Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1797236
  • 博文数量: 297
  • 博客积分: 285
  • 博客等级: 二等列兵
  • 技术积分: 3006
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-06 22:04
个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文(297)

文章存档

2020年(11)

2019年(15)

2018年(43)

2017年(79)

2016年(79)

2015年(58)

2014年(1)

2013年(8)

2012年(3)

分类: 大数据

2020-04-05 14:29:53

想要练下standardization,结果就有了这一篇.
理论上的Normalization vs. Standardization
The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1,
while standardization transforms data to have a mean of zero and a standard deviation of 1.
This standardization is called a z-score

Standardized: (X-mean) / sd
Normalized: (X - min(X)) / (max(X)-min(X))

在sklearn里面standardization只要两行

点击(此处)折叠或打开

  1. scaler = StandardScaler().fit(X_train)
  2. X_train_std, X_test_std = scaler.transform(X_train), scaler.transform(X_test)


今天使用的数据文件是一个,对这个文件的说明在这里,

共有9 columns,col 9是target.

点击(此处)折叠或打开

  1. 1. Number of times pregnant : 怀孕次数
  2. 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test : 血浆葡萄糖浓度 -2小时口服葡萄糖耐量试验
  3. 3. Diastolic blood pressure (mm Hg) : 舒张压
  4. 4. Triceps skin fold thickness (mm) : 三头肌皮褶厚度
  5. 5. 2-Hour serum insulin (mu U/ml) : 餐后2小时血清胰岛素
  6. 6. Body mass index (weight in kg/(height in m)^2) : 体重指数(体重(公斤)/ 身高(米)^2)
  7. 7. Diabetes pedigree function : 糖尿病家系作用
  8. 8. Age (years) : 年龄
  9. 9. Class variable (0 or 1) : target, 标签, 0表示不发病,1表示发病
这个代码是我参考一本书叫Hands-on Scikit-Learn for Machine Learning Applications上的代码修改而来的。 RandomForestClassifier和ExtraTreesClassifier对stardardized数据和未经standardized的数据几乎没有差别。
后面的svm和knn我没有实验. 代码如下


点击(此处)折叠或打开

  1. import pandas as pd, numpy as np
  2. import statsmodels.api as sm
  3. import statsmodels.formula.api as smf
  4. import seaborn as sns
  5. import matplotlib.pyplot as plt
  6. from sklearn.model_selection import train_test_split
  7. from sklearn.svm import SVC
  8. from sklearn.neighbors import KNeighborsClassifier
  9. from sklearn.preprocessing import StandardScaler
  10. from sklearn.metrics import (accuracy_score, f1_score, confusion_matrix)
  11. from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier)
  12. sns.set()

  13. def get_scores(model, xtrain, ytrain, xtest, ytest, scoring):
  14.     ypred = model.predict(xtest)
  15.     train = model.score(xtrain, ytrain)
  16.     test = model.score(xtest, ytest)
  17.     f1 = f1_score(ytest, ypred, average=scoring)
  18.     return (train, test, f1)

  19. def main():

  20.     col_names=['preg','plas','pres','skin','insu','mass','pedi','age','class']
  21.     data=pd.read_csv('data/pima-indians-diabetes.csv',header=None,names=col_names)
  22.     print(data.describe())

  23.     print(data['class'].unique())
  24.     print(data['class'].value_counts())

  25.     X=data.drop('class', axis=1)
  26.     y=data['class']

  27.     print('X shape: {}'.format(X.shape))
  28.     print('y shape: {}'.format(y.shape))

  29.     #split into train and test
  30.     X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42 )
  31.     
  32.     #Standardization
  33.     scaler = StandardScaler().fit(X_train.astype(np.float64))
  34.     X_train_std, X_test_std = scaler.transform(X_train.astype(np.float64)), scaler.transform(X_test.astype(np.float64))
  35.     

  36.     et = ExtraTreesClassifier(random_state=0, n_estimators=100)
  37.     et.fit(X_train, y_train)
  38.     et_scores = get_scores(et, X_train, y_train, X_test, y_test, 'micro')
  39.     print('{} train, test, f1_score'.format(et.__class__.__name__))
  40.     print(et_scores, '\n')

  41.     
  42.     et.fit(X_train_std, y_train)
  43.     et_std_scores = get_scores(et, X_train_std, y_train, X_test_std, y_test, 'micro')
  44.     print('{} with standardization (train, test, f1_score)'.format(et.__class__.__name__))
  45.     print(et_std_scores, '\n')

  46.     rf = RandomForestClassifier(random_state=42, n_estimators=100)
  47.     rf.fit(X_train, y_train)
  48.     rf_scores = get_scores(rf, X_train, y_train, X_test, y_test, 'micro')
  49.     print('{} train, test, f1_score'.format(rf.__class__.__name__))
  50.     print(rf_scores, '\n')

  51.     rf.fit(X_train_std, y_train)
  52.     rf_std_scores = get_scores(rf, X_train_std, y_train, X_test_std, y_test, 'micro')
  53.     print('{} with standardization (train, test, f1_score)'.format(rf.__class__.__name__))
  54.     print(rf_std_scores, '\n')

  55.     knn = KNeighborsClassifier().fit(X_train, y_train)
  56.     knn_scores = get_scores(knn, X_train, y_train, X_test, y_test, 'micro')

  57.     print('{} train, test, f1_score'.format(knn.__class__.__name__))
  58.     print(knn_scores, '\n')

  59.     svm = SVC(random_state=0, gamma='scale')
  60.     svm.fit(X_train_std, y_train)
  61.     svm_scores = get_scores(svm, X_train_std, y_train, X_test_std, y_test, 'micro')

  62.     print('{} train, test, f1_score'.format(svm.__class__.__name__))
  63.     print(svm_scores, '\n')

  64.     knn_name, svm_name = knn.__class__.__name__, svm.__class__.__name__

  65.     y_pred_knn = knn.predict(X_test)
  66.     cm_knn = confusion_matrix(y_test, y_pred_knn)
  67.     cm_knn_T = cm_knn.T

  68.     y_pred_svm = svm.predict(X_test_std)
  69.     cm_svm = confusion_matrix(y_test, y_pred_svm)
  70.     cm_svm_T = cm_svm.T

  71.     plt.figure(knn_name)
  72.     ax= plt.axes()
  73.     sns.heatmap(cm_knn_T, annot=True, fmt='d', cmap='gist_ncar_r', cbar=False)
  74.     ax.set_title('{} confustion matrix'.format(knn_name))
  75.     plt.xlabel('true label')
  76.     plt.ylabel('predicted label')

  77.     plt.figure(svm_name)
  78.     ax= plt.axes()
  79.     sns.heatmap(cm_svm_T, annot=True, fmt='d', cmap='gist_ncar_r', cbar=False)
  80.     ax.set_title('{} confustion matrix'.format(svm_name))
  81.     plt.xlabel('true label')
  82.     plt.ylabel('predicted label')

  83.     cnt_no, cnt_yes = 0, 0
  84.     for i,row in enumerate(y_test):

  85.         if row == 0: cnt_no += 1
  86.         elif row == 1:
  87.             cnt_yes += 1

  88.     print('true => 0: {} , 1: {}\n'.format(cnt_no, cnt_yes))

  89.     p_no, p_nox = cm_knn_T[0][0], cm_knn_T[0][1]
  90.     p_yes, p_yesx = cm_knn_T[1][1], cm_knn_T[1][0]

  91.     print('knn classification report')
  92.     print('predited 0: {}, ( {} misclassified)'.format(p_no, p_nox))
  93.     print('predicted 1: {}, ( {} misclassified)'.format(p_yes, p_yesx))

  94.     print()

  95.     p_no, p_nox = cm_svm_T[0][0], cm_svm_T[0][1]
  96.     p_yes, p_yesx = cm_svm_T[1][1], cm_svm_T[1][0]

  97.     print('svm classification report')
  98.     print('predited 0: {}, ( {} misclassified)'.format(p_no, p_nox))
  99.     print('predicted 1: {}, ( {} misclassified)'.format(p_yes, p_yesx))

  100.     plt.show()

  101. main()

阅读(867) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~