Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1741642
  • 博文数量: 297
  • 博客积分: 285
  • 博客等级: 二等列兵
  • 技术积分: 3006
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-06 22:04
个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文(297)

文章存档

2020年(11)

2019年(15)

2018年(43)

2017年(79)

2016年(79)

2015年(58)

2014年(1)

2013年(8)

2012年(3)

分类: Python/Ruby

2018-07-24 16:41:22

今天才看到Machine Learning:
Machine learning:
Algorithms that can learn from observational data, and can make predictions based on it.

Machine Learning分为unsupervised learning 和 supervised learning.
区别在: supervised learning: 有“参考答案”,而unsupervised learning则没有"参考答案"

2. Train /Test :
Need to ensure both sets are large enough to contain representatives of all the variations and outliers in the data you care about.
The data sets must be selected randomly.
Train/Test is a great way to guard against overfitting.

K-fold Cross validation:
  One way to further protect against overfitting is k-fold cross validation.
    1. split your data into k randomly-assigned segments.
    2. Reserve one segment as your test data.
    3. Train on each of the remaining k-1 segments and measure their performance against the test set.
    4. Take the average of the k-1 r-squared scores.

3. 下面就是实际代码
Train /Test practice - prevent overfitting of a polynomial regression.

首先是伪造数据

点击(此处)折叠或打开

  1. np.random.seed(2)
  2. pageSpeeds = np.random.normal(3.0, 1.0, 100)
  3. purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
  4. plt.scatter(pageSpeeds,purchaseAmount)
  5. plt.show()

点击(此处)折叠或打开

  1. np.random.seed(2)
  2. pageSpeeds = np.random.normal(3.0, 1.0, 100)
  3. purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
  4. plt.scatter(pageSpeeds,purchaseAmount)
  5. plt.show()

  6. #now we will split the data in two part 80% of the data used for training.
  7. #20% for testing, This way we avoid overfitting.
  8. #in real world, shuffle the data before split it.
  9. trainX = pageSpeeds[:80]
  10. testX= pageSpeeds[80:]

  11. trainY = purchaseAmount[:80]
  12. testY=purchaseAmount[80:]

  13. plt.scatter(trainX, trainY)
  14. plt.show()
  15. plt.scatter(testX, testY)
  16. plt.show()

  17. #we will try to fit an 8-degree polynomial to this data(centainly overfitting)
  18. x=np.array(trainX)
  19. y=np.array(trainY)

  20. p8=np.poly1d(np.polyfit(x, y, 8))

点击(此处)折叠或打开

  1. #plot the polynomial against the training data
  2. xp=np.linspace(0,7,100)
  3. axes=plt.axes()
  4. axes.set_xlim([0,7])
  5. axes.set_ylim([0,200])
  6. plt.scatter(x,y)
  7. plt.plot(xp, p8(xp), c='r')
  8. plt.show()

  9. #plot the polynomial against the test data
  10. testx=np.array(testX)
  11. testy=np.array(testY)
  12. axes=plt.axes()
  13. axes.set_xlim([0,7])
  14. axes.set_ylim([0,200])
  15. plt.scatter(testx,testy)
  16. plt.plot(xp, p8(xp), c='r')
  17. plt.show()
检查r-squared

点击(此处)折叠或打开

  1. #r-squared score on the test data is terrible: 0.30 , even though it fits the training data.
  2. r2=r2_score(testy, p8(testx))
  3. print(r2)

  4. #r-squared score on the training data: 0.64, better
  5. r3=r2_score(y, p8(x))
  6. print(r3)

4. 写在最后, 经过实验
使用6-degree polynomial 的时候,test data 的r-squared最大. 0.60.

点击(此处)折叠或打开

  1. p8=np.poly1d(np.polyfit(x, y, 8))
  2. p7=np.poly1d(np.polyfit(x, y, 7))
  3. p6=np.poly1d(np.polyfit(x, y, 6))
  4. p5=np.poly1d(np.polyfit(x, y, 5))
  5. p4=np.poly1d(np.polyfit(x, y, 4))

点击(此处)折叠或打开

  1. In [23]: p5=np.poly1d(np.polyfit(x, y, 5))

  2. In [24]: r2=r2_score(testy, p5(testx))
  3.     ...: print(r2)
  4.     ...:
  5. 0.504072389719

  6. In [25]: r2=r2_score(testy, p6(testx))
  7.     ...: print(r2)
  8.     ...:
  9. 0.605011947036

  10. In [26]: p7=np.poly1d(np.polyfit(x, y, 7))

  11. In [27]: r2=r2_score(testy, p7(testx))
  12.     ...: print(r2)
  13.     ...:
  14. 0.546145145284




阅读(1005) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~