Python_Data_Science_第9课-hmchzb19-ChinaUnix博客

Linuxer

首页　| 　博文目录　| 　关于我

hmchzb19

博客访问： 1812954
博文数量： 297
博客积分： 285
博客等级：二等列兵
技术积分： 3006
用户组：普通用户
注册时间： 2010-03-06 22:04

个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文（297）

machine_learning（16）
PYthon_Design_Pa（1）
数学（1）
Data Struct（1）
scheme（3）
Container（1）
sqlite3（1）
firefox（4）
Tor（1）
java（30）
生活（2）
测试生涯（1）
互联网（4）
algorithm（4）
ubuntu（4）
安全和kali （35）
windows（5）
cloud_manage（3）
tcp/ip（1）
security（5）
Linux（74）
python（70）
C（9）
postgresql（5）
shell（3）
db2（3）
oracle（3）
Power-VM虚拟化（7）
未分配的博文（0）

文章存档

2020年（11）

2019年（15）

2018年（43）

2017年（79）

2016年（79）

2015年（58）

2014年（1）

2013年（8）

2012年（3）

我的朋友

相关博文

Python_Data_Science_第9课

分类： Python/Ruby

2018-07-24 16:41:22

今天才看到Machine Learning:
Machine learning:
Algorithms that can learn from observational data, and can make predictions based on it.

Machine Learning分为unsupervised learning 和 supervised learning.
区别在: supervised learning: 有“参考答案”，而unsupervised learning则没有"参考答案"

2. Train /Test :
Need to ensure both sets are large enough to contain representatives of all the variations and outliers in the data you care about.
The data sets must be selected randomly.
Train/Test is a great way to guard against overfitting.

K-fold Cross validation:
One way to further protect against overfitting is k-fold cross validation.
    1. split your data into k randomly-assigned segments.
    2. Reserve one segment as your test data.
    3. Train on each of the remaining k-1 segments and measure their performance against the test set.
    4. Take the average of the k-1 r-squared scores.

3. 下面就是实际代码
Train /Test practice - prevent overfitting of a polynomial regression.

首先是伪造数据

点击(此处)折叠或打开

np.random.seed(2)
pageSpeeds = np.random.normal(3.0, 1.0, 100)
purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
plt.scatter(pageSpeeds,purchaseAmount)
plt.show()

点击(此处)折叠或打开

np.random.seed(2)
pageSpeeds = np.random.normal(3.0, 1.0, 100)
purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
plt.scatter(pageSpeeds,purchaseAmount)
plt.show()
#now we will split the data in two part 80% of the data used for training.
#20% for testing, This way we avoid overfitting.
#in real world, shuffle the data before split it.
trainX = pageSpeeds[:80]
testX= pageSpeeds[80:]
trainY = purchaseAmount[:80]
testY=purchaseAmount[80:]
plt.scatter(trainX, trainY)
plt.show()
plt.scatter(testX, testY)
plt.show()
#we will try to fit an 8-degree polynomial to this data(centainly overfitting)
x=np.array(trainX)
y=np.array(trainY)
p8=np.poly1d(np.polyfit(x, y, 8))

点击(此处)折叠或打开

#plot the polynomial against the training data
xp=np.linspace(0,7,100)
axes=plt.axes()
axes.set_xlim([0,7])
axes.set_ylim([0,200])
plt.scatter(x,y)
plt.plot(xp, p8(xp), c='r')
plt.show()
#plot the polynomial against the test data
testx=np.array(testX)
testy=np.array(testY)
axes=plt.axes()
axes.set_xlim([0,7])
axes.set_ylim([0,200])
plt.scatter(testx,testy)
plt.plot(xp, p8(xp), c='r')
plt.show()

检查r-squared

点击(此处)折叠或打开

#r-squared score on the test data is terrible: 0.30 , even though it fits the training data.
r2=r2_score(testy, p8(testx))
print(r2)
#r-squared score on the training data: 0.64, better
r3=r2_score(y, p8(x))
print(r3)

4. 写在最后, 经过实验
使用6-degree polynomial 的时候，test data 的r-squared最大. 0.60.

点击(此处)折叠或打开

p8=np.poly1d(np.polyfit(x, y, 8))
p7=np.poly1d(np.polyfit(x, y, 7))
p6=np.poly1d(np.polyfit(x, y, 6))
p5=np.poly1d(np.polyfit(x, y, 5))
p4=np.poly1d(np.polyfit(x, y, 4))

点击(此处)折叠或打开

In [23]: p5=np.poly1d(np.polyfit(x, y, 5))
In [24]: r2=r2_score(testy, p5(testx))
...: print(r2)
...:
0.504072389719
In [25]: r2=r2_score(testy, p6(testx))
...: print(r2)
...:
0.605011947036
In [26]: p7=np.poly1d(np.polyfit(x, y, 7))
In [27]: r2=r2_score(testy, p7(testx))
...: print(r2)
...:
0.546145145284

阅读(1153) | 评论(0) | 转发(0) |

上一篇：Python_Data_Science_第八课

下一篇：Kali install install RabbitMQ

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6