Python_Data_Science_第四课-hmchzb19-ChinaUnix博客

Linuxer

首页　| 　博文目录　| 　关于我

hmchzb19

博客访问： 1812832
博文数量： 297
博客积分： 285
博客等级：二等列兵
技术积分： 3006
用户组：普通用户
注册时间： 2010-03-06 22:04

个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文（297）

machine_learning（16）
PYthon_Design_Pa（1）
数学（1）
Data Struct（1）
scheme（3）
Container（1）
sqlite3（1）
firefox（4）
Tor（1）
java（30）
生活（2）
测试生涯（1）
互联网（4）
algorithm（4）
ubuntu（4）
安全和kali （35）
windows（5）
cloud_manage（3）
tcp/ip（1）
security（5）
Linux（74）
python（70）
C（9）
postgresql（5）
shell（3）
db2（3）
oracle（3）
Power-VM虚拟化（7）
未分配的博文（0）

文章存档

2020年（11）

2019年（15）

2018年（43）

2017年（79）

2016年（79）

2015年（58）

2014年（1）

2013年（8）

2012年（3）

我的朋友

相关博文

Python_Data_Science_第四课

分类： Python/Ruby

2018-05-23 10:49:53

0. 代码都是在ipython3里面敲的，所以prereq如下:

点击(此处)折叠或打开

ipython3
In [1]: import numpy as np
In [2]: import matplotlib.pyplot as plt

今天是covariance and correlation.

1. they give us a means of measuring just how tight these things are correlated,
covariance: Measures how two variables vary in tandem from their means.
correlation: -1 negative(inverse) correlation,one value increases, the other decreases. vice versa. 0 no correlation, 1 positive correlation. these two attributes are moving in exactly the same way as you look at different data points.

2. using following methods to calculate the covariance and correlation, these are self-written methods for calculate covariance and correlation.
1.Think of the data sets for the two variables as high-dimensional vectors.
2.Convert these to vectors of variances from the mean.
3.Take the dot product(cosine of the angle between them)of the two vectors.
4.Divide by the sample size.

点击(此处)折叠或打开

def de_mean(x):
xmean=np.mean(x)
return [xi - xmean for xi in x]
def covariance(x, y):
n=len(x)
return np.dot(de_mean(x), de_mean(y)) / (n-1)

点击(此处)折叠或打开

#compute correlation
def correlation(x, y):
#compute the standard deviation
stddevx=np.std(x)
stddevy=np.std(y)
#check devide by 0 in this step
return covariance(x,y) / stddevx /stddevy

点击(此处)折叠或打开

#Fabricate data:page speeds(how quickly a page renders on a website) and how much people spend.
pagespeeds = np.random.normal(3.0, 1.0, 1000)
#normal distribution : there is no real relationship between the two attributes
purchaseAmount = np.random.normal(50.0, 10.0, 1000)
plt.scatter(pagespeeds, purchaseAmount)
covariance(pagespeeds, purchaseAmount)
np.cov(pagespeeds,purchaseAmount)
plt.show()
#Fbricate these data with relations.
pagespeeds2 = np.random.normal(3.0, 1.0, 1000)
purchaseAmount2 = np.random.normal(50.0, 10.0, 1000) / pagespeeds2
plt.scatter(pagespeeds2, purchaseAmount2)
covariance(pagespeeds2, purchaseAmount2)
np.cov(pagespeeds2,purchaseAmount2)
plt.show()
#close to 0
correlation(pagespeeds, purchaseAmount)
np.corrcoef(pagespeeds, purchaseAmount)
#close to 1 or -1 ,means they have relationship
correlation(pagespeeds2, purchaseAmount2)
np.corrcoef(pagespeeds2, purchaseAmount2)
#a perfect correlations ,close to -1.
pagespeeds3 = np.random.normal(3.0, 1.0, 1000)
purchaseAmount3 = 100 - pagespeeds3 * 3
plt.scatter(pagespeeds3, purchaseAmount3)
correlation(pagespeeds3, purchaseAmount3)
plt.show()

3. numpy has a numpy.cov function that can compute covariance.
numpy.corrcoef, it returns a matrix of correlation coefficients between every combination of the arrays passed in.

阅读(1042) | 评论(0) | 转发(0) |

上一篇：Python_Data_Science_第三课

下一篇：Python_Data_Science_第五课

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6