这一段时间比较闲,找了UDEMY上面的教程看,看了些DataScience. 记录一下,今天是Linear Regression. 按照Lazyprogrammer的教程上,我们应该把这些所谓的DataScience都当做 Geometry.
代码如下,需要提前安装statsmodel,seaborn,pandas,matplotlib 可以使用ipython3或者jupyter.
csv文件是我从github上找的,乱搜索了一通kaggle,google.
最后直接clone了别人的一个repo
点击(此处)折叠或打开
- git clone https://github.com/timurista/data-analysis
点击(此处)折叠或打开
-
import statsmodels.api as sm
-
import seaborn as sns
-
import matplotlib.pyplot as plt
-
import pandas as pd
-
sns.set()
-
-
data=pd.read_csv('data-analysis/python-jupyter/1.01. Simple linear regression.csv')
-
data.describe()
-
y=data['GPA']
-
X=data['SAT']
-
plt.scatter(X,y)
-
plt.xlabel('SAT',fontsize=20)
-
plt.ylabel('GPA', fontsize=20)
-
plt.show()
-
-
x1=sm.add_constant(X)
-
results=sm.OLS(y,x1).fit()
-
'''
-
intercept=0.275
-
slope=0.0017
-
-
coef=coefficient
-
t: t-statistic
-
P>|t|: p-value, less than 0.005 menas the variable is significant, so the best value we want is 0.000
-
-
'''
-
print(results.summary())
-
plt.scatter(X,y)
-
-
yhat=0.0017*X+0.275
-
fig=plt.plot(X, yhat, lw=4, c='orange', label='regression line')
-
plt.xlabel('SAT', fontsize=20)
-
plt.ylabel('GPA', fontsize=20)
-
plt.show()