Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1789859
  • 博文数量: 297
  • 博客积分: 285
  • 博客等级: 二等列兵
  • 技术积分: 3006
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-06 22:04
个人简介

Linuxer, ex IBMer. GNU https://hmchzb19.github.io/

文章分类

全部博文(297)

文章存档

2020年(11)

2019年(15)

2018年(43)

2017年(79)

2016年(79)

2015年(58)

2014年(1)

2013年(8)

2012年(3)

分类: Python/Ruby

2018-04-29 10:40:24

Data Science 打算写一系列的笔记,记录下平时看书,看视频学到的知识.
今天是第一课.
1. Mean, Mode, Median.

点击(此处)折叠或打开

  1. Mean AKA Averate: sum/ number of samples
  2. Median: sort the values, and take the value at the midpoint, for even numbers
  3. then take the average of the midpoint 2.
  4. Mode: the most common value in a data set, which means this data occurs the most time.

下面使用Python 代码来实地求出这些值.

点击(此处)折叠或打开

  1. #import packages
  2. import numpy as np
  3. from scipy import stats
  4. import matplotlib.pyplot as plt

  5. #fabricate some data
  6. #use np.random.normal Draw random samples from a normal (Gaussian) distribution
  7. incomes = np.random.normal(27000,15000,10000)
  8. '''
  9.     Parameters
  10.     ----------
  11.     loc : float or array_like of floats
  12.         Mean ("centre") of the distribution.
  13.     scale : float or array_like of floats
  14.         Standard deviation (spread or "width") of the distribution.
  15.     size : int or tuple of ints, optional
  16.         Output shape. If the given shape is, e.g., ``(m, n, k)``, then
  17.         ``m * n * k`` samples are drawn. If size is ``None`` (default),
  18.         a single value is returned if ``loc`` and ``scale`` are both scalars.
  19.         Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.
  20. '''
  21. np.mean(incomes) #average ,close to 27000
  22. plt.hist(incomes, 50)
  23. plt.show()

  24. #compute median
  25. np.median(incomes)

  26. #add one outlier, then the mean will change a lot, but the median will not change too much.
  27. incomes = np.append(incomes, [1000000000])
  28. In [26]: np.mean(incomes)
  29. Out[26]: 126837.27483313478
  30. In [27]: np.median(incomes)
  31. Out[27]: 26584.942499458524

  32. #If there is more than one such value, only the smallest is returned.
  33. lst=[1,1,2,2,3,3,4,4]
  34. In [20]: stats.mode(lst)
  35. Out[20]: ModeResult(mode=array([1]), count=array([2]))
  36. In [15]: lst=[1,2,3,2,2,2]
  37. In [16]: stats.mode(lst)
  38. Out[16]: ModeResult(mode=array([2]), count=array([4]))
  39. ages = np.random.randint(18,high=90, size=500)
  40. stats.mode(ages)

2. standard deviation and variance:

variance: is simply the average of the squared differences from the mean.
Standard deviation is the squared root of the variance.

Example:
what is the variance of (1,4,5,4,8)
get mean: (4.4)
differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
Squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
average of the squared differences:  5.04
Standard deviation : 2.24

下面是代码:

点击(此处)折叠或打开

  1. #use numpy to calculate variance and standard deviation.
  2. In [30]: lst=[1,4,5,4,8]
  3. #standard deviation
  4. In [31]: np.std(lst)
  5. Out[31]: 2.2449944320643649
  6. #variance
  7. In [32]: np.var(lst)
  8. Out[32]: 5.04


阅读(1038) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~