Chinaunix首页 | 论坛 | 博客
  • 博客访问: 397204
  • 博文数量: 162
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 1501
  • 用 户 组: 普通用户
  • 注册时间: 2016-10-21 19:45
文章分类
文章存档

2018年(1)

2017年(101)

2016年(60)

分类: Python/Ruby

2017-05-04 19:37:38

统计函数Statistical functions(scipy.stats)

有一个很好的统计推断包。那就是scipy里面的stats。

Scipy的stats模块包含了多种概率分布的随机变量,随机变量分为连续的和离散的两种。
所有的连续随机变量都是rv_continuous的派生类的对象,而所有的离散随机变量都是 rv_discrete的派生类的对象。

This module contains a large number of probability distributions as well as a growing library of statistical functions.

Each univariate distribution is an instance of a subclass of ( for discrete distributions):

([momtype, a, b, xtol, ...]) A generic continuous random variable class meant for subclassing.
([a, b, name, badvalue, ...]) A generic discrete random variable class meant for subclassing.

皮皮blog



连续分布及其相关的函数

连续分布


An alpha continuous random variable.
An anglit continuous random variable.
An arcsine continuous random variable.
A beta continuous random variable.
A beta prime continuous random variable.
A Bradford continuous random variable.
A Burr (Type III) continuous random variable.
A Burr (Type XII) continuous random variable.
A Cauchy continuous random variable.
A chi continuous random variable.
A chi-squared continuous random variable.
A cosine continuous random variable.
A double gamma continuous random variable.
A double Weibull continuous random variable.
An Erlang continuous random variable.
An exponential continuous random variable.
An exponentially modified Normal continuous random variable.
An exponentiated Weibull continuous random variable.
An exponential power continuous random variable.
An F continuous random variable.
A fatigue-life (Birnbaum-Saunders) continuous random variable.
A Fisk continuous random variable.
A folded Cauchy continuous random variable.
A folded normal continuous random variable.
A Frechet right (or Weibull minimum) continuous random variable.
A Frechet left (or Weibull maximum) continuous random variable.
A generalized logistic continuous random variable.
A generalized normal continuous random variable.
A generalized Pareto continuous random variable.
A generalized exponential continuous random variable.
A generalized extreme value continuous random variable.
A Gauss hypergeometric continuous random variable.
A gamma continuous random variable.
A generalized gamma continuous random variable.
A generalized half-logistic continuous random variable.
A Gilbrat continuous random variable.
A Gompertz (or truncated Gumbel) continuous random variable.
A right-skewed Gumbel continuous random variable.
A left-skewed Gumbel continuous random variable.
A Half-Cauchy continuous random variable.
A half-logistic continuous random variable.
A half-normal continuous random variable.
The upper half of a generalized normal continuous random variable.
A hyperbolic secant continuous random variable.
An inverted gamma continuous random variable.
An inverse Gaussian continuous random variable.
An inverted Weibull continuous random variable.
A Johnson SB continuous random variable.
A Johnson SU continuous random variable.
Kappa 4 parameter distribution.
Kappa 3 parameter distribution.
General Kolmogorov-Smirnov one-sided test.
Kolmogorov-Smirnov two-sided test for large N.
A Laplace continuous random variable.
A Levy continuous random variable.
A left-skewed Levy continuous random variable.
A Levy-stable continuous random variable.
A logistic (or Sech-squared) continuous random variable.
A log gamma continuous random variable.
A log-Laplace continuous random variable.
A lognormal continuous random variable.
A Lomax (Pareto of the second kind) continuous random variable.
A Maxwell continuous random variable.
A Mielke’s Beta-Kappa continuous random variable.
A Nakagami continuous random variable.
A non-central chi-squared continuous random variable.
A non-central F distribution continuous random variable.
A non-central Student’s T continuous random variable.
A normal continuous random variable.
A Pareto continuous random variable.
A pearson type III continuous random variable.
A power-function continuous random variable.
A power log-normal continuous random variable.
A power normal continuous random variable.
An R-distributed continuous random variable.
A reciprocal continuous random variable.
A Rayleigh continuous random variable.
A Rice continuous random variable.
A reciprocal inverse Gaussian continuous random variable.
semicircular A semicircular continuous random variable.
A skew-normal random variable.
A Student’s T continuous random variable.
A trapezoidal continuous random variable.
A triangular continuous random variable.
A truncated exponential continuous random variable.
A truncated normal continuous random variable.
A Tukey-Lamdba continuous random variable.
A uniform continuous random variable.
A Von Mises continuous random variable.
A Von Mises continuous random variable.
A Wald continuous random variable.
A Frechet right (or Weibull minimum) continuous random variable.
A Frechet left (or Weibull maximum) continuous random variable.
A wrapped Cauchy continuous random variable.


连续随机变量对象的方法


(*args, **kwds) Random variates of given type.产生服从这种分布的一个样本,对随机变量进行随机取值,可以通过size参数指定输出的数组大小。
(x, *args, **kwds) Probability density function at x of the given RV.随机变量的概率密度函数。产生对应x的这种分布的y值。
(x, *args, **kwds) Log of the probability density function at x of the given RV.
(x, *args, **kwds) Cumulative distribution function of the given RV.随机变量的累积分布函数,它是概率密度函数的积分(也就是x时p(X
(x, *args, **kwds) Log of the cumulative distribution function at x of the given RV.
(x, *args, **kwds) Survival function (1 - ) at x of the given RV.随机变量的生存函数,它的值是1-cdf(t)。
(x, *args, **kwds) Log of the survival function of the given RV.
(q, *args, **kwds) Percent point function (inverse of ) at q of the given RV.累积分布函数的反函数。q=0.01时,ppf就是p(X
(q, *args, **kwds) Inverse survival function (inverse of ) at q of the given RV.
(n, *args, **kwds) n-th order non-central moment of distribution.
(*args, **kwds) Some statistics of the given RV.计算随机变量的期望值和方差
(*args, **kwds) Differential entropy of the RV.
([func, args, loc, scale, lb, ub, ...]) Calculate expected value of a function with respect to the distribution.
(*args, **kwds) Median of the distribution.
(*args, **kwds) Mean of the distribution.
(*args, **kwds) Standard deviation of the distribution.
(*args, **kwds) Variance of the distribution.
(alpha, *args, **kwds) Confidence interval with equal areas around the median.
(*args, **kwds) Freeze the distribution for the given arguments.
(data, *args, **kwds) Return MLEs for shape, location, and scale parameters from data.对一组随机取样进行拟合,找出最适合取样数据的概率密度函数的系数。如stats.norm.fit(x)就是将x看成是某个norm分布的抽样,求出其最好的拟合参数(mean, std)。
(data, *args) Estimate loc and scale parameters from data using 1st and 2nd moments.
(theta, x) Return negative loglikelihood function.
[]


[]


多变量分布Multivariate distributions



A multivariate normal random variable.
A matrix normal random variable.
A Dirichlet random variable.
A Wishart random variable.
An inverse Wishart random variable.
A matrix-valued SO(N) random variable.
A matrix-valued O(N) random variable.
A random correlation matrix.



multivariate_normal

>>> x, y = np.mgrid[-1:1:.01, -1:1:.01]
>>> pos = np.dstack((x, y))   #二维坐标组合成三维坐标点坐标
>>> rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
>>> rv.pdf(pos)  #接受的参数是三维数据,第三维代表一个数据坐标,1、2维代表网格坐标位置。

皮皮blog



离散分布及其相关的函数

当分布函数的值域为离散时,称之为离散概率分布。例如投掷有6个面的骰子时,只能获得1到6的整数,因此得到的概率分布为离散的。

对于离散随机分布,通常使用概率质量函数(PMF)描述其分布情况。在stats库中所有描述离散分布的随机变量都从rv_discrete类继承。

直接用rv_discrete 类自定义离散概率分布

stats.rv_discrete(values=(x,p))中的参数表示随机变量x和其对应的概率。

设有一个不均匀的骰子,各点出现的概率不相等。可以用下面的数组x保存骰子的所有可能值,数组p保存每个值出现的概率:
>>> x = range(1,7)
>>> p = (0.4, 0.2, 0.1, 0.1, 0.1, 0.1)
用下面的语句定义表示这个特殊骰子的随机变量,并调用其rvs()方法投掷此骰子20次,获得符合概率p的随机数:
>>> dice = stats.rv_discrete(values=(x,p))
>>> dice.rvs(size=20)
Array([2, 5, 1, 2, 1, 1, 2, 4, 1, 3, 1, 1, 4, 3, 1, 1, 1, 2, 6, 4])


from scipy import stats import numpy as np import matplotlib.pyplot as plt
fs_meetsig = np.random.random(30)
fs_xk = np.sort(fs_meetsig)
fs_pk = np.ones_like(fs_xk) / len(fs_xk)
fs_rv_dist = stats.rv_discrete(name='fs_rv_dist', values=(fs_xk, fs_pk))

plt.plot(fs_xk, fs_rv_dist.cdf(fs_xk), 'b-', ms=12, mec='r', label='friend')
plt.show()

[rv_discrete ]

离散分布


A Bernoulli discrete random variable.
A binomial discrete random variable.
A Boltzmann (Truncated Discrete Exponential) random variable.
A Laplacian discrete random variable.
A geometric discrete random variable.
A hypergeometric discrete random variable.
A Logarithmic (Log-Series, Series) discrete random variable.
A negative binomial discrete random variable.
A Planck discrete exponential random variable.
A Poisson discrete random variable.
A uniform discrete random variable.
A Skellam discrete random variable.
A Zipf discrete random variable.


离散分布的函数


(*args, **kwargs) Random variates of given type.
(k, *args, **kwds) Probability mass function at k of the given RV.
(k, *args, **kwds) Log of the probability mass function at k of the given RV.
(k, *args, **kwds) Cumulative distribution function of the given RV.
(k, *args, **kwds) Log of the cumulative distribution function at k of the given RV.
(k, *args, **kwds) Survival function (1 - ) at k of the given RV.
(k, *args, **kwds) Log of the survival function of the given RV.
(q, *args, **kwds) Percent point function (inverse of ) at q of the given RV.
(q, *args, **kwds) Inverse survival function (inverse of ) at q of the given RV.
(n, *args, **kwds) n-th order non-central moment of distribution.
(*args, **kwds) Some statistics of the given RV.
(*args, **kwds) Differential entropy of the RV.
([func, args, loc, lb, ub, ...]) Calculate expected value of a function with respect to the distribution for discrete distribution.
(*args, **kwds) Median of the distribution.
(*args, **kwds) Mean of the distribution.
(*args, **kwds) Standard deviation of the distribution.
(*args, **kwds) Variance of the distribution.
(alpha, *args, **kwds) Confidence interval with equal areas around the median.
(*args, **kwds) Freeze the distribution for the given arguments.

皮皮blog



统计函数Statistical functions

{scipy.stats顶层函数,可以应用于很多分布的函数}


Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.

(a[, axis, ddof, bias, nan_policy]) Computes several descriptive statistics of the passed array.
(a[, axis, dtype]) Compute the geometric mean along the specified axis.
(a[, axis, dtype]) Calculates the harmonic mean along the specified axis.
(a[, axis, fisher, bias, nan_policy]) Computes the kurtosis (Fisher or Pearson) of a dataset.
(a[, axis, nan_policy]) Tests whether a dataset has normal kurtosis
(a[, axis, nan_policy]) Returns an array of the modal (most common) value in the passed array.
(a[, moment, axis, nan_policy]) Calculates the nth moment about the mean for a sample.
(a[, axis, nan_policy]) Tests whether a sample differs from a normal distribution.
(a[, axis, bias, nan_policy]) Computes the skewness of a data set.
(a[, axis, nan_policy]) Tests whether the skew is different from the normal distribution.
(data[, n]) Return the nth k-statistic (1<=n<=4 so far).
(data[, n]) Returns an unbiased estimator of the variance of the k-statistic.
(a[, limits, inclusive, axis]) Compute the trimmed mean.
(a[, limits, inclusive, axis, ddof]) Compute the trimmed variance
(a[, lowerlimit, axis, inclusive, ...]) Compute the trimmed minimum
(a[, upperlimit, axis, inclusive, ...]) Compute the trimmed maximum
(a[, limits, inclusive, axis, ddof]) Compute the trimmed sample standard deviation
(a[, limits, inclusive, axis, ddof]) Compute the trimmed standard error of the mean.
(a[, axis, nan_policy]) Computes the coefficient of variation, the ratio of the biased standard deviation to the mean.
(arr) Find repeats and repeat counts.
(a, proportiontocut[, axis]) Return mean of array after trimming distribution from both tails.
cumfreq(a[, numbins, defaultreallimits, weights]) Returns a cumulative frequency histogram, using the histogram function.
(*args, **kwds)  is deprecated!
(*args, **kwds)  is deprecated!
(a) Returns a 2-D array of item frequencies.
(a, score[, kind]) The percentile rank of a score relative to a list of scores.
(a, per[, limit, ...]) Calculate the score at a given percentile of the input sequence.
(a[, numbins, defaultreallimits, weights]) Returns a relative frequency histogram, using the histogram function.
(x, values[, statistic, ...]) Compute a binned statistic for one or more sets of data.
(x, y, values[, ...]) Compute a bidimensional binned statistic for one or more sets of data.
(sample, values[, ...]) Compute a multidimensional binned statistic for a set of data.
(*args) Computes the O’Brien transform on input data (any number of arrays).
(*args, **kwds)  is deprecated!
(data[, alpha]) Bayesian confidence intervals for the mean, var, and std.
(data) ‘Frozen’ distributions for mean, variance, and standard deviation of data.
(a[, axis, ddof, nan_policy]) Calculates the standard error of the mean (or standard error of measurement) of the values in the input array.
(scores, compare[, axis, ddof]) Calculates the relative z-scores.
(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample mean and standard deviation.
(x[, axis, rng, scale, nan_policy, ...]) Compute the interquartile range of the data along the specified axis.
(a[, low, high]) Iterative sigma-clipping of array elements.
(*args, **kwds)  is deprecated!
(a, proportiontocut[, axis]) Slices off a proportion of items from both ends of an array.
(a, proportiontocut[, tail, axis]) Slices off a proportion from ONE end of the passed array distribution.
(*args) Performs a 1-way ANOVA.
(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
(a[, b, axis, nan_policy]) Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.
(x, y) Calculates a point biserial correlation coefficient and its p-value.
(x, y[, initial_lexsort, nan_policy]) Calculates Kendall’s tau, a correlation measure for ordinal data.
(x[, y]) Calculate a linear least-squares regression for two sets of measurements.
(y[, x, alpha]) Computes the Theil-Sen estimator for a set of points (x, y).
(*args, **kwds)  is deprecated!
(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
(a, b[, axis, equal_var, nan_policy]) Calculates the T-test for the means of two independent samples of scores.
(mean1, std1, nobs1, ...) T-test for means of two independent samples from descriptive statistics.
(a, b[, axis, nan_policy]) Calculates the T-test on TWO RELATED samples of scores, a and b.
(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
(x, y[, use_continuity, alternative]) Computes the Mann-Whitney rank test on samples x and y.
(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
(a[, method]) Assign ranks to data, dealing with ties appropriately.
(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
(*args, **kwargs) Compute the Kruskal-Wallis H-test for independent samples
(*args) Computes the Friedman test for repeated measurements
(pvalues[, method, weights]) Methods for combining the p-values of independent tests bearing upon the same hypothesis.
(*args, **kwds)  is deprecated!
(*args, **kwds)  is deprecated!
(x) Perform the Jarque-Bera goodness of fit test on sample data.
(x, y) Perform the Ansari-Bradley test for equal scale parameters
(*args) Perform Bartlett’s test for equal variances
(*args, **kwds) Perform Levene test for equal variances.
(x[, a, reta]) Perform the Shapiro-Wilk test for normality.
(x[, dist]) Anderson-Darling test for data coming from a particular distribution
(samples[, midrank]) The Anderson-Darling test for k-samples.
(x[, n, p, alternative]) Perform a test that the probability of success is p.
(*args, **kwds) Perform Fligner-Killeen test for equality of variance.
(*args, **kwds) Mood’s median test.
(x, y[, axis]) Perform Mood’s test for equal scale parameters.
(x[, lmbda, alpha]) Return a positive dataset transformed by a Box-Cox power transformation.
(x[, brack, method]) Compute optimal Box-Cox transform parameter for input data.
(lmb, data) The boxcox log-likelihood function.
(pk[, qk, base]) Calculate the entropy of a distribution for given probability values.
(*args, **kwds)  is deprecated!
(*args, **kwds)  is deprecated!

describe函数

这个函数的输出太难看了!


age = [23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61]
fat_percent = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7] age = np.array(age)
fat_percent = np.array(fat_percent)
data = np.vstack([age, fat_percent]).reshape([-1, 2])
print(stats.describe(data))
DescribeResult(nobs=18, minmax=(array([  7.8,  17.8]), array([ 60.,  61.])), mean=array([ 37.36111111,  37.86666667]), variance=array([ 236.58604575,  188.78588235]), skewness=array([-0.30733374,  0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))


修改了一个输出结果形式


for key, value in stats.describe(data)._asdict().items():  print(key, ':', value)
nobs : 18
minmax : (array([  7.8,  17.8]), array([ 60.,  61.]))
mean : [ 37.36111111  37.86666667]
variance : [ 236.58604575  188.78588235]
skewness : [-0.30733374  0.40999364]
kurtosis : [-0.65245849 -1.26315357]


也可以使用pandas中的函数进行替代,这样输出比较舒服[python数据处理库pandas]

概率分布的熵和kl散度的计算 scipy.stats.entropy

 scipy.stats.entropy(pk, qk=None, base=None)[source]
    Calculate the entropy of a distribution for given probability values.
    If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
    If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
    This routine will normalize pk and qk if they don’t sum to 1.

香农熵的计算entropy


shannon_entropy = stats.entropy(ij/sum(ij), base=None) print(shannon_entropy)

entropy的python直接实现


shannon_entropy_func = lambda pij: -sum(pij*np.log(pij))
shannon_entropy = shannon_entropy_func(ij[np.nonzero(ij)]) print(shannon_entropy)
def entropy(counts):
    '''Compute entropy.'''
    ps = counts/float(sum(counts))  # coerce to float and normalize
    ps = ps[nonzero(ps)]            # toss out zeros
    H = -sum(ps * numpy.log2(ps))   # compute entropy

    return H

两个分布的kl散度的计算


kl = sp.stats.entropy(fs_rv_dist, nonfs_rv_dist)


kl散度的其它实现[距离和相似度度量方法]

[scipy.stats.entropy]


假设检验相关的

ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements

ttest_1samp实现了单样本t检验。因此,如果我们想检验数据Abra列的稻谷产量均值,通过零假设,这里我们假定总体稻谷产量均值为15000,我们有:

from scipy import stats as ss
# Perform one sample t-test using 1500 as the true mean
print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)

# OUTPUT
(-1.1281738488299586, 0.26270472069109496)

返回下述值组成的元祖:

  • t : 浮点或数组类型
    t统计量
  • prob : 浮点或数组类型
    two-tailed p-value 双侧概率值

通过上面的输出,看到p值是0.267远大于α等于0.05,因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量,同样假设均值为15000,我们有:

print ss.ttest_1samp(a = df, popmean = 15000)

# OUTPUT
(array([ -1.12817385,   1.07053437, -65.81425599,  -4.564575  ,   6.17156198]),
 array([  2.62704721e-01,   2.87680340e-01,   4.15643528e-70,
          1.83764399e-05,   2.82461897e-08]))

第一个数组是t统计量,第二个数组则是相应的p值。

皮皮blog



列联表函数Contingency table functions

chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.

绘图测试Plot-tests

ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.

Statistical functions for masked arrays (scipy.stats.mstats)

蒙面统计函数Masked statistics functions

argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.

msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample

单变量和多变量核密度估计Univariate and multivariate kernel density estimation (scipy.stats.kde)

gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.

皮皮blog



统计函数使用举例

连续分布-Norm高斯分布

{高斯[正态]分布随机变量,A normal continuous random variable.}

生成服从高斯分布的随机向量(从正态分布中采样)stats.norm.rvs(loc, scale, size)

参数:

The location (loc) keyword specifies the mean.

The scale (scale) keyword specifies the standard deviation.

norm通过loc和scale参数可以指定随机变量的偏移和缩放参数。 对于正态分布的随机变量来说,这两个参数相当于指定其期望值和标准差。

高斯分布N(0,0.01)随机偏差 y = stats.norm.rvs(loc=0, scale=0.1, size=10)
输出:array([ 0.05419826,  0.04151471, -0.10784729,  0.18283546,  0.02348312, -0.04611974,  0.0069336 ,  0.03840133, -0.05015316,  0.23315205]) 

y.stats()
(array(0.0), array(0.1)

Note: 也可以使用numpy.random.norm函数生成高斯分布随机数[numpy库 - 随机数模块numpy.random]。

求正态分布最佳拟合参数stats.norm.fit(x)

>>> X =stats.norm(loc=1.0,scale=2.0,size = 100)
可以使用fit()方法对随机取样序列x进行拟合,返回的是与随机取样值最吻合的随机变量的参数
>>> stats.norm.fit(x) #得到随机序列的期望值和标准差
array([ 1.01810091, 2.00046946])


求正态分布N(1,1)概率密度函数某个x对应的值


lambda x: norm.pdf(x, 1, 1)
Note: 从正态分布概率密度中看出,这个和norm.pdf(x - 1)是不一样的,只有标准差为1时才相等。


求正态分布N(1,1)累积分布函数某个x对应的值


lambda x: norm.cdf(x, 1, 1)

绘制一维和二维正态分布概率密度图

[概率论:高斯分布]

[]

均匀分布

mu = uniform.rvs(size=N)  # 从均匀分布采样


伽玛分布

伽玛分布需要额外的形状参数。伽玛分布可用于描述等待k个独立的随机事件发生所需的时间,k就是伽玛分布的形状参数。
伽玛分布的尺度参数theta和随机事件发生的频率相关,由scale参数指定。
>>> stats.gamma.stats(2.0,scale=2) 
(array(4.0), array(8.0))
根据伽玛分布的数学定义可知其期望值为k*theta,方差为k*theta^2 。上面的程序验证了这两个公式。 当随机分布有额外的形状参数时,它所对应的rvs()、pdf()等方法都会增加额外的参数以接收形状参数。

离散分布-二项分布

假设有一种只有两个结果的试验,其成功概率为 P,那么二项分布描述了进行n次这样的独立试验而成功k次的概率。
二项分布的概率质量函数公式如下: 

使用二项分布的概率质量函数pmf()可以很容易计算出现k次6点的概率。

pmf()

pmf()的第一个参数为随机变量的取值,后面的参数为描述随机分布所需的参数。对于二项分布来说,参数分别为n和P,而取值范围则为0到n之间的整数。

程序通过二项分布的概率质量公式计算投掷5次骰子出现0到6所对应的概率:

>>> stats.binom.pmf(range(6), 5, 1/6.0)
array([0.401878, 0.401878, 0.166751, 0.032150, 0.003215, 0.000129])

由结果可知:出现0或1次6点的概率为40.2%,而出现3次6点的概率为3.215%

泊松分布

在二项分布中,如果试验次数n很大,而每次试验成功的概率p很小,其乘积np比较适中,那么试验成功次数的概率可以用泊松分布近似描述。
在泊松分布中,使用lambda描述单位时间(或单位面积)内随机事件的平均发生率。如果将二项分布中的试验次数n看作单位时间内所做的试验次数,那么它和事件出现概率P的乘积就是事件的平均发生率,即lambda = np。
泊松分布的概率质量函数公式如下:


二项分布的近似分布

程序分别计算二项分布和泊松分布的概率质量函数,当n足够大时,二者是十分接近的。
程序中事件平均发生率lambda恒等于10。根据二项分布的试验次数计算每次事件出现的概率p=lambda/n。
>>> _lambda = 10.0 
>>> k = np.arange(20)
>>> possion = stats .poisson .pmf(k, _lambda) # 泊松分布 
>>> binom100 = stats.binom.pmf(k, 100, _lambda/100) #二项式分布 100
>>> binom1000=stats.binom.pmf(k, 1000 , _lambda/1000) #二项式分布 1000
>>> np.max(np.abs(binom100-possion)) # 计算最大误差
 0.006755311103353312
>>> np.max(np.abs(binom1000-possion))# n为 1000时,误差较小
0.00063017540509099912


泊松分布的模拟过程

泊松分布适合描述单位时间内随机事件发生次数的分布情况。例如某设施在一定时间内的 使用次数。机器出现故障的次数。自然灾害发生的次数等等。

下面使用随机数模拟泊松分布,并与其概率质量函数进行比较,事件每秒的平均发生次数为lambda=10。其中观察时间分别为1000秒,50000秒。可以看出:观察时间越长,事件每秒发生的次数就越符合泊松分布。

>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time )*time
>>> count, time_edges = np.histogram(t, bins=time, range=(0,time))
>>> count
array([10, 9, 8, …, 11, 10, 18])
>>>x = count_edges[:-1] 
>>> dist, count_edges = np. histogram (count, bins=20, range= (0,20), normed=True)
>>> poisson = stats .poisson.pmf(x, _lambda)
>>> np.max(np.abs(dist-poisson)) #最大误差很小,符合泊松分布
 0.0088356241037075706

Note: 用rand()产生平均分布于0到time之间的_lambda*time 个事件所发生的时刻。
用histogram()可以统计数组t中每秒之内事件发生的次数count。
根据泊松分布的定义,count数组中数值的分布情况应该符合泊松分布。统计事件次数在0到20区间内的概率分布。当histogram()的normed参数为True并且每个统计区间的长度为1时,其结果和概率质量函数相等。


泊松分布的时间间隔:


还可以换一个角度看随机事件的分布问题。可以观察相邻两个事件之间时间间隔的分布情况,或者隔k个事件的时间间隔的分布情况。根据概率论,事件之间的时间间隔应符合伽玛分布,由于时间间隔可以是任意数值,因此伽玛分布是一种连续概率分布。伽玛分布的概率密度函数公式如下,它描述第k个亊件发生所需的等待时间的概率分布。伽玛函数,当 k为整数时,它的值和k的阶乘k!相等。

程序模拟事件的时间间隔的伽玛分布,观察时间为1 000秒,平均每秒产生10个事件。
图中“k=1”,它表示相邻两个事件之间的时间间 隔的分布,而“k=2”则表示相隔一个事件的两个事件之间的时间间隔的分布,可以看出它们都符合伽玛分布.

>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time)*time
>>> t.sort()#计算事性前后的时间间隔,需要先对随机时刻进行排序
>>> s1 = t[1:] - t[:-1] #相邻两个事件之间的时间间隔 
>>> s2 = t[2:] - t[:-2] #相隔一个事件的两个亊件之间的时间间隔
>>> dist1, x1= np.histogram(s1, bins=100, normed=True)
>>> dist2, x2 = np.histogram(s2 , bins=100, normed=True)
>>> gamma1 = stats.gamma.pdf((x1[:-1]+x1[1:])/2, 1, scale=1.0/_lambda)
>>> gamma2 = stats.gamma.pdf((x2[:-1]+x2[1:])/2, 2, scale=1.0/_lambda)
>>> np.max(np.abs(gamma1 - dist1))
0.13557317865888141
>>> np.max(np.abs(gamma2 - dist2))
0.087375030861794656
>>> np.max(gamma1), np.max(gamma2)
(9.3483221580498537, 3.6767953241013656) #由于概率密度函数的值本身比较大,因此上面的误差已经很小了:
Note:模拟伽玛分布:
首先在10000秒之内产生100000个随机事件发生的时刻.因此事件的平均发生次数为每秒10次;
为了计算事性前后的时间间隔,需要先对随机时刻进行排序;
histogram()返回的第二个值为统计区间的边界,采用gamma.pdf()计算伽玛分布的概率密度时,使用各个区间的中值进行计算。Pdf()的第二个参数为k值,scale参数为1/λ;

from:http://blog.csdn.net/pipisorry/article/details/49515215

ref:Statistical functions ()

阅读(17409) | 评论(0) | 转发(0) |
0

上一篇:Pareto solution

下一篇:RMS,RMSE,标准差

给主人留下些什么吧!~~