statistics --- 數(shù)學統(tǒng)計函數(shù)?

3.4 新版功能.

源代碼: Lib/statistics.py


該模塊提供了用于計算數(shù)字 (Real 值) 數(shù)據(jù)的數(shù)理統(tǒng)計量的函數(shù)。

注解

除非明確注釋,這些函數(shù)支持 int, float, decimal.Decimalfractions.Fraction。 當前不支持同其他類型(不論是否在數(shù)字塔中)的行為。 混合類型也是未定義且取決于具體實現(xiàn)的。 如果你輸入由混合類型組成的數(shù)據(jù),你應該能夠使用 map() 來確保得到一致的結(jié)果,例如 map(float, input_data)

平均值以及對中心位置的評估?

這些函數(shù)用于計算一個總體或樣本的平均值或者典型值。

mean()

數(shù)據(jù)的算術平均數(shù)(“平均數(shù)”)。

harmonic_mean()

數(shù)據(jù)的調(diào)和均值

median()

數(shù)據(jù)的中位數(shù)(中間值)

median_low()

數(shù)據(jù)的低中位數(shù)

median_high()

數(shù)據(jù)的高中位數(shù)

median_grouped()

分組數(shù)據(jù)的中位數(shù),即第50個百分點。

mode()

Mode (most common value) of discrete data.

對分散程度的評估?

這些函數(shù)用于計算總體或樣本與典型值或平均值的偏離程度。

pstdev()

數(shù)據(jù)的總體標準差

pvariance()

數(shù)據(jù)的總體方差

stdev()

數(shù)據(jù)的樣本標準差

variance()

數(shù)據(jù)的樣本方差

函數(shù)細節(jié)?

注釋:這些函數(shù)不需要對提供給它們的數(shù)據(jù)進行排序。但是,為了方便閱讀,大多數(shù)例子展示的是已排序的序列。

statistics.mean(data)?

返回 data 的樣本算術平均數(shù),數(shù)據(jù)可是是一個序列或迭代器。

算術平均數(shù)是數(shù)據(jù)之和與數(shù)據(jù)點個數(shù)的商。通常稱作“平均數(shù)”,盡管它指示諸多數(shù)學平均數(shù)之一。它是數(shù)據(jù)中心位置的度量。

data 為空,將會引發(fā) StatisticsError

一些用法示例:

>>> mean([1, 2, 3, 4, 4])
2.8
>>> mean([-1.0, 2.5, 3.25, 5.75])
2.625

>>> from fractions import Fraction as F
>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
Fraction(13, 21)

>>> from decimal import Decimal as D
>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
Decimal('0.5625')

注解

The mean is strongly affected by outliers and is not a robust estimator for central location: the mean is not necessarily a typical example of the data points. For more robust, although less efficient, measures of central location, see median() and mode(). (In this case, "efficient" refers to statistical efficiency rather than computational efficiency.)

The sample mean gives an unbiased estimate of the true population mean, which means that, taken on average over all the possible samples, mean(sample) converges on the true mean of the entire population. If data represents the entire population rather than a sample, then mean(data) is equivalent to calculating the true population mean μ.

statistics.harmonic_mean(data)?

返回 data 的調(diào)和均值,數(shù)據(jù)可以是序列或?qū)崝?shù)值的迭代器。

調(diào)和均值,也叫次相反均值,所有數(shù)據(jù)的倒數(shù)的算術平均數(shù) mean() 的倒數(shù)。比如說,數(shù)據(jù) abc 的調(diào)和均值等于 3/(1/a + 1/b + 1/c)

The harmonic mean is a type of average, a measure of the central location of the data. It is often appropriate when averaging quantities which are rates or ratios, for example speeds. For example:

假設一名投資者在三家公司各購買了等價值的股票,以 2.5, 3 , 10 的 P/E (投資/回報) 率。投資者投資組合的平均市盈率是多少?

>>> harmonic_mean([2.5, 3, 10])  # For an equal investment portfolio.
3.6

Using the arithmetic mean would give an average of about 5.167, which is too high.

如果 data 為空或者 任何一個元素的值小于零,會引發(fā) StatisticsError

3.6 新版功能.

statistics.median(data)?

使用常見的“取中間兩數(shù)平均值”方法,返回數(shù)字數(shù)據(jù)的中位數(shù)(中間值)。如果 data 為空,則引發(fā) StatisticsErrordata? 可以是序列或迭代器。

The median is a robust measure of central location, and is less affected by the presence of outliers in your data. When the number of data points is odd, the middle data point is returned:

>>> median([1, 3, 5])
3

當數(shù)據(jù)點的總數(shù)為偶數(shù)時,中位數(shù)將通過對兩個中間值求平均進行插值得出:

>>> median([1, 3, 5, 7])
4.0

這適用于當你的數(shù)據(jù)是離散的,并且你不介意中位數(shù)不是實際數(shù)據(jù)點的情況。

If your data is ordinal (supports order operations) but not numeric (doesn't support addition), you should use median_low() or median_high() instead.

statistics.median_low(data)?

Return the low median of numeric data. If data is empty, StatisticsError is raised. data can be a sequence or iterator.

低中位數(shù)一定是數(shù)據(jù)集的成員。 當數(shù)據(jù)點總數(shù)為奇數(shù)時,將返回中間值。 當其為偶數(shù)時,將返回兩個中間值中較小的那個。

>>> median_low([1, 3, 5])
3
>>> median_low([1, 3, 5, 7])
3

當你的數(shù)據(jù)是離散的,并且你希望中位數(shù)是一個實際數(shù)據(jù)點而非插值結(jié)果時可以使用低中位數(shù)。

statistics.median_high(data)?

Return the high median of data. If data is empty, StatisticsError is raised. data can be a sequence or iterator.

高中位數(shù)一定是數(shù)據(jù)集的成員。 當數(shù)據(jù)點總數(shù)為奇數(shù)時,將返回中間值。 當其為偶數(shù)時,將返回兩個中間值中較大的那個。

>>> median_high([1, 3, 5])
3
>>> median_high([1, 3, 5, 7])
5

當你的數(shù)據(jù)是離散的,并且你希望中位數(shù)是一個實際數(shù)據(jù)點而非插值結(jié)果時可以使用高中位數(shù)。

statistics.median_grouped(data, interval=1)?

Return the median of grouped continuous data, calculated as the 50th percentile, using interpolation. If data is empty, StatisticsError is raised. data can be a sequence or iterator.

>>> median_grouped([52, 52, 53, 54])
52.5

在下面的示例中,數(shù)據(jù)已經(jīng)過舍入,這樣每個值都代表數(shù)據(jù)分類的中間點,例如 1 是 0.5--1.5 分類的中間點,2 是 1.5--2.5 分類的中間點,3 是 2.5--3.5 的中間點等待。 根據(jù)給定的數(shù)據(jù),中間值應落在 3.5--4.5 分類之內(nèi),并可使用插值法來進行估算:

>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
3.7

可選參數(shù) interval 表示分類間隔,默認值為 1。 改變分類間隔自然會改變插件結(jié)果:

>>> median_grouped([1, 3, 3, 5, 7], interval=1)
3.25
>>> median_grouped([1, 3, 3, 5, 7], interval=2)
3.5

此函數(shù)不會檢查數(shù)據(jù)點之間是否至少相隔 interval 的距離。

CPython implementation detail: 在某些情況下,median_grouped() 可以會將數(shù)據(jù)點強制轉(zhuǎn)換為浮點數(shù)。 此行為在未來有可能會發(fā)生改變。

參見

  • "Statistics for the Behavioral Sciences", Frederick J Gravetter and Larry B Wallnau (8th Edition).

  • Gnome Gnumeric 電子表格中的 SSMEDIAN 函數(shù),包括 這篇討論

statistics.mode(data)?

Return the most common data point from discrete or nominal data. The mode (when it exists) is the most typical value, and is a robust measure of central location.

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

mode assumes discrete data, and returns a single value. This is the standard treatment of the mode as commonly taught in schools:

>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3

The mode is unique in that it is the only statistic which also applies to nominal (non-numeric) data:

>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
statistics.pstdev(data, mu=None)?

返回總體標準差(總體方差的平方根)。 請參閱 pvariance() 了解參數(shù)和其他細節(jié)。

>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
0.986893273527251
statistics.pvariance(data, mu=None)?

Return the population variance of data, a non-empty iterable of real-valued numbers. Variance, or second moment about the mean, is a measure of the variability (spread or dispersion) of data. A large variance indicates that the data is spread out; a small variance indicates it is clustered closely around the mean.

If the optional second argument mu is given, it should be the mean of data. If it is missing or None (the default), the mean is automatically calculated.

使用此函數(shù)可根據(jù)所有數(shù)值來計算方差。 要根據(jù)一個樣本來估算方差,通常 variance() 函數(shù)是更好的選擇。

如果 data 為空則會引發(fā) StatisticsError

示例:

>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
>>> pvariance(data)
1.25

如果你已經(jīng)計算過數(shù)據(jù)的平均值,你可以將其作為可選的第二個參數(shù) mu 傳入以避免重復計算:

>>> mu = mean(data)
>>> pvariance(data, mu)
1.25

This function does not attempt to verify that you have passed the actual mean as mu. Using arbitrary values for mu may lead to invalid or impossible results.

同樣也支持使用 Decimal 和 Fraction 值:

>>> from decimal import Decimal as D
>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('24.815')

>>> from fractions import Fraction as F
>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
Fraction(13, 72)

注解

當調(diào)用時附帶完整的總體數(shù)據(jù)時,這將給出總體方差 σ2。 而當調(diào)用時只附帶一個樣本時,這將給出偏置樣本方差 s2,也被稱為帶有 N 個自由度的方差。

If you somehow know the true population mean μ, you may use this function to calculate the variance of a sample, giving the known population mean as the second argument. Provided the data points are representative (e.g. independent and identically distributed), the result will be an unbiased estimate of the population variance.

statistics.stdev(data, xbar=None)?

返回樣本標準差(樣本方差的平方根)。 請參閱 variance() 了解參數(shù)和其他細節(jié)。

>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
1.0810874155219827
statistics.variance(data, xbar=None)?

返回包含至少兩個實數(shù)值的可迭代對象 data 的樣本方差。 方差或稱相對于均值的二階矩,是對數(shù)據(jù)變化幅度(延展度或分散度)的度量。 方差值較大表明數(shù)據(jù)的散布范圍較大;方差值較小表明它緊密聚集于均值附近。

如果給出了可選的第二個參數(shù) xbar,它應當是 data 的均值。 如果該參數(shù)省略或為 None (默認值),則會自動進行均值的計算。

當你的數(shù)據(jù)是總體數(shù)據(jù)的樣本時請使用此函數(shù)。 要根據(jù)整個總體數(shù)據(jù)來計算方差,請參見 pvariance()

如果 data 包含的值少于兩個則會引發(fā) StatisticsError

示例:

>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
>>> variance(data)
1.3720238095238095

如果你已經(jīng)計算過數(shù)據(jù)的平均值,你可以將其作為可選的第二個參數(shù) xbar 傳入以避免重復計算:

>>> m = mean(data)
>>> variance(data, m)
1.3720238095238095

此函數(shù)不會試圖檢查你所傳入的 xbar 是否為真實的平均值。 使用任意值作為 xbar 可能導致無效或不可能的結(jié)果。

同樣也支持使用 Decimal 和 Fraction 值:

>>> from decimal import Decimal as D
>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
Decimal('31.01875')

>>> from fractions import Fraction as F
>>> variance([F(1, 6), F(1, 2), F(5, 3)])
Fraction(67, 108)

注解

這是附帶貝塞爾校正的樣本方差 s2,也稱為具有 N-1 自由度的方差。 假設數(shù)據(jù)點具有代表性(即為獨立且均勻的分布),則結(jié)果應當是對總體方差的無偏估計。

如果你通過某種方式知道了真實的總體平均值 μ 則應當調(diào)用 pvariance() 函數(shù)并將該值作為 mu 形參傳入以得到一個樣本的方差。

異常?

只定義了一個異常:

exception statistics.StatisticsError?

ValueError 的子類,表示統(tǒng)計相關的異常。