Tuesday, May 31, 2011

Skewness and Kurtosis

In our previous blog entry, we discussed how a probability distribution can be described using the first four moments which are mean, variance/standard deviation, skewness, and kurtosis. Mean and standard deviation were described and now we will talk about skewness and kurtosis.

Skewness: The measurement of the asymmetry of a random variable is a dimensionless quantity called skewness (b1)^0.5, which is calculated from the second and third moments about the mean of a distribution. If (b1)^0.5 < 0 the distribution is negatively skewed (tail to left) and if (b1)^0.5 > 0 the distribution is positively skewed (tail to the right). The equations for calculating skewness from data are:


Distributions that are: (a) positively skewed with tail to the right (b1)^0.5  > 0, (b) centered (b1)^0.5  = 0, and (c) negatively skewed with tail to the left (b1)^0.5  < 0. Red lines are means.


Kurtosis: A dimensionless quantity that characterizes the peakedness of a random variable is called kurtotsis b2, which is calculated from the fourth and second moments of the distribution about the mean. If b2 >> 3, the distribution has a high peak and for b2 =1.8, the distribution becomes flat. At b2=3, the distribution is normal



(a) distribution with a high peak (b2  >> 3), (b) normal distribution (b2  = 3), (c) flat (uniform) distribution with (b2  =1.8).

In Excel, the functions SKEW and KURT can be used to calculate skewness and kurtosis, respectively. Be warned -- the Excel KURT function calculates the "Excess of Kurtosis" relative to a normal distribution.  In Excel, KURT will return 0 for a normal distribution rather than the value of 3 found in statistics textbooks.  We suggest that you add 3 to any to value returned by the KURT function when presenting data.

=KURT(A1:A10) + 3

Sunday, May 15, 2011

Mean and Standard Deviation

A random variable is defined by a distribution that has one or more variables that describe location, shape and scaling. (The term distribution is used in six sigma to denote a probability density function). Practically, a distribution can be described by:

  • mean 
  • variance or standard deviation 
  • skewness 
  • kurtosis 

Once mean, standard deviation, skewness, and kurtosis are calculated or assumed, the relevant location, shape, and scaling variables can be computed.

Mean
The expected (or average) value for the distribution of a random variable xbar is the mean and it can be calculated from sample data as follows:



where xi are the value for n data points. Mean is also called the first moment of a distribution about zero (this is the same things as a centroid – the distribution is rotated around zero).

Means (red lines) for different distributions.


Variance and Standard Deviation
The measurement of spread of a random variable is called variance σ^2 and the square root of variance is called standard deviation σ. This is equivalent to taking the second moment of a distribution around it’s mean.



Distributions with increasing standard deviation (a) to (c). Red lines are means

Friday, May 13, 2011

Chebychev’s Inequality

There is a neat theorem in statistics called Chebychev’s Inequality that states: for any distribution with a mean and standard deviation σ, at least 1 - (1/k^2) percent of the distribution is within ±kσ of the mean.  For normal distributions, the percent within ±kσ of the mean is exactly known. 





Percent of distribution that fall within ±kσ of the mean.
k
Any Distribution
Normal
1
-----
68.3%
1.415
50.0%
84.3%
2
75.0%
95.5%
3
88.9%
99.7%
4
93.8%
99.99%
5
96.0%
99.99994%
6
97.2%
99.9999998%

I think this is pretty useful.  You don't have to worry if the distribution is skewed, peaked, or flattened.  Chebychev's Inequality will always give you the minimum percent that falls within ±kσ.  It's a shame that this is not more widely used in engineering design or business decision making.