Concept Sheet 2
Descriptive Statistics

After we have collected a group of data we then have to being describing it. To begin we are interested in two broad characterizations of data: measures of central tendency or location and measures of dispersion.

Central Tendency or Location

Measures of central tendency attempt to get a handle on the "middle" or representative observation. There are three different measures of central tendency. Even for one data set, the three measures can give different answers, because each measure looks at different aspects of "middle-ness."

Mode --is the value with the maximum frequency in the data set, or the value that occurs most often.

Median --is a positional rank, or the value having the same number of observations above and below. This requires that the observations are ordinally measured.

Mean , --is an arithmetic measure defined as the sum of all observations divided by the total number of observations.

Now that we have a sense of the "middle" or typical value of the interpretation as above, we now want to get a sense of how the data vary around the "middle" that we just found. These measures used to describe the data set are called measures of dispersions.

Measures of Dispersion

We will break these measures up into two categories depending on how the data are measured.

I. For Discrete/Nominal or Ordinal Variables:
These statistics measure variation in a sample by comparing the cases (i.e., scores or observations) to one another.

Variation Ratio (V)--is the ratio of NON-modal observations to the total number of observations (or is one minus the percent of observations in the modal category).

Minimum value for V: 0, where all cases are modal (a constant!)
Maximum value for V: 1, where no cases are modal.

Index of Diversity (D)--indicates the likelihood that two observations drawn at random from the sample are from different categories of the variable. The larger the number for D, the greater the dispersion.

where:

k = number of categories of the variable
p = proportion of cases in a given category

Minimum value for D: 0 (when all cases fall into a single category)
Maximum value for D: (K-1)/K (a function of the number of categories of the variable)

Index of Qualitative Variation (IQV)--standardized version of D. Same interpretation as D, but IQV creates a maximum value of 1, regardless of the number of categories of the variable.

II. For Continuous/Interval Variables:
With the exception of the range, these statistics measure variation in the sample by comparing the scores to the mean.

Range--difference between extreme values of a distribution (high value minus low value).

R = Highest value - Lowest value.

Minimum value for the range: 0 (where all observations are on the same value)
Maximum value for the range: the distance from the lowest to the highest possible measurement

Average Deviation--sum of deviation of scores (cases or observations) about the mean, divided by the total number of scores.

Variance--the sum of the squared deviation of the scores about the mean, divided by the total number of scores minus 1. (Note: we divide by (n-1) because, otherwise, the sample variance will be a biased estimator of the population variance. We will talk more later in the semester about what this means.)

Standard Deviation s, --the square root of the variance. Think of it as the "average" or "typical" amount of variation around the mean. An "average" observation is one standard deviation away from the mean.

Minimum value for s.d.: 0 (where there is no variation from the mean at all; all cases are on the mean)
Maximum value for s.d.: approximately half of the total possible range of measurement (for example, on a 100 point scale, max. s.d. is roughly 50)

Chebychev's Inequality Theorem:

For ANY distribution, at least (1-1/k2)*100% of the observations lie within k standard deviations of the mean.

For Normally Distributed data:

Posted February 18, 1998
Copyright Chris Fastnow