Summarizing Skewed Distributions of Continuous Variables

Many measurement variables evaluated in public health are more or less normally distributed, but some are not. Consider the frequency distribution showing the percentage of people who have 0 to 20 or more drinks per day as shown below.

This distribution is not normal; it isskewed to the right. In this situation, the mean and standard deviation are not appropriate parameters for characterizing this variable. Instead, one should use a median to characterize the central tendency and interquartile range to indicate variability.

The Median

The median is the middle value, i.e., the value at which half of the measurements are below that value and half are above. For the seven systolic blood pressure measurements in the table above, the median value is 121 mm Hg.

100  110     114 121     130     130     160

To find the median one can sort the values and find the middle value if the number of values is odd; If the number of values is even, the median is the average of the two middle values. However, it is easier to let Excel compute this, particularly with a large number of subjects.

Interquartile Range (IQR)

The figure below has a normal distribution for which we would use mean and standard deviation instead of median and IQR. However, for ease of illustration we will use this normal distribution to explain quartiles and interquartile range that would be used for a skewed distribution. (Note that the mean and median will be similar in a symmetrical distribution like this.)

To figure out the quartiles and IQR manually, you would first rank the observations from smallest to greatest and then divide the data set into four equal parts. These are the quartiles,each of which has an equal (or nearly equal) number of observations. Half of the observations will be below the median, and half will be above.

The 1st quartile (Q1) has the lowest 25% of observations, defined by finding the middle value between the lowest and median values. The 4th quartile (Q4) has the highest 25% of observations and is defined by finding the middle value between the median and the highest value in the data set. The 2nd quartile (Q2) has the 25% between the 1st quartile and the median, and the 3rd quartile (Q3) has the 25% between the median and the 4th quartile. Theinterquartile range (IQR) is the range for the middle 50% of the data, i.e., between the top of Q1 and the top of Q3.

IQR=Q3-Q1

Outliers

Outliers are extreme values. Data points are outliers if they meet either of these definitions:

  • For a more or less normal distribution, outliers are values more than 3 standard deviations above the mean or less than 3 standard deviations below the mean.
  • For non-normal distributions, outliers are:
    • Outliers are values >Q3 + 1.5(IQR)
  • Outliers are values <Q1 – 1.5(IQR)

Box and Whisker Plots

Abox and whisker plot is a way of summarizing skewed data. It gives a sense of the shape of the distribution, the central tendency, and the degree of variability. You will not have to make box and whisker plots for this course.

example of a bol and whisker plot

Example of Box and Whisker Plots Used for Comparison
Carl and Angela work in a computer store and want to compare the number of sales they made for the past 12 months.

In the past 12 monthsAngelasold
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
(Ordered: 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57)

In the past 12 monthsCarlsold
51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
(Ordered: 6, 7, 13, 17, 20, 25, 39, 41, 43, 49, 51, 62)

After ordering the data, it can be summarized as follows:

Summary: Carl's highest and lowest sales are both higher than Angela's is, and Carl's median sales figure is higher too. During the past year, Carl consistently sold more computers than Angela.

Test Yourself

Test Yourself