2  Descriptive statistic: Tabular and Graphical Presentations

2.1 Summarizing Categorical Data

2.1.1 Frequency Distribution

A frequency distribution is a tabular summary of data showing the number (frequency) of items in each of several non overlapping classes.

Example 2.1 Consider the following data shown in Table 2.1.

Table 2.1: Data from a sample of 50 soft drink purchases
Coke Classic Coke Classic Coke Classic
Diet Coke Diet Coke Coke Classic
Pepsi Coke Classic Pepsi
Diet Coke Diet Coke Dr. Pepper
Coke Classic Coke Classic Coke Classic
Coke Classic Sprite Diet Coke
Dr. Pepper Pepsi Pepsi
Diet Coke Coke Classic Pepsi
Pepsi Coke Classic Pepsi
Pepsi Coke Classic Pepsi
Coke Classic Pepsi Coke Classic
Dr. Pepper Coke Classic Dr. Pepper
Sprite Sprite Pepsi
Coke Classic Dr. Pepper Sprite
Diet Coke Pepsi Coke Classic
Coke Classic Diet Coke Sprite
Coke Classic Pepsi

Now we will construct a frequency distribution by simply counting each type of soft-drink.

Table 2.2: Frequency distribution of Soft Drink Purchases
Soft Drink Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi 13
Sprite 5

Relative Frequency and Percent Frequency Distributions

  • Relative Frequency \(=\frac{Frequency \ \ of \ \ the \ \ class}{n}\)

  • The percent frequency of a class is the relative frequency multiplied by 100.

2.1.2 Bar Charts and Pie Charts

  • Bar chart: A graphical device for depicting qualitative data that have been summarized in a frequency, relative frequency, or percent frequency distribution.

  • Pie chart: A graphical device for presenting data summaries based on subdivision of a circle into sectors that correspond to the relative frequency for each class.

From the frequency table of soft drinks purchase, we will develop relative and percent frequency distribution (see Table 2.3) and will construct a bar-chart and pie-chart.

Table 2.3: Frequency, Relative And Percent Frequency Distributions Of Soft Drink Purchases
Soft Drink Frequency (f) Relative Frequency(Rf) Percent Frequency (Pf)
Coke Classic 19 0.38 38
Diet Coke 8 0.16 16
Dr. Pepper 5 0.10 10
Pepsi 13 0.26 26
Sprite 5 0.10 10

Now we construct a bar chart and pie chart.

Figure 2.1: Bar chart of Soft drink purchases
Figure 2.2: Pie chart of Soft drink purchases

2.1.3 Cross-tabulation and its graphical presentation

Cross-tabulation (also known as crosstabs or contingency tables) is a fundamental data analysis technique used to examine the relationship between two or more categorical variables. For instance, consider the following cross-table between the gender and fitness-level of 30 individuals:

Table 2.4: Cross-Tabulation of Gender by Fitness Level
Gender Excellent Fair Good Poor Grand Total
Male 2 7 3 3 15
Female 5 3 5 2 15
Grand Total 7 10 8 5 30

We can show the information of cross-tab using (a) Clustered column chart or (b) Stacked column chart (see Figure 2.3 ).

Figure 2.3: Visulization of relationship between two categorical variables

2.2 Summarizing Quantitative Data

2.2.1 Frequency Distribution of quantitative data

Consider the following data.

YEAR-END AUDIT TIMES (IN DAYS): 12, 14, 19, 18, 15, 15, 18, 17, 20, 27, 22, 23, 22, 21, 33, 28, 14, 18, 16, 13,

To construct a frequency distribution we have to

  1. Determine the number of non overlapping classes(k).
  2. Determine the width of each class.
  3. Determine the class limits.

2.2.2 Frequency Distribution of Audit time data

Here, \(n=20\), Smallest value=12, Largest value=33.

  1. Determine number of classes, \(k\) as : \(k=\sqrt n=\sqrt 20=4.47\approx5\). So \(5\) is the number of classes.
  2. Class width \(w\) as: \(w=\frac{Largest-Smallest}{k}=\frac{33-12}{5}=4.2\approx 5\)
  3. Class limits: Start from near smallest value (12) say from \(10\) we have the following classes (exclusive method-where upper bound of the class is excluded):

[10,15), [15,20), [20,25), [25,30), and [30,35)

Now count the data values in corresponding classes and thus we have the frequency distribution. Once we have the frequency distribution then we also can produce the relative and percent frequency distribution (Table 2.5 ).

Table 2.5: Frequency, relative frequency (rf) and percent frequency (pf) distribution for the audit time data (n=20)
Audit Time (days) Frequency (f) Relative frequency (rf) Percent frequency(pf)
[10,15) 4 0.20 20
[15,20) 8 0.40 40
[20,25) 5 0.25 25
[25,30) 2 0.10 10
[30,35) 1 0.05 5

2.2.3 Histogram

A common graphical presentation of quantitative data is a histogram. This graphical summary can be prepared for data previously summarized in either a frequency, relative frequency, or percent frequency distribution.

Figure 2.4: Histogram for the Audit Time data

Important Note: A relative frequency/ percent frequency histogram is ideal for comparing distributions across groups of different sizes, as it displays proportions instead of raw counts, allowing fair and meaningful comparisons.

Illustration ( see Figure 2.5)

A dataset of Marks (out of 100) was collected from two student groups:

  • 70 Female students

  • 40 Male students

The goal is to compare the distribution of marks between these two groups.

The frequency histogram shows how many students fall into each marks range (bin). However, since the number of female students (70) is greater than male students (40), their bars are naturally taller — even if the relative performance is similar. This makes direct comparison unfair and misleading.

A relative frequency histogram shows the proportion of students in each bin within each group. By dividing counts by the total number in the group:

  • It normalizes the data,

  • Allows for fair comparisons between groups of different sizes,

  • Highlights true differences in distribution, not just differences in group size.

Example Observation

From the RF histogram:

  • Around 40% of males scored between 50–60.

  • Around 39% of females scored between 50–60.

This comparison is valid only because the histograms show relative frequency, not raw counts.

Figure 2.5: Comparison between Frequency histogram and relative frequency histogram

2.2.4 HISTOGRAM and shape of the distribution

See Figure 2.6 .

Figure 2.6: Histograms Showing Differing Levels of Skewness

2.2.5 Cumulative Distributions and Ogive

A variation of the frequency distribution that provides another tabular summary of quantitative data is the cumulative frequency distribution. Table 2.6 shows the cumulative relative frequency of Audit Time data.

Table 2.6: Frequency, relative frequency and Cumulative relative frequency distribution of Audit Time Data
Audit Time (days) Frequency (f) Relative frequency (rf) Cumulative relative frequency(crf)
[10,15) 4 0.20 0.20
[15,20) 8 0.40 0.60
[20,25) 5 0.25 0.85
[25,30) 2 0.10 0.95
[30,35) 1 0.05 1.00

2.2.6 Ogive

Another way of presenting this information is the ogive, which is a graphical representation of the cumulative relative frequencies. Figure 2.7 is the drawn ogive for the cumulative relative frequency for Audit time data.

Figure 2.7: Ogive for Audit Time data

For instance, from both Table 2.6 and Figure 2.7 we can say (estimate) that 60% of audits took less than 20 days . Similarly, 95% of audits took less than 30 days and so on.

2.2.7 The Stem-and-Leaf Display

The techniques of exploratory data analysis consist of simple arithmetic and easy-to-draw graphs that can be used to summarize data quickly. One technique—referred to as a stem-and-leaf display—can be used to show both the rank order and shape of a data set simultaneously (Anderson and Sweeney 2011).

Steps to Construct a Stem-and-Leaf Diagram

(1) Divide each number into two parts: a stem, consisting of one or more of the leading digits, and a leaf, consisting of the remaining digit.

(2) List the stem values in a vertical column.

(3) Record the leaf for each observation beside its stem.

(4) Write the units for stems and leaves on the display.

Example 2.2 Here are the number of questions answered correctly on an aptitude test given to 50 individuals recently interviewed for a position at Haskens Manufacturing.

112, 72, 69, 97, 107,73, 92, 76, 86, 73, 126, 128, 118, 127, 124,82, 104, 132, 134, 83, 92, 108, 96, 100, 92,115, 76, 91, 102, 81, 95, 141, 81, 80, 106,84, 119, 113, 98, 75, 68, 98, 115, 106, 95,100, 85, 94, 106, 119


  The decimal point is 1 digit(s) to the right of the |

   6 | 89
   7 | 233566
   8 | 01123456
   9 | 12224556788
  10 | 002466678
  11 | 2355899
  12 | 4678
  13 | 24
  14 | 1

Exception

In some data sets, providing more classes or stems may be desirable. One way to do this would be to modify the original stems as follows: For example, divide stem 5 into two new stems, 5L and 5U. Stem 5L has leaves 0, 1, 2, 3, and 4, and stem 5U has leaves 5, 6, 7, 8, and 9. This will double the number of original stems. However, there may be various type of data in practical situations. So, we have to figure out the suitable stem-and-leaf plot.

Example 2.3: Construct a stem-and-leaf plot from the following data:

88.5, 98.8, 89.6, 92.2, 92.7, 88.4, 87.5, 90.9, 94.7, 88.3, 90.4, 83.4, 87.9, 92.6, 87.8, 89.9, 84.3, 90.4, 91.6, 91.0


  The decimal point is 1 digit(s) to the right of the |

  8 | 34
  8 | 888889
  9 | 0000112233
  9 | 59

  The decimal point is at the |

  82 | 4
  84 | 3
  86 | 589
  88 | 34569
  90 | 44906
  92 | 267
  94 | 7
  96 | 
  98 | 8

Example 2.4 (Another example): Construct a stem-and-leaf plot from the following data: 7,8,2,1,8,3,5,7,1,2,2,5,8,5,5,7,8,7,5,3

Solution:


  The decimal point is at the |

  1 | 00
  2 | 000
  3 | 00
  4 | 
  5 | 00000
  6 | 
  7 | 0000
  8 | 0000

2.3 Exercises

2.1 A doctor’s office staff studied the waiting times for patients who arrive at the office with a request for emergency service. The following data with waiting times in minutes were collected over a one-month period.

2, 5, 10, 12, 4, 4, 5, 17, 11, 8, 9, 8, 12, 21, 6, 8, 7, 13, 18, 3

Use class interval/width of 5 in the following (start your class limit from 0):

  1. Show the frequency distribution.
  2. Show the relative frequency distribution.
  3. Show the cumulative frequency distribution.
  4. Show the cumulative relative frequency distribution.
  5. What proportion of patients needing emergency service wait less than 10 minutes or less?

2.2 A shortage of candidates has required school districts to pay higher salaries and offer extras to attract and retain school district superintendents. The following data show the annual base salary ($1000s) for superintendents in 20 districts in the greater Rochester, New York, area (The Rochester Democrat and Chronicle, February 10, 2008).

187, 184, 174, 185, 175, 172, 202, 197, 165, 208, 215, 164, 162, 172, 182, 156, 172, 175, 170, 183

Use appropriate number classes/ class width in the following.

  1. Show the frequency distribution.
  2. Show the percent frequency distribution.
  3. Show the cumulative percent frequency distribution.
  4. Develop a histogram for the annual base salary.
  5. Do the data appear to be skewed? Explain.
  6. Which salary range belongs to the highest percentage of superintendents ?

187, 184, 174, 185, 175, 172, 202, 197, 165, 208, 215, 164, 162, 172, 182, 156, 172, 175, 170, 183

2.3 NRF/BIG research provided results of a consumer holiday spending survey (USA Today, December 20, 2005). The following data provide the dollar amount of holiday spending for a sample of 25 consumers.

1200, 850, 740, 590, 340, 450, 890, 260, 610, 350, 1780, 180, 850,2050, 770, 800, 1090, 510, 520, 220, 1450, 280, 1120, 200 350

  1. What is the lowest holiday spending? The highest?
  2. Use a class width of $250 to prepare a frequency distribution and a percent frequency distribution for the data.
  3. Prepare a histogram and comment on the shape of the distribution.
  4. What observations can you make about holiday spending?

1200, 850, 740, 590, 340, 450, 890, 260, 610, 350, 1780, 180, 850,2050, 770, 800, 1090, 510, 520, 220, 1450, 280, 1120, 200, 350

2.4 Construct a stem-and-leaf display for the following data.

70, 72, 75, 64, 58, 83, 80, 82, 76, 75, 68, 65, 57, 78, 85, 72

2.5 Construct a stem-and-leaf display for the following data.

11.3, 9.6, 10.4, 7.5, 8.3, 10.5, 10.0, 9.3, 8.1, 7.7, 7.5, 8.4, 6.3, 8.8

2.6 A psychologist developed a new test of adult intelligence. The test was administered to 20 individuals, and the following data were obtained.

114, 99, 131, 124, 117, 102, 106, 127, 119, 115,98, 104, 144, 151, 132, 106, 125, 122, 118, 118

Construct a stem-and-leaf display for the data.

2.4 Line Chart

2.5 Scatter diagram

2.6 Case Study: Lifestyle Indicators and Preferences

You’re given data from a cross-sectional survey of 30 individuals. Variables include Age, Gender, Monthly Income, Hours of Exercise/Week , Fitness Level, Favorite Fruit and Temperature preference. This data-set enables practice with data visualization, scale classification, and relationship analysis using tools like bar charts, scatter plots, and histograms.

ID Age Gender Monthly Income (USD) Hours of Exercise/Week Favorite Fruit Fitness Level Temperature Preference (°C)
1 25 Male 1520.75 2.5 Mango Fair 23
2 32 Female 2280.50 1.2 Apple Poor 22
3 20 Male 925.10 4.8 Banana Good 19
4 29 Female 1980.90 3.3 Orange Fair 22
5 40 Male 3055.45 0.7 Apple Poor 21
6 23 Female 1425.00 5.9 Mango Excellent 20
7 37 Male 3625.80 2.1 Banana Fair 18
8 31 Female 2520.00 1.5 Apple Poor 27
9 27 Male 1685.25 3.7 Orange Good 22
10 22 Female 1190.00 6.2 Mango Excellent 19
11 34 Male 3180.45 0.0 Banana Poor 24
12 26 Female 1830.60 4.3 Orange Good 19
13 39 Male 3999.00 1.4 Apple Fair 22
14 24 Female 1335.75 5.6 Banana Excellent 26
15 28 Male 2125.35 2.2 Mango Fair 27
16 33 Female 2885.50 3.1 Orange Good 22
17 21 Male 960.00 6.0 Apple Excellent 18
18 30 Female 2190.00 1.7 Banana Fair 27
19 36 Male 3425.90 2.3 Mango Fair 27
20 29 Female 1920.40 4.5 Orange Good 20
21 35 Male 3095.25 3.0 Apple Good 23
22 22 Female 1125.00 5.1 Banana Excellent 27
23 38 Male 4180.60 1.8 Mango Fair 22
24 30 Female 2590.00 2.4 Orange Fair 20
25 41 Male 3540.85 0.9 Apple Poor 22
26 28 Female 1875.50 4.0 Banana Good 24
27 24 Male 1275.00 6.7 Mango Excellent 21
28 31 Female 2390.25 3.5 Orange Good 25
29 36 Male 2965.80 2.6 Banana Fair 22
30 21 Female 985.00 5.4 Apple Excellent 25

Tasks

  1. Identify the scale of measurement of each variable.

  2. Construct suitable graph/chart for variables like gender, fitness level etc.

  3. Construct relative frequency histogram for variables like age, Monthly Income (USD), etc. Describe what you have learned.

  4. Plot Hours of Exercise/Week vs Temperature Preference (°C). What is your conclusion.

  5. Draw the scatter plot of Monthly Income vs Hours of Exercise/Week. Is there any relation?

  6. Cross-tabulation

  1. Make a cross-tabulation between Gender and Favorite Fruit. Show the result in a stacked bar-chart.
  1. Make a cross-tabulation between Favorite Fruit and Fitness Level. Show the result in a stacked bar-chart.