United States General Accounting Office

GAO    Program Evaluation and Methodology
Division

May 1992    Quantitative Data
Analysis:  An
Introduction

Preface

GAO assists congressional decisionmakers in their deliberative process by furnishing analytical information on issues and options under consideration. Many diverse methodologies are needed to develop sound and timely answers to the questions that are posed by the Congress. To provide GAO evaluators with basic information about the more commonly used methodologies, GAO's policy guidance includes documents such as methodology transfer papers and technical guidelines.

This methodology transfer paper on quantitative data analysis deals with information expressed as numbers, as opposed to words, and is about statistical analysis in particular because most numerical analyses by GAO are of that form. The intended reader is the GAO generalist, not statisticians and other experts on evaluation design and methodology. The paper aims to bridge the communications gap between generalist and specialist, helping the generalist evaluator be a wiser consumer of technical advice and helping report reviewers be more sensitive to the potential for methodological errors. The intent is thus to provide a brief tour of the statistical terrain by introducing concepts and issues important to GAO's work, illustrating the use of a variety of statistical methods, discussing factors that influence the choice of methods, and offering some advice on how to avoid pitfalls in the analysis of quantitative data. Concepts are presented in a nontechnical way by avoiding computational procedures, except for a few illustrations, and by avoiding a rigorous discussion of assumptions that underlie statistical methods.

Quantitative Data Analysis is one of a series of papers issued by the Program Evaluation and Methodology Division (PEMD). The purpose of the series is to provide GAO evaluators with guides to various aspects of audit and evaluation methodology, to illustrate applications, and to indicate where more detailed information is available. We look forward to receiving comments from the readers of this paper. They should be addressed to Eleanor Chelimsky at 202-275-1854. Werner Grosshans Assistant Comptroller General Office of Policy Eleanor Chelimsky Assistant Comptroller General for Program Evaluation and Methodology

Chapter 1 Introduction


Guiding Principles

Data analysis is more than number crunching. It is an activity that permeates all stages of a study. Concern with analysis should (1) begin during the design of a study, (2) continue as detailed plans are made to collect data in different forms, (3) become the focus of attention after data are collected, and (4) be completed only during the report writing and reviewing stages.\1

The basic thesis of this paper is that successful data analysis, whether quantitative or qualitative, requires (1) understanding a variety of data analysis methods, (2) planning data analysis early in a project and making revisions in the plan as the work develops; (3) understanding which methods will best answer the study questions posed, given the data that have been collected; and (4) once the analysis is finished, recognizing how weaknesses in the data or the analysis affect the conclusions that can properly be drawn. The study questions govern the overall analysis, of course. But the form and quality of the data determine what analyses can be performed and what can be inferred from them. This implies that the evaluator should think about data analysis at four junctures:

-------------------- \1 Relative to GAO job phases, the first two checkpoints occur during the job design phase, the third occurs during data collection and analysis, and the fourth during product preparation. For detail on job phases see the General Policy Manual, chapter 6, and the Project Manual, chapters 6.2, 6.3, and 6.4.


Designing the Study

As policy-relevant questions are being formulated, evaluators should decide what data will be needed to answer the questions and how they will analyze the data. In other words, they need to develop a data analysis plan. Determining the type and scope of data analysis is an integral part of an overall design for the study. (See the transfer paper entitled Designing Evaluations, listed in "Papers in This Series.") Moreover, confronting data collection and analysis issues at this stage may lead to a reformulation of the questions to ones that can be answered within the time and resources available.


Data Collection

When evaluators have advanced to the point of planning the details of data collection, analysis must be considered again. Observations can be made and, if they are qualitative (that is, text data), converted to numbers in a variety of ways that affect the kinds of analyses that can be performed and the interpretations that can be made of the results. Therefore, decisions about how to collect data should be influenced by the analysis options in mind.

 
Data Analysis

After the data are collected, evaluators need to see whether their expectations regarding data characteristics and quality have been met. Choice among possible analyses should be based partly on the nature of the data--for example, whether many observed values are small and a few are large and whether the data are complete. If the data do not fit the assumptions of the methods they had planned to use, the evaluators have to regroup and decide what to do with the data they have.\2 A different form of data analysis may be advisable, but if some observations are untrustworthy or missing altogether, additional data collection may be necessary.

As the evaluators proceed with data analysis, intermediate results should be monitored to avoid pitfalls that may invalidate the conclusions. This is not just verifying the completeness of the data and the accuracy of the calculations but maintaining the logic of the analysis. Yet it is more, because the avoidance of pitfalls is both a science and an art. Balancing the analytic alternatives calls for the exercise of considerable judgment. For example, when observations take on an unusual range of values, what methods should be used to describe the results? What if there are a few very large or small values in a set of data? Should we drop data at the extreme high and low ends of the scale? On what grounds?

-------------------- 2 An example would be a study in which the data analysis method evaluators planned to use required the assumption that observations be from a probability sample, as discussed in chapter 5. If the evaluators did not obtain observations for a portion of the intended sample, the assumption might not be warranted and their application of the method could be questioned.


Writing and Reviewing

Finally, as the evaluators interpret the results and write the report, they have to close the loop by making judgments about how well they have answered the questions, determining whether different or supplementary analyses are warranted, and deciding the form of any recommendations that may be suitable. They have to ask themselves questions about their data collection and analysis: How much of the variation in the data has been accounted for? Is the method of analysis sensitive enough to detect the effects of a program? Are the data "strong" enough to warrant a far-reaching recommendation? These questions and many others may occur to the evaluators and reviewers and good answers will come only if the analyst is "close" to the data but always with an eye on the overall study questions.


Quantitative Questions Addressed in the Chapters of This Paper

Most GAO statistical analyses address one or more of the four generic questions presented in table 1.3. Each generic question is illustrated with several specific questions and examples of the kinds of statistics that might be computed to answer the questions. The specific questions are loosely based on past GAO studies of state bottle bills (U.S. General Accounting Office, 1977 and 1980).

Table 1.1 Generic Types of Quantitative Questions

Generic question Specific question Useful statistics
What is a typical value of the variable? At the state level, how many pounds of soft drink bottles (per unit of population) were typically returned annually? Measures of central tendency  (ch. 2)
How much spread is there among cases? How similar are the individual states' return rates? Measures of spread  (ch. 3)
To what extent are two or more variables associated? What factors are most associated with high return rates: existence of state bottle bills? state economic conditions? state levels of environmental awareness? Measures of association  (ch. 4)
To what extent are there causal relationships among two or more variables? What factors cause high return rates: existence of state bottle bills? state economic conditions? state levels of environmental awareness? Measures of association  (ch. 4): Note that association is but one of the three conditions necessary to establish causation  (ch. 6)

Bottle bills have been adopted by about nine states and are intended to reduce solid waste disposal problems by recycling. Other benefits can also be sought, such as the reduction of environmental litter and savings of energy and natural resources. One of GAO's studies was a prospective analysis, intended to inform discussion of a proposed national bottle bill. The quantitative analyses were not the only relevant factor. For example, the evaluators had to consider the interaction of the merchant-based bottle bill strategy with emerging state incentives for curbside pickups or with other recycling initiatives sponsored by local communities. The quantitative results were, however, relevant to the overall conclusions regarding the likely benefits of the proposed national bottle bill.

The first three generic questions in table 1.3 are standard fare for statistical analysis. GAO reports using quantitative analysis usually include answers in the form of descriptive statistics such as the mean, a measure of central tendency, and the standard deviation, a measure of spread. In chapters 2, 3, and 4 of this paper, we focus on descriptive statistics for answering the questions.

To answer many questions, it is desirable to use probability samples to draw conclusions about populations. In chapter 5, we address the first three questions from the perspective of inferential statistics. The treatment there is necessarily brief, focused on point and interval estimation methods.

The fourth generic question, about causality, is more difficult to answer than the others. Providing a good answer to a causal question depends heavily upon the study design and somewhat advanced statistical methods; we treat the topic only lightly in chapter 6. Chapter 7 discusses some broad strategies for avoiding pitfalls in the analysis of quantitative data.

Before describing these concepts, it is important to establish a common understanding about some ideas that are basic to data analysis, especially those applicable to the quantitative analysis we describe in this paper. Each of GAO's assignments requires considerable analysis of data. Over the years, many workable tools and methods have been developed and perfected. Trained evaluators use these tools as appropriate in addressing an assignment's objectives. This paper tries to reinforce the uses of these tools and put consistent labels on them.\3 It also gives helpful hints and illustrates the use of each tool. In the next section, we discuss the basic terminology that is used in later chapters.

-------------------- \3 Inconsistencies in the use of statistical terms can cause problems. We have tried to deal with the difficulty in three ways: (1) by using the language of current writers in the field, (2) by noting instances where there are common alternatives to key terms, and (3) by including a glossary of the terms used in this paper.

Attributes, Variables, and Cases

Observations about persons, things, and events are central to answering questions about government programs and policies. Groups of observations are called data, which may be qualitative or quantitative. Statistical analysis is the manipulation, summarization, and interpretation of quantitative data.

We observe characteristics of the entities we are studying. For example, we observe that a person is female and we refer to that characteristic as an attribute of the person. A logical collection of attributes is called a variable; in this instance, the variable would be gender and would be composed of the attributes female and male. /4 Age might be another variable composed of the integer values from 0 to 115.

It is convenient to refer to the variables we are especially interested in as response variables. For example, in a study of the effects of a government retraining program for displaced workers, employment rate might be the response variable. In trying to determine the need for an acquired immune deficiency syndrome (AIDS) education program in different segments of the U.S. population, evaluators might use the incidence of AIDS as the response variable. We usually also collect information on other variables with which we hope to better understand the response variables. We occasionally refer to these other variables as supplementary variables.

The data that we want to analyze can be displayed in a rectangular or matrix form, often called a data sheet (see table 1.1). To simplify matters, the individual persons, things, or events that we get information about are referred to generically as cases. (The intensive study of one or a few cases, typically combining quantitative and qualitative data, is referred to as case study research. See the GAO transfer paper entitled Case Study Evaluations.) Traditionally, the rows in a data sheet correspond to the cases and the columns correspond to the variables of interest. The numbers or words in the cells then correspond to the attributes of the cases.

Table 1.2 Data Sheet for a Study of College Student Loan Balances

Type of Loan
                                               
Case                      Age     Class            Institution       Balance 
     

1                            23       Sophomore   Private            $3,254

2                            19        Freshman     Public               1,501

3                            21        Junior           Public                2,361

4                            30       Graduate      Private               8,100

5                            21       Freshman     Private               1,970

6                            22       Sophomore   Public               3,866

7                            21      Sophomore    Public                2,890

8                            20      Freshman       Public                6,300

9                            22      Junior             Private                2,639

10                           21     Sophomore   Public                  1,718

11                           19     Freshman      Private                2,690

12                           20     Sophomore   Public                 3,812

13                           20     Sophomore   Public                 2,210

14                           23       Senior          Private                3,780

15                           24       Senior          Private                5,082

Table 1.1 shows 15 cases, college students, from a hypothetical study of student loan balances at higher education institutions. The first column shows an identification number for each case, and the rest of the columns indicate four variables: age of student, class, type of institution, and loan balance. Two of the variables, class and type of institution, are presently in text form. As will be seen shortly, they can be converted to numbers for purposes of quantitative analysis. Loan balance is the response variable and the others are supplementary.

The choice of a data analysis method is affected by several considerations, especially the level of measurement for the variables to be studied; the unit of analysis; the shape of the distribution of a variable, including the presence of outliers (extreme values); the study design used to produce the data from populations, probability samples, or batches; and the completeness of the data. Each factor is considered briefly.

-------------------- \4 Instead of referring to the attributes of a variable, some prefer to say that the variable takes on a number of "values." For example, the variable gender can have two values, male and female. Also, some statisticians use the expression "attribute sampling" in reference to probability sampling procedures for estimating proportions. Although attribute sampling is related to attribute as used in data analysis, the terminology is not perfectly parallel. See the discussion of attribute sampling in the transfer paper entitled Using Statistical Sampling, listed in "Papers in This Series."


Level of Measurement

Quantitative variables take several forms, frequently called levels of measurement, which affect the type of data analysis that is appropriate. Although the terminology used by different analysts is not uniform, one common way to classify a quantitative variable is according to whether it is nominal, ordinal, interval, or ratio.

The attributes of a nominal variable have no inherent order. For example, gender is a nominal variable in that being male is neither better nor worse than being female. Persons, things, and events characterized by a nominal variable are not ranked or ordered by the variable. For purposes of data analysis, we can assign numbers to the attributes of a nominal variable but must remember that the numbers are just labels and must not be interpreted as conveying the order of the attributes. In the study of student loans, the type of institution is a nominal variable with two attributes--private and public--to which we might assign the numbers 0 and 1 or, if we wish, 12 and 17. For most purposes, 0 and 1 would be more useful.\5

With an ordinal variable, the attributes are ordered. For example, observations about attitudes are often arrayed into five classifications, such as greatly dislike, moderately dislike, indifferent to, moderately like, greatly like. Participants in a government program might be asked to categorize their views of the program offerings in this way. Although the ordinal level of measurement yields a ranking of attributes, no assumptions are made about the "distance" between the classifications. In this example, we do not assume that the difference between persons who greatly like a program offering and ones who moderately like it is the same as the difference between persons who moderately like the offering and ones who are indifferent to it. For data analysis, numbers are assigned to the attributes (for example, greatly dislike = -2, moderately dislike = -1, indifferent to = 0, moderately like = +1, and greatly like = +2), but the numbers are understood to indicate rank order and the "distance" between the numbers has no meaning. Any other assignment of numbers that preserves the rank order of the attributes would serve as well. In the student loan study, class is an ordinal variable.

The attributes of an interval variable are assumed to be equally spaced. For example, temperature on the Fahrenheit scale is an interval variable. The difference between a temperature of 45 degrees and 46 degrees is taken to be the same as the difference between 90 degrees and 91 degrees. However, it is not assumed that a 90-degree object has twice the temperature of a 45-degree object (meaning that the ratio of temperatures is not necessarily 2 to 1). The condition that makes the ratio of two observations uninterpretable is the absence of a true zero for the variable. In general, with variables measured at the interval level, it makes no sense to try to interpret the ratio of two observations.

The attributes of a ratio variable are assumed to have equal intervals and a true zero point. For example, age is a ratio variable because the negative age of a person or object is not meaningful and, thus, the birth of the person or the creation of the object is a true zero point. With ratio variables, it makes sense to form ratios of observations and it is thus meaningful, for example, to say that a person of 90 years is twice as old as one of 45. In the study of student loans, age and loan balance are both ratio variables (the attributes are equally spaced and the variables have true zero points). For analysis purposes, it is seldom necessary to distinguish between interval and ratio variables so we usually lump them together and call them interval-ratio variables.

-------------------- \5 A variable for which the attributes are assigned arbitrary numerical values is usually called a "dummy variable." Dummy variables occur frequently in evaluation studies.

Unit of Analysis

Units of analysis are the persons, things, or events under study--the entities that we want to say something about. Frequently, the appropriate units of analysis are easy to select. They follow from the purpose of the study. For example, if we want to know how people feel about the offerings of a government program, individual people would be the logical unit of analysis. In the statistical analysis, the set of data to be manipulated would be variables defined at the level of the individual.

However, in some studies, variables can potentially be analyzed at two or more levels of aggregation. Suppose, for example, that evaluators wished to evaluate a compensatory reading program and had acquired reading test scores on a large number of children, some who participated in the program and some who did not. One way to analyze the data would be to treat each child as a case.

But another possibility would be to aggregate the scores of the individual children to the classroom level. For example, they could compute the average scores for the children in each classroom that participated in their study. They could then treat each classroom as a unit, and an average reading test score would be an attribute of a classroom. Other variables, such as teacher's years of experience, number of students, and hours of instruction could be defined at the classroom level. The data analysis would proceed by using classrooms as the unit of analysis. For some issues, treating each child as a unit might seem more appropriate, while in others each classroom might seem a better choice. And we can imagine rationales for aggregating to the school, school district, and even state level.

Summarizing, the unit of analysis is the level at which analysis is conducted. We have, in this example, five possible units of analysis: child, classroom, school, school district, and state. We can move up the ladder of aggregation by computing average reading scores across lower-level units. In effect, the definition of the variable changes as we change the unit of analysis. The lowest-level variable might be called child- reading-score, the next could be classroom-average-reading-score, and so on.

In general, the results from an analysis will vary, depending upon the unit of analysis. Thus, for studies in which aggregation is a possibility, evaluators must answer the question: What is the appropriate unit of analysis? Several situation-specific factors may need consideration, and there may not be a clear-cut answer. Sometimes analyses are carried out with several units of analysis. (GAO evaluators should seek advice from technical assistance groups.)

Distribution of a Variable

The cases we observe vary in the characteristics of interest to us. For example, students vary by class and by loan balance. Such variation across cases, which is called the distribution of a variable, is the focus of attention in a statistical analysis. Among the several ways to picture or describe a distribution, the histogram is probably the simplest. To illustrate, suppose we want to display the distribution of the loan balance variable for the 15 cases in table 1.1. A histogram for the data is shown in figure 1.1. The length of the lefthand bar corresponds to the number of observations between $1,000 and $1,999. There are three: $1,500, $1,970, and $1,718. The lengths of the other bars are determined in a similar fashion, and the overall histogram gives a picture of the distribution. In this example, the distribution is rather "piled up" on one end and spread out at the other; two intervals have no observations.

Figure 1.1: Histogram of Loan Balances

docffchart1.jpg (64372 bytes)

Histograms show the shape of a distribution, a factor that helps determine the type of data analysis that will be appropriate. For example, some techniques are suitable only when the distribution is approximately symmetrical (as in figure 1.2a), while others can be

Figure 1.2: Two Distributions

docffchart2.jpg (69361 bytes)

used when the observations are asymmetrical (figure 1.2b). Once data are collected for a study, we need to inspect the distributions of the variables to see what initial steps are appropriate for the data analysis. Sometimes it is advisable to transform a variable (that is, systematically change the values of the observations) that is distributed asymmetrically to one that is symmetric. For example, taking the square root of each observation is a transformation that will sometimes work. Velleman and Hoaglin (1981, ch. 2) provide a good introduction to transformation strategies (they refer to them as "re-expression") and Hoaglin, Mosteller, and Tukey (1983, ch. 4) give a more complete treatment. (GAO generalists who believe that such a strategy is in order are advised to seek help from a technical assistance group.) With proper care, transformations do not alter the conclusions that can be drawn from data.

Another aspect of a distribution is the possible presence of outliers, a few observations that have extremely large or small values so that they lie on the outer reaches of the distribution. For the student loan observations, case number 4, which has a value of $8,100, is far from the center of the distribution. Outliers can be important because they may lead to new understanding of the variable in question. However, outliers attributable to measurement error may produce misleading results with some statistical analyses, so an early decision must be made about how to handle outliers--a decision not easy to make. The usual way is to employ analytical methods that are relatively insensitive to outliers--for example, by using the median instead of the mean. Sometimes outliers are dropped from the analysis but only if there is good reason to believe that the observations are in error.

Considerations about the shape of a distribution and about outliers apply to ordinal, interval, and ratio variables. Because the attributes of a nominal variable have no inherent order, these spatial relationships have no meaning. However, we can still display the results from observations on a nominal variable as a histogram, as long as we remember that the order of the attributes is arbitrary. Figure 1.3 shows hypothetical data on the number of participants in four government programs. There is no inherent order for displaying the programs.

Figure 1.3: Histogram for a Nominal Variable

docffchart3.jpg (56793 bytes)

Another way of showing the distribution of a variable is to use a simple table. Suppose evaluators have data on 341 homeowners' attitudes toward energy conservation with three categories of response: indifferent, somewhat positive, and positive. Table 1.2 shows the data in summary form. This kind of display is not often used when only one variable is involved, but with two it is common (see chapter 4).

Table 1.3
Tabular Display of a Distribution

Attitude toward energy conservation Number of homeowners
Indifferent 120
Somewhat positive 115
Positive 106
Total 341

Populations, Probability Samples, and Batches

Statistical analysis is applied to a group of cases. The process by which the group was chosen (that is, the study design) affects the type of data analysis that is appropriate and the interpretations that may be drawn from the analysis. Three types of group are of interest: populations, probability samples, and batches.

A population is the full set of cases that the evaluators have a question about. For example, suppose they want to know the age of Medicaid participants and the amount of benefits these participants received last year. The population would be all persons who received such benefits, and the evaluators might obtain data tapes containing the attributes for all such persons. They could perform statistical analyses to describe the distributions of certain variables such as age and amount of benefits received. The results of such an analysis are called descriptive statistics.

A second way to draw conclusions about the Medicaid participants is to use a probability sample from the population of beneficiaries. A probability sample is a group of cases selected so that each member of the population has a known, nonzero probability of being selected. (For detailed information on probability sampling, see the transfer paper entitled Using Statistical Sampling.) Studies based on probability samples are usually less expensive than those that use data from the entire population and, under some conditions, are less error-prone.\6 The study of probability samples can use descriptive statistics but the study of the population, upon which the probability sample is based, uses inferential statistics (discussed in chapter 5).

A group of cases can also be treated as a batch, a group produced by a process about which we make no probabilistic assumptions. For example, the evaluators might use their judgment, not probability, to select a number of interesting Medicaid cases for study. Being neither a population nor a probability sample, the set of cases is treated as a batch. As such, the techniques of descriptive statistics can be applied but not those of inferential statistics. Thus, conclusions about the population of which the batch is a part cannot be based on statistical rules of inference.

When do we regard a group of cases as a batch? Evaluators who have purposely chosen a nonprobability sample, or who have doubts about whether cases in hand fit the definition of a probability sample--for example, because they are using someone else's data and the selection procedures were not well described --should treat the cases as a batch. Actually, any group of cases can be regarded as a batch. The term is applied whenever we do not wish to assume the grouping is a population or a probability sample.

-------------------- \6 Error in using probability samples to answer questions about populations stems from the net effects of both measurement error and sampling error. Conclusions based upon data from the entire population are subject only to measurement error. The total error associated with data from a probability sample may be less than the total error (measurement only) of data from a population.


Completeness of Data

When we design a study, we plan to obtain data for a specific number of cases. Despite our best plans, we usually cannot obtain data on all variables for all cases. For example, in a sample survey, some persons may decline to respond at all and others may not answer certain questionnaire items. Or responses to some interview questions may be inadvertently "lost" during data editing and processing. In another study, we may not be allowed to observe certain events. Almost inevitably, the data will be incomplete in several respects, and data analysis must contend with that eventuality.

Incompleteness in the data can affect analysis in a variety of ways. The classic example is when we draw a probability sample with the aim of using inferential statistics to answer questions about a population. To illustrate, suppose evaluators send a questionnaire to a sample of Medicaid beneficiaries but only 45 percent provide data. Without increasing the response rate or satisfying themselves that nonrespondents would have answered in ways similar to respondents (or that the differences would have been inconsequential), the evaluators would not be entitled to draw inferential conclusions about the population of Medicaid beneficiaries. If they knew the views of the nonrespondents, their overall description of the population might be quite different. They would be limited, therefore, to descriptive statistics about the 45 percent who responded, and that information might not be useful for answering a policy- relevant question.

The problem of incomplete data entails several considerations and a variety of analytic approaches. (See, for example, Groves, 1989; Madow, Olkin, and Rubin, 1983; and Little and Rubin, 1987.) One important strategy is to minimize the problems by using good data collection techniques. (See the transfer papers entitled Using Structured Interviewing Techniques and Developing and Using Questionnaires.)

 
Statistics

In GAO work, we may be interested in analyzing data from a population, a probability sample, or a batch. Regardless of how the group of cases is selected, we make observations on the cases and can produce a data sheet like that of table 1.1. A main purpose of statistical analysis is to draw conclusions about the real world by computing useful statistics.\7 A statistic is a number computed from a set of data. For example, the midpoint loan balance for the 15 students, $2,890, is a statistic-- the median loan balance for the batch in statistical terminology.

Many statistics are possible but only a relative few are useful in the sense of helping us understand the data and answer policy-relevant questions. Another possibly useful statistic from the batch of 15 is the range--the difference between the maximum loan balance and the minimum. The range, in this example, is 8,100 - 1,500 = 6,600. In this instance, the "computation" of the statistic is merely a sorting through the attributes for the loan balance variable to find the largest and smallest values and then computing the difference between them. Many statistics can be imagined but most would not be useful in describing the batch. For example, the square root of the difference between the maximum loan balance and the mean loan balance is a statistic but not a useful one.

The methods of statistical analysis provide us with ways to compute and interpret useful statistics. Those that are useful for describing a population or a batch are called descriptive statistics. They are used to describe a set of cases upon which observations were made. Methods that are useful for drawing inferences about a population from a probability sample are called inferential statistics. They are used to describe a population using merely information from observations on a probability sample of cases from the population. Thus, the same statistic can be descriptive or inferential or both, depending on its use. -------------------- \7 Another purpose, though one that has received less attention in the statistical literature, is to devise useful ways to graphically depict the data. See, for example, Du Toit, Steyn, and Stumpf, 1986; and Tufte, 1983.

 

Chapter 2 Determining the Central Tendency of a Distribution

Descriptive analyses are the workhorses of GAO, carrying much of the message in many of our reports. There are three main forms of descriptive analysis: determining the central tendency in the distribution of a variable (discussed in this chapter), determining the spread of a distribution (chapter 3), and determining the association among variables (chapter 4).

The determination of central tendency answers the first of GAO's four basic questions, What is a typical value of the variable? All readers are familiar with the basic ideas. Sample questions might be

How satisfied are Social Security beneficiaries with the agency's responsiveness?
How much time is required to fill requests for fighter plane repair parts?
What was the dollar value in agricultural subsidies received by wealthy farmers?
What was the turnover rate among personnel in long-term care facilities?

The common theme of these questions is the need to express what is typical of a group of cases. For example, in the last question, the response variable is the turnover rate. Suppose evaluators have collected information on the turnover rates for 800 long-term care facilities. Assuming there is variation among the facilities, they would have a distribution for the turnover rate variable. There are two approaches for describing the central tendency of a distribution: (1) presenting the data on turnover rates in tables or figures and (2) finding a single number, a descriptive statistic, that best summarizes the distribution of turnover rates.

The first approach, shown in table 2.1, allows us to "see" the distribution. The trouble is that it may be hard to grasp what the typical value is. However, evaluators should always take a graphic or tabular approach as a first step to help in deciding how to proceed on the second approach, choosing a single statistic to represent the batch. How a display of the distribution can help will be seen shortly.

Table 2.1 Distribution of Staff Turnover Rates in Long-Term Care Facilities

Turnover rates
(percent new staff per year)
Frequency count (number of long-term care facilities)
0-0.9 155
1.0-1.9 100
2.0-2.9 125
3.0-3.9 150
4.0-4.9 100
5.0-5.9 75
6.0-6.9 50
7.0-7.9 25
8.0-8.9 15
9.0-9.9 5

The second approach, describing the typical value of a variable with a single number, offers several possibilities. But before considering them, a little discussion of terminology is necessary. A descriptive statistic is a number, computed from observations of a batch, that in some way describes the group of cases. The definition of a particular descriptive statistic is specific, sometimes given as a recipe for calculation. Measures of central tendency form a class of descriptive statistics each member of which characterizes, in some sense, the typical value of a variable--the central location of a distribution.\1 The definition of central tendency is necessarily somewhat vague because it embraces a variety of computational procedures that frequently produce different numerical values. Nonetheless, the purpose of each measure would be to compress information about a whole distribution of cases into a single number.

-------------------- \1 Measures of central tendency also go by other, equivalent names such as "center indicators" and "location indicators."


Measures of the Central Tendency of a Distribution

Three familiar and commonly used measures of central tendency are summarized in table 2.2. The mean, or arithmetic average, is calculated by summing the observations and dividing the sum by the number of observations. It is ordinarily used as a measure of central tendency only with interval-ratio level data. However, the mean may not be a good choice if several cases are outliers or if the distribution is notably asymmetric. The reason is that the mean is strongly influenced by the presence of a few extreme values, which may give a distorted view of central tendency. Despite such limitations, the mean has definite advantages in inferential statistics (see chapter 5).

   

     Use of measure   a

Measurement level Mode Median Mean
Nominal Yes No No
Ordinal Yes Yes No b
Interval-ratio Yes Yes Yes c

\a "Yes" means the indicator is suitable for the measurement level shown.
\b May be OK in some circumstances. See chapter 7.
\c May be misleading when the distribution is asymmetric or has a few outliers.

The median--calculated by determining the midpoint of rank-ordered cases--can be used with ordinal, interval, or ratio measurements and no assumptions need be made about the shape of the distribution.\2 The median has another attractive feature: it is a resistant measure. That means it is not much affected by changes in a few cases. Intuitively, this suggests that significant errors of observation in several cases will not greatly distort the results. Because it is a resistant measure, outliers have less influence on the median than on the mean. For example, notice that the observations 1,4,4,5,7,7,8,8,9 have the same median (7) as the observations 1,4,4,5,7,7,8,8,542. The means (5.89 and 65.44, respectively), however, are quite different because of the outlier, 542, in the second set of observations.

The mode is determined by finding the attribute that is most often observed.\3 That is, we simply count the number of times each attribute occurs in the data, and the mode is the most frequently occurring attribute. It can be used as a measure of central tendency with data at any level of measurement. However, the mode is most commonly employed with nominal variables and is generally less used for other levels. A distribution can have more than one mode (when two or more attributes tie for the highest frequency). When it does, that fact alone gives important information about the shape of the distribution.

Measures of central tendency are used frequently in GAO reports. In a study of tuition guarantee programs (U.S. General Accounting Office, 1990c), for example, the mean was often used to characterize the programs in the sample, but when outliers were evident, the median was reported. In another GAO study (U.S. General Accounting Office, 1988), the distinctions between properties of the mode, median, and mean figured prominently in an analysis of procedures used by the Employment and Training Administration to determine prevailing wage rates of farmworkers.

-------------------- \2 With an odd number of cases, the midpoint is the median. With an even number of cases, the median is the mean of the middle pair of cases. \3 This definition is suitable when the mode is used with nominal and ordinal variables--the most common situation. A slightly different definition is required for interval-ratio variables.


Analyzing and Reporting Central Tendency

To illustrate some considerations involved in determining the central tendency of a distribution, we can recall the earlier study question about the views of Social Security beneficiaries regarding program services. Assume that a questionnaire has been sent to a batch of 800 Social Security recipients asking how satisfied they are with program services.\4 Further, imagine four hypothetical distributions of the responses. By assigning a numerical value of 1 to the item response "very satisfied" and 5 to "very dissatisfied," and so on, we can create an ordinal variable. The three measures of central tendency can then be computed to produce the results in table 2.3.\5 Although the data are ordinal, we have included the mean for comparison purposes.

Table 2.3 Illustrative Measures of Central Tendency

Distribution

Attribute Code A B C D
Very satisfied 1 250 250 100 159
Satisfied 2 200 150 150 159
Neither Satisfied or Dissatisfied 3 125 0 300 164
Dissatisfied 4 125 150 150 159
Very Dissatisfied 5 100 250 100 159
Total Responses 800 800 800 800
Mean 2.5 3 3 3
Median 2 3 3 3
Mode 1 1 and 5 3 3

In distribution A, the data are distributed asymmetrically. More persons report being very satisfied than any other condition, and mode 1 reflects this. However, 225 beneficiaries expressed some degree of dissatisfaction (codes 4 and 5), and these observations pull the mean to a value of 2.5, (that is, toward the dissatisfied end of the scale). The median is 2, between the mode and the mean. Although the mean might be acceptable for some ordinal variables, in this example it can be misleading and shows the danger of using a single measure with an asymmetrical distribution. The mode seems unsatisfactory also because, although it draws attention to the fact that more respondents reported satisfaction with the services than any other category, it obscures the point that 225 reported that they were dissatisfied or very dissatisfied. The median seems the better choice for this distribution if we can display only one number, but showing the whole distribution is probably wise.

In distribution B, the mean and the median both equal 3 (a central tendency of "neither satisfied nor dissatisfied"). Some would say this is nonsense in terms of the actual distribution, since no one actually chose the middle category. Modes 1 and 5 seem the better choices to represent the clearly bimodal distribution, although again a display of the full distribution is probably the best option.

In distribution C, the mean, median, and mode are identical; the distribution is symmetrical. Any one of the three would be appropriate. One easy check on the symmetry of a distribution, as this shows, is to compare the values of the mean, median, and mode. If they differ substantially, as with distribution A, the distribution is probably such that the median should be used.

As distribution D illustrates, however, this rule-of-thumb is not infallible. Although the mean, median, and mode agree, the distribution is almost flat. In this case, a single measure of central tendency could be misleading, since the values 1, 2, 3, 4, and 5 are all about equally likely to occur. Thus, the full distribution should be displayed.

The lesson of this example? First, before representing the central tendency by any single number, evaluators need to look at the distribution and decide whether the indicator would be misleading. Second, there will be occasions when displaying the results graphically or in tabular form will be desirable instead of, or in addition to, reporting statistics.

The interpretation of a measure of central tendency comes from the context of the associated policy question. The number itself does not carry along a message saying whether policymakers should be complacent or concerned about the central tendency. For example, the observed mean agricultural subsidy for farmers can be interpreted only in the context of economic and social policy. Comparison of the mean to other numbers such as the wealth or income level of farmers or to the trend over time for mean subsidies might be helpful in this regard. And, of course, limits on mean values are sometimes written into law. An example is the fleet-average mileage standard for automobiles. Information that can be used to interpret the observed measures of central tendency is a necessary part of the overall answer to a policy question.

-------------------- \4 To keep the discussion general, we make no assumptions about how the group of recipients was chosen. However, in GAO, a probability sample would usually form the basis for data collection by a mailout questionnaire.
\5 Although computer programs automatically compute a variety of indicators and although we display three of them here, we are not suggesting that this is a good practice. In general, the choice of an indicator should be based upon the measurement level of a variable and the shape of the distribution.

Chapter 3
Determining the Spread of a Distribution

Spread refers to the extent of variation among cases--sometimes cases cluster closely together and sometimes they are widely spread out. When we determine appropriate policy action, the spread of a distribution may be as much a factor, or more, than the central tendency.

The point is illustrated by the issue of variation in hospital mortality rates. Consider two questions. How much do hospital mortality rates vary? If there is substantial variation, what accounts for it? We consider questions of the first type in this chapter and questions of the second type in chapter 6.

Figure 3.1: Histogram of Hospital Mortality Rates

docffchart4.jpg (65809 bytes)

Figure 3.1 shows the distribution of hypothetical data on mortality rates in 1,225 hospitals. While the depiction is useful both in gaining an initial understanding of the spread in mortality rates and in communicating findings, it is also usually desirable to produce a number that characterizes the variation in the distribution.

Other questions in which spread is the issue are

What is the variability in timber production among national forests?
What is the variation among the states in food stamp participation rates?
What is the spread in asset value among failed savings and loan institutions?

In each of these examples, we are addressing the generic question, How much spread (or variation in the response variable) is there among the cases? (See table 1.3.)

Even when spread is not the center of attention, it is an important concept in data analysis and should be reported when a set of data are described. Whenever evaluators give information about the central tendency of a distribution, they should also describe the spread. Chapter 3:1

Measures of the Spread of a Distribution

There is a variety of statistics for gauging the spread of a distribution. Some measures should be used only with interval-ratio measurement while others are appropriate for nominal or ordinal data. Table 3.1 summarizes the characteristics of four particular measures.

Table 3.1 Measures of Spread

 

                                             Use of measure                                             

Measurement level Index of dispersion Range Interquartile range Standard deviation
Nominal Yes No No No
Ordinal Sometimes Sometimes Yes No
Interval-ratio No Yes Yes Yes

The index of dispersion is a measure of spread for nominal or ordinal variables. With such variables, each case falls into one of a number of categories. The index shows the extent to which cases are bunched up in one or a few categories rather than being well spread out among the available categories.

The calculation of the index is based upon the concept of unique pairs of cases. Suppose, for example, we want to know the spread for gender, a nominal variable. Assume a batch of 8 cases, 3 females and 5 males. Each of the 3 females could be paired with each of the 5 males to yield 15 unique pairs (3 x 5).

The index is a ratio in which the numerator is the number of unique pairs (15 in the example) that can be created given the observed number of cases (n = 8 in the example). The denominator of the ratio is the maximum number of unique pairs of cases that can be created with n cases.

The maximum occurs when the cases are evenly divided among the available categories. The maximum number of unique pairs (for n = 8) would occur if the batch included 4 females and 4 males (the 8 cases evenly divided among the two categories). Under this condition, 16 unique pairs (4 x 4) could be formed. The index of dispersion for the example would thus be 15/16 = .94. Although this example illustrates the concept of the index, the calculation of the index becomes more tedious as the number of cases and the number of categories increase. Loether and McTavish (1988) give a computational formula and a computer program for the index of dispersion.

As the cases become more spread out among the available categories, the index of dispersion increases in value. The index of dispersion can be as large as 1, when the categories have equal numbers of cases, and as small as 0, when all cases are in one category.

The range is a commonly used measure of spread when a variable is measured at least at the ordinal level. The range is the difference between the largest and smallest observations in the distribution. Because the range is based solely on the extreme values, it is a crude measure that is very sensitive to sample size and to outliers. The effect of an outlier is shown by the two distributions we considered in chapter 2: (1) 1,4,4,5,7,7,8,8,9, and (2) 1,4,4,5,7,7,8,8,542. The range for the first distribution is 8, and for the second it is 541. The huge difference is attributable to the presence of an outlier in the second distribution.

A range of 0 means there is no variation in the cases, but unlike the index of dispersion, the range has no upper limit. The range is not used with nominal variables because the measure makes sense only when cases are ordered. To illustrate the measure, the distribution of hospital mortality rates, is reproduced in figure 3.2. Inspection of the data showed that the minimum rate was .025 and the maximum was .475, so the range is .45.

Figure 3.2: Spread of a Distribution

docffchart5.jpg (76627 bytes)

Another measure of spread, the interquartile range, is the difference between the two points in a distribution that bracket the middle 50 percent of the cases. These two points are called the 1st and 3rd quartiles and, in effect, the cut the upper and lower 25 percent of the cases from the range. The more closely the cases are bunched together, the smaller will be the value of the interquartile range. Like the range, the interquartile range requires at least an ordinal level of measurement, but by discounting extreme cases, it is not subject to criticism for being inappropriately sensitive to outliers. In the hospital mortality example, the 1st quartile is .075 and the 3rd quartile is .275 so the interquartile range is the difference, .2.

A fourth measure of spread, one often used with interval- ratio data, is the standard deviation. It is the square root of the average of the squares of the deviations of each case from the mean. As with the preceding measures, the standard deviation is 0 when there is no variation among the cases. It has no upper limit, however. For the distribution of hospital mortality rate, the standard deviation is .12 but note, from figure 3.2, that the distribution is somewhat asymmetric, so this measure of spread is apt to be misleading. The four-standard-deviation band shown in figure 3.2 is .48 units wide and centered on the sample mean of .19.\1

One way of interpreting or explaining the spread of a distribution (for ordinal or higher variables) is to look at the proportion of cases "covered" by a measure of dispersion. To do this, we think of a spread measure as a band having a lower value and an upper value and then imagine that band superimposed on the distribution of cases A certain proportion of the cases have observations larger than the lower value of the band and less than the upper value; those cases are thus covered by the spread measure. For the range, the lower value is the smallest observation among all cases and the upper value is the largest observation (see figure 3.2, based upon 1,225 cases). Then 100 percent of the cases are covered by the range.

Likewise, we know that when the interquartile range is used, 50 percent of the cases are always covered. The situation with the standard deviation is more complex but ultimately, in terms of inferential statistics, more useful.

When we use the standard deviation as the measure of spread, we can define the width of the band in an infinite number of ways but only two or three are commonly used. One possibility is to define the lower value of the band as the mean minus one standard deviation and the upper value as the mean plus one standard deviation. In other words, this band is two standard deviations wide (and centered on the mean). We could then simply count the cases in the batch that are covered by the band. However, it is important to realize that the number of cases can vary from study to study. For example, 53 percent of the cases might be covered in one study, to pick an arbitrary figure, and 66 percent in another. Just how many depends upon the shape of the distribution. So, unlike the situation with the range or the interquartile range, the measure by itself does not imply that a specified proportion of cases will be covered by a band that is two standard deviations wide. Thus if we know only the width of the band, we may have difficulty interpreting the meaning of the measure. Other bands could be defined as four standard deviations wide or any other multiple of the basic measure, a standard deviation.\2

We can obtain some idea of the effect of distribution shape on the interpretation of the standard deviation by considering three situations. We may believe that the distribution is (1) close to a theoretical curve called the normal distribution (the familiar bell-shaped curve, (2) has a single mode and is approximately symmetric (but not necessarily normal in shape), or (3) of unknown or "irregular" shape.\3 For this example, we define the band to be four standard deviations wide (that is, two standard deviations on either side of the mean).

When the distribution of a batch is close to a normal distribution, statistical theory permits us to say that approximately 95 percent of the cases will be covered by the four-standard-deviation band. (See figure 3.3.) However, if we know only that the distribution is unimodal and symmetric, theory lets us say that, at minimum, 89 percent of the cases will be covered. If the distribution is multimodal or asymmetric or if we simply do not know its shape, we can make a weaker statement that applies to any distribution: that, at minimum, 75 percent of the cases will be covered by the four-standard-deviation band. Figure 3.3:

Spread in a Normal Distribution

docffchart6.jpg (44085 bytes)

From this example, it should be evident that care must be taken when using the standard deviation to describe the spread of a batch of cases. The common interpretation that a four- standard-deviation band covers about 95 percent of the cases is true only if the distribution is approximately normal.

One GAO example of describing the spread of a distribution comes from a report on Bureau of the Census methods for estimating the value of noncash benefits to poor families (U.S. General Accounting Office, 1987b). Variation in the amount of noncash benefits was described in terms of both the range and the standard deviation. In a study of homeless children and youths (U.S. General Accounting Office, 1989a), GAO evaluators asked shelter providers, advocates for the homeless, and government officials to estimate the proportion of the homeless persons, in a county, that seek shelter in a variety of settings (for example, churches, formal shelters, and public places). The responses were summarized by reporting medians and by the first and third quartiles (from which the interquartile range can be computed).

-------------------- \1 Expressing the spread as a band of four standard deviations is a common but not unique practice. Any multiple of standard deviations would be acceptable but two, four, and six are commonplace.
\2 The term "standard deviation" is sometimes misunderstood to be implying some substantive meaning to the amount of variation-- that the variation is a large amount or a small amount. The measure by itself does not convey such information, and after we have computed a standard deviation, we still have to decide, on the basis of nonstatistical information, whether the variation is "large" or not.
\3 The name for the set of theoretical distributions called "normal" is unfortunate in that it seems to imply that distributions that have this form are "to be expected." While many real-world distributions are indeed close to a normal (or Gaussian) distribution in shape, many others are not.

 

Analyzing and Reporting Spread

To analyze the spread of a nominal variable, it is probably best just to develop a table or a histogram that shows the frequency of cases for each category of the variable. The calculation of a single measure, such as the index of dispersion, is not common but can be done.

For describing the spread of an ordinal variable, tables or histograms are useful, but the choice of a single measure is problematic. The index of dispersion is a possibility, but it does not take advantage of the known information about the order of the categories. Range, interquartile range, and standard deviation are all based on interval or ratio measurement. When a single measure is used, the best choice is often the interquartile range.

With an interval-ratio variable, graphic analysis of the spread is always advisable even if only a single measure is ultimately reported. The standard deviation is a commonly used measure but, as noted above, may be difficult to interpret if it cannot be shown that the cases have approximately a normal distribution.\4 Consequently, the interquartile range may be a good alternative to the standard deviation when the distribution is questionable.

With respect to reporting data, a general principle applies: whenever central tendency is reported, spread should be reported too. There are two main reasons for this. The first is that a key study question may ask about the variability among cases. In such instances, the mean should be reported but the real issue pertains to the spread.

The second reason for describing the spread of a distribution, which applies even when the study question focuses on central tendency, is that knowledge of variation among individual cases tells us the extent to which an action based on the central tendency is likely to be on the mark. The point is that government action based upon the central tendency may be appropriate if the spread of cases is small, but if the spread is large, several different actions may be warranted to take account of the great variety among the cases. For example, policymakers might conclude that the mean mortality rate among hospitals is satisfactory and, given central tendency alone, might decide that no action is needed. If there is little spread among hospitals with respect to mortality rates, then taking no action may be appropriate. But if the spread is wide, then maybe hospitals with low rates should be studied to see what lessons can be learned from them and perhaps hospitals with extremely high rates should be looked at closely to see if improvements can be made.

-------------------- \4 A possible approach with a variable that does not have a normal distribution is to change the scale of the variable so that the shape does approximate the normal. See Velleman and Hoaglin (1981) for some examples; they refer to the process of changing the scale as "re-expression," but "transformation" of the variables is a more common term.

Chapter 4  Determining Association Among Variables

Many questions GAO addresses deal with associations among variables: Do 12th grade students in high-spending school districts learn more than students in low-spending districts? Are different procedures for monitoring thrift institutions associated with different rates for correctly predicting institution failure? Is there a relationship between geographical area and whether farm crop prices are affected by price supports? Are homeowners' attitudes about energy conservation related to their income level? Are homeowners' appliance-purchasing decisions associated with government information campaigns aimed at reducing energy consumption? Recalling table 1.3, these examples illustrate the third generic question, To what extent are two or more variables associated? An answer to the first question would reveal, for example, whether high achievement levels tend to be found in higher-spending districts and low achievement levels in lower-spending districts (a positive association), or vice versa (a negative association).

What is an Association Among Variables

Just what do we mean by an association among variables?\1 The simplest case is that involving two variables, say homeowners' attitudes about energy conservation and income level. Imagine a data sheet as in table 4.1 representing the results of interviews with 341 homeowners. For these hypothetical data, we have adopted the following coding scheme: attitude toward energy conservation (indifferent = 1, somewhat positive = 2, positive = 3); family income level (low = 1, medium = 2, high = 3).

Table 4.1 Data Sheet With Two Variables

Case Attitude toward energy conservation Family income level
1 3 2
2 3 1
3 1 3
... ... ...
341 2 2



To say that there is an association between the variables is to say that there is a particular pattern in the observations. Perhaps homeowners who respond that their attitude toward energy conservation is positive tend to report that they have low income and homeowners who respond that they are indifferent toward conservation tend to have high income. The pattern is that the cases vary together on the two variables of interest. Usually the relationship does not hold for every case but there is a tendency for it to occur.

The trouble with a data sheet like this is that it is usually not easy to perceive an association between the two variables. Evaluators need a way to summarize the data. One common way, with nominal or ordinal data, is to use a cross- tabular display as in table 4.2. The numbers in the cells of the table indicate the number of homeowners who responded to each possible combination of attitude and income level.

Table 4.2 Cross-Tabulation of Two Ordinal Variables

Attitude toward energy conservation

Family income level

Low Medium High Total
Indifferent 27 37 56 120
Somewhat positive 35 39 41 115
Positive 43 33 30 106
Total 105 109 127 341

Notice that the information in table 4.2 is an elaboration of the distribution of 341 homeowners shown in table 1.2. Reading down the total column in table 4.2 gives the distribution of the homeowners with respect to the attitude variable (the same as in table 1.2). In a two-variable table, this distribution is called a marginal distribution; it presents information on only one variable. The last row in table 4.2 (not including the grand total, 341) also gives a marginal distribution--that for the income variable.

There is much more information in table 4.2. If we look down the numbers in the low-income column only, we are looking at the distribution of attitude toward energy conservation for only low-income households. Or, if we look across the indifferent row, we are looking at the distribution of income levels for indifferent households. The distribution of one variable for a given value of the other variable is called a conditional distribution. Four other conditional distributions (for households with medium income, high income, somewhat positive attitudes, and positive attitudes) are displayed in table 4.2, which in its entirety portrays a bivariate distribution.

The new table compresses the data, from 682 cells in the data sheet of table 4.1 to 16, and again we can look for patterns in the data. In effect, we are trying to compare distributions (for example, across low-income, medium-income, and high-income households) and if we find that the distributions are different (across income levels, for example) we will conclude that attitude toward energy conservation is associated with income level. Specifically, households with high income tend to be less positive than low-income households. But the comparisons are difficult because the number of households in the categories (for example, low-income and medium-income) are not equal, as we can observe from the row and column totals.

The next step in trying to understand the data is to convert the numbers in table 4.2 to percentages. That will eliminate the effects of different numbers of households in different categories. There are three ways to make the conversion: (1) make each number in a row a percentage of the row total, (2) make each number in a column a percentage of a column total, or (3) make each number in the table a percentage of the batch total, 341. (Computer programs may readily compute all three variations.) In table 4.3, we have chosen the second way. Now we can see much more clearly how the distributions compare for different income levels.

Table 4.3 Percentaged Cross-Tabulation of Two Ordinal Variables

 

Attitude toward energy conservation

Family income level

Low Medium High Total
Indifferent 26 34 44 35
Somewhat positive 33 36 32 34
Positive 41 30 24 31
Total 100 100 100 100

And we could go on and look at the other ways of computing percentages. But even with all three displays, it still may not be easy to grasp the extent of an association, much less readily communicate its extent to another person. Therefore, we often want to go beyond tabular displays and seek a number, a measure of association, to summarize the association. Such a measure can be used to characterize the extent of the relationship and, often, the direction of the association, except for nominal variables. We may sometimes use more than one measure to observe different facets of an association. Although this example involves two ordinal variables, the notion of an association is similar for other combinations of measurement levels. -------------------- \1 The term "relationship" is equivalent to "association."

 

Measures of Association Between Two Variables

A measure of association between variables is calculated from a batch of observations, so it is another descriptive statistic. Several measures of association are available to choose from, depending upon the measurement level of the variables and exactly how association is defined. For illustrative purposes, we mention four from the whole class of statistics sometimes used for indicating association: gamma, lambda, the Pearson product-moment correlation coefficient, and the regression coefficient.

Ordinal Variables: Gamma

When we have two ordinal variables, as in the energy conservation example, gamma is a common statistic used to characterize an association. This indicator can range in value from -1 to +1, indicating perfect negative association and perfect positive association, respectively. When the value of gamma is near zero, there is little or no evident association between the two variables. Gamma is readily produced by available statistical programs, and it can be computed by hand from a table like table 4.2, but the calculation, sketched out below, is rather laborious. For our hypothetical data set, gamma is found to be -.24.

The computed value of gamma indicates that the association between family income level and attitude toward energy conservation is negative but that the extent of the association is modest. One way to interpret this result is that if we are trying to predict a family's attitude toward energy conservation, we will be more accurate (but not much more) if we use knowledge of its income level in making the prediction. The gamma statistic is based upon a comparison of the errors in predicting the value of one variable (for example, family's attitude toward conservation) with and without knowing the value of another variable (family income). This idea is expressed in the following formula: gamma = (prediction errors not knowing income - prediction errors knowing income)/prediction errors not knowing income.

The calculation of gamma involves using the information in table 4.2 to determine the number of prediction errors for each of two situations, with and without knowing income. The formula above is actually quite general and applies to a number of measures of association, referred to as PRE (proportionate reduction in error) measures. The more general formulation (Loether and McTavish, 1988) is PRE measure = reduction in errors with more information/original amount of error. PRE measures vary, depending upon the definition of prediction error.

Nominal Variables: Lambda

With two nominal variables, the idea of an association is similar to that between ordinal variables but the approach to determining the extent of the association is a little different. This is so because, according to definition, the attributes of a nominal variable are not ordered. The consequences can be seen by looking at another cross-tabulation.

Suppose we have data with which to answer the question about the association between whether the prices farmers receive are affected by government crop supports and the region of the country in which they live. Then the variables and attributes might be as follows: crop supports (yes, no); region of the country (Northeast, Southeast, Midwest, Southwest, Northwest). Some hypothetical data for these variables are displayed in table 4.4.

Table 4.4 Cross-Tabulation of Two Nominal Variables

Prices affected by Crop
Region Yes No Total
Northeast 322 672 994
Southeast 473 287 760
Midwest 366 382 748
Southwest 306 297 748
Northwest 342 312 654
Total 1,809 1,950 3,759

If we start to look for a pattern in this cross-tabulation, we have to be careful because the order in which the regions are listed is arbitrary. We could just as well have listed them as Southwest, Northeast, Northwest, Midwest, and Southeast or in any other sequence. Therefore, the pattern we are looking for cannot depend upon the sequence as it does with ordinal variables.

Lambda is a measure of association between two nominal variables. It varies from 0, indicating no association, to 1, indicating perfect association.\2 The calculation of lambda, which is another PRE measure like gamma, involves the use of the mode as a basis for computing prediction errors. For the crop support example, the computed value of lambda is .08.\3 This small value indicates that there is not a very large association between crop-support effects and region of the country.

-------------------- \2 A definition of perfect association is beyond the scope of this paper. Different measures of association sometimes imply different notions of perfect association.
\3 There are actually three ways to compute lambda. The numerical value here is the symmetric lambda. There is some discussion of symmetric and asymmetric measures of association later in this paper. 

Interval Ratio Variables: Correlation and Regression Coefficients

A Pearson product-moment correlation coefficient is a measure of linear association between two interval-ratio variables.\4 The measure, usually symbolized by the letter r, varies from -1 to +1, with 0 indicating no linear association. The square of the correlation coefficient is another PRE measure of association.

Figure 4.1 Scatter Plots for Spending Level and Test Scores 

docffchart7.jpg (47191 bytes)

 

The Pearson product-moment correlation coefficient can be illustrated by considering the question about the association between students' achievement level in the 12th grade and school district spending level, regarding both variables as measured at the interval-ratio level. Such data are often displayed in a scatter plot, an especially revealing way to look at the association between two variables measured at the interval-ratio level. Figure 4.1 shows three scatter plots for three sets of hypothetical data on two variables: average test scores for 12th graders in a school district and the per capita spending level in the district. Each data point represents two numbers: a districtwide test score and a spending level.

docffchart8.jpg (45992 bytes)

 

 

In figure 4.1a, which shows essentially no pattern in the scatter of points, the correlation coefficient is .12. In figure 4.1b, the points are still widely scattered but the pattern is clear--a tendency for high test scores to be associated with high spending levels and vice versa; the correlation coefficient is .53. And finally, in figure 4.1c the linear pattern is quite pronounced and the correlation coefficient is .96.

The regression coefficient is another widely used measure of association between two interval-ratio variables and it can be used to introduce the idea of an asymmetric measure. First, we use the scatter plot data from figure 4.1b and replot them in figure 4.2. Using the set of data represented by the scatter plot, we can use the method of regression analysis to "regress Y on X," which tells us where to position a line through the scatter plot.\5 How the analysis works is not important here, but the interpretation of the line as a measure of association is. The slope of the line is numerically equal to the amount of change in the Y variable per 1 unit change in the X variable. The slope is the regression coefficient, an asymmetric measure of association between the two variables. Unlike many other commonly used measures, the regression coefficient is not limited to the interval from -1 and +1.\6 The regression coefficient for the data displayed in figure 4.2 is 1.76, indicating that a $100 change in spending level is associated with a 1.76 change in test scores.

Regression of Test Scores on Spending Level

docffchart9.jpg (81738 bytes)

Why the regression coefficient is asymmetric can be understood if we turn the scatter plot around, as in figure 4.3, so that X is on the vertical axis and Y is on the horizontal. The pattern of points is a little different now, and if we reverse the roles of the X and Y variables in the regression procedure (that is, "regress X on Y"), the resulting line will have a different slope. Consequently the Y-on-X regression coefficient is different from the X-on-Y coefficient and that is why the measure is said to be asymmetric. Measures of association in which the roles of the X and Y variables can be interchanged in the calculation procedures without affecting the measure are said to be symmetric and measures in which the interchange produces different results are asymmetric.

Figure 4.3: Regression of Spending Level on Test Scores

docffchart10.jpg (64842 bytes)

When we use asymmetric measures of association, we also use special language to characterize the roles of the variables. Or, put in a more direct way, if we take the view that the variables are playing different roles, we give them names indicative of the roles. One is called the dependent variable and the other is the independent variable. The language is applied to two kinds of application: (1) when we are trying to establish that the independent variable causes changes in the dependent one and (2) when we are trying to use the independent variable to predict the dependent one, without necessarily supposing the association is causal. In either case, the dependent variable in some sense depends upon the independent one. Graphically, the convention is to plot the dependent variable along the vertical axis and the independent variable along the horizontal axis.

Whether evaluators should use an asymmetric measure of association or a symmetric one depends upon the application. If there is no reason to label variables as dependent and independent, then they should use a symmetric measure. But when they are predicting one variable from another or believe that one has a causal effect on the other, an asymmetric measure is preferred.

In each of the foregoing examples, both variables were measured at the same level. That will not always be the case. One common circumstance in which the variables are at different levels is discussed in a section below, entitled "The Comparison of Groups."

-------------------- \4 The word "correlation" is sometimes used in a nonspecific way as a synonym for "association." Here, however, the Pearson product- moment correlation coefficient is a measure of linear association produced by a specific set of calculations on a batch of data. It is necessary to specify linear because if the association is nonlinear, the two variables might have a strong association but the correlation coefficient could be small or even zero. This potential problem is another good reason for displaying the data graphically, which can then be inspected for nonlinearly. For a relationship that is not linear, another measure of association, called "eta," can be used instead of the Pearson coefficient (Loether and McTavish, 1988).
\5 Regression analysis is not covered in this paper. For extensive treatments, see Draper and Smith, 1981, and Pedhazur, 1982.
\6 The regression coefficient is closely related to the Pearson product moment correlation. In fact, when the observed variables are transformed to so-called z-scores, by subtracting the mean from each observed value of a variable and dividing the difference by the standard deviation of the variable, the regression coefficient of the transformed variables is equal to the correlation coefficient.

Examples

An example of a measure of association between ordinal variables comes from a GAO report on the use of medical devices in hospitals (U.S. General Accounting Office, 1986). In reporting on the association between the seriousness of a device problem and hospitals' actions to contact the manufacturer or some other party outside the hospital, the evaluators displayed the results in cross-tabular array and summarized them using a symmetric measure.

In a study of election procedures (U.S. General Accounting Office, 1990d), some of the major findings were reported as a series of correlation coefficients that showed the association between voter turnout and numerous factors characterizing absentee ballot rules and voter information activities. The same study used a regression coefficient to show the association between voter turnout and the registration deadline, expressed as number of days before the election.

The Comparison of Groups

A situation of special interest arises when evaluators want to compare two groups on some variable to see if they are different. For example, suppose the evaluators want to compare government benefits received by farmers who live east of the Mississippi to those who live west of the Mississippi. Questions about the difference between two groups are very common. In this instance, it would probably be best to answer the question by computing the mean benefits for each group and looking at the difference.

Equivalently, the comparison between these two groups of cases can be seen as a measure of association. With government benefits measured at the interval-ratio level (in dollars) and region of the country measured at the nominal level (for example, 0 for East and 1 for West), we can compute a measure of association called the point biserial correlation between benefits and region.\7 If we then multiply this correlation by the standard deviation of benefits and divide by the standard deviation of region, we will get the difference between the means of the two groups. The same result would be obtained if we regressed benefits on region; the regression coefficient is equal to the difference between the means of the two groups. We thus have three different, but statistically equivalent, methods of comparing the two groups: (1) computing the difference between means of the groups, (2) computing the point biserial correlation (and then adjusting it), and (3) computing the regression coefficient.

The point is that when evaluators compare two groups, they are examining the extent of association between two variables: one is group membership and the other is the response variable, the characteristic on which the groups might differ. Such comparisons are the main method for evaluating the effect of a program. For example, a question might be: What is the effect of the Special Supplemental Food Program for Women, Infants, and Children (WIC) on birthweight? The answer is partly to be found in the association, if any, between group membership (program participation or not) and birthweight.

Knowing the association is only part of the answer, however, because the question about effect is about the causal association between program participation and birthweight. As we show in chapter 6, the existence of an association is one of three conditions necessary to establish causality.

A comparison of means is but one among many ways in which it might be appropriate to compare two groups. Other possibilities include the comparison of (1) medians, (2) proportions, and (3) distributions. For example, if two groups are being compared on an ordinal variable and the distribution is highly asymmetric, then an analysis of the medians may be preferable to an analysis of the means. Or, as noted in chapter 3, sometimes the question the evaluators are attempting to answer is focused on the spread of a distribution and so they might be interested in comparing a measure of spread in two groups. For the hospital mortality study, we could compare the spread of mortality rates between two categories of hospitals, say teaching and nonteaching ones.

Statistical methods for comparing groups are important to GAO in at least three situations: (1) comparison of the characteristics of populations (for example, farmers in the eastern part of the country with those in the western), (2) determination of program effects (for example, the WIC program), and (3) the comparison of processes (for example, different ways to monitor thrift institutions). The questions that arise from these situations lead to a variety of data analysis methods. Factors that determine an appropriate data analysis methodology include (1) the number of groups to be compared, (2) how cases for the groups were selected, (3) the measurement level of the variables, (4) the shape of the distributions, and (5) the type of comparison (measure of central tendency, measure of spread, and so on). A further complexity is that, when sampling, evaluators need to know if the observed difference between groups is real or most likely stems from sampling fluctuation. For making that determination, the methods of statistical inference are required.

A study of changes to the program called Aid to Families with Dependent Children illustrates the use of group comparisons on factors such as employment status to draw conclusions about effects of the changes (U.S. General Accounting Office, 1985). In another example, two groups of farmers, ones who specialized in a few crops and ones who diversified, were compared on agricultural practices (U.S. General Accounting Office, 1990a).

-------------------- \7 The point biserial correlation is analogous to the Pearson product-moment correlation that applies when both variables are measured at the interval-ratio level.

Analyzing and Reporting the Association Between Variables

Answering a question about the association between two variables really involves four subquestions: Does an association exist? What is the extent of the association? What is the direction of the association? What is the nature of the association? Analysis of a batch of data to answer these questions usually involves the production of tabular or graphic displays as well as the calculation of measures of association.

With nominal or ordinal data presented in tabular form, evaluators can check for the existence of an association by inspection of the tables. If the conditional distributions are identical or nearly so, they can conclude that there is no association. Table 4.2 illustrates a data set for which an association exists between income level and attitude toward energy conservation. Table 4.5 shows another set of 341 cases-- one in which there is virtually no association. The marginal distributions are the same for tables 4.2 and 4.5, so the pattern of observations can change only in the nine interior cells.

Table 4.5 Two Ordinal Variables Showing No Association

 

Family Income Level

Attitude Toward Energy Conservation Low Medium High Total
Indifferent 37 38 45 120
Somewhat positive 35 37 43 115
Positive 33 34 39 106
Total 105 109 127 341

Most bivariate data show the existence of association. The question is really whether the association is large enough to be important.\8 A measure of association is calculated to help answer this question, and evaluators must make a judgment about importance, using the context of the question as a guide.

The direction of an association is also given by a measure of association unless the variables are nominal, in which case the direction is not meaningful. Most measures are defined so that a negative value indicates that as one variable increases the other decreases and that a positive value indicates that the variables increase or decrease together.

While the existence, extent, and direction of an association can be revealed by a measure of association, determining the nature of the association requires other methods. Usually it is done by inspecting the tabular or graphic display of a bivariate distribution. For example, a scatter plot will show if the association is approximately linear, a constant amount of change in one variable being associated with a constant amount in the other variable, as in figure 4.4a. However, the scatter plot may show that the association is nonlinear as in figure 4.4b. Interpretations of the data are usually easier if the data are linear and, of course, interpolations and extrapolations are more straightforward.

Figure 4.4: Linear and Nonlinear Associations

docffchart11.jpg (80021 bytes)

In comparisons between groups of cases, regression analysis is an important tool when the dependent variable is interval- ratio. When the assumptions necessary for regression are not satisfied, other techniques are necessary.\9

Overall, there are many analysis choices. Evaluators can always find the extent of the association, if any, between the variables and, unless one or both the variables are measured at the nominal level, they can also determine the direction of the association. The appropriateness of a given procedure depends upon the measurement level of the variables and the definition of association believed best for the circumstances. It is also wise to display the data in a table or a graph as a way to understand the form of the association.

How much information from the analysis should be included in a report? The answer depends on how strongly the conclusions are based upon the association that has been determined. If the relationship between the two variables is crucial, then probably both measures of association and tabular or graphic displays should be presented. Otherwise, reporting only the measures will probably suffice. In either case, evaluators should be clear about the level of measurement assumed and analysis methods used.

-------------------- \8 If we are trying to draw conclusions about a population from a probability sample, then we must additionally be concerned about whether what seems to be an association really stems from sampling fluctuation. The data analysis then involves inferential statistics.
\9 The assumptions are not very stringent for descriptive statistics but may be problematic for inferential statistics. Chapter 5

Chapter 5  Estimating Population Parameters

Many questions that GAO seeks to answer are about relatively large populations of persons, things, or events. Examples are

What is the average student loan balance owed by college students?
Among households eligible for food stamps, what proportion receive them?
How much hazardous waste is produced in the nation annually and how much variation is there among individual generators? What is the relationship between the receipt of Medicaid benefits and size of household?

In chapters 2, 3, and 4, we focused on descriptive statistics--ways to answer questions about just those cases for which we had data. We now consider inferential statistics--methods for answering questions about cases for which we do not have observations. The procedures involve using data from a sample of cases to infer conclusions about the population of which the sample is a part.

The shift to inferential statistics is necessary when evaluators want to know about large populations but, for practical reasons, do not try to get information from every member of such populations. The most obvious obstacle to collecting data on many cases is cost, but other factors such as deadlines for producing results may play a role.

To generalize findings from a sample of cases to the larger population, not just any sample of cases will do--a probability sample is required. Random processes for drawing probability samples are detailed in the transfer paper in this series entitled Using Statistical Sampling.\1 Under such methods, each member of a population has a known, nonzero probability of being drawn.

The methods collectively called inferential statistics are based upon the laws of probability and require samples drawn by a random process. Attempts to draw conclusions about populations based upon nonprobability samples are usually not very persuasive, so we do not consider them here.

From the perspective of inferential statistics, the illustrative questions above need two-part answers: a point estimate of a parameter that describes the population and an interval estimate of the parameter. (Other forms of statistical inference, such as hypothesis testing, are appropriate to other kinds of questions. They are not covered in this paper.) Full understanding of inferential statistical statements requires a thorough knowledge of probability, the development of which is beyond the scope of this introductory paper. For our brief treatment here, we use the concept of the histogram and illustrate how probability comes into play through sampling distributions.

Some notions discussed in earlier chapters, involving data on all cases in a batch, are extended in this chapter to show how statistics computed from a probability sample of cases are used to estimate parameters such as the central tendency of a population (see chapter 2). The notable difference between describing a batch, using statistics from all cases in the batch, and describing a population, using statistics from a probability sample of the population, is that we will necessarily be somewhat uncertain in describing a population. However, the data analysis methods for inferential statistics allow us to be precise about the degree of uncertainty.

-------------------- \1 Probability sampling is sometimes called statistical sampling or scientific sampling.

Histograms and Probability Distributions

A key concept in statistical inference is the sampling distribution. The histogram, which was introduced in chapter 1, is a way of displaying a distribution, so we begin there. Expanding on the first example from chapter 1, suppose that instead of information on a batch of 15 college students, we have collected information on loan balances from 150 students. If we round numbers to the nearest $1,000 for ease of computation and display, our observations produce the distribution of loan balances shown in figure 5.1. For example, the height of the third bar corresponds to the number of students who reported loan balances between $1,500 and $2,499. The distribution is somewhat asymmetrical and has a mean of $2,907.

Figure 5.1: Frequency Distribution of Loan Balances

docffchart12.jpg (57699 bytes)

Probability is a numerical way of expressing the likelihood that a particular outcome, among a set of possibilities, will occur. Suppose that we do not have access to the responses from individual students in the survey but that we want to use the distribution in figure 5.1 to make a wager on whether a student to be selected at random from this sample of 150 will have a loan balance between $1,500 and $2,500. To make a reasonable bet, we need to know the probability that a particular outcome--a loan value between $1,500 and $2,500--will be reported when we make a phone call to the student. The information we need is in the figure but the answer will be more evident if we make a slight change in the display.

We can describe the students' use of loans in probability terms if we convert the frequency distribution to a probability distribution. The frequency distribution shows the number of students who reported each possible outcome (that is, loan balances between $1,500 and $2,500 and so on). We can present the same information in terms of percentages by dividing the number of students reporting each outcome (the height of a bar) by the total number in the sample (150). The percentages, expressed in decimal form, can be interpreted as probabilities and are displayed in figure 5.2.

Figure 5.2: Probability Distribution of Loan Balances

docffchart13.jpg (57867 bytes)

We now have a probability distribution for the loan balance variable for the sample of 150 students. Picking the outcome we want to make a wager on, we can say that the probability is .26 (39 divided by 150) that a student selected randomly from the sample will report a balance between $1,500 and $2,500.

The shape of the probability distribution is the same as the frequency distribution; we have just relabeled the vertical axis. But the probability distribution has two important characteristics not possessed by the frequency histogram: (1) the height of each bar is equal to or greater than 0 and equal to or less than 1 and (2) the sum of the heights of the bars is equal to 1. These characteristics qualify the new display as a probability distribution for nominal or ordinal variables.\2 The probability of an outcome is defined as ranging between 0 and 1, and the sum of probabilities across all possible outcomes is 1.

The probability distribution in figure 5.2 is an empirical distribution because it is based on experience. "Theoretical" probability distributions are also important in drawing conclusions from data and deciding actions to take. An example relevant to the decisions that gamblers make is the distribution of possible outcomes from throwing a six-sided die. In theory, the probability distribution for the six possible outcomes could be displayed with six bars, each having a height of 1/6.

Theoretical distributions that play key roles in the methods of inferential statistics are the binomial, normal, chi-square, t, and F distributions. Actually, each of these names refers to a whole family of distributions. The distributions are described in widely available tables that give numerical information about the distributions. For tables and discussions of the distributions, consult a statistics text such as Loether and McTavish (1988). For example, one could use a table of the normal distribution (with mean of 0 and standard deviation of 1) to find the probability that an observation from a population with this distribution could exceed a specified value. Before computers became commonplace for statistical calculations, tables of the distributions were indispensable to the application of inferential statistics.

-------------------- \2 Nominal and ordinal variables take on a finite set of values. Interval-ratio variables have a potentially infinite set of values, so the corresponding probability distribution is defined a little differently. (These variables are introduced under "Level of Measurement" in chapter 1.)

Sampling Distributions

The distribution of responses from 150 college students in the example above is the distribution of a sample. If we were to draw another sample of 150 students and plot a histogram, we would almost surely see a slightly different distribution and the mean would be different. And we could go on drawing more samples and plotting more histograms. Differences among the resulting distributions of samples are inherent in the sampling process.

The aim is to be precise about how much variation to expect among statistics computed from different samples. For example, if we use the mean of a sample to describe the distribution of loan balances in a student population, how much uncertainty derives from using a sample? New kinds of distributions called sampling distributions of statistics, or just sampling distributions for short, provide the basis for making statements about statistical uncertainty.

To this point, we have computed statistics without concern for how we produced the data but now we must use probability sampling, which requires that data be produced by a random process. In particular, suppose that we were to draw 100 different simple random samples, each with 150 students, and compute sample statistics, such as the mean, for each sample.\3 This would give us a data sheet like that in table 5.1. Since the computed sample means vary across the samples, we could draw a histogram showing the distribution of the sample means (figure 5.3). The midpoint of each bar along the X axis is the midpoint of an interval centered on the number shown. Such a distribution is what we mean by a sampling distribution--one that tells us the probability of obtaining a sample in which a computed statistic, such as the mean, will have certain values.\4 Using figure 5.3, we can say that 25 percent of the sample means had values in the interval $3,000 plus or minus 50. Using such information, we will be able to make a statement about the probability that a given interval includes the value of the population mean.\5 This idea is developed further in a later section on interval estimation.

Table 5.1 Data Sheet for 100 Samples of College Students

Sample Computed mean loan balance
1 $2,907
2 $2,947
3 $2,933
4 $3,127
5 $3,080
... ...
100 $3,227

Figure 5.3: Sampling Distribution for Mean Student Loan Balances

docffchart14.jpg (58917 bytes)

Speaking practically, of course, we would not want to draw many samples of college students because a principal reason for sampling, after all, is to avoid having to make a large number of observations. Therefore, we cannot hope actually to produce a distribution like that of figure 5.3 from empirical evidence. But if our sample is a probability sample, we can usually determine the amount of uncertainty associated with sampling and yet draw only one sample. With a probability sample, the laws of probability often enable us to know the theoretical distribution of a sample statistic so that we can use that instead of an empirical distribution obtained by drawing many samples.\6

The sample displayed in figure 5.1, as well as the other 99, was, in fact, drawn randomly from a population with a mean loan balance of $3,000. It can therefore be used to estimate population parameters for the distribution of students.

-------------------- \3 There are many kinds of probability samples. The most elementary is the simple random sample in which each member of the population has an equal chance of being drawn to the sample.
\4 Notice the difference between a sample distribution (the distribution of a sample) and a sampling distribution (the distribution of a sample statistic).
\5 The mean either lies in a given interval or it does not. No probability is involved in that respect. However, the probability statement is appropriate since the population mean is usually unknown and we use the confidence interval as a measure of the uncertainty in our estimate of the mean that stems from sampling.
\6 This is where families of distributions like the chi-square and the t come into play to help us estimate population parameters. They are the theoretical distributions that we need.

Population Parameters

A population parameter is a number that describes a population. Consider again the question of the mean student loan balance for college students. We want to know about the population of all college students--specifically, we want to know the mean loan balance--but we do not want to get information from all. We describe the situation by saying that we want to estimate a population parameter--in this case, the mean of the distribution of loan balances for all students. We want a reasonably close estimate but we are willing to tolerate some uncertainty in exchange for avoiding the cost and time of querying every college student.

The id