| United States General Accounting Office |
| GAO | Program Evaluation and Methodology Division |
| November 1990 | Case Study Evaluations |
Preface
GAO assists congressional decisionmakers in their deliberative process by furnishing
analytical information on issues and options under consideration. Many diverse
methodologies are needed to develop sound and timely answers to the questions that are
posed by the Congress. To provide GAO evaluators with basic information about the more
commonly used methodologies, GAO's policy guidance includes documents such as methodology
transfer papers and technical guidelines.
This methodology transfer paper on case study evaluations describes how GAO evaluators could use case study methods in performing our work. It describes six applications of case study methods, including the purposes and pitfalls of each, and explains similarities and differences among the six. This paper presents an evaluation perspective on case studies, defines them, and determines their appropriateness in terms of the type of evaluation question posed. The original report was authored by Lois-ellin Datta in April 1987. This reissued (1990) version supersedes the earlier edition.
Case Study Evaluations is one of a series of papers issued by the Program
Evaluation and Methodology Division (PEMD). The purpose of the series is to provide GAO
evaluators with guides to various aspects of audit and evaluation methodology, to
illustrate applications, and to indicate where more detailed information is available.
We look forward to receiving comments from the readers of this paper. They should be
addressed to Eleanor Chelimsky at 202-276-1854.
Werner Grosshans
Assistant Comptroller General
Office of Policy
Eleanor Chelimsky Assistant Comptroller General for Program Evaluation and Methodology
Contents
Preface
Chapter 1
Introduction
Chapter 2
What Are Case Studies?
What Is Meant by "A Case Study"?
Some Common Benefits Expected From Case Study Evaluations
Instance Selection in Case Studies
Chapter 3
Case Study Applications
Illustrative
Exploratory
Critical Instance
Program Implementation
Program Effects
Cumulative
Design Decisions and Case Study Applications
Chapter 4
Data Collection and Analysis
Data Collection
Data Analysis
Handling Multisite Data Sets
Basic Models for Data Analysis
Pitfalls and Booby Traps
Where to Go for More Information
Chapter 5
Summary
What Are Case Studies?
When Are Case Studies Appropriately Used in Evaluation?
What Distinguishes a Good From a Not-Good Case Study?
Impartiality and Generalizability
Appendixes
Appendix I: Theory and History
Appendix II: Site Selection Example
Appendix III: Guidelines for Reviewing Case Study Reports
Bibliography
Glossary
Papers in This Series
Tables
Table 2.1: What Is a Case Study? Exercise
Table 2.2: Complexity of Questions
Table 2.3: Methods of Obtaining Description and Analysis in Case Studies
Table 2.4: Some Common Benefits Expected From Case Study Evaluations
Table 2.5: Instance Selection in Case Studies
Table 2.6: Hypothetical Data on Instance Selection
Table 3.1: Illustrative Case Studies
Table 3.2: Exploratory Case Studies
Table 3.3: Critical Instance Case Studies
Table 3.4: Program Implementation Case Studies
Table 3.5: Illustration of Differences in Note-Taking
Table 3.6: Program Effects Case Studies
Table 3.7: Cumulative Case Studies
Table 3.8: Some Design Decisions in Case Study Methods
Table 4.1: Ways of Analyzing Case Study Data
Table 5.1: Some Common Pitfalls in Case Study Evaluation
Table 1.1: Criteria of Good Research
Table I.2: Evaluation Adaptations of the Research Case Study
Table II.I: Hypothetical Data on Unfiled Corporate Income Tax Returns for 1986
State Income Tax Returns
Table III. 1: Checklist for Reviewing Case Study Reports
Abbreviations
GAO General Accounting Office
OTTR Observe, think, test, and revise
PEMD Program Evaluation and Methodology Division
SSA Social Security Administration
Chapter 1
Introduction
At his government-required anti-terrorist training session recently, a captain for a major
airline said,
"The bits of information were so few and far between that people weren't even paying attention. My instructor for the eight-hour course entered the room only to change videotapes. People were talking; they were doing other things, including reading the paper." (Philadelphia Inquirer, 1986)
This is a case instance. It is an effective way of drawing attention to a problem such as training quality. Such anecdotes are remembered and they are convincing. What they are not, however, is generalizable: that is, an anecdote doesn't tell whether it is the only such instance or whether the problem is wide-spread. And anecdotes usually don't show the reasons for a situation, and thus are of limited value in suggesting solutions.
The challenge for evaluators is how to use those aspects of an anecdote that are effective for our work-the immediacy, the convincingness, the attention-getting quality-and, at the same time, fulfill other informational requirements for our jobs, such as generalizability and reliability. Case study methods, while not without their limitations in this regard, can help us answer this challenge.
GAO already does a lot of case studies-or at least, what we ourselves call case studies in describing our methods. There are GAO case studies in many areas-urban housing, weapon systems testing, community development, military procurement contracts, influences on the Brazilian export-import balances, how programs aimed at improving water quality are working, and the implementation of block grants-to name only a few.
Most of these case studies are either "illustrative" or "critical"
instance applications. The first type of application illustrates findings established by
other techniques, supplementing, for example, national findings on clean air from
administrative records
and other sources, with in-depth description on how funds have been used and with what
results in selected cities. The second type of application is in-depth analysis of a case
of unique interest, such as whether funds have been awarded and managed properly in a
specific community health center or if a certain former government official had done
anything improper before or after leaving the government. There are, however, four other
applications of case studies that are less often used at present but that could be
appropriate for our jobs. In brief, the six types of case study, which we examine in
chapter 3, are as follows:
Case Study Evaluations is a review of methodological issues involved in using case study evaluations. It is not a detailed guide to case study design. It does, however, explain the similarities and differences among the six kinds of case study and discusses ideas for successfully designing them. It also gives guidance to the manager who, in reviewing completed case studies, wants to assess their strengths. Finally, it presents an evaluation perspective on case studies, defining them and determining their appropriateness in terms of the type of evaluation question posed.
The methods and types of case studies outlined here are not definitive. The case study as a research method has evolved over many years of experience but evaluative use of the method has been more limited. Indeed, the history of the case study as an evaluation method is little older than a decade. Therefore, discussion of some of the applications described here is based on relatively extensive field experience (with questions in such domains as justice, education, welfare, environment, housing, and foreign aid), while the discussion of some of the other applications is based on more constrained experience.
We have paid particular attention to the conventional wisdom that case studies are
always subjective and nongeneralizable. In many uses of case studies, there is no need to
generalize. Nonetheless, we find that there are steps that can be taken to generalize from
case studies when this is desired. However, we did not devote any particular emphasis to
the popular idea that case studies are inexpensive to conduct (issues of research
management common to all designs were outside the scope of our work). However, one thing
that should emerge quite clearly from the discussion of design features intrinsic to the
case study is that it can be a rather costly endeavor, given the time required, the rich
in-depth nature of the information sought, and the need to achieve credibility. This
reinforces the importance of weighing carefully the decisions to employ the case study
method in program evaluation.
In this paper, we have taken positions on many issues, expecting to revise these as
experience accumulates and as we receive reactions from evaluators and researchers. This
paper is intended to transfer what we believe to be good practice in case studies and to
help establish the principles of applying case studies to evaluation. Thus, while the
document offers preliminary guidance, it is also a point of departure. For example, we are
developing the variation that we call the "cumulative" case study. It can entail
prospective and retrospective designs and it permits synthesis of many individual case
studies undertaken at different times and in different sites.
The quality of case studies can be variable. Some score high on reasonable tests of quality; others have lower scores. Three problems often encountered have to do with matching the question the evaluator set out to answer and the method for selecting the instances examined, reporting the basis for selecting the instances, and integrating findings across several instances when the findings in one were inconsistent with those in another.
The next sections of this paper will first present some new ways of thinking about a
familiar method, the case study, and then introduce the six applications, describing what
is required, in terms of methodology, to get the benefits case studies can offer. In the
last chapter, we turn to two basic questions: What do we need to take into account with
regard to the objectivity of case studies and their generalizability?
Chapter 2
What Are Case Studies?
Almost everyone in GAO probably has worked on a case study at one time or another yet may
be unfamiliar with what is meant, methodologically, by a case study. The methodological
meaning is important in understanding what differentiates a case study from a noncase
study and a good case study from a not-so-good case study.
What is a case study? The exercise in table 2.1 describes a job we might be asked to do and a design for it and asks you to decide whether or not this is a case study. Take about 10 minutes to think through this example and write out your answer. It is important that you try this out yourself, so please do it before continuing.
Table 2.1: What Is a Case Study? Exercise
| Item | Writing assignment |
| Exercise | Suppose GAO has been asked whether the informed consent requirements for experimentation with human subjects are being properly implemented. Suppose further that we visit three sites where humans are used as subjects for research-a hospital, a university, and a clinic-and that we review the informed consent procedures at each site. |
| Question 1 | Is this an application of the case study method? Why? |
| Question 2 | If not, would case studies be appropriate for answering the question we were asked? Why? |
| Question 3 | What is your definition of "case study"? |
The answers some GAO evaluators gave may illustrate the range of definitions surrounding
case study methods.
To some GAO evaluators, the instance was an application- of the case study
method, because we were looking at only a few sites or because we could not
generalize or because "actual subjects are being used for analysis of a specific
question." To some, the instance was clearly not an application of the case study
method, because "we do not know if the instances are representative of the
universe," and "there doesn't appear to be enough done at each site." To
still others, it was not possible to tell whether this was a case study because looking at
instances was what we do in all our methods, and there was no differentiation between this
job and a compliance audit.
The definitions given also varied greatly. To one person, a case study involves looking
at individual people. To another, a case study examines a clearly defined site and reports
on that one site, so that multiple site studies would not be case studies. To another,
case studies involve getting a great deal of information about a single site or
circumstance, when generalizability isn't important. To others, "a random sample is
necessary for a case study," "case studies are nonnormative research that
investigate a situation without prejudice," "where we could look at a limited
number of cases that would represent the universe overall," and "a review of
relevant conditions in a specific environment with no attempt to project to a larger
universe." There were almost as many definitions as people, and few of them had
elements in common. While exact uniformity isn't expected or perhaps even possible when
people are asked to recall a definition, the extreme variability illustrates that we could
be talking about very different things in a proposal or report when we discuss case study
methods. Thus a decision to "do case studies" could lead to the collection of
irreconcilably dissimilar information from groups working on the same job.
What Is Meant by "A Case Study"?
We have developed a definition of case studies that leads to appropriate uses and says
something about how a good case study is conducted. It is somewhat technical, so we turn
next to giving this definition and to discussing each of its elements.
"A case study is a method for learning about a complex instance, based on a comprehensive understanding of that instance obtained by extensive description and analysis of that instance taken as a whole and in its context."
For example, if we were asked to study what caused the Three Mile Island disaster and scoped the job to describe whether required safeguards were complied with, this would not be a case study. If, however, we scoped the job to examine in depth events leading up to the disaster, what went wrong, and why it went wrong, this would be a case study. For a second example, if we were asked to study the safety of nuclear plants in general, we might select as our method a survey of self-reported compliance with safeguards in all existing plants. This would not be a case study. If, however, we scoped the job to examine in depth recent problems in appropriately selected nuclear plants including among others Three Mile Island, seeking to understand why the safeguards either were not complied with or were not sufficient, then we would have selected the case study method to answer the question.
As we will discuss later, several methods can be used in one job; these examples are only intended to highlight what is not, and what is, a case study. Examining the elements of the definition also may help make this distinction clear.
"A complex instance" means that input and output cannot be readily or very
accurately related. There are several reasons why such a relationship might be difficult.
There could be many influences on what is happening and these influences could interact in
nonlinear ways such that a unit of change in the input can be associated with quite
different changes in the output, sometimes increasing it, sometimes decreasing it, and
sometimes having no discernible effect.
Table 2.2 gives an example of a less and a more complex instance. "Are U.S. airports
following required U.S. and international security procedures for passengers?" is a
less complex question because the criterion is fairly clear, the focus is narrow, the
influences on compliance are likely to be relatively few, and the relation of input and
output is likely to be fairly direct. Staff knowledge of procedures ought to play some
role in following these procedures, for instance.
Some questions are more complex, however, such as the question: "Are security procedures in U.S. airports sufficient to protect the safety of passengers and equipment?" This is more complex because the criterion of "sufficient protection" is much less certain; the focus is broader; the influences on actual achievement of sufficient procedures are likely to be many; and the relation of input and output is not only likely to be both direct and indirect but also difficult to measure.
The second key element in our definition is "a comprehensive understanding."
Here the situation is more straightforward. This means that the goal of a case study is to
obtain as complete a picture as possible of what is going on in an instance, and why.
The third key element, "obtained by extensive description and analysis," has
three components. These are summarized in table 2.3. Case studies involve what
methodologists call "thick" descriptions: rich, full information that should
come from multiple data sources, particularly from firsthand observations. The analysis
also is extensive, and the method compares information from different types of data
sources through a technique called "triangulation." That is, reliability of the
findings is developed through the multiple data sources within each type. This is
akin to corroboration as discussed in the General Policy Manual, chapter 8.0. The
validity of the findings, particularly validity with regard to cause and effect, is
derived from agreement among the types of data sources, together with the systematic
ruling out of alternative explanations and the explanation of "outlier" results.
Examining consistency of evidence across different types of data sources is akin to
verification. There are specialized strategies for making these comparisons-namely,
pattern matching, explanation building, and thematic review. The technical how-toe for
these three strategies will be summarized later in this paper. They involve techniques
such as graphic data displays, tabulations of event frequencies, and chronological or time
series orderings. Generally, data collection and analysis are concurrent and
interactive-that is, "yoked" in case study methods.
Table 2.2: Complexity of Questions
| Example | Characteristics |
| A less complex question "Are U.S. airports following required U.S. and international security procedures for passengers?" |
Criterion is fairly clear: "required U.S.
and international security procedures" Focus is narrow: "passengers" Influences on compliance are likely to be relatively few: staff knowledge of procedures and training in implementation equipment, number of staff compared to workflow, degree of supervision, staff screening and selection Relation of input (influences on compliance) to output (that required security procedures are followed) is fairly direct. |
| A more complex question "Are security procedures in U.S. airports sufficient to protect the safety of passengers and equipment?" |
Criterion is less clear: what would be
sufficient under present conditions and with existing and possible technologies? Focus is broader: passengers and equipment (although still fairly well specified) Influences on achievement of sufficient procedures likely to be many, including the state of the art of detection technologies, number and militancy of potential threats to security, and the willingness of passengers, airline personnel, and airport personnel to accept different costs and forms of protection Relations of input (influences on security) and output (safety) likely to be difficult to measure and to be both indirect and direct. |
Table 2.3: Methods of Obtaining Description and Analysis in Case Studies a
| Technique | Methodology |
| Extensive or "thick" analysis | Analysis of multiple types of data sources such
as --interviews with all relevant persons |
| Analysis via triangulation of data | Analysis through --Pattern matching |
| Comparison of evidence for consistency | Analysis through techniques such as --Matrix
of categories |
a Different types of evidence and standards for them are discussed in General
Policy Manual, chapter 8.0
The next element of the definition is "taken as a whole." As this list
indicates, the size of the instance can be as small as one individual or as large as a
nation. The instance as a whole can be
| 1 These instances have been the subject of case studies. (See U.S. General Accounting Office, February 22, 1984, and Allison, 1971.) Others are general illustrations. |
One example of a GAO case study that examines an individual is our examination of whether or not a senior official behaved improperly with regard to influence and accepting money before and since leaving the White House (U.S. General Accounting Office, July 11, 1986). Another example would be a request to examine in detail ax-President Marcos' use of funds intended by the United States for military or civilian purposes for his personal benefit. At the other extreme, an instance may be as large as an event, such as the Cuban missile crisis (Allison, 1971) and the swine flu vaccine (Neustadt and Fineberg, 1978), which have been the subjects of two well-known case studies, or the Challenger tragedy. It can be a region (Chesapeake Bay water cleanup programs), a nation (democracy in the Philippines), or an organization (UNESCO). Moreover, it is possible to have questions that require nested case studies. For example, to answer a question about how programs to serve handicapped children are working, we might select the cases of preschool and elementary programs; we might further select within preschool programs, those for the hearing impaired and those for the orthopedically impaired. Each of these nested studies is treated, in terms of specification of the unit of study and collection of data appropriate to it, as any other case study would be.
The last key element of the definition is "and in its context." Context means
all factors that could affect what is happening in an instance. As an example, in the
Challenger tragedy, inquiry began with trying to locate the technology that failed as the
reason for the explosion. The right-hand booster rocket was identified as the source of
the explosion and, within the rocket, technological attention focused on the O-rings. The
inquiry expanded very quickly, however, from asking what technology failed to an
examination of contextual influences, such as
That is, the Challenger inquiry could be seen as similar to a case study in some ways. The rapid spread of inquiry from an examination of the technology to an investigation of decisionmaking on that flight, to inquiry about NASA management as it affected the Challenger disaster generally, is what "taking the context into account" means. In case study methods, to understand what happened and why, context always is considered, and it is this consideration that gives the case study its strength as a way of understanding cause and effect.
Some Common Benefits Expected From Case Study Evaluations
Doing a good case study is more than just looking at what is happening in a few instances.
It is a special systematic way of looking at what is happening, of selecting the
instances, collecting the data, analyzing the information, and reporting the results.
There are nine features of case study evaluations that merit special discussion. Each
of these features-if carried out-confers certain benefits in terms of the product. Two of
the features relate to design, three to data collection, three to analysis, and one to
reporting. These features and their benefits are shown in table 2.4. For example, with
regard to design, information over time-the longitudinal feature of the design-provides
assurance that the final product represents what is happening and not an atypical
situation.
Table 2.4: Some Common Benefits Expected From Case Study Evaluations
| Study feature | Benefits expected | |
| Design | ||
| Longitudinal | Assurance that a short-term situation that may be unrepresentative of what is happening isn't inflated in importance | |
| Triangulation | Assurance that reasons given for events properly reflect influences from many different sources | |
| Purposive instance | Ability to match questions asked and later generalization of findings at level appropriate to the questions | |
| Data Collection | ||
| Comprehensive | Assurance that important conditions, consequences, and reasons for these have not been overlooked | |
| Flexible | Broader perspectives, increased assurance that what is important on the scene rather than centrally will be examined | |
| Multiple data sources | Assurance that a full picture will be obtained and that bias associated with self-protection or self-interests will be reduced | |
| Analysis | ||
| "Yoked" or concurrent with data collections | Assurance of the ability to collect data needed to test alternative interpretations and to make rapid adjustments in design | |
| Search for disproving - proving evidence | Assurance that alternative interpretations have been thoroughly searched for and checked; thorough identification of instances that don't fit the general pattern; and, often, understanding of the reasons for the outliers | |
| Chain-of-evidence and pattern matching techniques | Permit fairly direct assessment of how convincingly the evidence of conclusions are related | |
| Reporting |
||
| Actual instances | Assurance of authenticity through persuasiveness and ease of recall; use of the tendency to generalize from personal experience but via the substitution of more objective experience for anecdotes of unknown credibility | |
These features are the price of admission to the expected benefits. One frequent question about case study methods is how rigorously these features have to be followed. Obviously, the more closely the requirements are followed, the more benefits can be expected. It is a judgment call as to how much the features can be compromised before the "case study" becomes a site visit or turns into a survey. Probably the most critical features are appropriate instance selection, triangulation, and the search for disproving evidence. And of these three, probably the most critical is appropriate instance selection.
Instance Selection in Case Studies
There are three general bases for selecting instances: convenience, purpose, and
probability. Each has its function and can be used to answer certain questions. A good
case study will use a basis for instance selection that is appropriate for the question to
be answered. Using the wrong basis for selecting an instance is a fatal error in case
study designs, as in all designs. Such a case study is a not-good case study, and it is
irredeemably flawed despite any methodological virtues it may have in terms of data
collection, analysis, and reporting.
Table 2.5 summarizes the three general bases for selecting instances and the questions each basis can answer. Of particular interest may be the seven varieties of purposive site selection: bracketing, best cases, worst cases, cluster, representative, typical, and special interest.
Instance selection is crucial to generalizability and to answering the evaluation
questions appropriately. Only rarely will convenience be a sound basis for instance
selection; only rarely will probability sampling be feasible. Thus, instance selection on
the basis of the purpose of the study is the most appropriate method in many designs.
Table 2.5: Instance Selection in Case Studies
| Selection basis | When to use and what questions it can answer | |
| Convenience | "In this site, selected because it was expedient for data collection purposes, what is happening, and why" | |
| Purpose | ||
| Bracketing | "What is happening at extremes? What explains such differences" | |
| Best cases | "What accounts for an effective program?" | |
| Worst cases | "Why isn't the program working?" | |
| Cluster | "How do different types of programs compare with each other?" | |
| Representative | "In instances chosen to represent important variations, what is the program like and why?" | |
| Typical | "In a typical site, what s happening and why?" | |
| Special Interest | "In this particular circumstance, what is happening and why?" | |
| Probability | "What is happening in the program as a whole, and why?" | |
The match between the question asked and the method of purposive sampling chosen can be
tricky. For example, studies that attain "representativeness" by conducting a
few case studies in a rural setting, a few in a suburban setting, and a few in an urban
setting will produce a report in which the three settings receive more or less equal
weight. If, however, 90 percent of the clients or sites for the program are rural, such
"representativeness" may appropriately capture the range of site experiences but
be rather unrepresentative of the program as a whole, and care will be needed to
generalize only to the range of settings and not to the program as a whole.
Table 2.6: Hypothetical Data on Instance Selection
| Location | Operated by | Number of beds | Clientele served | Years in Operation | Funded by | Costs a | Problems b |
| 1. San Diego, CA | CAIM, Inc. | 800 | Men and boys | 2 | INS | 25 | 4 |
| 2. Amarillo, TX | CAIM, Inc. | 130 | Men and boys | 1 | INS | 30 | 4 |
| 3. El Paso, TX | PIC | 75 | Families | 3 | INS | 15 | 7 |
| 4. El Paso, TX | CAIM, Inc. | 350 | Men and boys | 1 | BOP/INS | 60 | 7 |
| 5. Miami, FL | Security | 100 | Men and boys | 1 | BOP/INS | 150 | 15 |
| 6. Clearwater, FL | CAIM, Inc. | 300 | Men and boys | 5 | BOP/INS | 100 | 10 |
| 7. Pensacola, FL | Security | 100 | Families | 5 | INS/State | 70 | 6 |
| 8. Denver, FL | PIC | 100 | Families | 3 | INS/State | 20 | 3 |
| 9. Salida, CA | Security | 200 | Men and boys | 4 | INS | 70 | 9 |
| 10. Salinas, CA | CAIM, Inc. | 100 | Men and boys | 2 | INS | 30 | 3 |
| 11. Los Angeles, CA | Security | 300 | Men and boys | 3 | INS | 75 | 5 |
| 12. San Francisco, CA | Security | 250 | Men and boys | 3 | INS/State | 70 | 7 |
| 13. San Francisco, CA | PIC | 100 | Men and boys | 3 | INS | 25 | 4 |
| 14. New York, NY | ARIVA, Inc. | 100 | Men and boys | 2 | INS | 55 | 6 |
| 15. Washington, DC | ARIVA, Inc. | 300 | Families | 2 | INS | 85 | 5 |
| 16. Seattle, WA | Security | 100 | Men and boys | 3 | INS/State | 60 | 7 |
To illustrate what each variety means, and how it might be operationalized, consider the information in table 2.6. This gives hypothetical data about a real situation in designing a study-selecting instances (in this study, sites or locations) for an assessment of the costs and operations of federal detention facilities managed by private contractors under OMB Circular A-76. There are not many such facilities-so the 16 hypothetical facilities represent what we might actually find in such a study. The following paragraphs describe what a sample would look like if it were chosen according to the bases in table 2.6.
Convenience Samples
If our location were the Denver Regional Office, a convenience sample would be sites 8
(Denver) and 9 (Salida). That is, ease of collecting data and minimizing resources
required would have driven our choice.
Purposive Sample
Bracketing
If our interests were extreme costs, numbers 3 (El Paso, at $15 per person day) and 5
(Miami, at $150 per person day) would bracket the cost extremes. If we wanted the three
least expensive and the three most expensive, we could select 3 (El Paso), 8 (Denver, at
$20), and 13 (San Francisco, at $25) in comparison to 5 (Miami, at $150), 6 (Clearwater,
at $100), and 15 (Washington, D.C., at $85). Such an addition would also give us a better
basis for analysis because it includes not only high-cost and low-cost sites but also
services to men and boys and to families, a difference that in itself might be expected to
lead to cost variations.
Best Cases
If our interests were in operating centers with the least problems, we might examine
numbers 8
(Denver, 3 percent) and 10 (Salines, 3 percent). Since both are in Colorado (although
operated by different firms and serving different groups), we might want to add sites.
Such an addition could show whether we were looking at something about Colorado rather
than about low-problem centers. We could do this by selecting 1 (San Diego, 4 percent), 2
(Amarillo, 4 percent), and 13 (San Francisco, 14 percent).
Worst Cases
Sites 5 (Miami, 15 percent problems) and 6 (Clearwater, 10 percent) stand out as worst
cases.
Selecting an out-of-state comparison, if we wanted it, is harder here. The next highest
problem rate (9, Salida, at 9 percent) is run by a different company and costs much less.
Security has a site in San Francisco, for men and boys, which costs $70 daily with a
7-percent problem rate. The costs of site 15(Washington, D.C.) are higher, but this site
serves families and has a low problem rate. The best choice probably is 12 (San
Francisco): it serves the same group (men and boys) and is run by the same company
(Security).
Cluster
We might be interested in administrative arrangements-in, for example, how administration
works
out when INS alone is the contractor, when responsibility is shared with another federal
agency
(Bureau of Prisons), and when responsibility is shared with the state. One cluster of
sites (1, 2, 3, 8,
9, 10, 11, 13, 14, and 15) is administered by INS alone. Another cluster (4, 5, and 6) is
shared
between BOP and INS, and the last cluster (7, 8, 12, and 16) is run by INS and the state.
We could pick one or two sites from each cluster to get a sense of how agency auspices may
affect program operations.
Representative
One issue we might need to examine could be efficiencies of operation-particularly in
terms of
facility size. Here we might select numbers 1 (San Diego, 800 beds), 6 (Clearwater, 300
beds), and 10 (Salines, 100 beds). All are run by CAIM, and all serve men and boys. We
would have to limit our generalizations to facilities for men and boys, but these three
sites should give a good sense of the size and operations issue.
Typical
This would be a challenge. In terms of size, there is a "typical" bed size (100
beds); in terms of people served, there is a "typical" population (men and
boys); and in terms of years of operation, 3 years is "typical," with 2 years a
close runner-up. In terms of costs, however, the distribution is trimodal-that is, three
values appear about equally often-and for percent of problems, it is almost flat with two
outliers. Also, there is not a single site that matches all three "typical"
characteristics well. Miami, for example, has 100 beds and serves men and boys, but it has
been in operation only 1 year, costs $150 per person per day, and has a 15-percent problem
rate. The best approach would be to indicate that it is not possible to pick one site that
is "typical" of such distributions.
Special Interest
Any one of the 16 sites might be examined as a result of special congressional
interest. Such
interest usually would be based on information extraneous to the data in the table: a
complaint
might be received, for example, about conditions in the San Diego site, or allegations
might be made
that the high costs of the Miami site were due to mismanagement.
Probability Samples
Probabilistic sampling is the method of choice for answering questions about "how
much," or how
extensive a problem is in a population. Properly carried out, it provides strong
generalizability and
assurance of representativeness. A probability sample is one in which all members of the
population have a known and equal chance of being selected. If we used a table of random
numbers, and selected as the first two sites those corresponding to the first two numbers
between 1 and 16 in the table, we would have selected a probability sample. Each site
would have a 1-in-16 chance of selection, and that chance would be equal among sites. A
fair objection to this statement is that the laws of probability operate on large numbers,
and selecting fewer than 30 instances does not always provide the generalizability to the
population as a whole that probability samples promise. However, in terms of actual
operations, which we want to illustrate here, the method just sketched is a probabilistic
one, and some case studies have involved 30 or more sites selected on a probabilistic
basis. (See PEMD's transfer paper entitled Using Statistical Sampling (U.S. General
Accounting Office, May 15, 1986) for more information.)
For readers who want to check out their skills in applying different types of purposive selection, appendix II gives information for a job involving the 50 states (a fairly common situation for GAO), a form for indicating which you would select for each of the seven kinds of purposive selection, and our answers, for comparison against yours.
In many jobs, what is a "case" and what dimensions are important to consider
in selection will be clear. For example, the population of detention facilities supported
by INS contracts can be defined legally (by the contract awarded), and the relevant
dimensions (length of time in operation, facility size, detainee mix) are straightforward.
There are, however, more problematic circumstances. An example would be a study of the
extent to which voluntary organizations have taken up any slack in welfare supports. What
is a voluntary organization can be defined broadly, as "any nonprofit
organization," or narrowly, as "a service-oriented group whose members do not
receive payment for their work."
Dimensions of potential relevance for the outcome of interest are many, and the empirical
basis for selecting any one dimension over others few. In such situations, the evaluator
can turn to past experience, a search of the appropriate theoretical as well as empirical
literature, the advice of knowledgeable persons, an examination of key issues in proposed
or pending legislation, customer guidance, and similar techniques. That is, while it is
important to recognize the difficulties, there are ways of dealing with them in case
definition.
Chapter 3
Case Study Applications
As noted earlier, there are six types of applications for case study methods-illustrative,
exploratory, critical instance, implementation, program effects, and cumulative. But case
study reports commonly use only two of the six applications: illustrative and critical
instance. Greater use could be made of the four others in selecting alternative ways of
answering questions, because these may be able to give information that is more valuable
to customers than other techniques. Also, improvements can always be made in how even the
two approaches already used frequently are carried out, especially in the area of
selecting instances for study. The next sections summarize, for each of the six types, the
evaluation questions they can answer, the functions they perform, their design features,
and their pitfalls. The last section shows what basis for selecting sites is appropriate
for each of the six applications.
Illustrative
As table 3.1 indicates, illustrative case studies primarily describe what is happening and
why, in one or two instances, to show what a situation is like. This can help in the
interpretation of other data, particularly if we have reason to believe most readers know
too little about a program or situation to understand fully the information from surveys
or other methods.
Table 3.1: Illustrative Case Studies
| Aspect examined | Characteristics |
| Evaluation questions | Help interpret other data when there is reason to believe that readers know too little about a program; descriptive, often used in conjunction with other methods. |
| Functions | Make the unfamiliar familiar; provide surrogate experience; avoid over-simplification of reality; and give reader a common language about the topic. |
| Design features | Site selected as typical or representative of important variations; small number of cases to keep reader's interest; data often include visual evidence; analysis concerned with data quality and meaning; and reports use self-contained, separate narratives or descriptions. |
| Pitfalls | May be difficult to hold reader's interest while presenting in-depth information on each illustration; may not adequately represent situations where considerable diversity exists (in such situations it may be impossible to represent variety well enough to use illustrative case studies); and may not have time on-site for in-depth examination |
GAO has many examples of such illustrative use. In 1982, for instance, CED examined
housing block grants through a survey supplemented by case studies. The results of the
survey were published in the main report (U.S. General Accounting Office, December 13,
1982). For three of the sites (Pittsburgh, Seattle, and Dallas), individual reports
described what each city was like with regard to housing and housing-related activities
and how the money was used in that city and included before-and-after pictures of what
rehabilitation meant for individual neighborhoods and houses (U.S. General Accounting
Office, March 24, 1982; March 30, 1982; April 30, 1982). In a similar application, HRD
described the projects funded under the Emergency
Job Appropriations Act of 1983 in communities in Texas, Alabama, California, Georgia, and
Massachusetts (U.S. General Accounting Office, March 26, 1985; August 27, 1985; September
25, 1985; December 6, 1985).
Illustrative case studies are used by evaluators in other agencies. When the Department of Health and Human Services was trying out delivery of Head Start services to parents and children in their own homes, called Home Start, the Department supplemented a formal assessment of the development of the children before and after the program with case studies (High/Scope Educational Research Foundation, 1972). These case studies described what services were delivered, the conditions in rural as well as urban areas, and what the Home Start teachers did during the home visits and generally provided a surrogate or vicarious experience for readers who might never have visited a Head Start or a Home Start center. The case studies told, too, of the development of the program over time and helped give a realistic sense of problems in start-up and implementation, how changes in staffing were accommodated, and the impact of shifting federal guidance on efforts to carry out the program in the field.
Case studies such as these are well accepted as a valid way of amplifying a more systematic presentation via the realism and vividness of anecdotal information. There are, however, pitfalls in presenting illustrative case studies. The most serious is selecting the instances. The case or cases must adequately represent the situation or program. This is relatively easy if the program is small and homogeneous. Where considerable diversity exists, it may not be possible to select a "typical" site, and the diversity may be so great that to represent it adequately would require more case studies than most people would want to read for illustrative purposes. In the example of privately operated detention facilities, an illustrative case study might run the risk of oversimplifying a more complex situation. The example was contrived to illustrate exactly this point: that sometimes we cannot select a site that fits our needs and thus the method is not appropriate.
However, in many real-world situations, it is possible to represent diversity adequately for illustrative purposes and to obtain the benefits of this application: helping readers feel, hear, see, "be there" when this kind of surrogate site experience is necessary to undo stereotypes or explain a situation otherwise inaccessible for most people.
Such a situation might be a bilingual education class, about which stereotypes can
abound, or life aboard a nuclear-weapon-equipped submarine, a situation few readers will
ever experience themselves but may need to get a feel for in order to understand staff
selection, training, and management on modern submarines.
Exploratory
The exploratory case study is a shortened case study, undertaken before launching into a
large scale investigation. Its function is to develop the evaluation questions, measures,
designs, and analytic strategy for the bigger study. As table 3.2 indicates, it is most
helpful where considerable uncertainty exists about program operations, goals, and
results. Also rather than initiate a job requiring 1,000 staff days or more, when we do
not have an adequate on-the-shelf set of designs and measures, an exploratory case study
can save time and money in implementation as well as improving the confidence we have in
our results. We can aim more precisely and hit the target more often.
Table 3.2: Exploratory Case Studies
| Aspect examined | Characteristic |
| Evaluation questions | Usually cause and effect |
| Functions | Where considerable uncertainty exists about program operations, goals, and results, exploratory case studies help identify questions, select important measurement constructs, develop actual measures for these, which can be used later in larger-scale tests; formulate expectations; safeguard investment in larger studies (for problems or programs that are not well-developed) |
| Design features | Site selected: needs at least one site that represents each important variation to make a convenience sample acceptable; number of cases sufficient to cover diversity; data focus on program operations and on-site observation, are not longitudinal but need enough time to find out what is going on; analysis is closely concurrent with field work but does not require strong chain of evidence or audit trail; reports are usually internal or parts of larger, longer reports |
| Pitfalls | Temptation to prolong the exploratory phase; site selection only for convenience, inadequate coverage of diversity; prematurity-exploratory findings released as conclusions; over-involvement in evaluator's own hunches so that initial findings are confirmed rather than tested |
Some of our scoping work already may involve exploratory case studies. For example, in GGD, a design study was done as a separate job, culminating in a briefing, prior to an in-depth study of the implementation of the Bail Reform Act of 1984. The methodology included 90 interviews, observations, and data analysis from the population of 94 court districts selected purposively for their characteristics on significant variables. Researchers and experts in the field were also interviewed. An expert panel was used to give feedback at various points to make sure we had a comprehensive picture of the situation. The product of this exploratory case study was a briefing, with the study design choices described, including detailed research questions, outlines of data sources, significant variables, extant data bases, and site selection criteria. From this, a larger study was designed to meet the needs of the requester. Other jobs may involve similar efforts that are not, however, reported as separate jobs and thus are less visible as exploratory case studies.
Also reports that include some features of exploratory case studies have been issued by GAO. In 1985, for example, NSIAD examined emerging issues in export competition through a case study of the Brazilian market (U.S. General Accounting Office, September 26, 1985). Combining site visits to Brazil, Japan, West Germany, and France, interviews with many officials of appropriate agencies and from the private sector, examination of official government files, and a questionnaire survey of high technology firms active in the Brazilian market, the evaluators amassed a rich array of contextual and focal information and identified four trade practices considered to be key factors in export competitiveness in Brazilian markets. These were bilateral trade accords, countertrade, export financing, and compliance with trade-related industrial policy. Although to meet the requirements of the job, NSIAD did not need to test these factors for generalizability to other countries through a later study, the product would permit such testing. NSIAD is using the findings in this way, as part of its ongoing work on bilateral initiatives. Of particular methodological note in this report is the detailed explanation of why export competitiveness in Brazilian markets (the instance) was selected for the case study.
The exploratory case study has been used by agencies outside GAO. The Department of Justice, for example, supported an exploratory case study of the career criminal program (Chelimsky and Dahmann, 1980). The career criminal program aimed at "swift and certain" justice by trying to expedite and strengthen processing of individuals who had long criminal histories at the time of apprehension. The exploratory study looked in depth at four of the nine demonstration sites prior to conducting a program effects evaluation. The evaluators identified the key elements of the programs as implemented and what measurable changes were likely to occur and developed measures of the outcomes, as well as designs for testing cause and effect in the subsequent larger study (Chelimsky and Sasfy, 1976).
The greatest pitfall in the exploratory study is prematurity: that is, the findings may seem so convincing that it can be difficult to resist pressures to report on these as if they had the strength of the larger study. Also, care must be taken to scope and sequence the exploratory study so that it yields enough information to be worthwhile and in time for use in the larger study but does not unduly delay answering the questions through the larger study. In addition, it is inappropriate to use the scoping phase as an ad hoc exploratory case study accompanied by an urge to issue the product at the end of scoping, when the necessary procedures for an exploratory case study with regard to such issues as instance selection have not been followed.
Critical Instance
The critical instance is the most frequent application of the case study
method in GAO, so much so that it may be seen as a "usual GAO review" rather
than recognized as what it can be-a case study (U.S. General Accounting Office, January
22, 1981; April 23, 1982; October 30, 1985). The advantage of recognizing the approach as
an application of case study methods is that some aspects of the method-such as the close
yoking of data collection and analysis-that may not be widely used now could be applied in
a way that increases timeliness without reducing quality. (This technique, discussed in
more detail in the section on analysis, can increase efficiency by reducing collection of
data and large-scale analyses of these data that subsequently do not prove useful.)
The critical instance case study examines one, or very few, sites for one of two purposes. First, a very frequent application is the examination of a situation of unique interest, such as Three Mile Island, the Challenger disaster, or allegations concerning funding for a specific presidential campaign. There is little or no interest in generalizability. The instance is not "selected" by us; rather, we are called to it.
GAO conducts many critical instance studies. One example, already mentioned, was our review of the representation of foreign interests by former very high government officials (U.S. General Accounting Office, July 11, 1986). Another is PEMD's review of the readiness of the Big Eye Bomb for production (U.S. General Accounting Office, May 23, 1986). Yet another is RCED's review of a construction contract award at Jean Lafitte National Historical Park (U.S. General Accounting Office, September 26, 1987) and their examination in a separate report of the park service actions at Delaware Water Gap National Recreation area in awarding a lease, closing a camp ground, and raising a house rent (U.S. General Accounting Office, October 28, 1987).
A second, rare, application is where a highly generalized or universal assertion is being called into question, and we are able to test it through examining one instance.
In one such study, GGD examined whether national policies, procedures, and practices with regard to cargo imports were causing problems in port operations (U.S. General Accounting Office, December 1986). The Port of New York offered a critical test because, given the diversity of imports and the volume of work, if problems were occurring, they would be likely to show up clearly in this site. If no problems were observed, problems in other sites were unlikely. GGD used observations, interviews, and document analysis at three sites in the Port of New York and supplemented these with a small number of less intensive observations at other sites. The method, in this instance, was sufficient to permit recommendations that were systemwide and generalizable with the single case.
Table 3.3 summarizes the features of the critical instance case study. As noted, the method is particularly suited for answering cause-and-effect questions about the instance of concern. It provides assurance that we have not prematurely overlooked important factors, that we have not been swayed by information from limited or perhaps biased sources, and that we have taken context into account, thus giving a fair and balanced picture of the situation.
Perhaps the biggest pitfall in this application is insufficient specification of the
customer's question. That is, the job may be presented to us as if only that situation is
of concern, but the underlying question may call for a broader look at the issue. A
request to investigate the reasons for the bank failures in Ohio, for example, may reflect
an interest only in Ohio, but it could be a "tip of the iceberg" question. What
the customer may really want to know is whether other states are likely to have similar
problems. In such a situation, Ohio might be selected as a site to examine but we would
also need to look at other states or use other approaches to achieve the generalizability
needed. This then rules out the critical instance method as appropriate for this job. The
importance of probing the underlying questions in a request to achieve good specification
of the evaluation question is not unique, of course, to the critical instance case study
but it is crucial in its appropriate application.
Table 3.3: Critical Instance Case Studies
| Aspect examined | Characteristic |
| Evaluation questions | Cause and effect, usually stand alone |
| Functions | Investigation of specific problem (frequently encountered at GAO), decisive testing of universal assertion; cause-and-effect questions |
| Design features | Site selects itself in specific problem-for decisive testing, have to assume uniform system with regard to issue and so convenience sample acceptable; number of cases is usually one instance; comprehensive data for specific problem-for decisive testing, need more modeling, hypotheses, and targeting to know what to study; data analysis and collection concurrent and interactive: data feed new collection, and emphasis on ruling out alternative causes; report describes instances, presents conclusions about cause, gives evidence |
| Pitfalls | Inappropriate selection of this technique as real issue may not be specific problem (e.g. Ohio bank failure) but more general questions; premature closure may narrow causal search too early; overgeneralization from evidence |
Program Implementation
We frequently are asked whether a program has been implemented and, often,
whether implementation is in compliance with congressional intent. The program
implementation case study is helpful where enabling legislation offers considerable
flexibility. In such cases, a wide variety of expenditures or actions could be consistent
with legislation and compliance with intent may be a matter of understanding the process
by which decisions were made, who was involved, and whether the actions are meeting local
needs. One example is the 1981 legislation consolidating many small categorical grants
into larger block grants, the funds for which could be spent very flexibly.
Another situation where program implementation case studies may be called for is when
concern exists about implementation problems. In-depth, longitudinal reports of what has
happened over time and why can set a context for interpreting a finding of implementation
variability: that is, whether there seem to be basic structural problems or if the program
understandably requires time for installment, adaptations, and building an infrastructure.
In some instances, GAO has been able to follow fairly intensively the implementation of
programs or activities. One example is GOD's series of reports on how the 1980 census was
conducted. GAO evaluators, in addition to being "on the scene" due to their
location at the major audit site accompanied enumerators into the field and examined, in
depth, Census procedures at field offices. In other instances, we have spent somewhat less
elapsed time in the field, with less direct observation, and with greater reliance on
interview and documentary evidence. In 1985, for example, RCED was asked how the
Department of Interior was implementing the Office of Management and Budget's Circular
A-76, dealing with privatization of all appropriate services. The request overlapped with
another similar request. This request reflected a senator's special interest in the
Glacier National Park in Montana. The evaluators were able to combine the jobs in a review
that eventually involved information from 8 of 17 National Park Service regional offices
and 19 of 402 field offices. The report aggregates findings across these sites and
concludes that agencies have been slow to implement the circular, although progress has
been made since 1982 (U.S. General Accounting Office, March 15, 1985).
Another example is GAO's review of 23 federal agencies' efforts to implement the Federal
Managers' Financial Integrity Act of 1982. A series of case studies, together with an
overview report, was produced. Among these, RCED's review of the Department of Commerce
implementation, to take one report, examined the actions Commerce took that were intended
to improve internal controls, such as training senior financial analysts in evaluating
applicants and borrowers in the troubled EDA business loan program and overhauling the way
in which computer resources were used for the National Weather Service. RCED also examined
the results of these efforts and highlighted priority areas for further improvement, such
as better information on results for internal management purposes.
Table 3.4 summarizes the design, data collection, analysis, and reporting features of
program implementation case studies. Usually, in such studies, generalization is wanted
and care is required to negotiate the question with the customer (best situations? worst?
typical?) and to match instance selection carefully with the questions. Unless the program
is small and homogeneous, the evaluator faces two possibilities. The first possibility is
that the number of instances will need to be fairly large in order to achieve the
generalizability wanted, and, as a consequence, skill will be needed to manage data
collection with sufficient flexibility to obtain the insights case studies offer and
sufficient structure to permit cross-site aggregation of findings. The second possibility
is that the diversity will be so great that it would be impossible to have enough
instances to meet needs for generalizability and still manage the data collection and
analysis.
Table 3.4: Program Implementation Case Studies
| Aspect examined | Characteristic |
| Evaluation questions | Descriptive, normative |
| Functions | Learn what implementation has been achieved, understand unexpected aspects; understand reasons why implementation looks the way it does; useful when enabling legislation has given flexibility |
| Design features | Site selection cannot be convenience because usually generalization wanted, and purposive sample can be typical and representative of diversity and best and worst cases; number of cases depends on program diversity since generalization usually wanted; data rely on common instruments, published documents, and observation; reports are varied in theme, site, chronology, and narration |
| Pitfalls | Bias detection methods may be inadequate; may fail to take into account diverse views about program goals and purposes; competence of all on-site observers may not be sufficiently high; can be costly due to management, data quality control, validation procedures, and analytic model (within site, cross site, etc.) may lead to cutting too many corners to maintain quality |
An important requirement for good program implementation case studies is investment of enough time on site to get longitudinal data and to obtain breadth of information. If the purpose is to report what is happening in a descriptive sense only, short site visits together with administrative records may provide adequate bases for findings. If, however, the evaluation question requires GAO to report on how satisfactory progress is or the reasons for problems in implementation, the more staff who can be on site over time, with the richest or "thickest" base for examining the situation as the many people involved see it, the sounder our causal conclusions and subsequent recommendations will be.
The multiple sites usually required for program implementation questions impose demands
on training and supervision needed for quality control. Because of tight resources, lack
of travel funds, and the need to use staff with uneven experience and skills, this becomes
critical in situations involving many evaluators working in different regions. That is,
time is needed to train staff adequately in such case study techniques as the note-taking
required for thick descriptions, which is in turn required for the content analysis of
themes in the instance. It is possible, for example, for two persons to interview the same
informant and find that one has used a one-sentence summary for a detailed, rich, 5-minute
discourse while the other captured much more of the complexity and essence of what was
said and what was happening. Table 3.5 illustrates such a difference.
Table 3.5: Illustration of Differences in Note-Taking
| Situation | Technique | Characteristic |
| In an interview with the Director of the National Science Foundation program for grants to small colleges, the following question is asked: "How does your program inform the eligible colleges of the opportunity to apply for grants?" | Rich notes | "The Director indicated that procedures has changed three times since the inception of the program. In the first 4 years, announcements were mailed to the individual named as president in the listing, for the same year, of the American Association of Small Colleges. Because applications were very sparse, with about 30% of eligible colleges applying, the procedure was changed to a two-stage mailing, first to the president to find out the name of the official in charge of federal programs and then to the official. This worked well for 5-year period, in terms of receipt of applications from over 80% of the eligible colleges, but when overall federal funding for research was reduced, the positions of federal funding were abolished and applications fell to about 49% of eligible institutions responding. Two years ago, the decision was made to mail copies t to the persons listed as chairs of the relevant science departments in each college in appropriate professional association listings. This has increased the cost of outreach by about $15,000 or about 25% more than the prior system. To date, returns are at the 80% rate again." |
| Thin notes | "The current system is to mail copies of the announcements to the chairs of relevant science departments, such as chemistry, biology, physics, and computer science." |
Program Effects
Case studies can determine the effects of programs and reasons for success (or failures).
In 1982, for example, RCED examined the progress made since the 1970's in cleaning up the
nation's air, water, and land, finding that while strides had been made toward meeting the
established goals (cleaner air, properly treated wastewater, more drinkable water),
deadlines had been extended and unresolved issues made meeting even these deadlines
difficult (U.S. General Accounting Office, July 21, 1982). We pointed to lack of
flexibility as a source of cascading problems and delays. The bases for these conclusions
were in-depth case studies of three sites (Cleveland, Dallas, and New York City) together
with information from reports prepared by six federal agencies and by environmental
organizations and public interest groups and interviews with Environmental Protection
Agency officials. Particularly notable methodologically in this report is the integration
of case study findings with other sources of information throughout the first volume.
A PEMD report has focused on water quality: the effectiveness of efforts to improve water quality and the reasons for successes and failures. In-depth, very extensive case studies of several water catchment areas were conducted, and the final report is based on a synthesis of the findings from the case studies-another example of integration of findings across diverse sites (U.S. General Accounting Office, December 17, 1986a, b; September 19, 1986). This series of reports also is useful for illustrating the way in which causality is established in case studies: through development of internally consistent explanations of what led to what and the conscientious use of information from within the site and from contrasting sites to rule out alternative explanations.
For another example, to determine whether actions taken by the states since the mid-1970's to address medical malpractice insurance reduced insurance costs, the number of claims filed, and the average amount paid per claim, HRD conducted case studies in six selected states (Arkansas, California, Florida, Indiana, New York, and California). Work included obtaining views of organizations representing physicians, hospitals, insurers, and lawyers on perceived problems, actions taken to deal with them, results of these actions, and the need for federal involvement. Other information came from surveys of nonfederal hospitals about the sources, coverage limits, and costs and claims from leading insurers in each state and, for comparison, the same type of information from a nationwide company. The results are presented separately in six case study reports and aggregated in the overall report (U.S. General Accounting Office, December 31, 1986).
Other federal agencies have used the case study method successfully in answering program effects questions. The National Science Foundation, for example, assessed the effectiveness of a cooperative science program aimed at increasing innovation and knowledge transfer between university and industry researchers. Ten case studies were undertaken of a carefully selected group of projects that ranged from computer language systems through nuclear science to fisheries biology and chemical engineering. Of note is the methodological detail given on project selection, data collection, analysis, and case format. In a companion report, results from a survey of grant recipients are analyzed, giving both a quantitative and a qualitative sense of how the program was working. Results from the two methods were not integrated; both suggested, however, that the program was generally working well (National Science Foundation, 1984).
Table 3.6 summarizes key features of program effects case studies. Like the program
implementation case study, the evaluative question often requires generalizability and,
for a highly diverse program, it may not be possible to answer the questions adequately
and still have a manageable number of sites.
Table 3.6: Program Effects
| Aspect examined | Characteristic |
| Evaluation questions | Cause and effect, can be stand alone or mulitmethods and can be conducted before, during, or after other methods |
| Functions | Determine impact and give strong inference about reasons for effects |
| Design features | Site selection depends on program diversity, cannot be used with highly diverse programs; best, worst, representative, typical, or cluster bases appropriate; must keep number of cases manageable or risk becoming minisurvey, can use survey before or after to check generalizability or mix survey with concurrent case studies selected for special purposes; data rely on observation and structured materials, often combine qualitative and quantitative data; analysis uses varying degrees of formalization around emergent or predetermined themes; reports are usually thematic and describe site differences and explain these; variation in degree of integration of data across sites and of findings from different methods |
| Pitfalls | Not collecting the right amount of data; not examining the right number of sites; insufficient supply of well-trained evaluators; difficulties in giving evaluators enough data collection latitude to obtaining insight without risking bias |
There are some methodological solutions to this problem. One solution would be to conduct the case studies first in a set of sites chosen for representativeness and to verify the findings from the case study through targeted examination of administrative data, prior reports, or a survey. A second solution would be to use these other methods first. After identifying the findings of particular interest, case studies would be conducted in sites selected to maximize the ability to get the specific understanding required. Both of these approaches have been used with good effect in program evaluation.
Cumulative
This relatively new and not as yet widely used application of case study methods brings
together the findings from case studies done at different times. The applications
previously discussed that involved multisite case studies are cross-sectional: that is,
information from several sites is collected at the same time. In contrast, the cumulative
case study aggregates information from several sites collected at different and even quite
extended times.
The cumulative case study can be retrospective, aggregating information across studies done in the past, or prospective, structuring a series of investigations for different times in the future. The techniques for ensuring sufficient comparability and quality and for aggregating the information are what constitute the "cumulative" part of the methodology.
That is, the cumulative case study is similar to an evaluation synthesis, in that it is
a method for aggregating the findings of several studies. It differs from an evaluation
synthesis in that special techniques are required to aggregate the qualitative information
that often is a feature of case studies and to maintain the sense of the "instance as
a whole" in its complexity that distinguishes case studies from surveys of several
sites. For some jobs, both case study and noncase study reports can be
aggregated, each using the appropriate techniques, in order to produce capping reports or
similar products.
GAO does not appear to have done a cumulative case study using our own case study reports or other case studies. GAO reports have been used with good results, however, in cumulative case studies published by others outside GAO. One example is a book on bureaucratic failures, which is based entirely on GAO reports of management problems in different agencies over a considerable period of time (Pierce, 1981). The author began with a set of hunches or hypotheses about what can go wrong in agency management, and what would be evidence supporting-or contradicting-these hypotheses. He reviewed the GAO reports in detail, analyzed the data from each one in terms of his framework, and aggregated the results in his final chapter.
Other examples of cumulative case studies come from two international agencies. A
retrospective cumulative case study was conducted by the World Bank in its examination of
four in-depth case studies of the effectiveness of educational programs. These case
studies were intended initially as stand-alone assessments of the programs but were
brought together to learn about the effectiveness of the evaluations themselves in the
context of educational programs (Searle, 1985). A prospective cumulative case study was
commissioned by the U.S. Agency for International Development. The purpose was to identify
input and process components of economic assistance that could be quantitatively
associated with differences in outcome measures. The method was the specification of a
common set of data (both qualitative and quantitative) to be collected over a 5-year
period as projects were initiated, together with a means of coding the data across the 47
studies eventually completed. The coded results were analyzed quantitatively in the final
report (Finsterbush, 1984).
Table 3.7: Cumulative Case Studies
| Aspect examined | Characteristic |
| Evaluation questions | Cause and effect |
| Functions | Retrospective cumulation allows generalization without cost and time of conducting numerous new case studies; prospective cumulation also allows generalization without unmanageably large numbers of cases in process at any one time; strengthens inference from new studies by combining with results from older studies |
| Design features | Uses site selection and usually a large number of cases; data as reported (retrospective); usually on-site observation (prospective); backfill techniques; analysis uses case survey method to cumulate findings; possible to examine interactions directly since number of instances is large; reports may resemble evaluation syntheses |
| Pitfalls | Publication basis may severely limit generalization; inadequate or uncertain quality of original data; quality of data-reduction procedures may be very difficult to determine; the effects of changes in many contextual factors over time may be difficult to separate from effects of the programs |
Two features of the cumulative case study, shown in table 3.7, are the case survey
method just described as a means of aggregating findings (Lucas, 1974; Yin and Heald,
1975; Yin et al., 1976) and backfill techniques (Berger, 1983). The latter are helpful in
retrospective cumulation as a means of obtaining information from the authors that permits
an otherwise unusable case study to be included in the aggregation. Knowing the basis on
which the case instances were selected, for example, is crucial in cumulation; otherwise
it is not possible to know whether best case, worst case, typical, or the like instances
are being aggregated. Some published case studies do not provide sufficient detail on
this. In backfilling, the evaluator might call the author, visit the author to review the
original data, or contact others who were knowledgeable about the design decisions in
order to get adequate information on instance selection.
Opinion varies as to the credibility of cumulative case studies for answering program
implementation and effects questions. One authority notes that publication biases may
favor programs that seem to work, which could lead to a misleadingly positive view
(Berger, 1983). Other experts are concerned about the quality of the original data and
analyses and problems in verifying their quality (Hoaglin et al., 1982; Yin, 1989). For
the cumulative use of GAO reports, these concerns are less important, since we already use
the "audit trail" procedures recommended in the policy and other manuals for
verification of data collection and analysis quality. We do, however, have the opposite
concern: that is, we would need to be sure there was not "bad news" selectivity
in a particular area, associated with killing jobs that did not identify problems during
scoping.
Table 3.8: Some Design Decisions in Case Study Methods
Type of question
| Design decision | Illustrative, exploratory | Critical instance | Implementation, program effects, cumulative |
| Basis for site selection | Typical, representative, cluster | Convenience, unique interest | Best-worst case, bracketing, typical, representative, cluster, probability |
| If mulitmethod | Concurrent | Concurrent | Before, concurrent, after |
| Prestructuring | Low, moderate | Low, moderate | Moderate, high |
| Type of data | Qualitative only, qualitative-quantitative | Qualitative only, qualitative-quantitative | Qualitative only, qualitative-quantitative, quantitative only |
| Sequence of analysis | Within sites, then across | Within sites, then across | Within sites, then across; across sites, then within; concurrent |
| Reporting | Narrative, thematic | Narrative, thematic | Thematic |
Design Decisions and Case Study Applications
In earlier sections, we discussed seven bases for purposive selection of instances and six
applications of the case study method, each of which was associated with a different
evaluation purpose or question. Bringing this information together, table 3.8 shows the
relations among case study applications and design decisions. For example, if the purpose
of the study is illustrative, an appropriate basis for site selection could be typical,
representative, or cluster; the case studies would be conducted concurrently with other
methods used in the main study; prestructuring or guidance to the evaluators in the field
would be low to moderate to permit the thickness and richness of insights needed; data
could be qualitative only or both qualitative and quantitative; the case studies probably
would be analyzed within sites only; and the reporting would probably be narrative.
Chapter 4
Data Collection and Analysis
We have said that the features distinguishing case studies from other methods are how
sites are selected, how the data are collected, and how they are analyzed. In the last
chapter, we covered instance selection. We turn now to other elements that distinguish a
case study from a not-case study and a good case study from a not-good case study. The
discussion is an introduction to the approaches.
Data Collection
In other transfer papers on program evaluation, we have emphasized the importance of
validity. Validity involves measurement and also design. A valid measure-that is, one with
construct validity-reflects what it claims to reflect and not something else. For example,
whether or not there are active opposition parties may be a more valid measure of whether
a country is a democracy than how many people vote in an election. A valid
cause-and-effect design-that is, one with internal validity-rules out alternative
explanations of results by comparing what happened with an intervention to what happened
in the absence of the intervention. For example, in a study of the effects of an
employment training program, greater employment of participants after the training than
before must be shown to be due to the training and not simply to better economic
conditions, which also could increase employment.
Measurement Validity
Case study methods can use two tactics for achieving measurement validity: multiple
sources of evidence and using the chain-of-evidence technique in data reduction.
Multiple Sources of Evidence
Turning first to multiple data sources: case studies require "thick"
description in order to get enough information to check for trends, to rule out competing
explanations, and to corroborate findings. Eight techniques are used-sometimes all of them
in the same study-to collect information (Neustadt and Fineberg, 1978; Yin, 1989).
Many of the eight techniques are discussed in the General Policy Manual, chapter 8.0. Of these ways, the approaches that most differentiate case studies from other techniques are direct observation and participant observation.
GAO has used both approaches in its jobs. For example, in NSIAD's study of conditions on submarines, auditors spent time aboard submarines in a variety of situations, getting firsthand knowledge of life in these vessels. Their direct observations form the primary data source for our report. We went to sea in this instance, however, in our GAO role, as auditors and evaluators and so-it could be argued-might have seen what special guests see and not what life would be like for the average sailor.
To get more authentic information, evaluators have sometimes become participants in
situations, not identified to the other persons involved as GAO staff. One example of how
we have adapted this
participant-observer approach was in GOD's study of the services available to taxpayers
from IRS after IRS reduced the number of public information agents (U.S. General
Accounting Office, April 5, 1984). We developed a set of standard income tax questions
about which citizens typically would call IRS, obtained IRS agreement on the correct
answers to these questions, and then, on a probabilistic sampling basis, called IRS
offices around the country to seek help. We used names such as Gerald A. Office in these
conversations but did not say we were from GAO. We were able to report how long it took to
get the phone answered, how long it took to get information, the consistency of
information, and general helpfulness of the responding agent. Such an approach gave more
authentic information than relying only on IRS records of calls received, or a survey of
taxpayers. In the first instance, IRS would have no record of time before the person could
get through to an agent and of "discouraged callers." In the second, a survey of
taxpayers would have to be very large to get a good "hit" rate of individuals
who sought assistance, and the diversity of individual questions would have blurred
ability to interpret variation in IRS responsiveness. HRD used a similar approach in
reviewing the Social Security Administration's telephone inquiry program; over 4,000 calls
were made, with GAO personnel taking the role of ordinary citizens in asking the randomly
selected, prepared questions (U.S. General Accounting Office, August 29, 1986).
One element of data collection that distinguishes case studies from other techniques is that comprehensiveness of interviewing is very important. In order to learn the meaning of events to those involved in them, a key element of case studies, the views of more senior officials are not given greater weight than views of less highly placed persons. In fact, a case study where the only people interviewed were senior officials would be seen as a not good case study, in contrast to one where the views of individuals at all levels affected was obtained. For example, if we wanted to learn about how noncompetitive awards were reviewed in an agency, a good case study would obtain information from the agency head, the head of the procurement division the inspector general's office, the contracts officer responsible for selected awards, staff involved in the reviews for these awards, counterpart persons from the contractors' procurement and program operations staff, and the legal divisions within the agency and the contractors. We might shadow several noncompetitive procurements, following their life history from initiation through actual awards, sitting in on meetings, and studying, over time, how the awards were handled.
Chain of Evidence
A chain of evidence is the sequence from observation to conclusions. In a strong chain of
evidence, an independent second evaluator could follow the first evaluator from original
observations, the "raw" or unreduced data, through all the steps of data
aggregation and analysis, and conclude that the first evaluator's findings were justified
by the evidence and fairly represented it. This requires careful organization of the files
of original observations, complete documentation of the conditions of data collection that
are relevant to the trustworthiness and credibility of the information, and making
transparent and reproducible the manner in which the evaluator moved from phase to phase
of the analysis. Some evaluators call such a procedure "building an audit trail"
and use procedures similar to indexing and referencing to establish both the construct
validity of the measures reported and the convincingness of the causal explanations
developed in the case study (Halpern, 1983). That is, they have an independent evaluator
review the equivalent of their workpapers rather than providing so much detail in the
report itself that a reader can come to the same conclusion.
Some information in a case study is likely to be judgmental, particularly when observer
and participant-observer modes of data collection are used. And the collection process
involves judgment calls of promising leads and the meaning of initial information. While
documenting the basis for judgments can be more difficult than documenting nonjudgmental
information, overall the chain of evidence or audit trail techniques should not pose any
greater difficulty for GAO evaluators than our documentation procedures for other
evaluation methods.
Data Analysis
Case studies, obviously, can generate a great deal of data, data that need
to be analyzed sufficiently and with appropriate techniques in order to be useful. Much is
qualitative. As table 4.1 indicates, there are six general features of data analysis. Four
are essential to case study methods: iteration, OTTR, triangulation, and ruling out rival
explanations.
A unique feature of case studies is that data collection and analysis are concurrent. In most methods, we plan for data collection, then we collect the information, then we analyze it, and then we write the report. In case studies, the data coming in are analyzed as they become available, and the emerging results are used to shape the next set of observations.
The sequence in which this takes place is the OTTR, which stands for "observe,
think, test, revise." After observations have been made in the first phase (and
during the observations, because that is a natural way for our minds to work), the
evaluators think about the meaning of the information: what does it suggest about what is
happening and why? What else could explain what is going on? The
Table 4.1: Ways of Analyzing Case Study Data
| Feature | Methodology |
| Iterative | Data collection and concurrent analysis |
| OTTR | Observe, think, test, and revise |
| Triangulation | Comparison of multiple, independent sources of evidence before deciding there is a finding |
| Rival explanations | Developing alternative interpretations of findings and testing through search for confirming and disconfirming evidence until one hypothesis is confirmed and others ruled out |
| Reproducibility of findings | Establish through analysis of multiple sites and data over time |
| Plausible and complete | Data analysis ends when a plausible explanation has been developed, considering completely all the evidence |
| Specific techniques for handling mulitsite data sets | Matrix of categories, graphic data displays, tabulating frequency of different events, developing complex tabulations to check for relationships, and ordering information chronologically for time series analysis |
second, or "think," phase ends with specification of what new information
would be needed to rule out alternative explanations or confirm interpretations. This
triggers the third phase: test. In this phase, the evaluator collects more information, as
required by the specifications from the "think" cycle. The data collected in the
third phase are not specified before the first phase: they emerge, often with surprises,
from the initial observations. The fourth phase is examination of the second round of data
collection and a revision of initial interpretations and expectations-the
"revise" phase. The revise phase may lead to another test phase, if information
from the second round of data collection was insufficient to rule out alternatives, or if,
during revision, new interpretations emerged. This iterative process ends when a plausible
explanation has been developed and, at the end of a "revise" phase, there are no
outlier or unexplained data, no further interpretations possible, or it is clear that
despite the most diligent search for information, more is not available to further refine
description and explanation.
In case study methods, causality is established through the internal consistency and
plausibility of explanation, derived additively through the OTTR sequence. This is in
considerable contrast to other evaluation methods, where control and comparison groups are
used subtractively to rule out other reasons for a finding and establish firm attribution.
Handling Multisite Data Sets
Several techniques have been developed recently for handling multisite case
study data sets. These include setting up a matrix of categories, graphic data displays,
tabulating frequencies, developing cross-tabulations, and time series analysis.
Matrix of Categories
In this technique, a coding scheme is developed prior to data collection. It is modified
during data collection and the OTTR process and finalized after the evaluation team has
read through all the case materials. The categories are related to the evaluation
subquestions; for example, if a subquestion was "How does the Immigration and
Naturalization Service monitor the conditions of confinement in privately contracted
detention facilities," coding categories might include who is responsible, how these
persons get information, what they do with information received, evidence that minimum
standards are met, evidence of shortfalls, changes over time in monitoring, and
conflicting guidance or responsibilities. These categories might be put into a matrix by
facility size or groups served. The approach is similar to content analysis, and the PEMD
transfer paper on content analysis gives further how-to information (U.S. General
Accounting Office, June 1982).
Graphic Data Displays
This is a family of techniques, some of which have been adapted for computers and some of
which use wall-space. The evaluators immerse themselves in information on a site,
following OTTR. Their initial story of what is happening and why is displayed as a
flowchart with a series of critical paths for action. Evidence supporting the story is
arrayed in the display. The materials then are searched for counter evidence and
subsidiary or branching paths are laid out. As a satisfactory graphic is developed for one
site, the evaluators turn to the next site. The evaluators could at this point either
modify the first graphic, based on information from the second site, or prepare an
independent flowchart. In the second approach, aggregation would come after all the sites
had been charted, and the charts would be used as the data base for aggregation.
The graphic techniques can be applied to an instance as a whole or to subcomponents.
For example, if an analysis of life-threatening or fatal incidents at national parks were
needed, the evaluators might develop separate graphics for events leading up to the
incidents, the incidents themselves, and postincident actions. More complex case studies
might need several "layers" or graphics; less complex, few.
Tabulating Event Frequencies
Another technique for analyzing multisite case data is identifying events within
each case study ("meeting between Jones and Smith"; "Smith staff prepares
recommendations") and tabulating their frequency of occurrence. Such a simple
tabulation can draw the evaluator's attention to events that may be significant or to
informal networks and give a sense of actual (as contrasted to on-paper) organizational
relationships. Divergences between observed and expected patterns can be examined further
to see what happens as a result of these meetings and identify potential problem nodes:
for example, when an expected high-communication node turns out to be, relatively
speaking, a low-communication spot.
Complex Tabulations
Cross-tabulations of events can identify interactions and check the developing
story more formally. For example, service coordination is a popular remedy for limited
funds. An evaluator in the field may observe that coordination among local agencies funded
through the same federal agency is more frequent than coordination among local agencies
funded by different federal departments. Tabulations of actual meetings and of consequent
actions for same-agency funded and different-agency funded services can help check out
whether this impression is reliable.
Time Series Analysis
Organization of information within each site by time of occurrence, coupled with a
systematic analysis of contextual influences on events, permits a nonquantitative time
series analysis for case study data. The flow of events over time for each significant
actor and for significant points in the series of events forms the organizing framework
for data analysis within each site. Such comparisons of when key actions occurred, how
well (or poorly) they were carried out, and what influenced both timing and quality of
performance can be particularly helpful in case studies of program implementation.
In some instances, only one component of a case study may be analyzed in this way. For
example, a case study of the effectiveness of a job training program might need to take
into account general economic trends, such as unemployment rates in the community. A time
series comparing local unemployment rates with placement rates for job training program
participants could be computed quantitatively and changes interpreted through the more
qualitative time series data about the program
Basic Models for Data Analysis
Two basic models of data analysis are pattern matching and explanation
building. Pattern matching requires using past experience, logic, or theory before the job
begins to specify what we expect to find. The analysis then compares actual findings to
expectations. When the findings fit, the pattern is confirmed. When the findings don't
fit, the evaluator adjusts the expectations or elaborates them, building a subroutine that
can explain the unexpected findings. Explanation building is the inverse procedure:
starting with the observations, the evaluator develops a picture of what is happening and
why. Data are used to fill in the initial hunches, to change them, to elaborate on them.
The first strategy matches findings to hypotheses or assumptions. The second uses the data
to structure the hypotheses or assumptions.
In either strategy, the evaluator needs to search the full data base thoroughly for disconfirming evidence, in order to avoid the pitfall of premature conclusions and data analysis ends when the best fit possible has been reached between the observations and a statement about what they mean.
In either strategy, expectations and explanations can be expressed as themes: a job dealing with bank failures, for example, might have as themes decisions about credit risks, procedures for reviewing decisions, or controls over the accuracy and recency of information on bank solvency. A job dealing with employee training might have as themes decisions about training needs, how employees are selected for training, how course quality is monitored, or how employees and supervisors view the purpose of training.
Themes, in turn, can be analyzed within individual sites first, then findings on each
theme aggregated across sites. Alternatively, all themes within one site can be analyzed
first; then data from the second (and subsequent) sites can be examined. Theme analysis
also can proceed in matrix fashion. On the PEMD AFDC study, for example, evaluators were
assigned as site managers, responsible for understanding across themes all there was to
know about the issues for their site. They also were assigned to individual themes, such
as health and employment, responsible concurrently for looking across all sites for
information on their topic. This organization proved helpful in ensuring that reasons why
a site showed up as an outlier for a given theme could be discussed by someone who knew
the site as a whole.
Pitfalls and Booby Traps
Case study methods, like any other method, offer plenty of opportunity to
go awry. Two frequent concerns are the risks in using other people's studies and in
generalizability.
Impartiality
The biggest risk when we use other people's case studies is that GAO standards of
impartiality may not have been met. There are three meanings of impartiality, one of which
does not create problems. Case studies use as data the impressions and judgments of the
evaluator, which are inherently subjective. For a case study methodologist and for GAO, if
proper care is taken, this should not be a problem. If we want to illustrate, for example,
working conditions for immigrant laborers, we can report what the thermometers registered
and we can also report, firsthand, how people were sweating and what it felt like to be
out in the fields. Such observation is part of the richness, immediacy, and
"thick" description of a case study. However, case studies, like any other
method GAO uses, have to meet two other criteria of impartiality: accuracy and lack of
bias, in the sense that the evaluator's personal, preconceived opinions about a situation
do not distort reporting and that the evaluator is scrupulously evenhanded in examining
all sides of a situation.
Some authorities on evaluation methods believe that case studies reflect the author's
values in ways that can be difficult to detect. Other experts conclude that three actions,
taken together, are sufficient safeguards for lack of bias and adequate accuracy. These
are ( 1 ) submitting reports to people from whom data were collected and printing their
critiques with the report, (2) use of multiple data collection methods within case
studies, and (3) adoption of the audit trail or chain-of-evidence technique. Adequate
supervisory controls also are recommended. Complying with these safeguards should give us
no major problems in our own jobs. The guidance would mainly expand the range of
reviewers. We already conduct exit conferences and, following the "Yellow Book"
and Communications Manual, submit draft reports for agency comments. We often use
multiple methods, and the audit trail technique now recommended for case study use was
itself adopted from such auditing procedures as workpapers and referencing, which are
standard practice with GAO. We also require adequate supervisory control through such
means as prompt review of workpapers. We would need to assure ourselves, however, that
case studies whose results we are going to use have adopted the same procedures for
ensuring impartiality. (Appendix III gives a checklist for reviewing proposed or completed
case studies for quality.)
Generalizability
We often are asked questions where the customer wants in-depth information that is
nationally generalizable, but frequently the issue may not yet be ripe for a national
study or we do not have the resources to collect in-depth data from nationally
representative samples. Using 4, 10, or 15 sites as case studies might be feasible, but we
would still need to be concerned about the risks in generalizability. A main point of
this paper is that generalizability depends less on the number of sites and more on the
right match between the purpose of the study and how the instances were selected, taking
into account the diversity of the programs.
An example of an efficient combination of careful specification of the purpose of the study matched with appropriate site selection is the GGD study of the productivity of the Social Security Administration's (SSA's) regional operations. This review examined in depth only one SSA region (U.S. General Accounting Office, September 11, 1985). Atlanta was selected because it had the best productivity among the 10 regions; if GAO could demonstrate opportunities for improvement in the most productive SSA region, then similar improvements might be possible in the less productive regions. Following the case study, an inexpensive (25 staff day) check was made on productivity data and trends from other SSA regions, and similarities were noted. While other problems might be affecting these less productive regions, the findings from the single site plus the trends were so convincing that SSA concluded the single instance examination had national impli