United States General Accounting Office

GAO Program Evaluation and Methodology Division

November 1990 Case Study Evaluations

                              
Preface


GAO assists congressional decisionmakers in their deliberative process by furnishing analytical information on issues and options under consideration. Many diverse methodologies are needed to develop sound and timely answers to the questions that are posed by the Congress. To provide GAO evaluators with basic information about the more commonly used methodologies, GAO's policy guidance includes documents such as methodology transfer papers and technical guidelines.

This methodology transfer paper on case study evaluations describes how GAO evaluators could use case study methods in performing our work. It describes six applications of case study methods, including the purposes and pitfalls of each, and explains similarities and differences among the six. This paper presents an evaluation perspective on case studies, defines them, and determines their appropriateness in terms of the type of evaluation question posed. The original report was authored by Lois-ellin Datta in April 1987. This reissued (1990) version supersedes the earlier edition.

Case Study Evaluations is one of a series of papers issued by the Program Evaluation and Methodology Division (PEMD). The purpose of the series is to provide GAO evaluators with guides to various aspects of audit and evaluation methodology, to illustrate applications, and to indicate where more detailed information is available.

We look forward to receiving comments from the readers of this paper. They should be addressed to Eleanor Chelimsky at 202-276-1854.

Werner Grosshans
Assistant Comptroller General
Office of Policy

Eleanor Chelimsky Assistant Comptroller General for Program Evaluation and Methodology

Contents

Preface

Chapter 1
Introduction

Chapter 2
What Are Case Studies?
  What Is Meant by "A Case Study"?
  Some Common Benefits Expected From Case Study Evaluations
  Instance Selection in Case Studies

Chapter 3
Case Study Applications

  Illustrative
  Exploratory
  Critical Instance
  Program Implementation
  Program Effects
  Cumulative
  Design Decisions and Case Study Applications

Chapter 4
Data Collection and Analysis

  Data Collection
  Data Analysis
  Handling Multisite Data Sets
  Basic Models for Data Analysis
  Pitfalls and Booby Traps
  Where to Go for More Information

Chapter 5
Summary

  What Are Case Studies?
  When Are Case Studies Appropriately Used in Evaluation?
  What Distinguishes a Good From a Not-Good Case Study?
  Impartiality and Generalizability

Appendixes
  Appendix I: Theory and History
  Appendix II: Site Selection Example
  Appendix III: Guidelines for Reviewing Case Study Reports

Bibliography

Glossary

Papers in This Series

Tables
  Table 2.1: What Is a Case Study? Exercise
  Table 2.2: Complexity of Questions
  Table 2.3: Methods of Obtaining Description and Analysis in Case Studies
  Table 2.4: Some Common Benefits Expected From Case Study Evaluations
  Table 2.5: Instance Selection in Case Studies
  Table 2.6: Hypothetical Data on Instance Selection
  Table 3.1: Illustrative Case Studies
  Table 3.2: Exploratory Case Studies
  Table 3.3: Critical Instance Case Studies
  Table 3.4: Program Implementation Case Studies
  Table 3.5: Illustration of Differences in Note-Taking
  Table 3.6: Program Effects Case Studies
  Table 3.7: Cumulative Case Studies
  Table 3.8: Some Design Decisions in Case Study Methods
  Table 4.1: Ways of Analyzing Case Study Data
  Table 5.1: Some Common Pitfalls in Case Study Evaluation
  Table 1.1: Criteria of Good Research
  Table I.2: Evaluation Adaptations of the Research Case Study
  Table II.I: Hypothetical Data on Unfiled Corporate Income Tax Returns for 1986 State Income Tax     Returns
  Table III. 1: Checklist for Reviewing Case Study Reports

Abbreviations

GAO General Accounting Office
OTTR Observe, think, test, and revise
PEMD Program Evaluation and Methodology Division
SSA Social Security Administration


Chapter 1
Introduction

At his government-required anti-terrorist training session recently, a captain for a major airline said,

"The bits of information were so few and far between that people weren't even paying attention. My instructor for the eight-hour course entered the room only to change videotapes. People were talking; they were doing other things, including reading the paper." (Philadelphia Inquirer, 1986)

This is a case instance. It is an effective way of drawing attention to a problem such as training quality. Such anecdotes are remembered and they are convincing. What they are not, however, is generalizable: that is, an anecdote doesn't tell whether it is the only such instance or whether the problem is wide-spread. And anecdotes usually don't show the reasons for a situation, and thus are of limited value in suggesting solutions.

The challenge for evaluators is how to use those aspects of an anecdote that are effective for our work-the immediacy, the convincingness, the attention-getting quality-and, at the same time, fulfill other informational requirements for our jobs, such as generalizability and reliability. Case study methods, while not without their limitations in this regard, can help us answer this challenge.

GAO already does a lot of case studies-or at least, what we ourselves call case studies in describing our methods. There are GAO case studies in many areas-urban housing, weapon systems testing, community development, military procurement contracts, influences on the Brazilian export-import balances, how programs aimed at improving water quality are working, and the implementation of block grants-to name only a few.

Most of these case studies are either "illustrative" or "critical" instance applications. The first type of application illustrates findings established by other techniques, supplementing, for example, national findings on clean air from administrative records
and other sources, with in-depth description on how funds have been used and with what results in selected cities. The second type of application is in-depth analysis of a case of unique interest, such as whether funds have been awarded and managed properly in a specific community health center or if a certain former government official had done anything improper before or after leaving the government. There are, however, four other applications of case studies that are less often used at present but that could be appropriate for our jobs. In brief, the six types of case study, which we examine in chapter 3, are as follows:

  1. Illustrative. This case study is descriptive in character and intended to add realism and in-depth examples to other information about a program or policy.

  2. Exploratory. This is also a descriptive case study but is aimed at generating hypotheses for later investigation rather than illustrating.

  3. Critical instance. This examines a single instance of unique interest or serves as a critical test of an assertion about a program, problem, or strategy.

  4. Program implementation. This case study investigates operations, often at several sites, and often normatively.

  5. Program effects. This application uses the case study to examine causality and usually involves multisite, multimethod assessments.

  6. Cumulative. This brings together findings from many case studies to answer an evaluation question, whether descriptive, normative, or cause-and-effect.

Case Study Evaluations is a review of methodological issues involved in using case study evaluations. It is not a detailed guide to case study design. It does, however, explain the similarities and differences among the six kinds of case study and discusses ideas for successfully designing them. It also gives guidance to the manager who, in reviewing completed case studies, wants to assess their strengths. Finally, it presents an evaluation perspective on case studies, defining them and determining their appropriateness in terms of the type of evaluation question posed.

The methods and types of case studies outlined here are not definitive. The case study as a research method has evolved over many years of experience but evaluative use of the method has been more limited. Indeed, the history of the case study as an evaluation method is little older than a decade. Therefore, discussion of some of the applications described here is based on relatively extensive field experience (with questions in such domains as justice, education, welfare, environment, housing, and foreign aid), while the discussion of some of the other applications is based on more constrained experience.

We have paid particular attention to the conventional wisdom that case studies are always subjective and nongeneralizable. In many uses of case studies, there is no need to generalize. Nonetheless, we find that there are steps that can be taken to generalize from case studies when this is desired. However, we did not devote any particular emphasis to the popular idea that case studies are inexpensive to conduct (issues of research management common to all designs were outside the scope of our work). However, one thing that should emerge quite clearly from the discussion of design features intrinsic to the case study is that it can be a rather costly endeavor, given the time required, the rich in-depth nature of the information sought, and the need to achieve credibility. This reinforces the importance of weighing carefully the decisions to employ the case study method in program evaluation.

In this paper, we have taken positions on many issues, expecting to revise these as experience accumulates and as we receive reactions from evaluators and researchers. This paper is intended to transfer what we believe to be good practice in case studies and to help establish the principles of applying case studies to evaluation. Thus, while the document offers preliminary guidance, it is also a point of departure. For example, we are developing the variation that we call the "cumulative" case study. It can entail prospective and retrospective designs and it permits synthesis of many individual case studies undertaken at different times and in different sites.

The quality of case studies can be variable. Some score high on reasonable tests of quality; others have lower scores. Three problems often encountered have to do with matching the question the evaluator set out to answer and the method for selecting the instances examined, reporting the basis for selecting the instances, and integrating findings across several instances when the findings in one were inconsistent with those in another.

The next sections of this paper will first present some new ways of thinking about a familiar method, the case study, and then introduce the six applications, describing what is required, in terms of methodology, to get the benefits case studies can offer. In the last chapter, we turn to two basic questions: What do we need to take into account with regard to the objectivity of case studies and their generalizability?


Chapter 2
What Are Case Studies?

Almost everyone in GAO probably has worked on a case study at one time or another yet may be unfamiliar with what is meant, methodologically, by a case study. The methodological meaning is important in understanding what differentiates a case study from a noncase study and a good case study from a not-so-good case study.

What is a case study? The exercise in table 2.1 describes a job we might be asked to do and a design for it and asks you to decide whether or not this is a case study. Take about 10 minutes to think through this example and write out your answer. It is important that you try this out yourself, so please do it before continuing.

Table 2.1: What Is a Case Study? Exercise

Item Writing assignment
Exercise Suppose GAO has been asked whether the informed consent requirements for experimentation with human subjects are being properly implemented.   Suppose further that we visit three sites where humans are used as subjects for research-a hospital, a university, and a clinic-and that we review the informed consent procedures at each site.
Question 1 Is this an application of the case study method?  Why?
Question 2 If not, would case studies be appropriate for answering the question we were asked?  Why?
Question 3 What is your definition of "case study"?


The answers some GAO evaluators gave may illustrate the range of definitions surrounding case study methods.

To some GAO evaluators, the instance was an application-  of the case study method, because we were looking at only a few sites or because we could not
generalize or because "actual subjects are being used for analysis of a specific question." To some, the instance was clearly not an application of the case study method, because "we do not know if the instances are representative of the universe," and "there doesn't appear to be enough done at each site." To still others, it was not possible to tell whether this was a case study because looking at instances was what we do in all our methods, and there was no differentiation between this job and a compliance audit.

The definitions given also varied greatly. To one person, a case study involves looking at individual people. To another, a case study examines a clearly defined site and reports on that one site, so that multiple site studies would not be case studies. To another, case studies involve getting a great deal of information about a single site or circumstance, when generalizability isn't important. To others, "a random sample is necessary for a case study," "case studies are nonnormative research that investigate a situation without prejudice," "where we could look at a limited number of cases that would represent the universe overall," and "a review of relevant conditions in a specific environment with no attempt to project to a larger universe." There were almost as many definitions as people, and few of them had elements in common. While exact uniformity isn't expected or perhaps even possible when people are asked to recall a definition, the extreme variability illustrates that we could be talking about very different things in a proposal or report when we discuss case study methods. Thus a decision to "do case studies" could lead to the collection of irreconcilably dissimilar information from groups working on the same job.

What Is Meant by "A Case Study"?

We have developed a definition of case studies that leads to appropriate uses and says something about how a good case study is conducted. It is somewhat technical, so we turn next to giving this definition and to discussing each of its elements.

"A case study is a method for learning about a complex instance, based on a comprehensive understanding of that instance obtained by extensive description and analysis of that instance taken as a whole and in its context."

For example, if we were asked to study what caused the Three Mile Island disaster and scoped the job to describe whether required safeguards were complied with, this would not be a case study. If, however, we scoped the job to examine in depth events leading up to the disaster, what went wrong, and why it went wrong, this would be a case study. For a second example, if we were asked to study the safety of nuclear plants in general, we might select as our method a survey of self-reported compliance with safeguards in all existing plants. This would not be a case study. If, however, we scoped the job to examine in depth recent problems in appropriately selected nuclear plants including among others Three Mile Island, seeking to understand why the safeguards either were not complied with or were not sufficient, then we would have selected the case study method to answer the question.

As we will discuss later, several methods can be used in one job; these examples are only intended to highlight what is not, and what is, a case study. Examining the elements of the definition also may help make this distinction clear.

"A complex instance" means that input and output cannot be readily or very accurately related. There are several reasons why such a relationship might be difficult. There could be many influences on what is happening and these influences could interact in nonlinear ways such that a unit of change in the input can be associated with quite different changes in the output, sometimes increasing it, sometimes decreasing it, and sometimes having no discernible effect.

Table 2.2 gives an example of a less and a more complex instance. "Are U.S. airports following required U.S. and international security procedures for passengers?" is a less complex question because the criterion is fairly clear, the focus is narrow, the influences on compliance are likely to be relatively few, and the relation of input and output is likely to be fairly direct. Staff knowledge of procedures ought to play some role in following these procedures, for instance.

Some questions are more complex, however, such as the question: "Are security procedures in U.S. airports sufficient to protect the safety of passengers and equipment?" This is more complex because the criterion of "sufficient protection" is much less certain; the focus is broader; the influences on actual achievement of sufficient procedures are likely to be many; and the relation of input and output is not only likely to be both direct and indirect but also difficult to measure.

The second key element in our definition is "a comprehensive understanding." Here the situation is more straightforward. This means that the goal of a case study is to obtain as complete a picture as possible of what is going on in an instance, and why.

The third key element, "obtained by extensive description and analysis," has three components. These are summarized in table 2.3. Case studies involve what methodologists call "thick" descriptions: rich, full information that should come from multiple data sources, particularly from firsthand observations. The analysis also is extensive, and the method compares information from different types of data sources through a technique called "triangulation." That is, reliability of the findings is developed through the multiple data sources within each type. This is akin to corroboration as discussed in the General Policy Manual, chapter 8.0. The validity of the findings, particularly validity with regard to cause and effect, is derived from agreement among the types of data sources, together with the systematic ruling out of alternative explanations and the explanation of "outlier" results. Examining consistency of evidence across different types of data sources is akin to verification. There are specialized strategies for making these comparisons-namely, pattern matching, explanation building, and thematic review. The technical how-toe for these three strategies will be summarized later in this paper. They involve techniques such as graphic data displays, tabulations of event frequencies, and chronological or time series orderings. Generally, data collection and analysis are concurrent and interactive-that is, "yoked" in case study methods.

Table 2.2: Complexity of Questions

Example Characteristics
A less complex question

"Are U.S. airports following required U.S. and international security procedures for passengers?"

Criterion is fairly clear: "required U.S. and international security procedures"

Focus is narrow: "passengers"

Influences on compliance are likely to be relatively few: staff knowledge of procedures and training in implementation equipment, number of staff compared to workflow, degree of supervision, staff screening and selection

Relation of input (influences on compliance) to output (that required security procedures are followed) is fairly direct.

A more complex question

"Are security procedures in U.S. airports sufficient to protect the safety of passengers and equipment?"

Criterion is less clear: what would be sufficient under present conditions and with existing and possible technologies?

Focus is broader: passengers and equipment (although still fairly well specified)

Influences on achievement of sufficient procedures likely to be many, including the state of the art of detection technologies, number and militancy of potential threats to security, and the willingness of passengers, airline personnel, and airport personnel to accept different costs and forms of protection

Relations of input (influences on security) and output (safety) likely to be difficult to measure and to be both indirect and direct.

Table 2.3: Methods of Obtaining Description and Analysis in Case Studies a

Technique Methodology
Extensive or "thick" analysis Analysis of multiple types of data sources such as

--interviews with all relevant persons
--Observation over time
--Participant observation
--Documents
--Archives
--Physical information

Analysis via triangulation of data Analysis through

--Pattern matching
--Explanation building
--Thematic review

Comparison of evidence for consistency Analysis through techniques such as

--Matrix of categories
--Graphic data displays
--Tabulation of event frequencies
--Chronological or time series ordering

a  Different types of evidence and standards for them are discussed in General Policy Manual, chapter 8.0

The next element of the definition is "taken as a whole." As this list indicates, the size of the instance can be as small as one individual or as large as a nation. The instance as a whole can be

1 These instances have been the subject of case studies. (See U.S. General Accounting Office, February 22, 1984, and Allison, 1971.) Others are general illustrations.

One example of a GAO case study that examines an individual is our examination of whether or not a senior official behaved improperly with regard to influence and accepting money before and since leaving the White House (U.S. General Accounting Office, July 11, 1986). Another example would be a request to examine in detail ax-President Marcos' use of funds intended by the United States for military or civilian purposes for his personal benefit. At the other extreme, an instance may be as large as an event, such as the Cuban missile crisis (Allison, 1971) and the swine flu vaccine (Neustadt and Fineberg, 1978), which have been the subjects of two well-known case studies, or the Challenger tragedy. It can be a region (Chesapeake Bay water cleanup programs), a nation (democracy in the Philippines), or an organization (UNESCO). Moreover, it is possible to have questions that require nested case studies. For example, to answer a question about how programs to serve handicapped children are working, we might select the cases of preschool and elementary programs; we might further select within preschool programs, those for the hearing impaired and those for the orthopedically impaired. Each of these nested studies is treated, in terms of specification of the unit of study and collection of data appropriate to it, as any other case study would be.

The last key element of the definition is "and in its context." Context means all factors that could affect what is happening in an instance. As an example, in the Challenger tragedy, inquiry began with trying to locate the technology that failed as the reason for the explosion. The right-hand booster rocket was identified as the source of the explosion and, within the rocket, technological attention focused on the O-rings. The inquiry expanded very quickly, however, from asking what technology failed to an examination of contextual influences, such as

That is, the Challenger inquiry could be seen as similar to a case study in some ways. The rapid spread of inquiry from an examination of the technology to an investigation of decisionmaking on that flight, to inquiry about NASA management as it affected the Challenger disaster generally, is what "taking the context into account" means. In case study methods, to understand what happened and why, context always is considered, and it is this consideration that gives the case study its strength as a way of understanding cause and effect.

Some Common Benefits Expected From Case Study Evaluations

Doing a good case study is more than just looking at what is happening in a few instances. It is a special systematic way of looking at what is happening, of selecting the instances, collecting the data, analyzing the information, and reporting the results.

There are nine features of case study evaluations that merit special discussion. Each of these features-if carried out-confers certain benefits in terms of the product. Two of the features relate to design, three to data collection, three to analysis, and one to reporting. These features and their benefits are shown in table 2.4. For example, with regard to design, information over time-the longitudinal feature of the design-provides assurance that the final product represents what is happening and not an atypical situation.

Table 2.4: Some Common Benefits Expected From Case Study Evaluations

Study feature Benefits expected
Design
Longitudinal Assurance that a short-term situation that may be unrepresentative of what is happening isn't inflated in importance
Triangulation Assurance that reasons given for events properly reflect influences from many different sources
Purposive instance Ability to match questions asked and later generalization of findings at level appropriate to the questions
Data Collection
Comprehensive Assurance that important conditions, consequences, and reasons for these have not been overlooked
Flexible Broader perspectives, increased assurance that what is important on the scene rather than centrally will be examined
Multiple data sources Assurance that a full picture will be obtained and that bias associated with self-protection or self-interests will be reduced
Analysis
"Yoked" or concurrent with data collections Assurance of the ability to collect data needed to test alternative interpretations and to make rapid adjustments in design
Search for disproving - proving evidence Assurance that alternative interpretations have been thoroughly searched for and checked; thorough identification of instances that don't fit the general pattern; and, often, understanding of the reasons for the outliers
Chain-of-evidence and pattern matching techniques Permit fairly direct assessment of how convincingly the evidence of conclusions are related
 

Reporting

Actual instances Assurance of authenticity through persuasiveness and ease of recall; use of the tendency to generalize from personal experience but via the substitution of more objective experience for anecdotes of unknown credibility

These features are the price of admission to the expected benefits. One frequent question about case study methods is how rigorously these features have to be followed. Obviously, the more closely the requirements are followed, the more benefits can be expected. It is a judgment call as to how much the features can be compromised before the "case study" becomes a site visit or turns into a survey. Probably the most critical features are appropriate instance selection, triangulation, and the search for disproving evidence. And of these three, probably the most critical is appropriate instance selection.

Instance Selection in Case Studies

There are three general bases for selecting instances: convenience, purpose, and probability. Each has its function and can be used to answer certain questions. A good case study will use a basis for instance selection that is appropriate for the question to be answered. Using the wrong basis for selecting an instance is a fatal error in case study designs, as in all designs. Such a case study is a not-good case study, and it is irredeemably flawed despite any methodological virtues it may have in terms of data collection, analysis, and reporting.

Table 2.5 summarizes the three general bases for selecting instances and the questions each basis can answer. Of particular interest may be the seven varieties of purposive site selection: bracketing, best cases, worst cases, cluster, representative, typical, and special interest.

Instance selection is crucial to generalizability and to answering the evaluation questions appropriately. Only rarely will convenience be a sound basis for instance selection; only rarely will probability sampling be feasible. Thus, instance selection on the basis of the purpose of the study is the most appropriate method in many designs.

Table 2.5: Instance Selection in Case Studies

Selection basis When to use and what questions it can answer
Convenience "In this site, selected because it was expedient for data collection purposes, what is happening, and why"
Purpose
Bracketing "What is happening at extremes? What explains such differences"
Best cases "What accounts for an effective program?"
Worst cases "Why isn't the program working?"
Cluster "How do different types of programs compare with each other?"
Representative "In instances chosen to represent important variations, what is the program like and why?"
Typical "In a typical site, what s happening and why?"
Special Interest "In this particular circumstance, what is happening and why?"
Probability "What is happening in the program as a whole, and why?"

The match between the question asked and the method of purposive sampling chosen can be tricky. For example, studies that attain "representativeness" by conducting a few case studies in a rural setting, a few in a suburban setting, and a few in an urban setting will produce a report in which the three settings receive more or less equal weight. If, however, 90 percent of the clients or sites for the program are rural, such "representativeness" may appropriately capture the range of site experiences but be rather unrepresentative of the program as a whole, and care will be needed to generalize only to the range of settings and not to the program as a whole.

Table 2.6: Hypothetical Data on Instance Selection

Location Operated by Number of beds Clientele served Years in Operation Funded by Costs a Problems b
1. San Diego, CA CAIM, Inc. 800 Men and boys 2 INS 25 4
2. Amarillo, TX CAIM, Inc. 130 Men and boys 1 INS 30 4
3. El Paso, TX PIC 75 Families 3 INS 15 7
4. El Paso, TX CAIM, Inc. 350 Men and boys 1 BOP/INS 60 7
5. Miami, FL Security 100 Men and boys 1 BOP/INS 150 15
6. Clearwater, FL CAIM, Inc. 300 Men and boys 5 BOP/INS 100 10
7. Pensacola, FL Security 100 Families 5 INS/State 70 6
8. Denver, FL PIC 100 Families 3 INS/State 20 3
9. Salida, CA Security 200 Men and boys 4 INS 70 9
10. Salinas, CA CAIM, Inc. 100 Men and boys 2 INS 30 3
11. Los Angeles, CA Security 300 Men and boys 3 INS 75 5
12. San Francisco, CA Security 250 Men and boys 3 INS/State 70 7
13. San Francisco, CA PIC 100 Men and boys 3 INS 25 4
14. New York, NY ARIVA, Inc. 100 Men and boys 2 INS 55 6
15. Washington, DC ARIVA, Inc. 300 Families 2 INS 85 5
16. Seattle, WA Security 100 Men and boys 3 INS/State 60 7
  1. Costs per person per day, charged by contractor to funder (hypothetical data).
  2. Problem rates include all problems considered under contract as serious, such as escapes, acts of violence by or toward individuals, vandalism requiring more than $1,000 to repair, and suicides.  Rates are number of such instances per 100 days per year (hypothetical data).

To illustrate what each variety means, and how it might be operationalized, consider the information in table 2.6. This gives hypothetical data about a real situation in designing a study-selecting instances (in this study, sites or locations) for an assessment of the costs and operations of federal detention facilities managed by private contractors under OMB Circular A-76. There are not many such facilities-so the 16 hypothetical facilities represent what we might actually find in such a study. The following paragraphs describe what a sample would look like if it were chosen according to the bases in table 2.6.

Convenience Samples
If our location were the Denver Regional Office, a convenience sample would be sites 8 (Denver) and 9 (Salida). That is, ease of collecting data and minimizing resources required would have driven our choice.

Purposive Sample

Bracketing
If our interests were extreme costs, numbers 3 (El Paso, at $15 per person day) and 5 (Miami, at $150 per person day) would bracket the cost extremes. If we wanted the three least expensive and the three most expensive, we could select 3 (El Paso), 8 (Denver, at $20), and 13 (San Francisco, at $25) in comparison to 5 (Miami, at $150), 6 (Clearwater, at $100), and 15 (Washington, D.C., at $85). Such an addition would also give us a better basis for analysis because it includes not only high-cost and low-cost sites but also services to men and boys and to families, a difference that in itself might be expected to lead to cost variations.

Best Cases
If our interests were in operating centers with the least problems, we might examine numbers 8
(Denver, 3 percent) and 10 (Salines, 3 percent). Since both are in Colorado (although operated by different firms and serving different groups), we might want to add sites. Such an addition could show whether we were looking at something about Colorado rather than about low-problem centers. We could do this by selecting 1 (San Diego, 4 percent), 2 (Amarillo, 4 percent), and 13 (San Francisco, 14 percent).

Worst Cases
Sites 5 (Miami, 15 percent problems) and 6 (Clearwater, 10 percent) stand out as worst cases.
Selecting an out-of-state comparison, if we wanted it, is harder here. The next highest problem rate (9, Salida, at 9 percent) is run by a different company and costs much less. Security has a site in San Francisco, for men and boys, which costs $70 daily with a 7-percent problem rate. The costs of site 15(Washington, D.C.) are higher, but this site serves families and has a low problem rate. The best choice probably is 12 (San Francisco): it serves the same group (men and boys) and is run by the same company (Security).

Cluster
We might be interested in administrative arrangements-in, for example, how administration works
out when INS alone is the contractor, when responsibility is shared with another federal agency
(Bureau of Prisons), and when responsibility is shared with the state. One cluster of sites (1, 2, 3, 8,
9, 10, 11, 13, 14, and 15) is administered by INS alone. Another cluster (4, 5, and 6) is shared
between BOP and INS, and the last cluster (7, 8, 12, and 16) is run by INS and the state. We could pick one or two sites from each cluster to get a sense of how agency auspices may affect program operations.

Representative
One issue we might need to examine could be efficiencies of operation-particularly in terms of
facility size. Here we might select numbers 1 (San Diego, 800 beds), 6 (Clearwater, 300 beds), and 10 (Salines, 100 beds). All are run by CAIM, and all serve men and boys. We would have to limit our generalizations to facilities for men and boys, but these three sites should give a good sense of the size and operations issue.

Typical
This would be a challenge. In terms of size, there is a "typical" bed size (100 beds); in terms of people served, there is a "typical" population (men and boys); and in terms of years of operation, 3 years is "typical," with 2 years a close runner-up. In terms of costs, however, the distribution is trimodal-that is, three values appear about equally often-and for percent of problems, it is almost flat with two outliers. Also, there is not a single site that matches all three "typical" characteristics well. Miami, for example, has 100 beds and serves men and boys, but it has been in operation only 1 year, costs $150 per person per day, and has a 15-percent problem rate. The best approach would be to indicate that it is not possible to pick one site that is "typical" of such distributions.

Special Interest
Any one of the 16 sites might be examined as a result of special congressional interest. Such
interest usually would be based on information extraneous to the data in the table: a complaint
might be received, for example, about conditions in the San Diego site, or allegations might be made
that the high costs of the Miami site were due to mismanagement.

Probability Samples
Probabilistic sampling is the method of choice for answering questions about "how much," or how
extensive a problem is in a population. Properly carried out, it provides strong generalizability and
assurance of representativeness. A probability sample is one in which all members of the population have a known and equal chance of being selected. If we used a table of random numbers, and selected as the first two sites those corresponding to the first two numbers between 1 and 16 in the table, we would have selected a probability sample. Each site would have a 1-in-16 chance of selection, and that chance would be equal among sites. A fair objection to this statement is that the laws of probability operate on large numbers, and selecting fewer than 30 instances does not always provide the generalizability to the population as a whole that probability samples promise. However, in terms of actual operations, which we want to illustrate here, the method just sketched is a probabilistic one, and some case studies have involved 30 or more sites selected on a probabilistic basis. (See PEMD's transfer paper entitled Using Statistical Sampling (U.S. General Accounting Office, May 15, 1986) for more information.)

For readers who want to check out their skills in applying different types of purposive selection, appendix II gives information for a job involving the 50 states (a fairly common situation for GAO), a form for indicating which you would select for each of the seven kinds of purposive selection, and our answers, for comparison against yours.

In many jobs, what is a "case" and what dimensions are important to consider in selection will be clear. For example, the population of detention facilities supported by INS contracts can be defined legally (by the contract awarded), and the relevant dimensions (length of time in operation, facility size, detainee mix) are straightforward. There are, however, more problematic circumstances. An example would be a study of the extent to which voluntary organizations have taken up any slack in welfare supports. What is a voluntary organization can be defined broadly, as "any nonprofit organization," or narrowly, as "a service-oriented group whose members do not receive payment for their work."

Dimensions of potential relevance for the outcome of interest are many, and the empirical basis for selecting any one dimension over others few. In such situations, the evaluator can turn to past experience, a search of the appropriate theoretical as well as empirical literature, the advice of knowledgeable persons, an examination of key issues in proposed or pending legislation, customer guidance, and similar techniques. That is, while it is important to recognize the difficulties, there are ways of dealing with them in case definition.


Chapter 3
Case Study Applications

As noted earlier, there are six types of applications for case study methods-illustrative, exploratory, critical instance, implementation, program effects, and cumulative. But case study reports commonly use only two of the six applications: illustrative and critical instance. Greater use could be made of the four others in selecting alternative ways of answering questions, because these may be able to give information that is more valuable to customers than other techniques. Also, improvements can always be made in how even the two approaches already used frequently are carried out, especially in the area of selecting instances for study. The next sections summarize, for each of the six types, the evaluation questions they can answer, the functions they perform, their design features, and their pitfalls. The last section shows what basis for selecting sites is appropriate for each of the six applications.

Illustrative

As table 3.1 indicates, illustrative case studies primarily describe what is happening and why, in one or two instances, to show what a situation is like. This can help in the interpretation of other data, particularly if we have reason to believe most readers know too little about a program or situation to understand fully the information from surveys or other methods.

Table 3.1: Illustrative Case Studies

Aspect examined Characteristics
Evaluation questions Help interpret other data when there is reason to believe that readers know too little about a program; descriptive, often used in conjunction with other methods.
Functions Make the unfamiliar familiar; provide surrogate experience; avoid over-simplification of reality; and give reader a common language about the topic.
Design features Site selected as typical or representative of important variations; small number of cases to keep reader's interest; data often include visual evidence; analysis concerned with data quality and meaning; and reports use self-contained, separate narratives or descriptions.
Pitfalls May be difficult to hold reader's interest while presenting in-depth information on each illustration; may not adequately represent situations where considerable diversity exists (in such situations it may be impossible to represent variety well enough to use illustrative case studies); and may not have time on-site for in-depth examination

GAO has many examples of such illustrative use. In 1982, for instance, CED examined housing block grants through a survey supplemented by case studies. The results of the survey were published in the main report (U.S. General Accounting Office, December 13, 1982). For three of the sites (Pittsburgh, Seattle, and Dallas), individual reports described what each city was like with regard to housing and housing-related activities and how the money was used in that city and included before-and-after pictures of what rehabilitation meant for individual neighborhoods and houses (U.S. General Accounting Office, March 24, 1982; March 30, 1982; April 30, 1982). In a similar application, HRD described the projects funded under the Emergency
Job Appropriations Act of 1983 in communities in Texas, Alabama, California, Georgia, and Massachusetts (U.S. General Accounting Office, March 26, 1985; August 27, 1985; September 25, 1985; December 6, 1985).

Illustrative case studies are used by evaluators in other agencies. When the Department of Health and Human Services was trying out delivery of Head Start services to parents and children in their own homes, called Home Start, the Department supplemented a formal assessment of the development of the children before and after the program with case studies (High/Scope Educational Research Foundation, 1972). These case studies described what services were delivered, the conditions in rural as well as urban areas, and what the Home Start teachers did during the home visits and generally provided a surrogate or vicarious experience for readers who might never have visited a Head Start or a Home Start center. The case studies told, too, of the development of the program over time and helped give a realistic sense of problems in start-up and implementation, how changes in staffing were accommodated, and the impact of shifting federal guidance on efforts to carry out the program in the field.

Case studies such as these are well accepted as a valid way of amplifying a more systematic presentation via the realism and vividness of anecdotal information. There are, however, pitfalls in presenting illustrative case studies. The most serious is selecting the instances. The case or cases must adequately represent the situation or program. This is relatively easy if the program is small and homogeneous. Where considerable diversity exists, it may not be possible to select a "typical" site, and the diversity may be so great that to represent it adequately would require more case studies than most people would want to read for illustrative purposes. In the example of privately operated detention facilities, an illustrative case study might run the risk of oversimplifying a more complex situation. The example was contrived to illustrate exactly this point: that sometimes we cannot select a site that fits our needs and thus the method is not appropriate.

However, in many real-world situations, it is possible to represent diversity adequately for illustrative purposes and to obtain the benefits of this application: helping readers feel, hear, see, "be there" when this kind of surrogate site experience is necessary to undo stereotypes or explain a situation otherwise inaccessible for most people.

Such a situation might be a bilingual education class, about which stereotypes can abound, or life aboard a nuclear-weapon-equipped submarine, a situation few readers will ever experience themselves but may need to get a feel for in order to understand staff selection, training, and management on modern submarines.

Exploratory

The exploratory case study is a shortened case study, undertaken before launching into a large scale investigation. Its function is to develop the evaluation questions, measures, designs, and analytic strategy for the bigger study. As table 3.2 indicates, it is most helpful where considerable uncertainty exists about program operations, goals, and results. Also rather than initiate a job requiring 1,000 staff days or more, when we do not have an adequate on-the-shelf set of designs and measures, an exploratory case study can save time and money in implementation as well as improving the confidence we have in our results. We can aim more precisely and hit the target more often.

Table 3.2: Exploratory Case Studies

Aspect examined Characteristic
Evaluation questions Usually cause and effect
Functions Where considerable uncertainty exists about program operations, goals, and results, exploratory case studies help identify questions, select important measurement constructs, develop actual measures for these, which can be used later in larger-scale tests; formulate expectations; safeguard investment in larger studies (for problems or programs that are not well-developed)
Design features Site selected:  needs at least one site that represents each important variation to make a convenience sample acceptable; number of cases sufficient to cover diversity; data focus on program operations and on-site observation, are not longitudinal but need enough time to find out what is going on; analysis is closely concurrent with field work but does not require strong chain of evidence or audit trail; reports are usually internal or parts of larger, longer reports
Pitfalls Temptation to prolong the exploratory phase; site selection only for convenience, inadequate coverage of diversity; prematurity-exploratory findings released as conclusions; over-involvement in evaluator's own hunches so that initial findings are confirmed rather than tested

Some of our scoping work already may involve exploratory case studies. For example, in GGD, a design study was done as a separate job, culminating in a briefing, prior to an in-depth study of the implementation of the Bail Reform Act of 1984. The methodology included 90 interviews, observations, and data analysis from the population of 94 court districts selected purposively for their characteristics on significant variables. Researchers and experts in the field were also interviewed. An expert panel was used to give feedback at various points to make sure we had a comprehensive picture of the situation. The product of this exploratory case study was a briefing, with the study design choices described, including detailed research questions, outlines of data sources, significant variables, extant data bases, and site selection criteria. From this, a larger study was designed to meet the needs of the requester. Other jobs may involve similar efforts that are not, however, reported as separate jobs and thus are less visible as exploratory case studies.

Also reports that include some features of exploratory case studies have been issued by GAO. In 1985, for example, NSIAD examined emerging issues in export competition through a case study of the Brazilian market (U.S. General Accounting Office, September 26, 1985). Combining site visits to Brazil, Japan, West Germany, and France, interviews with many officials of appropriate agencies and from the private sector, examination of official government files, and a questionnaire survey of high technology firms active in the Brazilian market, the evaluators amassed a rich array of contextual and focal information and identified four trade practices considered to be key factors in export competitiveness in Brazilian markets. These were bilateral trade accords, countertrade, export financing, and compliance with trade-related industrial policy. Although to meet the requirements of the job, NSIAD did not need to test these factors for generalizability to other countries through a later study, the product would permit such testing. NSIAD is using the findings in this way, as part of its ongoing work on bilateral initiatives. Of particular methodological note in this report is the detailed explanation of why export competitiveness in Brazilian markets (the instance) was selected for the case study.

The exploratory case study has been used by agencies outside GAO. The Department of Justice, for example, supported an exploratory case study of the career criminal program (Chelimsky and Dahmann, 1980). The career criminal program aimed at "swift and certain" justice by trying to expedite and strengthen processing of individuals who had long criminal histories at the time of apprehension. The exploratory study looked in depth at four of the nine demonstration sites prior to conducting a program effects evaluation. The evaluators identified the key elements of the programs as implemented and what measurable changes were likely to occur and developed measures of the outcomes, as well as designs for testing cause and effect in the subsequent larger study (Chelimsky and Sasfy, 1976).

The greatest pitfall in the exploratory study is prematurity: that is, the findings may seem so convincing that it can be difficult to resist pressures to report on these as if they had the strength of the larger study. Also, care must be taken to scope and sequence the exploratory study so that it yields enough information to be worthwhile and in time for use in the larger study but does not unduly delay answering the questions through the larger study. In addition, it is inappropriate to use the scoping phase as an ad hoc exploratory case study accompanied by an urge to issue the product at the end of scoping, when the necessary procedures for an exploratory case study with regard to such issues as instance selection have not been followed.

Critical Instance

The critical instance is the most frequent application of the case study method in GAO, so much so that it may be seen as a "usual GAO review" rather than recognized as what it can be-a case study (U.S. General Accounting Office, January 22, 1981; April 23, 1982; October 30, 1985). The advantage of recognizing the approach as an application of case study methods is that some aspects of the method-such as the close yoking of data collection and analysis-that may not be widely used now could be applied in a way that increases timeliness without reducing quality. (This technique, discussed in more detail in the section on analysis, can increase efficiency by reducing collection of data and large-scale analyses of these data that subsequently do not prove useful.)

The critical instance case study examines one, or very few, sites for one of two purposes. First, a very frequent application is the examination of a situation of unique interest, such as Three Mile Island, the Challenger disaster, or allegations concerning funding for a specific presidential campaign. There is little or no interest in generalizability. The instance is not "selected" by us; rather, we are called to it.

GAO conducts many critical instance studies. One example, already mentioned, was our review of the representation of foreign interests by former very high government officials (U.S. General Accounting Office, July 11, 1986). Another is PEMD's review of the readiness of the Big Eye Bomb for production (U.S. General Accounting Office, May 23, 1986). Yet another is RCED's review of a construction contract award at Jean Lafitte National Historical Park (U.S. General Accounting Office, September 26, 1987) and their examination in a separate report of the park service actions at Delaware Water Gap National Recreation area in awarding a lease, closing a camp ground, and raising a house rent (U.S. General Accounting Office, October 28, 1987).

A second, rare, application is where a highly generalized or universal assertion is being called into question, and we are able to test it through examining one instance.

In one such study, GGD examined whether national policies, procedures, and practices with regard to cargo imports were causing problems in port operations (U.S. General Accounting Office, December 1986). The Port of New York offered a critical test because, given the diversity of imports and the volume of work, if problems were occurring, they would be likely to show up clearly in this site. If no problems were observed, problems in other sites were unlikely. GGD used observations, interviews, and document analysis at three sites in the Port of New York and supplemented these with a small number of less intensive observations at other sites. The method, in this instance, was sufficient to permit recommendations that were systemwide and generalizable with the single case.

Table 3.3 summarizes the features of the critical instance case study. As noted, the method is particularly suited for answering cause-and-effect questions about the instance of concern. It provides assurance that we have not prematurely overlooked important factors, that we have not been swayed by information from limited or perhaps biased sources, and that we have taken context into account, thus giving a fair and balanced picture of the situation.

Perhaps the biggest pitfall in this application is insufficient specification of the customer's question. That is, the job may be presented to us as if only that situation is of concern, but the underlying question may call for a broader look at the issue. A request to investigate the reasons for the bank failures in Ohio, for example, may reflect an interest only in Ohio, but it could be a "tip of the iceberg" question. What the customer may really want to know is whether other states are likely to have similar problems. In such a situation, Ohio might be selected as a site to examine but we would also need to look at other states or use other approaches to achieve the generalizability needed. This then rules out the critical instance method as appropriate for this job. The importance of probing the underlying questions in a request to achieve good specification of the evaluation question is not unique, of course, to the critical instance case study but it is crucial in its appropriate application.

Table 3.3: Critical Instance Case Studies

Aspect examined Characteristic
Evaluation questions Cause and effect, usually stand alone
Functions Investigation of specific problem (frequently encountered at GAO), decisive testing of universal assertion; cause-and-effect questions
Design features Site selects itself in specific problem-for decisive testing, have to assume uniform system with regard to issue and so convenience sample acceptable; number of cases is usually one instance; comprehensive data for specific problem-for decisive testing, need more modeling, hypotheses, and targeting to know what to study; data analysis and collection concurrent and interactive: data feed new collection, and emphasis on ruling out alternative causes; report describes instances, presents conclusions about cause, gives evidence
Pitfalls Inappropriate selection of this technique as real issue may not be specific problem (e.g. Ohio bank failure) but more general questions; premature closure may narrow causal search too early; overgeneralization from evidence

Program Implementation

We frequently are asked whether a program has been implemented and, often, whether implementation is in compliance with congressional intent. The program implementation case study is helpful where enabling legislation offers considerable flexibility. In such cases, a wide variety of expenditures or actions could be consistent with legislation and compliance with intent may be a matter of understanding the process by which decisions were made, who was involved, and whether the actions are meeting local needs. One example is the 1981 legislation consolidating many small categorical grants into larger block grants, the funds for which could be spent very flexibly.

Another situation where program implementation case studies may be called for is when concern exists about implementation problems. In-depth, longitudinal reports of what has happened over time and why can set a context for interpreting a finding of implementation variability: that is, whether there seem to be basic structural problems or if the program understandably requires time for installment, adaptations, and building an infrastructure.

In some instances, GAO has been able to follow fairly intensively the implementation of programs or activities. One example is GOD's series of reports on how the 1980 census was conducted. GAO evaluators, in addition to being "on the scene" due to their location at the major audit site accompanied enumerators into the field and examined, in depth, Census procedures at field offices. In other instances, we have spent somewhat less elapsed time in the field, with less direct observation, and with greater reliance on interview and documentary evidence. In 1985, for example, RCED was asked how the Department of Interior was implementing the Office of Management and Budget's Circular A-76, dealing with privatization of all appropriate services. The request overlapped with another similar request. This request reflected a senator's special interest in the Glacier National Park in Montana. The evaluators were able to combine the jobs in a review that eventually involved information from 8 of 17 National Park Service regional offices and 19 of 402 field offices. The report aggregates findings across these sites and concludes that agencies have been slow to implement the circular, although progress has been made since 1982 (U.S. General Accounting Office, March 15, 1985).

Another example is GAO's review of 23 federal agencies' efforts to implement the Federal Managers' Financial Integrity Act of 1982. A series of case studies, together with an overview report, was produced. Among these, RCED's review of the Department of Commerce implementation, to take one report, examined the actions Commerce took that were intended to improve internal controls, such as training senior financial analysts in evaluating applicants and borrowers in the troubled EDA business loan program and overhauling the way in which computer resources were used for the National Weather Service. RCED also examined the results of these efforts and highlighted priority areas for further improvement, such as better information on results for internal management purposes.

Table 3.4 summarizes the design, data collection, analysis, and reporting features of program implementation case studies. Usually, in such studies, generalization is wanted and care is required to negotiate the question with the customer (best situations? worst? typical?) and to match instance selection carefully with the questions. Unless the program is small and homogeneous, the evaluator faces two possibilities. The first possibility is that the number of instances will need to be fairly large in order to achieve the generalizability wanted, and, as a consequence, skill will be needed to manage data collection with sufficient flexibility to obtain the insights case studies offer and sufficient structure to permit cross-site aggregation of findings. The second possibility is that the diversity will be so great that it would be impossible to have enough instances to meet needs for generalizability and still manage the data collection and analysis.

Table 3.4: Program Implementation Case Studies

Aspect examined Characteristic
Evaluation questions Descriptive, normative
Functions Learn what implementation has been achieved, understand unexpected aspects; understand reasons why implementation looks the way it does; useful when enabling legislation has given flexibility
Design features Site selection cannot be convenience because usually generalization wanted, and purposive sample can be typical and representative of diversity and best and worst cases; number of cases depends on program diversity since generalization usually wanted; data rely on common instruments, published documents, and observation; reports are varied in theme, site, chronology, and narration
Pitfalls Bias detection methods may be inadequate; may fail to take into account diverse views about program goals and purposes; competence of all on-site observers may not be sufficiently high; can be costly due to management, data quality control, validation procedures, and analytic model (within site, cross site, etc.) may lead to cutting too many corners to maintain quality

An important requirement for good program implementation case studies is investment of enough time on site to get longitudinal data and to obtain breadth of information. If the purpose is to report what is happening in a descriptive sense only, short site visits together with administrative records may provide adequate bases for findings. If, however, the evaluation question requires GAO to report on how satisfactory progress is or the reasons for problems in implementation, the more staff who can be on site over time, with the richest or "thickest" base for examining the situation as the many people involved see it, the sounder our causal conclusions and subsequent recommendations will be.

The multiple sites usually required for program implementation questions impose demands on training and supervision needed for quality control. Because of tight resources, lack of travel funds, and the need to use staff with uneven experience and skills, this becomes critical in situations involving many evaluators working in different regions. That is, time is needed to train staff adequately in such case study techniques as the note-taking required for thick descriptions, which is in turn required for the content analysis of themes in the instance. It is possible, for example, for two persons to interview the same informant and find that one has used a one-sentence summary for a detailed, rich, 5-minute discourse while the other captured much more of the complexity and essence of what was said and what was happening. Table 3.5 illustrates such a difference.

Table 3.5: Illustration of Differences in Note-Taking

Situation Technique Characteristic
In an interview with the Director of the National Science Foundation program for grants to small colleges, the following question is asked: "How does your program inform the eligible colleges of the opportunity to apply for grants?" Rich notes "The Director indicated that procedures has changed three times since the inception of the program.  In the first 4 years, announcements were mailed to the individual named as president in the listing, for the same year, of the American Association of Small Colleges.  Because applications were very sparse, with about 30% of eligible colleges applying, the procedure was changed to a two-stage mailing, first to the president to find out the name of the official in charge of federal programs and then to the official.  This worked well for 5-year period, in terms of receipt of applications from over 80% of the eligible colleges, but when overall federal funding for research was reduced, the positions of federal funding were abolished and applications fell to about 49% of eligible institutions responding.  Two years ago, the decision was made to mail copies t to the persons listed as chairs of the relevant science departments in each college in appropriate professional association listings.  This has increased the cost of outreach by about $15,000 or about 25% more than the prior system.  To date, returns are at the 80% rate again."
Thin notes "The current system is to mail copies of the announcements to the chairs of relevant science departments, such as chemistry, biology, physics, and computer science."


Program Effects

Case studies can determine the effects of programs and reasons for success (or failures). In 1982, for example, RCED examined the progress made since the 1970's in cleaning up the nation's air, water, and land, finding that while strides had been made toward meeting the established goals (cleaner air, properly treated wastewater, more drinkable water), deadlines had been extended and unresolved issues made meeting even these deadlines difficult (U.S. General Accounting Office, July 21, 1982). We pointed to lack of flexibility as a source of cascading problems and delays. The bases for these conclusions were in-depth case studies of three sites (Cleveland, Dallas, and New York City) together with information from reports prepared by six federal agencies and by environmental organizations and public interest groups and interviews with Environmental Protection Agency officials. Particularly notable methodologically in this report is the integration of case study findings with other sources of information throughout the first volume.

A PEMD report has focused on water quality: the effectiveness of efforts to improve water quality and the reasons for successes and failures. In-depth, very extensive case studies of several water catchment areas were conducted, and the final report is based on a synthesis of the findings from the case studies-another example of integration of findings across diverse sites (U.S. General Accounting Office, December 17, 1986a, b; September 19, 1986). This series of reports also is useful for illustrating the way in which causality is established in case studies: through development of internally consistent explanations of what led to what and the conscientious use of information from within the site and from contrasting sites to rule out alternative explanations.

For another example, to determine whether actions taken by the states since the mid-1970's to address medical malpractice insurance reduced insurance costs, the number of claims filed, and the average amount paid per claim, HRD conducted case studies in six selected states (Arkansas, California, Florida, Indiana, New York, and California). Work included obtaining views of organizations representing physicians, hospitals, insurers, and lawyers on perceived problems, actions taken to deal with them, results of these actions, and the need for federal involvement. Other information came from surveys of nonfederal hospitals about the sources, coverage limits, and costs and claims from leading insurers in each state and, for comparison, the same type of information from a nationwide company. The results are presented separately in six case study reports and aggregated in the overall report (U.S. General Accounting Office, December 31, 1986).

Other federal agencies have used the case study method successfully in answering program effects questions. The National Science Foundation, for example, assessed the effectiveness of a cooperative science program aimed at increasing innovation and knowledge transfer between university and industry researchers. Ten case studies were undertaken of a carefully selected group of projects that ranged from computer language systems through nuclear science to fisheries biology and chemical engineering. Of note is the methodological detail given on project selection, data collection, analysis, and case format. In a companion report, results from a survey of grant recipients are analyzed, giving both a quantitative and a qualitative sense of how the program was working. Results from the two methods were not integrated; both suggested, however, that the program was generally working well (National Science Foundation, 1984).

Table 3.6 summarizes key features of program effects case studies. Like the program implementation case study, the evaluative question often requires generalizability and, for a highly diverse program, it may not be possible to answer the questions adequately and still have a manageable number of sites.

Table 3.6: Program Effects

Aspect examined Characteristic
Evaluation questions Cause and effect, can be stand alone or mulitmethods and can be conducted before, during, or after other methods
Functions Determine impact and give strong inference about reasons for effects
Design features Site selection depends on program diversity, cannot be used with highly diverse programs; best, worst, representative, typical, or cluster bases appropriate; must keep number of cases manageable or risk becoming minisurvey, can use survey before or after to check generalizability or mix survey with concurrent case studies selected for special purposes; data rely on observation and structured materials, often combine qualitative and quantitative data; analysis uses varying degrees of formalization around emergent or predetermined themes; reports are usually thematic and describe site differences and explain these; variation in degree of integration of data across sites and of findings from different methods
Pitfalls Not collecting the right amount of data; not examining the right number of sites; insufficient supply of well-trained evaluators; difficulties in giving evaluators enough data collection latitude to obtaining insight without risking bias

There are some methodological solutions to this problem. One solution would be to conduct the case studies first in a set of sites chosen for representativeness and to verify the findings from the case study through targeted examination of administrative data, prior reports, or a survey. A second solution would be to use these other methods first. After identifying the findings of particular interest, case studies would be conducted in sites selected to maximize the ability to get the specific understanding required. Both of these approaches have been used with good effect in program evaluation.

Cumulative

This relatively new and not as yet widely used application of case study methods brings together the findings from case studies done at different times. The applications previously discussed that involved multisite case studies are cross-sectional: that is, information from several sites is collected at the same time. In contrast, the cumulative case study aggregates information from several sites collected at different and even quite extended times.

The cumulative case study can be retrospective, aggregating information across studies done in the past, or prospective, structuring a series of investigations for different times in the future. The techniques for ensuring sufficient comparability and quality and for aggregating the information are what constitute the "cumulative" part of the methodology.

That is, the cumulative case study is similar to an evaluation synthesis, in that it is a method for aggregating the findings of several studies. It differs from an evaluation synthesis in that special techniques are required to aggregate the qualitative information that often is a feature of case studies and to maintain the sense of the "instance as a whole" in its complexity that distinguishes case studies from surveys of several sites. For some jobs, both case study and noncase study reports can be
aggregated, each using the appropriate techniques, in order to produce capping reports or similar products.

GAO does not appear to have done a cumulative case study using our own case study reports or other case studies. GAO reports have been used with good results, however, in cumulative case studies published by others outside GAO. One example is a book on bureaucratic failures, which is based entirely on GAO reports of management problems in different agencies over a considerable period of time (Pierce, 1981). The author began with a set of hunches or hypotheses about what can go wrong in agency management, and what would be evidence supporting-or contradicting-these hypotheses. He reviewed the GAO reports in detail, analyzed the data from each one in terms of his framework, and aggregated the results in his final chapter.

Other examples of cumulative case studies come from two international agencies. A retrospective cumulative case study was conducted by the World Bank in its examination of four in-depth case studies of the effectiveness of educational programs. These case studies were intended initially as stand-alone assessments of the programs but were brought together to learn about the effectiveness of the evaluations themselves in the context of educational programs (Searle, 1985). A prospective cumulative case study was commissioned by the U.S. Agency for International Development. The purpose was to identify input and process components of economic assistance that could be quantitatively associated with differences in outcome measures. The method was the specification of a common set of data (both qualitative and quantitative) to be collected over a 5-year period as projects were initiated, together with a means of coding the data across the 47 studies eventually completed. The coded results were analyzed quantitatively in the final report (Finsterbush, 1984).

Table 3.7: Cumulative Case Studies

Aspect examined Characteristic
Evaluation questions Cause and effect
Functions Retrospective cumulation allows generalization without cost and time of conducting numerous new case studies; prospective cumulation also allows generalization without unmanageably large numbers of cases in process at any one time; strengthens inference from new studies by combining with results from older studies
Design features Uses site selection and usually a large number of cases; data as reported (retrospective); usually on-site observation (prospective); backfill techniques; analysis uses case survey method to cumulate findings; possible to examine interactions directly since number of instances is large; reports may resemble evaluation syntheses
Pitfalls Publication basis may severely limit generalization; inadequate or uncertain quality of original data; quality of data-reduction procedures may be very difficult to determine; the effects of changes in many contextual factors over time may be difficult to separate from effects of the programs

Two features of the cumulative case study, shown in table 3.7, are the case survey method just described as a means of aggregating findings (Lucas, 1974; Yin and Heald, 1975; Yin et al., 1976) and backfill techniques (Berger, 1983). The latter are helpful in retrospective cumulation as a means of obtaining information from the authors that permits an otherwise unusable case study to be included in the aggregation. Knowing the basis on which the case instances were selected, for example, is crucial in cumulation; otherwise it is not possible to know whether best case, worst case, typical, or the like instances are being aggregated. Some published case studies do not provide sufficient detail on this. In backfilling, the evaluator might call the author, visit the author to review the original data, or contact others who were knowledgeable about the design decisions in order to get adequate information on instance selection.

Opinion varies as to the credibility of cumulative case studies for answering program implementation and effects questions. One authority notes that publication biases may favor programs that seem to work, which could lead to a misleadingly positive view (Berger, 1983). Other experts are concerned about the quality of the original data and analyses and problems in verifying their quality (Hoaglin et al., 1982; Yin, 1989). For the cumulative use of GAO reports, these concerns are less important, since we already use the "audit trail" procedures recommended in the policy and other manuals for verification of data collection and analysis quality. We do, however, have the opposite concern: that is, we would need to be sure there was not "bad news" selectivity in a particular area, associated with killing jobs that did not identify problems during scoping.

Table 3.8: Some Design Decisions in Case Study Methods

Type of question

Design decision Illustrative, exploratory Critical instance Implementation, program effects, cumulative
Basis for site selection Typical, representative, cluster Convenience, unique interest Best-worst case, bracketing, typical, representative, cluster, probability
If mulitmethod Concurrent Concurrent Before, concurrent, after
Prestructuring Low, moderate Low, moderate Moderate, high
Type of data Qualitative only, qualitative-quantitative Qualitative only, qualitative-quantitative Qualitative only, qualitative-quantitative, quantitative only
Sequence of analysis Within sites, then across Within sites, then across Within sites, then across; across sites, then within; concurrent
Reporting Narrative, thematic Narrative, thematic Thematic


Design Decisions and Case Study Applications


In earlier sections, we discussed seven bases for purposive selection of instances and six applications of the case study method, each of which was associated with a different evaluation purpose or question. Bringing this information together, table 3.8 shows the relations among case study applications and design decisions. For example, if the purpose of the study is illustrative, an appropriate basis for site selection could be typical, representative, or cluster; the case studies would be conducted concurrently with other methods used in the main study; prestructuring or guidance to the evaluators in the field would be low to moderate to permit the thickness and richness of insights needed; data could be qualitative only or both qualitative and quantitative; the case studies probably would be analyzed within sites only; and the reporting would probably be narrative.


Chapter 4
Data Collection and Analysis

We have said that the features distinguishing case studies from other methods are how sites are selected, how the data are collected, and how they are analyzed. In the last chapter, we covered instance selection. We turn now to other elements that distinguish a case study from a not-case study and a good case study from a not-good case study. The discussion is an introduction to the approaches.

Data Collection

In other transfer papers on program evaluation, we have emphasized the importance of validity. Validity involves measurement and also design. A valid measure-that is, one with construct validity-reflects what it claims to reflect and not something else. For example, whether or not there are active opposition parties may be a more valid measure of whether a country is a democracy than how many people vote in an election. A valid cause-and-effect design-that is, one with internal validity-rules out alternative explanations of results by comparing what happened with an intervention to what happened in the absence of the intervention. For example, in a study of the effects of an employment training program, greater employment of participants after the training than before must be shown to be due to the training and not simply to better economic conditions, which also could increase employment.

Measurement Validity
Case study methods can use two tactics for achieving measurement validity: multiple sources of evidence and using the chain-of-evidence technique in data reduction.

Multiple Sources of Evidence
Turning first to multiple data sources: case studies require "thick" description in order to get enough information to check for trends, to rule out competing explanations, and to corroborate findings. Eight techniques are used-sometimes all of them in the same study-to collect information (Neustadt and Fineberg, 1978; Yin, 1989).

  1. Collect physical articles.

  2. Collect documents such as contracts, memos, and reports.

  3. Examine archives such as lists of persons served, computerized order records.

  4. Conduct open-ended interviews.

  5. Conduct focused interviews.

  6. Conduct structured interviews and surveys.

  7. Undertake direct observations.

  8. Carry out participant observations.

Many of the eight techniques are discussed in the General Policy Manual, chapter 8.0. Of these ways, the approaches that most differentiate case studies from other techniques are direct observation and participant observation.

GAO has used both approaches in its jobs. For example, in NSIAD's study of conditions on submarines, auditors spent time aboard submarines in a variety of situations, getting firsthand knowledge of life in these vessels. Their direct observations form the primary data source for our report. We went to sea in this instance, however, in our GAO role, as auditors and evaluators and so-it could be argued-might have seen what special guests see and not what life would be like for the average sailor.

To get more authentic information, evaluators have sometimes become participants in situations, not identified to the other persons involved as GAO staff. One example of how we have adapted this
participant-observer approach was in GOD's study of the services available to taxpayers from IRS after IRS reduced the number of public information agents (U.S. General Accounting Office, April 5, 1984). We developed a set of standard income tax questions about which citizens typically would call IRS, obtained IRS agreement on the correct answers to these questions, and then, on a probabilistic sampling basis, called IRS offices around the country to seek help. We used names such as Gerald A. Office in these conversations but did not say we were from GAO. We were able to report how long it took to get the phone answered, how long it took to get information, the consistency of information, and general helpfulness of the responding agent. Such an approach gave more authentic information than relying only on IRS records of calls received, or a survey of taxpayers. In the first instance, IRS would have no record of time before the person could get through to an agent and of "discouraged callers." In the second, a survey of taxpayers would have to be very large to get a good "hit" rate of individuals who sought assistance, and the diversity of individual questions would have blurred ability to interpret variation in IRS responsiveness. HRD used a similar approach in reviewing the Social Security Administration's telephone inquiry program; over 4,000 calls were made, with GAO personnel taking the role of ordinary citizens in asking the randomly selected, prepared questions (U.S. General Accounting Office, August 29, 1986).

One element of data collection that distinguishes case studies from other techniques is that comprehensiveness of interviewing is very important. In order to learn the meaning of events to those involved in them, a key element of case studies, the views of more senior officials are not given greater weight than views of less highly placed persons. In fact, a case study where the only people interviewed were senior officials would be seen as a not good case study, in contrast to one where the views of individuals at all levels affected was obtained. For example, if we wanted to learn about how noncompetitive awards were reviewed in an agency, a good case study would obtain information from the agency head, the head of the procurement division the inspector general's office, the contracts officer responsible for selected awards, staff involved in the reviews for these awards, counterpart persons from the contractors' procurement and program operations staff, and the legal divisions within the agency and the contractors. We might shadow several noncompetitive procurements, following their life history from initiation through actual awards, sitting in on meetings, and studying, over time, how the awards were handled.

Chain of Evidence
A chain of evidence is the sequence from observation to conclusions. In a strong chain of evidence, an independent second evaluator could follow the first evaluator from original observations, the "raw" or unreduced data, through all the steps of data aggregation and analysis, and conclude that the first evaluator's findings were justified by the evidence and fairly represented it. This requires careful organization of the files of original observations, complete documentation of the conditions of data collection that are relevant to the trustworthiness and credibility of the information, and making transparent and reproducible the manner in which the evaluator moved from phase to phase of the analysis. Some evaluators call such a procedure "building an audit trail" and use procedures similar to indexing and referencing to establish both the construct validity of the measures reported and the convincingness of the causal explanations developed in the case study (Halpern, 1983). That is, they have an independent evaluator review the equivalent of their workpapers rather than providing so much detail in the report itself that a reader can come to the same conclusion.

Some information in a case study is likely to be judgmental, particularly when observer and participant-observer modes of data collection are used. And the collection process involves judgment calls of promising leads and the meaning of initial information. While documenting the basis for judgments can be more difficult than documenting nonjudgmental information, overall the chain of evidence or audit trail techniques should not pose any greater difficulty for GAO evaluators than our documentation procedures for other evaluation methods.

Data Analysis

Case studies, obviously, can generate a great deal of data, data that need to be analyzed sufficiently and with appropriate techniques in order to be useful. Much is qualitative. As table 4.1 indicates, there are six general features of data analysis. Four are essential to case study methods: iteration, OTTR, triangulation, and ruling out rival explanations.

A unique feature of case studies is that data collection and analysis are concurrent. In most methods, we plan for data collection, then we collect the information, then we analyze it, and then we write the report. In case studies, the data coming in are analyzed as they become available, and the emerging results are used to shape the next set of observations.

The sequence in which this takes place is the OTTR, which stands for "observe, think, test, revise." After observations have been made in the first phase (and during the observations, because that is a natural way for our minds to work), the evaluators think about the meaning of the information: what does it suggest about what is happening and why? What else could explain what is going on? The

Table 4.1: Ways of Analyzing Case Study Data

Feature Methodology
Iterative Data collection and concurrent analysis
OTTR Observe, think, test, and revise
Triangulation Comparison of multiple, independent sources of evidence before deciding there is a finding
Rival explanations Developing alternative interpretations of findings and testing through search for confirming and disconfirming evidence until one hypothesis is confirmed and others ruled out
Reproducibility of findings Establish through analysis of multiple sites and data over time
Plausible and complete Data analysis ends when a plausible explanation has been developed, considering completely all the evidence
Specific techniques for handling mulitsite data sets Matrix of categories, graphic data displays, tabulating frequency of different events, developing complex tabulations to check for relationships, and ordering information chronologically for time series analysis

second, or "think," phase ends with specification of what new information would be needed to rule out alternative explanations or confirm interpretations. This triggers the third phase: test. In this phase, the evaluator collects more information, as required by the specifications from the "think" cycle. The data collected in the third phase are not specified before the first phase: they emerge, often with surprises, from the initial observations. The fourth phase is examination of the second round of data collection and a revision of initial interpretations and expectations-the "revise" phase. The revise phase may lead to another test phase, if information from the second round of data collection was insufficient to rule out alternatives, or if, during revision, new interpretations emerged. This iterative process ends when a plausible explanation has been developed and, at the end of a "revise" phase, there are no outlier or unexplained data, no further interpretations possible, or it is clear that despite the most diligent search for information, more is not available to further refine description and explanation.

In case study methods, causality is established through the internal consistency and plausibility of explanation, derived additively through the OTTR sequence. This is in considerable contrast to other evaluation methods, where control and comparison groups are used subtractively to rule out other reasons for a finding and establish firm attribution.

Handling Multisite Data Sets

Several techniques have been developed recently for handling multisite case study data sets. These include setting up a matrix of categories, graphic data displays, tabulating frequencies, developing cross-tabulations, and time series analysis.

Matrix of Categories
In this technique, a coding scheme is developed prior to data collection. It is modified during data collection and the OTTR process and finalized after the evaluation team has read through all the case materials. The categories are related to the evaluation subquestions; for example, if a subquestion was "How does the Immigration and Naturalization Service monitor the conditions of confinement in privately contracted detention facilities," coding categories might include who is responsible, how these persons get information, what they do with information received, evidence that minimum standards are met, evidence of shortfalls, changes over time in monitoring, and conflicting guidance or responsibilities. These categories might be put into a matrix by facility size or groups served. The approach is similar to content analysis, and the PEMD transfer paper on content analysis gives further how-to information (U.S. General Accounting Office, June 1982).

Graphic Data Displays
This is a family of techniques, some of which have been adapted for computers and some of which use wall-space. The evaluators immerse themselves in information on a site, following OTTR. Their initial story of what is happening and why is displayed as a flowchart with a series of critical paths for action. Evidence supporting the story is arrayed in the display. The materials then are searched for counter evidence and subsidiary or branching paths are laid out. As a satisfactory graphic is developed for one site, the evaluators turn to the next site. The evaluators could at this point either modify the first graphic, based on information from the second site, or prepare an independent flowchart. In the second approach, aggregation would come after all the sites had been charted, and the charts would be used as the data base for aggregation.

The graphic techniques can be applied to an instance as a whole or to subcomponents. For example, if an analysis of life-threatening or fatal incidents at national parks were needed, the evaluators might develop separate graphics for events leading up to the incidents, the incidents themselves, and postincident actions. More complex case studies might need several "layers" or graphics; less complex, few.

Tabulating Event Frequencies
Another technique for analyzing multisite case data is identifying events within each case study ("meeting between Jones and Smith"; "Smith staff prepares recommendations") and tabulating their frequency of occurrence. Such a simple tabulation can draw the evaluator's attention to events that may be significant or to informal networks and give a sense of actual (as contrasted to on-paper) organizational relationships. Divergences between observed and expected patterns can be examined further to see what happens as a result of these meetings and identify potential problem nodes: for example, when an expected high-communication node turns out to be, relatively speaking, a low-communication spot.

Complex Tabulations
Cross-tabulations of events can identify interactions and check the developing story more formally. For example, service coordination is a popular remedy for limited funds. An evaluator in the field may observe that coordination among local agencies funded through the same federal agency is more frequent than coordination among local agencies funded by different federal departments. Tabulations of actual meetings and of consequent actions for same-agency funded and different-agency funded services can help check out whether this impression is reliable.

Time Series Analysis
Organization of information within each site by time of occurrence, coupled with a systematic analysis of contextual influences on events, permits a nonquantitative time series analysis for case study data. The flow of events over time for each significant actor and for significant points in the series of events forms the organizing framework for data analysis within each site. Such comparisons of when key actions occurred, how well (or poorly) they were carried out, and what influenced both timing and quality of performance can be particularly helpful in case studies of program implementation.

In some instances, only one component of a case study may be analyzed in this way. For example, a case study of the effectiveness of a job training program might need to take into account general economic trends, such as unemployment rates in the community. A time series comparing local unemployment rates with placement rates for job training program participants could be computed quantitatively and changes interpreted through the more qualitative time series data about the program

Basic Models for Data Analysis

Two basic models of data analysis are pattern matching and explanation building. Pattern matching requires using past experience, logic, or theory before the job begins to specify what we expect to find. The analysis then compares actual findings to expectations. When the findings fit, the pattern is confirmed. When the findings don't fit, the evaluator adjusts the expectations or elaborates them, building a subroutine that can explain the unexpected findings. Explanation building is the inverse procedure: starting with the observations, the evaluator develops a picture of what is happening and why. Data are used to fill in the initial hunches, to change them, to elaborate on them. The first strategy matches findings to hypotheses or assumptions. The second uses the data to structure the hypotheses or assumptions.

In either strategy, the evaluator needs to search the full data base thoroughly for disconfirming evidence, in order to avoid the pitfall of premature conclusions and data analysis ends when the best fit possible has been reached between the observations and a statement about what they mean.

In either strategy, expectations and explanations can be expressed as themes: a job dealing with bank failures, for example, might have as themes decisions about credit risks, procedures for reviewing decisions, or controls over the accuracy and recency of information on bank solvency. A job dealing with employee training might have as themes decisions about training needs, how employees are selected for training, how course quality is monitored, or how employees and supervisors view the purpose of training.

Themes, in turn, can be analyzed within individual sites first, then findings on each theme aggregated across sites. Alternatively, all themes within one site can be analyzed first; then data from the second (and subsequent) sites can be examined. Theme analysis also can proceed in matrix fashion. On the PEMD AFDC study, for example, evaluators were assigned as site managers, responsible for understanding across themes all there was to know about the issues for their site. They also were assigned to individual themes, such as health and employment, responsible concurrently for looking across all sites for information on their topic. This organization proved helpful in ensuring that reasons why a site showed up as an outlier for a given theme could be discussed by someone who knew the site as a whole.

Pitfalls and Booby Traps

Case study methods, like any other method, offer plenty of opportunity to go awry. Two frequent concerns are the risks in using other people's studies and in generalizability.

Impartiality
The biggest risk when we use other people's case studies is that GAO standards of impartiality may not have been met. There are three meanings of impartiality, one of which does not create problems. Case studies use as data the impressions and judgments of the evaluator, which are inherently subjective. For a case study methodologist and for GAO, if proper care is taken, this should not be a problem. If we want to illustrate, for example, working conditions for immigrant laborers, we can report what the thermometers registered and we can also report, firsthand, how people were sweating and what it felt like to be out in the fields. Such observation is part of the richness, immediacy, and "thick" description of a case study. However, case studies, like any other method GAO uses, have to meet two other criteria of impartiality: accuracy and lack of bias, in the sense that the evaluator's personal, preconceived opinions about a situation do not distort reporting and that the evaluator is scrupulously evenhanded in examining all sides of a situation.

Some authorities on evaluation methods believe that case studies reflect the author's values in ways that can be difficult to detect. Other experts conclude that three actions, taken together, are sufficient safeguards for lack of bias and adequate accuracy. These are ( 1 ) submitting reports to people from whom data were collected and printing their critiques with the report, (2) use of multiple data collection methods within case studies, and (3) adoption of the audit trail or chain-of-evidence technique. Adequate supervisory controls also are recommended. Complying with these safeguards should give us no major problems in our own jobs. The guidance would mainly expand the range of reviewers. We already conduct exit conferences and, following the "Yellow Book" and Communications Manual, submit draft reports for agency comments. We often use multiple methods, and the audit trail technique now recommended for case study use was itself adopted from such auditing procedures as workpapers and referencing, which are standard practice with GAO. We also require adequate supervisory control through such means as prompt review of workpapers. We would need to assure ourselves, however, that case studies whose results we are going to use have adopted the same procedures for ensuring impartiality. (Appendix III gives a checklist for reviewing proposed or completed case studies for quality.)

Generalizability
We often are asked questions where the customer wants in-depth information that is nationally generalizable, but frequently the issue may not yet be ripe for a national study or we do not have the resources to collect in-depth data from nationally representative samples. Using 4, 10, or 15 sites as case studies might be feasible, but we would still need to be concerned about the risks in generalizability. A main point of this paper is that generalizability depends less on the number of sites and more on the right match between the purpose of the study and how the instances were selected, taking into account the diversity of the programs.

An example of an efficient combination of careful specification of the purpose of the study matched with appropriate site selection is the GGD study of the productivity of the Social Security Administration's (SSA's) regional operations. This review examined in depth only one SSA region (U.S. General Accounting Office, September 11, 1985). Atlanta was selected because it had the best productivity among the 10 regions; if GAO could demonstrate opportunities for improvement in the most productive SSA region, then similar improvements might be possible in the less productive regions. Following the case study, an inexpensive (25 staff day) check was made on productivity data and trends from other SSA regions, and similarities were noted. While other problems might be affecting these less productive regions, the findings from the single site plus the trends were so convincing that SSA concluded the single instance examination had national impli