United States General Accounting Office

GAO Program Evaluation and Methodology Division

October 1993 Developing and Using Questionnaires

 

Preface

GAO assists congressional decisionmakers in their decisionmaking process by furnishing analytical information on issues and options under consideration. Many diverse methodologies are needed to develop sound and timely answers to the questions that are posed by the Congress. To provide GAO evaluators with basic information about the more commonly used methodologies, GAO's policy guidance includes documents such as methodology transfer papers and technical guidelines.

The purpose of this methodology transfer paper is to provide evaluators with a background that is of sufficient depth to use questionnaires in their evaluations. Specifically, this paper provides rationales for determining when questionnaires should be used to accomplish assignment objectives. It also describes how to plan, design, and use a questionnaire in conducting a population survey. We do not expect GAO evaluators to become experts after reading this paper. But we do hope that they will become familiar enough with questionnaire design guidelines to plan and use a questionnaire; to make preliminary designs and assist in many development and testing tasks; to communicate the questionnaire requirements to the measurement, sampling, and statistical analysis experts; and to ensure the quality of the final questionnaire and the resulting data collection.

The present document is a revision. An earlier version was authored by Brian Keenan and Marilyn Mauch in 1986. This revision, authored by Brian Keenan, includes new material on cognition as well as on a number of developments in pretesting that have occurred since then. As such, the present document supersedes the 1986 version.

Developing and Using Questionnaires is one of a series of papers prepared and issued by the Program Evaluation and Methodology Division (PEMD). The purpose of the series is to provide GAO evaluators with guides to various aspects of audit and evaluation methodology, to illustrate applications, and to indicate where more detailed information is available.

We look forward to receiving comments from the readers of this paper. They should be addressed to Eleanor Chelimsky at 202-612-2900.

Werner Grosshans
Assistant Comptroller General
Office of Policy

Eleanor Chelimsky
Assistant Comptroller General
for Program Evaluation and Methodology

Contents

Chapter 1 Using Questionnaires
  Overview of Tasks in Using Questionnaires
  Deciding to Use Structured Questionnaires
  Planning the Questionnaire
  Developing the Measures
  Designing the Sample
  Developing and Testing the Questionnaire
  Producing the Questionnaire
  Preparing for and Collecting Data
  Analyzing Data
  Telephone Surveys

Chapter 2 Developing the Measures to Get the Questions
  The Questionnaire Framework
  Operationalizing the Constructs
  Developing Measures From Operationalized
  Constructs
  Specify the Key Variable Relationships

Chapter 3 Designing the Sample or Population for Data Collection
  Survey Population
  Selecting the Sample
  Nonstatistical Sampling

Chapter 4 Formatting the Questions
  Open-Ended Questions
  Fill-in-the-Blank Questions
  Yes-No Questions
  "Implied No" Choices
  Single-Item Choices
  Expanded Yes-No Questions
  Free Choices
  Multiple-Choice Questions
  Ranking and Rating Questions
  Guttman Format
  Intensity Scale Questions
  Semantic Differential Intensity Scales
  Intensity Paired-Comparison Scales

Chapter 5 Avoiding Inappropriate Questions
  Questions That Are Not Relevant to the Evaluation Goals
  Unbalanced Line of Inquiry
  Questions That Cannot or Will Not Be Answered Accurately
  Questions That Are Not Geared to Respondent's Depth and Range of Information, Knowledge, and Perceptions
  Questions That Respondents Perceive as Illogical or Unnecessary
  Questions That Require Unreasonable Effort to Answer
  Threatening or Embarrassing Questions
  Vague or Ambiguous Questions
  Unfair Questions

Chapter 6 Writing Clear Questions
  Simplify the Word Structure
  Be Careful About Words With Several
  Specific Meanings and Other Problem Words
  Do Not Use Abstract Words
  Reduce the Complexity of Ideas and Present Them One at a Time in Logical Order
  Reduce the Sentence Length
  Simplify the Sentence Structure
  Use Active and Passive Voice Appropriately
  Use Direct, Periodic, and Balanced Styles Appropriately
  Avoid Writing Styles That Inhibit Comprehension

Chapter 7 Developing Unscaled Response Lists
  Developing Comprehensive Lists
  Presenting Mutually Exclusive Categories
  Using Relevant and Appropriate Categories
  Keeping the Response List Reasonably Short
  Using Categories of Appropriate Specificity
  Listing Categories in the Logical Order Expected by Respondents
  Using a Screening Question

Chapter 8  Minimizing Question Bias and Memory Error
  Question Bias
  Memory Error
  Remembering Frequency and Time of Occurrence

Chapter 9 Minimizing Respondent Bias
  Response Styles
  Highly Sensitive Items

Chapter 10 Measurement Error and Measurement Scales in Brief
  Measurement Scales
  Equal-Appearing Intervals

Chapter 11 Organizing the Line of Inquiry
  Setting Expectations
  Sequencing Questions
  Using Subtitles as Cues
  Choosing an Opening Question
  Obtaining Complex Data
  Using Transitional Phrases
  Putting Specific Questions Before Overall Judgment Questions

Chapter 12 Following Quality Assurance Procedures
  Pretesting
  Expert Review
  Validation and Verification
  Analysis of Questionnaire Nonresponses

Chapter 13 Designing the Questionnaire Graphics and Layout
  Instructions
  Questionnaire Format Preparation
  Typographic Style

Chapter 14  Preparing the Mail Out Package and Collecting and Reducing the Data
  Preparation of the Mail-Out Package
  Data Collection
  Data Reduction

Chapter 15 Analyzing Questionnaire Results
  Analysis Plan
  Item Responses and Univariate Analysis
  Bivariate Analysis and Comparison of Two Groups
  Multivariate Analysis and Comparison of Multiple Groups
  Choice of Analysis Methods

Chapter 16  Adaptations for the Design and Use of Telephone Surveys
  Advantages and Disadvantages of Telephone Surveys
  Design Guidelines
  Administration

Bibliography

Glossary

Papers in This Series

Table
  Table 14.1: The Percentage of Questionnaires That Should Be Randomly Sampled to Determine the Keypunch Error Rate

Figures
  Figure 1.1: Typical Completion Times for Major Questionnaire Tasks
  Figure 2.1: Operationalized Variable in Question Response Format
  Figure 4.1: Fill-in-the-Blank Questions
  Figure 4.2: Fill-in-the-Blank Row, Column, and Matrix Formats
  Figure 4.3: Yes-No Filter Question
  Figure 4.4: Mixed Yes-No and Multiple Choice Question
  Figure 4.5: Balanced and Unambiguous Yes-No Question
  Figure 4.6: "Implied No" Question
  Figure 4.7: Emphasized-No Question
  Figure 4.8: Single-Item Choice Question
  Figure 4.9: Expanded Yes-No Format
  Figure 4.10: Expanded Yes-No Format With Middle Category
  Figure 4.11: Expanded Yes-No Format With Escape Choice
  Figure 4.12: Multiple-Choice Question
  Figure 4.13: Ranking Question
  Figure 4.14: Rating Questions
  Figure 4.15: Guttman Question
  Figure 4.16: Extent Scale and the Expanded Yes-No Scale Questions
  Figure 4.17: Extent Scale Converted to Likert Scale Question
  Figure 4.18: Likert Question Used to Evaluate Policy
  Figure 4.19: Amount Intensity Scale
  Figure 4.20: Frequency Intensity Scale
  Figure 4.21: Frequency and Amount Intensity Scales With Proportional and Verbal Descriptive Anchors in Addition to the      Conventional Adjective and Scale Number Anchors
  Figure 4.22: Branching Intensity Scale Format
  Figure 4.23: Number-of-Occurrences and Time Interval Formats
  Figure 4.24: Semantic Differential Question
  Figure 4.25: Intensity Paired Comparison Scale
  Figure 5.1: Skip Question
  Figure 5.2: Behavior-Oriented Question
  Figure 7.1: Question With Comprehensive List of Categories
  Figure 7.2: Question With Overlapping Categories
  Figure 7.3: Question With Nonoverlapping Categories
  Figure 7.4: Tailored Question With Comprehensive Nonoverlapping Categories
  Figure 8.1: Biased Question
  Figure 8.2: List Divided Into Subgroups to Counter Primacy and Recency Biases
  Figure 8.3: "Checks All That Apply" Response Format Changed to "Check Yes or No" Format
  Figure 8.4: Using Presentation Order to Counteract Expected Bias
  Figure 8.5: Complex Question Broken Into Sequence of Questions
  Figure 9.1: Question to Reduce Overreporting
  Figure 9.2: Question With List of Ranges
  Figure 9.3: Series of Indirect Questions
  Figure 11.1: Sequence of Questions Obtaining Complex Data
  Figure 13.1: Partial Questionnaire
  Figure 14.1: Initial Questionnaire Transmittal Letter
  Figure 14.2: Questionnaire Follow-Up Letter


 

Chapter 1
Using Questionnaires

This paper describes how to design and use questionnaires. Such information is important for GAO evaluators for two reasons. First, GAO frequently uses questionnaires to collect data. Second, the questionnaire is a method with a high potential for error if not designed and used properly.

GAO employs questionnaires to ask people for figures, statistics, amounts, and other facts. We ask them to describe conditions and procedures that affect the work, organizations, and systems with which they are involved, and we ask for their judgments and views about processes, performance, adequacy, efficiency, and effectiveness. We ask people to report past events and to make forecasts, to tell us about their attitudes and opinions, and to describe their behavior and the behavior of others.

Questionnaires are popular because they can be a relatively inexpensive way of getting people to provide information. But because they rely on people to provide answers, a benefit-risk consideration is associated with their use. People with the ability to observe, select, acquire, process, evaluate, interpret, store, retrieve, and report can be a valuable and versatile source of information under the right circumstances. However, the human mind is a very complex and vulnerable observation instrument. And if we do not ask the right people the right questions in the right way, we will not get high-quality answers.

This holds true for even the simplest of questions. An easy way to demonstrate this is to do a simple straw poll, like asking co-workers how they came to work. One may answer "By way of New York Avenue" or give some other route description. Another answer to the same question may be "by car pool." If you continued this straw poll, many of the answers would be unusable if your intent was to learn modes of transportation to work.
*insert figure 1.1

After describing important factors to consider when deciding to use a questionnaire, we briefly cover, in the remaining sections of this chapter, the major tasks listed in figure 1.1 and refer to subsequent chapters that provide detailed instructions. We do this to give an overview of the scope of work required to plan, develop, and implement a questionnaire and to show what the reader can expect to find in each of the subsequent chapters. Overall, the organization of this paper parallels the logical sequence of tasks undertaken when developing and using questionnaires.

Asking good questions in the right way-the focus of this paper-is both a science and an art. It is a science in that it uses many scientific principles developed from various fields of applied psychology, sociology, cognitive research, and evaluation research. It is an art because it requires clear and interesting writing and the ability to trade off or accommodate many competing requirements. For example, a precisely worded, well-qualified, unambiguous question may be stilted and hard to read. Questions must be clear, interesting, and easy to understand and answer. In addition to asking the right questions, evaluators need to be aware of other principles dealing with questionnaire design and administration that are also covered in this paper.

Overview of Tasks in Using Questionnaires

Using even a simple questionnaire is not always simple. Numerous mayor tasks to develop and use a questionnaire must be completed in a logical sequence. After deciding to use a questionnaire, evaluators must plan the questionnaire, develop measures, design the sample, develop and test the questionnaire, produce the questionnaire, prepare and distribute the mailout or interview packages, collect the data and follow up with nonrespondents, perform checks to ensure the quality of responses, and reduce and analyze the data. Figure 1.1 reviews these mayor tasks. Except for the data collection, these processes are very similar regardless of whether the questionnaire is to be designed for the mail or a telephone or face-to-face interview. When interviewers are used, however, they must also be trained, which adds another major task.

Deciding to Use Structured Questionnaires

One of the first decisions evaluators have to make is whether to use a questionnaire or some other method to collect the data for the job. In many situations, other data collection techniques may be superior. In fact, over the past years other techniques were recommended by technical design teams for about one of every three proposed GAO questionnaires. The decision to use questionnaires should be made only after carefully considering the comparative advantages and disadvantages of the various ways of administering questionnaires over other data collection techniques.

Data Considerations

Data can be collected in a variety of ways, such as field observations, reviews of records or published reports, interviews and standardized mail, and face-to-face or telephone questionnaires. The selection of one technique over another involves trade-offs between staff requirements, costs, time constraints, and most importantly the depth and type of information needed. For example, if the objective of the assignment is to determine the average per acre charge and the income derived from public grazing-land permit fees, the evaluator might consider using structured data collection forms or pro forma work papers to manually retrieve data from the case files in record storage. However, if the objective is to determine how much land the ranchers are willing to lease and how much per acre they are willing to pay, a mail, telephone, or face-to-face survey of ranchers would be necessary.

Questionnaires are frequently used with sample survey strategies to answer descriptive and normative audit or evaluation questions. They are often less central in studies answering impact, or cause-and-effect, questions. While operational audits and impact, or cause-and-effect, studies are often not large-scale efforts, questionnaires can be used to confirm or expand their scope.

Questionnaires can be useful when the evaluator needs a cost-effective way to collect a large amount of standardized information, when the information to be collected varies in complexity, when a large number of respondents are needed, when different populations are involved, and when the people in those populations are in widely separated locations.

Furthermore, questionnaires are usually more versatile than other methods. They can be used to collect more types of information from a wider variety of sources than other methods because they use people, who can report facts, figures, amounts, statistics, dates, attitudes, opinions, experiences, events, assessments, and judgments during a single contact. People can answer for a specific type of source, such as members of a health maintenance organization, or for a variety of types of sources, such as local, state, and federal government officials.

Questionnaires are difficult to use if the respondent population cannot be readily identified or if the information being sought is not widely distributed among the population of those who hold the knowledge. Furthermore, questionnaires should not be used if the respondents are likely to be unable or unwilling to answer or to provide accurate and unbiased answers or if the questions are inappropriate or compromising.

In general, questionnaires should not be used to gather information that taxes the limitations of the respondent. Sometimes people are not knowledgeable or accurate reporters of certain kinds of information. They remember recent events much better than long-past events. They remember salient and routine events and meaningful facts but do not remember details, dates, and incidental events very well. For example, veterans might accurately report that doctors made medical examinations for Agent Orange effects on their eyes, ears, nose, throat, genitals, and pelvis but might substantially underreport skin examinations. If the information were needed on skin examinations, other sources, such as medical records, might be more useful. However, there are exceptions, particularly when the respondents are highly motivated.

Structured questionnaires are also not particularly well suited for broad, global, or exploratory questions. Because respondents have many different forms of reference, levels of knowledge, and question interpretations, the structured methodology limits the evaluators' ability to vary the focus, scope, depth, and direction of the line of inquiry. Such flexibility is necessary to accommodate variances in the respondents' perceptions and understanding that result from such questions.

Most of the people from whom GAO evaluators seek information are members of special populations, such as federal and state government employees, welfare recipients, or company executives. Unlike pollsters and market researchers, GAO evaluators rarely do a national population survey. Consequently, some of the mass survey techniques like random-digit dialing seldom apply to GAO work.1 Also, GAO evaluators very rarely go back to the same population, and when they do, the time periods between surveys are so long that they usually have to redocument the population.

1 Random-digit dialing refers to a telephone interview method that contacts people by dialing numbers at random. In some situations, usually when the population is hidden or not easily identified (for example, heads of households older than 66), this method may provide better access than other methods.

Administration Considerations
If after considering the pros and cons of using questionnaires, a questionnaire is still the method of choice for data collection, the evaluators need to consider the most appropriate method of administration. The appropriateness of the method of administration whether it be mail, face-to-face interview, or telephone varies with the resources and constraints of the job, the abilities and motivation of the respondent population, and the requirements of the evaluation. All three methods have comparative advantages and disadvantages, depending on the time and cost constraints of the job, the characteristics of the respondent population, and the nature of the inquiry.

Mail questionnaires are usually more cost effective but require longer time periods than personal or telephone interviews. While mail questionnaires usually have higher development costs than telephone or face-to-face interviews, this is generally offset by the relatively inexpensive data collection costs. Mail questionnaires are the least labor intensive of the alternatives, with the labor costs limited to the effort needed to mail the questionnaire and track, follow up on, and edit the returns. Generally staff can mail hundreds of letters or edit scores of returns in a given day. Workers are not so productive with telephone and face-to-face interviews.

Because of the difficulty in establishing telephone or personal interview contacts and the one-on-one nature of interviews, these alternatives require more staff time. Interviewers usually do not complete more than 10 or 12 telephone interviews or two or three face-to-face interviews in a day. Furthermore, the travel requirements for personal interviews can be very expensive when compared to postage or telephone charges.

But mail questionnaires take longer to design and require longer periods for collecting and editing data than other choices. Extra care must be taken with the mail questionnaires because, unlike the other choices, there is no interviewer to help the respondent. Also, mail is a slow means of transmission, and mail questionnaires take two or three follow-ups. In summary, if money is tight and the subject matter can be phrased intelligibly for the respondent population, use the mail; if time is tight and staff time is not, use the face-to-face or telephone interview methods.

In addition to subject matter, respondent characteristics play a key role in the method of choice. For example, if the respondents are motivated and literate and have normal vision, the mail is often the best option; otherwise, use the telephone or an interviewer. If respondents cannot be readily located by address or telephone number but gather at particular places (such as restaurants, parks, or hospitals), then a face-to-face interview is the only option.

If the contact people are likely to conceal the identity of the intended respondent, and this is likely to make a difference, or if the evaluator is not sure that the intended respondent will get the questionnaire, then personal contact is better than telephone and telephone is better than mail. Also, if the respondent has a vested interest in giving biased reports that can readily be verified by inspection, then the face-to-face interview is the obvious choice.

However, if the contact has a likely chance of temporarily inconveniencing the respondent or the respondent has privacy concerns, then a mail survey has the advantage over the remaining choices.

Questionnaire characteristics also determine choice. Long, complex questionnaires designed to be answered by simple checks or short fill-in-the-blanks are better suited for self-administered questionnaires than the interview method. However, the converse is often true if the questions require the composition of responses that are other than very short answers (most people would rather speak than write). Also, if the questionnaire has many complex and confusing skips that frequently require respondents to answer some questions but not others, then one of the interview methods is preferable to a mail or self-administered questionnaire.

In summary, evaluators should review the conditions and requirements of the data collection before deciding to use questionnaires and again before deciding the methods for administering the questionnaire. Mail questionnaires are a versatile, low-cost method of collecting detailed data. They are particularly adaptable to survey methods when the population is big, difficult to contact, likely to be inconvenienced, concerned about privacy, and widely dispersed. But mail questionnaires usually have a long turnaround time. The evaluators must be willing to invest the time required to carefully craft and test these questions. And the respondent must be willing and able and sufficiently literate and unbiased to accurately answer the queries. Interview methods, while much more expensive and more prone to bias, help insure against respondent error, have less turnaround time if sufficient staff is provided, and can be used to provide some interviewer verifications.

Planning the Questionnaire

Once evaluators decide to use a questionnaire, planning starts with this paper, which provides information on the procedures necessary to do each of the major tasks to design and use questionnaires. The next step is to review the evaluation design and audit plan and then mentally walk the job through each procedure necessary to design and implement a
questionnaire: developing the measures, designing the sample, developing and testing the questionnaire, producing the questionnaire, preparing the mailout or interview materials, and conducting the data collection, reduction, and analysis. A write-up of this mental walkthrough, evaluated for comprehensiveness and feasibility, can serve as a basis for writing the implementation plan.

Developing the Measures

As evaluators do their planning, they will find that the scope of the effort is greatly influenced by information developed in the next two tasks developing the measures and the sample design to ensure that the right questions are being asked of the right people. Remember that the questionnaire is an instrument used to take measures. To be sure it can do this, evaluators must first identify all the variables or conditions, criteria, causes, and effects that they want to measure. Next, evaluators analyze these variables and describe them so scientifically and precisely that they can be qualified, quantified, manipulated, and related. As explained in chapter 2, "Developing the Measures to Get the Questions," these measures define the requirements for the questionnaire. Questionnaires are designed by establishing a framework and sets of related questions that provide these measures.

Designing the Sample


Questionnaires are a way of asking the right people to take the measures needed to complete an evaluation. Before evaluators begin to write a question, it makes good sense to be sure they can find the people. The right people are representatives of a population who share the experiences the evaluators are interested in and who have or can get, will get, and will give them the information they need. Furthermore, evaluators must select these people scientifically, so the population these people represent can be talked about rather than just the individuals contacted. This is called a population survey, and how to do a population survey with questionnaires is explained in chapter 3, "Designing the Sample or Population for Data Collection."

Developing and Testing the Questionnaire

Once the evaluators have established what to measure and who to ask to take the measures, they are ready to ask people to take these measures. Asking questions in the right way requires the evaluators to write sets of questions so that the answerer can easily understand precisely what information must be provided and, with little or no error, can easily provide this information. This means writing questions in a way that facilitates rather than interferes with the respondents' ability to understand the question and report the answer to the best of their ability. This simply stated task is deceptively complicated. To write good questions, evaluators must first understand something about the very complicated mental or cognitive process people use to answer questions. If evaluators access this cognitive process properly, the questionnaire can become a highly versatile and powerful instrument for observation and recall. If not, it can become a source of confusion and error.

The sets of inquiries or questions must then be organized into a draft instrument. This questionnaire is then tested, reviewed, and revised until it is proven that as an instrument it takes the required measures. Since completing these tasks is perhaps the most difficult part of the job and consumes the most resources, we devote nine chapters (chapters 4--12) to explaining some of the many known and tested ways to do this work.

Chapters 4-7 show how to facilitate the perception, acceptance, and understanding of the questions and how to help respondents recall their mentally stored information. In chapter 4, "Formatting the Questions," we show how to present the question in the precise format best suited to get the specific type of information requested. We demonstrate what respondents are likely to consider as fair and unfair questions in chapter 6, "Avoiding Inappropriate Questions." In chapter 6, "Writing Clear Questions," we explain how to write a question that can be quickly, easily, and precisely understood by all respondents in the same way. And in chapter 7, "Developing Unscaled Response Lists," we explain how to write in a way that aids respondents as they cognitively search their minds to select the answers to questions.

Chapters 8 and 9 deal with the problem of bias and error. This problem has two sources: the question writer and the question answerer. Chapter 8, "Minimizing Question Bias and Memory Error," illustrates many of the typical mistakes question writers make and how to avoid them. Chapter 9, "Minimizing Respondent Bias," explains the ranges of capacities and limitations that respondents have in answering questions and how to make the most of the respondents' abilities and minimize the risk and compromise of their shortcomings.

Chapter 10, "Measurements Error and Measurement Scales in Brief," explains how to translate the question answers into qualitative and quantitative measures for use in GAO reports. Throughout chapter 10, we deal with how to write individual questions. However, when we put these individual questions together into a single questionnaire, they often interact with one another in a context that affects the measuring of the questions. Chapter 11, "Organizing the Line of Inquiry," shows how to organize these questions into a line of inquiry that can enhance the
quality of the answers and minimize unintended and interfering effects.

After finishing the first 11 chapters of this paper, evaluators should be able to help write the first draft of a questionnaire. But there is still much more to be done before evaluators can use this draft as a survey instrument. They should go through a quality-assurance procedure, which requires that the draft questionnaire be tested and validated. The methods for this task, and other quality assurance tasks carried out during data collection and analysis, are described in chapter 12, "Following Quality Assurance Procedures."

Producing the Questionnaire

Once the questionnaire has been tested and validated, and probably revised, the evaluators can put it in final form and use it to collect and analyze data to answer the assignment questions. Good questionnaires can be seriously compromised if they are not presented in a format that is easy to read and administer. Chapter 13, "Designing the Form and Layout," addresses this issue and shows the evaluator how to design the questionnaire type, format, and layout in a manner that greatly facilitates the user's ability to perceive and respond.

Preparing for and Collecting Data

Several administrative procedures, such as preparing the transmittal or contact letters or mail piece or interviewers' kits, must precede data collection. Data collection methods then involve such activities as mailing, contacting, interviewing, tracking, and following up on nonresponses. Poor quality in the execution of these fundamental and very important activities can cut the response rate by as much as 50 percent. To avoid this problem, we have documented procedures shown to be highly effective for mail surveys in chapter 14, "Preparing the Mail-out
Package and Collecting and Reducing the Data." Activities needed to check, edit, and prepare the data for computer processing are also covered in this chapter.

Analyzing Data

Chapter 15, "Analyzing Questionnaire Results," discusses some of the initial thinking and conceptualization that are important to the data analysis, including the development of a strategy and a plan for the data analysis. We do not describe data analysis methods since they are covered in Quantitative Data Analysis: An Introduction.2 Chapter 16 concludes the discussion on using mail and self-administered questionnaires.

2 U.S. General Accounting Office, Quantitative Data Analysis: An Introduction, GAO/PEMD-10.1.11 (Washington, D.C.: June 1992).

Telephone Surveys

Personal or telephone interviews are also important and useful methods for collecting structured data for GAO assignments. While the methodology for asking good questions developed in this paper applies regardless of whether the questions are asked in a self-administered mode, such as by mail, or in some other mode, such as a face-to-face or telephone interview, certain limitations are specific to each administration method. Those that apply to conducting telephone surveys are discussed in the concluding chapter 16, "Adaptations for the Design and Use of Telephone Surveys." Further details on personal interviews are presented in Using Structured Interviewing Techniques.3

3 U.S. General Accounting Office, Using Structured Interviewing Techniques, GAO/PEMD-10.1.6 (Washington, D.C.: July 1991). Some information relevant to conducting face-to-face interviews is presented in chapter 12 of this paper, in a section dealing with pretesting techniques.

 

Chapter 2
Developing the Measures to Get the Questions

Deciding what and whom to ask appears be a straightforward task. But appearances can be deceiving. And as we shall see in the next two chapters, this initial step must be thought through with careful consideration and structured to an elemental level of detail. The what and whom to ask decision lays the foundation for the focus and scope, the level of difficulty and complexity, the risk, completion times, data collection, analysis, and processing requirements and resources needed for the job. Hence, all the job plans are based on this decision. Furthermore, the three major sources of error--misspecification of variables, measurement error, and sampling error-are often introduced at this stage.

In this chapter, we discuss methods for documenting what a questionnaire should ask. This documentation will be used to develop a framework for writing the questions, describing the variables in scientific terms necessary for measurement, developing the measures, and specifying the variable relationships in order to check for misspecification of variable and measurement errors. In the next chapter, we discuss protocols for selecting the target population in ways that maintain the integrity of the design and minimize sampling error. Because deciding what to ask and deciding whom to ask it of are complex, we have described them in two chapters. However, in actual practice, deciding what and whom to ask go hand in hand and are among the few tasks in survey research that must be done interactively and iteratively. This is because the questions we ask are determined by both the need for information and the respondent's ability to provide this information.

To document the questionnaire framework, variable operationalizations, measures, and variable relationships, it is best to start with what we know about the requirements of the job and mentally work in two directions, by thinking, first, in the abstract to integrate and conceptualize and, then, shifting to more concrete logic to define and analyze. At the start, evaluators usually find that some of the information they will need is very global, general, and abstract and other information is highly specific. However, most of the information they have gathered is at a middle level of detail, and they can begin by working with what they have. Information should be available from the job design, audit plan, evaluation framework, and previously gathered background material. Evaluators should conceptualize and organize this information into a framework of inquiry or types of questions that can be developed to yield answers to the evaluation questions. Often they may have to do additional research or additional thinking through to fill knowledge gaps.

Next, they must go in the other direction and think more concretely and analytically. They must specifically describe or operationalize these information requirements and develop measures that will satisfy these requirements. Finally, they should integrate these conceptualizations and analysis into a format that presents the key relationships of the measurement variables. The process needed to develop each document product is described in the following sections.

The Questionnaire Framework

Initially, the evaluators decide what constructs, traits, conditions, or variables are to be measured and how to measure them. The documentation for this task is sometimes referred to as a questionnaire framework. The framework is usually depicted as a taxonomical classification. It is a scheme that lays out the evaluation questions and all the information required to answer each question with ordered and specified relationships. In essence, the framework provides a
roadmap identify and track the kind of data needed to answer the questionnaire.

A relatively uncomplicated example might be structured in response to the evaluation question "Is the size of the 4year college associated with student performance?" The constructs (or the things evaluators want to measure) for college size and student performance and their relationships are identified for measurement development.

The identification of these constructs and their relationships influences the choice of data collection sources, methods, and measures. For instance, in the example above, we can readily see that there are alternatives: the use of extant data from various national graduate record achievement score data bases, surveys of administrative and academic deans, and so on. And just as the choice of methods and sources will force a choice of measures, so will the choice of measures determine the methods and sources.

Hence, these choices must be made interactively and iteratively. The relationship of college size to student performance was a simple example. In this case, evaluators might have been able to proceed without committing measurement considerations to paper, but it is nearly impossible to plan complex questionnaires without documentation. For example, consider the following evaluation question: "What are the needs of earth-orbiting satellite image users?" The answer to this question requires a plurality of complex considerations, constructs, and measures such as the identification of the different types of uses (national and international scientists, political administrators, disaster managers, and earth resource manages) the identification of the national and international, geopolitical, and socioeconomic considerations that determine the type of use and the measures of the quality of the information displays of the satellite and the relationships among the variables and constructs. This is a level of complexity that requires documentation.

As we can see by the example, the framework identifies, specifies, and justifies the need for the information, constructs, variables, measures, and variable relationships that the evaluator wishes to collect data on. It is a scheme for documenting the information needs requirements. It is not a questionnaire but rather the basis for the questionnaire.

Operationalizing the Constructs

So far we have talked in broad terms about ideas or concepts, traits or properties, and characteristics evaluators often like to measure-usually referred to as constructs. These constructs are not measures until the terms are specific enough to standardize. By Ustandardize," we mean that questions are designed and asked so that each recipient will understand and answer the same question in the same way. Different people reading the same questions need to have a common understanding. For example, one survey asked congresspersons about the "timeliness" of reports. Some respondents interpreted the construct "timeliness" as turnaround time while others interpreted it as getting the report information in time to use it for legislative decisions. As we can see, standardizing is very important because it enhances the objectivity of the resulting measure.

The fist step toward standardization is to operationalize or to define the construct in concrete, specific, unambiguous, and contextual terms that reduce the measure to a single trait or characteristic. Failure to do this in the example citing the size of the college resulted in a misspecification of this variable. The respondents variously interpreted size of college
as spring enrollment, fall enrollment, total spring and fall enrollment, total full-time plus part-time spring enrollment, total full-time and part-time fall enrollment, full-time equivalent enrollment, and so on. The construct should have been operationalized as the enumeration of both the total full-time enrollment and the total part-time enrollment as of the close of the spring 1992 semester or quarter.

Developing Measures From Operationalized Constructs

Measures are developed by giving operationalized constructs a dimension. Measures qualify and sometimes quantify the trait in a single dimension such as presence or absence or the amount, intensity, value, frequency of occurrence, or the ranking or rating or some other form of comparative valuation or quantification. The next few paragraphs will help familiarize the reader with some of the requirements of a measure. Although this familiarization will proceed in other chapters of this paper through discussion and example, evaluators should consult a text specifically devoted to measurement or consult a specialist when complex measures are required.

Measures must be accurate, precise, valid, reliable, relevant, realistic, meaningful, comprehensive, and in some cases complementary, sensitive, and properly anchored. While evaluators may readily understand the meaning of precision and accuracy, some of the other terms may need to be defined, because in measurement they are used in a very special way. For instance, measures are considered valid if they are logical and they measure what they say they are measuring. They must adequately represent the trait in question. They must consistently predict outcomes, vary as expected in a variety of situations, and hold up against rigorous attempts to prove them invalid. We have all seen valid and questionable measures. Positive examples might be found in well-executed polls that predict voter outcome to a reasonable
degree of accuracy. A negative example might be found in the logic of using complaints as a measure of discrimination, because the cost, time to resolve a case, difficulty in proving discrimination, difficulty in filing, fear of retaliation, and other reasons discourage the aggrieved from filing a complaint.

Next, consider reliability, which is different from and independent of validity. To be reliable, a measure must give consistent results when repeated under similar situations. For example, IQ tests and employee attitude surveys usually give consistent results when repeated under similar circumstances with the same people.

Measures should be relevant, meaningful, and realistic. For example, some very valid measures like IQ and grade-point average are used to hire employees. These are not relevant measures if the new employee is expected to be creative and inventive and generate new ideas, because the traits of IQ, grade-point average, and creativity are not correlated. Also, the labels given to the measure should correctly describe and communicate its meaning. For example, manages frequently measure things like costs, staff time, and number of reports under the term "quality measures." These measures may index effectiveness or productivity but not quality. The measure should be realistic or practical. For example, if a reader's pupils are dilated, this might be a good measure of his or her interest, but these observations are very hard to make. Therefore, under certain circumstances, the accuracy of the respondents' information recall and self-reports, while not as accurate, are more useful because answers are easy to obtain.

Ideally, measures ought to be comprehensive and, in some cases, complementary. Comprehensive measures span the entire range of values that are of interest with equal precision. A single measure usually refers to a single trait, but sometimes if the construct is multidimensional or has several traits for reasons of economy or the need to capture two or more traits as they work together, we develop a measure that captures these multitrait effects. For example, asking the respondent if the text was easy to read and readily understandable might be considered a comprehensive measure. In contrast, complementary measures are measures that are distinct and must be taken together to reflect the construct. For example, the number of contrast shades and the sharpness of the contour lines are needed to measure photographic image quality.

Other features of measures that are also important are sensitivity and anchoring. Sensitivity refers to a measure's ability to detect (1) the presence or absence of the trait, (2) levels of intensity, or (3) changes in the level of intensity with sufficient precision at sufficiently low levels to meet the needs of the evaluation. Anchoring refers to the establishment of clear, concrete points on the measurement scale that are meaningful to the respondent. That is, the scale should have meaningful starting, interim, mid, and end points. For example, we might anchor estimations of lighting quality as too dim (not bright enough to read a newspaper), appropriate (could comfortably read a newspaper), or too bright (too much glare to comfortably read a newspaper).

An example of a complex measure taken from one of the cases cited in the preceding part of this chapter is presented in figure 2.1. The measure was developed from a construct identified with a questionnaire framework: the user's perception of the quality of an earth-orbiting satellite image. The construct was operationalized and developed into a measure of image quality. During this process, particular attention was given to accuracy, precision, validity, reliability, realism of application, meaningfulness of concept, the comprehensiveness and complementary requirements, measure sensitivity, and anchoring of the measure.
*insert figure 2.1

Specify the Key Variable Relationships


We conclude this chapter with a brief but important discussion on specifying the variable relationships to be evaluated. (The two remaining sets of documentation needed to initiate the planning-the identification of and the selection of the target populationare discussed in the next chapter.) This task is important because, as we shall see, errors or omissions in specifying the variable relationship can either invalidate or weaken the evaluation. In this task, evaluators document and review all variables to ensure that all key variable relationships are included and specified with common units of analysis and for appropriate functional relationships and in the appropriate measurement stratification and time periods so as to permit statistical, temporal, and cross-sectional observations and comparability. These variable relationships should be documented down to the level of measurement specification.

Then the evaluation design, the evaluation framework, and the questionnaire framework should be checked against this documentation to make sure nothing important is left out and that nothing unnecessary is included. A review should ensure that the sample or population measurements are to be taken on-and generalized to-common units of analysis. For example, in one case we found that one measure was to be taken on contractors, while its comparison measures applied to contracts. A review should be made for changes that would facilitate statistical comparability. For instance, the evaluators may find that one of the measures to be related is unnecessarily categorized while the other is continuous, or that some measures are inappropriately categorized for the intended cross sectional comparisons, thus weakening the statistical power of the analysis or, worse yet, rendering the analysis invalid.

Further, review should make sure the specified categories in the comparison variables are not likely to confound crosssectional comparisons. For instance, suppose we know from past studies that the effects of training are not likely to be noticed until 9
months later, there is less bias against the mentally disabled in the city than in the suburbs, or treatment for violence exposure is most effective soon after the incident. If evaluators test for training effect soon after the training, they may not see an influence because the trainees did not have enough development time to assimilate their experience. If the test is for bias against the mentally disabled only in the inner city rather than in both the inner city and the suburbs, the evaluators may not find the effect because this bias is less noticeable in the inner city. If they test for the effects of treatment for exposure to violence on only those who waited a year before receiving treatment, they may not see the effect because the treatment was given too late to do much good. Hence, evaluators must make sure that the cross-sectional comparison categories are structured to capture, not hide, the effects under study.

Another point is to make sure the temporal comparisons are appropriate. For example, it is not unusual to find that the data for the different variables in the relationship are to be collected during different yeas. Finally, it is important to be sure important categories were not left out. This is because the sampling specialists will use this documentation to design the sample. For instance, in one case the evaluator was disappointed to find that the sample did not have enough power to compare important city, race, and educational stratifications because the sampling specialist had not been aware of these stratifications.


Chapter 3
Designing the Sample or Population for Data Collection

Along with deciding what to ask, evaluators must decide who to ask. The people questioned must have the information the evaluators they must be readily identifiable and accessible, they must be willing and able to answer, and they must be representative of the population being measured. They can be migrant workers, prisoners, police, scientists, medical doctors, commanders or soldiers, inner city African American youths, or government officials..

Ideally, everyone in the population should be questioned, and sometimes this is done if the population is very small. But usually the best that can be done is to take a sample of these people and generalize the findings to the population they come from.

In theory, to generalize findings, evaluators must fist define the population. Then they should enumerate every unit in the population in a way such that every unit has an equal chance of being selected for the sample. In practice, it may be unrealistic to expect to enumerate every unit in a real population (for example, all persons who participated in a government program such as Head Start), but the enumeration must be reasonably complete and accurate and be reasonably representative of the actual population. The evaluators must then draw a representative sample from this population.

Survey Population

However, the sample cannot be determined or drawn until the evaluators have studied the size and characteristics of the population they want to know about. All too often, this step in questionnaire development is overlooked or assumed to be routine. Then, when the questionnaire is complete and ready to be mailed, the team is faced with weeks of hard research, or a major redesign, because the sample was not well founded.

The fist step in defining the survey population is to learn about the population distribution-the mayor categories of units and the numbers category. For example, if the evaluators want to sample banks, they should learn the differences between county, regional, statewide, branch, and unit banks; they should know geographic location factors and understand the basis for classifying banks as very large, large, medium, and small. If they are studying unit commanders in the armed services, they should know the unit sizes and types and the variations among the services. This research will help in designing sampling factors, such as stratification and stratification size, and will ensure a representative sample.

Once the evaluators are familiar with the characteristics of the population, they can look for sources that enumerate each unit in the population or develop a reasonable theory for selecting the sampling units. The enumeration should be accurate, up-to-date, and organized to reflect the distribution characteristics. Sometimes this task is relatively easy. For example, in one project we needed to assess the effect that the Foreign Corrupt Practices Act had on U.S. business. The act prohibits payments to foreign officials if the purpose is to influence business. The population was U.S. companies that conduct most of the foreign business. These companies were readily identified because they were among the Fortune 1,000 companies, which conduct most of the foreign business. All we had to do was buy this list from Fortune magazine. The list gave the order of the companies by sales volume and provided information on each company's activities and the name and address of both the chief executive officer and the chairman of the board. However, for many other projects, considerable effort is needed to document the survey population.

In practice, evaluators rarely have a list of the real population; at best they have only a list at the time the source material was current. By the time the questionnaire is administered, some units will have left the population and others will have joined it. For example, in the Fortune 1,000 evaluation, 6 percent of the firms left the population and we do not know how many may have joined it. The sample analysis must evaluate and make statistical adjustments for the losses. Whenever possible, the effect of the additions should also be considered.

The best way to start enumerating a population is to talk to experts in the field and search out likely organizations, archives, directories, libraries, and management information systems until a reliable source has been discovered. Then the sampling units or population elements are organized, reorganized, or indexed into groups or frames, so they can be reached by a random, systematic, or prescribed process. For example, in one evaluation, we had to locate retired military users of military medical facilities. From a Department of Defense archival data base we were able to get the names and addresses of all the retired military personnel but we had no way of knowing if they were uses of a particular medical facility. Our field work showed that retired military were likely to travel up to 40 miles to use hospital services; if they lived farther away, they usually made other arrangements. So we developed a computer program, based on zip codes, that matched persons to the hospitals that were within 40 miles of their homes.

In a study of zoning problems encountered by group homes for the mentally disabled, we discovered that there was no national register of group homes. Since this was a study to see if this restrictive zoning practice was geographically widespread, we sampled catchment areas. We then called up the catchment area directors and got the names and addresses of every group home in each catchment area and sent the group home directors a questionnaire asking about their zoning problems.

Sometimes, no matter how hard the search, archival data or records cannot be found from which to develop a population. When this happens, the best thing to do is to look for groups, sections, or clusters of files or lists that contain the information. Or the evaluators may want to look at existing data to surmise some ratio or relationship associated with the population. For example, if they want to define the population of general aviation flight-service airport specialists, they may be able to use previous work or pilot or survey studies. For example, from previous experience, they may find that they can estimate that the average number of specialists per airport is 16, multiply 16 specialists by the 316 airports, and estimate the population at about 5,000.

Unfortunately, in a great many cases, there is neither a population enumeration nor a way to get cluster, unit, or ratio figures. In these cases, the evaluators must try to document the biggest possible portion of the most important and most representative cases, or they must develop some reasonable theory for selecting the sampling units. For example, to get a representative list of internal auditors, the evaluators might use the membership list for the Institute of Internal Auditors plus a list of the internal audit departments for the Fortune 1,000 companies. The latter would be included because most of them have internal audit departments.

In one situation, we had to sample major importers and exporters. The available list had over 10,000 entries, almost all of which were too small to be considered major. So we used a combination of a "small world network" and a "snowball" approach. We found an association on the eastern coast to which most major midAtlantic shippers belonged. We contacted the association and obtained a list of the major shippers and their business volume. This association identified two other shippers' associations, which provided their lists and the names of six more associations. We continued until we had identified all associations and had a list of most of the major shippers. The shippers' associations reviewed our list and estimated that it accounted for 82 percent of the import-export business.

Many other sources of specialized lists are available, but their reliability varies considerably. For example, major organizations such as the American Medical Association, the National Education Association, and the National Association for Home Builders can provide detailed address lists and population descriptions of their members. However, their cooperation varies with their interest in what the job is about The cost for lists can be anything from nothing to a few hundred to several thousand dollars. Although the Bureau of the Census sometimes has useful lists, such as the census of manufactures and the census of governments, these sources may be out of date. Many commercial sources, such as Ruben and Donnelly, Polk, and Thomas, sell population lists. Also, some commercial firms such as Dunn and Bradstreet sell specialized lists for various users, such as mail order companies. Care must be taken in using these lists because their quality varies considerably and very little may be known about the bias built into them, how they were developed, or what they include and, more importantly, exclude.

Before using a list, it is a good idea to review and perhaps test it For example, in a sample survey of farmers, the address list was developed from a list of subscribers to the Farm Home Journal. The list turned out to be several yeas old, and many of the subscribes were not farmers in the technical sense but people who sold or bought agricultural equipment or products or who were interested in rural living.

Selecting the Sample

Once the population has been enumerated and the evaluators sure that it represents the population to which they want to generalize, they are ready to draw the sample.

The sample must be drawn in accordance with a procedure that ensures a random selection. The sample size must be large enough to provide the degree of measurement precision and accuracy generally accepted by the scientific community. This must be done very efficiently and cost effectively. In many instances, accomplishing this will require the assistance of a sampling statistician who has the appropriate technical skills and practical experience.1

1 See U.S. General Accounting Office, Using Statistical Sampling, GAO/PEMD-10.1.6 (Washington, D.C.: May 1992). This paper provides a thorough treatment of this topic.

Nonstatistical Sampling

Questionnaires may be used on projects in which statistical sampling is not used, so we need to consider briefly other ways in which evaluators select cases (Deming, 1960). Either all the cases can be studied-that is, a census can be taken-or part of the population can be selected in a nonstatistical manner. When evaluators take part of the population, they usually do so for a reason. It may be that they are doing a case study, so they select one or more cases that provide the best opportunity to observe the phenomena or relationships of interest, and they do not need to generalize their findings to the population. In other situations, the evaluators know very little about the population and cannot draw a statistical sample, so they arbitrarily select as many cases as they can and report the findings. However, in many situations, evaluators want to generalize and they know something about the population but it is just not feasible to draw statistical samples. So they pick a sample that they hope will correspond, in its features, to the population, even though they know they will not be able to use the powerful reasoning associated with statistical samples. An important category of nonstatistical sampling is "Judgments sampling."

A judgment sample draws its name from the fact that in the judgment of the evaluator, the cases chosen correspond to certain aspects of the population. The cases may be selected because they are judged most typical, because they represent the extreme ranges, because they represent a known part of the population, or because they simulate or act as a proxy for a representative sample from the population. For example, we could interview all the Fortune 500 chief executive officers in New York and Chicago because we believe that this sample is typical of chief executive of in large companies. We could study selected group homes for the mentally disabled in California, Mississippi, New York, and Texas, because these states represent the extremes of the laws and practices. We could study 60 prime contractors with the Department of Defense in California and New York, because these contractors account for 82 percent of all defense contracts. We might pick 15 airports in 11 states, such that the sample would be similar to the population of airports with respect to size, geographic coverage, and weather conditions.

As a rule, the use of judgment sampling in a project in which the intent is to generalize is ill advised, because arguments to support generalization cannot be nearly as persuasive as with statistical samples. However, occasions may arise (as with a very homogeneous population) in which the situation is not altogether bleak.

When the validity of the findings depends on the extent to which they can be generalized to the population, and when there is no statistical sample, it might help to have some rule of thumb that might compare judgment samples to statistical samples. One way to picture the relationship between statistical samples and judgment samples with respect to representativeness might be to imagine a credibility scale from 1 to 10. Assume that a score of 1 is the value given to a single case study designed without intent whatsoever to generalize, and 10 is the credibility associated with studying the whole population. A very large, statistically valid random sample might yield a value of 9. A large, medium, and very small but statistically valid random sample might yield respective scores of 8, 7, and 6. If we made many case studies but did not take a random sample, we might get a value of 4. We might extend this value to 5 if the groups were large enough to provide statistical certainty within their limited area of selection or if the population was very homogeneous. We might get the same score of. 5 if we selected a number of cases that represented the range of conditions and circumstances that apply to the population. (Incidentally, this is how pretest candidates are selected, because there is neither time nor resources to draw a statistically valid sample.) However, the score would drop to 3 or even 2 if we selected many or fewer cases without giving consideration to representing the expected range of conditions.

A few yeas ago, we did a review of the elderly in which we selected thousands of cases at random from the same city. This might have been acceptable, from a generalization viewpoint, if we were measuring the conditions associated with cholesterol levels; these levels could be presumed similar for most U.S. city-dwellers. However, in this review, we were concerned about programs and their effects, which may have varied from city to city. Thus, limiting the sample to one city prohibited generalizations beyond the city that was studied. Another example involved a population of 132 health maintenance organizations. We arbitrarily picked 16 of these organizations and collected data from hundreds of people in each one. In the end, what we came up with was a set of 16 case studies. Although the sample for each case study was representative of the population of people in one of the 132 health maintenance organizations, the 16 case studies together permitted only very careful and limited findings. We might have had a much more powerful evaluation at a fraction of the cost if we had taken a random sample of organizations and looked at fewer cases within each organization.


Chapter 4
Formatting the Questions



Before writing the questionnaire, the evaluators need to choose the format for each question. Each format presented in this chapter serves a specific purpose that should coincide with the available information and data analysis needs.

Open-Ended Questions

Open-ended questions are easy to write and require very little knowledge of the subject. All the evaluators have to do is ask a question, such as "What factors do you consider when you pick a carrier?" But this type of question provides a very unstandardized, often incomplete, and ambiguous answer, and it is very difficult to use such answers in a quantitative analysis. Respondents will write some salient factors that they happen to think of (for example, lower rates and faster transit time) but will leave out some important factors because at that moment they did not think of them. Open-ended questions do not help respondents consider a range of factors; rather, they depend on the respondents' unaided recall. There is no way of knowing what was important but not recalled, and because not all respondents consider the same set of factors, it may be extremely difficult or impossible to aggregate the responses.

Also, the evaluators may not know how to interpret the answers. For example, people might say they choose a carrier because it is more convenient or less trouble. There is no way of knowing what this means. It may mean any thing from faster transit time to easier documentation.

Another problem is that open-ended questions cannot easily be tabulated. Rather, a complicated process called "content analysis" must be used, in which someone reads and rereads a substantial number of the written responses, identifies the major categories of themes, and develops rules for assigning responses to these categories. Then the entire sample has to be
gone through to categorize each answer. Because people interpret differently, three or four people have to categorize the answers independently. Furthermore, rules must be developed to handle disagreements and only very low levels of qualitative analysis can be performed.1 Similarly, at the conclusion of the data reduction phase, only very low levels of qualitative analysis can be performed.

1 Interrater reliability is a measure of the consistency among the people categorizing the answers.

Still another problem is that open-ended questions substantially increase response burden. They usually take several minutes to answer, rather than a few seconds. Because respondents must compose and organize their thoughts and then try to express them in concise English, they are much less likely to answer.

However, open-ended questions do sometimes have advantages. It may happen that they are unavoidable when, for example, we are uncertain about criteria or we are engaged in exploratory work. If we ask enough people an open-ended question, we can develop a list of alternatives for closed-ended questions. We can also use open-ended questions to make sure our list of structured alternatives did not omit an important item or qualification. We can also ask open-ended questions to obtain responses that might further clarify the meaning of answers to closed-ended questions or to gather respondent examples that can be used to illustrate points. The rest of this chapter details closed-ended questions, because they are the meat and potatoes of our work.

Fill-in-the-Blank Questions

Each questionnaire usually has some fill-in-the-blank questions. They are not open-ended because the blanks are accompanied by parenthetical directions that specify the units in which the respondent is to answer. Some examples are shown in figure 4.1.

Figure 4.1 Fill-in-the-Blank Questions

  1. What was your age on your last birthday?____(age in years)
  2. What was your city's infant mortality rate last year?____(mortality rate in deaths/1,000)
  3. What size is your printing plant?____( in square ft.)

Fill-in-the-blank questions should be reserved for very specific requests. The instructions should be explicit and should specify the answer units. Sometimes, several fill-in-the-blank questions are asked at once in a row, column, or matrix format, as shown in the examples presented in figure 4.2.
*insert figure 4.2

Yes-No Questions


Unfortunately, yes-no questions are very popular. Although they have some advantages, they have many problems and few uses. Yes-no questions are ideal for dichotomous variables, such as black and white, because they measure whether the condition or trait is present or absent. They are therefore very good for filters in the line of questioning and can be used to move respondents to the questions that apply to them, as in figure 4.3.

Figure 4.3 Yes-No Filter Question

  1. Did you get training? (Check one.)
    __ Yes (continue)
    __ No (go to question 5)

However, most of the questions GAO asks deal with measures that are not absolute or measures that span a range of values and conditions. Consider the question: "Were the terms of the contracts clear?" Most people would have trouble with this question because it involves several different considerations. First, some contracts may have been clear and others may not have been. Second, some contracts may have been neither clear nor unclear or of marginal clarity. Third, parts of some contracts may have been clear and others not clear.

Because so little information is obtained from each yes-no question, several rounds of questions individually have to be administered to get the information needed. "Did you have a plan?" "Was the plan in writing?" "Was it a formal plan?" "Was it approved?" This method of inquiry is usually so boring as to discourage respondents.

Sometimes, question writers try to compress their line of inquiry and cause serious item-construction flaws. They ask for two things at once-a double-barreled question. For instance, a yes-no answer to "Did you get mission and site support training?" is imprecise.

How do respondents answer if they got mission but not site support training?
A related question-writing mistake is mixing yes-no and multiple choice. See figure 4.4.

Figure 4.4 Mixed Yes-No  and Multiple Choice Question

  1. Did you get mission and/or site training?
    1. __ Yes, mission but not site training
    2. __ Yes, site but not mission training
    3. __ Yes, both mission and site training
    4. __ No, neither mission nor site training

The example in figure 4.4 has several problems. The question and the response space do not agree. This slows up the cognitive processing because the question prepares the reader for a simple yes-no answer. But in reality the reader gets not a yes-no answer space but, rather, a list of qualified alternatives. The response alternatives are biased toward "yes" because most of the choices have "yes" in them. Furthermore, "no" in the last item cannot be used with the correlative conjunction "neither nor," because this is an unintended double negative. Such questions make a simple inquiry difficult because they are counter to the cognitive process, burdensome, and cause errors.

Yes-no questions are prone to bias and misinterpretation for several reasons. First, many
people like to say "yes." Some have the opposite bias and like to say "no." Second, questions such as "Do you submit reports?" have what is called an "inferred bias" toward the "yes" response. The most common way to counter this bias is to add the negative alternative-for example, "Do you submit reports or not?" However, if this is done, the use of yes-no choices in the answer must be qualified or avoided. Without this precaution, a simple "yes" answer may be read as applying to both parts of the question, "yes, I submit" and "Yes, I do not submit." A simple "No" might also be read as "No, I do not submit"a double negative. To prevent confusion, qualify the answer choices or avoid yes-no answers. See figure 4.5.

Figure 4.5 Balanced and Unambiguous Yes-No Question

  1. Do you submit reports or not? (Check one.)
    1. __ I submit reports.
    2. __ I do not submit reports.

"Implied No" Choices

In figure 4.6, failure to check an item implies "no." The implied-no choice format is used because it is easy to read and quick to answer.

Figure 4.6 "Implied No" Question

  1. What health problems, if any, did the VA tell you that you had? (Check all that apply.)
    1. __ Skin problems
    2. __ Liver or kidney problems
    3. __ Tumors or growths
    4. __ Problems with your nerves
    5. __ Other health problems (please specify)

When evaluators want to emphasize the "no" alternative, they can expand the implied-no format to include one column for "yes" answers and one for "no." "No" is listed as an option when the respondent might not answer or might overlook part of the question, as when the choices are difficult, the list of items is long, or the respondent's recollection is taxed. If "no" is not included as an alternative, no's will be overreported, because the analysts will not be able to differentiate real no's from omissions and nonresponses. An example appears in figure 4.7.

Figure 4.7

Questions asked Yes (1) No (2)
1.  Nervousness    
2.  Headaches    
3.  Numbness in       arms, hands,        legs, feet    
4.  Infections    
5. Liver problems    
6.  Weight loss    
7.  Fatigue    
8.  Skin Problems    
9.  Lung Problems    
10.  Change in sex          drive    
11.  Sterility    
12.  Birth defects         in children    



Single-Item Choices

In single-item choices, respondents choose not "yes" or "no" but one of two or more alternatives. See figure 4.8 for an example. Since yes-no and single-item choices are similar, they have the same types of problems, but the difficulties are less pronounced in some respects and accentuated in others.

Figure 4.8 Single-Item Choice Question

  1. There are two programs for educating the handicapped.  One program provides special education in separate classrooms and uses a curriculum different from that used for the main group of children.  Another program (called mainstreaming) includes the handicapped in the regular classroom, adapts the main curriculum to special education, and makes other provisions for the handicapped.  The question is, which alternative to do you prefer? (Check one.)
    1. __ Separate special education classes
    2. __ Mainstream classes

On the positive side, the differences between the choices are usually clear, and the writer can set up a truly dichotomous question. If used carefully, the single-item choice can be efficient. It often serves to filter people out or to skip them through parts of the questionnaire. It is not likely to be overused and cause excessive cycles of repetition. Furthermore, the question writer is not likely to compress the question into a double-barreled item. The single-choice format is also not subject to bias from yea sayers or nay sayers. And eliminating the negative alternative reduces misinterpretation.

But there are problems. In the single-choice format, the writer is more apt to bias one of the choices by understating or overstating it. Some writers may not properly emphasize the second alternative; others, aware of this tendency, overcompensate.

Expanded Yes-No Questions

One way around the yes-no constraints is to use an expanded yes-no format like that shown in figure 4.9. The expanded yes-no format gives a measure of intensity, avoids some of the biases common to yes-no, implied-no, and single-choice questions, and resolves the problem of quibbling. Consider the question, "Could you have gotten through college without a loan or not?" Also in the expanded format more students will answer in the negative than otherwise.

Figure 4.9 Expanded Yes-No Format

  1. __ Yes
  2. __ Probably yes
  3. __ Probably no
  4. __ No

The expanded alternatives can have qualifiers other than "probably yes" and "probably no." Qualifiers can be changed to meet the situation-"generally yes" and "Generally no" or For the most part yes" and "for the most part no."

Free Choices

Yes-no, implied-no, single-choice, and expanded formats are forced choices in that respondents must answer one way or the other. Forced-choice items generally simplify measurement and analysis because they divide the population clearly into those who do and those who do not or those who have and those who have not. Unfortunately, putting the population into just two camps may also oversimplify the picture and yield error, bias, and unreliable answers. To avoid this problem and to reduce the respondent's burden, a middle category can be added, as in the question in figure 4.l0.

Figure 4.10 Expanded Yes-No Format With Middle Category

  1. __ Yes
  2. __ Probably yes
  3. __ Uncertain
  4. __ Probably no
  5. __ No

Even though the proportion of yes's to no's will not change, the evaluators will have a better measure of the yes-no polarization, because the middle category absorbs those who are uncertain. A good rule of thumb is that if we are not certain that nearly everyone can make a clear choice-we include a middle category.

Usually, the question asker will also put in an "escape choice" to filter out those for whom the question is not relevant. Examples are "not applicable," "no to judge," "have not considered the issue," and "can't recall." See figure 4.11.

Figure 4.11

  1. __ Yes
  2. __ Probably yes
  3. __ Uncertain
  4. __ Probably no
  5. __ No
  6. __ Have not considered the issue

Multiple-Choice Questions

The most efficient format-and the most difficult to design-is the multiple-choice question. The respondent is exposed to a range of choices and must pick one or more, as in the example in figure 4.12.

Figure 4.12 Multiple-Choice Question

  1. What reasons explain why you or your family went or had to go elsewhere for care? (Check all that apply.)
    1. __ No doctor available to treat your particular case.
    2. __ There was a very long waiting list for an appointment, so you were advised that it was better to go elsewhere.
    3. __ The equipment required for your care was not available at that facility.
    4. __ The facility was very busy and you preferred to go elsewhere for care
    5. __ Other (specify) _______________________

Multiple-choice questions are difficult to write because the writer must provide a comprehensive range of nonoverlapping choices. They must be a logical and reasonable grouping of the types of experience the respondents are likely to have encountered.

The example in figure 4.12 turned out to be flawed in practice. We learned during the pretest that we had left out some important choices. We detected this error because many respondents wrote answers in the "other" category.

Because this format is very important and requires the most research, field work, and testing, and because the analysis and interpretation can be
complex, we discuss multiple-choice question design in chapter 7 in considerably more detail.

Ranking and Rating Questions

Ranking questions are used to make very difficult distinctions between things that are of nearly equal value. The question forces the respondent to value one alternative over another no matter how close they are. The value that is assigned is a relative value. Rating questions are used when the alternatives are likely to vary somewhat in value and when evaluators want to know how valuable the alternative is rather than if it is a little more or less valuable than the next alternative. First consider ranking. In ranking, the respondents are asked to tell which alternative has the highest value, which has the second highest, and so on. They rank the choices with respect to one another, but their answers tell little about the intrinsic value of their choices. For example, suppose we asked respondents to rank the importance of the following services for institutionalized children: education, health care, lawn care, telephones, and choir practice. They would be hard put to choose between education and health care, because both are essential to the children's development. But they would have to rank one first and one second. Telephones would probably be ranked third. Compared to health care and education, telephones are much less important, yet they are ranked third just behind two services that are so important that it is difficult to choose between them.

Ranking starts to get hard for people when there are more than seven categories. This is because they can usually pick the first and second and third and then the last and next to the last and the next to the next to last, so that what is left is the middle. But for more than seven items, respondents begin to lose track of where they are with respect to the first, last, and middle positions. When this happens, they make mistakes. For more than seven items, respondents can be given special task-taking procedures to counter this problem. But this procedure is rather burdensome.

Also, ranking questions have to be written very carefully. The slightest lapse in clarity in the question or the instruction given will cause some people to rank in the reverse order or to assign two alternatives the same rank or to forget to rank every alternative. Nonetheless, ranking must sometimes be used. The example in figure 4.13 is one that has worked reasonably well. Respondents will make a few errors, but statistical procedures are available to handle them.

Figure 4.13

Consider each of the following types of findings, which are often used to assess programs.  FROM YOUR EXPERIENCE, which do you think are more likely to impress the state education agency program (SEA) officers?  Indicate your answer by rank ordering each of the following alternatives from the most to the least impressive.  Select the type of result you think is most likely to impreece the SEA officials.  Rank this 1st by checking.  Do the same for all the remaining categories, ranking them 2nd, 3rd, 4th, 5th, 6th, and 7th.

  1st 2nd 3rd 4th 5th 6th 7th
1.  Improvement in educational management or accountability              
2.  Improvement in school or facilities              
3.  Student improvement through gain scores on grades or teacher rating              
4.  Student improvement through gain scores on standardized norm referenced              
5.  Student improvement through gain scores on criterion referenced tests              
6.  Student improvement through gain scores in the affective domain (e.g., likes, dislikes)              
7.  Improvement in curriculum and instruction              

Rating questions are perhaps our most useful format because we usually want to know the actual or absolute value of the trait we are measuring. Ratings are assigned solely on the basis of the score's absolute position within a range of possible values. For example, a rating scale might be assigned the following categories: of little importance, somewhat important, moderately important, and so on. In writing rating questions, we should try to categorize the scales in equal intervals and anchor the scale positions whenever possible. Aside from the scaling, rating questions are easier to write properly and cause less error than ranking questions. We can see from the two examples of the rating format shown in figure 4.14 that ratings provide an adequate level of quantification for most purposes. We can also see by comparing the examples in figures 4.13 and 4.14 that rating formats are far less cumbersome than ranking formats.

Figure 4.14 Rating Questions

  1. Under what risk classification should Presentence Investigation reports contain recommendations for court conditions? (Check one.)
    1. __ Maximum risk
    2. __ Moderate risk
    3. __ Minimum risk
  2. Rate how well the report contents were supported by verification, referencing of sources, statistics, statements of scientific certainty, or soundness of data-gathering methods.  (Check one.)
    1. __ More than adequate
    2. __ Generally adequate
    3. __ Of marginal or borderline adequacy
    4. __ Inadequate
    5. __ Very inadequate

Guttman Format

In questions written in the Guttman format, the alternatives increase in comprehensiveness; that is, the higher-valued alternatives include the lower-valued alternatives.
Applying this principle in one job, we asked state resource officials how they benefited from an earth-orbiting satellite. The question is given in figure 4.15. Here we assumed that if respondents had measured the benefit, they had identified it, and if they had determined the cost-benefit ratio, they had measured the primary and secondary benefits and lack of benefits as well as the worth or dollar value of these benefits and lack of benefits.

Figure 4.15

Consider the benefits, if any, of your state government may have received from participating in the LANDSAT program.  Identify the benefit areas and the degree to which you can qualify and/or quantify these benefits.  (Check column 1 if particular benefit not identified; otherwise check one of the columns 2-5).

                            Qualification of Benefits

No benefit identified (1) Identified benefits (2) Measured some or all benefits (3) Assessed worth and/or dollar value of benefits (4) Made cost-benefit analysis (5)
Benefit area
1.  Agriculture/forestry, range resources
2.  Land use survey and mapping
3.  Mineral resources, geostructural, and land form surveys
4.  Water resources
5.  Marine resources and ocean surveys
6.  Meteorology
7.  Environment
8.  Other



Intensity Scale Questions

The intensity scale format is usually used to measure the strength of an attitude or an opinion. Two popular versions, the extent and expanded yes-no scales, are presented in figures 4.16.

Figure 4.16 Extent Scale and the Expanded yes-No Scale Questions

  1. To what extent, it at all, do you believe an international agreement against bribery would strengthen American companies' competitive position abroad? (Check one.)
    1. __ To little or no extent
    2. __ To some extent
    3. __ To a moderate extent
    4. __ To a great extent
    5. __ To a very great extent
    6. __ No opinion
  2. Do you feel that an international trade agreement against bribery would strengthen American companies' competitive position  abroad or not?  (Check one.)
    1. __ Yes
    2. __ Probably yes
    3. __ Uncertain
    4. __ Probably no
    5. __ No

Likert Scale
Another frequently used intensity scale format is the Likert or agree-or-disagree scale. The Likert scale is easy to construct. Consider the extent-scale example of figure 4.16. As shown in figure 4.17, all the question writer has to do is convert the question into a statement and follow it with agree-or-disagree choices.

Figure 4.17 Extent Scale Converted to Likert Scale Question

  1. An international agreement against bribery would strengthen U.S. companies' competitive position abroad.  (Check one.)
    1. __ Strongly agree
    2. __ Agree
    3. __ Undecided
    4. __ Disagree
    5. __ Strongly disagree
    6. __ No basis for judging

However, if the writer is not careful, the simplicity and adaptability of the Likert scale format are often paid for by greater error and threats to validity.

First, there is bias. The Likert scale presents only one side of an argument, and some people have a natural tendency to agree with the "status quo" or the argument presented. Writers of Likert scale questions could attempt to counter this bias error by presenting the converse statement also. For example, they would first ask for a response to "My boss does not let me participate in decisions (agree or disagree)." Then in a subsequent part of the questionnaire, they have to ask their questions in reverse: "My boss lets me participate in decisions (agree or disagree)."

But now the line of inquiry is no longer concise or simple. The questions are doubled in number with a serial repetitive format that interferes with the cognitive recall process, aside from inhibiting motivation because these formats quickly become boring. Furthermore, developing precise converse statements of counterbalancing intensity can be difficult and complex. For example, "not satisfied" is not necessarily the opposite of "satisfied." And in the example above, the phrase "My boss does not let me participate" is much more negative than the phrase "My boss lets me participate" is positive.

Another problem is that the extent of the respondent's agreement or disagreement with a statement may not correspond directly to the strength of the respondent's attitude about the Likert statement posed in the question. The respondent may consider the statement either true or false and respond as if the question were in an "either or" format rather than a graduated scale measuring the intensity of a belief.

The Likert question uses the statement as a reference point or anchor. Hence, what is measured may be not the strength of the respondent's attitude over the complete range of intensities but, rather, the range of intensities bounded or referenced by the position of the anchoring statement at one end of the range and unbounded at the other end of the range. To complicate things even more, the single-bounding anchor may not be at the extreme end of the range; this makes comparisons among items very difficult.

The point is that the indirect approach in the Likert scale may produce misleading results for a variety of reasons. It is usually better to use a direct approach that measures the strength of the respondent's actual attitude over a complete range of intensities. For example, it is better to reformulate the item from "My boss never lets me participate" to "To what extent, if at all, do you participate?"

However, one situation in which the Likert scale is very useful is when extent of agreement or disagreement is closely and directly related to the statement. For instance, the respondent may be asked about the extent to which he or she agrees or disagrees with a policy, as in figure 4.18.

Figure 4.18 Likert Question Used to Evaluate Policy

  1. Some people agree with GAO's policy on rotation, while others do not.  The question is, how to you feel about the policy?  (Check one.)
    1. __ Strongly agree
    2. __ Agree more than disagree
    3. __ Undecided
    4. __ Disagree more than agree
    5. __ Strongly disagree

Amount and Frequency Intensity Scales
Many questions ask the respondent to "quantify" either amounts or frequencies. These are relatively simple. They use certain descriptive words to characterize the amount, frequency, or number of items being measured. For example, traits like "help," "hindrance," "effect," "increase," or "decrease" can be quantified by adding "little," "some," "moderate,"
"great," or "very great." Certain adjectives like some and great have a stable and relatively precise level of quantification. For instance some is usually considered to be about 25 percent of the amount shown on the scale and a great amount is usually considered to be about 75 percent. Sometimes such adverbs as "very" and "extremely" are used. Quantities can also be implied by the sequence of numbered alternatives ordered with respect to increasing or decreasing intensity. See figure 4.19, which uses both methods together, in the common practice.

Figure 4.19 Amount Intensity Scale

  1. __ Little or no hindrance
  2. __ Some hindrance
  3. __ Moderate hindrance
  4. __ Great hindrance
  5. __ Very great hindrance

Frequencies or occurrences of events are treated the same way. Question writers know that words like "sometimes" and "great many" or "very often" mean about one fourth of the amount or 25 percent of the time and three fourths or 75 percent of the time, respectively, to most people. Similarly, words like "about half" and "moderate" anchor the midpoints. As
with amount intensity scales, it is important to use both numbered, ordered scalar presentations and words to quantify the scale intervals. See figure 4.20.

Figure 4.20 Frequency Intensity Scale

  1. __ Seldom if ever
  2. __ Sometimes
  3. __ Often
  4. __ Very often
  5. __ Always or almost always

In many amount and frequency measures, where ambiguities are likely to occur, it is also important to use proportional anchors such as fractions and percents or verbal descriptive anchors such as once a day or once a month in addition to the adjective and scale number anchors. Examples are shown in figure 4.21.

Figure 4.21 Frequency and Amount Intensity Scales with Proportional and Verbal Descriptive Anchors in addition to the Conventional Adjective and Scale Number Anchors

  1.  
    1. __ Seldom if ever (0 to 10% of the time)
    2. __ Sometimes (about 1/4 of the time)
    3. __ Often (about 1/2 of the time)
    4. __ Very often (about 3/4 of the time)
    5. __ Always or almost always (90 to 100% of the time)
  2.  
    1. __ Always or almost always (once a day or so)
    2. __ Very often (every other day)
    3. __ Often (about once a week)
    4. __ Sometimes (every two or 3 weeks)
    5. __ Infrequently (once a month or less)
  3.  
    1. __ To little or no extent; less than 10% of streams are covered
    2. __ To some extent; perhaps 1/4 of the streams are covered
    3. __ To a moderate extent; about 1/2 of the streams are covered
    4. __ To a great extent; about 3/4 of the streams are covered
    5. __ To a very great extent; nearly all of the streams are covered

Branching Intensity Scale Formats
So far, all the examples have illustrated nonbranching formats. However, even more precise measures can be obtained with branching formats. An example is shown in figure 4.22.

Figure 4.22 Branching Intensity Scale Format\

  1. If a group home for mentally ill were applying for a license in your neighborhood, would you support or oppose this licensing or are you undecided?  (Check one.)
    1. __ Support (continue)
    2. __ Undecided (go to 4)
    3. __ Oppose (go to 3)
  2. If you would support this licensing to what extent would you support it?
    1. __ To a little extent
    2. __ To some extent
    3. __ To a moderate extent
    4. __ To a great extent
    5. __ To a very great extent

Fill-in-the-Blank Frequency Formats
Sometimes when evaluators have to be really precise and the range of frequency choices is very wide, such as in the study of repetitive behaviors, they can use a fill-in-the-blank format. What is asked for is the number of occurrences in a given time period or the interval between events to be counted. Examples are shown in figure 4.23.

Figure 4.23 Number-of-Occurences and Time Interval Formats

  1. How many meetings have you attended in the last two full weeks?
    ____________(number of meetings in the last two weeks, counting back from last full week)
  2.  
    1. When was the last meeting you attended? _____________
    2. Before this meeting, how long had it been since you had attended a meeting?
      _____________(number of days since  attending another meeting 

Here are some guidelines for using intensity scales.

  1. Pick a dimension and a dimension reference point; then decide whether the scale should increase in a negative direction from that reference point, increase in a positive direction, or both. For instance, consider the question, "To what extent, if at all, did the law affect your business?" Here, the scale might go from reference point "no to "a severe hardship" or, if you believe the law can only help, from "no effect" to "a very great help." But if the law could help some and hinder others, the scale would span the range from "a severe hardship" through the "no effect" reference point to "a very great help."
  2. Use an odd-number of categories, preferably five or seven.
  3. If there is a possibility of bias from the category ordering, order the scale in a way that favors the hypothesis you want to disconfirm and that disadvantages the hypothesis you want to confirm. This way, you confirm the hypothesis with the bias against you.
  4. If there is no bias, start the scale with the most undesirable or negative effect and end the scale with the most positive categories.
  5. Present the scale categories in the sequence that people are used to seeing them.
  6. Pick scale-range anchors or poles (that is, specify the ends of the range) with concrete and unambiguous measures.
  7. Use the item sequence and numbering to help define the range of categories.
  8. Use words that are natural anchors or that will divide the scale at equal intervals, particularly over the middle two thirds or three fourths of the scale. For example, to most people, "some or somewhat" is usually perceived as about one fourth of the time, intensity, or amount, whereas "great" has a face value of about three fourths.
  9. Anchor the intervals with numbers, fractions, or proportions and descriptions, when feasible.
  10. Use a branching format when feasible, as it is precise.

Semantic Differential Intensity Scales

In a semantic differential question, frequencies or values that span the range of possible choices are not completely identified; only the extreme value or frequency categories are labeled. An example is shown in figure 4.24. The respondent must infer that the range is divided into equal intervals. The range seems to work much better with seven categories than five. The reasons for this are complicated, but seven categories provide a closer approximation to the normal distribution.

Figure 4.24 Semantic Differential Question

  1. Indicate the number of times per week you usually engage in technical communication with colleagues in your group.
    1. __  (Few)
    2. __
    3. __
    4. __
    5. __
    6. __
    7. __  (20 or more)

Semantic differentials are very useful when the evaluators do not have enough information to anchor the intervals between the poles. However, three major problems detract from this format. First, if the questions are not written with great care, many respondents will not answer or will answer with errors. Second, respondents may flounder and make judgment errors because the semantic differential has no midrange or intermediate anchors. Third, the results lack a certain amount of credibility because they are not tied to a factual observation. For example, compare a factually anchored scale point with a simple enumerated scale point. We find there is a big difference between saying that 70 percent of the respondents said their streams were polluted to the point at which most aquatic life was declining and saying that 70 percent checked 6 on a scale of 1 to 7.

Intensity Paired Comparison Scales

Intensity scales are very versatile and are sometimes combined with other types of scales. One such combination of scales is sometimes used in establishing priorities. Here an intensity scale is combined with a paired comparison scale. As its name implies, a paired comparison scale compares all the question options by pairs by asking the respondent to rank one item of the pair over the other. An intensity paired comparison scale asks the respondents to scale the amount of the difference between the two pair items. See figure 4.25.

Figure 4.25: Intensity Paired Comparison Scale

  Much less important Somewhat less important Equally important Somewhat more important Much more important

Comparison Activities

         
1.  Biotechnology vs. Acquisition          
2.  Description vs. Breeding          
3.  Enhancement vs. Preservation          
4.  Acquisition vs. Description          
5.  Preservation vs. Biotechnology          
6.  Breeding vs. Enhancement          
7.  Biotechnology vs. Breeding          
8.  Description vs. Preservation          
9.  Enhancement vs. Acquisition          
10.  Acquisition vs. Breeding          
11.  Preservation vs. Acquisition          
12.  Breeding vs. Preservation          
13.  Biotechnology vs. Enhancement          
14.  Description vs. Biotechnology          
15.  Enhancement vs. Description          

 


Chapter 5                                                                                                                                             
Avoiding Inappropriate Questions

To make sure questions are appropriate, the evaluators must become familiar with respondent groups-their knowledge of certain areas, the terms they use, and their perceptions and sensitivities. What may be an excessive burden for one group may not be for another. And what may be a fair question for some may not be for others. For example, in a survey of the handicapped, those who were not obviously handicapped were very sensitive about answering questions.

This chapter discusses nine types of inappropriate questions and ways to avoid them. Questions are inappropriate if they

The best way to avoid inappropriate questions is to learn about the respondent group, design and field test for this group, and not rely on preconceptions or stereotypes. An anecdote may bring this point home. A researcher was pretesting a questionnaire on people who used mental health services. During the test, the researchers expressed surprise that the respondents could handle certain difficult concepts. Annoyed, one of the respondents rejoined, "I may be crazy, but I'm not stupid."

Questions That Are Not Relevant to the Evaluation Goals

A questionnaire should contain no more questions than necessary. Questions that are not related to the goals of the evaluation or that are not likely to be used in the final report should be avoided. They require unnecessary time and effort from respondents. And questions that they view as irrelevant to the evaluation are less likely to be answered. This is the single biggest cause of nonparticipation. However, there are occasions when questions that are indeed very important appear to be irrelevant. If this is expected, the author should be very careful to explain why it was included.

Occasionally, however, someone asks the evaluators to include what is called a "rider"an unrelated question for use in another evaluation. Including riders creates three problems. First, the evaluation now has a dual purpose that has to be explained to readers. Second, the riders have to be woven into the questionnaire so that they do not seem irrelevant. Third, the use of the rider changes the context and hence the meaning of the questions.

Aside from riders, there are three other ways in which irrelevant questions typically find their way into evaluations:

  1. The evaluation design was inadequate. The evaluators did not formulate the overall project questions and the technical approach in a systematic way but decided to measure "everything" and see what they could come up with.
  2. The evaluators had a hidden agenda. The evaluation was just a pretext for measuring other things.
  3. The evaluators used the questionnaire to cover their bets. They already had the information they needed. They just wanted to be sure not to miss anything.

Not one of these reasons is acceptable because the use of evaluations for such purposes wastes the agency's and the respondents' time and money.

Unbalanced Line of Inquiry

Evaluators should not write questions that could be seen as developing a line of inquiry to support a particular position or preconceived idea, possibly at the expense of evidence to the contrary. The purpose of questionnaires is to develop information for an objective evaluation. To seem to do otherwise threatens a study's reputation for objectivity, commitment to balance, and integrity.

Questions That Cannot or Will Not Be Answered Accurately

Perhaps the most frequent source of error is asking questions that cannot or will not be answered correctly. For example, we asked companies for 4 years of data, when they kept records for only 3 years.

A more difficult problem occurs when respondents either purposely or unconsciously give biased answers. For example, unit commanders had a favorable bias when reporting on the performance of their units, whereas enlisted personnel were more likely to "tell it like it is." Similarly, physicians in certain hospitals rated the quality of their own medical practice very high but were objective in their judgment of peers. In these instances, it was inappropriate to ask unit commanders and physicians to rate themselves, because they were understandably biased in their answers. We obtained much more accurate observations from other sources (enlisted members and physician peer and nurse reports).

Sometimes respondents provide misinformation because they make a random guess or they do not like to admit that they do not know something or they like to please the question asker by responding "yes." But it is better to have no information than false information. So it is important to skip out those not qualified to answer by using socially acceptable skip questions (see figure 5.1) or to direct the questionnaire only to those the evaluators know are knowledgeable. For example, in one project we evaluated the usefulness of a congressional report that analyzed federal funding by program and geographic location. We did not know which congressional staff used this report. So we analyzed staffing patterns and sent the questionnaire to the right people.

Figure 5.1 Skip Question

  1. Was your rating changed by officials other than your supervisor? (Check one.)
    1. __ Yes (CONTINUE)
    2. __ No (CONTINUE)
    3. __ Don't know (GO TO QUESTION 21)

Another means of selection is to ask people to rate their expertise. For example, in a study of the feasibility of a national health plan, we asked people to rate their expertise in the various knowledge areas such as the health care industry, insurance, education, manufacturing, and preventive medicine.

Questions That Are Not Geared to Respondent's Depth and Range of Information, Knowledge, and Perceptions

To avoid questions not properly geared to the respondents, it is important not to use words or terms they do not understand. It is very easy to assume that respondents know the same words we do. Some terms and abbreviations that have caused problems in past surveys are "detoxification," "EEO," "DCASR," "peer group," "net sales," and "adjusted gross income." We could have saved time and money had we provided a few words of explanation, such as "detoxification, or drying out"; "peer group, or the people you work with who have similar rank or status"; and "net sales, or the profit on sales after all expenses have been deducted."

Evaluators must also use terms in the same context and sense that people are used to seeing them in. To students at a state college, the student union was a place where people hang out, watch television, and buy coffee and doughnuts; however, to military academy cadets, it was a subversive organization. In another survey, the term "margin" had different meanings to different respondents. It meant barely adequate to consumers, the amount of collateral required for stock purchases to bankers and brokers, the benefits of building or buying additional units to businessmen, and a cross-tabulation calculation to statisticians.

Question writers must be familiar with their population, and they cannot assume too much or too little. For instance, we were worried about using two technical terms in surveying ranchers: "actual grazing capacity" and "forage productive capacity." However, our pretests showed the ranchers uniformly understood the terms. In another survey, we asked users to rate the quality of the computer image tapes from the LANDSAT earth-orbiting satellite. (The tapes provide data used to make computer maps of the earth's surface.) In general, the users could not answer this question because it was too broad. They
wanted us to be much more specific and ask about the quality of the calibration, striping, formatting, wave length bands, pixil number of original amplitude steps used in digital conservation, corrections for geometric errors and distortions, and threshold settings. In yet anothe