Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Construct Validity
  • Reliability
  • Levels of Measurement
  • Survey Research
  • Scaling in Measurement
  • Qualitative Measures
  • Unobtrusive Measures
  • Research Design
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

Measurement

Measurement is the process of observing and recording the observations that are collected as part of a research effort. There are two major issues that will be considered here.

First, you have to understand the fundamental ideas involved in measuring. Here we consider two of major measurement concepts. In Levels of Measurement , I explain the meaning of the four major levels of measurement: nominal, ordinal, interval and ratio. Then we move on to the reliability of measurement, including consideration of true score theory and a variety of reliability estimators.

Second, you have to understand the different types of measures that you might use in social research. We consider four broad categories of measurements. Survey research includes the design and implementation of interviews and questionnaires. Scaling involves consideration of the major methods of developing and implementing a scale. Qualitative research provides an overview of the broad range of non-numerical measurement approaches. And unobtrusive measures presents a variety of measurement methods that don’t intrude on or interfere with the context of the research.

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Social Sci LibreTexts

4.1: What is Measurement?

  • Last updated
  • Save as PDF
  • Page ID 124458

Learning Objective

  • Define measurement.

Measurement

Measurement is important. Recognizing that fact, and respecting it, will be of great benefit to you—both in research methods and in other areas of life as well. If, for example, you have ever baked a cake, you know well the importance of measurement. As someone who much prefers rebelling against precise rules over following them, I once learned the hard way that measurement matters. A couple of years ago I attempted to bake my husband a birthday cake without the help of any measuring utensils. I’d baked before, I reasoned, and I had a pretty good sense of the difference between a cup and a tablespoon. How hard could it be? As it turns out, it’s not easy guesstimating precise measures. That cake was the lumpiest, most lopsided cake I’ve ever seen. And it tasted kind of like Play-Doh. Figure 4.1 depicts the monstrosity I created, all because I did not respect the value of measurement.

a research measurement definition

Measurement is important in baking and in research.

Just as measurement is critical to successful baking, it is as important to successfully pulling off a social scientific research project. In sociology, when we use the term measurement we mean the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. At its core, measurement is about defining one’s terms in as clear and precise a way as possible. Of course, measurement in social science isn’t quite as simple as using some predetermined or universally agreed-on tool, such as a measuring cup or spoon, but there are some basic tenants on which most social scientists agree when it comes to measurement. We’ll explore those as well as some of the ways that measurement might vary depending on your unique approach to the study of your topic.

What Do Social Scientists Measure?

The question of what social scientists measure can be answered by asking oneself what social scientists study. Think about the topics you’ve learned about in other sociology classes you’ve taken or the topics you’ve considered investigating yourself. Or think about the many examples of research you’ve read about in this text. Classroom learning environments and the mental health of first grade children. Journal of Health and Social Behavior, 52 , 4–22. of first graders’ mental health. In order to conduct that study, Milkie and Warner needed to have some idea about how they were going to measure mental health. What does mental health mean, exactly? And how do we know when we’re observing someone whose mental health is good and when we see someone whose mental health is compromised? Understanding how measurement works in research methods helps us answer these sorts of questions.

As you might have guessed, social scientists will measure just about anything that they have an interest in investigating. For example, those who are interested in learning something about the correlation between social class and levels of happiness must develop some way to measure both social class and happiness. Those who wish to understand how well immigrants cope in their new locations must measure immigrant status and coping. Those who wish to understand how a person’s gender shapes their workplace experiences must measure gender and workplace experiences. You get the idea. Social scientists can and do measure just about anything you can imagine observing or wanting to study.

How Do Social Scientists Measure?

Measurement in social science is a process. It occurs at multiple stages of a research project: in the planning stages, in the data collection stage, and sometimes even in the analysis stage. Recall that previously we defined measurement as the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. Once we’ve identified a research question, we begin to think about what some of the key ideas are that we hope to learn from our project. In describing those key ideas, we begin the measurement process.

Let’s say that our research question is the following: How do new college students cope with the adjustment to college? In order to answer this question, we’ll need to some idea about what coping means. We may come up with an idea about what coping means early in the research process, as we begin to think about what to look for (or observe) in our data-collection phase. Once we’ve collected data on coping, we also have to decide how to report on the topic. Perhaps, for example, there are different types or dimensions of coping, some of which lead to more successful adjustment than others. However we decide to proceed, and whatever we decide to report, the point is that measurement is important at each of these phases.

As the preceding paragraph demonstrates, measurement is a process in part because it occurs at multiple stages of conducting research. We could also think of measurement as a process because of the fact that measurement in itself involves multiple stages. From identifying one’s key terms to defining them to figuring out how to observe them and how to know if our observations are any good, there are multiple steps involved in the measurement process. An additional step in the measurement process involves deciding what elements one’s measures contain. A measure’s elements might be very straightforward and clear, particularly if they are directly observable. Other measures are more complex and might require the researcher to account for different themes or types. These sorts of complexities require paying careful attention to a concept’s level of measurement and its dimensions.

KEY TAKEAWAYS

  • Measurement is the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating.
  • Measurement occurs at all stages of research.

Logo for Mavs Open Press

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

10.1 What is measurement?

Learning objectives.

Learners will be able to…

  • Define measurement
  • Explain where measurement fits into the process of designing research
  • Apply Kaplan’s three categories to determine the complexity of measuring a given variable

Pre-awareness check (Knowledge)

What do you already know about measuring key variables in your research topic?

In social science, when we use the term  measurement , we mean the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. In this chapter, we’ll use the term “concept” to mean an abstraction that has meaning. Concepts can be understood from our own experiences or from particular facts, but they don’t have to be limited to real-life phenomenon. We can have a concept of anything we can imagine or experience such as weightlessness, friendship, or income. Understanding exactly what our concepts mean is necessary in order to measure them.

In research, measurement is a systematic procedure for assigning scores, meanings, and descriptions to concepts so that those scores represent the characteristic of interest. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe or measure than others.

Where does measurement fit in the process of designing research?

Table 10.1 is intended as a partial review and outlines the general process researchers can follow to get from problem formulation to data collection, including measurement. Use the drop down feature in the table to view the examples for each component of the research process. Keep in mind that this process is iterative. For example, you may find something in your literature review that leads you to refine your conceptualizations, or you may discover as you attempt to conceptually define your terms that you need to return back to the literature for further information. Accordingly, this table should be seen as a suggested path to take rather than an inflexible rule about how research must be conducted.

Table 10.1. Components of the Research Process from Problem Formulation to Data Collection. Note. Information on attachment theory in this table came from: Bowlby, J. (1978). Attachment theory and its therapeutic implications. Adolescent Psychiatry, 6 , 5-33

Categories of concepts that social scientists measure

In 1964, philosopher Abraham Kaplan (1964) [1] wrote The Conduct of Inquiry , which has been cited over 8,500 times. [2] In his text, Kaplan describes different categories of things that behavioral scientists observe. One of those categories, which Kaplan called “observational terms,” is probably the simplest to measure in social science. Observational terms are simple concepts. They are the sorts of things that we can see with the naked eye simply by looking at them. Kaplan roughly defines them as concepts that are easy to identify and verify through direct observation. If, for example, we wanted to know how the conditions of playgrounds differ across different neighborhoods, we could directly observe the variety, amount, and condition of equipment at various playgrounds.

Indirect observables , on the other hand, are less straightforward concepts to assess. In Kaplan’s framework, they are conditions that are subtle and complex that we must use existing knowledge and intuition to define. If we conducted a study for which we wished to know a person’s income, we’d probably have to ask them their income, perhaps in an interview or a survey. Thus, we have observed income, even if it has only been observed indirectly. Birthplace might be another indirect observable. We can ask study participants where they were born, but chances are good we won’t have directly observed any of those people being born in the locations they report.

Sometimes the concepts that we are interested in are more complex and more abstract than observational terms or indirect observables. Because they are complex, constructs generally consist of more than one concept. Let’s take for example, the construct “bureaucracy.” We know this term has something to do with hierarchy, organizations, and how they operate but measuring such a construct is trickier than measuring something like a person’s income because of the complexity involved. Here’s another construct: racism. What is racism? How would you measure it? The constructs of racism and bureaucracy represent constructs whose meanings we have come to agree on.

Though we may not be able to observe constructs directly, we can observe their components. In Kaplan’s categorization, constructs are concepts that are “not observational either directly or indirectly” (Kaplan, 1964, p. 55), [3] but they can be defined based on observables. An example would be measuring the construct of depression. A diagnosis of depression can be made through the DSM-V which includes diagnostic criteria of fatigue, poor concentration, etc. Each of these components of depression can be observed indirectly. We are able to measure constructs by defining them in terms of what we can observe. Though we may not be able to observe them, we can observe their components.

TRACK 1 (IF YOU ARE CREATING A RESEARCH PROPOSAL FOR THIS CLASS):

Look at the variables in your research question.

  • Classify them as direct observables, indirect observables, or constructs.
  • Do you think measuring them will be easy or hard?
  • What are your first thoughts about how to measure each variable? No wrong answers here, just write down a thought about each variable.

TRACK 2 (IF YOU AREN’T CREATING A RESEARCH PROPOSAL FOR THIS CLASS): 

You are interested in studying older adults’ social-emotional well-being. Specifically, you would like to research the impact on levels of older adult loneliness of an intervention that pairs older adults living in assisted living communities with university student volunteers for a weekly conversation.

Develop a working research question for this topic. Then, look at the variables in your research question.

  • Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science. San Francisco, CA: Chandler Publishing Company. ↵
  • Earl Babbie offers a more detailed discussion of Kaplan’s work in his text. You can read it in: Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. ↵
  • Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company. ↵

The process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena under investigation in a research study.

In measurement, conditions that are easy to identify and verify through direct observation.

things that require subtle and complex observations to measure, perhaps we must use existing knowledge and intuition to define.

Conditions that are not directly observable and represent states of being, experiences, and ideas.

Doctoral Research Methods in Social Work Copyright © by Mavs Open Press. All Rights Reserved.

Share This Book

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

5.1 Understanding Psychological Measurement

Learning objectives.

  • Define measurement and give several examples of measurement in psychology.
  • Explain what a psychological construct is and give several examples.
  • Distinguish conceptual from operational definitions, give examples of each, and create simple operational definitions.
  • Distinguish the four levels of measurement, give examples of each, and explain why this distinction is important.

What Is Measurement?

Measurement is the assignment of scores to individuals so that the scores represent some characteristic of the individuals. This very general definition is consistent with the kinds of measurement that everyone is familiar with—for example, weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey by inserting a meat thermometer. It is also consistent with measurement throughout the sciences. In physics, for example, one might measure the potential energy of an object in Earth’s gravitational field by finding its mass and height (which of course requires measuring those variables) and then multiplying them together along with the gravitational acceleration of Earth (9.8 m/s 2 ). The result of this procedure is a score that represents the object’s potential energy.

Of course this general definition of measurement is consistent with measurement in psychology too. (Psychological measurement is often referred to as psychometrics .) Imagine, for example, that a cognitive psychologist wants to measure a person’s working memory capacity—his or her ability to hold in mind and think about several pieces of information all at the same time. To do this, she might use a backward digit span task, where she reads a list of two digits to the person and asks him or her to repeat them in reverse order. She then repeats this several times, increasing the length of the list by one digit each time, until the person makes an error. The length of the longest list for which the person responds correctly is the score and represents his or her working memory capacity. Or imagine a clinical psychologist who is interested in how depressed a person is. He administers the Beck Depression Inventory, which is a 21-item self-report questionnaire in which the person rates the extent to which he or she has felt sad, lost energy, and experienced other symptoms of depression over the past 2 weeks. The sum of these 21 ratings is the score and represents his or her current level of depression.

The important point here is that measurement does not require any particular instruments or procedures. It does not require placing individuals or objects on bathroom scales, holding rulers up to them, or inserting thermometers into them. What it does require is some systematic procedure for assigning scores to individuals or objects so that those scores represent the characteristic of interest.

Psychological Constructs

Many variables studied by psychologists are straightforward and simple to measure. These include sex, age, height, weight, and birth order. You can almost always tell whether someone is male or female just by looking. You can ask people how old they are and be reasonably sure that they know and will tell you. Although people might not know or want to tell you how much they weigh, you can have them step onto a bathroom scale. Other variables studied by psychologists—perhaps the majority—are not so straightforward or simple to measure. We cannot accurately assess people’s level of intelligence by looking at them, and we certainly cannot put their self-esteem on a bathroom scale. These kinds of variables are called constructs (pronounced CON-structs ) and include personality traits (e.g., extroversion), emotional states (e.g., fear), attitudes (e.g., toward taxes), and abilities (e.g., athleticism).

Psychological constructs cannot be observed directly. One reason is that they often represent tendencies to think, feel, or act in certain ways. For example, to say that a particular college student is highly extroverted (see Note 5.6 “The Big Five” ) does not necessarily mean that she is behaving in an extroverted way right now. In fact, she might be sitting quietly by herself, reading a book. Instead, it means that she has a general tendency to behave in extroverted ways (talking, laughing, etc.) across a variety of situations. Another reason psychological constructs cannot be observed directly is that they often involve internal processes. Fear, for example, involves the activation of certain central and peripheral nervous system structures, along with certain kinds of thoughts, feelings, and behaviors—none of which is necessarily obvious to an outside observer. Notice also that neither extroversion nor fear “reduces to” any particular thought, feeling, act, or physiological structure or process. Instead, each is a kind of summary of a complex set of behaviors and internal processes.

The Big Five

The Big Five is a set of five broad dimensions that capture much of the variation in human personality. Each of the Big Five can even be defined in terms of six more specific constructs called “facets” (Costa & McCrae, 1992).

The conceptual definition of a psychological construct describes the behaviors and internal processes that make up that construct, along with how it relates to other variables. For example, a conceptual definition of neuroticism (another one of the Big Five) would be that it is people’s tendency to experience negative emotions such as anxiety, anger, and sadness across a variety of situations. This definition might also include that it has a strong genetic component, remains fairly stable over time, and is positively correlated with the tendency to experience pain and other physical symptoms.

Students sometimes wonder why, when researchers want to understand a construct like self-esteem or neuroticism, they do not simply look it up in the dictionary. One reason is that many scientific constructs do not have counterparts in everyday language (e.g., working memory capacity). More important, researchers are in the business of developing definitions that are more detailed and precise—and that more accurately describe the way the world is—than the informal definitions in the dictionary. As we will see, they do this by proposing conceptual definitions, testing them empirically, and revising them as necessary. Sometimes they throw them out altogether. This is why the research literature often includes different conceptual definitions of the same construct. In some cases, an older conceptual definition has been replaced by a newer one that works better. In others, researchers are still in the process of deciding which of various conceptual definitions is the best.

Operational Definitions

An operational definition is a definition of a variable in terms of precisely how it is to be measured. These measures generally fall into one of three broad categories. Self-report measures are those in which participants report on their own thoughts, feelings, and actions, as with the Rosenberg Self-Esteem Scale. Behavioral measures are those in which some other aspect of participants’ behavior is observed and recorded. This is an extremely broad category that includes the observation of people’s behavior both in highly structured laboratory tasks and in more natural settings. A good example of the former would be measuring working memory capacity using the backward digit span task. A good example of the latter is a famous operational definition of physical aggression from researcher Albert Bandura and his colleagues (Bandura, Ross, & Ross, 1961). They let each of several children play for 20 minutes in a room that contained a clown-shaped punching bag called a Bobo doll. They filmed each child and counted the number of acts of physical aggression he or she committed. These included hitting the doll with a mallet, punching it, and kicking it. Their operational definition, then, was the number of these specifically defined acts that the child committed in the 20-minute period. Finally, physiological measures are those that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.

A man wearing an EEG cap

In addition to self-report and behavioral measures, researchers in psychology use physiological measures. An electroencephalograph (EEG) records electrical activity from the brain.

Wikimedia Commons – public domain.

For any given variable or construct, there will be multiple operational definitions. Stress is a good example. A rough conceptual definition is that stress is an adaptive response to a perceived danger or threat that involves physiological, cognitive, affective, and behavioral components. But researchers have operationally defined it in several ways. The Social Readjustment Rating Scale is a self-report questionnaire on which people identify stressful events that they have experienced in the past year and assigns points for each one depending on its severity. For example, a man who has been divorced (73 points), changed jobs (36 points), and had a change in sleeping habits (16 points) in the past year would have a total score of 125. The Daily Hassles and Uplifts Scale is similar but focuses on everyday stressors like misplacing things and being concerned about one’s weight. The Perceived Stress Scale is another self-report measure that focuses on people’s feelings of stress (e.g., “How often have you felt nervous and stressed?”). Researchers have also operationally defined stress in terms of several physiological variables including blood pressure and levels of the stress hormone cortisol.

When psychologists use multiple operational definitions of the same construct—either within a study or across studies—they are using converging operations . The idea is that the various operational definitions are “converging” on the same construct. When scores based on several different operational definitions are closely related to each other and produce similar patterns of results, this constitutes good evidence that the construct is being measured effectively and that it is useful. The various measures of stress, for example, are all correlated with each other and have all been shown to be correlated with other variables such as immune system functioning (also measured in a variety of ways) (Segerstrom & Miller, 2004). This is what allows researchers eventually to draw useful general conclusions, such as “stress is negatively correlated with immune system functioning,” as opposed to more specific and less useful ones, such as “people’s scores on the Perceived Stress Scale are negatively correlated with their white blood counts.”

Levels of Measurement

The psychologist S. S. Stevens suggested that scores can be assigned to individuals so that they communicate more or less quantitative information about the variable of interest (Stevens, 1946). For example, the officials at a 100-m race could simply rank order the runners as they crossed the finish line (first, second, etc.), or they could time each runner to the nearest tenth of a second using a stopwatch (11.5 s, 12.1 s, etc.). In either case, they would be measuring the runners’ times by systematically assigning scores to represent those times. But while the rank ordering procedure communicates the fact that the second-place runner took longer to finish than the first-place finisher, the stopwatch procedure also communicates how much longer the second-place finisher took. Stevens actually suggested four different levels of measurement (which he called “scales of measurement”) that correspond to four different levels of quantitative information that can be communicated by a set of scores.

The nominal level of measurement is used for categorical variables and involves assigning scores that are category labels. Category labels communicate whether any two individuals are the same or different in terms of the variable being measured. For example, if you look at your research participants as they enter the room, decide whether each one is male or female, and type this information into a spreadsheet, you are engaged in nominal-level measurement. Or if you ask your participants to indicate which of several ethnicities they identify themselves with, you are again engaged in nominal-level measurement.

The remaining three levels of measurement are used for quantitative variables. The ordinal level of measurement involves assigning scores so that they represent the rank order of the individuals. Ranks communicate not only whether any two individuals are the same or different in terms of the variable being measured but also whether one individual is higher or lower on that variable. The interval level of measurement involves assigning scores so that they represent the precise magnitude of the difference between individuals, but a score of zero does not actually represent the complete absence of the characteristic. A classic example is the measurement of heat using the Celsius or Fahrenheit scale. The difference between temperatures of 20°C and 25°C is precisely 5°, but a temperature of 0°C does not mean that there is a complete absence of heat. In psychology, the intelligence quotient (IQ) is often considered to be measured at the interval level. Finally, the ratio level of measurement involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam.

Stevens’s levels of measurement are important for at least two reasons. First, they emphasize the generality of the concept of measurement. Although people do not normally think of categorizing or ranking individuals as measurement, in fact they are as long as they are done so that they represent some characteristic of the individuals. Second, the levels of measurement can serve as a rough guide to the statistical procedures that can be used with the data and the conclusions that can be drawn from them. With nominal-level measurement, for example, the only available measure of central tendency is the mode. Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. One cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, but one can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level.

Key Takeaways

  • Measurement is the assignment of scores to individuals so that the scores represent some characteristic of the individuals. Psychological measurement can be achieved in a wide variety of ways, including self-report, behavioral, and physiological measures.
  • Psychological constructs such as intelligence, self-esteem, and depression are variables that are not directly observable because they represent behavioral tendencies or complex patterns of behavior and internal processes. An important goal of scientific research is to conceptually define psychological constructs in ways that accurately describe them.
  • For any conceptual definition of a construct, there will be many different operational definitions or ways of measuring it. The use of multiple operational definitions, or converging operations, is a common strategy in psychological research.
  • Variables can be measured at four different levels—nominal, ordinal, interval, and ratio—that communicate increasing amounts of quantitative information. The level of measurement affects the kinds of statistics you can use and conclusions you can draw from your data.
  • Practice: Complete the Rosenberg Self-Esteem Scale and compute your overall score.
  • Practice: Think of three operational definitions for sexual jealousy, decisiveness, and social anxiety. Consider the possibility of self-report, behavioral, and physiological measures. Be as precise as you can.

Practice: For each of the following variables, decide which level of measurement is being used.

  • A college instructor measures the time it takes his students to finish an exam by looking through the stack of exams at the end. He assigns the one on the bottom a score of 1, the one on top of that a 2, and so on.
  • A researcher accesses her participants’ medical records and counts the number of times they have seen a doctor in the past year.
  • Participants in a research study are asked whether they are right-handed or left-handed.

Bandura, A., Ross, D., & Ross, S. A. (1961). Transmission of aggression through imitation of aggressive models. Journal of Abnormal and Social Psychology, 63 , 575–582.

Costa, P. T., Jr., & McCrae, R. R. (1992). Normal personality assessment in clinical practice: The NEO Personality Inventory. Psychological Assessment, 4 , 5–13.

Segerstrom, S. E., & Miller, G. E. (2004). Psychological stress and the human immune system: A meta-analytic study of 30 years of inquiry. Psychological Bulletin, 130 , 601–630.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

  • Member Benefits
  • Communities
  • Grants and Scholarships
  • Student Nurse Resources
  • Member Directory
  • Course Login
  • Professional Development
  • Institutions Hub
  • ONS Course Catalog
  • ONS Book Catalog
  • ONS Oncology Nurse Orientation Program™
  • Account Settings
  • Help Center
  • Print Membership Card
  • Print NCPD Certificate
  • Verify Cardholder or Certificate Status

ONS Logo

  • Trouble finding what you need?
  • Check our search tips.

a research measurement definition

  • Oncology Nursing Forum
  • Number 4 / July 2014

Measurements in Quantitative Research: How to Select and Report on Research Instruments

Teresa L. Hagan

Measures exist to numerically represent degrees of attributes. Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it cannot be tested.

Become a Member

Purchase this article.

has been added to your cart

Related Articles

Systematic reviews, case study research methodology in nursing research, preferred reporting items for systematic reviews and meta-analyses.

Book cover

Handbook of Research Methods in Health Social Sciences pp 663–679 Cite as

Measurement Issues in Quantitative Research

  • Dafna Merom 2 , 3 &
  • James Rufus John 3 , 4  
  • Reference work entry
  • First Online: 13 January 2019

1124 Accesses

1 Citations

Measurement is central to empirical research whether observational or experimental. Common to all measurements is the systematic application of numerical value (scale) to a variable or a factor we wish to quantify. Measurement can be applied to physical, biological, or chemical attribute or to more complex factors such as human behaviors, attitudes, physical, social, or psychological characteristics or the combination of several characteristics that denote a concept. There are many reasons for the act of measurement that are relevant to health and social science disciplines: for understanding aetiology of disease or developmental processes, for evaluating programs, for monitoring progress, and for decision-making. Regardless of the specific purpose, we should aspire that our measurement be adequate. In this chapter, we review the properties that determine the adequacy of our measurement (reliability, validity, and sensitivity) and provide examples of statistical methods that are used to quantify these properties. At the concluding section, we provide examples from the physical activity and public health field in the four areas for which precise measurements are necessary illustrating how imprecise or biased scoring procedure can lead to erroneous decisions across the four major purposes of measurement.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Bolarinwa OA. Principles and methods of validity and reliability testing of questionnaires used in social and health science researches. Niger Postgrad Med J. 2015;22(4):195.

Article   Google Scholar  

Bowling A, Ebrahim S. Key issues in the statistical analysis of quantitative data in research on health and health services. In: Handbook of health research methods: investigation, measurement and analysis. England: Open University Press McGraw Hill Education Birshire; 2005. p. 497–514.

Google Scholar  

Brink H. Validity and reliability in qualitative research. Curationis. 1993;16(2):35–8.

Brown WJ, Trost SG, Bauman A, Mummery K, Owen N. Test-retest reliability of four physical activity measures used in population. J Sci Med Sport. 2004;7(2):205–15.

Brownson RC, Jones DA, Pratt M, Blanton C, Heath GW. Measuring physical activity with the behavioral risk factor surveillance system. Med Sci Sports Exerc. 2000;32(11):1913–8.

Busija L, Pausenberger E, Haines TP, Haymes S, Buchbinder R, Osborne RH. Adult measures of general health and health-related quality of life: Medical Outcomes Study Short Form 36-Item (SF-36) and Short Form 12-Item (SF-12) Health Surveys, Nottingham Health Profile (NHP), Sickness Impact Profile (SIP), Medical Outcomes study Short Form 36-Item (SF-36) and Short Form 12-Item (SF-12) Health Surveys, Nottingham Health Profile (NHP), Sickness Impact Profile (SIP), Medical Outcomes Study Short Form 6D (SF-6D), Health Utilities Index Mark 3 (HUI3), Quality of Well-Being Scale (QWB), and Assessment of Quality of Life (AQoL). Arthritis Care and Research. 2011;63(Supll S11):S383–S4121.

Cerin E, Saelens BE, Sallis JF, Frank LD. Neighborhood environment walkability scale: validity and development of a short form. Med Sci Sports Exerc. 2006;38(9):1682–91.

Davis RE, Couper MP, Janz NK, Caldwell CH, Resnicow K. Interviewer effects in public health surveys. Health Educ Res. 2009;25(1):14–26.

De Bruin A, Diederiks J, De Witte L, Stevens F, Philipsen H. Assessing the responsiveness of a functional status measure: the Sickness Impact Profile versus the SIP68. J Clin Epidemiol. 1997;50(5):529–40.

Delgado-Rodríguez M, Llorca J. Bias. J Epidemiol Community Health. 2004;58(8):635–41.

Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis. 1986;39(11):897–906.

Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures statistics and strategies for evaluation. Control Clin Trials. 1991;12((4):S142–58.

Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ. 2003;37:830–7.

Fok CCT, Henry D. Increasing the sensitivity of measures to change. Prev Sci. 2015;16(7):978–86.

Gadotti I, Vieira E, Magee D. Importance and clarification of measurement properties in rehabilitation. Braz J Phys Ther. 2006;10(2):137–46.

Golafshani N. Understanding reliability and validity in qualitative research. Qual Rep. 2003;8(4):597–606.

Grant JS, Davis LL. Focus on quantitative methods: Selection and use of content experts for instrument development. Research in Nursing and Health. 1997;20:269–74.

Griffiths P, Rafferty AM. Outcome measures (Gerrish K, Lathlean J, Cormack D, editors), 7th ed. West Sussex, UK: Wiley Blackwell; 2014.

Harris T, Kerry SM, Limb ES, Victor CR, Iliffe S, Ussher M, … Cook DG. Effect of a primary care walking intervention with and without nurse support on physical activity levels in 45- to 75-year-olds: the Pedometer And Consultation Evaluation (PACE-UP) cluster randomised clinical trial. PLoS Med. 2016;14(1):e1002210. https://doi.org/10.1371/journal.pmed.1002210 .

Heale R, Twycross A. Validity and reliability in quantitative studies. Evid Based Nurs. 2015. https://doi.org/10.1136/eb-2015-102129 .

Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol. 2000;53(5):459–68.

Kimberlin CL, Winetrstein AG. Validity and reliability of measurement instruments used in research. Am J Health Syst Pharm. 2008;65(23):2276.

Last MJ. A dictionary of epidemiology. 4th ed. New York: Oxford University Press; 2001.

Leung L. Validity, reliability, and generalizability in qualitative research. J Fam Med Prim Care. 2015;4(3):324.

Manoj S, Lingyak P. Measurement and evaluation for health educators. Burlington: Jones & Bartlett Learning; 2014.

Merom D, Korycinski R. Measurement of walking. In: Mulley C, Gebel K, Ding D, editors. Walking, vol. 11–39. West Yorkshire, UK: Emerald Publishing; 2017.

Chapter   Google Scholar  

Merom D, Rissel C, Phongsavan P, Smith BJ, van Kemenade C, Brown W, Bauman A. Promoting walking with pedometers in the community. The step-by-step trial. Am J Prev Med. 2007;32(4):290–7.

Merom D, Bowles H, Bauman A. Measuring walking for physical activity surveillance – the effect of prompts and respondents’ interpretation of walking in a leisure time survey. J Phys Act Health. 2009;6:S81–8.

Nunan D. Research methods in language learning. Cambridge: Cambridge University Press; 1992.

Pannucci CJ, Wilkins EG. Identifying and avoiding bias in research. Plast Reconstr Surg. 2010;126(2):619.

Revicki D, Hays RD, Cella D, Sloan J. Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. J Clin Epidemiol. 2008;61(2):102–9.

Schmidt S, Bullinger M. Current issues in cross-cultural quality of life instrument development. Arch Phys Med Rehabil. 2003;84(Suppl 2):S29–34.

Stamatakis E, Ekelund U, Wareham NJ. Temporal trends in physical activity in England: the Health Survey for England 1991 to 2004. Prev Med. 2007;45:416–23.

Streiner D, Norman G. Health measurement scales: a practical guide to their development and use. Oxford: Oxford University Press; 2003.

Terwee C, Dekker F, Wiersinga W, Prummel M, Bossuyt P. On assessing responsiveness of health-related quality of life instruments: guidelines for instrument evaluation. Qual Life Res. 2003;12(4):349–62.

Thorndike RM. Measurement and evaluation in psychology and education. 7th ed. Upper Saddle River: Pearson Prentice Hall; 2007.

Ursachi G, Horodnic IA, Zait A. How reliable are measurement scales? External factors with indirect influence on reliability estimators. Procedia Economics and Finance. 2015;20:679–86.

Walters SJ. Quality of life outcomes in clinical trials and health-care evaluation: a practical guide to analysis and interpretation, vol. 84. West Yorkshire, UK: Wiley; 2009.

Winzenberg T, Shaw KS. Screening for physical activity in general practice a test of diagnostic criteria. Aust Fam Physician. 2011;40(1):57–61.

Yu S, Yarnell JW, Sweetnam PM, Murray L. What level of physical activity protects against premature cardiovascular death? The Caerphilly study. Heart. 2003;89(5):502–6.

Download references

Author information

Authors and affiliations.

School of Science and Health, Western Sydney University, Penrith, Sydeny, NSW, Australia

Dafna Merom

Translational Health Research Institute, School of Medicine, Western Sydney University, Penrith, NSW, Australia

Dafna Merom & James Rufus John

Capital Markets Cooperative Research Centre, Sydney, NSW, Australia

James Rufus John

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dafna Merom .

Editor information

Editors and affiliations.

School of Science and Health, Western Sydney University, Penrith, NSW, Australia

Pranee Liamputtong

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this entry

Cite this entry.

Merom, D., John, J.R. (2019). Measurement Issues in Quantitative Research. In: Liamputtong, P. (eds) Handbook of Research Methods in Health Social Sciences. Springer, Singapore. https://doi.org/10.1007/978-981-10-5251-4_95

Download citation

DOI : https://doi.org/10.1007/978-981-10-5251-4_95

Published : 13 January 2019

Publisher Name : Springer, Singapore

Print ISBN : 978-981-10-5250-7

Online ISBN : 978-981-10-5251-4

eBook Packages : Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Measurement in Science

Measurement is an integral part of modern science as well as of engineering, commerce, and daily life. Measurement is often considered a hallmark of the scientific enterprise and a privileged source of knowledge relative to qualitative modes of inquiry. [ 1 ] Despite its ubiquity and importance, there is little consensus among philosophers as to how to define measurement, what sorts of things are measurable, or which conditions make measurement possible. Most (but not all) contemporary authors agree that measurement is an activity that involves interaction with a concrete system with the aim of representing aspects of that system in abstract terms (e.g., in terms of classes, numbers, vectors etc.) But this characterization also fits various kinds of perceptual and linguistic activities that are not usually considered measurements, and is therefore too broad to count as a definition of measurement. Moreover, if “concrete” implies “real”, this characterization is also too narrow, as measurement often involves the representation of ideal systems such as the average household or an electron at complete rest.

Philosophers have written on a variety of conceptual, metaphysical, semantic and epistemological issues related to measurement. This entry will survey the central philosophical standpoints on the nature of measurement, the notion of measurable quantity and related epistemological issues. It will refrain from elaborating on the many discipline-specific problems associated with measurement and focus on issues that have a general character.

1. Overview

2. quantity and magnitude: a brief history, 3.1 fundamental and derived measurement, 3.2 the classification of scales, 3.3 the measurability of sensation, 3.4 representational theory of measurement, 4. operationalism and conventionalism, 5. realist accounts of measurement, 6. information-theoretic accounts of measurement, 7.1 the roles of models in measurement, 7.2 models and measurement in economics, 7.3 psychometric models and construct validity, 8.1 standardization and scientific progress, 8.2 theory-ladenness of measurement, 8.3 accuracy and precision, other internet resources, related entries.

Modern philosophical discussions about measurement—spanning from the late nineteenth century to the present day—may be divided into several strands of scholarship. These strands reflect different perspectives on the nature of measurement and the conditions that make measurement possible and reliable. The main strands are mathematical theories of measurement, operationalism, conventionalism, realism, information-theoretic accounts and model-based accounts. These strands of scholarship do not, for the most part, constitute directly competing views. Instead, they are best understood as highlighting different and complementary aspects of measurement. The following is a very rough overview of these perspectives:

  • Mathematical theories of measurement view measurement as the mapping of qualitative empirical relations to relations among numbers (or other mathematical entities).
  • Operationalists and conventionalists view measurement as a set of operations that shape the meaning and/or regulate the use of a quantity-term.
  • Realists view measurement as the estimation of mind-independent properties and/or relations.
  • Information-theoretic accounts view measurement as the gathering and interpretation of information about a system.
  • Model-based accounts view measurement as the coherent assignment of values to parameters in a theoretical and/or statistical model of a process.

These perspectives are in principle consistent with each other. While mathematical theories of measurement deal with the mathematical foundations of measurement scales, operationalism and conventionalism are primarily concerned with the semantics of quantity terms, realism is concerned with the metaphysical status of measurable quantities, and information-theoretic and model-based accounts are concerned with the epistemological aspects of measuring. Nonetheless, the subject domain is not as neatly divided as the list above suggests. Issues concerning the metaphysics, epistemology, semantics and mathematical foundations of measurement are interconnected and often bear on one another. Hence, for example, operationalists and conventionalists have often adopted anti-realist views, and proponents of model-based accounts have argued against the prevailing empiricist interpretation of mathematical theories of measurement. These subtleties will become clear in the following discussion.

The list of strands of scholarship is neither exclusive nor exhaustive. It reflects the historical trajectory of the philosophical discussion thus far, rather than any principled distinction among different levels of analysis of measurement. Some philosophical works on measurement belong to more than one strand, while many other works do not squarely fit either. This is especially the case since the early 2000s, when measurement returned to the forefront of philosophical discussion after several decades of relative neglect. This recent body of scholarship is sometimes called “the epistemology of measurement”, and includes a rich array of works that cannot yet be classified into distinct schools of thought. The last section of this entry will be dedicated to surveying some of these developments.

Although the philosophy of measurement formed as a distinct area of inquiry only during the second half of the nineteenth century, fundamental concepts of measurement such as magnitude and quantity have been discussed since antiquity. According to Euclid’s Elements , a magnitude—such as a line, a surface or a solid—measures another when the latter is a whole multiple of the former (Book V, def. 1 & 2). Two magnitudes have a common measure when they are both whole multiples of some magnitude, and are incommensurable otherwise (Book X, def. 1). The discovery of incommensurable magnitudes allowed Euclid and his contemporaries to develop the notion of a ratio of magnitudes. Ratios can be either rational or irrational, and therefore the concept of ratio is more general than that of measure (Michell 2003, 2004a; Grattan-Guinness 1996).

Aristotle distinguished between quantities and qualities. Examples of quantities are numbers, lines, surfaces, bodies, time and place, whereas examples of qualities are justice, health, hotness and paleness ( Categories §6 and §8). According to Aristotle, quantities admit of equality and inequality but not of degrees, as “one thing is not more four-foot than another” (ibid. 6.6a19). Qualities, conversely, do not admit of equality or inequality but do admit of degrees, “for one thing is called more pale or less pale than another” (ibid. 8.10b26). Aristotle did not clearly specify whether degrees of qualities such as paleness correspond to distinct qualities, or whether the same quality, paleness, was capable of different intensities. This topic was at the center of an ongoing debate in the thirteenth and fourteenth centuries (Jung 2011). Duns Scotus supported the “addition theory”, according to which a change in the degree of a quality can be explained by the addition or subtraction of smaller degrees of that quality (2011: 553). This theory was later refined by Nicole Oresme, who used geometrical figures to represent changes in the intensity of qualities such as velocity (Clagett 1968; Sylla 1971). Oresme’s geometrical representations established a subset of qualities that were amenable to quantitative treatment, thereby challenging the strict Aristotelian dichotomy between quantities and qualities. These developments made possible the formulation of quantitative laws of motion during the sixteenth and seventeenth centuries (Grant 1996).

The concept of qualitative intensity was further developed by Leibniz and Kant. Leibniz’s “principle of continuity” stated that all natural change is produced by degrees. Leibniz argued that this principle applies not only to changes in extended magnitudes such as length and duration, but also to intensities of representational states of consciousness, such as sounds (Jorgensen 2009; Diehl 2012). Kant is thought to have relied on Leibniz’s principle of continuity to formulate his distinction between extensive and intensive magnitudes. According to Kant, extensive magnitudes are those “in which the representation of the parts makes possible the representation of the whole” (1787: A162/B203). An example is length: a line can only be mentally represented by a successive synthesis in which parts of the line join to form the whole. For Kant, the possibility of such synthesis was grounded in the forms of intuition, namely space and time. Intensive magnitudes, like warmth or colors, also come in continuous degrees, but their apprehension takes place in an instant rather than through a successive synthesis of parts. The degrees of intensive magnitudes “can only be represented through approximation to negation” (1787: A 168/B210), that is, by imagining their gradual diminution until their complete absence.

Scientific developments during the nineteenth century challenged the distinction between extensive and intensive magnitudes. Thermodynamics and wave optics showed that differences in temperature and hue corresponded to differences in spatio-temporal magnitudes such as velocity and wavelength. Electrical magnitudes such as resistance and conductance were shown to be capable of addition and division despite not being extensive in the Kantian sense, i.e., not synthesized from spatial or temporal parts. Moreover, early experiments in psychophysics suggested that intensities of sensation such as brightness and loudness could be represented as sums of “just noticeable differences” among stimuli, and could therefore be thought of as composed of parts (see Section 3.3 ). These findings, along with advances in the axiomatization of branches of mathematics, motivated some of the leading scientists of the late nineteenth century to attempt to clarify the mathematical foundations of measurement (Maxwell 1873; von Kries 1882; Helmholtz 1887; Mach 1896; Poincaré 1898; Hölder 1901; for historical surveys see Darrigol 2003; Michell 1993, 2003; Cantù and Schlaudt 2013; Biagioli 2016: Ch. 4, 2018). These works are viewed today as precursors to the body of scholarship known as “measurement theory”.

3. Mathematical Theories of Measurement (“Measurement Theory”)

Mathematical theories of measurement (often referred to collectively as “measurement theory”) concern the conditions under which relations among numbers (and other mathematical entities) can be used to express relations among objects. [ 2 ] In order to appreciate the need for mathematical theories of measurement, consider the fact that relations exhibited by numbers—such as equality, sum, difference and ratio—do not always correspond to relations among the objects measured by those numbers. For example, 60 is twice 30, but one would be mistaken in thinking that an object measured at 60 degrees Celsius is twice as hot as an object at 30 degrees Celsius. This is because the zero point of the Celsius scale is arbitrary and does not correspond to an absence of temperature. [ 3 ] Similarly, numerical intervals do not always carry empirical information. When subjects are asked to rank on a scale from 1 to 7 how strongly they agree with a given statement, there is no prima facie reason to think that the intervals between 5 and 6 and between 6 and 7 correspond to equal increments of strength of opinion. To provide a third example, equality among numbers is transitive [if (a=b & b=c) then a=c] but empirical comparisons among physical magnitudes reveal only approximate equality, which is not a transitive relation. These examples suggest that not all of the mathematical relations among numbers used in measurement are empirically significant, and that different kinds of measurement scale convey different kinds of empirically significant information.

The study of measurement scales and the empirical information they convey is the main concern of mathematical theories of measurement. In his seminal 1887 essay, “Counting and Measuring”, Hermann von Helmholtz phrased the key question of measurement theory as follows:

[W]hat is the objective meaning of expressing through denominate numbers the relations of real objects as magnitudes, and under what conditions can we do this? (1887: 4)

Broadly speaking, measurement theory sets out to (i) identify the assumptions underlying the use of various mathematical structures for describing aspects of the empirical world, and (ii) draw lessons about the adequacy and limits of using these mathematical structures for describing aspects of the empirical world. Following Otto Hölder (1901), measurement theorists often tackle these goals through formal proofs, with the assumptions in (i) serving as axioms and the lessons in (ii) following as theorems. A key insight of measurement theory is that the empirically significant aspects of a given mathematical structure are those that mirror relevant relations among the objects being measured. For example, the relation “bigger than” among numbers is empirically significant for measuring length insofar as it mirrors the relation “longer than” among objects. This mirroring, or mapping, of relations between objects and mathematical entities constitutes a measurement scale. As will be clarified below, measurement scales are usually thought of as isomorphisms or homomorphisms between objects and mathematical entities.

Other than these broad goals and claims, measurement theory is a highly heterogeneous body of scholarship. It includes works that span from the late nineteenth century to the present day and endorse a wide array of views on the ontology, epistemology and semantics of measurement. Two main differences among mathematical theories of measurement are especially worth mentioning. The first concerns the nature of the relata , or “objects”, whose relations numbers are supposed to mirror. These relata may be understood in at least four different ways: as concrete individual objects, as qualitative observations of concrete individual objects, as abstract representations of individual objects, or as universal properties of objects. Which interpretation is adopted depends in large part on the author’s metaphysical and epistemic commitments. This issue will be especially relevant to the discussion of realist accounts of measurement ( Section 5 ). Second, different measurement theorists have taken different stands on the kind of empirical evidence that is required to establish mappings between objects and numbers. As a result, measurement theorists have come to disagree about the necessary conditions for establishing the measurability of attributes, and specifically about whether psychological attributes are measurable. Debates about measurability have been highly fruitful for the development of measurement theory, and the following subsections will introduce some of these debates and the central concepts developed therein.

During the late nineteenth and early twentieth centuries several attempts were made to provide a universal definition of measurement. Although accounts of measurement varied, the consensus was that measurement is a method of assigning numbers to magnitudes . For example, Helmholtz (1887: 17) defined measurement as the procedure by which one finds the denominate number that expresses the value of a magnitude, where a “denominate number” is a number together with a unit, e.g., 5 meters, and a magnitude is a quality of objects that is amenable to ordering from smaller to greater, e.g., length. Bertrand Russell similarly stated that measurement is

any method by which a unique and reciprocal correspondence is established between all or some of the magnitudes of a kind and all or some of the numbers, integral, rational or real. (1903: 176)

Norman Campbell defined measurement simply as “the process of assigning numbers to represent qualities”, where a quality is a property that admits of non-arbitrary ordering (1920: 267).

Defining measurement as numerical assignment raises the question: which assignments are adequate, and under what conditions? Early measurement theorists like Helmholtz (1887), Hölder (1901) and Campbell (1920) argued that numbers are adequate for expressing magnitudes insofar as algebraic operations among numbers mirror empirical relations among magnitudes. For example, the qualitative relation “longer than” among rigid rods is (roughly) transitive and asymmetrical, and in this regard shares structural features with the relation “larger than” among numbers. Moreover, the end-to-end concatenation of rigid rods shares structural features—such as associativity and commutativity—with the mathematical operation of addition. A similar situation holds for the measurement of weight with an equal-arms balance. Here deflection of the arms provides ordering among weights and the heaping of weights on one pan constitutes concatenation.

Early measurement theorists formulated axioms that describe these qualitative empirical structures, and used these axioms to prove theorems about the adequacy of assigning numbers to magnitudes that exhibit such structures. Specifically, they proved that ordering and concatenation are together sufficient for the construction of an additive numerical representation of the relevant magnitudes. An additive representation is one in which addition is empirically meaningful, and hence also multiplication, division etc. Campbell called measurement procedures that satisfy the conditions of additivity “fundamental” because they do not involve the measurement of any other magnitude (1920: 277). Kinds of magnitudes for which a fundamental measurement procedure has been found—such as length, area, volume, duration, weight and electrical resistance—Campbell called “fundamental magnitudes”. A hallmark of such magnitudes is that it is possible to generate them by concatenating a standard sequence of equal units, as in the example of a series of equally spaced marks on a ruler.

Although they viewed additivity as the hallmark of measurement, most early measurement theorists acknowledged that additivity is not necessary for measuring. Other magnitudes exist that admit of ordering from smaller to greater, but whose ratios and/or differences cannot currently be determined except through their relations to other, fundamentally measurable magnitudes. Examples are temperature, which may be measured by determining the volume of a mercury column, and density, which may be measured as the ratio of mass and volume. Such indirect determination came to be called “derived” measurement and the relevant magnitudes “derived magnitudes” (Campbell 1920: 275–7).

At first glance, the distinction between fundamental and derived measurement may seem reminiscent of the distinction between extensive and intensive magnitudes, and indeed fundamental measurement is sometimes called “extensive”. Nonetheless, it is important to note that the two distinctions are based on significantly different criteria of measurability. As discussed in Section 2 , the extensive-intensive distinction focused on the intrinsic structure of the quantity in question, i.e., whether or not it is composed of spatio-temporal parts. The fundamental-derived distinction, by contrast, focuses on the properties of measurement operations . A fundamentally measurable magnitude is one for which a fundamental measurement operation has been found. Consequently, fundamentality is not an intrinsic property of a magnitude: a derived magnitude can become fundamental with the discovery of new operations for its measurement. Moreover, in fundamental measurement the numerical assignment need not mirror the structure of spatio-temporal parts. Electrical resistance, for example, can be fundamentally measured by connecting resistors in a series (Campbell 1920: 293). This is considered a fundamental measurement operation because it has a shared structure with numerical addition, even though objects with equal resistance are not generally equal in size.

The distinction between fundamental and derived measurement was revised by subsequent authors. Brian Ellis (1966: Ch. 5–8) distinguished among three types of measurement: fundamental, associative and derived. Fundamental measurement requires ordering and concatenation operations satisfying the same conditions specified by Campbell. Associative measurement procedures are based on a correlation of two ordering relationships, e.g., the correlation between the volume of a mercury column and its temperature. Derived measurement procedures consist in the determination of the value of a constant in a physical law. The constant may be local, as in the determination of the specific density of water from mass and volume, or universal, as in the determination of the Newtonian gravitational constant from force, mass and distance. Henry Kyburg (1984: Ch. 5–7) proposed a somewhat different threefold distinction among direct, indirect and systematic measurement, which does not completely overlap with that of Ellis. [ 4 ] A more radical revision of the distinction between fundamental and derived measurement was offered by R. Duncan Luce and John Tukey (1964) in their work on conjoint measurement, which will be discussed in Section 3.4 .

The previous subsection discussed the axiomatization of empirical structures, a line of inquiry that dates back to the early days of measurement theory. A complementary line of inquiry within measurement theory concerns the classification of measurement scales. The psychophysicist S.S. Stevens (1946, 1951) distinguished among four types of scales: nominal, ordinal, interval and ratio. Nominal scales represent objects as belonging to classes that have no particular order, e.g., male and female. Ordinal scales represent order but no further algebraic structure. For example, the Mohs scale of mineral hardness represents minerals with numbers ranging from 1 (softest) to 10 (hardest), but there is no empirical significance to equality among intervals or ratios of those numbers. [ 5 ] Celsius and Fahrenheit are examples of interval scales: they represent equality or inequality among intervals of temperature, but not ratios of temperature, because their zero points are arbitrary. The Kelvin scale, by contrast, is a ratio scale, as are the familiar scales representing mass in kilograms, length in meters and duration in seconds. Stevens later refined this classification and distinguished between linear and logarithmic interval scales (1959: 31–34) and between ratio scales with and without a natural unit (1959: 34). Ratio scales with a natural unit, such as those used for counting discrete objects and for representing probabilities, were named “absolute” scales.

As Stevens notes, scale types are individuated by the families of transformations they can undergo without loss of empirical information. Empirical relations represented on ratio scales, for example, are invariant under multiplication by a positive number, e.g., multiplication by 2.54 converts from inches to centimeters. Linear interval scales allow both multiplication by a positive number and a constant shift, e.g., the conversion from Celsius to Fahrenheit in accordance with the formula °C × 9/5 + 32 = °F. Ordinal scales admit of any transformation function as long as it is monotonic and increasing, and nominal scales admit of any one-to-one substitution. Absolute scales admit of no transformation other than identity. Stevens’ classification of scales was later generalized by Louis Narens (1981, 1985: Ch. 2) and Luce et al. (1990: Ch. 20) in terms of the homogeneity and uniqueness of the relevant transformation groups.

While Stevens’ classification of scales met with general approval in scientific and philosophical circles, its wider implications for measurement theory became the topic of considerable debate. Two issues were especially contested. The first was whether classification and ordering operations deserve to be called “measurement” operations, and accordingly whether the representation of magnitudes on nominal and ordinal scales should count as measurement. Several physicists, including Campbell, argued that classification and ordering operations did not provide a sufficiently rich structure to warrant the use of numbers, and hence should not count as measurement operations. The second contested issue was whether a concatenation operation had to be found for a magnitude before it could be fundamentally measured on a ratio scale. The debate became especially heated when it re-ignited a longer controversy surrounding the measurability of intensities of sensation. It is to this debate we now turn.

One of the main catalysts for the development of mathematical theories of measurement was an ongoing debate surrounding measurability in psychology. The debate is often traced back to Gustav Fechner’s (1860) Elements of Psychophysics , in which he described a method of measuring intensities of sensation. Fechner’s method was based on the recording of “just noticeable differences” between sensations associated with pairs of stimuli, e.g., two sounds of different intensity. These differences were assumed to be equal increments of intensity of sensation. As Fechner showed, under this assumption a stable linear relationship is revealed between the intensity of sensation and the logarithm of the intensity of the stimulus, a relation that came to be known as “Fechner’s law” (Heidelberger 1993a: 203; Luce and Suppes 2004: 11–2). This law in turn provides a method for indirectly measuring the intensity of sensation by measuring the intensity of the stimulus, and hence, Fechner argued, provides justification for measuring intensities of sensation on the real numbers.

Fechner’s claims concerning the measurability of sensation became the subject of a series of debates that lasted nearly a century and proved extremely fruitful for the philosophy of measurement, involving key figures such as Mach, Helmholtz, Campbell and Stevens (Heidelberger 1993a: Ch. 6 and 1993b; Michell 1999: Ch. 6). Those objecting to the measurability of sensation, such as Campbell, stressed the necessity of an empirical concatenation operation for fundamental measurement. Since intensities of sensation cannot be concatenated to each other in the manner afforded by lengths and weights, there could be no fundamental measurement of sensation intensity. Moreover, Campbell claimed that none of the psychophysical regularities discovered thus far are sufficiently universal to count as laws in the sense required for derived measurement (Campbell in Ferguson et al. 1940: 347). All that psychophysicists have shown is that intensities of sensation can be consistently ordered, but order by itself does not yet warrant the use of numerical relations such as sums and ratios to express empirical results.

The central opponent of Campbell in this debate was Stevens, whose distinction between types of measurement scale was discussed above. Stevens defined measurement as the “assignment of numerals to objects or events according to rules” (1951: 1) and claimed that any consistent and non-random assignment counts as measurement in the broad sense (1975: 47). In useful cases of scientific inquiry, Stevens claimed, measurement can be construed somewhat more narrowly as a numerical assignment that is based on the results of matching operations, such as the coupling of temperature to mercury volume or the matching of sensations to each other. Stevens argued against the view that relations among numbers need to mirror qualitative empirical structures, claiming instead that measurement scales should be regarded as arbitrary formal schemas and adopted in accordance with their usefulness for describing empirical data. For example, adopting a ratio scale for measuring the sensations of loudness, volume and density of sounds leads to the formulation of a simple linear relation among the reports of experimental subjects: loudness = volume × density (1975: 57–8). Such assignment of numbers to sensations counts as measurement because it is consistent and non-random, because it is based on the matching operations performed by experimental subjects, and because it captures regularities in the experimental results. According to Stevens, these conditions are together sufficient to justify the use of a ratio scale for measuring sensations, despite the fact that “sensations cannot be separated into component parts, or laid end to end like measuring sticks” (1975: 38; see also Hempel 1952: 68–9).

In the mid-twentieth century the two main lines of inquiry in measurement theory, the one dedicated to the empirical conditions of quantification and the one concerning the classification of scales, converged in the work of Patrick Suppes (1951; Scott and Suppes 1958; for historical surveys see Savage and Ehrlich 1992; Diez 1997a,b). Suppes’ work laid the basis for the Representational Theory of Measurement (RTM), which remains the most influential mathematical theory of measurement to date (Krantz et al. 1971; Suppes et al. 1989; Luce et al. 1990). RTM defines measurement as the construction of mappings from empirical relational structures into numerical relational structures (Krantz et al. 1971: 9). An empirical relational structure consists of a set of empirical objects (e.g., rigid rods) along with certain qualitative relations among them (e.g., ordering, concatenation), while a numerical relational structure consists of a set of numbers (e.g., real numbers) and specific mathematical relations among them (e.g., “equal to or bigger than”, addition). Simply put, a measurement scale is a many-to-one mapping—a homomorphism—from an empirical to a numerical relational structure, and measurement is the construction of scales. [ 6 ] RTM goes into great detail in clarifying the assumptions underlying the construction of different types of measurement scales. Each type of scale is associated with a set of assumptions about the qualitative relations obtaining among objects represented on that type of scale. From these assumptions, or axioms, the authors of RTM derive the representational adequacy of each scale type, as well as the family of permissible transformations making that type of scale unique. In this way RTM provides a conceptual link between the empirical basis of measurement and the typology of scales. [ 7 ]

On the issue of measurability, the Representational Theory takes a middle path between the liberal approach adopted by Stevens and the strict emphasis on concatenation operations espoused by Campbell. Like Campbell, RTM accepts that rules of quantification must be grounded in known empirical structures and should not be chosen arbitrarily to fit the data. However, RTM rejects the idea that additive scales are adequate only when concatenation operations are available (Luce and Suppes 2004: 15). Instead, RTM argues for the existence of fundamental measurement operations that do not involve concatenation. The central example of this type of operation is known as “additive conjoint measurement” (Luce and Tukey 1964; Krantz et al. 1971: 17–21 and Ch. 6–7). Here, measurements of two or more different types of attribute, such as the temperature and pressure of a gas, are obtained by observing their joint effect, such as the volume of the gas. Luce and Tukey showed that by establishing certain qualitative relations among volumes under variations of temperature and pressure, one can construct additive representations of temperature and pressure, without invoking any antecedent method of measuring volume. This sort of procedure is generalizable to any suitably related triplet of attributes, such as the loudness, intensity and frequency of pure tones, or the preference for a reward, it size and the delay in receiving it (Luce and Suppes 2004: 17). The discovery of additive conjoint measurement led the authors of RTM to divide fundamental measurement into two kinds: traditional measurement procedures based on concatenation operations, which they called “extensive measurement”, and conjoint or “nonextensive” fundamental measurement. Under this new conception of fundamentality, all the traditional physical attributes can be measured fundamentally, as well as many psychological attributes (Krantz et al. 1971: 502–3).

Above we saw that mathematical theories of measurement are primarily concerned with the mathematical properties of measurement scales and the conditions of their application. A related but distinct strand of scholarship concerns the meaning and use of quantity terms. Scientific theories and models are commonly expressed in terms of quantitative relations among parameters, bearing names such as “length”, “unemployment rate” and “introversion”. A realist about one of these terms would argue that it refers to a set of properties or relations that exist independently of being measured. An operationalist or conventionalist would argue that the way such quantity-terms apply to concrete particulars depends on nontrivial choices made by humans, and specifically on choices that have to do with the way the relevant quantity is measured. Note that under this broad construal, realism is compatible with operationalism and conventionalism. That is, it is conceivable that choices of measurement method regulate the use of a quantity-term and that, given the correct choice, this term succeeds in referring to a mind-independent property or relation. Nonetheless, many operationalists and conventionalists adopted stronger views, according to which there are no facts of the matter as to which of several and nontrivially different operations is correct for applying a given quantity-term. These stronger variants are inconsistent with realism about measurement. This section will be dedicated to operationalism and conventionalism, and the next to realism about measurement.

Operationalism (or “operationism”) about measurement is the view that the meaning of quantity-concepts is determined by the set of operations used for their measurement. The strongest expression of operationalism appears in the early work of Percy Bridgman (1927), who argued that

we mean by any concept nothing more than a set of operations; the concept is synonymous with the corresponding set of operations. (1927: 5)

Length, for example, would be defined as the result of the operation of concatenating rigid rods. According to this extreme version of operationalism, different operations measure different quantities. Length measured by using rulers and by timing electromagnetic pulses should, strictly speaking, be distinguished into two distinct quantity-concepts labeled “length-1” and “length-2” respectively. This conclusion led Bridgman to claim that currently accepted quantity concepts have “joints” where different operations overlap in their domain of application. He warned against dogmatic faith in the unity of quantity concepts across these “joints”, urging instead that unity be checked against experiments whenever the application of a quantity-concept is to be extended into a new domain. Nevertheless, Bridgman conceded that as long as the results of different operations agree within experimental error it is pragmatically justified to label the corresponding quantities with the same name (1927: 16). [ 8 ]

Operationalism became influential in psychology, where it was well-received by behaviorists like Edwin Boring (1945) and B.F. Skinner (1945). Indeed, Skinner maintained that behaviorism is “nothing more than a thoroughgoing operational analysis of traditional mentalistic concepts” (1945: 271). Stevens, who was Boring’s student, was a key promoter of operationalism in psychology, and argued that psychological concepts have empirical meaning only if they stand for definite and concrete operations (1935: 517; see also Isaac 2017). The idea that concepts are defined by measurement operations is consistent with Stevens’ liberal views on measurability, which were discussed above ( Section 3.3 ). As long as the assignment of numbers to objects is performed in accordance with concrete and consistent rules, Stevens maintained that such assignment has empirical meaning and does not need to satisfy any additional constraints. Nonetheless, Stevens probably did not embrace an anti-realist view about psychological attributes. Instead, there are good reasons to think that he understood operationalism as a methodological attitude that was valuable to the extent that it allowed psychologists to justify the conclusions they drew from experiments (Feest 2005). For example, Stevens did not treat operational definitions as a priori but as amenable to improvement in light of empirical discoveries, implying that he took psychological attributes to exist independently of such definitions (Stevens 1935: 527). This suggests that Stevens’ operationalism was of a more moderate variety than that found in the early writings of Bridgman. [ 9 ]

Operationalism met with initial enthusiasm by logical positivists, who viewed it as akin to verificationism. Nonetheless, it was soon revealed that any attempt to base a theory of meaning on operationalist principles was riddled with problems. Among such problems were the automatic reliability operationalism conferred on measurement operations, the ambiguities surrounding the notion of operation, the overly restrictive operational criterion of meaningfulness, and the fact that many useful theoretical concepts lack clear operational definitions (Chang 2009). [ 10 ] In particular, Carl Hempel (1956, 1966) criticized operationalists for being unable to define dispositional terms such as “solubility in water”, and for multiplying the number of scientific concepts in a manner that runs against the need for systematic and simple theories. Accordingly, most writers on the semantics of quantity-terms have avoided espousing an operational analysis. [ 11 ]

A more widely advocated approach admitted a conventional element to the use of quantity-terms, while resisting attempts to reduce the meaning of quantity terms to measurement operations. These accounts are classified under the general heading “conventionalism”, though they differ in the particular aspects of measurement they deem conventional and in the degree of arbitrariness they ascribe to such conventions. [ 12 ] An early precursor of conventionalism was Ernst Mach, who examined the notion of equality among temperature intervals (1896: 52). Mach noted that different types of thermometric fluid expand at different (and nonlinearly related) rates when heated, raising the question: which fluid expands most uniformly with temperature? According to Mach, there is no fact of the matter as to which fluid expands more uniformly, since the very notion of equality among temperature intervals has no determinate application prior to a conventional choice of standard thermometric fluid. Mach coined the term “principle of coordination” for this sort of conventionally chosen principle for the application of a quantity concept. The concepts of uniformity of time and space received similar treatments by Henri Poincaré (1898, 1902: Part 2). Poincaré argued that procedures used to determine equality among durations stem from scientists’ unconscious preference for descriptive simplicity, rather than from any fact about nature. Similarly, scientists’ choice to represent space with either Euclidean or non-Euclidean geometries is not determined by experience but by considerations of convenience.

Conventionalism with respect to measurement reached its most sophisticated expression in logical positivism. Logical positivists like Hans Reichenbach and Rudolf Carnap proposed “coordinative definitions” or “correspondence rules” as the semantic link between theoretical and observational terms. These a priori , definition-like statements were intended to regulate the use of theoretical terms by connecting them with empirical procedures (Reichenbach 1927: 14–19; Carnap 1966: Ch. 24). An example of a coordinative definition is the statement: “a measuring rod retains its length when transported”. According to Reichenbach, this statement cannot be empirically verified, because a universal and experimentally undetectable force could exist that equally distorts every object’s length when it is transported. In accordance with verificationism, statements that are unverifiable are neither true nor false. Instead, Reichenbach took this statement to expresses an arbitrary rule for regulating the use of the concept of equality of length, namely, for determining whether particular instances of length are equal (Reichenbach 1927: 16). At the same time, coordinative definitions were not seen as replacements, but rather as necessary additions, to the familiar sort of theoretical definitions of concepts in terms of other concepts (1927: 14). Under the conventionalist viewpoint, then, the specification of measurement operations did not exhaust the meaning of concepts such as length or length-equality, thereby avoiding many of the problems associated with operationalism. [ 13 ]

Realists about measurement maintain that measurement is best understood as the empirical estimation of an objective property or relation. A few clarificatory remarks are in order with respect to this characterization of measurement. First, the term “objective” is not meant to exclude mental properties or relations, which are the objects of psychological measurement. Rather, measurable properties or relations are taken to be objective inasmuch as they are independent of the beliefs and conventions of the humans performing the measurement and of the methods used for measuring. For example, a realist would argue that the ratio of the length of a given solid rod to the standard meter has an objective value regardless of whether and how it is measured. Second, the term “estimation” is used by realists to highlight the fact that measurement results are mere approximations of true values (Trout 1998: 46). Third, according to realists, measurement is aimed at obtaining knowledge about properties and relations, rather than at assigning values directly to individual objects. This is significant because observable objects (e.g., levers, chemical solutions, humans) often instantiate measurable properties and relations that are not directly observable (e.g., amount of mechanical work, more acidic than, intelligence). Knowledge claims about such properties and relations must presuppose some background theory. By shifting the emphasis from objects to properties and relations, realists highlight the theory-laden character of measurements.

Realism about measurement should not be confused with realism about entities (e.g., electrons). Nor does realism about measurement necessarily entail realism about properties (e.g., temperature), since one could in principle accept only the reality of relations (e.g., ratios among quantities) without embracing the reality of underlying properties. Nonetheless, most philosophers who have defended realism about measurement have done so by arguing for some form of realism about properties (Byerly and Lazara 1973; Swoyer 1987; Mundy 1987; Trout 1998, 2000). These realists argue that at least some measurable properties exist independently of the beliefs and conventions of the humans who measure them, and that the existence and structure of these properties provides the best explanation for key features of measurement, including the usefulness of numbers in expressing measurement results and the reliability of measuring instruments.

For example, a typical realist about length measurement would argue that the empirical regularities displayed by individual objects’ lengths when they are ordered and concatenated are best explained by assuming that length is an objective property that has an extensive structure (Swoyer 1987: 271–4). That is, relations among lengths such as “longer than” and “sum of” exist independently of whether any objects happen to be ordered and concatenated by humans, and indeed independently of whether objects of some particular length happen to exist at all. The existence of an extensive property structure means that lengths share much of their structure with the positive real numbers, and this explains the usefulness of the positive reals in representing lengths. Moreover, if measurable properties are analyzed in dispositional terms, it becomes easy to explain why some measuring instruments are reliable. For example, if one assumes that a certain amount of electric current in a wire entails a disposition to deflect an ammeter needle by a certain angle, it follows that the ammeter’s indications counterfactually depend on the amount of electric current in the wire, and therefore that the ammeter is reliable (Trout 1998: 65).

A different argument for realism about measurement is due to Joel Michell (1994, 2005), who proposes a realist theory of number based on the Euclidean concept of ratio. According to Michell, numbers are ratios between quantities, and therefore exist in space and time. Specifically, real numbers are ratios between pairs of infinite standard sequences, e.g., the sequence of lengths normally denoted by “1 meter”, “2 meters” etc. and the sequence of whole multiples of the length we are trying to measure. Measurement is the discovery and estimation of such ratios. An interesting consequence of this empirical realism about numbers is that measurement is not a representational activity, but rather the activity of approximating mind-independent numbers (Michell 1994: 400).

Realist accounts of measurement are largely formulated in opposition to strong versions of operationalism and conventionalism, which dominated philosophical discussions of measurement from the 1930s until the 1960s. In addition to the drawbacks of operationalism already discussed in the previous section, realists point out that anti-realism about measurable quantities fails to make sense of scientific practice. If quantities had no real values independently of one’s choice of measurement procedure, it would be difficult to explain what scientists mean by “measurement accuracy” and “measurement error”, and why they try to increase accuracy and diminish error. By contrast, realists can easily make sense of the notions of accuracy and error in terms of the distance between real and measured values (Byerly and Lazara 1973: 17–8; Swoyer 1987: 239; Trout 1998: 57). A closely related point is the fact that newer measurement procedures tend to improve on the accuracy of older ones. If choices of measurement procedure were merely conventional it would be difficult to make sense of such progress. In addition, realism provides an intuitive explanation for why different measurement procedures often yield similar results, namely, because they are sensitive to the same facts (Swoyer 1987: 239; Trout 1998: 56). Finally, realists note that the construction of measurement apparatus and the analysis of measurement results are guided by theoretical assumptions concerning causal relationships among quantities. The ability of such causal assumptions to guide measurement suggests that quantities are ontologically prior to the procedures that measure them. [ 14 ]

While their stance towards operationalism and conventionalism is largely critical, realists are more charitable in their assessment of mathematical theories of measurement. Brent Mundy (1987) and Chris Swoyer (1987) both accept the axiomatic treatment of measurement scales, but object to the empiricist interpretation given to the axioms by prominent measurement theorists like Campbell (1920) and Ernest Nagel (1931; Cohen and Nagel 1934: Ch. 15). Rather than interpreting the axioms as pertaining to concrete objects or to observable relations among such objects, Mundy and Swoyer reinterpret the axioms as pertaining to universal magnitudes, e.g., to the universal property of being 5 meter long rather than to the concrete instantiations of that property. This construal preserves the intuition that statements like “the size of x is twice the size of y ” are first and foremost about two sizes , and only derivatively about the objects x and y themselves (Mundy 1987: 34). [ 15 ] Mundy and Swoyer argue that their interpretation is more general, because it logically entails all the first-order consequences of the empiricist interpretation along with additional, second-order claims about universal magnitudes. Moreover, under their interpretation measurement theory becomes a genuine scientific theory, with explanatory hypotheses and testable predictions. Building on this work, Jo Wolff (2020a) has recently proposed a novel realist account of quantities that relies on the Representational Theory of Measurement. According to Wolff’s structuralist theory of quantity, quantitative attributes are relational structures. Specifically, an attribute is quantitative if its structure has translations that form an Archimedean ordered group. Wolff’s focus on translations, rather than on specific relations such as concatenation and ordering, means that quantitativeness can be realized in multiple ways and is not restricted to extensive structures. It also means that being a quantity does not have anything special to do with numbers, as both numerical and non-numerical structures can be quantitative.

Information-theoretic accounts of measurement are based on an analogy between measuring systems and communication systems. In a simple communication system, a message (input) is encoded into a signal at the transmitter’s end, sent to the receiver’s end, and then decoded back (output). The accuracy of the transmission depends on features of the communication system as well as on features of the environment, i.e., the level of background noise. Similarly, measuring instruments can be thought of as “information machines” (Finkelstein 1977) that interact with an object in a given state (input), encode that state into an internal signal, and convert that signal into a reading (output). The accuracy of a measurement similarly depends on the instrument as well as on the level of noise in its environment. Conceived as a special sort of information transmission, measurement becomes analyzable in terms of the conceptual apparatus of information theory (Hartley 1928; Shannon 1948; Shannon and Weaver 1949). For example, the information that reading \(y_i\) conveys about the occurrence of a state \(x_k\) of the object can be quantified as \(\log \left[\frac{p(x_k \mid y_i)}{p(x_k)}\right]\), namely as a function of the decrease of uncertainty about the object’s state (Finkelstein 1975: 222; for alternative formulations see Brillouin 1962: Ch. 15; Kirpatovskii 1974; and Mari 1999: 185).

Ludwik Finkelstein (1975, 1977) and Luca Mari (1999) suggested the possibility of a synthesis between Shannon-Weaver information theory and measurement theory. As they argue, both theories centrally appeal to the idea of mapping: information theory concerns the mapping between symbols in the input and output messages, while measurement theory concerns the mapping between objects and numbers. If measurement is taken to be analogous to symbol-manipulation, then Shannon-Weaver theory could provide a formalization of the syntax of measurement while measurement theory could provide a formalization of its semantics. Nonetheless, Mari (1999: 185) also warns that the analogy between communication and measurement systems is limited. Whereas a sender’s message can be known with arbitrary precision independently of its transmission, the state of an object cannot be known with arbitrary precision independently of its measurement.

Information-theoretic accounts of measurement were originally developed by metrologists — experts in physical measurement and standardization — with little involvement from philosophers. Independently of developments in metrology, Bas van Fraassen (2008: 141–185) has recently proposed a conception of measurement in which information plays a key role. He views measurement as composed of two levels: on the physical level, the measuring apparatus interacts with an object and produces a reading, e.g., a pointer position. [ 16 ] On the abstract level, background theory represents the object’s possible states on a parameter space. Measurement locates an object on a sub-region of this abstract parameter space, thereby reducing the range of possible states (2008: 164 and 172). This reduction of possibilities amounts to the collection of information about the measured object. Van Fraassen’s analysis of measurement differs from information-theoretic accounts developed in metrology in its explicit appeal to background theory, and in the fact that it does not invoke the symbolic conception of information developed by Shannon and Weaver.

7. Model-Based Accounts of Measurement

Since the early 2000s a new wave of philosophical scholarship has emerged that emphasizes the relationships between measurement and theoretical and statistical modeling (Morgan 2001; Boumans 2005a, 2015; Mari 2005b; Mari and Giordani 2013; Tal 2016, 2017; Parker 2017; Miyake 2017). According to model-based accounts, measurement consists of two levels: (i) a concrete process involving interactions between an object of interest, an instrument, and the environment; and (ii) a theoretical and/or statistical model of that process, where “model” denotes an abstract and local representation constructed from simplifying assumptions. The central goal of measurement according to this view is to assign values to one or more parameters of interest in the model in a manner that satisfies certain epistemic desiderata, in particular coherence and consistency.

Model-based accounts have been developed by studying measurement practices in the sciences, and particularly in metrology. Metrology, officially defined as the “science of measurement and its application” (JCGM 2012: 2.2), is a field of study concerned with the design, maintenance and improvement of measuring instruments in the natural sciences and engineering. Metrologists typically work at standardization bureaus or at specialized laboratories that are responsible for the calibration of measurement equipment, the comparison of standards and the evaluation of measurement uncertainties, among other tasks. It is only recently that philosophers have begun to engage with the rich conceptual issues underlying metrological practice, and particularly with the inferences involved in evaluating and improving the accuracy of measurement standards (Chang 2004; Boumans 2005a: Chap. 5, 2005b, 2007a; Frigerio et al. 2010; Teller 2013, 2018; Riordan 2015; Schlaudt and Huber 2015; Tal 2016a, 2018; Mitchell et al. 2017; Mößner and Nordmann 2017; de Courtenay et al. 2019).

A central motivation for the development of model-based accounts is the attempt to clarify the epistemological principles underlying aspects of measurement practice. For example, metrologists employ a variety of methods for the calibration of measuring instruments, the standardization and tracing of units and the evaluation of uncertainties (for a discussion of metrology, see the previous section). Traditional philosophical accounts such as mathematical theories of measurement do not elaborate on the assumptions, inference patterns, evidential grounds or success criteria associated with such methods. As Frigerio et al. (2010) argue, measurement theory is ill-suited for clarifying these aspects of measurement because it abstracts away from the process of measurement and focuses solely on the mathematical properties of scales. By contrast, model-based accounts take scale construction to be merely one of several tasks involved in measurement, alongside the definition of measured parameters, instrument design and calibration, object sampling and preparation, error detection and uncertainty evaluation, among others (2010: 145–7).

According to model-based accounts, measurement involves interaction between an object of interest (the “system under measurement”), an instrument (the “measurement system”) and an environment, which includes the measuring subjects. Other, secondary interactions may also be relevant for the determination of a measurement outcome, such as the interaction between the measuring instrument and the reference standards used for its calibration, and the chain of comparisons that trace the reference standard back to primary measurement standards (Mari 2003: 25). Measurement proceeds by representing these interactions with a set of parameters, and assigning values to a subset of those parameters (known as “measurands”) based on the results of the interactions. When measured parameters are numerical they are called “quantities”. Although measurands need not be quantities, a quantitative measurement scenario will be supposed in what follows.

Two sorts of measurement outputs are distinguished by model-based accounts [JCGM 2012: 2.9 & 4.1; Giordani and Mari 2012: 2146; Tal 2013]:

  • Instrument indications (or “readings”): these are properties of the measuring instrument in its final state after the measurement process is complete. Examples are digits on a display, marks on a multiple-choice questionnaire and bits stored in a device’s memory. Indications may be represented by numbers, but such numbers describe states of the instrument and should not be confused with measurement outcomes, which concern states of the object being measured.
  • Measurement outcomes (or “results”): these are knowledge claims about the values of one or more quantities attributed to the object being measured, and are typically accompanied by a specification of the measurement unit and scale and an estimate of measurement uncertainty. For example, a measurement outcome may be expressed by the sentence “the mass of object a is 20±1 grams with a probability of 68%”.

As proponents of model-based accounts stress, inferences from instrument indications to measurement outcomes are nontrivial and depend on a host of theoretical and statistical assumptions about the object being measured, the instrument, the environment and the calibration process. Measurement outcomes are often obtained through statistical analysis of multiple indications, thereby involving assumptions about the shape of the distribution of indications and the randomness of environmental effects (Bogen and Woodward 1988: 307–310). Measurement outcomes also incorporate corrections for systematic effects, and such corrections are based on theoretical assumptions concerning the workings of the instrument and its interactions with the object and environment. For example, length measurements need to be corrected for the change of the measuring rod’s length with temperature, a correction which is derived from a theoretical equation of thermal expansion. Systematic corrections involve uncertainties of their own, for example in the determination of the values of constants, and these uncertainties are assessed through secondary experiments involving further theoretical and statistical assumptions. Moreover, the uncertainty associated with a measurement outcome depends on the methods employed for the calibration of the instrument. Calibration involves additional assumptions about the instrument, the calibrating apparatus, the quantity being measured and the properties of measurement standards (Rothbart and Slayden 1994; Franklin 1997; Baird 2004: Ch. 4; Soler et al. 2013). Another component of uncertainty originates from vagueness in the definition of the measurand, and is known as “definitional uncertainty” (Mari and Giordani 2013; Grégis 2015). Finally, measurement involves background assumptions about the scale type and unit system being used, and these assumptions are often tied to broader theoretical and technological considerations relating to the definition and realization of scales and units.

These various theoretical and statistical assumptions form the basis for the construction of one or more models of the measurement process. Unlike mathematical theories of measurement, where the term “model” denotes a set-theoretical structure that interprets a formal language, here the term “model” denotes an abstract and local representation of a target system that is constructed from simplifying assumptions. [ 17 ] The relevant target system in this case is a measurement process, that is, a system composed of a measuring instrument, objects or events to be measured, the environment (including human operators), secondary instruments and reference standards, the time-evolution of these components, and their various interactions with each other. Measurement is viewed as a set of procedures whose aim is to coherently assign values to model parameters based on instrument indications. Models are therefore seen as necessary preconditions for the possibility of inferring measurement outcomes from instrument indications, and as crucial for determining the content of measurement outcomes. As proponents of model-based accounts emphasize, the same indications produced by the same measurement process may be used to establish different measurement outcomes depending on how the measurement process is modeled, e.g., depending on which environmental influences are taken into account, which statistical assumptions are used to analyze noise, and which approximations are used in applying background theory. As Luca Mari puts it,

any measurement result reports information that is meaningful only in the context of a metrological model, such a model being required to include a specification for all the entities that explicitly or implicitly appear in the expression of the measurement result. (2003: 25)

Similarly, models are said to provide the necessary context for evaluating various aspects of the goodness of measurement outcomes, including accuracy, precision, error and uncertainty (Boumans 2006, 2007a, 2009, 2012b; Mari 2005b).

Model-based accounts diverge from empiricist interpretations of measurement theory in that they do not require relations among measurement outcomes to be isomorphic or homomorphic to observable relations among the items being measured (Mari 2000). Indeed, according to model-based accounts relations among measured objects need not be observable at all prior to their measurement (Frigerio et al. 2010: 125). Instead, the key normative requirement of model-based accounts is that values be assigned to model parameters in a coherent manner. The coherence criterion may be viewed as a conjunction of two sub-criteria: (i) coherence of model assumptions with relevant background theories or other substantive presuppositions about the quantity being measured; and (ii) objectivity, i.e., the mutual consistency of measurement outcomes across different measuring instruments, environments and models [ 18 ] (Frigerio et al. 2010; Tal 2017a; Teller 2018). The first sub-criterion is meant to ensure that the intended quantity is being measured, while the second sub-criterion is meant to ensure that measurement outcomes can be reasonably attributed to the measured object rather than to some artifact of the measuring instrument, environment or model. Taken together, these two requirements ensure that measurement outcomes remain valid independently of the specific assumptions involved in their production, and hence that the context-dependence of measurement outcomes does not threaten their general applicability.

Besides their applicability to physical measurement, model-based analyses also shed light on measurement in economics. Like physical quantities, values of economic variables often cannot be observed directly and must be inferred from observations based on abstract and idealized models. The nineteenth century economist William Jevons, for example, measured changes in the value of gold by postulating certain causal relationships between the value of gold, the supply of gold and the general level of prices (Hoover and Dowell 2001: 155–159; Morgan 2001: 239). As Julian Reiss (2001) shows, Jevons’ measurements were made possible by using two models: a causal-theoretical model of the economy, which is based on the assumption that the quantity of gold has the capacity to raise or lower prices; and a statistical model of the data, which is based on the assumption that local variations in prices are mutually independent and therefore cancel each other out when averaged. Taken together, these models allowed Jevons to infer the change in the value of gold from data concerning the historical prices of various goods. [ 19 ]

The ways in which models function in economic measurement have led some philosophers to view certain economic models as measuring instruments in their own right, analogously to rulers and balances (Boumans 1999, 2005c, 2006, 2007a, 2009, 2012a, 2015; Morgan 2001). Marcel Boumans explains how macroeconomists are able to isolate a variable of interest from external influences by tuning parameters in a model of the macroeconomic system. This technique frees economists from the impossible task of controlling the actual system. As Boumans argues, macroeconomic models function as measuring instruments insofar as they produce invariant relations between inputs (indications) and outputs (outcomes), and insofar as this invariance can be tested by calibration against known and stable facts. When such model-based procedures are combined with expert judgment, they can produce reliable measurements of economic phenomena even outside controlled laboratory settings (Boumans 2015: Chap. 5).

Another area where models play a central role in measurement is psychology. The measurement of most psychological attributes, such as intelligence, anxiety and depression, does not rely on homomorphic mappings of the sort espoused by the Representational Theory of Measurement (Wilson 2013: 3766). Instead, psychometric theory relies predominantly on the development of abstract models that are meant to predict subjects’ performance in certain tasks. These models are constructed from substantive and statistical assumptions about the psychological attribute being measured and its relation to each measurement task. For example, Item Response Theory, a popular approach to psychological measurement, employs a variety of models to evaluate the reliability and validity of questionnaires. Consider a questionnaire that is meant to assess English language comprehension (the “ability”), by presenting subjects with a series of yes/no questions (the “items”). One of the simplest models used to calibrate such questionnaires is the Rasch model (Rasch 1960). This model supposes a straightforward algebraic relation—known as the “log of the odds”—between the probability that a subject will answer a given item correctly, the difficulty of that particular item, and the subject’s ability. New questionnaires are calibrated by testing the fit between their indications and the predictions of the Rasch model and assigning difficulty levels to each item accordingly. The model is then used in conjunction with the questionnaire to infer levels of English language comprehension (outcomes) from raw questionnaire scores (indications) (Wilson 2013; Mari and Wilson 2014).

The sort of statistical calibration (or “scaling”) provided by Rasch models yields repeatable results, but it is often only a first step towards full-fledged psychological measurement. Psychologists are typically interested in the results of a measure not for its own sake, but for the sake of assessing some underlying and latent psychological attribute, e.g., English language comprehension. A good fit between item responses and a statistical model does not yet determine what the questionnaire is measuring. The process of establishing that a procedure measures the intended psychological attribute is known as “validation”. One way of validating a psychometric instrument is to test whether different procedures that are intended to measure the same latent attribute provide consistent results. Such testing belongs to a family of validation techniques known as “construct validation”. A construct is an abstract representation of the latent attribute intended to be measured, and

reflects a hypothesis […] that a variety of behaviors will correlate with one another in studies of individual differences and/or will be similarly affected by experimental manipulations. (Nunnally & Bernstein 1994: 85)

Constructs are denoted by variables in a model that predicts which correlations would be observed among the indications of different measures if they are indeed measures of the same attribute. Such models involve substantive assumptions about the attribute, including its internal structure and its relations to other attributes, and statistical assumptions about the correlation among different measures (Campbell & Fiske 1959; Nunnally & Bernstein 1994: Ch. 3; Angner 2008).

In recent years, philosophers of science have become increasingly interested in psychometrics and the concept of validity. One debate concerns the ontological status of latent psychological attributes. Denny Borsboom has argued against operationalism about latent attributes, and in favour of defining validity in a manner that embraces realism: “a test is valid for measuring an attribute if and only if a) the attribute exists, and b) variations in the attribute causally produce variations in the outcomes of the measurement procedure” (2005: 150; see also Hood 2009, 2013; Feest 2020). Elina Vessonen has defended a moderate form of operationalism about psychological attributes, and argued that moderate operationalism is compatible with a cautious type of realism (2019). Another recent discussion focuses on the justification for construct validation procedures. According to Anna Alexandrova, construct validation is in principle a justified methodology, insofar as it establishes coherence with theoretical assumptions and background knowledge about the latent attribute. However, Alexandrova notes that in practice psychometricians who intend to measure happiness and well-being often avoid theorizing about these constructs, and instead appeal to respondents’ folk beliefs. This defeats the purpose of construct validation and turns it into a narrow, technical exercise (Alexandrova and Haybron 2016; Alexandrova 2017; see also McClimans et al. 2017).

A more fundamental criticism leveled against psychometrics is that it dogmatically presupposes that psychological attributes can be quantified. Michell (2000, 2004b) argues that psychometricians have not made serious attempts to test whether the attributes they purport to measure have quantitative structure, and instead adopted an overly loose conception of measurement that disguises this neglect. In response, Borsboom and Mellenbergh (2004) argue that Item Response Theory provides probabilistic tests of the quantifiability of attributes. Psychometricians who construct a statistical model initially hypothesize that an attribute is quantitative, and then subject the model to empirical tests. When successful, such tests provide indirect confirmation of the initial hypothesis, e.g. by showing that the attribute has an additive conjoint structure (see also Vessonen 2020).

Several scholars have pointed out similarities between the ways models are used to standardize measurable quantities in the natural and social sciences. For example, Mark Wilson (2013) argues that psychometric models can be viewed as tools for constructing measurement standards in the same sense of “measurement standard” used by metrologists. Others have raised doubts about the feasibility and desirability of adopting the example of the natural sciences when standardizing constructs in the social sciences. Nancy Cartwright and Rosa Runhardt (2014) discuss “Ballung” concepts, a term they borrow from Otto Neurath to denote concepts with a fuzzy and context-dependent scope. Examples of Ballung concepts are race, poverty, social exclusion, and the quality of PhD programs. Such concepts are too multifaceted to be measured on a single metric without loss of meaning, and must be represented either by a matrix of indices or by several different measures depending on which goals and values are at play (see also Bradburn, Cartwright, & Fuller 2016, Other Internet Resources). Alexandrova (2008) points out that ethical considerations bear on questions about the validity of measures of well-being no less than considerations of reproducibility. Such ethical considerations are context sensitive, and can only be applied piecemeal. In a similar vein, Leah McClimans (2010) argues that uniformity is not always an appropriate goal for designing questionnaires, as the open-endedness of questions is often both unavoidable and desirable for obtaining relevant information from subjects. [ 20 ] The intertwining of ethical and epistemic considerations is especially clear when psychometric questionnaires are used in medical contexts to evaluate patient well-being and mental health. In such cases, small changes to the design of a questionnaire or the analysis of its results may result in significant harms or benefits to patients (McClimans 2017; Stegenga 2018, Chap. 8). These insights highlight the value-laden and contextual nature of the measurement of mental and social phenomena.

8. The Epistemology of Measurement

The development of model-based accounts discussed in the previous section is part of a larger, “epistemic turn” in the philosophy of measurement that occurred in the early 2000s. Rather than emphasizing the mathematical foundations, metaphysics or semantics of measurement, philosophical work in recent years tends to focus on the presuppositions and inferential patterns involved in concrete practices of measurement, and on the historical, social and material dimensions of measuring. The philosophical study of these topics has been referred to as the “epistemology of measurement” (Mari 2003, 2005a; Leplège 2003; Tal 2017a). In the broadest sense, the epistemology of measurement is the study of the relationships between measurement and knowledge. Central topics that fall under the purview of the epistemology of measurement include the conditions under which measurement produces knowledge; the content, scope, justification and limits of such knowledge; the reasons why particular methodologies of measurement and standardization succeed or fail in supporting particular knowledge claims, and the relationships between measurement and other knowledge-producing activities such as observation, theorizing, experimentation, modelling and calculation. In pursuing these objectives, philosophers are drawing on the work of historians and sociologists of science, who have been investigating measurement practices for a longer period (Wise and Smith 1986; Latour 1987: Ch. 6; Schaffer 1992; Porter 1995, 2007; Wise 1995; Alder 2002; Galison 2003; Gooday 2004; Crease 2011), as well as on the history and philosophy of scientific experimentation (Harré 1981; Hacking 1983; Franklin 1986; Cartwright 1999). The following subsections survey some of the topics discussed in this burgeoning body of literature.

A topic that has attracted considerable philosophical attention in recent years is the selection and improvement of measurement standards. Generally speaking, to standardize a quantity concept is to prescribe a determinate way in which that concept is to be applied to concrete particulars. [ 21 ] To standardize a measuring instrument is to assess how well the outcomes of measuring with that instrument fit the prescribed mode of application of the relevant concept. [ 22 ] The term “measurement standard” accordingly has at least two meanings: on the one hand, it is commonly used to refer to abstract rules and definitions that regulate the use of quantity concepts, such as the definition of the meter. On the other hand, the term “measurement standard” is also commonly used to refer to the concrete artifacts and procedures that are deemed exemplary of the application of a quantity concept, such as the metallic bar that served as the standard meter until 1960. This duality in meaning reflects the dual nature of standardization, which involves both abstract and concrete aspects.

In Section 4 it was noted that standardization involves choices among nontrivial alternatives, such as the choice among different thermometric fluids or among different ways of marking equal duration. These choices are nontrivial in the sense that they affect whether or not the same temperature (or time) intervals are deemed equal, and hence affect whether or not statements of natural law containing the term “temperature” (or “time”) come out true. Appealing to theory to decide which standard is more accurate would be circular, since the theory cannot be determinately applied to particulars prior to a choice of measurement standard. This circularity has been variously called the “problem of coordination” (van Fraassen 2008: Ch. 5) and the “problem of nomic measurement” (Chang 2004: Ch. 2). As already mentioned, conventionalists attempted to escape the circularity by positing a priori statements, known as “coordinative definitions”, which were supposed to link quantity-terms with specific measurement operations. A drawback of this solution is that it supposes that choices of measurement standard are arbitrary and static, whereas in actual practice measurement standards tend to be chosen based on empirical considerations and are eventually improved or replaced with standards that are deemed more accurate.

A new strand of writing on the problem of coordination has emerged in recent years, consisting most notably of the works of Hasok Chang (2001, 2004, 2007; Barwich and Chang 2015) and Bas van Fraassen (2008: Ch. 5; 2009, 2012; see also Padovani 2015, 2017; Michel 2019). These works take a historical and coherentist approach to the problem. Rather than attempting to avoid the problem of circularity completely, as their predecessors did, they set out to show that the circularity is not vicious. Chang argues that constructing a quantity-concept and standardizing its measurement are co-dependent and iterative tasks. Each “epistemic iteration” in the history of standardization respects existing traditions while at the same time correcting them (Chang 2004: Ch. 5). The pre-scientific concept of temperature, for example, was associated with crude and ambiguous methods of ordering objects from hot to cold. Thermoscopes, and eventually thermometers, helped modify the original concept and made it more precise. With each such iteration the quantity concept was re-coordinated to a more stable set of standards, which in turn allowed theoretical predictions to be tested more precisely, facilitating the subsequent development of theory and the construction of more stable standards, and so on.

How this process avoids vicious circularity becomes clear when we look at it either “from above”, i.e., in retrospect given our current scientific knowledge, or “from within”, by looking at historical developments in their original context (van Fraassen 2008: 122). From either vantage point, coordination succeeds because it increases coherence among elements of theory and instrumentation. The questions “what counts as a measurement of quantity X ?” and “what is quantity X ?”, though unanswerable independently of each other, are addressed together in a process of mutual refinement. It is only when one adopts a foundationalist view and attempts to find a starting point for coordination free of presupposition that this historical process erroneously appears to lack epistemic justification (2008: 137).

The new literature on coordination shifts the emphasis of the discussion from the definitions of quantity-terms to the realizations of those definitions. In metrological jargon, a “realization” is a physical instrument or procedure that approximately satisfies a given definition (cf. JCGM 2012: 5.1). Examples of metrological realizations are the official prototypes of the kilogram and the cesium fountain clocks used to standardize the second. Recent studies suggest that the methods used to design, maintain and compare realizations have a direct bearing on the practical application of concepts of quantity, unit and scale, no less than the definitions of those concepts (Riordan 2015; Tal 2016). The relationship between the definition and realizations of a unit becomes especially complex when the definition is stated in theoretical terms. Several of the base units of the International System (SI) — including the meter, kilogram, ampere, kelvin and mole — are no longer defined by reference to any specific kind of physical system, but by fixing the numerical value of a fundamental physical constant. The kilogram, for example, was redefined in 2019 as the unit of mass such that the numerical value of the Planck constant is exactly 6.62607015 × 10 -34 kg m 2 s -1 (BIPM 2019:131). Realizing the kilogram under this definition is a highly theory-laden task. The study of the practical realization of such units has shed new light on the evolving relationships between measurement and theory (Tal 2018; de Courtenay et al 2019; Wolff 2020b).

As already discussed above (Sections 7 and 8.1 ), theory and measurement are interdependent both historically and conceptually. On the historical side, the development of theory and measurement proceeds through iterative and mutual refinements. On the conceptual side, the specification of measurement procedures shapes the empirical content of theoretical concepts, while theory provides a systematic interpretation for the indications of measuring instruments. This interdependence of measurement and theory may seem like a threat to the evidential role that measurement is supposed to play in the scientific enterprise. After all, measurement outcomes are thought to be able to test theoretical hypotheses, and this seems to require some degree of independence of measurement from theory. This threat is especially clear when the theoretical hypothesis being tested is already presupposed as part of the model of the measuring instrument. To cite an example from Franklin et al. (1989: 230):

There would seem to be, at first glance, a vicious circularity if one were to use a mercury thermometer to measure the temperature of objects as part of an experiment to test whether or not objects expand as their temperature increases.

Nonetheless, Franklin et al. conclude that the circularity is not vicious. The mercury thermometer could be calibrated against another thermometer whose principle of operation does not presuppose the law of thermal expansion, such as a constant-volume gas thermometer, thereby establishing the reliability of the mercury thermometer on independent grounds. To put the point more generally, in the context of local hypothesis-testing the threat of circularity can usually be avoided by appealing to other kinds of instruments and other parts of theory.

A different sort of worry about the evidential function of measurement arises on the global scale, when the testing of entire theories is concerned. As Thomas Kuhn (1961) argues, scientific theories are usually accepted long before quantitative methods for testing them become available. The reliability of newly introduced measurement methods is typically tested against the predictions of the theory rather than the other way around. In Kuhn’s words, “The road from scientific law to scientific measurement can rarely be traveled in the reverse direction” (1961: 189). For example, Dalton’s Law, which states that the weights of elements in a chemical compound are related to each other in whole-number proportions, initially conflicted with some of the best known measurements of such proportions. It is only by assuming Dalton’s Law that subsequent experimental chemists were able to correct and improve their measurement techniques (1961: 173). Hence, Kuhn argues, the function of measurement in the physical sciences is not to test the theory but to apply it with increasing scope and precision, and eventually to allow persistent anomalies to surface that would precipitate the next crisis and scientific revolution. Note that Kuhn is not claiming that measurement has no evidential role to play in science. Instead, he argues that measurements cannot test a theory in isolation, but only by comparison to some alternative theory that is proposed in an attempt to account for the anomalies revealed by increasingly precise measurements (for an illuminating discussion of Kuhn’s thesis see Hacking 1983: 243–5).

Traditional discussions of theory-ladenness, like those of Kuhn, were conducted against the background of the logical positivists’ distinction between theoretical and observational language. The theory-ladenness of measurement was correctly perceived as a threat to the possibility of a clear demarcation between the two languages. Contemporary discussions, by contrast, no longer present theory-ladenness as an epistemological threat but take for granted that some level of theory-ladenness is a prerequisite for measurements to have any evidential power. Without some minimal substantive assumptions about the quantity being measured, such as its amenability to manipulation and its relations to other quantities, it would be impossible to interpret the indications of measuring instruments and hence impossible to ascertain the evidential relevance of those indications. This point was already made by Pierre Duhem (1906: 153–6; see also Carrier 1994: 9–19). Moreover, contemporary authors emphasize that theoretical assumptions play crucial roles in correcting for measurement errors and evaluating measurement uncertainties. Indeed, physical measurement procedures become more accurate when the model underlying them is de-idealized, a process which involves increasing the theoretical richness of the model (Tal 2011).

The acknowledgment that theory is crucial for guaranteeing the evidential reliability of measurement draws attention to the “problem of observational grounding”, which is an inverse challenge to the traditional threat of theory-ladenness (Tal 2016b). The challenge is to specify what role observation plays in measurement, and particularly what sort of connection with observation is necessary and/or sufficient to allow measurement to play an evidential role in the sciences. This problem is especially clear when one attempts to account for the increasing use of computational methods for performing tasks that were traditionally accomplished by measuring instruments. As Margaret Morrison (2009) and Wendy Parker (2017) argue, there are cases where reliable quantitative information is gathered about a target system with the aid of a computer simulation, but in a manner that satisfies some of the central desiderata for measurement such as being empirically grounded and backward-looking (see also Lusk 2016). Such information does not rely on signals transmitted from the particular object of interest to the instrument, but on the use of theoretical and statistical models to process empirical data about related objects. For example, data assimilation methods are customarily used to estimate past atmospheric temperatures in regions where thermometer readings are not available. Some methods do this by fitting a computational model of the atmosphere’s behavior to a combination of available data from nearby regions and a model-based forecast of conditions at the time of observation (Parker 2017). These estimations are then used in various ways, including as data for evaluating forward-looking climate models. Regardless of whether one calls these estimations “measurements”, they challenge the idea that producing reliable quantitative evidence about the state of an object requires observing that object, however loosely one understands the term “observation”. [ 23 ]

Two key aspects of the reliability of measurement outcomes are accuracy and precision. Consider a series of repeated weight measurements performed on a particular object with an equal-arms balance. From a realist, “error-based” perspective, the outcomes of these measurements are accurate if they are close to the true value of the quantity being measured—in our case, the true ratio of the object’s weight to the chosen unit—and precise if they are close to each other. An analogy often cited to clarify the error-based distinction is that of arrows shot at a target, with accuracy analogous to the closeness of hits to the bull’s eye and precision analogous to the tightness of spread of hits (cf. JCGM 2012: 2.13 & 2.15, Teller 2013: 192). Though intuitive, the error-based way of carving the distinction raises an epistemological difficulty. It is commonly thought that the exact true values of most quantities of interest to science are unknowable, at least when those quantities are measured on continuous scales. If this assumption is granted, the accuracy with which such quantities are measured cannot be known with exactitude, but only estimated by comparing inaccurate measurements to each other. And yet it is unclear why convergence among inaccurate measurements should be taken as an indication of truth. After all, the measurements could be plagued by a common bias that prevents their individual inaccuracies from cancelling each other out when averaged. In the absence of cognitive access to true values, how is the evaluation of measurement accuracy possible?

In answering this question, philosophers have benefited from studying the various senses of the term “measurement accuracy” as used by practicing scientists. At least five different senses have been identified: metaphysical, epistemic, operational, comparative and pragmatic (Tal 2011: 1084–5). In particular, the epistemic or “uncertainty-based” sense of the term is metaphysically neutral and does not presuppose the existence of true values. Instead, the accuracy of a measurement outcome is taken to be the closeness of agreement among values reasonably attributed to a quantity given available empirical data and background knowledge (cf. JCGM 2012: 2.13 Note 3; Giordani & Mari 2012; de Courtenay and Grégis 2017). Thus construed, measurement accuracy can be evaluated by establishing robustness among the consequences of models representing different measurement processes (Basso 2017; Tal 2017b; Bokulich 2020; Staley 2020).

Under the uncertainty-based conception, imprecision is a special type of inaccuracy. For example, the inaccuracy of weight measurements is the breadth of spread of values that are reasonably attributed to the object’s weight given the indications of the balance and available background knowledge about the way the balance works and the standard weights used. The imprecision of these measurements is the component of inaccuracy arising from uncontrolled variations to the indications of the balance over repeated trials. Other sources of inaccuracy besides imprecision include imperfect corrections to systematic errors, inaccurately known physical constants, and vague measurand definitions, among others (see Section 7.1 ).

Paul Teller (2018) raises a different objection to the error-based conception of measurement accuracy. He argues against an assumption he calls “measurement accuracy realism”, according to which measurable quantities have definite values in reality. Teller argues that this assumption is false insofar as it concerns the quantities habitually measured in physics, because any specification of definite values (or value ranges) for such quantities involves idealization and hence cannot refer to anything in reality. For example, the concept usually understood by the phrase “the velocity of sound in air” involves a host of implicit idealizations concerning the uniformity of the air’s chemical composition, temperature and pressure as well as the stability of units of measurement. Removing these idealizations completely would require adding infinite amount of detail to each specification. As Teller argues, measurement accuracy should itself be understood as a useful idealization, namely as a concept that allows scientists to assess coherence and consistency among measurement outcomes as if the linguistic expression of these outcomes latched onto anything in the world. Precision is similarly an idealized concept, which is based on an open-ended and indefinite specification of what counts as repetition of measurement under “the same” circumstances (Teller 2013: 194).

  • Alder, K., 2002, The Measure of All Things: The Seven-Year Odyssey and Hidden Error That Transformed the World , New York: The Free Press.
  • Alexandrova, A., 2008, “First Person Reports and the Measurement of Happiness”, Philosophical Psychology , 21(5): 571–583.
  • –––, 2017, A Philosophy for the Science of Well-Being , Oxford: Oxford University Press.
  • Alexandrova, A. and D.M. Haybron, 2016, “Is Construct Validation Valid?” Philosophy of Science , 83(5): 1098–1109.
  • Angner, E., 2008, “The Philosophical Foundations of Subjective Measures of Well-Being”, in Capabilities and Happiness , L. Bruni, F. Comim, and M. Pugno (eds.), Oxford: Oxford University Press.
  • –––, 2013, “Is it Possible to Measure Happiness? The argument from measurability”, European Journal for Philosophy of Science , 3: 221–240.
  • Aristotle, Categories , in The Complete Works of Aristotle , Volume I, J. Barnes (ed.), Princeton: Princeton University Press, 1984.
  • Baird, D., 2004, Thing Knowledge: A Philosophy of Scientific Instruments , Berkeley: University of California Press.
  • Barwich, A.S., and H. Chang, 2015, “Sensory Measurements: Coordination and Standardization”, Biological Theory , 10(3): 200–211.
  • Basso, A., 2017, “The Appeal to Robustness in Measurement Practice”, Studies in History and Philosophy of Science Part A , 65: 57–66.
  • Biagioli, F., 2016, Space, Number, and Geometry from Helmholtz to Cassirer , Dordrecht: Springer.
  • –––, 2018, “Cohen and Helmholtz on the Foundations of Measurement”, in C. Damböck (ed.), Philosophie Und Wissenschaft Bei Hermann Cohen – Philosophy and Science in Hermann Cohen , Dordrecht: Springer, 77–100.
  • BIPM (Bureau International des Poids et Mesures), 2019, The International System of Units (SI Brochure), 9th Edition. [ BIPM 2019 available online ]
  • Bogen, J. and J. Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303–352.
  • Bokulich, A., 2020, “Calibration, Coherence, and Consilience in Radiometric Measures of Geologic Time”, Philosophy of Science , 87(3): 425–56.
  • Boring, E.G., 1945, “The use of operational definitions in science”, in Boring et al. 1945: 243–5.
  • Boring, E.G., P.W. Bridgman, H. Feigl, H. Israel, C.C Pratt, and B.F. Skinner, 1945, “Symposium on Operationism”, The Psychological Review , 52: 241–294.
  • Borsboom, D., 2005, Measuring the Mind: Conceptual Issues in Contemporary Psychometrics , Cambridge: Cambridge University Press.
  • Borsboom, D., and G.J. Mellenbergh, 2004, “Why psychometrics is not pathological: A comment on Michell”, Theory & Psychology , 14: 105–120.
  • Boumans, M., 1999, “Representation and Stability in Testing and Measuring Rational Expectations”, Journal of Economic Methodology , 6(3): 381–401.
  • –––, 2005a, How Economists Model the World into Numbers , New York: Routledge.
  • –––, 2005b, “Truth versus Precision”, in Logic, Methodology and Philosophy of Science: Proceedings of the Twelfth International Congress , P. Hájek, L. Valdés-Villanueva, and D. Westerstahl (eds.), London: College Publications, pp. 257–269.
  • –––, 2005c, “Measurement outside the laboratory”, Philosophy of Science , 72: 850–863.
  • –––, 2006, “The difference between answering a ‘why’ question and answering a ‘how much’ question”, in Simulation: Pragmatic Construction of Reality , J. Lenhard, G Küppers, and T Shinn (eds.), Dordrecht: Springer, pp. 107–124.
  • –––, 2007a, “Invariance and Calibration”, in 2007: 231–248.
  • ––– (ed.), 2007b, Measurement in Economics: A Handbook , London: Elsevier.
  • –––, 2009, “Grey-Box Understanding in Economics”, in Scientific Understanding: Philosophical Perspectives , H.W. de Regt, S. Leonelli, and K. Eigner, Pittsburgh: University of Pittsburgh Press, pp. 210–229.
  • –––, 2012a, “Modeling Strategies for Measuring Phenomena In- and Outside the Laboratory”, in EPSA Philosophy of Science: Amsterdam 2009 (The European Philosophy of Science Association Proceedings), H.W. de Regt, S. Hartmann, and S. Okasha (eds.), Dordrecht: Springer, pp. 1–11.
  • –––, 2012b, “Measurement in Economics”, in Philosophy of Economics (Handbook of the Philosophy of Science: Volume 13), University of Mäki (ed.), Oxford: Elsevier, pp. 395–423.
  • –––, 2015, Science Outside the Laboratory: Measurement in Field Science and Economics , Oxford: Oxford University Press.
  • Bridgman, P.W., 1927, The Logic of Modern Physics , New York: Macmillan.
  • –––, 1938, “Operational Analysis”, Philosophy of Science , 5: 114–131.
  • –––, 1945, “Some General Principles of Operational Analysis”, in Boring et al. 1945: 246–249.
  • –––, 1956, “The Present State of Operationalism”, in Frank 1956: 74–79.
  • Brillouin, L., 1962, Science and information theory , New York: Academic Press, 2nd edition.
  • Byerly, H.C. and V.A. Lazara, 1973, “Realist Foundations of Measurement”, Philosophy of Science , 40(1): 10–28.
  • Campbell, N.R., 1920, Physics: the Elements , London: Cambridge University Press.
  • Campbell, D.T. and D.W. Fiske, 1959, “Convergent and discriminant validation by the multitrait-multimethod matrix”, Psychological Bulletin , 56(2): 81–105.
  • Cantù, P. and O. Schlaudt (eds.), 2013, “The Epistemological Thought of Otto Hölder”, special issue of Philosophia Scientiæ , 17(1).
  • Carnap, R., 1966, Philosophical foundations of physics , G. Martin (ed.), reprinted as An Introduction to the Philosophy of Science , NY: Dover, 1995.
  • Carrier, M., 1994, The Completeness of Scientific Theories: On the Derivation of Empirical Indicators Within a Theoretical Framework: the Case of Physical Geometry , The University of Western Ontario Series in Philosophy of Science Vol. 53, Dordrecht: Kluwer.
  • Cartwright, N.L., 1999, The Dappled World: A Study of the Boundaries of Science , Cambridge: Cambridge University Press.
  • Cartwright, N.L. and R. Runhardt, 2014, “Measurement”, in N.L. Cartwright and E. Montuschi (eds.), Philosophy of Social Science: A New Introduction , Oxford: Oxford University Press, pp. 265–287.
  • Chang, H., 2001, “Spirit, air, and quicksilver: The search for the ‘real’ scale of temperature”, Historical Studies in the Physical and Biological Sciences , 31(2): 249–284.
  • –––, 2004, Inventing Temperature: Measurement and Scientific Progress , Oxford: Oxford University Press.
  • –––, 2007, “Scientific Progress: Beyond Foundationalism and Coherentism”, Royal Institute of Philosophy Supplement , 61: 1–20.
  • –––, 2009, “Operationalism”, The Stanford Encyclopedia of Philosophy (Fall 2009 Edition), E.N. Zalta (ed.), URL= < https://plato.stanford.edu/archives/fall2009/entries/operationalism/ >
  • Chang, H. and N.L. Cartwright, 2008, “Measurement”, in The Routledge Companion to Philosophy of Science , S. Psillos and M. Curd (eds.), New York: Routledge, pp. 367–375.
  • Clagett, M., 1968, Nicole Oresme and the medieval geometry of qualities and motions , Madison: University of Wisconsin Press.
  • Cohen, M.R. and E. Nagel, 1934, An introduction to logic and scientific method , New York: Harcourt, Brace & World.
  • Crease, R.P., 2011, World in the Balance: The Historic Quest for an Absolute System of Measurement , New York and London: W.W. Norton.
  • Darrigol, O., 2003, “Number and measure: Hermann von Helmholtz at the crossroads of mathematics, physics, and psychology”, Studies in History and Philosophy of Science Part A , 34(3): 515–573.
  • de Courtenay, N., O. Darrigol, and O. Schlaudt (eds.), 2019, The Reform of the International System of Units (SI): Philosophical, Historical and Sociological Issues , London and New York: Routledge.
  • de Courtenay, N. and F. Grégis, 2017, “The evaluation of measurement uncertainties and its epistemological ramifications”, Studies in History and Philosophy of Science (Part A), 65: 21–32.
  • Diehl, C.E., 2012, The Theory of Intensive Magnitudes in Leibniz and Kant , Ph.D. Dissertation, Princeton University. [ Diehl 2012 available online ]
  • Diez, J.A., 1997a, “A Hundred Years of Numbers. An Historical Introduction to Measurement Theory 1887–1990—Part 1”, Studies in History and Philosophy of Science , 28(1): 167–185.
  • –––, 1997b, “A Hundred Years of Numbers. An Historical Introduction to Measurement Theory 1887–1990—Part 2”, Studies in History and Philosophy of Science , 28(2): 237–265.
  • Dingle, H., 1950, “A Theory of Measurement”, The British Journal for the Philosophy of Science , 1(1): 5–26.
  • Duhem, P., 1906, The Aim and Structure of Physical Theory , P.P. Wiener (trans.), New York: Atheneum, 1962.
  • Ellis, B., 1966, Basic Concepts of Measurement , Cambridge: Cambridge University Press.
  • Euclid, Elements , in The Thirteen Books of Euclid’s Elements , T.L. Heath (trans.), Cambridge: Cambridge University Press, 1908.
  • Fechner, G., 1860, Elements of Psychophysics , H.E. Adler (trans.), New York: Holt, Reinhart & Winston, 1966.
  • Feest, U., 2005, “Operationism in Psychology: What the Debate Is About, What the Debate Should Be About”, Journal of the History of the Behavioral Sciences , 41(2): 131–149.
  • –––, 2020, “Construct Validity in Psychological Tests–the Case of Implicit Social Cognition”, European Journal for Philosophy of Science , 10(1): 4.
  • Ferguson, A., C.S. Myers, R.J. Bartlett, H. Banister, F.C. Bartlett, W. Brown, N.R. Campbell, K.J.W. Craik, J. Drever, J. Guild, R.A. Houstoun, J.O. Irwin, G.W.C. Kaye, S.J.F. Philpott, L.F. Richardson, J.H. Shaxby, T. Smith, R.H. Thouless, and W.S. Tucker, 1940, “Quantitative estimates of sensory events”, Advancement of Science , 2: 331–349. (The final report of a committee appointed by the British Association for the Advancement of Science in 1932 to consider the possibility of measuring intensities of sensation. See Michell 1999, Ch 6. for a detailed discussion.)
  • Finkelstein, L., 1975, “Representation by symbol systems as an extension of the concept of measurement”, Kybernetes , 4(4): 215–223.
  • –––, 1977, “Introductory article”, (instrument science), Journal of Physics E: Scientific Instruments , 10(6): 566–572.
  • Frank, P.G. (ed.), 1956, The Validation of Scientific Theories . Boston: Beacon Press. (Chapter 2, “The Present State of Operationalism” contains papers by H. Margenau, G. Bergmann, C.G. Hempel, R.B. Lindsay, P.W. Bridgman, R.J. Seeger, and A. Grünbaum)
  • Franklin, A., 1986, The Neglect of Experiment , Cambridge: Cambridge University Press.
  • –––, 1997, “Calibration”, Perspectives on Science , 5(1): 31–80.
  • Franklin, A., M. Anderson, D. Brock, S. Coleman, J. Downing, A. Gruvander, J. Lilly, J. Neal, D. Peterson, M. Price, R. Rice, L. Smith, S. Speirer, and D. Toering, 1989, “Can a Theory-Laden Observation Test the Theory?”, The British Journal for the Philosophy of Science , 40(2): 229–231.
  • Frigerio, A., A. Giordani, and L. Mari, 2010, “Outline of a general model of measurement”, Synthese , 175(2): 123–149.
  • Galison, P., 2003, Einstein’s Clocks, Poincaré’s Maps: Empires of Time , New York and London: W.W. Norton.
  • Gillies, D.A., 1972, “Operationalism”, Synthese , 25(1): 1–24.
  • Giordani, A., and L. Mari, 2012, “Measurement, models, and uncertainty”, IEEE Transactions on Instrumentation and Measurement , 61(8): 2144–2152.
  • Gooday, G., 2004, The Morals of Measurement: Accuracy, Irony and Trust in Late Victorian Electrical Practice , Cambridge: Cambridge University Press.
  • Grant, E., 1996, The foundations of modern science in the middle ages , Cambridge: Cambridge University Press.
  • Grattan-Guinness, I., 1996, “Numbers, magnitudes, ratios, and proportions in Euclid’s Elements: How did he handle them?”, Historia Mathematica , 23: 355–375.
  • Grégis, F., 2015, “Can We Dispense with the Notion of ‘True Value’ in Metrology?”, in Schlaudt and Huber 2015, 81–93.
  • Guala, F., 2008, “Paradigmatic Experiments: The Ultimatum Game from Testing to Measurement Device”, Philosophy of Science , 75: 658–669.
  • Hacking, I, 1983, Representing and Intervening , Cambridge: Cambridge University Press.
  • Harré, R., 1981, Great Scientific Experiments: Twenty Experiments that Changed our View of the World , Oxford: Phaidon Press.
  • Hartley, R.V., 1928, “Transmission of information”, Bell System technical journal , 7(3): 535–563.
  • Heidelberger, M., 1993a, Nature from Within: Gustav Theodore Fechner and His Psychophysical Worldview , C. Klohr (trans.), Pittsburgh: University of Pittsburgh Press, 2004.
  • –––, 1993b, “Fechner’s impact for measurement theory”, commentary on D.J. Murray, “A perspective for viewing the history of psychophysics”, Behavioural and Brain Sciences , 16(1): 146–148.
  • von Helmholtz, H., 1887, Counting and measuring , C.L. Bryan (trans.), New Jersey: D. Van Nostrand, 1930.
  • Hempel, C.G., 1952, Fundamentals of concept formation in empirical science , International Encyclopedia of Unified Science, Vol. II. No. 7, Chicago and London: University of Chicago Press.
  • –––, 1956, “A logical appraisal of operationalism”, in Frank 1956: 52–67.
  • –––, 1966, Philosophy of Natural Science , Englewood Cliffs, N.J.: Prentice-Hall.
  • Hölder, O., 1901, “Die Axiome der Quantität und die Lehre vom Mass”, Berichte über die Verhandlungen der Königlich Sächsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematische-Physische Klasse , 53: 1–64. (for an excerpt translated into English see Michell and Ernst 1996)
  • Hood, S.B., 2009, “Validity in Psychological Testing and Scientific Realism”, Theory & Psychology , 19(4): 451–473.
  • –––, 2013, “Psychological Measurement and Methodological Realism”, Erkenntnis , 78(4): 739–761.
  • Hoover, K. and M. Dowell, 2001, “Measuring Causes: Episodes in the Quantitative Assessment of the Value of Money”, in The Age of Economic Measurement (Supplement to History of Political Economy : Volume 33), J. Klein and M. Morgan (eds.), pp. 137–161.
  • Isaac, A.M. C., 2017, “Hubris to Humility: Tonal Volume and the Fundamentality of Psychophysical Quantities”, Studies in History and Philosophy of Science (Part A), 65–66: 99–111.
  • Israel-Jost, V., 2011, “The Epistemological Foundations of Scientific Observation”, South African Journal of Philosophy , 30(1): 29–40.
  • JCGM (Joint Committee for Guides in Metrology), 2012, International Vocabulary of Metrology—Basic and general concepts and associated terms (VIM), 3rd edition with minor corrections, Sèvres: JCGM. [ JCGM 2012 available online ]
  • Jorgensen, L.M., 2009, “The Principle of Continuity and Leibniz’s Theory of Consciousness”, Journal of the History of Philosophy , 47(2): 223–248.
  • Jung, E., 2011, “Intension and Remission of Forms”, in Encyclopedia of Medieval Philosophy , H. Lagerlund (ed.), Netherlands: Springer, pp. 551–555.
  • Kant, I., 1787, Critique of Pure Reason , P. Guyer and A.W. Wood (trans.), Cambridge: Cambridge University Press, 1998.
  • Kirpatovskii, S.I., 1974, “Principles of the information theory of measurements”, Izmeritel’naya Tekhnika , 5: 11–13, English translation in Measurement Techniques , 17(5): 655–659.
  • Krantz, D.H., R.D. Luce, P. Suppes, and A. Tversky, 1971, Foundations of Measurement Vol 1: Additive and Polynomial Representations , San Diego and London: Academic Press. (For references to the two other volumes see Suppes et al. 1989 and Luce et al. 1990.)
  • von Kries, J., 1882, “Über die Messung intensiver Grösse und über das sogenannte psychophysiches Gesetz”, Vierteljahrschrift für wissenschaftliche Philosophie (Leipzig), 6: 257–294.
  • Kuhn, T.S., 1961, “The Function of Measurement in Modern Physical Sciences”, Isis , 52(2): 161–193.
  • Kyburg, H.E. Jr., 1984, Theory and Measurement , Cambridge: Cambridge University Press.
  • Latour, B., 1987, Science in Action , Cambridge: Harvard University Press.
  • Leplège, A., 2003, “Epistemology of Measurement in the Social Sciences: Historical and Contemporary Perspectives”, Social Science Information , 42: 451–462.
  • Luce, R.D., D.H. Krantz, P. Suppes, and A. Tversky, 1990, Foundations of Measurement (Volume 3: Representation, Axiomatization, and Invariance), San Diego and London: Academic Press. (For references to the two other volumes see Krantz et al. 1971 and Suppes et al. 1989.)
  • Luce, R.D., and J.W. Tukey, 1964, “Simultaneous conjoint measurement: A new type of fundamental measurement”, Journal of mathematical psychology , 1(1): 1–27.
  • Luce, R.D. and P. Suppes, 2004, “Representational Measurement Theory”, in Stevens’ Handbook of Experimental Psychology (Volume 4: Methodology in Experimental Psychology), J. Wixted and H. Pashler (eds.), New York: Wiley, 3rd edition, pp. 1–41.
  • Lusk, G., 2016, “Computer simulation and the features of novel empirical data”, Studies in History and Philosophy of Science Part A , 56: 145–152.
  • Mach, E., 1896, Principles of the Theory of Heat , T.J. McCormack (trans.), Dordrecht: D. Reidel, 1986.
  • Mari, L., 1999, “Notes towards a qualitative analysis of information in measurement results”, Measurement , 25(3): 183–192.
  • –––, 2000, “Beyond the representational viewpoint: a new formalization of measurement”, Measurement , 27: 71–84.
  • –––, 2003, “Epistemology of Measurement”, Measurement , 34: 17–30.
  • –––, 2005a, “The problem of foundations of measurement”, Measurement , 38: 259–266.
  • –––, 2005b, “Models of the Measurement Process”, in Handbook of Measuring Systems Design , vol. 2, P. Sydenman and R. Thorn (eds.), Wiley, Ch. 104.
  • Mari, L., and M. Wilson, 2014, “An introduction to the Rasch measurement approach for metrologists”, Measurement , 51: 315–327.
  • Mari, L. and A. Giordani, 2013, “Modeling measurement: error and uncertainty,”, in Error and Uncertainty in Scientific Practice , M. Boumans, G. Hon, and A. Petersen (eds.), Ch. 4.
  • Maxwell, J.C., 1873, A Treatise on Electricity and Magnetism , Oxford: Clarendon Press.
  • McClimans, L., 2010, “A theoretical framework for patient-reported outcome measures”, Theoretical Medicine and Bioethics , 31: 225–240.
  • –––, 2017, “Psychological Measures, Risk, and Values”, In Measurement in Medicine: Philosophical Essays on Assessment and Evaluation , L. McClimans (ed.), London and New York: Rowman & Littlefield, 89–106.
  • McClimans, L. and P. Browne, 2012, “Quality of life is a process not an outcome”, Theoretical Medicine and Bioethics , 33: 279–292.
  • McClimans, L., J. Browne, and S. Cano, 2017, “Clinical Outcome Measurement: Models, Theory, Psychometrics and Practice”, Studies in History and Philosophy of Science (Part A), 65: 67–73.
  • Michel, M., 2019, “The Mismeasure of Consciousness: A Problem of Coordination for the Perceptual Awareness Scale”, Philosophy of Science , 86(5): 1239–49.
  • Michell, J., 1993, “The origins of the representational theory of measurement: Helmholtz, Hölder, and Russell”, Studies in History and Philosophy of Science (Part A), 24(2): 185–206.
  • –––, 1994, “Numbers as Quantitative Relations and the Traditional Theory of Measurement”, British Journal for the Philosophy of Science , 45: 389–406.
  • –––, 1999, Measurement in Psychology: Critical History of a Methodological Concept , Cambridge: Cambridge University Press.
  • –––, 2000, “Normal science, pathological science and psychometrics”, Theory & Psychology , 10: 639–667.
  • –––, 2003, “Epistemology of Measurement: the Relevance of its History for Quantification in the Social Sciences”, Social Science Information , 42(4): 515–534.
  • –––, 2004a, “History and philosophy of measurement: A realist view”, in Proceedings of the 10th IMEKO TC7 International symposium on advances of measurement science , [ Michell 2004 available online ]
  • –––, 2004b, “Item response models, pathological science and the shape of error: Reply to Borsboom and Mellenbergh”, Theory & Psychology , 14: 121–129.
  • –––, 2005, “The logic of measurement: A realist overview”, Measurement , 38(4): 285–294.
  • Michell, J. and C. Ernst, 1996, “The Axioms of Quantity and the Theory of Measurement”, Journal of Mathematical Psychology , 40: 235–252. (This article contains a translation into English of a long excerpt from Hölder 1901.)
  • Mitchell, D.J., E. Tal, and H. Chang, 2017, “The Making of Measurement: Editors’ Introduction.” Studies in History and Philosophy of Science (Part A), 65–66: 1–7.
  • Miyake, T., 2017, “Uncertainty and Modeling in Seismology”, in Mößner & Nordmann (eds.) 2017, 232–244.
  • Morgan, M., 2001, “Making measuring instruments”, in The Age of Economic Measurement (Supplement to History of Political Economy : Volume 33), J.L. Klein and M. Morgan (eds.), pp. 235–251.
  • Morgan, M. and M. Morrison (eds.), 1999, Models as Mediators: Perspectives on Natural and Social Science , Cambridge: Cambridge University Press.
  • Morrison, M., 1999, “Models as Autonomous Agents”, in Morgan and Morrison 1999: 38–65.
  • –––, 2009, “Models, measurement and computer simulation: the changing face of experimentation”, Philosophical Studies , 143: 33–57.
  • Morrison, M. and M. Morgan, 1999, “Models as Mediating Instruments”, in Morgan and Morrison 1999: 10–37.
  • Mößner, N. and A. Nordmann (eds.), 2017, Reasoning in Measurement , London and New York: Routledge.
  • Mundy, B., 1987, “The metaphysics of quantity”, Philosophical Studies , 51(1): 29–54.
  • Nagel, E., 1931, “Measurement”, Erkenntnis , 2(1): 313–333.
  • Narens, L., 1981, “On the scales of measurement”, Journal of Mathematical Psychology , 24: 249–275.
  • –––, 1985, Abstract Measurement Theory , Cambridge, MA: MIT Press.
  • Nunnally, J.C., and I.H. Bernstein, 1994, Psychometric Theory , New York: McGraw-Hill, 3rd edition.
  • Padovani, F., 2015, “Measurement, Coordination, and the Relativized a Priori”, Studies in History and Philosophy of Science (Part B: Studies in History and Philosophy of Modern Physics), 52: 123–28.
  • –––, 2017, “Coordination and Measurement: What We Get Wrong About What Reichenbach Got Right”, In M. Massimi, J.W. Romeijn, and G. Schurz (eds.), EPSA15 Selected Papers (European Studies in Philosophy of Science), Cham: Springer International Publishing, 49–60.
  • Parker, W., 2017, “Computer Simulation, Measurement, and Data Assimilation”, British Journal for the Philosophy of Science , 68(1): 273–304.
  • Poincaré, H., 1898, “The Measure of Time”, in The Value of Science , New York: Dover, 1958, pp. 26–36.
  • –––, 1902, Science and Hypothesis , W.J. Greenstreet (trans.), New York: Cosimo, 2007.
  • Porter, T.M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , New Jersey: Princeton University Press.
  • –––, 2007, “Precision”, in Boumans 2007b: 343–356.
  • Rasch, G., 1960, Probabilistic Models for Some Intelligence and Achievement Tests , Copenhagen: Danish Institute for Educational Research.
  • Reiss, J., 2001, “Natural Economic Quantities and Their Measurement”, Journal of Economic Methodology , 8(2): 287–311.
  • Riordan, S., 2015, “The Objectivity of Scientific Measures”, Studies in History and Philosophy of Science (Part A), 50: 38–47.
  • Reichenbach, H., 1927, The Philosophy of Space and Time , New York: Dover Publications, 1958.
  • Rothbart, D. and S.W. Slayden, 1994, “The Epistemology of a Spectrometer”, Philosophy of Science , 61: 25–38.
  • Russell, B., 1903, The Principles of Mathematics , New York: W.W. Norton.
  • Savage, C.W. and P. Ehrlich, 1992, “A brief introduction to measurement theory and to the essays”, in Philosophical and Foundational Issues in Measurement Theory , C.W. Savage and P. Ehrlich (eds.), New Jersey: Lawrence Erlbaum, pp. 1–14.
  • Schaffer, S., 1992, “Late Victorian metrology and its instrumentation: a manufactory of Ohms”, in Invisible Connections: Instruments, Institutions, and Science , R. Bud and S.E. Cozzens (eds.), Cardiff: SPIE Optical Engineering, pp. 23–56.
  • Schlaudt, O. and Huber, L. (eds.), 2015, Standardization in Measurement: Philosophical, Historical and Sociological Issues , London and New York: Routledge.
  • Scott, D. and P. Suppes, 1958, “Foundational aspects of theories of measurement”, Journal of Symbolic Logic , 23(2): 113–128.
  • Shannon, C.E., 1948, “A Mathematical Theory of Communication”, The Bell System Technical Journal , 27: 379–423 and 623–656.
  • Shannon, C.E. and W. Weaver, 1949, A Mathematical Theory of Communication , Urbana: The University of Illinois Press.
  • Shapere, D., 1982, “The Concept of Observation in Science and Philosophy”, Philosophy of Science , 49(4): 485–525.
  • Skinner, B.F., 1945, “The operational analysis of psychological terms”, in Boring et al. 1945: 270–277.
  • Soler, L., F. Wieber, C. Allamel-Raffin, J.L. Gangloff, C. Dufour, and E. Trizio, 2013, “Calibration: A Conceptual Framework Applied to Scientific Practices Which Investigate Natural Phenomena by Means of Standardized Instruments”, Journal for General Philosophy of Science , 44(2): 263–317.
  • Staley, K. W., 2020, “Securing the empirical value of measurement results”, The British Journal for the Philosophy of Science , 71(1): 87–113.
  • Stegenga, J., 2018, Medical Nihilism , Oxford: Oxford University Press.
  • Stevens, S.S., 1935, “The operational definition of psychological concepts”, Psychological Review , 42(6): 517–527.
  • –––, 1946, “On the theory of scales of measurement”, Science , 103: 677–680.
  • –––, 1951, “Mathematics, Measurement, Psychophysics”, in Handbook of Experimental Psychology , S.S. Stevens (ed.), New York: Wiley & Sons, pp. 1–49.
  • –––, 1959, “Measurement, psychophysics and utility”, in Measurement: Definitions and Theories , C.W. Churchman and P. Ratoosh (eds.), New York: Wiley & Sons, pp. 18–63.
  • –––, 1975, Psychophysics: Introduction to Its Perceptual, Neural and Social Prospects , New York: Wiley & Sons.
  • Suppes, P., 1951, “A set of independent axioms for extensive quantities”, Portugaliae Mathematica , 10(4): 163–172.
  • –––, 1960, “A Comparison of the Meaning and Uses of Models in Mathematics and the Empirical Sciences”, Synthese , 12(2): 287–301.
  • –––, 1962, “Models of Data”, in Logic, methodology and philosophy of science: proceedings of the 1960 International Congress , E. Nagel (ed.), Stanford: Stanford University Press, pp. 252–261.
  • –––, 1967, “What is a Scientific Theory?”, in Philosophy of Science Today , S. Morgenbesser (ed.), New York: Basic Books, pp. 55–67.
  • Suppes, P., D.H. Krantz, R.D. Luce, and A. Tversky, 1989, Foundations of Measurement Vol 2: Geometrical, Threshold and Probabilistic Representations , San Diego and London: Academic Press. (For references to the two other volumes see Krantz et al. 1971 and Luce et al. 1990.)
  • Swoyer, C., 1987, “The Metaphysics of Measurement”, in Measurement, Realism and Objectivity , J. Forge (ed.), Reidel, pp. 235–290.
  • Sylla, E., 1971, “Medieval quantifications of qualities: The ‘Merton School’”, Archive for history of exact sciences , 8(1): 9–39.
  • Tabor, D., 1970, “The hardness of solids”, Review of Physics in Technology , 1(3): 145–179.
  • Tal, E., 2011, “How Accurate Is the Standard Second?”, Philosophy of Science , 78(5): 1082–96.
  • –––, 2013, “Old and New Problems in Philosophy of Measurement”, Philosophy Compass , 8(12): 1159–1173.
  • –––, 2016a, “Making Time: A Study in the Epistemology of Measurement”, British Journal for the Philosophy of Science , 67(1): 297–335
  • –––, 2016b, “How Does Measuring Generate Evidence? The Problem of Observational Grounding”, Journal of Physics: Conference Series , 772: 012001.
  • –––, 2017a, “A Model-Based Epistemology of Measurement”, in Mößner & Nordmann (eds.) 2017, 233–253.
  • –––, 2017b, “Calibration: Modelling the Measurement Process”, Studies in History and Philosophy of Science (Part A), 65: 33–45.
  • –––, 2018, “Naturalness and Convention in the International System of Units”, Measurement , 116: 631–643.
  • Teller, P., 2013, “The concept of measurement-precision”, Synthese , 190: 189–202.
  • –––, 2018, “Measurement Accuracy Realism”, in I. Peschard and B.C. van Fraassen (eds.), The Experimental Side of Modeling , Minneapolis: University of Minnesota Press, 273–98.
  • Thomson, W., 1889, “Electrical Units of Measurement”, in Popular Lectures and Addresses (Volume 1), London: MacMillan, pp. 73–136.
  • Trout, J.D., 1998, Measuring the intentional world: Realism, naturalism, and quantitative methods in the behavioral sciences , Oxford: Oxford University Press.
  • –––, 2000, “Measurement”, in A Companion to the Philosophy of Science , W.H. Newton-Smith (ed.), Malden, MA: Blackwell, pp. 265–276.
  • van Fraassen, B.C., 1980, The Scientific Image , Oxford: Clarendon Press.
  • –––, 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press.
  • –––, 2009, “The perils of Perrin, in the hands of philosophers”, Philosophical Studies , 143: 5–24.
  • –––, 2012, “Modeling and Measurement: The Criterion of Empirical Grounding”, Philosophy of Science , 79(5): 773–784.
  • Vessonen, E., 2019. “Operationalism and Realism in Psychometrics”, Philosophy Compass , 14(10): e12624.
  • –––, 2020, “The Complementarity of Psychometrics and the Representational Theory of Measurement”, The British Journal for the Philosophy of Science , 71(2): 415–442.
  • Wilson, M., 2013, “Using the concept of a measurement system to characterize measurement models used in psychometrics”, Measurement , 46(9): 3766–3774.
  • Wise, M.N. (ed.), 1995, The Values of Precision , NJ: Princeton University Press.
  • Wise, M.N. and C. Smith, 1986, “Measurement, Work and Industry in Lord Kelvin’s Britain”, Historical Studies in the Physical and Biological Sciences , 17(1): 147–173.
  • Wolff, J. E., 2020a, The Metaphysics of Quantities , Oxford: Oxford University Press.
  • –––, 2020b, “Heaps of Moles? – Mediating Macroscopic and Microscopic Measurement of Chemical Substances”, Studies in History and Philosophy of Science (Part A), 80: 19–27.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Bradburn,, M., Cartwright, N.L., and Fuller, J., 2016, “ A Theory of Measurement ”, CHESS Working Paper No. 2016-07 (Centre for Humanities Engaging Science and Society), Durham University. (A summary of this paper appears in R.M. Li (ed.), The Importance of Common Metrics for Advancing Social Science Theory and Research: A Workshop Summary , Washington, DC: National Academies Press, 2011, pp. 53–70.)
  • Openly accessible guides to metrological terms and methods by the International Bureau of Weights and Measures (BIPM)
  • Bibliography on measurement in science at PhilPapers.

Duhem, Pierre | economics: philosophy of | empiricism: logical | Helmholtz, Hermann von | Mach, Ernst | models in science | operationalism | physics: experiment in | Poincaré, Henri | quantum theory: philosophical issues in | Reichenbach, Hans | science: theory and observation in | scientific objectivity | Vienna Circle

Acknowledgments

The author would like to thank Stephan Hartmann, Wendy Parker, Paul Teller, Alessandra Basso, Sally Riordan, Jo Wolff, Conrad Heilmann and participants of the History and Philosophy of Physics reading group at the Department of History and Philosophy of Science at the University of Cambridge for helpful feedback on drafts of this entry. The author is also indebted to Joel Michell and Oliver Schliemann for useful bibliographical advice, and to John Wiley and Sons Publishers for permission to reproduce excerpt from Tal (2013). Work on this entry was supported by an Alexander von Humboldt Postdoctoral Research Fellowship and a Marie Curie Intra-European Fellowship within the 7 th European Community Framework Programme. Work on the 2020 revision of this entry was supported by an FRQSC New Academic grant, a Healthy Brains for Healthy Lives Knowledge Mobilization grant, and funding from the Canada Research Chairs program.

Copyright © 2020 by Eran Tal < eran . tal @ mcgill . ca >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Velentgas P, Dreyer NA, Nourjah P, et al., editors. Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan.

Cover of Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide

Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide.

  • Hardcopy Version at Agency for Healthcare Research and Quality

Chapter 6 Outcome Definition and Measurement

Priscilla Velentgas , PhD, Nancy A Dreyer , MPH, PhD, and Albert W Wu , MD, MPH.

This chapter provides an overview of considerations for the development of outcome measures for observational comparative effectiveness research (CER) studies, describes implications of the proposed outcomes for study design, and enumerates issues of bias that may arise in incorporating the ascertainment of outcomes into observational research, and means of evaluating, preventing and/or reducing these biases. Development of clear and objective outcome definitions that correspond to the nature of the hypothesized treatment effect and address the research questions of interest, along with validation of outcomes or use of standardized patient reported outcome (PRO) instruments validated for the population of interest, contribute to the internal validity of observational CER studies. Attention to collection of outcome data in an equivalent manner across treatment comparison groups is also required. Use of appropriate analytic methods suitable to the outcome measure and sensitivity analysis to address varying definitions of at least the primary study outcomes are needed to draw robust and reliable inferences. The chapter concludes with a checklist of guidance and key considerations for outcome determination and definitions for observational CER protocols.

  • Introduction

The selection of outcomes to include in observational comparative effectiveness research (CER) studies involves the consideration of multiple stakeholder viewpoints (provider, patient, payer, regulatory, industry, academic and societal) and the intended use for decisionmaking of resulting evidence. It is also dependent on the level of funding and scope of the study. These studies may focus on clinical outcomes, such as recurrence-free survival from cancer or coronary heart disease mortality; general health-related quality of life measures, such as the EQ-5D and the SF-36; or disease-specific scales, like the uterine fibroid symptom and quality of life questionnaire (UFS-QOL); and/or health resource utilization or cost measures. As with other experimental and observational research studies, the hypotheses or study questions of interest must be translated to one or more specific outcomes with clear definitions.

The choice of outcomes to include in a CER study will in turn drive other important design considerations such as the data source(s) from which the required information can be obtained (see chapter 8 ), the frequency and length of followup assessments to be included in the study following initial treatment, and the sample size, which is influenced by the expected frequency of the outcome in addition to the magnitude of relative treatment effects and scale of measurement.

In this chapter, we provide an overview of types of outcomes (with emphasis on those most relevant to observational CER studies); considerations in defining outcomes; the process of outcome ascertainment, measurement and validation; design and analysis considerations; and means to evaluate and address bias that may arise.

  • Conceptual Models of Health Outcomes

In considering the range of health outcomes that may be of interest to patients, health care providers, and other decisionmakers, key areas of focus are medical conditions, impact on health-related or general quality of life, and resource utilization. To address the interrelationships of these outcomes, some conceptual models have been put forth by researchers with a particular focus on health outcomes studies. Two such models are described here.

Wilson and Cleary proposed a conceptual model or taxonomy integrating concepts of biomedical patient outcomes and measures of health-related quality of life. The taxonomy is divided into five levels: biological and physiological factors, symptoms, functioning, general health perceptions, and overall quality of life. 1 The authors discuss causal relationships between traditional clinical variables and measures of quality of life that address the complex interactions of biological and societal factors on health status, as summarized in Table 6.1 .

Table 6.1. Wilson and Cleary's taxonomy of biomedical and health-related quality of life outcomes.

Wilson and Cleary's taxonomy of biomedical and health-related quality of life outcomes.

An alternative model, the ECHO (Economic, Clinical, Humanistic Outcomes) Model, was developed for planning health outcomes and pharmacoeconomic studies, and goes a step further than the Wilson and Cleary model in incorporating costs and economic outcomes and their interrelationships with clinical and humanistic outcomes ( Figure 6.1 ). 2 The ECHO model does not explicitly incorporate characteristics of the patient as an individual or psychosocial factors to the extent that the Wilson and Cleary model does, however.

The ECHO model. See Kozma CM, Reeder CE, Schultz RM. Economic, clinical, and humanistic outcomes: a planning model for pharmacoeconomic research. Clin Ther. 1993;15(6):1121-32. This figure is copyrighted by Elsevier Inc. and reprinted with permission. (more...)

As suggested by the complex interrelationships between different levels and types of health outcomes, different terminology and classifications may be used, and there are areas of overlap between the major categories of outcomes important to patients. In this chapter, we will discuss outcomes according to the broad categories of clinical, humanistic, and economic and utilization outcome measures.

  • Outcome Measurement Properties

The properties of outcome measures that are an integral part of an investigator's evaluation and selection of appropriate measures include reliability, validity, and variability. Reliability is the degree to which a score or other measure remains unchanged upon test and retest (when no change is expected), or across different interviewers or assessors. It is measured by statistics including kappa, and the inter- or intra-class correlation coefficient. Validity, broadly speaking, is the degree to which a measure assesses what it is intended to measure, and types of validity include face validity (the degree to which users or experts perceive that a measure is assessing what it is intended to measure), content validity (the extent to which a measure accurately and comprehensively measures what it is intended to measure), and construct validity (the degree to which an instrument accurately measures a nonphysical attribute or construct such as depression or anxiety, which is itself a means of summarizing or explaining different aspects of the entity being measured). 3 Variability usually refers to the distribution of values associated with an outcome measure in the population of interest, with a broader distribution or range of values said to show more variability.

Responsiveness is another property usually discussed in the context of patient-reported outcomes (PROs) but extendable to other measures, representing the ability of a measure to detect change in an individual over time.

These measurement properties may affect the degree of measurement error or misclassification that an outcome measure is subject to, with the consideration that the properties themselves are specific to the population and setting in which the measures are used. Issues of misclassification and considerations in reducing this type of error are discussed further in the section on “avoidance of bias in study design.”

  • Clinical Outcomes

Clinical outcomes are perhaps the most common category of outcome to be considered in CER studies. Medical treatments are developed and must demonstrate efficacy in preapproval clinical trials to prevent the occurrence of undesirable outcomes such as coronary events, osteoporosis, or death; to delay disease progression such as in rheumatoid arthritis; to hasten recovery or improve survival from disease, such as in cancer or H5N1 influenza; or to manage or reduce the burden of chronic diseases including diabetes, psoriasis, Parkinson's disease, and depression. Postapproval observational CER studies are often needed to compare newer treatments against the standard of care; to obtain real-world data on effectiveness as treatments are used in different medical care settings and broader patient populations than those studied in clinical trials; and to increase understanding of the relative benefits and risks of treatments by weighing quality of life, cost, and safety outcomes alongside clinical benefits. For observational studies, this category of outcome generally focuses on clinically meaningful outcomes such as time between disease flares; number of swollen, inflamed joints; or myocardial infarction. Feasibility considerations sometimes dictate the use of intermediate endpoints, which are discussed in further detail later in the chapter.

Definitions of Clinical Outcomes

Temporal aspects.

The nature of the disease state to be treated, the mechanism, and the intended effect of the treatment under study determine whether the clinical outcomes to be identified are incident (a first or new diagnosis of the condition of interest), prevalent (existing disease), or recurrent (new occurrence or exacerbation of disease in a patient who has a previous diagnosis of that condition). The disease of interest may be chronic (a long-term or permanent condition), acute (a condition with a clearly identifiable and rapid onset), transient (a condition that comes and goes), or episodic (a condition that comes and goes in episodes), or have more than one of these aspects.

Subjective Versus Objective Assessments

Most clinical outcomes involve a diagnosis or assessment by a health care provider. These may be recorded in a patient's medical record as part of routine care, coded as part of an electronic health record (EHR) or administrative billing system using coding systems such as ICD-9 or ICD-10, or collected specifically for a given study.

While there are varying degrees of subjectivity involved in most assessments by health care providers, objective measures are those that are not subject to a large degree of individual interpretation, and are likely to be reliably measured across patients in a study, by different health care providers, and over time. Laboratory tests may be considered objective measures in most cases and can be incorporated as part of a standard outcome definition to be used for a study when appropriate. Some clinical outcomes, such as all-cause mortality, can be ascertained directly and may be more reliable than measures that are subject to interpretation by individual health care providers, such as angina or depression.

Instruments have been developed to help standardize the assessment of some conditions for which a subjective clinical assessment might introduce unwanted variability. Consider the example of a study of a new psoriasis treatment. Psoriasis is a chronic skin condition that causes lesions affecting varying amounts of body surface area, with varying degrees of severity. While a physician may be able to assess improvement within an individual patient, a quantifiable measure that would be reproducible across patients and raters improves the information value of comparative trials and observational studies of psoriasis treatment effectiveness. An outcome assessment that relies on purely subjective assessments of improvement such as, “Has the patient's condition improved a lot, a little, or not at all?” is vulnerable to measurement error that arises from subjective judgments or disagreement among clinicians about what comprises the individual categories and how to rate them, often resulting in low reproducibility or inter-rater reliability of the measure. In the psoriasis example, an improved measure of the outcome would be a standardized assessment of the severity and extent of disease expressed as percentage of affected body surface area, such as the Psoriasis Area Severity Index or PASI Score. 4 The PASI score requires rating the severity of target symptoms [erythema (E), infiltration (I), and desquamation (D)] and area of psoriatic involvement (A) for each of four main body areas [head (h), trunk (t), upper extremities (e), lower extremities (l)]. Target symptom severity is rated on a 0–4 scale; area of psoriatic involvement is rated on a 0–6 scale, with each numerical value representing a percentage of area involvement. 4 The final calculated score ranges from 0 (no disease) to 72 (severe disease), with the score contribution of each body area weighted by its percentage of total body area (10, 20, 30, and 40% of body area for head, upper extremities, trunk, and lower extremities, respectively). 4 Compared with subjective clinician assessment of overall performance, using changes in the PASI score increases reproducibility and comparability across studies that use the score.

Relatedly, the U.S. Food and Drug Administration (FDA) has provided input on types of Clinical Outcome Assessments (COAs) that may be considered for qualification for use in clinical trials, with the goals of increasing the reliability of such assessments within a specific context of use in drug development and regulatory decisionmaking to measure a specific concept with a specific interpretation. Contextual considerations include the specific disease of interest, target population, clinical trial design and objectives, regionality, and mode of administration. The types of COAs described are: 5

  • Patient-reported outcome (PRO) assessment : A measurement based on a report that comes directly from the patient (i.e., the study subject) about the status of particular aspects of or events related to a patient's health condition. PROs are recorded without amendment or interpretation of the patient's response by a clinician or other observer. A PRO measurement can be recorded by the patient directly, or recorded by an interviewer, provided that the interviewer records the patient's response exactly.
  • Observer-reported outcome (ObsRO) assessment : An assessment that is determined by an observer who does not have a background of professional training that is relevant to the measurement being made, i.e., a nonclinician observer such as a teacher or caregiver. This type of assessment is often used when the patient is unable to self-report (e.g., infants, young children). An ObsRO assessment should only be used in the reporting of observable concepts (e.g., signs or behaviors); ObsROs cannot be validly used to directly assess symptoms (e.g., pain) or other unobservable concepts.
  • Clinician-reported outcome (ClinRO) assessment : An assessment that is determined by an observer with some recognized professional training that is relevant to the measurement being made.

Other considerations related to use of PROs for measurement of health-related quality of life and other concepts are addressed later on in this chapter.

Composite Endpoints

Some clinical outcomes are composed of a series of items, and are referred to as composite endpoints. A composite endpoint is often used when the individual events included in the score are rare, and/or when it makes biological and clinical sense to group them. The study power for a given sample size may be increased when such composite measures are used as compared with individual outcomes, since by grouping numerous types of events into a larger category, the composite endpoint will occur more frequently than any of the individual components. As desirable as this can be from a statistical point of view, challenges include interpretation of composite outcomes that incorporate both safety and effectiveness, and broader adoption of reproducible definitions that will enhance cross-study comparisons. For example, Kip and colleagues 6 point out that there is no standard definition for MACE (major adverse cardiac events), a commonly used outcome in clinical cardiology research. They conducted analyses to demonstrate that varying definitions of composite endpoints, such as MACE, can lead to substantially different results and conclusions. The investigators utilized the DEScover registry patient population, a prospective observational registry of drug-eluting stent (DES) users, to evaluate differences in 1-year risk for three definitions of MACE in comparisons of patients with and without myocardial infarction (MI), and patients with multi-lesion stenting versus single-lesion stenting (also referred to as percutaneous coronary intervention or PCI). The varying definitions of MACE included one related to safety only [composite of death, MI, and stent thrombosis (ST)], and two relating to both safety and effectiveness [composite of death, MI, ST, and either (1) target vessel revascularization (TVR) or (2) any repeat vascularization]. When comparing patients with and without acute MI, the three definitions of MACE yielded very different hazard ratios. The safety-only definition of MACE yielded a hazard ratio of 1.75 (p<0.05), indicating that patients with acute MI were at greater risk of 1-year MACE. However, for the composite of safety and effectiveness endpoints, the risk of 1-year MACE was greatly attenuated and no longer statistically significant. Additionally, when comparing patients with single versus multiple lesions treated with PCI, the three definitions also yielded different results; while the safety-only composite endpoint demonstrated that there was no difference in 1-year MACE, adding TVR to the composite endpoint definition led to a hazard ratio of 1.4 (p<0.05) for multi-lesion PCI versus single-lesion PCI. This research serves as a cautionary tale for the creation and use of composite endpoints. Not only can varying definitions of composite endpoints such as MACE lead to substantially different results and conclusions; results must also be carefully interpreted, especially in the case where safety and effectiveness endpoints are combined.

Intermediate Endpoints

The use of an intermediate or surrogate endpoint is more common in clinical trials than in observational studies. This type of endpoint is often a biological marker for the condition of interest, and may be used to reduce the followup period required to obtain results from a study of treatment effectiveness. An example would be the use of measures of serum lipids as endpoints in randomized trials of the effectiveness of statins, for which the major disease outcomes of interest to patients and physicians are a reduction in coronary heart disease incidence and mortality. The main advantages of intermediate endpoints are that the followup time required to observe possible effects of treatment on these outcomes may be substantially shorter than for the clinical outcome(s) of primary interest, and if they are measured on all patients, the number of outcomes for analysis may be larger. Much as with composite endpoints, using intermediate endpoints will increase study power for a given sample size as compared with outcomes that may be relatively rare, such as primary myocardial infarction. Surrogate or intermediate outcomes, however, may provide an incomplete picture of the benefits or risk. Treatment comparisons based on intermediate endpoints may differ in magnitude or direction from those based on major disease endpoints, as evidenced in a clinical trial of nifedipine versus placebo 7 - 8 as well as other clinical trials of antihypertensive therapy. 9 On one hand, nifedipine, a calcium channel blocker, was superior to placebo in reduction of onset of new coronary lesions; on the other hand, mortality was sixfold greater among patients who received nifedipine versus placebo. 7

Freedman and colleagues have provided recommendations regarding the use of intermediate endpoints. 10 Investigators should consider the degree to which the intermediate endpoint is reflective of the main outcome, as well as the degree to which effects of the intervention may be mediated through the intermediate endpoint. Psaty and colleagues have cautioned that because drugs have multiple effects, to the extent that a surrogate endpoint is likely to measure only a subset of those effects, results of studies based on surrogate endpoints may be a misleading substitute for major disease outcomes as a basis for choosing one therapy over another. 9

Table 6.2 Clinical outcome definitions and objective measures

View in own window

Selection of Clinical Outcome Measures

Identification of a suitable measure of a clinical outcome for an observational CER study is a process in which various aspects of the nature of the disease or condition under study should be considered along with sources of information by which the required information may be feasibly and reliably obtained.

The choice of outcome measure may follow directly from the expected biological mechanism of action of the intervention(s) under study and its impact on specific medical conditions. For example, the medications tamoxifen and raloxifene are selective estrogen receptor modulators that act through binding to estrogen receptors to block the proliferative effect of estrogen on mammary tissue and reduce the long-term risk of primary and recurrent invasive and non-invasive breast cancer. 11 Broader or narrower outcome definitions may be appropriate to specific research questions or designs. In some situations, however, the putative biologic mechanism may not be well understood. Nonetheless, studies addressing the clinical question of comparative effectiveness of treatment alternatives may still inform decisionmaking, and advances in understanding of the biological mechanism may follow discovery of an association through an observational CER study.

The selection of clinical outcome measures may be challenging when there are many clinical aspects that may be of interest, and a single measure or scale may not adequately capture the perspective of the clinician and patient. For example, in evaluating treatments or other interventions that may prolong the time between flares of systematic lupus erythematosus (SLE), researchers may use an index such as the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) which measures changes in disease activity. Or they may use the SLICC/ACR damage index, an instrument designed to assess accumulated damage since the onset of the disease. 12 - 14 This measure of disease activity has been tested in different populations and has demonstrated high reliability, evidence for validity, and responsiveness to change. 15 Yet, multiple clinical outcomes in addition to disease activity may be of interest in studying treatment effectiveness in SLE, such as reduction or increase in time to flare, reduction in corticosteroid use, or occurrence of serious acute manifestations (e.g., acute confusional state or acute transverse myelitis). 16

Interactions With the Health Care System

For any medical condition, one should first determine the source of reporting or detection that may lead to initial contact with the medical system. The manner in which the patient presents for medical attention may provide insights as to data source(s) that may be useful in studying the condition. The decision whether to collect information directly from the physician, through medical record abstraction, directly from patients, and/or through use of electronic health records (EHRs) and/or administrative claims data will follow from this. For example, general hospital medical records are unlikely to provide the key components of an outcome such as respiratory failure, which requires information about use of mechanical ventilation. In contrast, hospital medical records are useful for the study of myocardial infarction, which must be assessed and treated in a hospital setting and are nearly always accompanied by an overnight stay. General practice physician office records and emergency department records may be useful in studying the incidence of influenza A or urticaria, with selection of which of these sources depending on the severity of the condition. A prospective study may be required to collect clinical assessments of disease severity using a standard instrument, as these are not consistently recorded in medical practice and are not coded in administrative data sources. The chapter on data sources ( chapter 8 ) provides additional information on selection of appropriate sources of data for an observational CER study.

  • Humanistic Outcomes

While outcomes of interest to patients generally include those of interest to physicians, payers, regulators, and others, they are often differentiated by two characteristics: (1) they are clinically meaningful with practical implications for disease recognition and management (i.e., patients generally have less interest in intermediate pathways with no clear clinical impact); and (2) they include reporting of outcomes based on a patient's unique perspective, e.g., patient-reported scales that indicate pain level, degree of functioning, etc. This section deals with measures of health-related quality of life (HRQoL) and the range of measures collectively described as patient-reported outcomes (PROs), which include measures of HRQoL. Other humanistic perspectives relevant to patients (e.g., economics, utilization of health services, etc.) are covered elsewhere.

Health-Related Quality of Life

Health-related quality of life (HRQoL) measures the impact of disease and treatment on the lives of patients and is defined as “the capacity to perform the usual daily activities for a person's age and major social role.” 17 HRQoL commonly includes physical functioning, psychological well-being, and social role functioning. This construct comprises outcomes from the patient perspective and are measured by asking the patient or surrogate reporters about them.

HRQoL is an outcome increasingly used in randomized and non-randomized studies of health interventions, and as such FDA has provided clarifying definitions of HRQoL and of improvements in HRQoL. The FDA defines HRQoL as follows:

HRQL is a multidomain concept that represents the patient's general perception of the effect of illness and treatment on physical, psychological, and social aspects of life. Claiming a statistical and meaningful improvement in HRQL implies: (1) that all HRQL domains that are important to interpreting change in how the clinical trial's population feels or functions as a result of the targeted disease and its treatment were measured; (2) that a general improvement was demonstrated; and (3) that no decrement was demonstrated in any domain. 18

Patient-Reported Outcomes

Patient-reported outcomes (PROs) include any outcomes that are based on data provided by patients or by people who can report on their behalf (proxies), as opposed to data from other sources. 19 PROs refer to patient ratings and reports about any of several outcomes, including health status, health-related quality of life, quality of life defined more broadly, symptoms, functioning, satisfaction with care, and satisfaction with treatment. Patients can also report about their health behaviors, including adherence and health habits. Patients may be asked to directly report information about clinical outcomes or health care utilization and out-of-pocket costs when these are difficult to measure through other sources. The FDA defines a PRO as “a measurement based on a report that comes directly from the patient (i.e., study subject) about the status of a patient's health condition without amendment or interpretation of the patient's response by a clinician or anyone else. A PRO can be measured by self-report or by interview provided that the interviewer records only the patient's response.” 18

In this section we focus mainly on the use of standard instruments for measurement of PROs, in domains including specific disease areas, health-related quality of life, and functioning. PRO measures may be designed to measure the current state of health of an individual or to measure a change in health state. PROs have similarities to other outcome variables measured in observational studies. They are measured with components of both random and systematic error (bias). To be most useful, it is important to have evidence about the reliability, validity, responsiveness, and interpretation of PRO measures, discussed further later in this section.

Types of Humanistic Outcome Measures

Generic measures.

Generic PRO questionnaires are measurement instruments designed to be used across different subgroups of individuals, and contain common domains that are relevant to almost all populations. They can be used to compare one population with another, or to compare scores in a specific population with normative scores. Many have been used for years, and have well established and well understood measurement properties.

Generic PRO questionnaires can focus on a comprehensive set of domains, or on a narrow range of domains such as symptoms or aspects of physical, mental, or social functioning. An example of a generic PRO measure is the Sickness Impact Profile (SIP), one of the oldest and most rigorously developed questionnaires, which measures 12 domains that are affected by illness. 20 The SIP produces two subscale scores, one for physical and one for mental health, and an overall score. Another questionnaire, the SF-36, measures eight domains including general health perceptions, pain, physical functioning, role functioning (as limited by physical health), social functioning, mental health, and vitality. 21 The SF-36 produces a Physical Component Score and a Mental Component Score. 22 The EQ-5D is another generic measure of health-related quality of life, intended for self-completion, that generates a single index score. This scale defines health in terms of 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression.

Each dimension has three response categories corresponding to no problem/some problem/extreme problem. Taken as a whole, the EQ-5D defines a total of 243 possible states, to which two further states (dead and unconscious) have been added. 23 Another broadly used indicator of quality of life relates to the ability to work. The Work Productivity Index (WPAI) was created as a patient-reported quantitative assessment of the amount of absenteeism, presenteeism, and daily activity impairment attributable to general health (WPAI:GH) or to a specific health problem (WPAI:SHP) (see below), in an effort to develop a quantitative approach to measuring the ability to work. 24

Examples of generic measures that assess a more restricted set of domains include the SCL-90 to measure symptoms, 25 the Index of Activities of Daily Living to measure independence in performing basic functioning, 26 the Psychological General Well-Being Index to measure psychological well-being (PGWBI), 27 and the Beck Depression Inventory. 28

Disease- or Population-Specific Measures

Specific PRO questionnaires are sometimes referred to as “disease-specific.” While a questionnaire can be disease- or condition-specific (e.g., chronic heart failure), it can also be designed for use in a specific population (e.g., pediatric, geriatric), or for use to evaluate a specific treatment (e.g., renal dialysis). Specific questionnaires may be more sensitive to symptoms that are experienced by a particular group of patients. Thus, they are thought to detect differences and changes in scores when they occur in response to interventions.

Some specific measurement instruments assess multiple domains that are affected by a condition. For example, the Arthritis Impact Measurement Scales (AIMS) includes nine subscales that assess problems specific to the health-related quality of life of patients with rheumatoid arthritis and its treatments. 29 The MOS-HIV Health Survey includes 10 domains that are salient for people with HIV and its treatments. 30

Some of these measures take a modular approach, including a core measure that is used for assessment of a broader set of conditions, accompanied by modules that are specific to disease subtypes. For example, the FACIT and EORTC families of measures for evaluating cancer therapies each include a core module that is used for all cancer patients, and specific modules for each type of cancer, such as a module pertaining specifically to breast cancer. 31 - 33

Other measures focus more narrowly on a few domains most likely to be affected by a disease, or most likely to improve with treatment. For example, the Headache Impact Test includes only six items. 34 In contrast, other popular measures focus on symptoms that are affected by many diseases, such as the Brief Pain Inventory and the M.D. Anderson Symptom Inventory (MDASI), which measure the severity of pain and other symptoms and the impact of symptoms on function, and have been developed, refined, and validated in many languages and patient subgroups over three decades. 35 - 36

It is possible, though not always advisable, to design a new PRO instrument for use in a specific study. The process of developing and testing a new PRO measure can be lengthy—generally requiring at least a year in time–and there is no guarantee that a new measure will work as well as more generic but better tested instruments. Nonetheless, it may be necessary to do so in the case of an uncommon condition for which there are no existing PRO measures, for a specific cultural context that differs from the ones that have been studied before, and/or to capture effects of new treatments that may require a different approach to measurement. However, when possible, in these cases it is still prudent to include a PRO measure with evidence for reliability and validity, ideally in the target patient population, in case the newly designed instruments fail to work as intended. This approach will allow comparisons with the new measure to assess content validity if there is some overlap of the concepts being measured.

Item Response Theory (IRT) and Computer Adaptive Testing (CAT)

Item Response Theory (IRT) is a framework for the development of tests and measurement tools, and for the assessment of how well the tools work. Computer Adaptive Testing (CAT) represents an area of innovation in measuring PROs. CAT allows items to be selected to be administered so that questions are relevant to the respondent and targeted to the specific level of the individual, with the last response determining the next question that is asked. Behind the scenes, items are selected from “item banks,” comprising collections of dozens to hundreds of questions that represent the universe of potential levels of the dimension of interest, along with an indication of the relative difficulty or dysfunction that they represent. For example, the Patient-Reported Outcomes Measurement Information System (PROMIS) item bank for physical functioning includes 124 items that range in difficulty from getting out of bed to running several miles. 37 This individualized administration can both enhance measurement precision and reduce respondent burden. 38 Computer adaptive testing is based on IRT methods of scaling items and drawing subsets of items from a larger item bank. 39 Considerations around adaptive testing involve balancing the benefit of tailoring the set of items and measurements to the specific individual with the risk of inappropriate targeting or classification if items answered incorrectly early on determine the later set of items to which a subject is able to respond. PROMIS 40 is a major NIH initiative that leverages these desirable properties for PROs in clinical research and practice applications.

Descriptive Versus Preference Format

Descriptive questionnaires ask about general or common domains and complaints, and usually provide multiple scores. Preference-based measures, generally referred to as utility measures, provide a single score, usually on a 0–1 scale, that represents the aggregate of multiple domains for an overall estimate of burden.

Most of the questionnaires familiar to clinical researchers fall into the category of descriptive measures, including all of those mentioned in the preceding paragraphs. Patients or other respondents are asked to indicate the extent to which descriptions of specific feelings, abilities, or behaviors apply to them. Utility measures are discussed further in the following section.

Other Attributes of PROs

Within each of the above options, there are several attributes of PRO instruments to consider. These include response format (numeric scales vs. verbal descriptors or visual analogue scales), the focus of what is being assessed (frequency, severity, impairment, all of the above), and recall period. Shorter, more recent recall periods more accurately capture the individual's actual experience, but may not provide as good an estimate of their typical activities or experiences. (For example, not everyone vacuums or has a headache every day.)

Content Validity

Content validity is the extent to which a PRO instrument covers the breadth and depth of salient issues for the intended group of patients. If a PRO instrument is not valid with respect to its content, then there is an increased chance that it may fail to capture adequately the impact of an intervention. For example, in a study to compare the impact of different regimens for rheumatoid arthritis, a PRO that does not assess hand function could be judged to have poor content validity, and might fail to capture differences among therapies. FDA addresses content validity as being of primary interest in assessing a PRO, with other measurement properties being secondary. and defines content validity as follows:

Evidence from qualitative research demonstrating that the instrument measures the concept of interest including evidence that the items and domains of an instrument are appropriate and comprehensive relative to its intended measurement concept, population, and use. Testing other measurement properties will not replace or rectify problems with content validity. 18

Content validity is generally assessed qualitatively rather than statistically. It is important to understand and consider the population being studied, including their usual activities and problems, the condition (especially its impact on the patient's functioning), and the interventions being evaluated (including both their positive and adverse effects).

Responsiveness and Minimally Important Difference

Responsiveness is a measure of a PRO instrument's sensitivity to changes in health status or other outcome being measured. If a PRO is not sufficiently responsive, it may not provide adequate evidence of effectiveness in observational studies or clinical trials. Related to responsiveness is the minimally important difference that a PRO measure may detect. Both the patient's and the health care provider's perspectives are needed to determine if the minimally important difference detectable by an instrument is in fact of relevance to the patient's overall health status. 41

Floor and Ceiling Effects

Poor content validity can also lead to a mismatch between the distribution of responses and the true distribution of the concept of interest in the population. For example, if questions in a PRO to assess ability to perform physical activities are too “easy” relative to the level of ability in the population, then the PRO will not reflect the true distribution. This problem can present as a “ceiling” effect, where a larger proportion of the sample reports no disability. Similarly, “floor” effects are seen when questions regarding a level of ability are skewed too difficult for the population and the responses reflect this lack of variability.

Interpretation of PRO Scores

Clinicians and clinical researchers may be unfamiliar with how to interpret PRO scores. They may not understand or have reference to the usual distribution of scores of a particular PRO in a clinical or general population. Without knowledge of normal ranges, physicians may not know what cutpoints of scoring indicate that action is warranted. Without reference values from a comparable population, researchers will not know whether an observed difference between two groups is meaningful, and whether a given change within or between groups is important. The task of understanding the meaning of scores is made more difficult by the fact that different PRO measurement tools tend to use different scoring systems. For most questionnaires, higher scores imply better health, but for some, a higher score is worse. Some scales are scored from 0 to 1, where 0=dead and 1=perfect health. Others are scores on a 0–100 scale, where 0 is simply the lowest attainable score (i.e., the respondent indicates the “worst” health state in response to all of the questions) and 100 is the highest. Still others are “normalized,” so that, for example, a score of 50 represents the mean score for the healthy or nondiseased population, with a standard deviation of 10 points. It is therefore crucial for researchers and users of PRO data to understand the scoring system being used for an instrument and the expected distribution, including the distributional properties.

For some PRO instruments, particularly generic questionnaires that have been applied to large groups of patients over many years, population norms have been collected and established. These can be used as reference points. Scoring also can be recalculated and “normalized” to a “T-score” so that a specific score (often 50 or 100) corresponds to the mean score for the population, and a specific number of points (often 5 or 10) corresponds to 1 standard deviation unit in that population.

Selection of a PRO Measure

There are a number of practical considerations to take into account when selecting PRO measures for use in a CER study. The measurement properties discussed in the preceding sections also require evaluation in all instances for the specific instrument selected, within a given population, setting, and intended purpose.

It is important to understand the target population that will be completing the PRO assessment. These may range from individuals who can self-report, to individuals requiring the assistance of a proxy or medical professional (e.g., children, mentally or cognitively limited individuals, visually impaired individuals). Some respondents may be ambulatory individuals living in the community, whereas others may be inpatients or institutionalized individuals.

If a PRO questionnaire is to be used in non–English-speaking populations or in multiple languages, it is necessary to have versions appropriately adapted to language and culture. One should have evidence for the reliability and validity of the translated and culturally adapted version, as applied to the concerned population. One also should have data showing the comparability of performance across different language and cultural groups. This is of special importance when pooling data across language versions, as in a multinational clinical trial or registry study.

It is important to match the respondent burden created by a PRO instrument to the requirements of the population being studied. Patients with greater levels of illness or disability are less able to complete lengthy questionnaires. In some cases, the content or specific questions posed in a PRO may be upsetting or otherwise unacceptable to respondents. In other cases, a PRO questionnaire may be too cognitively demanding or written at a reading level that is above that of the intended population. The total burden of study-related data collection on patients and providers must also be considered, as an excessive number of forms that must be completed are likely to reduce compliance.

Cost and Copyright

Another practical consideration is the copyright status of a PRO being considered for use. Some PRO questionnaires are entirely in the public domain and are free for use. Others are copyrighted and require permission and/or the payment of fees for use. Some scales, such as the SF-12 and SF-36, require payment of fees for scoring.

Mode and Format of Administration

As noted above, there are various options for how a questionnaire should be administered and how the data should be captured, each method having both advantages and disadvantages. A PRO questionnaire can be (1) self-administered at the time of a clinical encounter, (2) administered by an interviewer at the time of a clinical encounter, (3) administered with computer assistance at the time of a clinical encounter, (4) self-administered by mail, (5) self-administered on-line, (6) interviewer-administered by telephone, or (7) computer-administered by telephone. Self-administration at the time of a clinical encounter requires little technology or up-front cost, but requires staff for supervision and data entry and can be difficult for respondents with limited literacy or sophistication. Face-to-face administration engages respondents and reduces their burden but requires trained interviewers. Computer-assisted administration provides an intermediate solution but also requires capital investment. Mailed surveys afford more privacy to respondents, but they generate mailing expenses and do not eliminate problems with literacy. Paper-based formats require data entry, scoring, and archiving and are prone to calculation errors. Online administration is relatively inexpensive, especially for large surveys, and surveys can be completed any time, but not all individuals have Internet access. Administration by live telephone interview is engaging and allows interviewer flexibility but is also expensive. “Cold calls” to potential study participants may result in low response rates, given the increased prevalence of caller ID screening systems and widespread skepticism about “telemarketing.”

Interactive voice response systems (or IVRS) can also be used to conduct telephone interviews, but it can be tedious to respond using the telephone key pad, and this format strikes some as impersonal.

Static Versus Dynamic Questionnaires

Static forms are the type of questionnaire that employs a fixed-format set of questions and response options. They can be administered on paper, by interview, or through the Internet. Dynamic questionnaires select followup questions to administer based on the responses already obtained for previous questions. Since they are more efficient, more domains can be assessed.

Economic and Utilization Outcomes

While clinical outcomes represent the provider and professional perspective, and humanistic outcomes represent the patient perspective, economic outcomes, including measures of health resource utilization, represent the payer and societal perspective. In the United States, measures of cost and cost-effectiveness are often excluded from government-funded CER studies. However, these measures are important to a variety of important stakeholders such as payers and product manufacturers, and are routinely included in cost-effectiveness research in countries such as Australia, the United Kingdom, Canada, France, and Germany. 42

Research questions addressing issues of cost-effectiveness and resource utilization may be formulated in a number of ways. Cost identification studies measure the cost of applying a specified treatment to a population under a certain set of conditions. These studies describe the cost incurred without comparison to alternative interventions.

Some cost identification studies describe the total costs of care for a particular population, whereas others isolate costs of care related to a specific condition; this latter approach requires that each episode of care be ascribed as having been related or unrelated to the illness of interest and involves substantial review. 43 Cost-benefit studies are typically measured in dollars or other currency. These studies compare the monetary costs of an intervention against the standard of care with the cost savings that result from the benefits of that treatment. In these studies, mortality is also assigned a dollar value, although techniques for assigning value to a human life are controversial. Cost-effectiveness is a relative concept, and its analysis compares the costs of treatments and benefits of treatments in terms of a specified outcome, such as reduced mortality or morbidity, years of life saved, or infections averted.

Types of Health Resource Utilization and Cost Measures

Monetary costs.

Studies most often examine direct costs (i.e., the monetary costs of the medical treatments themselves, potentially including associated costs of administering treatment or conditions associated with treatment), but may also include measures of indirect costs (e.g., the costs of disability or loss of livelihood, both actual and potential). Multiple measures of costs are commonly included in any given study.

Health Resource Utilization

Measures of health resource utilization, such as number of inpatient or outpatient visits, total days of hospitalization in a given year, or number of days treated with IV antibiotics, are often used as efficient and easily interpretable proxies for measuring cost, since actual costs are dependent on numerous factors (e.g., institutional overhead, volume discounts) and can be difficult to obtain, since they often may be confidential, since, in part, they reflect business acumen in price negotiation. Costs may also vary by institution or location, such as the cost of a day in the hospital or a medical procedure. Resource utilization measures may be preferred when a study is intended to yield results that may be generalizable to health systems or reimbursement systems other than those under study, as they are not dependent on a particular reimbursement structure such as Medicare. Alternatively, a specific cost or reimbursement structure, such as the amount reimbursed by the Centers for Medicare and Medicaid Services (CMS) for specific treatment items, or average wholesale drug costs, may be applied to units of health resource use when conducting studies that pool data from different health systems.

Utility and Preference-Based Measures

PROs and cost analyses intersect around the calculation of cost-utility. Utility measures are derived from economic and decision theory. The term utility refers to the value placed by the individual on a particular health state. Utility is summarized as a score ranging from 0.0 representing death to 1.0 representing perfect health.

In health economic analyses, utilities are used to justify devoting resources to a treatment. There are several widely used preference-based instruments that are used to estimate utility.

Preference measures are based on the fundamental concept that individuals or groups have reliable preferences about different health states. To evaluate those preferences, individuals rate a series of health states: for example, a person with specific levels of physical functioning (able to walk one block but not climb stairs), mental health (happy most of the time), and social role functioning (not able to work due to health). The task for the individual is to directly assign a degree of preference to that state. These include the Standard Gamble and Time Tradeoff methods, 44 - 45 the EQ-5D, also referred to as the Euroqol, 23 the Health Utilities Index, 46 - 47 and the Quality of Well-Being Scale. 48

Quality-Adjusted Life Years (QALYs)

Utility scores associated with treatment can be used to weight the duration of life according to its quality, and are thereby used to generate QALYs. Utility scores are generally first ascertained directly in a sample of people with the condition in question, either cross-sectionally or over time with a clinical trial. Utility values are sometimes estimated indirectly using other sources of information about the health status of people in a population. The output produced by an intervention can be calculated as the area under the cost-utility curve.

For example, if the mean utility score for patients receiving antiretroviral treatment for HIV disease is 0.80, then the outcome for a treated group would be survival time multiplied by 0.80.

Disability-Adjusted Life Years (DALYs)

DALYs are another measure of overall disease burden expressed as the number of years lost to poor health, disability, or premature death. 49 As with QALYs, mortality and morbidity are combined in a single metric. Potential years of life lost to premature death are supplemented with years of health life lost due to less than optimal health. Whereas 1 QALY corresponds to one year of life in optimal health, 1 DALY corresponds to one year of healthy life lost.

An important aspect of the calculation of DALYs is that the value assigned to each year of life depends on age. Years lived as a young adult are valued more highly than those spent as a young child or older adult, reflecting the different capacity for work productivity during different phases of life. DALYs are therefore estimated for different chronic illnesses by first calculating the age- and sex-adjusted incidence of disease. A DALY is calculated as the sum of the average years of life lost, and the average years lived with a disability. For example, to estimate the years of healthy life lost in a region due to HIV/AIDS, one would first estimate the prevalence of the disease by age. The DALY value is calculated by summing the average of years of life lost and the average number of years lived with AIDS, discounted based on a universal set of standard weights based on expert valuations.

Selection of Resource Utilization and Cost Measures

The selection of measures of resource utilization or costs should correspond to the primary hypothesis in terms of the impact of an intervention. For example, will treatment reduce the need for hospitalization or result in a shorter length of stay? Or, will treatment or other intervention reduce complications that require hospitalization? Or, will a screening method reduce the total number of diagnostic procedures required per diagnosis?

It is useful to consider what types of costs are of interest to the investigators and to various stakeholders. Are total costs of interest, or costs associated with specific resources (e.g., prescription drug costs)? Are only direct costs being measured, or are you also interested in indirect costs such as those related to days lost from work?

When it is determined that results will be presented in terms of dollars rather than units of resources, several different methods can be applied. In the unusual case that an institution has a cost-accounting system, cost can be measured directly. In most cases, resource units are collected, and costs are assigned based on local or national average prices for the specific resources being considered, for example, reimbursement from CMS for a CT scan, or a hospital day. Application of an external standard cost system reduces variability in costs due to region, payer source, and other variables that might obscure the impact of the intervention in question.

  • Study Design and Analysis Considerations

Study Period and Length of Followup

In designing a study, the required study period and length of followup are determined by the expected time frame within which an intervention may be expected to impact the outcome of interest. A study comparing traditional with minimally invasive knee replacement surgery will need to follow subjects at least for the duration of the expected recovery time of 3 to 6 months or longer. The optimal duration of a study can be problematic when studying effects that may become manifest over a long time period, such as treatments to prevent or delay the onset of chronic disease. In these cases, data sources with a high degree of turnover of patients, such as administrative claims databases from managed care organizations, may not be suitable. For example, in the case of Alzheimer's disease, a record of health care is likely to be present in health insurance claims. However, with the decline in cognitive function, patients may lose ability to work and may enter assisted care facilities, where utilization is not typically captured in large health insurance claims systems. Some studies may be undertaken for the purpose of determining how long an intervention can be expected to impact the outcome of interest. For example, various measures are used to aid in reducing obesity and in smoking cessation, and patients, health care providers, and payers are interested in knowing how long these interventions work (if at all), for whom, and in what situations.

Notwithstanding the limitations of intermediate endpoints (discussed in a preceding section), one of the main advantages of their use is the potential truncation of the required study followup period. Consider, for example, a study of the efficacy of the human papilloma virus vaccine, for which the major medical endpoint of interest is prevention of cervical cancer. The long latency period (more than 2 years, depending on the study population) and the relative infrequency of cervical cancer raise the possibility that intermediate endpoints should be used. Candidates might include new diagnoses of genital warts, or new diagnoses of the precancerous conditions cervical intraepithelial neoplasia (CIN) or vaginal intraepithelial neoplasia (VIN), which have shorter latency periods of less than 1 year or 2 years (minimum), respectively. Use of these endpoints would allow such a study to provide meaningful evidence informing the use of the HPV vaccine in a shorter timeframe, during which more patients might benefit from its use. Alternatively, if the vaccine is shown to be ineffective, this information could avoid years of unnecessary treatment and the associated costs as well as the costs of running a longer trial.

Avoidance of Bias in Study Design

Misclassification.

The role of the researcher is to understand the extent and sources of misclassification in outcome measurement, and to try to reduce these as much as possible. To ensure comparability between treatment groups with as little misclassification (also referred to as measurement error) of outcomes as possible, a clear and objective (i.e., verifiable and not subject to individual interpretation insofar as possible) definition of the outcome of interest is needed. An unclear outcome definition can lead to misclassification and bias in the measure of treatment effectiveness. When the misclassification is nondifferential, or equivalent across treatment groups, the estimate of treatment effectiveness will be biased toward the null, reducing the apparent effectiveness of treatment, which may result in an erroneous conclusion that no effect (or one smaller than the true effect size) exists. When the misclassification differs systematically between treatment groups, it may distort the estimate of treatment effectiveness in either direction.

For clinical outcomes, incorporation of an objective measure such as a validated tool that has been developed for use in clinical practice settings, or an adjudication panel for review of outcomes with regard to whether they meet the predetermined definition of an event, would both be approaches that increase the likelihood that outcomes will be measured and classified accurately and in a manner unlikely to vary according to who is doing the assessment. For PROs, measurement error can stem from several sources, including the way in which a question is worded and hence understood by a respondent, how the question is presented, the population being assessed, the literacy level of respondents, the language in which the questions are written, and elements of culture that it represents.

To avoid differential misclassification of outcomes, care must also be taken to use the same methods of ascertainment and definitions of study outcomes whenever possible. For prospective or retrospective studies with contemporaneous comparators, this is usually not an issue, since it is most straightforward to utilize the same data sources and methods of outcome ascertainment for each comparison group. A threat to validity may arise in use of a historical comparison group, which may be used in certain circumstances. For example, this occurs when a new treatment largely displaces use of an older treatment within a given indication, but further evidence is needed for the comparative effectiveness of the newer and older treatments, such as enzyme replacement for lysosomal storage disorders. In such instances, use of the same or similar data sources and equivalent outcome definitions to the extent possible will reduce the likelihood of bias due to differential outcome ascertainment.

Other situations that may give rise to issues of differential misclassification of outcomes include: when investigators are not blinded to the hypothesis of the study, and “rule-out” diagnoses are more common in those with a particular exposure of interest; when screening or detection of outcomes is more common or more aggressive in those with one treatment than another (i.e., surveillance bias, e.g., when liver function testing are preferentially performed in patients using a new drug compared to other treatments for that condition); and when loss to followup occurs that is related to the risk of experiencing the outcome. For example, once a safety signal has been identified and publicized, physicians have been alerted and then look more proactively for clinical signs and symptoms in treated patients. This situation is even greater for products that are subject to controlled distribution or Risk Evaluation and Mitigation Strategies (REMS). Consider clozapine, an anti-schizophrenia drug that is subject to controlled distribution through a “no blood, no drug” monitoring program. The blood testing program was implemented to detect early development of agranulocytemia. When comparing patients treated with clozapine with those treated with other antischizophrenics, those using clozapine may appear to have a worse safety profile with respect to this outcome.

Sensitivity analyses may be conducted in order to estimate the impact of different levels of differential or nondifferential misclassification on effect estimates from observational CER studies. These approaches are covered in detail in chapter 11 .

Validation and Adjudication

In some instances, additional information must be collected (usually from medical records) to validate the occurrence of the outcome of interest, including to exclude erroneous or “rule-out” diagnoses. This is particularly important for medical events identified in administrative claims databases, for which a diagnosis code associated with a medical encounter may represent a “rule out” diagnosis or a condition that does not map to a specific diagnosis code. For some complex diagnoses, such as unstable angina, a standard clinical definition must be applied by an adjudication panel that has access to detailed records inclusive of subjects' relevant medical history, symptomatic presentation, diagnostic work-up, and treatment. Methods of validation and adjudication of outcomes strengthen the internal validity and therefore the evidence that can be drawn from a CER study. However, they are resource-intensive.

Issues Specific to PROs

PROs are prone to several specific sources of bias. Self-reports of health status are likely to differ systematically from reports by surrogates, who, for example, are likely to report less pain than the individuals themselves. 50 Some biases may be population-dependent. For example, there may be a greater tendency of some populations to succumb to acquiescence bias (agreeing with the statements in a questionnaire) or social desirability bias (answering in a way that would cast the respondent in the best light). 51 In some situations, however, a PRO may be the most useful marker of disease activity, such as with episodic conditions that cause short-duration disease flares such as low back pain and gout, where patients may not present for health care immediately, if at all.

The goal of the researcher is to understand and reduce sources of bias, considering those most likely to apply in the specific population and topics under study. In the case of well understood systematic biases, adjustments can be made so that distributions of responses are more consistent. In other cases, redesigning items and scales, for example, by including both positively and negatively worded items, can reduce specific kinds of bias.

Missing data, an issue covered in more detail in chapter 10 , pose a particular problem with PROs, since PRO data are usually not missing at random. Instead, respondents whose health is poorer are more likely to fail to complete an assessment. Another special case of missing data occurs when a patient dies and is unable to complete an assessment. If this issue is not taken into account in the data analysis, and scores are only recorded for living patients, incorrect conclusions may be drawn. Strategies for handling this type of missing data include selection of an instrument that incorporates a score for death, such as the Sickness Impact Profile 20 , 52 or the Quality of Well-Being Scale, 48 or through an analytic strategy that allows for some missing values.

Failure to account for missing PRO data that are related to poor health or death will lead to an overestimate of the health of the population based on responses from subjects who do complete PRO forms. Therefore, in research using PROs, it is very important to understand the extent and pattern of missing data, both at the level of the individual as well as for specific items or scales on an instrument. 53

A strategy should be put in place to handle missing data when developing the study protocol and analysis plans. Such strategies that pertain to use of PROs in research are discussed in further detail in publications such as the book by Fairclough and colleagues.

Analytic Considerations

Form of outcome measure and analysis approach.

To a large extent, the form of the primary outcome of interest—that is, whether the outcome is measured and expressed as a dichotomous or polytomous categorical variable or a continuous variable, and whether it is to be measured at a single time point, measured repeatedly at fixed intervals, or measured repeatedly at varying time intervals—determines the appropriate statistical methods that may be applied in analysis. These topics are covered in detail in chapter 10 .

Sensitivity Analysis

One of the key factors to address in planned sensitivity analyses for an observational CER study is how varying definitions of the study outcome or related outcomes will affect the measures of association from the study. These investigations include assessing multiple related outcomes within a disease area; for example, assessing multiple measures of respiratory function such as FEV1, FEV1% predicted, and FVC in studies of asthma treatment effectiveness in children; assessing the effect of different cutoffs for dichotomized continuous outcome measures; for example, the use of Systemic Lupus Erythematosus Disease Activity Index-2000 scores to define active disease in lupus treatment studies, 54 or the use of different sets of diagnosis codes to capture a condition such as influenza and related respiratory conditions, in administrative data. These and other considerations for sensitivity analyses are covered in detail in chapter 11 .

Future Directions

Increased use of EHRs as a source of data for observational research, including registries, other types of observational studies, and specifically for CER, has prompted initiatives to develop standardized definitions of key outcomes and other data elements that would be used across health systems and different EHR platforms to facilitate comparisons between studies and pooling of data. The National Cardiovascular Research Infrastructure partnership between the American College of Cardiology and Duke Clinical Research Institute, which received American Recovery and Reinvestment Act funding to establish intra-operable data standards based on the National Cardiovascular Data Registry, is an example of such a current activity. 55

This chapter has provided an overview of considerations in development of outcome definitions for observational CER studies; has described implications of the nature of the proposed outcomes for the study design; and has enumerated issues of bias that may arise in incorporating the ascertainment of outcomes into observational research. It has also suggested means of preventing or reducing these biases.

Development of clear and objective outcome definitions that correspond to the nature of the hypothesized treatment effect and address the research questions of interest, along with validation of outcomes where warranted or use of standardized PRO instruments validated for the population of interest, contribute to the internal validity of observational CER studies. Attention to collection of outcome data in an equivalent manner across treatment comparison groups is also required. Use of appropriate analytic methods suitable to the outcome measure, and sensitivity analysis to address varying definitions of at least the primary study outcomes, are needed to make inferences drawn from such studies more robust and reliable.

Checklist: Guidance and key considerations for outcome selection and measurement for an observational CER protocol

Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide is copyrighted by the Agency for Healthcare Research and Quality (AHRQ). The product and its contents may be used and incorporated into other materials on the following three conditions: (1) the contents are not changed in any way (including covers and front matter), (2) no fee is charged by the reproducer of the product or its contents for its use, and (3) the user obtains permission from the copyright holders identified therein for materials noted as copyrighted by others. The product may not be sold for profit or incorporated into any profitmaking venture without the expressed written permission of AHRQ.

  • Cite this Page Velentgas P, Dreyer NA, Wu AW. Outcome Definition and Measurement. In: Velentgas P, Dreyer NA, Nourjah P, et al., editors. Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan. Chapter 6.
  • PDF version of this title (5.8M)

In this Page

Other titles in these collections.

  • AHRQ Methods for Effective Health Care
  • Health Services/Technology Assessment Text (HSTAT)

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Recent Activity

  • Outcome Definition and Measurement - Developing a Protocol for Observational Com... Outcome Definition and Measurement - Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Logo for Kwantlen Polytechnic University

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Psychological Measurement

19 Understanding Psychological Measurement

Learning objectives.

  • Define measurement and give several examples of measurement in psychology.
  • Explain what a psychological construct is and give several examples.
  • Distinguish conceptual from operational definitions, give examples of each, and create simple operational definitions.
  • Distinguish the four levels of measurement, give examples of each, and explain why this distinction is important.

What Is Measurement?

Measurement  is the assignment of scores to individuals so that the scores represent some characteristic of the individuals. This very general definition is consistent with the kinds of measurement that everyone is familiar with—for example, weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey using a meat thermometer. It is also consistent with measurement in the other sciences. In physics, for example, one might measure the potential energy of an object in Earth’s gravitational field by finding its mass and height (which of course requires measuring  those  variables) and then multiplying them together along with the gravitational acceleration of Earth (9.8 m/s2). The result of this procedure is a score that represents the object’s potential energy.

This general definition of measurement is consistent with measurement in psychology too. (Psychological measurement is often referred to as psychometrics .) Imagine, for example, that a cognitive psychologist wants to measure a person’s working memory capacity—their ability to hold in mind and think about several pieces of information all at the same time. To do this, she might use a backward digit span task, in which she reads a list of two digits to the person and asks them to repeat them in reverse order. She then repeats this several times, increasing the length of the list by one digit each time, until the person makes an error. The length of the longest list for which the person responds correctly is the score and represents their working memory capacity. Or imagine a clinical psychologist who is interested in how depressed a person is. He administers the Beck Depression Inventory, which is a 21-item self-report questionnaire in which the person rates the extent to which they have felt sad, lost energy, and experienced other symptoms of depression over the past 2 weeks. The sum of these 21 ratings is the score and represents the person’s current level of depression.

The important point here is that measurement does not require any particular instruments or procedures. What it  does  require is  some  systematic procedure for assigning scores to individuals or objects so that those scores represent the characteristic of interest.

Psychological Constructs

Many variables studied by psychologists are straightforward and simple to measure. These include age, height, weight, and birth order. You can ask people how old they are and be reasonably sure that they know and will tell you. Although people might not know or want to tell you how much they weigh, you can have them step onto a bathroom scale. Other variables studied by psychologists—perhaps the majority—are not so straightforward or simple to measure. We cannot accurately assess people’s level of intelligence by looking at them, and we certainly cannot put their self-esteem on a bathroom scale. These kinds of variables are called  constructs  (pronounced  CON-structs ) and include personality traits (e.g., extraversion), emotional states (e.g., fear), attitudes (e.g., toward taxes), and abilities (e.g., athleticism).

Psychological constructs cannot be observed directly. One reason is that they often represent  tendencies  to think, feel, or act in certain ways. For example, to say that a particular university student is highly extraverted does not necessarily mean that she is behaving in an extraverted way right now. In fact, she might be sitting quietly by herself, reading a book. Instead, it means that she has a general tendency to behave in extraverted ways (e.g., being outgoing, enjoying social interactions) across a variety of situations. Another reason psychological constructs cannot be observed directly is that they often involve internal processes. Fear, for example, involves the activation of certain central and peripheral nervous system structures, along with certain kinds of thoughts, feelings, and behaviors—none of which is necessarily obvious to an outside observer. Notice also that neither extraversion nor fear “reduces to” any particular thought, feeling, act, or physiological structure or process. Instead, each is a kind of summary of a complex set of behaviors and internal processes.

The Big Five

The Big Five is a set of five broad dimensions that capture much of the variation in human personality. Each of the Big Five can even be defined in terms of six more specific constructs called “facets” (Costa & McCrae, 1992) [1] .

Table 4.1 The Big Five Personality Dimensions

The  conceptual definition  of a psychological construct describes the behaviors and internal processes that make up that construct, along with how it relates to other variables. For example, a conceptual definition of neuroticism (another one of the Big Five) would be that it is people’s tendency to experience negative emotions such as anxiety, anger, and sadness across a variety of situations. This definition might also include that it has a strong genetic component, remains fairly stable over time, and is positively correlated with the tendency to experience pain and other physical symptoms.

Students sometimes wonder why, when researchers want to understand a construct like self-esteem or neuroticism, they do not simply look it up in the dictionary. One reason is that many scientific constructs do not have counterparts in everyday language (e.g., working memory capacity). More important, researchers are in the business of developing definitions that are more detailed and precise—and that more accurately describe the way the world is—than the informal definitions in the dictionary. As we will see, they do this by proposing conceptual definitions, testing them empirically, and revising them as necessary. Sometimes they throw them out altogether. This is why the research literature often includes different conceptual definitions of the same construct. In some cases, an older conceptual definition has been replaced by a newer one that fits and works better. In others, researchers are still in the process of deciding which of various conceptual definitions is the best.

Operational Definitions

An  operational definition  is a definition of a variable in terms of precisely how it is to be measured. These measures generally fall into one of three broad categories.  Self-report measures  are those in which participants report on their own thoughts, feelings, and actions, as with the Rosenberg Self-Esteem Scale (Rosenberg, 1965) [2] . Behavioral measures  are those in which some other aspect of participants’ behavior is observed and recorded. This is an extremely broad category that includes the observation of people’s behavior both in highly structured laboratory tasks and in more natural settings. A good example of the former would be measuring working memory capacity using the backward digit span task. A good example of the latter is a famous operational definition of physical aggression from researcher Albert Bandura and his colleagues (Bandura, Ross, & Ross, 1961) [3] . They let each of several children play for 20 minutes in a room that contained a clown-shaped punching bag called a Bobo doll. They filmed each child and counted the number of acts of physical aggression the child committed. These included hitting the doll with a mallet, punching it, and kicking it. Their operational definition, then, was the number of these specifically defined acts that the child committed during the 20-minute period. Finally,  physiological measures  are those that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.

For any given variable or construct, there will be multiple operational definitions. Stress is a good example. A rough conceptual definition is that stress is an adaptive response to a perceived danger or threat that involves physiological, cognitive, affective, and behavioral components. But researchers have operationally defined it in several ways. The Social Readjustment Rating Scale (Holmes & Rahe, 1967) [4] is a self-report questionnaire on which people identify stressful events that they have experienced in the past year and assigns points for each one depending on its severity. For example, a man who has been divorced (73 points), changed jobs (36 points), and had a change in sleeping habits (16 points) in the past year would have a total score of 125. The Hassles and Uplifts Scale (Delongis, Coyne, Dakof, Folkman & Lazarus, 1982) [5]  is similar but focuses on everyday stressors like misplacing things and being concerned about one’s weight. The Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983) [6] is another self-report measure that focuses on people’s feelings of stress (e.g., “How often have you felt nervous and stressed?”). Researchers have also operationally defined stress in terms of several physiological variables including blood pressure and levels of the stress hormone cortisol.

When psychologists use multiple operational definitions of the same construct—either within a study or across studies—they are using converging operations . The idea is that the various operational definitions are “converging” or coming together on the same construct. When scores based on several different operational definitions are closely related to each other and produce similar patterns of results, this constitutes good evidence that the construct is being measured effectively and that it is useful. The various measures of stress, for example, are all correlated with each other and have all been shown to be correlated with other variables such as immune system functioning (also measured in a variety of ways) (Segerstrom & Miller, 2004) [7] . This is what allows researchers eventually to draw useful general conclusions, such as “stress is negatively correlated with immune system functioning,” as opposed to more specific and less useful ones, such as “people’s scores on the Perceived Stress Scale are negatively correlated with their white blood counts.”

Levels of Measurement

The psychologist S. S. Stevens suggested that scores can be assigned to individuals in a way that communicates more or less quantitative information about the variable of interest (Stevens, 1946) [8] . For example, the officials at a 100-m race could simply rank order the runners as they crossed the finish line (first, second, etc.), or they could time each runner to the nearest tenth of a second using a stopwatch (11.5 s, 12.1 s, etc.). In either case, they would be measuring the runners’ times by systematically assigning scores to represent those times. But while the rank ordering procedure communicates the fact that the second-place runner took longer to finish than the first-place finisher, the stopwatch procedure also communicates  how much  longer the second-place finisher took. Stevens actually suggested four different levels of measurement (which he called “scales of measurement”) that correspond to four types of information that can be communicated by a set of scores, and the statistical procedures that can be used with the information.

The  nominal level  of measurement is used for categorical variables and involves assigning scores that are category labels. Category labels communicate whether any two individuals are the same or different in terms of the variable being measured. For example, if you ask your participants about their marital status, you are engaged in nominal-level measurement. Or if you ask your participants to indicate which of several ethnicities they identify themselves with, you are again engaged in nominal-level measurement. The essential point about nominal scales is that they do not imply any ordering among the responses. For example, when classifying people according to their favorite color, there is no sense in which green is placed “ahead of” blue. Responses are merely categorized. Nominal scales thus embody the lowest level of measurement [9] .

The remaining three levels of measurement are used for quantitative variables. The  ordinal level  of measurement involves assigning scores so that they represent the rank order of the individuals. Ranks communicate not only whether any two individuals are the same or different in terms of the variable being measured but also whether one individual is higher or lower on that variable. For example, a researcher wishing to measure consumers’ satisfaction with their microwave ovens might ask them to specify their feelings as either “very dissatisfied,” “somewhat dissatisfied,” “somewhat satisfied,” or “very satisfied.” The items in this scale are ordered, ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal scales, ordinal scales allow comparisons of the degree to which two individuals rate the variable. For example, our satisfaction ordering makes it meaningful to assert that one person is more satisfied than another with their microwave ovens. Such an assertion reflects the first person’s use of a verbal label that comes later in the list than the label chosen by the second person.

On the other hand, ordinal scales fail to capture important information that will be present in the other levels of measurement we examine. In particular, the difference between two levels of an ordinal scale cannot be assumed to be the same as the difference between two other levels (just like you cannot assume that the gap between the runners in first and second place is equal to the gap between the runners in second and third place). In our satisfaction scale, for example, the difference between the responses “very dissatisfied” and “somewhat dissatisfied” is probably not equivalent to the difference between “somewhat dissatisfied” and “somewhat satisfied.” Nothing in our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of satisfaction, which we are trying to measure.)

The  interval level  of measurement involves assigning scores using numerical scales in which intervals have the same interpretation throughout. As an example, consider either the Fahrenheit or Celsius temperature scales. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms of the kinetic energy of molecules).

Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to carry the name “zero.” The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence of temperature (the absence of any molecular kinetic energy). In reality, the label “zero” is applied to its temperature for quite accidental reasons connected to the history of temperature measurement. Since an interval scale has no true zero point, it does not make sense to compute ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20 degrees Fahrenheit is the same as the ratio of 100 to 50 degrees; no interesting physical property is preserved across the two ratios. After all, if the “zero” label were applied at the temperature that Fahrenheit happens to label as 10 degrees, the two ratios would instead be 30 to 10 and 90 to 40, no longer the same! For this reason, it does not make sense to say that 80 degrees is “twice as hot” as 40 degrees. Such a claim would depend on an arbitrary decision about where to “start” the temperature scale, namely, what temperature to call zero (whereas the claim is intended to make a more fundamental assertion about the underlying physical reality).

In psychology, the intelligence quotient (IQ) is often considered to be measured at the interval level. While it is technically possible to receive a score of 0 on an IQ test, such a score would not indicate the complete absence of IQ. Moreover, a person with an IQ score of 140 does not have twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.

Finally, the  ratio level  of measurement involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. You can think of a ratio scale as the three earlier scales rolled up in one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning. However, in addition, the same ratio at two places on the scale also carries the same meaning (see Table 4.1).

The Fahrenheit scale for temperature has an arbitrary zero point and is therefore not a ratio scale. However, zero on the Kelvin scale is absolute zero. This makes the Kelvin scale a ratio scale. For example, if one temperature is twice as high as another as measured on the Kelvin scale, then it has twice the kinetic energy of the other temperature.

Another example of a ratio scale is the amount of money you have in your pocket right now (25 cents, 50 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero money, this actually implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as someone with 25 cents.

Stevens’s levels of measurement are important for at least two reasons. First, they emphasize the generality of the concept of measurement. Although people do not normally think of categorizing or ranking individuals as measurement, in fact, they are as long as they are done so that they represent some characteristic of the individuals. Second, the levels of measurement can serve as a rough guide to the statistical procedures that can be used with the data and the conclusions that can be drawn from them. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. Once again, one cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, but one can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level.

  • Costa, P. T., Jr., & McCrae, R. R. (1992). Normal personality assessment in clinical practice: The NEO Personality Inventory. Psychological Assessment, 4 , 5–13. ↵
  • Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press ↵
  • Bandura, A., Ross, D., & Ross, S. A. (1961). Transmission of aggression through imitation of aggressive models. Journal of Abnormal and Social Psychology, 63 , 575–582. ↵
  • Holmes, T. H., & Rahe, R. H. (1967). The Social Readjustment Rating Scale. Journal of Psychosomatic Research, 11 (2), 213-218. ↵
  • Delongis, A., Coyne, J. C., Dakof, G., Folkman, S., & Lazarus, R. S. (1982). Relationships of daily hassles, uplifts, and major life events to health status. Health Psychology, 1 (2), 119-136. ↵
  • Cohen, S., Kamarck, T., & Mermelstein, R. (1983). A global measure of perceived stress. Journal of Health and Social Behavior, 24, 386-396. ↵
  • Segerstrom, S. E., & Miller, G. E. (2004). Psychological stress and the human immune system: A meta-analytic study of 30 years of inquiry. Psychological Bulletin, 130 , 601–630. ↵
  • Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103 , 677–680. ↵
  • Levels of Measurement. Retrieved from http://wikieducator.org/Introduction_to_Research_Methods_In_Psychology/Theories_and_Measurement/Levels_of_Measurement ↵

Is the assignment of scores to individuals so that the scores represent some characteristic of the individuals.

A subfield of psychology concerned with the theories and techniques of psychological measurement.

Psychological variables that represent an individual's mental state or experience, often not directly observable, such as personality traits, emotional states, attitudes, and abilities.

Describes the behaviors and internal processes that make up a psychological construct, along with how it relates to other variables.

A definition of the variable in terms of precisely how it is to be measured.

Measures in which participants report on their own thoughts, feelings, and actions.

Measures in which some other aspect of participants’ behavior is observed and recorded.

Measures that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.

When psychologists use multiple operational definitions of the same construct—either within a study or across studies.

Four categories, or scales, of measurement (i.e., nominal, ordinal, interval, and ratio) that specify the types of information that a set of scores can have, and the types of statistical procedures that can be used with the scores.

A measurement used for categorical variables and involves assigning scores that are category labels.

A measurement that involves assigning scores so that they represent the rank order of the individuals.

A measurement that involves assigning scores using numerical scales in which intervals have the same interpretation throughout.

A measurement that involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity.

Research Methods in Psychology Copyright © 2019 by Rajiv S. Jhangiani, I-Chant A. Chiang, Carrie Cuttler, & Dana C. Leighton is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Logo for VIVA Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

19 11. Quantitative measurement

Chapter outline.

  • Overview of measurement (11 minute read)
  • Operationalization and levels of measurement (20 minute read)
  • Scales and indices (15 minute read)
  • Reliability and validity (20 minute read)
  • Ethical and social justice considerations for measurement (6 minute read)

Content warning: Discussions of immigration issues, parents and gender identity, anxiety, and substance use.

11.1 Overview of measurement

Learning Objectives

Learners will be able to…

  • Provide an overview of the measurement process in social work research
  • Describe why accurate measurement is important for research

This chapter begins with an interesting question: Is my apple the same as your apple? Let’s pretend you want to study apples. Perhaps you have read that chemicals in apples may impact neurotransmitters and you want to test if apple consumption improves mood among college students. So, in order to conduct this study, you need to make sure that you provide apples to a treatment group, right? In order to increase the rigor of your study, you may also want to have a group of students, ones who do not get to eat apples, to serve as a comparison group. Don’t worry if this seems new to you. We will discuss this type of design in Chapter 13 . For now, just concentrate on apples.

In order to test your hypothesis about apples, you need to define exactly what is meant by the term “apple” so you ensure everyone is consuming the same thing. You also need to know what you consider a “dose” of this thing that we call “apple” and make sure everyone is consuming the same kind of apples and you need a way to ensure that you give the same amount of apples to everyone in your treatment group. So, let’s start by making sure we understand what the term “apple” means. Say you have an object that you identify as an apple and I have an object that I identify as an apple. Perhaps my “apple” is a chocolate apple, one that looks similar to an apple but made of chocolate and red dye, and yours is a honeycrisp. Perhaps yours is papier-mache and mine is a Macbook Pro.  All of these are defined as apples, right?

Decorative image

You can see the multitude of ways we could conceptualize “apple,” and how that could create a problem for our research. If I get a Red Delicious (ick) apple and you get a Granny Smith (yum) apple and we observe a change in neurotransmitters, it’s going to be even harder than usual to say the apple influenced the neurotransmitters because we didn’t define “apple” well enough. Measurement in this case is essential to treatment fidelity , which is when you ensure that everyone receives the same, or close to the same, treatment as possible. In other words, you need to make sure everyone is consuming the same kind of apples and you need a way to ensure that you give the same amount of apples to everyone in your treatment group.

In social science, when we use the term  measurement , we mean the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. At its core, measurement is about defining one’s terms in as clear and precise a way as possible. Of course, measurement in social science isn’t quite as simple as using a measuring cup or spoon, but there are some basic tenets on which most social scientists agree when it comes to measurement. We’ll explore those, as well as some of the ways that measurement might vary depending on your unique approach to the study of your topic.

An important point here is that measurement does not require any particular instruments or procedures. What it does  require is  some systematic procedure for assigning scores, meanings, and descriptions to individuals or objects so that those scores represent the characteristic of interest. You can measure phenomena in many different ways, but you must be sure that how you choose to measure gives you information and data that lets you answer your research question. If you’re looking for information about a person’s income, but your main points of measurement have to do with the money they have in the bank, you’re not really going to find the information you’re looking for!

What do social scientists measure?

The question of what social scientists measure can be answered by asking yourself what social scientists study. Think about the topics you’ve learned about in other social work classes you’ve taken or the topics you’ve considered investigating yourself. Let’s consider Melissa Milkie and Catharine Warner’s study (2011) [1] of first graders’ mental health. In order to conduct that study, Milkie and Warner needed to have some idea about how they were going to measure mental health. What does mental health mean, exactly? And how do we know when we’re observing someone whose mental health is good and when we see someone whose mental health is compromised? Understanding how measurement works in research methods helps us answer these sorts of questions.

As you might have guessed, social scientists will measure just about anything that they have an interest in investigating. For example, those who are interested in learning something about the correlation between social class and levels of happiness must develop some way to measure both social class and happiness. Those who wish to understand how well immigrants cope in their new locations must measure immigrant status and coping. Those who wish to understand how a person’s gender shapes their workplace experiences must measure gender and workplace experiences. You get the idea. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe or measure than others.

In 1964, philosopher Abraham Kaplan (1964) [2] wrote The   Conduct of Inquiry,  which has since become a classic work in research methodology (Babbie, 2010). [3] In his text, Kaplan describes different categories of things that behavioral scientists observe. One of those categories, which Kaplan called “observational terms,” is probably the simplest to measure in social science. Observational terms are the sorts of things that we can see with the naked eye simply by looking at them. Kaplan roughly defines them as conditions that are easy to identify and verify through direct observation. If, for example, we wanted to know how the conditions of playgrounds differ across different neighborhoods, we could directly observe the variety, amount, and condition of equipment at various playgrounds.

Indirect observables , on the other hand, are less straightforward to assess. In Kaplan’s framework, they are conditions that are subtle and complex that we must use existing knowledge and intuition to define.If we conducted a study for which we wished to know a person’s income, we’d probably have to ask them their income, perhaps in an interview or a survey. Thus, we have observed income, even if it has only been observed indirectly. Birthplace might be another indirect observable. We can ask study participants where they were born, but chances are good we won’t have directly observed any of those people being born in the locations they report.

How do social scientists measure?

Measurement in social science is a process. It occurs at multiple stages of a research project: in the planning stages, in the data collection stage, and sometimes even in the analysis stage. Recall that previously we defined measurement as the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. Once we’ve identified a research question, we begin to think about what some of the key ideas are that we hope to learn from our project. In describing those key ideas, we begin the measurement process.

Let’s say that our research question is the following: How do new college students cope with the adjustment to college? In order to answer this question, we’ll need some idea about what coping means. We may come up with an idea about what coping means early in the research process, as we begin to think about what to look for (or observe) in our data-collection phase. Once we’ve collected data on coping, we also have to decide how to report on the topic. Perhaps, for example, there are different types or dimensions of coping, some of which lead to more successful adjustment than others. However we decide to proceed, and whatever we decide to report, the point is that measurement is important at each of these phases.

As the preceding example demonstrates, measurement is a process in part because it occurs at multiple stages of conducting research. We could also think of measurement as a process because it involves multiple stages. From identifying your key terms to defining them to figuring out how to observe them and how to know if your observations are any good, there are multiple steps involved in the measurement process. An additional step in the measurement process involves deciding what elements your measures contain. A measure’s elements might be very straightforward and clear, particularly if they are directly observable. Other measures are more complex and might require the researcher to account for different themes or types. These sorts of complexities require paying careful attention to a concept’s level of measurement and its dimensions. We’ll explore these complexities in greater depth at the end of this chapter, but first let’s look more closely at the early steps involved in the measurement process, starting with conceptualization.

The idea of coming up with your own measurement tool might sound pretty intimidating at this point. The good news is that if you find something in the literature that works for you, you can use it with proper attribution. If there are only pieces of it that you like, you can just use those pieces, again with proper attribution. You don’t always have to start from scratch!

Key Takeaways

  • Measurement (i.e. the measurement process) gives us the language to define/describe what we are studying.
  • In research, when we develop measurement tools, we move beyond concepts that may be subjective and abstract to a definition that is clear and concise.
  • Good social work researchers are intentional with the measurement process.
  • Engaging in the measurement process requires us to think critically about what we want to study. This process may be challenging and potentially time-consuming.
  • How easy or difficult do you believe it will be to study these topics?
  • Think about the chapter on literature reviews. Is there a significant body of literature on the topics you are interested in studying?
  • Are there existing measurement tools that may be appropriate to use for the topics you are interested in studying?

11.2 Operationalization and levels of measurement

  • Define constructs and operationalization and describe their relationship
  • Be able to start operationalizing variables in your research project
  • Identify the level of measurement for each type of variable
  • Demonstrate knowledge of how each type of variable can be used

Now we have some ideas about what and how social scientists need to measure, so let’s get into the details. In this section, we are going to talk about how to make your variables measurable (operationalization) and how you ultimately characterize your variables in order to analyze them (levels of measurement).

Operationalizing your variables

“Operationalizing” is not a word I’d ever heard before I became a researcher, and actually, my browser’s spell check doesn’t even recognize it. I promise it’s a real thing, though. In the most basic sense, when we operationalize a variable, we break it down into measurable parts. Operationalization is the process of determining how to measure a construct that cannot be directly observed. And a constructs are conditions that are not directly observable and represent states of being, experiences, and ideas. But why construct ? We call them constructs because they are built using different ideas and parameters.

As we know from Section 11.1, sometimes the measures that we are interested in are more complex and more abstract than observational terms or indirect observables . Think about some of the things you’ve learned about in other social work classes—for example, ethnocentrism. What is ethnocentrism? Well, from completing an introduction to social work class you might know that it’s a construct that has something to do with the way a person judges another’s culture. But how would you measure  it? Here’s another construct: bureaucracy. We know this term has something to do with organizations and how they operate, but measuring such a construct is trickier than measuring, say, a person’s income. In both cases, ethnocentrism and bureaucracy, these theoretical notions represent ideas whose meaning we have come to agree on. Though we may not be able to observe these abstractions directly, we can observe the things that they are made up of.

a research measurement definition

Now, let’s operationalize bureaucracy and ethnocentrism. The construct of bureaucracy could be measured by counting the number of supervisors that need to approve routine spending by public administrators. The greater the number of administrators that must sign off on routine matters, the greater the degree of bureaucracy. Similarly, we might be able to ask a person the degree to which they trust people from different cultures around the world and then assess the ethnocentrism inherent in their answers. We can measure constructs like bureaucracy and ethnocentrism by defining them in terms of what we can observe.

How we operationalize our constructs (and ultimately measure our variables) can affect the conclusions we can draw from our research. Let’s say you’re reviewing a state program to make it more efficient in connecting people to public services. What might be different if we decide to measure bureaucracy by the number of forms someone has to fill out to get a public service instead of the number of people who have to review the forms, like we talked about above? Maybe you find that there is an unnecessary amount of paperwork based on comparisons to other state programs, so you recommend that some of it be eliminated. This is probably a good thing, but will it actually make the program more efficient like eliminating some of the reviews that paperwork has to go through would? I’m not really making a judgment on which way is better to measure bureaucracy, but I encourage you to think about the costs and benefits of each way we operationalized the construct of bureaucracy, and extend this to the way you operationalize your own concepts in your research project.

Levels of Measurement

Now, we’re going to move into some more concrete characterizations of variables. You now hopefully understand how to operationalize your concepts so that you can turn them into variables. Imagine a process kind of like what you see in Figure 11.1 below.

a research measurement definition

Notice that the arrows from the construct point toward the research question, because ultimately, measuring them will help answer your question!

The level of measuremen t of a variable tells us how the values of the variable relate to each other  and what mathematical operations we can perform with the variable. (That second part will become important once we move into quantitative analysis in Chapter 14  and Chapter 15 ).  Many students find this definition a bit confusing. What does it mean when we say that the level of measurement tells us about mathematical operations? So before we move on, let’s clarify this a bit. 

Let’s say you work for your a community nonprofit that wants to develop programs relevant to community members’ ages (i.e., tutoring for kids in school, job search and resume help for adults, and home visiting for elderly community members). However, you do not have a good understanding of the ages of the people who visit your community center. Below is a part of a questionnaire that you developed to.

  • How old are you? – Under 18 years old – 18-30 years old – 31-50 years old – 51-60 years old – Over 60 years old
  • How old are you? _____ years

Look at the two items on this questionnaire. They both ask about age, but t he first item asks about age but asks the participant to identify the age range. The second item asks you to identify the actual age in years. These two questions give us data that represent the same information measured at a different level.

It would help your agency if you knew the average age of clients, right? So, which item on the questionnaire will provide this information? Item one’s choices are grouped into categories. Can you compute an average age from these choices? No. Conversely, participants completing item two are asked to provide an actual number, one that you could use to determine an average age. In summary, the two items both ask the participants to report their age. However, the type of data collected from both items is different and must be analyzed differently. 

We can think about the four levels of measurement as going from less to more specific, or as it’s more commonly called, lower to higher: nominal, ordinal , interval , and ratio . Each of these levels differ and help the researcher understand something about their data.  Think about levels of measurement as a hierarchy.

In order to determine the level of measurement, please examine your data and then ask these four questions (in order).

  • Do I have mutually exclusive categories? If the answer is yes, continue to question #2.
  • Do my item choices have a hierarchy or order? In other words, can you put your item choices in order? If no, stop–you have nominal level data. If the answer is yes, continue to question #3.
  • Can I add, subtract, divide, and multiply my answer choices? If no, stop–you have ordinal level data. If the answer is yes, continue to question #4.
  • Is it possible that the answer to this item can be zero? If the answer is no—you have interval level data. If the answer is yes, you are at the ratio level of measurement.

Nominal level .  The nominal level of measurement is the lowest level of measurement. It contains categories are mutually exclusive, which means means that anyone who falls into one category cannot not fall into another category. The data can be represented with words (like yes/no) or numbers that correspond to words or a category (like 1 equaling yes and 0 equaling no). Even when the categories are represented as numbers in our data, the number itself does not have an actual numerical value. It is merely a number we have assigned so that we can use the variable in mathematical operations (which we will start talking about in Chapter 14.1 ). We say this level of measurement is lowest or least specific because someone who falls into a category we’ve designated could differ from someone else in the same category. Let’s say on our questionnaire above, we also asked folks whether they own a car. They can answer yes or no, and they fall into mutually exclusive categories. In this case, we would know whether they own a car, but not whether owning a car really affects their life significantly. Maybe they have chosen not to own one and are happy to take the bus, bike, or walk. Maybe they do not own one but would like to own one. We cannot get this information from a nominal variable, which is ok when we have meaningful categories. Nominal variables are especially useful when we just need the frequency of a particular characteristic in our sample.

The nominal level of measurement usually includes many demographic characteristics like race, gender, or marital status.

Ordinal leve l . The ordinal level of measurement is the next level of measurement and contains slightly more specific information than the nominal level. This level has mutually exclusive categories and a hierarchy or order. Let’s go back to the first item on the questionnaire we talked about above.

Do we have mutually exclusive categories? Yes. Someone who selects item A cannot also select item B. So, we know that we have at least nominal level data. However, the next question that we need to ask is “Do my answer choices have order?” or “Can I put my answer choices in order?” The answer is yes, someone who selects A is younger than someone who selects B or C. So, you have at least ordinal level data.

From a data analysis and statistical perspective, ordinal variables get treated exactly like nominal variables because they are both categorical variables , or variables whose values are organized into mutually exclusive groups but whose numerical values cannot be used in mathematical operations. You’ll see this term used again when we get into bivariate analysis in Chapter 15.

Interval level The interval level of measurement is a higher level of measurement. This level marks the point where we are able . This level contains all of the characteristics of the previous levels (mutually exclusive categories and order). What distinguishes it from the ordinal level is that the interval level can be used to conduct mathematical computations with data (like an average, for instance).

Let’s think back to our questionnaire about age again and take a look at the second question where we asked for a person’s exact age in years. Age in years is mutually exclusive – someone can’t be 14 and 15 at the same time – and the order of ages is meaningful, since being 18 means something different than being 32. Now, we can also take the answers to this question and do math with them, like addition, subtraction, multiplication, and division.

Ratio level . Ratio level data is the highest level of measurement. It has mutually exclusive categories, order, and you can perform mathematical operations on it. The main difference between the interval and ratio levels is that the ratio level has an absolute zero, meaning that a value of zero is both possible and meaningful. You might be thinking, “Well, age has an absolute zero,” but someone who is not yet born does not have an age, and the minute they’re born, they are not zero years old anymore.

Data at the ratio level of measurement are usually amounts or numbers of things, and can be negative (if that makes conceptual sense, of course). For example, you could ask someone to report how many A’s they have on their transcript or how many semesters they have earned a 4.0. They could have zero A’s and that would be a valid answer.

From a data analysis and statistical perspective, interval and ratio variables are treated exactly the same because they are both continuous variables , or variables whose values are mutually exclusive and can be used in mathematical operations. Technically, a continuous variable could have an infinite number of values.

What does the level of measurement tell us?

We have spent time learning how to determine our data’s level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data’s level of measurement. (We will discuss this soon in Chapter 15.) The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how. That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement.

Finally, sometimes when analyzing data, researchers find a need to change a data’s level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student collected a variety of data. One item asked about the number of mental health diagnoses, reported as the actual number. When analyzing data, my student examined the mental health diagnosis variable and noticed that she had two groups, those with none or one diagnosis and those with many diagnoses. Instead of using the ratio level data (actual number of mental health diagnoses), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.

  • Operationalization involves figuring out how to measure a construct you cannot directly observe.
  • Nominal variables have mutually exclusive categories with no natural order. They cannot be used for mathematical operations like addition or subtraction. Race or gender would be one example.
  • Ordinal variables have mutually exclusive categories  and a natural order. They also cannot be used for mathematical operations like addition or subtraction. Age when measured in categories (i.e., 18-25 years old) would be an example.
  • Interval variables have mutually exclusive categories, a natural order, and can be used for mathematical operations. Age as a raw number would be an example.
  • Ratio variables have mutually exclusive categories, a natural order, can be used for mathematical operations, and have an absolute zero value. The number of times someone calls a legislator to advocate for a policy would be an example.
  • Nominal and ordinal variables are categorical variables, meaning they have mutually exclusive categories and cannot be used for mathematical operations, even when assigned a number.
  • Interval and ratio variables are continuous variables, meaning their values are mutually exclusive and can be used in mathematical operations.
  • Researchers should consider the costs and benefits of how they operationalize their variables, including what level of measurement they choose, since the level of measurement can affect how you must gather your data.
  • What are the primary constructs being explored in the research?
  • Could you (or the study authors) have chosen another way to operationalize this construct?
  • What are these variables’ levels of measurement?
  • Are they categorical or continuous?

11.3 Scales and indices

  • Identify different types of scales and compare them to each other
  • Understand how to begin the process of constructing scales or indices

Quantitative data analysis requires the construction of two types of measures of variables: indices and scales. These measures are frequently used and are important since social scientists often study variables that possess no clear and unambiguous indicators–for instance, age or gender. Researchers often centralize much of work in regards to the attitudes and orientations of a group of people, which require several items to provide indication of the variables. Secondly, researchers seek to establish ordinal categories from very low to very high (vice-versa), which single data items can not ensure, while an index or scale can.

Although they exhibit differences (which will later be discussed) the two have in common various factors.

  • Both are ordinal measures of variables.
  • Both can order the units of analysis in terms of specific variables.
  • Both are composite measures of variables ( measurements based on more than one one data item ).

In general, indices are a sum of series of individual yes/no questions, that are then combined in a single numeric score. They are usually a measure of the quantity of some social phenomenon and are constructed at a ratio level of measurement. More sophisticated indices weigh individual items according to their importance in the concept being measured (i.e. in a multiple choice test where different questions are worth different numbers of points). Some interval-level indices are not weight counted, but contain other indexes or scales within them (i.e. college admissions that score an applicant based on GPA, SAT scores, essays, and place a different point from each source).

This section discusses two formats used for measurement in research: scales and indices (sometimes called indexes). These two formats are helpful in research because they use multiple indicators to develop a composite (or total) score. Co mposite scores provide a much greater understanding of concepts than a single item could. Although we won’t delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices can be used.

Types of scales

As a student, you are very familiar with end of the semester course evaluations. These evaluations usually include statements such as, “My instructor created an environment of respect” and ask students to use a scale to indicate how much they agree or disagree with the statements.  These scales, if developed and administered appropriately, provide a wealth of information to instructors that may be used to refine and update courses. If you examine the end of semester evaluations, you will notice that they are organized, use language that is specific to your course, and have very intentional methods of implementation. In essence, these tools are developed to encourage completion.

As you read about these scales, think about the information that you want to gather from participants. What type or types of scales would be the best for you to use and why? Are there existing scales or do you have to create your own?

The Likert scale

Most people have seen some version of a Likert scale. Designed by Rensis Likert (Likert, 1932) [4] , a Likert scale is a very popular rating scale for measuring ordinal data in social work research. This scale includes Likert items that are simply-worded statements to which participants can indicate their extent of agreement or disagreement on a five- or seven-point scale ranging from “strongly disagree” to “strongly agree.” You will also see Likert scales used for importance, quality, frequency, and likelihood, among lots of other concepts. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.

Likert scales are excellent ways to collect information. They are popular; thus, your prospective participants may already be familiar with them. However, they do pose some challenges. You have to be very clear about your question prompts. What does strongly agree mean and how is this differentiated from agree ? In order to clarify this for participants, some researchers will place definitions of these items at the beginning of the tool.

There are a few other, less commonly used, scales discussed next.

Semantic differential scale

This is a composite (multi-item) scale where respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. For instance, in the above Likert scale, the participant is asked how much they agree or disagree with a statement. In a semantic differential scale, the participant is asked to indicate how they feel about a specific item. This makes the s emantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. The following is an example of a semantic differential scale that was created to assess participants’ feelings about the content taught in their research class.  

Feelings About My Research Class

Directions: Please review the pair of words and then select the one that most accurately reflects your feelings about the content of your research class.

Boring……………………………………….Exciting

Waste of Time…………………………..Worthwhile

Dry…………………………………………….Engaging

Irrelevant…………………………………..Relevant

Guttman scale

This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation. Let’s pretend that you are working with a group of parents whose children have identified as part of the transgender community. You want to know how comfortable they feel with their children. You could develop the following items.

Example Guttman Scale Items

  • I would allow my child to use a name that was not gender-specific (e.g., Ryan, Taylor)    Yes/No
  • I would allow my child to wear clothing of the opposite gender (e.g., dresses for boys)   Yes/No
  • I would allow my child to use the pronoun of the opposite sex                                             Yes/No
  • I would allow my child to live as the opposite gender                                                             Yes/No

Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.

Indices (Indexes)

An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.

Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person’s socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.

The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? Third, create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity. Lastly, validate the index score using existing or new data.

Differences Between Scales and Indices

Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).

Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.

A note on scales and indices

Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn’t been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the next section.

It’s important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.

  • Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
  • A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
  • In order to create scales or indices, researchers must have a clear understanding of the indicators for what they are studying.
  • What is the level of measurement for each item on each tool? Take a second and think about why the tool’s creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
  • If these tools don’t exist for what you are interested in studying, why do you think that is?

11.4 Reliability and validity in measurement

  • Discuss measurement error, the different types, and how to minimize the probability of them
  • Differentiate between reliability and validity and understand how these are related to each other and relevant to understanding the value of a measurement tool
  • Compare and contrast the types of reliability and demonstrate how to evaluate each type
  • Compare and contrast the types of validity and demonstrate how to evaluate each type

The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. Before we talk about how to evaluate our measurement process, let’s discuss why we want to evaluate our process. We evaluate our process so that we minimize our chances of error . But what is measurement error?

Types of Errors

We need to be concerned with two types of errors in measurement: systematic and random errors. Systematic errors are errors that are generally predictable. These are errors that, “are due to the process that biases the results.” [5] For instance, my cat stepping on the scale with me each morning is a systematic error in measuring my weight. I could predict that each measurement would be off by 13 pounds. (He’s a bit of a chonk.)

There are multiple categories of systematic errors.

  • Social desirability , occurs when you ask participants a question and they answer in the way that they feel is the most socially desired . For instance, let's imagine that you want to understand the level of prejudice that participants feel regarding immigrants and decide to conduct face-to-face interviews with participants. Some participants may feel compelled to answer in a way that indicates that they are less prejudiced than they really are. 
  • [pb_glossary id="2096"]Acquiescence bias  occurs when participants answer items in some type of pattern, usually skewed to more favorable responses. For example, imagine that you took a research class and loved it. The professor was great and you learned so much. When asked to complete the end of course questionnaire, you immediately mark "strongly agree" to all items without really reading all of the items. After all, you really loved the class. However, instead of reading and reflecting on each item, you "acquiesced" and used your overall impression of the experience to answer all of the items.
  • Leading questions are those questions that are worded in a way so that the participant is "lead" to a specific answer. For instance, think about the question, "Have you ever hurt a sweet, innocent child?" Most people, regardless of their true response, may answer "no" simply because the wording of the question leads the participant to believe that "no" is the correct answer.

In order to minimize these types of errors, you should think about what you are studying and examine potential public perceptions of this issue. Next, think about how your questions are worded and how you will administer your tool (we will discuss these in greater detail in the next chapter). This will help you determine if your methods inadvertently increase the probability of these types of errors. 

These errors differ from random errors , whic are "due to chance and are not systematic in any way." [6] Sometimes it is difficult to "tease out" random errors. When you take your statistics class, you will learn more about random errors and what to do about them. They're hard to observe until you start diving deeper into statistical analysis, so put a pin in them for now.

Now that we have a good understanding of the two types of errors, let's discuss what we can do to evaluate our measurement process and minimize the chances of these occurring. Remember, quality projects are clear on what is measured , how it is measured, and why it is measured . In addition, quality projects are attentive to the appropriateness of measurement tools and evaluate whether tools are used correctly and consistently.  But how do we do that? Good researchers  do not simply  assume  that their measures work. Instead, they collect data to  demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it. There are two key factors to consider in deciding whether your measurements are good: reliability and validity.

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of reliability: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-retest reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the  same group of people at a later time. At neither point has the research participant received any sort of intervention. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

A scatterplot with scores at time 1 on the x-axis and scores at time 2 on the y-axis, both ranging from 0 to 30. The dots on the scatter plot indicate a strong, positive correlation.

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal consistency

Another kind of reliability is internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Interrater Reliability

Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.

Validity , another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.

Face validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content validity

Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion validity

Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Discriminant validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not  correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

Increasing the reliability and validity of measures

We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool.

  • In measurement, two types of errors can occur: systematic, which we might be able to predict, and random, which are difficult to predict but can sometimes be addressed during statistical analysis.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • Once you have used a measure, you should reevaluate its reliability and validity based on your new data. Remember that the assessment of reliability and validity is an ongoing process.
  • Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
  • Think about your target population . Are there changes that need to be made in order for one of these tools to be appropriate for your population?
  • If you decide to create your own tool, how will you assess its validity and reliability?

11.5 Ethical and social justice considerations for measurement

  • Identify potential cultural, ethical, and social justice issues in measurement.

Just like with other parts of the research process, how we decide to measure what we are researching is influenced by our backgrounds, including our culture, implicit biases, and individual experiences. For me as a middle-class, cisgender white woman, the decisions I make about measurement will probably default to ones that make the most sense to me and others like me, and thus measure characteristics about us most accurately if I don't think carefully about it. There are major implications for research here because this could affect the validity of my measurements for other populations.

This doesn't mean that standardized scales or indices, for instance, won't work for diverse groups of people. What it means is that researchers must not ignore difference in deciding how to measure a variable in their research. Doing so may serve to push already marginalized people further into the margins of academic research and, consequently, social work intervention. Social work researchers, with our strong orientation toward celebrating difference and working for social justice, are obligated to keep this in mind for ourselves and encourage others to think about it in their research, too.

This involves reflecting on what we are measuring, how we are measuring, and why we are measuring. Do we have biases that impacted how we operationalized our concepts? Did we include st a keholders and gatekeepers in the development of our concepts? This can be a way to gain access to vulnerable populations. What feedback did we receive on our measurement process and how was it incorporated into our work? These are all questions we should ask as we are thinking about measurement. Further, engaging in this intentionally reflective process will help us maximize the chances that our measurement will be accurate and as free from bias as possible.

The NASW Code of Ethics discusses social work research and the importance of engaging in practices that do not harm participants. [14] This is especially important considering that many of the topics studied by social workers are those that are disproportionately experienced by marginalized and oppressed populations. Some of these populations have had negative experiences with the research process: historically, their stories have been viewed through lenses that reinforced the dominant culture's standpoint. Thus, when thinking about measurement in research projects, we must remember that the way in which concepts or constructs are measured will impact how marginalized or oppressed persons are viewed.  It is important that social work researchers examine current tools to ensure appropriateness for their population(s). Sometimes this may require researchers to use or adapt existing tools. Other times, this may require researchers to develop completely new measures. In summary, the measurement protocols selected should be tailored and attentive to the experiences of the communities to be studied.

But it's not just about reflecting and identifying problems and biases in our measurement, operationalization, and conceptualization - what are we going to  do about it? Consider this as you move through this book and become a more critical consumer of research. Sometimes there isn't something you can do in the immediate sense - the literature base at this moment just is what it is. But how does that inform what you will do later?

  • Social work researchers must be attentive to personal and institutional biases in the measurement process that affect marginalized groups.
  • What are the potential social justice considerations surrounding your methods?
  • What are some strategies you could employ to ensure that you engage in ethical research?
  • Milkie, M. A., & Warner, C. H. (2011). Classroom learning environments and the mental health of first grade children. Journal of Health and Social Behavior, 52 , 4–22 ↵
  • Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company. ↵
  • Earl Babbie offers a more detailed discussion of Kaplan’s work in his text. You can read it in: Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. ↵
  • Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 1–55. ↵
  • Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.) . Thousand Oaks, CA: SAGE. ↵
  • Engel, R. & Shutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
  • Sullivan G. M. (2011). A primer on the validity of assessment instruments. Journal of graduate medical education, 3 (2), 119–120. doi:10.4300/JGME-D-11-00075.1 ↵
  • https://www.socialworkers.org/about/ethics/code-of-ethics/code-of-ethics-english ↵

The process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating.

In measurement, conditions that are easy to identify and verify through direct observation.

In measurement, conditions that are subtle and complex that we must use existing knowledge and intuition to define.

The process of determining how to measure a construct that cannot be directly observed

Conditions that are not directly observable and represent states of being, experiences, and ideas.

“a logical grouping of attributes that can be observed and measured and is expected to vary from person to person in a population” (Gillespie & Wagner, 2018, p. 9)

The level that describes the type of operations can be conducted with your data. There are four nominal, ordinal, interval, and ratio.

Level of measurement that follows nominal level. Has mutually exclusive categories and a hierarchy (order).

A higher level of measurement. Denoted by having mutually exclusive categories, a hierarchy (order), and equal spacing between values. This last item means that values may be added, subtracted, divided, and multiplied.

The highest level of measurement. Denoted by mutually exclusive categories, a hierarchy (order), values can be added, subtracted, multiplied, and divided, and the presence of an absolute zero.

variables whose values are organized into mutually exclusive groups but whose numerical values cannot be used in mathematical operations.

variables whose values are mutually exclusive and can be used in mathematical operations

The differerence between that value that we get when we measure something and the true value

Errors that are generally predictable.

Errors lack any perceptable pattern.

The ability of a measurement tool to measure a phenomenon the same way, time after time. Note: Reliability does not imply validity.

The extent to which scores obtained on a scale or other measure are consistent across time

The extent to which different observers are consistent in their assessment or rating of a particular characteristic or item.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears “on its face” to measure the construct of interest

The extent to which a measure “covers” the construct of interest, i.e., it's comprehensiveness to measure the construct.

The extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with.

A type of Criterion validity. Examines how well a tool provides the same scores as an already existing tool.

A type of criterion validity that examines how well your tool predicts a future criterion.

the group of people whose needs your study addresses

individuals or groups who have an interest in the outcome of the study you conduct

the people or organizations who control access to the population you want to study

Graduate research methods in social work Copyright © 2020 by Matthew DeCarlo, Cory Cummings, Kate Agnelli is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

a research measurement definition

6.1 Measurement

Learning objectives.

  • Define measurement.
  • Describe Kaplan’s three categories of the things that social scientists measure.
  • Identify the stages at which measurement is important.

Measurement is important. Recognizing that fact, and respecting it, will be of great benefit to you—both in research methods and in other areas of life as well. If, for example, you have ever baked a cake, you know well the importance of measurement. As someone who much prefers rebelling against precise rules over following them, I once learned the hard way that measurement matters. A couple of years ago I attempted to bake my husband a birthday cake without the help of any measuring utensils. I’d baked before, I reasoned, and I had a pretty good sense of the difference between a cup and a tablespoon. How hard could it be? As it turns out, it’s not easy guesstimating precise measures. That cake was the lumpiest, most lopsided cake I’ve ever seen. And it tasted kind of like Play-Doh. Figure 6.1 depicts the monstrosity I created, all because I did not respect the value of measurement.

a research measurement definition

Measurement is important in baking and in research.

Just as measurement is critical to successful baking, it is as important to successfully pulling off a social scientific research project. In sociology, when we use the term measurement The process by which we describe and ascribe meaning to the key facts, concepts, or phenomena that we are investigating. we mean the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. At its core, measurement is about defining one’s terms in as clear and precise a way as possible. Of course, measurement in social science isn’t quite as simple as using some predetermined or universally agreed-on tool, such as a measuring cup or spoon, but there are some basic tenants on which most social scientists agree when it comes to measurement. We’ll explore those as well as some of the ways that measurement might vary depending on your unique approach to the study of your topic.

What Do Social Scientists Measure?

The question of what social scientists measure can be answered by asking oneself what social scientists study. Think about the topics you’ve learned about in other sociology classes you’ve taken or the topics you’ve considered investigating yourself. Or think about the many examples of research you’ve read about in this text. In Chapter 2 "Linking Methods With Theory" we learned about Melissa Milkie and Catharine Warner’s study (2011) Milkie, M. A., & Warner, C. H. (2011). Classroom learning environments and the mental health of first grade children. Journal of Health and Social Behavior, 52 , 4–22. of first graders’ mental health. In order to conduct that study, Milkie and Warner needed to have some idea about how they were going to measure mental health. What does mental health mean, exactly? And how do we know when we’re observing someone whose mental health is good and when we see someone whose mental health is compromised? Understanding how measurement works in research methods helps us answer these sorts of questions.

As you might have guessed, social scientists will measure just about anything that they have an interest in investigating. For example, those who are interested in learning something about the correlation between social class and levels of happiness must develop some way to measure both social class and happiness. Those who wish to understand how well immigrants cope in their new locations must measure immigrant status and coping. Those who wish to understand how a person’s gender shapes their workplace experiences must measure gender and workplace experiences. You get the idea. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe, or measure, than others, and the things we might wish to measure don’t necessarily all fall into the same category of measureables.

In 1964, philosopher Abraham Kaplan (1964) Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company. wrote what has since become a classic work in research methodology, The Conduct of Inquiry (Babbie, 2010). Earl Babbie offers a more detailed discussion of Kaplan’s work in his text. You can read it in Chapter 5 "Research Design" of the following: Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. In his text, Kaplan describes different categories of things that behavioral scientists observe. One of those categories, which Kaplan called “observational terms,” is probably the simplest to measure in social science. Observational terms Things that we can see with the naked eye simply by looking at them. are the sorts of things that we can see with the naked eye simply by looking at them. They are terms that “lend themselves to easy and confident verification” (1964, p. 54). Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company, p. 54. If, for example, we wanted to know how the conditions of playgrounds differ across different neighborhoods, we could directly observe the variety, amount, and condition of equipment at various playgrounds.

Indirect observables Things that we cannot see with the naked eye but that require some more complex assessment. , on the other hand, are less straightforward to assess. They are “terms whose application calls for relatively more subtle, complex, or indirect observations, in which inferences play an acknowledged part. Such inferences concern presumed connections, usually causal, between what is directly observed and what the term signifies” (1964, p. 55). Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company, p. 55. If we conducted a study for which we wished to know a person’s income, we’d probably have to ask them their income, perhaps in an interview or a survey. Thus we have observed income, even if it has only been observed indirectly. Birthplace might be another indirect observable. We can ask study participants where they were born, but chances are good we won’t have directly observed any of those people being born in the locations they report.

Sometimes the measures that we are interested in are more complex and more abstract than observational terms or indirect observables. Think about some of the concepts you’ve learned about in other sociology classes—ethnocentrism, for example. What is ethnocentrism? Well, you might know from your intro to sociology class that it has something to do with the way a person judges another’s culture. But how would you measure it? Here’s another construct: bureaucracy. We know this term has something to do with organizations and how they operate, but measuring such a construct is trickier than measuring, say, a person’s income. In both cases, ethnocentrism and bureaucracy, these theoretical notions represent ideas whose meaning we have come to agree on. Though we may not be able to observe these abstractions directly, we can observe the confluence of things that they are made up of. Kaplan referred to these more abstract things that behavioral scientists measure as constructs Abstractions that cannot be observed directly but that can be defined based on that which is observable. . Constructs are “not observational either directly or indirectly” (1964, p. 55), Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company, p. 55. but they can be defined based on observables.

Thus far we have learned that social scientists measure what Abraham Kaplan called observational terms, indirect observables, and constructs. These terms refer to the different sorts of things that social scientists may be interested in measuring. But how do social scientists measure these things? That is the next question we’ll tackle.

How Do Social Scientists Measure?

Measurement in social science is a process. It occurs at multiple stages of a research project: in the planning stages, in the data collection stage, and sometimes even in the analysis stage. Recall that previously we defined measurement as the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. Once we’ve identified a research question, we begin to think about what some of the key ideas are that we hope to learn from our project. In describing those key ideas, we begin the measurement process.

Let’s say that our research question is the following: How do new college students cope with the adjustment to college? In order to answer this question, we’ll need to some idea about what coping means. We may come up with an idea about what coping means early in the research process, as we begin to think about what to look for (or observe) in our data-collection phase. Once we’ve collected data on coping, we also have to decide how to report on the topic. Perhaps, for example, there are different types or dimensions of coping, some of which lead to more successful adjustment than others. However we decide to proceed, and whatever we decide to report, the point is that measurement is important at each of these phases.

As the preceding paragraph demonstrates, measurement is a process in part because it occurs at multiple stages of conducting research. We could also think of measurement as a process because of the fact that measurement in itself involves multiple stages. From identifying one’s key terms to defining them to figuring out how to observe them and how to know if our observations are any good, there are multiple steps involved in the measurement process. An additional step in the measurement process involves deciding what elements one’s measures contain. A measure’s elements might be very straightforward and clear, particularly if they are directly observable. Other measures are more complex and might require the researcher to account for different themes or types. These sorts of complexities require paying careful attention to a concept’s level of measurement and its dimensions. We’ll explore these complexities in greater depth at the end of this chapter, but first let’s look more closely at the early steps involved in the measurement process.

Key Takeaways

  • Measurement is the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating.
  • Kaplan identified three categories of things that social scientists measure including observational terms, indirect observables, and constructs.
  • Measurement occurs at all stages of research.
  • See if you can come up with one example of each of the following: an observational term, an indirect observable, and a construct. How might you measure each?

Understanding qualitative measurement: The what, why, and how

Last updated

30 January 2024

Reviewed by

You’ll need to collect data to determine the success of any project, from product launches to employee culture initiatives. How that data is collected is just as important as what it reveals.

There are many ways to gather and analyze data, from in-person interviews to emailed surveys. Qualitative research focuses on telling a story with the information collected, while quantitative research involves collecting, analyzing, and presenting hard datasets.

Data gathered through qualitative measurement describes traits or characteristics. You can collect it in different ways, including interviews and observation, and it can be in the form of descriptive words.

While gathering and analyzing data through qualitative measurement can be challenging, especially if you’re working with limited resources or a smaller team, the insights you get at the end of the project are often well worth the effort.

  • What is qualitative measurement?

Qualitative measures can be particularly helpful in understanding how a phenomenon or action affects individuals and groups.

  • Why is qualitative data important?

Through data, you can understand how to better serve your customers and employees and anticipate shifts in your business.

The data will provide a deeper understanding of your customers, empowering you to make decisions that positively benefit your company in the long run. Qualitative data helps you see patterns and trends so you can make actionable changes. It can also answer questions posed by your project so you can provide company stakeholders with helpful information and insights.

  • How to collect qualitative data

Your ideal method for collecting qualitative data will depend on the resources you have at your disposal, the size of your team, and your project’s timeline.

You might select one method or a mixture of several. For instance, you could opt to send out surveys following a focus group session to receive additional feedback on one or two specific areas of interest.

Analyze your available resources and discuss options with project stakeholders before committing to one particular plan.

The following are some examples of the methods you could use:

Individual interviews

In-depth interviews are one of the most popular methods of collecting qualitative data. They are usually conducted in person, but you could also use video software.

During interviews, a researcher asks the person questions, logging their answers as they go.

Focus groups

Focus groups are a powerful way to observe and document a group of people, making them a common method for collecting qualitative data. They provide researchers with a direct way to interact with participants, listening to them while they share their insights and experiences and recording responses without the interference of software or third-party systems.

However, while focus groups and interviews are two of the most popular methods, they might not be right for every situation or company.

Direct observation

Direct observation allows researchers to see participants in their natural setting, offering an intriguing “real-life” angle to data collection. This method can provide rich, detailed information about the individuals or groups you are studying.

You can conduct surveys in person or online through web software or email. They can also be as detailed or general as your project requires. To get the most information from your surveys, use open-ended questions that encourage respondents to share their thoughts and opinions on the subject.

Diaries and journals

Product launches or employee experience initiatives are two examples of projects that could benefit from diaries and journals as a form of qualitative data gathering.

Diaries and journals enable participants to record their thoughts and feelings on a particular topic. By later examining the diary entries, project managers and stakeholders can better understand their reactions and opinions on the project and the questions asked.

  • Examples of qualitative data

Qualitative data is non-numeric information. It’s descriptive, often including adjectives to paint a picture of a situation or object. Qualitative data can be used to describe a person or place, as you can see in the examples below:

The employee prefers black coffee to sweet beverages.

The cat is black and fluffy.

The brown leather couch is worn and faded.

There are many ways to collate qualitative data, but remember to use appropriate language when communicating it to other project stakeholders. Qualitative data isn’t flowery, but neither does it shy away from descriptors to comprehensively paint a picture.

  • How to measure qualitative data

To measure qualitative data, define a clear project scope ahead of time. Know what questions you want answered and what people you need to speak to to make that happen. While not every result can be tallied, by understanding the questions and project scope well in advance, you’ll be better prepared to analyze what you’re querying.

Define the method you wish to use for your project. Whether you opt for surveys, focus groups, or a mixture of methods, employ the approach that will yield the most valuable data.

Work within your means and be realistic about the resources you can dedicate to data collection. For example, if you only have one or two employees to dedicate to the project, don’t commit to multiple focus group meetings with large groups of participants, as it might not be feasible.

  • What’s the difference between qualitative and quantitative measurements?

Qualitative measurements are descriptive. You can’t measure them with a ruler, scale, or other numeric value, nor can you express them with a numeric value.

In contrast, quantitative measurements are numeric in nature and can be counted.

  • When to use qualitative vs. quantitative measurements

Both qualitative and quantitative measurements can be valuable. Which to use greatly depends on the nature of your project.

If you’re looking to confirm a theory, such as determining which variety of body butter was sold most during a specific month, quantitative measurements will likely give you the answers you need.

To learn more about concepts and experiences, such as which advertising campaign your target customers prefer, opt for qualitative measurement.

You don’t have to commit to one or the other exclusively. Many businesses use a mixed-method approach to research, combining elements of both quantitative and qualitative measurements. Know the questions you want to answer and proceed accordingly with what makes the most sense for your goals.

  • What are the best ways to communicate qualitative data?

Communicating the qualitative data you’ve gathered can be tricky. The information is subjective, and many project stakeholders or other involved parties may have an easier time understanding and reacting to numeric data.

To effectively communicate qualitative data, you’ll need to create a compelling storyline that offers context and relevant details.

It can also help to describe the data collection method you used. This not only helps set the stage for your story but gives those listening insight into research methodologies they may be unfamiliar with.

Finally, allow plenty of time for questions. Regardless of whether you’re speaking to your company’s CEO or a fellow project manager, you should be prepared to respond to questions with additional, relevant information.

How can qualitative measurement be expressed through data?

Qualitative data is non-numeric. It is most often expressed through descriptions since it is surveyed or observed rather than counted.

  • Challenges associated with qualitative measurement

Any in-depth study or research project requires a time commitment. Depending on the research method you employ, other resources might be required. For instance, you might need to compensate the participants of a focus group in some way.

The time and resources required to undertake qualitative measurement could make it prohibitive for many companies, especially small ones with only a few employees. Outsourcing can also be expensive.

Conducting a cost–benefit analysis could help you decide if qualitative measurement is a worthwhile undertaking or one that should be delayed as you plan and prepare.

Get started today

Go from raw data to valuable insights with a flexible research platform

Editor’s picks

Last updated: 21 December 2023

Last updated: 16 December 2023

Last updated: 6 October 2023

Last updated: 5 March 2024

Last updated: 25 November 2023

Last updated: 15 February 2024

Last updated: 11 March 2024

Last updated: 12 December 2023

Last updated: 6 March 2024

Last updated: 10 April 2023

Last updated: 20 December 2023

Latest articles

Related topics, log in or sign up.

Get started for free

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Research Methods | Definitions, Types, Examples

Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.

First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :

  • Qualitative vs. quantitative : Will your data take the form of words or numbers?
  • Primary vs. secondary : Will you collect original data yourself, or will you use data that has already been collected by someone else?
  • Descriptive vs. experimental : Will you take measurements of something as it is, or will you perform an experiment?

Second, decide how you will analyze the data .

  • For quantitative data, you can use statistical analysis methods to test relationships between variables.
  • For qualitative data, you can use methods such as thematic analysis to interpret patterns and meanings in the data.

Table of contents

Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.

Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.

Qualitative vs. quantitative data

Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.

For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .

If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .

You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.

Primary vs. secondary research

Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).

If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.

Descriptive vs. experimental data

In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .

In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .

To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.

Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.

Qualitative analysis methods

Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:

  • From open-ended surveys and interviews , literature reviews , case studies , ethnographies , and other sources that use text rather than numbers.
  • Using non-probability sampling methods .

Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .

Quantitative analysis methods

Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).

You can use quantitative analysis to interpret data that was collected either:

  • During an experiment .
  • Using probability sampling methods .

Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

a research measurement definition

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis
  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
  • If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Is this article helpful?

Other students also liked, writing strong research questions | criteria & examples.

  • What Is a Research Design | Types, Guide & Examples
  • Data Collection | Definition, Methods & Examples

More interesting articles

  • Between-Subjects Design | Examples, Pros, & Cons
  • Cluster Sampling | A Simple Step-by-Step Guide with Examples
  • Confounding Variables | Definition, Examples & Controls
  • Construct Validity | Definition, Types, & Examples
  • Content Analysis | Guide, Methods & Examples
  • Control Groups and Treatment Groups | Uses & Examples
  • Control Variables | What Are They & Why Do They Matter?
  • Correlation vs. Causation | Difference, Designs & Examples
  • Correlational Research | When & How to Use
  • Critical Discourse Analysis | Definition, Guide & Examples
  • Cross-Sectional Study | Definition, Uses & Examples
  • Descriptive Research | Definition, Types, Methods & Examples
  • Ethical Considerations in Research | Types & Examples
  • Explanatory and Response Variables | Definitions & Examples
  • Explanatory Research | Definition, Guide, & Examples
  • Exploratory Research | Definition, Guide, & Examples
  • External Validity | Definition, Types, Threats & Examples
  • Extraneous Variables | Examples, Types & Controls
  • Guide to Experimental Design | Overview, Steps, & Examples
  • How Do You Incorporate an Interview into a Dissertation? | Tips
  • How to Do Thematic Analysis | Step-by-Step Guide & Examples
  • How to Write a Literature Review | Guide, Examples, & Templates
  • How to Write a Strong Hypothesis | Steps & Examples
  • Inclusion and Exclusion Criteria | Examples & Definition
  • Independent vs. Dependent Variables | Definition & Examples
  • Inductive Reasoning | Types, Examples, Explanation
  • Inductive vs. Deductive Research Approach | Steps & Examples
  • Internal Validity in Research | Definition, Threats, & Examples
  • Internal vs. External Validity | Understanding Differences & Threats
  • Longitudinal Study | Definition, Approaches & Examples
  • Mediator vs. Moderator Variables | Differences & Examples
  • Mixed Methods Research | Definition, Guide & Examples
  • Multistage Sampling | Introductory Guide & Examples
  • Naturalistic Observation | Definition, Guide & Examples
  • Operationalization | A Guide with Examples, Pros & Cons
  • Population vs. Sample | Definitions, Differences & Examples
  • Primary Research | Definition, Types, & Examples
  • Qualitative vs. Quantitative Research | Differences, Examples & Methods
  • Quasi-Experimental Design | Definition, Types & Examples
  • Questionnaire Design | Methods, Question Types & Examples
  • Random Assignment in Experiments | Introduction & Examples
  • Random vs. Systematic Error | Definition & Examples
  • Reliability vs. Validity in Research | Difference, Types and Examples
  • Reproducibility vs Replicability | Difference & Examples
  • Reproducibility vs. Replicability | Difference & Examples
  • Sampling Methods | Types, Techniques & Examples
  • Semi-Structured Interview | Definition, Guide & Examples
  • Simple Random Sampling | Definition, Steps & Examples
  • Single, Double, & Triple Blind Study | Definition & Examples
  • Stratified Sampling | Definition, Guide & Examples
  • Structured Interview | Definition, Guide & Examples
  • Survey Research | Definition, Examples & Methods
  • Systematic Review | Definition, Example, & Guide
  • Systematic Sampling | A Step-by-Step Guide with Examples
  • Textual Analysis | Guide, 3 Approaches & Examples
  • The 4 Types of Reliability in Research | Definitions & Examples
  • The 4 Types of Validity in Research | Definitions & Examples
  • Transcribing an Interview | 5 Steps & Transcription Software
  • Triangulation in Research | Guide, Types, Examples
  • Types of Interviews in Research | Guide & Examples
  • Types of Research Designs Compared | Guide & Examples
  • Types of Variables in Research & Statistics | Examples
  • Unstructured Interview | Definition, Guide & Examples
  • What Is a Case Study? | Definition, Examples & Methods
  • What Is a Case-Control Study? | Definition & Examples
  • What Is a Cohort Study? | Definition & Examples
  • What Is a Conceptual Framework? | Tips & Examples
  • What Is a Controlled Experiment? | Definitions & Examples
  • What Is a Double-Barreled Question?
  • What Is a Focus Group? | Step-by-Step Guide & Examples
  • What Is a Likert Scale? | Guide & Examples
  • What Is a Prospective Cohort Study? | Definition & Examples
  • What Is a Retrospective Cohort Study? | Definition & Examples
  • What Is Action Research? | Definition & Examples
  • What Is an Observational Study? | Guide & Examples
  • What Is Concurrent Validity? | Definition & Examples
  • What Is Content Validity? | Definition & Examples
  • What Is Convenience Sampling? | Definition & Examples
  • What Is Convergent Validity? | Definition & Examples
  • What Is Criterion Validity? | Definition & Examples
  • What Is Data Cleansing? | Definition, Guide & Examples
  • What Is Deductive Reasoning? | Explanation & Examples
  • What Is Discriminant Validity? | Definition & Example
  • What Is Ecological Validity? | Definition & Examples
  • What Is Ethnography? | Definition, Guide & Examples
  • What Is Face Validity? | Guide, Definition & Examples
  • What Is Non-Probability Sampling? | Types & Examples
  • What Is Participant Observation? | Definition & Examples
  • What Is Peer Review? | Types & Examples
  • What Is Predictive Validity? | Examples & Definition
  • What Is Probability Sampling? | Types & Examples
  • What Is Purposive Sampling? | Definition & Examples
  • What Is Qualitative Observation? | Definition & Examples
  • What Is Qualitative Research? | Methods & Examples
  • What Is Quantitative Observation? | Definition & Examples
  • What Is Quantitative Research? | Definition, Uses & Methods

What is your plagiarism score?

a research measurement definition

Rank the following mobile brand in order of your preference, the most preferred mobile brand should be ranked one, the second most preferred should be ranked two and so on.

rank order questions for questionnaire

To know the descriptive analysis of the ranking scale, watch the video.

Ranking Scale Questionnaire - How to tabulate, analyse and prepare graph using MS Excel.

Interval Scale

It is the next higher level of measurement. It overcomes the limitation of ordinal scale measurement. In the ordinal scale, the magnitude of the difference is unimportant, but here on an interval scale, the magnitude of the difference is important. In the interval scale, the difference between the two variables has a meaningful interpretation. In the interval scale, the difference between variables is equal distance. The distance between any two adjacent attributes is called an  interval , and intervals are always equal.

Examples of Interval Scale data connection using questionnaire.

How likely do you recommend our product to your friends or relatives?

a research measurement definition

Likert scale is a tool to collect interval data, which is developed by Rensis Likert

To know the descriptive analysis of the interval scale , watch the video.

How to tabulate, analyze, and prepare graph from Likert Scale questionnaire data using Ms Excel.

Ratio Scale

Ratio scale is purely quantitative.  Among the four levels of measurement, ratio scale is the most precise.  The score of zero in ratio scale is not arbitrary compared to the other three scales.

This is the unique quality of ratio scale data.  It represents all the characteristics of nominal, ordinal, and interval scales.  Examples of ratio scales are age, wight, height, income, distance etc.

Examples of Interval Scale (Ranking Scale) data connection using questionnaire.

Specify you monthly income :

How many students are there in your institution? :

Number of departments in your organisation :

Thank You !!! Successfully Registered !!!

Recent posts.

  • UGC NET : Research Methodology 003
  • UGC NET – Part I : Research Aptitude 002
  • UGC NET – Part I : Research Aptitude 001
  • Demographic Data: Role and importance in market research.

a research measurement definition

Sign Up For Our Newsletter!

No Spam! Just information about our new publication.

You have Successfully Subscribed!

a research measurement definition

Data Science Journal

Press logo

  • Download PDF (English) XML (English)
  • Alt. Display

Research Papers

An unsupervised learning approach to evaluate questionnaire data—what one can learn from violations of measurement invariance.

  • Max Hahn-Klimroth
  • Paul W. Dierkes
  • Matthias W. Kleespies

In several branches of the social sciences and humanities, surveys based on standardized questionnaires are a prominent research tool. While there are a variety of ways to analyze the data, some standard procedures have become established. When those surveys want to analyze differences in the answer patterns of different groups (e.g., countries, gender, age), these procedures can only be carried out in a meaningful way if there is measurement invariance; i.e., the measured construct has psychometric equivalence across groups. As recently raised as an open problem by Sauerwein et al. (2021), new evaluation methods that work in the absence of measurement invariance are needed.

This paper promotes an unsupervised learning-based approach to such research data by proposing a procedure that works in three phases: data preparation, clustering of questionnaires, and measuring similarity based on the obtained clustering and the properties of each group. We generate synthetic data in three data sets, which allows us to compare our approach with the PCA approach under measurement invariance and under violated measurement invariance. As a main result, we obtain that the approach provides a natural comparison between groups and a natural description of the response patterns of the groups. Moreover, it can be safely applied to a wide variety of data sets, even in the absence of measurement invariance. Finally, this approach allows us to translate (violations of) measurement invariance into a meaningful measure of similarity.

  • Principle Component Analysis
  • Unsupervised Learning
  • Methods in Cross-Cultural Studies
  • Measurement Invariance

1. Introduction

In several branches of the social sciences and humanities, a prominent research tool is to conduct surveys using standardized questionnaires. One reason for the prominence of questionnaire-based studies may be that they are inexpensive, relatively easy to administer, and if the responses are standardized, it is easy to compile the data. Measurement instruments in such questionnaires can either consist of single questions that measure separate variables, such as questions about preferences or daily activities, or they can consist of multiple questions that can be aggregated into a single value or index. In the latter case, it is common to say that all the items in the questionnaire measure the same construct . They are often used when measuring attitudes ( Milfont and Duckitt 2010 ), connection to nature ( Mayer and Frantz 2004 ) or environmental behavior ( F. G. Kaiser 1998 ). A common research question in both cases is whether several groups differ in their preferences, attitudes, or other environmental psychological measures. In this way, differences and similarities between age groups ( Liefländer et al. 2013 ), between genders ( Kleespies and Dierkes 2020 ) or across different disciplines ( Feucht, Dierkes, and Kleespies 2023 ) can be examined.

There are a variety of ways to analyze and evaluate questionnaire data. Over the years, some standard procedures have become established in social science research and have been used in countless studies. When a large data set consisting of different psychological variables and constructs has been collected, a structurally simplifying procedure, such as a factor analysis or principle component analysis (PCA), is usually carried out to simplify the interpretation of the results ( Costello and Osborne 2005 ; Yong and Pearce 2013 ).

However, these procedures require that the data collected are also suitable for carrying out such an analysis. To verify the applicability of these methods, Bartlett’s test of sphericity and the Kaiser, Meyer and Olkin criterion (KMO) are usually applied ( Bartlett 1950 ; H. F. Kaiser 1970 ). The KMO criterion assesses the sampling adequacy for each variable of the model and for the entire model, and Bartlett’s test of sphericity tests whether there are correlations between the single items. A factor analysis or PCA only makes sense if these criteria are met. As a general rule, it is assumed that a factor analysis or PCA can only be applied if the Bartlett’s test finds significant deviations from the hypothesis of having no correlations and the KMO is above the value of 0.7 ( Dziuban and Shirkey 1974 ).

In these structure-simplifying procedures, similar items are assigned to the same higher-level factors. In the analysis, the individual items of the higher-order factors can be summarized by calculating a mean value. If there is only a single factor, this is referred to as a unidimensional model. In order to confirm the internal consistency (the inter-relatedness between the test items) and validity of the individual factors or components, Cronbach’s alpha is often calculated ( Cronbach 1951 ). The mean values of the individual factors can then be used to carry out group comparisons using for example hypothesis tests.

When making comparisons between different groups, it is only possible to carry out these procedures in a meaningful way if there is measurement invariance, i.e., the measured construct shows psychometric equivalence between groups ( Putnick and Bornstein 2016 ). For example, it is possible that the perception of a measurement instrument differs between different cultural groups and that the factor analysis therefore produces a different factor structure for each group. In this case, there is a lack of measurement invariance and a comparison of the different groups is not easily possible. Verifying measurement invariance is a complex and multi-stage process that involves different stages ( Knickenberg et al. 2019 ). Currently, methods are still needed that allow research data to be analyzed despite the lack of measurement invariance, as recently posed as an open problem ( Sauerwein and Theis 2021 ). Such methods of analysis could help to carry out cross cultural studies, of which more are needed especially in environmental psychology ( Tam and Milfont 2020 ).

1.1. Our contribution

In this contribution, we propose an unsupervised learning-based approach towards such research data. As already described, standard methods require either a similar data structure in all subgroups, or at least comparable pairwise correlations between the individual items across all groups. However, especially when comparing heterogeneous groups, this cannot be guaranteed, so the application of the standard methods described is not appropriate. In addition, missing data pose a major challenge, as the standard approach requires missing data to be replaced by the mean of the questionnaire, regardless of any correlations or similarities between different items. As an alternative, missing values are often simply ignored by excluding cases, although this has a negative impact on the sample size. Finally, from a more statistical point of view, following the standard approach results in dealing with multiple groups comparisons. With an increasing number of groups, the number of pairwise comparisons also increases, which quickly becomes confusing and very difficult to interpret when there are many groups. In this case, applied error corrections can contribute to the result becoming inaccurate. Our approach will analyze a questionnaire data set in three steps; more precisely, we describe an algorithm that:

  • prepares the questionnaire data,
  • clusters the questionnaires according to their response types ,
  • measures the similarity between groups using the proportion of questionnaires of each response type in the group.

In the data preparation step, the algorithm takes care of missing values in the original data using k -nearest neighbor imputation, and prepares the data for the actual clustering step. The clustering step clusters the individual questionnaires, and the centroids of the clusters will be called response types , as they refer to the typical questionnaire in each cluster. Finally, the proportion of each response type per group provides a very natural measure of similarity between groups, and further statistical analyses that might explain group similarities or differences can be applied based on this quantity.

In this paper, we give examples of this method applied to synthetic data and compare the result with the classical methods when they can be applied. We also give examples where our approach can be easily applied, but standard methods fail. Of course, the unsupervised learning approach itself (the actual clustering) and the imputation approach (nearest neighbor imputation) are well known and extensively studied methods in the data science community. However, the main goal of this paper is to combine these methods and to promote this approach for the evaluation of questionnaire data to a wide range of researchers who evaluate questionnaire data in different fields.

2. Important Definitions and Notation

2.1. studied datasets.

Below, we describe three synthetic datasets that are used throughout the paper. The datasets were created using the NumPy package in the Python programming language, and for completeness, the generated data are provided in the Supplementary Material. The first and second datasets are based on questionnaires consisting of seven items, where each item takes an integer value in [1, 5], and the third dataset consists of questionnaires with only three such items.

2.1.1. Case 1: measurement invariance is given

and let unif (1, 5) denote the uniform distribution on {1, 2, 3, 4, 5} 7 . Then,

In the second step, the value of each element x is perturbed by the following noise function F , so that

This means that an independent Gaussian noise with mean 0 and variance 0.66 is added to the value of each element and the result is rounded to the nearest integer. Also, values above 5 and below 1 are truncated to 5 and 1, respectively.

2.1.2. Case 2: violations of measurement invariance

Formally, we describe the probability laws from which we sample, as follows. Again, we denote by δ v the Dirac measure on v and by unif (1, 5) the uniform distribution on {1, 2, 3, 4, 5}. Then,

Having sampled 1,000 elements for each group independently, the same perturbation as before is applied, meaning that each value is perturbed by

2.1.3. Items are unrelated and show differences between groups

Then define

After the sample procedure, we perturb each item’s value by the noise function G , where

2.2. Ward’s clustering method

The clustering obtained in this paper is due to performing a standard agglomerative clustering with Ward’s minimum variance criterion as the objective function ( Ward 1963 ). When this method is applied to a data set of size n , in the first step of the clustering algorithm, all n data points form their own cluster. Now, in each step, the two clusters whose merging minimizes the total within-cluster distance are merged, and the cluster center of the new cluster is computed as the point minimizing the sum of squares distance to all points in the cluster. More precisely, in each step, the clustering algorithm must find the pair of clusters that leads to the minimum increase in the total within-cluster variance after merging ( Cormack 1971 ; Ward 1963 ). Formally, we choose those clusters A and B with centers c A and c B that minimize

This clustering approach is quite popular because it usually produces compact and comparably sized clusters ( Szmrecsanyi 2012 ).

However, the algorithm requires a stopping criterion. This can either be calculated automatically, e.g., when the increase of the within-cluster variance exceeds a certain threshold, or when a certain number of clusters is reached. We follow the second approach, where the user determines the number of clusters obtained by the method. This choice can be guided by clustering indices.

2.3. Determining the number of clusters

A fairly intuitive, however, quite recent, approach to determine a suitable number of clusters during agglomerative clustering, is based on the so-called gap statistic ( Mohajer, Englmeier, and Schmid 2010 ; Tibshirani, Walther, and Hastie 2001 ). It compares the cluster compactness of a given clustering with a null reference distribution of the data, which is data with no (obvious) clustering. The number of clusters suggested by the method is the value for which cluster compactness on the original data is significantly smaller than the cluster compactness on the reference data ( Mohajer, Englmeier, and Schmid 2010 ; Tibshirani, Walther, and Hastie 2001 ). Hence, we are looking for a (local) maximum in a scree plot which plots the number of clusters against the gap value. Intuitively, this corresponds to ‘unnaturally large gaps’ in a corresponding dendrogram. A dendrogram, which represents a tree, illustrates the arrangement of clusters produced by an agglomerative clustering process. The leaves of the tree are the individual data points, and whenever two clusters are merged, an edge is used to visualize the merging. The corresponding height, the distance from the leaves, is equal to the ‘distance’ of the cluster centroids at that moment. The clusters are induced by a horizontal line, so the tree is cut into a forest by removing all lines above this line. The height of such a natural horizontal line should correspond to the existence of a large level gap in the dendrogram ( Tibshirani, Walther, and Hastie 2001 ).

Of course, there are several other indices which can be used to measure the goodness of a clustering, like the Calinski-Harabasz index ( Calinski and Harabasz 1974 ; Liu et al. 2013 ) or the Silhouette coefficient ( Rousseeuw 1987 ). However, while the gap statistic can be used with any clustering algorithm, the latter indices are known to prefer convex clusters over non-convex clusters, even if a non-convex variant might intuitively reflect the better clustering ( Liu et al. 2013 ; Rousseeuw 1987 ), in particular if an underlying community structure is supposed to exist. Due to this reason, we decided to use the gap statistic in this contribution.

3. Evaluation algorithm for questionnaire data

In this section, we present the proposed evaluation method in detail. As previously described, the algorithm used to evaluate the questionnaire data runs in three phases. The first phase is the data preparation phase and consists of the following steps.

The next step, the data balancing step, is required to obtain a meaningful clustering of the questionnaires. As we assume that the distribution of questionnaires might vary between groups, and the sample sizes between the groups might also vary, we need to make sure that the questionnaires appearing (only) in groups with a comparably small sample size are not irrelevant during the clustering step. A standard approach to guarantee this in (supervised) learning tasks is to balance the data by oversampling minority groups and/or downsampling majority groups ( Chawla et al. 2002 ; Menardi and Torelli 2012 ). In our case, we propose to oversample groups with smaller sample sizes until all groups contain equally many questionnaires, as the actual evaluation will only be with respect to original data and not the synthetically oversampled data (see Phase 2).

The last data preparation step is to perturb each item’s value slightly with independent additive Gaussian noise with mean zero and standard deviation 0.1. The main purpose is that the clustering (e.g., the cluster centers ) becomes much more stable towards adding or removing single data points if the raw data is augmented. This is a well-known principle, not only in clustering, but in various machine learning tasks in which the models generalize much better if the training data is augmented by random noise ( Belinkov and Bisk 2018 ; Ding et al. 2007 ; Min et al. 2020 ; Zhang, Kinoshita, and Kiya 2020 ). Moreover, the data matrix becomes full rank as there are no duplicate rows anymore (with high probability), which increases the numerical stability of the computation.

The second phase, the clustering phase, starts with clustering the oversampled and augmented questionnaires given by D . As a clustering algorithm, we propose to perform a standard agglomerative cluster analysis with Ward’s minimum variance method as the objective function (see Section 2.2). The number of clusters l is determined visually using the gap statistic ( Tibshirani, Walther, and Hastie 2001 ) based on a scree plot as well as a dendrogram. As explained earlier, the number of clusters is expected to be at a local maximum in the scree plot. The main idea is described in Section 2.3.

Once the clusters are obtained, we compute the corresponding cluster centers (geometrically speaking, the centroid of each cluster) and call these points response types . While mathematically the response types are really just cluster centroids, the name should reflect the fact that we expect a typical questionnaire in the cluster to follow that response type. We call the set of all response types R and fix an arbitrary order. Let the ordered response types be r 1 , …, r l .

The last phase of the proposed method combines the explorative data-driven approach with the actual content interpretation. First, the response types can be interpreted as a typical response to a questionnaire in that cluster. The fingerprints thus reflect the distribution of people following a certain response type in the different groups. The more similar two fingerprints are between two groups, the more similar people answered the questionnaires, which is a natural measure of similarity between groups.

In Section 2.3, it was already explained that a dendrogram (a tree representation of a clustering algorithm) can be used to determine how many natural or robust clusters exist. It also yields a very intuitive description of similarity between data points, as those points whose clusters merge earlier are more similar. Such a notion of similarity is also standard outside of data science; for example, in ecology and evolution such dendrograms are known as phylogenetic trees, and show the evolutionary relationships among species ( Letunic and Bork 2006 ).

As an optional last step, the fingerprints in combination with the response types may be explained by group specific properties. This step is not related to cluster analysis, nor is it part of the proposed method, but for completeness we present it here. For example, suppose the groups are different countries, and the response types are easy to interpret: response type 1 might reflect a high interest in conservation, response type 2 might reflect a high interest in conservation in principle, but some parts of preservation are irrelevant to the people, etc. Thus, fingerprints with a high value in response type 1 reflect countries where the majority of people are highly interested in conservation, and fingerprints with a high value in response type 2 reflect countries where people are also interested in conservation, but certain aspects are irrelevant. These results could be explained by indices that describe countries, such as wealth indices or a country’s forest cover. A simple but powerful way to test such hypotheses is to measure the rank correlation between the marginal of the fingerprint representing a particular response type and the corresponding index.

4.1. Structure of the construct equals in all groups

4.1.1. factor analysis or pca with follow-up testing.

In the first phase, the applicability of PCA was assessed for each of the four groups using Bartlett’s test and the Kaiser-Meyer-Olkin (KMO) criterion. Given the significance of the Bartlett’s test ( p < 0.001) and a KMO criterion of over 0.700 in all groups, PCA was considered appropriate. The PCA in all groups showed that the 7 items could be combined into one higher-order component according to the Kaiser criterion ( F. G. Kaiser 1998 ). The calculation of the Cronbach’s alpha for this component showed a high internal consistency and reliability between the items for all four groups (α > 0.700). After determining the component, the mean values of the items were calculated for each data point. To determine differences between the groups, those mean values were compared using a hypothesis test between groups. We applied a Kruskal-Wallis test followed by the Dunn-Bonferroni post-hoc test. The level of significance was adjusted by the Bonferroni correction. The results demonstrated pairwise significant differences between all groups, with the exception of Groups 2–3 ( p = 0.564) and 3–4 ( p = 1.00).

4.1.2. Our approach

Graphical representation of the gap statistic as well as the dendrogram corresponding to the goodness of the clustering of the questionnaires in data set . Moreover, the corresponding response types are shown as a spider plot

The next step is to compute the fingerprints of the 4 groups and express the similarity of the groups (see Figure 2 ). The four groups are visually different. Group 1 is concentrated on response types corresponding to large uniform responses, Group 2 is concentrated on medium large answer patterns, Group 3 contains roughly equally many questionnaires of any response type, while in Group 4, most questionnaires contain either quite small or quite large answers. When compared to the model used to generate the data, it is immediately apparent that this is a very good reconstruction of the actual data which is also easy to interpret content-wise.

The fingerprints of the different groups regarding the response types as spider plots. The radial y-axes are scaled to (0, 0.7). Also, the group similarity on data set  is given by a dendrogram

4.2. Structures of the construct differs between groups

4.2.1. factor analysis or pca with follow-up testing, 4.2.2. our approach.

Again, Figure 3 gives an overview of the indices that determine the number of response types as well as the similarity of the groups. The local maximum in the gap statistic is at 10 clusters, and those 10 clusters are highly visible in the corresponding dendrogram. As in the previous case, this fits well with the model used to generate questionnaires. We expect up to five symmetric clusters in which all items have roughly the same value, as well as clusters in which the typical item is small to moderate but item 4 is large (response types 1, 2 & 5), and finally clusters in which the typical item is large but item 7 is small (response types 7 & 8), see Figure 3 .

Graphical representation of the gap statistic as well as the dendrogram corresponding to the goodness of the clustering of the questionnaires in data set . Moreover, the corresponding response types are shown as a spider plot

Next, the fingerprints of the groups are computed and the similarity of the group is expressed in Figure 4 . Again, the group fingerprints reflect the actual groups very well. Groups 1 to 4 are described similarly to the previous case; they are still concentrated on those response types which express uniform answers of different height. Moreover, the ‘new’ groups are also well described, in particular the distribution on the fingerprints yields the following interpretations: Group 5 has many high responses and item 7 is artificially small. Group 6 has small to medium responses but item 4 is large. Finally, in Group 7 we observe a large proportion of questionnaires in which all answers are small but item 4 is large (response types 1 & 2), but also questionnaires with high answers in which item 7 is comparatively small (response types 7 & 8). We also observe that the previously more similar groups (Group 3 and Group 4) are measured as similar again, and Group 1 and Group 2 are still more similar than either of them is to Group 3 or Group 4. The measure of similarity is thus stable with respect to adding data of more groups.

The fingerprints of the different groups regarding the response types as spider plots. The radial y-axes are scaled to (0, 0.7). Moreover, the group similarity on data set  is shown as a dendrogram

4.3. Items are unrelated and show differences between groups

4.3.1. factor analysis or pca with follow-up testing, 4.3.2. our approach.

As before, we plot the response types in Figure 5 . The gap statistic suggests the use of 5 response types, and in the dendrogram, one would choose 5–6 response types. Again, this fits well with the actual data generation, which is based on noisy instances of 6 types. The response types correspond to noisy measurements of five of the six ground-truth values σ 1 , …, σ 6 , but σ 3 = (3, 3, 3) does not appear as a response type. This might well be due to the relatively large noise applied to each coordinate, such that a typical sample from σ 3 will have different entries.

Graphical representation of the gap statistic as well as the dendrogram corresponding to the goodness of the clustering of the questionnaires in data set . Moreover, the corresponding response types are shown as a spider plot

Regarding the interpretation of the group’s fingerprints (see Figure 6 ), we observe that Group 8 contains questionnaires of each response type, Group 9 is mostly concentrated on response types with a high score for item 1, Group 10 is concentrated along those response types in which item 1 is small, and finally, Group 11 has large entries in item 2. This reflects the actual data model very well.

The fingerprints of the different groups regarding the response types as spider plots. The radial y-axes are scaled to (0, 0.7). Again, the group similarity on data set  is visualized by a dendrogram

4.4. Robustness towards the number of clusters

The similarity between the groups in data set  for a growing number of response types (left: 5 response types, right: 8 response types). The similarity based on the optimal number of response types (5) gives no significantly different result as the similarity based on more response types

While the similarity itself does not change, it is important to notice that over-estimation of the number of response types clearly has its drawbacks. The main challenge might arise, as the cluster centroids (the typical questionnaire per cluster) are no longer well separated and potentially harder to explain content-wise. Recall from Figure 1 that the five response types were very easy to interpret: they referred to the typical sheets 1…1–5…5 up to some noise. However, if eight response types are formed, they are not that easy to describe (see Figure 8 ). For example, we observe that response types 4 & 5 do clearly emerge from the previous response type 3 (see Figure 1 ). They contain questionnaires in which the answers are around the typical answer 3, but the cluster centroids are are slightly ‘deformed’ rather than being roughly uniform in all coordinates. This is not desirable, because, obviously, even if all items measure the same construct and the participant answers with care, the questionnaires (4, 3, 3, 3, 3, 2, 3) (in the cluster of response type 4) and (3, 3, 3, 4, 3, 3, 3) (in the cluster of response type 5) are highly likely to be observed and should, intuitively, both correspond to the same response type.

The eight response types on data set . While the similarity between group fingerprints does not vary if more response types are used, the response types themselves are much worse separated

To summarize, it is unproblematic to over-estimate the number of response types with regard to the similarity of the groups, but the response types might become harder to interpret.

5. Discussion & Conclusion

5.1. on the data preparation step.

All the steps used in the data preparation step are well known to the data science community, but in the described application, namely the evaluation of questionnaire data, these steps are unlikely to be found. The first step was to impute missing data by k -nearest neighbor imputation. Normally, such missing data are filled by simply taking the average (either of the row, or of the column) ( Troyanskaya et al. 2001 ; Weaver and Maxwell 2014 ), or simply ignored ( Husson and Josse 2013 ), but this either does not take into account the dependencies between different items, or it reduces the sample size. Therefore, especially since our method is also applicable to questionnaires measuring different constructs with different items, we need a better imputation technique. Since nearest neighbor imputation is well studied and often used for missing data in a variety of data science applications ( Troyanskaya et al. 2001 ), we believe that it should be used in this case as well.

Next, the group samples are balanced by simple oversampling. This is necessary when the sample sizes of the different groups are different. For example, suppose one group has a completely different typical response than the other groups, but the number of questionnaires in that group is small. This tiny fraction of data points won’t significantly affect the clustering metrics, and therefore the response type is unlikely to appear as a cluster centroid. This effect does not occur when the group samples are comparably large. While oversampling is a standard method in supervised learning, one usually has to be very careful not to overfit a model to a few examples ( Chawla et al. 2002 ; Menardi and Torelli 2012 ). Note that this overfitting effect does not have a serious impact on the proposed method. First, we use the oversampled data set only to identify appropriate response types, while the actual evaluation (e.g., measuring similarity based on the fingerprints) is based only on the actual (not oversampled) data. Second, the obtained clustering is not intended to be applied to unseen data, but only to describe a data set. We emphasize that an alternative to oversampling would be weighted clustering, but linkage-based clustering algorithms as applied here are known to be incompatible with this approach ( Ackerman et al. 2021 ). However, this choice results in a much higher computational cost, and in applications it may well be that non-linkage-based clustering algorithms could also be used. Furthermore, while oversampling allows for different sample sizes in the different groups, it is very important to note that the sample in each group is representative and valid for the question being studied.

Finally, the data augmentation step adds some noise to the individual questionnaire items. Thus, the questionnaire items are no longer integers, but floating-point numbers. This makes the data matrix more likely to be of full rank, which is crucial for numerical stability, and the resulting clustering more stable against removal of individual data points. Moreover, it is well known that almost all machine learning models generalize much better when the training data is augmented with random noise ( Belinkov and Bisk 2018 ; Ding et al. 2007 ; Min et al. 2020 ; Zhang, Kinoshita, and Kiya 2020 ). A further idea behind the augmentation is that it turns the naive over-sampling of step two into a SMOTE-similar over-sampling as the single over-sampled data points are subjected to noise and are no simple duplicates of the original data ( Chawla et al. 2002 ).

5.2. Comparison to factor analysis and PCA

of its fingerprint f , a quantity that is a very natural measure of variation in a variety of applications from physical systems to information theory ( Wehrl 1978 ), and we propose to use the entropy of fingerprints to measure the heterogeneity of a group. A key property of the normalized entropy is that it takes values between 0 (all questionnaires belong to the same response type) 1 and 1 (there are equally many questionnaires of each response type), which quantifies homogeneity of a group quite naturally.

5.3. Advantages and limitations of the proposed approach

Probably the most limiting factor of our approach is its purely descriptive nature, i.e., it does not provide a typical measure of significance with respect to group differences. In addition, the number of required response types is usually hard to determine. While the gap statistic yields an algorithmic approach by taking the number of response types at a local maximum in the corresponding scree plot, it is not clear that such a unique or obvious maximum exists. Of course, visual inspection of the dendrogram can support the choice, but overall, this freedom of choice might reduce the inter-observer reliability.

On the positive side, unlike factor analysis or PCA, the proposed method can be applied to any questionnaire data set where differences between multiple groups are to be compared. The method comes with a very natural dimensionality reduction from individual questionnaires to fingerprints of groups that have a very natural interpretation, given reasonable response types. Moreover, the similarity between groups can be described in a straightforward way.

5.4. Overcoming some of the challenges

While the result of the method is indeed descriptive, the obtained similarity between groups can be seen as a quantification of how similar different groups are. When an agglomerative clustering algorithm is applied to the fingerprints, visual inspection of the dendrogram or the formal gap statistic (if the number of groups is large) can be used to identify clusters of fingerprints ( Mohajer, Englmeier, and Schmid 2010 ; Tibshirani, Walther, and Hastie 2001 ). A difference between two groups exists if and only if these groups do not belong to the same cluster. Moreover, as given as an example, the proportion of each response type in a group can be compared to additional describing factors, e.g., country indices via standard methods. As long as the response types can be well interpreted, such analyses can be used to explain the observed effects.

Similarly, while the number of response types used during the algorithm has an impact on the results, we saw in 4.4 that the similarity between the fingerprints of groups is robust against choosing too many response types. The only effect was observed with respect to the interpretation of the response types (or, the groups’ fingerprints on those response types). The most extreme case is a full clustering, where all possible questionnaire types form a singleton cluster (this corresponds to n d response types, with d being the number of items and n being the possible number of scores). While the similarity can, in principle, still be measured based on the n d - dimensional fingerprints, an interpretation of the results becomes out of reach. However, due to the gap statistic backed up by visual inspection of the dendrogram, it is easy to determine a number of response types that is at least close to the potentially optimal choice. Here, we emphasize that the property of being a local maximum of the gap value is important. As can be observed in Figure 5 , a local maximum appears around 5 response types. However, for more than 10 response types, the gap value increases again and becomes even larger as the local maximum. As the data set was generated based on noisy instances of 6 base types σ 1 , …, σ 6 , it is not to be expected that considerably many answer patterns are observed regularly, such that the later response types, based on much more than 10 clusters, are likely to be artificial and cannot be well interpreted.

5.5. Conclusion

We presented a method to quantify the similarity between different groups based on questionnaire studies, and how it might be possible to explain group differences. The method does, in contrast to the standard approaches, not require measurement invariance, but it can even use variance in the measurements to better distinguish between the groups. The approach is easy to apply and relies on very well-known data scientific concepts and yields a natural interpretation of the results. Moreover, we observed that even in situations in which standard factor analyses could be conducted, following simply the standard approach, did not detect all occurring group differences if they were with respect to group homogeneity or heterogeneity rather than based on the average answer. Overall, we believe that the proposed approach may help a variety of applicants to analyze their complex data sets.

As usual, we assume that 0 log(0) = 0.  

Funding Information

We gained support from the von Opel Hessische Zoostiftung and the OA Open Access publication fund of Goethe University Frankfurt.

Competing Interests

The authors have no competing interests to declare.

Ackerman, M, et al. 2021 Weighted clustering: Towards solving the user’s dilemma. Pattern Recognition , 120: 108152. ISSN: 0031-3203. DOI: https://doi.org/10.1016/j.patcog.2021.108152  

Bartlett, MS. 1950 Tests of significance in factor analysis. Br. J. Stat. Psychol ., 3(2): 77–85. DOI: https://doi.org/10.1111/j.2044-8317.1950.tb00285.x  

Belinkov, Y and Bisk, Y. 2018 Synthetic and natural noise both break neural machine translation. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net .  

Calinski, T and Harabasz, J. 1974 A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods , 3(1): 1–27. DOI: https://doi.org/10.1080/03610927408827101  

Chawla, NV, et al. 2002 SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res ., 16(1): 321–357. ISSN: 1076-9757. DOI: https://doi.org/10.1613/jair.953  

Cormack, RM. 1971 A review of classification. Journal of the Royal Statistical Society. Series A (General) , 134(3): 321. ISSN: 0035-9238. DOI: https://doi.org/10.2307/2344237  

Costello, AB and Osborne, J. 2005 Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis . University of Massachusetts Amherst. DOI: https://doi.org/10.7275/JYJ1-4868  

Cover, T and Hart, P. 1967 Nearest neighbor pattern classification. In: IEEE Transactions on Information Theory , 13(1): 21–27. DOI: https://doi.org/10.1109/TIT.1967.1053964  

Cronbach, LJ. 1951 Coefficient alpha and the internal structure of tests. Psychometrika , 16(3): 297–334. DOI: https://doi.org/10.1007/BF02310555  

Ding, Y, et al. 2007 Robust clustering in high dimensional data using statistical depths. In: BMC Bioinformatics , 8(S7). DOI: https://doi.org/10.1186/1471-2105-8-S7-S8  

Dziuban, CD and Shirkey, EC. 1974 When is a correlation matrix appropriate for factor analysis? Some decision rules. Psychol. Bull ., 81(6): 358–361. DOI: https://doi.org/10.1037/h0036316  

Feucht, V, Dierkes, PW and Kleespies, MW. 2023 The different values of nature: a comparison between university students’ perceptions of nature’s instrumental, intrinsic and relational values. Sustainability Science , 18(5): 2391–2403. ISSN: 1862-4057. DOI: https://doi.org/10.1007/s11625-023-01371-8  

Husson, F and Josse, J. 2013 Handling missing values in multiple factor analysis. Food Quality and Preference , 30(2): 77–85. ISSN: 0950-3293. DOI: https://doi.org/10.1016/j.foodqual.2013.04.013  

Kaiser, FG. 1998 A general measure of ecological behavior. Journal of Applied Social Psychology , 28(5): 395–422. ISSN: 1559-1816. DOI: https://doi.org/10.1111/j.1559-1816.1998.tb01712.x  

Kaiser, HF. 1970 A second generation little jiffy. Psychometrika , 35(4): 401–415. ISSN: 1860-0980. DOI: https://doi.org/10.1007/BF02291817  

Kleespies, MW and Dierkes, PW. 2020 Impact of biological education and gender on students’ connection to nature and relational values. PLOS ONE , 15(11): e0242004. ISSN: 1932-6203. DOI: https://doi.org/10.1371/journal.pone.0242004  

Knickenberg, M, et al. 2019 Assessing dimensions of inclusion from students’ perspective – measurement invariance across students with learning disabilities in different educational settings. European Journal of Special Needs Education , 35(3): 287–302. ISSN: 1469-591X. DOI: https://doi.org/10.1080/08856257.2019.1646958  

Letunic, I and Bork, P. 2006 Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics , 23(1): 127–128. ISSN: 1367-4803. DOI: https://doi.org/10.1093/bioinformatics/btl529  

Liefländer, AK, et al. 2013 Promoting connectedness with nature through environmental education. Environmental Education Research , 19(3): 370–384. ISSN: 1469-5871. DOI: https://doi.org/10.1080/13504622.2012.697545  

Liu, Y, et al. 2013 Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern ., 43(3): 982–994. DOI: https://doi.org/10.1109/TSMCB.2012.2220543  

Mayer, F and Frantz, CM. 2004 The connectedness to nature scale: A measure of individuals’ feeling in community with nature. Journal of Environmental Psychology , 24(4): 503–515. ISSN: 0272-4944. DOI: https://doi.org/10.1016/j.jenvp.2004.10.001  

Menardi, G and Torelli, N. 2012 Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery , 28(1): 92–122. ISSN: 1573-756X. DOI: https://doi.org/10.1007/s10618-012-0295-5  

Milfont, TL and Duckitt, J. 2010 The environmental attitudes inventory: A valid and reliable measure to assess the structure of environmental attitudes. Journal of Environmental Psychology , 30(1): 80–94. ISSN: 0272-4944. DOI: https://doi.org/10.1016/j.jenvp.2009.09.001  

Min, J, et al. 2020 Syntactic data augmentation increases robustness to inference heuristics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault. (Eds.), Association for Computational Linguistics, 2339–2352. DOI: https://doi.org/10.18653/v1/2020.acl-main.212  

Mohajer, M, Englmeier, K-H and Schmid, VJ. 2010 A comparison of Gap statistic definitions with and without logarithm function. LMU Department of Statistics: Technical Reports , 96. DOI: https://doi.org/10.5282/ubm/epub.11920  

Putnick, D L and Bornstein, MH. 2016 Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Dev. Rev ., 41: 71–90. DOI: https://doi.org/10.1016/j.dr.2016.06.004  

Rousseeuw, PJ. 1987 Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics , 20: 53–65. ISSN: 0377-0427. DOI: https://doi.org/10.1016/0377-0427(87)90125-7  

Sauerwein, M and Theis, D. 2021 New ways of dealing with lacking measurement invariance. In: Accountability and Educational Improvement . Cham: Springer International Publishing. pp. 63–82. DOI: https://doi.org/10.1007/978-3-030-69345-9_5  

Schmitt, N and Kuljanin, G. 2008 Measurement invariance: review of practice and implications. Hum. Resour. Manag. Rev ., 18(4): 210–222. DOI: https://doi.org/10.1016/j.hrmr.2008.03.003  

Szmrecsanyi, B. 2012 Studies in English language: Grammatical variation in British English dialects: A study in corpus-based dialectometry . Cambridge, England: Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511763380  

Tam, K-P and Milfont, TL. 2020 Towards cross-cultural environmental psychology: A state-ofthe-art review and recommendations. Journal of Environmental Psychology , 71: 101474. ISSN: 0272-4944. DOI: https://doi.org/10.1016/j.jenvp.2020.101474  

Tibshirani, R, Walther, G and Hastie, T. 2001 Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B: Statistical Methodology , 63(2): 411–423. ISSN: 1467-9868. DOI: https://doi.org/10.1111/1467-9868.00293  

Troyanskaya, O, et al. 2001 Missing value estimation methods for DNA microarrays. Bioinformatics , 17(6): 520–525. DOI: https://doi.org/10.1093/bioinformatics/17.6.520  

Van De Schoot, R, et al. 2015 Editorial: measurement invariance. Front. Psychol ., 6: 1064. DOI: https://doi.org/10.3389/fpsyg.2015.01064  

Ward, JH. 1963 Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association , 58(301): 236–244. ISSN: 1537-274X. DOI: https://doi.org/10.1080/01621459.1963.10500845  

Weaver, B and Maxwell, H. 2014 Exploratory factor analysis and reliability analysis with missing data: A simple method for SPSS users. The Quantitative Methods for Psychology , 10(2): 143–152. ISSN: 2292-1354. DOI: https://doi.org/10.20982/tqmp.10.2.p143  

Wehrl, A. 1978 General properties of entropy. Reviews of Modern Physics , 50(2): 221–260. ISSN: 0034-6861. DOI: https://doi.org/10.1103/RevModPhys.50.221  

Yong, AG and Pearce, S. 2013 A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology , 9(2): 79–94. ISSN: 1913-4126. DOI: https://doi.org/10.20982/tqmp.09.2.p079  

Zhang, W, Kinoshita, Y and Kiya, H. 2020 Image-enhancement-based data augmentation for improving deep learning in image classification problem. In: IEEE International Conference on Consumer Electronics – Taiwan, ICCE-TW 2020, Taoyuan, Taiwan, September 28–30, 2020. IEEE, pp. 1–2. DOI: https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258292  

Logo for Portland State University Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Understanding Psychological Measurement

Rajiv S. Jhangiani; I-Chant A. Chiang; Carrie Cuttler; and Dana C. Leighton

Learning Objectives

  • Define measurement and give several examples of measurement in psychology.
  • Explain what a psychological construct is and give several examples.
  • Distinguish conceptual from operational definitions, give examples of each, and create simple operational definitions.
  • Distinguish the four levels of measurement, give examples of each, and explain why this distinction is important.

What Is Measurement?

Measurement  is the assignment of scores to individuals so that the scores represent some characteristic of the individuals. This very general definition is consistent with the kinds of measurement that everyone is familiar with—for example, weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey using a meat thermometer. It is also consistent with measurement in the other sciences. In physics, for example, one might measure the potential energy of an object in Earth’s gravitational field by finding its mass and height (which of course requires measuring  those  variables) and then multiplying them together along with the gravitational acceleration of Earth (9.8 m/s2). The result of this procedure is a score that represents the object’s potential energy.

This general definition of measurement is consistent with measurement in psychology too. (Psychological measurement is often referred to as psychometrics .) Imagine, for example, that a cognitive psychologist wants to measure a person’s working memory capacity—their ability to hold in mind and think about several pieces of information all at the same time. To do this, she might use a backward digit span task, in which she reads a list of two digits to the person and asks them to repeat them in reverse order. She then repeats this several times, increasing the length of the list by one digit each time, until the person makes an error. The length of the longest list for which the person responds correctly is the score and represents their working memory capacity. Or imagine a clinical psychologist who is interested in how depressed a person is. He administers the Beck Depression Inventory, which is a 21-item self-report questionnaire in which the person rates the extent to which they have felt sad, lost energy, and experienced other symptoms of depression over the past 2 weeks. The sum of these 21 ratings is the score and represents the person’s current level of depression.

The important point here is that measurement does not require any particular instruments or procedures. What it  does  require is  some  systematic procedure for assigning scores to individuals or objects so that those scores represent the characteristic of interest.

Psychological Constructs

Many variables studied by psychologists are straightforward and simple to measure. These include age, height, weight, and birth order. You can ask people how old they are and be reasonably sure that they know and will tell you. Although people might not know or want to tell you how much they weigh, you can have them step onto a bathroom scale. Other variables studied by psychologists—perhaps the majority—are not so straightforward or simple to measure. We cannot accurately assess people’s level of intelligence by looking at them, and we certainly cannot put their self-esteem on a bathroom scale. These kinds of variables are called  constructs  (pronounced  CON-structs ) and include personality traits (e.g., extraversion), emotional states (e.g., fear), attitudes (e.g., toward taxes), and abilities (e.g., athleticism).

Psychological constructs cannot be observed directly. One reason is that they often represent  tendencies  to think, feel, or act in certain ways. For example, to say that a particular university student is highly extraverted does not necessarily mean that she is behaving in an extraverted way right now. In fact, she might be sitting quietly by herself, reading a book. Instead, it means that she has a general tendency to behave in extraverted ways (e.g., being outgoing, enjoying social interactions) across a variety of situations. Another reason psychological constructs cannot be observed directly is that they often involve internal processes. Fear, for example, involves the activation of certain central and peripheral nervous system structures, along with certain kinds of thoughts, feelings, and behaviors—none of which is necessarily obvious to an outside observer. Notice also that neither extraversion nor fear “reduces to” any particular thought, feeling, act, or physiological structure or process. Instead, each is a kind of summary of a complex set of behaviors and internal processes.

The Big Five

The Big Five is a set of five broad dimensions that capture much of the variation in human personality. Each of the Big Five can even be defined in terms of six more specific constructs called “facets” (Costa & McCrae, 1992) [1] .

Table 4.1 The Big Five Personality Dimensions

The  conceptual definition  of a psychological construct describes the behaviors and internal processes that make up that construct, along with how it relates to other variables. For example, a conceptual definition of neuroticism (another one of the Big Five) would be that it is people’s tendency to experience negative emotions such as anxiety, anger, and sadness across a variety of situations. This definition might also include that it has a strong genetic component, remains fairly stable over time, and is positively correlated with the tendency to experience pain and other physical symptoms.

Students sometimes wonder why, when researchers want to understand a construct like self-esteem or neuroticism, they do not simply look it up in the dictionary. One reason is that many scientific constructs do not have counterparts in everyday language (e.g., working memory capacity). More important, researchers are in the business of developing definitions that are more detailed and precise—and that more accurately describe the way the world is—than the informal definitions in the dictionary. As we will see, they do this by proposing conceptual definitions, testing them empirically, and revising them as necessary. Sometimes they throw them out altogether. This is why the research literature often includes different conceptual definitions of the same construct. In some cases, an older conceptual definition has been replaced by a newer one that fits and works better. In others, researchers are still in the process of deciding which of various conceptual definitions is the best.

Operational Definitions

An  operational definition  is a definition of a variable in terms of precisely how it is to be measured. These measures generally fall into one of three broad categories.  Self-report measures  are those in which participants report on their own thoughts, feelings, and actions, as with the Rosenberg Self-Esteem Scale (Rosenberg, 1965) [2] . Behavioral measures  are those in which some other aspect of participants’ behavior is observed and recorded. This is an extremely broad category that includes the observation of people’s behavior both in highly structured laboratory tasks and in more natural settings. A good example of the former would be measuring working memory capacity using the backward digit span task. A good example of the latter is a famous operational definition of physical aggression from researcher Albert Bandura and his colleagues (Bandura, Ross, & Ross, 1961) [3] . They let each of several children play for 20 minutes in a room that contained a clown-shaped punching bag called a Bobo doll. They filmed each child and counted the number of acts of physical aggression the child committed. These included hitting the doll with a mallet, punching it, and kicking it. Their operational definition, then, was the number of these specifically defined acts that the child committed during the 20-minute period. Finally,  physiological measures  are those that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.

For any given variable or construct, there will be multiple operational definitions. Stress is a good example. A rough conceptual definition is that stress is an adaptive response to a perceived danger or threat that involves physiological, cognitive, affective, and behavioral components. But researchers have operationally defined it in several ways. The Social Readjustment Rating Scale (Holmes & Rahe, 1967) [4] is a self-report questionnaire on which people identify stressful events that they have experienced in the past year and assigns points for each one depending on its severity. For example, a man who has been divorced (73 points), changed jobs (36 points), and had a change in sleeping habits (16 points) in the past year would have a total score of 125. The Hassles and Uplifts Scale (Delongis, Coyne, Dakof, Folkman & Lazarus, 1982) [5]  is similar but focuses on everyday stressors like misplacing things and being concerned about one’s weight. The Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983) [6] is another self-report measure that focuses on people’s feelings of stress (e.g., “How often have you felt nervous and stressed?”). Researchers have also operationally defined stress in terms of several physiological variables including blood pressure and levels of the stress hormone cortisol.

When psychologists use multiple operational definitions of the same construct—either within a study or across studies—they are using converging operations . The idea is that the various operational definitions are “converging” or coming together on the same construct. When scores based on several different operational definitions are closely related to each other and produce similar patterns of results, this constitutes good evidence that the construct is being measured effectively and that it is useful. The various measures of stress, for example, are all correlated with each other and have all been shown to be correlated with other variables such as immune system functioning (also measured in a variety of ways) (Segerstrom & Miller, 2004) [7] . This is what allows researchers eventually to draw useful general conclusions, such as “stress is negatively correlated with immune system functioning,” as opposed to more specific and less useful ones, such as “people’s scores on the Perceived Stress Scale are negatively correlated with their white blood counts.”

Levels of Measurement

The psychologist S. S. Stevens suggested that scores can be assigned to individuals in a way that communicates more or less quantitative information about the variable of interest (Stevens, 1946) [8] . For example, the officials at a 100-m race could simply rank order the runners as they crossed the finish line (first, second, etc.), or they could time each runner to the nearest tenth of a second using a stopwatch (11.5 s, 12.1 s, etc.). In either case, they would be measuring the runners’ times by systematically assigning scores to represent those times. But while the rank ordering procedure communicates the fact that the second-place runner took longer to finish than the first-place finisher, the stopwatch procedure also communicates  how much  longer the second-place finisher took. Stevens actually suggested four different levels of measurement (which he called “scales of measurement”) that correspond to four types of information that can be communicated by a set of scores, and the statistical procedures that can be used with the information.

The  nominal level  of measurement is used for categorical variables and involves assigning scores that are category labels. Category labels communicate whether any two individuals are the same or different in terms of the variable being measured. For example, if you ask your participants about their marital status, you are engaged in nominal-level measurement. Or if you ask your participants to indicate which of several ethnicities they identify themselves with, you are again engaged in nominal-level measurement. The essential point about nominal scales is that they do not imply any ordering among the responses. For example, when classifying people according to their favorite color, there is no sense in which green is placed “ahead of” blue. Responses are merely categorized. Nominal scales thus embody the lowest level of measurement [9] .

The remaining three levels of measurement are used for quantitative variables. The  ordinal level  of measurement involves assigning scores so that they represent the rank order of the individuals. Ranks communicate not only whether any two individuals are the same or different in terms of the variable being measured but also whether one individual is higher or lower on that variable. For example, a researcher wishing to measure consumers’ satisfaction with their microwave ovens might ask them to specify their feelings as either “very dissatisfied,” “somewhat dissatisfied,” “somewhat satisfied,” or “very satisfied.” The items in this scale are ordered, ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal scales, ordinal scales allow comparisons of the degree to which two individuals rate the variable. For example, our satisfaction ordering makes it meaningful to assert that one person is more satisfied than another with their microwave ovens. Such an assertion reflects the first person’s use of a verbal label that comes later in the list than the label chosen by the second person.

On the other hand, ordinal scales fail to capture important information that will be present in the other levels of measurement we examine. In particular, the difference between two levels of an ordinal scale cannot be assumed to be the same as the difference between two other levels (just like you cannot assume that the gap between the runners in first and second place is equal to the gap between the runners in second and third place). In our satisfaction scale, for example, the difference between the responses “very dissatisfied” and “somewhat dissatisfied” is probably not equivalent to the difference between “somewhat dissatisfied” and “somewhat satisfied.” Nothing in our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of satisfaction, which we are trying to measure.)

The  interval level  of measurement involves assigning scores using numerical scales in which intervals have the same interpretation throughout. As an example, consider either the Fahrenheit or Celsius temperature scales. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms of the kinetic energy of molecules).

Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to carry the name “zero.” The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence of temperature (the absence of any molecular kinetic energy). In reality, the label “zero” is applied to its temperature for quite accidental reasons connected to the history of temperature measurement. Since an interval scale has no true zero point, it does not make sense to compute ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20 degrees Fahrenheit is the same as the ratio of 100 to 50 degrees; no interesting physical property is preserved across the two ratios. After all, if the “zero” label were applied at the temperature that Fahrenheit happens to label as 10 degrees, the two ratios would instead be 30 to 10 and 90 to 40, no longer the same! For this reason, it does not make sense to say that 80 degrees is “twice as hot” as 40 degrees. Such a claim would depend on an arbitrary decision about where to “start” the temperature scale, namely, what temperature to call zero (whereas the claim is intended to make a more fundamental assertion about the underlying physical reality).

In psychology, the intelligence quotient (IQ) is often considered to be measured at the interval level. While it is technically possible to receive a score of 0 on an IQ test, such a score would not indicate the complete absence of IQ. Moreover, a person with an IQ score of 140 does not have twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.

Finally, the  ratio level  of measurement involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. You can think of a ratio scale as the three earlier scales rolled up in one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning. However, in addition, the same ratio at two places on the scale also carries the same meaning (see Table 4.1).

The Fahrenheit scale for temperature has an arbitrary zero point and is therefore not a ratio scale. However, zero on the Kelvin scale is absolute zero. This makes the Kelvin scale a ratio scale. For example, if one temperature is twice as high as another as measured on the Kelvin scale, then it has twice the kinetic energy of the other temperature.

Another example of a ratio scale is the amount of money you have in your pocket right now (25 cents, 50 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero money, this actually implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as someone with 25 cents.

Stevens’s levels of measurement are important for at least two reasons. First, they emphasize the generality of the concept of measurement. Although people do not normally think of categorizing or ranking individuals as measurement, in fact, they are as long as they are done so that they represent some characteristic of the individuals. Second, the levels of measurement can serve as a rough guide to the statistical procedures that can be used with the data and the conclusions that can be drawn from them. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. Once again, one cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, but one can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level.

  • Costa, P. T., Jr., & McCrae, R. R. (1992). Normal personality assessment in clinical practice: The NEO Personality Inventory. Psychological Assessment, 4 , 5–13. ↵
  • Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press ↵
  • Bandura, A., Ross, D., & Ross, S. A. (1961). Transmission of aggression through imitation of aggressive models. Journal of Abnormal and Social Psychology, 63 , 575–582. ↵
  • Holmes, T. H., & Rahe, R. H. (1967). The Social Readjustment Rating Scale. Journal of Psychosomatic Research, 11 (2), 213-218. ↵
  • Delongis, A., Coyne, J. C., Dakof, G., Folkman, S., & Lazarus, R. S. (1982). Relationships of daily hassles, uplifts, and major life events to health status. Health Psychology, 1 (2), 119-136. ↵
  • Cohen, S., Kamarck, T., & Mermelstein, R. (1983). A global measure of perceived stress. Journal of Health and Social Behavior, 24, 386-396. ↵
  • Segerstrom, S. E., & Miller, G. E. (2004). Psychological stress and the human immune system: A meta-analytic study of 30 years of inquiry. Psychological Bulletin, 130 , 601–630. ↵
  • Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103 , 677–680. ↵
  • Levels of Measurement. Retrieved from http://wikieducator.org/Introduction_to_Research_Methods_In_Psychology/Theories_and_Measurement/Levels_of_Measurement ↵

Is the assignment of scores to individuals so that the scores represent some characteristic of the individuals.

A subfield of psychology concerned with the theories and techniques of psychological measurement.

Psychological variables that represent an individual's mental state or experience, often not directly observable, such as personality traits, emotional states, attitudes, and abilities.

Describes the behaviors and internal processes that make up a psychological construct, along with how it relates to other variables.

A definition of the variable in terms of precisely how it is to be measured.

Measures in which participants report on their own thoughts, feelings, and actions.

Measures in which some other aspect of participants’ behavior is observed and recorded.

Measures that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.

When psychologists use multiple operational definitions of the same construct—either within a study or across studies.

Four categories, or scales, of measurement (i.e., nominal, ordinal, interval, and ratio) that specify the types of information that a set of scores can have, and the types of statistical procedures that can be used with the scores.

A measurement used for categorical variables and involves assigning scores that are category labels.

A measurement that involves assigning scores so that they represent the rank order of the individuals.

A measurement that involves assigning scores using numerical scales in which intervals have the same interpretation throughout.

A measurement that involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity.

Understanding Psychological Measurement Copyright © 2022 by Rajiv S. Jhangiani; I-Chant A. Chiang; Carrie Cuttler; and Dana C. Leighton is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

COMMENTS

  1. Measurement

    Measurement is the process of observing and recording the observations that are collected as part of a research effort. There are two major issues that will be considered here. First, you have to understand the fundamental ideas involved in measuring. Here we consider two of major measurement concepts. In Levels of Measurement, I explain the ...

  2. PDF What Is Measurement?

    lying concepts that need to be measured. This element of the definition of measurement highlights the importance of finding the most appropriate attributes to study in a research area. This element also emphasizes under-standing what these attributes really mean, that is, fully understanding the underlying concepts being measured.

  3. 4.1: What is Measurement?

    Measurement in social science is a process. It occurs at multiple stages of a research project: in the planning stages, in the data collection stage, and sometimes even in the analysis stage. Recall that previously we defined measurement as the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that ...

  4. 10.1 What is measurement?

    In research, measurement is a systematic procedure for assigning scores, meanings, and descriptions to concepts so that those scores represent the characteristic of interest. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe or measure than others.

  5. 5.1 Understanding Psychological Measurement

    What Is Measurement? Measurement is the assignment of scores to individuals so that the scores represent some characteristic of the individuals. This very general definition is consistent with the kinds of measurement that everyone is familiar with—for example, weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey by inserting a meat ...

  6. Measurement: The Basic Building Block of Research

    Measurement in science begins with the activity of distinguishing groups or phenomena from one another. This process, which is generally termed classification, implies that we can place units of scientific study—such as victims, offenders, crimes, or crime places—in clearly defined categories or along some continuum.

  7. Measurement in Nursing Research : AJN The American Journal of Nursing

    Measurement in research begins with defining the variables of interest. Often, researchers are interested in exploring how variation in one factor or phenomenon influences variation in another. ... Nevertheless, in moving from theory to practice, the researcher must consider how to operationalize this theoretical definition—that is, the ...

  8. Concept and Principles of Measurement

    The importance of measurement in research and technology is indisputable. Measurement is the fundamental mechanism of scientific study and development, and it allows to describe the different phenomena of the universe through the exact and general language of mathematics, without which it would be challenging to define practical or theoretical approaches from scientific investigation.

  9. Measurements in Quantitative Research: How to Select and Report ...

    Measures exist to numerically represent degrees of attributes. Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it ...

  10. Measurement Issues in Quantitative Research

    Measurement is central to empirical research whether observational or experimental. A study of a novel, well-defined research question can fall apart due to inappropriate measurement. Measurement is defined in a variety of ways (Last 2001; Thorndike 2007; Manoj and Lingyak 2014 ), yet common to all definitions is the systematic application of ...

  11. Measurement in Science

    Measurement is an integral part of modern science as well as of engineering, commerce, and daily life. Measurement is often considered a hallmark of the scientific enterprise and a privileged source of knowledge relative to qualitative modes of inquiry. [] Despite its ubiquity and importance, there is little consensus among philosophers as to how to define measurement, what sorts of things are ...

  12. Outcome Definition and Measurement

    This chapter provides an overview of considerations for the development of outcome measures for observational comparative effectiveness research (CER) studies, describes implications of the proposed outcomes for study design, and enumerates issues of bias that may arise in incorporating the ascertainment of outcomes into observational research, and means of evaluating, preventing and/or ...

  13. Understanding Psychological Measurement

    What Is Measurement? Measurement is the assignment of scores to individuals so that the scores represent some characteristic of the individuals.This very general definition is consistent with the kinds of measurement that everyone is familiar with—for example, weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey using a meat thermometer.

  14. Levels of Measurement

    There are 4 levels of measurement: Nominal: the data can only be categorized. Ordinal: the data can be categorized and ranked. Interval: the data can be categorized, ranked, and evenly spaced. Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero. Depending on the level of measurement of the variable, what you can do ...

  15. 11. Quantitative measurement

    Measurement (i.e. the measurement process) gives us the language to define/describe what we are studying. In research, when we develop measurement tools, we move beyond concepts that may be subjective and abstract to a definition that is clear and concise. Good social work researchers are intentional with the measurement process.

  16. 9.1 Measurement

    Measurement is critical to successful baking as well as successful social scientific research projects. In social science, measurement refers to the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. At its core, measurement is about clearly and precisely defining one's ...

  17. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  18. Measurement

    Just as measurement is critical to successful baking, it is as important to successfully pulling off a social scientific research project. In sociology, when we use the term measurement The process by which we describe and ascribe meaning to the key facts, concepts, or phenomena that we are investigating. we mean the process by which we describe and ascribe meaning to the key facts, concepts ...

  19. What is Qualitative Measurement? Definition and Examples

    What is qualitative measurement? Qualitative measurement is a research method used to better understand a topic. It's most often used in projects or studies related to human thoughts and behavior. It involves non-numeric data and characteristics, so it can be observed or surveyed rather than counted or measured.

  20. Research Methods

    Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:

  21. (Pdf) Measurement in Research

    Most of the measurements in Psychology a re on the interval scale. e.g. the Likert scale, RATIO MEASUREMENT. This is a further refinement in the measurement levels in that it provides us with ...

  22. Scale and Measurement in Research Methodology

    Measurement and Scale in Research Methodology. Measurement is the process of describing some property of a phenomenon under study and assigning a numerical value to it. Measurement is considered as the foundation of scientific inquiry. In our daily life, many things are measured continuously in different ways for different purposes.

  23. An Unsupervised Learning Approach to Evaluate Questionnaire Data—What

    Finally, this approach allows us to translate (violations of) measurement invariance into a meaningful measure of similarity. The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers on the management, dissemination, use and reuse of research data and databases across all research domains, including ...

  24. The Newest Vital Sign

    A Health Literacy Assessment Tool for Patient Care and Research The Newest Vital Sign (NVS) is a valid and reliable screening tool available in English and Spanish that identifies patients at risk for low health literacy. It is easy and quick to administer, requiring just three minutes. In clinical settings, the test allows providers to appropriately adapt their communication practices to the ...

  25. Understanding Psychological Measurement

    Vulnerability. The conceptual definition of a psychological construct describes the behaviors and internal processes that make up that construct, along with how it relates to other variables. For example, a conceptual definition of neuroticism (another one of the Big Five) would be that it is people's tendency to experience negative emotions ...