^{1}

^{*}

^{2}

^{3}

^{4}

^{1}

^{5}

^{1}

^{2}

^{3}

^{4}

^{5}

Edited by: Jimmy Thomas Efird, University of Newcastle, Australia

Reviewed by: Aida Turrini, Consiglio per la Ricerca in Agricoltura e L'analisi Dell'Economia Agraria (CREA), Italy; Mary Evelyn Northridge, New York University, United States

This article was submitted to Epidemiology, a section of the journal Frontiers in Public Health

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development in measuring complex phenomena. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales over the past several decades. We identified three phases that span nine steps. In the first phase, items are generated and the validity of their content is assessed. In the second phase, the scale is constructed. Steps in scale construction include pre-testing the questions, administering the survey, reducing the number of items, and understanding how many factors the scale captures. In the third phase, scale evaluation, the number of dimensions is tested, reliability is tested, and validity is assessed. We have also added examples of best practices to each step. In sum, this primer will equip both scientists and practitioners to understand the ontology and methodology of scale development and validation, thereby facilitating the advancement of our understanding of a range of health, social, and behavioral outcomes.

Scales are a manifestation of latent constructs; they measure behaviors, attitudes, and hypothetical scenarios we expect to exist as a result of our theoretical understanding of the world, but cannot assess directly (

leads to more accurate research findings. Thousands of scales have been developed that can measure a range of social, psychological, and health behaviors and experiences.

As science advances and novel research questions are put forth, new scales become necessary. Scale development is not, however, an obvious or a straightforward endeavor. There are many steps to scale development, there is significant jargon within these techniques, the work can be costly and time consuming, and complex statistical analysis is often required. Further, many health and behavioral science degrees do not include training on scale development. Despite the availability of a large amount of technical literature on scale theory and development (

Therefore, our goal is to describe the process for scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development. We anticipate this primer will be broadly applicable across many disciplines, especially for health, social, and behavioral sciences. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales related to multiple disciplines (

First, we provide an overview of each of the nine steps. Then, within each step, we define key concepts, describe the tasks required to achieve that step, share common pitfalls, and draw on examples in the health, social, and behavioral sciences to recommend best practices. We have tried to keep the material as straightforward as possible; references to the body of technical work have been the foundation of this primer.

There are three phases to creating a rigorous scale—item development, scale development, and scale evaluation (

An overview of the three phases and nine steps of scale development and validation.

Item development, i.e., coming up with the initial set of questions for an eventual scale, is composed of: (1) identification of the domain(s) and item generation, and (2) consideration of content validity. The second phase, scale development, i.e., turning individual items into a harmonious and measuring construct, consists of (3) pre-testing questions, (4) sampling and survey administration, (5) item reduction, and (6) extraction of latent factors. The last phase, scale evaluation, requires: (7) tests of dimensionality, (8) tests of reliability, and (9) tests of validity.

The three phases and nine steps of scale development and validation.

Domain identification | To specify the boundaries of the domain and facilitate item generation | 1.1 Specify the purpose of the domain |
( |

Item generation | To identify appropriate questions that fit the identified domain | 1.6 Deductive methods: literature review and assessment of existing scales |
( |

Evaluation by experts | To evaluate each of the items constituting the domain for content relevance, representativeness, and technical quality | 2.1 Quantify assessments of 5-7 expert judges using formalized scaling and statistical procedures including content validity ratio, content validity index, or Cohen's coefficient alpha |
( |

Evaluation by target population | To evaluate each item constituting the domain for representativeness of actual experience from target population | 2.3 Conduct cognitive interviews with end users of scale items to evaluate face validity | ( |

Cognitive interviews | To assess the extent to which questions reflect the domain of interest and that answers produce valid measurements | 3.1 Administer draft questions to 5–15 interviewees in 2–3 rounds while allowing respondents to verbalize the mental process entailed in providing answers | ( |

Survey administration | To collect data with minimum measurement errors | 4.1 Administer potential scale items on a sample that reflects range of target population using paper or device | ( |

Establishing the sample size | To ensure the availability of sufficient data for scale development | 4.2 Recommended sample size is 10 respondents per survey item and/or 200-300 observations | ( |

Determining the type of data to use | To ensure the availability of data for scale development and validation | 4.3 Use cross-sectional data for exploratory factor analysis |
– |

Item difficulty index | To determine the proportion of correct answers given per item (CTT) To determine the probability of a particular examinee correctly answering a given item (IRT) | 5.1 Proportion can be calculated for CTT and item difficulty parameter estimated for IRT using statistical packages | ( |

Item discrimination test | To determine the degree to which an item or set of test questions are measuring a unitary attribute (CTT) To determine how steeply the probability of correct response changes as ability increases (IRT) | 5.2 Estimate biserial correlations or item discrimination parameter using statistical packages | ( |

Inter-item and item-total correlations | To determine the correlations between scale items, as well as the correlations between each item and sum score of scale items | 5.3 Estimate inter-item/item communalities, item-total, and adjusted item-total correlations using statistical packages | ( |

Distractor efficiency analysis | To determine the distribution of incorrect options and how they contribute to the quality of items | 5.4 Estimate distractor analysis using statistical packages | ( |

Deleting or imputing missing cases | To ensure the availability of complete cases for scale development | 5.5 Delete items with many cases that are permanently missing, or use multiple imputation or full information maximum likelihood for imputation of data | ( |

Factor analysis | To determine the optimal number of factors or domains that fit a set of items | 6.1 Use scree plots, exploratory factor analysis, parallel analysis, minimum average partial procedure, and/or the Hull method | ( |

Test dimensionality | To address queries on the latent structure of scale items and their underlying relationships. i.e., to validate whether the previous hypothetical structure fits the items | 7.1 Estimate independent cluster model—confirmatory factor analysis, cf. Table |
( |

Score scale items | To create scale scores for substantive analysis including reliability and validity of scale | 7.4. calculate scale scores using an unweighted approach, which includes summing standardized item scores and raw item scores, or computing the mean for raw item scores |
( |

Calculate reliability statistics | To assess the internal consistency of the scale. i.e., the degree to which the set of items in the scale co-vary, relative to their sum score | 8.1 Estimate using Cronbach's alpha |
( |

Test–retest reliability | To assess the degree to which the participant's performance is repeatable; i.e., how consistent their scores are across time | 8.3 Estimate the strength of the relationship between scale items over two or three time points; variety of measures possible | ( |

Predictive validity | To determine if scores predict future outcomes | 9.1 Use bivariate and multivariable regression; stronger and significant associations or causal effects suggest greater predictive validity | ( |

Concurrent validity | To determine the extent to which scale scores have a stronger relationship with criterion measurements made near the time of administration | 9.2 Estimate the association between scale scores and “gold standard” of scale measurement; stronger significant association in Pearson product-moment correlation suggests support for concurrent validity | ( |

Convergent validity | To examine if the same concept measured in different ways yields similar results | 9.3 Estimate the relationship between scale scores and similar constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; higher/stronger correlation coefficients suggest support for convergent validity | ( |

Discriminant validity | To examine if the concept measured is different from some other concept | 9.4 Estimate the relationship between scale scores and distinct constructs using multi-trait multi-method matrix, latent variable modeling, or Pearson product-moment coefficient; lower/weaker correlation coefficients suggest support for discriminant validity | ( |

Differentiation by “known groups” | To examine if the concept measured behaves as expected in relation to “known groups” | 9.5 Select known binary variables based on theoretical and empirical knowledge and determine the distribution of the scale scores over the known groups; use |
( |

Correlation analysis | To determine the relationship between existing measures or variables and newly developed scale scores | 9.6 Correlate scale scores and existing measures or, preferably, use linear regression, intraclass correlation coefficient, and analysis of standard deviations of the differences between scores | ( |

The first step is to articulate the domain(s) that you are endeavoring to measure. A domain or construct refers to the concept, attribute, or unobserved behavior that is the target of the study (

McCoach et al. outline a number of steps in scale development; we find the first five to be suitable for the identification of domain (

Once the domain is delineated, the item pool can then be identified. This process is also called “question development” (

The deductive method, also known as “logical partitioning” or “classification from above” (

It is considered best practice to combine both deductive and inductive methods to both define the domain and identify the questions to assess it. While the literature review provides the theoretical basis for defining the domain, the use of qualitative techniques moves the domain from an abstract point to the identification of its manifest forms. A scale or construct defined by theoretical underpinnings is better placed to make specific pragmatic decisions about the domain (

It is recommended that the items identified using deductive and inductive approaches should be broader and more comprehensive than one's own theoretical view of the target (

Further, in the development of items, the

Fowler identified five essential characteristics of items required to ensure the quality of construct measurement (

These essentials are sometimes very difficult to achieve. Krosnick (

With regards to the type of responses to these questions, we recommend that questions with dichotomous response categories (e.g., true/false) should have no ambiguity. When a Likert-type response scale is used, the points on the scale should reflect the entire measurement continuum. Responses should be presented in an ordinal manner, i.e., in an ascending order without any overlap, and each point on the response scale should be meaningful and interpreted the same way by each participant to ensure data quality (

In terms of the number of points on the response scale, Krosnick and Presser (

One pitfall in the identification of domain and item generation is the improper conceptualization and definition of the domain(s). This can result in scales that may either be deficient because the definition of the domain is ambiguous or has been inadequately defined (

Caution should also be taken to avoid construct underrepresentation, which is when a scale does not capture important aspects of a construct because its focus is too narrow (

An example of best practice using the deductive approach to item generation is found in the work of Dennis on breastfeeding self-efficacy (

A valuable example for a rigorous inductive approach is found in the work of Frongillo and Nanama on the development and validation of an experience-based measure of household food insecurity in northern Burkina Faso (

Content validity, also known as “theoretical analysis” (

Content validity entails the process of ensuring that only the phenomenon spelled out in the conceptual definition, but not other aspects that “might be related but are outside the investigator's intent for that particular [construct] are added” (

Content validity is mainly assessed through evaluation by expert and target population judges.

Expert judges are highly knowledgeable about the domain of interest and/or scale development; target population judges are potential users of the scale (

Expert judges evaluate each of the items to determine whether they represent the domain of interest. These expert judges should be independent of those who developed the item pool. Expert judgment can be done systematically to avoid bias in the assessment of items. Multiple judges have been used (typically ranging from 5 to 7) (

Another way by which content validity can be assessed through expert judges is by using the Delphi method to come to a consensus on which questions are a reflection of the construct you want to measure. The Delphi method is a technique “for structuring group communication process so that the process is effective in allowing a group of individuals, as a whole, to deal with a complex problem” (

A good example of evaluation of content validity using expert judges is seen in the work of Augustine et al. on adolescent knowledge of micronutrients (

Target population judges are experts at evaluating face validity, which is a component of content validity (

An example of the concurrent use of expert and target population judges comes from Boateng et al.'s work to develop a household-level water insecurity scale appropriate for use in western Kenya (

Pre-testing helps to ensure that items are meaningful to the target population before the survey is actually administered, i.e., it minimizes misunderstanding and subsequent measurement error. Because pre-testing eliminates poorly worded items and facilitates revision of phrasing to be maximally understood, it also serves to reduce the cognitive burden on research participants. Finally, pre-testing represents an additional way in which members of the target population can participate in the research process by contributing their insights to the development of the survey.

Pre-testing has two components: the first is the examination of the extent to which the questions reflect the domain being studied. The second is the examination of the extent to which answers to the questions asked produce valid measurements (

To evaluate whether the questions reflect the domain of study and meet the requisite standards, techniques including cognitive interviews, focus group discussion, and field pre-testing under realistic conditions can be used. We describe the most recommended, which is cognitive interviews.

Cognitive interviewing entails the administration of draft survey questions to target populations and then asking the respondents to verbalize the mental process entailed in providing such answers (

The sample used for cognitive interviewing should capture the range of demographics you anticipate surveying (

In sum, cognitive interviews get to the heart of both assessing the appropriateness of the question to the target population

An example of best practice in pre-testing is seen in the work of Morris et al. (

Collecting data with minimum measurement errors from an adequate sample size is imperative. These data can be collected using paper and pen/pencil interviewing (PAPI) or Computer Assisted Personal Interviewing (CAPI) on devices like laptops, tablets, or phones. A number of software programs exist for building forms on devices. These include Computer Assisted Survey Information Collection (CASIC) Builder™ (West Portal Software Corporation, San Francisco, CA); Qualtrics Research Core™ (

Each approach has advantages and drawbacks. Using technology can reduce the errors associated with data entry, allow the collection of data from large samples with minimal cost, increase response rate, reduce enumerator errors, permit instant feedback, and increase monitoring of data collection and ability to get more confidential data (

On the other hand, paper forms may avert the crisis of losing data if the software crashes, the devices are lost or stolen prior to being backed up, and may be more suitable in areas that have irregular electricity and/or internet. However, as sample sizes increase, the use of PAPI becomes more expensive, time and labor intensive, and the data are exposed in several ways to human error (

The sample size to use for the development of a latent construct has often been contentious. It is recommended that potential scale items be tested on a heterogeneous sample, i.e., a sample that both reflects and captures the range of the target population (

The necessary sample size is dependent on several aspects of any given study, including the level of variation between the variables, and the level of over-determination (i.e., the ratio of variables to number of factors) of the factors (

In sum, there is no single item-ratio that works for all survey development scenarios. A larger sample size or respondent: item ratio is always better, since a larger sample size implies lower measurement errors and more, stable factor loadings, replicable factors, and generalizable results to the true population structure (

The development of a scale minimally requires data from a single point in time. To fully test for the reliability of the scale (cf. Steps 8, 9), however, either an independent dataset or a subsequent time point is necessary. Data from longitudinal studies can be used for initial scale development (e.g., from baseline), to conduct confirmatory factor analysis (using follow-up data, cf. Step 7), and to assess test–retest reliability (using baseline and follow-up data). The problem with using longitudinal data to test hypothesized latent structures is common error variance, since the same, potentially idiosyncratic, participants will be involved. To give the most credence to the reliability of scale, the ideal procedure is to develop the scale on sample A, whether cross-sectional or longitudinal, and then test it on an independent sample B.

The work of Chesney et al. on the Coping Self-Efficacy scale provides an example of this best practice in the use of independent samples (

In scale development, item reduction analysis is conducted to ensure that only parsimonious, functional, and internally consistent items are ultimately included (

Two theories, Classical Test Theory (CTT) and the Item Response Theory (IRT), underpin scale development (

CTT allows the prediction of outcomes of constructs and the difficulty of items (

Several techniques exist within the two theories to reduce the item pool, depending on which test theory is driving the scale. The five major techniques used are: item difficulty and item discrimination indices, which are primarily for binary responses; inter-item and item-total correlations, which are mostly used for categorical items; and distractor efficiency analysis for items with multiple choice response options (

The item difficulty index is both a CTT and an IRT parameter that can be traced largely to educational and psychological testing to assess the relative difficulties and discrimination abilities of test items (

Under the CTT framework, the item difficulty index, also called item easiness, is the proportion of correct answers on a given item, e.g., the proportion of correct answers on a math test (

Under the IRT framework, the item difficulty parameter is the probability of a particular examinee correctly answering any given item (

Researchers must determine whether they need items with low, medium, or high difficulty. For instance, researchers interested in general purpose scales will focus on items with medium difficulty (

The item discrimination index (also called item-effectiveness test), is the degree to which an item correctly differentiates between respondents or examinees on a construct of interest (

The item discrimination index has been found to improve test items in at least three ways. First, non-discriminating items, which fail to discriminate between respondents because they may be too easy, too hard, or ambiguous, should be removed (

An item discrimination index can be calculated through correlational analysis between the performance on an item and an overall criterion (

Item discrimination under the IRT framework is a slope parameter that determines how steeply the probability of a correct response changes as the proficiency or trait increases (

A third technique to support the deletion or modification of items is the estimation of inter-item and item-total correlations, which falls under CTT. These correlations often displayed in the form of a matrix are used to examine relationships that exist between individual items in a pool.

Inter-item correlations (also known as polychoric correlations for categorical variables and tetrachoric correlations for binary items) examines the extent to which scores on one item are related to scores on all other items in a scale (

Item-total correlations (also known as polyserial correlations for categorical variables and biserial correlations for binary items) aim at examining the relationship between each item vs. the total score of scale items. However, the adjusted item-total correlation, which examines the correlation between the item and the sum score of the rest of the items excluding itself is preferred (

The distractor efficiency analysis shows the distribution of incorrect options and how they contribute to the quality of a multiple-choice item (

In addition to these techniques, some researchers opt to delete items with large numbers of cases that are missing, when other missing data-handling techniques cannot be used (

Factor extraction is the phase in which the optimal number of factors, sometimes called domains, that fit a set of items are determined. This is done using factor analysis. Factor analysis is a regression model in which observed standardized variables are regressed on unobserved (i.e., latent) factors. Because the variables and factors are standardized, the bivariate regression coefficients are also correlations, representing the loading of each observed variable on each factor. Thus, factor analysis is used to understand the latent (internal) structure of a set of items, and the extent to which the relationships between the items are internally consistent (

A number of analytical processes have been used to determine the number of factors to retain from a list of items, and it is beyond the scope of this paper to describe all of them. For scale development, commonly available methods to determine the number of factors to retain include a scree plot (

The extraction of factors can also be used to reduce items. With factor analysis, items with factor loadings or slope coefficients that are below 0.30 are considered inadequate as they contribute <10% variation of the latent construct measured. Hence, it is often recommended to retain items that have factor loadings of 0.40 and above (

A number of scales developed stop at this phase and jump to tests of reliability, but the factors extracted at this point only provide a

The test of dimensionality is a test in which the hypothesized factors or factor structure extracted from a previous model is tested at a different time point in a longitudinal study or, ideally, on a new sample (

Confirmatory factor analysis is a form of psychometric assessment that allows for the systematic comparison of an alternative

Description of model fit indices and thresholds for evaluating scales developed for health, social, and behavioral research.

Chi-square test | The chi-square value is a test statistic of the goodness of fit of a factor model. It compares the observed covariance matrix with a theoretically proposed covariance matrix | Chi-square test of model fit has been assessed to be overly sensitive to sample size and to vary when dealing with non-normal variables. Hence, the use of non-normal data, a small sample size ( |
( |

Root Mean Squared Error of Approximation (RMSEA) | RMSEA is a measure of the estimated discrepancy between the population and model-implied population covariance matrices per degree of freedom ( |
Browne and Cudeck recommend RMSEA ≤ 0.05 as indicative of close fit, 0.05 ≤ RMSEA ≤ 0.08 as indicative of fair fit, and values >0.10 as indicative of poor fit between the hypothesized model and the observed data. However, Hu and Bentler have suggested RMSEA ≤ 0.06 may indicate a good fit | ( |

Tucker Lewis Index (TLI) | TLI is based on the idea of comparing the proposed factor model to a model in which no interrelationships at all are assumed among any of the items | Bentler and Bonnett suggest that models with overall fit indices of < 0.90 are generally inadequate and can be improved substantially. Hu and Bentler recommend TLI ≥ 0.95 | ( |

Comparative Fit Index (CFI) | CFI is an incremental relative fit index that measures the relative improvement in the fit of a researcher's model over that of a baseline model | CFI ≥ 0.95 is often considered an acceptable fit | ( |

Standardized Root Mean Square Residual (SRMR) | SRMR is a measure of the mean absolute correlation residual, the overall difference between the observed and predicted correlations | Threshold for acceptable model fit is SRMR ≤ 0.08 | ( |

Weighted Root Mean Square Residual (WRMR) | WRMR uses a “variance-weighted approach especially suited for models whose variables measured on different scales or have widely unequal variances” ( |
Yu recommends a threshold of WRMR < 1.0 for assessing model fit. This index is used for confirmatory factor analysis and structural equation models with binary and ordinal variables | ( |

Standard of Reliability for scales | A reliability of 0.90 is the minimum recommended threshold that should be tolerated while a reliability of 0.95 should be the desirable standard. While the ideal has rarely been attained by most researchers, a reliability coefficient of 0.70 has often been accepted as satisfactory for most scales | Nunnally recommends a threshold of ≥0.90 for assessing internal consistency for scales | ( |

Bifactor modeling, also referred to as nested factor modeling, is a form of item response theory used in testing dimensionality of a scale (

Another method to test dimensionality is measurement invariance, also referred to as factorial invariance or measurement equivalence (

An alternative approach to measurement invariance in the testing of unidimensionality under item response theory is the Rasch measurement model for binary items and polytomous IRT models for categorical items. Here, emphasis is on testing the differential item functioning (DIF)—an indicator of whether “a group of respondents is scoring better than another group of respondents on an item or a test after adjusting for the overall ability scores of the respondents” (

Whether the hypothesized structure is bidimensional or multidimensional, each dimension in the structure needs to be tested again to confirm its unidimensionality. This can also be done using confirmatory factor analysis. Appropriate model fit indices and the strength of factor loadings (cf. Table

One commonly encountered pitfall is a lack of satisfactory global model fit in confirmatory factor analysis conducted on a new sample following a satisfactory initial factor analysis performed on a previous sample. Lack of satisfactory fit offers the opportunity to identify additional underperforming items for removal. Items with very poor loadings (≤0.3) can be considered for removal. Also, modification indices, produced by M

A good example of best practice is seen in the work of Pushpanathan et al. on the appropriateness of using a traditional confirmatory factor analysis or a bifactor model (

Finalized items from the tests of dimensionality can be used to create scale scores for substantive analysis including tests of reliability and validity. Scale scores can be calculated by using unweighted or weighted procedures. The unweighted approach involves summing standardized item scores or raw item scores, or computing the mean for raw item scores (

Reliability is the degree of consistency exhibited when a measurement is repeated under identical conditions (

Cronbach's alpha assesses the internal consistency of the scale items, i.e., the degree to which the set of items in the scale co-vary, relative to their sum score (

An additional approach in testing reliability is the test–retest reliability. The test–retest reliability, also known as the coefficient of stability, is used to assess the degree to which the participants' performance is repeatable, i.e., how consistent their sum scores are across time (

The work of Johnson et al. (

Other approaches found to be useful and support scale reliability include split-half estimates, Spearman-Brown formula, alternate form method (coefficient of equivalence), and inter-observer reliability (

Scale validity is the extent to which “an instrument indeed measures the latent dimension or construct it was developed to evaluate” (

Criterion validity is the “degree to which there is a relationship between a given test score and performance on another measure of particular relevance, typically referred to as criterion” (

Concurrent criterion validity is the extent to which test scores have a stronger relationship with criterion (gold standard) measurement made at the time of test administration or shortly afterward (

A limitation of concurrent validity is that this strategy for validity does not work with small sample sizes because of their large sampling errors. Secondly, appropriate criterion variables or “gold standards” may not be available (

Construct validity is the “extent to which an instrument assesses a construct of concern and is associated with evidence that measures other constructs in that domain and measures specific real-world criteria” (

Convergent validity is the extent to which a construct measured in different ways yields similar results. Specifically, it is the “degree to which scores on a studied instrument are related to measures of other constructs that can be expected on theoretical grounds to be close to the one tapped into by this instrument” (

Discriminant validity is the extent to which a measure is novel and not simply a reflection of some other construct (

Differentiation or comparison between known groups examines the distribution of a newly developed scale score over known binary items (

Although correlational analysis is frequently used by several scholars, bivariate regression analysis is preferred to correlational analysis for quantifying validity (

Taken together, these methods make it possible to assess the validity of an adapted or a newly developed scale. In addition to predictive validity, existing studies in fields such as health, social, and behavioral sciences have shown that scale validity is supported if at least two of the different forms of construct validity discussed in this section have been examined. Further information about establishing validity and constructing indictors from scales can be found in Frongillo et al. (

In sum, we have sought to give an overview of the key steps in scale development and validation (Figure

Because scale development is so complicated, this should be considered a primer, i.e., a “jumping off point” for anyone interested in scale development. The technical literature and examples of rigorous scale development mentioned throughout will be important for readers to pursue. There are a number of matters not addressed here, including how to interpret scale output, the designation of cut-offs, when indices, rather than scales, are more appropriate, and principles for re-testing scales in new populations. Also, this review leans more toward the classical test theory approach to scale development; a comprehensive review on IRT modeling will be complementary. We hope this review helps to ease readers into the literature, but space precludes consideration of all these topics.

The necessity of the nine steps that we have outlined here (Table

Well-designed scales are the foundation of much of our understanding of a range of phenomena, but ensuring that we accurately quantify what we purport to measure is not a simple matter. By making scale development more approachable and transparent, we hope to facilitate the advancement of our understanding of a range of health, social, and behavioral outcomes.

GB and SY developed the first draft of the scale development and validation manuscript. All authors participated in the editing and critical revision of the manuscript and approved the final version of the manuscript for publication.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to acknowledge the importance of the works of several scholars of scale development and validation used in developing this primer, particularly Robert DeVellis, Tenko Raykov, George Marcoulides, David Streiner, and Betsy McCoach. We would also like to acknowledge the help of Josh Miller of Northwestern University for assisting with design of Figure

^{2}test of goodness of fit

audio computer self-assisted interviewing

adherence self-efficacy scale

computer assisted personal interviewing

confirmatory factor analysis

computer assisted survey information collection builder

comparative fit index

classical test theory

differential item functioning

exploratory factor analysis

full information maximum likelihood

fear of negative evaluation

global factor

intraclass correlation coefficient

Independent cluster model

item response theory

Open Data Kit

paper and pen/pencil interviewing

Questionnaire Development System

root mean square error of approximation

social avoidance and distress

statistical analysis systems

social anxiety scale for children revised

structural equation model

statistical package for the social sciences

statistics and data

standardized root mean square residual of approximation

Tucker Lewis Index

water, sanitation, and hygiene

weighted root mean square residual.