SYSTEMATIC REVIEW article

Front. Psychol., 08 February 2021

Sec. Psychology of Language

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.01941

An Extensive Knowledge Mapping Review of Measurement and Validity in Language Assessment and SLA Research

  • 1. National Institute of Education, Nanyang Technological University, Singapore, Singapore

  • 2. Nanyang Technological University, Singapore, Singapore

  • 3. College of Computing and Informatics, Drexel University, Philadelphia, PA, United States

  • 4. Department of Information Science, Yonsei University, Seoul, South Korea

Article metrics

View details

38

Citations

11,2k

Views

4,1k

Downloads

Abstract

This study set out to investigate intellectual domains as well as the use of measurement and validation methods in language assessment research and second language acquisition (SLA) published in English in peer-reviewed journals. Using Scopus, we created two datasets: (i) a dataset of core journals consisting of 1,561 articles published in four language assessment journals, and (ii) a dataset of general journals consisting of 3,175 articles on language assessment published in the top journals of SLA and applied linguistics. We applied document co-citation analysis to detect thematically distinct research clusters. Next, we coded citing papers in each cluster based on an analytical framework for measurement and validation. We found that the focus of the core journals was more exclusively on reading and listening comprehension assessment (primary), facets of speaking and writing performance such as raters and validation (secondary), as well as feedback, corpus linguistics, and washback (tertiary). By contrast, the primary focus of assessment research in the general journals was on vocabulary, oral proficiency, essay writing, grammar, and reading. The secondary focus was on affective schemata, awareness, memory, language proficiency, explicit vs. implicit language knowledge, language or semantic awareness, and semantic complexity. With the exception of language proficiency, this second area of focus was absent in the core journals. It was further found that the majority of citing publications in the two datasets did not carry out inference-based validation on their instruments before using them. More research is needed to determine what motivates authors to select and investigate a topic, how thoroughly they cite past research, and what internal (within a field) and external (between fields) factors lead to the sustainability of a Research Topic in language assessment.

Introduction

Although the practice of language testing and/or assessment can be traced back in history to ancient eras in China (Spolsky, 1990), many language assessment scholars recognize the pioneering book of Lado's (1961) and the book chapter of Carroll's (1961), as the beginning of the modern language testing/assessment field (Davies, 2008, 2014). The field was routinely referred to as language testing, at least from the 1950s until the 1990s. In contemporary usage, it is possible to make a distinction between testing and assessment, in terms of the formality and stakes involved in the procedures, the use of quantitative vs. qualitative approaches in design and implementation and other aspects1. Nonetheless, in the present study, testing, and assessment are used interchangeably. Despite the general recognition of 1961 as the beginning of the field of language testing, there had been many language testing studies published before 1961, particularly in the field of reading (e.g., Langsam, 1941; Davis, 1944; Hall and Robinson, 1945; see also Rosenshine, 2017; Aryadoust, 2020 for reviews). By definition, these studies qualify as language testing research and practice since they meet several criteria that Priscilla Allen, Alan Davies, Carol Chapelle and Geoff Brindley, and F. Y. Edgeworth set forth in their delineations of language testing, most notably the practice of evaluating language ability/proficiency, the psychometric activity of developing language tests, and/or decision making about test takers based on test results (Fulcher, n.d.).

In order to build a fair portrayal of a discipline, researchers often review the research outputs that have been generated over the years to understand its past and present trends (Goswami and Agrawal, 2019). For language assessment, several scholars have surveyed the literature and divided its development into distinct periods (Spolsky, 1977, 1995; Weir, 1990; Davies, 2014), while characterizing its historical events (Spolsky, 2017). Alternatively, some provided valuable personal reflections on the published literature (Davies, 1982; Skehan, 1988; Bachman, 2000; Alderson and Banerjee, 2001, 2002). Examples of personal reflections on specific parts of language assessment history also include Spolsky's (1990) paper on the “prehistory” of oral examinations and Weir et al.'s (2013) historical review of Cambridge assessments.

These narrative reviews offer several advantages such as the provision of “experts' intuitive, experiential, and explicit perspectives on focused topics” (Pae, 2015, p. 417). On the other hand, narrative reviews are qualitative in nature and do not use databases or vigorous frameworks and methodologies (Jones, 2004; Petticrew and Roberts, 2006). This contrasts with quantitative reviews, which have specific research questions or hypotheses and rely on the quantitative evaluation and analysis of data (Collins and Fauser, 2005). An example of such an approach is Scientometrics which is “the quantitative methods of the research on the development of science as an informational process” (Nalimov and Mulcjenko, 1971, p. 2). This approach comprises several main themes including “ways of measuring research quality and impact, understanding the processes of citations, mapping scientific fields and the use of indicators in research policy and management” (Mingers and Leydesdorff, 2015, p. 1). This wide scope makes Scientometrics a specialized and “extensively institutionalized area of inquiry” (De Bellis, 2014, p. 24). Thus, it is appropriate for analyzing the entire areas of research across various research fields (Mostafa, 2020).

Present Study

The present study had two main aims. First, we adopted Scientometrics to identify the intellectual structure of language assessment research published in English peer-reviewed journals. Although Scientometrics and similar approaches such as Bibliometric have been adopted in applied linguistics to investigate the knowledge structure across several research domains (Arik and Arik, 2017; Lei and Liu, 2019), there is currently no study that has investigated the intellectual structure of research in language assessment. Here, intellectual structure refers to a set of research clusters that represents specialized knowledge groups and research themes, as well as the growth of the research field over time (Goswami and Agrawal, 2019). To identify an intellectual structure, a representative dataset of the published literature is firstly generated and specialized software is subsequently applied to mine and extract the hidden structures in the data (Chen, 2016). The measures generated are then used to portray the structure and dynamics of the field “objectively,” where the dataset represents the research field in question (Goswami and Agrawal, 2019). Second, we aim to examine the content of emerged research clusters, using two field-specific frameworks to determine how each cluster can be mapped onto commonly adopted methodologies in the field: validity argument (Chapelle, 1998; Bachman, 2005; Kane, 2006; Chapelle et al., 2008; Bachman and Palmer, 2010) and measurement frameworks (Norris and Ortega, 2003). The two research aims are discussed in detail next.

First Aim

To achieve the first aim of the study, we adopted a Scientometric technique known as document co-citation analysis (DCA) (Chen, 2006, 2010) to investigate the intellectual structure for the field of language assessment as well as assessment-based research in second language acquisition (SLA). Co-citation refers to the frequency with which two or more publications are referenced in another publication (Chen, 2003, 2016). When a group of publications cites the same papers and books, this means that they are not only thematically related but they also take reference from the same pool of papers (Chen, 2003). Moreover, co-citations can be also generalized to authors and journals by identifying the frequency with which they have been written by the same authors or cited using the same journal resource (Chen, 2004, 2006; Chen and Song, 2017). Of note, co-citation analysis is similar to factor analysis that is extensively used for data reduction and pattern recognition in surveys and tests. In the latter, items are categorized into separate clusters called factors based on their correlation patterns. Factor loadings indicate the correlation of the item in question with other items that are categorized as a factor (Field, 2018). Some items have high loadings on latent variables, whereas others have low loading coefficients. The items with low loading coefficients do not make a significant contribution to the measurement of the ability or skill under assessment and can be removed from the instrument without affecting the amount of variance explained by the test items (Field, 2018). Similarly, co-citation analysis categorizes publications as discrete research clusters based on the publications that are co-cited in each cluster. When two publications co-cite a source or reference, this suggests that they may be related. If these publications share (co-cite) at least 50% of their references, it is plausible that there is a significant thematic link between them. Identifying the publications that co-cite the same sources facilitates the identification of the related research clusters via their pool of references. The publications that are clustered together (like factors in factor analysis) may be then inspected for their thematic relationships, either automatically through text-mining methods or manually by experts who read the content of the clustered publications. Furthermore, there may be influential publications in each cluster that have received large numbers of co-citations from other publications, and this is termed as “citation bursts.” Reviewing the content of the citation bursts can further help researchers characterize the cluster in terms of its focus and scope (Chen, 2017).

Second Aim

To achieve the second aim of the study, we developed a framework to describe measurement and validation practices across the emerged clusters. Despite the assumption that testing and assessment practices are specific to the language assessment field, SLA researchers have employed certain assessment techniques to investigate research questions pertinent to SLA (Norris and Ortega, 2003). Nevertheless, there seems to be methodological and conceptual gaps in assessment between the language testing field and SLA, which several publications attempted to bridge (Upshur, 1971; Bachman, 1990; see chapters in Bachman and Cohen, 1998). Bachman (1990, p. 2) asserted that “language testing both serves and is served by research in language acquisition and language teaching. Language tests, for example, are frequently used as criterion measures of language abilities in second language acquisition research.” He extended the uses and contributions of language assessment to teaching and learning practices, stressing that language tests are used for a variety of purposes like assessing progress and achievement, diagnosing learners' strengths and weaknesses, and as tools for SLA research. He stressed that insights from SLA can reciprocally assist language assessment experts to develop more useful assessments. For example, insights from SLA research on learners' characteristics and personality can help language testing experts to develop measurement instruments to investigate the effect of learner characteristics on assessment performance. Therefore, in Bachman's (1990) view, the relationship between SLA and language assessment is not exclusively unidirectional or exclusive to validity and reliability matters. Despite this, doubts have been voiced regarding the measurement of constructs in SLA (Bachman and Cohen, 1998) and the validity of the instruments used in SLA (Chapelle, 1998). For example, Norris and Ortega (2003) critiqued SLA research on the grounds that measurement is not often conducted with sufficient rigor.

Measurement is defined as the process of (i) construct representation, (ii) construct operationalization, (iii) data collection via “behavior elicitation” (Norris and Ortega, 2003, p. 720), (iv) data analysis to generate evidence, and (v) the employment of that evidence to draw theory-based conclusions (Messick, 1989, 1996). To establish whether measurement instruments function properly, it is essential to investigate their reliability and, where applicable and plausible, validate interpretations and uses of their results (scores) (Messick, 1996; Kane, 2006). Reliability refers to the evidence that the measurement is precise or has low error of measurement (Field, 2018) and its output is reproducible across occasions, raters, and test forms (Green and Salkind, 2014; Grabowski and Oh, 2018). In addition, since the publication of Cronbach and Meehl's (1955) paper, validation has been primarily treated as the process of developing arguments to justify the meaning and utility of test scores or assessment results. Messick (1989) emphasized that validation should encompass evidentiary and consequential bases of score interpretation and meaning and Kane (2006) proposed a progressive plan for collecting various sorts of evidence to buttress inferences drawn from the data and rebut counter-evidence (if any). Like the theory of measurement, Messick's (1989) and Kane's (2006) frameworks have had a lasting impact on language assessment (Bachman, 2005; Chapelle et al., 2008; Bachman and Palmer, 2010; Aryadoust, 2013).

We note that, in addition to the argument-based validation framework, there are several validation frameworks such as Weir's (2005b) socio-cognitive framework or Borsboom and Mellenbergh's (2007) test validity framework which have been adopted in some previous research. However, Borsboom and Mellenbergh's (2007) work is less well-known in language assessment and SLA and has a heavy focus on psychometrics. In addition, certain components of Weir's (2005a) framework such as cognitive validity are relatively under-researched in language assessment and SLA and coding the studies for these components would not generate as useful information. Therefore, the choice of argument-based validation framework seems to be more plausible for this study, although we do recognize the limitations of the approach (see Conclusion).

Bachman (2005) stressed that, before using an assessment for decision-making purposes, a validity argument should be fully fledged in terms of evidence supporting test developers' claims. On the other hand, empirical validation studies have demonstrated that collecting such evidence to establish an all-encompassing validity argument is an arduous and logistically complex task (Chapelle et al., 2008; Aryadoust, 2013; Fan and Yan, 2020). We are, hence, keen to determine the extent to which language assessment and SLA studies involving measurement and assessment have fulfilled the requirements of validation in the research clusters that are identified through DCA.

Methodology

Overview

This study investigated the intellectual structure in the language assessment field. It examines the literature over the period 1918–2019 to identify the network structure of influential research domains involved in the evolution of language assessment. The year 1918 is the lower limit as it is the earliest year of coverage by Scopus. The study adopted a co-citation method that comprises document co-citation analysis (DCA) (Small and Sweeney, 1985; Chen, 2004, 2006, 2010, 2016; Chen et al., 2008, 2010). The study also adopted CiteSpace Version 5.6.R3 (Chen, 2016), a computational tool used to identify highly cited publications and authors that acted as pivotal points of transition within and among research clusters (Chen, 2004).

Data Source and Descriptive Statistics

Scopus was employed as our main database, with selective searches carried out to create the datasets of the study. We identified several publications that defined language assessment as the practice of assessing first, second or other languages (Hornberger and Shohamy, 2008), including the assessment of what is known to be language “skills and elements” or a combination of them. Despite the defined scope, the bulk of the publications concerns SLA (as will be seen later). We treated the journals that proclaimed their focus to be exclusively language assessment as the “core journals” of the field, while using a keyword search to identify the focus of language assessment publications in applied linguistics/SLA journals. Accordingly, two datasets were created (see Appendix for the search code).

  • A core journals dataset consisting of 1,561 articles published in Language Testing, Assessing Writing, Language Assessment Quarterly, and Language Testing in Asia, which were indexed in Scopus. These journals focus specifically on publishing language assessment research and were, accordingly, labeled as core journals. The dataset also included all the publications (books, papers etc.) that were cited in the References of these articles.

  • A general journals dataset consisting of 3,175 articles on language assessment published in the top 20 journals of applied linguistics/SLA. The dataset also included all the publications cited in these articles. This list of journals was identified based on their ranking in the “Scimago Journal and Country Rank (SJR)” database and their relevance to the current study. The journals consisted of Applied Psycholinguistics, System, Language Learning, Modern Language Journal, TESOL Quarterly, Studies in Second Language Acquisition, English Language Teaching, RELC Journal, Applied Linguistics, Journal of Second Language Writing, English for Specific Purposes, Language Awareness, Language Learning and Technology, Recall, Annual Review of Applied Linguistics, and Applied Linguistics Review. There was no overlap between i and ii. To create ii, the Scopus search engine was set to search for generic keywords consisting of “test,” “assess,” “evaluate,” “rate,” and “measure” in the titles, keywords, or abstracts of publication2. These search words were chosen from the list of high-frequency words that were extracted by Scopus from the core journal dataset (i). Next, we reviewed the coverage of 1,405 out of 3,175 articles3, as determined by CiteSpace analysis, that contributed to the networks in this dataset to ascertain if they addressed a topic in language assessment. The publications were found to either have an exclusive focus on assessment or used assessment methods (e.g., test development, reliability analysis, or validation) as one of the components in the study.

Supplemental Table 1 presents the total number of articles published by the top 20 journals, countries/regions, and academic institutes. The top three journal publishers were Language Testing, System, and Language Learning, with a total of 690, 389, and 361 papers published between 1980 and 2019—note that there were language testing/assessment studies published earlier in other journals. In general, the journals published more than 100 papers, with the exceptions of Language Learning Journal, ReCall, Language Awareness, Journal of Second Language Writing, Language Learning and Technology, and English for Specific Purposes. The total number of papers published by the top five journals (2,087) accounted for more than 50% of the papers published by all journals.

The top five countries/regions producing the greatest number of articles were the United States (US), the United Kingdom, Canada, Iran, and Japan, with 1,644, 448, 334, 241, and 233 articles, respectively. Eleven of the top 20 countries/regions, listed in Supplemental Table 1, published more than 100 articles. The top three academic institutes publishing articles were the Educational Testing Service (n = 99), the University of Melbourne (n = 92), and Michigan State University (n = 68). In line with the top producing country, just over half of these institutions were located in the US.

First Aim: Document Co-Citation Analysis (DCA)

The document co-citation (DCA) technique was used to measure the frequency of earlier literature co-cited together in later literature. DCA was used to establish the strength of the relationship between the co-cited articles, identify ‘popular’ publications with high citations (bursts) in language assessment, and identify research clusters comprising publications related via co-citations4. DCA was conducted twice—once for each dataset obtained from Scopus, as previously discussed. We further investigated the duration of burstness (the period of time in which a publication continued to be influential) and burst strength (the quantified magnitude of influence).

Visualization and Automatic Labeling of Clusters

The generation of a timeline view on CiteSpace allowed for clusters of publications to be visualized on discrete horizontal axes. Clusters were arranged in a vertical manner descending in size, with the largest cluster at the top. Colored lines representing co-citation links were added in the time period of the corresponding color. Publications that had a citation burst and/or were highly cited were represented with red tree rings or appear larger than the surrounding nodes.

The identified clusters were automatically labeled. In CiteSpace, three term ranking algorithms can be used to label clusters: latent semantic indexing (LSI), log-likelihood ratio (LLR), or mutual information (MI). The ranking algorithms use different methods to identify the cluster themes. LSI uses document matrices but is “underdeveloped” (Chen, 2014, p.79). Both LLR and MI identify cluster themes by indexing noun phrases in the abstracts of citing articles (Chen et al., 2010), with different ways of computing the relative importance of said noun phrases. We chose the labels selected by LLR (rather than MI) as they represent unique aspects of the cluster (Chen et al., 2010) and are more precise at identifying cluster themes (Aryadoust and Ang, 2019).

While separate clusters represent discrete research themes, some clusters may consist of sub-themes. For example, our previous research indicated that certain clusters are characterized by publications that present general guidelines on the application of quantitative methods alongside publications focused on a special topic, e.g., language-related topics (Aryadoust and Ang, 2019; Aryadoust, 2020). In such cases, subthemes and their relationships should be identified (Aryadoust, 2020).

Temporal and Structural Measures of the Networks

To evaluate the quality of the DCA network, temporal and structural measures of networks were computed. Temporal measures were computed using citation burstness and sigma (∑). Citation burstness shows how favorably an article was regarded in the scientific community. If a publication receives no sudden increase of citations, its burstness tends to be close or equal to zero. On the other hand, there is no upper boundary for burstness. The sigma value of a node in CiteSpace merges the citation burstness and betweenness centrality, demonstrating both the temporal and structural significance of a citation. Sigma could also be indicative of novelty, detecting publications that presented novel ideas in their respective field (Chen et al., 2010). That is, the higher the sigma value, the higher the likelihood that the publication includes novel ideas.

Structural measures comprised the average silhouette score, betweenness centrality, and the modularity (Q) index. The average silhouette score ranges between −1 and 1 and measures the quality of the clustering configuration (Chen, 2019). This score defines how well a cited reference matches with the cluster in which it has been placed (vs. other clusters), depending on its connections with neighboring nodes (Rousseeuw, 1987). A high mean silhouette score suggests a large number of citers leading to the formation of a cluster, and is therefore reflective of high reliability of clustering; by contrast, a low silhouette score illustrates low homogeneity of clusters (Chen, 2019).

The modularity (Q) index ranges between −1 and 1 and determines the overall intelligibility of a network by decomposing it into several components (Chen et al., 2010; Chen, 2019). A low Q score hints at a network cluster without clear boundaries, while a high Q score is telling of a well-structured network (Newman, 2006).

The betweenness centrality metric ranges between 0 and 1 and assesses the degree to which a node is in the middle of a link that connects to other nodes within the network (Brandes, 2001). Moreover, a high betweenness centrality indicates that a publication may contain groundbreaking ideas; if a node is the only connection between two large but otherwise unrelated clusters, this is evidence that the author scores are high on betweenness centrality (Chen et al., 2010).

However, it must be noted that these measures are not absolute scales where a higher value automatically indicates increased importance. Rather, they show tendencies and directions for the analyst to pursue. In practice, one should also consider the diversity of the citing articles (Chen et al., 2010). For example, a higher silhouette value generated from a single citing article is not necessarily indicative of greater importance than a relatively lower value from multiple distinct citing articles. Likewise, the significance of the modularity index and the betweenness centrality metric is subject to interpretation, dependent on further analyses, including of citing articles.

Second Aim: The Analytical Framework

In DCA, clusters reflect what citing papers have in common in terms of how they cite references together (Chen, 2006). Therefore, we designed an analytical framework to examine the citing publications in the clusters (Table 1). In addition, we took into account the bursts (cited publications) per cluster in deciding what features would characterize each cluster. The framework was informed by a number of publications in language assessment research such as Aryadoust (2013), Bachman (1990), Bachman and Cohen (1998), Bachman and Palmer (2010), Chapelle et al. (2008), Eckes (2011), Messick (1989), Messick (1996), Kane (2006), Norris and Ortega (2003), and Xi (2010a). In Table 1, “component” is a generic term to refer to the inferences that are drawn from the data and are supported by warrants (specific evidence that buttress the claims or conclusions of the data analysis) (Kane, 2006; Chapelle et al., 2008; Bachman and Palmer, 2010). In addition, it also refers to the facets of measurement articulated by Messick (1989, 1996) and Norris and Ortega (2003) in their investigation of measurement and construct definition in assessment and SLA. It should be noted that the validity components in this framework, i.e., generalization, explanation, extrapolation, and utilization, are descriptive (rather than evaluative) and intended to record whether or not particular studies reported evidence for them. Thus, the lack of reporting of these components does not necessarily indicate that this evidence was not presented when it should have been, unless it is stated otherwise.

Table 1

ComponentDefinitionRelevant procedures and/or warrantsReferences
Domain specificationThe definition of the target language use (TLU) domain and the components of the representation of the construct in question (construct representation)Generating a theoretical framework to explain (i) the cognitive processes of the latent trait under investigation (competency-based approach) and/or (ii) the characteristics of the tasks that represent the TLU domain (task-based approach)Messick, 1989; Norris and Ortega, 2003; Chapelle et al., 2008
Construct operationalizationThe realization of the construct or translating the construct definition into actual assessment instruments(i) Using one or more task formats such as open-ended questions or discrete-point/selected response methods like multiple choice questions, and (ii) experts' evaluation of the tasksMessick, 1989; Norris and Ortega, 2003
Evaluation (scoring)Eliciting the intended behavior from the test taker and using a scale to translate the test performance to a score, mark, or grade(i) Developing or adapting a scale to grade or provide feedback on students' performance. This can be conducted by human raters or machines (e.g., automated writing evaluators), (ii) establishing the reliability of the scale using reliability analysis (e.g., internal consistency or rater reliability)Norris and Ortega, 2003; Kane, 2006; Chapelle et al., 2008; Bachman and Palmer, 2010; Xi, 2010a; Grabowski and Oh, 2018
GeneralizationEstablishing whether the observed scores represent a “universe score” and are not exclusive to the test form, rater, or test item formats in the assessmentGeneralizability theory analysis or many-facet Rasch measurement to investigate the sources of variance and error in data as well as the erratic marking patterns.Kane, 2006; Eckes, 2011; Aryadoust, 2013; Grabowski and Lin, 2019; Sawaki and Xi, 2019
Explanation (analogous to traditional construct validation)Establishing whether the test engages the target construct or whether the test takers' performance can primarily be explained by the target constructLatent variable analysis such as exploratory or confirmatory factor analysis or Rasch measurementChapelle et al., 2008
Extrapolation (analogous to traditional criterion evidence of validity)Establishing whether the test scores can be extrapolated to or predict test takers' performance in the TLU domainCorrelation analysis, regression analysis, or structural equation modeling (SEM) to examine the relationships between test results and future performance of the test takers in the TLU domainKane, 2006; Bachman and Palmer, 2010
Utilization (analogous to traditional washback research or consequential validity)Establishing whether the test results are used appropriately and whether their use has any positive impact on the individual, educational system, and societyInvestigation of washback through collecting evidence from classrooms, work places, or test takers, using questionnaires or interviews and analysis methods such as SEM or regression analysis.Bailey, 1999; Bachman and Palmer, 2010;

The analytical framework to address the second aim of the study.

Using this framework, we coded the publications independently and compared their codes. Only few discrepancies were identified which were subsequently resolved by the first author.

Results

DCA of the Core and General Journals Networks

Supplemental Table 2 presents the top publications in the core and general journals datasets with the strongest citation bursts sustained for at least 2 years. (Due to space constraints, only the top few publications have been presented). Overall, the publications had a low betweenness centrality index ranging from 0.01 to 0.39. Bachman (1990; centrality = 0.35) and Canale and Swain (1980; centrality = 0.39) had the highest betweenness centrality index among the core and general journals datasets, respectively. Of these, Bachman (1990) and Skehan (1998) appeared on both core and general journals lists. The books identified in the analysis were not included directly in the datasets; they appeared in the results since they were co-cited by a significant number of citing papers (i.e., they came from the References section of the citing papers).

The top five most influential publications in the core journals were Bachman and Palmer (1996; duration of burst = 6, strength = 17.39, centrality = 0.11, sigma = 6.4), Bachman and Palmer (2010; duration of burst = 4, strength = 14.93, centrality = 0.02, sigma = 1.25), Bachman (1990; duration of burst = 5, strength = 11.77, centrality = 0.35, sigma = 32.79), Fulcher (2003; duration of burst = 5, strength = 11.54, centrality = 0.01, sigma = 1.10), and Council of Europe (2001; duration of burst = 3, strength = 11.17, centrality = 0.01, sigma = 1.11).

In addition, four publications in the general journals dataset had a burst strength higher than 11: Skehan (1988; duration of burst = 9, strength = 13.42, centrality = 0.05, sigma = 1.85), Bachman and Palmer (1996; duration of burst = 7, strength = 12.15, centrality = 0.05, sigma = 1.81), Norris and Ortega (2009; duration of burst = 7, strength = 13.75, centrality = 0.01, sigma = 1.08), and Nation (1990; duration of burst = 6, strength = 11.00, centrality = 0.05, sigma = 1.67).

Visualization of the DCA Network for the Core Journals Dataset

Figure 1 depicts the cluster view of the DCA network of the core journals. Each cluster consists of nodes, which represent publications, and their links which are represented by lines and show co-citation connections. The labels per clusters are representative of the headings assigned to the citing articles within the cluster. The color of a link denotes the earliest time slice in which the connection was made, with warm colors like red representing the most recent burst and cold colors like blue representing older clusters. As we can see from the denseness of the nodes in Figure 1, there were six largest clusters experiencing citation bursts: #0 or language assessment (size=224; silhouette value = 0.538; Mean year of publication = 1995), #1 or interactional competence (size= 221; silhouette value = 0.544; Mean year of publication = 2005), #2 or reading comprehension test (size= 171; silhouette value =0.838; Mean year of publication = 1981), #3 or task-based language assessment (size= 161; silhouette value = 0.753; Mean year of publication = 1994), # 4 or rater experience (size=108; silhouette value =0.752; Mean year of publication = 1999), and #5 or pair task performance (size = 78; silhouette value = 0.839; Mean year of publication = 1993). Note that the numbers assigned to the clusters in this figure (from 0 to 20) are based on the cluster size, so #0 is the largest, followed by #1, etc. Smaller clusters with too few connections are not presented in cluster views. This DCA network had a modularity Q metric of 0.541, indicating a fairly well-structured network. The average silhouette index was 0.71, suggesting medium homogeneity of the structures (See Supplemental Table 3 for further information). It should be noted that after examining the content of each cluster, we made some revisions to the automatically generated labels to enhance their consistency and precision (see Discussion).

Figure 1

Figure 1

The cluster view of network in the core journals dataset (modularity Q = 0.541, average silhouette score = 0.71), generated using CiteSpace, Version 5.6.R3.

Visualization of the DCA Network for the General Journals

Figure 2 depicts a cluster view of the major clusters in the general journals dataset visualized along multiple horizontal lines (modularity Q = 0.6493, average silhouette score = 0.787). The clusters are color-coded, with their nodes (publications) and links being represented by dots and straight lines, respectively. Among the clusters visually represented, there were nine major clusters in the network, as presented in Supplemental Table 4. The largest cluster is #2 (incidental vocabulary learning); the oldest cluster is #0 (foreign language aptitude), whereas the most recent one is #4 (syntactic complexity). As presented in the Supplemental Table 4, although the dataset represented co-citation patterns in the general journals, we noted that there were multiple cited publications in this dataset that were published in the core journals. It should be noted that only major clusters are labeled and displayed in Figures 1, 2 and therefore the running order of the clusters are different across the two.

Figure 2

Figure 2

The cluster view of network in the general journals dataset (modularity Q = 0.6493, average silhouette score = 0.787), generated using CiteSpace, Version 5.6.R3.

Second Aim: Measurement and Validity in the Core Journal Clusters

Next, we applied the analytical framework of the study in Table 1 to examine the measurement and validation practices in each main cluster.

Domain Specification in Core Journals

For the core dataset, Table 2 presents the domains and constructs specified in the six major clusters. (Please note that the labels under the “The construct or domain specified” column were inductively assigned by the authors based on the examination of papers in each cluster). Overall, there were fewer constructs/domains in the core dataset (n = 15) as compared to the 26 in the general journals dataset below. The top four most frequently occurring constructs or domains in the core dataset were speaking/oral/communicative skills, writing and/or essays, reading, and raters/ratings. The most frequently occurring construct, Speaking/oral/communicative skills, appeared in every cluster, which is indicative of one of the major foci of the core journals. A series of χ2 tests showed that all categories of constructs or domains were significantly different from each other in terms of the distribution of the skills and elements (p < 0.05). Specifically, Clusters #0 and #2 were primarily characterized by the dominance of comprehension (reading and listening) assessment research while Clusters #1, #4, and #5 had a heavier focus on performance assessment (writing and oral production/interactional competence), thus suggesting two possible streams of research weaving the clusters together. The assessment of language elements such as vocabulary and grammar was significantly less researched across all the clusters.

Table 2

Cluster #The construct or domain specified# of papers
Cluster 0
Reading18
Listening8
Speaking/ oral/ communicative ability8
Writing5
Overall language proficiency7
Cluster 1
Reading8
Writing29
Speaking/ oral/ communicative ability16
Interactional competence6
Corpus linguistics3
Overall language proficiency9
Feedback3
Cluster 2
Reading6
Listening2
Speaking/ oral/ communicative ability3
Cluster 3
Reading3
Vocabulary7
Speaking/ oral/ communicative5
Overall language proficiency2
Cluster 4
Vocabulary3
Writing/ essays15
Raters/ ratings18
Speaking/ oral/ communicative ability8
Cluster 5
Speaking/ oral/ communicative ability13
Washback2

Domain specification in major clusters in the core journals.

Other Components in Core Journals

Table 3 presents the other components of the analytical framework in the core journals consisting of construct operationalization, evaluation, generalization, explanation, extrapolation, and utilization. The domains and constructs were operationalized using (i) a discrete-point and selected response format comprising 61 assessments that used cloze, Likert scales, and multiple-choice items, and (ii) production response format comprising 61 essays and writing assessments, and 59 oral production and interview. Specifically, the two most frequently occurring methods of construct operationalization were through cloze/ Likert/ multiple choice and essays and writing assessments in the major clusters of the core journals dataset.

Table 3

Construct operationalization
Cluster IDCloze/ Likert/ multiple choiceEssays and writingOral/interviewTotal
110322163
41717943
02051338
5401115
284214
32338
Total616159181
Reliability
Cluster IDReported reliabilityDid not report reliabilityTotal
1493685
0302959
426430
381826
213821
510919
Generalization
Cluster IDReported generalizability evidenceDid not report generalizability evidenceTotal
167985
015859
462430
302626
212021
531619
Criterion Evidence of Validity
Cluster IDYesNoTotal
158085
055459
412930
322426
251621
501919
Utilization
Cluster IDYesNoTotal
118285
065059
402730
302426
212021
541419
Explanation
Cluster IDYesNoTotal
1107585
085159
432730
302626
231821
521719

Measurement methods and evidence of validity in major clusters in the core journals.

In addition, reliability coefficients were reported in slightly more than half of the publications (56.7%), whereas generalizability was underreported in all the clusters with a mere 7.1% of the studies presenting evidence of generalizability. Likewise, only 7.5% presented criterion-based evidence of validity; 10.8% of the studies reported or investigated evidence supporting construct validity or the explanation inference; and 5% (12/240) of the studies addressed the utilization inference of the language assessments investigated. Among the clusters, Cluster #5 and #0 had the highest respective ratios of 4/19 (21%) and 6/59 (10%) studies investigating the utilization inference.

Measurement and Validity in the General Journal Clusters

Domain Specification in General Journals

Table 4 presents the domains and constructs specified in the major clusters in the general journals dataset. Of the 26 constructs/domains specified in the nine clusters, the top five constructs/domains in the clusters were grammar, speaking/ oral interactions, reading, vocabulary, and writing (ranked by frequency of occurrence in the clusters). Grammar appeared in every cluster except Cluster 8 which was distinct from other clusters as papers in this cluster did not examine linguistic constructs but the affective aspects of language learning, with a relatively low number of publications (n = 13). Looking at the number of papers for each respective domain in each cluster, we can observe that some clusters were characterized by certain domains. By frequency of occurrence, papers in Cluster 0 was mostly concerned with language comprehension (reading and listening), whereas Cluster 1 was characterized by feedback on written and oral production; Cluster 2 by vocabulary; and Cluster 4 by writing, with syntactic complexity being secondary in importance. A series of χ2 tests showed that 20 of the 26 categories of construct or domains occurred with significantly unequal probabilities, i.e., fluency, speaking, oral ability/proficiency, language proficiency/competence, feedback, collocations, semantic awareness, syntactic complexity, task complexity, phonological awareness, explicit/ implicit knowledge, comprehension, anxiety, attitudes, motivation, relative clauses, and language awareness (p < 0.005).

Table 4

Cluster #The construct or domain specified# of papers
Cluster 0
Reading12
Listening10
Speaking6
Writing4
Grammar5
Vocabulary5
Oral ability1
Oral proficiency1
Language proficiency3
Language competence1
Cluster 1
Reading1
Listening1
Speaking/ Oral/ Interaction15
Writing3
Grammar6
Vocabulary1
Memory4
Feedback*15
Cluster 2
Reading9
Listening9
Speaking/ Oral/ Interaction1
Writing5
Grammar1
Vocabulary43
Collocations5
Semantic awareness2
Cluster 3
Reading2
Listening1
Speaking/ Oral/ Interaction5
Writing3
Grammar2
Vocabulary3
Cluster 4
Speaking/ Oral/ Interaction5
Writing21
Grammar3
Vocabulary1
Fluency5
Syntactic complexity7
Task complexity2
Cluster 5
Reading2
Speaking/ Oral/ Interaction2
Grammar1
Vocabulary3
Phonological awareness3
Cluster 6
Reading1
Speaking/ Oral/ Interaction1
Grammar1
Fluency2
Explicit/ implicit knowledge3
Listening comprehension2
Cluster 8
Anxiety4
Attitudes3
Motivation6
Cluster 11
Grammar2
Relative clauses3
Language awareness2

Domain specification in major clusters in the general journals.

*

Papers on feedback were double-counted in other categories. This consisted of 10 papers on speaking/oral/interaction, 1 paper on grammar, 1 on explicit feedback, 1 on the use of classifiers and the perfective -le in Chinese, and 2 papers on writing.

Other Components in General Journals

Table 5 presents the breakdown of construct operationalization and the presentation of evidence of validity in the papers in the major clusters of the general journals data set. Given the domain characteristics (writing) of Cluster 4, discussed above, it is not surprising that the constructs are operationalized mainly through writing/essay in 59.6% of the papers in the cluster. As with the core journals dataset, the evaluation of reliability in the papers is fairly split, with 54.63% of the publications reporting reliability. The vast majority of papers did not provide any generalizability evidence (98.83%). Likewise, the majority of papers did not investigate construct validity (extrapolation) (95.03%) nor did they provide criterion evidence of validity (93.27%). Finally, only 24 of the publications reported or investigated the utilization inference.

Table 5

Cluster IDCloze/Likert/
multiple choice
Essay/writingOral/interviewTotal
Construct operationalization
22913648
13162140
31071229
0208836
43281647
662614
85016
52068
1134411
Cluster IDReported reliabilityDid not report reliabilityNon-EnglishTotal
Reliability
24440084
13432066
32120041
02513038
42722049
6168024
856112
5123015
1139113
Cluster IDReported generalizability evidenceDid not report generalizability evidenceNon-EnglishTotal
Generalization
2183084
1066066
3140041
0038038
4049049
6024024
8011112
5015015
11012113
Cluster IDYesNonon-EnglishTotal
Criterion Evidence of Validity
2381084
1462066
3536041
0632038
4148049
6024024
8011112
5213015
11012113
Cluster IDYesNoNon-EnglishTotal
Explanation
2282084
1462066
3437041
0632038
4148049
6024024
8012012
5015015
11013013
Cluster IDYesNoClaimed without evidenceTotal
Utilization
2082284
1063366
30291241
0130738
4049049
6024024
8011012
5015015
11012013

Measurement practices and evidence of validity in major clusters in the general journals.

Discussion

This study set out to investigate intellectual domains as well as the use of measurement and validation methods in language assessment research. We created two datasets covering the core and general journals, and employed DCA to detect research clusters. Next, we coded citing papers in each cluster based on an analytical framework for measurement and validation (Norris and Ortega, 2003; Kane, 2006; Bachman and Palmer, 2010). In this section, we will discuss bursts and citing publications per cluster to determine the features that possibly characterize each main clusters. Next, we will discuss the measurement and validation practices in the citing papers in the two datasets.

First Aim: Characterizing the Detected Clusters

Core Journals

Bursts (impactful cited publications) in the influential clusters in the core journals dataset are presented in Table 6. The review presented in the following sections is organized according to the content and relevance of these publications. We will further provide a broad overview of these publications. It should be noted that while narrative literature reviews customarily have specific foci, what we aim to do is to leverage the potentiality of clustering and highlight the linked concepts that might have resulted in the emergence of each cluster. Each cluster will be characterized by virtue of the content of the citing and cited publications. Due to space constraints, we provide a detailed review commentary on two of the largest clusters in the Core Journals dataset, and a general overview of the rest of the major clusters (see the Appendices for further information per cluster).

Table 6

ReferencesBurst strengthFrequencyCentralitySigmaCluster ID
Bachman and Palmer (1996)17.39630.116.40
Alderson et al. (1995)10.65280.021.190
Bachman (1990)9.58670.164.130
Alderson (2000)8.55260.011.070
Bachman and Palmer (2010)7.97180.011.060
Shohamy (2001)7.84220.011.10
Alderson (2005)7.7220.021.130
McNamara (1996)7.22220.021.140
Buck (2001)6.861801.020
Bond and Fox (2007)6.551201.020
Bachman (2005)5.99320.031.170
Read (2000)5.641301.010
Taylor (2009)5.331001.020
Alderson and Hamp-Lyons (1996)4.7120.011.050
Douglas (2000)4.47801.010
Fulcher (2004)4.16110.011.030
Canale and Swain (1980)4.13490.222.290
Brennan (2001)4.061001.010
Alderson and Lukmani (1989)3.75150.021.070
Kobayashi (2002)3.68701.020
Davison (2007)3.64601.010
Brindley (2001)3.62601.010
Fulcher (2003)11.55270.011.11
Council of Europe (2001)11.17230.011.111
American Educational Research Association (2014)9.17190.011.051
Weigle (2002)9.05600.051.61
Knoch (2009)7.77210.011.081
Kane (2006)7.3300.031.241
Weir (2005a)6.82160.011.041
Luoma (2004)6.741401.021
Guo et al. (2013)6.291301.011
Messick (1989)6.17810.122.031
Cohen (1988)5.99190.011.071
Fulcher et al. (2011)5.81001.021
Kane (2013)5.54150.011.041
Chapelle et al. (2008)5.11201.021
Cumming (2013)4.811001.021
Biber and Gray (2013)4.671101.011
Iwashita et al. (2008)4.44170.011.051
Gebril (2009)4.331501.021
Flower and Hayes (1981)4.32801.011
McNamara et al. (2014)4.32801.011
May (2011)4.261001.011
Deane (2013)4.07140.011.031
Jacobs (1981)3.98701.021
Fulcher (1996)3.81150.011.031
Ortega (2003)3.787011
Plakans (2008)3.691101.021
Knoch (2011)3.69100.011.031
Wright and Stone (1979)8.1170.051.482
Henning (1987)6.09130.021.142
Oller (1979)5.2990.041.252
Rasch (1960)5.2580.011.052
Hambleton and Swaminathan (1985)4.9180.011.062
Hughes (1989)4.5570.011.052
McNamara (1990)4.2180.011.032
Chen and Henning (1985)4.0280.031.142
Skehan (1998)7.9160.011.13
Messick (1989)7.18120.011.053
Brindley (1998)5.52120.041.223
Clapham (1996)4.880.011.033
Messick (1994)4.58120.031.123
Brown and Hudson (1998)3.8960.011.023
Bachman (1990)3.736013
Alderson and Wall (1993)3.61190.011.053
Cumming et al. (2002)8.48260.011.14
Lumley (2002)7.94430.041.324
Cumming (1990)6.72280.011.094
Eckes (2008)6.05240.011.064
Lumley and McNamara (1995)5.27260.011.074
Weigle (1998)4.54360.031.144
Weigle (1994)4.49170.011.044
Brown (1995)4.26220.041.174
Lim (2011)4.067014
Barkaoui (2010)3.839014
(Hamp-Lyons, 1991)3.81130.011.044
Brown (2003)6.65280.021.155
van Lier (1989)4.81130.021.085
Lazaraton (1996)4.59140.011.055
Messick (1996)4.15330.031.145
Chalhoub-Deville (2003)3.95170.011.045
Shohamy (1988)3.8860.011.035

Selected cited publications (Bursts) in the core journals.

Cluster 0: Language assessment (and comprehension)

As demonstrated in Table 7, bursts in this cluster can roughly be divided into two major groups: (i) generic textbooks or publications that present frameworks for the development of language assessments in general (e.g., Bachman, 1990; Alderson et al., 1995; Bachman and Palmer, 1996, 2010; McNamara, 1996; Shohamy, 2001; Alderson, 2005), or of specific aspects in the development of language assessments (Alderson, 2000; Read, 2000; Brennan, 2001; Buck, 2001; Kobayashi, 2002; Bachman, 2005) and psychometric measurement (McNamara, 1996; Bond and Fox, 2007), and (ii) publications that describe the contexts and implementations of tests (Alderson and Hamp-Lyons, 1996; Fulcher, 2004; Davison, 2007; Taylor, 2009). The citing publications in this cluster, on the other hand, consist of papers that chiefly investigate the assessment of comprehension skills (The labels under Focus area 1 and Focus area 2 in Tables 7, 8 and Supplemental Tables 5 through 11 were inductively assigned by the authors based on the examination of papers).

Table 7

ClusterReferencesCitingCited (bursts)Focus area 1Focus area 2
0(Bachman and Palmer, 1996)XTest usefulnessTest development
0(Alderson et al., 1995)XTest specificationTest development
0(Bachman, 1990)XTest developmentTest methods facets
0(Alderson, 2000)XTest development (reading)-
0(Bachman and Palmer, 2010)XValidationTest development
0(Shohamy, 2001)XTests and policy-makingDemocratic assessment
0(Alderson, 2005)XTest development (diagnostic assessment)The DIALANG assessment system
0(McNamara, 1996)XTest developmentPsychometric measurement
0(Buck, 2001)XTest development (listening)Theories of listening
0(Bond and Fox, 2007)XRasch measurement-
0Bachman (2005)XValidation-
0(Read, 2000)XTest development (Vocabulary)Theories of vocabulary acquisition and assessment
0(Taylor, 2009)XLanguage assessment literacyTest wiseness
0(Alderson and Hamp-Lyons, 1996)XWashbackThe TOEFL
0(Douglas, 2000)XXAssessment of language for specific purposes-
0(Fulcher, 2004)XThe Common European Framework of ReferenceLanguage assessment (political dimensions)
0(Canale and Swain, 1980)XCommunicative competence framework-
0Brennan (2001)XGeneralizability theory-
0(Kobayashi, 2002)XTest method effect-
0(Davison, 2007)XHong Kong Examinations and Assessment Authority (HKEAA) School Based AssessmentPerceptions toward school-based assessments
0(Harsch, 2014)XReview of General Language Proficiency-
0(McNamara, 2014)XReview of Communicative Language Testing (Editorial)CEF
0(Phakiti and Roever, 2011)XReview of Language Assessment in Australia and New Zealand (Editorial)-
0(Xi, 2010b)XReview of Automated scoring and feedback systems (Editorial)-
0(Lee and Sawaki, 2009)XReview of cognitive diagnostic assessment-
0(Carr, 2006)XReading comprehensionTest task characteristics
0(Zhang et al., 2014)XReading comprehension-
0(Papageorgiou et al., 2012)XListening comprehensionTest task characteristics
(Dialogic vs. monologic assessment)
0(Roever, 2006)XPragmalinguisticsValidity
0(Winke, 2011)XU.S. Naturalization TestReliability
0Gao and Rogers (2011)XReading comprehensionTest task characteristics
0(Green and Weir, 2010)XReading comprehension (textual features)Validity
0(Jang, 2009a)XReading comprehensionCognitive diagnostic assessment
0(Jang, 2009b)XReading comprehensionCognitive diagnostic assessment
0(Sawaki et al., 2009)XReading and listening comprehensionCognitive diagnostic assessment
0(Harding et al., 2015)XReading and listening comprehensionDiagnostic assessment
0(Eckes and Grotjahn, 2006)X(German) General Language Proficiency (reading, listening, writing, speaking)Validity

Major citing and cited publications in clusters 0 in the core journals.

Table 8

ClusterReferencesCitingCited (bursts)Focus area 1Focus area 2
1(Fulcher, 2003)XSpeaking
1(Council of Europe, 2001)XAssessment
1American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014XAssessmentValidation
1(Weigle, 2002)XWriting
1(Knoch, 2009)XRating scalesWriting
1(Kane, 2006)XValidation
1(Weir, 2005a)XValidation
1(Luoma, 2004)XSpeaking assessment
1(Guo et al., 2013)XLinguistic features and ratingCoh-Metrix
1(Messick, 1989)XValidation
1(Fulcher et al., 2011)XRating scalesSpeaking
1(Kane, 2013)XValidation
1(Chapelle et al., 2008)XValidation
1(Cumming, 2013)XReview of Integrated Writing Tasks
1(Iwashita et al., 2008)XRating scalesSpeaking
1(Gebril, 2009)XIntegrated Writing Tasks
1(Flower and Hayes, 1981)XWriting process
1(McNamara et al., 2014)XCoh-MetrixLinguistic features
1(May, 2011)XRating scalesSpeaking
1(Deane, 2013)XAutomated scoringWriting
1(Jacobs, 1981)X
1(Fulcher, 1996)XRating scalesSpeaking
1(Ortega, 2003)XReview of syntactic complexity
1(Plakans, 2008)XIntegrated Writing Tasks
1(Knoch, 2011)XRating scalesWriting
1(Plakans et al., 2019)XIntegrated writing tasks
(reading-writing)
Process
1(Plakans and Gebril, 2017)XIntegrated (reading-listening-writing) tasksThe TOEFL iBT
1(Banerjee et al., 2015)XWriting assessmentRating scale
1(Barkaoui and Knouzi, 2018)XWriting assessmentMode effect
1(Guo et al., 2013)XXWriting assessmentLinguistic features
1(Isbell, 2017)XWriting assessmentRating
1(Lallmamode et al., 2016)XWriting assessmentValidation of scoring rubric
1(Lu, 2017)XWriting assessmentSyntactic Complexity
1(Rakedzon and Baram-Tsabari, 2017)XWriting assessmentScoring rubric
1(Wilson et al., 2017)XWriting assessmentAutomated scoring (using linguistic features measures)
1(Zhao, 2017)XWriting assessmentScoring rubric (Voice)
1(Zheng and Yu, 2019)XWriting assessmentReview of writing assessment
1(Lam, 2018)XSpeaking assessmentInteractional competence
1(van Batenburg et al., 2018)XSpeaking assessmentInteractional competence
1(Römer, 2017)XSpeaking assessmentLexicogrammar

Major citing and cited publications in clusters 1 in the core journals.

Among the bursts in the first group, a few publications prove to be the pillars of the field: Alderson et al. (1995), Bachman (1990), and Bachman and Palmer (1996, 2010). This can be seen from the burst strength of these publications (Table 6) as well as from the citing publications. The articles that cite the publications in Cluster 0 span from reviews or editorials that provide an overview of the field of language assessment to looking at aspects of language assessment. Reviews of the field of language assessment (e.g., Harsch, 2014; McNamara, 2014) consistently mention the works of Bachman. Bachman's influence is such that his publications merited mention even when reviewing specific areas in the field as in Phakiti and Roever (2011) on regional issues in Australia and New Zealand, Xi (2010b) on scoring and feedback, and Lee and Sawaki (2009) on cognitive diagnostic assessment. Bachman and Palmer (1996, 2010) have wide appeal and are referenced with respect to a wide range of topics like reading (Carr, 2006; Zhang et al., 2014), listening (Papageorgiou et al., 2012), and pragmalinguistics (Roever, 2006) in Cluster 0. Bachman and Palmer (1996) and Bachman (1990) are also frequent sources for definitions, examples of which are too numerous to recount exhaustively. Two examples are that of reliability in Winke (2011) and of practicality in Roever (2006), which show the influence of these two texts in explicating core concepts of language assessment.

Articles on the assessment of reading comprehension (e.g., Jang, 2009a,b; Sawaki et al., 2009; Green and Weir, 2010; Gao and Rogers, 2011; Harding et al., 2015) often reference Charles Alderson: Alderson (2000), Alderson (2005) and to a lesser extent, Alderson et al. (1995) and Alderson and Lukmani (1989). For example, Jang's (2009a,b) studies on reading comprehension investigated the validity of LanguEdge test materials and the notion of reading subskills using cognitive diagnosis assessment. Prior discussions on the various aspects of reading assessment—like subskills—in Alderson's various works feature strongly in such studies (see also Sawaki et al., 2009). An exception is Carr's (2006) study on reading comprehension. While mentioning Alderson (2000), Bachman and Palmer's (1996) task characteristics model undergirds Carr's (2006) investigation on the relationship between test task characteristics and test taker performance.

Just like Alderson's works for reading, Buck (2001) seems to be the definitive textbook on assessing the listening component of language. For example, in influential citing papers such as Harding et al. (2015), Papageorgiou et al. (2012), as well as Sawaki et al. (2009), Buck's conceptualization of the subskills involved in listening is discussed.

Similarly, McNamara (1996) is a sourcebook on the development and validation of performance tests. McNamara (1996) introduced many-facet Rasch measurement (Linacre, 1994) as a useful method to capture the effect of external facets—most notably rater effects—on the measured performance of test takers. Relatedly, Bond and Fox (2007) guide readers through the general principles of the Rasch model and the various ways of applying it in their textbook. The importance of the Rasch model for test validation makes this accessible text oft-cited in studies concerned with test validity (e.g., Eckes and Grotjahn, 2006; Winke, 2011; Papageorgiou et al., 2012).

Another group of bursts in the cluster describe the then-current contexts of language assessment literacy (Taylor, 2009), frameworks (Fulcher, 2004), language tests after implementation (Alderson and Hamp-Lyons, 1996; Davison, 2007), and language for specific purposes (LSP, Douglas, 2000). In a call for the development of “assessment literacy” (Taylor, 2009) among applied linguists, Taylor described the state of the field of language assessment at that moment, looking at the types of practical knowledge needed and the scholarly work that offer them. This need for “assessment literacy” (Taylor, 2009) when implementing tests was already highlighted by Alderson and Hamp-Lyons (1996) some years before. Emphasizing the need to move beyond assumptions when hypothesizing about washback, Alderson and Hamp-Lyons (1996) observed and compared TOEFL and non-TOEFL classes taught by the same teachers in order to establish the presence of the oft-assumed washback effect of the TOEFL language tests. Davison (2007) takes a similar tack in looking at teachers' perception of the challenges in adapting to Hong Kong's shift to school-based assessment (SBA) of oral language skills. Although Davison (2007) and Alderson and Hamp-Lyons (1996) describe different tests, both sources highlight the importance of moving beyond theory and looking at implementation. That test development does not end at implementation is similarly highlighted by Fulcher (2004), who tackles the larger contexts surrounding the Common European Framework (CEF) in his critical historical overview of the development of said framework. Finally, Doughty's (2001) work on the assessment of LSP has become a major sourcebook in the field. Douglas's model of LSP ability drew inspiration from the communicative competence model of Canale and Swain (1980) and comprised language knowledge, strategic competence, and background knowledge.

Cluster 1: Rating (and Validation)

Moving from the global outlook on language assessment that largely characterizes Cluster 0, Cluster 1 narrows down on two related aspects of language testing: validation and rating. The unitary concept of validity (Messick, 1989), the socio-cognitive validity framework (Weir, 2005a), and the argument-based approach to validation (Kane, 2006, 2013) are the three main frameworks of validity featured in Cluster 1. The second major line of research in Cluster 1 is focused on improving rating scales. Fulcher (1996) proposed a data-driven approach to writing rating scales, coding transcripts from the ELTS oral examination to pinpoint “observed interruptions in fluency” (Fulcher, 1996, p. 216) present in candidates' speech. Using discriminant analysis, Fulcher (1996) linked linguistic descriptions to speaker performance, and at the same time, validating the rating scale produced. Iwashita et al. (2008) took a similar approach but expanded the range of measures beyond fluency with a more comprehensive set: grammatical accuracy and complexity, vocabulary, pronunciation, and fluency. Along the same idea, Fulcher et al. (2011) criticized the low richness of the descriptions generated from the measurement-driven approach and proposed Performance Decision Trees (PDTs), which are based on a non-linear scoring system that comprises yes/no decisions. In contrast, May (2011) took a different approach, using raters' perspectives to determine how raters would operationalize a rating scale and what features are salient to raters. Unlike the previous studies, however, the rating scale in May (2011) was for the paired speaking test. Mirroring the concerns about rating descriptors of speaking tasks, Knoch (2009) compared a new scale with more detailed, empirically developed descriptors with a pre-existing scale with less specific descriptors. Raters using the former scale reported higher rater reliability and better candidate discrimination. In a separate study, Knoch (2011) explained the features of diagnostic assessments of writing, stressing the uses and interpretations of rating scales.

With regards to the citing publications, papers describing the development of rating or scoring scales often cited the above publications, irrespective of what task the scale is for, resulting in the emergence of Cluster 1. For example, Banerjee et al.'s (2015) article focused the rating scale of writing assessment but discussed Fulcher (2003) and Fulcher et al. (2011). In addition, it is noted that rating scales are exclusively discussed with reference to the assessment of writing and speaking, with integrated tasks forming the nexus between these strands. Fulcher (2003) is the major publication of the speaking component of language assessment in this cluster, cited in studies focusing on speaking (Römer, 2017; van Batenburg et al., 2018) as well as meriting mention in studies on other topics like writing (Banerjee et al., 2015; Lallmamode et al., 2016). Akin to Fulcher (2003) for speaking, Weigle (2002) is a reference text on the subject of writing. It is cited in studies with a range of topics like integrated tasks (Plakans, 2008; Gebril, 2009; Plakans and Gebril, 2017), rubrics (Banerjee et al., 2015), validation (Lallmamode et al., 2016) and linguistic features of writing (Guo et al., 2013; Lu, 2017). Other citing papers focusing on writing assessment were Isbell (2017), Zhao (2017), Lam (2018), and Zheng and Yu (2019).

Measures of linguistic features in rater-mediated assessments have a significant importance in the cluster. Ortega's (2003) research synthesis quantified the effect size of syntactic complexity on assessed proficiency levels. More sophisticated ways of quantifying linguistic features have emerged since. A notable example is Coh-Metrix, a computational linguistic engine used to measure lexical sophistication, syntactic complexity, cohesion, and basic text information (Guo et al., 2013). McNamara et al. (2014) discussed the theoretical and practical implications of Coh-Metrix and provided an in-depth discussion of the textual features that Coh-Metrix measures. In a review article on syntactic complexity, Lu (2017) highlighted the increasing popularity of this tool. Coh-Metrix is used to operationalize and quantify linguistic and discourse features in writing, so as to predict scores (Banerjee et al., 2015; Wilson et al., 2017), test mode effect (Barkaoui and Knouzi, 2018).

Cluster 2: Test development (and dimensionality)

Cluster 2 is characterized by test development and dimensionality (see Supplemental Table 5). Publications in this cluster center around the development of tests (for teaching) (e.g., Oller, 1979; Henning, 1987; Hughes, 1989) and the implications of test scores, like Chen and Henning (1985), one of the initial works on bias. As well, a large part of the language test development process outlined in these publications include the interpretation and validation of test scores through item response theory (IRT) and Rasch models (Wright and Stone, 1979; Hambleton and Swaminathan, 1985; Henning, 1987). Rasch's (1960) pioneering monograph is the pillar upon which these publications stand. Citing articles are largely concerned with dimensionality (Lynch et al., 1988; McNamara, 1991) and validity (Lumley, 1993). From the publication dates, Cluster 2 seems reflective of prevailing concerns in the field specific to the 1980s and early 1990s.

Cluster 4: Rater Performance

As demonstrated in Supplemental Table 6, Cluster 4 concerns rating, which links it to Cluster 1. Chief concerns on variability in rating include raters' characteristics (Brown, 1995; Eckes, 2008), experience (Cumming, 1990; Lim, 2011) and biases (Lumley and McNamara, 1995) that affect rating performance, the effect of training (Weigle, 1994, 1998) and the processes by which the raters undergo while rating (Cumming et al., 2002; Lumley, 2002; Barkaoui, 2010). Citing articles largely mirror the same concerns (rater characteristics: Zhang and Elder, 2010; rater experience: Kim, 2015; rater training: Knoch et al., 2007; rating process: Wiseman, 2012; Winke and Lim, 2015), making this cluster a tightly focused one.

Cluster 5: Spoken Interaction

Cluster 5 looks at a specific aspect of assessing speaking: spoken interaction. Unlike Cluster 1 which also had a focus on assessing speaking, this cluster centers on a different group of bursts, thus its segregation: Brown (2003), Lazaraton (1996), Shohamy (1988), van Lier (1989) who explored the variation in the interactions between different candidates and testers during interviews. The social aspect of speaking calls into question validity and reliability in a strict sense, with implications for models of communicative ability, as Chalhoub-Deville (2003) highlighted. These developments in language assessment meant citing articles move beyond interviews to pair-tasks (O'Sullivan, 2002; Brooks, 2009; Davis, 2009), while maintaining similar concerns about reliability and validity (see Supplemental Table 7 for further information).

Clusters in the General Journals Dataset

Table 9 demonstrates bursts in the influential clusters in the general journals dataset. The main clusters are discussed below.

Table 9

ReferencesBurst strengthFrequencyCentralitySigmaCluster ID
Bachman (1990)11.13370.113.060
Oller (1979)8.36150.061.610
Henning (1987)7.86130.011.10
Wright and Stone (1979)7.7130.021.150
Halliday and Hasan (1976)7.01150.051.410
Hughes (1989)5.7901.030
Rasch (1960)5.2280.011.050
Chen and Henning (1985)5.290.021.130
Bachman and Palmer (1982)5.1980.021.080
Hambleton and Swaminathan (1985)4.78801.010
Cohen (1988)10.67630.041.451
Swain (1995)10.61560.031.431
Ellis N. (2005)10.3560.031.331
Spada and Tomita (2010)8.7250.011.061
Pica (1994)8.3180.011.11
Lyster and Saito (2010)82001.031
Lyster and Ranta (1997)7.48380.021.181
Schmidt (1994)7.2180.011.081
Swain (1985)7.08420.031.21
Long (2007)6.731301.011
Goo (2012)6.721301.021
Harrington and Sawyer (1992)6.61190.011.041
Daneman and Carpenter (1980)6.26260.051.341
Ammar and Spada (2006)6.03280.011.041
Li (2010)5.992701.031
Doughty (2001)5.961401.011
(Ellis et al., 2006)5.93270.011.051
Schmidt (2001)5.76780.081.581
Ellis N. (2005)5.691101.021
Rebuschat (2013)5.5712011
Sheen (2004)5.411501.011
(Ellis et al., 2001)5.38180.011.051
Gutiérrez (2013)5.241001.021
Lyster (1998)5.241001.011
Lyster (2004)5.09250.011.041
Long (1991)5150.021.091
Miyake and Friedman (1998)4.81301.011
Erlam (2005)4.78011
Mackey and Goo (2007)4.66801.011
Nation (1990)11330.051.672
Nation (2001)8.95670.031.362
Laufer and Hulstijn (2001)7.12301.032
Read (2000)6.88310.011.052
Nation (2006)6.82310.011.072
Read (2000)6.74180.011.062
Schmitt (2010)6.682001.012
(Godfroid et al., 2013)6.51401.022
Plonsky and Oswald (2014)6.251101.012
Laufer (1992)6.121601.032
Coxhead (2000)6.02310.041.242
Laufer and Ravenhorst-Kalovski (2010)5.771101.012
Nation (2013)5.6810012
Waring and Takaki (2003)5.581401.012
Wray (2002)5.561301.012
Hulstijn (2003)5.311301.012
O'Malley and Chamot (1990)5.16110.011.052
Barr et al. (2013)5.12901.022
Boers et al. (2006)5.051101.012
Schmidt (2001)4.729012
Schmitt et al. (2001)4.658012
Canale and Swain (1980)10.36570.3931.213
Alderson and Wall (1993)6.151101.033
Bachman and Palmer (1996)4.82270.021.13
Norris and Ortega (2009)11.72350.011.084
Norris and Ortega (2000)9.81480.031.374
Ellis (2003)9.76370.011.094
Skehan (1998)8.59650.081.914
Foster et al. (2000)8.24280.031.274
Skehan (2009)8.02240.011.074
Wolfe-Quintero et al. (1998)7.012101.024
Housen and Kuiken (2009)6.651301.024
Biber (1999)6.381601.034
Chandler (2003)6.25190.011.074
Levelt (1989)6.21201.024
Ellis (2009)6.011301.014
Vygotsky (1978)5.6810014
Bates et al. (2015)5.6810014
Larsen-Freeman (2006)5.6610014
Ellis (2008)5.65200.011.034
Biber et al. (2011)5.581401.024
Kormos and Dénes (2004)5.299014
Ortega (2003)5.181301.024
Plonsky (2013)4.781201.024
Swain (2000)4.741201.014
Robinson (2005)4.6410014
Dörnyei (2007)4.6410014

Selected cited publications (Bursts) in the general journals dataset.

Cluster 0: Test development (and dimensionality)

Cluster 0 in the General journals dataset overlapped in large part with Cluster 2 of the Core journals. Publications in Cluster 0 described the processes of test development (Oller, 1979; Wright and Stone, 1979; Henning, 1987; Hughes, 1989; Bachman, 1990). As with Cluster 2 (Core), there is a subfocus on IRT and Rasch models (Rasch, 1960; Wright and Stone, 1979; Hambleton and Swaminathan, 1985; Henning, 1987). Bachman (1990), Bachman and Palmer (1982), and Halliday and Hasan (1976) feature in this cluster but not in Cluster 2 (Core). There is a similar overlap in terms of the citing literature: 42% of the citing literature of the cluster overlaps with the citing literature of the Cluster 2 (Core), with little differences in central concerns of the articles (see Supplemental Table 8 for further information).

Cluster 1: Language Acquisition (Implicit vs. explicit)

Cluster 1 of the General journals dataset is a rather large cluster, which reflects the vastness of research into SLA. Long's (2007) book is one such attempt to elucidate on decades of theories and research. Other publications looked at specific theories like the output hypothesis (Swain, 1995), communicative competence (Swain, 1985) and the cognitive processes in language learning (Schmidt, 1994, 2001; Miyake and Friedman, 1998; Doughty, 2001). A recurrent theme in the theories of SLA is the dividing line between implicit and explicit language knowledge, as Ellis N. (2005) summarized. Research in the cluster similarly tackle the implicit and explicit divide in instruction (Ellis N., 2005; Erlam, 2005; Spada and Tomita, 2010). A subset of this is related to corrective feedback, where implicit feedback is often compared with explicit feedback (e.g., Ammar and Spada, 2006; Ellis et al., 2006). Along the same lines, Gutiérrez (2013) questions the validity of using grammaticality judgement tests to measure implicit and explicit knowledge (see Supplemental Table 9 for further information).

Cluster 2: Vocabulary Learning

Cluster 2 comprises of vocabulary learning research. General textbooks on theoretical aspects of vocabulary (Nation, 1990, 2001, 2013; O'Malley and Chamot, 1990; Schmitt, 2010) and Schmitt's (2008) review provide a deeper understanding of the crucial role of vocabulary in language learning, and in particular in incidental learning (Laufer and Hulstijn, 2001; Hulstijn, 2003; Godfroid et al., 2013). Efforts to find more efficient ways of learning vocabulary have led to the adoption of quantitative methods in research into vocabulary acquisition. Laufer (1992), Laufer and Ravenhorst-Kalovski (2010) and Nation (2006) sought the lexical threshold—the minimum number of words a learner needs for reading comprehension while the quantification of lexis allows for empirically-based vocabulary wordlists (Coxhead, 2000) and tests like the Vocabulary Levels Test (Schmitt et al., 2001). The use of formulaic sequences (Wray, 2002; Boers et al., 2006) is another off-shoot of this aspect of vocabulary learning. Read's (2000) text on assessing vocabulary remains a key piece of work, as it is in Cluster 0 of the Core journals. Finally, with the move toward quantitative methods, publications on relevant research methods such as effect size (Plonsky and Oswald, 2014) and linear mixed-effects models (Barr et al., 2013) gain importance in this cluster (see Supplemental Table 10 for further information).

Cluster 4: Measures of Language Complexity

Cluster 4 represent research on language complexity and its various measures. A dominant approach to measuring linguistic ability in this cluster is the measurement practices of complexity, accuracy, and fluency (CAF). In their review, Housen and Kuiken (2009) traced the historical developments and summarized the theoretical underpinnings and practical operationalization of the constructs, forming an important piece of work for research using CAF. Research in this cluster largely looked at the effect of methods of language teaching on one or more of the elements of CAF: for example, the effect of corrective feedback on accuracy and fluency (Chandler, 2003) and corrective feedback and the effect of planning on all three aspects in oral production (Ellis, 2009). Another line of research was to look at developments in complexity, accuracy, and/or fluency in students' language production (Ortega, 2003; Larsen-Freeman, 2006).

The CAF is not without its flaws, which are pointed out by Skehan (2009) and Norris and Ortega (2009). Norris and Ortega (2009) suggested that syntactic complexity should be measured multidimensionally and Biber et al. (2011), using corpus methods, suggested a new approach to syntactic complexity. As with Biber et al. (2011), another theme emerging from this cluster was the application of quantitative methods in language learning and teaching research (Bates et al., 2015). Methodological issues (Foster et al., 2000; Dörnyei, 2007; Plonsky, 2013) form another sub-cluster, as researchers attempt to come up with more precise ways of defining and measuring these constructs (see Supplemental Table 11 for further information).

Second Aim: Measurement and Validation in the Core and General Journals

The second aim of the study was to investigate measurement and validation practices in the published assessment research in the main clusters of the core and general journals. Figures 35 present visual comparisons in measurement and validation practices between the two datasets. Given the differing numbers in the two data sets, numbers presented in the histograms have been normalized for comparability (frequency of publications reporting the feature divided by the total number of papers). As demonstrated in Figure 3, studies in the general journals dataset covered a wider range of domain specifications, providing more coverage of more fine-grained domain specifications as compared to the core journals dataset. On the other hand, the four “basic” language skills—reading, writing, listening and speaking (listed here as Oral Production) were well-represented in both the general and core journals dataset, unsurprisingly. Cumulatively, reading, writing/essays, oral production dominate both the general journals and core journals datasets, with listening comparatively less so in both datasets. Of considerable interest is the predominance of vocabulary in the general journals dataset, far outstripping the four basic skills in the dataset.

Figure 3

Figure 3

Comparison of domain specifications in the core and general journals.

Figure 4

Figure 4

Comparison of construct operationalization in the core and general journals.

Figure 5

Figure 5

Comparison of measurement practices in the core and general journals.

In addition, as Figure 4 shows, the numbers of studies in both the core journals and general journals datasets that operationalized the constructs using Cloze/Likert/MCQ, Writing and Oral Production was fairly evenly matched. Writing is used most in the Core journals while Oral Production is used most in the General Journals. Finally, Figure 5 shows the importance placed on reliability by authors, in both datasets. In comparison, other measurement practices are scarcely given mention. Generalization and utilization had extremely poor showing in the general journals, in comparison to core journals, as the disparity between the four bars in Figure 5 shows.

Limitations and Future Directions

The present study is not without limitations. As the focus of the study was to identify research clusters and bursts and the measurement and validation practices in language assessment research. However, the reasons why certain authors were co-cited by a large number of authors were not investigated. Merton (1968, 1988) and Small (2004) proposed two reasons for bursts in citations based on the sociology of science whereby the Matthew effect and the halo effect constitute possible contributors to the burstness of publications. First, Merton (1968, 1988) proposed that eminent authors often receive comparatively more credit from other authors than less known authors—Merton (1968, 1988) called this the Matthew Effect. This results in a widening lacuna between unknown and well-known authors (Merton, 1968, 1988) and in many cases the unfortunate invisibility of equally superior research published by unknown authors (Small, 2004). This is because citations function like “expert referral” and once they gain momentum, they “will increase the inequality of citations by focusing attention on a smaller number of selected sources, and widening the gap between symbolically rich and poor” (Small, 2004, p. 74). One way that this can be measured in future research is using power laws or similar mathematical functions to capture the trends in the data (Brzezinski, 2015). For example, a power law would fit a dataset of cited and citing publications wherein a large portion of the observed outcomes (citations) result from a small number of cited publications (Albarrán and Ruiz-Castillo, 2011). Albarrán et al. (2011, p. 395) provided compelling evidence from an impressively large dataset to support this phenomenon, concluding that “scientists make references that a few years later will translate into a highly skewed citation distribution crowned in many cases by a power law.”

In addition, the eminence of scholars or the reputation of journals where the work is published can make a significant contribution to their burstness—this is called the halo effect (Small, 2004). In a recent paper, Zhang and Poucke (2017) showed that journal impact factor has a significant impact on the citations that a paper received. Another study by Antoniou et al. (2015, p. 286) identified “study design, studies reporting design in the title, long articles, and studies with high number of references” as predictors of higher citation rates. To this list, we might add seniority and eminence of authors and the type of publication (textbooks vs. paper), as well as “negative citation, self-citation, and misattribution” (Small, 2004, p. 76). Future research should investigate whether these variables have a role in citation patterns and clusters that emerged in the present study.

While self-citation was not filtered out and may present a limitation of this study, self-citation can be legitimate and necessary to the continuity of the development of a line of research. In CiteSpace, to qualify as a citing article, the citations of the article must exceed a selection threshold, either by g-index, top N most cited per time slice, or other selection modes. Although this process does not prevent the selection of a self-cited reference, the selection is justifiable to a great extent. If a highly cited reference involves some or even all self-citations, then it behooves the analyst to establish the role of the reference in the literature. They should verify whether the high citations are due to inflated citations or if indeed, there is intellectual merit that justifies self-citation.

Another limitation of the study is that we did not include methodological journals such as “Journal of Educational Measurement” in the search, as indicated earlier. This was because we adopted a keyword search strategy in this study and the majority of the papers in methodological journals include the search keywords we used such as measurement and assessment, even though many of them are not relevant to language assessment. This would affect the quality and content of the clusters. We suggest future research can explore the relationship between language assessment and methodological journals through, for example, the dual-map overlay method which is available in CiteSpace. Similarly, technical reports and book chapters were not included in the datasets, as the former are not indexed in Scopus and coverage of Scopus of the latter is not as wide as its coverage of journal articles.

Finally, it should be noted that for a recent publication to become a burst, it will take at least 1 year as our present and past analyses show (Aryadoust and Ang, 2019). Therefore, the dynamics of the field under investigation can change in a few years, as new bursts and research clusters emerge and drag the direction of research to a different direction.

Conclusion

The first aim of the study was to identify the main intellectual domains in language assessment research published in the core and general journals. We found that the primary focus of general journals was on vocabulary, oral proficiency, essay writing, grammar, and reading. The secondary focus was on affective schemata, awareness, memory, language proficiency, explicit vs. implicit language knowledge, language or semantic awareness, semantic complexity. By contrast, with the exception of language proficiency, this second area of focus was absent in the core generals. The focus of the core journals was more exclusively on reading and listening comprehension assessment (primary theme), facets of speaking and writing performance such as raters and (psychometric) validation (secondary theme), as well as feedback, corpus linguistics, and washback (tertiary theme). From this, it may be said the main preoccupation of researchers in SLA and language assessment was the assessment of reading, writing, and oral production, whereas assessment in SLA research additionally centered around vocabulary and grammar constructs. There were a number of areas that were underrepresented including affective schemata, awareness, memory, language proficiency, explicit vs. implicit language knowledge, language or semantic awareness, semantic complexity, feedback, corpus linguistics, and washback. These areas should be investigated with more rigor in future research.

In both datasets, several textbooks, editorials and review articles feature prominently in and/or across the clusters. The heavy presence of certain publications (like Bachman's) can be attributable to the importance of the scholar to the field. However, certain types of publications, like review articles, do tend to disproportionately get cited more often (Bennet et al., 2019) although precisely why this is the case is yet to be determined. Aksnes et al.'s (2019) cautioned on overreliance on bibliometric analysis ring true here as well. Thus, we have provided additional analyses on the statistics to complete the picture behind the numbers, inasmuch that is possible.

The second aim of the study was to describe measurement and validation practices in the two datasets. Collectively, the data and comparisons presented demonstrated strong evidence that the majority of citing papers did not carry out inference-based validation that was spelled out by Bachman and Palmer (2010), Kane (2006), or Messick (1989) in both core and general journals. In language assessment, Bachman (2005) and Bachman and Palmer (2010) stressed that an all-encompassing validation program is “important and useful” before an assessment can be put to any use (Bachman, 2005, p. 30, emphasis in original). However, the feasibility and heavy demands of a strong validity program remain an open question (see Haertel, 1999). Particularly, it seems impracticable to validate both the interpretations and uses of a language test/assessment before using the test for research purposes. The solution is Kane's (2006) less demanding approach which holds that test instruments should be validated for the claims made. Accordingly, it would not be expected that researchers provide any “validity” evidence containing all the validity inferences explicated above for every instrument. Some useful guidelines include the report of reliability (internal consistency and rater consistency), item difficulty and discrimination range, person ability range, as well as evidence that the test measures the purported constructs. In sum, in our view, the lack of reporting of evidence for the above-mentioned components in the majority of studies was because these were not applicable to the objectives and design of the studies and their assessment tools.

The preponderance of the use of open-ended (essay/oral performance), which engage more communicative skills as compared to discrete point/selected response testing (like MCQ or Cloze), shows a tendency toward communicative testing approaches in both datasets. As format effects have been found on L1 reading and L2 listening, and L2 listening under certain conditions (see In'nami and Koizumi, 2009), the popularity of the relatively more difficult open-ended questions have implications for language test developers that cannot be ignored. Given the effect of format on scores impacts the reliability of tests in making discriminations on language ability, and consequently, fairness, the popularity of one type of format in language testing should be re-evaluated, or at the very least, examined more closely.

Finally, the sustainability of the intellectual domains identified in this study depends on the needs of the language assessment community and other factors such as “influence” of the papers published in each cluster. If a topic is an established intellectual domain with influential authors (high burstness and betweenness centrality), it stands a higher chance of thriving and proliferating. However, the fate of intellectual domains that have not attracted the attention of authors with high bursts and betweenness centrality could be bleak—even though these clusters may discuss significant areas of inquiry. There is currently no profound understanding of the forces that shape the scope and direction of language assessment research. Significantly more research is needed to determine what motivates authors to select and investigate a topic, how thoroughly they cite past research, and what internal (within a field) and external (between fields) factors lead to the sustainability of a Research Topic.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. The datasets can be reproduced from Scopus using the search formula provided in the Appendix.

Author contributions

VA conceptualized the study, downloaded the data, conducted data analysis, contributed to writing the paper, and led the team. AZ and ML helped with the data analysis and coding, and contributed to writing the paper. CC contributed conceptually to data generation and analysis and suggested revisions. All authors contributed to the article and approved the submitted version.

Acknowledgments

We wish to thank Chee Shyan Ng and Rochelle Teo for their contribution to earlier versions of this paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling editor is currently editing co-organizing a Research Topic with one of the author VA, and confirms the absence of any other collaboration.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2020.01941/full#supplementary-material

Footnotes

1.^We are grateful to one of the reviewers for suggesting this note.

2.^We did not include methodological journals such as ‘Journal of Educational Measurement’ in the search, as the majority of the papers in those journals include the search keywords, even though they are not relevant to language assessment.

3.^In DCA, some publications may not have a clear link with the rest of the publications in the dataset. These were not listed among the contributory publications to the major clusters that were visualized by CiteSpace in the presents study.

4.^CiteSpace, by default, shows the largest connected component. If a cluster does not appear in the largest connected component, this means it must appear in the second-largest connected component or other smaller components. The present study was limited to clusters within the largest connected component, which is a widely adopted strategy in network analysis.

References

  • 1

    AksnesD. W.LangfeldtL.WoutersP. (2019). Citations, citation indicators, and research quality: an overview of basic concepts and theories. Sage Open9, 117. 10.1177/2158244019829575

  • 2

    AlbarránP.CrespoJ. A.OrtuñoI.Ruiz-CastilloJ. (2011). The skewness of science in 219 sub-fields and a number of aggregates. Scientometrics88, 385397. 10.1007/s11192-011-0407-9

  • 3

    AlbarránP.Ruiz-CastilloJ. (2011). References made and citations received by scientific articles. J. Am. Soc. Inform. Sci. Technol.62, 4049. 10.1002/asi.21448

  • 4

    AldersonJ. C. (2000). Assessing Reading.Cambridge: Cambridge University Press. 10.1017/CBO9780511732935

  • 5

    AldersonJ. C. (2005). Diagnosing Foreign Language Proficiency: The Interface Between Learning and Assessment. London: A&C Black.

  • 6

    AldersonJ. C.BanerjeeJ. (2001). State of the art review: language testing and assessment Part 1. Lang. Teach.34, 213236. 10.1017/S0261444800014464

  • 7

    AldersonJ. C.BanerjeeJ. (2002). State of the art review: language testing and assessment (part two). Language Teach.35, 79113. 10.1017/S0261444802001751

  • 8

    AldersonJ. C.ClaphamC.WallD. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press.

  • 9

    AldersonJ. C.Hamp-LyonsL. (1996). TOEFL preparation courses: a study of washback. Lang. Testing13, 280297. 10.1177/026553229601300304

  • 10

    AldersonJ. C.LukmaniY. (1989). Cognition and reading: cognitive levels as embodied in test questions. Read Foreign Lang.5, 253270.

  • 11

    AldersonJ. C.WallD. (1993). Does washback exist?Appl Linguist.14, 115129. 10.1093/applin/14.2.115

  • 12

    American Educational Research Association (2014). American Psychological Association,and National Council on Measurement in Education. Standards for Educational and Psychological Testing.Washington, DC: American Educational Research Association.

  • 13

    AmmarA.SpadaN. (2006). One size fits all? Recasts, Prompts, and L2 Learning. Stud. Second Lang. Acquis.28:543. 10.1017/S0272263106060268

  • 14

    AntoniouG. A.AntoniouS. A.GeorgakarakosE. I.SfyroerasG. S.GeorgiadisG. S. (2015). Bibliometric analysis of factors predicting increased citations in the vascular and endovascular literature. Ann. Vasc. Surg.29, 286292. 10.1016/j.avsg.2014.09.017

  • 15

    ArikB.ArikE. (2017). “Second language writing” publications in web of science: a bibliometric analysis. Publications5:4. 10.3390/publications5010004

  • 16

    AryadoustV. (2013). Building a Validity Argument for a Listening Test of Academic proficiency. Newcastle: Cambridge Scholars Publishing.

  • 17

    AryadoustV. (2020). A review of comprehension subskills: a scientometrics perspective. System88, 102180. 10.1016/j.system.2019.102180

  • 18

    AryadoustV.AngB. H. (2019). Exploring the frontiers of eye tracking research in language studies: a novel co-citation scientometric review. Comput. Assist. Lang. Learn.136. 10.1080/09588221.2019.1647251

  • 19

    BachmanL. F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.

  • 20

    BachmanL. F. (2000). Modern language testing at the turn of the century: assuring that what we count counts. Lang. Testing17, 142. 10.1177/026553220001700101

  • 21

    BachmanL. F. (2005) Building and supporting a case for test use. Lang. Assess. Quart. 2, 134. 10.1207/s15434311laq0201_1

  • 22

    BachmanL. F.CohenA. D. (1998). Interfaces Between Second Language Acquisition and Language Testing Research. Cambridge: Cambridge University Press. 10.1017/CBO9781139524711

  • 23

    BachmanL. F.PalmerA. S. (1982). The construct validation of some components of communicative proficiency. TESOL Quart.16:449. 10.2307/3586464

  • 24

    BachmanL. F.PalmerA. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford: Oxford University Press.

  • 25

    BachmanL. F.PalmerA. S. (2010). Language assessment in practice:Developing Language Assessments and Justifying Their Use in the Real World. Oxford: Oxford University Press.

  • 26

    BaileyK.M. (1999). Washback in Language Testing. TOEFL Monograph Series MS-15, June 1999. Educational Testing Service. Retrieved from: https://www.ets.org/Media/Research/pdf/RM-99-04.pdf

  • 27

    BanerjeeJ.YanX.ChapmanM.ElliottH. (2015). Keeping up with the times: revising and refreshing rating scale. Assess.Writ. Int. J.26, 519. 10.1016/j.asw.2015.07.001

  • 28

    BarkaouiK. (2010). Variability in ESL essay rating processes: the role of the rating scale and rater experience. Lang. Assess. Quart.7, 5474. 10.1080/15434300903464418

  • 29

    BarkaouiK.KnouziI. (2018). The effects of writing mode and computer ability on l2 test-takers' essay characteristics and scores. Assess. Writ. Int. J.36, 1931. 10.1016/j.asw.2018.02.005

  • 30

    BarrD. J.LevyR.ScheepersC.TilyH. J. (2013). Random effects structure for confirmatory hypothesis testing: keep it maximal. J. Memory Lang.68, 255278. 10.1016/j.jml.2012.11.001

  • 31

    BatesD.MächlerM.BolkerB.WalkerS. (2015). Fitting linear mixed-effects model using Ime4. J. Stat. Softw.67, 148. 10.18637/jss.v067.i01

  • 32

    BennetL.EisnerD. A.GunnA. J. (2019). Misleading with citation statistics?J. Physiol.10:2593. 10.1113/JP277847

  • 33

    BiberD. (1999). Longman grammar of spoken and written English. London: Longman.

  • 34

    BiberD.GrayB. (2013). Discourse Characteristics of Writing and Speaking Task Types on the “TOEFL iBT”® Test: A Lexico-Grammatical Analysis. “TOEFL iBT”® Research Report. TOEFL iBT-19. Research Report. RR-13-04. Princeton, NJ: ETS Research Report Series. 10.1002/j.2333-8504.2013.tb02311.x

  • 35

    BiberD.GrayB.PoonponK. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development?TESOL Q.45, 535. 10.5054/tq.2011.244483

  • 36

    BoersF.EyckmansJ.KappelJ.StengersH.DemecheleerM. (2006). Formulaic sequences and perceived oral proficiency: putting a Lexical Approach to the test. Lang. Teach. Res.10, 245261. 10.1191/1362168806lr195oa

  • 37

    BondT. G.FoxC. M. (2007). Applying the Rasch Model: Fundamental Measurement in the Human Sciences (2nd ed.). New Jersey: Lawrence Erlbaum.

  • 38

    BorsboomD.MellenberghG. J. (2007). “Test validity in cognitive assessment,” in Cognitive Diagnostic Assessment for Education: Theory and Applications, eds LeightonJ. P.GierlM. J. (New York, NY: Cambridge University Press), 85115. 10.1017/CBO9780511611186.004

  • 39

    BrandesU. (2001). A faster algorithm for betweenness centrality. J. Math. Sociol.25, 163177. 10.1080/0022250X.2001.9990249

  • 40

    BrennanR. L. (2001) Generalizability Theory. New York, NY: Springer. 10.1007/978-1-4757-3456-0.

  • 41

    BrindleyG. (1998). Outcomes-based assessment and reporting in language learning programmes: a review of the issues. Lang. Test.15, 4585. 10.1177/026553229801500103

  • 42

    BrindleyG. (2001). Outcomes-based assessment in practice: some examples and emerging insights. Lang. Test.18, 393407. 10.1177/026553220101800405

  • 43

    BrooksL. (2009). Interacting in pairs in a test of oral proficiency: co-constructing a better performance. Lang. Test.26, 341366. 10.1177/0265532209104666

  • 44

    BrownA. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Lang. Test.12, 115. 10.1177/026553229501200101

  • 45

    BrownA. (2003). Interviewer variation and the co-construction of speaking proficiency. Lang. Test.20, 125. 10.1191/0265532203lt242oa

  • 46

    BrownJ. D.HudsonT. (1998). The alternatives in language assessment. TESOL Q.32, 653675. 10.2307/3587999

  • 47

    BrzezinskiM. (2015). Power laws in citation distributions: evidence from Scopus. Scientometrics103, 213228. 10.1007/s11192-014-1524-z

  • 48

    BuckG. (2001). Assessing Listening. Cambridge: Cambridge University Press. 10.1017/CBO9780511732959

  • 49

    CanaleM.SwainM. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Appl. Linguis.1, 147. 10.1093/applin/1.1.1

  • 50

    CarrN. T. (2006). The factor structure of test task characteristics and examinee performance. Lang. Test.23, 269289. 10.1191/0265532206lt328oa

  • 51

    CarrollJ. B. (1961) “Fundamental considerations in testing for english language proficiency of foreign students,” in Testing Center for Applied Linguistics (Washington, DC). Reprinted in Allen, H.B. & Campbell, R.N. (eds.). (1972) Teaching English as a Second Language: A Book of Readings. McGraw Hill.

  • 52

    Chalhoub-DevilleM. (2003). Second language interaction: current perspectives and future trends. Lang. Test.20, 369383.

  • 53

    ChandlerJ. (2003). The efficacy of various kinds of error feedback for improvement in the accuracy and fluency of L2 student writing. J. Second Lang. Writing12, 267296. 10.1016/S1060-3743(03)00038-9

  • 54

    ChapelleC. A. (1998). “Construct definition and validity inquiry in SLA research,” in Interfaces Between Second Language Acquisition and Language Testing Research, eds BachmanL. F.CohenA. D. (Cambridge: Cambridge University Press) 3270. 10.1017/CBO9781139524711.004

  • 55

    ChapelleC. A.EnrightM. K.JamiesonJ. M. (2008). Building a Validity Argument for the Test of English as a Foreign Language™. New York, NY: Routledge.

  • 56

    ChenC. (2004). Searching for intellectual turning points: progressive knowledge domain visualization. Proc. Natl. Acad. Sci.U.S.A.101, 53035310. 10.1073/pnas.0307513100

  • 57

    ChenC. (2006). CiteSpace II: detecting and visualizing emerging trends and transient patterns in scientific literature. J. Am. Soc. Inform. Sci. Technol.57, 359377. 10.1002/asi.20317

  • 58

    ChenC. (2010). “Measuring Structural Change in Networks Due to New Information,” in NATO IST-093/RWS-015 Workshop on Visualizing Networks: Coping with Change and Uncertainty. Rome: Griffiss Institutes.

  • 59

    ChenC. (2014). The CiteSpace Manual. Availaable online at: http://cluster.ischool.drexel.edu/cchen/citespace/CiteSpaceManual.pdf

  • 60

    ChenC. (2016). CiteSpace: A Practical Guide for Mapping Scientific Literature. New York, NY: Nova Science Publishers.

  • 61

    ChenC. (2017). Science mapping: a systematic review of the literature. J. Data Inform. Sci.2, 140. 10.1515/jdis-2017-0006

  • 62

    ChenC. (2019). How to Use CiteSpace. Retrieved from https://leanpub.com/howtousecitespace

  • 63

    ChenC.Ibekwe-SanJuanF.HouJ. (2010). The structure and dynamics of co-citation clusters: a multiple-perspective co-citation analysis. J. Am. Soc. Inform. Sci. Technol.61, 13861409. 10.1002/asi.21309

  • 64

    ChenC.SongI. Y.YuanX.ZhangJ. (2008). The thematic and citation landscape of data and knowledge engineering (1985–2007). Data Knowl. Eng.67, 234259. 10.1016/j.datak.2008.05.004

  • 65

    ChenC.SongM. (2017). Representing Scientific Knowledge:The Role of Uncertainty. Princeton, NJ: Springer. 10.1007/978-3-319-62543-0

  • 66

    ChenZ.HenningG. (1985). Linguistic and cultural bias in language proficiency tests. Lang. Test.2:155. 10.1177/026553228500200204

  • 67

    ChenC. (2003) Mapping Scientific Frontiers: The Quest for Knowledge Visualization. 1st Edn. Princeton, NJ: Springer. 10.1007/978-1-4471-0051-5_1

  • 68

    ClaphamC. (1996). The Development of IELTS:A Study of the Effect of Background Knowledge on Reading Comprehension. Cambridge: University of Cambridge Local Examinations Syndicate.

  • 69

    CohenJ. (1988). Statistical Power Analysis for the Behavioral Sciences. L. New York, NY: Erlbaum Associates.

  • 70

    CollinsA. J.FauserC.J.M. B. (2005). Balancing the strengths of systematic and narrative reviews. Hum. Reprod. Update11, 103104. 10.1093/humupd/dmh058

  • 71

    Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment.Cambridge: Press Syndicate of the University of Cambridge.

  • 72

    CoxheadA. (2000). A new academic word list. TESOL Quart.34, 213238. 10.2307/3587951

  • 73

    CronbachL. J.MeehlP. E. (1955). Construct validity in psychological tests. Psychol. Bull.52, 281302. 10.1037/h0040957

  • 74

    CummingA. (1990). Expertise in evaluating second language compositions. Lang. Test.7:31. 10.1177/026553229000700104

  • 75

    CummingA. (2013). Assessing integrated writing tasks for academic purposes: promises and perils. Lang. Assess. Quart.10, 18. 10.1080/15434303.2011.622016

  • 76

    CummingA.KantorR.PowersD. E. (2002). Decision making while rating ESL/EFL writing tasks: a descriptive framework. Modern Lang. J.86, 6796. 10.1111/1540-4781.00137

  • 77

    DanemanM.CarpenterP. A. (1980). Individual differences in working memory and reading. J. Verb. Learn. Verb. Behav.19, 450466. 10.1016/S0022-5371(80)90312-6

  • 78

    DaviesA. (1982). “Language testing parts 1 and 2,” in Cambridge Surveys, ed KinsellaV. (Cambridge: Cambridge University Press), 127159. (Originally published in Language Teaching and Linguistics: Abstracts, 1978).

  • 79

    DaviesA. (2008). Textbook trends in teaching language testing. Lang. Test.25, 327347. 10.1177/0265532208090156

  • 80

    DaviesA. (2014). Remembering 1980. Lang. Assess. Quart.11, 129135. 10.1080/15434303.2014.898642

  • 81

    DavisF. B. (1944). Fundamental factors of comprehension in reading. Psychometrika9, 185197. 10.1007/BF02288722

  • 82

    DavisL. (2009). The influence of interlocutor proficiency in a paired oral assessment. Lang. Test.26, 367396. 10.1177/0265532209104667

  • 83

    DavisonC. (2007). Views from the chalkface: english language school-based assessment in Hong Kong. Lang. Assess. Quart.4, 3768. 10.1080/15434300701348359

  • 84

    De BellisN. (2014). “History and evolution of (biblio) metrics,” in Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact, eds CroninB.SugimotoC. (Cambridge, MA: MIT Press), 2344.

  • 85

    DeaneP. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assess. Writing18, 724. 10.1016/j.asw.2012.10.002

  • 86

    DörnyeiZ. (2007). Research Methods in Applied Linguistics: Quantitative, Qualitative, and Mixed Methodologies. Oxford: Oxford University Press.

  • 87

    DoughtyC. (2001). “Cognitive underpinnings of focus on form,” in Cognition and Second Language Instruction, eds RobinsonP.LongM. H.RichardsJ. C. (Cambridge: Cambridge University Press) 206257. 10.1017/CBO9781139524780.010

  • 88

    DouglasD. (2000). Assessing Languages for Specific Purposes. Cambridge: Cambridge University Press. 10.1017/CBO9780511732911

  • 89

    EckesT. (2008). Rater types in writing performance assessments: a classification approach to rater variability. Lang. Test.25, 155185. 10.1177/0265532207086780

  • 90

    EckesT. (2011). Introduction to many-facet Rasch measurement: analyzing and evaluating rater-mediated assessments. Peter Lang. 17, 113116. 10.1080/15366367.2018.1516094

  • 91

    EckesT.GrotjahnR. (2006). A closer look at the construct validity of C-tests. Lang. Test.23, 290325. 10.1191/0265532206lt330oa

  • 92

    EllisN. (2005). At the interface: dynamic interactions of explicit and implicit language knowledge. Stud. Second Lang. Acquis.27, 305352. 10.1017/S027226310505014X

  • 93

    EllisR. (2003). Task-Based Language Learning and Teaching. Oxford University Press.

  • 94

    EllisR. (2005). Measuring implicit and explicit knowledge of a second language: a psychometric study. Stud. Second Lang. Acquis.27:141. 10.1017/S0272263105050096

  • 95

    EllisR. (2008). The Study of Second Language Acquisition 2nd Edn.Cambridge: Oxford University Press.

  • 96

    EllisR. (2009). The differential effects of three types of task planning on the fluency, complexity, and accuracy in l2 oral production. Appl. Linguist.30, 474509. 10.1093/applin/amp042

  • 97

    EllisR.BasturkmenH.LoewenS. (2001). Learner Uptake in Communicative ESL Lessons. Lang. Learn. J. Res. Lang. Stud.51:281. 10.1111/1467-9922.00156

  • 98

    EllisR.LoewenS.ErlamR. (2006). Implicit and explicit corrective feedback and the acquisition of L2 grammar. Stud. Second Lang. Acquisit.28, 339368. 10.1017/S0272263106060141

  • 99

    ErlamR. (2005). Language aptitude and its relationship to instructional effectiveness in second language acquisition. Lang. Teach. Res.9, 147171. 10.1191/1362168805lr161oa

  • 100

    FanJ.YanX. (2020). Assessing speaking proficiency: a narrative review of speaking assessment research within the argument-based validation framework. Front. Psychol.11:330. 10.3389/fpsyg.2020.00330

  • 101

    FieldA. (2018). Discovering Statistics Using IBM SPSS Statistics (5th Edn.). Cambridge: The Bookwatch.

  • 102

    FlowerL.HayesJ. R. (1981). A cognitive process theory of writing. Coll. Compos. Commun.32:365. 10.2307/356600

  • 103

    FosterP.TonkynA.WigglesworthG. (2000). Measuring spoken language: a unit for all reasons. Appl. Linguist.21, 354375. 10.1093/applin/21.3.354

  • 104

    FulcherG. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Lang. Test.13, 208238. 10.1177/026553229601300205

  • 105

    FulcherG. (2003). Testing Second Language Speaking. Cambridge: Pearson Education.

  • 106

    FulcherG. (2004). Deluded by artifices? The common european framework and harmonization. Lang. Assess. Quart.1, 253266. 10.1207/s15434311laq0104_4

  • 107

    FulcherG. (n.d.). What Is Language Testing. Language Testing Resources. Available online at: http://languagetesting.info/whatis/lt.html

  • 108

    FulcherG.DavidsonF.KempJ. (2011). Effective rating scale development for speaking tests: performance decision trees. Lang. Test.28, 529. 10.1177/0265532209359514

  • 109

    GaoL.RogersW. T. (2011). Use of tree-based regression in the analyses of L2 reading test items. Lang. Test.28, 77104. 10.1177/0265532210364380

  • 110

    GebrilA. (2009). Score generalizability of academic writing tasks: does one test method fit it all?. Lang. Test.26, 507531. 10.1177/0265532209340188

  • 111

    GodfroidA.BoersF.HousenA. (2013). An eye for words: gauging the role of attention in incidental L2 vocabulary acquisition by means of eye-tracking. Stud. Second Lang. Acquisit.35, 483517. 10.1017/S0272263113000119

  • 112

    GooJ. (2012). Corrective feedback and working memory capacity in interaction-driven L2 learning. Stud. Second Lang. Acquisit.34:445. 10.1017/S0272263112000149

  • 113

    GoswamiA. K.AgrawalR. K. (2019). Building intellectual structure of knowledge sharing. VINE J. Inform. Knowl. Manag. Syst.50, 136162. 10.1108/VJIKMS-03-2019-0036

  • 114

    GrabowskiK. C.OhS. (2018). “Reliability analysis of instruments and data coding,” in The Palgrave Handbook of Applied Linguistics Research Methodology, eds PhakitiA.De CostaP.PlonskyL.StarfieldS. (London: Palgrave Macmillan), 541565. 10.1057/978-1-137-59900-1_24

  • 115

    GrabowskiK. C.LinR. (2019). “Multivariate generalizability theory in language assessment,” in Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques, eds AryadoustV.RaquelM. (New York, NY: Routledge), 5480. 10.4324/9781315187815-4

  • 116

    GreenA.ÜnaldiA.WeirC. (2010). Empiricism versus connoisseurship: establishing the appropriacy of texts in tests of academic reading. Lang. Test.27, 191211. 10.1177/0265532209349471

  • 117

    GreenS.SalkindN. (2014). Using SPSS for Windows and Macintosh: Analyzing and understanding data, 7th Edn.London: Person Education, Inc.

  • 118

    GuoL.CrossleyS. A.McNamaraD. S. (2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: a comparison study. Assess. Writing18, 218238. 10.1016/j.asw.2013.05.002

  • 119

    GutiérrezX. (2013). The construct validity of grammaticality judgment tests as measure of implicit and explicit knowledge. Stud. Second Lang. Acquisit.35, 423449. 10.1017/S0272263113000041

  • 120

    HaertelE. H. (1999). Validity arguments for high-stakes testing: in search of the evidence. Educ. Measur.18, 59. 10.1111/j.1745-3992.1999.tb00276.x

  • 121

    HallW. E.RobinsonF. P. (1945). An analytical approach to the study of reading skills. J. Educ. Psychol.36, 429442. 10.1037/h0058703

  • 122

    HallidayM. A. K.HasanR. (1976). Cohesion in English. London: English Language Series, Longman.

  • 123

    HambletonR. K.SwaminathanH. (1985). Item Response Theory: Principles and Applications.Dordrecht: Kluwer Academic Publishers. 10.1007/978-94-017-1988-9

  • 124

    Hamp-LyonsL. (1991). “Scoring procedures for ESL contexts,” in Assessing Second Language Writing in Academic Contexts, ed Hamp-LyonsL. (New York, NY: Ablex Pub. Corp), 241276.

  • 125

    HardingL.AldersonJ. C.BrunfautT. (2015). Diagnostic assessment of reading and listening in a second or foreign language: elaborating on diagnostic principles. Lang. Test.32, 317336. 10.1177/0265532214564505

  • 126

    HarringtonM.SawyerM. (1992). L2 Working memory capacity and l2 reading skill. Stud. Second Lang. Acquisit.14:25. 10.1017/S0272263100010457

  • 127

    HarschC. (2014). General language proficiency revisited: current and future issues. Lang. Assess. Quart.11, 152169. 10.1080/15434303.2014.902059

  • 128

    HenningG. (1987). A Guide to Language Testing: Development, Evaluation, Research. New York, NY: Newberry House Publishers.

  • 129

    HornbergerN. H.ShohamyE. (2008). Encyclopedia of Language and Education Vol. 7: Language Testing and Assessment. New York, NY: Springer.

  • 130

    HousenA.KuikenF. (2009). Complexity, accuracy and fluency in second language acquisition. Appl. Linguist.30:amp048. 10.1093/applin/amp048

  • 131

    HughesA. (1989). Testing for Language Teachers. Cambridge: Cambridge University Press.

  • 132

    HulstijnJ. H. (2003). “Incidental and intentional learning,” in The Handbook of Second Language Acquisition, eds DoughtyC. J.LongM. H. (Blackwell Publishing), 349381. (New Jersey: Blackwell handbooks in linguistics; No. 14) 10.1002/9780470756492.ch12

  • 133

    In'namiY.KoizumiR. (2009). A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats. Lang. Test. 26, 219244. 10.1177/0265532208101006

  • 134

    IsbellD. R. (2017). Assessing C2 writing ability on the certificate of english language proficiency: rater and examinee age effects. Assess. Writing Int. J.34, 3749. 10.1016/j.asw.2017.08.004

  • 135

    IwashitaN.BrownA.McNamaraT.O'HaganS. (2008). Assessed levels of second language speaking proficiency: how distinct?Appl. Linguist.29, 2449. 10.1093/applin/amm017

  • 136

    JacobsH. L. (1981). Testing ESL Composition: A Practical Approach. New York, NY: Newbury House.

  • 137

    JangE. E. (2009a). Cognitive diagnostic assessment of L2 reading comprehension ability: validity arguments for fusion model application to languedge assessment. Lang. Test.26, 3173. 10.1177/0265532208097336

  • 138

    JangE. E. (2009b). Demystifying a Q-matrix for making diagnostic inferences about L2 reading skills. Lang. Assess. Quart.6, 210238. 10.1080/15434300903071817

  • 139

    JonesK. (2004). Mission drift in qualitative research, or moving toward a systematic review of qualitative studies, moving back to a more systematic narrative review. Q. Rep.9, 95112.

  • 140

    KaneM. T. (2006). “Validation,” in Educational Measurement, 4th Edn, ed BrennanR. L. (Westport, CT: American Council on Education/Praeger), 1764.

  • 141

    KaneM. T. (2013). Validating the Interpretations and Uses of Test Scores. J. Educ. Measur.50, 173. 10.1111/jedm.12000

  • 142

    KimH. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Lang. Assess. Quart.12, 239261. 10.1080/15434303.2015.1049353

  • 143

    KnochU. (2009). Diagnostic assessment of writing: a comparison of two rating scales. Lang. Test.26, 275304. 10.1177/0265532208101008

  • 144

    KnochU. (2011). Rating scales for diagnostic assessment of writing: what should they look like and where should the criteria come from?Assess. Writing16, 8196. 10.1016/j.asw.2011.02.003

  • 145

    KnochU.ReadJ.von RandowJ. (2007). Re-training writing raters online: how does it compare with face-to-face training?Assess. Writing12, 2643. 10.1016/j.asw.2007.04.001

  • 146

    KobayashiM. (2002). Method effects on reading comprehension test performance: text organization and response format. Lang. Test.19, 193220. 10.1191/0265532202lt227oa

  • 147

    KormosJ.DénesM. (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System32, 145164. 10.1016/j.system.2004.01.001

  • 148

    LadoR. (1961). Language Testing: The Construction and Use of Foreign Language Tests: A Teacher's Book. Bristol, Inglaterra Longmans, Green and Company.

  • 149

    LallmamodeS. P.DaudN. M.Abu KassimN. L. (2016). Development and initial argument-based validation of a scoring rubric used in the assessment of L2 writing electronic portfolios. Assess Writing30, 4462. 10.1016/j.asw.2016.06.001

  • 150

    LamD. M. K. (2018). What counts as “responding? Contingency on previous speaker contribution as a feature of interactional competence. Lang. Test.35, 377401. 10.1177/0265532218758126

  • 151

    LangsamR. S. (1941). A factorial analysis of reading ability. J. Exp. Educ.10, 5763. 10.1080/00220973.1941.11010235

  • 152

    Larsen-FreemanD. (2006). The emergence of complexity, fluency, and accuracy in the oral and written production of five chinese learners of english. Appl. Linguist.27, 590619. 10.1093/applin/aml029

  • 153

    LauferB. (1992). “How much lexis is necessary for reading comprehension?,” in Vocabulary and Applied Linguistics, eds ArnaudP. J. L.BejoingH. (New York, NY: Macmillan), 129132. 10.1007/978-1-349-12396-4_12

  • 154

    LauferB.HulstijnJ. (2001). Incidental vocabulary acquisition in a second language: the construct of task-induced involvement. Appl. Linguist.22, 126. 10.1093/applin/22.1.1

  • 155

    LauferB.Ravenhorst-KalovskiG. C. (2010). Lexical threshold revisited: Lexical text coverage, learners' vocabulary size and reading comprehension. Read. Foreign Lang.22, 1530.

  • 156

    LazaratonA. (1996). Interlocutor support in oral proficiency interviews: the case of CASE. Lang. Test.13, 151172. 10.1177/026553229601300202

  • 157

    LeeY. W.SawakiY. (2009). Cognitive diagnosis approaches to language assessment: an overview. Lang. Assess. Quart.6, 172189. 10.1080/15434300902985108

  • 158

    LeiL.LiuD. (2019). The research trends and contributions of system's publications over the past four decades (1973e2017): a bibliometric analysis. System80:1e13. 10.1016/j.system.2018.10.003

  • 159

    LeveltW. J. M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press.

  • 160

    LiS. (2010). The effectiveness of corrective feedback in SLA: a meta-analysis. Lang. Learn, 60, 309365. 10.1111/j.1467-9922.2010.00561.x

  • 161

    LimG. S. (2011). The development and maintenance of rating quality in performance writing assessment: a longitudinal study of new and experienced raters. Lang. Test.28, 543560. 10.1177/0265532211406422

  • 162

    LinacreJ. M. (1994). Many-Facet Rasch Measurement (2nd Ed.). Chicago, IL: MESA.

  • 163

    LongM. H. (2007). Problems in SLA.New Jersey: Lawrence Erlbaum Associates Publishers.

  • 164

    LongM. H. (1991) “Focus on form: a design feature in language teaching methodology,” in Foreign Language Research in Cross-Cultural Perspective. eds BotK. D.KramschC.GinsbergR. (Amsterdam: John Benjamins), 3952. 10.1075/sibil.2.07lon

  • 165

    LuX. (2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Lang. Test.34, 493511. 10.1177/0265532217710675

  • 166

    LumleyT. (1993). The notion of subskills in reading comprehension tests: an EAP example. Lang. Test.10:211. 10.1177/026553229301000302

  • 167

    LumleyT. (2002). Assessment criteria in a large-scale writing test: what do they really mean to the raters?. Lang. Test.19, 246276. 10.1191/0265532202lt230oa

  • 168

    LumleyT.McNamaraT. F. (1995). Rater characteristics and rater bias: implications for training. Lang. Test.12, 5471. 10.1177/026553229501200104

  • 169

    LuomaS. (2004). Assessing Speaking.Cambridge: Cambridge University Press. 10.1017/CBO9780511733017

  • 170

    LynchB.DavidsonF.HenningG. (1988). Person dimensionality in language test validation. Lang. Test.5:206. 10.1177/026553228800500206

  • 171

    LysterR. (1998). Recasts, repetition, and ambiguity in L2 classroom discourse. Stud. Second Lang. Acquisit.20:51. 10.1017/S027226319800103X

  • 172

    LysterR. (2004). Differential effects of prompts and recasts in form-focused instruction. Stud. Second Lang. Acquisit.26, 399432. 10.1017/S0272263104263021

  • 173

    LysterR.RantaL. (1997). Corrective feedback and learner uptake: negotiation of form in communicative classrooms. Stud. Second Lang. Acquisit.19, 3766. 10.1017/S0272263197001034

  • 174

    LysterR.SaitoK. (2010). Oral feedback in classroom SLA: a meta-analysis. Stud. Second Lang. Acquisit.32:265. 10.1017/S0272263109990520

  • 175

    MackeyA.GooJ. (2007) “Interaction research in SLA: a meta-analysis research synthesis,” in Conversational Interaction In Second Language Acquisition, eds MackeyA. (Oxford: Oxford University Press), 407453.

  • 176

    MayL. (2011). Interactional competence in a paired speaking test: features salient to raters. Lang. Assess. Quart.8, 127145. 10.1080/15434303.2011.565845

  • 177

    McNamaraD.GraesserA.McCarthyP.CaiZ. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. London: Cambridge University Press. 10.1017/CBO9780511894664

  • 178

    McNamaraT. (2014). 30 Years on-evolution or revolution?Epilogue. Lang. Assess. Quart.11, 226232. 10.1080/15434303.2014.895830

  • 179

    McNamaraT. F. (1996). MeasuringSecond Language Performance. London: Longman.

  • 180

    McNamaraT. F. (1990). Assessing the second language proficiency of health professionals. (Ph.D. thesis), Department of Linguistics and Language Studies, The University of Melbourne, Australia.

  • 181

    McNamaraT. F. (1991). Test dimensionality: IRT analysis of an ESP listening test1. Lang. Test.8:139. 10.1177/026553229100800204

  • 182

    MertonR.K. (1988). The matthew effect in science, II: cumulative advantage and the symbolism of intellectual property. ISIS79, 606623. 10.1086/354848

  • 183

    MertonR. K. (1968). The matthew effect in science. Science 159, 56–63. Reprinted in: The Sociology of Science: Theoretical and Empirical Investigations. (Chicago: University of Chicago Press, 1973), p. 438459.

  • 184

    MessickS. (1989). “Validity,” in Educational Measurement, 3rd Edn, ed LinnR. L. (New York, NY: American Council on Education/Macmillan), 13103.

  • 185

    MessickS. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educ. Res.23, 1323. 10.3102/0013189X023002013

  • 186

    MessickS. (1996). Validity and washback in language testing. Lang. Test.13, 241256. 10.1177/026553229601300302

  • 187

    MingersJ.LeydesdorffL. (2015). A review of theory and practice in scientometrics. Eur. J. Operation Res.246, 119. 10.1016/j.ejor.2015.04.002

  • 188

    MiyakeA.FriedmanN. P. (1998). “Individual differences in second language proficiency: working memory as language aptitude,” in Foreign Language Learning: Psycholinguistic Studies on Training and Retention. eds HealyA. F.BourneL. E.Jr (New Jersey: Lawrence Erlbaum Associates Publishers), 339364.

  • 189

    MostafaM. M. (2020). A knowledge domain visualization review of thirty years of halal food research: themes, trends and knowledge structure. Trends Food Sci. Technol.99,660677. 10.1016/j.tifs.2020.03.022

  • 190

    NalimovV.MulcjenkoB. (1971). Measurement of Science: Study of the Development of Science as an Information Process.Washington, DC: Foreign Technology Division.

  • 191

    NationI. S. P. (2006). How large a vocabulary is needed for reading and listening?Can. Modern Lang. Rev.63, 5982. 10.3138/cmlr.63.1.59

  • 192

    NationI. S. P. (2013). Learning Vocabulary in Another Language. Cambridge: Cambridge University Press. 10.1017/CBO9781139858656

  • 193

    NationI. S. P. (1990). Teaching and Learning Vocabulary.New York, NY: Newbury House Publishers.

  • 194

    NationI. S. P. (2001). Learning Vocabulary in Another Language.Cambridge: Cambridge University Press. 10.1017/CBO9781139524759

  • 195

    NewmanM. E. (2006). Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A.103, 85778582. 10.1073/pnas.0601602103

  • 196

    NorrisJ. M.OrtegaL. (2000). Effectiveness of L2 instruction: a research synthesis and quantitative meta-analysis. Lang. Learn. J. Res. Lang. Stud.50:417. 10.1111/0023-8333.00136

  • 197

    NorrisJ. M.OrtegaL. (2003). “Defining and measuring SLA,” in The Handbook of Second Language Acquisition, eds DoughtyC. J.LongM. H. (Malden, MA: Blackwell), 717761

  • 198

    NorrisJ. M.OrtegaL. (2009). Towards an organic approach to investigating CAF in instructed SLA: the case of complexity. Appl. Linguist.30, 555578. 10.1093/applin/amp044

  • 199

    OllerJ. W. (1979). Language Tests at School: A Pragmatic Approach. London: Longman.

  • 200

    O'MalleyJ. M.ChamotA. U. (1990). Learning Strategies in Second Language Acquisition. Cambridge: Cambridge University Press.

  • 201

    OrtegaL. (2003). Syntactic complexity measures and their relationship to L2 proficiency: a research synthesis of college-level L2 Writing. Appl. Linguist.24, 492518. 10.1093/applin/24.4.492

  • 202

    O'SullivanB. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Lang. Test.19, 277295. 10.1191/0265532202lt205oa

  • 203

    PaeC. U. (2015). Why systematic review rather than narrative review?. Psychiat. Invest.12:417. 10.4306/pi.2015.12.3.417

  • 204

    PapageorgiouS.StevensR.GoodwinS. (2012). The relative difficulty of dialogic and monologic input in a second-language listening comprehension test. Lang. Assess. Quart.9, 375397. 10.1080/15434303.2012.721425

  • 205

    PetticrewM.RobertsH. (2006). Systematic Reviews in the Social Sciences.New Jersey: Wiley Blackwell. 10.1002/9780470754887

  • 206

    PhakitiA.RoeverC. (2011). Current issues and trends in language assessment in Australia and New Zealand. Lang. Assess. Quart.8, 103107. 10.1080/15434303.2011.566397

  • 207

    PicaT. (1994). Research on negotiation: what does it reveal about second-language learning conditions, processes, and outcomes?Lang. Learn. J. Res. Lang. Stud.44, 493527. 10.1111/j.1467-1770.1994.tb01115.x

  • 208

    PlakansL. (2008). Comparing composing processes in writing-only and reading-to-write test tasks. Assess. Writing13, 111129. 10.1016/j.asw.2008.07.001

  • 209

    PlakansL.GebrilA. (2017). Exploring the relationship of organization and connection with scores in integrated writing assessment. Assess. Writing39, 98112. 10.1016/j.asw.2016.08.005

  • 210

    PlakansL.LiaoJ.-T.WangF. (2019). “I should summarize this whole paragraph”: Shared processes of reading and writing in iterative integrated assessment tasks. Assess. Writing40, 1426. 10.1016/j.asw.2019.03.003

  • 211

    PlonskyL. (2013). Study quality in SLA: an assessment of designs, analyses, and reporting practices in quantitative L2 research. Stud. Second Lang. Acquisit.35:655. 10.1017/S0272263113000399

  • 212

    PlonskyL.OswaldF. L. (2014). How big is “big?” interpreting effect sizes in L2 Research. Lang. Learn. J. Res. Lang. Stud.64, 878912. 10.1111/lang.12079

  • 213

    RakedzonT.Baram-TsabariA. (2017). To make a long story short: a rubric for assessing graduate students' academic and popular science writing skills. Assess. Writ. 32, 2842. 10.1016/j.asw.2016.12.004

  • 214

    RaschG. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Santa Monica: Paedagogike Institute.

  • 215

    ReadJ. (2000). Assessing Vocabulary. Cambridge: Cambridge University Press. 10.1017/CBO9780511732942

  • 216

    RebuschatP. (2013). Measuring implicit and explicit knowledge in second language research. Lang. Learn.63, 595626. 10.1111/lang.12010

  • 217

    RobinsonP. (2005). Cognitive complexity and task sequencing: studies in a componential framework for second language task design. Int. Rev. Appl. Linguist. Lang. Teach.43, 132. 10.1515/iral.2005.43.1.1

  • 218

    RoeverC. (2006). Validation of a web-based test of ESL pragmalinguistics. Lang. Test.23, 229256. 10.1191/0265532206lt329oa

  • 219

    RömerU. (2017). Language assessment and the inseparability of lexis and grammar: focus on the construct of speaking. Lang. Test.34, 477492. 10.1177/0265532217711431

  • 220

    RosenshineB.V. (2017). “Skill hierarchies in reading comprehension,” in Theoretical Issues in Reading Comprehension: Perspectives From Cognitive Psychology, Linguistics, Artificial Intelligence and Education, eds SpiroR. J.BruceB. C.BrewerW.F. (London: Taylor and Francis) 535554. 10.4324/9781315107493-29

  • 221

    RousseeuwP. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.20, 5365. 10.1016/0377-0427(87)90125-7

  • 222

    SawakiY.StrickerL. J.OranjeA. H. (2009). Factor structure of the TOEFL Internet-based test. Lang. Test.26, 530. 10.1177/0265532208097335

  • 223

    SawakiY.XiX. (2019). “Univariate generalizability theory in language assessment,” in Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques eds AryadoustV.RaquelM. (London: Routledge) 3053. 10.4324/9781315187815-3

  • 224

    SchmidtR. (1994). Deconstructing consciousness in search of useful definitions for applied linguistics. AILA Rev.11, 1126.

  • 225

    SchmidtR. (2001). “Attention,” in Cognition and Second Language Instruction, eds RobinsonP. (Cambridge: Cambridge University Press) 332. 10.1017/CBO9781139524780.003

  • 226

    SchmittNSchmittD.ClaphamC. (2001). Developing and exploring the behaviour of two new versions of the vocabulary levels test. Lang. Test.18, 5588. 10.1177/026553220101800103

  • 227

    SchmittN. (2008). Review article: instructed second language vocabulary learning. Lang. Teach. Res.12, 329363. 10.1177/1362168808089921

  • 228

    SchmittN. (2010). Researching Vocabulary: A Vocabulary Research Manual. London: Palgrave Macmillan. 10.1057/9780230293977

  • 229

    SheenY. (2004). Corrective feedback and learner uptake in communicative classrooms across instructional settings. Lang. Teach. Res.8, 263300. 10.1191/1362168804lr146oa

  • 230

    ShohamyE. (1988). A proposed framework for testing the oral language of second/foreign language learners. Stud. Second Lang. Acquisit.10:165. 10.1017/S0272263100007294

  • 231

    ShohamyE. G. (2001). The power of tests: a critical perspective on the uses of language tests. Harlow; New York, NY: Longman.

  • 232

    SkehanP. (1988). State of the art article: language testing Part 1. Lang. Teach.21, 211221. 10.1017/S0261444800005218

  • 233

    SkehanP. (2009). Modelling second language performance: integrating complexity, accuracy, fluency, and lexis. Appl. Linguist.30, 510532. 10.1093/applin/amp047

  • 234

    SkehanP. (1998). A Cognitive Approach to Language Learning. Oxford University Press. 10.1177/003368829802900209

  • 235

    SmallH. (2004). On the shoulders of robert merton: towards a normative theory of citation. Scientometrics60, 7179. 10.1023/B:SCIE.0000027310.68393.bc

  • 236

    SmallH.SweeneyE. (1985). Clustering the science citation index using co-citations: a comparison of methods. Scientometrics7, 391409. 10.1007/BF02017157

  • 237

    SpadaN.TomitaY. (2010). Interactions between type of instruction and type of language feature: a meta-analysis. Lang. Learn.60, 263308. 10.1111/j.1467-9922.2010.00562.x

  • 238

    SpolskyB. (1977). “Language testing: art or science,” in Proceedings of the Fourth International Congress of Applied Linguistics, Vol. 3, ed NickelG. (Stuttgart: Hochschulverlag), 728.

  • 239

    SpolskyB. (1990). Oral examinations: an historical note. Lang. Test.7, 158173. 10.1177/026553229000700203

  • 240

    SpolskyB. (1995). Measured Words: The Development of Objective Language Testing. Oxford: Oxford University Press.

  • 241

    SpolskyB. (2017). “History of language testing,” in Language Testing and Assessment, eds ShohamyE.HornbergerN. H. (New York, NY: Springer), 375384. 10.1007/978-3-319-02261-1_32

  • 242

    SwainM. (1985). “Communicative competence: some roles of comprehensible input and comprehensible output in its development,” in Input in Second Language Acquisition, eds GassS.MaddenC. (New York, NY: Newbury House), 235253.

  • 243

    SwainM. (1995). “Three functions of output in second language learning,” in Principle and Practice in Applied Linguistics: Studies in Honour of H. G. Widdowson, eds CookG.SeidlhoferB. (Oxford: Oxford University Press), 125144.

  • 244

    SwainM. (2000). “The output hypothesis and beyond: mediating acquisition through collaborative dialogue,” in Sociocultural Theory and Second Language Learning, eds LantolfJ.P. (Oxford: Oxford University Press), 97114.

  • 245

    TaylorL. (2009). Developing assessment literacy. Ann. Rev. Appl. Linguist.29, 21−36. 10.1017/S0267190509090035

  • 246

    UpshurJ. A. (1971). “Productive communication testing: a progress report,” in Applications in Linguistics, eds PerrenG.TrimJ. L. M. (Cambridge University Press). 435442.

  • 247

    van BatenburgE. S. L.OostdamR. J.van GelderenA. J. S.de JongN. H. (2018). Measuring L2 speakers' interactional ability using interactive speech tasks. Lang. Test.35, 75100. 10.1177/0265532216679452

  • 248

    van LierL. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: oral proficiency interviews as conversation. TESOL Q.23:489. 10.2307/3586922

  • 249

    VygotskyL. (1978). Mind in Society: The Development of Higher Psychological Processes, eds ColeM.John-SteinerV.ScribnerS.SoubermanE. Cambridge, MA: Harvard University Press.

  • 250

    WaringR.TakakiM. (2003). At what rate do learners learn and retain new vocabulary from reading a graded reader?Read. Foreign Lang.15, 130163.

  • 251

    WeigleS. C. (1994). Effects of training on raters of ESL compositions. Lang. Test.11:197. 10.1177/026553229401100206

  • 252

    WeigleS. C. (1998). Using FACETS to model rater training effects. Lang. Test.15, 263287. 10.1177/026553229801500205

  • 253

    WeigleS. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. 10.1017/CBO9780511732997

  • 254

    WeirC. (1990). Communicative Language Testing. New Jersey: Prentice Hall.

  • 255

    WeirC. J. (2005a). Language Testing and validation :An Evidence-Based Approach.London: Palgrave Macmillan.

  • 256

    WeirC. J. (2005b). Language Testing and Validation.London: Palgrave Macmillan. 10.1057/9780230514577

  • 257

    WeirC. J.VidakovicsIGalacziE. D. (2013). Measured constructs. A history of Cambridge English Language Examinations 1913-2012. Studies in Language Testing 37.Cambridge: Cambridge University Press.

  • 258

    WilsonJ.RoscoeR.AhmedY. (2017). Automated formative writing assessment using a levels of language framework. Assess. Writing34, 1636. 10.1016/j.asw.2017.08.002

  • 259

    WinkeP. (2011). Investigating the Reliability of the Civics Component of the U.S. naturalization test. Language Assessment Q. 8, 317341. 10.1080/15434303.2011.614031

  • 260

    WinkeP.LimH. (2015). ESL essay raters' cognitive processes in applying the Jacobs et al. Rubric: an eye-movement study. Assess. Writing25, 3854. 10.1016/j.asw.2015.05.002

  • 261

    WisemanC. S. (2012). Rater effects: ego engagement in rater decision-making. Assess. Writing17, 150173. 10.1016/j.asw.2011.12.001

  • 262

    Wolfe-QuinteroK.InagakiS.KimH-Y. (1998). Second Language Development in Writing: Measures of Fluency, Accuracy & Complexity. Hawai'i.: University of Hawai'i Press.

  • 263

    WrayA. (2002). Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. 10.1017/CBO9780511519772

  • 264

    WrightB. D.StoneM. H. (1979). Best Test Design. Chicago, IL: Mesa Press.

  • 265

    XiX. (2010a). How do we go about investigating test fairness?. Lang. Test.27, 147170. 10.1177/0265532209349465

  • 266

    XiX. (2010b). Automated scoring and feedback systems: where are we and where are we heading?Lang. Test.27, 291300. 10.1177/0265532210364643

  • 267

    ZhangL.GohC. C. M.KunnanA. J. (2014). Analysis of test takers' metacognitive and cognitive strategy use and EFL reading test performance: a multi-sample SEM approach. Lang. Assess. Q. Int. J. 11, 76102. 10.1080/15434303.2013.853770

  • 268

    ZhangY.ElderC. (2010). Judgments of oral proficiency by non-native and native English speaking teacher raters: competing or complementary constructs?. Lang. Test.28, 3150. 10.1177/0265532209360671

  • 269

    ZhangZ.PouckeS. V. (2017). Citations for randomized controlled trials in sepsis literature: the halo effect caused by journal impact factor. PLoS ONE12:e0169398. 10.1371/journal.pone.0169398

  • 270

    ZhaoC. G. (2017). Voice in timed L2 argumentative essay writing. Assess. Writing 31, 7383. 10.1016/j.asw.2016.08.004

  • 271

    ZhengY.YuS. (2019). What has been assessed in writing and how? Empirical evidence from assessing writing (2000–2018). Assess. Writing42:100421. 10.1016/j.asw.2019.100421

Summary

Keywords

document co-citation analysis, language assessment, measurement, review, Scientometrics, validity, visualization, Second language acquisition

Citation

Aryadoust V, Zakaria A, Lim MH and Chen C (2020) An Extensive Knowledge Mapping Review of Measurement and Validity in Language Assessment and SLA Research. Front. Psychol. 11:1941. doi: 10.3389/fpsyg.2020.01941

Received

04 February 2020

Accepted

14 July 2020

Published

04 September 2020

Volume

11 - 2020

Edited by

Thomas Eckes, Ruhr University Bochum, Germany

Reviewed by

John Read, The University of Auckland, New Zealand; Stefanie A. Wind, University of Alabama, United States

Updates

Copyright

*Correspondence: Vahid Aryadoust

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics