Validity Evidence of the Reading Screening Test for Portuguese First Graders

This study aims to present validity evidence for the Reading Screening Test (TRL-Teste de Rastreio de Leitura) that assesses word and pseudoword reading. Participants were 94 Portuguese first graders (49 girls and 45 boys), assessed with the TRL and criterion measures—ALEPE subscales to assess words, pseudowords reading, and rapid automatized naming. Results from confirmatory factor analyses indicated that a two-factor measurement model yielded a good fit to the data. Favorable estimates of internal consistency reliability were obtained. Correlation coefficient results suggested that the measure was positively and statistically associated with another measure of reading assessment. These results revealed adequate evidence based on internal structure and evidence based on the relationship to other variables for the assessment of word reading accuracy among Portuguese first graders.


INTRODUCTION
Difficulties in reading have been observed in students with learning disabilities (Benner et al., 2010). Some other students show difficulties in reading with no diagnosis of learning difficulties right at the beginning of schooling as well (Poulsen et al., 2017). Good levels of reading skills play a key role on the individuals' personal and professional development (Jamshidifarsani et al., 2019).
Thus, it is important to identify and to intervene in reading acquisition difficulties as early as possible, for which diagnostic and intervention tools are necessary (Poulsen et al., 2017). Assessment is a fundamental step in the teaching process as it allows obtaining information that supports pedagogical decisions that will contribute to help students developing their skills (Viana, 2009;Santos et al., 2017;Zuilkowski et al., 2019). When referring to reading acquisition, early assessment is a necessary condition for early detection of difficulties with a consequent decision on intervention (Lyytinen, 2008;Hall and Burns, 2018). In turn, if the difficulties are not the target of early intervention, the child will be exposed to consecutive experiences of failure, leading to decreased motivation to learn, increasing the likelihood for retentions or school dropouts that will negatively mark the school path (Lyytinen, 2008;Lyytinen and Erskine, 2016).
Currently, worldwide education guidelines emphasize the importance of early assessment and intervention regarding reading difficulties (Fuchs and Fuchs, 2005). The earlier, more accurate, and thorough the assessment, the more effective the intervention (Lyytinen, 2008). In the Portuguese educational context, reading acquisition begins when children complete 6 years old, entering the 1 st grade of elementary school Preschoolers are not subject to the literacy socialization. In Portugal, these concerns about early assessment and intervention are also reflected on recent education policy. The report of Rodrigues et al. (2017), for example, recommends the development of diagnostic and intervention tools and sharing of diagnosis methodologies and early intervention, to be adopted in conjunction with teaching pedagogical strategies and individual intervention by teachers.
Another document that appears in line with international research is the recent Portuguese legislation that brings for the educational reality the need to assess all the first graders and intervene as early as possible with at risk students (Decreto-Lein • 54/2018, de 6 de julho). The legislation emphasizes the role of early assessment, listing measures to ensure inclusion and boosting successful trajectories. In Portugal, the development of screening reading assessment instruments is still at an embryonic stage (e.g., Rodrigues et al., 2017). Most of the evidence built for this purpose has significant gaps in terms of theoretical rationale and validation procedures.
Following the simple reading model (Gough and Tunmer, 1986), there are two main abilities necessary for the reading process to be mastered: decoding and comprehension. Decoding is the mechanism responsible for the conversion of graphemes into phonemes and their respective fusion, sequentially from left to the right (Chang et al., 2017). There are two decoding levels: alphabetic (basic) decoding allows reading orthographically simple words, whereas orthographic (complex) decoding allows us to read orthographically complex words. Simple words are characterized by consistent and/or dominant grapheme-phoneme correspondences whereas complex words have inconsistent grapheme-phoneme correspondences (e.g., several phonemes correspond to the same grapheme). As the decoding process develops, familiar and unknown words are read fast and accurately (Borleffs et al., 2019), which constitutes a necessary foundation for both fluency and reading comprehension.
Several studies report moderate to high correlations between decoding and oral reading fluency (Meisinger et al., 2010;Speece et al., 2010), as well as moderate correlations between decoding and reading comprehension (Ricketts et al., 2007;Best et al., 2008). Research also indicates correlations between decoding and rapid automatized naming (RAN) (e.g., Hulme and Snowling, 2013;Lyytinen et al., 2015). RAN refers to the speed with which a stimulus is named. It consists of the ability to recover and name familiar items fluently, evaluating the speed and accuracy in the process of accessing the lexicon (Heikkilä, 2015). Slow performance on RAN tasks is associated with poor reading performance (Denkla and Rudel, 1976;Lúcio et al., 2017;Katzir et al., 2018;Landerl et al., 2019). Several studies confirm that RAN has a strong correlation with reading competence, with a progressively more relevant role throughout the school path (Cohen et al., 2018). This impact is especially strong on reading fluency (Araújo et al., 2015;Papadopoulos et al., 2016;Carvalho et al., 2017). Therefore, the validation of a reading test that assesses the processes of decoding and reading comprehension, along with an emphasis on fluency, is of great value for early reading acquisition monitoring.
This study aimed at analyzing the psychometric properties of the Reading Screening Test with Portuguese speaking first graders. TRL aims to evaluate the processes of decoding and reading comprehension. Previous studies included the development of TRL along with a usability study and item analysis (Silva, 2019). The TRL intends to continue the work developed by Sucena and Castro (2008), and Vilhena et al. (2016), with the reading age test. This test has a similar structure to that of the Lobrot L3 tests (Lobrot, 1973) the reading efficiency test (Marín and Carrillo, 1999). In both these tests, the child must complete sentences, selecting the correct alternative using multiple choice. Specifically, in this study we aim to assess the measure's dimensional factor structure. Some studies consider a one-dimensional factor structure (e.g., Athayde et al., 2014;Viana et al., 2014); however, decoding is divided in two levels: alphabetic and orthographic (Borleffs et al., 2019). As described before, alphabetic decoding is predominant in the initial phase of reading acquisition, allowing the conversion of simple graphemes. Orthographic decoding is a more demanding process, which allows the conversion of complex graphemes. In this way, it seems to be important to test the dimensionality of the decoding construct. The second aim of this study was to provide evidence of validity based on the relationship to other variables (the subscales: word reading, pseudoword reading and RAN of the Reading Evaluation Battery for European Portuguese (ALEPE-Avaliação da Leitura em Português Europeu) (Sucena and Castro, 2011). This reading battery has reference values for primary school students (first, second, third, and fourth grades) and its administration is run individually. ALEPE assesses the main processes involved in reading: phonological and written words processing. The analysis of the results obtained in ALEPE allows us to determinate the child's reading level, as well as to identify the reasons for reading difficulties. The ALEPE validity was analyzed through correlation studies between ALEPE subscales. Based on previous studies (e.g., Lúcio et al., 2017), high positive correlations were expected between TRL scores and word and pseudoword reading scores as well as RAN. Taking previous research into account, it was hypothesized that: a one or two-factor measurement model of the TRL scale would yield a good fit to the data (H1); the TRL total scale would present favorable estimates of internal consistency reliability (H2); and the TRL total scale would be positively and statistically significantly correlated with the ALEPE subscales (Sucena and Castro, 2011): word reading, pseudoword reading and RAN (H3).

Participants
This study assessed 94 first graders, 49 girls (52.1%) and 45 boys (47.9%), all native speakers of Portuguese. Participants range between low and medium SES, 31 students from low SES (M age = 6.11; SD = 4.4) and 63 students from medium SES (M age = 7.11; SD = 3.6). Participants were enrolled in public schools in the northern region of Portugal. This study used a non-probabilistic type of sampling, specifically convenience sampling, due to the location of the schools and the receptivity to the study. None of the participating students had a cognitive or a language disorder. In the Portuguese educational context classes are composed by 24 first graders (Despacho Normativon • 10-A/2018). In order to guarantee the minimum number of participants for parametric analysis (more than 30 participants-Field, 2009), two entire classes were enrolled in the study. All students from each class, in two elementary schools, were invited to participate.

Instruments
Participants were assessed with the TRL. TRL is an early reading ability-screening test, developed for Portuguese speaking first graders. The test consists of 30 incomplete sentences (items), which the reader must read and complete by selecting one of four given alternatives using multiple choice.
Across the four alternatives, one is the target word and the remaining three are distractors. Distractors are words or pseudowords that are visually and/or phonologically close to the target word (e.g., "Paga o bolo com a: noda, mopa, bota, nota"-Pay the cake with the: noda/mopa/boot/money-the other options are pseudowords]; or "O pai vai à: jola, mola, loja, dota"-The father goes to the: jola, clothespin, store, dota). From the 30 sentences (items), 20 are orthographically simple words, and 10 are orthographically complex words. Scores were collected 5 min after the beginning of the test. The total score corresponds to the total number of sentences completed correctly by the child. The maximum score is 30 points.
The construction process of this test took into consideration the task increasing complexity, through the manipulation of three psycholinguistic variables that influence the accuracy and speed of reading (Vale, 2014): (i) syllabic structure of both words and pseudowords (simple or complex) (ii) orthographic structure of both words and pseudowords (simple or complex) (iii) extension of the sentences (short or long).
The syllabic structure of the four answer alternatives was controlled in the construction of the test. These alternatives were selected according to the following criteria: words with simple syllabic structure; consonant-vowel; and words with complex syllabic structure, consonant-vowel-consonant, consonant-diphthong, and consonant-consonant-vowel.
Regarding the orthographic condition, four types of orthographic condition were selected: words/pseudowords with simple graphemes, words/pseudowords with complex graphemes, words/pseudowords with contextual regularity, and irregular words.
Regarding the extension of the sentences, two types of length were contemplated: short and long sentences, respectively, composed by four and six to eight words.
Participants were also assessed with three ALEPE subscalesword and pseudoword reading and RAN (Sucena and Castro, 2011). The word reading subtest consists of four training items and 18 experimental words with varying orthographic complexity: simple words, complex words, and irregular words.
The pseudoword reading subtest consists of four training items and 15 experimental items with different orthographic complexity-simple and complex pseudowords. The child is asked to read each item, presented in isolation, on a computer screen. In the RAN, the child was asked to name the visual stimuli (four colors: red, yellow, blue, and green) displayed on the computer screen (during 30 s) as quickly and accurately as possible. The stimuli were displayed in continuous format (4 × 4). The experimental trial was preceded by a training trial, to ensure the child understood the task. This test allows the evaluation of the ability to recover the phonological form of words. The total result is obtained through the sum of colors correctly named. Cronbach's alphas for the ALEPE words/pseudowords scales ranged between 0.46 to first graders and 0.72 to 2 nd , 3 rd and 4 th graders (Sucena and Castro, 2011). The authors clarify that this calculation was separated between the first grade and the remaining grades, due to the different composition and length of the stimulus lists. In the 2nd, 3rd, and 4th grade (Lists B, C, and B'), the alpha founded value was much more satisfactory than the one that the first grade had. This can be explained by the fact that in the first grade, the number of stimuli is relatively small and thus can limit the alpha value that could be reached (Sucena and Castro, 2011).

Procedures of Data Collection
Prior to the data collection authorizations by the Portuguese Education Ministry, school boards and parents or legal tutors were obtained. The voluntary participation of all participants was ensured.
The administration of the TRL was run without time limit in order to analyze the functioning of all items; however, after 5 min of the beginning of the test, the last sentence completed was marked, as well as if there were items with no responses. This pause was made in order to analyze how many items the participant could actually complete within this period, as 5 min is the time limit usually adopted in the literature for this type of screening tool (e.g., Lobrot, 1973;Cadime, 2011;Vilhena et al., 2016). After the pause, participants were instructed to continue the sentences completion. Once each participant finished all sentences, the total time of completion was marked. The TRL was presented as a reading game composed of sentences that needed to be completed as quickly as possible. First, the experimenter read aloud the training items with the classroom and explained they should read each sentence and the four options carefully. Attention was drawn to the fact that two of the options were words and the other two were pseudowords and that only one of the four possibilities was correct, that should be underlined. Finally, participants were instructed not to stop if they did not know how to complete a given sentence, instead to proceed to the next one. After these instructions, each participant completed the test individually. The administration of the instruments occurred in two sessions, both in school context in the last month of the school year, by a researcher with specific training in reading acquisition difficulties. In the first session, TRL was applied collectively with all students at the classroom using paper and pencil. In the second assessment session, each participant was assessed individually, with the three ALEPE subscales (RAN, words and pseudowords reading-Sucena and Castro, 2011) using a computer screen.

Procedures of Data Analyses
Confirmatory factor analysis was conducted to investigate the hypothesized dimensionality of the TRL, using the Analysis of Moment Structures (AMOS), version 25.0 for Windows. Two measurement models were tested. Model 1 assumed a one-factor structure (30 observed variables, one latent variable). Model 2 considered a two-factor structure (30 observed variables, two correlated latent variables inherent to simple spelling words and complex spelling words). Measurement errors were freely estimated and one factor loading for each latent variable was fixed to 1. For all measurement items, there is no missing data. The assumptions of multivariate normality of sample distribution and absence of outliers were previously tested. While item asymmetry values ranged from −0.28 to −3.54, kurtosis values ranged from 0.51 to −10.5, suggesting a violation of the assumption of normal distribution. The Mahalanobis Distance statistics of the first test suggested the existence of 60 outliers (p < 0.05). Based on the fact this is a real sample, we decided not to withdraw outliers and proceed with the analysis considering this information when conclusions would be taking into account. As due to evidence of multivariate non-normality of sampling distribution, the Maximum Likelihood estimation method with bootstrap samples were used (Gilson et al., 2013). The chi-square and its degrees of freedom, Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), Standardized Root Mean Square Residual (RMR) and the Akaike Information Criteria (AIC) were used as criteria. CFI values greater than 0.90, RMSEA values lower than 0.08 and SRMR values lower than 0.05 were indicative of good fit (Browne and Cudeck, 1992;Hoyle and Panter, 1995;Blunch, 2008;Kline, 2016). In model comparison, smaller AIC values were indicative of better fit (Tabachnick and Fidell, 2013). In order to reduce the sensitivity chi square test to the sample size, a transformation of the value of the test, dividing it by degrees of freedom was performed (Kline, 2005).
The person separation reliability (PSR) and the Kuder-Richardson 2 (KR-20) were tested to assure satisfactory reliability coefficients. All three coefficients are expressed on a scale ranging from 0 to 1. High reliability coefficients indicate low levels of measurement error; therefore, values closest to 1 are desirable. Reliability coefficients were performed using WINSTEPS software (Wright and Linacre, 1998).
Analyses to test the evidence validity based on the relationship to other variables were performed using Statistical Package for the Social Sciences software (IBM SPSS, version 25) for Windows. Particularly, evidence of validity based on relations to other variables-ALEPE-was examined through Pearson correlation coefficients. Correlations exceeding 0.10, 0.30, and 0.70 were considered low, moderate, and high, respectively (Field, 2009). To analyze the correlation between the results of TRL and ALEPE, the assumption of normality in the distribution of interval variables was initially verified. This exploratory analysis of the data revealed that the assumptions underlying the use of parametric tests were not met. However, since the results of the non-parametric tests go in the same direction as those of the parametric tests in this study, the latter will be reported (Martins, 2011). In this sense, the analysis will be performed using Pearson's correlation coefficient.

RESULTS
Regarding the evidence based on internal structure, Table 1 presents data concerning goodness-of-fit statistics for the TRL. Chi-square values were statistically significant for both models (p < 0.001). The transformation of the chi-square value dividing it by degrees of freedom shows that for the two-factor model, the value resulting from this calculation is less than 2.00, which indicates a good adjustment (Kline, 2005). However, for onefactor model, the value resulting from this calculation is 2.13, which does not indicate such a good fit (Kline, 2005).
Model 1, which admits that all items could saturate in a single general factor, does not present better fit to the data than Model 2. Model 2, which admits the existence of two scales: simple spelling and complex spelling, presents better fit, with a CFI around 0.90, RMSEA close to 0.05, SRMR closer to 0.05, chisquare to degrees of freedom ratio (χ 2 /df) lower than 2, and a smaller AIC than Model 1 (Kline, 2005;Brown, 2006;Byrne, 2011). The correlation between the two latent factors, in the twofactor CFA Model, is 0.81. Thus, results from the confirmatory factor analyses suggested that Model 2 yielded the best fit to the data ( Table 1).
The TRL presents satisfactory estimates of internal consistency reliability: PSR = 0.85, KR-20 = 0.95, and α = 0.95. As for convergent validity, there were positive moderate to high statistically significant correlations between the TRL and ALEPE (0.39 < r < 0.90). The TRL at 5 min was positively moderate and statistically significantly associated with ALEPE subscales (0.50 < r < 0.60). The TRL without time limit was positively moderate to high and statistically significantly associated with ALEPE subscales (0.43 < r < 0.75) ( Table 2).
As can be seen in Table 2, there are statistically significant positive correlations between the TRL and the ALEPE subtests, either considering the results after 5 min, or considering the results in the application without time limit.

DISCUSSION
This study presents validity evidence based on internal structure and on the relationship to other variables. Overall, results support the psychometric quality of the TRL measure. This result represents an important contribution to the research on the  factor structure of word reading tests, which has been limited. However, replication studies are needed to confirm the data presented in this study.
Regarding the internal structure, results suggest that a twofactor measurement model present a good fit to the data, thus supporting H1 and H2. These findings prove the dimensionality of decoding, as for example, Borleffs et al. (2019) had already theoretically described it. In the TRL, in the first 20 items, the stimuli are composed of simple orthographic words, and for its realization, the child needs to have developed the competences of alphabetical decoding. In the last 10 items, the stimuli have a complex spelling, involving orthographic decoding in its performance.
The correlational results support the TRL convergent validity. Hence, H3 is empirically sustained. Moderate to strong correlations between the TRL scales and the ALEPE subscales were found. Specifically, between the TRL (with or without time limit) and the performance on reading words and pseudowords subtest were positive and with moderate magnitude. These results are consistent with the literature, emphasizing that a good performance on TRL relies on the same basic processes as those on the basis for a good performance on the ALEPE reading subtests-the decoding process (Sucena and Castro, 2011). The correlation between the results on TRL (with time limit) and RAN is positive and with moderate magnitude, once again in line with previous results described in the literature (e.g., Hulme and Snowling, 2013;Lyytinen et al., 2015;Lúcio et al., 2017).
The data from the present study demonstrated satisfactory validity evidence of the test for first graders. The TRL allows a fast identification of at risk children who may need pedagogical adaptations and/or other intervention measures. This test enables an early intervention; in other words, this test enables not to jeopardize the future knowledge acquisition and school trajectory (Viana, 2009;Santos et al., 2017;Zuilkowski et al., 2019). Further research to gather evidence based on consequences of early testing may be useful to inform and motivate educators to adopt reading acquisition screening tools.
Despite the relevance of this study, two main limitations are worth mentioning: first, the sample size, and second, the absence of a multicultural sample and different educational levels. Future studies might continue examining the TRL scale's factorial structure with samples in different academic levels and with a larger number of participants. Hence, the psychometric properties of the TRL should continue to be analyzed, using both online and paper-based data collection procedures. This would not only contribute to increase the sample size, but also afford the possibility to test the TRL factorial invariance for online and paper-based applications. Future research can be made to add evidence and affirm the TRL as a brief screening test in the reading acquisition field. It would also be useful to examine the TRL scale's factorial invariance over time. In this regard, the TRL could be used in longitudinal studies, following individuals from the onset to the end of primary school. Predictive validity evidence could be added in order to better understand reading and writing acquisition, in which TRL could be used as a measure to predict reading performance. It would also be of interest to test its factorial equivalence with bilingual students, the case for most immigrant children. The increasing heterogeneity of cultures in schools justifies the study of TRL with students from different linguistic contexts. The results on TRL could be also used to predict the success on reading acquisition academic scores.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
Ethical approval was not required for the study in accordance with institutional requirements. Previous authorizations by the Portuguese Education Ministry were provided. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
AFS collected and inserted the data on SPSS and performed the literature review. CM analyzed the data and wrote the Method section. AS designed, supervised, and revised the article critically. All authors contributed to the article and approved the submitted version.