Validity of social–emotional screening tool for newborns and infants: The effects of gender, ethnicity and age

The purpose of the present study was to test the measurement invariance of the baby pediatric symptom checklist (BPSC) across gender and age as a means to provide for valid comparisons in point estimates across groups. A secondary goal involved confirming the earlier identified factor structure and re-examining the presence of differentially item functioning in the BPSC across grouping variables. Participants were 601 children (322 boys and 279 girls) aged below 1 year (n = 294) and 1 year to 12 months (n = 307). Data were collected as part of the National Health Interview Survey in 2020. Results related to gender indicated satisfying all five levels of measurement and structural invariance. When testing measurement invariance across age groups, a few modifications were necessary. Specifically, partial metric invariance was satisfied after freeing slope estimates of items 1 and 2, partial scalar after freeing intercept terms of items 6 and 10, and partial residual invariance through freeing error terms of items 2 and 12. These non-invariant estimates, however, provide support for partial invariance and do not invalidate the comparison of latent means. The BPSC can be used for the valid assessment of social–emotional skills in infants up to the age of 18 months.


Introduction
Based on UNICEF, in 2020, 2.4 million newborns died in the first month of life around the globe. This estimate represents approximately 30 deaths per 1,000 live births. The number of newborn deaths has increased by 40% in 2020 compared to 1990. In the United States, the rate is approximately 5 deaths per 1,000 live births although this estimate varies widely by race and ethnicity. Benova et al. (2020) emphasized the need to identify behavioral and growth indicators that are valid predictors of a newborn's health as a means to strengthen the quality of measurable predictors and outcomes. Valid measurement of social emotional difficulties can lead to early intervention and treatment especially for disorders prevalent to infants such as internalizing and externalizing problems, such as temper tantrums. Interestingly, the validity of measurement is likely compromised across domains. For example, Williams et al. (2018) reported that broad developmental measures likely miss young children's social and emotional needs.
Social-emotional difficulties in infancy have had important consequences for future development and adaptation as well as success in school (Bolten, 2013). Problem areas have been predominantly identified in sleep (Mindell et al., 2017), feeding (Hemmi et al., 2011), and crying (Alvarez, 2004), all three identified as important predictors of subsequent behavioral problems (Bolten, 2013;Sidor et al., 2013). Behavioral problems have been associated with impaired parent-child relationships (Papousek and von Hofacker, 1998) and school problems (Hyde et al., 2012). Specifically, sleep problems have been predictive of poor social and emotional functioning and the presence of psychiatric problems (Berger et al., 2012;Winsper and Wolke, 2014;Sadeh et al., 2015). McIntosh et al. (2021) in a review of the literature pointed to the fact that most screening measures of socio-emotional screening targeted children 18 months and older pointing to the scarcity of measures for newborns. Further concerns about the measurement of social and emotional skills at a very young age pointed to the lack of rigorous psychometric data on the validity of the available measures (Pontoppidan et al. (2017). To this end, the Baby Pediatric Symptom Checklist (Sheldrick et al., 2013) was developed to screen for social and emotional symptoms in newborns and infants from birth to 18 months and has provided evidence in favor of construct and convergent validity, reliability, and bias (via Differential Item Functioning) across gender, parent education, and family income. Despite the rigorous methodology employed and the positive results in favor of the baby pediatric symptom checklist (BPSC), additional tests of validity are warranted before the measure can be utilized in tests of mean-level differences across gender, ethnic, and age groups within the range of 0-18 months.
There is ample evidence that gender differences are present in temperament during infancy with also large effect sizes (e.g., in effort control favoring girls, see Else-Quest et al., 2006). Some studies have reported that girls have higher levels in positive affect (LaFrance et al., 2003) and fear (Olino et al., 2013), which are also accompanied by lower activity levels compared to boys (Eaton and Enns, 1986), although meta-analytic findings concluded small overall effect sizes. Furthermore, differences in the variability in measuring temperament were reported with boys having more variable temperament estimates compared to girls (Else-Quest et al., 2006). This rather inconclusive evidence suggests the need to investigate such effects further, as there are concerns that they are linked to methodological shortcomings such as the source of data collection (parent report vs. observational studies), or across parent's gender (Seifer et al., 1994).
Hispanic and Latino groups are the largest and fastest-growing populations in the U.S. (Federal Interagency Forum on Child and Family Statistics, 2017) with also unique characteristics and challenges. For example, one-third (30%) of Latino children live in poverty, and more than 60% in households with low incomes (Murphey et al., 2014). Developmental findings have also indicated lower scores in Hispanic infants on both cognitive and social outcomes (Borkowski et al., 2002) and also salient differences in rearing practices, and socio-economic factors (Moore et al., 1997) suggesting they are 3-4 times at a greater risk for developmental delays, language and social-emotional difficulties (Jahromi et al., 2012).

Importance, novelty, and goals of the present study
The well-being of newborns and babies lies on the monitoring of specific crucial indicators. Tracking growth is then dependent upon the validity, importance, and relevance of these growth behaviors and indicators (Larson and Mercer, 2004). The necessity of identifying those indicators goes without saying as social and emotional problems have been strongly linked to adaptation and school success in later life (Kagan et al., 1999), the presence of psychopathological tendencies (Else-Quest et al., 2006), and adjustment in later life (Rothbart et al., 2001).
The purpose of the present study was to test the validity of the BPSC, a freely available measure for the assessment of social/ emotional well-being in children under the age of 18 months (see Supplementary material). Specifically, we sought to confirm the sample structure of the BPSC using a nationally representative U.S. sample. Furthermore, we attempt to test requisite steps before conducting mean level tests across gender and age groups, that is, the presence of measurement and structural invariance (Meredith, 1993), verify the reported simple structure (replicate the factor solution reported by Sheldrick et al., 2013), and examine additional indices of dimensionality and internal consistency reliability.

Participants and procedures
Data came from the National Health Interview Survey which provides detailed data on the health of the U.S. nation since 1957 via structured household interviews. The present paper utilized the data from 2020, which included revisions to some of the earlier surveys. There were 601 children (N boys = 322, N girls = 279) aged below 1 year (n = 294) or 1 year to 12 months (n = 307). This age classification comprised the grouping variable in our invariance modeling. Furthermore there were 142 Hispanic children (23.6%) and 459 non-Hispanic (76.4%).

Measures
The BPSC was developed as part of the Survey of Wellbeing of Young Children and in its final form included 12 items (Sheldrick et al., 2013) that are appropriate for the measurement of Frontiers in Psychology 03 frontiersin.org social-emotional skills in newborns and infants below the age of 18 months. The scaling system included 3 options from "not at all, " to "somewhat" to "very much. " In its final form, it included the measurement of 3 domains, namely, (a), inflexibility (b), irritability, and, (c) difficulty with routines, derived from the conceptual frameworks of Rothbart (1981). Internal consistency reliability was assessed using the omega coefficient (Raykov, 1997) which is most appropriate for non-tau-equivalent instruments.

Data analyses
Data were analyzed using structural equation modeling and specifically the confirmatory factor analysis framework with ordinal indicators. Furthermore, a measurement and structural invariance protocol was employed to verify the equivalence of form and function across gender and age groups as described below. First, the configural model was tested which evaluated the equivalence in form across gender and age groups. This was followed by the metric model in which the relationships between items and latent variables were tested. The third model, which represents a prerequisite before testing for mean differences was the scalar model specifying, in addition to the configural and metric models, the equivalence of intercept terms. Further constraints included "strict" invariance, that is, the equivalence of item residual variances, followed by tests of the equivalence of variances in the latent factors. All analyses were conducted using Mplus 8.8 (Muthen andMythen, 1998-2018). Models were nested as they included the same items and latent constructs but were differentiated from each other by specifying additional restrictions (e.g., equivalence of slopes, thresholds, etc.). Consequently, Chi-square difference tests were utilized when contrasting models. Table 1 displays model fit statistics of the BPSC via three competing models. Model 1 (M1) is a fixed slope 3-factor correlated model, resembling the Rasch-model idea of fixed discrimination parameters. Model 2 (M2) estimates 3-correlated factors with slopes freely estimated and Model 3 (M3) is a bifactor model with items loading on both a general factor and also specific factors. Using inferential statistical criteria results indicated that the free-slopes model (M2) was superior to the fixed slopes model using a loglikelihood difference test which is distributed as a Chi-square statistic [χ 2 (12) = 235.057, p < 0.001]. The more parameterized bifactor model was associated with a slightly better model fit, which by no means justified the estimation of the additional 9 parameters [χ 2 (9) = 11.133, p = 0.267]. Consequently, due to parsimony, the optimal model fit was attained using a three-factor correlated model (M2) as in the original Sheldrick et al. (2013) study. This model had all descriptive fit indices [i.e., comparative fit index (CFI), Tucker-Lewis index) over 0.95, and unstandardized residual values less than 5%, being indicative of "exact model fit. " Thus, the construct validity of the BPSC was replicated with the present data. Further tests of measurement and structural invariance utilized M2.

Measurement invariance across gender
Measurement and structural invariance across gender are shown in Table 2. Initially, a configural model was fit to the data, which showed an acceptable model fit in that the 3-factor correlated simple structure fit the data well for both boys and girls. Further testing involved more constrained models as a means to examine the decrease in fit due to non-invariance. The metric model specified the equivalence of factor loadings across gender with thresholds being freely estimated. Results indicated that constraining slopes to be equivalent in boys and girls was not associated with significant decrements in model fit [Difftest(9) = 14.202, p = 0.115]. The scalar model involved testing the equivalence of thresholds across boys and girls in addition to the equivalence of slopes. The scalar model was also supportive of these constraints by not being linked to a significantly worse model fit [Difftest(21) = 21.372, p = 0.436]. The results from the first three models pointed to the presence of full measurement invariance and justify the proper comparison of latent means across gender. The fourth layer of constraining involved the equivalence of residual estimates across groups, namely testing "strict" invariance. As shown in the table, constraining error terms to be equivalent across groups was again supported [Difftest(12) = 17.789, p = 0.122]. Consequently, the amount t of

Measurement invariance across ethnic groups
Measurement and structural invariance across Hispanic and Non-Hispanic groups are shown in Table 2

Measurement invariance across age
Tests of measurement and structural invariance were conducted across newborns up to 12 months old and between 1 year and 18 months old. The same analytical protocol described above was Frontiers in Psychology 05 frontiersin.org utilized here as well. As shown in Table 2, the configural model was again associated with excellent model fit (i.e., CFI = 0.979, RMSEA = 0.035). Constraining item slopes on the three domains were associated with significant decrements in model fit [Difftest(9) = 23.258, p = 0.006]. Using modification indices results indicated that the slopes of items 1 and 2 needed to be freed. This partial metric invariance model (M2b) was associated with non-significant decrements in comparison to the configural model [Difftest(7) = 12.747, p = 0.079]. Constraining the thresholds across age groups was associated with significantly lower model fit [Difftest(21) = 39.223, p = 0.009] pointing to the need to free the thresholds of items 6 (0 vs. 1 and 2) and item 10 (0 and 1 vs. 2). After allowing these parameters to vary across age groups (M3b)  Latent mean comparisons across gender, ethnic groups, and age Following satisfying scalar (for gender groups) or partial scalar invariance (for ethnic and age groups) comparisons between means can be conducted. These comparisons at the latent level involve fixing the latent factor means of the reference group to zero and freely estimating the latent means of the comparison group. Results when comparing boys with girls indicated no significant differences in the latent means of irritability Last, tests of latent mean differences in the three domains of the BPSC across age groups pointed to non-significant differences in irritability (M 12-18mos = −0.065, z = −0.557, p = 0.577), and difficulty with routines (M 12-18mos = −0.149, z = −1.274, p = 0.203). Differences were revealed in inflexibility, with infants (12 to 18 months old) having significantly higher means compared to newborns (M 12-18mos = 0.803, z = 6.456, p < 0.001, see Figure 1, lower panel).

Discussion
The present study evaluated the psychometric properties of the BPSC as a means to provide valid assessments of latent mean differences across gender and age groups. Results favored the original simple structure of Sheldrick et al. (2013) and verified the measurement and structural invariance of the 3-factor solution across gender and age groups. Furthermore, indices of internal Frontiers in Psychology 06 frontiersin.org consistency reliability were exceptional pointing to minimal amounts of measurement error. Thus, the three facets of the instrument can be used for the valid assessment of temperament in infancy (Buss and Plomin, 1975). Differences in latent means were revealed across gender with females having significantly higher difficulties in the inflexibility domain, albeit with a small effect size. This finding agrees with the results of Spinner et al. (2018) but disagrees with the earlier meta-analytic findings of Else-Quest et al. (2006) who reported no differences across gender in adaptability (d = −0.003, p = n.s.). Thus, the present findings add to the equivocal research findings from past studies. Another difference with the Else-Quest study was that the amount of variance of boys was significantly elevated compared to the respective estimate for girls in their meta-analytic study. In the present study, satisfying factor variance invariance (Model 5) suggested no differences in the variability of the latent construct of inflexibility/adaptability. However, one needs to weigh the magnitude of effect, which was small in the present study, as a function of relatively excessive levels of power, thus, there may not be a salient difference between the present and the Else-Quest et al. (2006) study's findings. For age, older infants had significantly elevated scores on inflexibility and this finding agrees with the study by Rothbart and Mauro (1990) who reported a correlation of 0.57 across scores on adaptability and an age grouping variable (6-month-old vs. 12-month-old). Furthermore, the lack of finding significant differences in temperament across ethnic groups disagrees with earlier findings pointing to delays in Hispanic toddlers in social and emotional development (Borkowski et al., 2002).
In the future additional evidence of the validity of the BPSC can be attempted through testing longitudinal invariance, which is very important as newborns and infants age. Also, the invariance across races and different SES groups may further increase our understanding of mean differences across populations. Further tests can engage advances in psychometrics such as approximate measurement invariance (AMI) and the alignment method (Muthen and Asparouhov, 2018) in case partial measurement invariance fails so that latent mean comparisons can still be conducted.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found at: https://www.cdc.gov/nchs/nhis/ 2020nhis.htm.

Author contributions
FA conceptualized the study and contributed to data analyses and the write-up of the manuscript. GA-K contributed to data analyses and the write-up of the quantitative sections. Both authors approved the final draft of the manuscript. All authors contributed to the article and approved the submitted version.