Is Parent–Child Disagreement on Child Anxiety Explained by Differences in Measurement Properties? An Examination of Measurement Invariance Across Informants and Time

There are numerous empirical studies demonstrating that agreement between parent-reports of youth and youth self-reports of internalizing behavior problems is modest at best. This has spurred much research on factors that influence the magnitude of associations between informants, including individual difference characteristics of the informants and contexts through which individuals interact with the child. There is also tremendous interest in understanding symptom trajectories longitudinally. However, each of these lines of work are predicated on the assumptions that the psychometric construct that is being assessed from each informant and at each measurement occasion is the same. This study examined measurement invariance between maternal and child reports and longitudinally across ages 9 and 12 on five dimensions of anxiety using the Screen for Child Anxiety and Related Disorders (SCARED; Birmaher et al., 1999). No cross-informant models for anxiety dimensions achieved acceptable fit and at least partial metric and scalar invariance. Moreover, few longitudinal models demonstrated acceptable fit and at least partial metric and scalar invariance. Thus, using the SCARED as an example, these results show that inter-informant agreement may be compromised by different item functioning, and highlight the need for testing invariance before using measures for longitudinal tracking of symptoms.


INTRODUCTION
There has been extensive research on agreement and disagreement between raters of symptoms of behavior problems in children and adolescents. These studies have examined multiple constellations of raters, including parents of the same target child, a parental caregiver and teachers, and parents and their child. Overall, there is modest agreement between parents and children and parents and teachers, but moderate agreement between parents (De Los Reyes et al., 2015). Attempts to understand factors that influence agreement between raters and also within raters over time have not provided complete explanations for lack of agreement. However, there have been no studies that test whether the underlying constructs reported by different informants, particularly primary caregivers and their children, are equivalent. There are few studies examining parallel issues over time. Without such evidence, it is difficult to interpret associations across informants as reflecting agreement on the same construct and how to evaluate longitudinal changes in the constructs. Thus, the present study examines whether measurement differences are present between parent-and child self-reports of anxiety that may partially explain lack of agreement across raters and across development.
The overall pattern of inter-informant agreement on child mental health symptoms have been extensively examined and summarized in two meta-analyses spanning a 28-year period. In the first, Achenbach et al. (1987) examined the associations between youth, parent, and teacher reports of internalizing and externalizing problems. In their work, there was stronger agreement among individuals with the same relationship to the target child (e.g., inter-parental agreement, average r = 0.61 across informant types), but more modest associations across different informant types (average r = 0.29 across all informants). Interinformant agreement for overcontrolled and undercontrolled behavior problems, similar to internalizing and externalizing problems, respectively, were in the small-moderate range (rs = 0.32 and 0.41, respectively). More recently, De Los Reyes et al. (2015) conducted an updated analysis of studies since the Achenbach et al. (1987) paper. In this work, the authors found that the magnitude of interparental agreement (mean r = 0.59) was similar to that of other informant pairs with the same relationship to the target (i.e., teachers, mental health workers; average r = 0.58). However, agreement between raters with different relationships to the target was markedly lower (average r = 0.29). Overall inter-informant agreement was modest for both internalizing (r = 0.25) and externalizing problems (r = 0.30). The convergent findings from the two meta-analyses indicate that individuals with greater similarity in information will have a higher degree of similarity in their ratings of behavior. This has served as the foundation for the Operations Triad Model (De Los Reyes et al., 2013, which emphasizes context as an important factor in understanding reports of child behavior problems and assessing the incremental value of information from disparate sources. Numerous studies have examined factors that explain the modest levels of convergence between informants on youth internalizing and externalizing behavior problems. These studies have considered moderating factors such as parent-child relationship functioning (Treutler and Epkins, 2003), parent symptoms (Youngstrom et al., 2000;Treutler and Epkins, 2003;Rothen et al., 2009), parental stress (Youngstrom et al., 2000Langberg et al., 2010), child race (Youngstrom et al., 2000), child sex (Rothen et al., 2009), and characteristics of the symptoms themselves (e.g., observability, salience; Frank et al., 2000;Karver, 2006). However, these findings lack coherence and are sparsely replicated across samples.
There have been numerous studies examining the developmental course of anxiety disorders and symptoms with studies focusing on different age spans (Feng et al., 2008;Van Oort et al., 2009;Olino et al., 2010bOlino et al., , 2014. These studies have focused on risk factors predicting course as well as course predicting outcomes. However, there has been a paucity attention to longitudinal MI for youth anxiety. This precludes understanding whether observed mean-level changes are reflecting true score changes, or if these changes are influenced by changes in measurement properties. In one study (Mathyssek et al., 2013), the authors found evidence supporting MI for individual dimensions of anxiety from the Revised Child Anxiety and Depression Scale (RCADS; Chorpita et al., 2000). However, this study examined this issue using only youth-reports for a single assessment measure. Thus, comparisons between youth and parent reports across time are novel.
A key challenge in examining inter-informant agreement and assessing stability over time concerns the psychometric functioning of the measures used to assess the constructs. De Los Reyes et al. (2015) identified several sources of measurement error that may lead to attenuation of associations. Some of these are factors such as parental psychopathology or personality that may lead to distorted reports of youth behavior (Kagan, 1997;Najman et al., 2001;Hayden et al., 2010). Random error, such as imperfect test-retest reliability, could also limit the magnitude of associations across raters. Finally, the authors identify systematic error across informants as a potential explanation for the limited inter-informant associations.
Systematic error in ratings can come from several sources. De Los Reyes et al. (2015) focus on studies demonstrating differences in item response scaling as a possible, but unlikely, contributor to low inter-informant agreement. However, there are additional considerations that have not yet been explored in this area. For example, systematic error may be introduced because the constructs that individual informants are reporting on have different psychometric properties. Estimation of reliability is frequently indexed by Cronbach's alpha (Cronbach, 1951). However, alpha is more correctly interpreted as a measure of internal consistency (Sijtsma, 2009). It does not provide information about the specific measurement structure of the items comprising a test/scale.
To evaluate this possibility, more sophisticated analytic tools are necessary. For example, confirmatory factor analysis (CFA) can evaluate measurement properties such as how items relate to constructs. Extensions of CFA have been developed to test whether measurement properties of constructs are consistent across informants (Olino and Klein, 2015) and assessment waves (Widaman et al., 2010). These methods have been termed measurement invariance (MI; Meredith, 1993).
There are multiple levels of MI that reflect increasingly strict model properties, and address different psychometric questions (Widaman et al., 2010;Millsap, 2011). A fundamental requirement is that the same items are associated with the same construct across units (e.g., informants and time). Simply stated, do the same items load on the same factors when assessed in the different units. This is referred to as configural invariance. If the items assessing what are purportedly the same constructs differ across groups, the items have different meanings within each group. Next, it is important that the magnitude of the associations between the items and the underlying construct is the same across groups (i.e., are the factor loadings for each factor comparable when assessed within the different groups?). This is referred to as metric invariance. Finally, the probability of item endorsement should be the same across groups (Reise et al., 1993;Vandenberg and Lance, 2000). This is referred to as scalar invariance. When configural, metric, and scalar invariance are established for a particular measure across groups, scale scores can be considered to reflect the same psychometric quantities among the groups. Thus, it is critical to evaluate whether lack of MI is contributing to reduced associations between parents and children. However, complete MI imposes highly rigorous assumptions (i.e., equality of all factor loadings and item thresholds across informants). Consequently, there has been increasing attention to the presence of partial MI that specifies invariance on parameters for some, but not all, items (Byrne et al., 1989). This approach has gained prominence and has permitted meaningful comparisons when full MI fails (Steinmetz, 2013).
In the present study, we examine MI across maternal-and child-reports of youth anxiety symptoms when children are ages 9 and 12. Thus, we are able to describe differences in MI across this 3-year developmental span. We also present analyses examining MI across time for maternal-and child-reports separately.
In light of the consistently modest agreement between maternal and child reports of symptomatology, we expect to find a lack of MI across informants at both assessment waves. We do not posit whether this is due to differences in factor loadings or thresholds. However, we expect there to be stronger support for MI across time within informants as there is evidence for longitudinal stability of youth anxiety (Prenoveau et al., 2011). In instances when full MI fails, we examine partial MI that permits some flexibility in the models.

Participants and Procedure
Participants were from a larger sample of 559 children and their families living in a suburban community who were participating in the Stony Brook Temperament Study, a longitudinal study of temperament and psychopathology, which began when children were 3 years old (Olino et al., 2010a). Potential participants were identified using a commercial mailing list and screened by telephone. Families with a 3-year-old child who lived with an English-speaking biological parent within 20 contiguous miles of Stony Brook, New York and did not have significant medical conditions or developmental disabilities were included. Of the 815 identified eligible families, 68.5% entered the study. No significant differences were found between families who did and did not participate on child sex and race/ethnicity, and parental marital status and education. Informed and written consent was obtained from the parent prior to participation. The study was approved by the institutional review board at Stony Brook University, and families were compensated for their participation. At the second wave of the study, 3 years later, 50 additional minority families were recruited to increase racial/ethnic diversity (total N = 609; Bufferd et al., 2012).
At the age 9 visit, 487 mothers (80.0%) and 481 youth (79.0%) completed the measures of youth anxiety symptoms used in this study; a mother or child from 492 families (80.8%) participated. At the age 12 visit, 468 mothers (76.8%) and 470 youth (77.2%) completed these measures; a mother or child from 479 families (78.7%) participated. The mean age of the children was 9.18 years (SD = 0.40) at the 9-year assessment and 12.66 (SD = 0.46) at the 12-year assessment. Approximately half the children were female (9-year visit: 226, 45.9%; 12-year visit: 225, 47.0%) and the majority were White/non-Hispanic (9-year visit: 390, 79.3%; 12-year visit: 381, 79.5%). At the time of the 12-year visit, most mothers were married (373, 77.9%) and approximately half had graduated from college (279; 58.2%), and the median income bracket was $100,000-$119,999. Youth who participated at age 9 did not differ from those participating at age 3 on child sex, race, or total or externalizing behavior problems, as assessed by maternal reports on the Child Behavior Checklist (Achenbach and Rescorla, 2001; all ps > 0.05). However, youth who did not continue with the study at age 9 had higher levels of internalizing problems at age 3 than those who continued with the study, though the effect is small [t(547) = 4.69, P < 0.05, d = 0.09].

Measures
Children and their parents completed the 41-item youth selfreport and parent-report versions, respectively, of the Screen for Childhood Anxiety Related Disorders (SCARED; Birmaher et al., 1997Birmaher et al., , 1999. Children and their parents are asked to rate the presence of anxiety symptoms in the child over the past 3 months on a three-point scale (0 = not true or hardly ever true; 1 = somewhat true or sometimes true; 2 = very true or often true). The SCARED is made up of five factor-analytically derived subscales: panic/somatic, general anxiety, separation anxiety, social phobia, and school phobia. These subscales reflect anxiety disorder symptoms as conceptualized in the DSM-IV-TR. Each factor has been shown to have good internal consistency and test-retest reliability (range of α: 0.78-0.87; Birmaher et al., 1999; intraclass correlation across time for each scale ranged from 0.70-0.80; Birmaher et al., 1997).

Statistical Analyses
In line with a model building approach and to identify whether one-factor models were appropriate for testing, we estimated a series of initial single-factor CFAs separately for youth selfand parent-reports at the ages 9 and 12 waves. Items from the panic/somatic, general anxiety, and social phobia subscales were included in models reflecting each of these constructs, respectively. Next, models were fit sequentially to evaluate MI and we continued testing for MI only when there was evidence that a one-factor model for each was an acceptable fit to the data. We followed the same logical progression of testing MI across informants as is used in examinations of longitudinal invariance (Widaman et al., 2010) with minor modifications. We tested first for configural invariance (schematic models for configural invariance models are displayed in Figure 1), or whether the pattern of significant (i.e., non-zero) factor loadings is similar across youth and parent-reports. We estimated models for each of the subscales including a single factor for youth and a single factor maternal-reports simultaneously while permitting the factors to be correlated. These models were specified freely estimating all factor loadings and fixing the latent variable variance at 1 for purposes of model identification. Next, we tested for metric invariance, or whether factor loadings for each item are equal across informants. In these models, we freely estimated the variance of the maternal-report latent factor as fixing factor loadings to be equal across informants permits this constraint to be relaxed for one informant. Finally, we tested for scalar invariance, or whether the probability of item endorsement is similar across informants, by constraining the thresholds across informants to be equal. In these models, we freely estimated the mean of the maternal-report latent factor as fixing thresholds to be equal across informants permits this constraint to be relaxed for one informant. If all three types of invariance hold, this indicates that the scales measure the same constructs across reporters on the same scale. Thus, differences in mean trait levels can be interpreted as true score differences, as opposed to differences in measurement.
For models that did not achieve full MI, we tested partial MI, which identifies whether some, but not all, items are invariant across informants and/or time. We examined the presence of comparable factor loadings using the MODEL CONSTRAINT command in Mplus to assess differences in configural invariance. When factor loadings were identified that did not significantly differ at P < 0.05, a partial metric invariant model was estimated that included equality constraints on those factor loadings. In this partial metric invariance model, we used the MODEL CONSTRAINT command that tests whether the difference between specified parameters significantly differ, to examine the presence of comparable item thresholds. When item thresholds were identified that did not significantly differ at P < 0.05, a partial scalar invariant model was estimated that included equality constraints on those item thresholds.
All models were estimated in Mplus version 8 Muthén, 1998-2017) using the weighted least squares estimator (WLSMV; Flora and Curran, 2004), which is a robust estimator suited for modeling binary data. There were low rates of responses in the highest response category (i.e., "very true or often true") on many items. Specifically, for 34 (82.9%) items at both ages 9 and 12, 5% or fewer of parents endorsed the highest category. Similarly, for 7 (17.1%) items at age 9, and 23 items (56.1%) at age 12, 5% or fewer of children endorsed the most severe response option. Consequently, the top two item response categories were collapsed, making all items binary. We evaluated models on two goodness of fit indices. Specifically, we used the comparative fit index (CFI; Bentler, 1990) and Root Mean Square Error of Approximation (RMSEA; Steiger, 1990). Although cut-offs are somewhat arbitrary (Marsh et al., 2004), current conventions suggest that excellent model fit is indicated by CFI values ≥ 0.95 (Hu and Bentler, 1999) and RMSEA values ≤0.05 (MacCallum et al., 2006); good fit is indicated by CFI greater than 0.90 and a RMSEA between 0.05 and 0.10.
We estimated configural (similar pattern of factor loadings across groups), metric (equality of factor loadings across groups), and scalar (equality of thresholds across groups) for comparisons between maternal-and child-reports. In addition to testing MI across informants, we also tested the same sequence of models for evaluating longitudinal MI in each informant, separately. Model fit comparisons were evaluated by investigating change in both CFI and RMSEA using Chen's (2007) guidelines. Chen (2007) recommended interpreting reductions in CFI of 0.01 and RMSEA of 0.015 as indicating non-invariance (i.e., failure to demonstrate MI). When the RMSEA and CFI changes led to different conclusions, we relied on the more conservative index to inform interpretations.

Measurement Models for Informant and Age
Initial models estimated one-factor models for each of the SCARED subscales for child self-and maternal-reports at ages 9 and 12. These models were estimated to identify scales that fit the data well enough to pursue tests of MI. Table 1 displays overall fit for each of the models tested. For age 9 data, one-factor models demonstrated excellent fit for child-reported generalized anxiety disorder (GAD), panic, and social phobia and demonstrated a good fit for maternal-reported GAD, panic, and separation anxiety. For age 12 data, one-factor models demonstrated excellent fit for child-reported panic and good fit for GAD and social phobia, and demonstrated excellent fit for maternal-reported panic and good fit for GAD, separation anxiety, and social phobia. One-factor models for child-reported separation anxiety were poor fits to the data at each time point. Model fit for school avoidance was also less than adequate. For child reports at age 12 and mother reports at age 9, the CFI was acceptable, but the RMSEA was greater than 0.10. In addition, the model for maternal-report of school avoidance at age 12 failed to provide an admissible solution. Owing to the brevity of the school phobia scale, the school avoidance models included only four observed indicators, which may have led to model instability.
As child-report separation anxiety provided poor fit to the data at ages 9 and 12, we did not assess MI for the youth reports on this subscale. However, as maternal reports of separation anxiety demonstrated good fit, we examined longitudinal invariance for mothers' reports on this subscale. Due to the problematic fit of the school avoidance models, we did not conduct any MI analyses on this subscale. All model parameters are available in the Supplementary Materials.

Tests of MI: Child-and Maternal-Reports at Age 9
The configural invariance model for GAD across youth selfand maternal-reports was a good fit to the data ( Table 2). Likewise, the fit of the metric invariance model was good, and imposing constraints on the factor loadings did not markedly diminish model fit. However, when imposing constraints on the item thresholds across informants, model fit diminished substantially. Comparisons identified three item thresholds that did not significantly differ across informants. Estimating a partial scalar invariant model that constrained those three item thresholds to equality yielded good model fit. Thus, this model supports partial scalar MI.
The configural invariance models for panic disorder across youth self-and maternal-reports were a poor fit to the data. Thus, further tests of metric and scalar invariance were not pursued.
The fit for the configural invariance model for social phobia across youth self-and maternal-reports was good. Likewise, the metric invariance model was a good fit to the data, and imposing constraints on the factor loadings did not markedly diminish model fit. Similarly, imposing constraints on the item thresholds across informants did not substantially diminish model fit, supporting full-scalar MI.

Tests of MI: Child-and Maternal-Reports at Age 12
The configural invariance model for GAD across youth self-and maternal-reports at age 12 was a good fit to the data (Table 3). Likewise, the fit of the metric invariance model was good, and imposing constraints on the factor loadings did not markedly diminish model fit. However, when imposing constraints on the item thresholds across informants, model fit diminished substantially, failing to support scalar invariance. Comparisons identified only one item threshold that did not significantly differ across informants. Thus, this model also failed to support partial scalar MI. The configural invariance model for panic disorder demonstrated adequate fit. Including constraints on factor loadings across informants to test metric invariance yielded a model with an adequate fit to the data and did not markedly differ from the configural invariance model. However, when including constraints on item thresholds to test for scalar invariance, model fit was poor and was reduced relative to the metric invariance model. Moreover, all item thresholds significantly differed across informants, hence there was no basis for evaluating partial scalar invariance.
The configural invariance model for social phobia across youth self-and maternal-reports was a good fit to the data. Likewise, the fit of the metric invariance model was good, and imposing constraints on the factor loadings did not markedly diminish model fit. Finally, after imposing constraints on the item thresholds across informants, model fit was not substantially diminished. Thus, this model supports full-scalar MI.

Tests of MI: Child-Reports Across Ages 9 and 12
The fit for the configural invariance model for GAD for youth self-reports across ages 9 and 12 was excellent (Table 4). Likewise, the metric invariance model was an excellent fit to the data as imposing constraints on the factor loadings did not markedly diminish model fit. When imposing constraints on the item thresholds across informants to test scalar invariance, overall 1 | Initial model fit for child self-and maternal-report of SCARED subscales at ages 9 and 12.

Age 9
Age 12 Child self-report GAD, generalized anxiety disorder symptoms; panic, panic disorder symptoms; school, school phobia symptoms; separation anxiety, separation anxiety disorder symptoms; and social anxiety, social anxiety symptoms. GAD, generalized anxiety disorder symptoms; panic, panic disorder symptoms; and social anxiety, social anxiety symptoms. Changes in CFI and RMSEA are calculated as differences between the metric invariance model relative to the configural invariance model and between the scalar invariance model relative to the metric invariance model. a In this model, three of nine threshold parameters were constrained to be equal. This model is compared with the full-metric model. model fit was still good; however, model fit was diminished relative to the metric invariance model. Comparisons identified only two item thresholds that did not significantly differ across informants. This partial scalar invariance model yielded excellent model fit. However, with only two invariance item intercepts, this model failed to sufficiently support partial scalar MI. The fit for the configural invariance model for panic disorder for youth self-reports across ages 9 and 12 was excellent ( Table 4). The metric invariance model was also an excellent fit to the data. However, there was a substantial reduction in model fit as indexed by the CFI and a more modest reduction in fit according to the RMSEA. Comparisons identified three factor loadings that differed across age. Model fit for the partial metric invariance model was an excellent fit to the data. As only partial metric invariance was supported, when estimating scalar invariance, thresholds for items that did not evince equal factor loadings across time were freely estimated. After imposing constraints on the other item thresholds across time, overall model fit was still good; however, model fit was diminished relative to the partial metric invariance model. Comparisons identified four item thresholds that did not significantly differ across time. This partial scalar invariance model yielded excellent model fit.
The fit for the configural invariance model for social phobia for youth self-reports across ages 9 and 12 was an excellent fit to the data. The fit of the metric invariance model was also good. However, there was a substantial reduction in model fit as indexed by the CFI, and a modest reduction in the RMSEA. Comparisons identified three factor loadings that did not statistically differ across age. Model fit for the partial metric invariance model was an excellent fit to the data. As only partial metric invariance was supported, when estimating scalar invariance, item thresholds for items that did not evince equal factor loadings across time were freely estimated. Three item thresholds were constrained

Tests of MI: Maternal-Reports Across Ages 9 and 12
The configural invariance model for GAD for mother-reports across ages 9 and 12 was an excellent fit to the data ( Table 5). The fit of the metric invariance model was good, and imposing constraints on the factor loadings did not markedly diminish model fit, supporting metric invariance. After imposing constraints on the item thresholds across informants, overall model fit was still good and showed a minor reduction in model fit as indexed by the CFI and a trivial reduction in the RMSEA. Thus, scalar MI was supported. The configural invariance model for panic disorder was an adequate fit to the data. However, there were problems in estimating the metric and scalar invariance models due to low endorsement rates of item response options across multiple items GAD, generalized anxiety disorder symptoms; panic, panic disorder symptoms; separation anxiety, separation anxiety disorder symptoms; social anxiety, social anxiety symptoms. Changes in CFI and RMSEA are calculated as differences between the metric invariance model relative to the configural invariance model and between the configural invariance model relative to the metric invariance model. The model with best statistical fit is highlighted in bold. a In this model, seven of eight factor loading parameters were constrained to be equal. This model is compared to the configural invariance model. b In this model, 3 of 13 threshold parameters were freely estimated across time. This model is compared to the partial metric model. c In this model, six of seven factor loading parameters were constrained to be equal. This model is compared to the configural invariance model. b In this model, one of seven threshold parameters were freely estimated across time. This model is compared to the partial metric model.
(i.e., empty cells in bivariate distributions). Thus, those models could not be adequately tested. The configural invariance model for separation anxiety was good. The metric invariance model marginally reduced model fit, but it was enough to result in a less than adequate fit to the data. Comparisons of factor loadings identified one parameter that statistically differed across time. Model fit for the partial metric invariance model was good, supporting partial metric invariance. After adding constraints on item thresholds across time, model fit was reduced and demonstrated a poor fit to the data. Comparisons of item thresholds revealed that all parameters differed across time. Thus, there was no support for partial scalar invariance.
The fit for the configural invariance model for social phobia for maternal-reports across ages 9 and 12 was excellent. The metric invariance model was also an excellent fit to the data. However, there was a reduction in model fit as indexed by the CFI and the RMSEA. Comparisons of factor loadings identified six (of seven) factor loadings that did not statistically differ across age. Fit for the partial metric invariance model was excellent, supporting partial metric invariance. As only partial metric invariance was supported, when estimating scalar invariance, the item threshold for the item that did not evince equal factor loadings across time was freely estimated. After imposing constraints on the item thresholds across time to test for scalar invariance, overall model fit was excellent and the model did not demonstrate a substantial reduction in fit relative to the partial metric invariant model, supporting scalar invariance.

DISCUSSION
There has been much previous work examining factors and contexts that influence correspondence between parents' and their children's reports of psychopathology (Achenbach et al., 1987;De Los Reyes et al., 2015). However, there has been much less research examining measurement properties between informants that could influence the comparability of reports of youth behavior. Similarly, there has been little attention to examining MI across time, which is critical to understanding whether mean-level changes across time are contaminated by changes in measurement properties of items (Widaman et al., 2010). In the present study, we used the subscales from the SCARED to examine overall fit of each anxiety construct in each informant and at each assessment. Then we examined MI between mothers and their children at ages 9 and 12. Finally, we examined invariance for each rater from middle childhood to early adolescence. Overall, full MI was supported between children and their mothers for social anxiety at both ages 9 and 12, but not for any other SCARED subscale. We found support for partial metric invariance across mothers and children at age 9 for GAD. Longitudinally, full-scalar invariance was found for maternal reports of GAD over time and partial scalar invariance was supported for child reported panic and social anxiety and for maternal reported separation anxiety across the two waves.
Thus, we found support for full-scalar invariance across informants for only one SCARED subscale-social anxiety. This indicates that direct comparisons of mean levels of child and maternal reported anxiety symptoms are valid only for this scale of the SCARED.
To demonstrate "strong enough" measurement properties, there has to be consistent evidence supporting at least partial metric invariance across informants at both ages 9 and 12 (Marsh and Grayson, 1994). This indicates that a subset of items reflect the same target latent construct across mothers and their children. Thus, the construct reported on by each informant is conceptually similar in form and reflects rankorder associations among like-constructs. This suggests that for the scales demonstrating at least partial metric invariance inter-informant associations are meaningful. This condition was satisfied by the GAD scale at both ages 9 and 12. However, the lack of scalar invariance precludes comparing mean levels of generalized anxiety across informants (Millsap, 2011).
Panic, school avoidance, and separation anxiety showed the least evidence for MI. Although the panic symptom models demonstrated good fit to the data in our four preliminary models (i.e., separate informant and assessment; Table 1), tests of configural invariance across informant yielded poor fit to the data at age 9 and marginal fit to the data at age 12. Moreover, the fit of configural invariance models for school avoidance and separation anxiety was poor. Fit of these models may have been impacted by the developmental level of the children in the study. School avoidance and separation anxiety are typically observed at higher levels earlier in development. Thus, the coherence of the items in later childhood may be poorer than earlier in development (Hayward et al., 2000;Mathyssek et al., 2012). Moreover, incidence of panic continues to rise through adolescence (Beesdo et al., 2009) and item functioning may continue to change.
Examining the pattern of differences in factor loadings and thresholds between child and maternal reports, there is a consistent pattern of maternal reports having larger factor loadings and thresholds. Stronger factor loadings for maternal scores suggest that their ratings have greater precision and are better at discriminating between children with high and low levels of anxiety. Higher item thresholds for maternal than childreported items suggest that symptoms need to be more severe for mothers to rate them as present relative to children. Taken together, these findings pose significant challenges to comparing levels of anxiety across mothers and youth. With only a few exceptions, these results argue against direct comparisons of mothers' and youth's anxiety ratings.
Our models testing longitudinal invariance demonstrated greater, albeit modest, support for MI over time for each informant taken separately. Maternal reports of youth GAD achieved full-scalar invariance, suggesting that scores from this scale are comparable from middle childhood to early adolescence. Child-reports of panic and social anxiety and maternal-reports of separation anxiety demonstrated a good fit to the data and partial scalar invariance. For these scales, there were some items that demonstrate invariance across time, permitting longitudinal comparisons of latent mean-level differences on the full set of items or examining mean-level differences on the subset of items. These comparisons should reflect true changes in the constructs, rather than being conflated with changes in item properties. Child-report of GAD and maternalreport of social anxiety each had a small number of items with invariant factor loadings and threshold. Based on these results, there should be concern about relying on this set of items/scales to assess developmental changes on dimensions of anxiety symptoms, particularly when relying on child selfreports, and provide little basis for combining these ratings. However, our findings raise the question of whether these subscales evidence MI invariance over shorter periods of time and from pre-to post-test in evaluations of interventions. If psychometric functioning is changing over time, it may not be possible to distinguish intervention effects from measurement changes.
In our work, we focused on the primary, lower-order scales that demonstrated at least adequate fit for a one-factor model. In this evaluation, school phobia and some of the assessments of separation anxiety were not unitary factors. Thus, we did not evaluate these dimensions for MI. This suggests that more indepth analysis of these dimensions is warranted, although there are only four items on the school phobia subscale, restricting alternative modeling strategies to yield better fit. Alternatively, because school phobia and separation anxiety are most common in early childhood, there may have been limited variability in responses for these dimensions at ages 9 and 12. Earlier assessments of school phobia and separation anxiety may have greater variability (Merikangas et al., 2010) and could lead to better fitting models. Examination of other instruments (e.g., the RCADS; Chorpita et al., 2000) across informants and time would provide leverage to determine whether this is a measure-specific or construct assessment challenge.
The present study employed an underutilized lens to better understand sources of discrepancy between child-and parentreports of anxiety, as well as instability of anxiety symptoms from middle childhood to early adolescence. We employed a relatively large sample of mothers and youth who reported on multiple dimensions of anxiety symptomatology in middle childhood and early adolescence. However, our work has some limitations. First, our data came from a community sample with modest levels of symptomatology. Further, we had truncated ranges of item endorsement and collapsed our highest endorsement categories. We are unsure how this may have affected the findings. Second, we used only a single measure of anxiety, albeit one of the most frequently employed with children and adolescents. It is possible that other measures may demonstrate different levels of robustness across informants or longitudinal assessments. Third, we relied solely on comparisons between mothers and children. It is important to consider whether other caregivers (e.g., fathers) and teachers report on the same constructs of behavior problems in children. Fourth, we focused on individual subscales, rather than the total SCARED score. Thus, our work emphasizes these anxiety domains, but does not speak to the similarity in the overall structure of anxiety between informants and across time. Additional analyses would be necessary that focus on the broader dimensional model of the SCARED as a whole. Here, preliminary multidimensional models for the total SCARED produced good fit at age 9, but only a marginal fit at age 12. Thus, there is some evidence that the general structure may differ across time. Adequate testing of this more complex model would require a larger sample with greater variability in anxiety severity. Fifth, there was some selection for continuing the study when youth had lower levels of internalizing problems at age 3. Though this difference was small.
In sum, our findings illustrate that it is critical to evaluate measurement properties of anxiety symptom rating scales using sophisticated measurement strategies. We found that associations across informants may be compromised by differences in the functioning of items on the scale being examined. In such cases, testing for differences between informants and combining ratings across informants to yield single indices of severity are both inappropriate. However, there was also evidence that measurement functioning for some anxiety dimensions remained consistent over time. Thus, a few of the dimensions of the SCARED are valid for assessing longitudinal change. As it may be difficult to know a priori which measures are appropriate for assessing change, there is a pressing need for a comprehensive effort to evaluate MI for the full range of scales commonly used to assess developmental trajectories and response to treatment in child and adolescent clinical psychology and psychiatry.

ETHICS STATEMENT
Informed consent was obtained prior to participation in accordance with the Declaration of Helsinki. The study was approved by the institutional review board at Stony Brook University.

AUTHOR CONTRIBUTIONS
TO conceptualized the research questions, drafted the manuscript, and conducted analyses. MF provided assistance in conducting analyses and provided critical feedback on the manuscript. LD provided critical feedback on the manuscript. DK provided substantial contribution to the research design and critical feedback on the manuscript.