Applying Item Response Theory (IRT) Modeling to an Observational Measure of Childhood Pragmatics: The Pragmatics Observational Measure-2

Assessment of pragmatic language abilities of children is important across a number of childhood developmental disorders including ADHD, language impairment and Autism Spectrum Disorder. The Pragmatics Observational Measure (POM) was developed to investigate children’s pragmatic skills during play in a peer–peer interaction. To date, classic test theory methodology has reported good psychometric properties for this measure, but the POM has yet to be evaluated using item response theory. The aim of this study was to evaluate the POM using Rasch analysis. Person and item fit statistics, response scale, dimensionality of the scale and differential item functioning were investigated. Participants included 342 children aged 5–11 years from New Zealand; 108 children with ADHD were playing with 108 typically developing peers and 126 typically developing age, sex and ethnic matched peers played in dyads in the control group. Video footage of this interaction was recorded and later analyzed by an independent rater unknown to the children using the POM. Rasch analysis revealed that the rating scale was ordered and used appropriately. The overall person (0.97) and item (0.99) reliability was excellent. Fit statistics for four individual items were outside acceptable parameters and were removed. The dimensionality of the measure showed two distinct elements (verbal and non-verbal pragmatic language) of a unidimensional construct. These findings have led to a revision of the first edition of POM, now called the POM-2. Further empirical work investigating the responsiveness of the POM-2 and its utility in identifying pragmatic language impairments in other childhood developmental disorders is recommended.


INTRODUCTION
No matter how much one desires a solitary existence or an expansive network of friends, we all engage in social interaction. For school-aged children, peer-peer interaction during play is a major context for social interaction (Cordier et al., 2013). Crucially, this context allows children to develop, express and refine pragmatic language skills. Pragmatic language has been defined as "behavior that encompasses social, emotional, and communicative aspects of social language" (Adams et al., 2005, p. 568). Pragmatic language difficulties (PLD) are implicated in various clinical populations, including autism spectrum disorder (ASD) and Attention Deficit Hyperactive Disorder (ADHD) (Young et al., 2005;Staikova et al., 2013), and is a key diagnostic feature of social (pragmatic) communication disorder (SCD) in the DSM-5 (American Psychiatric Association, 2013). As such, clinicians and researchers require robust measures to plan and evaluate interventions, and to understand more about the complex nature of this language domain.
The complex and multifaceted nature of pragmatic language makes it challenging to develop instruments for meaningful measurement of the construct. The formal nature of standardized assessment tasks fails to capture that pragmatic language is context dependence; instead, capturing a narrow and possibly unreliable picture of an individual's pragmatic "knowledge" as opposed to performance capabilities. This limitation was accentuated when a proportion of children with SCD demonstrated knowledge of pragmatic rules, but still violated the same rules during social interactions with a communication partner (Lockton et al., 2016). Parent report measures can provide an understanding of a child's abilities across various social contexts and have a role in understanding a child's needs from the perspective of the service-user (Adams et al., 2012). However, if used to evaluate intervention effectiveness, they introduce bias due to the inability to blind parents to treatments.
Although observational measures show promise in addressing these known biases, there are very few in existence and the strength of their psychometric properties remains largely unknown (Adams, 2002). Furthermore, there is a need to develop pragmatic assessments that capture interactions in a naturalistic context, rather than focused on the impairment level. A recent international consensus study for Developmental Language Disorder (DLD) made specific recommendations in this regard (Bishop et al., 2016(Bishop et al., , 2017. This is especially important in the measurement of pragmatic language, as it is an area with a dearth of developmental norms (Adams, 2002).
To observe and measure pragmatic language behaviors in a functional childhood activity (play), the Pragmatics Observational Measure (POM) (Cordier et al., 2014) was developed. This instrument rates 27 items that reflect five elements of verbal and non-verbal pragmatic language: (1) introducing and responding to peer-peer social interactions; (2) use and interpretation of non-verbal communication; (3) social-emotional attunement of one's own thinking, emotions and behavior, as well as the intention and reactions of peers; (4) higher level thinking and planning; and (5) peer-peer negotiation skills including suggestions, cooperation and effective disagreement. Children in peerpeer interactions during free and uninterrupted play are videoed and then rated according to these five elements with individual items scored along a four-point scale. To date, the POM has been used to investigate the pragmatic language abilities of typically developing schoolaged children and their peers diagnosed with ADHD Wilkes-Gillan et al., 2017a,b) and Autism Spectrum Disorder (Parsons et al., unpublished;Parsons et al., 2018).
Evaluating an instrument's psychometric properties is an important element of test development. The POM has demonstrated good reliability, validity, responsiveness and interpretability (Cordier et al., 2014). In terms of internal consistency, an exploratory PC and a confirmatory Maximum Likelihood (ML) were performed. Factor analysis identified that the items on the POM reflected a unidimensional construct accounting for 81.5% of the variance (exploratory PC factor analysis) and 73.7% (ML factor analysis). This suggests that while items are theoretically grouped under five elements of pragmatic language, the instrument's items represent overall pragmatic language ability rather than multidimensional "subdimensions" of pragmatic language. Our previous work used the Pragmatic Protocol (Prutting and Kittchner, 1987) as our "gold standard" comparison as there were very few observational assessments for assessing peer-peer pragmatic language. Labeled as a descriptive taxonomy, the Pragmatic Protocol (PP) has 30 items which are classified under verbal, paralinguistic and non-verbal aspects. Children aged 5 years and older are observed in a dyadic interaction while the rater scores each item as: appropriate, inappropriate, or not observed. While theoretically and clinically motivated, it is not clear whether this instrument reflects an underlying uni-or multidimensional construct. Thus far, the psychometric analysis of the POM was based on Classical Test Theory (CTT) approaches which view the whole test as the unit of analysis and assume that all items are equally contributing to the same underlying construct (Cordier et al., 2014). This could be problematic for the POM given that this measure has 27 items and we cannot be completely confident that all items equally contribute. There is also the possibility that all items do not reflect the same underlying construct. To explore these notions further we turned to item response theory (IRT).
Item response theory (IRT) modeling has become an important methodology for test development (Edelen and Reeve, 2007). Item response theory examines the reliability of each item and whether each item contributes to an overall construct (Linacre, 2016a). Another advantage is that IRT can be completed independently from the testing group used. Rasch analysisa type of IRT model -has been used to successfully critique and evaluate existing measures in other clinical areas relating to swallowing and communication disorders (Donovan et al., 2006;Cordier et al., 2017a). In this study, we apply the Rasch measurement model to further evaluate the POM. Specifically, we investigated person and item fit statistics, response scale, dimensionality of the scale, and differential item functioning.

Participants
Video footage of children and their playmates from Cordier et al. (2010) was used to evaluate the psychometric properties of the POM. The sample included children diagnosed with ADHD (n = 108), paired with typically developing playmates (n = 108), with one child with ADHD and one typically developing child in each observation. Children with ADHD were chosen because of their known social impairments and difficulties with pragmatic language (Staikova et al., 2013). The control group involved two typically developing children in each observation (n = 126). Children in the control group were matched on age, sex and ethnicity and not known to have ADHD as defined by the DSM-IV (American Psychiatric Association, 2000). All playmate pairs were familiar with each other.

Children With ADHD
Children with ADHD were recruited from district health boards and pediatric practices in Auckland, New Zealand. The inclusion criteria detailed that children must have a formal diagnosis of ADHD from a psychiatrist or pediatrician according to the DSM-IV criteria. Children must not have been administered medication prescribed for ADHD on the day of assessment and must not have been taking medication where an overnight period was an insufficient wash-out (e.g., Atomoxetine). This was applied to ensure high levels of diagnostic accuracy, minimize inclusion of borderline cases and cases with disorders other than ADHD as the primary diagnosis, and to enable observation of how children with ADHD interact without the effects of medication.

Typically Developing Children in the Control Group
Children in the control group (n = 126) were recruited from professional networks such as local schools and families of health service employees in Auckland, New Zealand. The inclusion criteria for the control group defined a typically developing child as a child with no childhood developmental disorder and no developmental concerns having been raised by a teacher or health professional. Presence of a developmental disorder was further ruled out through the administration of the Conners' Parent Rating Scales-Revised [CPRS-R]. The CPRS-R is a screening questionnaire completed by parents or primary carers to determine whether children aged 3-17 years have signs and symptoms consistent with a diagnosis of ADHD. Previously, the CPRS-R has shown excellent reliability (international consistency reliability 0.75-0.94) and construct validity (to discriminate ADHD from the non-clinical group: sensitivity 92%, specificity 91%, positive predictive power 94%, negative predictive power 92%) (Conners et al., 1998;Conners, 2004). Children who scored below the clinical cut-offs for any CPRS-R subscale and DSM-IV subscale were included in the control group.

Pragmatics Observational Measure (POM)
The POM was developed as a result of the need for an observational measure to assess pragmatic language in naturalistic contexts between peers. Only the Pragmatic Protocol (PP) (Prutting and Kittchner, 1987) and the Structured Multidimensional Assessment Profiles (S-MAPs) (Wiig et al., 2004) were found to be observational measures of some aspects of pragmatic language. The S-MAPs was developed as a tool for clinicians for curriculum-based assessment and intervention for children and provided clinicians with examples of how to develop their own rubrics and included a few rubrics related to aspects of pragmatic language. However, this measures usefulness within a research setting is limited, as little psychometric information has been published. The PP presents similar issues, with limited psychometric information available. Moreover, the use of a dichotomous rating scale limits the observer's ability to capture the complex nature of pragmatic language. A summary of the five elements and a summative description of each item that are grouped within each of the five elements are provided in Table 1.
The 27 items included in the POM were selected, developed and refined by the first four authors. All the authors have extensive experience in working with children from four disciplines: clinical psychology, epidemiology, speech and language pathology and occupational therapy. The item level descriptors were continuously refined over an 18-month period to ensure that they were clear, unambiguous and that all items could be rated using observable behavior. External raters assisted with item refinement by rating video footage of typically developing children and children with behavioral disorders and PLD.
The POM includes 27 items. Each item is based on the child's consistency of performance, rated on a 4-point scale (1-4) ranging between: 1 -rarely or never observed; 2 -sometimes observed (25-50% of the time); 3 -much of the time observed (50-75% of the time); and 4 -almost always observed (75-100% of the time). A detailed description is provided for each level of performance for all items. Discriminant analysis was used during initial development and psychometric testing to calculate a diagnostic cut off score of 8.02 for significant PLD (Cordier et al., 2014). Children with a mean measure score below 8.2 were classified as having significant PLD and those above the cut-off were deemed to have no PLD.

Procedure
The Sydney University Human Ethics Research Committee provided ethical approval to perform secondary analysis on data. The original study aimed to compare the play skills of children with ADHD with typically developing children (Cordier et al., 2010). Peer-peer social interactions for all children were observed. For those in the control group, children were observed using a designated play area at the respective schools that children attended, and children with ADHD were observed at clinics that they regularly attended. The same toys were present during all play sessions and the children were allowed to choose their play materials and activities. A diversity of play materials and toys catering to age and gender differences were made available to support a range of play and encourage peer-peer interaction.
The assessor introduced the peers to the free play situation and was as unobtrusive as possible. Participants were instructed that they could play with any of the toys in the playroom for 20 min and that they should ignore the assessor who was present in the playroom. The play session was video recorded for later analysis. Children were asked to ignore the assessor present in the playroom. When children attempted to interact with the assessor, their response was neutral and the assessor remained as unobtrusive as possible. The assessor did not intervene unless a child was in danger.
A single experienced rater (who was not the assessor) rated all the children from the videotapes. The rater was blinded to the purpose of the study to minimize bias. To establish adequate inter-rater reliability, another blinded rater familiarized themselves with the POM. Next, the first and third author developed a training video using footage of school-aged children playing who were independent of the current study. The blinded rate and their author then coded ten samples from this footage using the POM. Coding was compared and then consensus reached following discussion and re-viewing of the training footage. Reliability for the current study was calculated based on a random selection of 30% of all data.

Statistical Analysis
Rasch analyses were used to evaluate the reliability and validity of the POM. Data were analyzed using WINSTEPS version 3.92.0 (Linacre, 2016b), with the joint maximum likelihood estimation rating scale estimation (Wright and Masters, 1982). Data were analyzed for all 27 POM items, thereafter an iterative process was adopted. This involved removing poor fitting items, in various combinations, and re-running the analysis to get the best overall item fit, person separation and dimensionality statistics. The following analyses were conducted for all investigations.

Rating Scale Validity
Examination of the rating scale validity can confirm whether the ordinal response scale for all items stays true to the assumption that higher ratings indicate "more" and lower ratings indicate "less" of the concept under assessment. In WINSTEPS, rating scale response options are referred to as categories. There are three situations in which the partial credit model can be used: (1) items where some responses may be more correct than others; (2) items that can be broken down into component tasks; and (3) items where increments in the quality of performance are rated (Wright, 1998). None of these situations apply to the POM scale structure and all POM items have the same scale structure. As such, a Rating Scale Model (RSM) was used. In alignment with the POM response options the categories are numbered 1-4.
To determine if the rating response scales were being used in the expected manner, category response data was examined for even distribution or category disorder. Poorly defined categories or the inclusion of items that do not measure the construct result in non-uniformity/disordering. Ordered categories are indicated by average measure scores (frequency of use) that increase monotonically as the category increased. Mean squares (MnSq) outside 0.7-1.4 indicate category misfit and disordering and the collapsing of the misfitting category with an adjacent category should be considered (Linacre, 2016a).
The point at which there is equal probability of a response in either of two adjacent categories being selected, known as step calibrations or Andrich-thresholds, were determined to assess step disordering. Andrich-thresholds reflect the distance between categories and should progress monotonically, showing neither overlap between categories nor too large a gap between categories.
Step disordering indicates that the category defines a narrow section of the variable but does not imply that the category definitions are out of sequence. The average measure distinct categories are indicated by an increase of at least 1.0 logit on a 5-category scale. An increase of >5.0 logits, however, is indicative of gaps in the variable (Linacre, 1999).

Person and Item Fit Statistics
Construct validity was assessed using fit statistics to identify misfitting items and the pattern of responses for each person. Fit statistics are reported as log odd units (logits) and indicate whether the items contribute to the one construct (i.e., pragmatic language ability) and the degree to which a person's responses are reliable. Unstandardized MnSq or Z-Standard (Z-STD) scores can be used to described infit and outfit MnSq values should be close to 1.0 with an acceptable range of 0.7-1.4 (Bond and Fox, 2015). The outfit Z-STD values are expected to be 0 and any value that exceeds ±2 is interpreted as less than the expected fit to the model (Bond and Fox, 2015). Model underfit degrades the model and requires further investigation to determine the reason for the underfit. Model overfit, on the other hand, does not always degrade the model but still can lead to the misinterpretation that the model worked better than expected (Bond and Fox, 2015).
Internal consistency of the measure is evaluated through the person reliability, which is equivalent to the traditional Cronbach's alpha. Low person reliability values (<0.8) indicate having too few items or a narrow range of person measures (i.e., not having enough persons with more extreme abilities, both high and low).
If outlying measures are accidental, people are classified using person separation. However, if the outlying measures represent true performances, people are classified using person separation index (PSI)/strata (4 * person separation +1/3). To distinguish high performers from low performers, person separation determines whether the test separates the sample into sufficient levels. Low person separation is indicative that the measure is not sensitive enough to separate low and high performers. Reliability of 0.5, 0.8, and 0.9, respectively, indicates separation into only one or two levels, 2-3 levels, and 3-4 levels (Linacre, 2016a). A PSI/strata of 3 is required (the minimum level to attain a reliability of 0.9) to consistently identify three levels of performance. Item hierarchy with <3 levels (high, medium, low) is verified by item reliability. If item reliability < 0.9, the sample is too small to confirm the construct validity (item difficulty) of the measure.

Dimensionality of the Scale
Dimensionality can be assessed by the following means: (a) using negative point-biserial correlations identify any potentially problematic items; (b) identifying misfitting persons or items using Rasch fit indicators; and (c) performing Rasch factor analysis using Principal Component Analysis (PCA) of the standardized residuals (Linacre, 1998). The number of principal components are checked using PCA of residuals to confirm that there are no second or further dimensions after the intended or Rasch dimension is removed. No second dimension is indicated if the residuals for pairs of items are uncorrelated and normally distributed. The following recommended criteria are used to determine if further dimensions in the residuals are present: (a) the Rasch factor uses a cut-off of >60% of the explained variance; (b) on first contrast the eigenvalue of <3 (equivalent to three items), and (c) first contrast of <10% of explained variance (Linacre, 2016a).
The person-item dimensionality map using a logit scale schematically represents the distributions of the person abilities and item difficulties. In this paper, person ability refers to the level of pragmatic language ability observed by an assessor. "Difficult" items on the POM would attempt to capture aspects of pragmatic language that occurs with such infrequency that very few assessors will give a high rating to these items, whereas "easy" items might refer to aspects of pragmatic language that occurs regularly and will receive high assessors' ratings. If two or more items represent similar difficulty, these items occupy the same location on the logit scale. Locations on the logit scale where persons are represented with no corresponding item identifies gaps in the item difficulty continuum. The person measure score is another indicator of overall distribution. A person mean measure score location on the person item map, lower than the centralized item mean score of 50 indicates people in the sample were more able than the level of difficulty of the items. If the mean person location is higher (above 50), then the people in the sample were less able than the mean item difficulty.

Differential Item Analysis
To examine whether the scale items were used in the same way by all groups, a differential item functioning (DIF) analysis was performed on the remaining 23 items. DIF occurs when a characteristic other than the pragmatic language difficulty being assessed influences their rating on an item (Bond and Fox, 2015). For DIF analysis, the sample was categorized by age (5-8 years vs. 9-11 years), participant category (ADHD vs. Playmate vs. Control), ethnicity (European vs. Maori vs. Other ethnicities), gender (male vs. female), and pragmatic language difficulty (PLD vs. noPLD).
We were interested in these variables, (a) based on the current literature about the development of pragmatic language, and (b) given that POM is a measure of pragmatic language performance, we needed to establish if it could detect differences in performance of children with ADHD and possibly PLD, as we would expect this would impact their scores (Wu et al., 2017).
Children with ADHD (Väisänen et al., 2014) and children with PLD (Ketelaars et al., 2010;Ryder and Leinonen, 2014) have been found to have poorer pragmatic language outcomes.
If, however, there was significant DIF on a large number of items based on comparing age, gender and ethnicity, this would be a concern, or at least warrant further research as it possibly indicates item bias. DIF based on age would only be expected with children younger than 5 years of age (Moreno-Manso et al., 2010;Hoffmann et al., 2013). The children in this study were older than 5 years of age. For the purposes of DIF analysis, two age groups (5-8 years vs. 9-11 years) were created by dividing the number of children into two groups that were of relatively equal size. We did initially attempt three age groups, but the number of children in the 5-6 years and the 10-11 years age categories were too small. If significant differences were detected in a large number of items when examining DIF between the two age groups that would indicate the need for further research to understand if it was the result of impact or bias. In terms of gender, previous research found that boys performed poorer than girls in pragmatic language outcomes (Ketelaars et al., 2010). Surprisingly, no research has been conducted that specifically explored pragmatic language in terms of cultural variations, even though differences in non-verbal communication across cultures have previously been acknowledged in research (Vrij et al., 1992;Dew and Ward, 1993).
Differential item functioning contrast is inspected when comparing groups and refers to the difference in difficulty of the item between both groups. When testing the hypothesis "this item has the same difficulty for two group, " DIF is noticeable when the DIF contrast, which is the reporting of effect size in Winsteps, is at least 0.5 logits with a p-value < 0.05, as statistical significance can be affected by sample size and the sample size may not be large enough to exclude the possibility of being accidental (Linacre, 2016a). When interpreting the directionality of DIF contrast values, if the logits are positive, then it indicates that the item was harder (lower scores) than expected. If the logits are negative, then it indicates that the item was easier (higher scores) than expected. In determining DIF when comparing more than two groups (i.e., participant category and ethnicity) with the hypothesis "this item has no overall DIF across all groups, " the chi-square statistic and p-value < 0.05 is used (Linacre, 2016a). Winsteps implements two DIF methods. Winsteps implements Mantel for complete or almost complete polytomous data, Mantel-Haenszel for uniform DIF analysis of complete or almost complete dichotomous data, and a logistic uniform DIF method for incomplete, especially sparse, data, which estimates the difference between the Rasch item difficulties for the two groups, holding everything else constant. Mantel/Mantel-Haenszel in Winsteps are (log-)odds estimators of DIF size and significance from cross-tabs of observations of the two groups and uses theta to stratify to overcome the limitation of needing complete data in its original form. Furthermore, Mantel/Mantel-Haenszel is suitable for our sample size as Mantel/Mantel-Haenszel does not require large sample (Guilera et al., 2013). Winsteps also implements a nonuniform DIF logistic technique and a graphical non-uniform DIF approach. For the DIF analysis conducted in this analysis we used the Mantel-Haenszel test for dichotomous variables and the Mantel test for polytomous variables as these methods are generally considered most authoritative (Linacre, 2016b).

RESULTS
The sample of 342 records from 108 children with ADHD and 108 typically developing playmates and 126 typically developing (TD) children were analyzed; 80.3% of the children with ADHD were male with a mean age of 8.9 years (SD = 1.4) and 75.2% of the playmates were males with a mean age of 8.4 years (SD = 1.9) and 78.7% in TD group were males with a mean age of 8.6 years (SD = 1.5). All children were from New Zealand with ethnicity representative of the New Zealand population. See Table 2 for details and other demographic information. Missing data were recorded for 9 (0.1%) out of 9,234 observations (27 items × 342 participants) which is negligible.

Rating Scale Validity
The POM uses a 4-point (1-4) rating scale to rate the child's performance from beginner to expert. For the overall instrument the probability of a category being observed was examined. The average measure scores increased monotonically, and the fit statistics were all in the acceptable range (MnSq = 0.7-1.4) resulting in four distinct, ordered categories (see Table 3 and Figure 1). When examining the Andrich thresholds which reflect the relative frequency of use of the categories, they were not disordered but all categories advanced by >5 logits (range 26.78-28.23 logits) indicating potential gaps in the measurement of the variable (i.e., in the category labels).

Person and Item Fit Statistics
The summary infit and outfit statistics for item and person ability for the 27-item scale showed good fit to the model with a good item reliability estimate (0.99) and high person reliability (0.97). The PSI of 8.48 was well above the minimum of 3 required to separate people into distinct ability strata (see Table 4). Point biserial correlations were examined and all found to be in a positive direction indicating all items contribute to the overall construct.
We then examined the summary fit statistics of the overall scale (all 27 items), where after we ran the analysis again, removing each of the outfitting items (all under-fitting), individually and then together to determine if this improved the overall fit to the model. The additional analyses were completed in the following sequence: (1) self-regulation only removed; (2) creativity only removed; (3) both creativity and self-regulation removed. This change in excluded items led to the removal of two further items: thinking style and express feelings. The analysis was then completed with the four items removed (self-regulation, creativity, thinking style, and express feelings). The self-regulation item was returned because the previous step had reduced item dimensionality and item separation was less. We then re-analyzed the data with the three mentioned items removed.
With each analysis the person and item reliability remained unchanged except when self-regulation and creativity were  removed together it resulted in item reliability of 0.98 (from 0.99) but still well above the required 0.90 which confirms the hierarchy of the scale items. Item separation remained good, although it was reduced compared to all items when self-regulation and creativity were removed separately and together, and when removed with thinking style and express feelings. Person separation remained at approximately 6.0, well above the required level of 3 for all analyses and the person separation index remained high but did not improve with the removal of any items. Following examination of the point biserial correlations that confirmed all items contributed to the overall construct we examined item misfit for all 27 items combined (see Table 5). We examined infit and outfit scores for contradictions and although there were more reported misfitting infit Z-STD scores than outfitting Z-STD scores, there were no contradictions in the direction of change. Underfit (MnSq > 1.4; Z-STD > 2) is the biggest threat to the measure because it can degrade the model as it occurs because of too much variation in the responses (Bond and Fox, 2015). Underfit of both infit and outfit scores was not observed for any item, but infit MnSq and Z-STD for creativity were both underfitting, as were the outfit MnSq and Z-STD for self-regulation, thinking style, and express feelings. More misfit was evident on infit and outfit Z-STD scores than MnSq with the Z-STD infit also underfitting for self-regulation, thinking style, express feelings, and requests. Overfit of the MnSq infit and outfit scores for respond, environmental demands, and integrate communicative aspects was also observed. Removing self-regulation resulted in a slight reduction in the underfit of creativity MnSQ, but this change slightly increased the outfit MnSq for creativity and express feelings without resulting in any improvements to model fit of other items. When creativity was removed there was underfit on all infit and outfit scores for thinking style and express feelings, and for self-regulation except infit MnSQ and there were no other significant or improved fit scores on the other items ( Table 5). A similar outcome was observed for thinking style and express feelings when creativity and self-regulation were removed in the same analysis. Removing creativity, thinking style and Express feelings, with and without removing Self-regulation resulted in under fit for all scores for the request item and increased under infit Z-STD scores for facial expressions and repair and review ( Table 5). In the final analysis we removed creativity, thinking style, express feelings and requests which resulted in an increase in the underfit of the self-regulation outfit MnSq (from 4.07 to 8.24) and an increase in the under fit of self-regulation infit and outfit Z-STD. This change also resulted in the underfit of disagree and facial expression infit Z-STD remaining and increased underfit of the repair and review and initiate Z-STD (Table 5). This solution was kept as it resulted in the best individual item fit that did not degrade person separation.

Dimensionality of the Scale
The dimensionality of the overall scale with all 27 items was examined using principal components analysis (PCA) of the residuals ( Table 6). The Rasch dimension explained 77.1% with >40% considered a strong measurement of dimension. However, of the 77.1% explained variance, the person measures (60.8%) explained almost four times the variance explained by item measures (16.3%). The total raw unexplained variance (22.9%) had an eigenvalue of 27, resulting in the eigenvalue of first contrast being 5.49, which indicated the presence of a second dimension. The PCA of residuals divided the items into two groups related to verbal aspects of pragmatic language (items 1, 2, 3, 4, 5, 6, 16, 17, 23, 25, and 26) and non-verbal aspects of pragmatic language (consisting of items 7, 8, 9, 10, 11, 13, 14, 15, 20, 21, and 22). Based on the theoretical logic that pragmatic language consists of both verbal and non-verbal components, which is what we considered in constructing the scale to ensure all the features of the construct were being measured, this finding indicates that the items do form the one construct of pragmatic language. This finding is further supported by examining the disattenuated correlation between the person measures on the two sets of items (see Table 5).
With the exception of one disattenuated correlations being 0.79 (item: Maintain and Change), the correlations for all other items exceeded 0.8, which indicates the items are unidimensional for practical purposes (i.e., a multidimensional analysis will produce effectively the same results as a unidimensional one).
Second and subsequent contrasts were less than the 2 eigenvalue units required to indicate further dimensions ( Table 7). This process was then repeated with the removal of self-regulation, and creativity (as the most misfitting items) separately, then both removed, and then in various combinations with creativity, thinking style, express feelings, and request without significant change and all models still indicating a second dimension ( Table 7). As presented in Figure 2, the person-item maps show that few people were aligned with the missing items and that, although there were no redundant items, the addition of more easy and difficult items would improve the measure. The person-item dimensionality maps remained consistent, regardless of which items were removed.

Differential Item Analysis
The DIF analysis enabled examination of potential contrasting item-by-item profiles associated with: (a) ADHD, playmate or control, (b) age, (c) gender, (d) ethnicity, and (e) PLD vs. noPLD. The summary of the DIF analysis for the remaining 23 items is presented in Table 8 and revealed that participant category (ADHD vs. playmate vs. control) was the major factor in how items were used. DIF on the identified items indicated that children with ADHD scored lower than expected on items 7, 8, 9, 10, 11, 21, and 22, that is, children with ADHD found the items more difficult than expected. Children with ADHD found items 4, 16, 17, 23, 25, and 26 easier than expected. Based on category PLD vs. no pragmatic language difficulties (noPLD), none of the items had both a p < 0.05 and DIF contrast >0.5. DIF based on age of participants was only observed on items 8, 9, and 23, which were all harder than expected for the younger children. DIF based on gender was observed for three items with higher than expected scores (easier than expected) on items 12 and 23 for boys, and lower than expected scores (harder than expected) for item 13. DIF reported on items indicating it was easier for the European group to score on items 3, 6, and 17 and more difficult for them to score on the non-verbal items 7, 8, and 9.

DISCUSSION
In this study we aimed to evaluate the psychometric properties of the POM using Rasch analysis. Our first important finding was that with items creativity, thinking style, express feelings, and requests removed, the overall item and person reliability [the IRT equivalent of Cronbach's Alpha/internal consistency (Mokkink et al., 2010;Bond and Fox, 2015)] of the POM was excellent, with the overall infit and outfit statistics within the required parameters. Additionally, the person and item separation indexes were within acceptable parameters. This indicates that the POM performs well in regard to separating children with different levels of pragmatic language skills into four distinct groups (i.e., scores 1-4 indicating skill levels of beginner, advanced beginner, competent or expert) (Bond and Fox, 2015).
Our next important finding is related to the dimensionality of the POM and the removal of misfitting items creativity, thinking style, express feelings, and requests. Using Rasch analysis, dimensionality reflects the structural validity of the POM (Baylor et al., 2011;Bond and Fox, 2015). The low percentage of overall unexplained variance across the 27 items (22.9%) supports the finding that the POM is a unidimensional construct with good structural validity. However, our analyses indicated that there were some items that did not contribute to toward the overall construct. When items creativity, thinking style, expresses feelings, and requests were removed, the unexplained variance reduced to 21.7%.
We suspect these items may not have contributed to the construct as they are in part subsumed within items in the areas of Introduction and Responsiveness and Social-Emotional Attunement (Campbell et al., 2016). For instance, to express feelings appropriately and successfully continue a communicative interaction one needs to be able to: (1) regulate their own thinking, emotions and behavior (item: self-regulation); (2) be aware of and respond to another's emotional needs (item: emotional attunement); and (3) consider and integrate another's viewpoint/emotion (item: perspective taking). These items sit under the element of Social-Emotional Attunement in the first version of the POM.
Similarly, to think and articulate abstract and complex ideas (item: thinking style) and interpret, connect, and express ideas in versatile ways (item: creativity) and request explanations or more information (item: requests) to effectively continue a communicative interaction, one must be able to use skills in the area of Introduction and Responsiveness. These skills include the ability to select and introduce and maintain and change conversational topics. In order to be able to respond to communication from another and repair or review conversation when a breakdown occurs, one must also be able to successfully request information from their communication partner (Sehley and Snow, 1992).  At an individual item level, after removing items creativity, thinking style, express feelings, and requests, the MnSq fit statistics of most items were within acceptable parameters. However, the Z-STD fit statistic of five items (self-regulation, disagrees, facial expression, repair and review, and initiates) were outside of the expected parameters. Self-regulation was the most problematic item, falling outside the acceptable parameters for both infit and outfit statistics. This is not a surprising finding given the complex dynamic nature of self-regulation, which children are developing into adolescence (Raffaelli et al., 2005;McClelland et al., 2015). Additionally, self-regulation relates to both children's social and emotional skills, which have been notoriously hard to define and measure (McClelland and Cameron, 2012;Jones et al., 2016). Conceptualizing and measuring self-regulation and other social-emotional skills has been described as a complex task as these skills are categorized broadly, with each containing a set of more delineated skills. In addition to measurement of self-regulation being hindered by lack of conceptual clarity, there is also existing debate as to the underlying components contributing to one's capacity to self-regulate (McClelland and Cameron, 2012;Campbell et al., 2016;Jones et al., 2016).
Self-regulation was also problematic regarding the personitem dimensionality map as it was relatively easy in comparison to the rest of the POM items, despite it being a skill that is known to be notoriously complex from both clinical and conceptual viewpoints (McClelland and Cameron, 2012;Jones et al., 2016). The findings pointed to the need to change rather than drop the item. Furthermore, our findings indicated that there was evidence for both easy and hard items and people alignment alongside the items. Further, there was no evidence of item redundancy, with no items occurring at the same level.
Differential item functioning analyses were conducted in relation to participant group (ADHD, playmate, and control), age category (5-8 years or 9-11 years), gender, ethnicity and PLD vs. noPLD. Our finding that 13 of 23 items were significantly different in relation to clinical group is supported by previous research that found children with ADHD experience difficulty with pragmatic language skills compared to their peers (Kim and Kaiser, 2000;Bignell and Cain, 2007). Of interest is that many of the items that were more difficult were the non-verbal pragmatic language items and co-operation and engagement. Items that were easier included being assertive, making  suggestions and disagreeing and initiation and, surprisingly, they found it easier than expected to manage communication content and the item 'attend, plan and initiate.' It is not surprising to see results that suggest that children with ADHD have more PLD compared to typically developing peers. In earlier studies, children with ADHD experienced more PLD than typically developing peers when using the Children's Communication Checklist -Second Edition (CCC-2) (Väisänen et al., 2014), the Test of Pragmatic Language -Second Edition (TOPL-2), and the Comprehensive Assessment of Spoken Language (CASL) (Staikova et al., 2013). However, the exact nature of the pragmatic difficulties was reported at an aggregate level only, thus not allowing for a more nuanced explanation of differences. As such, this mix of DIF on items indicate that more research is required to understand whether the DIF is related to impact (i.e., the variable skills of children with ADHD compared to the controls), or bias in the items. For children with PLD vs. noPLD, none of the items showed significant DIF. A possible explanation that there were no DIF for the PLD variable could be how the variable was derived. Children, regardless of diagnosis, who score 1.5 standard deviations below the overall mean measure score were categorized as PLD and those who score above the cut off score were categorized as noPLD. The cut off score of 1.5 standard deviations below the overall mean measure score may have been too conservative to truly identify children with PLD. In future development of the POM, a new cut-off point that is at least two standard deviations below the mean should be considered with a larger sample size.
Very few items were significantly different in relation to age (three items) or gender (three items). Two of the significantly different items for gender (self-regulation and perspective taking) are consistent with research that found school-aged girls demonstrated higher skills than boys on objective and teacher reported measures (Matthews et al., 2009). In terms of gender, boys have been reported to perform significantly lower than girls on the CCC-2 (Ketelaars et al., 2010). However, given that DIF was only observed for three items, further research is required to determine if this is related to inherent differences in the groups (impact) or bias (i.e., that the items were working differently based on gender).  In relation to ethnicity (European, Maori or other), only 7 items were found to be significantly different between children. This finding indicates the POM may have good cross-cultural validity. However, it is important to note that the majority of the sample was European and children were only categorized into three separate ethic groups. Testing the psychometric properties of the POM on a sample of children from a broader range of ethnic groups is therefore a required direction for further research. Given the relative homogeneity and the size of this sample, further research with larger sample sizes and different clinical populations is required to determine if DIF should be interpreted as impact or bias.
The POM is a measure of pragmatic language performance and it needs to differentiate on some items across non-verbal and verbal elements of pragmatic language performance. Further research is required to understand whether DIF observed (p < 0.05 and DIF contrast ≥0.5 logits) on the small number of age, gender and ethnicity items is due to impact or bias. Given the age-related development of pragmatic language skills, difference would be expected between very young children who would not be expected to have developed the pragmatic language skills compared with older children (Hoffmann et al., 2013). However, you would not expect to find DIF between children aged 5-11 years, which was the age range of the study participants (Moreno-Manso et al., 2010;Hoffmann et al., 2013). When viewed holistically, aside from ADHD as a diagnostic category, none of the other variables that were evaluated showed significant DIF across various items. This indicates the POM is consistent in measuring the underlying construct of pragmatic language, regardless of the child's age, gender, ethnicity and PLD status.

Resulting Changes to the POM
Our findings indicate that several changes to the POM are required. First, we needed to remove the four misfitting items creativity, thinking style, express feelings, and requests. Second, the standardized residual loadings indicated the presence of two dimensions. However, a more precise interpretation is that we are observing the covariance that is expected in such a complex construct (Linacre, 2016a). The two groups of items comprise verbal and non-verbal aspects of pragmatic language which are of equal weighting, that is, two attributes that contribute equally to the same construct (pragmatic language). Therefore, we interpret the findings that the POM is a unidimensional construct that comprises two elements named: Pragmatics Observational Measure Verbal Communication Element and Pragmatics Observational Measure Non-verbal Communication Element. These two elements replace the previous five elements we had in the POM that were based on theoretical constructs. This delineation of items allows for the calculation of a nonverbal and verbal pragmatic language score for children, as well as an overall pragmatic language measure score. Third, because of the aforementioned change, revisions to the four skill-level descriptions of item 23 (Assertion) were required. The previous description consisted of both verbal and non-verbal behaviors, and the revised description now only includes behaviors of verbal assertion. The fourth and final change involved revising the description of the item self-regulation at each of the four skill levels in the POM. The revised item now includes a more detailed description at each skill level to better capture the complex delineation of skills that comprise selfregulation that is based on a review of research into the construct of self-regulation (McClelland and Cameron, 2012;Jones et al., 2016).

Future Directions for Research
Based on the findings from this study, there are several important areas of research for the continued development of the POM. The first being testing the reliability and validity of the measure on children in broader diagnostic and ethnic groups. This should also include a greater number of typically developing females, so gender differences can be further examined (Mokkink et al., 2010). The revised items Assertion and self-regulation necessitates current raters to be provided with additional training in the observation of verbal assertion and self-regulation skills, using the revised descriptors.
Another important area for further development is making the POM accessible to allied health professionals, such as speech pathologists, occupational therapists, and psychologists who routinely work with children with PLD (Duncan and Murray, 2012). While the routine use of outcome measures has been mandated within the practice of allied health  professionals for over two decades, barriers to the use of outcome measures still exists (Duncan and Murray, 2012). Thus, a training package for clinicians in the use of the POM and scoring and interpreting the scores should be carefully considered at both an individual and organizational level. It is likely that such a training package will need to also address: (1) the organization's need to provide appropriate training, administrative support and allocation of resources to clinicians; (2) clearly highlight to clinicians the relevance and clinical applicability of the measure to direct client care; (3) address barriers around the time taken to compete the measure, considering the institutional restrictions which impact how much time a clinician spends with each patient/client; and (4) ensure the development of clinician knowledge of pragmatic language and strategies for facilitating the pragmatic language outcomes of children (Duncan and Murray, 2012).

Limitations
The psychometric properties of the POM-2 need to be assessed in other clinical groups likely to have pragmatic language difficulties, including autism spectrum disorders and children diagnosed with Social Communication Disorder. The use of a single rater is also a limitation to be considered when interpreting the results. The use of multiple raters could impact POM-2 psychometric properties such as differential item functioning and dimensionality. Future research needs to, (a) investigate the responsiveness of the POM-2, (b) consider re-evaluating the POM-2 with additional items that are easier and more difficult to rate, (c) assess test-retest reliability of the POM-2 and (d) assess the POM with multiple raters and examine rater effect and severity using a multifaceted Rasch model (Linacre, 2018).

AUTHOR CONTRIBUTIONS
RC, NM, SW-G, RS, LP, and AJ contributed to the conceptual content of the manuscript. RC and AJ performed the statistical analysis. RC wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.