Assessing the parent–infant relationship: a two-stage, COSMIN-informed systematic review evaluating clinician-rated measures

Shone, Isabelle; Gregg, Lynsey; Wittkowski, Anja

doi:10.3389/fpsyt.2025.1426198

REVIEW article

Front. Psychiatry, 28 April 2025

Sec. Public Mental Health

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1426198

This article is part of the Research TopicParents with Mental and/or Substance Use Disorders and their Children, volume IIIView all 34 articles

Assessing the parent–infant relationship: a two-stage, COSMIN-informed systematic review evaluating clinician-rated measures

Isabelle Shone^1,2

Lynsey Gregg^1,2,3

Anja Wittkowski^1,2,3*

¹Division of Psychology and Mental Health, School of Health Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, United Kingdom
²The Perinatal Mental Health and Parenting (PRIME) Research Unit, Greater Manchester Mental Health National Health Service (NHS) Foundation Trust, Manchester, United Kingdom
³Manchester Academic Health Sciences Centre, Manchester, United Kingdom

Background: The parent–infant relationship is important for healthy infant development. Parent–infant assessments can aid clinicians in identifying any difficulties within the parent–infant relationship. Meaningful, valid, and reliable clinician-rated measures assist these assessments and provide diagnostic, prognostic, and treatment indications. Thus, this review aimed to (a) provide a comprehensive overview of existing clinician-rated measures and their clinical utility for the assessment of aspects of the parent–infant relationship and (b) evaluate their methodological qualities and psychometric properties.

Methods: A systematic search of five databases was undertaken in two stages. In Stage 1, relevant clinician-rated parent–infant assessment measures, applicable from birth until 2 years postpartum were identified. In Stage 2, relevant studies describing the development and/or validation of those measures were first identified and then reviewed. Eligible studies from Stage 2 were quality assessed in terms of their methodological quality and risk of bias; a quality appraisal of each measure’s psychometric properties and an overall grading of the quality of evidence were also undertaken. The COnsensus-based Standards for the selection of health Measurement INstruments methodology was used.

Results: Forty-one measures were eligible for inclusion at Stage 1, but relevant studies reporting on the development and/or validation of the parent–infant assessments were identified for 25 clinician-rated measures. Thirty-one studies reporting on those 25 measures that met inclusion criteria were synthesised at Stage 2. Most measures were rated as “low” or “very low” overall quality according to the Grading of Recommendations Assessment, Development and Evaluation approach. The most promising evidence was identified for the Mother–Infant/Toddler Feeding Scale, Tuned-In Parenting Scale, and Coding Interactive Behaviour Instrument.

Conclusions: There was a notable diversity of measures that can be used to assess various aspects of the parent–infant relationship, including attunement, attachment, interaction quality, sensitivity, responsivity, and reciprocity. The quality of methodological and psychometric evidence across the reviewed measures was low, with 76% of measures having only one study supporting the measure’s development and/or validation. Thus, further research is needed to review the psychometric properties and suitability as assessment measures.

Introduction

Disruptions to early childhood, for example, through trauma or illness, can have a long-term impact on infant mental and physical health, developmental trajectory, and even socioeconomic standing later in life (1, 2). During the first critical year in a child’s life, the infant brain undergoes rapid development and is particularly sensitive to experiences, both positive and negative (3). The parent–infant relationship has been identified as an early life experience, crucial for the infant’s development (4, 5). As infants can recognise and respond to parental speech and cues within the first three months of life (6), parental behaviours can significantly and profoundly influence infant wellbeing (7). Inappropriate parent–infant interactions and traumatic experiences in the early period of a child’s life can impact the developing brain (8, 9) and lead to increased cortisol levels, which may later increase the risk of hyperactivity, anxiety, and attachment difficulties (10, 11). Additionally, the quality of the parent–infant relationship is known to have a significant impact on the social and emotional development of the infant as well as on cognitive and academic development (5, 12, 13). Brief periods of poorly attuned parent–infant relationships are common; however, prolonged periods of inconsistent parenting and disorganisation within the dyad can lead to maladaptive outcomes for infants (4, 14).

The impact of perinatal mental health difficulties (PMHDs) on the parent–infant relationship has been acknowledged in the literature (15–17). PMHDs occur during pregnancy or in the first year following birth, affecting up to 20% of new and expectant mothers (18). PMHDs cover a wide range of conditions, including postpartum depression, anxiety and psychosis (19). If left untreated, PMHDs can have both short- and long-term impacts on the parent, child and wider family, including transgenerational effects (20). Perinatal mental health (PMH) services (including parent–infant services) can help to ameliorate these effects. PMH services assess the parent–infant relationship and identify negative and positive aspects of parent–infant interactions (21). The assessment of the parent–infant relationship and its associated aspects, such as attachment behaviours, sensitivity, responsivity, reciprocity, and attunement, can assist clinicians in providing assessment, guidance, and, importantly, interventions, with the aim of improving maternal sensitivity, the parent–child relationship and child behaviour (22). Measurement tools are also routinely used to monitor and evaluate treatment and service effectiveness. It is therefore of critical importance for clinicians to have access to meaningful, valid, and reliable measures to assess the parent–infant relationship.

The Royal College of Psychiatrists (23) recommends several parent report measures to assess the parent–infant relationship, namely, the Postpartum Bonding Questionnaire (PBQ) (24) and the Mothers’ Object Relations Scale–short form (25) as well as clinician-rated measures, such as the Bethlem Mother–Infant Interaction Scale (26), the CARE-Index (27), the Parent–Infant Interaction Observation Scale (28), and the National Institute of Child Health and Human Development scale (NICHD) (29).

In a comprehensive review of 17 original parent report assessment measures and 13 modified versions, Wittkowski et al. (30) identified that the PBQ, both the original and modified versions, was found to have the strongest psychometric properties with the highest quality of evidence ratings received. Despite the potential drawbacks to using clinician-rated measures, several authors [e.g., (31–33)] have questioned the benefits of parent report measures over clinician-rated or observational measures, citing possible biases from the parents regarding their child’s perceived skills, behaviours, and interactions or their tendency to respond in a socially desirable way. Wittkowski et al. (30) also wondered if clinician-rated measures might not be used consistently across services, potentially due to a need for training to use the measures and any trainingcosts, as well as supervision and capacity issues.

At least three other reviews of clinician-rated measures assessing aspects of the parent–infant relationship exist. For example, in their systematic review of 17 measures, of the parent–infant interaction, Munson and Odom (34) provided good levels of detail regarding the validity and reliability of the identified assessment measures; however, they did not assess responsiveness or measurement error and, in terms of validity, they also did not assess the measures’ structural validity, thereby reducing the comprehensiveness of their results. They also drew information from non-peer reviewed information, such as books and manuals; thus, the impact of the results in this field of research may be reduced (30). Additionally, their review, which is now 28 years old, excluded measures that used behavioural coding systems, solely assessing measures which used rating scales.

To demonstrate the appropriateness of assessing behavioural and emotional problems during infancy, Bagner et al. (35) conducted a review of both parent report questionnaires (n = 7) and observational coding or clinician rated procedures (n = 4). Of the four observational coding measures they reviewed, the Functional Emotional Assessment Scale (36) and the Emotional Availability Scales (EAS) (37) are the most widely known ones. The authors concluded that the observational coding procedures provided more detailed and meaningful information regarding the infant (less than 12 months old) and caregiver than parent report measures. However, their review did not assess responsiveness, measurement error, or hypothesis testing for construct validity to determine the quality of the studies, potentially leading to errors in judgement when clincians or researchers attempt to determine the best measure to use (32).

Finally, in their comprehensive review of measures rated by a trained clinician, Lotzin et al. (32) focused on 24 existing measures with more than one journal article describing or evaluating each measure. They synthesised 104 articles published between 1975 and 2012, 60.5% of which had low methodological quality. Lotzin et al. (32) assigned lower quality ratings to authors not reporting enough detail about their study and/or using small sample sizes. Lotzin et al. (32) also concluded that further studies refining the existing tools were needed with regard to content validity and consequential validity. Although they were comprehensive and thorough in their evaluation of psychometric properties across their stipulated five validity domains of (1) content, (2) response process, (3) internal structure, (4) relation to other variables and (5) consequences of assessment, Lotzin et al. (32) appeared to follow their own idiosyncratic method of assessing a measure’s validity, rather than following standardised criteria.

Increasingly, systematic reviews of assessment measures (self-report and/or clinician-rated) have used the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) (38–41) tools. The COSMIN is an initiative of a team of researchers who have expertise in the development and evaluation of outcome measurement instruments. The COSMIN initiative aims to improve the selection of outcome measures within clinical practice and research (41) by developing specific standards and criteria for evaluating and reporting on the measurement properties of the outcome measures (42). For examples of reviews informed by the COSMIN criteria and guidelines, see Wittkowski et al. (43), Bentley, Hartley and Bucci (44) and Wittkowski et al. (30). These reviews did not assess clinician-rated measures.

Given the shortcomings of the abovementioned reviews by Wittkowski et al. (30), Munson and Odom (34), Bagner et al. (35) and Lotzin et al. (32), there is now a clear need for a systematic, transparent, comprehensive, COSMIN-informed review of relevant measures in this field. Thus, the aim of this systematic review was to assist practitioners and researchers in identifying the most suitable measures to use in their clinical practice or research by providing an overview (in Stage 1) and evaluation (Stage 2) of the current existing clinician-rated assessment measures of the parent–infant relationship, including its specific aspects such as attachment behaviours, sensitivity, responsivity, reciprocity, and attunement. The following questions were examined in this review:

1. What assessment measures did exist for clinicians to assess the parent–infant relationship in the perinatal period?

2. Which measures demonstrated the best clinical utility, methodological qualities, and psychometric properties?

Methods

This systematic review, registered with the PROSPERO database (www.crd.york.ac.uk/prospero; registration number CRD42024501229), was conducted in accordance with the COSMIN tools (38, 41, 42) and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (45). The methodology, which was specifically developed and validated for use in reviews of patient-reported outcome measures (PROMS) (38), can be adapted and used for other types of outcome measures, for example, those in which opinions on the parent–infant relationship are not self-reported but instead are evaluated by clinicians (clinician-reported outcome measures or ClinROMs) (40). The first author acted as the main reviewer but received support and supervision from the other two authors.

Search strategy

A search was conducted in two stages: 1) to identify which parent–infant assessments exist for clinicians to use and 2) to identify studies describing the development and/or validation of each identified measure. The following databases were searched for both stages: PsycINFO (Ovid), Cumulative Index of Nursing and Allied Health Literature (CINAHL), Excerpta Medica database (EMBASE, Ovid), Medical Literature Analysis and Retrieval System Online (MEDLINE, Ovid), and Web of Science.

Stage 1 of the search involved designing a search strategy to identify and retrieve studies of relevance to the development and/or use of clinician-rated measures of the parent–infant relationship. As recommended by the COSMIN guidelines for systematic reviews (41), this initial search was first piloted and then, after further refinement with a university librarian, the final Stage 1 search was completed in November 2023. Searches using Ovid (MEDLINE, EMBASE and PsychINFO) were limited to abstracts, English language and “humans.” CINAHL and Web of Science did not offer these options of limits. Six search categories were developed, which were combined using the Boolean operator “AND.” The instruction “OR” was applied within each category and, when relevant, wildcard asterisks were used to capture related terms (Table 1). At Stage 1, all articles were screened based on abstract and title review and those mentioning parent–infant assessment measures were examined for full-text review. Each identified measure was assessed for eligibility against the inclusion and exclusion criteria.

Table 1

Table 1. Search terms and limits at Stage 1.

To ensure the reliability of this review process, an independent reviewer (E.W.) double screened 10% of randomly selected papers from Stage 1. Cohen’s kappa and the percentage of inter-rater agreement were calculated, with good agreement (κ = .80, p <.001; 98.20%) (46).

At Stage 2, any relevant measures identified in Stage 1 were searched for in the same databases to identify any studies describing each measure’s initial development and/or validation. This search was conducted in December 2023 and later updated in early 2024. The following terms were searched: “Relationship” OR “Interaction” OR “Dyad” OR “Bond” OR “Sensitivity” OR “Responsiveness” OR “Attachment” OR “Attunement” OR “Reflexivity” OR “Adjustment” OR “Behaviour” AND the measure’s name OR abbreviation. Studies identified in Stage 2 were reviewed based on title and abstract; studies were assessed for eligibility by examining their full text, and their reference lists were checked for additional studies.

Eligibility criteria of measures and studies

At Stage 1, measures were included if they were developed for clinicans to assess or rate the parent–infant relationship or a specific aspect of this relationship (e.g., attachment, reciprocity, attunement, bonding, parental sensitivity, and emotional regulation) (12). For the purpose of this review, we used the following definitions of the parent–infant relationship to help guide the identification of suitable measures: “Parent–infant relationships refer to the quality of the relationship between a baby and their parent or carer” [46, p. 2] and “the connection or bond created between the parent and infant through the exchange of behaviours and emotion communicated between both parties” [(47), p.3]. Thus, we included measures of interaction between the parent and their infant if it was a reciprocal exchange. The CARE-Index was also pre-determined to be included because its utility in assessing parent–infant interactions has been demonstrated in research into attachment behaviours (27, 48, 49). The CARE-Index has also been recognised in other systematic reviews of parent–infant assessment measures, including by Lotzin et al. (32) and the Royal College of Psychiatrists (23).

Measures were included only if they were applicable for use with an infant from birth up to the age of 2 years, which is defined as the perinatal period by the NHS Long Term Plan in the UK (50) and sometimes also referred to as the first 1,001 days (51). In the perinatal period, any difficulties within the parent–infant relationship should be identified as early as possible so that future interventions or treatment decisions can be made (52). Measures were excluded if they were designed to assess a related but different concept (e.g., “parenting style” or “attitudes to pregnancy”). Measures were also excluded if full-text studies could not be accessed or if they assessed the parent–infant relationship as part of a subscale in a longer inventory.

At Stage 2, studies were included if they 1) described the initial development and/or validation of an identified measure, 2) included data pertaining to an attempt to validate and/or to test the psychometric properties of the measure and this was stated in the aims of the study, and 3) were published in a peer reviewed journal in order to ensure consistently high-quality studies were used (53). Studies were excluded if they were not written in English and/or were reported only in theses, dissertations, or conference abstracts. We also excluded any measure for which we could not identify any studies describing the psychometric evaluation of that measure.

Quality assessment of the studies included after Stage 2

The COSMIN Risk of Bias Tool (40), an extended version of the COSMIN Risk of Bias checklist (38), was used to assess the methodological quality of studies identified at Stage 2, and subsequently determine each study’s overall risk of bias. Figure 1 reflects the recommended 11-step procedure for conducting a systematic review on any type of outcome measure instrument outlined in the COSMIN Risk of Bias Tool user manual (40). The manual was developed to assess the quality of studies of all types of outcome measure instruments (including ClinROMs) and designed to be incorporated into the COSMIN methodology (40). It differs from the COSMIN Risk of Bias checklist in that it includes boxes for assessing reliability and measurement error. Furthermore, Step 8 (“Evaluate interpretability and feasibility”) was removed from the Risk of Bias Tool because interpretability and feasibility do not refer to the quality of the ClinROM (40). Interpretability and feasibility were instead extracted and summarised within a descriptive characteristics table. In this table, we included ease of administration (with regard to home or laboratory observations and time required to complete observations), associated costs and interpretability of scores. Both the COSMIN Risk of Bias Tool (40) and the COSMIN criteria (41, 42) are based on the COSMIN taxonomy for measurement properties, and these criteria are generally agreed as gold standard when evaluating measures in the context of a systematic review ensuring standardisation across papers (39). The COSMIN guidelines recommend the following stages for assessing the quality of an outcome measure, outlined in Figure 1 as Parts A, B, and C.

Figure 1

Figure 1. Diagram of the 11 steps for conducting a systematic review on any type of outcome measure instrument.

Part A: quality appraisal for the methodological quality for each measurement property and risk of bias assessment across each study

The first steps in assessing the methodological quality and risk of bias of the included studies were based on the Terwee et al. (42) COSMIN criteria and the Mokkink et al. (40) COSMIN Risk of Bias Tool. A COSMIN evaluation sheet (see Appendix A in Supplementary Materials) was adapted for this review to include comprehensibility (from the clinician’s point of view) because this was more applicable for clinician-rated measures (54). Content validity was assessed in terms of relevance, comprehensiveness and comprehensibility using Terwee et al.’s (42) updated criteria. Each measurement property is outlined in Table 2.

Table 2

Table 2. Definitions and criteria for good measurement properties.

Each study was assessed for methodological quality and was rated using the COSMIN scale’s 4-point scoring system (4 = “very good,” 3 = “adequate,” 2 = “doubtful,” 1 = “inadequate”). An overall score for a study’s risk of bias was then determined by taking the lowest rating among all criteria for each category, known as the “worst score counts” method. This method was followed because poor methodological qualities should not be compensated for by good qualities (42).

Part B: quality appraisal of the psychometric properties of each measure

The main reviewer appraised the quality of the reported results in terms of psychometric properties for each measure. Each of the eight psychometric properties (except content validity) was rated as “sufficient” (+) if results were determined to provide good evidence of a measure exhibiting this property, an “indeterminate” (?) rating was assigned if results were not consistent, not reported, or appropriate tests had not been performed and an “insufficient” (−) rating was assigned when appropriate tests had been performed, but the results were below the COSMIN checklist’s standards.

Content validity (i.e., in terms of relevance, comprehensiveness and comprehensibility) was rated as either sufficient (+), insufficient (−), inconsistent (±) or indeterminate (?). A subjective rating regarding content validity was also considered (41). The evaluated results of all studies for each measure were summarised. The focus at this stage changed to the measures, whereas in the previous substeps, the focus was on the individual studies.

Part C: quality grading of the evidence

The strength of evidence for each category for each measure was determined based on the methodological quality and risk of bias (Part A) and the psychometric properties (Part B). The main reviewer utilised the modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach (38) to assess the quality of the evidence provided for each measure (Table 3). Detailed information on the GRADE approach can be found in the COSMIN user manual (38, 41, 42). As per COSMIN guidance, if studies were rated as being “inadequate” overall (Part A), the GRADE rating of “very low” was given for the content validity categories. If studies were rated as being of “doubtful” quality overall, a GRADE rating of “low” was given for content validity categories (42). COSMIN guidelines recommend that studies determined to be “inadequate” should not be rated further. However, in order to gain a comprehensive overview of each measure, we rated all studies in full.

Table 3

Table 3. Definitions of quality levels using the GRADE approach.

As current COSMIN criteria do not include guidance regarding the rating of exploratory factor analysis (EFA), the criteria for assessing structural validity were adapted, comparable to Wittkowski et al. (30). EFAs were rated as “sufficient” if > 50% of the variance was explained (55) and studies using EFA could only be rated as “adequate” rather than “very good” for risk of bias.

When confirmatory factor analysis (CFA) was also reported alongside EFA, the lower quality evidence of EFA was ignored and the study was rated according to the CFA results reported. If the percentage of variance accounted for and/or model fit statistics were not reported in studies, an “indeterminate” rating was given.

The GRADE approach to rating results also takes into consideration the risk of bias, inconsistency (unexplained inconsistency of results across multiple studies), imprecision (total sample size in the studies) and indirectness (evidence from different populations than the population of interest) (40, 41). The GRADE approach follows the assumption that all evidence is of high quality to begin with. The quality of the evidence is subsequently downgraded to “moderate,” “low,” or “very low” when there is a risk of bias, unexplained inconsistencies in the results, imprecision (less than 100 or less than 50 participants) or indirect results (41).

Results

Review process

At Stage 1, 5,974 papers were identified (see Figure 2 for details). After removing duplicates at this stage, the titles and abstracts of 5,328 records were screened. The full texts of 329 papers were examined against the inclusion and exclusion criteria, leading to the identification of 41 potentially eligible parent–infant measures.

Figure 2

Figure 2. PRISMA flow diagram of both stages of the search process.

At Stage 2, with the titles of the identified measures as the search terms, 11,464 records were identified. After removing 2,810 duplicates, 8,654 records were screened, with 8,573 records subsequently excluded based on title and abstract review. This process resulted in 81 full text articles, which were assessed for eligibility at Stage 2.

All decisions regarding inclusion and exclusion of studies and measures were discussed by all authors and any discrepancies were resolved (for a list of excluded measures, please see Appendix B in Supplementary Materials). After a detailed and comprehensive assessment of the identified studies from Stage 2, 31 studies describing the development of and/or validation of 25 measures were included in this review.

Study characteristics

After completion of Stage 2, the publication dates of the included studies ranged from 1978 to 2023 and sample sizes ranged from ten (56) to 838 participants (57). The greatest number of studies came from the United States of America (USA; n = 19), with the remaining studies from the United Kingdom (UK; n = 4), Australia (n = 3), Denmark (n = 1), Germany (n = 1), the Netherlands (n = 1), Peru (n = 1), and Switzerland (n = 1). Studies were conducted using either a non-clinical sample (n = 17) or a clinical sample (n = 9), or both a clinical and non-clinical comparison sample (n = 5). Further details on measure development, aspects of clinical utility and characteristics of each of the included measures and studies are provided in Table 4.

Table 4

Table 4. Overview of the 25 included parent–infant assessment measures (presented in alphabetical order).

Overview of identified clinician-rated measures

All measures covered infancy (i.e., from birth to 2 years of age), but some were designed for use with children up to 14 years old, such as the EAS (37). The measure with the most severe age restriction was the Family Alliance Assessment Scales for Diaper Change Play (FAAS-DCP) (63), which was suitable for use with infants only within the first three weeks of life. The Parent Infant Interaction Observation Scale (PIIOS) (50) could only be used for a 5-month period from two to seven months.

Only the AMIS, FAAS-DCP, and LPICS were applicable for use with infants under 3 months of age. For use with infants younger than 12 months, only six measures (BMIS, DMC, Monadic Phases, MRS, PIIOS, and PIPE) were applicable. A further seven measures (Attachment Q sort, CIB, MACI, MRO, PIIOS, PIOG, and PIPE) were not applicable for use with newborns.

The AMIS, CIB, DMC, IPSIC, MACI, PCERA, and PIIS assessed the parent, infant, and the dyad. Ten measures (Attachment Q sort, BMIS, CARE-Index, EAS, M-C ADS, Monadic Phases, MRS, NCATS, PIOG, and PIRAT) required clinicians to assess the parent and infant separately. Eight measures (FAAS-DCP, LPICS, M-I/TFS, MRO, PIIOS, PIPE, PIRGAS, and TIP-RS) assessed only the dyad/triad. All measures included multiple subscales, which ranged from three (PIRGAS) to 25 subscales (AMIS). The number of items used in the measures ranged from four items (PIPE) to 111 items (MRS). The length of time required to complete each measure ranged from a “brief” 2-min game in the PIPE to 6–8h of observations to complete the Attachment Q sort.

Sixteen measures (64%) required the clinician to use videotaped recordings to code the observed relationship, so video recording equipment was required. Seven measures (Attachment Q sort, BMIS, PIOG, PIPE, PIRAT, PIRGAS, and TIP-RS) were designed to be completed as live observations of the interactions (no videorecording required) and two measures (EAS and NCATS) could be completed live or by using videotaped recordings. Thirteen measures (52%) were designed to be completed either in home or clinical (including laboratory) environments. Ten measures were designed to be completed in clinical environments only. Two measures (CIB and IPSIC) were designed to be completed at the home of the family being assessed.

All 25 measures assessed the parent–infant relationship in terms of expected relationship characteristics, namely, perceived sensitivity and reciprocity (PIIS), sensitivity and responsiveness (LPICS, PIIOS), reciprocity (NCATS, PIPE), synchrony (DMC), sensitivity and synchrony (EAS), mutual responsivity (MRO), facial expressions (MonadicPhases, MRS), quality of the interactions (the BMIS, CIB, FAAS-DCP, IPSIC, MACI, PCERA, PIOG, PIRGAS, TIP-RS), risk (PIRAT) and attachment/attachment behaviours (Attachment Q sort, CARE-Index, M-C ADS). The AMIS and the M-I/TFS were designed to assess the parent–infant relationship, with a focus on interactions in a feeding context.

In terms of costs, training requirements and access to the measures’ scale, manual and training courses, 15 measures (60%) required the user to be trained in using the measure. However, seven measures (Attachment Q sort, DMC, FAAS-DCP, LPICS, MACI, NCATS, and PIPE) required the user to complete training but did not offer further information on how to access this training. The M-C ADS required self-study of the published, free to access, manual as training and the IPSIC detailed requesting training information from the measure authors. The AMIS and BMIS did not require the user to be trained to use the measure. For eight measures (Monadic phases, M-I/TFS, MRO, MRS, PCERA, PIIS, PIOG, TIP-RS), it was unclear if training was required. The CARE-Index, CIB, EAS, PIIOS, PIRAT, PIRGAS, had available training courses accessible via online websites. Of these websites, only three of these measures had costs for these training courses listed, the CARE-Index (£850–£1,050), CIB ($2,500), and PIIOS (£450). For the remaining measures, we could not find information detailing the costs required to access the training courses. When we did find training requirements for measures, the time required to complete the training course ranged from 4h for the PIRGAS to nine days for the CARE-Index.

Eight measures (32%) and/or their scoring sheets were free to access and accessible in the original development or validation study (AMIS, Attachment Q Sort, BMIS, IPSIC, M-C ADS, Monadic Phases, MRS, and PIPE). For the CARE-Index, CIB, EAS, NCATS, PIIOS, PIRAT, and PIRGAS, costs were involved for manual and scale access, the amount required to access these was unknown for the EAS, NCATS, PIRAT, and PIRGAS. The IPSIC scale and manual could be requested through the measure authors. The M-C ADS had a freely accessible published manual online. The FAAS-DCP and MACI both have manuals, but access to these could not be located. For fourteen measures, no mention of a manual was made in the studies or could be found online (AMIS, Attachment Q Sort, BMIS, DMC, LPICS, Monadic phases, M-I/TFS, MRO, MRS, PCERA, PIIS, PIOG, PIPE, and TIP-RS).

Overview of the quality of measurement properties assessed

Thirty-one studies pertaining to the 25 measures were assessed. Table 5 provides the overall evidence ratings for each measure, for part A, B, and C. The overall risk of bias of each study was evaluated through the ‘worst score counts’ method. Only one study (76) received an “adequate” rating (evaluating the M-I/TFS) for overall risk of bias. Nine studies were rated as “doubtful” (evaluating the BMIS, CARE-Index, CIB, DMC, EAS, M-C ADS, PIOG, and TIP-RS). Twenty-one of 31 studies (67.7%) received overall scores of “inadequate” in terms of risk of bias. With regards to the quality of evidence reported, only one measure, the M-I/TFS, received a “high” rating for quality of evidence reported. Ten measures were assigned “moderate” ratings for at least one measurement property assessed (BMIS, CARE-Index, DMC, MACI, M-I/TFS, MRO, NCATS, PCERA, PIIOS, TIP-RS). In terms of final overall evidence, 15 measures were assigned “very low” ratings. Of these, four measures were assigned “very low” ratings for nine out of ten measurement properties (IPSIC, LPICS, MRS, and PIRAT). Nine measures were assigned “low” ratings for final overall evidence (CARE-Index, CIB, DMC, EAS, M-C ADS, MRO, PIIOS, PIOG, TIP-RS).

Table 5

Table 5. Overall evidence synthesis for each measure.

Assessment of validity

Content validity

Due to many of the studies having very different scores for content validity or no content validity studies identified, it was important to report the relevance, comprehensiveness and comprehensibility ratings separately (see Terwee et al.’s (42) criteria for assessing content validity). Fifteen of 31 studies (48.4%) reported evaluating the “relevance” of the measure’s items. Only one study received a “very good” rating for “relevance” among participants using the Attachment Q sort (62) in terms of methodological quality. Fourteen studies were rated “adequate” for methodological quality and the remaining 16 studies were rated “doubtful” due to not enough evidence being given as to whether “relevance” was assessed by the study authors.

With regard to “comprehensiveness” 14 studies (45.2.%) reported evaluating this aspect: three studies received a “very good” rating (M-I/TFS (77), PIOG (99), and PIRAT (91)) in terms of methodological quality. Eleven studies were rated “adequate,” 16 studies received a “doubtful” rating and one study (LPICS (68)) received an “inadequate” rating. Ten studies (32.3%) reported assessing “comprehensibility,” but only two studies were rated “very good” (PIOG (99), PIRAT (91)) for methodological quality. Eight studies were rated “adequate” and the remaining 21 studies were assigned “doubtful” ratings due to not enough information being given by the study authors to assign any higher rating.

In terms of quality appraisal of the psychometric properties in the reported results, relevance, comprehensiveness and comprehensibility were again evaluated separately. With regard to the quality appraisal of findings for relevance, nine measures received “sufficient” (+) ratings (CARE-Index, DMC, LPICS, MACI, M-C ADS, M-I/TFS, Monadic Phases, MRO, PIIS). Two measures received “inconsistent” ratings (±), the BMIS and NCATS.

The “inconsistent” ratings arose because relevance, comprehensiveness, and/or comprehensibility was “sufficient” for one study but “insufficient” for another, so the ratings of content validity differed between two studies evaluating the same measure. The remaining 14 measures received “indeterminate” ()? ratings, due to many of the studies failing to report enough information in the results to meet a “sufficient” rating. In terms of the quality appraisal of results for “comprehensiveness,” nine measures received “sufficient” ratings (AMIS, Attachment Q sort, DMC, FAAS-DCP, IPSIC, M-I/TFS, PIOG, PIRAT and TIP-RS). Four measures received “insufficient” ratings (BMIS, CARE-Index, Monadic Phases, and PIPE), whereas the remaining 12 measures received “indeterminate” ratings. Finally, in terms of “comprehensibility,” 11 measures (AMIS, Attachment Q sort, BMIS, CARE-Index, DMC, EAS, IPSIC, M-I/TFS, PIOG, PIRAT, and TIP-RS) received “sufficient” ratings. One measure (NCATS) received an “inconsistent” rating. The remaining 13 measures received “indeterminate” ratings.

In the final step, the scores assigned for both methodological quality and psychometric properties of a measure were rated using the GRADE approach. As per COSMIN criteria, if a study received an “inadequate” risk of bias rating, then the measure evaluated in that study received a “very low” rating in terms of the GRADE for relevance, comprehensiveness and comprehensibility. If the study was rated as of “doubtful” quality, the content validity ratings were of “low” quality. Therefore, only one measure, the M-I/TFS, was rated as “moderate” quality of evidence for content validity due to receiving an “adequate” overall score for risk of bias. Seven measures (the CARE-Index, CIB, DMC, EAS, M-C ADS, PIOG, and TIP-RS) were rated as “low” quality evidence and 17 measures were rated as “very low” quality of evidence for relevance, comprehensiveness and comprehensibility according to the GRADE approach.

Structural validity

Two studies (evaluating the EAS and MRO) were rated “very good” for methodological quality, and six studies (evaluating the CIB, FAAS-DCP, M-I/TFS, PCERA, PIOG, TIP-RS) were rated as “adequate” due to most of these studies using EFA. Studies using EFA could only be rated as “adequate” rather than “very good” for methodological quality. The remaining studies were rated as “doubtful” due to not providing information pertaining to the assessment of or consideration of structural validity.

Structural validity was assessed in studies for only seven of the 25 measures (28%). Only the FAAS-DCP, PCERA and PIOG were assigned a “sufficient” rating for quality appraisal; EFA was used to assess their structural validity. The CIB, EAS, MRO and TIP-RS were assigned “insufficient” ratings. All four of these measures had studies reporting on structural validity of the measure using CFA; all four reported results were “insufficient” for the COSMIN criteria. The remaining 18 measures were assigned “indeterminate” ratings due to not reporting enough information on the structural validity of the measure to meet the criterion for either a “sufficient” of “insufficient” rating. The quality of the evidence ranged from “moderate” to “very low” for this measurement property.

Hypothesis testing for construct validity

Seventeen measures of the 25 (68%) had studies reporting information regarding construct validity. Four studies (assessing the BMIS, EAS, M-C ADS, M-I/TFS) received “very good” ratings for methodological quality and 17 studies received “adequate” ratings. Five studies (assessing the AMIS, NCATS, PIIOS, PIOG, PIRAT) received “inadequate” ratings. The remaining five studies received “doubtful” ratings.

In terms of quality appraisal of the psychometric properties only the CARE-Index, DMC and the M-I/TFS were assigned “sufficient” ratings. Fourteen measures were assigned “insufficient” ratings and the remaining eight measures were assigned “indeterminate” ratings (if hypotheses could not be defined by the review team). Gradings of “high” to “very low” were given for the quality of evidence for this measurement property.

Criterion validity

The assessment of criterion validity was reported in studies for 13 measures (52%). Four studies were assigned “very good” ratings for methodological quality (the BMIS, Monadic Phases, NCATS, and TIP-RS). Nine studies were assigned “adequate” ratings for methodological quality and six studies were assigned “insufficient” ratings. The remaining 12 studies were assigned “doubtful” ratings.

However, with regard to then appraising the measures’ reported psychometric properties, only the PIIOS was assigned a “sufficient” rating for criterion validity. Twelve measures were assigned “insufficient” ratings and the remaining 12 measures were assigned an “indeterminate” rating. The quality of the evidence was graded “moderate” to “very low” for this measurement property.

Assessment of reliability

Internal consistency

In terms of internal consistency, 14 studies were assigned “very good” ratings and five studies were assigned “adequate” ratings regarding methodological quality. The remaining 12 studies were assigned “doubtful” ratings, with no studies receiving a rating of “inadequate.”

Internal consistency was reported in studies for 16 of the 25 measures (64%). Despite this, the COSMIN criteria stipulate that outcome measures that do not demonstrate at least “low” evidence of “sufficient” validity can be rated as “indeterminate” only. Therefore, only the CIB, FAAS-DCP and PCERA were rated as “sufficient” for psychometric evidence for internal consistency, the PIOG was assigned ratings of “insufficient,” with the remaining 21 measures rated as “indeterminate.” For internal consistency, the quality of the evidence was graded “moderate” to “very low.”

Reliability

Reliability of the measures was reported in studies for 20 of the 25 measures (80%). Only four studies received “adequate” ratings for methodological quality (studies reporting on the BMIS, CIB, FAAS-DCP, and M-I/TFS). Three studies reporting on the Attachment Q sort, MACI, and PIRAT were assigned “inadequate” ratings. The remaining studies received “doubtful” ratings.

Regarding quality appraisal of psychometric properties, seven measures were assigned “sufficient” ratings, comprising the AMIS, Attachment Q Sort, CIB, EAS, M-I/TFS, MRO, and PIPE. Thirteen measures were assigned “insufficient” ratings. The remaining five measures (CARE-Index, Monadic Phases, MRS, PIIS, and PIRGAS) were assigned “indeterminate” ratings. The quality of the evidence was graded “moderate” to “very low” for this measurement property.

Measurement error

Three measures (12%) had studies reporting on measurement error. With regard to methodological quality, only two studies were rated as “adequate” (M-C ADS and M-I/TFS). Six studies were rated as “inadequate” (for the MACI, MRS, PIOG, PIRAT, and PIRGAS). The remaining 23 studies were assigned ratings of “doubtful” for methodological quality.

With regard to quality appraisal of psychometric properties ratings, three measures (the M-C ADS, NCATS, and PIOG) were assigned “insufficient” ratings as per the COSMIN criteria for measurement error. The remaining 22 measures were assigned “indeterminate” ratings. The quality of the evidence was graded “moderate” to “very low” for this measurement property.

Responsiveness

In terms of responsiveness, three studies reporting on the MACI, M-I/TFS and PIIS were assigned “very good” ratings for methodological quality. Seven studies received “adequate” ratings (for the BMIS, CARE-Index, DMC, MRS, MRO, NCATS, and the TIP-RS) and 12 studies received “inadequate” ratings for methodological quality. The remaining nine studies were rated “doubtful” quality.

Fifteen of the 25 measures (60%) had studies reporting information for responsiveness. Only the CARE-Index, MACI, and the M-I/TFS were assigned “sufficient” ratings for quality appraisal of the reported psychometric properties. The other 12 measures were deemed to have “insufficient” information to meet the COSMIN criteria for a rating of “sufficient.” The remaining ten measures were assigned “indeterminate” ratings. Finally, with regard to responsiveness the quality of the evidence was graded “high” to “very low.”

Inter-rater reliability

To ensure inter-rater reliability and quality of the ratings, an independent researcher undertook quality ratings for 25% of identified papers describing the measures. An exact agreement of 87.5% was achieved for the quality ratings, with any discrepancies resolved through discussion.

Discussion

This review systematically identified and examined 25 clinician-rated parent–infant assessments and comprehensively examine their psychometric properties and their overall quality, informed by the COSMIN criteria. A previous review by Munson and Odom in 1996 (34) identified 17 clinician-rated parent–infant assessments, of which only five met inclusion criteria for this current review. A review completed by Bagner et al. in 2012 (35) identified four clinician-rated parent–infant assessments, of which three were included in this review. In 2015, Lotzin et al. (32) reviewed 24 clinician-rated parent–infant assessment measures, of which eight measures were not included in this systematic review. These differences could be attributed to differences in inclusion and exclusion criteria; for example, Munson and Odom (34) drew on book chapters for information, rather than peer-reviewed journals. Bagner et al. (35) did not offer detailed information on their search strategy or their inclusion/exclusion criteria. Lotzin et al. (32) only included measures with more than one study outlining information that described the development and/or validation of the measure, whereas this current review included measures even if only one relevant study was identified. The differences in methods between these reviews and the current review are important to consider because it may be due to these differences as to why different assessment measures were ultimately included in this review.

The measures identified assessed the parent–infant relationship in very young babies, across the age range from birth to two years of age and for specific contexts, such as during a feed. Two measures (AMIS, LPICS) could be used in a short time period only, namely, from birth to 3 months old. The FAAS-DCP and the PIIOS had very strict periods of use from birth to 3 weeks and 2–7 months, respectively. The AMIS and M-I/TFS could be used in specific feeding contexts only. Thus, these 6 measures can only be used in very specific contexts, so they may not be applicable for wider implementation in perinatal services.

All 25 measures assessed the parent–infant relationship in terms of expected relationship characteristics, with the most common focus of the measure being the perceived quality of the parent–infant interaction (nine measures focused on this). Over half (60%) of the measures required the clinician to complete further training to use the measure; however, training courses or information could not be located for seven of these measures. Furthermore, only eight measures offered free access to the scales and no access to a manual could be found in the included studies or online, for 14 measures.

The COSMIN criteria are considered stringent (39); as a result, some measures, for which adequate psychometric properties were reported, were assigned scores that fell short of the stringent COSMIN requirements. The M-I/TFS, followed by the TIP-RS and CIB demonstrated the most promising evidence overall. However, the measure that demonstrated the best psychometric properties is limited in its uses due to being used in a feeding context only. Thus, its utility across other specific contexts (e.g., structured play, i.e., a play interaction guided or structured by the caregiver) is limited. Twenty-one of 31 studies (67.7%) received overall scores of “inadequate” in terms of risk of bias. Structural validity was reported variably across the studies. All studies reporting use of EFA to assess structural validity were assigned “sufficient” ratings; all studies that used CFA to report on structural validity were assigned “insufficient” ratings. The most frequently assessed measurement property was reliability: 80% of studies reported this. With regard to internal consistency, only three measures (CIB, FAAS-DCP, PCERA) ultimately received ratings of “sufficient” due to the remaining measures not meeting the COSMIN criteria of atleast low structural validity to receive a rating of “sufficient.” No measure received a “sufficient” rating in terms of measurement error and only one measure (PIIOS) was assigned a “sufficient” rating for criterion validity. With regard to the strength of evidence, the majority of measures were assigned “very low” and “low” ratings using the GRADE approach. Only one measure scored “moderate” for overall evidence, the M-I/TFS. The M-I/TFS was also the only measure to be scored “high” in two measurement properties, according to the GRADE approach. However, this measure still scored “low” ratings for strength of evidence across four other measurement properties. Consequently, our recommendations regarding the use of each identified measure are cautiously provided and clinicians should be aware of the quality disparities across assessment measures. These novel findings are important, because they extend knowledge as to the quality of the parent–infant assessments that are in use. This review highlights the importance of transparency in reporting and the need for more detailed accounts of psychometric properties of measures.

Considerations relating to content validity

Content validity has been argued to be the most important psychometric property (38, 42); the relevance, comprehensiveness, and comprehensibility of a measure can be an important contributing factor when a clinician is deciding whether to use a measure for clinical or research purposes (42). Content validity was most often demonstrated within studies by a variable description of a theory-driven method or a review of relevant literature driving the development of individual items or subscales but detail was often lacking.Authors rarely reported on involving professionals or participants in the target population in the development of the measure. Many authors failed to mention necessary details of how they had conducted any evaluations of content validity, and as a result, applying the stringent COSMIN criteria for content validity resulted in many studies being rated as “low” or “very low” for overall quality of evidence. This is an important finding: experts (by profession, or via lived experience) should be involved in the development or adaptation of measures to improve content validity. More detailed evaluations of the content validity of these assessment measures should be prioritised in future research to increase confidence in the measure (100, 101).

Considerations relating to structural validity

Although validity evidence based on internal structure is essential to support the use of an outcome measure (102), 23 studies (74.2%) were assigned ratings of “doubtful” for this in terms of methodological quality. Only three measures, the FAAS-DCP, PCERA and PIOG, showed “sufficient” evidence of structural validity with “moderate” or “low” quality of evidence. All three used EFA to assess structural validity. Of the four measures that showed “insufficient” evidence (the CIB, EAS, MRO, and TIP-RS), all used CFA to assess structural validity. This finding is important as it adds more weight to Schmitt et al. (103): misconceptions exist among researchers whether to use CFA, EFA, or a combination of both factor analytic approaches and many researchers often mistakenly use CFA methods when EFA may be more appropriate.

Considerations relating to construct validity

Only the CARE-Index, DMC and M-I/TFS were assigned “sufficient” ratings in terms of quality appraisal, despite 17 measures (68%) reporting information regarding this measurement property. Developing more rigorous assessments of construct validity is important because misattribution or misidentification of the cause of or the effects of the measure can lead to inaccuracies in measurement (104). Therefore, we identified a need to comprehensively establish construct validity of outcome measures and improve the transparent reporting of construct validity in order for clinicians and researchers to make accurate and informed decisions.

Considerations relating to criterion validity

Although many authors assessed criterion validity by comparing the measure against a “gold-standard” clinician-rated measure of the parent–infant relationship, such as the CARE-Index (as described in Svanberg, Barlow and Tigbe (50), only the PIIOS was assigned a “sufficient” rating for criterion validity. Many authors demonstrated their measure’s efficacy in discriminating between high- and low-risk participants or reported on the measure’s constructs being correlated with other similar constructs. Furthermore, many authors reported criterion validity when only assessing specific aspects of criterion validity, such as hypothesis testing, convergent or discriminant validity. Authors rarely assessed predictive validity, that is, whether scores predicted future developmental outcomes, a significant omission (32).

Considerations relating to internal consistency

Authors typically reported adequate levels of internal consistency. Fourteen studies were assigned “very good” ratings, and five studies were assigned “adequate” ratings regarding methodological quality for internal consistency. However, due to “very low” ratings for sufficient structural validity for many studies, only the CIB, FAAS-DCP and PCERA were rated as “sufficient” for internal consistency. All studies used Cronbach’s alpha. However, when it is used to assess items that cover a broad or more complex topic, it has been suggested that Cronbach’s alpha may underestimate the internal consistency of the measure (105) and some researchers have suggested alphas should not be interpreted as a measure of internal consistency (106).

Considerations relating to reliability

Twenty of the 25 measures (80%) had studies that reported on the reliability of the measure, but only seven measures were assigned “sufficient” ratings for the quality appraisal of the reported reliability results. While most studies reported on inter-rater reliability, no studies explicitly reported on assessing intra-rater reliability. Intra-rater reliability estimates are also important, because a researcher can assess if there are any practice effects associated with clinicians becoming familiar with the outcome measure (107).

Considerations relating to measurement error

Only three measures (12%, M-C ADS, NCATS, and PIOG) had studies that reported on measurement error, and they were ultimately assigned “insufficient” ratings. The remaining 22 measures were rated “indeterminate.” On reflection, many of the studies did report adequate percentage agreement (>80%) but failed to explicitly define the minimal important change (MIC), meaning they were assigned ratings of “indeterminate” as per COSMIN criteria. It is important to define the MIC because if the reported measurement error in a study is smaller than the MIC, it may be possible for researchers to identify and distinguish clinically important changes from measurement error with a greater amount of certainty (108). Additionally, many studies failed to report on sensitivity, specificity and/or accuracy, which led to ratings of “doubtful” for 74.2% of studies for methodological quality. Further information is required on the sensitivity, specificity and accuracy in order for clinicians and researchers to be able to use the measures to identify parent–infant relationships at risk of breaking down or having long term consequences for infant mental health (32).

Considerations relating to responsiveness

Fifteen measures (60%) had studies reporting information on responsiveness of the measure. However, only the CARE-Index, MACI and M-I/TFS were assigned “sufficient” ratings in terms of responsiveness. The other 12 measures were deemed to be “insufficient.” Few studies reported longitudinal parent–infant relationship data. Responsiveness is important to establish the ability of a measure to detect change over time i (38, 39, 109). Thus, more research is needed to fully establish the responsivity of these measures in order to continue monitoring changes over time within the parent–infant relationship.

Considerations relating to the use of the COSMIN guidelines

The COSMIN criteria are regarded as the most stringent and comprehensive to apply to studies due to the multi-step process outlined previously, which is considered a strength of this review (39). Despite this, as highlighted by Jewell et al. (110), these stringent cutoff scores can lead to important information being overlooked. In some cases in this review, the reported results were very close to values defined as “sufficient” indicating a positive result, but were rated as “insufficient” due to not meeting the stringent COSMIN cutoff values. Additionally, the COSMIN checklist was used to assess the study’s risk of bias using the “worst score counts” method, meaning that even one flaw in a study would result in a “doubtful” or “inadequate” rating overall, despite demonstrating “very good” evidence in other methodological aspects. Jewell et al. (110) argued that this method results in the overall reported methodological quality of a study as potentially not being an accurate reflection of the study’s risk of bias and perhaps leading to an underestimation of the adequacy of reported measurement properties., Poor risk of bias ratings often stemmed from a lack of information reported by authors, meaning they did not meet the stringent COSMIN criteria. It should be noted, however, that 64.5% of studies were published before the COSMIN criteria were available.

Additionally, 16% of the measures were developed or validated with a sample of 50 or fewer participants leading to possible imprecisions. Inadequate sample sizes, in terms of COSMIN criteria, and inadequate risk of bias scores often affected the overall grading of the quality of evidence. However, the subsequent use of such a measure would not necessarily be based on robust and strong psychometric evidence.

Strengths and limitations

One major strength of this review is its comprehensiveness. More than 13,000 records across all publication years in five databases were screened. This approach reduced the likelihood of missing any relevant studies and resulted in a robust approach to reviewing all studies that used or reported on assessment measures of the parent–infant relationship. Stage 1 enabled the identification of a high number of parent–infant assessment measures.

Earlier reviews (32, 34, 35) published in, and prior to 2015, included a smaller number of measures and were not assessed as comprehensively since they were not informed by the COSMIN criteria.

Some limitations to this review are acknowledged. Validity and reliability evidence was based only on studies in peer-reviewed journals that psychometrically evaluated or described the development of the measure. Other literature (e.g., book chapters and theses) was excluded; thus, other relevant evidence for the included measures may exist but was not included. Additionally, this review excluded measures that were not suitable for infants aged 0–2 years. Consequently, we may have underreported the breadth of parent–infant assessment measures available for clinicians and researchers to use.

Furthermore, we excluded non-English language studies due to the limited time and resources available. Thus, the presence of possible language and location biases is acknowledged (111). However, only two identified measures (one in each stage) were excluded due to the relevant studies being in a non-English language. Additionally, alternatives to the COSMIN tools when assessing the psychometric properties of outcome measures exist, such as the Evaluating the Measurement of Patient-Reported Outcomes Tool (EMPRO) (112), a 39-item standardised assessment tool that has been used to review psychometric properties of measures in other systematic reviews (113) and the Francis tool (114), which uses an 18-item checklist to appraise the psychometric properties of instruments (115).

We acknowledge a further limitation of this review in terms of the potential exclusion of literature that might have examined construct validity: the authors of those studies might not have specifically stated that they intended to fully validate the measure (e.g., they might have used a method to investigate the relationship between two measures rather than examined the validity of a particular measure).Therefore, it is possible that that this aspect was not reported because the authors did not have that as their primary aim.

Implications for future research and practical recommendations

Of the final 25 parent–infant assessment measures that were first identified and then evaluated in this review, the majority (76%) had only one suitable study describing or evaluating the measure’s development and/or psychometric properties. Hence, further research demonstrating each measure’s reliability and validity would be useful for clinicians.

As suggested in a review by Lotzin et al. (32), parent–infant assessment measures could further refine the constructs, subscales or items used. Five measures identified in this review (Attachment Q sort, Monadic Phases, MRS, NCATS, and PCERA) included more than 60 items, meaning the time required from both the clinicians and the dyad to complete the assessment is high. Future studies could refine measures by specifying the developmental outcomes assessed by each construct or subscale (e.g., academic, cognitive, behavioural, or socio-emotional development).

Manuals for the identified measures were not often freely available or published. Information about the measure’s psychometric evidence, within accessible manuals, would help to enable clinicians and researchers to be able to make well-informed decisions when choosing assessment measures and prioritise choosing measures that demonstrate good psychometric evidence.

Practical constraints, such as costs, training, manual availability, and required settings or equipment, should also be considered. Additionally, it is also worth considering that parent–infant interactions were often reported to be observed in clinical or laboratory settings; thus, the behaviours rated by clinicians may not be a true reflection of the parent and infant’s typical behaviour in their home environment (116). Further studies could focus more on observations of parent–infant behaviours in the naturalistic home environment.

Conclusion

Twenty-five parent–infant assessment measures were identified, assessed for risk of bias, and appraised for the quality of their psychometric properties. This review highlights that further research examining the reliability and validity of existing measures is required to advance this field of assessing the parent–infant relationship because few measures could be recommended for clinical and/or research use based on the findings. Clinicians and researchers should be aware of the quality disparities across assessment measures and may need to look beyond local guidelines or clinical recommendations when choosing parent–infant assessment measures.

Although it is reassuring to see a wealth of emerging literature on clinician-rated parent–infant assessment measures, there is a clear need to continue evaluating the existing assessment measures for their reliability and validity to ensure high quality parent–infant assessments, with clinical utility, are completed. More significant efforts should be made to improve the quality of the existing parent–infant assessment measures, as well as increased rigour and transparency in reporting measure development and evaluations, which in turn could serve to enable greater precision, sensitivity and specificity when assessing the parent–infant relationship. Improved detection of any problems or risks within the parent–infant relationship could help to reduce negative consequences for the parents and infants in the future, as well as to facilitate and contribute to the development of interventions within community and clinical PMH services.

Author contributions

IS: Conceptualization, Data curation, Formal analysis, Investigation, Project administration, Writing – original draft, Writing – review & editing. LG: Conceptualization, Formal analysis, Supervision, Writing – review & editing. AW: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. The authors did not receive any external funding to conduct this review. However, the University of Manchester supported the open access publication.

Acknowledgments

The authors would like to thank Eleanor Wozniak who assisted with the quality appraisal and with the study screening processes. Eleanor was a postgraduate researcher and a fellow trainee clinical psychologist at the University of Manchester, School of Health Sciences.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1426198/full#supplementary-material

References

1. Anderson LM, Shinn C, Fullilove MT, Scrimshaw SC, Fielding JE, Normand J, et al. The effectiveness of early childhood development programs. Am J Prev Med. (2003) 24:32–46. doi: 10.1016/S0749-3797(02)00655-4