Measurement properties of patient-reported outcome measures of mental help-seeking attitude: a systematic review of psychometric properties

Background At present, the phenomenon of patients with mental disorders not seeking mental help is very serious, and the mental help-seeking attitude is the central structure of the help process. However, there is no consensus on which patient-reported outcome measure (PROM) should be used to assess mental help-seeking attitudes. Objective The systematic review aims to critically appraise, compare, and summarize the measurement quality of the all-available PROMs about mental help-seeking attitudes to provide evidence-based guidance and reference for clinical researchers. Methods A systematic search was conducted in 9 databases (PubMed, Embase, Web of Science, Medline, APA, CINAHL, Sinomed, CNKI, and WanFang) since the establishment of the database until November 30, 2022 to identify articles on the PROMs of mental help-seeking attitudes. We used the COSMIN guidelines to evaluate the methodological quality and measurement properties of all-available PROMs, and a modified Grading, Recommendation, Assessment, Development, and Evaluation (GRADE) system to evaluate the level of evidence supporting each rating. Finally, the recommendation level is given according to the overall quality of each PROM. Results We identified 29 studies representing 13 PROMs out of 2,828 screening studies. The overall quality of the included PROMs varied, with 6 rated as class B, 6 as class C, and only the Mental Help Seeking Attitudes Scale (MHSAS) as class A. Conclusion The measurement characteristics of MHSAS have been the most comprehensive evaluation, and it has good reliability and validity, and high feasibility for clinical application, so it can be temporarily recommended for use, but the above conclusions still need to be supported by more high-quality evidence.


Introduction
In recent years, under the influence of population, environment, and social politics, mental health disorders have become a severe and costly global public health concern, affecting people of different ages, cultures, and socio-economic statuses (1,2). At the same time, although the quality and effectiveness of mental health treatments and services have greatly improved, the utilization of mental health services is low (3,4), resulting in a treatment gap that is wider than that in any other health sector (5). Results from a large-scale epidemiological study (6) show that fewer than one in five respondents with psychiatric disorders sought mental health services during the year they were interviewed. Many people who might benefit from psychotherapy are often reluctant to seek psychological help. A study (7) during the COVID-19 pandemic also validated the results. Mental disorders are treatable and possibly preventable, but the fact is that mental help-seeking is often delayed or completely absent (8)(9)(10). Some studies (11)(12)(13) have shown that the delay or refusal of psychological help will not only aggravate the negative experience of patients with mental illness, but will further increase the burden and cost, and eventually lead to severe ramifications for the individual. To close this treatment gap in mental health services, it is important to understand the influencing factors of the mental help-seeking process.
Mental help-seeking refers to the process of individuals seeking help from professionals when they encounter psychological troubles or obstacles. As one of the crucial variables affecting mental helpseeking (14), the attitudes towards mental help-seeking are people's overall evaluation (i.e., good vs. bad) of the act of seeking help from mental health services, which includes the need, trust, acceptance, and expectation of psychological help behavior. The Theory of Planned Behavior (TPB) developed by Ajzen (15) states that an individual's actual behavior is indirectly influenced by behavioral intention, and three constructs predict behavioral intention: attitudes, subjective norms, and perceived behavioral control. Among them, studies (16,17) have shown that individual attitudes are the strongest predictor of help-seeking intentions and a crucial structure in help-seeking research and practice. In addition, TPB also emphasizes the high correlation between attitude and behavior, that is, individuals will evaluate their liking or disliking of certain behaviors, and then influence their actual behaviors by influencing their behavioral intention. Many empirical studies (5,18) have demonstrated that the mental help-seeking attitude is highly correlated with the help-seeking process. In addition, in studies where the intention was not measured, attitude also directly explained differences in actual future helpseeking behavior (19). In other words, the more positive the individual's attitude towards mental help-seeking, the more likely it is to take help-seeking behavior into practice. Therefore, a positive mental help-seeking attitude is the first step towards promoting helpseeking behavior, which can facilitate access to mental health and wellness services. However, the previous systematic reviews (20)(21)(22)(23) of mental help-seeking mostly focused on help-seeking behavior, which is a comprehensive process, including the promotion and hindrance factors, and there are few studies on mental help-seeking attitudes. Given the critical role of mental help-seeking attitudes in help-seeking research and practice, it is important to be able to effectively and accurately assess this structure via patient-reported outcome measures (PROMs). At present, many measuring tools can be used to evaluate the attitude of psychological help, with different contents and evaluation methods, and the quality is uneven. Simultaneously, we found no study that systematically evaluated and compared measures of mental help-seeking attitudes to help clinicians and researchers select appropriate scales for specific uses.
The Consensus-based Standards for the Selection of Health Measurement Instruments (COSMIN) (24) methodology evaluates the measurement tools from the quality of measurement properties and methodological quality, and comprehensively assesses the evidence quality of each measurement property, thus forming the final recommendation of the measurement tools. Therefore, the COSMIN methodology facilitates a systematic review of measurement instruments. This study comprehensively and systematically reviewed the mental help-seeking attitude assessment tools according to the COSMIN guidelines to provide reference to researchers on the selection, application, and development of related tools.

Design
This systematic review was carried out following the COSMIN (24) guideline. The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (25) guideline was used in reporting the study and was registered with PROSPERO (CRD42022382992).

Search strategy
Studies on mental help-seeking attitudes were comprehensively searched in PubMed, Embase, Web of Science, Medline, APA, CINAHL, Sinomed, CNKI, and WanFang since the establishment of the database until November 30, 2022, and the language of the study is not limited. To ensure the inclusion of all available and relevant preliminary studies, this study used Medical Subject Headings (MeSH) terms and free-text words to identify studies concerning the measured constructs. According to COSMIN guidelines and the advice of relevant experts, two authors developed a search strategy. Key terms were applied to each database: mental help-seeking attitude OR psychological help-seeking attitude OR attitudes toward seeking professional psychological help OR attitudes toward seeking mental help OR Psychological help attitude OR help-seeking attitude. Supplementary document 1 provides a detailed search strategy in the PubMed database as an example. In addition, we manually searched the reference lists of relevant articles as other sources to ensure the inclusion of additional research.

Eligibility criteria
The following inclusion criteria were applied: (1) Validation studies of measures assessing constructs relating to mental helpseeking attitudes. (2) Studies that determined at least one measurement property of PROMs. The exclusion criteria included the following: (1) the pre-experiment of measurement tools or the research of application status; (2) Secondary research (review, systematic review, etc.); (3) Repeated studies from multiple databases; (4) Studies for which the full text is not available.

Study selection
All available records for database searches are uploaded to EndNote (version X9, Clarivate Analytics). After the removal of duplicates, the titles and abstracts were reviewed by two researchers respectively, and studies not meeting the inclusion and exclusion criteria were removed at this stage. Any discrepancies were resolved by joint discussion or referral to the third author, and a list of the articles to be included in this review was determined. In case of non-English or Chinese published articles, literature should be processed by professional translators before the researchers review it. In addition, the reasons for excluding the study at the full-text screening stage were recorded. The process of study selection will be shown in the PRISMA flowchart.

Data extraction
Two researchers independently extracted data, including measurement instruments, the first author, publication year, country and language, target population, sample size, number of dimensions and number of items, scoring method, and the retest time. The main findings on measurement properties included content validity, structural validity, internal consistency, cross-cultural validity/ measurement invariance, test-retest reliability, criterion validity, and hypotheses testing (24).

Quality appraisal
Two researchers used the COSMIN (24) guideline for systematic reviews to independently evaluate the methodological quality and the measurement property of the mental help-seeking attitude scales, and cross-checked the results. In case of disagreement, the dispute shall be settled through consultation with the third reviewer. A narrative analysis was used to summarize and analyze the measurement property and methodological quality results of the mental helpseeking attitude assessment instruments.

Methodological quality assessment of included studies
This study used the COSMIN Risk of Bias Checklist (26) to evaluate the methodological quality of the scales, detailing an instrument's development, content validity, construct validity, internal consistency, cross-cultural validity\measurement invariance, reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness. A 4-point scoring system was used to rate the methodological quality of each study, that is "very good (V), " "adequate (A), " "doubtful (D)" or "inadequate (I). " The methodological quality score for each property was determined by taking the lowest rating of any item in each box -worst score counts principle.

Measurement property assessment of the instruments
The quality criteria COSMIN checklist was developed by Terwee et al. (27) as a framework to evaluate the measurement property of each study. The checklist covers nine measurement properties: content validity, structural validity, internal consistency, cross-cultural validity, reliability, measurement error, criterion validity, hypothesis testing, and responsiveness, and sets the rating level as "sufficient (+), " "insufficient (−), " "indeterminate (?). " After data synthesis, when the ratings of each study are consistent, the overall rating of the measurement property is also rated as "sufficient (+)" or "insufficient (−)" or "indeterminate (?). " When the rating of each study is inconsistent, we can explore the possible reasons. If the explanation is reasonable, subgroup analyses can be performed and ratings can be provided by the subgroup. If the explanation was unreasonable, the overall rating of the measurement property was rated as "inconsistent (±). " 2.6.3. Summarizing the evidence and grading the quality of the evidence We used a modified Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) (28) system to assess the overall quality of evidence. According to the risk of bias, inconsistency, indirection, and imprecision, the quality of evidence was divided into four levels: high, moderate, low, and very low.
At last, based on the objective evaluation results of the evidence, the recommendation opinions of the instruments are formed, and the intensity of the recommendation opinions is marked, which are divided into a strong recommendation (A), weak recommendation (B), and no recommendation (C). The criteria for category A are sufficient content validity (any level of evidence) and sufficient internal consistency (evidence of at least low quality). The criteria for category C are insufficient measurement property (evidence level is high). Category B is between category A and category C, and more studies are needed to verify the measurement characteristics.

Search results
Of the 2,828 studies retrieved from the databases, 294 duplicates were removed. Upon screening the titles and abstracts, another 2,463 studies were removed. For the remaining 71 studies, their full texts were retrieved, of which 42 were removed with reasons, leaving 29 studies (14, in the review. A flowchart of the literature screening process is shown in Figure 1.

General characteristics of the instruments included
After retrieval, a total of 29 studies representing 13 instruments conducted between 1970 and 2021 were included. Among the 13 measurement instruments, Attitudes Toward Seeking Professional Psychological Help Scale-Short Form (ATSPPH-SF) is the scale that has been evaluated most frequently, with a total of 10 studies, followed by Inventory of Attitudes Toward Seeking Mental Health Services (IASMHS) (6 studies) and Attitudes Toward Seeking Professional Psychological Help Scale (ATSPPH) (3 studies), and the rest have only one study.
The number of patients in the studies included in this systematic review ranged from 150 to 3,006, with 11 studies involving college students, and the rest involving community residents, adults, future psychotherapy practitioners, parents, and immigrant adolescents. Studies were conducted in the United States (n = 6), China (n = 11), Canada (n = 2), Singapore (n = 2), the Greek (n = 1), Italy (n = 1), Jordan (n = 1), France (n = 1), Austria (n = 1), the Philippines (n = 1), Germany (n = 1) and Turkey (n = 1). The number of scale items varied from 7 to 43, and the subscales were considered to range from one to five dimensions. In addition, 10 measurement instruments had varying recall periods from 1 to 4 weeks. Table 1 shows the characteristics of the instruments included in the systematic review. Flowchart of the identification and selection of studies.

Methodological quality of the included studies and measurement property assessment of the instruments
Data obtained in the measurement properties assessment of the selected instruments and the methodological quality results were summarized in Table 2. This systematic review evaluated the content validity (n = 16), structural validity (n = 29), internal consistency (n = 28), cross-cultural validity/measurement invariance (n = 7), testretest reliability (n = 11), criterion validity (n = 4), and hypothesis testing (n = 1) of the included instrument.

Content validity
Content validity refers to the degree of agreement between the content of PROMs and the measured construct, which is considered to be the most important measurement characteristic of PROM (57). According to the COSMIN guidelines (24), content validity is described in aspects of relevance, comprehensiveness, and comprehensibility.
Of In the 6 studies (37,38,44,45,51,53), qualitative interviews were used to investigate the target population, but the interview guide and content were not specified, and the data analysis methods were not clear, so the methodological quality of the research was "doubtful. " The study of Hammer et al. (14) used a questionnaire survey to evaluate the relevance, comprehensiveness, and comprehensibility of the scale items among community adults, and the analysis approach was appropriate but not clearly described, so the methodological quality was "adequate. " In addition, since only 5 professionals from relevant disciplines were quantitatively surveyed in the study of Hammer et al., the methodological quality regarding the relevance and comprehensiveness of professionals is "doubtful. " Other studies only used quantitative survey methods to evaluate content validity, and the description of the research process/statistical methods was not clear, so the methodological quality of these studies is "doubtful. "

Structural validity
Structural validity refers to the degree to which the score of a PROM adequately reflects the dimensions of the construct to be measured (58). All included studies were evaluated for structural validity. In terms of methodological quality, was in the 4 studies (30, 35,36,43) conducted only Exploratory Factor Analysis (EFA), so we rated them as adequate for structural validity. One study (54) performed confirmatory factor analysis (CFA), but the structural validity of this study was also rated as adequate because there were only 150 subjects, less than 7 times the number of items. CFA was performed in the remaining 24 studies, and the sample size was sufficient, so the structural validity was rated as very good. In terms of instrumental

Internal consistency
Internal consistency, defined as the degree of interrelatedness among the items, is usually assessed by Cronbach's alpha (58). Twentyeight studies measured the internal consistency, 3 (30, 32, 49) of which were rated as "inadequate" for not giving Cronbach's alpha for the subscale. Other studies reported the Cronbach's alpha of all dimensions, and the methodological quality was "very good. " 15 studies (30-34, 36-39, 48-51, 53, 54) reported Cronbach alpha below 0.70, so the internal consistency was rated as "insufficient. "

Cross-cultural validity/measurement invariance
Cross-cultural validity/measurement invariance refers to the degree of consistency in the score of PROMs items when measured across different cultural groups (59). A total of 7 studies (14,29,32,47,49,50,56) used multi-group CFA (MGCFA) to verify the measurement variability of the instruments in different cultural groups, however, they did not specify whether other relevant characteristics except for the group variable were similar, so the methodological quality of the research was "doubtful. " 3 studies (49,50,56) found differences in the grouping variables, so the crosscultural validity/measurement invariance was "insufficient. "

Reliability
Reliability is defined as the extent to which a measurement is free of measurement error and can be tested with repeated measurements in stable patients at appropriate intervals under similar test conditions (60). Ten studies (14,30,31,34,36,40,43,51,52,55) evaluated the reliability, but the methodology quality was "doubtful" because it failed to detail whether the characteristics of the test subjects, the test conditions, and the constructs to be measured were stable during the interval, and failed to provide the basis for the time interval. For six studies (14,30,31,34,40,55), the intraclass correlation coefficient (ICC) was calculated, and ICC > 0.70, so the reliability was rated as "sufficient. " Other studies only calculated the Pearson correlation coefficient, so reliability was rated as "indeterminate. "

Criterion validity
Criterion validity refers to the degree to which the scores of a PROM are an adequate reflection of a gold standard. Two studies (40, 46) assessed the criterion validity, all of which took ATSPPH as the gold standard and calculated the correlation, so the methodology quality was rated as "very good. " Only one study (46) calculated a correlation greater than 0.70 and was therefore rated "sufficient. "

Hypothesis testing
In the assessment of hypothesis testing, there are two parts, a and b. Part a is the comparison with other outcome measurement instruments, and part b is the comparison between subgroups. In the studies, part a or part b or both can be evaluated, depending on the type of comparison. In this systematic review, only one study evaluated the hypothesis testing, Part a is the comparison with ATSPPH and IASMH, and Part b is the comparison between subgroups of men and women, and between subgroups of whether they had previously sought mental health services. However, Mental Help Seeking Attitudes Scales (MHSAS) only gave the correlation coefficient instead of the mean and standard deviation, so the methodology quality was reduced from "very good" to "adequate. " The research results of Hammer et al. were consistent with the hypothesis, so they were rated as "sufficient. "

Summary of evidence and grading of the quality of evidence
This section summarizes the overall ratings and quality of evidence for all assessment tools, and Table 3 was formed. A total of 13 measurement tools were included in our study, but none of them measured measurement error and responsiveness.
In terms of measurement property quality, all 13 tools involved the assessment of content validity, but only the study of Hammer et al. was rated as "sufficient, " and the rest tools were "indeterminate. " Among the structural validity evaluation tools, the structural validity of ATSPPH, Attitude Toward Seeking Professional Psychological Help Questionnaire-Civil Servants (ATSPPHQ-CS), and Immigrated Adolescents' Psychological Attitude Questionnaire (IAPAQ) was rated as "insufficient, " the structural validity of Trainees' Attitudes Toward Seeking Psychotherapy Scale (TATSPS), Attitude Toward Seeking Professional Psychological Help Questionnaire-Middle School Students (ATSPPHQ-M), and Attitudes Toward Seeking Psychological Help Scale (ASPH) was rated as "indeterminate, " and the structural validity of other evaluation tools was "sufficient. " Internal consistency was evaluated for 13 tools, with six rated as "sufficient" and five rated as "insufficient. " In addition, the internal consistency of ATSPPH-SF and IASMHS is "inconsistent. " Cross-cultural validity/measurement invariance was evaluated by three assessment tools, of which ATSPPH-SF and MHSAS were rated as "sufficient. " 8 assessment tools were used to evaluate reliability, ATSPPH-SF, ATSPPHQ-M, Attitude Toward Seeking Professional Psychological Help Questionnaire-College Students 1 (ATSPPHQ-C 1 ), and MHSAS were rated as "sufficient, " while the rest were rated as "indeterminate. " Only 2 assessment tools evaluated the criterion validity, among which ATSPPH-SF was "inconsistent" and TATSPS was "insufficient. " MHSAS measured the hypothesis test and was rated "sufficient. " The quality of evidence is affected by four downgrading factors: risk of bias, inconsistency, indirectness, and imprecision. In this study, due to the impact of bias risk and inconsistency, the overall quality of each measurement property is mostly moderate or low, especially content validity, structural validity, cross-cultural validity/ measurement invariance, and reliability. Based on the COSMIN, measurement instruments can be classified as strongly recommended (A), weakly recommended (B), and not recommended (C). Six of the 13 measurement instruments were rated as B, 6 were rated as C, and only MHSAS was rated as A.

Discussion
This systematic review is the first psychometric review concerning the measurement properties of mental help-seeking attitude instruments based on the COSMIN checklist. All stages of frontiersin.org  Frontiers in Psychiatry 13 frontiersin.org this study are conducted according to PRISMA and COSMIN guidelines, which have a high evidence level. Through the comprehensive and systematic evaluation of the included research, it found that the attitude of mental help-seekers mainly focuses on the cognition of mental health, overall evaluation of psychological help, tolerance of stigma, interpersonal openness, need for self-sufficiency, and the trust of psychological professionals. However, different instruments have obvious differences in the quality of measurement properties, and the quality of the instruments is uneven. Therefore, the problems found in this study were summarized to provide a reference for the development and verification of a high-quality psychological help-seeking attitude scale in the future.
In assessing the methodological quality of studies, content validity is the most important measurement property of a PROM (27). Through the evaluation process of PROM development, this study found that most of the included instruments were compiled according to the results of literature review, expert consultation, target population interview, or questionnaire survey. ASPH, Attitude Toward Seeking Professional Psychological Help Questionnaire-College Students 2 (ATSPPHQ-C 2 ), Attitude Toward Seeking Professional Psychological Help Questionnaire-Pupil's Parents (ATSPPHQ-PP), and IAPAQ conducted qualitative interviews with the target population but did not describe whether there was an interview guide, skilled interviewers, and a specific interview process. It is also unclear about interview content saturation and qualitative data analysis process, so the quality of content validity methodology is "doubtful. " The rest of the assessment instruments only conduct questionnaire surveys on the target population, among which only the content validity of MHSAS has "adequate" methodological quality, and the data analysis process of other instruments was not clear, so the methodological quality was "doubtful. " This resulted in low recommendation ratings for most measurement instruments. Therefore, the general problem of content validity research is the lack of qualitative interviews with the target population and detailed and standardized descriptions of research methods. COSMIN recommends the use of qualitative research to investigate the relevance, comprehensiveness, and comprehensibility of items from patients or professionals to measure content validity. Future studies can refer to the COSMIN checklist (24) to improve the PROM design, and use in-depth individual interviews or focus group interviews to understand the views of research subjects on the content of PROM so that the research subjects can participate in the verification and evaluation process. In addition, the process of data analysis should be described in detail, especially the process of qualitative research data analysis.
In this review, the included studies were all based on classical test theory (CTT), and factor analysis is the preferred method for evaluating structural validity in CTT (61). Factor analysis is a broad term that refers to a set of statistical methods for extracting common regression coefficient from a number of observed variables (62). There are two main approaches: exploratory factor analysis (EFA) and CFA. CFA applies to the situation where the dimension of the measured construct is determined, while EFA applies to the situation where the dimension of the measured construct is uncertain. The fundamental difference between the two is whether there is a prior theory or knowledge. Compared with EFA, CFA can describe the relationship between measurement items and factors, and directly verify this relationship or model (63). Many of the included studies only conducted EFA to produce a theory of internal structure, but not CFA, so it was impossible to evaluate the degree of conformity between the factor structure and the sample data defined by the theory. In fact, CFA and EFA are two stages of the research process, and only the combination of the two can make the research more in-depth. Therefore, for PROMs lacking a theoretical basis, when evaluating the structural validity, it is recommended to first use EFA to clarify its internal structure, and then apply CFA to analyze the relationship between measurement items and factors.
Cross-cultural validity/measurement invariance was evaluated in ATSPPH-SF, IASMHS, and MHSAS, but the quality of evidence was not high. An ideal measurement instrument should have a stable structure when measured in different cultural groups. Therefore, when the instrument is used in different cultural groups, attention should be paid to the measurement invariance between groups and whether there is differential item functioning (DIF) of scale items. Only MHSAS evaluated the hypothesis testing of structural validity. Hypothesis testing for structural validity refers to the extent to which scale scores are consistent with the hypothesis. The more specific the hypothesis and the more hypotheses tested, the more evidence supporting structural validity. Further research needs to focus more on this measurement property. In terms of criterion validity, there is no gold standard in the field of psychological help-seeking attitudes. However, the COSMIN guidelines state that the original scale can serve as the "gold standard" for the newly developed shorter version. Therefore, researchers can compare the short version scale with the original scale to verify whether it has good criterion validity.
All 13 assessment instruments evaluated internal consistency, there were 3 studies only giving the Cronbach alpha value of the total scale, so the methodology quality of internal consistency was rated as "inadequate. " The COSMIN guidelines state that the internal consistency of each subscale should be calculated when the scale presents multidimensional dimensions. In future development and verification, the Cronbach alpha of each subscale should be given on the premise of clear structural validity. In the scales assessing reliability, the quality of evidence was mostly low, because these scales did not address the reason for the choice of the measurement interval, nor did they indicate whether subjects were stable in the interim period on the construct to be measured. In addition, COSMIN points out that when evaluating reliability, the preferred choice for continuous scores is ICC, for dichotomous/nominal/ordinal scores is Kappa, and for ordered scores is the weighted Kappa. Besides, the ICC model is a two-way random effects model, which takes into account both the variation within the subject and the systemic variation, while the Person or Spearman correlation coefficient does not consider systematic variation. All the included studies were continuous scores, but 4 studies performed only the Person correlation coefficient. Measurement error and responsiveness were not reported for any of the 13 instruments in this study. The measurement error refers to the systematic and random errors of the measured scores, including the Standard Error of Mean (SEM) and the Minimal Detectable Change (MDC). Responsiveness refers to the ability of the scale to examine score changes over time. At present, the psychological help attitude tools are in the development stage, and the measurement error and responsiveness can be tested to improve the scientific nature of the assessment tools.
In addition to evaluating its methodology and measuring property quality, the application population is also a key concern in the selection of assessment tools. The application population of the scales should be consistent with the characteristics of the population included in the Frontiers in Psychiatry 14 frontiersin.org development of the scales. However, there are differences between the original target population and the application population of most psychological help-seeking attitude assessment tools, such as ATSPPH-SF and IASMHS. Taking IASMHS as an example, Mackenzie et al. developed the scale based on college students and then applied it to different groups such as adults, Chinese-Canadian elders, and adult abuse survivors. Although the IASMHS is rated as class B based on the modified GRADE system, indicating that the IASMHS has the potential to be applied in these populations, whether the IASMHS is the best assessment tool for these specific populations needs further research to be verified. In addition, there are many targeted assessment tools developed for special groups in this study, such as TATSPS, ATSPPHQ-M, and so on. Therefore, more consideration can be given to developing targeted assessment tools in future research.
The results of this study show that MHSAS is the class A scale, ATSPPH-SF, IASMHS, ATSPPHQ-M, ATSPPHQ-C 1 , Parental Attitudes Toward Psychological Services Inventory (PATPSI), ATSPPHQ-PP is the class B scale. ATSPPH, TATSPS, ASPH, ATSPPHQ-C 2 , ATSPPHQ-CS and IAPAQ are Class C scales. Among them, the class A scale is recommended in the application population, the class B scale has application potential, but it needs further verification, and the class C scale is not recommended (28). In other words, when choosing mental help-seeking attitudes assessment tools, ATSPPH, ASPH, and ATSPPHQ-C 2 are not recommended among college students, TATSPS is not recommended among future psychotherapy practitioners, ATSPPHQ-CS is not recommended among civil servants, and IAPAQ is not recommended among immigrant adolescents. ATSPPH-SF, IASMHS, ATSPPHQ-M, ATSPPHQ-C 1 , PATPSI, and ATSPPHQ-PP have the potential to be applied in the corresponding population. As the only Type A scale, MHSAS is recommended for community adults. MHSAS is developed based on the defects of previous assessment tools, so the scale development process is more standardized. MHSAS is a single-dimensional scale with 9 items and is based on the Theory of Planned Behavior (TPB). The items are simple and the evaluation is comprehensive. Compared with other PROMs, MHSAS has moderatequality evidence to support its content validity and high-quality evidence to support its internal consistency. It has good clinical feasibility and can be used in future research on psychological help-seeking attitudes. However, at present, there are few validation studies on MHSAS, only one study was included, and only its recommended rating was obtained in community adults, without adaptation studies in different age groups and other populations, so the promotion degree of MHSAS in other groups is also uncertain. Follow-up studies may also explore the use of MHSAS in other populations to provide a more sufficient research basis. ATSPPH-SF was the most frequently evaluated scale, but in terms of tool development, it lacked standardization, and the research on content validity was not perfect, so it is only rated as "indeterminate" (moderate evidence). These problems also exist on other class B scales. In addition, future studies can further explore the specific mental help-seeking attitudes assessment tools of a certain special population.

Limitations
Overall, there are also some limitations in this study. First, as with all systematic reviews, this study has the possibility of publication bias. In addition to ATSPPH, ATSPPH-SF, and IASMHS, there is only one study report of other PROMs, and validation studies with negative results may never have been published. Second, some studies may have been conducted correctly but not described in sufficient detail according to COSMIN standards, affecting their quality ratings. Third, we included only studies that were designed to assess the measurement properties of PROMs for mental help-seeking attitudes. Finally, the target population of existing PROMs is mostly college students, and there are few studies in other groups.

Conclusion
This systematic review provides a comprehensive overview of the quality of measurement properties and methodological quality of mental help-seeking attitude instruments. The results of this review can contribute to the selection of the appropriate measuring instruments to assess psychological help-seeking attitudes. Currently, a number of PROMs of psychological help-seeking attitudes are available for use, but without evidence of an adequate development process. We categorized as A only the MHSAS which has sufficient psychometric evidence to be recommended as the most appropriate tool. ATSPPH-SF, IASMHS, ATSPPHQ-M, ATSPPHQ-C1, PATPSI, and ATSPPHQ-PP were categorized as B with the potential to be recommended and should be evaluated with further studies. Future studies should pay attention to the measurement characteristics of content validity, verify structural validity before measuring internal consistency, and fully describe the measurement characteristics of Cross-cultural validity/measurement invariance, measurement error, criterion validity, and responsiveness.

Author contributions
CB and Y-HZ designed the study. CB, S-YC, W-WZ, and YC organized the data collection and extraction. CB, S-YC, S-RY, and R-XW performed the quality appraisal. CB and Y-HZ were responsible for drafting and critical revisions of the manuscript. All authors contributed to the article and approved the submitted version.

Funding
This work was supported by grants from Jiangsu Province Hospital Association (grant number JSYGY-2-2021-411).

Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.