Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychiatry, 26 January 2026

Sec. Computational Psychiatry

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1671747

A multi-layer similarity approach for analyzing ADHD symptomology and assessment methods considering DSM-5 diagnostic criteria

  • 1School of Interdisciplinary Engineering and Sciences (SINES), National University of Sciences & Technology (NUST), Islamabad, Pakistan
  • 2Department of Statistical Science, University of Padua, Padova, Italy

Aim: Attention-Deficit-Hyperactivity-Disorder (ADHD) is a neurodevelopmental-condition characterized by two symptom-domains, inattention and hyperactivity/impulsivity, as per DSM-5. Prior research, indicates conceptual-overlap among symptoms within each domain, potentially compromising the diagnostic utility of symptom structure itself. This structural redundancy has direct implications for evaluation of ADHD-screening-tools, which already show substantial heterogeneity in item-content and focus. While full psychometric-validation is resource-intensive, assessing tool alignment with DSM-5 offers a more practical and clinically relevant alternative.

Method: Considering these challenges, this study first employed a three-layer-similarity-framework with entropy-based-weighted-combined-score, to investigate intra-domain symptom redundancy. Subsequently, a multi-stage-classification-pipeline, comprising a filtering-layer and machine-learning-classifiers (Random-Forest, Support-Vector-Machine and Logistic-Regression), was trained on DSM-5 ADHD and Non-ADHD (Conduct-Disorder, Major-Depressive-Disorder, Oppositional-Defiant-Disorder) statements, tested on Vanderbilt-preschool-assessment-questionnaire and validated on ADHD-Rating-Scale, Swanson-Nolan-and-Pelham-Rating-Scale (SNAP-IV) and Modified-Checklist-for-Autism-in-Toddlers (M-CHAT), to assess screening-tool’s alignment with DSM-5.

Results: The results revealed moderate-overlap between symptom-pairs (2 and 5) and (5 and 7) within the inattention-domain, with similarity-scores of 0.62 and 0.58 respectively. The filtering-layer demonstrated high accuracy of 97%, perfect precision and specificity in isolating ADHD symptoms. Among classifiers, Random-Forest achieved the best performance with 92% accuracy, 83% precision, 100% recall and 91% F1-score. Validation with ADHD-Rating-Scale ensured near-perfect classification due to its focused symptom set, while SNAP-IV’s inclusion of non-ADHD-items slightly reduced subtype specificity. M-CHAT validation further confirmed the designed pipeline’s ability to exclude non-ADHD symptoms, supporting its classification precision.

Conclusion: The proposed pipeline can be adopted for analyzing strength and limitations of screening-tools, which serve as a catalyst for refinements, ensuring reliability and effectiveness in practical applications.

1 Introduction

Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental condition characterized by a consistent pattern of inattention and hyperactivity-impulsivity that significantly impairs academic, occupational and social functioning (1). According to a meta-analysis conducted in 2023, the worldwide prevalence of ADHD in children and adolescents is estimated to be around 8% (2). As there are no definitive biological markers or objective tests currently available for ADHD, behavioral assessment remains the primary method of diagnosis. In this context, the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) provides standardized diagnostic criteria that guide clinical evaluation. According to DSM-5 ADHD is categorized into two core domains: inattention and hyperactivity-impulsivity, each consisting of nine distinct behavioral symptoms (3). Despite this standardized framework, concerns persist regarding the conceptual distinctiveness and item-level redundancy of these symptoms. For instance, symptoms such as “often fidgets with hands or feet”, “is often on the go” and “often leaves seat in situations when remaining seated is expected” may all reflect variations of psychomotor hyperactivity. Likewise, inattention symptoms such as “often loses things necessary for tasks” and “has difficulty organizing tasks and activities” could indicate shared deficits in executive functioning (4). Using techniques such as item response theory, machine learning, network analysis and internal consistency evaluation, researchers have consistently identified symptom-level redundancy and differential contribution in prediction of functional impairment. For instances, a study conducted in 2021, employed network analysis and random forest regression on a nationally representative adults sample and identified only a small subset of symptoms, three inattention (“difficulty organizing task and activity”, “does not follow through instructions” and “make careless mistakes”) and one hyperactive (“difficulty engaging in leisure activities”), as central bridge symptom linked to global and domain specific impairment (5). Similarly in 2021, longitudinal network modeling in children and adolescents, found that only a few symptoms such as “is easily distracted”, “has difficulty sustaining attention”, “difficulties following instructions” and “interrupt/intrudes” were consistently central across both parent and teacher reports, and these central symptoms predicted future emotional and behavioral difficulties as effectively as the full 18-symptoms set (6). Item response theory also demonstrated that “easily distracted” provided substantially more diagnostic information than others (7). Subsequent investigations further reinforced this variability by showing that the predictive utility of individual ADHD symptoms differs across developmental stages and impairment domains, with inattention being more predictive of academic impairment and hyperactivity/impulsivity more relevant for social functioning in early childhood (8). Similarly, ROC-based classification was employed in a distinct study, to develop optimized diagnostic algorithms and reported that models limited to impairment-predictive symptoms significantly outperformed DSM-IV criteria in diagnostic efficiency and inter-rater reliability (9). In 2019, finding emerged showing that among over 116,000 valid symptom combinations, only a few symptoms, such as “motoric activity”, “losing thing” and “does not follow instructions” had high centrality and disproportionate influence on diagnostic outcomes (10). Collectively, these findings indicate that the 18-symptoms list, particularly within the hyperactivity-impulsivity domain, may be reducible without substantial loss of diagnostic information, as individual symptoms may encapsulate content that is redundantly represented across multiple current symptoms (4). On the basis of this notion, we hypothesized that, are there some of the 18 symptoms of ADHD that may exhibit misleading language patterns in DSM-5 description, potentially indicating underlying semantic redundancy? To evaluate this, we implemented a three-layer similarity framework aimed at systematically assessing any possible language ambiguity among the symptom descriptions. Similar techniques have also been implemented in other domains such as text summarization, textual similarities and semantic analysis (1115). However their application to modeling language ambiguity within ADHD symptomology remains novel.

In addition to evaluating the conceptual distinctiveness of DSM-5 ADHD symptoms set, it is also essential to consider how these symptoms are operationalized in real world settings. This is particularly critical given that ADHD increasingly recognized as lifelong neurodevelopmental condition (16). Early identification can pave the way for effective behavior modification, yielding lasting benefits (17). Maximizing these outcomes hinges on screening children at the earliest possible age. Therefore the American Academy of Pediatrics and the Centers for Disease Control and Prevention’s National Center of Birth Defects and Developmental Disabilities recommend in cooperating routine screenings as a part of developmental surveillance to help pediatricians in early identification (18). Over the years, numerous screening tools has been developed for ADHD assessment with respect to different age groups. However our particular focus in on tools designed for use in pediatric population. One of the earliest, the Conner’s Teacher Rating Scale, a 39 item symptom and behavior checklist, was introduced in 1969, followed by the Conner’s Parent Rating Scale consisting of 73 items, in 1970, with revisions in subsequent years to enhance psychometric precision and clinical relevance (19, 20). The SNAP-IV (Swanson, Nolan and Pelhum Rating Scale), is 90 item self-reporting too, introduced in 1985, provide a structured assessment of core attention and behavior regulation difficulties, suitable for both clinical use and research (21). Later, in 2003, the Vanderbilt ADHD diagnostic Rating Scale (VADRS) was developed to offer a comprehensive screening tool incorporating symptom ratings as well as items related to academic performance and behavioral concerns (22). These tools have become foundational in both educational and clinical practice for identifying children at risk. Additionally, the ADHD Rating Scale has been extensively used due to its alignment with clinical observations and its application in both research and clinical settings for symptoms quantification (23). Beyond these, several other tools have emerged, including the Brown Attention Deficit Disorder Scales, Child Behavior Checklist and Behavior Assessment System for Children, etc. each varying in scope and purpose (2426). The evolution of ADHD screening tools has been instrumental in early screening, yet the heterogeneity in the assessment pool presents a significant challenge in determining which tool is most effective. Broadly, two approaches exist for evaluating these tools, either assessing their psychometric properties (such as reliability, validity, and sensitivity) or determining their alignment with established diagnostic criteria i-e DSM-5. While evaluating psychometric properties requires extensive validation across diverse populations, assessing tools based on criteria fulfillment provides a more direct and structured method. To maintain clinical relevance, ADHD screening tools must integrate the diagnostic framework, ensuring a more comprehensive assessment. This includes mapping existing questionnaires, validating them against contemporary diagnostic standards. Mapping items to DSM-5 diagnostic criteria is crucial for ensuring the reliability and validity of screening tool. This technique ensure that each item precisely measure the intended construct, leading to more consistence outcomes. Despite the critical role of standardized diagnostic criteria for ADHD assessment, to the best of our knowledge, no similar studies have been conducted to rigorously map screening tools to established frameworks such as DSM-5. This represents a significant gap in the field. Therefore, a multi-level symptom representation and classification pipeline employed to systematically align questionnaire items with DSM-5, enabling a precise evaluation of the strengths and limitations of ADHD screening tools. Similar approach has been used for question, text and fake new classification separately (2729). The major contribution of this study includes:

● A three layer similarity framework has been designed to examine the coherence and distinctiveness of diagnostic items beyond statistical associations.

● A multi-level symptom representation and classification pipeline has been developed using linguistic features to differentiate ADHD relevant versus non-ADHD symptomatology, followed by subtype classification (inattention vs. hyperactivity-impulsivity).

● Developed pipeline has been validated using three widely adopted tools (e.g., Vanderbilt, SNAP-IV and ADHD Rating Scale). This process identified strengths, inconsistencies and potential misclassifications, particularly where tools blur boundaries between ADHD and comorbid conditions like conduct disorder.

2 Methodology

2.1 Corpus

To investigate linguistic overlap among the ADHD symptom criteria, this study utilized the set of 18 core diagnostic statements outlined in DSM-5, comprising nine inattention and nine hyperactivity/impulsivity symptoms. Only the primary symptom descriptors were retained, excluding the illustrative examples that accompany each symptom in the manual. This approach allowed for a focused assessment of the conceptual structure and potential semantic overlap within the core diagnostic criteria, as summarized in Table 1.

Table 1
www.frontiersin.org

Table 1. Core ADHD symptom statements used in the analysis.

2.2 Semantic similarity assessment of ADHD symptom statement

To investigate potential linguistic overlap within ADHD symptom domains, a semantic similarity analysis was conducted. This approach evaluates whether symptom statements convey closely related meanings, despite being treated as distinct diagnostic indicators. For this task, a pre-trained sentence transformer model, pritamdeka/S-Biomed-Roberta-snli-multinli-stsb, was utilized as primary model due to its domain-specific training on biomedical and clinical text, making it particularly suitable for capturing semantic nuances in health-related text (30). While alternative models such as BioBERT or ClinicalBERT are highly effective for token level tasks (e.g. Named Entity Recognition), they often underperform in semantic similarity tasks because they typically rely on simple mean pooling of token embedding, which can dilute overall sentence meaning (31, 32). In contrast the selected primary sentence transformer model employs a Siamese Network Architecture and specifically fine-tuned on Natural Language Inference (NLI) and Semantic Textual Similarity (STS) datasets, making it optimized for computing pairwise cosine similarity of diagnostic statements.

To evaluate the sensitivity of the proposed semantic similarity analysis to model selection, a comparative analysis between the primary model and a widely used general purpose sentence transformer, all-mpnet-base-v2 (33). Each symptom statement was passed through each model to generate 768-dimentional sentence embedding, that effectively captured their semantic representations. Subsequently, pairwise cosine similarity scores were computed among the embedding within each domain to quantify conceptual proximity between statements. This process has been demonstrated in Supplementary Figure S1 (Appendix 2). The resulting similarity patterns are illustrated using heat maps in Figure 1A (Inattention) and Figure 1B (Hyperactivity/Impulsivity) for primary model. While the similarity patterns generated by general purpose model has been given in Supplementary Figure S2 (Appendix 2). To access the agreement between models, Spearman rank correlation was calculated between the similarity scores generated by general purpose and primary model for each symptom domain. Rank base correlation was chosen to focus on relative ordering rather than absolute similarity values, which are known to vary across embedding spaces.

Figure 1
Heatmaps comparing pairwise cosine similarity of symptoms. Panel A shows inattention symptoms with values ranging from 0.13 to 1.00 across symptoms S1 to S9. Panel B depicts hyperactive symptoms with values from 0.04 to 1.00 for its symptoms. Both use a color gradient from dark blue to light yellow.

Figure 1. Semantic similarity heat maps for ADHD symptom domains based on sentences embedding. (A) Inattention domain: pairwise cosine similarity among symptom statements. (B) Hyperactivity/Impulsivity: pairwise cosine similarity among symptom statements. Darker shades indicate higher conceptual similarity between symptoms.

To systematically identify symptom pairs exhibiting significant linguistic overlap, Rather than adopting an arbitrary universal threshold (e.g., 0.75), a percentile-based approach was implemented to ensure sensitivity to the distributional characteristics of cosine similarity scores within each domain. A threshold of 0.65 (± 3) and 0.79 (± 3) was set for the inattention domain and 0.47 (± 3) and 0.67 (± 3) for hyperactivity/impulsivity domain using primary and general purpose model respectively. These values correspond to approximately the top 1st percentile of the respective similarity distributions, ensuring that only the most semantically proximal pairs were retained for further interpretation. The relaxation of ±3 around the cutoff allows for capturing a small range of values near the threshold to account for natural variability and measurement noise, ensuring that borderline cases with meaningful similarity are not excluded. This percentile-based criterion provides a principled, data-driven approach to distinguish clinically meaningful similarity from general semantic relatedness. This strategy is inspired by prior studies in semantic similarity, text clustering and image thresholding as well (34, 35). The distribution of similarity scores and selected thresholds for both domains using primary model are illustrated in Figure 2A (inattention) and Figure 2B (Hyperactivity/Impulsivity). While the distribution of similarity scores and selected thresholds using general purpose model is demonstrated in Supplementary Figure S3 (Appendix 2).

Figure 2
Two line graphs show similarity scores with percentile thresholds. Graph A illustrates scores for inattention, with scores decreasing from approximately 0.6 to 0.3. Graph B shows scores for hyperactivity/impulsivity, with values descending from around 0.5 to 0.1. A dashed green line in each graph denotes the percentile threshold. Both graphs have indices ranging from 0 to 35 on the x-axis.

Figure 2. Distribution of pairwise semantic similarity scores within ADHD symptom domains. (A) Inattention symptoms and (B) Hyperactivity/Impulsivity symptoms. The green dashed lines indicate the top 1st percentile thresholds (0.65 for inattention and 0.47 for hyperactivity/impulsivity), used to identify highly similar symptom pairs.

Across symptom domains, similarity scores differed in magnitude and ranking across models, indicating model dependent representation of DSM-5 symptom language. In inattention domain, similarity scores generated by primary and general purpose model showed moderate rank agreement (Spearman ρ 0.54 (p-value: 0.000)) and percentile-based-selection revealed partial overlap, with the general purpose model identifying three highly similar symptom pairs, all of which were also selected by the primary model, which identified one additional pair. In contrast, for hyperactivity/impulsivity domain, rank agreement was slightly higher (Spearman ρ 0.63 (p-value: 0.000)), yet threshold based selection slightly diverged, with the general purpose model identifying three pairs while primary model identifying a single pair, with no overlap between the selected set. Qualitative inspection at granularity level (symptom pairs) further indicate that the general purpose model occasionally assigned relatively high scores to clinically related but conceptually distinct behaviors (e.g. pair (3,6) in hyperactivity/impulsivity domain), where the primary model produced more conservative similarity estimates for such pairs as shown in Table 2 and Supplementary Table S1 (Appendix 2). Although no external clinical ground truth was used to established absolute correctness, these differences are consistent with the respective training objectives of the model and highlight that semantic similarity analysis of DSM-5 symptom descriptions are inherently model dependent. On this basis, the pritamdeka/S-Biomed-Roberta-snli-multinli-stsb model was retained as primary model for subsequent analysis due to its closer alignment with clinically oriented sentence level distinction.

Table 2
www.frontiersin.org

Table 2. Symptoms pairs exhibiting high sematic similarity based on cosine scores within each ADHD domain.

To conceptualize the percentile-based threshold strategy, the primary sentence transformer model was additionally evaluated on the Semantic Textual Similarity Benchmark (STS-B), a standard dataset containing human-annotated sentence similarity scores normalized between 0 and 1 (36). On STS-B, cosine similarity scores produced by the model showed very strong alignment with human-judgments, yielding a Spearman correlation of approximately 0.96 between model’s generated similarities and gold standard annotations as shown in Supplementary Figure S4 (Appendix 2). At higher similarity ranges, sentence pair within the top 95-99th percentile exhibited cosine similarity value above 0.96, with corresponding human similarity scores exceeding 0.95, indicating near phrase-level agreement. Importantly, because similarity values observed in the ADHD symptom analysis predominately lay within a moderate range (0.50 to 0.65) (shown in Table 2), model behavior was specifically examined at comparable similarity levels in STS-B. At mid-range similarity values (cosine similarity = 0.50 to 0.65), the mean human annotated similarity score was 0.55 with a median of 0.56, closely matching the model generated values at this range and indicating reasonable calibration beyond only extreme similarity cases. While STS-B does not provide clinical validation for DSM-5 symptom comparisons, this external benchmarking supports the use of percentile-based data-driven thresholds as a principled method for identifying relative degree of semantic similarity within a constrained symptom set.

Despite moderate semantic similarity scores, a deeper functional examination reveals important diagnostic distinctions. For instance, the inattention symptoms “often fails to give close attention to details or make careless mistakes” and “often does not follow through on instructions and fails to finish tasks” scored 0.63 on semantic similarity. However, the first symptom pertains primarily to selective attention and momentary lapses in cognitive processing, whereas the second involves deficits in working memory that linked with executive functioning (37, 38). Similarly, in hyperactivity/impulsivity domain, the symptoms “often fidgets with or tap hands or feet or squirms in seat” and “often leaves seat in situations when remaining seated is expected” show a semantic similarity of 0.50, yet represent distinct behavioral contexts. One reflecting fine motor restlessness or excessive non-goal motor movement, the other indicating context-inappropriate behavioral inhibition (39, 40). These observations underscore a key limitation of relying solely on semantic similarity, as embedding models can conflate distinct constructs when symptoms share similar contextual language. This overlap is especially problematic in clinical settings, where superficially alike wording may reflect fundamentally different functional impairments. Therefore, to improve the resolution of similarity analysis and minimize false conceptual overlap, lexical and syntactic features has been incorporated in further analysis. Lexical features identify subtle vocabulary differences, while syntactic patterns reflect variations in behavioral structure.

2.3 Lexical similarity assessment of ADHD symptom statement

2.3.1 Preprocessing

To prepare the symptom statements for lexical similarity analysis, a preprocessing pipeline was implemented to standardize and clean the text related to both domains (inattention and hyperactivity/impulsivity) separately. Each sentence was first converted to lowercase to ensure case insensitivity, followed by tokenization using NLTK’s “word-tokenize” method, which segments the sentence into individual word tokens. Common English stop words, such as articles, conjunctions and auxiliary verbs, were removed to eliminate function words that do not contribute meaningful semantic content. Additionally, all tokens were filtered to retain only alphabetical characters, excluding punctuation and special symbols. The remaining tokens were lemmatized using the “Word Net Lemmatizer”, which reduces each word to its base or dictionary form (e.g., “running” becomes “run”), helping to group morphological variants under a single representative form. Before lemmatization, Part-Of-Speech (POS) tagging was applied to each token to identify its grammatical role within the sentence. This step is essential for enabling context-aware lemmatization, as the lemmatizer requires explicit POS input to accurately reduce words to their base forms. In the absence of POS tagging, standard lemmatizer such as “Word Net” default to noun-based transformations, which can result in incorrect lemmatization. For instance, in the symptom “often leaves seat when remaining seated is expected”, the word “leaves” is a verb, however, without POS guidance, lemmatizer treat it as noun and it would incorrectly be reduced to “leaf”. By assigning the correct POS tag, verb in this case, lemmatizer accurately transforms “leaves” into its root form “leave”. Each token was thus first POS tagged using the Penn Treebank tag set and then mapped to Word Net compatible POS categories (i.e., noun, verb, adjective and adverb) prior to lemmatization. This preprocessing ensures that the lexical similarity measures are based on meaningful content words, enhancing the accuracy and interpretability of subsequent similarity computations.

2.3.2 Lexical similarity calculation

Following preprocessing, lexical similarity between pairs of symptom statements were computed using a word-to-word alignment framework adopted from prior work in lexical sematic analysis (11). For each token pair between two sentences, similarity was calculated using a hybrid method. First, path-based semantic similarity was computed using Word Net’s path distance, which reflects the conceptual proximity between two words in the lexical ontology. If no valid semantic path existed between the words, or if the computed similarity was below a minimum threshold (0.1), a fallback levenshtein similarity was used. This string-level metric captures the degree of surface similarity between two words based on character-level edits. Using these measures, a word-level similarity matrix was constructed for each sentence pair, where each cell represents the similarity between a token from the first sentence and a token from the second. The total similarity between two sentences was computed using a greedy alignment algorithm. At each iteration, the highest remaining similarity score in the matrix was selected and added to a cumulative total. The corresponding row and column were then removed from further consideration to avoid reusing aligned tokens. This process was repeated until all row or columns had been exhausted. The final word similarity was obtained by dividing the accumulated score by the number of iterations (i.e., the number of aligned word pairs). To account for asymmetry in sentence length, a penalty term was applied. The penalty was computed as the absolute difference between the token lengths of the two sentences, multiplied by the computed similarity, and divided by the maximum of the two sentence lengths. This penalty was subtracted from the word similarity score to obtain a length-normalized similarity and this score is considered final lexical score. A detailed example illustrating this entire process is provided in Appendix 1.

2.4 Syntactic similarity assessment of ADHD symptom statement

2.4.1 Preprocessing

In the syntactic similarity layer, preprocessing is limited to the removal of punctuation marks from all symptom sentences related to both domains (inattention and hyperactivity/impulsivity) separately. This step eliminates structurally irrelevant tokens that commonly appear as “punct” dependencies, which has no meaningful contribution to the grammatical structure of the sentence. The rest of the sentence is retained in full to preserve its syntactic structure, as function words such as auxiliaries and prepositions are essential for accurate dependency parsing. This allows for the accurate extraction of grammatical relations necessary for constructing RDF-style syntactic triples, which form the basis of the syntactic similarity computation. Additionally all tokens are converted to lowercase and lemmatized to ensure consistent word forms and improve the matching syntactic elements across sentences.

2.4.2 Syntactic similarity calculation

The syntactic triple extraction method used in this study follows the approach proposed in (11), which represents sentence structure using RDF-style dependency triples. After preprocessing, each sentence is parsed using spaCy’s dependency parser to extract syntactic relations in the form of head, relation and dependent. These triples encode the grammatical relationships between tokens and capture the core syntactic structure of the sentence. To enhance the relevance of the extracted structure, only a selected subset of dependency types is retained. For instance nominal subjects, direct objects, indirect objects, adverbs, prepositions and auxiliary verbs etc. This filtering removes low impact dependencies such as determinants, which contribute minimally to structural similarity. Dependency graph examples resulting from this process are illustrated in Figure 3, and representative triples from inattention symptoms 1 and 4 are shown in Table 3. After extracting syntactic triples from each sentence, pairwise similarity between triples was computed using the same hybrid method defined in the lexical layer. However, instead of comparing individual tokens, the comparison has been performed between syntactic triples by evaluating the similarity between their corresponding vertices. For each pair of triples, two alignment scores has been calculated: the first by averaging the word similarity between corresponding head and dependent words, and the second by averaging the similarity in a cross-wise manner (i.e., head to dependent and dependent to head). The final similarity between two triples was obtained by averaging these two alignment scores. This approach allows for both direct and cross alignment between the vertices of the two triples, improving robustness against syntactic variation. Using this formula, a similarity matrix was constructed where each cell represents the similarity score between a triple from the first sentences and one from the second. A greedy alignment algorithm, identical to that used in the lexical layer, was then applied to aggregate these scores into an overall syntactic similarity between sentences. This algorithm selects the best-matching pairs of triples without repetition and averages their scores, yielding the final syntactic similarity value. A detailed example illustrating this entire process is provided in Appendix 3.

Figure 3
Diagram labeled A shows a network of nodes and arrows indicating dependencies in the phrase “Often fails to give close attention to details or makes careless mistakes.” Diagram B shows a similar structure for the phrase “Often fidgets with or taps hands or feet, or squirms in seat.” Each node represents a word connected by arrows with labeled relations.

Figure 3. Full syntactic dependency graphs of (A) inattention (B) hyperactivity/impulsivity symptom 1 respectively. Each graph represents the complete syntactic structure of the sentence, with nodes as lemmatized tokens and edges as grammatical relations, based on filtered dependency types.

Table 3
www.frontiersin.org

Table 3. Representative syntactic triples extracted from inattention domain symptom 1 and 4.

2.5 Validation of multi-layer similarity assumption

After computing similarity scores across semantic, lexical and syntactic layers for all symptom pairs, next step involved evaluating the relationship between these layers. While semantic similarity particularly when derived from Sentence Transformers, is often assumed to be sufficient for capturing overall meaning (41). It remain important to access whether it sufficiently reflects other linguistic dimensions because this may not hold in contexts where fine-grained linguistics cues play a significant role. To investigate this, two complementary statistical methods were employed to assess whether lexical and syntactic similarities provide non-redundant, complementary information beyond what is captured semantically.

2.5.1 Wilcoxon signed-rank test

To determine whether lexical and syntactic similarity scores differ significantly from semantic similarity scores, the Wilcoxon signed-rank test was used (42). This non-parametric test is designed for comparing paired samples and evaluates whether the median of the differences between them is zero. It is suitable in this context as it does not assume normality and is robust to the skewed distribution of similarity values. The test was applied separately for the inattention and hyperactivity/impulsivity symptom domains, results has been illustrated in Table 4. Statistically significant results in all cases (p-value< 0.05) confirm that lexical and syntactic similarities differs meaningfully from semantic similarity. This indicates that the semantic layer does not entirely capture the information encoded at the lexical and syntactic levels.

Table 4
www.frontiersin.org

Table 4. Results of Wilcoxon signed-rank test and mutual information regression comparing semantic similarity with lexical and syntactic similarity across inattention and hyperactivity/impulsivity domain.

2.5.2 Mutual information regression

To further examine the dependence between similarity layers, mutual information regression was employed (43). Mutual information quantifies the amount of information share between two variables, capturing both linear and non-linear associations. In this context, it measure how much knowledge of the lexical and syntactic similarity scores reduce uncertainty about semantic similarity scores. A value of zero indicates complete independence, while higher values suggest stronger dependency. The results for mutual information regression is given in Table 4. These results provide additional support to the findings from the Wilcoxon test. In particular, the zero mutual information in the hyperactivity/impulsivity domain between syntactic and semantic similarity indicates statistical independence, confirming that synaptic patterns are not captured by semantic representation. Even in the inattention domain, while mutual information values are slightly higher from zero, suggesting partial overlap but not redundancy.

Together these analysis strongly support the claim that lexical and syntactic features contribute unique and necessary information when assessing overlap in clinical symptom descriptions.

2.6 Entropy-based feature weighing

In this study, Entropy Weight Method (EWM), a widely recognized technique within the Multiple Criteria Decision Making (MCDM) framework, has been employed to objectively determine the weights of different similarity layers without requiring any prior knowledge or labeled data (44, 45). The method leverages the concept of entropy from information theory to quantify the degree of disorder or uncertainty in the distribution of features values. Features exhibiting greater variability and lower entropy carry more informative content and are thus assigned higher weights. This approach facilitates an unbiased weighting process driven solely by the data itself, making it especially suitable for unsupervised scenarios. The procedure consists of the following steps:

1. Data normalization: Each similarity layer values has been normalized such that the sum of its values across all samples equals one. This was achieved by dividing each value in the feature column (e.g., lexical) by the total sum of that feature. This step converts the original data into a probability distribution for each similarity layer, which is a pre-requisite for computing entropy.

2. Zero value adjustment: To avoid the mathematical undecidedness associated with logarithms of zero during entropy calculation, any zero entries in the normalized data are substituted with a small positive constant, typically  1010.

3. Computation of normalization factor (k): The normalization factor k ensures that calculated entropy values are scaled between zero and one, making them comparable across different features.

k=1ln(m)

Where m is the number of samples in dataset and “ln” denoted natural logarithm.

4. Entropy computation: For each similarity layer, entropy has been computed as a measure of uncertainty based on its normalized value distribution by using formula given as:

Hj=ki=1mpijln(pij)

5. Determination of Divergence: The informational contribution of each similarity layer has been represented by its degree of divergence, calculated as one minus the corresponding entropy value. This metric highlights similarity layers with higher variability and information content.

6. Normalization to obtain weights: The divergence values were normalized so that their sum equals one, yielding the final set of similarity layers weights.

wj=1Hjj=1n1Hj

These weights reflects the relative importance of each layer based on the inherent data characteristics.

By implementing the entropy weight method, we quantitatively capture the intrinsic significance of each layer with respect to each domain separately, thereby enhancing the robustness and validity of the subsequent data analysis and decision making process. Entropy-based weights for lexical, syntactic and semantic similarities were calculated separately for both domains (inattention and hyperactivity/impulsivity). Lexical similarity weights were 0.35 and 0.21 indicating moderate importance in both domains. Syntactic weights were low, at 0.16 and 0.04, showing limited contribution, especially in the second domain. Semantic weights were highest at 0.49 and 0.75, highlighting its dominant role, particularly in the hyperactivity/impulsivity domain. After that, the combined score for each instance with respect to each domain, was calculated by multiplying each original similarity layer value by its corresponding weight. These weighted feature values were then summed to produce a single overall similarity score per instances. This weighted aggregation reflects the relative importance of each similarity layer as quantified by entropy. The final combine scores are presented in Tables 5, 6 for inattention and hyperactivity/impulsivity domain respectively.

Table 5
www.frontiersin.org

Table 5. Combined similarity scores calculated by weighting original features (lexical, syntactic and semantic) values with entropy based weights for inattention domain.

Table 6
www.frontiersin.org

Table 6. Combined similarity scores calculated by weighting original features (lexical, syntactic and semantic) values with entropy based weights for hyperactivity/impulsivity domain.

2.7 Feature extraction based on similarity measures

Second aim of this study is to access the effectiveness of screening tools in accordance with DSM-5 diagnostic criteria by developing a robust classification framework. This classification task involves two categories, inattention and hyperactivity/impulsivity, which were assigned binary labels 0 and 1, respectively. Feature extraction was performed by leveraging similarity scores computed between 18 symptoms describe in Table 1 (46, 47). Specifically, lexical, syntactic and semantic similarities were measured between each target sentence and two predefined symptom domain corresponding to the assigned classes. For each similarity metric, the average similarity to all sentences within each symptom domain is calculated, and the difference between these averages form the basis of the feature values. This process produces a concise three-dimensional feature vector for each sentence, capturing its relative closeness to both classes across multiple linguistic dimensions. By transforming pairwise similarity information into fixed-length feature representations, this approach facilitates effective binary classification.

2.8 Model development and evaluation

The analysis was implemented as two-stage machine learning pipeline, where symptom statements flow sequentially from a filtering stage to downstream classification models. In the first stage a semantic filtering layer was applied to distinguish ADHD consistent symptom language from non-ADHD diagnostic language before further processing. This filtering layer was implemented using a logistic regression classifier trained on DSM-5 diagnostic statements, with ADHD symptoms treated as a positive class and DSM-5 symptoms from Conduct Disorder (CD), Oppositional Defiant Disorder (ODD) and Major depressive disorder (MDD) treated as the negative class. These symptoms were not treated as clinically exclusive to ADHD, as symptom co-occurrence across psychiatric disorders is well-established and explicitly acknowledged. Instead they were used as contrastive examples to model differences in diagnostic language and symptom framing at textual level. ADHD screening instruments are specifically designed to assess core inattention and hyperactivity/impulsivity constructs, therefore introducing non-ADHD DSM-5 symptoms formulations enable examination of whether such tools preferably retain canonical ADHD-related language or admit broader behaviors and affective symptom descriptions. This design choice serve to probe the linguistics specificity and limitations or strengths of ADHD screening tools, rather than to impose categorical diagnostic boundaries. Importantly, all non-ADHD DSM-5 statements were confined to the training phase of the filtering layer and were never used in downstream evaluations. All symptom statements (ADHD and Non-ADHD) were encoded using a fixed pre-trained sentence transformer, pritamdeka/S-Biomed-Roberta-snli-multinli-stsb, generating 768 dimensional semantic embedding that serve as input features to the filtering layer (logistic regression). The logistic regression model is trained with default parameters except for max_iter =1000 to ensure convergence. The training data for filter exhibit class imbalance (18 ADHD vs. 32 Non-ADHD statements), which is formally accessed using a chi-square test (chi-square test statistics = 3.920, p-value = 0.048), indicating statistically significant imbalance at alpha 0.05. To address this, Adaptive Synthetic Sampling (ADASYN) a resampling technique was applied prior to training the filter. This technique was selected because it adaptively shift the decision boundary by generating synthetic samples in difficult regions of the minority class rather than sample duplication, thereby improving class separability without inflating redundant samples (48).

Once trained the filtering layer was applied to test dataset, and only those statements classified as ADHD consistent were passed to the second stage of the pipeline. In the second stage three supervised classifiers were trained exclusively on ADHD DSM-5 statements using the predefined extracted lexical, syntactic and semantic similarity features. These classifiers includes 1) a Logistic Regression model with default parameters 2) a Support Vector Classifier using an RBF kernel with probability estimations enable and 3) a Random Forest classifier with 100 tress and a fixed random state of 42 (4951). No hyper-parameter tuning was performed at this stage, allowing difference in performance to reflect model characteristics rather than optimization choices. Training data at this stage consists only of DSM-5 ADHD statements, while all screening tools were reserved strictly for testing and validation.

For evaluation, all items from the Vanderbilt Preschool Assessment Questionnaire were first processed through the complete pipeline to access baseline performance of the three classifiers under identical conditions (22). Model evaluation was conducted using standard metrics such as accuracy, precision, recall, F1 score and ROC curve to comprehensively assess classification effectiveness. This combined architecture leverages both embedding-based filtering and feature-based classification to robustly distinguish ADHD symptom profiles.

Because screening instruments are frequently derived from DSM-5 language, a dedicated data-leakage analysis was then conducted to identify near-verbatim overlap between training and test dataset. Specifically cosine similarity was computed between all DSM-5 sentences used during training (including ADHD and Non-ADHD statements from filtering layer) and all questionnaire items, using a conservative threshold of cosine similarity greater than or equal to 0.90 to identify highly overlapping sentence pairs for near-duplicate or paraphrase equivalent sentences rather than general semantic/topical relatedness (36). This analysis identified approximately 26% of questionnaire items as highly overlapped. These overlapping items were than excluded, and the full pipeline was re-applied to the reduced test set to examine robustness of classifiers performance in the absence of near identical training language. To access this, performance stability was examined using bootstrapping resampling. For both the original (overlap included) and over-lap excluded test sets, bootstrapping resampling with 30 iterations was conducted (choosing owing to small sample size) as shown in Table 7. Model accuracy was recalculated at each iteration. Paired t-test were then applied to compare bootstrap accuracy distribution across the two datasets, allowing statistical assessment of whether classifier performance was materially affected by overlapping content. Following the leakage assessment, model selection was performed by comparing three second stage (1st stage is filtering layer and 2nd is classification layer as shown in Figure 4) classifiers using the same bootstrap framework. This procedure was applied on overlap excluded test set, and performance distributions were compared between each classifier (e.g. Random Forest vs. Support Vector Machine). Paired t-tests were then used to access statistically meaning full change in performance. This comparison was used to identify the most stable and reliable model for subsequent external validation, while ensuring that observed performance was not driven by textual overlap between diagnostic criteria and screening tool items. Complete flow of pipeline has been shown in Figure 4.

Table 7
www.frontiersin.org

Table 7. Distribution of samples across the training, test and validation datasets, showing the total number of instance and class proportions in each split.

Figure 4
Flowchart depicting the process of ADHD symptom classification. It includes three main phases: Training, Evaluation and Validation. In the Filtering Layer, training data is divided into ADHD and non-ADHD symptoms, resampled, and features are extracted for logistic regression. The Classification Layer uses ADHD symptom data for feature extraction and model development with logistic regression, support vector classifier, and random forest. The Evaluation Phase involves model evaluation with full and subset questionnaire items, comparing model performance using accuracy, precision, and other metrics. The Validation Phase uses separate validation data and evaluates model performance after verbatim overlap analysis.

Figure 4. Process flow for model development, evaluation and validation.

2.9 External validation

For external validation, the best performing classifier identified in the previous stage was evaluated on three independent instruments, the ADHD Rating Scale, SNAP-IV and M-CHAT. The ADHD Rating scale and SNAP-IV were selected due to their wide spread clinical use, that differ in item phrases, behavioral emphasis and response structure, that allow examination of how variation in symptom wording influence model behavior. M-CHAT was included as a Non-ADHD screening instrument to access designed pipeline specificity. Prior to evaluation, a verbatim overlap analysis was conducted between the training corpus and validation items using cosine similarity, with a conservative threshold of greater than or equal to 0.90 to identify near duplicate content. Approximately 24% of validation items exceeded this threshold and were excluded. The selected model was then applied to the remaining non-overlapping items, and performance was reported using standard classification metrics to access generalization under strict leakage control. This approach tests the model’s generalizability and specificity across related neurodevelopmental disorders. The use of multiple standardized instruments strengthens the robustness of the validation process.

3 Results and discussion

3.1 Overlapping symptom pairs based on entropy-weighted combined similarity scores

Using the combined similarity scores, cutoff thresholds were defined at the top 1st percentile ±3 to identify the most similarly related symptom pairs. The 1st percentile cutoff for the inattention domain was set at 0.61 (± 3), while for the hyperactivity/impulsivity domain it was 0.47 (± 3). The relaxation of ±3 around the cutoff allows for capturing a small range of values near the threshold to account for natural variability and measurement noise, ensuring that borderline cases with meaningful similarity are not excluded. Within these thresholds, two pairs of symptoms were identified in the inattention domain and one pair in the hyperactivity/impulsivity domain exhibiting the highest similarity values among their similarity score distribution. The first pair, “often has difficulty sustaining attention in tasks or play activities” and “often has difficulty organizing tasks and activities”, showed a similarity score of 0.62. The second pair, “often has difficulty organizing tasks and activities” and “often loses things necessary for tasks and activities”, had a similarity of 0.58 as shown in Table 5. In the hyperactivity domain, the pair “often fidgets with or taps hands and feet, or squirms in seat” and “often leaves seat in situations when remaining seated is expected” demonstrated a similarity of 0.49 as represented in Table 6.

Among them, first 2 pairs are more debatable because their similarity scores indicate meaningful but moderate overlap, while the 3rd pair has a low similarity which reflects more clearly distinct symptom presentation. The two symptoms, “often has difficulty sustaining attention in tasks or play activities” and “often has difficulty organizing tasks and activities”, exhibit notable similarity despite representing distinct neurocognitive impairments because of their functional interdependence in goal-directed behavior. Sustained attention is primarily associated with the ability to maintain focused cognitive engagement over time, a function critically mediated by prefrontal cortex circuits regulating vigilance and behavioral control (52). This deficit impairs the capacity to sustain mental effort necessary for task completion. On the other hand, difficulty organizing tasks reflects dysfunction in working memory, planning and executive control process, higher order functions depend on overlapping but partially separable prefrontal and front striatal networks responsible for managing, sequencing and prioritizing information (53, 54). Although impaired working memory and poor planning contribute to disorganization, these processes involve complex manipulation of information beyond pure attentional focus (52, 54). The moderate similarity arises because ineffective sustained attention can exacerbate problems with organization by limiting the individual’s ability to maintain task-relevant information online, creating a downstream impact on planning and task management (53, 55). Empirical data support that while these cognitive domains interact and often co-occur in ADHD, they represent dissociable constructs with distinct neurobiological subtracts (53). Therefore, despite their semantic overlap and functional association, these symptoms map onto different executive domains, vigilance vs. working memory/executive control, and should be interpreted as related but neurocognitive distinct deficits. This distinction has important implications for targeted assessment and intervention in ADHD. While the symptom pair, “often has difficulty organizing tasks and activities” and “often loses things necessary for tasks and activities”, reflect closely related manifestations of executive dysfunction, particularly impairments in planning, organization, and working memory. Losing things necessary for tasks, such as tools, materials or belongings is commonly viewed as a behavioral consequence of disorganization and working memory deficits, where failure to monitor, update or recall task relevant information impairs task execution. According to Neuropsychological study, these symptoms engage similar prefrontal and front striatal circuits responsible for maintaining and manipulating information during goal directed behavior (56). However, difficulty organizing tasks captures a higher level strategic aspect of executive control, while losing things reflects lapses in real time monitoring and item tracking, implicating aspects of prospective memory and attentional control. Therefore, although both symptoms are strongly interrelated and often co-occur, they represent distinct but complementary facets of executive function deficits commonly observed in ADHD and related conditions (57). Clinically, this distinction is important, interventions targeting organizational skills may differ from those addressing memory strategies to reduce item loss, suggesting these symptoms should be considered closely linked but not identical (58, 59).

3.2 Evaluation of Multi-stage classification pipeline

The performance evaluation of the filtering layer and classification models was conducted on the Vanderbilt assessment tool’s symptom items. On original test data (before exclusion of overlapping), the filtering layer identified 18 ADHD-related symptom items with high accuracy of 97% perfect precision and specificity (100%), and a minimal false negative count of one. Subsequent classification of these filtered symptom items was performed using Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression (LR). The RF model attained an accuracy of 94%, precision of 90%, recall of 100%, and F1-score of 0.95. The SVM model achieved an accuracy, precision, recall and F1 score of 89% respectively, while the LR model yielded slightly lower values with an accuracy of 83%, precision of 75%, and perfect recall of 100% and F1-score of 0.86. Receiver Operating Characteristics (ROC) curve analysis as shown in Figure 5, yielding an area under the curve (AUC) of 0.94 for RF as compared to 0.89 and 0.83 for SVM and LR, respectively. Figure 6 summarizes the comparative evaluation metrics of these models. Per class classification matrices were given in Table 8 with inattention encoded as class 0 and hyperactivity/impulsivity as 1.

Figure 5
ROC curve comparison graph showing three classifiers: SVM (AUC=0.89), LR (AUC=0.83), and RF (AUC=0.94). The x-axis represents the false positive rate and the y-axis represents the true positive rate. The RF curve is nearest to the top left, indicating the highest performance. A dashed line represents random classification.

Figure 5. ROC curve comparing the classification performance of Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR) models. The RF model achieved the highest AUC of 0.94, indicating superior discriminative ability, followed by SVM with an AUC of 0.89 and LR with an AUC of 0.83. Higher AUC values represent better overall model performance.

Figure 6
Bar chart comparing the performance of three algorithms: Random Forest, Support Vector Machine, and Logistic Regression. Metrics evaluated are Precision, Recall, F1-score, and Accuracy. Random Forest scores: Precision 90, Recall 100, F1-score 95, Accuracy 94. Support Vector Machine scores: Precision 89, Recall 89, F1-score 89, Accuracy 88. Logistic Regression scores: Precision 75, Recall 100, F1-score 86, Accuracy 83.

Figure 6. Model evaluation scores for random forest, support vector machine and logistic regression, illustrating precision, recall, F1-score and accuracy for each model.

Table 8
www.frontiersin.org

Table 8. Per-class classification metrics for ADHD and Non-ADHD symptom items on the original test dataset across RF, SVM and LR classifiers.

On the overlapping excluded test data, the filtering layer correctly identified 12 ADHD-related symptoms with 100% precision and specificity and only one false negative. The retained items were then classified by the same three classifiers. RF and SVM yielded same performance with (accuracy of 92%, precision of 83%, recall of 100%, and F1-score of 0.91) followed by LG (accuracy of 75%, precision of 62%, recall of 100%, and F1-score of 0.77). ROC analysis shown in Figure 7 showed the same AUC for RF and SVM (0.93) compared with LR (79). A comparative summary of model matrices has been presented in Figure 8 with per-class classification results in Table 9.

Figure 7
ROC curve comparison chart showcasing performance of three models: SVM (AUC=0.93, blue line), LR (AUC=0.79, orange line), and RF (AUC=0.93, green line). The x-axis represents the false positive rate, and the y-axis represents the true positive rate. The black dashed line indicates random performance.

Figure 7. ROC curve comparing the classification performance of Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR) models on overlapping excluded test data. The RF and SVM model achieved the AUC of 0.93, indicating superior discriminative ability, followed by LR with an AUC of 0.79. Higher AUC values represent better overall model performance.

Figure 8
Bar chart comparing precision, recall, F1-score, and accuracy percentages for three models: Random Forest, Support Vector Machine, and Logistic Regression. Random Forest and Support Vector Machine show identical scores: Precision at eighty-three percent, Recall at one hundred percent, F1-score at ninety-one percent, and Accuracy at ninety-two percent. Logistic Regression has lower scores, with Precision at sixty-two percent, Recall at one hundred percent, F1-score at seventy-seven percent, and Accuracy at seventy-five percent.

Figure 8. Model evaluation scores for random forest, support vector machine and logistic regression, illustrating precision, recall, F1-score and accuracy for each model using overlapping excluded test set.

Table 9
www.frontiersin.org

Table 9. Per-class classification metrics for ADHD and Non-ADHD symptom items on the overlapping excluded test dataset across RF, SVM and LR classifiers.

To evaluate whether the proposed pipeline was susceptible to performance inflation due to overlap between DSM-5 training statements and screening tool items, a two stage robustness analysis was conducted. First, the filtering layer and downstream classifiers were evaluated on the full test set, second potentially overlapping items were removed and the full evaluation repeated. Importantly, the filtering layer, trained to distinguish ADHD-consistent statements from Non-ADHD symptoms, demonstrated identical performance before and after overlap exclusion, indicating that its behavior was not dependent on memorized lexical content but rather on its learned decision boundary. Bootstrapping with 30 iterations was then applied separately to both datasets to quantify the stability of the downstream classifiers and t test whether overlap removal altered model performance. On the over-lap excluded test set, mean accuracies were 0.949 (SD = 0.057) for LR, 0.959 (SD = 0.048) for SVM and 0.938 (SD = 0.073) for RF. The corresponding accuracy distributions has been shown in Figure 9A, demonstrate that LR scores ranged primarily between 0.75 to 1.00, SVM between 0.84 to 1.00 and RF between 0.70 to 1.00, indicating moderate variability but consistently high central performance across resamples. The box plots indicated narrow interquartile ranges concentrated towards the upper end of the accuracy scale, suggesting that occasional drops in performance were rare outliers rather than systematic instability as shown in Figure 9B. A comparable pattern emerged when the same procedure was applied to the original test set (without removal of overlapping items). Mean accuracies were 0.942 (SD = 0.075) for LR, 0.949 (SD = 0.064) for SVM, and 0.956 (SD = 0.053) for RF, with accuracy distributions again clustered in the upper range (0.75 to 1.00 for LR and SVM, 0.84-1.00 for RF) as shown in Figure 10A. The consistency of boxplot medians and dispersion across both datasets indicates that model behavior was stable despite the removal of overlapping textual content (shown in Figure 10B). To formally assess whether model performance differed between datasets, paired t-tests on the bootstrap accuracy distributions were conducted. No statistically significant differences were observed at 5% level of significance for any classifier (LR (before and after exclusion): t = 0.367, p-value = 0.717, RF (before and after exclusion): t = -1.211, p-value = 0.236, SVM (before and after exclusion): t = 0.656, p-value = 0.517). The 95% Confidence Intervals (CIs) for the mean accuracy differences all contained zero as shown in Figure 11, (LR: diff = 0.007, CI = [-0.030, 0.043], RF: diff = -0.018, CI = [-0.048, 0.012], SVM: diff = 0.010, CI = [-0.021, 0.041], confirming that any fluctuations represent sampling variation rather than systematic performance shifts attributable to data leakage. These findings indicate that the classification models are robust and not driven by information leakage, and that their predictive behavior reflects genuine discriminative leaning rather than memorization of paraphrased diagnostic content.

Figure 9
Panel A shows histograms of bootstrap accuracies for three models: Logistic Regression (LR) with most frequencies at high accuracy, Support Vector Machine (SVM) peaking around 0.92 and 1.00, and Random Forest (RF) clustering near 0.95 and 1.00. Panel B shows box plots for the same models, illustrating accuracy distributions with Logistic Regression having a median near 0.95, SVM near 0.92, and Random Forest close to 0.95.

Figure 9. Bootstrap accuracy distributions for logistic regression, SVM and random forest on the overlap-excluded test set. (A) Bar plots and (B) box plot illustrate the stability of classifier performance across 30 bootstrap iterations. Accuracy values are concentrated in the upper performance range (LR: 0.75-1.00, SVM: 0.84-1.00, RF: 0.70-1.00), indicating limited variability and high model robustness following removal of overlapping items.

Figure 10
Panel A shows histograms of bootstrap accuracies for Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF). LR and SVM have peaks at 0.95 and 1.00, while RF shows higher frequencies at 0.94 and 1.00. Panel B presents box plots of bootstrap accuracies for the three models. Logistic Regression displays a wide range of accuracies with several outliers below 0.85. SVM and Random Forest have more clustered results with Random Forest showing a slightly broader range.

Figure 10. Bootstrap accuracy distributions for logistic regression, SVM and random forest on the original test set. (A) Bar plots and (B) box plot illustrate the stability of classifier performance across 30 bootstrap iterations. Accuracy values are concentrated in the upper performance range (LR and SVM: 0.75-1.00, RF: 0.84-1.00), indicating limited variability across different model performances.

Figure 11
Graph showing 95% confidence intervals for mean differences in bootstrap accuracies across three model comparisons: SVM vs. SVM, RF vs. RF, and LR vs. LR. Each comparison has a horizontal blue line with red dots indicating mean differences. Zero difference is marked by a dashed vertical line.

Figure 11. 95% confidence intervals for bootstrap-based differences in model accuracy before and after overlap exclusion. The plot displays mean accuracy differences and their 95% confidence intervals for LR, RF and SVM across 30 bootstrap iterations. All confidence intervals include zero (LR: diff = 0.007, CI = [-0.030, 0.043], RF: diff = -0.018, CI = [-0.048, 0.012], SVM: diff = 0.010, CI = [-0.021, 0.041], indicating no statistically significant change in performance between the original and overlap excluded test sets. These results provide evidence that model accuracy was not inflated by lexical overlap and that the pipeline remains robust to potential data-leakage effects.

After confirming that performance was not inflated by lexical overlap, model selection was conducted using only the overlap-excluded test set, as this dataset provides the most conservative and leakage-controlled estimate of classifier behavior. Bootstrap resampling (30 iterations) was applied to the accuracies of the three classifiers as shown in Figure 9, the paired t-tests were used to compare models. The results indicated no statistically significant differences in mean accuracy across models (LR vs. SVM: t = -1.278, p-value = 0.211, LR vs. RF: t = 1.000, p-value = 0.326, SVM vs. RF: t = 1.610, p-value = 0.118), a finding further supported by 95% confidence intervals that all crossed zero (LR-SVM: -0.027 to 0.006, LR-RF: -0.011 to 0.031, SVM-RF: -0.006 to 0.047). Although the difference were not statistically significant, the RF model was selected for external validation because as an ensemble method, it is well suited to capture complex, non-linear relationships among features and demonstrate greater robustness to noise and outliers through performance stabilization via aggregation across multiple decision tress (50, 6062). Selection on the conservative (overlap-excluded) test data ensures that the chosen model reflects true generalization performance rather than artifact-driven similarity, thereby aligning with best-practice recommendations for preventing Type-I inflation in NLP based classifier evaluation.

The Vanderbilt assessment tool for ADHD strength lies in its comprehensive inclusion of all 18 DSM-5 core ADHD symptom items, which facilitates thorough symptom-level screening aligned with clinical diagnostic standards. However, alongside ADHD symptoms, Vanderbilt assessment tool incorporates approximately 28 additional items related to frequently comorbid behavioral disorders such as Oppositional Defiant Disorder (ODD) and Conduct Disorder (CD). While this broader item inclusion enhances ecological validity by accounting for complex behavioral presentations common in affected populations, it introduces challenges in specificity. The overlapping symptom domains associated with ODD and CD may complicate pure ADHD symptom discrimination. Thereby potentially inflating false positives when relaying solely on the screening tools. This represents a trade-off between comprehensive behavioral assessment and diagnostic accuracy. Careful interpretation of Vanderbilt assessment tool results in conjunction with clinical evaluation and multistage analytical frameworks is therefore essential.

3.3 Validation of purposed pipeline

External validation was conducted using three widely-applied ADHD screening instruments: the ADHD Rating Scale, SNAP-IV and M-CHAT. Prior to model evaluation, potential textual overlap between the training and validation data has been assessed and removed to form an overlap-excluded validation subset. The filtering layer perfectly separated Non-ADHD statements, correctly rejecting all M-CHAT items and SNAP-IV item 20, 21 and 26 (Note that SNAP-IV has item 19 to 26 related to non-ADHD behaviors indicators, due to overlap exclusion only 3 items retained for evaluation) across the three screening tools with 100% specificity and precision, indicating that no non-ADHD statements progressed to the second-stage classifier. This confirms that filter layer reliably distinguished ADHD-relevant from developmentally-atypical but non-diagnostic statements prior to downstream classification. On the retained ADHD-relevant items, the selected RF classifier showed high overall accuracy, with a single pattern of domain-level misclassification. Specifically, one Hyperactivity/Impulsivity item (“Often has difficulty playing or engaging in leisure activities quietly”) was misclassified as belonging to the inattention domain (false positive within domain assignment). Importantly, all items were still correctly recognized as ADHD-related, the errors reflected only domain switching rather than diagnostic rejection. The corresponding confusion matrix is reported in Figure 12. These results show that, after removing lexically overlapping items, model performance remains stable and errors are restricted to borderline linguistic cases where behavioral phrasing may plausibly map onto either attentional or hyperactivity constructs.

Figure 12
Confusion matrix for validation data showing predicted versus true labels. For hyperactivity/impulsivity, 11 are correctly classified, 1 is wrongly classified as inattention. For inattention, all 13 are correctly classified.

Figure 12. Confusion matrix for the overlap-excluded validation dataset, showing domain-level classification outcome for ADHD-related items using Random Forest classifier.

The strength and limitation of these tools are closely tied to the composition of their symptom items. The ADHD Rating Scale exclusive focus on ADHD related items is a significant advantage, as it ensures high specificity for ADHD symptomatology and simplifies the filtering process, minimizing noise from unrelated behavioral domains. This focused scope likely contributes to the pipeline’s high classification accuracy and reduces complexity in symptom domain discrimination. Conversely, SNAP-IV’s inclusion of eight non-ADHD items introduces both opportunities and challenges. On one hand, this broader symptom coverage allows assessment of comorbid or overlapping behavioral issues commonly seen with ADHD, which may enhance ecological validity and clinical utility. On the other hand, the presence of non-ADHD items can reduce the specificity of screening by introducing potential for misclassification or confounding during filtering and subtype classification, as such items may amplify ambiguity between ADHD and other behavioral domains. This complexity demands more nuanced interpretation to accurately delineate symptom origins.

This analytic pipeline provides a structured way to examine the strengths and limitations of ADHD screening questionnaires in relation to DSM-5 symptom constructs. By mapping questionnaire items to their most semantically aligned DSM-5 behaviors, the approach helps assess whether a tool includes a sufficiently representative set of clinically relevant behaviors across both inattention and hyperactivity/Impulsivity domains. This is particularly important because DSM-5 specifies that diagnostic classification depends on the presence of a minimum number of symptoms within each domain (e.g., 5 out of 9). The present method does not assume that a few key items are sufficient to define ADHD, instead, it highlights where questionnaire may over or under represent certain behaviors, or where item wording may shift meaning towards adjacent constructs. In this way, the pipeline can inform refinement of existing tools, guide the development of more balanced and criterion-consistent item sets, and support researchers and clinicians in evaluating whether a screening tool adequately captures the breadth of ADHD related behaviors specified in DSM-5.

3.4 Limitation and future recommendations

Despite the promising outcomes, certain limitations must be acknowledged. The constrained dataset size presents a challenge, potentially restricting the model’s generalizability. Additionally, the study focused exclusively on screening tools designed for preschool-aged children. As a result, the framework’s effectiveness in evaluating tools for older age groups remain untested. To address these limitations, future efforts will focus on expanding the dataset to enhance the model’s robustness and ensure wider applicability. Moreover, integrating assessment tools tailored for different age groups will extend its relevance beyond preschool-aged-children-specific screening tools. A pivotal advancement will be transforming the model into smart, fully automated system, enabling seamless, real-time classification at scale. Furthermore, leveraging deep learning architectures will refine feature extraction, elevate classification accuracy, and strength decision-making capabilities, paving the way for a more intelligent, scalable, and universally applicable screening framework.

4 Conclusion

The core insights gained from the research are outlined as following:

● The investigation into conceptual overlap within ADHD symptom domains, using a multi-level similarity framework combine with an entropy-weighted method, revealed moderate overlap between symptom pairs (2 and 5) and (5 and 7) within the inattention domain, with similarity scores of 0.62 and 0.58 respectively.

● The multi-stage classification pipeline, comprising a filtering layer and machine learning classifiers (RF, SVM and LR), effectively separated ADHD-related symptom items and classified them into inattention and hyperactivity/impulsivity domains. The filtering layer demonstrated high accuracy of 97%, perfect precision and specificity in isolating DSM-5 ADHD symptoms. Among classifiers, RF achieved the best performance with 92% accuracy, 83% precision, 100% recall and an F1-score of 0.91.

● Testing the proposed pipeline suggest that, the Vanderbilt assessment tool effectively capture core ADHD symptoms while also assessing comorbidities like ODD and CD. This broad scope enhance ecological validity but many reduce diagnostic specificity for ADHD alone.

● Validation with ADHD–specific screening tools (ADHD Rating Scale and SHAP-IV) demonstrated the pipeline’s robustness. The ADHD Rating Scale ensured near-perfect classification due to its focused symptom set, while SNAP-IV’s inclusion of non-ADHD items slightly reduced subtype specificity. M-CHAT validation further confirmed the designed pipeline’s ability to exclude non-ADHD symptoms, supporting its classification precision.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

SZ: Formal analysis, Validation, Data curation, Methodology, Writing – original draft, Investigation. ZH: Validation, Writing – review & editing, Conceptualization, Supervision. MZ: Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. Open Access funding provided by Università degli Studi di Padova | University of Padua, Open Science Committee.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. During the preparation of this work the author(s) used ChatGPT Omni in order to improve language and readability. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1671747/full#supplementary-material

References

1. American Psychiatric Association, A and American Psychiatric Association. Diagnostic and statistical manual of mental disorders: DSM-IV Vol. 4. . Washington, DC: American psychiatric association (1994).

Google Scholar

2. Ayano G, Demelash S, Gizachew Y, Tsegay L, and Alati R. The global prevalence of attention deficit hyperactivity disorder in children and adolescents: An umbrella review of meta-analyses. J Affect Disord. (2023) 339:860–6. doi: 10.1016/j.jad.2023.07.071

PubMed Abstract | Crossref Full Text | Google Scholar

3. American Psychiatric Association. diagnostic and statistical manual of mental disorders: DSM-5. Washington, DC: American psychiatric association. (2013).

Google Scholar

4. Goh PK, Elkins AR, Bansal PS, Eng AG, and Martel MM. Data-driven methods for predicting ADHD diagnosis and related impairment: the potential of a machine learning approach. Res Child Adolesc Psychopathol. (2023) 51:679–91. doi: 10.1007/s10802-023-01022-7

PubMed Abstract | Crossref Full Text | Google Scholar

5. Goh PK, Martel MM, and Barkley RA. Clarifying ADHD and sluggish cognitive tempo item relations with impairment: a network analysis. J Abnormal Child Psychol. (2020) 48:1047–61. doi: 10.1007/s10802-020-00655-2

PubMed Abstract | Crossref Full Text | Google Scholar

6. Martel MM, Goh PK, Lee CA, Karalunas SL, and Nigg JT. Longitudinal ADHD symptom networks in childhood and adolescence: key symptoms, stability, and predictive validity. J Abnormal Psychol. (2021) 130:562–74. doi: 10.1037/abn0000661

PubMed Abstract | Crossref Full Text | Google Scholar

7. Li JJ, Reise SP, Chronis-Tuscano A, Mikami AY, and Lee SS. Item response theory analysis of ADHD symptoms in children with and without ADHD. Assessment. (2016) 23:655–71. doi: 10.1177/1073191115591595

PubMed Abstract | Crossref Full Text | Google Scholar

8. Zoromski AK, Owens JS, Evans SW, and Brady CE. Identifying ADHD symptoms most associated with impairment in early childhood, middle childhood, and adolescence using teacher report. J Abnormal Child Psychol. (2015) 43:1243–55. doi: 10.1007/s10802-015-0017-8

PubMed Abstract | Crossref Full Text | Google Scholar

9. Mota VL and Schachar RJ. Reformulating attention deficit/hyperactivity disorder according to signal detection theory. J Am Acad Child Adolesc Psychiatry. (2000) 39:1144–51. doi: 10.1097/00004583-200009000-00014

PubMed Abstract | Crossref Full Text | Google Scholar

10. Silk TJ, Malpas CB, Beare R, Efron D, Anderson V, Hazell P, et al. A network analysis approach to ADHD symptoms: More than the sum of its parts. PLoS One. (2019) 14:e0211053. doi: 10.1371/journal.pone.0211053

PubMed Abstract | Crossref Full Text | Google Scholar

11. Ferreira R, Lins RD, Simske SJ, Freitas F, and Riss M. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput Speech Lang. (2016) 39:1–28. doi: 10.1016/j.csl.2016.01.003

Crossref Full Text | Google Scholar

12. Sowmya V, Raju M. S. V. S. B., and Vardhan BV. Analysis of lexical, syntactic, and semantic features for semantic textual similarity. Int J Comput Eng Technol (IJCET). (2018) 9:1–9.

Google Scholar

13. Wali W, Gargouri B, and Ben Hamadou A. Enhancing the sentence similarity measure by semantic and syntactico-semantic knowledge. Vietnam J Comput Sci. (2017) 4:51–60. doi: 10.1007/s40595-016-0080-2

Crossref Full Text | Google Scholar

14. Bie Y, Yang Y, and Zhang Y. Fusing syntactic structure information and lexical semantic information for end-to-end aspect-based sentiment analysis. Tsinghua Sci Technol. (2022) 28:230–43. doi: 10.26599/TST.2021.9010095

Crossref Full Text | Google Scholar

15. Mohd M, Javeed S, Wani MA, and Khanday HA. Sentiment analysis using lexico-semantic features. J Inf Sci. (2024) 50:1449–70. doi: 10.1177/01655515221124016

Crossref Full Text | Google Scholar

16. Asherson P. ADHD across the lifespan. Medicine. (2012) 40:623–7. doi: 10.1016/j.mpmed.2012.08.007

Crossref Full Text | Google Scholar

17. Long N and Coats H. The need for earlier recognition of attention deficit hyperactivity disorder in primary care: a qualitative meta-synthesis of the experience of receiving a diagnosis of ADHD in adulthood. Family Pract. (2022) 39:1144–55. doi: 10.1093/fampra/cmac038

PubMed Abstract | Crossref Full Text | Google Scholar

18. Control CfD. Prevention. Learn the Signs. Act Early, Atlanta, Georgia, USA, CDC (2015).

Google Scholar

19. Conners CK. A teacher rating scale for use in drug studies with children. Am J Psychiatry. (1969) 126:884–8. doi: 10.1176/ajp.126.6.884

PubMed Abstract | Crossref Full Text | Google Scholar

20. Cohen M. The Revised Conners Parent Rating Scale: factor structure replication with a diversified clinical sample. J Abnormal Child Psychol. (1988) 16:187–96. doi: 10.1007/BF00913594

PubMed Abstract | Crossref Full Text | Google Scholar

21. Atkins MS, Pelham WE, and Licht MH. A comparison of objective classroom measures and teacher ratings of Attention Deficit Disorder. J Abnormal Child Psychol. (1985) 13:155–67. doi: 10.1007/bf00918379

PubMed Abstract | Crossref Full Text | Google Scholar

22. Wolraich ML, Lambert W, Doffing MA, Bickman L, Simmons T, and Worley K. Psychometric properties of the vanderbilt ADHD diagnostic parent rating scale in a referred population. J Pediatr Psychol. (2003) 28:559–68. doi: 10.1093/jpepsy/jsg046

PubMed Abstract | Crossref Full Text | Google Scholar

23. DuPaul GJ, Power TJ, Anastopoulos AD, and Reid R. ADHD Rating Scale—IV: Checklists, norms, and clinical interpretation. New York and London: The Guilford Press. (1998).

Google Scholar

24. Davenport TL and Davis AS. Brown attention-deficit disorder scales. In: Goldstein S and Naglieri JA, editors. Encyclopedia of Child Behavior and Development. Springer, Boston, MA (2011). doi: 10.1007/978-0-387-79061-9_439

Crossref Full Text | Google Scholar

25. Mazefsky CA, Anderson R, Conner CM, and Minshew N. Child behavior checklist scores for school-aged children with autism: preliminary evidence of patterns suggesting the need for referral. J Psychopathol Behav Assess. (2011) 33:31–7. doi: 10.1007/s10862-010-9198-1

PubMed Abstract | Crossref Full Text | Google Scholar

26. Thorpe J, Kamphaus RW, and Reynolds CR. The behavior assessment system for children. In: Reynolds CR and Kamphaus RW, editors. Handbook of psychological and educational assessment of children: Personality, behavior, and context, 2nd ed. New York: The Guilford Press (2003). p. 387–405.

Google Scholar

27. Mishra M, Mishra VK, and Sharma HR. Question classification using semantic, syntactic and lexical features. Int J Web Semantic Technol. (2013) 4:39. doi: 10.5121/ijwest.2013.4304

Crossref Full Text | Google Scholar

28. Zhu B and Pan W. Chinese text classification method based on sentence information enhancement and feature fusion. Heliyon. (2024) 10, e36861. doi: 10.1016/j.heliyon.2024.e36861

PubMed Abstract | Crossref Full Text | Google Scholar

29. Choudhary A and Arora A. Linguistic feature based learning model for fake news detection and classification. Expert Syst Appl. (2021) 169:114171. doi: 10.1016/j.eswa.2020.114171

Crossref Full Text | Google Scholar

31. Reimers N and Gurevych I. Sentence-BERT: sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics (2019) pp. 3982–92.

Google Scholar

32. Pennington J, Socher R, and Manning CD. GloVe: global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics (2014) pp. 1532–43.

Google Scholar

33. Available online at: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (Accessed December 20, 2025).

Google Scholar

34. Panchenko A. Similarity Measures for Semantic Relation Extraction. (2013) Universit catholique de Louvain, 62. doi: 10.13140/RG.2.2.17076.45448.

Crossref Full Text | Google Scholar

35. Puspaningrum A, Nur N, and Riza O. Image thresholding based on hierarchical clustering analysis and percentile method for tuna image segmentation. NJCA (Nusantara J Comput Its Applications). (2018) 2, 1–11. doi: 10.36564/njca.v2i1.24

Crossref Full Text | Google Scholar

36. Available online at: https://huggingface.co/datasets/sentence-transformers/stsb (Accessed December 20, 2025).

Google Scholar

37. Mueller A, Hong DS, Shepard S, and Moore T. Linking ADHD to the neural circuitry of attention. Trends Cognit Sci. (2017) 21:474–88. doi: 10.1016/j.tics.2017.03.009

PubMed Abstract | Crossref Full Text | Google Scholar

38. Dunham S, Lee E, and Persky AM. The psychology of following instructions and its implications. Am J Pharm Educ. (2020) 84:ajpe7779. doi: 10.5688/ajpe7779

PubMed Abstract | Crossref Full Text | Google Scholar

39. Koiler R, Schimmel A, Bakhshipour E, Shewokis P, and Getchell N. The impact of fidget spinners on fine motor skills in individuals with and without ADHD: an exploratory analysis. J Behav Brain Sci. (2022) 12:82–101. doi: 10.4236/jbbs.2022.123005

Crossref Full Text | Google Scholar

40. Barkley RA. Behavioral inhibition, sustained attention, and executive functions: constructing a unifying theory of ADHD. Psychol Bull. (1997) 121:65–94. doi: 10.1037/0033-2909.121.1.65

PubMed Abstract | Crossref Full Text | Google Scholar

41. Šarić F and Šnajder J. Analysis of lexical, syntactic and semantic features for semantic textual similarity. Int J Comput Eng Technol. (2020) 9, 1–9.

Google Scholar

42. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. (1945) 1:80–3. doi: 10.2307/3001968

Crossref Full Text | Google Scholar

43. Van Dijck G and Van Hulle MM. Speeding up the wrapper feature subset selection in regression by mutual information relevance and redundancy analysis, in: Proceedings of the 16th International Conference on Artificial Neural Networks, Athens, Greece. (2006). 31–40. doi: 10.1007/11840817_4

Crossref Full Text | Google Scholar

44. Tan J, Zhao H, Yang R, Liu H, Li S, and Liu J. An entropy-weighting method for efficient power-line feature evaluation and extraction from LiDAR point clouds. Remote Sens. (2021) 13:3446. doi: 10.3390/rs13173446

Crossref Full Text | Google Scholar

45. Núñez H and Sànchez-Marrè M. Instance-based learning techniques of unsupervised feature weighting do not perform so badly! ECAI. (2004) 16:102.

Google Scholar

46. Wang Z, Mi H, and Ittycheriah A. Sentence similarity learning by lexical decomposition and composition. arXiv preprint arXiv:1602.07019. (2016). https://doi.org/10.48550/arXiv.1602.07019

Google Scholar

47. Kajiwara T, Bollegala D, Yoshida Y, and Kawarabayashi KI. An iterative approach for the global estimation of sentence similarity. PLoS One. (2017) 12:e0180885. doi: 10.1371/journal.pone.0180885

PubMed Abstract | Crossref Full Text | Google Scholar

48. He H, Bai Y, Garcia EA, and Li S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (Hong Kong: IEEE World Congress on Computational Intelligence), pp. 1322–8. doi: 10.1109/IJCNN.2008.4633969

Crossref Full Text | Google Scholar

49. Boser BE, Guyon IM, and Vapnik VN. A training algorithm for optimal margin classifiers, in: Proceedings of the fifth annual workshop on Computational learning theory, Pennsylvania, USA. (1992). pp. 144–52.

Google Scholar

50. Breiman L. Random forests. Mach Learn. (2001) 45:5–32. doi: 10.1023/A:1010933404324

Crossref Full Text | Google Scholar

51. Berkson J. Application of the logistic function to bio-assay. J Am Stat Assoc. (1944) 39:357–65. doi: 10.1080/01621459.1944.10500699

Crossref Full Text | Google Scholar

52. Brennan AR and Arnsten AF. Neuronal mechanisms underlying attention deficit hyperactivity disorder: the influence of arousal on prefrontal cortical function. Ann N Y Acad Sci. (2008) 1129:236–45. doi: 10.1196/annals.1417.007

PubMed Abstract | Crossref Full Text | Google Scholar

53. Pievsky MA and McGrath RE. The neurocognitive profile of attention-deficit/hyperactivity disorder: A review of meta-analyses. Arch Clin Neuropsychol. (2018) 33:143–57. doi: 10.1093/arclin/acx055

PubMed Abstract | Crossref Full Text | Google Scholar

54. Ortega R, López V, Carrasco X, Escobar M. J, García A. M, Parra M. A, et al. Neurocognitive mechanisms underlying working memory encoding and retrieval in Attention-Deficit/Hyperactivity Disorder. Sci Rep. (2020) 10:7771. doi: 10.1038/s41598-020-64678-x

PubMed Abstract | Crossref Full Text | Google Scholar

55. Rubia K. Cognitive neuroscience of attention deficit hyperactivity disorder (ADHD) and its clinical translation. Front Hum Neurosci. (2018) 12:100. doi: 10.3389/fnhum.2018.00100

PubMed Abstract | Crossref Full Text | Google Scholar

56. Nee DE, Brown JW, Askren MK, Berman MG, Demiralp E, Krawitz A, et al. A meta-analysis of executive components of working memory. Cereb Cortex. (2013) 23:264–82. doi: 10.1093/cercor/bhs007

PubMed Abstract | Crossref Full Text | Google Scholar

57. Kofler MJ, Sarver DE, Harmon SL, Moltisanti A, Aduen PA, Soto EF, et al. Working memory and organizational skills problems in ADHD. J Child Psychol Psychiatry. (2018) 59:57–67. doi: 10.1111/jcpp.12773

PubMed Abstract | Crossref Full Text | Google Scholar

58. Chan ESM, Gaye F, Cole AM, Singh LJ, and Kofler MJ. Central executive training for ADHD: Impact on organizational skills at home and school. A randomized Controlled trial. Neuropsychol. (2023) 37:859–71. doi: 10.1037/neu0000918

PubMed Abstract | Crossref Full Text | Google Scholar

59. Al-Saad MSH, Al-Jabri B, and Almarzouki AF. A review of working memory training in the management of attention deficit hyperactivity disorder. Front Behav Neurosci. (2021) 15:686873. doi: 10.3389/fnbeh.2021.686873

PubMed Abstract | Crossref Full Text | Google Scholar

60. Biau G. Analysis of a random forests model. J Mach Learn Res. (2012) 13:1063–95.

Google Scholar

61. Belgiu M and Drăguţ L. Random forest in remote sensing: A review of applications and future directions. ISPRS J photogrammetry Remote Sens. (2016) 114:24–31. doi: 10.1016/j.isprsjprs.2016.01.011

Crossref Full Text | Google Scholar

62. Gislason PO, Benediktsson JA, and Sveinsson JR. Random forests for land cover classification. Pattern recognition Lett. (2006) 27:294–300. doi: 10.1016/j.patrec.2005.08.011

Crossref Full Text | Google Scholar

Keywords: ADHD, ADHD rating scale, DSM-5, similarity layers, SNAP-IV

Citation: Zahra Shamsi SA, Hussain Z and Zaman M (2026) A multi-layer similarity approach for analyzing ADHD symptomology and assessment methods considering DSM-5 diagnostic criteria. Front. Psychiatry 16:1671747. doi: 10.3389/fpsyt.2025.1671747

Received: 13 August 2025; Accepted: 29 December 2025; Revised: 25 December 2025;
Published: 26 January 2026.

Edited by:

Dimitrios Adamis, Mental Health Services Sligo/Leitrim and West Cavan, Ireland

Reviewed by:

Tianran Zhang, Atlantic Technological University Faculty of Science, Ireland
Muhammad Ramzan, Saudi Electronic University, Saudi Arabia

Copyright © 2026 Zahra Shamsi, Hussain and Zaman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Mehwish Zaman, bWVod2lzaC56YW1hbkBzdHVkZW50aS51bmlwZC5pdA==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.