- 1Division of Psychiatry, Haukeland University Hospital, Bergen, Norway
- 2Department of Biological and Medical Psychology, Faculty of Psychology, University of Bergen, Bergen, Norway
- 3Department of Neurology, Haukeland University Hospital, Bergen, Norway
- 4Department of Radiology, Mohn Medical Imaging and Visualization Centre, Bergen, Norway
Introduction: Despite digital advances in healthcare, clinical neuropsychology has been slow to adopt automated assessment tools. Automated scoring of the Rey-Osterrieth Complex Figure Test (ROCFT) could enhance efficiency and consistency in evaluating quantitative and qualitative aspects of the figure. However, clinical utility and accuracy compared to traditional scoring methods remain unclear.
Objective: To evaluate whether digital automated scoring systems provide accuracy, reliability, and clinical utility equal to or superior to traditional clinician-driven scoring of the ROCFT.
Methods: A rapid review following the PRISMA guidelines was conducted. PubMed and Web of Science were searched from January 1, 2015, to October 12, 2025, and included recent studies that benchmark automated scoring against human raters.
Results: The review identified five articles, three with deep-learning approaches and two that used rule-based algorithms. Together, they analysed more than 41,000 ROCFT drawings with diverse capture methods. Overall, well-designed automated systems can achieve, and in some cases surpass, expert-level performance. The algorithmic approaches demonstrated close agreement with trained raters and reproducible outputs, with discrepancies primarily emerging in atypical drawings. Deep-learning models achieved high concordance with expert scoring when image quality was adequate, and training data were well-labelled. However, performance varied with data quality and distribution of scores.
Discussion and conclusion: This review demonstrates that digital automated ROCFT scoring achieves accuracy and reliability compared to traditional clinician ratings, with well-designed systems occasionally surpassing human performance. Expected advances in artificial intelligence and automation could further enhance clinical neuropsychology into the 21st century. However, clinical implementation faces several constraints: heterogeneous training datasets, limited evidence of usefulness across disorders, and lack of independent validation. Automated scoring should thus augment, not replace, clinical judgement. To address these limitations, future research should strive to establish disorder−specific norms, conduct independent validation in real-world clinical settings, and develop human-in-the-loop pipelines that combine automated efficiency with clinical oversight. Responsible implementation will require explicit governance frameworks that regulate data use and sharing, address privacy and ‘mental data’ concerns. These advances would strengthen the evidence-base and utility for automated ROCFT scoring and support its responsible integration into neuropsychological practice.
1 Introduction
While clinical neuropsychology has long been the gold standard for measuring cognition in clinical practice, its methods have evolved more slowly than the digital transformation in other areas of health care (1, 2). Tests are still mostly administered on paper, scored by hand, at least partly dependent on the rater’s training and time. The Rey-Osterrieth Complex Figure Test (ROCFT) exemplifies both the strengths and constraints of this approach (3). It is a well-established test for assessing visuoconstruction, executive planning, and visual memory by requiring participants to copy and recall a complex figure, where 18 parts are given 0–2 points, with 36 points in total (4). However, neuropsychological test administration and scoring are labor-intensive, and inter-rater reliability remains problematic (5). The value of neuropsychological tests depends on norms for specific conditions. Tests require demographically stratified, disorder-relevant norms to yield valid inferences across age, education, and cultural groups (6). To address these challenges, proponents of a Neuropsychology 3.0 framework argue that the way forward is to transform test results into data streams that can be audited, compared across sites, and integrated with electronic medical records and decision support (7). Thus, automated ROCFT scoring can save time and enhance measurement fidelity: item-level outputs and process-features can be mapped to explicit cognitive constructs, improving transparency, inter-rater consistency, and longitudinal comparability and clinical utility while preserving clinician oversight. Demand for assessments is rising as populations age, and as neurological and psychiatric conditions are recognized and treated earlier (8). Thus, improving assessment and scoring efficiency becomes increasingly important for the field. Within this landscape, automated scoring of tests such as the ROCFT may offer faster feedback to the patients, more standardized and sharable documentation in clinical teams, and structured data supporting ongoing research and service improvement.
Early work laid the ground for automated ROCFT scoring. Canham et al. (9) proposed a rule-based computer-vision pipeline that identified basic geometric shapes in scanned drawings and matching it to the Osterrieth scoring system. Although promising, drawings with missing or highly distorted pieces still needed a human rater. Given recent technological advances, previous shortcomings of digital scoring can be traversed. Automated scoring, particularly approaches that leverage artificial intelligence (AI) and deep learning (DL), offers a novel path to optimize workload while preserving the test’s clinical utility. Automation can deliver instant scoring, reducing turnaround time and freeing clinician time for interpretation of results and patient feedback (10). At a deeper level, digital capture of patient drawings could enable systematic extraction of item-level errors and process features such as stroke order, pauses, spatial organization, and timing, which opens possibilities of quantification and novel ways of interpretation and clinical utility (11). Standardization is another advantage. Automated pipelines apply the same rules across drawings, clinics, and time, which can improve inter-rater consistency and reliability (5) and quality assurance (12). Additional benefits concern scalability and access. Archives of patient records containing ROCFT drawings, are rarely re-analysed. Image-based automation can unlock these archives for retrospective research and service evaluation, while tablet-based capture can streamline prospective clinical workflows. In settings with limited neuropsychology staffing, automated scoring may support triage, such as flagging concerning results (13). Automated scoring systems can serve as a safety net for practitioners by providing an independent, objective score that serves as verification, particularly for borderline or complex cases. Also, systems trained on large datasets can flag atypical drawings that deviate from typical patterns, and discrepancies between automated and manual scores provide feedback that can refine clinical judgment over time. While empirical evidence for these specific applications in ROCFT is still emerging, these mechanisms parallel established clinical decision support systems in other medical contexts (14).
When embedded in human-in-the-loop workflows, automation becomes a second opinion that accelerates routine cases while supporting evaluation of atypical cases for closer review. However, successfully implementing such systems requires evidence about their performance relative to human raters and careful consideration of practical constraints. The present rapid review addresses this need by examining the following question: Do digital automated scoring systems provide accuracy, reliability, and clinical utility superior to traditional clinician-driven scoring of the ROCFT? To answer this question, we synthesize recent comparisons between automated and human scoring, and evaluate these findings regarding workflow, generalizability, and governance. The study aims to clarify whether automated ROCFT scoring is ready to augment clinical practice and to identify what evidence and safeguards are needed to support responsible integration.
2 Methods
A review following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines (15), adapted for rapid reviews, was conducted. We pre-specified five inclusion criteria: (1) published in English, (2) including a form of digitally automated ROCFT scoring system (or variation), (3) reporting a head-to-head comparison between a human/expert rater and an automated system, (4) peer-reviewed research paper published in an international journal, and (5) published between 2015 and 2025. Publication date was restricted because relevant applications of modern computerized scoring have largely emerged within the past decade. Earlier work, such as Canham et al. (9), mainly comprised proof-of-concept systems developed under substantially different hardware and software constraints. This better reflects contemporary technological possibilities. Papers included digital variants and automated scorers using the Quantitative Scoring System (QSS; 0–36) totals or equivalent element-level scoring and studies evaluating automated clinical decisions derived from ROCFT drawings.
PubMed and Web of Science were searched, from January 1, 2015, to October 12, 2025. A total of 459 papers were identified, and 61 duplicates were removed. The first author screened titles/abstracts and excluded 377 records according to the review criteria. Among the reports assessed for eligibility, 18 did not meet inclusion criteria. Finally, the bibliographies of all eligible articles were systematically searched to identify further eligible work. See Figure 1 for PRISMA flow chart.
Figure 1. PRISMA 2020 flow diagram showing the identification, screening, eligibility assessment, and inclusion of studies in this rapid review (PubMed and Web of Science; final included studies, n = 5).
3 Results
Five studies were identified including data on >41,000 ROCFT drawings. Details about the five studies are presented in Table 1.
Table 1. Characteristics and key findings of included studies comparing automated Rey–Osterrieth Complex Figure Test (ROCFT) scoring with human raters, summarizing study design, model type (deep learning vs algorithmic/classical computer vision), sample size, comparison approach, and main performance outcomes (e.g., MAE/MSE, ICC/PCC, ROC/AUC).
Three studies used automated scoring based on DL (12, 13, 16), while two used algorithmic approaches combined with digital tablets and styluses or digitized images of the drawings (11, 17). Studies differed in drawing capture methods (scanner images, photographs, or tablet input); predictors (QSS total scores or item-level scores); human benchmarks (expert raters, practicing clinicians, or crowd raters); and participant types (healthy adults, stroke survivors, patients with neurological/psychiatric diagnoses). Across the five studies, capture modalities included scanned paper drawings (12, 13, 16), camera-based images (17), and digital tablet input (11). Predictors also varied: Webb et al. (11) and Park et al. (13) focused on automated estimation of totals comparable to QSS. Langer et al. (12) and Guerrero-Martín et al. (16) trained models on element-level labels and then aggregated these to total scores. Human benchmarks ranged from a single experienced clinician (17) to panels of expert raters (11–13, 16) and crowdsourced raters (12). Samples likewise differed, including healthy adults, stroke survivors, and individuals across the cognitive spectrum from normal cognition through mild cognitive impairment to dementia (11–13, 16, 17).
Two studies show that algorithmic automation align well with human scoring. Webb et al. (11) validated a tablet-based scoring system for the Oxford Cognitive Screen-Plus complex figure task against human scoring. Their algorithm segmented the digital drawing into the 18 predefined elements, applied geometric and positional rules to classify each element as present, distorted, or misplaced, and then aggregated these decisions to copy and recall totals. Automated scores correlated closely with manual ratings, with intraclass correlations around .83 and element-detection sensitivity and specificity of roughly 92% and 90%. Discrepancies occurred mainly in atypical drawings.
Sangiovanni et al. (17) programmed a human-like robot to administer a cognitive test battery, including ROCFT. Drawings captured on camera were scored using two complementary automated metrics. The element-matching score quantified how well predefined ROCFT elements in the drawing overlapped with a reference template, providing an approximation of item-level accuracy. In parallel, a global similarity score evaluated the overall correspondence between the full drawing and the template, thereby capturing large-scale distortions, misconfigurations, or omissions that may not be fully reflected by element-level overlap alone. Automated scores showed strong agreement with the expert ratings, with Pearson correlations up to r .79, Spearman ρ up to .84, and internal consistency estimates (Cronbach’s α) up to .87. The global score provided the best calibration against the human benchmark. Overall, these results demonstrate that transparent, rule-based algorithmic scoring can achieve clinically acceptable agreement with human scoring.
In all included DL studies, automated scoring achieved high concordance with expert scoring when image quality was adequate, and training labels were accurate. In some cases, automated ROCFT scoring matched or exceeded human performance. Park et al. (13) trained a convolutional neural network (CNN) on 20,040 drawings. The model was then tested on an external set of 150 drawings scored by five psychologists (defined as gold standard). The model’s predicted QSS totals showed very small mean absolute error (MAE ≈ 0.64), high explained variance (R² ≈ 0.99), and Pearson correlations comparable to inter-expert agreement. Accuracy of the DL scores was comparable to that of human experts. Langer et al. (12) trained a multiheaded CNN on >20,000 drawings (≈ 4,000 with clinician scores) and prospectively validated it on ≈ 2,500 drawings. In both retrospective and preregistered prospective evaluations, the model outperformed clinicians and crowdsourced raters on absolute error (MAE ≈ 1.11–1.16 vs ≈ 2.15 for clinicians and ≈ 2.41 for crowdsourced raters). Guerrero-Martín et al. (16) used the open dataset ROCFD528 to train their models (528 drawings). When comparing multiple CNNs against two expert raters, expert raters showed superior performance on lower score ranges, whereas the best model approached or exceeded expert performance at higher scores, with a best-case MAE around 3.45 on a deliberately challenging, imbalanced dataset. Collectively, these studies suggest that CNN-based scoring can achieve expert-level accuracy under optimal conditions, though performance is sensitive to dataset diversity.
4 Discussion
This rapid review addressed whether digital automated ROCFT scoring provides accuracy, reliability, and clinical utility superior to traditional clinician-driven scoring. The evidence shows that automated systems achieve comparable performance to human raters, with some studies demonstrating superior accuracy under optimal conditions. While not categorically superior in all contexts, well-designed digital systems match human-level performance while offering distinct advantages in standardization, speed, and auditability.
4.1 Current state of automated scoring
While the results presented in the included articles are encouraging, we identified only five articles that directly compared automated scoring to human ratings (11–13, 16, 17). Although automated scoring systems could provide multiple clinical benefits, including faster and more consistent scoring, improved capacity for large-scale analysis of both new and archived data, and standardized workflows across clinical sites, most systems have not been tested using external data by independent research teams.
Beyond the reviewed studies, other work has used digital ROCFT variants to extract process-level features without focusing on automated QSS scoring, also introducing novel techniques and applications. Prange and Sonntag (18) recorded ROCFT with a digital stylus to extract temporal and kinematic features from each stroke, enabling automatic classification of cognitive status without manual scoring. Similarly, Savickaite et al. (19) demonstrated how a tablet-based ROCFT can yield performance metrics (e.g. accuracy, organizational strategy, and motor patterns) and revealed group differences, that traditional methods might overlook. In addition, Kim et al. (20) found that while standard ROCFT scores did not distinguish between certain patient subgroups, gaze metrics did show divergent cognitive approaches (e.g. frequent fixations and switching in early-onset Alzheimer’s disease). Thus, combining technologies might augment clinical practice.
Importantly, DL has enabled diagnostic applications that extend substantially beyond automated scoring. While rule-based and classical ML approaches focus on replicating clinician scoring behavior, DL approaches can discover complex, non-linear patterns in ROCFT drawings that may not be captured by conventional scoring systems. Large-scale ROCFT datasets have been used to train deep neural networks that assist in diagnosing dementia (21) and identifying at-risk individuals (22, 23). These DL models leverage CNN-extracted features, capturing subtle drawing characteristics such as spatial relationships, element sequencing, and execution dynamics, that exceed what human raters typically quantify. By using these hierarchical feature representations or element-level outputs as inputs to diagnostic classifiers, DL transforms the ROCFT from a psychometric instrument into a multidimensional digital biomarker, extending applications beyond psychometric assessment into clinical diagnostic and prognostic domains.
4.2 Advantages of automated scoring
Automated scoring offers several advantages over manual ROCFT scoring, particularly speed and consistency. Manual scoring is labor-intensive and dependent on trained experts, which constrains throughput in hospital settings (24), whereas automated systems can alleviate this pressure (12). Decision-support pipelines can label each of the 18 ROCFT elements as omitted, distorted, misplaced, or correct, preserving error types that are lost in the traditional 0–36 total and providing model explanations that clarify which features drive diagnostic predictions (25). Tablet-based approaches extend this further by extracting time-stamped, spatial, procedural, and kinematic indices that correlate with conventional tests targeting the same constructs (26, 27) and build on prior work using digital drawing to assess fine motor skills and handwriting in conditions such as Parkinson’s disease (28). For diagnostic use, automated systems can combine element-level judgements and process metrics (e.g., stroke order, pauses, spatial organization) into structured, auditable outputs that feed classifiers, support group comparisons, and align with standard scoring for clinical deployment (11–13, 25).
4.3 Challenges and controversies
Applying DL to clinical data faces concrete barriers that indicate why deployment remains limited. Datasets are often too small or noisy for robust training, and privacy regulations such as GDPR, especially for health data, constrain cross-border sharing and cloud processing without strong safeguards. ROCFT drawings can constitute biometric/behavioral data, attracting additional restrictions analogous to handwriting-based identification (29). Emerging concepts like ‘mental privacy’ suggest that AI applied to neurodata may require protections beyond current AI frameworks (30), making data pooling, external validation, and iterative model updates more complex and slowing translation into routine care.
While current results demonstrate technical feasibility, acceptance of automated scoring by clinicians and its integration into clinical workflows remain a potential challenge. Langer et al. (12) mention that their application is in beta testing, indicating ongoing efforts to address this issue. However, currently none of the papers in our review address pathways for regulatory approval. Developers would need formal clinical evaluation demonstrating safety, performance, generalizability, and documentation on data protection, bias mitigation, human-in-the-loop use, and post-marked monitoring. Until such requirements are met, deployment will likely be limited to research rather than clinical practice.
Training AI requires large datasets, and a model constructed from the ground-up requires larger datasets until reaching a plateau (31). Pre-trained models can be fine-tuned with smaller datasets. In vision tasks, Shahinfar et al. (32) showed a robust logarithmic relationship between per-class sample size and accuracy/recall/precision, with diminishing returns beyond approximately 150–500 labelled images per-class in balanced designs. For automatic scoring of the ROCFT, this suggests prioritizing high-quality, diverse exemplars per scoring category, targeting a few hundred labelled drawings where feasible, and investing in curation and external validation. Dataset diversity is also important for transportability, as they are generally not accessible for most hospitals (33), raising questions about the feasibility of implementing AI scoring systems in smaller clinical settings with limited data access. A potential solution would be for hospitals to rely on data sharing.
Beyond data volume, process interpretability poses another challenge. Traditional machine learning methods require handcrafted features, which can provide some degree of interpretability for clinicians when the features themselves are clinically meaningful. DL learns from raw pixel data, achieving higher accuracy, but often functions as a "black box" with limited explainability. Clinical choice depends on context: DL suits high-volume screening where accuracy is paramount; more transparent approaches suit forensic, teaching, or resource-limited settings requiring clear decision pathways. Hybrid approaches combining DL feature extraction with interpretable scoring may balance accuracy and explainability (34). Both automated scoring and diagnostic classification can use either approach but serve distinct purposes: measurement versus clinical decision support.
Generalizability is important and training and evaluation should therefore demonstrate stable performance across age, cognitive status, and cultural backgrounds. Concrete strategies include assembling large, heterogeneous datasets with prospective and external testing (12), and reporting subgroup performance across copy/recall conditions and clinical strata (13). Transfer can be further improved by multi-stage fine-tuning, e.g., by pre-training on large datasets before ROCFT specialization (33). Finally, element-level outputs and model explanations make decisions traceable and biases visible, allowing clinicians to verify why a case was flagged (25).
This relates to risk for bias, such as under-representation of key subgroups. Additional threats include label bias (inconsistent rater rules) and domain shift (deployment images different from training), both of which could reduce accuracy. Mitigations include using multiple independent raters, stress-testing shifts and training with varied captures, routing low-confidence or out-of-distribution cases to a human-in-the-loop, and pre-specifying subgroup metrics (age, education, diagnosis, handedness/motor impairment) in advance of analysis. Together, these practices align automated scoring with clinical standards while preserving clinician oversight (12, 13, 25, 33) and thus should be implemented for clinical augmentation.
Comprehensive sensitivity for automated detection of neurological and psychiatric disorders remains limited. Langer et al. (12) reported such sampling, but did not break down diagnoses in detail, underscoring the need for disorder-specific evidence. Nevertheless, some recent studies have expanded diagnostic coverage: Di Febbo et al. (25) and Park et al. (13) both included participants across cognitive status categories (healthy, MCI, and dementia). Webb et al. (11) compared healthy adults with stroke survivors, and Hyun et al. (35) contrasted adolescents with ADHD to matched controls using a digital ROCFT variant. Frigeni et al. (27), Guerrero-Martín et al. (16), Petilli et al. (26), and Sangiovanni et al. (17) included only healthy participants. These studies extend coverage to several clinical groups, but many patient populations remain unrepresented. Evidence is still sparse or absent for common neurological and psychiatric conditions (e.g., neurologic-, affective-, and psychotic disorders), as well as for culturally diverse patients and those with motor or visual impairments, leaving substantial gaps for disorder-specific validation.
Over-reliance on automated models can affect clinical judgement and decision making. This has recently been exemplified in a recent multicenter study (36). The study found that endoscopists who routinely used AI for polyp detection became worse at standard (non-AI assisted) colonoscopy. These findings indicate that reliance on AI can affect clinical judgement and decision making. The question is not only whether to implement such technology, but how it is integrated into practice. This is highly relevant for successful implementation of automated scoring in neuropsychological assessment.
The core risk is over-reliance on the algorithmic output. Clinicians may bypass an independent visual review of drawings and accept the numerical score at face value, allowing the quantitative estimate to displace qualitative appraisal of copy strategy, organization, and error patterns. This risk is amplified when captured images are suboptimal, e.g. cropped, rotated, low contrast, or otherwise degraded (domain shift). These considerations argue for a human-in-the-loop workflow that treats the model’s output as an assistive second opinion rather than a replacement for clinical judgement. In short, the question is not only whether to implement such technology, but how it is integrated into practice.
5 Conclusion
Automated scoring has potential to enhance clinical neuropsychology, but safe adoption requires more than accuracy. Systems must be validated; their outputs should be interpretable and auditable; and deployment must include robust data-protection measures, bias monitoring, and clinician-controlled overrides. Ethical risks remain salient: unapproved secondary use (e.g., repurposing clinical drawings for training); “construct creep” can erode validity if models weight features outside the intended scoring construct; and black-box explanations may undermine trust, consent, and accountability. In this landscape, best practice is a human-in-the-loop workflow that treats automation as a supplementary option: accelerating routine scoring, flagging low-confidence or out-of-distribution cases, and preserving clinician judgment where it matters most. Current evidence supports an augment-not-replace role: automated scoring can shorten turnaround time, improve inter-rater consistency, and surface process information that would otherwise be lost, while clinicians integrate these outputs with history, examination, imaging, and patient priorities. Establishing guidelines for implementation is paramount.
Author contributions
SA: Formal analysis, Writing – original draft, Conceptualization. AL: Supervision, Writing – review & editing, Conceptualization. ER: Conceptualization, Writing – review & editing, Supervision.
Funding
The author(s) declared that financial support was received for this work and/or its publication. Funding to publish this article has been granted by the University of Bergen.
Conflict of interest
The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. AI was used for proof reading and structure and for finding relevant resources for the manuscript. First author checked AI generated information.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2025.1746720/abstract#supplementary-material
References
1. Harris C, Tang Y, Birnbaum E, Cherian C, Mendhe D, and Chen MH. Digital neuropsychology beyond computerized cognitive assessment: applications of novel digital technologies. Arch Clin Neuropsychol. (2024) 39:290–304. doi: 10.1093/arclin/acae016
2. Lundervold AJ. Precision neuropsychology in the area of AI. Front Psychol. (2025) 16:1537368/full. doi: 10.3389/fpsyg.2025.1537368/full
3. Rey A. L’examen psychologique dans les cas d’encéphalopathie traumatique. Arch Psychol. (1941) 28:286–340.
4. Sherman EMS, Tan JE, and Hrabok M. A Compendium of Neuropsychological Tests: Fundamentals of neuropsychological assessment and test reviews for clinical practice. 4th ed. New York (NY: Oxford University Press (2022).
5. Styck KM and Walsh SM. Evaluating the prevalence and impact of examiner errors on the Wechsler scales of intelligence: a meta-analysis. Psychol Assess. (2016) 28:3–17. doi: 10.1037/pas0000157
6. Casaletto KB and Heaton RK. Neuropsychological assessment: past and future. J Int Neuropsychol Soc. (2017) 23:778–90. doi: 10.1017/S1355617717001060
7. Bilder RM. Neuropsychology 3.0: evidence-based science and practice. J Int Neuropsychol Soc. (2011) 17:7–13. doi: 10.1017/S1355617710001396
8. Nichols E, Steinmetz JD, Vollset SE, Fukutaki K, Chalek J, Abd-Allah F, et al. Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: an analysis for the Global Burden of Disease Study 2019. Lancet Public Health. (2022) 7:e105–25. doi: 10.1016/S2468-2667(21)00249-8
9. Canham RO, Smith SL, and Tyrrell AM. Automated scoring of a neuropsychological test: the Rey Osterrieth complex figure. In: Proceedings of the 26th euromicro conference, EUROMICRO 2000 – informatics: inventing the future (Los Alamitos, CA: IEEE Computer Society), vol. 2. (2000). p. 406–13. Available online at: https://ieeexplore.ieee.org/document/874519 (Accessed September 21, 2025).
10. Jaworski M 3rd, Balconi J, Santivasci C, and Calamia M. Feasibility of AI-powered assessment scoring: can large language models replace human raters? Clin Neuropsychol. (2025) 10:1–14. doi: 10.1080/13854046.2025.2552289
11. Webb S, Moore M, Yamshchikova A, Kozik V, Duta M, Voiculescu I, et al. Validation of an automated scoring program for a digital complex figure copy task within healthy ageing and stroke. Neuropsychology. (2021) 35:547–56. doi: 10.1037/neu0000752
12. Langer N, Weber M, Hebling Vieira B, Strzelczyk D, Wolf L, Pedroni A, et al. A deep learning approach for automated scoring of the Rey–Osterrieth complex figure. eLife. (2024) 13:e96017. doi: 10.7554/eLife.96017
13. Park JY, Seo EH, Yoon HJ, Won S, and Lee KH. Automating Rey Complex Figure Test scoring using a deep learning-based approach: a potential large-scale screening tool for cognitive decline. Alzheimers Res Ther. (2023) 15:145. doi: 10.1186/s13195-023-01283-w
14. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. (2019) 25:44–56. doi: 10.1038/s41591-018-0300-7
15. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. (2021) 372:n71. doi: 10.1136/bmj.n71
16. Guerrero-Martín J, Díaz-Mardomingo MDC, García-Herranz S, Martínez-Tomás R, and Rincón M. A benchmark for Rey-Osterrieth complex figure test automatic scoring. Heliyon. (2024) 10:e39883. doi: 10.1016/j.heliyon.2024.e39883
17. Sangiovanni S, Spezialetti M, D’Asaro FA, Maggi G, and Rossi S. Administrating cognitive tests through HRI: an application of an automatic scoring system through visual analysis. In: Wagner AR, Feil-Seifer D, Haring KS, Rossi S, Williams T, He H, et al, editors. Social robotics. Springer International Publishing, Cham. (2020) p. 369–80. doi: 10.1007/978-3-030-62056-1_31
18. Prange A and Sonntag D. Modeling users’ cognitive performance using digital pen features. Front Artif Intell. (2022) 5:787179. doi: 10.3389/frai.2022.787179
19. Savickaite S, Morrison C, Lux E, Delafield-Butt J, and Simmons DR. The use of a tablet-based app for investigating the influence of autistic and ADHD traits on performance in a complex drawing task. Behav Res Methods. (2022) 54:2479–501. doi: 10.3758/s13428-021-01746-8
20. Kim KW, Choi J, Chin J, Lee BH, and Na DL. Eye-tracking metrics for figure-copying processes in early- vs. late-onset Alzheimer’s disease. Front Neurol. (2022) 13:844341. doi: 10.3389/fneur.2022.844341
21. Youn YC, Pyun JM, Ryu N, Baek MJ, Jang JW, Park YH, et al. Use of the Clock Drawing Test and the Rey–Osterrieth Complex Figure Test-copy with convolutional neural networks to predict cognitive impairment. Alzheimers Res Ther. (2021) 13:85. doi: 10.1186/s13195-021-00821-8
22. Simfukwe C, An SS, and Youn YC. Comparison of RCF scoring system to clinical decision for the Rey Complex Figure using machine-learning algorithm. Dement Neurocogn Disord. (2021) 20:70–9. doi: 10.12779/dnd.2021.20.4.70
23. Simfukwe C, Youn YC, and Jeong HT. A machine-learning algorithm for predicting brain age using Rey-Osterrieth complex figure tests of healthy participants. Appl Neuropsychol Adult. (2025) 32:225–30. doi: 10.1080/23279095.2022.2164198
24. Cheah WT, Hwang JJ, Hong SY, Fu LC, Chang YL, Chen TF, et al. A digital screening system for Alzheimer disease based on a neuropsychological test and a convolutional neural network: system development and validation. JMIR Med Inform. (2022) 10:e31106. doi: 10.2196/31106
25. Di Febbo D, Ferrante S, Baratta M, Luperto M, Abbate C, Trimarchi PD, et al. A decision support system for Rey–Osterrieth complex figure evaluation. Expert Syst Appl. (2023) 213:119226. doi: 10.1016/j.eswa.2022.119226
26. Petilli MA, Daini R, Saibene FL, and Rabuffetti M. Automated scoring for a tablet-based Rey Figure copy task differentiates constructional, organisational, and motor abilities. Sci Rep. (2021) 11:14895. doi: 10.1038/s41598-021-94247-9
27. Frigeni M, Petilli MA, Gobbo S, Di Giusto V, Zorzi CF, Rabuffetti M, et al. Tablet-based Rey–Osterrieth Complex Figure copy task: a novel application to assess spatial, procedural, and kinematic aspects of drawing in children. Sci Rep. (2024) 14:16787. doi: 10.1038/s41598-024-67076-9
28. Thomas M, Lenka A, and Kumar Pal P. Handwriting analysis in Parkinson’s disease: current status and future directions. Mov Disord Clin Pract. (2017) 4:806–18. doi: 10.1002/mdc3.12552
29. Štitilis D and Laurinaitis M. Treatment of biometrically processed personal data: problem of uniform practice under EU personal data protection law. Comput Law Secur Rev. (2017) 33:618–28. doi: 10.1016/j.clsr.2017.03.012
30. European Parliament, Directorate-General for Parliamentary Research Services. The protection of mental privacy in the area of neuroscience: societal, legal and ethical challenges. Luxembourg: Publications Office of the European Union (2024). Available online at: https://data.europa.eu/doi/10.2861/869928 (Accessed July 8, 2025).
31. Bahri Y, Dyer E, Kaplan J, Lee J, and Sharma U. Explaining neural scaling laws. Proc Natl Acad Sci U S A. (2024) 121:e2311878121. doi: 10.1073/pnas.2311878121
32. Shahinfar S, Meek P, and Falzon G. How many images do I need?” Understanding how sample size per class affects deep learning model performance metrics for balanced designs in autonomous wildlife monitoring. Ecol Inform. (2020) 57:101085. doi: 10.1016/j.ecoinf.2020.101085
33. Schuster B, Kordon F, Mayr M, Seuret M, Jost S, Kessler J, et al. (2023). Multi-stage fine-tuning deep learning models improves automatic assessment of the Rey–Osterrieth Complex Figure Test, in: Document Analysis and Recognition – ICDAR 2023: 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part I, , pp. 3–19. Berlin, Heidelberg: Springer-Verlag. doi: 10.1007/978-3-031-41676-7_1
34. Holzinger A, Langs G, Denk H, Zatloukal K, and Müller H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip Rev Data Min Knowl Discov. (2019) 9:e1312. doi: 10.1002/widm.1312
35. Hyun GJ, Park JW, Kim JH, Min KJ, Lee YS, Kim SM, et al. Visuospatial working memory assessment using a digital tablet in adolescents with attention deficit hyperactivity disorder. Comput Methods Programs Biomed. (2018) 157:137–43. doi: 10.1016/j.cmpb.2018.01.022
Keywords: artificial intelligence, automatic scoring, deep learning, diagnostic tool, neuropsychological assessment, Rey Osterrieth Complex Figure Test, validation
Citation: Andersen SL, Lundervold AJ and Ronold EH (2026) Automated versus human scoring of the Rey-Osterrieth Complex Figure Test: a rapid review. Front. Psychiatry 16:1746720. doi: 10.3389/fpsyt.2025.1746720
Received: 14 November 2025; Accepted: 23 December 2025; Revised: 19 December 2025;
Published: 14 January 2026.
Edited by:
Robert M. Roth, Dartmouth College, United StatesReviewed by:
Raheleh Heyrani, Umeå University, SwedenCopyright © 2026 Andersen, Lundervold and Ronold. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Eivind Haga Ronold, ZWl2aW5kLnJvbm9sZEB1aWIubm8=