- 1Complex Systems Monitoring, Modeling and Control Laboratory, Pennsylvania State University, University Park, PA, United States
- 2Center for Human Systems Engineering, University of Louisville, Louisville, KY, United States
Dementia, one of the most prevalent neurodegenerative diseases, affects millions worldwide. Understanding linguistic markers of dementia is crucial for elucidating how cognitive decline manifests in speech patterns. Current non-invasive assessments like the Montreal Cognitive Assessment (MoCA) and Saint Louis University Mental Status (SLUMS) tests rely on manual interpretation and often lack detailed linguistic insight. This paper introduces a first-of-its-kind interpretable artificial intelligence (IAI) framework, CharMark, which leverages first-order Markov Chain models to characterize language production at the character level. By computing steady-state probabilities of character transitions in speech transcripts from individuals with dementia and healthy controls, we uncover distinctive character-usage patterns. The space character “ ”, representing pauses, (treated here as the space token between words rather than acoustic pauses), and letters such as “n” and “i” showed statistically significant differences between groups. Principal Component Analysis (PCA) revealed natural clustering aligned with cognitive status, while Kolmogorov-Smirnov tests confirmed distributional shifts. A Lasso Logistic Regression model further demonstrated that these character-level features possess strong discriminative potential. Our primary contribution is the identification and characterization of candidate linguistic biomarkers of cognitive decline; features that are both interpretable and easily computable. These findings highlight the potential of character-level modeling as a lightweight, scalable strategy for early-stage dementia screening, particularly in settings where more complex or audio-dependent models may be impractical.
1 Introduction
Alzheimer’s Disease and related dementias (ADRD) represent a growing global health crisis. As of 2020, more than 55 million people worldwide were affected, and projections suggest this number will rise to 139 million by 2050 [1]. Early detection remains one of the most powerful levers for improving outcomes, yet current tools face fundamental challenges that limit their utility in real-world clinical practice.
Widely used screening tools, such as the Montreal Cognitive Assessment (MoCA) and the Saint Louis University Mental Status (SLUMS) exam, remain the default in many healthcare settings. However, they present several well-documented limitations:
• Subjectivity and interpretation bias: Outcomes can vary significantly depending on the examiner’s training, experience, and cultural background [2].
• Time and resource burden: These tools require clinician involvement and are time-consuming to administer and score [3].
• Limited diagnostic confidence: Surveys reveal that as many as 40% of primary care providers lack confidence when diagnosing dementia based solely on these screenings [4].
• Educational and cultural bias: Performance on these tests can be influenced by a patient’s education level and language proficiency, increasing the risk of misdiagnosis [5].
Given these constraints, there is an urgent need for more objective, efficient, and culturally adaptable screening methods that can be deployed at scale.
This study introduces a novel modeling framework—CharMark—designed to uncover early linguistic biomarkers of dementia through character-level analysis of speech transcripts. Figure 1 presents a graphical depiction of the Markov network at the heart of our approach, constructed from transitions between characters in recorded speech.
Figure 1. A visual rendering of the character-level Markov Chain model used in our study. Each node represents a character, and directed edges illustrate the probability of transitioning from one character to another based on observed speech transcripts. This network reveals the structural “fingerprint” of how language flows—capturing micro-level speech dynamics often imperceptible to human raters.
Unlike prior models that emphasize high-level semantic or acoustic features, we analyze language at its most granular level: transitions between individual characters. Our framework computes steady-state probabilities for each character using a first-order Markov Chain model [6], which allows us to quantify how often each character appears in the long-run behavior of speech. This yields interpretable linguistic fingerprints that are not only compact but also agnostic to language and speaker variability.
To explore whether these fingerprints contain clinically meaningful information, we perform unsupervised k-means clustering and visualize group separation via Principal Component Analysis (PCA). We also apply the Kolmogorov-Smirnov (KS) test [7] to statistically confirm character-level differences between dementia and control groups. Finally, we use a Lasso Logistic Regression model to assess the discriminative value of these features, while maintaining interpretability.
Our contributions are as follows:
• A novel character-level approach to linguistic biomarker discovery: We introduce an interpretable AI framework based on steady-state character probabilities derived from first-order Markov Chains.
• Identification of distinctive linguistic signals: Our analysis highlights specific characters–such as the space character (indicating pauses) (captured as whitespace tokens rather than true acoustic pauses), “n,” and “i” that show statistically significant differences in usage between groups.
• Validation using transparent statistical tools: Through clustering, hypothesis testing, and logistic regression, we confirm the potential of these features as early-stage indicators of cognitive decline.
This study begins by reviewing prior research on linguistic analysis in Alzheimer’s Disease and Related Dementias (ADRD), establishing the context for our work. We then describe our methodology, including the dataset, preprocessing steps, and the implementation of our character-level Markov model. Our focus on character-level transitions rather than words or sentences, stems from the hypothesis that microstructural disruptions in language (such as altered pause patterns or character repetition) may serve as early markers of cognitive impairment. By operating at this fine-grained level, our framework captures subtle changes in language production that often precede higher-level semantic breakdowns. The remainder of the paper presents our results, interprets the identified linguistic signals, and discusses their implications for biomarker discovery and future screening applications. This study introduces CharMark as a transparent, low-resource framework for linguistic biomarker discovery in dementia. Our experiments contextualize discriminative potential but are not intended as a leaderboard benchmarking study. Comprehensive multi-task, multi-dataset, and multilingual validation is planned as future work.
2 Materials and methods
2.1 Overview of the approach
This study introduces an interpretable modeling framework—CharMark—for identifying early linguistic biomarkers of dementia from transcribed speech. Our approach treats each transcript as a character sequence and models it using a first-order Markov Chain. From this, we compute a 27-dimensional vector representing the steady-state probability of each character (a–z and the space character). These vectors are then analyzed through clustering, dimensionality reduction, hypothesis testing, and logistic regression to evaluate group-level differences and the potential diagnostic value of the extracted features.
Figure 2 provides a conceptual overview of the CharMark pipeline.
Figure 2. The CharMark framework begins with transcript preprocessing, followed by character-level Markov modeling to extract steady-state distributions. These linguistic fingerprints are then analyzed using statistical and machine learning techniques to identify and validate early cognitive biomarkers.
2.2 Dataset and preprocessing
We used transcripts from the DementiaBank Pitt Corpus, a widely cited dataset in dementia and speech-language research. All participants were asked to describe the Boston Cookie Theft picture, a standardized elicitation task designed to generate naturalistic but structured language samples [8].
The dataset included:
• 310 transcripts from 168 participants with Alzheimer’s Disease (AD)
• 242 transcripts from 98 cognitively healthy controls
Each transcript was lowercased, and preprocessing was applied to normalize textual variation. All non-alphabetic characters (punctuation, numerals, special symbols) were removed, and contractions were decomposed into their constituent letters (e.g., “don’t” “dont”). Because the Pitt Corpus transcripts do not encode filled pauses or disfluency markers (e.g., “uh,” “um”), these were not present in the input. We retained the space character to preserve pausing patterns as a structural token to mark word boundaries, yielding a discrete sequence over a 27-character vocabulary (26 letters plus space). This sequence served as the basis for Markov modeling.
2.3 Markov chain modeling
Markov Chain modeling offers a mathematically grounded and interpretable framework for capturing sequential dependencies in language. Its ability to model transition dynamics with minimal assumptions makes it well suited for identifying subtle patterns that may serve as early digital biomarkers [9]. Building on our prior work using symbolic recurrence and character-level modeling to capture linguistic markers of cognitive decline [10, 11], we adopt a first-order Markov Chain approach in this study.
Each transcript was modeled as a first-order Markov Chain, where the probability of observing character depends only on the immediately preceding character . Transition probabilities were estimated using smoothed frequency counts with Laplace smoothing (), a small constant selected to mitigate zero-probability transitions while preserving the sparsity of natural language character distributions:
where is the number of observed transitions from character to , and .
To assess robustness, we further conducted a sensitivity analysis varying while holding the classifier and splits fixed. Performance was stable across the range (AUC varied by 0.027, F1 by 0.035), confirming that is a reasonable and stable choice for all main analyses. Full numerical results are provided in Table 3 and trends are visualized in Figure 3.
Figure 3. Sensitivity of CharMark to Laplace smoothing parameter . Performance remained stable across the tested range; AUC varied by 0.027 and F1 by 0.035. The default is retained for all main analyses.
To derive the long-run behavior of character usage, we computed the steady-state distribution by solving the eigenvector equation:
This steady-state vector, which reflects the long-term usage frequency of each character under the Markov process, was extracted for every transcript, forming a 552 27 feature matrix. These steady-state distributions serve as compact, interpretable fingerprints of linguistic structure across groups.
2.4 Clustering and dimensionality reduction
We applied -means clustering to the steady-state vectors to explore group separability in an unsupervised setting. The optimal number of clusters was determined using silhouette analysis, which identified as the most stable solution corresponding closely to the ground truth diagnostic labels of healthy controls and dementia patients. This finding suggests that meaningful cognitive signals are embedded in the character-level structure of language, independent of any supervised learning.
To better understand the geometry of these features, we performed Principal Component Analysis (PCA) on the 27-dimensional steady-state vectors. PCA reduces dimensionality while preserving the directions of greatest variance, enabling clearer visual interpretation. As shown in Figure 4, the resulting projection reveals a striking separation between clusters, with transcripts from dementia subjects forming a more compact and shifted group compared to the wider distribution of controls. This emergent clustering reinforces the hypothesis that cognitive decline alters linguistic dynamics in ways that are both measurable and visually discernible at the character level.
Figure 4. Two-dimensional PCA projection of steady-state character distributions derived from each transcript. This plot reveals a natural separation between dementia and control groups, as indicated by unsupervised -means clustering. Transcripts from dementia subjects (red) tend to form a tighter, shifted cluster, suggesting reduced linguistic variability compared to the broader distribution of healthy controls (blue). This emergent structure reinforces the potential of character-level dynamics as interpretable biomarkers of cognitive decline.
2.5 Statistical testing of character features
To identify which characters contributed most to group-level differences, we conducted two-sample Kolmogorov-Smirnov (KS) tests on each of the 27 character distributions (dementia vs. control). The KS statistic quantifies the maximal distance between the empirical cumulative distribution functions:
We selected the KS test due to its non-parametric nature and sensitivity to both location and shape differences between distributions, making it appropriate for features that may not follow Gaussian assumptions. Significance was evaluated at with Bonferroni correction to control for multiple comparisons across the full character set. This conservative adjustment reduces false positives, ensuring that only robust distributional shifts are flagged as candidate biomarkers.
2.6 Validation via lasso logistic regression
To assess the predictive relevance of the steady-state features, we trained a Lasso Logistic Regression model using the 27 character probabilities. Lasso was chosen for its ability to perform both classification and feature selection, enabling a sparse and interpretable solution. We evaluated model performance using the area under the Receiver Operating Characteristic curve (ROC-AUC).
To ensure comparability, we used stratified 5-fold cross-validation and reused the same splits across CharMark and all baselines (TF–IDF + Logistic Regression, BERT-base). Where participant identifiers were available, all transcripts from a participant were confined to a single fold. We report mean SD across folds for AUC, F1, precision, recall (sensitivity), specificity, and accuracy.
To prevent data leakage across folds, all transcripts from a given participant were confined to a single fold. This ensured that no subject contributed data to both training and testing sets. Participant identifiers (Pitt Corpus IDs) were used to enforce group-level assignment when constructing cross-validation splits.
2.7 Calibration and threshold analysis
To evaluate probability calibration, we computed Brier scores and Expected Calibration Error (ECE) across cross-validation folds, and plotted reliability curves. Threshold sensitivity was assessed by comparing default classification at with thresholds optimized for Youden’s statistic and for maximal F1.
2.8 Justification for transcript-based character modeling
While many dementia studies utilize acoustic features, we focused on text-based analysis to enhance interpretability, reproducibility, and cross-linguistic applicability. Character-level modeling captures fine-grained language disruptions such as excessive pausing, rigid phrasing, and letter-specific anomalies that are often masked in higher-level or audio-driven analyses. Moreover, this approach is computationally efficient, robust to noise, and aligned with emerging needs for lightweight, privacy-preserving digital biomarkers in remote or resource-constrained environments.
2.9 Contextual baselines
TF–IDF + Logistic regression: We extracted word unigrams and bigrams with min_df = 5 and max_df = 0.8 and trained an L1-penalized logistic regression model (C = 1.0, class-balanced, max_iter = 1,000).
BERT-base (uncased): We fine-tuned bert-base-uncased for sequence classification (maximum length 128, batch size 16, learning rate , 2 epochs) with simple per-fold oversampling on the training split to balance classes.
For both baselines we reused the same stratified 5-fold cross-validation splits as for CharMark to ensure an apples-to-apples comparison. We report mean SD across folds for AUC, F1, precision, recall, specificity, and accuracy.
2.10 Additional baselines
To further contextualize CharMark’s contribution, we evaluated two ablations. First, a frequency-only baseline using raw 27-dimensional character frequency vectors. Second, a space-only baseline using the proportion of whitespace tokens in each transcript. Both were trained with the same L1-penalized logistic regression classifier under identical 5-fold splits.
3 Results
3.1 Character-level distributions reveal salient differences
We first examined steady-state character distributions to identify linguistic signals that differed significantly between groups. The Kolmogorov-Smirnov (KS) test revealed that several characters, including the space character “ ,” “n,” and “i,” showed statistically significant distributional shifts between dementia and control groups (Bonferroni-corrected ). These characters reflect changes in pacing, repetition, and lexical structure often observed in cognitive decline. Table 1 summarizes the top-ranked features based on KS statistic.
3.2 Space character as a primary marker of pausing behavior
Among all features, the space character emerged as the most distinctive. Figure 5 visualizes the empirical distribution of steady-state space probabilities across groups. Dementia transcripts exhibit a rightward shift in the distribution, reflecting longer or more frequent pauses; likely tied to disrupted fluency. Reflecting greater whitespace token frequency, which may indicate disrupted fluency at the textual level.
Figure 5. Distribution of steady-state probabilities for the space character across transcripts. Transcripts from individuals with dementia (red) show a clear shift in mean compared to controls (blue), highlighting increased or prolonged pauses that may reflect speech hesitancy or disrupted fluency. Highlighting increased whitespace token usage, which may reflect disrupted fluency patterns.
To further probe this finding, we plotted a rolling mean of the space character probability across transcript indices (Figure 6). While average values remain elevated in dementia subjects, healthy controls display greater variance, potentially reflecting more adaptive or dynamic speech rhythm. The convergence of elevated mean and lower variance in dementia aligns with prior hypotheses about pacing rigidity in cognitive decline.
Figure 6. Rolling mean and standard deviation of the space character’s steady-state probability across transcripts. Dementia subjects show elevated means and reduced variance, suggesting more uniform character usage patterns compared to controls. Notably reduced variance, supporting the hypothesis that cognitive decline manifests in more rigid and less adaptive pausing behavior.
3.3 Validation through lasso logistic regression
To evaluate the diagnostic potential of our extracted features, we trained a Lasso Logistic Regression model using the 27-dimensional steady-state character vectors. The model was trained in a binary classification setting (dementia vs. control), and the resulting Receiver Operating Characteristic (ROC) curve is shown in Figure 7. The model achieved an area under the curve (AUC) of 0.806, indicating strong discriminative performance from the character-level features alone.
Figure 7. Receiver Operating Characteristic (ROC) curve for the Lasso Logistic Regression model trained on steady-state character features. The model achieves an AUC of 0.806, confirming that even low-level linguistic structures such as pause frequency and character usage carry sufficient signal to distinguish cognitive status.
Lasso regularization yielded a sparse solution, emphasizing only the most informative features. Notably, the space character remained among the most predictive features selected by the model, reinforcing its potential role as a candidate digital biomarker. To contextualize performance, TF–IDF + logistic regression achieved AUC and BERT-base achieved AUC on the same 5-fold splits. Alongside TF–IDF and BERT baselines, both ablations were substantially weaker. The frequency-only model achieved AUC and F1 , while the space-only model dropped further to AUC and F1 . These results confirm that CharMark’s lift derives from its transition dynamics, not marginal frequencies or space proportion. Full metrics for all models appear in Table 2.
Table 2. Comparative performance of models (mean SD, 5-fold CV). CharMark reported with ; full -sensitivity in Table 3. A second-order Markov ablation was also tested but did not improve performance, consistent with sparsity expectations.
3.4 Sensitivity analysis
CharMark performance was stable across the tested range of . AUC varied by approximately 0.027 and F1 by 0.035, with Precision, Recall, Specificity, and Accuracy similarly consistent. We therefore retain for all main analyses. The trends are visualized in Figure 3, and detailed numerical results are provided in Table 3.
3.5 Calibration performance
CharMark demonstrated good discrimination (AUC = 0.803 0.029) with moderate calibration error (Brier = 0.183 0.013, ECE = 0.104 0.022). At the default threshold (), F1 = 0.719 0.025, Precision = 0.672 0.028, Recall = 0.699 0.047, and Accuracy = 0.734 0.020. Threshold optimization via Youden’s and F1-maximization yielded only minor changes, indicating that the model is reasonably well calibrated for screening applications. Reliability curves are shown in Figure 8.
3.6 Second-order Markov ablation
We further performed a second-order Markov ablation, in which transcripts were modeled as bigram-to-bigram transitions over states. Despite the expanded state space, performance did not improve: AUC = 0.795 0.026 and F1 = 0.707 0.025, both slightly below the first-order model. This supports the choice of first-order modeling in the present study, balancing statistical stability and interpretability.
3.7 Structural differences in markov networks
While scalar features like steady-state probabilities offer useful diagnostic signals, structural properties of the underlying transition networks also carry valuable insights into how linguistic rigidity manifests. To qualitatively explore this, we visualized first-order Markov transition networks from two representative transcripts, one from a dementia subject and one from a healthy control (Figure 9).
Figure 9. Side-by-side Markov transition networks generated from transcripts of a dementia subject and a healthy control. Each node corresponds to a character, and edge thickness represents transition probability. The dementia network (a) displays dense self-loops and concentrated local transitions, reflecting reduced lexical variety. The healthy control network (b) reveals broader, more exploratory transitions, characteristic of flexible and adaptive language use.
The dementia network is dominated by self-loops and short-range transitions among a limited subset of characters, suggesting repetitive or constrained lexical production. In contrast, the control network exhibits greater transition diversity and broader edge connectivity, consistent with richer, more adaptive language use. These structural differences offer a complementary lens into how cognitive impairment alters the foundational dynamics of speech.
3.8 Demographics
Group-level demographics are summarized in Table 4 (Number of participants,gender, sex, age, education, MMSE).
4 Discussion
This study introduces a novel, interpretable framework, CharMark, for detecting linguistic signals of cognitive decline through character-level modeling of transcribed speech. By leveraging first-order Markov Chains and extracting steady-state probabilities, we uncover low-level features that capture subtle disruptions in language structure. These features, including the space character and letters such as “n” and “i,” exhibited significant distributional differences between individuals with dementia and healthy controls.
Our use of character-level modeling departs from the dominant paradigm in dementia speech analysis, which typically focuses on higher-level semantic or acoustic features. While these approaches have demonstrated predictive power, they often require large labeled datasets, complex tuning, and can be challenging to interpret in clinical settings. In contrast, CharMark produces compact, transparent features that offer a granular view of how cognitive impairment affects speech mechanics, making it particularly well suited for scalable and explainable biomarker discovery.
The elevated steady-state probability of the space character in dementia transcripts, combined with reduced variance, suggests a loss of dynamism in pause behavior. This aligns with clinical observations of reduced fluency, increased hesitation, and monotonic delivery in patients with early cognitive decline. Likewise, the reduced lexical diversity visualized through the Markov networks further supports the hypothesis that cognitive impairment constrains linguistic flexibility. Our findings thus reinforce the emerging view that early-stage dementia manifests not only in what is said, but how language is structured and delivered at a micro level.
Importantly, the Lasso Logistic Regression model yielded strong discriminative power (AUC = 0.806) using only 27 character-level features. The model’s sparsity highlights which transitions carry the most diagnostic value, offering a level of interpretability often missing from more complex black-box systems. This simplicity enhances trustworthiness and facilitates integration into clinical workflows, where transparent, actionable insights are essential. While calibration was generally reasonable, some deviations from the ideal diagonal were observed, particularly at low and high confidence levels. Future work could explore post-hoc calibration methods (e.g., Platt scaling, isotonic regression) if clinically precise probability estimates are required.
These findings build upon and complement our prior work on symbolic recurrence analysis and character-level embeddings for cognitive assessment [10, 11], while introducing a new perspective grounded in steady-state transition behavior. Compared to symbolic recurrence, the Markov modeling approach is computationally lightweight, more intuitive to interpret, and easier to deploy across language settings, particularly when acoustic data is unavailable or incomplete.
That said, several limitations merit discussion. First, the current analysis is based on a single structured elicitation task (the Cookie Theft description), which may limit generalizability to other speech contexts. This reflects the proof-of-concept scope of CharMark; broader validation across multiple tasks, datasets, and languages is planned for future work. Second, demographic variables such as age, education, and language background were not explicitly modeled, which could introduce subtle biases. This limitation reflects the released corpus, which lacks per-sample linkage between transcripts and demographic variables, precluding covariate-adjusted modeling in the present study. Because CharMark operates on character-level transition dynamics that are minimally dependent on lexical content, we expect reduced sensitivity to education and content effects; nevertheless, future datasets with linked metadata will enable formal covariate adjustment and robustness analyses across demographic strata.
At 27 symbols, moving from a first- to a second-order chain expands the state space from 27 to and increases the number of free transition parameters to approximately 18,954 (more than 25 the first-order case). Given the modest sample size and our emphasis on clinical interpretability, we therefore retained first-order modeling in this study. Future work could explore constrained-context extensions (e.g., selective n-grams or composite transitions) once larger datasets are available.
Nonetheless, this work advances the field of digital biomarkers by highlighting the diagnostic potential of low-level, text-based features. By identifying character-level shifts that correspond to cognitive decline, CharMark contributes a novel layer of explainable signal that can enhance multimodal models and inform early-stage screening tools, especially in resource-constrained or multilingual settings where speech transcripts may be easier to collect than high-fidelity audio.
5 Conclusions
This study presents CharMark, a novel framework for identifying early-stage linguistic biomarkers of dementia using character-level Markov modeling. By analyzing steady-state transition probabilities in transcribed speech, we uncover compact, interpretable features that distinguish individuals with cognitive decline from healthy controls. Our results demonstrate that even the most granular units of language such as individual characters and pauses, encode and whitespace tokens, encode meaningful diagnostic signals. CharMark offers a lightweight and transparent alternative to traditional semantic or acoustic approaches, making it especially well suited for scalable deployment in digital health tools. As speech-based screening continues to evolve, character-level modeling can serve as a critical foundation for interpretable and accessible cognitive assessment. Future work will extend this approach across diverse tasks and populations, and explore its integration into broader multimodal frameworks for precision brain health monitoring.
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: Repository Name: TalkBank Dataset: DementiaBank Pitt Corpus Direct Link: https://dementia.talkbank.org/access/English/Pitt.html.
Ethics statement
The studies involving humans were approved by University of Pittsburgh Institutional Review Board (IRB), Human Research Protection Office. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.
Author contributions
KM: Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. FA: Supervision, Writing – review & editing. HY: Conceptualization, Methodology, Supervision, Writing – review & editing.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported in part by the National Science Foundation under the Grant No. IIP-2425827 (to KM and HY on Dementia Analytics), IIS-2302833 (to FA on cognitive learning), and IIS-2302834 (to HY on cognitive learning). The Pitt Corpus data used in this study was collected with support from NIA grants AG03705 and AG05133. Any opinions, findings, or conclusions found in this paper originate from the authors and do not necessarily reflect the views of the sponsor.
Acknowledgments
This manuscript has been released as a preprint on Research Square [11].
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fdgth.2025.1659366/full#supplementary-material
References
1. Alzheimer’s Disease International. World Alzheimer Report 2021: Journey through the diagnosis of dementia (Tech. rep., World Alzheimer Report). London: Alzheimer’s Disease International (2021).
2. Karimi L, Mahboub-Ahari A, Jahangiry L, Sadeghi-Bazargani H, Farahbakhsh M. A systematic review and meta-analysis of studies on screening for mild cognitive impairment in primary healthcare. BMC Psychiatry. (2022) 22:97. doi: 10.1186/s12888-022-03730-8
3. De Roeck EE, De Deyn PP, Dierckx E, Engelborghs S. Brief cognitive screening instruments for early detection of Alzheimer’s disease: a systematic review. Alzheimers Res Ther. (2019) 11:21. doi: 10.1186/s13195-019-0474-3
4. Tsoi KKF, Chan JYC, Hirai HW, Wong SYS, Kwok TCY. Cognitive tests to detect dementia: a systematic review and meta-analysis. JAMA Intern Med. (2015) 175:1450–8. doi: 10.1001/jamainternmed.2015.2152
5. Tumas V, Borges V, Ballalai-Ferraz H, Zabetian CP, Mata IF, Brito MMC, et al. Some aspects of the validity of the montreal cognitive assessment (MoCA) for evaluating cognitive impairment in Brazilian patients with Parkinson’s disease. Dement Neuropsychol. (2016) 10:333–8. doi: 10.1590/s1980-5764-2016dn1004013
7. Massey Jr FJ. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. (1951) 46:68–78. doi: 10.1080/01621459.1951.10500769
8. Goodglass H, Kaplan E, Barresi B. The Assessment of Aphasia and Related Disorders. 3rd ed. Philadelphia: Lippincott Williams & Wilkins (2001).
10. Mekulu K, Aqlan F, Yang H. Character-level linguistic biomarkers for precision assessment of cognitive decline: a symbolic recurrence approach. medRxiv [Preprint]. (2025). Available online at: https://doi.org/10.1101/2025.06.12.25329529 (Accessed July 20, 2025).
11. Mekulu K, Aqlan F, Yang H. CharMark: character-level Markov modeling to detect linguistic signs of dementia. Research Square [Preprint] (Version 1). (2025). Available online at: https://doi.org/10.21203/rs.3.rs-6391300/v1 (Accessed May 02, 2025).
Keywords: dementia, linguistic biomarkers, Markov model, steady-state probability, speech analysis, interpretable AI, Alzheimer’s disease, character transitions
Citation: Mekulu K, Aqlan F and Yang H (2025) CharMark: character-level Markov modeling for interpretable linguistic biomarkers of cognitive decline. Front. Digit. Health 7:1659366. doi: 10.3389/fdgth.2025.1659366
Received: 4 July 2025; Accepted: 21 October 2025;
Published: 19 November 2025.
Edited by:
Maria Lucia O Souza-Formigoni, Federal University of São Paulo, BrazilReviewed by:
Pooyan Mobtahej, University of California, Irvine, United StatesVani C, SRM Institute of Science and Technology, India
Felix Agbavor, Drexel University, United States
Copyright: © 2025 Mekulu, Aqlan and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Hui Yang, aHVpeWFuZ0Bwc3UuZWR1
Hui Yang1*