Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Digit. Health, 19 November 2025

Sec. Digital Mental Health

Volume 7 - 2025 | https://doi.org/10.3389/fdgth.2025.1659366

CharMark: character-level Markov modeling for interpretable linguistic biomarkers of cognitive decline

  • 1Complex Systems Monitoring, Modeling and Control Laboratory, Pennsylvania State University, University Park, PA, United States
  • 2Center for Human Systems Engineering, University of Louisville, Louisville, KY, United States

Dementia, one of the most prevalent neurodegenerative diseases, affects millions worldwide. Understanding linguistic markers of dementia is crucial for elucidating how cognitive decline manifests in speech patterns. Current non-invasive assessments like the Montreal Cognitive Assessment (MoCA) and Saint Louis University Mental Status (SLUMS) tests rely on manual interpretation and often lack detailed linguistic insight. This paper introduces a first-of-its-kind interpretable artificial intelligence (IAI) framework, CharMark, which leverages first-order Markov Chain models to characterize language production at the character level. By computing steady-state probabilities of character transitions in speech transcripts from individuals with dementia and healthy controls, we uncover distinctive character-usage patterns. The space character “ ”, representing pauses, (treated here as the space token between words rather than acoustic pauses), and letters such as “n” and “i” showed statistically significant differences between groups. Principal Component Analysis (PCA) revealed natural clustering aligned with cognitive status, while Kolmogorov-Smirnov tests confirmed distributional shifts. A Lasso Logistic Regression model further demonstrated that these character-level features possess strong discriminative potential. Our primary contribution is the identification and characterization of candidate linguistic biomarkers of cognitive decline; features that are both interpretable and easily computable. These findings highlight the potential of character-level modeling as a lightweight, scalable strategy for early-stage dementia screening, particularly in settings where more complex or audio-dependent models may be impractical.

1 Introduction

Alzheimer’s Disease and related dementias (ADRD) represent a growing global health crisis. As of 2020, more than 55 million people worldwide were affected, and projections suggest this number will rise to 139 million by 2050 [1]. Early detection remains one of the most powerful levers for improving outcomes, yet current tools face fundamental challenges that limit their utility in real-world clinical practice.

Widely used screening tools, such as the Montreal Cognitive Assessment (MoCA) and the Saint Louis University Mental Status (SLUMS) exam, remain the default in many healthcare settings. However, they present several well-documented limitations:

Subjectivity and interpretation bias: Outcomes can vary significantly depending on the examiner’s training, experience, and cultural background [2].

Time and resource burden: These tools require clinician involvement and are time-consuming to administer and score [3].

Limited diagnostic confidence: Surveys reveal that as many as 40% of primary care providers lack confidence when diagnosing dementia based solely on these screenings [4].

Educational and cultural bias: Performance on these tests can be influenced by a patient’s education level and language proficiency, increasing the risk of misdiagnosis [5].

Given these constraints, there is an urgent need for more objective, efficient, and culturally adaptable screening methods that can be deployed at scale.


This study introduces a novel modeling framework—CharMark—designed to uncover early linguistic biomarkers of dementia through character-level analysis of speech transcripts. Figure 1 presents a graphical depiction of the Markov network at the heart of our approach, constructed from transitions between characters in recorded speech.

Figure 1
Network diagram with interconnected nodes labeled from a to x. Nodes are connected by directed arrows, showing complex relationships and interactions among them. Central nodes include u and i, indicating a possible hub in the network.

Figure 1. A visual rendering of the character-level Markov Chain model used in our study. Each node represents a character, and directed edges illustrate the probability of transitioning from one character to another based on observed speech transcripts. This network reveals the structural “fingerprint” of how language flows—capturing micro-level speech dynamics often imperceptible to human raters.

Unlike prior models that emphasize high-level semantic or acoustic features, we analyze language at its most granular level: transitions between individual characters. Our framework computes steady-state probabilities for each character using a first-order Markov Chain model [6], which allows us to quantify how often each character appears in the long-run behavior of speech. This yields interpretable linguistic fingerprints that are not only compact but also agnostic to language and speaker variability.

To explore whether these fingerprints contain clinically meaningful information, we perform unsupervised k-means clustering and visualize group separation via Principal Component Analysis (PCA). We also apply the Kolmogorov-Smirnov (KS) test [7] to statistically confirm character-level differences between dementia and control groups. Finally, we use a Lasso Logistic Regression model to assess the discriminative value of these features, while maintaining interpretability.

Our contributions are as follows:

A novel character-level approach to linguistic biomarker discovery: We introduce an interpretable AI framework based on steady-state character probabilities derived from first-order Markov Chains.

Identification of distinctive linguistic signals: Our analysis highlights specific characters–such as the space character (indicating pauses) (captured as whitespace tokens rather than true acoustic pauses), “n,” and “i” that show statistically significant differences in usage between groups.

Validation using transparent statistical tools: Through clustering, hypothesis testing, and logistic regression, we confirm the potential of these features as early-stage indicators of cognitive decline.

This study begins by reviewing prior research on linguistic analysis in Alzheimer’s Disease and Related Dementias (ADRD), establishing the context for our work. We then describe our methodology, including the dataset, preprocessing steps, and the implementation of our character-level Markov model. Our focus on character-level transitions rather than words or sentences, stems from the hypothesis that microstructural disruptions in language (such as altered pause patterns or character repetition) may serve as early markers of cognitive impairment. By operating at this fine-grained level, our framework captures subtle changes in language production that often precede higher-level semantic breakdowns. The remainder of the paper presents our results, interprets the identified linguistic signals, and discusses their implications for biomarker discovery and future screening applications. This study introduces CharMark as a transparent, low-resource framework for linguistic biomarker discovery in dementia. Our experiments contextualize discriminative potential but are not intended as a leaderboard benchmarking study. Comprehensive multi-task, multi-dataset, and multilingual validation is planned as future work.

2 Materials and methods

2.1 Overview of the approach

This study introduces an interpretable modeling framework—CharMark—for identifying early linguistic biomarkers of dementia from transcribed speech. Our approach treats each transcript as a character sequence and models it using a first-order Markov Chain. From this, we compute a 27-dimensional vector representing the steady-state probability of each character (a–z and the space character). These vectors are then analyzed through clustering, dimensionality reduction, hypothesis testing, and logistic regression to evaluate group-level differences and the potential diagnostic value of the extracted features.

Figure 2 provides a conceptual overview of the CharMark pipeline.

Figure 2
Diagram illustrating a process involving Markov modeling, feature extraction, and evaluation. It starts with transcripts leading to a first-order Markov model. In feature extraction, linguistic analysis identifies patterns, depicted as blue circles and red triangles. Evaluation uses LASSO regression, visualized with a plot separating healthy (blue) and dementia (red) classifications.

Figure 2. The CharMark framework begins with transcript preprocessing, followed by character-level Markov modeling to extract steady-state distributions. These linguistic fingerprints are then analyzed using statistical and machine learning techniques to identify and validate early cognitive biomarkers.

2.2 Dataset and preprocessing

We used transcripts from the DementiaBank Pitt Corpus, a widely cited dataset in dementia and speech-language research. All participants were asked to describe the Boston Cookie Theft picture, a standardized elicitation task designed to generate naturalistic but structured language samples [8].

The dataset included:

• 310 transcripts from 168 participants with Alzheimer’s Disease (AD)

• 242 transcripts from 98 cognitively healthy controls

Each transcript was lowercased, and preprocessing was applied to normalize textual variation. All non-alphabetic characters (punctuation, numerals, special symbols) were removed, and contractions were decomposed into their constituent letters (e.g., “don’t” “dont”). Because the Pitt Corpus transcripts do not encode filled pauses or disfluency markers (e.g., “uh,” “um”), these were not present in the input. We retained the space character to preserve pausing patterns as a structural token to mark word boundaries, yielding a discrete sequence over a 27-character vocabulary (26 letters plus space). This sequence served as the basis for Markov modeling.

2.3 Markov chain modeling

Markov Chain modeling offers a mathematically grounded and interpretable framework for capturing sequential dependencies in language. Its ability to model transition dynamics with minimal assumptions makes it well suited for identifying subtle patterns that may serve as early digital biomarkers [9]. Building on our prior work using symbolic recurrence and character-level modeling to capture linguistic markers of cognitive decline [10, 11], we adopt a first-order Markov Chain approach in this study.

Each transcript was modeled as a first-order Markov Chain, where the probability of observing character sj depends only on the immediately preceding character si. Transition probabilities were estimated using smoothed frequency counts with Laplace smoothing (α=0.01), a small constant selected to mitigate zero-probability transitions while preserving the sparsity of natural language character distributions:

Pij=nij+αu=1k(niu+α)

where nij is the number of observed transitions from character si to sj, and k=27.

To assess robustness, we further conducted a sensitivity analysis varying α{0.001,0.005,0.01,0.05,0.1} while holding the classifier and splits fixed. Performance was stable across the range (AUC varied by 0.027, F1 by 0.035), confirming that α=0.01 is a reasonable and stable choice for all main analyses. Full numerical results are provided in Table 3 and trends are visualized in Figure 3.

Figure 3
Line graph titled \

Figure 3. Sensitivity of CharMark to Laplace smoothing parameter α. Performance remained stable across the tested range; AUC varied by 0.027 and F1 by 0.035. The default α=0.01 is retained for all main analyses.

To derive the long-run behavior of character usage, we computed the steady-state distribution π by solving the eigenvector equation:

π=πPwithi=1kπi=1

This steady-state vector, which reflects the long-term usage frequency of each character under the Markov process, was extracted for every transcript, forming a 552 × 27 feature matrix. These steady-state distributions serve as compact, interpretable fingerprints of linguistic structure across groups.

2.4 Clustering and dimensionality reduction

We applied k-means clustering to the steady-state vectors to explore group separability in an unsupervised setting. The optimal number of clusters was determined using silhouette analysis, which identified k=2 as the most stable solution corresponding closely to the ground truth diagnostic labels of healthy controls and dementia patients. This finding suggests that meaningful cognitive signals are embedded in the character-level structure of language, independent of any supervised learning.

To better understand the geometry of these features, we performed Principal Component Analysis (PCA) on the 27-dimensional steady-state vectors. PCA reduces dimensionality while preserving the directions of greatest variance, enabling clearer visual interpretation. As shown in Figure 4, the resulting projection reveals a striking separation between clusters, with transcripts from dementia subjects forming a more compact and shifted group compared to the wider distribution of controls. This emergent clustering reinforces the hypothesis that cognitive decline alters linguistic dynamics in ways that are both measurable and visually discernible at the character level.

Figure 4
Scatter plot showing two principal components. Red dots represent the dementia group, and blue dots represent the healthy group. Both groups are distributed across the plot, indicating variance but some overlap. The x-axis is labeled \

Figure 4. Two-dimensional PCA projection of steady-state character distributions derived from each transcript. This plot reveals a natural separation between dementia and control groups, as indicated by unsupervised k-means clustering. Transcripts from dementia subjects (red) tend to form a tighter, shifted cluster, suggesting reduced linguistic variability compared to the broader distribution of healthy controls (blue). This emergent structure reinforces the potential of character-level dynamics as interpretable biomarkers of cognitive decline.

2.5 Statistical testing of character features

To identify which characters contributed most to group-level differences, we conducted two-sample Kolmogorov-Smirnov (KS) tests on each of the 27 character distributions (dementia vs. control). The KS statistic quantifies the maximal distance between the empirical cumulative distribution functions:

D=supx|Fdementia(x)Fcontrol(x)|

We selected the KS test due to its non-parametric nature and sensitivity to both location and shape differences between distributions, making it appropriate for features that may not follow Gaussian assumptions. Significance was evaluated at α=0.05 with Bonferroni correction to control for multiple comparisons across the full character set. This conservative adjustment reduces false positives, ensuring that only robust distributional shifts are flagged as candidate biomarkers.

2.6 Validation via lasso logistic regression

To assess the predictive relevance of the steady-state features, we trained a Lasso Logistic Regression model using the 27 character probabilities. Lasso was chosen for its ability to perform both classification and feature selection, enabling a sparse and interpretable solution. We evaluated model performance using the area under the Receiver Operating Characteristic curve (ROC-AUC).

To ensure comparability, we used stratified 5-fold cross-validation and reused the same splits across CharMark and all baselines (TF–IDF + Logistic Regression, BERT-base). Where participant identifiers were available, all transcripts from a participant were confined to a single fold. We report mean ± SD across folds for AUC, F1, precision, recall (sensitivity), specificity, and accuracy.

To prevent data leakage across folds, all transcripts from a given participant were confined to a single fold. This ensured that no subject contributed data to both training and testing sets. Participant identifiers (Pitt Corpus IDs) were used to enforce group-level assignment when constructing cross-validation splits.

2.7 Calibration and threshold analysis

To evaluate probability calibration, we computed Brier scores and Expected Calibration Error (ECE) across cross-validation folds, and plotted reliability curves. Threshold sensitivity was assessed by comparing default classification at p=0.5 with thresholds optimized for Youden’s J statistic and for maximal F1.

2.8 Justification for transcript-based character modeling

While many dementia studies utilize acoustic features, we focused on text-based analysis to enhance interpretability, reproducibility, and cross-linguistic applicability. Character-level modeling captures fine-grained language disruptions such as excessive pausing, rigid phrasing, and letter-specific anomalies that are often masked in higher-level or audio-driven analyses. Moreover, this approach is computationally efficient, robust to noise, and aligned with emerging needs for lightweight, privacy-preserving digital biomarkers in remote or resource-constrained environments.

2.9 Contextual baselines

TF–IDF + Logistic regression: We extracted word unigrams and bigrams with min_df = 5 and max_df = 0.8 and trained an L1-penalized logistic regression model (C = 1.0, class-balanced, max_iter = 1,000).

BERT-base (uncased): We fine-tuned bert-base-uncased for sequence classification (maximum length 128, batch size 16, learning rate 2×105, 2 epochs) with simple per-fold oversampling on the training split to balance classes.

For both baselines we reused the same stratified 5-fold cross-validation splits as for CharMark to ensure an apples-to-apples comparison. We report mean ± SD across folds for AUC, F1, precision, recall, specificity, and accuracy.

2.10 Additional baselines

To further contextualize CharMark’s contribution, we evaluated two ablations. First, a frequency-only baseline using raw 27-dimensional character frequency vectors. Second, a space-only baseline using the proportion of whitespace tokens in each transcript. Both were trained with the same L1-penalized logistic regression classifier under identical 5-fold splits.

3 Results

3.1 Character-level distributions reveal salient differences

We first examined steady-state character distributions to identify linguistic signals that differed significantly between groups. The Kolmogorov-Smirnov (KS) test revealed that several characters, including the space character “ ,” “n,” and “i,” showed statistically significant distributional shifts between dementia and control groups (Bonferroni-corrected p<0.05). These characters reflect changes in pacing, repetition, and lexical structure often observed in cognitive decline. Table 1 summarizes the top-ranked features based on KS statistic.

Table 1
www.frontiersin.org

Table 1. Kolmogorov-Smirnov test results for steady-state probabilities of selected characters.

3.2 Space character as a primary marker of pausing behavior

Among all features, the space character emerged as the most distinctive. Figure 5 visualizes the empirical distribution of steady-state space probabilities across groups. Dementia transcripts exhibit a rightward shift in the distribution, reflecting longer or more frequent pauses; likely tied to disrupted fluency. Reflecting greater whitespace token frequency, which may indicate disrupted fluency at the textual level.

Figure 5
Histogram showing the distribution of steady-state probabilities for the space character. Red represents dementia subjects and blue represents healthy controls. Vertical dashed lines indicate the means for each group. Density on the y-axis and probability on the x-axis, ranging from 0.20 to 0.27.

Figure 5. Distribution of steady-state probabilities for the space character across transcripts. Transcripts from individuals with dementia (red) show a clear shift in mean compared to controls (blue), highlighting increased or prolonged pauses that may reflect speech hesitancy or disrupted fluency. Highlighting increased whitespace token usage, which may reflect disrupted fluency patterns.

To further probe this finding, we plotted a rolling mean of the space character probability across transcript indices (Figure 6). While average values remain elevated in dementia subjects, healthy controls display greater variance, potentially reflecting more adaptive or dynamic speech rhythm. The convergence of elevated mean and lower variance in dementia aligns with prior hypotheses about pacing rigidity in cognitive decline.

Figure 6
Line graph showing the rolling mean and variation of space character usage in transcripts. The red line represents dementia patients, and the blue dashed line represents healthy controls. Shaded areas indicate one standard deviation. The y-axis shows the steady-state probability of space character, and the x-axis represents the transcript index. The graph compares the two groups' space character usage patterns.

Figure 6. Rolling mean and standard deviation of the space character’s steady-state probability across transcripts. Dementia subjects show elevated means and reduced variance, suggesting more uniform character usage patterns compared to controls. Notably reduced variance, supporting the hypothesis that cognitive decline manifests in more rigid and less adaptive pausing behavior.

3.3 Validation through lasso logistic regression

To evaluate the diagnostic potential of our extracted features, we trained a Lasso Logistic Regression model using the 27-dimensional steady-state character vectors. The model was trained in a binary classification setting (dementia vs. control), and the resulting Receiver Operating Characteristic (ROC) curve is shown in Figure 7. The model achieved an area under the curve (AUC) of 0.806, indicating strong discriminative performance from the character-level features alone.

Figure 7
Receiver Operating Characteristic (ROC) curve graph showing the true positive rate against the false positive rate. An orange line represents the ROC curve with an area under the curve (AUC) of 0.8060. A dotted diagonal line indicates random performance.

Figure 7. Receiver Operating Characteristic (ROC) curve for the Lasso Logistic Regression model trained on steady-state character features. The model achieves an AUC of 0.806, confirming that even low-level linguistic structures such as pause frequency and character usage carry sufficient signal to distinguish cognitive status.

Lasso regularization yielded a sparse solution, emphasizing only the most informative features. Notably, the space character remained among the most predictive features selected by the model, reinforcing its potential role as a candidate digital biomarker. To contextualize performance, TF–IDF + logistic regression achieved AUC 0.801±0.023 and BERT-base achieved AUC 0.8165±0.015 on the same 5-fold splits. Alongside TF–IDF and BERT baselines, both ablations were substantially weaker. The frequency-only model achieved AUC 0.782±0.024 and F1 0.701±0.026, while the space-only model dropped further to AUC 0.695±0.048 and F1 0.612±0.052. These results confirm that CharMark’s lift derives from its transition dynamics, not marginal frequencies or space proportion. Full metrics for all models appear in Table 2.

Table 2
www.frontiersin.org

Table 2. Comparative performance of models (mean ± SD, 5-fold CV). CharMark reported with α=0.1; full α-sensitivity in Table 3. A second-order Markov ablation was also tested but did not improve performance, consistent with sparsity expectations.

Table 3
www.frontiersin.org

Table 3. Sensitivity of CharMark performance to Laplace smoothing parameter α (mean ± SD, 5-fold CV).

3.4 Sensitivity analysis

CharMark performance was stable across the tested range of α. AUC varied by approximately 0.027 and F1 by 0.035, with Precision, Recall, Specificity, and Accuracy similarly consistent. We therefore retain α=0.01 for all main analyses. The trends are visualized in Figure 3, and detailed numerical results are provided in Table 3.

3.5 Calibration performance

CharMark demonstrated good discrimination (AUC = 0.803 ± 0.029) with moderate calibration error (Brier = 0.183 ± 0.013, ECE = 0.104 ± 0.022). At the default threshold (p=0.5), F1 = 0.719 ± 0.025, Precision = 0.672 ± 0.028, Recall = 0.699 ± 0.047, and Accuracy = 0.734 ± 0.020. Threshold optimization via Youden’s J and F1-maximization yielded only minor changes, indicating that the model is reasonably well calibrated for screening applications. Reliability curves are shown in Figure 8.

Figure 8
Calibration reliability curve showing empirical accuracy versus predicted probability. A dashed line indicates perfect calibration, while a blue line with markers represents CharMark with alpha equals 0.1. The blue line deviates below the perfect line between probabilities 0.1 and 0.6 and then converges at higher probabilities.

Figure 8. Reliability curve for CharMark (α=0.1). The dashed diagonal indicates perfect calibration.

3.6 Second-order Markov ablation

We further performed a second-order Markov ablation, in which transcripts were modeled as bigram-to-bigram transitions over 272=729 states. Despite the expanded state space, performance did not improve: AUC = 0.795 ± 0.026 and F1 = 0.707 ± 0.025, both slightly below the first-order model. This supports the choice of first-order modeling in the present study, balancing statistical stability and interpretability.

3.7 Structural differences in markov networks

While scalar features like steady-state probabilities offer useful diagnostic signals, structural properties of the underlying transition networks also carry valuable insights into how linguistic rigidity manifests. To qualitatively explore this, we visualized first-order Markov transition networks from two representative transcripts, one from a dementia subject and one from a healthy control (Figure 9).

Figure 9
Two diagrams compare networks: (a) Dementia Network, showing sparse connections with centralized nodes labeled a to v; (b) Healthy Control Network, denser with interconnected nodes labeled a to j. Blue circles represent nodes, and arrows indicate connections.

Figure 9. Side-by-side Markov transition networks generated from transcripts of a dementia subject and a healthy control. Each node corresponds to a character, and edge thickness represents transition probability. The dementia network (a) displays dense self-loops and concentrated local transitions, reflecting reduced lexical variety. The healthy control network (b) reveals broader, more exploratory transitions, characteristic of flexible and adaptive language use.

The dementia network is dominated by self-loops and short-range transitions among a limited subset of characters, suggesting repetitive or constrained lexical production. In contrast, the control network exhibits greater transition diversity and broader edge connectivity, consistent with richer, more adaptive language use. These structural differences offer a complementary lens into how cognitive impairment alters the foundational dynamics of speech.

3.8 Demographics

Group-level demographics are summarized in Table 4 (Number of participants,gender, sex, age, education, MMSE).

Table 4
www.frontiersin.org

Table 4. Demographic information of participants (DementiaBank Pitt corpus).

4 Discussion

This study introduces a novel, interpretable framework, CharMark, for detecting linguistic signals of cognitive decline through character-level modeling of transcribed speech. By leveraging first-order Markov Chains and extracting steady-state probabilities, we uncover low-level features that capture subtle disruptions in language structure. These features, including the space character and letters such as “n” and “i,” exhibited significant distributional differences between individuals with dementia and healthy controls.

Our use of character-level modeling departs from the dominant paradigm in dementia speech analysis, which typically focuses on higher-level semantic or acoustic features. While these approaches have demonstrated predictive power, they often require large labeled datasets, complex tuning, and can be challenging to interpret in clinical settings. In contrast, CharMark produces compact, transparent features that offer a granular view of how cognitive impairment affects speech mechanics, making it particularly well suited for scalable and explainable biomarker discovery.

The elevated steady-state probability of the space character in dementia transcripts, combined with reduced variance, suggests a loss of dynamism in pause behavior. This aligns with clinical observations of reduced fluency, increased hesitation, and monotonic delivery in patients with early cognitive decline. Likewise, the reduced lexical diversity visualized through the Markov networks further supports the hypothesis that cognitive impairment constrains linguistic flexibility. Our findings thus reinforce the emerging view that early-stage dementia manifests not only in what is said, but how language is structured and delivered at a micro level.

Importantly, the Lasso Logistic Regression model yielded strong discriminative power (AUC = 0.806) using only 27 character-level features. The model’s sparsity highlights which transitions carry the most diagnostic value, offering a level of interpretability often missing from more complex black-box systems. This simplicity enhances trustworthiness and facilitates integration into clinical workflows, where transparent, actionable insights are essential. While calibration was generally reasonable, some deviations from the ideal diagonal were observed, particularly at low and high confidence levels. Future work could explore post-hoc calibration methods (e.g., Platt scaling, isotonic regression) if clinically precise probability estimates are required.

These findings build upon and complement our prior work on symbolic recurrence analysis and character-level embeddings for cognitive assessment [10, 11], while introducing a new perspective grounded in steady-state transition behavior. Compared to symbolic recurrence, the Markov modeling approach is computationally lightweight, more intuitive to interpret, and easier to deploy across language settings, particularly when acoustic data is unavailable or incomplete.

That said, several limitations merit discussion. First, the current analysis is based on a single structured elicitation task (the Cookie Theft description), which may limit generalizability to other speech contexts. This reflects the proof-of-concept scope of CharMark; broader validation across multiple tasks, datasets, and languages is planned for future work. Second, demographic variables such as age, education, and language background were not explicitly modeled, which could introduce subtle biases. This limitation reflects the released corpus, which lacks per-sample linkage between transcripts and demographic variables, precluding covariate-adjusted modeling in the present study. Because CharMark operates on character-level transition dynamics that are minimally dependent on lexical content, we expect reduced sensitivity to education and content effects; nevertheless, future datasets with linked metadata will enable formal covariate adjustment and robustness analyses across demographic strata.

At 27 symbols, moving from a first- to a second-order chain expands the state space from 27 to 272=729 and increases the number of free transition parameters to approximately 18,954 (more than 25× the first-order case). Given the modest sample size and our emphasis on clinical interpretability, we therefore retained first-order modeling in this study. Future work could explore constrained-context extensions (e.g., selective n-grams or composite transitions) once larger datasets are available.

Nonetheless, this work advances the field of digital biomarkers by highlighting the diagnostic potential of low-level, text-based features. By identifying character-level shifts that correspond to cognitive decline, CharMark contributes a novel layer of explainable signal that can enhance multimodal models and inform early-stage screening tools, especially in resource-constrained or multilingual settings where speech transcripts may be easier to collect than high-fidelity audio.

5 Conclusions

This study presents CharMark, a novel framework for identifying early-stage linguistic biomarkers of dementia using character-level Markov modeling. By analyzing steady-state transition probabilities in transcribed speech, we uncover compact, interpretable features that distinguish individuals with cognitive decline from healthy controls. Our results demonstrate that even the most granular units of language such as individual characters and pauses, encode and whitespace tokens, encode meaningful diagnostic signals. CharMark offers a lightweight and transparent alternative to traditional semantic or acoustic approaches, making it especially well suited for scalable deployment in digital health tools. As speech-based screening continues to evolve, character-level modeling can serve as a critical foundation for interpretable and accessible cognitive assessment. Future work will extend this approach across diverse tasks and populations, and explore its integration into broader multimodal frameworks for precision brain health monitoring.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: Repository Name: TalkBank Dataset: DementiaBank Pitt Corpus Direct Link: https://dementia.talkbank.org/access/English/Pitt.html.

Ethics statement

The studies involving humans were approved by University of Pittsburgh Institutional Review Board (IRB), Human Research Protection Office. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

KM: Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft. FA: Supervision, Writing – review & editing. HY: Conceptualization, Methodology, Supervision, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported in part by the National Science Foundation under the Grant No. IIP-2425827 (to KM and HY on Dementia Analytics), IIS-2302833 (to FA on cognitive learning), and IIS-2302834 (to HY on cognitive learning). The Pitt Corpus data used in this study was collected with support from NIA grants AG03705 and AG05133. Any opinions, findings, or conclusions found in this paper originate from the authors and do not necessarily reflect the views of the sponsor.

Acknowledgments

This manuscript has been released as a preprint on Research Square [11].

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fdgth.2025.1659366/full#supplementary-material

References

1. Alzheimer’s Disease International. World Alzheimer Report 2021: Journey through the diagnosis of dementia (Tech. rep., World Alzheimer Report). London: Alzheimer’s Disease International (2021).

Google Scholar

2. Karimi L, Mahboub-Ahari A, Jahangiry L, Sadeghi-Bazargani H, Farahbakhsh M. A systematic review and meta-analysis of studies on screening for mild cognitive impairment in primary healthcare. BMC Psychiatry. (2022) 22:97. doi: 10.1186/s12888-022-03730-8

PubMed Abstract | Crossref Full Text | Google Scholar

3. De Roeck EE, De Deyn PP, Dierckx E, Engelborghs S. Brief cognitive screening instruments for early detection of Alzheimer’s disease: a systematic review. Alzheimers Res Ther. (2019) 11:21. doi: 10.1186/s13195-019-0474-3

PubMed Abstract | Crossref Full Text | Google Scholar

4. Tsoi KKF, Chan JYC, Hirai HW, Wong SYS, Kwok TCY. Cognitive tests to detect dementia: a systematic review and meta-analysis. JAMA Intern Med. (2015) 175:1450–8. doi: 10.1001/jamainternmed.2015.2152

PubMed Abstract | Crossref Full Text | Google Scholar

5. Tumas V, Borges V, Ballalai-Ferraz H, Zabetian CP, Mata IF, Brito MMC, et al. Some aspects of the validity of the montreal cognitive assessment (MoCA) for evaluating cognitive impairment in Brazilian patients with Parkinson’s disease. Dement Neuropsychol. (2016) 10:333–8. doi: 10.1590/s1980-5764-2016dn1004013

PubMed Abstract | Crossref Full Text | Google Scholar

6. Hayes B. First links in the Markov chain. Am Sci. (2013) 101:92–7. doi: 10.1511/2013.101.92

Crossref Full Text | Google Scholar

7. Massey Jr FJ. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. (1951) 46:68–78. doi: 10.1080/01621459.1951.10500769

Crossref Full Text | Google Scholar

8. Goodglass H, Kaplan E, Barresi B. The Assessment of Aphasia and Related Disorders. 3rd ed. Philadelphia: Lippincott Williams & Wilkins (2001).

Google Scholar

9. Ross SM. Introduction to Probability Models. 11th ed. Cambridge, MA: Academic Press (2014).

Google Scholar

10. Mekulu K, Aqlan F, Yang H. Character-level linguistic biomarkers for precision assessment of cognitive decline: a symbolic recurrence approach. medRxiv [Preprint]. (2025). Available online at: https://doi.org/10.1101/2025.06.12.25329529 (Accessed July 20, 2025).

Google Scholar

11. Mekulu K, Aqlan F, Yang H. CharMark: character-level Markov modeling to detect linguistic signs of dementia. Research Square [Preprint] (Version 1). (2025). Available online at: https://doi.org/10.21203/rs.3.rs-6391300/v1 (Accessed May 02, 2025).

Google Scholar

Keywords: dementia, linguistic biomarkers, Markov model, steady-state probability, speech analysis, interpretable AI, Alzheimer’s disease, character transitions

Citation: Mekulu K, Aqlan F and Yang H (2025) CharMark: character-level Markov modeling for interpretable linguistic biomarkers of cognitive decline. Front. Digit. Health 7:1659366. doi: 10.3389/fdgth.2025.1659366

Received: 4 July 2025; Accepted: 21 October 2025;
Published: 19 November 2025.

Edited by:

Maria Lucia O Souza-Formigoni, Federal University of São Paulo, Brazil

Reviewed by:

Pooyan Mobtahej, University of California, Irvine, United States
Vani C, SRM Institute of Science and Technology, India
Felix Agbavor, Drexel University, United States

Copyright: © 2025 Mekulu, Aqlan and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hui Yang, aHVpeWFuZ0Bwc3UuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.