AUTHOR=Kim-Dufor Deok-Hee , Walter Michel , Krebs Marie-Odile , Haralambous Yannis , Lenca Philippe , Lemey Christophe 

TITLE=Deeper insight into speech characteristics of patients at ultra-high risk using classification and explainability models

JOURNAL=Frontiers in Psychiatry

VOLUME=Volume 16 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2025.1595197

DOI=10.3389/fpsyt.2025.1595197

ISSN=1664-0640

ABSTRACT=IntroductionPeculiar use of language and even language deficits are one of the well-known signs of schizophrenia. Different language features analyzed using natural language processing and machine learning have been reported to differentiate patients at ultra-high risk for psychosis. However, it has not always been explained how, and to what extent, those linguistic markers allow the distinction of patients. This study aims to find relevant linguistic markers for classifying patients at ultra-high risk and explain how the detected markers contribute to the classification.MethodsThe first consultations with a psychiatrist of 68 patients (15 not-at-risk patients, 45 at-risk patients, and 8 patients with first episode psychosis) were recorded, transcribed verbatim, and annotated for analyses using natural language processing. A gradient-boosted decision tree algorithm was tested to evaluate its potential to correctly classify three categories of patients and find relevant linguistic markers at the level of lexical richness, semantic coherence, speech disfluency, and syntactic complexity. The Synthetic Minority Oversampling Technique was used to handle imbalanced data, and the SHapley Additive exPlanations (SHAP) values were computed to measure feature importance and each feature’s contributions to the classification.ResultsThe model yielded good performance, that is, 0.82 accuracy, 0.82 F2-score, 0.85 precision, 0.82 recall, and 0.86 ROC–AUC score, with four linguistic variables that concern weak coherence, the use of “I,” and filled pauses.DiscussionThe findings in this study suggest that weak coherence play a key role in classification. No significant differences in the use of “I” and filled pauses were found between groups using a statistical test, but an explainability model showed its different contributions. The contribution of each linguistic feature to the classification by patient group provided deeper insight into linguistic manifestations of each patient group and their subtle differences, which could help better analyze and understand patients’ language behaviors.