ORIGINAL RESEARCH article

Front. Psychiatry

Sec. Schizophrenia

Volume 16 - 2025 | doi: 10.3389/fpsyt.2025.1595197

This article is part of the Research TopicNatural Language Processing and Artificial Intelligence tools to explore the relationship between language and schizophrenia from diagnosis to careView all 4 articles

Deeper insight into speech characteristics of patients at Ultra-high risk using classification and explainability models

Provisionally accepted
Deok-Hee  Kim-DuforDeok-Hee Kim-Dufor1*Michel  WalterMichel Walter2Philippe  LencaPhilippe Lenca3Christophe  LemeyChristophe Lemey2,4
  • 1Limics, Sorbonne Université, Université Sorbonne Paris-Nord, INSERM, Paris, France
  • 2URCI, Department of Psychiatry, CHU de Brest, Brest, France
  • 3LUSSI, IMT Atlantique, Brest, Brittany, France
  • 4CEVUP, Department of Psychiatry, CHU de Brest, Brest, France

The final, formatted version of the article will be published soon.

Peculiar use of language and even language deficits are one of the well-known signs of schizophrenia. Different language features analyzed using natural language processing and machine learning have been reported to differentiate patients at ultra-high risk for psychosis. However, it has not always been explained how and to what extent those linguistic markers allow distinguishing patients. This study aims to find relevant linguistic markers for classifying patients at ultra-high risk and explain how the detected markers contribute to the classification. The first consultations with a psychiatrist of 68 patients (15 Not-At-Risk patients, 45 At-Risk patients, and 8 patients with First Episode Psychosis) were recorded, transcribed verbatim, and annotated for analyses using natural language processing. A gradient-boosted decision tree algorithm was tested to evaluate its potential to correctly classify three categories of patients and find relevant linguistic markers at the level of lexical richness, semantic coherence, speech disfluency, and syntactic complexity. The Synthetic Minority Oversampling Technique was used to handle imbalanced data, and the SHapley Additive exPlanations values were computed to measure feature importance and each feature's contributions to the classification. The model yielded good performance, that is, 0.82 accuracy, 0.82 F2-score, 0.85 precision, 0.82 recall, and 0.86 ROC-AUC score, with four linguistic variables that concern weak coherence, the use of "I", and filled pauses. Weak coherence seems to play a key role in classification. No significant differences in the use of "I" and filled pauses were found between groups using a statistical test, but an explainability model showed its different contributions. The contribution of each linguistic feature to the classification by patient group provided deeper insight into linguistic manifestations of each patient group and their subtle differences, which could help better analyze and understand patients' language behaviors.

Keywords: UHR patients, spoken language, Natural Language Processing, XGBoost, SMOTE, SHAP values

Received: 17 Mar 2025; Accepted: 23 May 2025.

Copyright: © 2025 Kim-Dufor, Walter, Lenca and Lemey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Deok-Hee Kim-Dufor, Limics, Sorbonne Université, Université Sorbonne Paris-Nord, INSERM, Paris, France

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.