ORIGINAL RESEARCH article

Front. Psychiatry, 16 June 2025

Sec. Schizophrenia

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1595197

This article is part of the Research TopicNatural Language Processing and Artificial Intelligence tools to explore the relationship between language and schizophrenia from diagnosis to careView all 4 articles

Deeper insight into speech characteristics of patients at ultra-high risk using classification and explainability models

Deok-Hee Kim-Dufor*Deok-Hee Kim-Dufor1*Michel WalterMichel Walter2Marie-Odile KrebsMarie-Odile Krebs3Yannis HaralambousYannis Haralambous4Philippe LencaPhilippe Lenca5Christophe Lemey,Christophe Lemey2,6
  • 1Limics, Sorbonne Université, Université Sorbonne Paris-Nord, INSERM, Paris, France
  • 2Unité de Recherche Clinique en Psychiatrie (URCP), Department of Psychiatry, Centre Hospitalier Universitaire (CHU) de Brest, Brest, France
  • 3University of Paris, Groupe Hospitalier Universitaire de Paris (GHU)-Paris, Service Hospitalo-Universitaire, Sainte-Anne, Centre d'évaluation pour Jeunes Adultes et ADolescents (C’JAAD), Paris, France
  • 4IMT Atlantique, Lab-STICC, UMR CNRS 6285, Brest, France
  • 5Données, Modèles, Informations & Décisions (DECIDE), Department of LUSSI, Institut Mines-Télécom (IMT) Atlantique, Brest, France
  • 6Consultation d’Evaluation de la VUlnérabilité Psychologique (CEVUP), Department of Psychiatry, CHU de Brest, Brest, France

Introduction: Peculiar use of language and even language deficits are one of the well-known signs of schizophrenia. Different language features analyzed using natural language processing and machine learning have been reported to differentiate patients at ultra-high risk for psychosis. However, it has not always been explained how, and to what extent, those linguistic markers allow the distinction of patients. This study aims to find relevant linguistic markers for classifying patients at ultra-high risk and explain how the detected markers contribute to the classification.

Methods: The first consultations with a psychiatrist of 68 patients (15 not-at-risk patients, 45 at-risk patients, and 8 patients with first episode psychosis) were recorded, transcribed verbatim, and annotated for analyses using natural language processing. A gradient-boosted decision tree algorithm was tested to evaluate its potential to correctly classify three categories of patients and find relevant linguistic markers at the level of lexical richness, semantic coherence, speech disfluency, and syntactic complexity. The Synthetic Minority Oversampling Technique was used to handle imbalanced data, and the SHapley Additive exPlanations (SHAP) values were computed to measure feature importance and each feature’s contributions to the classification.

Results: The model yielded good performance, that is, 0.82 accuracy, 0.82 F2-score, 0.85 precision, 0.82 recall, and 0.86 ROC–AUC score, with four linguistic variables that concern weak coherence, the use of “I,” and filled pauses.

Discussion: The findings in this study suggest that weak coherence play a key role in classification. No significant differences in the use of “I” and filled pauses were found between groups using a statistical test, but an explainability model showed its different contributions. The contribution of each linguistic feature to the classification by patient group provided deeper insight into linguistic manifestations of each patient group and their subtle differences, which could help better analyze and understand patients’ language behaviors.

1 Introduction

People with schizophrenia present with significant impairments stemming from disordered cognitive functioning (1). This mental illness manifests itself in characteristic symptoms such as delusions, hallucinations, disorganized thinking and behaviors, limited speech and expression of emotions, and social withdrawal. Early detection and treatment of schizophrenia have been proven to lead patients to favorable prognosis and better quality of life (2, 3). They could indeed reduce the risks and disorders associated with the first symptoms by engaging patients who present with prodromal symptoms in a care pathway (4) and limit the duration of untreated psychosis (DUP) by means of a treatment at the onset of the first episode of psychosis (FEP). The DUP is one of the key prognostic factors both in FEP (5) and in chronic schizophrenia (6). Different clinical assessments allow prodromal symptoms to be identified such as the Comprehensive Assessment of At-Risk Mental States (CAARMS), the Structured Interview of Psychosis-risk Syndromes (SIPS) from the “Ultra-High Risk (UHR)” criteria, and the Schizophrenia Proneness Instrument—Adult (SPI-A) from the basic symptom concept. Even though these tools show acceptable or fairly good performances, they still have a somewhat limited rate of prediction (7). Complementary elements for better predictions have therefore become a desideratum, and natural language processing (NLP) comes into play. Peculiar uses of language in schizophrenia (810) have been reported in the literature and are one of the well-known signs (11, 12). They are very easily noticeable and even qualified as “schizophrenic language” and “schizophrenese” by some authors in the last century (1316). Peculiarities are observed at different language levels ranging from words to sentence structure, coherence, pragmatics (1721) as itemized in the Scale for the Assessment of Thought, Language, and Communication by Andreasen like neologism, word approximation, poverty of speech, poverty of content, tangentiality, derailment, incoherence, and stilted speech (8). Based on the idea that self-disturbance is one of the core features of schizophrenia, a phenomenological approach to the sense of self in patients has developed (2224) along with studies on the use of first-person pronouns (2529). Language analysis of syntactic variables was already proposed in the 1980s as a potential diagnostic aid (3032), since differences were observed between schizophrenics, maniacs, and controls (30, 31). Even though language analyses turned out to have great potential, they were highly time consuming and likely to be subjective because they had to be manually carried out. Automated language analyses are more objective methods and unlimited in data size. Many studies have therefore explored language in schizophrenia and searched for linguistic markers to be used as a diagnostic aid along with biomarkers such as brain imaging, genetic testing, and blood tests (3335). With the development of artificial intelligence, analysis techniques, such as NLP and machine learning (ML) models, have become more sophisticated and yielded more propitious results. These techniques have been used on linguistic data in a growing number of studies on mental health (36, 37), namely, those on schizophrenia and FEP (38, 39): latent semantic analysis for quantifying speech coherence (40), semantic, lexical, and pragmatic features (4144), speech graph connectivity for measuring thought disorder in schizophrenia and mania (45, 46) and for predicting transition (47, 48), longitudinal classification of FEP (49), clustering for constructing language profiles of heterogeneous linguistic behaviors of patients with schizophrenia for early intervention (50) and prognosis (51), and a combination of acoustic and semantic features for classifying schizophrenia-spectrum disorders (52), to name a few. The aims of this exploratory study were to detect relevant language features that could classify patients by their status at their first consultation with a psychiatrist and seek to explain classification results with respect to clinical observations. Among the linguistic markers found in these studies (4051), the most frequent language feature is semantic coherence despite different types and lengths of corpus. It was therefore hypothesized that semantic coherence would be part of the relevant linguistic markers in conversational discourses of patients at ultra-high risk. With the disturbed sense of self observed in the clinic, it was also hypothesized that the use of first-person singular pronoun would vary depending on the UHR patient groups.

2 Methods

2.1 Participants

Sixty-eight patients (34 males, 34 females; mean age = 19.3 ± 2.86) participated in the present study. Out of the 68 patients, 15 were assessed as NAR (7 males, 8 females; mean age = 19.5 ± 2.24), 45 as AR (22 males, 23 females; mean age = 19.2 ± 2.83), and 8 as FEP (5 males, 3 females; mean age = 19.7 ± 3.78) using the CAARMS at T0. In total, 33 patients had antidepressants and/or anxiolytics, 5 were under neuroleptic treatment for less than 6 months, and 20 had no drug treatment. Healthy controls were not recruited separately to respect the same conditions of collecting data for each of the three groups, that is, a consultation with a psychiatrist. All were native speakers of French with an IQ superior to 70 and were informed of the study. Education levels were as follows: NAR [years of education (YoE) = 12.07 ± 1.34], AR (YoE = 11.58 ± 1.32), and FEP (YoE = 12 ± 1.73). A statement of non-opposition to the study was signed by their physician or the parents of underage patients.

2.2 Collection of patients’ speech and transcription

The recruited patients were recorded during their first consultations with a psychiatrist at the Center for Evaluation of Psychological Vulnerability (CEVUP) of the University Hospital of Brest, France. The first consultation with a psychiatrist is the starting point of the care pathway at the CEVUP. It is therefore labeled T0 (time zero), and a 2-year follow-up is indicated as T2. The interviews are semi-structured with some predetermined questions on the patient’s problems. The topics broached are the patient’s background, family, social relationships, socio-professional insertion, complaints about their symptoms, and any other topics based on what is said by the patient. Some additional questions are asked if more detailed information is needed for better understanding of the help seeker’s problems to assess their risk for psychosis. The transcripts have a conversational form between a psychiatrist and a patient. A nurse participated in the consultations, but she seldom spoke, and even when she did, it was only to provide the patient with supplementary information on the care pathway at the end of the consultations. The total duration of each recording is approximately 1 h. The mean total number of all words is 4,979.18 (SD = 2,448.70). The entire utterances including filled pauses, neologisms, and mispronunciations were transcribed verbatim using Microsoft Word by two trained assistants with clear instructions. Each speech turn starts on a new line and that of the healthcare provider is marked with an octothorpe (#) at the beginning and at the end. The present study has been approved by the IRB—Comité de Protection des Personnes EST-III (CPP:18.04.03, ID-RCB: 2017-A02702-51).

2.3 Preprocessing

An experienced linguist carried out preprocessing following predefined instructions. The spellings were manually double checked and corrected in all the transcripts without affecting their verbatim nature. Three different symbols, inspired by the method proposed by Foster and colleagues (53), were used to mark the elements required for analyses as follows:

{} for speech disfluency such as filled pause, repetition, false start, auto-correction, and auto-interruption/abandonment

| for clauses whose nucleus is a conjugated verb

< > for minor utterances (no conjugated verbs).

The transcripts were segmented in three ways: each speech turn as a segment, each sentence as a segment, and each sentence without the healthcare provider’s speech as a segment. For the first segment, each new line was a segment; for the second, each punctuation; and for the last, the whole new lines starting and ending with octothorpes were removed using Python as well as the blank lines generated by this removal process.

2.4 Linguistic variables

The preprocessed transcripts were analyzed using NLP techniques with Python, which resulted in 33 features at the lexical, syntactic, and semantic levels and that of speech fluency (see Table in Supplementary Material).

2.4.1 Lexical level

Lexical richness was measured to explore the variety of words and the quality of vocabulary. For the former, lexical diversity was calculated using the type–token ratio (54). For the latter, the proportion of content words (nouns, verbs, adjectives, and adverbs) to the total number of words, called lexical density (55), was measured. Since function words are excluded, lexical density reflects how informative the discourse is. Disturbed self-experience and different patterns of use of the first-person singular pronoun in people with schizophrenia have been reported (26, 29, 56). The use of personal pronouns was explored through three different measures as follows: the proportion of “I” to the total number of subject personal pronouns, the proportion of “I” to the total number of words, and the ratio of the first-person singular subject pronoun to the first-person object pronoun. The analyses at the lexical level were carried out on the lemmatized corpus using treetaggerwrapper (57).

2.4.2 Syntactic level

Syntactic complexity and poverty of speech were measured. The analyses were based on lexicogrammatical constituency in functional grammar. Constituency is the hierarchical compositional structure of language, and this hierarchy of units is denominated as a rank scale, with each step in the hierarchy referred to as one rank (58). The ranks of lexicogrammatical constituency are clause > phrase/group > word > morpheme, wherein the clause is the highest unit and the central processing unit. In addition, this unit is one of the five levels in the grammatical system (59) and the primary unit in immediate speech processing (60). The clause has therefore been determined as the basic syntactic unit in this study. The utterances were segmented into clauses whose nucleus is a conjugated verb. When a group of words lacks a conjugated verb, it is considered a minor utterance. As for syntactic complexity, Szmerecsány compared syntax tree-based node counts, length-based word counts, and index of syntactic complexity calculated based on subordinators and embeddedness with regard to their accuracy and applicability (61). The results showed that all the three methods were almost perfect proxies, and therefore the most economical method, word counts, could be used. The average number of words per clause was therefore calculated as a measure of syntactic complexity. In turn-taking between a patient and a psychiatrist, the number of the patient’s turns was counted, and the proportion of the turns only with minor utterances (short answers) to the total number of their turns was calculated. A patient’s turn is considered minor utterance when the patient answers with simple words such as “yes,” “no,” “OK,” or a group of words without developing the reply. For example, to the question “How are you feeling today?”, the reply would be “so so/a little better/not really happy about all this.” This type of utterances is in line with “poverty of speech,” which is widely described in the literature (8, 10, 12). All the disfluency elements have been removed from the corpus prior to the syntactic analyses.

2.4.3 Semantic level

Latent semantic analysis (LSA) (62, 63) has been applied to measure incoherence in speech (40, 41) and turned out to be fairly efficient when combined with other linguistic features (4143, 49). LSA is a widely used NLP technique that analyzes texts to explore the relationships between a set of documents and the terms inside those documents. The underlying idea of LSA is that semantically similar words occur in similar texts, and thereby the cooccurrences of terms in large corpora of texts are used for measuring the lexical proximity/semantic similarity of terms of a language. LSA was chosen over other techniques for the following assets: a) the technique is based on a psychological theory of meaning and has shown results similar to human evaluations in educational applications (63); b) early studies using this technique paved the way for the use of NLP in early detection of psychosis (40, 41, 64, 65); c) LSA can handle longer passages of words (66) and synonyms in case of word redundancy for the avoidance of repetition (63); and d) contrary to new transformer-based models, this technique is not sensitive to initialization parameters, which allows consistent results. In addition, an LSA-based text analysis tool called Coh-Metrix (67, 68) has been efficiently used in studies on formal thought disorder (FTD) (56, 6971). In the present study, semantic coherence was measured in three different types: intersubjective, subjective, and subjective without doctor (abbreviated henceforth as wodr) coherence. In the first type, semantic coherence was measured based on turn-taking, which represents dialogue coherence, inter-turn comparison; in the second, based on punctuation marks, such as periods and question marks, which could be called sentence-to-sentence coherence; and in the third, only the patients’ speech was considered. For the semantic analyses, the transcripts were not lemmatized (72), stop words were removed, and the disfluency elements were kept for the sake of semantic integrity.

2.4.4 Speech fluency

Speech flow can vary in any individuals depending on their situation, state of mind, and/or fatigue. Disfluencies in speech comprise unfilled pauses (silent), filled pauses (“uh,” “um”), false starts, repetitions, autocorrection, parenthetical remarks (“well,” “yeah”) (73), and abandoned utterances (abandonment/auto-interruption). Various features of speech disfluency in patients with psychotic disorders, such as filled pauses, autocorrection, reparandum–interregnum repair structure, and unfilled pauses, have been studied in detail (7476). All the disfluency elements, except unfilled pauses, were counted, and three disfluency-related subcategories were created as features in the present study as follows: filled pauses, abandonments/auto-interruptions, and auto-corrections/repetitions/false starts. The proportion of each of the three to the total number of words was calculated. A disfluency element with several words was counted as one. Among the abandoned utterances, clauses with a subject and an incomplete predicate have constituted a variable, that is, truncated clauses.

2.5 Statistics, XGBoost Classifier, SMOTE, SHAP values

Statistical analyses were carried out using Python scipy (77) and statsmodels (78). Data normality was tested using Kolmogorov–Smirnov test. For group comparisons in each of the 33 linguistic features and education levels, a Kruskal–Wallis test and a Dunn–Bonferroni test, as a post hoc analysis, were performed. Data homoscedasticity was verified using Levene’s test. A Kendall’s tau-b was calculated between the linguistic variables and the patients’ education levels as possible confounders.

A supervised machine learning model XGBoost, for eXtreme Gradient Boosting (79) was used for classification. The gradient boosting method provides higher predictive accuracy thanks to its functional characteristics, that is, it combines weak learners to give rise to a stronger learner and therefore forms a more robust model (80). In addition, multicollinearity does not affect the stability and robustness of the model’s performance thanks to the capability of the algorithm to choose the best of highly correlated features (81). Furthermore, XGBoost has shown better performance with small datasets (82, 83) than other classifiers. The dataset in the present study is imbalanced. This limitation was addressed through SMOTE (Synthetic Minority Oversampling Technique) (84), a statistical technique for upsampling the minority class for a better balanced dataset. This technique has already been used and proven its efficacity, for example, in diagnosis, classification, and prognosis of cancer, diabetes, and Parkinson’s disease (8597) to name a few. Stratified K-fold cross validation (k = 3) was used to split the data into train and test sets, and SMOTE was subsequently conducted individually in each fold to avoid data leakage. Stratified K-fold cross validation was chosen over leave-one-out cross validation for the sake of computational time and power, and k = 3 was set considering our relatively small dataset and the number of patient groups. The test size was 0.3. Using Bayesian Optimization (98) to tune hyperparameters, an XGBoost Classifier was trained using the 33 features of the original data to compute the SHapley Additive exPlanation (SHAP) values (99), and the mean absolute SHAP values were calculated for feature selection (100, 101). Another XGBoostClassifier was then trained using the outcome of feature importance based on the mean absolute SHAP values and the upsampled data. Inspired by Shapely values (102) from cooperative game theory, the SHAP values allow interpreting the model output by measuring the contribution of each feature to predictions. Precisely, the SHAP values reveal how much (magnitude) and either positively or negatively (direction) each feature affected the classification (99). This method thereby allows explanations and better interpretation of the results. The process of speech data acquisition and analyses is depicted below in Figure 1.

Figure 1
www.frontiersin.org

Figure 1. Pipeline for speech data acquisition and data analyses.

3 Results

3.1 Statistical results

A Kolmogorov–Smirnov test showed that no feature had a normal distribution (0.5 ≤ D ≤ 1 and p < 0.00 in all 33 features). The results of Levene’s test indicated homogeneity of variance in all features (p > 0.05). A Kendall’s tau-b test showed no evidence for a moderate or strong impact of years of education on the linguistic features (rτ = 0.24, p = 0.01 between average number of words per clause and education level; −0.14 ≤ rτ ≤ 0.16, 0.07 ≤ p ≤ 0.99 in all the other pairs). A Kruskal–Wallis test was performed on each of the 33 features of the three groups. The results revealed significant differences between the three groups in two features (intersubjective LSA minimum and subjective LSA minimum) as shown in Table 1a (for the full table, see Supplementary Material). A Dunn–Bonferroni test was then conducted to verify which groups were different. Its results indicated significant differences either between AR and FEP or between AR and FEP, but no differences were found between NAR and AR as shown in Table 1b.

Table 1
www.frontiersin.org

Table 1. Kruskal–Wallis test results of the main features (a) and Dunn–Bonferroni test results (b).

3.2 Classification and explainability results

The XGBoostClassifier trained on SMOTE data with all the features yielded 0.75 accuracy, 0.73 precision, 0.75 recall, 0.74 F2-score, and 0.70 ROC–AUC score. The most impactful features were selected based on the mean absolute values computed on the original data as shown in Figure 2. The first four features whose values are greater than 0.3 were selected (intersubjective LSA minimum, subjective LSA wodr minimum, the proportion of “I” to the total number of words, and filled pauses) for another classification using XGBoostClassifier. This cutoff selection was based on threshold tests on the first 10 features. The best result was obtained when the first four features were included; for example, with the first five features, the accuracy was slightly lower (0.79) than that with the first four features and higher than that with the whole features (0.75). The newly trained model reached 0.82 accuracy, 0.85 precision, 0.82 recall, 0.82 F2-score, and 0.86 ROC–AUC score (see Figure 3 for ROC–AUC curve), and as for 95% confidence intervals (CI) of accuracy, the lower CI was 0.68 and the upper CI, 0.95. The specificity and sensitivity of each group (group-specificity–sensitivity) were as follows: NAR-0.82–0.80, AR-0.86–0.80, and FEP-1.00–1.00. The results are shown in Table 2. Eight patients in the test set had their statuses at T2. Only one AR patient at T0 was misclassified into NAR by our model, but their status at T2 turned out to be NAR.

Figure 2
www.frontiersin.org

Figure 2. Mean absolute SHAP values.

Figure 3
www.frontiersin.org

Figure 3. ROC curve of XGBoostClassifier model.

Table 2
www.frontiersin.org

Table 2. Classification report (a), specificity and sensitivity (b), 95% confidence intervals (c).

The SHAP values of each individual in each class are visually represented in Figures 4 (NAR), 5 (AR), and 6 (FEP). The x-axis indicates the SHAP values, the y-axis shows the features, and the color of the point represents the original value of that sample, that is, higher in red and lower in blue. The farther a point is from the center vertical axis, the stronger its impact is on the classification. Figure 2 shows that lower scores in intersubjective LSA minimum, lexical density, and subjective LSA without doctor minimum have a negative impact on predictions. In other words, these lower values are indicative of the individuals’ lower chance of being classified as NAR. Conversely, higher scores, albeit to a lesser degree, in filled pauses and subjective LSA median contribute positively to NAR. The magnitude of the higher scores in the proportion of “I” to the total number of words suggests their relatively small negative impact on the NAR classification. In Figure 3, the lower proportion of “I” to the total number of words, and higher frequencies of abandonment/auto-interruption and filled pauses, have a negative impact on predictions in AR. When scores in the proportion of “I” to the personal pronouns and subjective LSA minimum are higher, the odds on individuals being classified as AR are higher. Figure 4 shows that lower minimum scores in all the three types of LSA contribute positively to FEP with the greatest magnitude of intersubjective LSA minimum. Higher values in subjective LSA wodr median negatively impact FEP. The contributions are summarized by patient group, direction, and magnitude in Table 3.

Figure 4
www.frontiersin.org

Figure 4. SHAP values of Not-At-Risk patients.

Figure 5
www.frontiersin.org

Figure 5. SHAP values of At-Risk patients.

Figure 6
www.frontiersin.org

Figure 6. SHAP values of First Episode of Psychosis patients.

Table 3
www.frontiersin.org

Table 3. Overview of the directions (positive and negative impacts on classification) and magnitudes (higher and lower values marked with ordinal numbers) of linguistic markers based on SHAP values.

4 Discussion

The present study aimed at detecting relevant linguistic markers that could classify French-speaking UHR patients by their status at T0 and seeking to explain the classification results with regard to linguistic manifestations observed in the clinic. The results showed that our model based on XGBoost, SMOTE, and the SHAP values could get good performance through the interplay of the four linguistic markers obtained from a feature importance method using the SHAP values on the original data. These mean absolute SHAP values as feature importance revealed that the two uppermost features pertained to semantic coherence, the third most important to the use of “I,” and the last important feature was one of the disfluency-related elements, filled pauses. The two hypotheses thereby turned out to be true—semantic coherence and the use of “I” played a key role in the classification. The four linguistic markers identified pertain to weak coherence (intersubjective LSA minimum and subjective LSA wodr minimum, i.e., the lowest LSA score in each patient), self-related subject pronoun (the proportion of “I” to the total number of words), and disfluency (filled pauses).

Semantic incoherence has been reported to be a linguistic characteristic in FEP or schizophrenia (8, 10, 20, 4042, 45, 46, 56). It is noteworthy that lower minimum scores contribute positively to FEP and negatively to NAR regardless of the LSA type. Higher minimum scores in subjective LSA appear to have a positive impact on classifying AR. The feature intersubjective LSA minimum turned out to have significant differences in Kruskal–Wallis and Dunn’s tests and a much greater impact on predictions than the other markers. This type of coherence was calculated between consecutive pairs of speech turns. Studies on coherence have been focused on patients’ utterances (4044, 49, 52) like subjective LSA wodr (only-patient LSA) in our study. A dialogue is constructed within the framework of turn-taking described as a type of social organization that is implicated in speech exchange systems (103). For a dialogue to be coherent, a response should be fluent, consistent, context related (104), and the respondent should understand conventional meaning and catch their interlocutor’s intention. Dialogue coherence is thereby grounded in Speech Act Theory (105, 106) as well as related theories on conversation analysis and discursive pragmatics (107109), wherein semantics and pragmatics are entailed. This weak dialogue coherence could partly explain some occasional strange speech and social interaction impairment in patients. Higher median values in subjective LSA contribute positively to NAR classification, whereas higher subjective LSA wodr median scores have a negative impact on FEP. Taken together, these results suggest that weak coherence is a marker of FEP even though it is still somewhat premature to generalize this finding due to the small sample size of FEP in the current study.

The use of the first-person singular pronouns in schizophrenia has been explored in some studies whose results were opposite to one another. When compared to patients with mood disorder, schizophrenics used fewer first-person singular pronouns (26) whereas these pronouns were more frequent in individuals with schizophrenia than healthy controls (28, 29, 56). The present study focused on the first-person singular subject pronoun “I.” The results showed no significant difference between groups, and higher and lower scores of “I” in FEP do not provide unequivocal contribution types contrary to what has been reported in the literature. However, more frequent use of “I” has a positive impact on AR classification, whereas it contributes negatively to NAR. The difference between the findings in the aforementioned studies and ours could be due to the differences in the populations compared (mood disorder vs. schizophrenia, healthy individuals vs. people with schizophrenia, NAR vs. FEP, and AR vs. FEP) and the pronouns compared (first-person singular pronouns; first-person singular subjective pronoun). The frequency of “I” in this study allowed differentiating between NAR and AR. The more frequent use of “I” in AR might indicate their more intense emotional distress compared to the NAR group as the statuses are the outcome of the CAARMS that assesses “emotional disturbance” in one of the seven subscales. Rude and colleagues showed that depressed college students used “I” more frequently—not the other first-person singular pronouns such as “me” or “myself”—than non-depressed peers (110). The differentiation between NAR and AR by the frequency of “I” might be indicative of more self-centered speech of AR and explained by their considering the self to be a solitary actor/agent as proposed by Rude and colleagues in (110). The meaning of higher and lower values in the frequency of “I” found in both directions in FEP is unclear and intriguing to us, but it might be partly explained by current affective disorders that turned out to be significantly more common in at-risk mental state than FEP (111). This claim does not refute the interpretation of the aforementioned differentiation between NAR and AR.

A filled pause is an uttered sound that fills a momentary interruption in speech production. When considered a pragmatic function, it has several functions such as discourse planning and structuring, and turn-taking (112) by signaling delays when a speaker stalls for time to retrieve information and wishes to continue their utterance (113). When considered a speech disfluency element, filled pauses are symptomatic of production difficulties (114). In the present study, the feature filled pauses is another marker that allows differentiation between NAR and AR. Its higher values contribute positively to NAR and negatively to AR. No impact of this disfluency element is observed on FEP classification. Another disfluency element, abandonment/auto-interruption, plays a role in classifying AR. When its scores are higher, it has a negative impact on AR predictions. It has been reported that patients with schizophrenia use fewer filled pauses (74, 115, 116) and produce longer filled pauses than healthy controls (117). Interestingly, Costa and Silva found that filled pauses before personal pronouns produced by patients with schizophrenia were twice as long as others, and the pronouns are mostly first-person singular pronouns (117). It was argued by the authors that their result could be explained by patients’ possible difficulties with self-reference. Filled pauses have ambivalent roles as mentioned above—they not only help speech production but also indicate hesitations and difficulties. Lower values in filled pauses in AR in this study, and fewer thereof in FEP in the literature, could be interpreted as indicative of somewhat disturbed pragmatic functions rather than speech disfluency. No contribution of filled pauses to FEP predictions contrary to what has been reported in the literature may be due to different populations compared (schizophrenia vs. FEP) and the small number of FEP patients in the current study.

The present exploratory study used recordings of the first consultations, a non-invasive method that does not transcend the classic healthcare frames, while allowing data collection under the same conditions for all participants. Our results provided evidence that a small number of linguistic markers without demographic or clinical data could classify UHR patients even at T0, that is, when patients do probably not present with obvious abnormalities in language behaviors. Besides, even healthy controls can experience mild language abnormalities (118), which could make language analyses more subtle and complicated. It should be pointed out that even though the AR patient at T0 who was misclassified into NAR is a single case of the kind in the present study, this misclassification—along with the other seven patients with their statuses at T2 who were correctly classified—is encouraging. It should cautiously be noted that the small number of FEP along with possible linguistic and cultural differences could make it somewhat delicate to generalize the results. However, the possible linguistic and cultural factor may not intervene in FTD as a systemic review article suggests a three-factor FTD structure with two prominent dimensions (disorganization and negative dimensions) is likely consistent and robust across languages (119). As a number of studies in the literature have also shown disturbed semantic coherence in FEP and schizophrenia, it could be argued that at least semantic disturbances are a universal linguistic manifestation of patients with psychosis regardless of languages and cultures. The SHAP values provided a local interpretation or the contribution of each feature to the classification. Even some features, such as the frequency of “I,” filled pauses, subjective LSA wodr minimum, wherein no significant group difference was observed, showed distinctive differences in the directions of the SHAP values and/or the magnitude. These differences would more likely reflect very subtle differences between patient groups recorded at a very early stage of care in psychiatry than an overfitting issue, since the model went through a cross-validation phase, although it was with a small k value. The SHAP explainability method could thereby allow getting deeper insight into the linguistic characteristics and speech patterns of each category of patients, which could lead to improving diagnostic methods.

5 Limitation

The current study lacks FEP patients and the 2-year statuses of most patients. In addition, our dataset is relatively small and imbalanced, which led us to carrying out an exploratory study to test the feasibility and potential of a gradient boosting model using only linguistic data. With new transformer-based models, such as BERT and SBERT, as well as word-embedding models, like GloVe, LSA is considered by some to be outdated, despite its advantages, mainly because LSA does not consider word order and context. This weakness might be critical to clinical data. It would therefore be interesting to use a new model combining LSA and BERT (BERT-LSA) (120) or other models in a future study. The inclusion of more patients and their statuses at T2 would allow more robust models and more accurate model performance evaluations. It is therefore planned to continue to record UHR patients, include more FEP, and analyze their speech using more classifiers for performance comparisons in search of a good diagnostic aid tool.

Data availability statement

The datasets presented in this article are not readily available due to medical confidentiality. Requests to access the datasets should be directed to D-HK-D, ZGgua2ltZHVmb3JAZ21haWwuY29t.

Ethics statement

The studies involving humans were approved by Comité de Protection des Personnes EST-III (CPP:18.04.03, ID-RCB: 2017-A02702-51). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’ legal guardians/next of kin.

Author contributions

DK: Writing – review & editing, Conceptualization, Investigation, Writing – original draft, Data curation, Formal Analysis, Methodology, Software, Visualization. MW: Conceptualization, Writing – original draft, Writing – review & editing, Funding acquisition, Project administration, Validation. M-OK: Writing – review & editing, Funding acquisition, Project administration, Validation. YH: Writing – review & editing. PL: Writing – original draft, Writing – review & editing. CL: Writing – original draft, Data curation, Validation, Methodology, Investigation, Writing – review & editing, Funding acquisition, Conceptualization.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work has been supported by the French government’s “Investissement d’Avenir” program, which is managed by the Agence Nationale de la Recherche (ANR), under the reference PsyCARE ANR-18–429 RHUS-0014.

Acknowledgments

We would like to thank the psychiatrists and research nurses at the CEVUP, CHU de Brest, for recording their consultations and helping us out with clinical data. We are also grateful to Catherine and Valentine for the transcription.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1595197/full#supplementary-material

References

1. Fatouros-Bergman H, Cervenka S, Flyckt L, Edman G, and Farde L. Meta-analysis of cognitive performance in drug-naïve patients with schizophrenia. Schizophr Res. (2014) 158:156–62. doi: 10.1016/j.schres.2014.06.034

PubMed Abstract | Crossref Full Text | Google Scholar

2. Larsen TK, Melle I, Auestad B, Haahr U, Joa I, Johannessen JO, et al. Early detection of psychosis: positive effects on 5-year outcome. psychol Med. (2011) 41:1461–9. doi: 10.1017/S0033291710002023

PubMed Abstract | Crossref Full Text | Google Scholar

3. Murru A and Carpiniello B. Duration of untreated illness as a key to early intervention in schizophrenia: a review. Neurosci Lett. (2018) 669:59–67. doi: 10.1016/j.neulet.2016.10.003

PubMed Abstract | Crossref Full Text | Google Scholar

4. Krebs MO. Signes précoces de schizophrénie. Paris, France: Dunod (2015).

Google Scholar

5. Howes OD, Whitehurst T, Shatalina E, Townsend L, Onwordi EC, Mak TLA, et al. The clinical significance of duration of untreated psychosis: an umbrella review and random-effects meta-analysis. World Psychiatry. (2021) 20:75–95. doi: 10.1002/wps.20822

PubMed Abstract | Crossref Full Text | Google Scholar

6. Yu M, Tan Q, Wang Y, Xu Y, Wang T, Liu D, et al. Correlation between duration of untreated psychosis and long-term prognosis in chronic schizophrenia. Front Psychiatry. (2023) 14:1112657. doi: 10.3389/fpsyt.2023.1112657

PubMed Abstract | Crossref Full Text | Google Scholar

7. Fusar-Poli P, Cappucciati M, Borgwardt S, Woods SW, Addington J, Nelson B, et al. Heterogeneity of psychosis risk within individuals at clinical high risk: a meta-analytical stratification. JAMA Psychiatry. (2016) 73:113–20. doi: 10.1001/jamapsychiatry.2015.2324

PubMed Abstract | Crossref Full Text | Google Scholar

8. Andreasen NC. Scale for the assessment of thought, language, and communication (TLC). Schizophr Bull. (1986) 12:473. doi: 10.1093/schbul/12.3.473

PubMed Abstract | Crossref Full Text | Google Scholar

9. Covington MA, He C, Brown C, Naçi L, McClain JT, Fjordbak BS, et al. Schizophrenia and the structure of language: the linguist’s view. Schizophr Res. (2005) 77:85–98. doi: 10.1016/j.schres.2005.01.016

PubMed Abstract | Crossref Full Text | Google Scholar

10. Kuperberg GR. Language in schizophrenia part 1: an introduction. Lang Linguistics Compass. (2010) 4:576–89. doi: 10.1111/j.1749-818X.2010.00216.x

PubMed Abstract | Crossref Full Text | Google Scholar

11. Hinzen W and Rosselló J. The linguistics of schizophrenia: thought disturbance as language pathology across positive symptoms. Front Psychol. (2015) 6:126923. doi: 10.3389/fpsyg.2015.00971

PubMed Abstract | Crossref Full Text | Google Scholar

12. Ehlen F, Montag C, Leopold K, and Heinz A. Linguistic findings in persons with schizophrenia—a review of the current literature. Front Psychol. (2023) 14:1287706. doi: 10.3389/fpsyg.2023.1287706

PubMed Abstract | Crossref Full Text | Google Scholar

13. Whitehorn JC and Zipf GK. Schizophrenic language. Arch Neurol Psychiatry. (1943) 49:831–51. doi: 10.1001/archneurpsyc.1943.02290180055006

Crossref Full Text | Google Scholar

14. Lorenz M. Problems posed by schizophrenic language. Arch Gen Psychiatry. (1961) 4:603–10. doi: 10.1001/archpsyc.1961.01710120073008

PubMed Abstract | Crossref Full Text | Google Scholar

15. Wolcott RH. Schizophrenese: A private language. J Health Soc Behav. (1970) 11:126–34. doi: 10.2307/2948472

PubMed Abstract | Crossref Full Text | Google Scholar

16. Chaika E. A linguist looks at “schizophrenic” language. Brain Language. (1974) 1:257–76. doi: 10.1016/0093-934X(74)90040-6

PubMed Abstract | Crossref Full Text | Google Scholar

17. Baskak B, Ozel ET, Atbasoglu EC, and Baskak SC. Peculiar word use as a possible trait marker in schizophrenia. Schizophr Res. (2008) 103:311–7. doi: 10.1016/j.schres.2008.04.025

PubMed Abstract | Crossref Full Text | Google Scholar

18. Champagne-Lavau M and Stip E. Pragmatic and executive dysfunction in schizophrenia. J Neurolinguistics. (2010) 23:285–96. doi: 10.1016/j.jneuroling.2009.08.009

Crossref Full Text | Google Scholar

19. Moro A, Bambini V, Bosia M, Anselmetti S, Riccaboni R, Cappa SF, et al. Detecting syntactic and semantic anomalies in schizophrenia. Neuropsychologia. (2015) 79:147–57. doi: 10.1016/j.neuropsychologia.2015.10.030

PubMed Abstract | Crossref Full Text | Google Scholar

20. Allé MC, Potheegadoo J, Köber C, Schneider P, Coutelle R, Habermas T, et al. Impaired coherence of life narratives of patients with schizophrenia. Sci Rep. (2015) 5:12934. doi: 10.1038/srep12934

PubMed Abstract | Crossref Full Text | Google Scholar

21. Haas MH, Chance SA, Cram DF, Crow TJ, Luc A, and Hage S. Evidence of pragmatic impairments in speech and proverb interpretation in schizophrenia. J Psycholinguist Res. (2015) 44:469–83. doi: 10.1007/s10936-014-9298-2

PubMed Abstract | Crossref Full Text | Google Scholar

22. Sass LA and Parnas J. Schizophrenia, consciousness, and the self. Schizophr Bull. (2003) 29:427–44. doi: 10.1093/oxfordjournals.schbul.a007017

PubMed Abstract | Crossref Full Text | Google Scholar

23. Nelson B, Fornito A, Harrison BJ, Yücel M, Sass LA, Yung AR, et al. A disturbed sense of self in the psychosis prodrome: Linking phenomenology and neurobiology. Neurosci Biobehav Rev. (2009) 33:807–17. doi: 10.1016/j.neubiorev.2009.01.002

PubMed Abstract | Crossref Full Text | Google Scholar

24. Moe AM and Docherty NM. Schizophrenia and the sense of self. Schizophr Bull. (2014) 40:161–8. doi: 10.1093/schbul/sbt121

PubMed Abstract | Crossref Full Text | Google Scholar

25. Buck B and Penn DL. Lexical characteristics of emotional narratives in schizophrenia: relationships with symptoms, functioning, and social cognition. J Nervous Ment Dis. (2015) 203:702–8. doi: 10.1097/NMD.0000000000000354

PubMed Abstract | Crossref Full Text | Google Scholar

26. Fineberg SK, Deutsch-Link S, Ichinose M, McGuinness T, Bessette AJ, Chung CK, et al. Word use in first-person accounts of schizophrenia. Br J Psychiatry. (2015) 206:32–8. doi: 10.1192/bjp.bp.113.140046

PubMed Abstract | Crossref Full Text | Google Scholar

27. Fineberg SK, Leavitt J, Deutsch-Link S, Dealy S, Landry CD, Pirruccio K, et al. Self-reference in psychosis and depression: a language marker of illness. Psychol Med. (2016) 46:2605–15. doi: 10.1017/S0033291716001215

PubMed Abstract | Crossref Full Text | Google Scholar

28. Tang SX, Kriz R, Cho S, Park SJ, Harowitz J, Gur RE, et al. Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. NPJ Schizophr. (2021) 7:25. doi: 10.1038/s41537-021-00154-3

PubMed Abstract | Crossref Full Text | Google Scholar

29. Chan CC, Norel R, Agurto C, Lysaker PH, Myers EJ, Hazlett EA, et al. Emergence of language related to self-experience and agency in autobiographical narratives of individuals with schizophrenia. Schizophr Bull. (2023) 49:444–53. doi: 10.1093/schbul/sbac126

PubMed Abstract | Crossref Full Text | Google Scholar

30. Morice RD and Ingram JCL. Language analysis in schizophrenia: diagnostic implications. Aust N Z J Psychiatry. (1982) 16:11–21. doi: 10.3109/00048678209161186

PubMed Abstract | Crossref Full Text | Google Scholar

31. Fraser WI, King KM, Thomas P, and Kendell RE. The diagnosis of schizophrenia by language analysis. Br J Psychiatry. (1986) 148:275–8. doi: 10.1192/bjp.148.3.275

PubMed Abstract | Crossref Full Text | Google Scholar

32. Thomas P, King K, and Fraser WI. Positive and negative symptoms of schizophrenia and linguistic performance. Acta Psychiatr Scand. (1987) 76:144–51. doi: 10.1111/j.1600-0447.1987.tb02877.x

PubMed Abstract | Crossref Full Text | Google Scholar

33. Lin E, Lin CH, and Lane HY. Prediction of functional outcomes of schizophrenia with genetic biomarkers using a bagging ensemble machine learning method with feature selection. Sci Rep. (2021) 11:10179. doi: 10.1038/s41598-021-89540-6

PubMed Abstract | Crossref Full Text | Google Scholar

34. Kraguljac NV, McDonald WM, Widge AS, Rodriguez CI, Tohen M, and Nemeroff CB. Neuroimaging biomarkers in schizophrenia. Am J Psychiatry. (2021) 178:509–21. doi: 10.1176/appi.ajp.2020.20030340

PubMed Abstract | Crossref Full Text | Google Scholar

35. Rodrigues-Amorim D, Rivera-Baltanás T, López M, Spuch C, Olivares JM, and Agís-Balboa RC. Schizophrenia: a review of potential biomarkers. J Psychiatr Res. (2017) 93:37–49. doi: 10.1016/j.jpsychires.2017.05.009

PubMed Abstract | Crossref Full Text | Google Scholar

36. Corcoran CM and Cecchi GA. Using language processing and speech analysis for the identification of psychosis and other disorders. Biol Psychiatry: Cogn Neurosci Neuroimag. (2020) 5:770–9. doi: 10.1016/j.bpsc.2020.06.004

PubMed Abstract | Crossref Full Text | Google Scholar

37. Le Glaz A, Haralambous Y, Kim-Dufor DH, Lenca P, Billot R, Ryan TC, et al. Machine learning and natural language processing in mental health: systematic review. J Med Internet Res. (2021) 23:e15708. doi: 10.2196/15708

PubMed Abstract | Crossref Full Text | Google Scholar

38. Corcoran CM, Mittal VA, Bearden CE, Gur RE, Hitczenko K, Bilgrami Z, et al. Language as a biomarker for psychosis: a natural language processing approach. Schizophr Res. (2020) 226:158–66. doi: 10.1016/j.schres.2020.04.032

PubMed Abstract | Crossref Full Text | Google Scholar

39. Hitczenko K, Mittal VA, and Goldrick M. Understanding language abnormalities and associated clinical markers in psychosis: the promise of computational methods. Schizophr Bull. (2021) 47:344–62. doi: 10.1093/schbul/sbaa141

PubMed Abstract | Crossref Full Text | Google Scholar

40. Elvevåg B, Foltz PW, Weinberger DR, and Goldberg TE. Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia. Schizophr Res. (2007) 93:304–16. doi: 10.1016/j.schres.2007.03.001

PubMed Abstract | Crossref Full Text | Google Scholar

41. Bedi G, Carrillo F, Cecchi GA, Slezak DF, Sigman M, Mota NB, et al. Automated analysis of free speech predicts psychosis onset in high-risk youths. NPJ Schizophr. (2015) 1:1–7. doi: 10.1038/npjschz.2015.30

PubMed Abstract | Crossref Full Text | Google Scholar

42. Corcoran CM, Carrillo F, Fernández-Slezak D, Bedi G, Klim C, Javitt DC, et al. Prediction of psychosis across protocols and risk cohorts using automated language analysis. World Psychiatry. (2018) 17:67–75. doi: 10.1002/wps.20491

PubMed Abstract | Crossref Full Text | Google Scholar

43. Morgan SE, Diederen K, Vértes PE, Ip SHY, Wang B, Thompson B, et al. Natural Language Processing markers in first episode psychosis and people at clinical high-risk. Transl Psychiatry. (2021) 11:630. doi: 10.1038/s41398-021-01722-y

PubMed Abstract | Crossref Full Text | Google Scholar

44. Gargano G, Caletti E, Perlini C, Turtulici N, Bellani M, Bonivento C, et al. Language production impairments in patients with a first episode of psychosis. PloS One. (2022) 17:e0272873. doi: 10.1371/journal.pone.0272873

PubMed Abstract | Crossref Full Text | Google Scholar

45. Mota NB, Vasconcelos NAP, Lemos N, Pieretti AC, Kinouchi O, Cecchi GA, et al. Speech graphs provide a quantitative measure of thought disorder in psychosis. PloS One. (2012) 7:e34928. doi: 10.1371/journal.pone.0034928

PubMed Abstract | Crossref Full Text | Google Scholar

46. Mota NB, Furtado R, Maia PP, Copelli M, and Ribeiro S. Graph analysis of dream reports is especially informative about psychosis. Sci Rep. (2014) 4:3691. doi: 10.1038/srep03691

PubMed Abstract | Crossref Full Text | Google Scholar

47. Mota NB, Copelli M, and Ribeiro S. Thought disorder measured as random speech structure classifies negative symptoms and schizophrenia diagnosis 6 months in advance. NPJ Schizophr. (2017) 3:18. doi: 10.1038/s41537-017-0019-3

PubMed Abstract | Crossref Full Text | Google Scholar

48. Spencer TJ, Thompson B, Oliver D, Diederen K, Demjaha A, Weinstein S, et al. Lower speech connectedness linked to incidence of psychosis in people at clinical high risk. Schizophr Res. (2021) 228:493–501. doi: 10.1016/j.schres.2020.09.002

PubMed Abstract | Crossref Full Text | Google Scholar

49. Figueroa-Barra A, Del Aguila D, Cerda M, Gaspar PA, Terissi LD, Durán M, et al. Automatic language analysis identifies and predicts schizophrenia in first-episode of psychosis. Schizophrenia. (2022) 8:53. doi: 10.1038/s41537-022-00259-3

PubMed Abstract | Crossref Full Text | Google Scholar

50. Oomen PP, De Boer JN, Brederoo SG, Voppel AE, Brand BA, Wijnen FNK, et al. Characterizing speech heterogeneity in schizophrenia-spectrum disorders. J Psychopathol Clin Sci. (2022) 131:172–81. doi: 10.1037/abn0000736

PubMed Abstract | Crossref Full Text | Google Scholar

51. Bambini V, Frau F, Bischetti L, Cuoco F, Bechi M, Buonocore M, et al. Deconstructing heterogeneity in schizophrenia through language: a semi-automated linguistic analysis and data-driven clustering approach. Schizophr. (2022) 8:102. doi: 10.1038/s41537-022-00306-z

PubMed Abstract | Crossref Full Text | Google Scholar

52. Voppel AE, De Boer JN, Brederoo SG, Schnack HG, and Sommer IEC. Semantic and acoustic markers in schizophrenia-spectrum disorders: A combinatory machine learning approach. Schizophr Bull. (2023) 49:S163–71. doi: 10.1093/schbul/sbac142

PubMed Abstract | Crossref Full Text | Google Scholar

53. Foster P, Tonkyn A, and Wigglesworth G. Measuring spoken language: A unit for all reasons. Appl Linguist. (2000) 21:354–75. doi: 10.1093/applin/21.3.354

Crossref Full Text | Google Scholar

54. Templin M. Certain language skills in children: their development and interrelationships. Minneapolis: University of Minnesota Press (1957).

Google Scholar

55. Ure J. Lexical density and register differentiation. Appl Linguist. (1971) 23:443–52.

Google Scholar

56. Lundin NB, Cowan HR, Singh DK, and Moe AM. Lower cohesion and altered first-person pronoun usage in the spoken life narratives of individuals with schizophrenia. Schizophr Res. (2023) 259:140–9. doi: 10.1016/j.schres.2023.04.001

PubMed Abstract | Crossref Full Text | Google Scholar

57. Pointal L. TreeTaggerWrapper. Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur, Laboratoire Interdisciplinaire des Sciences du Numérique. Paris, France: CNRS (2016).

Google Scholar

58. Halliday MAK and Matthiessen CM. Halliday’s introduction to functional grammar. Milton Park, Abingdon, UK: Routledge (2013).

Google Scholar

59. Cook WA. Introduction to tagmemic analysis. Washington D.C., USA: Georgetown University Press (1969).

Google Scholar

60. Bever TG, Lackner J, and Kirk R. The underlying structures of sentences are the primary units of immediate speech processing. Percept Psychophys. (1969) 5:225–34. doi: 10.3758/BF03210545

Crossref Full Text | Google Scholar

61. Szmrecsányi B. On operationalizing syntactic complexity. Le poids des mots. Proceedings of the 7th international conference on textual data statistical analysis. Louvain-la-Neuve. (2004). 2:1032–9.

Google Scholar

62. Landauer TK, Foltz PW, and Laham D. An introduction to latent semantic analysis. Discourse Processes. (1998) 25:259–84. doi: 10.1080/01638539809545028

Crossref Full Text | Google Scholar

63. Landauer TK, McNamara DS, Dennis S, and Kintsch W. Handbook of latent semantic analysis. Milton Park, Abingdon, UK: Routledge (2011).

Google Scholar

64. Elvevåg B, Foltz PW, Rosenstein M, and DeLisi LE. An automated method to analyze language use in patients with schizophrenia and their first-degree relatives. J Neurolinguistics. (2010) 23:270–84. doi: 10.1016/j.jneuroling.2009.05.002

PubMed Abstract | Crossref Full Text | Google Scholar

65. Holshausen K, Harvey PD, Elvevåg B, Foltz PW, and Bowie CR. Latent semantic variables are associated with formal thought disorder and adaptive behavior in older inpatients with schizophrenia. Cortex. (2014) 55:88–96. doi: 10.1016/j.cortex.2013.02.006

PubMed Abstract | Crossref Full Text | Google Scholar

66. Wiemer-Hastings P. How latent is latent semantic analysis? Proceedings of the 16th international joint conference on Artificial intelligence. San Francisco, CA (1999) p. 932–7.

Google Scholar

67. Graesser AC, McNamara DS, Louwerse MM, and Cai Z. Coh-Metrix: Analysis of text on cohesion and language. Behav Res Methods Instrum Computers. (2004) 36:193–202. doi: 10.3758/BF03195564

PubMed Abstract | Crossref Full Text | Google Scholar

68. McNamara DS, Graesser AC, McCarthy PM, and Cai Z. Automated Evaluation of Text and Discourse with Coh-Metrix. 1st ed. Cambridge, UK: Cambridge University Press (2014). Available at: https://www.cambridge.org/core/product/identifier/9780511894664/type/book (Accessed January 20, 2025).

Google Scholar

69. Willits JA, Rubin T, Jones MN, Minor KS, and Lysaker PH. Evidence of disturbances of deep levels of semantic cohesion within personal narratives in schizophrenia. Schizophr Res. (2018) 197:365–9. doi: 10.1016/j.schres.2017.11.014

PubMed Abstract | Crossref Full Text | Google Scholar

70. Gupta T, Hespos SJ, Horton WS, and Mittal VA. Automated analysis of written narratives reveals abnormalities in referential cohesion in youth at ultra high risk for psychosis. Schizophr Res. (2018) 192:82–8. doi: 10.1016/j.schres.2017.04.025

PubMed Abstract | Crossref Full Text | Google Scholar

71. Mackinley M, Chan J, Ke H, Dempster K, and Palaniyappan L. Linguistic determinants of formal thought disorder in first episode psychosis. Early Intervent Psych. (2021) 15:344–51. doi: 10.1111/eip.12948

PubMed Abstract | Crossref Full Text | Google Scholar

72. Lemaire B. Limites de la lemmatisation pour l’extraction de significations. In. (2008) p:725–32.

Google Scholar

73. Harley TA. The psychology of language: From data to theory. London, UK: Psychology press (2013).

Google Scholar

74. Howes C, Lavelle M, Healey PG, Hough J, and McCabe R. Disfluencies in dialogues with patients with schizophrenia. Proceedings of the Annual Meeting of the Cognitive Science Society (2017) 39.

Google Scholar

75. Vail AK, Liebson E, Baker JT, and Morency LP. Toward objective, multifaceted characterization of psychotic disorders: Lexical, structural, and disfluency markers of spoken language. Proceedings of the 20th ACM International Conference on Multimodal Interaction (2018), 170–178. doi: 10.1145/3242969

Crossref Full Text | Google Scholar

76. Çokal D, Zimmerer V, Turkington D, Ferrier N, Varley R, Watson S, et al. Disturbing the rhythm of thought: Speech pausing patterns in schizophrenia, with and without formal thought disorder. PloS One. (2019) 14:e0217404. doi: 10.1371/journal.pone.0217404

PubMed Abstract | Crossref Full Text | Google Scholar

77. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. (2020) 17:352–2. doi: 10.1038/s41592-020-0772-5

PubMed Abstract | Crossref Full Text | Google Scholar

78. Seabold S and Perktold J. Statsmodels: Econometric and Statistical Modeling with Python. Austin, Texas: SciPy (2010) p. 92–6. doi: 10.25080/Majora-92bf1922-011

Crossref Full Text | Google Scholar

79. Chen T and Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco California USA (2016). p. 785–94. doi: 10.1145/2939672.2939785

Crossref Full Text | Google Scholar

80. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. (2001) 29:1189–232. doi: 10.1214/aos/1013203451

Crossref Full Text | Google Scholar

81. Chen T, He T, Benesty M, and Tang Y. Understand your dataset with XGBoost. Available online at: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html (Accessed March 11, 2024).

Google Scholar

82. Zou M, Jiang WG, Qin QH, Liu YC, and Li ML. Optimized XGBoost model with small dataset for predicting relative density of Ti-6Al-4V parts manufactured by selective laser melting. Materials. (2022) 15:5298. doi: 10.3390/ma15155298

PubMed Abstract | Crossref Full Text | Google Scholar

83. Xu P, Ji X, Li M, and Lu W. Small data machine learning in materials science. NPJ Comput Mater. (2023) 9:42. doi: 10.1038/s41524-023-01000-z

Crossref Full Text | Google Scholar

84. Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. jair. (2002) 16:321–57. doi: 10.1613/jair.953

Crossref Full Text | Google Scholar

85. Diz J, Marreiros G, and Freitas A. Applying data mining techniques to improve breast cancer diagnosis. J Med Syst. (2016) 40:1–7. doi: 10.1007/s10916-016-0561-y

PubMed Abstract | Crossref Full Text | Google Scholar

86. Ramezankhani A, Pournik O, Shahrabi J, Azizi F, Hadaegh F, and Khalili D. The impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Making. (2016) 36:137–44. doi: 10.1177/0272989X14560647

PubMed Abstract | Crossref Full Text | Google Scholar

87. Abdoh SF, Abo Rizka M, and Maghraby FA. Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access. (2018) 6:59475–85. doi: 10.1109/ACCESS.2018.2874063

Crossref Full Text | Google Scholar

88. Fotouhi S, Asadi S, and Kattan MW. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform. (2019) 90:103089. doi: 10.1016/j.jbi.2018.12.003

PubMed Abstract | Crossref Full Text | Google Scholar

89. Polat K. A hybrid approach to Parkinson disease classification using speech signal: the combination of SMOTE and random forests. In: 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). IEEE, Istanbul, Turkey (2019). p. 1–3. Available at: https://ieeexplore.ieee.org/document/8741725/ (Accessed July 21, 2024).

Google Scholar

90. Shuja M, Mittal S, and Zaman M. Effective prediction of type II diabetes mellitus using data mining classifiers and SMOTE. In: Sharma H, Govindan K, Poonia RC, Kumar S, and El-Medany WM, editors. Advances in Computing and Intelligent Systems. Springer Singapore, Singapore (2020). p. 195–211. Algorithms for Intelligent Systems. doi: 10.1007/978-981-15-0222-4_17

Crossref Full Text | Google Scholar

91. Abdu-Aljabar RD and Awad OA. A comparative analysis study of lung cancer detection and relapse prediction using XGBoost classifier Vol. p. Bristol, UK: IOP Publishing (2021). p. 012048.

Google Scholar

92. Shen J, Wu J, Xu M, Gan D, An B, and Liu F. A hybrid method to predict postoperative survival of lung cancer using improved SMOTE and adaptive SVM. In: Cesarelli M, editor. Computational and Mathematical Methods in Medicine, vol. 2021 (2021) USA: Wiley Online Library. p. 1–15.

PubMed Abstract | Google Scholar

93. Chang CC, Li YZ, Wu HC, and Tseng MH. Melanoma detection using XGB classifier combined with feature extraction and K-means SMOTE techniques. Diagnostics. (2022) 12:1747. doi: 10.3390/diagnostics12071747

PubMed Abstract | Crossref Full Text | Google Scholar

94. Karamti H, Alharthi R, Anizi AA, Alhebshi RM, Eshmawi AA, Alsubai S, et al. Improving prediction of cervical cancer using KNN imputed SMOTE features and multi-model ensemble learning approach. Cancers. (2023) 15:4412. doi: 10.3390/cancers15174412

PubMed Abstract | Crossref Full Text | Google Scholar

95. Bhavani CH and Govardhan A. Cervical cancer prediction using stacked ensemble algorithm with SMOTE and RFERF. Mater Today: Proc. (2023) 80:3451–7. doi: 10.1016/j.matpr.2021.07.269

Crossref Full Text | Google Scholar

96. Srinivasan S, Ramadass P, Mathivanan SK, Panneer Selvam K, Shivahare BD, and Shah MA. Detection of Parkinson disease using multiclass machine learning approach. Sci Rep. (2024) 14:13813. doi: 10.1038/s41598-024-64004-9

PubMed Abstract | Crossref Full Text | Google Scholar

97. Herawati BC, Hairani H, and Guterres JX. SMOTE variants and random forest method: A comprehensive approach to breast cancer classification. IJEC. (2024) 3:12–23. doi: 10.58291/ijec.v3i1.147

Crossref Full Text | Google Scholar

98. Martinez-Cantin R. BayesOpt: A Bayesian optimization library for nonlinear optimization, experimental design and bandits. J Machine Learning Res. (2014) 15:3915–9. Available online at: https://arxiv.org/abs/1405.7430.

Google Scholar

99. Lundberg SM and Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems, Vol. 30. San Diego, USA: Neural Information Processing Systems Foundation, Inc. (2017).

Google Scholar

100. Lundberg S. An introduction to explainable AI with Shapley values(2018). Available online at: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html (Accessed April 23, 2024).

Google Scholar

101. Wang H, Liang Q, Hancock JT, and Khoshgoftaar TM. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J Big Data. (2024) 11:44. doi: 10.1186/s40537-024-00905-w

Crossref Full Text | Google Scholar

102. Shapley LS. Notes on the N-Person Game — II: The Value of an N-Person Game. Santa Monica, USA: Rand Corporation (1951). Available at: https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM670.pdf (Accessed May 7, 2023).

Google Scholar

103. Sacks H, Schegloff EA, and Jefferson G. A Simplest Systematics for the Organization of Turn Taking for Conversation**This chapter is a variant version of “A Simplest Systematics for the Organization of Turn-Taking for Conversation,” which was printed in Language, 50, 4. Available online at: https://linkinghub.elsevier.com/retrieve/pii/B9780126235500500082 (Accessed November 12, 2024).

Google Scholar

104. Ye Z, Lu L, Huang L, Lin L, and Liang X. Towards quantifiable dialogue coherence evaluation. arXiv preprint arXiv:2106.00507. (2021). Available online at: http://arxiv.org/abs/2106.00507.

Google Scholar

105. Austin JL. w to do things with words: the William James lectures delivered at Harvard University in 1955. Urmson JO, editor. London: Oxford Univ. Press (1971). 166 p. Oxford paperbacks.

Google Scholar

106. Searle JR. Speech acts: an essay in the philosophy of language. Cambridge: Cambridge Univ. Press (1970). 203 p.

Google Scholar

107. Bilmes J. Discourse and Behavior. New York, NY: Springer (1986). 1 p.

Google Scholar

108. Kasper G. Speech acts in interaction: Towards discursive pragmatics. In: Pragmatics & language learning. Bardori-Harlig K, Félix-Brasdefer C, and Omar A. eds. National Foreign Language Resource Center, Honolulu, HI.

Google Scholar

109. González-Lloret M. Conversation analysis and speech act performance. In: Martínez-Flor A and Usó-Juan E, editors. Speech Act Performance: Theoretical, empirical and methodological issues. John Benjamins Publishing Company, Amsterdam (2010). p. 57–74. Available at: https://benjamins.com/catalog/lllt.26.04gon (Accessed November 27, 2024).

Google Scholar

110. Rude S, Gortner EM, and Pennebaker J. Language use of depressed and depression-vulnerable college students. Cogn Emotion. (2004) 18:1121–33. doi: 10.1080/02699930441000030

Crossref Full Text | Google Scholar

111. Heitz U, Cherbuin J, Menghini-Müller S, Egloff L, Ittig S, Beck K, et al. Comorbidities in patients with an at-risk mental state and first episode psychosis. Eur Psychiatr. (2017) 41:S198–8. doi: 10.1016/j.eurpsy.2017.01.2142

Crossref Full Text | Google Scholar

112. Kosmala L. Exploring the status of filled pauses as pragmatic markers: The role of gaze and gesture. P&C. (2022) 29:272–96. doi: 10.1075/pc.21020.kos

Crossref Full Text | Google Scholar

113. Clark H. Using uh and um in spontaneous speaking. Cognition. (2002) 84:73–111. doi: 10.1016/S0010-0277(02)00017-3

PubMed Abstract | Crossref Full Text | Google Scholar

114. Levelt W. Monitoring and self-repair in speech. Cognition. (1983) 14:41–104. doi: 10.1016/0010-0277(83)90026-4

PubMed Abstract | Crossref Full Text | Google Scholar

115. Matsumoto K, Kircher TTJ, Stokes PRA, Brammer MJ, Liddle PF, and McGuire PK. Frequency and neural correlates of pauses in patients with formal thought disorder. Front Psychiatry. (2013) 4:127/abstract. doi: 10.3389/fpsyt.2013.00127/abstract

PubMed Abstract | Crossref Full Text | Google Scholar

116. DeVault D, Georgila K, Artstein R, Morbini F, Traum D, Scherer S, et al. (2013). Verbal indicators of psychological distress in interactive dialogue with a virtual human, in: Proceedings of the SIGDIAL 2013 Conference, Metz, France. pp. 193–202.

Google Scholar

117. Costa JC and Silva LFLE. Parts of speech and filled pauses in schizophrenia. Alfa Rev linguíst (São José Rio Preto). (2023) 67:e16993. doi: 10.1590/1981-5794-e16993t

Crossref Full Text | Google Scholar

118. Andreasen NC and Grove WM. Thought, language, and communication in schizophrenia: diagnosis and prognosis. Schizophr Bull. (1986) 12:348–59. doi: 10.1093/schbul/12.3.348

PubMed Abstract | Crossref Full Text | Google Scholar

119. Zamperoni G, Tan EJ, Rossell SL, Meyer D, and Sumner PJ. Evidence for the factor structure of formal thought disorder: A systematic review. Schizophr Res. (2024) 264:424–34. doi: 10.1016/j.schres.2024.01.006

PubMed Abstract | Crossref Full Text | Google Scholar

120. Cheng Q, Zhu Y, Song J, Zeng H, Wang S, Sun K, et al. Bert-based latent semantic analysis (Bert-LSA): A case study on geospatial data technology and application trend analysis. Appl Sci. (2021) 11:11897. doi: 10.3390/app112411897

Crossref Full Text | Google Scholar

Keywords: UHR patients, spoken language, natural language processing, XGBoost, SMOTE, SHAP values

Citation: Kim-Dufor D-H, Walter M, Krebs M-O, Haralambous Y, Lenca P and Lemey C (2025) Deeper insight into speech characteristics of patients at ultra-high risk using classification and explainability models. Front. Psychiatry 16:1595197. doi: 10.3389/fpsyt.2025.1595197

Received: 17 March 2025; Accepted: 23 May 2025;
Published: 16 June 2025.

Edited by:

Saturnino Luz, University of Edinburgh, United Kingdom

Reviewed by:

Sofia De La Fuente Garcia, University of Edinburgh, United Kingdom
Bahman Mirheidari, The University of Sheffield, United Kingdom

Copyright © 2025 Kim-Dufor, Walter, Krebs, Haralambous, Lenca and Lemey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Deok-Hee Kim-Dufor, ZGgua2ltZHVmb3JAZ21haWwuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.