FinTextSim: A Domain-Specific Sentence-Transformer for Extracting Predictive Latent Topics from Financial Disclosures

Jehnen, Simon; Villalba-Díez, Javier; Ordieres Meré, Joaquín

doi:10.3389/frai.2026.1752103

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. AI in Finance

FinTextSim: A Domain-Specific Sentence-Transformer for Extracting Predictive Latent Topics from Financial Disclosures

Provisionally accepted

Simon Jehnen^1,2

Javier Villalba-Díez^3,4

Joaquín Ordieres Meré^1*

¹Higher Technical School of Industrial Engineers, Polytechnic University of Madrid, Madrid, Spain
²Beta Klinik GmbH, Bonn, Germany
³Hochschule Heilbronn, Heilbronn, Germany
⁴Universidad de la Rioja, Logroño, Spain

The final, formatted version of the article will be published soon.

Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract actionable insights from this wealth of textual data, automated review processes, such as topic modeling, are essential. This study benchmarks classical approaches against contemporary neural techniques and introduces FinTextSim, a sentence-transformer finetuned for financial text. Using Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2023), we systematically evaluate these models qualitatively and quantitatively. BERTopic in combination with FinTextSim consistently outperforms all alternatives, producing notably clearer, more coherent and financially relevant topic clusters. Compared to the most widely used standard embedding models and financial baselines, FinTextSim improves intratopic similarity by up to 71% and reduces intertopic similarity by more than 108%, highlighting the importance of domain-specific embeddings. Crucially, these qualitative gains translate into quantitative predictive benefits: incorporating FinTextSim-derived topic features into a logistic regression framework for corporate performance prediction leads to a statistically significant two-percentage-point increase in both ROC-AUC and F1-score over a purely financial baseline. In contrast, off-the-shelf sentence-transformers and classical topic models introduce noise that degrades predictive performance. For non-linear classifiers, several textual representations yield modest gains, reflecting their greater capacity to absorb noisier features. However, FinTextSim remains the most stable and consistently strong performer across both linear and non-linear settings. Overall, FinTextSim acts as a domain-adapted information filter, translating unstructured financial text into structured, semantically rich representations that human analysts and generic models often overlook. By bridging interpretability and predictive utility, it enables the extraction of economically relevant information from corporate narratives and supports more effective decision-making, resource allocation, and corporate performance forecasting.

Keywords: artificial intelligence, BERTopic, Company Performance Prediction, FinTextSim, LDA, machine learning, Topic Modeling

Received: 22 Nov 2025; Accepted: 16 Jan 2026.

Copyright: © 2026 Jehnen, Villalba-Díez and Ordieres Meré. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Joaquín Ordieres Meré

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.