ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Natural Language Processing
Volume 8 - 2025 | doi: 10.3389/frai.2025.1624900
Integrated ensemble of BERT-and feature-based models for authorship attribution in Japanese literary works
Provisionally accepted- 1Graduate School of Culture and Information Science, Doshisha University, Kyoto, Japan
- 2Research Center for Linguistic Ecology, Doshisha University, Kyoto, Japan
- 3Institute of Interdisciplinary Research, Kyoto University of Advanced Science, Kyoto, Japan
- 4Faculty of Psychology, Mejiro University, Tokyo, Japan
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Background: Traditional authorship attribution (AA) research has primarily relied on statistical analysis and classification based on stylistic features extracted from textual data. Although pre-trained language models like BERT have gained prominence in text classification tasks, their effectiveness in small-sample AA scenarios remains insufficiently explored. A critical unresolved challenge is developing methodologies that effectively integrate BERT with conventional feature-based approaches to advance AA research. Revised Objective: This study aims to substantially enhance performance in small-sample AA tasks through the strategic combination of traditional feature-based methods and contemporary BERT-based approaches. Furthermore, we conduct a comprehensive comparative analysis of the accuracy of BERT models and conventional classifiers while systematically evaluating how individual model characteristics interact within this combination to influence overall classification effectiveness. Methods: We propose a novel integrated ensemble methodology that combines BERT-based models with feature-based classifiers, benchmarked against conventional ensemble techniques. Experimental validation is conducted using two literary corpora, each consisting of works from ten distinct authors. The ensemble framework incorporates five BERT variants, three feature types, and two classifier architectures to systematically evaluate model effectiveness. Results: BERT demonstrated effectiveness in small-sample authorship attribution tasks, surpassing traditional feature-based methods. Both BERT-based and feature-based ensembles outperformed their standalone counterparts, with the integrated ensemble method achieving even higher scores. Notably, the integrated ensemble significantly outperformed the best individual model on Corpus B—which was not included in the pre-training data— improving the F1 score from 0.823 to 0.96. It achieved the highest score among all evaluated approaches, including standalone models and conventional ensemble techniques, with a statistically significant margin (p < 0.012, Cohen's d = 4.939), underscoring the robustness of the result. The pre-training data used in BERT had a significant impact on task performance, emphasizing the need for careful model selection based not only on accuracy but also on model diversity. These findings highlight the importance of pre-training data and model diversity in optimizing language models for ensemble learning, offering valuable insights for authorship attribution research and the broader development of artificial general intelligence systems.
Keywords: bidirectional encoder representations from transformers (BERT), Stylometric features, Authorship attribution (AA), text classification, Integrated ensemble
Received: 09 May 2025; Accepted: 08 Sep 2025.
Copyright: © 2025 Taisei, Mingzhe and Wataru. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Jin Mingzhe, Research Center for Linguistic Ecology, Doshisha University, Kyoto, Japan
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.