GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations

Wilming, Rick; Dox, Artur; Schulz, Hjalmar; Oliveira, Marta; Clark, Benedict; Haufe, Stefan

doi:10.3389/frai.2025.1694388

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations

Provisionally accepted

Artur Dox²

¹Physikalisch-Technische Bundesanstalt, Berlin, Germany
²Technische Universitat Berlin, Berlin, Germany

The final, formatted version of the article will be published soon.

Large pre-trained language models have become a crucial backbone for many downstream tasks in natural language processing (NLP), and while they are trained on a plethora of data containing a variety of biases, such as gender biases, it has been shown that they can also inherit such biases in their weights, potentially affecting their prediction behavior. However, it is unclear to what extent such biases also impact feature attributions, generated by applying "explainable artificial intelligence" (XAI) techniques, in possibly unfavorable ways. To systematically study this question, we create a gender-controlled text dataset, GECO, in which the alteration of grammatical gender forms induces class-specific words and gives rise to ground truth feature attributions for gender classification tasks, enabling the objective evaluation of the correctness of XAI methods. We apply this dataset to the pre-trained BERT model, which we fine-tune to different degrees, to quantitatively measure how pre-training induces undesirable bias in feature attributions and to what extent fine-tuning can mitigate such explanation bias. To this extent, we provide GECOBench, a rigorous quantitative evaluation framework for benchmarking popular XAI methods. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to benefit from fine-tuning or complete retraining of embedding layers, particularly.

Keywords: XAI, nlp, Benchmark, Dataset, Explainability, Interpretability, Language models

Received: 28 Aug 2025; Accepted: 31 Oct 2025.

Copyright: © 2025 Wilming, Dox, Schulz, Oliveira, Clark and Haufe. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Rick Wilming, rick.wilming@tu-berlin.de

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.