Your new experience awaits. Try the new design now and help us make it even better

CORRECTION article

Front. Artif. Intell.

Sec. AI for Human Learning and Behavior Change

This article is part of the Research TopicAI4Science: New Paradigms and TrendsView all articles

AI for scientific integrity: detecting ethical breaches, errors, and misconduct in manuscripts

Provisionally accepted
  • University of Saskatchewan, Vaccine and Infectious Disease Organization, International Vaccine Centre (VIDO-InterVac), Saskatoon, Kanagawa, Canada

The final, formatted version of the article will be published soon.

Frontiers' correction template for authors A correction refers to a change to their article that the author wishes to publish after publication. The publication of this article is subject to Frontiers' editorial approval. Instructions: ● Please read through all the templates before choosing ● Pick the most relevant text template(s) from the following page and delete all others. ● Edit the text as necessary, ensuring that the original incorrect text is included for the record, please see the below. ● Please do not use any extra formatting when editing the templates, and only modify the red text unless absolutely necessary ● Submit to Frontiers following the instructions on this page. When the original text contained incorrect information, to preserve the scientific record, please include that text when editing the below templates. For example: There was a mistake in the Funding statement, an incorrect number was used. The correct number is "2015C03Bd051.". The publisher apologizes for this mistake. The original version of this article has been updated. In the published article, there was a mistake in the Funding statement. The funding statement for the Key Development Project of the Department of Science and Technology was displayed as "2015CBd051". The correct statement is "Key Development Project of Department of Science and Technology (2015C03Bd051).'' Missing citation Wrong In the abstract, [we mention "industry initiatives to AI-driven papermills"]. This has been corrected to read: ["industry initiatives to [combat] AI-driven papermills"] The original version of this article has been updated. [The text contained a partial sentence that should have been removed from the final draft, the phrase "' looks better, but then again it might be due to my foreigner non-standard syntax" placed immediately before the reference for (Jiang et al., 2024). The correct text shouldn't have it removed]. A correction has been made to the section [the Introduction, GenAI detection tools, 5th Paragraph]: "This result, however, is counterintuitive. Non-native speakers are expected to use more loan words, construct sentences with non-standard syntax, and make more grammatical errors, traits that would typically increase perplexity, not decrease it. These features make their writing appear less natural and less similar to the LLMs' training corpus, and therefore harder to predict. A more recent and rigorous study used a larger dataset and perplexity estimations using unpublished detectors based on GPT-2 to revisit this issue (Jiang et al., 2024). It analyzed a mixed dataset of native and non-native English GRE writing assessments containing both HWT and MGT. Contrary to the earlier claims, this analysis showed that non-native texts had the highest perplexity, while MGTs consistently had much lower perplexity. Using this feature alone, the authors reported 99.9% accuracy in detecting MGTs. These conflicting findings may be explained by differences in dataset composition, detector models, or evaluation design. The earlier study might have used small or biased datasets, or misinterpreted correlations between writing style and perplexity. The later study's use of real educational writing and unpublished detectors with stricter evaluation may offer a more accurate reflection of cross-linguistic variation. This contrast highlights the need for careful consideration of language background in AI detector evaluation, and it raises important concerns about cross-linguistic generalizability and fairness in MGT detection." [The text mentions "BenchGPT" when it should be "MGTBench"]. A correction has been made to the section [the Introduction, GenAI detection tools, 3rd Paragraph]: "In order to compare how different detectors were able to differentiate HWT and MGT from different LLMs (ChatGLM, Dolly, ChatGPTturbo, GPT4All, StableLM, and Claude) and across different types of corpora (academic essays, short stories, and news articles), MGTBench (He et al., 2024) created datasets of each genus containing 1,000 HWT and 1,000 MGT (from those LLMs). The study showed that all detectors are sensitive to changes in the selection of their training dataset. There is a trade-off where detectors that are robust against genre changes like ConDA (Bhattacharjee et al., 2023) (F1-score when trained with news dropped from 0.99 to 0.67 when testing essays) are poor at detecting MGT created with a model different than the one it was trained on, when trained with StableLM (Hugging Face, n.d.) the F1-score testing Claude drops to 0.00. On the other hand, detectors like DEMASQ (Kumari et al., 2023) that are robust against changes in LLM (F0-score drop from 0.92 to 0.71 when trained in ChatGPT-Turbo testing MGT from StableLM) fail when there is a change in genre (F0-score of 0.23 when trained on news and testing essays)." Adding/removing text [The text mentions "BenchGPT" when it should be "MGTBench". On the same paragraph the text mentions F0-scores, when they should be F1-score]. A correction has been made to the section [the Introduction, GenAI detection tools, 3rd Paragraph]: "In order to compare how different detectors were able to differentiate HWT and MGT from different LLMs (ChatGLM, Dolly, ChatGPT-turbo, GPT4All, StableLM, and Claude) and across different types of corpora (academic essays, short stories, and news articles), MGTBench (He et al., 2024) created datasets of each genus containing 1,000 HWT and 1,000 MGT (from those LLMs). The study showed that all detectors are sensitive to changes in the selection of their training dataset. There is a trade-off where detectors that are robust against genre changes like ConDA (Bhattacharjee et al., 2023) (F1-score when trained with news dropped from 0.99 to 0.67 when testing essays) are poor at detecting MGT created with a model different than the one it was trained on, when trained with StableLM (Hugging Face, n.d.) the F1-score testing Claude drops to 0.00. On the other hand, detectors like DEMASQ (Kumari et al., 2023) that are robust against changes in LLM (F1-score drop from 0.92 to 0.71 when trained in ChatGPT-Turbo testing MGT from StableLM) fail when there is a change in genre (F1-score of 0.23 when trained on news and testing essays)."

Keywords: artiffcial intelligence-AI, LLM, Misconduct Detection, generative A.I., scientific integrity

Received: 04 Sep 2025; Accepted: 24 Oct 2025.

Copyright: © 2025 Pellegrina and Helmy. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mohamed Helmy, helmy.sfc@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.