Small language models applied in text summarization task of health-related news to improve public health audit: an experimental case study

Guimarães, Alysson; Colaço Junior, Methanias; De Almeida, Samuel Santana; Garcia Ferreira de Araújo, Gabriely; Fontes, Raphael Silva; Prado, Helder; Credidio Freire Alves, Luca Pareja; Matos, Natan; de Medeiros Valentim, Ricardo Alexsandro; dos Santos, João Paulo Queiroz

doi:10.3389/frai.2026.1708993

ORIGINAL RESEARCH article

Front. Artif. Intell., 05 February 2026

Sec. Natural Language Processing

Volume 9 - 2026 | https://doi.org/10.3389/frai.2026.1708993

This article is part of the Research TopicThe Use of Large Language Models to Automate, Enhance, and Streamline Text Analysis Processes. Large Language Models Used to Analyze and Check Requirement Compliance.View all 9 articles

Small language models applied in text summarization task of health-related news to improve public health audit: an experimental case study

Alysson Guimarães¹^*^†

Methanias Colaço Junior^1,2^†

Samuel Santana De Almeida¹^†

Gabriely Garcia Ferreira de Araújo³^†

Raphael Silva Fontes⁴^†

Helder Prado^1,4

Luca Pareja Credidio Freire Alves³^†

Natan Matos¹^†

Ricardo Alexsandro de Medeiros Valentim²^†

João Paulo Queiroz dos Santos⁴^†

¹Postgraduate Program in Computer Science (PROCC), Federal University of Sergipe, São Cristóvão, Brazil
²Laboratory for Technological Innovation in Health (LAIS), Onofre Lopes University Hospital, Federal University of Rio Grande do Norte (UFRN), Natal, Brazil
³Department of Science and Technology, Federal University of Rio Grande do Norte, Natal, Brazil
⁴Center for Innovation and Advanced Technology (NAVI), Federal Institute of Rio Grande do Norte (IFRN), Natal, Brazil

Context: Fraud and corruption are among the main crimes affecting public institutions, with the healthcare sector being particularly vulnerable due to its structural complexity, the coexistence of public and private providers, the large number of actors involved, the globalized nature of supply chains, the high financial costs, and the information asymmetry among stakeholders. These factors weaken healthcare systems, resulting in resource waste, reduced resilience during medical emergencies, and limited access to essential services.

Objective: This study aims to evaluate automatic text summarization methods by comparing the quality of machine-generated summaries with those produced by humans, from the perspective of Data Scientists and SUS Auditors, within the context of audits carried out by the National Department of Unified Health System (Sistema Único de Saúde—SUS) Auditing (AudSUS).

Method: A controlled experiment was conducted to assess the performance of Small Language Models (SLMs) in summarization tasks, using the metrics ROUGE-N, ROUGE-L, BLEU, METEOR, and BERTScore. In addition, the consistency of results across 35 runs, their contribution to reducing information overload, and their pairwise performances were evaluated.

Results: The models NousResearch/Hermes-3-Llama-3.2-3B, Qwen/Qwen2.5-7B-Instruct, and meta-llama/Llama-3.2-3B-Instruct achieved the highest average performances across all metrics, standing out for their ability to preserve contextual meaning and synthesize essential information more effectively than human-generated summaries.

Conclusion: The findings highlight the potential of SLMs as tools to reduce information overload, thereby enhancing the effectiveness of the analytical phase of audits and enabling faster preparation of teams for the operational stage.

1 Introduction

Currently, most of the user-generated data in society is unstructured and textual, and Natural Language Processing (NLP) has gradually been applied in public administration. Governments and public institutions have employed this technology to process large volumes of documents with the aim of improving the quality of public services, increasing citizens' trust in institutions, and enhancing efficiency and effectiveness, particularly in functional areas such as healthcare, education, and decision-making (Jiang et al., 2023).

Many governments need to analyze, in real time, multiple sources of information, both static and dynamic, to monitor public and private cameras, citizens' comments on social media, online transactions, and events. This analysis seeks to identify patterns, establish correlations, and build predictive models that enable strategy optimization and improvement of services offered to citizens. Another key objective is to ensure the monitoring and surveillance necessary to protect the population and mitigate the impact of crimes (Benjelloun et al., 2015).

Among these crimes, fraud and corruption stand out, with the healthcare sector being particularly vulnerable due to several factors. These include the complexity of health systems, which combine public and private providers; the large number of people involved; the globalized nature of supply chains; high public and private expenditures; and information asymmetry among actors, which can negatively affect decision-making in the sector. Such vulnerabilities weaken healthcare systems, resulting in resource waste, reduced resilience during medical emergencies, and compromised coverage and access to essential health services (Mackey et al., 2018).

A study conducted by Transparency International, a global civil society organization against corruption, revealed that more than 50% of citizens in 42 out of 109 surveyed countries, believe that the healthcare sector in their country is corrupt or highly corrupt. Furthermore, the World Health Organization (WHO) estimated that, out of the US$ 7.5 trillion spent on healthcare worldwide in 2008, US$ 415 billion (7.3%) was lost due to fraud or corruption in the sector (Mackey et al., 2018).

The impacts of corruption extend beyond financial losses, encompassing social consequences, particularly in low-income countries. In these regions, both immediate and long-term effects include increased morbidity and mortality, due to the barriers created by corruption in accessing healthcare services, disproportionately affecting the most vulnerable groups. Corruption undermines the quality of healthcare systems and distorts the allocation of investments in the sector (Mackey et al., 2018).

Fraud also damages organizational reputation and public trust, making the implementation of strategies to prevent, detect, and mitigate these risks vital. One effective mechanism in combating fraud is the use of ombudsman offices and reporting channels, which play a central role in compliance systems by enabling the receipt and handling of fraud and corruption complaints (Paula et al., 2024).

Furthermore, audits represent another crucial tool to mitigate these crimes and their impacts. However, the large volume of data presents significant challenges, including information overload, which makes the auditing process complex and difficult to conduct (do Amaral et al., 2020; Paula et al., 2024).

The auditing process is generally costly, time-consuming, and requires substantial human and material resources. Therefore, it is necessary to implement solutions and techniques that automate the analysis of corruption reports. This process is typically divided into two stages: in the first, the goal is to identify elements and evidence of corruption, such as suppliers, contracts, employees, clients, and other stakeholders, by assessing the plausibility and consistency of the reports and fraud indicators; in the second stage, the investigation itself takes place (Paula et al., 2024).

To build the knowledge required for audit work, information gathering about the audit's objectives must be carried out. At this stage, various sources are used, including websites (Fontes et al., 2023), and, to support the information collection process, web scraping techniques for large-scale data extraction from health-related websites can be employed. For analyzing this large volume of data, NLP techniques such as text summarization can be applied, reducing the time and resources required for the analysis and evidence collection of potential irregularities (Sanchez-Gomez et al., 2022; Benjelloun et al., 2015; Madureira et al., 2021).

Based on this scenario, this article evaluates and proposes the use of small language models, following the experimental process described in ? and ?, in order to investigate the ability of these models to support the auditing process and to identify the most suitable models for this task. The main motivation is to support the auditing process by reducing information overload in the analytical phase of planning.

The article is structured as follows: Section 2 presents the literature on small language models. Section 3 details the dataset employed, the process of creating the reference set, and the evaluation metrics adopted. In Section 4, the experiment's objectives, planning, research questions, dependent and independent variables, selection of study objects, experimental design, and instrumentation are specified. Section 5 describes the procedures for data preparation, execution, and validation. Subsequently, Section 6 presents the results obtained and discusses the threats to validity. Finally, Section 7 provides the concluding remarks and proposes directions for future work.

2 Related work

Large Language Models (LLMs) have driven a paradigm shift in Natural Language Processing (NLP) (?). These models have demonstrated emergent capabilities in text generation, question answering, and reasoning, facilitating tasks across multiple domains (Wang et al., 2025). The field of NLP has been profoundly transformed by the ability of LLMs to perform downstream tasks after being trained on massive datasets under a self-supervised learning regime (?).

Transformer-based models, such as BERT, RoBERTa, mT5, and the GPT family, have become the foundation for a wide range of applications and research directions in NLP (?). The applications of LLMs in NLP encompass a wide range of tasks, including Question Answering (QA) (Ren, 2024), text classification or categorization into predefined classes (e.g., sentiment, fake news, topics) (Fields et al., 2024), text summarization (Van Veen et al., 2024), sentiment analysis or polarity detection (positive, negative, or neutral) (Ren, 2024), and Named Entity Recognition (NER) (Luo et al., 2023). These diverse tasks are employed across domains such as health, education, and industry (Raza et al., 2025).

The development of LLMs has rapidly expanded, encompassing both proprietary models such as ChatGPT, Bard, and Claude, and open-source models such as Llama (Wang et al., 2025). Current research largely focuses on model scalability, training data, efficiency, and the overall capabilities of these architectures (?).

As argued in ?, despite these advancements, progress with LLMs has not occurred uniformly across languages. Most models are trained on high-resource languages, such as English, whereas multilingual models typically exhibit lower performance than their monolingual counterparts. This disparity arises from imbalances in training corpora, in which high-resource languages predominate, ultimately resulting in user dissatisfaction with multilingual models when applied to low-resource languages.

The practical use of LLMs still faces several limitations, including high computational costs and restrictive licensing regimes, privacy and data security concerns, infeasibility on low-power or edge devices, high inference latency, and suboptimal performance in specialized domains (Wang et al., 2025; ?; Corrêa et al., 2025). An alternative to mitigate these constraints lies in the adoption of Small Language Models (SLMs).

Small Language Models (SLMs) have gained increasing attention as promising alternatives to LLMs. The exact definition of SLMs may vary; however, they are generally characterized by having fewer parameters than LLMs, with some classifications considering models with fewer than one billion parameters. Other definitions classify them as “small” relative to their larger counterparts, encompassing models with up to 10 billion parameters, while emphasizing the absence of emergent abilities typically observed in larger LLMs (Wang et al., 2025).

SLMs stand out for their low inference latency, cost-effectiveness, development efficiency, and ease of customization and adaptability (Wang et al., 2025). They offer significant computational savings in both pre-training and inference, requiring reduced memory and storage, which is particularly relevant for applications with constrained resources (Wang et al., 2025). Such characteristics make SLMs especially suitable for resource-limited environments, including edge devices and real-time applications, where they can enhance privacy, security, and response times through on-device processing (Wang et al., 2025; ?). Moreover, when fine-tuned for specific domains, SLMs can match or even surpass the performance of larger models in specialized tasks (Wang et al., 2025).

Text summarization can be either abstractive or extractive. An abstractive summary generates new content that does not exist in the original document, creating novel words and sentences. In contrast, an extractive summary selects a subset of sentences from the original text to form the summary (Sanchez-Gomez et al., 2020). Furthermore, summaries can be classified as generic or query-oriented. Generic summaries do not require any user input (Sanchez-Gomez et al., 2018), whereas query-oriented summaries require some form of user-provided information, typically a query or topic of interest in sentence form (Sanchez-Gomez et al., 2024). Additionally, summarization methods can be categorized as single-document or multi-document. Single-document methods condense the information from a single text into a concise summary, while multi-document methods extract key information from a set of documents (Saini et al., 2019). Summarization approaches can also be classified as supervised or unsupervised (Alguliyev et al., 2015).

This study focuses on the application of SLMs for the NLP task of automatic text summarization. For this purpose, the following models were employed: BART (Lewis et al., 2019), Gemma (Gemma Team et al., 2024), Sabiá (Pires et al., 2023), Llama (Team, 2024a), TeenyTinyLlama (?), Hermes (Teknium et al., 2024), Qwen (Team, 2024b), and Tucano (Corrêa et al., 2025). Additionally, extractive models were considered, including TextRank (Nenkova and McKeown, 2011), LexRank (Erkan and Radev, 2004), LSA (Steinberger and JeŽek, 2004), KLSum (Haghighi and Vanderwende, 2009), and SumBasic (Woodsend and Lapata, 2011).

3 Materials and methods

This is an experimental study, following the steps outlined by ? for evaluating the results of text summarization methods applied to health-related news articles with indications of irregularities, assessing the quality of summaries using the ROUGE-N, ROUGE-L, BLEU, METEOR, and BERTScore metrics.

The description of the dataset employed, the process of selecting news articles with evidence of irregularities, and the evaluation metrics for the automatic summaries are presented in Sections 3.1–3.3, respectively.

The replication of experiments is a key characteristic of any scientific field. In the software domain, it is therefore essential to apply methods that can be replicated and evaluated, in order to prevent new methods, techniques, languages, and tools from being proposed, published, or marketed without proper experimentation and validation (Travassos et al., 2020).

3.1 Database collection

The construction of the database, including all stages of data collection and storage, lies outside the scope of this study. This Section describes the process conducted by (Fontes et al. 2023), which was carried out in three stages. Table 1 describes the stages of database collection. The first stage consisted of a proof of concept based on interviews with auditors from National Department of Unified Health System (Sistema Único de Saúde—SUS) Auditing (AudSUS), aiming to clarify the process of acquiring the material used in the analytical phase of auditing. The audit process implemented by AudSUS consists of a set of activities and the preparation of documents that specify the tasks to be performed. These tasks are organized into three phases: the analytical phase, the operational (in loco) phase, and the final report. The analytical phase corresponds to audit planning, whose purpose is to prepare the team for the operational stage. This preparation involves building the necessary knowledge for the audit through the collection of information related to its objectives. The operational phase comprises the audit itself, while the final report presents the audit's overall findings.

Table 1

Table 1. Database creation stages.

The second stage involved an exploratory study aimed at mapping how textual news articles are organized in the sources, identifying their publication routines, periodicity, existing limitations, and possible solutions for cases of missing or incomplete data. Upon completion of this mapping, the construction of a model responsible for storing this data was initiated.

In the third stage, the database itself was built. For this purpose, a data collection pipeline was developed using the Python programming language, the Django framework, the PostgreSQL database, and OpenSearch for indexing the results. The collection of news articles was performed with specialized crawlers capable of reading, interpreting, and collecting metadata from the sources. These crawlers were configured to collect only new articles or to reprocess previous publications. This functionality enables more frequent searches in sources that publish several articles daily, as opposed to those that publish only once a day, monthly, or occasionally, such as audit reports and official journals. The final database contains more than 6 million textual articles from research sources indicated by auditors consulted in the initial stage.

After collection, the articles undergo a preprocessing pipeline designed to clean the texts and standardize their metadata according to the established model. This step is essential to ensure data quality and its proper storage in both the database and the indexing tool.

3.2 Health-related news selection

The information collected by (Fontes et al. 2023) was organized into a database containing 154,407 news articles with metadata regarding publication date, source (website), news title, headline, and full content. All data manipulation processes were conducted using Jupyter Notebook,¹ with the Polars library² version 1.32.0, and the Python programming language³ version 3.10.12.

To identify health-related news articles with potential indications of audit relevance, a three-stage keyword-based selection strategy was applied. In the first stage, across the entire dataset, articles containing the keyword “saúde” (health) in their content (corpus) were classified as “Health” while all others were classified as “Generic News.” The generic articles were excluded from subsequent stages.

In the second stage, keywords indicating potential irregularities were applied to the subset of articles previously classified as “Health,” dividing them into “Generic Health" and “Health with Irregularities.” Articles containing at least one of the keywords in the title and/or headline were classified as “Irregularity News." At this stage, 6,239 articles were identified as containing indications of irregularity, while the remaining 3,277 were classified as “Generic Health.” Table 2 presents the list of keywords used.

Table 2

Table 2. Keywords used to identify signs of irregularities.

In the third stage, articles belonging to the “Irregularity News” subgroup were independently assessed by two annotators. A substantial inter-rater agreement (IRA) (Landis and Koch, 1977) was achieved, with a Cohen's Kappa (k) value of 0.6203. Table 3 presents the contingency table of the evaluations. Five additional evaluators were assigned to resolve cases of disagreement in the classification. Each evaluator was individually responsible for making a final decision on 65 distinct samples. The assessment involved reviewing the titles and abstracts of all articles to confirm their categorization as “Health Irregularity,” “Generic Health,” or “Generic News.”

Table 3

Table 3. Contingency matrix of annotations between Evaluator 1 and Evaluator 2.

Finally, after the third stage, 421 health-related articles with indications of irregularity were identified. Figure 1 illustrates the news selection process.

Figure 1

Flowchart depicting the categorization of 154,407 news items. The process starts with checking for keywords, separating into 144,891 generic news and 9,516 health-related items. Health items undergo further keyword scrutiny, yielding 3,277 generic health and 6,239 health irregularities. Final human evaluation confirms 421 health irregularities from the remaining 5,818 items.

Figure 1. Health-related news selection process.

For the use of this dataset in the text summarization task, it was necessary to create summaries of the original articles to serve as reference summaries for the evaluation of the automatically generated ones. These reference summaries were produced by an external journalist researcher.

3.3 Evaluation metrics

This subsection describes the evaluation metrics adopted in the experiment. The metrics employed were ROUGE-N (Lin, 2004), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and BERTScore (Zhang et al., 2020).

3.3.1 ROUGE-N

The Recall-Oriented Understudy for Gisting Evaluation (ROUGE-N) (Lin, 2004) is widely employed in the literature to assess the quality of summaries generated by automatic summarization methods. This metric evaluates summary quality by measuring the overlap of word sequence units (N-grams) and word pairs between the automatically generated summary and the reference summary (El-Kassas et al., 2021). The formal definition is given in Equation 1.

\begin{array}{l} ROUGE-N = \frac{\sum_{S \in RefSummaries} \sum_{N-grams \in S} {Count}_{match} (N-gram)}{\sum_{S \in RefSummaries} \sum_{N-grams \in S} Count (N-gram)} & (1) \end{array}

where RefSummaries denotes the set of reference summaries used for comparison with the automatically generated summary, N-grams refer to consecutive segments of N words (or tokens) in a sentence or text, Count_match(N-gram) represents the number of times a specific N-gram from the reference summary appears in the generated summary—thus indicating the count of overlapping N-grams between the reference and generated summaries—while Count(N-gram) is the total count of N-grams in the reference summary. The denominator's summation accounts for all possible N-grams that could have been captured from the reference summary.

3.3.2 ROUGE-L

The Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence (ROUGE-L) (Lin, 2004) is an automatic evaluation measure for text quality, based on the computation of the Longest Common Subsequence (LCS) between a reference text R and an automatically generated text H.

Let LCS(R, H) denote the length of the longest common subsequence between R and H. Recall, Precision, and Fβ (F1-score) are defined by Equations 2–4, respectively. Here, |R| represents the length of the reference sequence and |H| the length of the generated sequence. For the computation of Fβ, the parameter β is typically set to 1, yielding the F1-score.

The use of the longest common subsequence provides ROUGE-L with the ability to capture the global structure and the relative order of words, without the strict contiguity constraints required by N-grams. This characteristic distinguishes ROUGE-L from ROUGE-N, allowing it to reflect textual similarity at a more flexible level.

\begin{array}{l} R_LCS = \frac{L C S (R, H)}{| R |} & (2) \end{array}

\begin{array}{l} P_LCS = \frac{L C S (R, H)}{| H |} & (3) \end{array}

\begin{array}{l} F_LCS = \frac{(1 + β^{2}) \cdot R_L C S \cdot P_L C S}{R_L C S + β^{2} \cdot P_L C S} & (4) \end{array}

3.3.3 BLEU

The Bilingual Evaluation Understudy (BLEU) metric (Papineni et al., 2002) measures the quality of a generated text based on the precision of N-grams with respect to one or more references, while incorporating a brevity penalty to avoid excessively short summaries. Formally, let p_n denote the n-gram precision of order n, with weights w_n (typically uniform), and BP the brevity penalty. where |H| is the length of the generated summary and |R| is the length of the reference summary. It is defined as follows:

\begin{array}{l} BLEU = B P \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n}) & (5) \end{array}

B P = {\begin{array}{l} 1 & if | H | > | R | \\ e^{(1 - | R | / | H |)} & if | H | \leq | R | \end{array}

3.3.4 METEOR

The Metric for Evaluation of Translation with Explicit ORdering (METEOR) (Lavie and Agarwal, 2007) establishes a flexible alignment between automatically generated summaries and references, accounting for exact matches, stems, synonyms, and paraphrases. Precision P and recall R are defined over the identified matches. The F_α score is computed as shown in Equation 6, where α balances the relative importance of precision and recall. A fragmentation penalty Pen, dependent on the dispersion of matched segments, is also applied. Finally, METEOR is computed according to Equation 7.

\begin{array}{l} F_{α} = \frac{P R}{α P + (1 - α) R} & (6) \end{array}

\begin{array}{l} METEOR = (1 - P e n) \cdot F_{α} & (7) \end{array}

3.3.5 BERTScore

The BERTScore metric (Zhang et al., 2020) is grounded in semantic representations obtained from pretrained Transformer-based language models. Given a set of embeddings e_h for tokens in the automatically generated summary and e_r for tokens in the reference, where cos(e_h, e_r) denotes the cosine similarity between embeddings, BERTScore diverges from traditional metrics by capturing actual semantics, including synonyms and paraphrases. Moreover, it correlates strongly with human evaluation, since both the reference text and the generated summary are represented contextually through embeddings. BERTScore is computed as shown in Equation 10.

\begin{array}{l} P = \frac{1}{| H |} \sum_{h \in H} max_{r \in R} cos (e_{h}, e_{r}) & (8) \end{array}

\begin{array}{l} R = \frac{1}{| R |} \sum_{r \in R} max_{h \in H} cos (e_{r}, e_{h}) & (9) \end{array}

\begin{array}{l} F_BERT = \frac{2 P R}{P + R} & (10) \end{array}

4 Experimental definition

This section presents the objective of the experimental evaluation, the planning, the research questions, the independent and dependent variables, and the hypotheses.

4.1 Objective

To formalize the objective of this study, the Goal Question Metric (GQM) model proposed by (Basili and Weiss 1984) was employed. The aim of this work is to analyze automatic text summarization methods, with the purpose of evaluating the quality of automatically generated summaries, against human-generated summaries, with respect to metrics of ROUGE-1, ROUGE-2, ROUGE-L, BLEU, METEOR and BERTScore, from the perspective of Data Scientists and Auditors of the Brazilian Unified Health System (SUS), in the context of public health audits conducted by the National Department of Unified Health System (Sistema Único de Saúde—SUS) Auditing (AudSUS).

4.2 Planning

The experiment was conducted in a controlled environment, using text summarization methods on health-related news articles with indications of irregularities. Table 4 presents and describes the methods employed.

Table 4

Table 4. Models used, their characteristics and purposes.

The experiment involved the generation and evaluation of automatic summaries, as well as the analysis and presentation of results.

The automatic summary generation phase consisted of applying text summarization methods to the database of news articles with indications of irregularities. The database employed is described in Section 3.1.

For the evaluation, quality measurement metrics for summaries were applied, including ROUGE-N (Lin, 2004), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and BERTScore (Zhang et al., 2020). These metrics are detailed in Section 5.3.

In the subsequent phase, analyses were conducted on the mean scores of each applied method, and hypothesis testing was performed to evaluate differences in results. To determine whether there were statistically significant differences in mean performance among the methods, a paired non-parametric Wilcoxon Signed-Rank test was applied.

After identifying the best-performing methods, they were comparatively described regarding the distribution of results and internal consistency through descriptive analysis of the standard deviation of the evaluation metrics. Finally, threats to validity are presented.

4.3 Context selection

Despite technological advancements, many processes in the public sector still rely on manual searches for knowledge construction. This scenario is also observed in the AudSUS, responsible for the control and oversight of the Brazilian Unified Health System (SUS). Audit activities conducted by this department play a crucial role in the management and proper use of public resources; however, the process is highly resource-intensive due to the high demand, as auditors must oversee all SUS areas while also addressing internal and external demands from the Ministry of Health (Fontes et al., 2023). In this context, the proposed experiment aims to support the analytical phase of the audit process, which is responsible for planning and preparing the team for the operative phase, through the collection of information related to the audit objectives.

4.4 Research question

To guide the experiment and fulfill the objectives of this study, the following research questions were formulated:

• RQ1: Can an automatic text summarization method support the audit process by reducing information overload regarding indications of irregularities?

• RQ2: Among the selected summarization methods, which are the top three in terms of summary quality?

To address the research questions, the following theoretical hypotheses, presented in Table 5, were formulated.

Table 5

Table 5. Research questions and associated hypotheses.

4.5 Dependent variables

The dependent variables, or output variables, were the automatic summaries generated by the models, from which the summary quality evaluation metrics ROUGE-N, ROUGE-L, BLEU, METEOR, and BERTScore were derived.

4.6 Independent variables

In this experiment, the independent variable, or input variable, is the reference dataset created for evaluating the automatic summaries, as well as the tested models: abstractive models BART (Lewis et al., 2019), Gemma (Gemma Team et al., 2024), Sabiá (Pires et al., 2023), Llama (Team, 2024a), TeenyTinyLlama (?), Hermes (Teknium et al., 2024), Qwen (Team, 2024b), and Tucano (Corrêa et al., 2025), and extractive models TextRank (Nenkova and McKeown, 2011), LexRank (Erkan and Radev, 2004), LSA (Steinberger and JeŽek, 2004), KLSum (Haghighi and Vanderwende, 2009), and SumBasic (Woodsend and Lapata, 2011). These models were selected following the classification in (Wang et al. 2025), i.e., models with up to 10 billion parameters.

4.7 Objects selection

Following the context described in Section 4.3, the objects of this experiment are health-related news articles with indications of irregularities, as described in Section 3.2. The dataset contains 421 news articles and human-generated reference summaries.

To generalize the results of this experiment to the broader population of news articles, it is necessary to evaluate the results using a representative sample (?). For sample size calculation, a finite population of 154,407 news articles (the total number of articles in the complete dataset) was considered. It is noteworthy that the final sample exceeds the estimated number according to Equation 12. For sample size calculation, a 95% confidence level (Z = 1.96), a tolerable sampling error of 5% (e = 0.05), and an expected proportion of 50% (p = 0.5) were used, maximizing sample variability and ensuring a more conservative sample size.

The sample size calculation for a finite population was conducted in two steps: first, the sample for an infinite population (n) was estimated using Equation 11, and then the adjustment for a finite population (n_adjusted) was applied according to Equation 12, resulting in approximately 383.21 samples, as shown in Equation 14.

\begin{array}{l} n = \frac{Z^{2} \cdot p \cdot (1 - p)}{e^{2}} & (11) \end{array}

\begin{array}{l} n_{a j u s t a d o} = \frac{n}{1 + (\frac{n - 1}{N})} & (12) \end{array}

\begin{array}{l} n = \frac{1, 9 6^{2} \cdot 0, 5 \cdot (1 - 0, 5)}{0, 0 5^{2}} = \frac{3, 8416 \cdot 0, 25}{0, 0025} = 384, 16 & (13) \end{array}

\begin{array}{l} n_{a j u s t a d o} = \frac{384, 16}{1 + (\frac{384, 16 - 1}{154407})} \approx 383, 21 & (14) \end{array}

4.8 Experiment design

Automatic summaries were generated in 35 independent rounds for each of the 421 news articles, resulting in 14,735 automatic summaries per method. A total of 35 rounds were conducted to ensure that the distribution of the sample mean score for each method approaches a normal distribution, even if the underlying population distribution does not follow one, in accordance with the Central Limit Theorem (CLT). This sample size (n = 35) surpasses the commonly accepted threshold of n≥30 for the application of the CLT, thereby enabling the subsequent use of robust parametric tests (which assume normality of the sampling distribution) and ensuring a minimum number of observations for normality tests, such as the Kolmogorov–Smirnov test (?).

In this experiment, the metrics ROUGE-N (Lin, 2004), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and BERTScore (Zhang et al., 2020) were employed to measure the quality of the generated summaries, as described in Section 3.3. These metrics were applied to evaluate the automatic summaries produced by the methods, using human-generated summaries as references.

For extractive methods, a preprocessing step was required, in which words were normalized using stemming, a process that reduces words to their root forms, decreasing linguistic variation and complexity while preserving the essential meaning. This process is necessary to ensure consistent representation of variant forms of the same word by removing suffixes, thereby achieving textual normalization (?).

The maximum length of the automatic summaries was constrained to the average length of the reference summaries. For extractive methods, summaries were limited to a maximum of five sentences, corresponding to the average sentence count of the reference summaries (4.92 sentences). For abstractive summaries, reference summaries were tokenized, the mean number of tokens was calculated, and a minimum summary length was set as max(5, mean_tokens−tolerance_value), while the maximum length was fixed at mean_tokens+tolerance_value. The tolerance value was defined as (mean_tokens*0.1). Tokenization, which divides text into subword units called tokens, was automated using the tokenizer specific to the method being executed. The restriction on summary length aims to ensure a fairer comparison, as recommended in NIST (2025).

4.9 Instrumentation

The following materials and resources were employed:

• Google Sheets;

• Annotated database with reference summaries (Section 3.1);

• Google Colab⁴;

• Python programming language (3.11.13)⁵;

• Python libraries: accelerate (1.9.0), bert-score (0.3.12), bitsandbytes (0.47.0), datasets (4.0.0), evaluate (0.4.5), matplotlib (3.10.3), openpyxl (3.1.4), packaging (25.0), pandas (2.2.3), polars (1.32.0), protobuf (6.31.1), pyarrow (20.0.0), python-dotenv (1.1.1), rouge-score (0.1.2), seaborn (0.13.2), sentencepiece (0.1.99), sumy (0.11.0), tiktoken (0.9.0), tqdm (4.67.1), scipy (1.15.3), and transformers (4.54.0);

• Computational resources from the High-Performance Computing Center (NPAD) at the Federal University of Rio Grande do Norte (UFRN).

5 Experiment operation

This section describes the preparation of the experiment, its execution, and the evaluation of the results.

5.1 Experiment preparation

The database containing all news articles was obtained as described in Section 3.1. The reference summaries, used as the comparison standard, were produced by an independent researcher, ensuring that for each health-related news article with indications of irregularity, a corresponding summary was created.

Before applying the summarization methods, a virtual environment was set up for dependency management, ensuring reproducibility and compatibility across different development environments. Within this environment, all required libraries for method execution were installed.

In the case of automatic summaries generated by extractive methods, a preprocessing step was carried out, in which the news texts were normalized using stemming, thereby reducing linguistic variation and complexity while preserving the essential meaning of words.

To ensure systematic execution of the process, a summarization pipeline was developed. This pipeline consists of a script capable of receiving one or more methods and repeatedly generating automatic summaries, with the number of repetitions (N) defined as 35 in this experiment.

As a pilot study, the pipeline was initially tested in five rounds with 25 samples to verify its functionality. Necessary adjustments and corrections were made during this preliminary stage (?). Subsequently, the complete process was executed for all methods. Table 6 presents the hyperparameters employed for the abstractive models.

Table 6

Table 6. Hyperparameters abstractive models.

5.2 Experiment execution

The execution of the experiment consisted of two main stages: the generation and evaluation of the automatic summaries.

In the generation stage, the pipeline was executed for each method using the database of news articles with indications of fraud.

The generation stage involved the implementation of a pipeline in Python, automating the execution of each method for 35 iterations. The pipeline for extractive methods is presented in Algorithm 1, while the pipeline for abstractive methods is shown in Algorithm 2. The materials and resources used are detailed in Section 4.9.

Algorithm 1

Algorithm 1. Extractive text summarization pipeline.

Algorithm 2

Algorithm 2. Abstractive text summarization pipeline.

After the generation of summaries, the evaluation stage was initiated, consisting of measuring the quality of the automatic summaries using the ROUGE-N (Lin, 2004), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and BERTScore (Zhang et al., 2020) metrics, with the reference summaries serving as the ground truth. Each of the 14,735 summaries generated by method was compared with its corresponding manually produced reference summary. Figure 2 illustrates the execution process.

Figure 2

Flowchart illustrating a process where three methods (A, B, and C) are used to summarize a health irregularity dataset. The outputs are three summaries (A, B, and C), which are then evaluated with a reference summary to produce metrics.

Figure 2. Text summarization process and evaluation.

5.3 Data evaluation

Five (5) types of statistical tests were applied for analysis, interpretation, and validation: Anderson–Darling (AD Test), Kolmogorov–Smirnov (KS Test), Wilcoxon Signed-Rank Test (pairwise), Z-score, and Interquartile Range (IQR). The Anderson–Darling (AD Test) and Kolmogorov–Smirnov (KS Test) were employed to assess data normality, while the Wilcoxon Signed-Rank Test (pairwise) was applied to compare the medians of the ROUGE-1, ROUGE-2, ROUGE-L, BLEU, METEOR, and BERTScore metrics. Z-score and Interquartile Range (IQR) metrics were used to identify outliers within the evaluation metrics.

6 Results

This section presents the process of data analysis and interpretation, threats to validity, and conclusions and future work.

6.1 Data analysis and interpretation

To address the research questions outlined in Section 4.4, the Execution stage was followed, and the results for the evaluation metrics were obtained. Table 7 presents the results of the ROUGE-1 (Table 7), ROUGE-2 (Table 7), ROUGE-L (Table 7), BLEU (Table 7), METEOR (Table 7), and BERTScore (Table 7) metrics, aggregated by mean, minimum value, maximum value, and standard deviation for each metric. Figure 3 provides a visual representation of the metrics in the form of a heatmap, which the lighter the color, the higher the score.

Table 7

Table 7. Results for the evaluation metrics.

Figure 3

Heatmap showing the performance of various NLP models across five metrics: ROUGE-1 F1, ROUGE-2 F1, BLEU, METEOR, and BERTScore F1. Each model is listed on the vertical axis, with color intensity representing metric values. Higher values are shown in brighter colors, ranging from dark purple to orange. The legend on the right indicates the color scale from 0.1 to 0.7.

Figure 3. Heatmap of evaluation metrics results.

Given these results, it is observed that the abstractive models led performance across almost all metrics with consistent results, demonstrating their ability to capture the essential information from the original text dataset, achieving performance comparable to humans, particularly under the BERTScore metric.

Despite their lower performance, extractive models can still be useful, especially when interpretability of results is a critical factor. These models employ simple statistical mechanisms, such as word frequency within the text (SumBasic), KL divergence between distributions (KL-Sum), similarity graphs (LexRank), and latent topics derived via SVD (LSA), thus providing deterministic and verifiable decision rules that clarify why each sentence included in the summary was selected. In an auditing context, where transparency, traceability, and justifiability are essential, such characteristics support auditors in making informed decisions on sensitive matters. Within the class of extractive models, the SumBasic method stands out, ranking as the fourth best overall method according to Table 13, considering its performance across all metrics compared to the others.

Among the methods with the best overall results are the abstractive models NousResearch/Hermes-3-Llama-3.2-3B, Qwen/Qwen2.5-7B-Instruct, and meta-llama/Llama-3.2-3B-Instruct, which consistently outperform across all metrics, for instance, in BERTScore as shown in Table 13. This result suggests that they better capture and preserve the contextual meaning of the original text.

To assess the consistency of each automatic summary across each sample in all iterations, the standard deviation of the metrics was analyzed, revealing the stability of the methods' performance relative to the evaluated metrics. A high standard deviation indicates that the model's performance varies considerably across test runs, while a low standard deviation indicates greater consistency, delivering similar results regardless of the evaluated example or the number of iterations.

For comparative purposes, since all metrics range from zero to one, we classified results into “Low” (std < 0.05), “Moderate” (std>0.05, and < 0.1), and “High” (std>0.1) based on the standard deviation value for each metric. From this perspective, the methods NousResearch/Hermes-3-Llama-3.2-3B, Qwen/Qwen2.5-7B-Instruct, and meta-llama/Llama-3.2-3B-Instruct maintain their prominence with moderate variation, as shown in Table 8.

Table 8

Table 8. Performance consistency classification using standard deviation of results.

In addition, to verify the consistency of the methods' results, an analysis was conducted to identify and characterize the presence of outliers, considering as such those values greater than three standard deviations, as well as those identified through the Interquartile Range (IQR). Among the 75 model-metric combinations, only 22 exhibited outliers above 1%, indicating a low incidence of outliers and, consequently, supporting the consistency of results across different iterations. Table 9 presents the distribution of outliers by method and metric.

Table 9

Table 9. Percentage of outliers detected by Z-score and IQR for different methods and metrics.

To analyze the reduction of information overload, in addition to quality metrics, textual reduction relative to the original document and its size compared to human performance were assessed. Beyond preserving quality, the best models—NousResearch/Hermes-3-Llama-3.2-3B, Qwen/Qwen2.5-7B-Instruct, and meta-llama/Llama-3.2-3B-Instruct—produced summaries with lengths relatively close to the human average, with mean differences in word count of 10.57%, 3.25%, and 8.13%, respectively, compared to human performance. Table 10 describes the reduction of information overload, reporting the average length of all news articles in words (Orig. Words), the average human summary length across all documents (Ref. Words) and its relative size compared to the average document length (Ref. %), the average length of automatic summaries (Auto Words) and its relative size compared to the average document length (Auto %), and the relative difference between automatic summaries and human performance (Dif %). Supplementary Appendix A.15 compares reference summary samples with automatic summaries generated by the top-performing models.

Table 10

Table 10. Percentage difference in information overload reduction between human summary vs. automatic summary. Sorted by diff. (%).

Despite the promising results, it is not possible to draw definitive conclusions without sufficiently conclusive statistical evidence. Thus, to enable comparative assertions, a significance level (α) of 0.05 was established for the entire experiment. Normality tests were then applied.

A normality assessment was carried out using robust methods for large samples (14,735 per method), specifically the Anderson–Darling (AD Test) and Kolmogorov–Smirnov (KS Test), in order to determine the most appropriate hypothesis test to answer the Research Questions (Section 4.4). The AD Test was adopted as the primary metric, while the KS Test was employed as a complementary verification. The results indicated no evidence that any of the datasets follow a normal distribution, as presented in Tables 11–13. For the 14,735 samples, the critical value for the AD statistic was 0.787, leading to rejection across all methods and metrics. Analysis of the p-values from the Kolmogorov–Smirnov test further confirmed that none of the result distributions for any metric approximate a normal distribution. Consequently, the use of non-parametric hypothesis testing was required. As the primary approach, the Wilcoxon Signed-Rank Test was employed in a pairwise manner (Method A vs. Method B) to evaluate whether the distribution of values from “Method A” was significantly superior to that of “Method B.”

Table 11

Table 11. Normality test results—Anderson–Darling (AD_statistic).

Table 12

Table 12. Normality test results—Kolmogorov–Smirnov (KS_pvalue).

Table 13

Table 13. Summary of the Wilcoxon Signed-Rank Test (pairwise).

To assess whether one method significantly outperforms another, the Wilcoxon Signed-Rank Test was employed. This is a non-parametric test for paired samples that evaluates whether the median of the differences between two methods (or conditions) is significantly different from zero. Following its application, evidence was obtained regarding the comparative performance across each of the evaluation metrics. Tables B.16, B.17, B.18, B.19 and B.20 in Supplementary Appendix B describes the pairwise results of Wilcox Singed-Rank Test for each evaluation metric. These results are summarized in Table 13, which reports the number of scores (when the p-value is significant) of the baseline method (“Model A”) over the alternative method (“Model B”) in pairwise comparisons for each metric, while the column Score represents the cumulative count of scores for the baseline model (“Model A”) over the alternative models (“Model B”).

After identifying the models with the best comparative performance, the next step involved measuring the magnitude of the difference between the medians of the top models in both the most favorable and least favorable scenarios, with the aim of evaluating the extent to which one model stands out relative to the alternative. Table 14 presents the baseline model, the models corresponding to the best- and worst-case scenarios, as well as the respective metrics. Furthermore, Table 14 reports the results obtained when the baseline model outperformed or underperformed for each analyzed metric.

Table 14

Table 14. Magnitude of the difference in medians between the best models in the best- and worst-case scenarios.

The models NousResearch/Hermes-3-Llama-3.2-3B and Qwen/Qwen2.5-7B-Instruct were pre-trained using English-language datasets, whereas meta-llama/Llama-3.2-3B-Instruct was trained on multilingual corpora (Table 4). Despite being trained entirely or predominantly in foreign languages, these models achieved the best performance across all summary quality evaluation criteria and pairwise comparisons. For the BERTScore metric, the difference was 0.19 or 19%, meaning that compared to human performance, the three top-performing models were 19% superior to the compared model. Conversely, the models with the weakest relative results compared to the top three were those pre-trained in Portuguese, namely maritaca-ai/sabia-7b and nicholasKluge/TeenyTinyLlama-460m-Chat. Meanwhile, the models that exhibited the smallest performance differences were the extractive sumbasic and the abstractive NousResearch/Hermes-3-Llama-3.2-3B and Qwen/Qwen2.5-7B-Instruct.

The results of all evaluations indicate, from the perspective of an auditor, that these models are technically reliable, as they demonstrate consistency in performance, highlight the most relevant information, and maintain an average summary length comparable to human performance. Thus, by leveraging automated methods to contribute to reducing informational overload, these models can support the auditing process in the analytical phase, enhancing efficiency and effectiveness in information gathering and preparing teams for the operational phase in less time.

6.2 Threats to validity

For the evaluation of the experiment, it is necessary to consider factors that may influence the results, characterized as threats to internal and external validity.

• Internal validity: the process of classifying the news articles was conducted by two annotators. As this is a manual and intensive activity, there is a possibility of classification errors. To mitigate this risk, five evaluators intervened in cases of disagreement between annotators regarding the categorization of the news.

• External validity: the number of methods trained exclusively in Brazilian Portuguese or multilingual corpora is very limited compared to those developed in English. Models trained in the same language as the dataset may achieve superior or more consistent performance relative to multilingual or English-adapted models. To mitigate this threat, comparisons were balanced by prioritizing models trained or fine-tuned in Brazilian Portuguese.

7 Conclusion and future work

The auditing process is generally characterized as costly, time-consuming, and resource-intensive, requiring substantial human and material effort. In this context, it becomes necessary to implement solutions and techniques that enable the automation of the analysis of corruption allegations. This process is typically divided into two stages: in the first, elements and evidence of corruption—such as suppliers, contracts, employees, clients, and other stakeholders—are identified, assessing the plausibility and consistency of the allegations and signs of fraud; in the second stage, the investigation itself is carried out.

For building the knowledge required in auditing activities, it is essential to collect information directly related to the audit's objectives. In this phase, various sources are consulted, including websites. To support the information-gathering process, webscraping techniques can be applied to extract large-scale data from health-related websites. Furthermore, to assist in analyzing this substantial volume of data, NLP techniques, such as text summarization, can be employed, significantly reducing the time and resources required for the analysis and collection of evidence of potential irregularities.

In this context, aiming to support, improve, and optimize the collection of relevant information that may assist in combating irregularities, this study presents the results of applying 15 automatic text summarization methods to a set of health-related news articles with indications of irregularities. The objective was to evaluate whether such methods can contribute to the auditing process by reducing informational overload, as well as to identify which are most effective for this task.

In this controlled experiment, using a curated dataset of 421 news samples, automatic summaries were generated through 15 methods, each repeated over 35 rounds. The results were robustly evaluated based on multiple performance metrics [ROUGE-N (Lin, 2004), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and BERTScore (Zhang et al. 2020)]. In addition to evaluating summary quality across multiple metrics, the analysis included consistency through standard deviation, the presence of outliers, and the degree of informational overload reduction. Moreover, all methods were subjected to the Wilcoxon Signed-Rank Test (paired), the results of which are presented in this study.

Among the methods with the best performance in terms of summary quality and comparative results, the models NousResearch/Hermes-3-Llama-3.2-3B, Qwen/Qwen2.5-7B-Instruct, and meta-llama/Llama-3.2-3B-Instruct stand out. These models consistently outperformed others across the various evaluation metrics, demonstrating a superior ability to capture and preserve the contextual meaning of the original text while adequately synthesizing key information, when compared to human performance.

From an auditor's perspective, these models prove to be technically reliable, as they provide consistent results, highlight the most relevant information, and maintain an average summary length comparable to human performance. Therefore, by leveraging automated methods to reduce informational overload, these models can support the auditing process in the analytical phase, increasing efficiency and effectiveness in information gathering and enabling teams to prepare for the operational phase in less time.

For future work, although these models require relatively fewer computational resources compared to larger models, their implementation and execution still demand specialized knowledge and significant resources. In this regard, there is room for exploring complexity reduction techniques, such as quantization methods, which may enable more efficient use of these models in practical scenarios with limited resources. Finally, to further optimize the reduction of informational overload, it would be possible to summarize groups of texts rather than individual texts, following the application of topic modeling methods.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

AG: Writing – review & editing, Formal analysis, Validation, Data curation, Writing – original draft, Investigation, Conceptualization, Methodology. MC: Data curation, Writing – review & editing, Conceptualization, Methodology. SD: Writing – review & editing, Data curation. GA: Writing – review & editing, Data curation. RF: Writing – review & editing, Resources, Data curation. HP: Writing – review & editing, Data curation. LC: Writing – review & editing, Data curation. NM: Data curation, Writing – review & editing. RM: Writing – review & editing, Funding acquisition. JS: Writing – review & editing, Funding acquisition.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by LAIS—master's and doctoral scholarships.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2026.1708993/full#supplementary-material

Footnotes

1. ^https://jupyter.org/

2. ^https://docs.pola.rs/

3. ^https://www.python.org/

4. ^https://colab.google/

5. ^https://www.python.org/

References

Alguliyev, R. M., Aliguliyev, R. M., and Isazade, N. R. (2015). An unsupervised approach to generating generic summaries of documents. Appl. Soft Comput. J. 34, 236–250. doi: 10.1016/j.asoc.2015.04.050

Crossref Full Text | Google Scholar

Basili, V. R., and Weiss, D. M. (1984). A methodology for collecting valid software engineering data. IEEE Trans. Softw. Eng. 10, 728–738. doi: 10.1109/TSE.1984.5010301

Crossref Full Text | Google Scholar

Benjelloun, F.-Z., Benjelloun, F.-Z., Lahcen, A. A., Lahcen, A. A., Lahcen, A. A., Belfkih, S., et al. (2015).

Google Scholar

Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2025). Tucano: advancing neural text generation for Portuguese. Patterns 6:101325. doi: 10.1016/j.patter.2025.101325

PubMed Abstract | Crossref Full Text | Google Scholar

do Amaral, J. A. A., Amaral, J. A., Rodrigues, J. B., Rodrigues, J. B., and Rodrigues, J. B. (2020). Alocaçã de tópicos latentes — um modelo para segmentação de dados de auditoria do governo de pe. Rev. Eng. Pesqui. Apl. 5, 40–49. doi: 10.25286/repa.v5i1.1179

Crossref Full Text | Google Scholar

El-Kassas, W. S., Salama, C. R., Rafea, A. A., and Mohamed, H. K. (2021). Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165:113679. doi: 10.1016/j.eswa.2020.113679

Crossref Full Text | Google Scholar

Erkan, G., and Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479. doi: 10.1613/jair.1523

Crossref Full Text | Google Scholar

Fields, J., Chovanec, K., and Madiraju, P. (2024). A survey of text classification with transformers: How wide? How large? How long? How accurate? How expensive? How safe? IEEE Access 12, 6518–6531. doi: 10.1109/ACCESS.2024.3349952

Crossref Full Text | Google Scholar

Fontes, R. S., Júnior, M. C., Prado, H., Nely, A., Araújo, J., de Paiva, J., et al. (2023). “Sussurro - detecção na web de eventos auditáveis que representam riscos à saúde pública,” in Anais Estendidos do XXIII Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2023). doi: 10.5753/sbcas_estendido.2023.231515

Crossref Full Text | Google Scholar

Gemma Team, T. M., Hardin, C., Dadashi, R., Bhupatiraju, S., Sifre, L., Rivière, M., et al. (2024). Gemma. doi: 10.34740/KAGGLE/M/3301

Crossref Full Text | Google Scholar

Haghighi, A., and Vanderwende, L. (2009). “Exploring content models for multi-document summarization,” NAACL HLT 2009- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Conference (Stroudsburg, PA: ACL). doi: 10.3115/1620754.1620807

Crossref Full Text | Google Scholar

Jiang, Y., Li, J., Wong, D., and Kan, H. Y. (2023). Natural language processing adoption in governments and future research directions: a systematic review. Appl. Sci. 13:12346. doi: 10.3390/app132212346

Crossref Full Text | Google Scholar

Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159–174. doi: 10.2307/2529310

PubMed Abstract | Crossref Full Text | Google Scholar

Lavie, A., and Agarwal, A. (2007). “METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments,” in Proceedings of the Second Workshop on Statistical Machine Translation, eds. C. Callison-Burch, P. Koehn, C. S. Fordyce, and C. Monz (Prague: Association for Computational Linguistics), 228–231. doi: 10.3115/1626355.1626389

Crossref Full Text | Google Scholar

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., et al. (2019). BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv [preprint]. arXiv:1910.13461. doi: 10.48550/arXiv:1910.13461

Crossref Full Text | Google Scholar

Lin, C.-Y. (2004). “ROUGE: a package for automatic evaluation of summaries,” in Text Summarization Branches Out (Barcelona: Association for Computational Linguistics), 74–81.

Google Scholar

Luo, L., Ning, J., Zhao, Y., Wang, Z., Ding, Z., Chen, P., et al. (2023). Taiyi: A bilingual fine-tuned large language model for diverse biomedical tasks. J. Am. Med. Inform. Assoc. 31:1865–1874. doi: 10.1093/jamia/ocae037

PubMed Abstract | Crossref Full Text | Google Scholar

Mackey, T. K., Mackey, T. K., Vian, T., Vian, T., Köhler, J. C., Kohler, J. C., et al. (2018). The sustainable development goals as a framework to combat health-sector corruption. Bull. World Health Organ. 96, 634–643. doi: 10.2471/BLT.18.209502

PubMed Abstract | Crossref Full Text | Google Scholar

Madureira, L., Popovič, A., and Castelli, M. (2021). Competitive intelligence: a unified view and modular definition. Technol. Forecast. Soc. Change 173:121086. doi: 10.1016/j.techfore.2021.121086

Crossref Full Text | Google Scholar

Nenkova, A., and McKeown, K. (2011). Automatic summarization. Found. Trends Inf. Retr. 5, 103–233. doi: 10.1561/1500000015

Crossref Full Text | Google Scholar

NIST (2025). Proceedings of the Document Understanding Conference. Gaithersburg, MD: National Institute of Standards and Technology.

Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, eds. P. Isabelle, E. Charniak, and D. Lin (Philadelphia, PA: Association for Computational Linguistics), 311–318. doi: 10.3115/1073083.1073135

Crossref Full Text | Google Scholar

Paula, T. D., Amaral, A. D., Victor, A., Sales, L. A., Moreira, R., Meirelles, T., et al. (2024). “Automated admissibility of complaints about fraud and corruption,” in Proceedings of the 16th International Conference on Computational Processing of Portuguese, Vol. 1, eds. P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, et al. (Santiago de Compostela, Galicia: Association for Computational Lingustics), 610–613.

Google Scholar

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). “Sabiá: Portuguese large language models,” in Intelligent Systems, M. C. Naldi, and R. A. C. Bianchi (Cham: Springer Nature Switzerland), 226–240. doi: 10.1007/978-3-031-45392-2_15

Crossref Full Text | Google Scholar

Raza, M., Jahangir, Z., Riaz, M. B., Saeed, M. J., and Sattar, M. A. (2025). Industrial applications of large language models. Sci. Rep. 15:13755. doi: 10.1038/s41598-025-98483-1

PubMed Abstract | Crossref Full Text | Google Scholar

Ren, M. (2024). Advancements and applications of large language models in natural language processing: a comprehensive review. Appl. Comput. Eng. 97, 55–63. doi: 10.54254/2755-2721/97/20241406

Crossref Full Text | Google Scholar

Saini, N., Saha, S., Jangra, A., and Bhattacharyya, P. (2019). Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowl. Based Syst. 164, 45–67. doi: 10.1016/j.knosys.2018.10.021

Crossref Full Text | Google Scholar

Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., and Pérez, C. J. (2018). Extractive multi-document text summarization using a multi-objective artificial bee colony optimization approach. Knowl. Based Syst. 159, 1–8. doi: 10.1016/j.knosys.2017.11.029

Crossref Full Text | Google Scholar

Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., and Pérez, C. J. (2020). A decomposition-based multi-objective optimization approach for extractive multi-document text summarization. Appl. Soft Comput. J. 91:106231. doi: 10.1016/j.asoc.2020.106231

Crossref Full Text | Google Scholar

Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., and Pérez, C. J. (2022). A multi-objective memetic algorithm for query-oriented text summarization: medicine texts as a case study. Expert Syst. Appl. 198:116769. doi: 10.1016/j.eswa.2022.116769

Crossref Full Text | Google Scholar

Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., and Pérez, C. J. (2024). An indicator-based multi-objective variable neighborhood search approach for query-focused summarization. Swarm Evol. Comput. 91:101721. doi: 10.1016/j.swevo.2024.101721

Crossref Full Text | Google Scholar

Steinberger, J., and JeŽek, K. (2004). “Text summarization and singular value decomposition,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Cham: Springer). doi: 10.1007/978-3-540-30198-1_25

Crossref Full Text | Google Scholar

Team, M. (2024a). Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. Available online at: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (Accessed September 15, 2025).

Google Scholar

Team, Q. (2024b). Qwen2.5: A Party of Foundation Models. Available online at: https://qwenlm.github.io/blog/qwen2.5/ (Accessed September 15, 2025).

Google Scholar

Teknium, R., Quesnelle, J., and Guang, C. (2024). Hermes 3 technical report. arXiv [preprint]. arXiv:2408.11857. doi: 10.48550/arXiv:2408.11857

Crossref Full Text | Google Scholar

Travassos, G. H., Gurov, D., and Amaral, E. (2020). Introdução à Engenharia de Software. Rio de Janeiro, RJ: Relatório, Universidade Federal do Rio de Janeiro. Experimental.

Google Scholar

Van Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J.-B., Aali, A., Bluethgen, C., et al. (2024). Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142. doi: 10.1038/s41591-024-02855-5

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, F., Lin, M., Ma, Y., Liu, H., He, Q., Tang, X., et al. (2025). “A survey on small language models in the era of large language models: architecture, capabilities, and trustworthiness,” in Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD '25 (New York, NY: Association for Computing Machinery), 6173–6183. doi: 10.1145/3711896.3736563

Crossref Full Text | Google Scholar

Woodsend, K., and Lapata, M. (2011). “Learning to simplify sentences with quasi-synchronous grammar and integer programming,” in EMNLP 2011- Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference.

Google Scholar

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). “BERTScore: evaluating text generation with BERT,” in Proceedings of the International Conference on Learning Representations (ICLR). Published as a Conference Paper at ICLR 2020. Available online at: https://arxiv.org/abs/1904.09675

Google Scholar

Keywords: text summarization, automatic text summarization, abstractive generic summarization, extractive generic summarization, small language models

Citation: Guimarães A, Colaço Junior M, De Almeida SS, Garcia Ferreira de Araújo G, Fontes RS, Prado H, Credidio Freire Alves LP, Matos N, de Medeiros Valentim RA and dos Santos JPQ (2026) Small language models applied in text summarization task of health-related news to improve public health audit: an experimental case study. Front. Artif. Intell. 9:1708993. doi: 10.3389/frai.2026.1708993

Received: 19 September 2025; Revised: 04 December 2025;
Accepted: 08 January 2026; Published: 05 February 2026.

Edited by:

Samuel Moore, Ulster University, United Kingdom

Reviewed by:

Muhammad Asim Ali, Ulster University - Belfast Campus, United Kingdom
Seán Ó. Fithcheallaigh, Ulster University - Belfast Campus, United Kingdom

Copyright © 2026 Guimarães, Colaço Junior, De Almeida, Garcia Ferreira de Araújo, Fontes, Prado, Credidio Freire Alves, Matos, de Medeiros Valentim and dos Santos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Alysson Guimarães, YWx5c3NvbmFsa0BnbWFpbC5jb20=

^†ORCID: Alysson Guimarães orcid.org/0000-0002-1035-8992
Methanias Colaço Junior orcid.org/0000-0002-4811-1477
Samuel Santana De Almeida orcid.org/0000-0001-8662-1673
Gabriely Garcia Ferreira de Araújo orcid.org/0009-0009-2624-690X
Raphael Silva Fontes orcid.org/0000-0003-3160-3384
Luca Pareja Credidio Freire Alves orcid.org/0009-0000-8449-0024
Natan Matos orcid.org/0009-0003-6979-243X
Ricardo Alexsandro de Medeiros Valentim orcid.org/0000-0002-9216-8593
João Paulo Queiroz dos Santos orcid.org/0000-0002-9130-7723

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.