ORIGINAL RESEARCH article

Front. Comput. Sci., 23 April 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1758333

Node-Sampling: adaptive multi-agent optimization in medical education

  • Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Hamburg, Germany

Abstract

Introduction:

Differences in prior knowledge among incoming medical students pose a persistent challenge for universities. To promote more individualized and equitable preparation, a large language model-based learning platform is being developed at the University Medical Center Hamburg-Eppendorf. A central component of this platform is the automated generation of multiple-choice questions (MCQs) from curated medical materials. However, ensuring their educational quality remains difficult, particularly when relying on smaller, locally deployed language models.

Methods:

This study introduces Node-Sampling, a self-optimizing multi-agent approach for improving MCQ quality. The method identifies efficient refinement strategies by modeling agents as an adaptive sequence optimized through the REINFORCE algorithm.

Results:

Expert evaluations showed that Node-Sampling enhances the quality of question stems significantly compared to a fixed baseline. Importantly, Node-Sampling achieved this performance using an effective three-agent configuration, requiring only 33% of the original resources. Results for answer options were less consistent.

Discussion:

The results highlight the potential of adaptive multi-agent optimization to strengthen automated question refinement. Node-Sampling therefore presents a sustainable and promising approach to better MCQ quality and supports more effective and personalized preparation for medical students.

1 Introduction

Medical education relies on a solid foundation of pre-existing knowledge in subjects such as biology or chemistry. However, the level of prior knowledge among incoming students often varies substantially (Nivala et al., 2016), reflecting the heterogeneity of secondary education in Germany (Behrend et al., 2019). To address these disparities, the University Medical Center Hamburg-Eppendorf (UKE) offers preparatory in-person crash courses. However, they are often poorly attended by students and require considerable preparation effort from educators.

To mitigate this problem, the Institute for Applied Medical Informatics (IAM) at the UKE is developing a large language model (LLM)-based learning platform called KiMED for medical students (Credidio et al., 2025). The present work builds on this platform through an independent extension. The main component is the generation of high-quality multiple choice questions (MCQs) designed to help students identify and close individual knowledge gaps. This enables students to prepare independently and ensure a more equal level of knowledge at the beginning of their studies. The platform offers a safe learning space for students and reduces efforts from educators.

Designing high-quality MCQs is difficult, as the questions need to fulfill certain criteria, such as grammatical correctness, no negative wording or semantically distinct answer options. The overall set of criteria ensures that the questions are formulated as legally secure as possible. Several studies have explored MCQ generation in the medical domain using global models such as ChatGPT. Some report promising results (Kıyak and Emekli, 2024; Kıyak et al., 2024), but evaluation criteria are often vague, and prompt designs do not incorporate formal quality constraints. Other works emphasize the continued need for human review and improved prompting strategies (Rivera-Rosas et al., 2024).

The major challenges of LLM-based MCQ generation for the learning platform are limited computing resources and the requirement for local deployment. The latter is essential, as the model processes curated learning materials from the UKE that are subject to copyright or contain sensitive information. Running the model locally ensures data privacy but also necessitates the use of smaller models. Smaller models show weaker performance (Shen et al., 2024) and therefore lower-quality MCQs are generated. Typical issues are vague phrasing, grammatical mistakes, or semantically similar answer options. Preliminary experiments showed that prompt engineering alone is insufficient. Longer prompts reduced instruction adherence, and did not favor the implementation of robust metrics to verify whether all criteria were satisfied.

Agent-based systems can address these deficiencies by distributing refinement tasks among specialized agents (Tian et al., 2025; Sreedhar and Chilton, 2024). Yet, creating and updating such pipelines manually poses a substantial workload and introduces variability that can undermine reliability. Moreover, LLMs already incur significant energy consumption (Samsi et al., 2023), and multi-agent systems amplify this due to the increased number of model invocations. Consequently, the number of agents should be kept as low as possible to balance performance gains with computational and energy constraints.

Motivated by these challenges, this work presents a multi-agent optimization approach called Node-Sampling. Inspired by GPTSwarm (Zhuge et al., 2024), Node-Sampling models the agent pipeline as an optimizable sequence rather than a graph. Each agent is represented as an independent node, and a probability distribution is defined over node selections. In each optimization step, a sequence of nodes is sampled. The sequence is optimized by the REINFORCE algorithm (Williams, 1992) that maximizes a predefined utility function.

Recent work focuses on static MCQ refinement (Hang et al., 2024; So et al., 2025). Node-Sampling stands out as a universal, modular optimization layer that can operate on any underlying LLM, agent set, or utility function.

This study aims to improve MCQ quality in medical education through self-optimizing agent pipelines, ensuring that even small, locally deployable language models can generate questions that meet educational standards. To remain compatible with limited computational resources, the approach explicitly focuses on minimizing both the number of agents involved and the number of agent calls required during training. This reduction helps contain energy consumption, shortens optimization cycles, and allows the system to operate efficiently in local environments.

2 Materials and methods

The methodological framework of this study was designed to evaluate whether Node-Sampling can improve the quality of MCQs compared to a fixed baseline. The process begins with the generation of MCQs using a Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) pipeline based on curated medical learning materials. RAG couples the dynamic retrieval of documents with LLMs enabling up-to-date and domain-specific information from trusted sources to be incorporated. This enables its application in learning tools, while maintaining LLMs' personalization potential (Gao et al., 2024). The retrieval corpus consists of licensed textbook excerpts, course slides, and literature sources that were preprocessed, chunked into semantically coherent sections, and enriched with metadata. Retrieval is performed using a hybrid strategy combining dense vector similarity search and keyword matching to ensure both semantic and lexical relevance. Retrieved passages were provided to the language model as grounding context for MCQ generation.

Since not all generated questions meet established design standards, a structured refinement process is applied. Each refinement step is represented by an individual agent corresponding to a specific quality criterion. These criteria are also used to construct the utility function, which quantifies the overall question quality. Node-Sampling optimizes the sequence of agents to maximize this utility. Finally, domain experts evaluated the refined MCQs to verify whether the optimized sequences led to significant quality improvements over the baseline. The following section details the methodological process described above.

2.1 Approach

Node-Sampling is divided into two components, to avoid unstructured processing when refining the full MCQ:

  • Question stem refinement: Node-Sampling is first applied to the question stem. The resulting optimized agent sequence is then used to refine selected MCQs. Both the agents and the utility function are defined according to the established quality criteria for MCQ stems.

  • Answer options and explanation refinement: based on the refined stems, the answer options, the correct answer, and an explanation are generated through a RAG pipeline. In this project, internally developed slides, texts, and license-compliant textbook sections are included. These materials are chunked by title, enriched with metadata, and processed through a hybrid retrieval system combining semantic search and keyword matching. This setup ensures that all generated MCQs remain aligned with the intended learning materials, minimizing hallucinations and guaranteeing that students are tested only on relevant, accurate content. The generated content then undergoes a second round of Node-Sampling, resulting in a separate optimized agent sequence, for refining answer options and explanations. Here, the agents and utility function follow the quality criteria defined for this part of the MCQ.

Thus, two different agent sequences were employed: one specialized in refining the question stem, and another dedicated to improving the answer options and explanation. The overall workflow is visualized in Figure 1. For question stem as well as answer option generation, the Qwen-14B Instruct model (Qwen Team, 2025, 2024) was employed.

Figure 1

2.2 Agent construction

For MCQ refinement, each agent corresponds to a specific quality criterion that must be met. Quality criteria have been selected by domain experts and literature guidelines (Wang et al., 2025; Moore et al., 2023). For the question stem, the following agents are defined:

  • Grammar [Grammar agent]: checks whether the question is grammatically correct and free of spelling errors.

  • Negatively worded? [Negative agent]: negatively worded questions are likely to confuse students and less likely to assess meaningful learning outcomes. Therefore, ensuring that the question is not phrased negatively is important for generating high-quality MCQs.

  • Need for rephrasing? [Rephrase agent]: evaluates whether the question would benefit from rephrasing to improve clarity and eliminate possible errors.

  • True/false question? [TrueFalse agent]: an effective MCQ should be answerable without needing to see the options. It should be self-contained. Therefore, questions should ideally not ask for a correct or incorrect statement.

  • Hallucination Free? [Correctness agent]: ensures that the questions are factually correct and aligned with the retrieved documents from the RAG pipeline.

  • Double-barrelled question? [TwoInOne agent]: questions that address more than one issue at a time can be confusing and unfocused. Such constructions should be avoided.

The quality of answer options, including the correct answer and the explanations, is evaluated by the grammar and hallucination free criterion and adds the following criteria:

  • Avoidance of longest answer being correct [Longest agent]: ensures that the correct option is not the longest. It could serve as an unintended hint for the students.

  • Distinction [Distinct agent]: all answer options should be distinct from each other to make sure each represents a unique alternative. They should not be self-excluding, as students can easily answer through elimination.

  • Avoidance of absolute terms [Absolute agent]: absolute terms, such as “never,” “always,” or “all,” should not be present. Students often recognize these as likely incorrect.

  • Avoidance of vague terms [Vague agent]: vague terms like “frequently” or “ occasionally” should be avoided. They are not precise and can lead to an ambiguous interpretation.

2.3 Utility function

The utility function was designed to evaluate the quality of MCQs refined by the agent sequence. It is the main component in the optimization process. Assessing improvements requires a robust evaluation approach, since both input and output are MCQs. Establishing a ground truth (GT) for high-quality MCQs is particularly challenging, as there is no “perfect” MCQ. A GT would be the ideal question refinement of the corresponding input MCQ. Furthermore, it is challenging to find a suitable metric for evaluating the similarity of the refined MCQ and the GT.

The previously defined quality criteria are used to calculate a utility score for each generated MCQ. Question stems and answer options, including explanations, are optimized in separate steps. Accordingly, two distinct utility functions are defined, each incorporating the relevant quality criteria.

The utility function for question stems evaluates the following criteria: Grammar, Negatively Worded? Need for Rephrasing? True/False Question? Hallucination Free? and Double-Barreled Question? Two additional criteria are included: Free of Comments? and Only in German?

For answer options and explanations, the utility function assesses: Grammar, Hallucination-Free, Avoidance of Longest Answer Being Correct, Distinction, Avoidance of Absolute Terms, Avoidance of Vague Terms, and includes the Only in German criterion.

The same language model used within the Node-Sampling system determines whether each criterion is satisfied. The final utility score is calculated as the proportion of satisfied criteria relative to the total number of criteria. This yields a value between 0 and 1, with 1 representing the best possible score. This score serves as the objective for optimizing the agent sequence.

2.4 Node-Sampling approach

In Node-Sampling, each node corresponds to a specific agent. An illustrative example is shown in Figure 2A.

Figure 2

For Node-Sampling, a fixed sequence of agents a:[a0, a1, ..., aT−1] with at∈{0, ..., N−1} is defined. Each at denotes the agent selected at position t in the sequence.

Given the sequence length T∈ℕ and the number of distinct agents N∈ℕ, a logit matrix (Equation 1)

is constructed. It models the preference for selecting each agent at each position. The highlighted entry represents the logit associated with selecting agent 1 at position 1.

Each entry θt, n with t∈{0, ..., T−1} and n∈{0, ..., N−1} is initialized with a fixed value (Equation 2):

The logit matrix defines the categorical distribution over all sequences.

Each row θt corresponds to a softmax-based distribution over all agents at position t. Let Softmax(·) denote the softmax function. The probability of selecting agent i∈{0, …, T−1} at position t∈{0, …, N−1} (Equation 3) is

A full agent sequence a is sampled from this categorical distribution by sampling each at independently. The log probability of the entire sequence a (Equation 4) is given by

2.4.1 Optimization procedure

Given a sampled sequence a, the corresponding agent operations are executed to refine the input MCQ. The final output is then evaluated by the utility function uτ(a)∈ℝ with respect to task τ. The objective is to maximize the expected utility over all sequences that are sampled from the current distribution (Equation 5):

This objective is optimized with the REINFORCE algorithm (Williams, 1992). The REINFORCE algorithm is used to estimate the gradient of the expected utility (Equation 6) by

where are the sampled sequences and the estimated utility score is denoted as ûτ(ai).

The REINFORCE loss (Equation 7) is given by

where is a moving average utility baseline used to reduce variance.

The optimization parameters θt, n with t∈{0, ..., T−1} and n∈{0, ..., N−1} are updated using the Adam optimizer (Kingma and Ba, 2015). The gradient update increases the probability of agents that improve utility and decreases it otherwise.

In contrast to GPTSwarm, Node-Sampling requires less agent calls during optimization. In GPTSwarm, the total number of LLM calls in each iteration depends on the following parameters:

  • Number of nodes O∈ℕ: total nodes in the graph G

  • Batch size M∈ℕ: the number of sampled graphs and MCQ to be refined and evaluated

  • Number of output nodes Oout∈ℕ

  • Number of evaluation criteria C∈ℕ applied to each output ŷd, where d = 1, …, Oout, and

  • Number of predecessors Pn per node nN.

Accordingly, the total number of LLM calls per iteration (Equation 8) is determined as

For Node-Sampling, the total number of LLM calls is defined solely by the following factors:

  • Number of agents OM, with M as the maximum sequence length

  • Batch size M∈ℕ: the number of sampled sequences and MCQ to be refined and evaluated, and

  • Number of evaluation criteria C∈ℕ applied to the single output ŷ.

Therefore, the total number of LLM calls per iteration (Equation 9) is given by

2.4.2 Advanced Node-Sampling approach

The advanced Node-Sampling approach adds a STOP node. This allows sequences to terminate early. For example, if the maximum sequence length is set to five, sequences of lengths zero to five become possible. A possible sequence is illustrated in Figure 2B. The STOP node is sampled at the fifth position, resulting in a sequence of length four.

Therefore, the logit matrix is expanded (Equation 10) to

with the final column containing the logit values of the STOP node. The STOP node is introduced to find out if the sequence length corresponds to an improved utility. Larger sequences might not automatically lead to better results. The STOP node could terminate the sequence at position six when setting the maximum sequence length to 10, but only having six distinct agents.

Furthermore, the STOP node can improve efficiency, as it potentially reduces the number of executed agent operations. This, in turn, may speed up convergence during optimization.

The regularization term is included in the loss function to penalize sequence length. The loss (Equation 11) is then computed by

where λ∈(0, 1) is a regularization weight and li∈[0, T] is the length of the sequence ai. This encourages the optimizer to find the shortest possible sequence that still yields high utility.

2.5 Baseline

The outputs of the proposed methods were compared to the results of the reference pipeline (baseline, i.e., all selected quality criteria are passed sequentially in a predefined order). Each agent corresponds to a specific quality criterion that must be fulfilled. For the baseline, one node corresponds to a single agent.

The baseline order of agents for the stem refinement is: TrueFalseTwoInOneNegativeWordedCorrectRephraseGrammar.

The baseline order of agents for the option and explanation refinement is: DistinctLongestAbsoluteVagueCorrectGrammar.

This specific order of agents is implemented to first check structural aspects of the question before focusing on formal correctness. The idea behind this approach is that essential reformulations are often necessary at the beginning. This makes early grammar and phrasing checks ineffective if the wording and structure of the question will be later modified significantly.

2.6 Dataset

All MCQs were generated and evaluated in German. The example questions shown in the manuscript were translated into English for illustration. The dataset used for optimization consists of 80 biochemistry questions generated by the RAG pipeline and edited by domain experts. Fifty questions were used for training, 30 questions for evaluation. Experts were instructed to generate questions containing controlled violations of the predefined quality criteria to enable systematic evaluation of whether the refinement process can detect and correct specific error types under reproducible conditions.

2.7 Experimental setup

The experiments were first conducted on question stem refinement to establish suitable parameters. These parameters were then also applied to answer options and explanations for the final evaluation.

The first step was to determine a suitable learning rate for further experiments. A learning rate of 0.1 showed the most consistent performance, with steady logit adaptation over iterations. Therefore, a learning rate of 0.1 was used in all subsequent experiments (see Supplementary Figure S1 for more details).

Next, the Node-Sampling approach without applying the STOP node or regularization was tested. The number of iterations was set to 130. It was investigated whether longer sequences lead to a higher utility. Another aspect was to determine if there are agents that are more important than others. Additionally, the STOP node was included. This helped to assess if applying agents several times is effective or redundant. Last, the regularization term was added. It needed to be tested which regularization weight is suitable. The goal was to promote solutions that achieve equal or superior performance compared to longer sequences.

2.8 Evaluation method

The Node-Sampling outputs and the baseline were evaluated by five domain experts responsible for manually designing MCQs used in exam preparation and assessments. These experts are well-qualified to assess MCQ quality due to their expertise in both academic content and the formulation of high-quality questions. Their involvement ensures that the evaluation reflects which MCQs best support students in achieving learning objectives.

A one-dimensional χ2-test (Bortz and Schuster, 2010) was applied to determine whether the methods significantly outperformed the baseline. Separate tests were conducted for each pairwise comparison.

Evaluators received the original questions alongside the refined versions from the baseline and other test setups. The order of versions was permuted, so evaluators were blind to which version corresponded to which method. They selected the best version, marking both if two were equally good, or none if the versions did not differ. Votes were counted per version. For cases where evaluators marked multiple versions as equally good, each selected version received a vote. “None good” selections were excluded.

To compare two versions, the null hypothesis H0 assumes that votes are equally distributed. The null hypothesis is rejected, if the test statistic is larger than the critical χ2 value for α. This indicates a significant difference between versions.

To account for multiple comparisons, the Bonferroni correction (Bortz and Schuster, 2010) was applied. The significance level α = 0.05 is therefore divided by the number of tests to control the family-wise error rate (FWER). The FWER is the probability that at least one Type I error across a family of statistical tests occurs (Cramer et al., 2016). The Type I error refers to incorrectly rejecting H0 when it is actually true (Puhani, 2025).

2.9 Robustness and transferability

To evaluate the robustness of the Node-Sampling approach, the final agent sequences obtained in different test scenarios were compared. Specifically, it was examined whether the resulting agent sequences across setups consistently identified the same agents as important.

Exact reproducibility of the final agent sequences could also be considered a measure of reliability. However, previous tests showed that the specific order of agents within the optimized sequence does not significantly influence the resulting utility values (Supplementary Tables S5, S6). Therefore, exact replication of the same agent sequence is not required to ensure reproducibility. Instead, the emphasis is placed on robustness.

The optimized sequence obtained from the biochemistry domain was applied to 30 physics-related MCQs from the test set to assess transferability. The MCQs were generated analogously to the biochemistry questions using UKE curated learning materials.

This evaluation, conducted by domain experts, served to examine whether the optimized sequence generalizes beyond biochemistry content and maintains its effectiveness in structurally similar, yet conceptually distinct, question domains.

2.9.1 Hardware configuration

The experiments were conducted with the following hardware configuration:

  • GPU model: NVIDIA RTX A6000 (NVIDIA Corporation, Santa Clara, California, USA).

  • GPU memory: 49140MiB.

  • Driver version: 535.129.03.

  • CUDA version: 12.2.

  • CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz (Intel Corporation, Santa Clara, California, USA).

3 Results

Prior to experiment construction, the LLM-based utility score was compared to human ratings (see Supplementary Figure S4), showing a strong correlation (r = 0.953). This validates the utility function as a proxy for expert judgment. The comparison of the fixed sequence length of six and 10 agents over 130 iterations (the number of iterations was chosen because utility plateaued while additional iterations increased runtime without measurable gains) resulted in the longer sequence achieving a higher moving-average utility (Figure 3). Analysis of the final logit distributions (Figure 4) showed that the Grammar and Rephrase agents were never selected, indicating limited contribution to quality improvement. In contrast, NegativeWorded and TwoInOne appeared repeatedly, suggesting they are key drivers of utility and that repeated application may be beneficial.

Figure 3

Figure 4

Next, the STOP node was added, and the maximum sequence length was set to 10. No early termination occurred, and the moving-average utility continued to increase beyond 0.9 (Supplementary Figure S2). As in previous setups, Grammar and Rephrase were not selected, and Correctness was also absent. These results suggest that longer sequences remain beneficial for MCQ stem refinement.

Lastly, the regularization term was included. A suitable regularization weight needed to be determined as a starting point. High weight (λ = 0.1) shortened the sequence excessively (one agent) and reduced utility, while very low weight (λ = 0.01) produced no shortening at all (Supplementary Figure S3). Intermediate values showed better trade-offs: λ = 0.03 yielded a three-agent sequence with utility up to 0.85, whereas λ = 0.05 produced a shorter but weaker two-agent sequence (Figure 5). Therefore, λ = 0.03 was selected as the final configuration. The resulting agent configuration is: TrueFalseTwoInOneNegativeWorded. The two-agent sequence with λ = 0.05 did not achieve higher utility and approached zero too rapidly, supporting the choice of the three-agent configuration.

Figure 5

3.1 Evaluation of Node-Sampling with baseline

Node-Sampling with a fixed sequence length of 10 (test setup 1) and the regularized configuration with λ = 0.03 (test setup 2) was compared against the baseline for the final evaluation. Both configurations were evaluated for question stem and option refinement using identical parameters.

3.1.1 Evaluation of question stem refinement

The question stem quality of all versions was evaluated by five domain experts (Supplementary Table S1; percent agreement (McHugh, 2012) between evaluators for Node-Sampling vs. baseline decisions ≈ 88%). Each evaluator received the baseline version, as well as the two Node-Sampling configurations: test setup 1 (sequence length 10) and test setup 2 (regularization weight of 0.03). The corresponding χ2 test statistics are presented in Table 1. Details on which quality criteria were violated before and after refinement are provided in Supplementary Figure S5.

Table 1

Comparison of question stemTest statisticSignificant
Baseline vs. test setup 17.36Yes
Baseline vs. test setup 210.91Yes
Test setup 1 vs. test setup 20.36No
Comparison of answer optionsTest statisticSignificant
Baseline vs. test setup 11.43No
Baseline vs. test setup 20.11No
Test setup 1 vs. test setup 20.73No

χ2 test statistics for version comparisons of the question stem refinement and the answer options.

Test setup 1 (sequence length 10) and 2 (regularization weight 0.03) received significantly more votes than the baseline. There is no significant difference between test setup 1 and test setup 2.

Both Node-Sampling approaches show a significant improvement over the baseline. They were chosen significantly more often as the best version. A translated example question is shown Table 2.

Table 2

Example question
Original questionHow does D-ribose differ from D-ribulose? What are the main differences between these two carbohydrates?
BaselineWhat is the role of D-ribose in biological systems?
Test setup 1What chemical modification differentiates D-ribose from 2-deoxy-D-ribose?
Test setup 2What is the main difference between D-ribose and D-ribulose?

A translated example question for question stem refinement.

In this case, test setups 1 and 2 were selected as the better options by all evaluators. The baseline produces a rather broad question. This is not considered a high quality MCQ. Test setups 1 and 2, target a specific biochemical distinction, which specifically reduces ambiguity.

3.1.2 Evaluation of option refinement

Option refinement was evaluated in analogy to the question stem study (see Supplementary Table S2; percent agreement between evaluators for Node-Sampling vs. baseline decisions ≈ 75%). The refined baseline questions served as inputs for generating answer options through the RAG pipeline, which were then refined by all three setups [baseline, sequence length 10 (test setup 1), and regularization weight 0.03 (test setup 2)]. Baseline-refined questions were chosen, as the associated answer options still contained potential weaknesses, making it possible to test whether Node-Sampling could provide improvements.

The results show no statistically significant differences between setups (Table 2). In many cases, all versions produced similarly phrased correct and incorrect options, limiting observable differences. In other cases, the questions targeted highly specific entities (e.g., a particular atom or configuration), which likewise left little room for variation across setups. No meaningful differences were observed in the violated quality criteria before vs. after refinement (see Supplementary Figure S6).

The distributions of utility score before and after refinement show no systematic shift toward higher utility. Agent-based refinement produced negligible changes in answer option quality (see Supplementary Figure S7).

3.2 Evaluation of robustness and transferability

Across all setups, the final agent sequences consistently excluded the Grammar and Rephrase agents. In the 10-agent sequence, the NegativeWorded and TwoInOne agents were each applied three times, indicating their particular importance. The regularized sequence included only three agents, among which NegativeWorded and TwoInOne were again present, confirming their robustness across different configurations.

When applying the optimized agent sequences from test setup 1 (sequence length 10) to the physics question set, they yielded a significant improvement in question stem quality compared to the baseline (Table 3). The regularized sequence (test setup 2) was not significantly different from the test setup 1.

Table 3

ComparisonTest statisticSignificant
Baseline vs. test setup 15.46Yes
Baseline vs. test setup 22.53No
Test setup 1 vs. test setup 20.57No

χ2 test statistics for setup comparisons of the question stem refinement for physics questions.

Test setup 1 (sequence length 10) received significantly more votes than the baseline. There is no significant difference between the baseline and test setup 2 (regularization weight 0.03).

4 Discussion and conclusion

This work aimed to support more equitable and individualized preparation for incoming medical students by improving the quality of automatically generated MCQs. Within the learning platform, such questions form a central mechanism for helping students identify knowledge gaps and study independently. However, generating high-quality MCQs with locally deployable models remains challenging. This study addressed this gap by introducing Node-Sampling, a self-optimizing multi-agent approach designed to systematically refine MCQs produced by smaller LLMs. Because the method identifies effective agent sequences automatically, it can operate with substantially fewer agents and, therefore lower computational cost.

In the biochemistry domain, Node-Sampling outperformed the baseline, both with and without regularization. Notably, the regularized three-agent sequence and the 10-agent sequence achieved significantly better performance than the baseline. Using only three instead of 10 agents reduces the required resources to approximately 33%, demonstrating that high-quality MCQ refinement does not require large agent cascades. This shows that Node-Sampling is capable of finding minimal yet effective agent compositions that preserve or improve performance.

Node-Sampling demonstrated robustness and transferability across domains. Although the regularized sequence did not outperform the baseline significantly in the physics setting, the 10-agent sequence did. This suggests that the regularization strength used in the biochemistry domain does not necessarily generalize to other domains. Adjustments to the regularization weights may therefore be required to ensure consistent performance across heterogeneous question sets. In expert evaluation, Node-Sampling significantly outperformed the baseline for question stem refinement, whereas option refinement did not show a significant improvement. Analyses of agent activity and utility scores show that, although the answer-option agents were applied, the proportion of criteria satisfied did not improve. Utility scores also remained essentially unchanged, matching the absence of improvement in human evaluations.

The discrepancy between stem and option refinement highlights the need for stem-aware option generation or additional specialized agents. It also suggests that the current Qwen-14B model, while capable of stem refinement, may have limited capacity for post-hoc optimization of answer options without modifications to the stem or utility criteria.

Grammar and Rephrase agents were consistently excluded, reflecting redundancy under the current utility function. This is likely because modern LLMs already produce text with fluent grammar and phrasing (Qiu et al., 2024; Wu et al., 2023); so higher-impact agents indirectly resolve surface-level issues. This illustrates effective credit assignment in Node-Sampling and highlights that agent selection is utility- and task-dependent rather than a direct measure of pedagogical importance.

Compared to prior static MCQ refinement pipelines (Hang et al., 2024; So et al., 2025), Node-Sampling offers an adaptive reinforcement learning-based approach. It jointly optimizes quality and computational efficiency. Selected agent sequences and refinement steps are directly inspectable and modifiable, increasing explainability. Agents can be exchanged, added, or removed to adapt to new rules, utility functions, or domains. Additional optimization overhead during training is offset once efficient agent sequences are identified. Node-Sampling thus operates as an independent optimization layer on top of any LLM, agent set, or retrieval pipeline.

The choice of baseline was guided by methodological considerations. A simpler baseline, such as using a single prompt with all criteria, could be considered. However, experiments showed that such simple prompting produces substantially lower-quality questions, and overly weak baselines can artificially inflate performance gains. As described in the Methods section, the baseline agent order was not randomly chosen. Additional experiments demonstrated that agent order had negligible impact on performance. Therefore, no single fixed ordering constitutes an optimal baseline.

Factual correctness and hallucination control are central to educational MCQs. Prior work emphasizes evidence-grounded and context-aware verification, including graph-based retrieval-augmented approaches such as TrumorGPT (Hang et al., 2025) or scientific evidence integration (Ni et al., 2024). Unlike open-domain fact-checking, Node-Sampling operates on curated educational materials. It enforces factual consistency via retrieval grounding and a Correctness agent, showing that Node-Sampling can leverage RAG-style evidence retrieval to improve structural and semantic MCQ quality. This highlights that hallucination control is not a standalone step but can be seamlessly incorporated into multi-agent, task-specific optimization.

The boundary conditions of this work include a limited number of evaluators and, consequently, a relatively small dataset of evaluated questions. These challenges are consistent with previous research (Law et al., 2025; Artsi et al., 2024), which report that similar studies evaluating AI-generated MCQs typically involve two to five evaluators assessing between three and 50 questions.

While Node-Sampling optimizes structural MCQ quality, it does not yet address cognitive level. Future extensions could incorporate Bloom-aligned criteria into the utility function (Bloom et al., 1956; Anderson and Krathwohl, 2001), rewarding application, analysis, or evaluation skills to expand beyond structural optimization. Specialized Bloom agents could be seamlessly integrated into the Node-Sampling framework without altering the overall multi-agent optimization process.

Local deployment with smaller LLMs involves trade-offs in model capacity, agent specialization, and semantic depth. Node-Sampling mitigates these limitations via a RAG pipeline and selective agent activation. The framework is compatible with larger models, where reducing agent calls decreases computational cost and energy consumption, providing a scalable solution.

Since all experiments were conducted using German MCQs, performance may differ for other languages. particularly when using smaller models whose multilingual capabilities can vary (Agrawal et al., 2024). Accordingly, the generalizability of the results should be interpreted with respect to the language setting.

Currently, the utility function is LLM-based and measures how many criteria are satisfied, without requiring a GT. Experiments validated the utility function as a proxy for expert evaluation. Nevertheless, there is a potential circularity, as the same LLM is used for both MCQ refinement and scoring. Future work could employ a separate, potentially larger model for evaluation. Node-Sampling may be particularly effective in multi-agent applications where a GT exists. This would allow objective scoring and potentially more stable optimization.

Despite these limitations, Node-Sampling produced a significant improvement in question stem quality, which is particularly relevant as the literature highlights the stem as the most critical component of an MCQ (Steele et al., 2025). Improving the stem alone already enhances overall question quality, often resulting in better answer options (Supplementary Table S4). Consequently, refining the question stem represents a promising strategy for efficient MCQ enhancement. For educators, this means that high-quality stems combined with reasonably good generated options require only minor adjustments, substantially reducing preparation time.

Future extensions of the learning platform could include fill-in-the-blank exercises and open-ended questions. For the latter, students would receive explanations after answering, enabling them to compare their reasoning with the correct solution. An LLM could also provide feedback by comparing the student's answer to the generated explanation, offering personalized guidance. For open-ended formats, Node-Sampling would require additional quality criteria that guide question refinement beyond structural correctness. These criteria can be integrated into the utility function alongside existing ones. Bloom's taxonomy can further inform this process by specifying or weighting cognitive objectives (e.g., application, analysis, or evaluation), either through predefined rubrics or through weighted utility terms that reward increasingly demanding question formulations. These extensions can be implemented by adding specialized agents without changing the multi-agent framework.

Node-Sampling can improve the reliability and quality of automated question generation and make it more sustainable. It can be integrated into various learning tools. The approach could go beyond exam-style assessment by supporting open-ended questions. This would help students consolidate knowledge and deepen their understanding.

In conclusion, this work presented Node-Sampling as a promising and extensible approach for refining MCQs generated by small, locally deployable language models through efficient agent systems. It contributes to more effective and personalized preparation for medical students while reducing the workload for educators. This represents a meaningful step toward scalable, high-quality digital AI assisted learning tools.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

LD: Conceptualization, Writing – review & editing, Investigation, Methodology, Software, Writing – original draft, Data curation. MG: Writing – review & editing, Supervision, Writing – original draft, Methodology. LB: Methodology, Writing – original draft, Writing – review & editing. GC: Writing – original draft, Resources, Writing – review & editing. LR: Writing – review & editing, Supervision, Conceptualization, Writing – original draft.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by Open Access Publication Fund of UKE - Universitätsklinikum Hamburg-Eppendorf.

Acknowledgments

We would like to thank the domain experts for their time, effort, and constructive evaluations, which were essential for validating this study.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. The author(s) used Generative AI for translation, proofreading, and editing. The author(s) reviewed and edited the content as needed and take full responsibility for the content for the publication.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomp.2026.1758333/full#supplementary-material

References

  • 1

    AgrawalA.DangA.NezhadS. B.PokharelR.ScheinbergR. (2024). Evaluating multilingual long-context models for retrieval and reasoning. arXiv [Preprint]. doi: 10.48550/arXiv.2409.18006

  • 2

    AndersonL. W.KrathwohlD. R. (2001). A Taxonomy for Learning, Teaching, and Assessing. New York, NY: Longman.

  • 3

    ArtsiY.SorinV.KonenE.GlicksbergB. S.NadkarniG.KlangE.et al. (2024). Large language models for generating medical examinations: systematic review. BMC Med. Educ. 24:354. doi: 10.1186/s12909-024-05239-y

  • 4

    BehrendR.MetteM.ParteckeM.ReichelK.WershofenB. (2019). Heterogeneous learning cultures in interprofessional education: a teacher training. GMS J. Med. Educ. 36:Doc24. doi: 10.3205/zma001232

  • 5

    BjorckJ.GomesC. P.WeinbergerK. Q. (2022). Is high variance unavoidable in RL? A case study in continuous control. arXiv [Preprint]. doi: 10.48550/arXiv.2110.11222

  • 6

    BloomB. S.EngelhartM. D.FurstE. J.HillW. H.KrathwohlD. R. (1956). Taxonomy of Educational Objectives: The Classification of Educational Goals. White Plains, NY: Longman.

  • 7

    BortzJ.SchusterC. (2010). Statistik für Human- und Sozialwissenschaftler. Berlin: Springer. doi: 10.1007/978-3-642-12770-0

  • 8

    CramerA. O. J.van RavenzwaaijD.MatzkeD.SteingroeverH.WetzelsR.GrasmanR. P. P. P.et al. (2016). Hidden multiplicity in exploratory multiway ANOVA: prevalence and remedies. Psychon. Bull. Rev. 23, 640647. doi: 10.3758/s13423-015-0913-5

  • 9

    CredidioG.GrößlerM.DüsterbeckL. M.LöhndorfA.EisenbarthS.GuseA. H.et al. (2025). “KIMED - a generative artificial intelligence tool to support the individualized learning process of medical students,” in 70. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS) (Jena Medical Science GMS Publishing House), DocAbstr, 228.

  • 10

    GaoY.XiongY.GaoX.JiaK.PanJ.BiY.et al. (2024). Retrieval-augmented generation for large language models: a survey. arXiv [Preprint]. doi: 10.48550/arXiv.2312.10997

  • 11

    HangC. N.Wei TanC.YuP.-D. (2024). MCQGEN: a large language model-driven MCQ generator for personalized learning. IEEE Access12, 102261102273. doi: 10.1109/ACCESS.2024.3420709

  • 12

    HangC. N.YuP.-D.TanC. W. (2025). TrumorGPT: graph-based retrieval-augmented large language model for fact-checking. IEEE Trans. Artif. Intell. 6, 31483162. doi: 10.1109/TAI.2025.3567369

  • 13

    KingmaD. P.BaJ. (2015). “ADAM: a method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, eds. Y. Bengio, and Y. LeCun (Ithaca, NY: Cornell University).

  • 14

    KıyakY. S.CoşkunÖ.Budakoğluİ. (2024). ChatGPT for generating multiple-choice questions: evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur. J. Clin. Pharmacol. 80, 729735. doi: 10.1007/s00228-024-03649-x

  • 15

    KıyakY. S.EmekliE. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad. Med. J. 100, 858865. doi: 10.1093/postmj/qgae065

  • 16

    LawA.SoJ.LuiC.ChoiY. F.CheungK. H.HungK. K.et al. (2025). AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med. Educ. 25:208. doi: 10.1186/s12909-025-06796-6

  • 17

    LewisP.PerezE.PiktusA.PetroniF.KarpukhinV.GoyalN.et al. (2020). “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, eds. H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Red Hook, NY: Curran Associates Inc), 94599474.

  • 18

    McHughM. (2012). Interrater reliability: the kappa statistic. Biochem. Med. 22, 276282. doi: 10.11613/BM.2012.031

  • 19

    MooreS.NguyenH. A.ChenT.StamperJ. (2023). “Assessing the quality of multiple-choice questions using GPT-4 and rule-based methods,” in Responsive and Sustainable Educational Futures: 18th European Conference on Technology Enhanced Learning, EC-TEL 2023, Aveiro, Portugal, September 4–8, 2023, Proceedings, eds. O. Viberg, I. Jivet, P. J. Muñoz-Merino, M. Perifanou, and T. Papathoma (Cham: Springer-Verlag), 229245. doi: 10.1007/978-3-031-42682-7_16

  • 20

    NiZ.QianY.ChenS.JaulentM.-C.BousquetC. (2024). Scientific evidence and specific context: leveraging large language models for health fact-checking. Online Inf. Rev. 48, 14881514. doi: 10.1108/OIR-02-2024-0111

  • 21

    NivalaM.ParankoJ.GruberH.LehtinenE. (2016). The role of prior knowledge and students' perceptions in learning of biomedical sciences. Med. Sci. Educ. 26, 18. doi: 10.1007/s40670-016-0319-7

  • 22

    PuhaniJ. (2025). Statistik. Wiesbaden: Springer Gabler. doi: 10.1007/978-3-658-48352-4

  • 23

    QiuZ.DuanX.CaiZ. (2024). “Evaluating grammatical well-formedness in large language models: a comparative study with human judgments,” in Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, eds. T. Kuribayashi, G. Rambelli, E. Takmaz, P. Wicke, and Y. Oseki (Bangkok: Association for Computational Linguistics), 189198. doi: 10.18653/v1/2024.cmcl-1.16

  • 24

    Qwen Team (2024). Qwen2.5: A Party of Foundation Models. Available online at: https://qwenlm.github.io/blog/qwen2.5/ (Accessed March 30, 2026).

  • 25

    Qwen Team (2025). Qwen2.5 technical report. arXiv [Preprint]. doi: 10.48550/arXiv:2412.15115

  • 26

    Rivera-RosasC. N.Calleja-LópezJ.Ruibal-TavaresE.Villanueva-NeriA.Flores-FelixC. M.Trujillo-LópezS. (2024). Exploring the potential of ChatGPT to create multiple-choice question exams. Educ. Méd. 25:100930. doi: 10.1016/j.edumed.2024.100930

  • 27

    SamsiS.ZhaoD.McDonaldJ.LiB.MichaleasA.JonesM.et al. (2023). From words to watts: benchmarking the energy costs of large language model inference. arXiv [Preprint]. doi: 10.48550/arXiv.2310.03003

  • 28

    ShenW.LiC.ChenH.YanM.QuanX.ChenH.et al. (2024). Small LLMs are weak tool learners: a multi-LLM agent. arXiv [Preprint]. doi: 10.48550/arXiv.2401.07324

  • 29

    SoC. C.LeeC. H.YangD.LohA. W. K.SzeC. K. (2025). Automatic generation of high-quality MCQs with LLMs for artificial intelligence education. IEEE Access13, 184332184347. doi: 10.1109/ACCESS.2025.3624820

  • 30

    SreedharK.ChiltonL. (2024). Simulating human strategic behavior: comparing single and multi-agent LLMs. arXiv [Preprint]. doi: 10.48550/arXiv.2402.08189

  • 31

    SteeleS.NayakN.MohamedY.PanigrahiD. (2025). The generation and use of medical MCQs: a narrative review. Adv. Med. Educ. Pract. 16, 13311340. doi: 10.2147/AMEP.S513119

  • 32

    TianA. X.ZhangR.TangJ.ChoY. M.LiX.YiQ.et al. (2025). Beyond the strongest LLM: multi-turn multi-agent orchestration vs. single LLMs on benchmarks. arXiv [Preprint]. doi: 10.48550/arXiv.2509.23537

  • 33

    WangJ.XiaoR.TsengY.-J. (2025). “Generating AI literacy MCQs: a multi-agent LLM approach,” in Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 2 (New York, NY: Association for Computing Machinery), 16511652. doi: 10.1145/3641555.3705189

  • 34

    WilliamsR. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229256. doi: 10.1023/A:1022672621406

  • 35

    WuH.WangW.WanY.JiaoW.LyuM. R. (2023). ChatGPT or grammarly? Evaluating ChatGPT on grammatical error correction benchmark. arXiv [Preprint]. doi: 10.48550/arXiv.2303.13648

  • 36

    ZhugeM.WangW.KirschL.FaccioF.KhizbullinD.SchmidhuberJ.et al. (2024). “GPTSwarm: language agents as optimizable graphs,” in Proceedings of the 41st International Conference on Machine Learning, eds. R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, et al. (Vienna), 6274362767.

Summary

Keywords

adaptive learning algorithm, large language model (LLM), medical teaching, multi-agent system, multiple-choice questions, personalized education, reinforcement learning, resource-efficient AI

Citation

Düsterbeck LM, Größler M, Credidio G, Bellmann L and Riemann LT (2026) Node-Sampling: adaptive multi-agent optimization in medical education. Front. Comput. Sci. 8:1758333. doi: 10.3389/fcomp.2026.1758333

Received

01 December 2025

Revised

05 March 2026

Accepted

17 March 2026

Published

23 April 2026

Volume

8 - 2026

Edited by

Antonio Sarasa-Cabezuelo, Complutense University of Madrid, Spain

Reviewed by

Ching Nam Hang, Caritas Institute of Higher Education, Hong Kong SAR, China

Ryann Perez, University of Pennsylvania, United States

Updates

Copyright

*Correspondence: Lilly Marie Düsterbeck,

†These authors share last authorship

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics