GastroBot: a Chinese gastrointestinal disease chatbot based on the retrieval-augmented generation

Introduction Large Language Models (LLMs) play a crucial role in clinical information processing, showcasing robust generalization across diverse language tasks. However, existing LLMs, despite their significance, lack optimization for clinical applications, presenting challenges in terms of illusions and interpretability. The Retrieval-Augmented Generation (RAG) model addresses these issues by providing sources for answer generation, thereby reducing errors. This study explores the application of RAG technology in clinical gastroenterology to enhance knowledge generation on gastrointestinal diseases. Methods We fine-tuned the embedding model using a corpus consisting of 25 guidelines on gastrointestinal diseases. The fine-tuned model exhibited an 18% improvement in hit rate compared to its base model, gte-base-zh. Moreover, it outperformed OpenAI’s Embedding model by 20%. Employing the RAG framework with the llama-index, we developed a Chinese gastroenterology chatbot named “GastroBot,” which significantly improves answer accuracy and contextual relevance, minimizing errors and the risk of disseminating misleading information. Results When evaluating GastroBot using the RAGAS framework, we observed a context recall rate of 95%. The faithfulness to the source, stands at 93.73%. The relevance of answers exhibits a strong correlation, reaching 92.28%. These findings highlight the effectiveness of GastroBot in providing accurate and contextually relevant information about gastrointestinal diseases. During manual assessment of GastroBot, in comparison with other models, our GastroBot model delivers a substantial amount of valuable knowledge while ensuring the completeness and consistency of the results. Discussion Research findings suggest that incorporating the RAG method into clinical gastroenterology can enhance the accuracy and reliability of large language models. Serving as a practical implementation of this method, GastroBot has demonstrated significant enhancements in contextual comprehension and response quality. Continued exploration and refinement of the model are poised to drive forward clinical information processing and decision support in the gastroenterology field.


Introduction
In recent years, there has been significant advancement in large language models (LLMs), with notable models like ChatGPT demonstrating remarkable performance in question answering, summarization, and content generation (1)(2)(3).These models exhibit robust generalization not only within natural language processing (NLP) tasks but also across various interdisciplinary domains (4).However, models akin to ChatGPT, trained on general datasets, lack specific optimizations for clinical applications and encounter issues in question generation characterized by "perceptual illusions" and "unrealistic" features (5)(6)(7), potentially resulting in the provision of incomplete or inaccurate information and posing inherent risks (8,9).To specialize LLMs, three methods have been proposed: optimizing the original LLM model (10), employing prompt engineering (11)(12)(13), and Retrieval-Augmented Generation (RAG) (14).
RAG, introduced in 2020, is a retrieval-augmented technique capable of fetching information from external knowledge sources, thus significantly enhancing answer accuracy and relevance (15).In recent years, RAG technology has proven effective in the biomedical field (16).Wang et al. (17) developed Almanac, which improved medical guideline retrieval.Ge et al. (18) created Li Versa for liver disease queries, while Ranjit et al. (19) applied RAG to radiology reports.Yu et al. (20) utilized RAG for diagnosing heart disease and sleep apnea, whereas Lozano et al. (21) and Manathunga et al. (22) applied it to medical literature and education.
This study focuses on the application of RAG in the field of clinical gastroenterology in China, aiming to address the issue associated with the continuous increase in the infection rate of Helicobacter pylori and the rising incidence of gastric cancer (23,24).Given the substantial patient population with gastrointestinal diseases and the complexity of diagnosis and treatment, the "illusions" brought about by LLMs may pose additional challenges to the diagnosis and treatment of gastrointestinal diseases (5,7).Integrating RAG is crucial for enhancing the accuracy of clinical practitioners in managing these diseases and can effectively mitigate this issue.
The aim of this study is to leverage RAG and large-scale language models, utilizing 25 guidelines on gastrointestinal diseases and 40 recent gastrointestinal literature sources as external knowledge bases, to develop a dedicated chatbot for gastrointestinal diseases named GastroBot.Furthermore, to enhance the relevance between retrieved content and user queries, this study conducted domain-specific finetuning of the embedding model tailored to gastrointestinal diseases, directly enhancing the performance of RAG.GastroBot is capable of providing precise diagnosis and treatment recommendations for gastrointestinal patients, thereby improving treatment efficacy.Figure 1 illustrates the comprehensive workflow of GastroBot.
In summary, the main contributions can be summarized as follows: • We have created a specialized dataset named the "EGD Database" specifically for Chinese gastrointestinal diseases.• We performed domain-specific fine-tuning of the embedding model to enhance retrieval performance for gastrointestinal diseases.
• We utilized 25 gastrointestinal disease guidelines and 40 related literature articles as the knowledge base to develop a gastrointestinal disease chatbot named GastroBot using RAG and LLM.
2 Materials and methods

Dataset and data preprocessing
In order to develop a gastrointestinal chatbot tailored for the Chinese context, we initially sourced 25 clinical guideline documents related to gastrointestinal diseases from the Chinese Medical Journal Full-text Database. 1 These guidelines were selected based on their alignment with the most current official guidelines in the field, ensuring comprehensive coverage across various dimensions.Additionally, we integrated the latest literature on gastroenterology from the China National Knowledge Infrastructure (CNKI) database,2 categorized under the discipline of digestive system diseases.These articles covered a range of topics including gastroesophageal reflux disease, Helicobacter pylori, clinical observations, H. pylori infection, and peptic ulcer diseases, all published in 2024, totaling 40 articles.
Subsequently, we conducted data preprocessing on the collected dataset, removing elements such as English abstracts and references that were irrelevant to our research objectives.Given RAG's inability to process images, all image data were excluded during this preprocessing stage.The resulting refined dataset was named the "EGD Database, " with EGD standing for Expert Guidelines for Gastrointestinal Diseases.

Experiment
The experimental section comprises two crucial steps aimed at developing a dedicated Chinese gastrointestinal disease chatbot, named GastroBot, for knowledge-based question answering on gastrointestinal diseases.The initial steps involve fine-tuning the embedding model specifically for gastrointestinal diseases.Subsequently, the LlamaIndex 3 is employed to construct the RAG pipeline.

Fine-tuning embedding model
The objective of fine-tuning is to strengthen the correlation between retrieved content and queries.Fine-tuning the embedding model aims to optimize the influence of retrieved content on generating outputs.Particularly in the medical domain, characterized by evolving or rare terminology, these tailored embedding techniques can enhance retrieval relevance.The GTE (25) embedding model is renowned for its high performance.In this study, the gte-base-zh model from Alibaba DAMO Academy served as the foundational embedding model and underwent domain-specific fine-tuning.
For fine-tuning the gte-base-zh model, we employed GPT-3.5 Turbo to aid in generating question-answer pairs.During the finetuning process, the LLM generated questions based on document blocks, forming pairs with their respective answers.Then, the SentenceTransformersFinetuneEngine in LlamaIndex was utilized for fine-tuning.The fine-tuning of the customized Chinese gastrointestinal domain embedding model was accomplished through the steps illustrated in Figure 2.

Data preprocessing in the fine-tuning stage
In this study, we leveraged the EGD database as the source of finetuning data and underwent a multi-step preprocessing procedure to adapt it into a corpus suitable for training and evaluation.Initially, we loaded cleaned and processed PDF files containing essential text information.Subsequently, the SimpleDirectoryReader tool was employed to extract data from designated files, generating a list containing all documents.Then, utilizing the SimpleNodeParser, we extracted meaningful node information from the documents.This parser was encapsulated within a callable function named load_ corpus.After obtaining textual nodes, the data underwent transformation and processing through the generate_qa_embedding_ pairs function in LlamaIndex to produce QA pairs suitable for finetuning embedding models.Furthermore, we utilized gpt-3.5-turbo to define context information and questions, establishing prompt generation templates.The resulting output is an embedded pair dataset, saved as "train_dataset, " which includes "train_dataset.json"and "val_dataset.json." This process laid the groundwork for subsequent training and evaluation of the fine-tuned Chinese gastrointestinal embedding model.

Training the fine-tuned model
We chose the gte-base-zh model from the GTE series models trained by Alibaba DAMO Academy as the embedding model for the fine-tuning phase.Throughout the fine-tuning process, the SentenceTransformersFinetuneEngine was employed to carry out various subtasks.This involved constructing a pre-training model using SentenceTransformer, defining a data loader responsible for loading the training dataset and parsing it into queries, corpus, and relevant_docs.Leveraging the gte-base-zh model, the engine mapped node_ids from relevant_docs to text nodes in the corpus and compiled a list of Input Examples.Training employed the multiple negatives ranking (MNR) loss from sentence_transformers, with an evaluator The overview of GastroBot.The process begins with the data preparation stage, where documents are initially split using a Splitter, dividing them into multiple document chunks.Subsequently, each chunk is encoded using the fine-tuned embedding model, producing semantic vectors stored in a Vector Database.Moving on to the data retrieval stage, the user inputs a question (User Question), and based on the question vector (Query Vector), the most relevant chunks (top-3 Chunks) are retrieved from the vector database.Finally, in the LLM generation phase, the large model generates answers by combining the top-3 chunks and the prompt, submitting them together to the LLM to obtain the final answer (Answer).
monitoring the model's performance using the eval dataset throughout training.The entire process was seamlessly integrated into the training pipeline, encapsulated within the SentenceTransformersFinetuneEngine in LlamaIndex, and executed by invoking its fine-tuning function.

Implementing retrieval-augmented generation with Llamaindex
The LlamaIndex framework is employed to construct the pipeline for RAG implementation.Initially, data loading is conducted to facilitate subsequent experiments.The entire document is segmented into smaller text units termed nodes, facilitating processing within the LlamaIndex framework.Using the SimpleNodeParser, the loaded documents are parsed and transformed into these nodes.Following this, within the global configuration object, the embedding model and LLM are explicitly specified.The chosen embedding model is finetuned for gastrointestinal diseases, streamlining the conversion of text into vector representations crucial for subsequent computations.The gpt-3.5 turbo is selected as the generative LLM and is utilized throughout the process for answer generation.Finally, three core components-index, retriever, and query engine-are instantiated via the LlamaIndex framework, collectively supporting questionanswering functionality based on user data or documents.The index serves as a data structure for swiftly retrieving information pertinent to user queries from external documents.This is accomplished through the vector storage index, which generates vector embeddings for the text of each node.The retriever is responsible for acquiring and retrieving information relevant to user queries, while the query engine, built upon the index and retriever, furnishes a universal interface for posing questions to the data.

Chunking
Subsequently, in the text segmentation step, once the data extraction is complete, the document is partitioned into multiple text blocks referred to as chunks within LlamaIndex.Each chunk's size is defined as 512 characters.Although the default ID for each node is a randomly generated text string, we have the flexibility to format it into a specific pattern as required.

Embedding
Each chunk undergoes encoding using the fine-tuned embedding model to generate semantic vectors that encapsulate nuanced information within the captured segments.This fine-tuned embedding model excels particularly well in capturing specialized vocabulary associated with gastrointestinal diseases.Vectorization is pivotal as it transforms text data into a matrix of vectors, directly influencing the effectiveness of subsequent retrieval operations.While existing generic embedding models may serve adequately in many scenarios, in the medical domain, where rare specialized vocabulary or terminology is prevalent, we opted to fine-tune GTE to suit our specific application needs and enhance retrieval efficiency.

Vector database
The semantic vectors generated are stored within the Vector Database, establishing an indexed repository optimized for swift and semantically aligned searches.This Vector Database forms the cornerstone for efficient retrieval in the subsequent phases of the RAG model.The intricacies of these steps are elaborated upon in the data preparation section of Figure 1.

Building RAG
Having completed the data preparation phase, the second step involves the selection of the embedding model and LLM.The embedding model is tasked with generating vector embeddings for each text chunk, for which we employ a fine-tuned embedding model.Meanwhile, the LLM handles user queries and related text chunks, producing contextually relevant answers.To achieve this, we opt to utilize the gpt-3.5-turbomodel via API calls.Both models collaborate synergistically within the service framework, playing indispensable roles in the indexing and querying processes.Subsequently, in the third step, we call upon LlamaIndex to construct the index, retriever, and query engine-these three pivotal components collectively facilitate question-answering based on user data or documents.
The index facilitates swift retrieval of information relevant to user queries directly from the external knowledge base.This is achieved through the creation of vector embeddings for the text of each node within the vector storage index.The retriever's role is to acquire and retrieve information pertinent to user queries, while the query engine, positioned atop the index and retriever, offers a universal interface for posing inquiries to the data.The fundamental implementation of RAG, based on LlamaIndex, streamlines this process.
When a user poses a question, it undergoes conversion into a vector representation.Using this query vector, the most relevant segments (top-3 chunks) are retrieved from the vector database, constituting the Data retrieval phase depicted in Figure 1.Following this, the top-3 chunks, along with the prompt, are fed into the gpt-3.5turbomodel for answer generation, culminating in the final answer, as illustrated in the LLM generation phase of Figure 1.Throughout the process delineated in Figure 1, the user query is embedded into the same vector space as the additional context retrieved from the vector database, facilitating similarity-based search and returning the most proximate data objects, denoted as retrieval (labeled as Retrieve in the figure).The amalgamation of the user query and the supplementary context obtained from the prompt template is referred to as augmentation (labeled as Augment in the figure).Ultimately, the augmented prompt is input into the LLM for answer generation, a step termed generation (labeled as Generate in the figure).

Comparative experiment
To demonstrate the superior performance of GastroBot, we conducted a comparative analysis between GastroBot and three baseline models utilizing RAG.When selecting these comparative models, we considered their scale, performance metrics, and diversity.Firstly, we selected Llama2 (26), an open-source language model that consistently outperforms other models across various external benchmark tests, including inference, encoding, proficiency, and knowledge evaluation.Secondly, we included ChatGLM-6B (27) and Qwen-7B (28), representing the latest advancements in Chinese artificial intelligence, as additional comparative models for this study, both demonstrating robust capabilities across multiple natural language processing tasks.We randomly selected 20 questions related to gastrointestinal diseases and compared the answers generated by GastroBot with those generated by the other three models.The performance of GastroBot relative to the other three models will be evaluated using human assessment methods.

Embedding model evaluation
In this section, we will evaluate three different embedding models: OpenAI text-embedding-ada-002 (29), gte-base-zh, and our finetuned embedding model.We will employ two distinct evaluation methods: Hit rate (30): Conducting a straightforward top-k retrieval for each query/relevant_doc pair.A retrieval is considered successful if the search result includes the relevant_doc, defining it as a "hit."The hit ratio is shown in Equation (1): S denotes the total number of query/relevant document pairs, representing the count of user demands.The function hit i serves as an indicator; its value is 1 if the relevant document for the i th − query is in the top-k search results, and 0 otherwise.
Information Retrieval Evaluator: A comprehensive metric suite provided by the LlamaIdex for the evaluation of open-source embeddings.This class evaluates an Information Retrieval (IR) (31) setting.Given a set of queries and a large corpus set.It will retrieve for each query the top-k most similar document.It measures Mean Reciprocal Rank (MRR) (32), Recall@k, and Normalized Discounted Cumulative Gain (NDCG) (33,34).

Using RAGAs to evaluate RAG
Ragas ( 35) is a large-scale model evaluation framework designed to assess the effectiveness of Retrieval-Augmented Generation (RAG).It aids in analyzing the output of models, providing insights into their performance on a given task.
To assess the RAG system, Ragas requires the following information: Questions: Queries provided by users.Answers: Responses generated by the RAG system (elicited from a large language model, LLM).
Contexts: Documents relevant to the queries retrieved from external knowledge sources.
Ground Truths: Authentic answers provided by humans, serving as the correct references based on the questions.This is the sole required input.
Once Ragas obtains this information, it will utilize LLMs to evaluate the RAG system.
Ragas's evaluation metrics comprise Faithfulness, Answer Relevance, Context Precision, Context Relevancy, Context Recall, Answer Semantic Similarity, Answer Correctness, and Aspect Critique.For this study, our chosen evaluation metrics are Faithfulness, Answer Relevance, and Context Recall.

Faithfulness
Faithfulness is evaluated by assessing the consistency of generated answers with the provided context, derived from both the answer itself and the context of retrieval.Scores are scaled from 0 to 1, with higher scores indicating greater faithfulness.
An answer is deemed reliable if all assertions within it can be inferred from the given context.To compute this value; a set of assertions needs identification from the generated answer, followed by cross-checking each assertion against the provided context.The Equation (2) for computing faithfulness is as follows: Where | | V represents the number of statements supported by LLM, and | | S denotes the total number of statements.

Answer relevance
To evaluate the relevance of answers: utilize LLM to generate potential questions and compute their similarity to the original question.The relevance score of an answer is determined by averaging the similarity between all generated questions and the original question.
Let the original question be q, and the answer to the question be a q s .The context segment relevant to question q is denoted as c q .If the claims presented in the answer can be inferred from the context, we assert that the answer a q s is faithful to the context c q .To gauge credibility, we initially employ LLM to extract a set of statements S a q s . If the answer a q s directly and appropriately addresses the question, we consider it relevant.Particularly, our evaluation of answer relevance does not account for factual accuracy but penalizes incomplete or redundant information in the answers.To estimate answer relevance, given an answer a q s , we prompt LLM to generate n potential questions q i based on a q s .Subsequently, we use the text-embedding-ada-002 model from the OpenAI API to obtain embeddings for all questions.For each q i , we calculate the similarity sim q q i , with the original question q .The specific formula for Answer relevance is Equation (3): This metric assesses the alignment between the generated answers and the initial question or instruction.

Context recall
Context recall assesses how well the retrieved context aligns with the authentic answers provided by humans.It is calculated by comparing the ground truth with the retrieved context, with scores ranging from 0 to 1, where higher scores indicate better performance.
To estimate context recall based on the authentic answers, each sentence in the authentic answers is examined to determine its relevance to the retrieved context.Ideally, all sentences in the authentic answers should be relevant to the retrieved context.The context relevance score is calculated using the following Equation (4):

CR
Ground truth sentences that can be attributed to context Nu = m mber of sentences in ground truth (4) This formula quantifies the proportion of sentences in the authentic answers that can be attributed to the retrieved context, providing a measure of how well the retrieved context aligns with the ground truth.

Human evaluation
Although RAGAS has contributed to assessing the performance of RAG to some extent, human evaluation remains essential from the perspectives of safety, validation of professional knowledge, flexibility and adaptability, as well as ethical considerations.Therefore, this study incorporated human assessment.Our investigation presents the SUS (Safety, Usability, and Smoothness) evaluation method for human assessors SUS encompasses three dimensions: safety, usability, and smoothness (36).The "safety" dimension evaluates whether the model-generated content could potentially mislead users, posing health risks.Assessments from the "usability" dimension reflect the depth of professional expertise, while the "smoothness" dimension gauges the proficiency of the generated model functioning as an LLM.The skill enhancement program employs a three-tier scoring system, ranging from 1 (unsatisfactory) to 3 (good), with 2 indicating acceptable performance.

Illustrative Q&A examples from GastroBot
To demonstrate the capabilities of GastroBot, we developed a web application using Streamlit.This application enables users to input questions related to gastrointestinal diseases.Additionally, a question dialogue box is located below the main interface, allowing users to input their queries.The interface of the web application is depicted in Figure 3.
We randomly present five questions related to gastric diseases and generate answers using both GastroBot and ChatGPT.Table 1 compares the answers generated by GastroBot and ChatGPT for the five sample questions.Through this comparison, we observe that GastroBot's answers are more precise, contextually relevant, and effectively mitigate the production of misleading information.For instance, in question 2, where the user in quires about the guidelines for handling gastric biopsy specimens, GastroBot provides a detailed response comprising five steps, each accompanied by precise explanations and requirements, and devoid of errors.In contrast, ChatGPT's response offers only a series of steps without clear time references and lacks sufficient clarity in explanations.

Fine-tuned embedding model improved performance
The results of the hit rate are illustrated in Figure 4 and presented in Table 2. Our fine-tuned model exhibits an 18% improvement in performance compared to its base model, gte-base-zh.When contrasted with OpenAI's embedding model, Depicts the chat interface of GastroBot, built using the Streamlit platform.The contraindications of endoscopic treatment for EGC include lymph node metastasis, invasion of the muscularis propria, and coagulopathy.In addition, the relative surgical contraindications of ESD include negative lifting sign, which means that the local uplift cannot be formed after saline injection into the submucosa at the base of the lesion, suggesting that there is adhesion between the submucosa and the muscular layer at the base of the lesion.The risk of perforation is high when ESD is performed under these conditions.However, with proficiency in ESD manipulation techniques, ESD can be performed safely even with a negative lifting sign.The causes of chronic gastritis include H.pylori infection, autoimmune gastritis, foodinduced eosinophilic gastroenteritis, lymphocytic gastritis, and so on.According to the classification system of chronic gastritis, it can be divided into chronic non-atrophic gastritis and chronic atrophic gastritis.In addition, there are some special types of gastritis, such as chemical gastritis, granulomatous gastritis, collagenous gastritis, radiation gastritis, infectious gastritis, and giant hypertrophic gastritis.(Continued)  text-embedding-ada-002, our fine-tuned model demonstrates a 20% enhancement in performance.The results of the Information Retrieval Evaluator evaluation are presented in Figure 5, showing a 21% improvement in performance for the fine-tuned model compared to the base model.Additionally, the fine-tuned model exhibits improvements in each of the 30 evaluation metrics columns.

RAGAs scores
Our Ragas evaluation relies on GPT-3.5-Turbo.To ensure diversity and representativeness in the test set, we meticulously designed the distribution of questions across categories such as "simple," "inference," "multi-context," and "conditional."Adhering to these guidelines, we curated a test set comprising 20 questions for assessment.The evaluation outcomes of Ragas are consolidated in Table 3.When employing the RAGAS framework to evaluate GastroBot, we attained a context recall rate of 95%, with faithfulness reaching 93.73%.Moreover, the answer relevancy score achieved 92.28%.

SUS scores
To assess the model's performance, we enlisted 5 professionals with medical expertise and randomly selected 20 questions recommended by GastroBot for evaluation.The experimental results of the SUS scores are detailed in Table 4.In comparison with the other three models, GastroBot scored remarkably high in terms of safety, usability, and smoothness, achieving scores of 2.87, 2.72, and 2.88, respectively.These scores signify that GastroBot's responses are exceptionally smooth and notably enhance the accessibility of knowledge while maintaining safety.

Previous research background
The extensive utilization of LLM in natural language processing showcases remarkable generalization capabilities.However, challenges such as hallucinations and interpretability issues persist in clinical applications.Our research tackles these challenges by introducing the RAG method, which improves the accuracy and relevance of answers by retrieving information from external knowledge sources.RAG has previously proven successful in biomedical fields (16), liver disease research (18), clinical test diagnostics (19), and electrocardiogram data diagnostics (20).
In Wang et al. 's study, the application of the Almanac framework improved the retrieval function of medical guidelines and treatment recommendations, showcasing the potential effectiveness of LLM in clinical decision-making (17).Moreover, Ge et al. (18) utilized RAG technology to develop Li Versa, a specialized model for liver diseases, Illustrates the hit rates of the text-embedding-ada-002 model, the gte-base-zh model, and the fine-tuned model.(23,24).Given the substantial patient population and the complexity of managing gastrointestinal diseases, implementing RAG is paramount for enhancing the accuracy of diagnosis and treatment for these diseases.

Novelty discovered
We employed a corpus consisting of 25 guideline documents and 40 relevant literature articles to apply RAG technology in clinical practice within the field of Chinese gastroenterology.By fine-tuning the embedding models, we achieved a significant enhancement in performance.Post fine-tuning, our model exhibited an 18% increase in hit rate compared to the base model gte-base-zh.Furthermore, our fine-tuned model demonstrated a 20% performance improvement compared to OpenAI's embedding model.Concurrently, leveraging the RAG framework, we developed GastroBot, a Chinese gastroenterology chatbot.Evaluation using the RAGAS framework showcased a context recall rate of 95%, faithfulness of 93.73%, and a high answer relevancy of 92.28%.Human assessments indicated GastroBot's excellent performance in safety, usability, and smoothness, with scores of 2.87, 2.72, and 2.88, respectively.These findings underscore the significant advantages and innovative potential of RAG technology in addressing clinical challenges.

Explanation of potential drawbacks and limitations
Although RAG technology has shown significant improvements in accuracy and relevance, it still faces inherent limitations.The quality and accuracy of external knowledge sources directly impact the quality of generated responses.In our study, using 25 guideline documents and 40 literature articles, we encountered challenges related to the comprehensiveness and timeliness of knowledge.Responses may lack necessary details or specificity, requiring subsequent clarification.Additionally, responses may sometimes be overly vague or generic, failing to effectively meet user needs.In future work, we intend to explore complex retrieval strategies to address these challenges.It is important to note that our current research primarily focuses on the Chinese region, thus exhibiting geographical limitations.Future research endeavors will strive to enhance the model's applicability, encompassing broader regions and cultural backgrounds.Furthermore, RAG technology's reliance on large datasets may hinder its performance in scenarios with limited sample sizes.Subsequent research plans should aim to alleviate these challenges to bolster the model's resilience and adaptability.Information retrieval evaluator is a comprehensive metric suite provided by LlamaIndex, showcasing the outcomes of 30 evaluation metrics.

Integration with current problem understanding and advancement
Our research provides a fresh perspective and solution to the information processing challenges faced in today's clinical environment.Currently, AI chatbots are making significant strides in healthcare, particularly in pain management (37,38).Inspired by these advancements, we successfully integrated RAG technology, which combines LLM's reasoning capabilities with domain-specific knowledge retrieval, to develop an AI chatbot tailored to addressing the interpretation and understanding challenges in clinical gastroenterology.This advancement not only deepens our understanding of clinical problem-solving but also holds promise for extending this technology to other clinical domains, especially in underserved regions with limited medical resources, thereby aiding in early diagnosis.

Theoretical hypotheses for future directions and testing
Future research endeavors may concentrate on refining the selection and updating mechanisms of external knowledge sources to ensure the model's access to the latest and most comprehensive clinical data.To enhance GastroBot's performance, we intend to explore the utilization of advanced RAG technology.Moreover, we will seek collaboration with experts in gastroenterology to enrich GastroBot's domain-specific knowledge.Integrating electronic medical record systems into GastroBot is also part of our future enhancement agenda.Additionally, we anticipate the emergence of high-performance LLMs deployable locally, facilitating the deployment of relevant chatbots by researchers in various fields.Fundamentally, our research introduces an innovative solution to the clinical information processing domain and provides valuable insights for future studies.Ongoing improvements and deeper exploration suggest that RAG technology may play a crucial role in the foreseeable future, particularly in fields such as clinical decision support systems.

FIGURE 2
FIGURE 2Provides a comprehensive overview of the entire process involved in fine-tuning the embedding model specific to the Chinese gastroenterology domain.The process encompasses three key steps: data preparation, model fine-tuning, and model evaluation.
Guidelines for the Handling of Gastric Endoscopic Biopsy Specimens include the following steps:Preliminary specimen handling: Flatten the biopsy specimen after removal from the body, ensuring that the basal layer of the mucosa adheres to filter paper.Specimen fixation: Immerse the specimen in an adequate 10% neutral buffered formalin solution for fixation.The fixation time should be greater than 6 h but less than 48 h.Paraffin embedding: Remove the filter paper and embed the tissue vertically oriented.Description and recording: Describe the size and number of tissues submitted for examination.Sampling: Take the entire mucosa for examination, ensuring the mucosa is wrapped in filter paper to prevent loss.When sampling, drop some eosin to facilitate the technician's identification during embedding and slicing.Samples with significantly different sizes should be placed separately in different dehydration boxes to prevent small biopsy tissues from being lost or excessively sliced.During embedding, it is essential to ensure that the flattened mucosa is embedded upright (i.e., the mucosa is perpendicular to the bottom of the embedding box), with no more than three tissue pieces per wax block, embedded parallel to the direction.The white edge of the wax block without tissue should be removed as much as possible with a scalpel.It is recommended that each slide contains 6-8 consecutive tissue sections for continuous observation.Processing of gastroscopic biopsy specimensinvolves collection, rapid transport to the laboratory for moist storage, fixation with a fixative such as 10% formalin, and embedding in wax blocks.Tissue in wax blocks was cut into thin sections, usually 3-5 microns in size, and subsequently stained, most commonly with H&E.The stained sections were placed under a microscope and examined by a professional physician or pathologist.Physicians write pathology reports based on the findings of microscopic examination to provide information to diagnose or evaluate lesions.This process ensures the accuracy and quality of the specimen, which is essential for the diagnosis and treatment planning of diseases such as gastric cancer.
Contraindications to endoscopic treatment of early gastric cancer include:Deep invasion: If the cancer is already deeply invasive, endoscopic treatment is not appropriate.Lymph node metastasis: If lymph node metastasis is present, endoscopic treatment may not completely remove cancer cells.Extensive invasion: If the area of cancer is too large, it is difficult to cover all abnormal areas with endoscopic treatment.Combined with other diseases: Combined with other diseases, such as bleeding tendency and severe cardiopulmonary diseases, may increase the risk of endoscopic treatment.Patient is not a Candidate for surgery: Endoscopic therapy may not be appropriate if the patient is medically unfit for surgery.Before deciding on endoscopic treatment, a careful assessment of the patient's condition and overall health status is necessary to ensure that the most appropriate treatment is selected.
Chronic gastritis is caused by prolonged inflammatory stimulation of the gastric mucosa.The main causes include Helicobacter pylori infection, long-term use of nonsteroidal antiinflammatory drugs, autoimmune reactions, poor diet, acid reflux, and exposure to harmful substances.According to the type of lesion, it can be divided into chronic non-atrophic gastritis and chronic atrophic gastritis.Non-atrophic gastritis showed inflammation of gastric mucosa with relatively normal cellular structure.In contrast, atrophic gastritis is accompanied by irreversible changes, including mucosal atrophy and gland depletion.In addition, according to the location of the lesion, it can be divided into pangastritis, antral gastritis, gastritis and so on.The exact classification is usually determined by gastroscopy and tissue biopsy.Timely consultation with a doctor for diagnosis and treatment advice is the key.

TABLE 1
Five example questions and answers generated by GastroBot and ChatGPT.

TABLE 2
The hit rate results for the three embedding models.

TABLE 3
Displays the evaluation results for RAGAS.