Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

Wu, Saier; Xu, Chong; Xue, Zhiwen; Huang, Yuandong; Xu, Guoguo; Cui, Yulong; Ma, Junxue; Ma, Ruixia; Xie, Chenchen

doi:10.3389/feart.2025.1695920

ORIGINAL RESEARCH article

Front. Earth Sci., 13 January 2026

Sec. Geohazards and Georisks

Volume 13 - 2025 | https://doi.org/10.3389/feart.2025.1695920

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

1. National Institute of Natural Hazards, Ministry of Emergency Management of China, Beijing, China
2. Key Laboratory of Compound and Chained Natural Hazards Dynamics, Ministry of Emergency Management of China, Beijing, China
3. School of Emergency Management Science and Engineering, University of Chinese Academy of Sciences, Beijing, China
4. School of Civil Engineering and Architecture, Anhui University of Science and Technology, Huainan, China
5. School of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing, China

Article metrics

View details

855

Views

Downloads

Abstract

Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ChatGPT-4o for geological-hazard question answering using a transparent, rubric-based design. Sixty questions spanning six task categories (C1-C6) were posed within a fixed time window under a controlled single-turn protocol, and eight evaluators with geohazard expertise independently rated each response on six capability dimensions (D1 Knowledge Coverage; D2 Comprehension and Reasoning; D3 Accuracy and Rigor; D4 Critical Thinking; D5 Application and Context Adaptability; D6 Innovation and Knowledge Expansion). Scores were assigned on a continuous 0–1 scale, with 0, 0.5, and 1 used as anchor points to guide interpretation. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Performance was consistently higher on structured knowledge tasks—defined here as questions with well-established concepts, factual grounding, or clearly bounded reasoning paths (C1 = 0.827; C2 = 0.797; C3 = 0.818) than on open-ended tasks (C4-C6 mean = 0.591). Across dimensions, scores were highest for D1 (0.868), D2 (0.864), and D3 (0.830), and lowest for D4 (0.578) and D6 (0.550). Overall agreement was good (ICC (3, k) = 0.8095), while agreement decreased for more subjective tasks and dimensions. The study provides (i) a baseline, version-specific appraisal of GPT-4o in geohazard-related QA, (ii) a transferable rubric-based workflow for evaluating domain LLMs, and (iii) evidence that human oversight remains essential when such systems are used to support safety-critical disaster risk reduction decisions.

1 Introduction

In recent years, geological hazards triggered by extreme rainfall and earthquakes have shown rising frequency and expanding impacts (Gao et al., 2024; Huang et al., 2025; Shao et al., 2024). Compounding and cascading effects across multiple hazards have become increasingly prominent, rendering hazard processes highly abrupt, with complex evolutionary pathways and narrow decision windows (Gao et al., 2025; Wu et al., 2025). These characteristics raise the bar for monitoring and early warning, risk assessment, and emergency management (C. Xu and Lin, 2025).

Against this backdrop, generative large language models (LLMs) have made substantive advances in multimodal understanding and information integration, knowledge organization and generation, and complex reasoning. They exhibit cross-task transferability and adaptivity, revealing the potential for artificial general intelligence (Bommasani et al., 2021), and opening new technical pathways for natural-hazard research and governance (OpenAI et al., 2023; 2024). LLMs have been adopted in highly specialized domains that require timely knowledge services, including law (Lai et al., 2024), chemistry (Mirza et al., 2025), medicine (Cascella et al., 2023; Neha et al., 2024; Semeraro et al., 2025; Yang et al., 2025), and engineering (Hostetter et al., 2024; Kim et al., 2024). They are also extending into Earth system and human–environment modeling and data analysis within the geosciences (Reichstein et al., 2019; Xie et al., 2025; Zhao et al., 2024).

Within disaster risk reduction (DRR), natural language processing is increasingly used across the stages of observation, cognition, analysis, and decision-making. Pilot applications include disaster information extraction, sentiment analysis and scenario simulation, public outreach, and decision support (Xu et al., 2025; Xue et al., 2023; Zhao et al., 2024), with floods and earthquakes as the most common use cases. For example, generative models such as ChatGPT (GPT) have supported “abstractive reviews” that rapidly synthesize evidence for flood-rescue logistics and resource allocation (Kaklauskas et al., 2024). In seismic-engineering contexts, GPT has been used for technical drafting, terminology explanation, and communication to improve research and outreach efficiency (Ray, 2024; Wilson et al., 2023).

However, most existing evaluations focus on general or single tasks. Evidence that has been carefully examined and accepted by researchers and practitioners in geohazard science is still lacking regarding the effectiveness and limits of using a generic question-answering (QA) interface across what we refer to here as the full knowledge chain of geological hazards. This chain spans foundational concepts, process interpretation, regional variability, interdisciplinary reasoning, scenario construction, and emerging research topics. Widely used academic benchmarks such as MMLU and BIG-bench do not transfer well to the domain-constrained and context-dependent nature of professional QA in geohazard assessment (Hendrycks et al., 2021; Srivastava et al., 2023). Recent discussions in the AI research community have also emphasized that strong benchmark performance does not necessarily translate into reliable behavior in applied decision-making contexts. This “high-score, low-utility” gap has also been discussed in public remarks by Ilya Sutskever, a leading figure in deep learning and co-founder of OpenAI, who has cautioned against over-reliance on benchmark-centered capability assessments. Taken together, these issues illustrate the need for systematic, scholar-informed and practice-oriented capability auditing, and for developing clear, evidence-based guidance on how LLMs should be responsibly introduced into geohazard-related applications, where accuracy and practical applicability are essential.

Model iteration further underscores the need for establishing clear historical baselines. GPT-4o, released in May 2024, became one of the first widely accessible multimodal models supporting text–image interaction for tasks relevant to hazard interpretation and scientific communication (OpenAI et al., 2024). Subsequent releases, including the shift to GPT-5 as the default model in 2025, introduced architectures that can vary the depth of internal reasoning and, depending on user tier and platform, offer multiple model variants. These changes improve flexibility but also make it more difficult to determine which specific model version a user is interacting with. This increasing opacity highlights the need for version-specific and well-documented assessments to support transparent comparison across evolving model families.

Against this background, we develop an expert-driven, multi-dimensional rating and reliability framework to systematically evaluate GPT-4o′s text-based QA performance for geological-hazard tasks that are pertinent to DRR and to AI-enabled, remote-sensing–assisted monitoring and early-warning workflows. We design 60 representative items across six problem categories and six capability dimensions, organize independent blind ratings, and quantify agreement and reliability using the intraclass correlation coefficient (ICC) (

Koo and Li, 2016

). Our principal contributions are as follows:

An expert-driven quantitative evaluation framework structured as “problem category × capability dimension,” with reliability reported via ICC, yielding a reproducible and transferable protocol for professional QA assessment;
A multi-dimensional capability profile and uncertainty characterization of GPT-4o in geohazard and DRR. The analysis identifies strengths in structured knowledge tasks (questions anchored in well-established concepts, classification standards, or other widely agreed reference answers) and in causal reasoning, and weaknesses in critical and innovative thinking as well as in complex, design-oriented, open-ended tasks. These patterns provide practical justification for human-in-the-loop oversight in safety-critical settings;
A version-specific, traceable performance baseline produced in the era of unified GPT-5 routing and deprecation of legacy defaults, supporting longitudinal tracking of subsequent versions and fair cross-model comparisons.

In sum, this study fills the gap in domain-specific, reproducible evaluation of general-purpose LLMs in a geological-hazard QA setting, providing an evidence base and methodological template for AI-enabled DRR and risk-aware governance. The full methodology, question set, structured rating data, and processing pipeline are documented in “Methodology and Data” and the Supplementary Material to facilitate verification and reuse by peers.

2 Methodology and data

For methodological transparency and continuity, the entire workflow is illustrated in Figure 1, encompassing every step from question design and prompt engineering through answer collection and data preprocessing to rating and statistical analysis.

FIGURE 1

2.1 Question design and categorization

We developed a 60-question set to reflect the types of knowledge and reasoning that commonly arise in geological-hazard work. To examine different aspects of model performance, the questions were organized into six categories: (C1) Basic Knowledge; (C2) Formation-Mechanism Inference; (C3) Regional Differences; (C4) Interdisciplinary Analysis; (C5) Scenario Planning and Design; and (C6) Frontier Exploration.

When drafting the questions, our intention was to cover the main themes encountered in geohazard assessment. The set includes straightforward factual items as well as questions that require interpretation or multi-step reasoning, so that the difficulty resembles what practitioners deal with in real settings. Each question was written to stand alone, avoiding reliance on earlier items. We also paid attention to phrasing, aiming for wording that would be clear to both the model and the domain scholars who later evaluated the responses.

C1 and C2 focus on basic concepts and triggering mechanisms. C3 highlights regional patterns, C4 addresses cross-disciplinary integration, C5 concerns planning-oriented tasks in practical scenarios, and C6 explores forward-looking scientific directions. All questions were reviewed by researchers working in geohazards to ensure scientific accuracy and relevance. Each item was asked independently in a single-turn format to prevent any influence from prior interactions. The questions were also phrased in a way that resembles how non-expert users typically seek information about geological hazards. The six-category structure and representative items are listed in Table 1.

TABLE 1

Category	Representative item
C1. Basic knowledge answers	What are the commonly used instruments and technologies for monitoring and early warning of geological hazards?
C2. Formation-mechanism inference	Explain the triggering mechanism of earthquake-induced landslides
C3. Regional differences	What are the distinctive characteristics of landslides on the Chinese loess plateau?
C4. Interdisciplinary analysis	What are the current limitations of InSAR technology in the application to geological hazard monitoring?
C5. Scenario planning and design	Design an early warning system for rainfall-induced cascading landslides in mountainous regions by integrating low-, medium-, and high-altitude remote sensing platforms with available multisource environmental data
C6. Frontier exploration	In support of early warning and risk prevention of geological hazards, which aspects of remote sensing development should be prioritized in the future?

Overview of six problem categories and illustrative examples.

2.2 Answer collection and preprocessing

All responses were generated using GPT-4o through the standard web interface. Each question was submitted in an independent chat session during a short, continuous period, ensuring that all outputs reflected the same underlying model state. A uniform presentation format was used for every item, and no supplementary background information or follow-up clarification was provided. The single-turn setup served as an experimental control that allowed us to observe the model’s immediate, unassisted response under fixed conditions (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Liang et al., 2022). It is important to note that in professional geohazard assessment, AI tools are typically used through iterative exchanges that allow practitioners to probe and verify the model’s responses. The single-turn configuration used here was therefore not intended to reflect standard practice, but to provide a stable, version-specific baseline for evaluating the model’s immediate, unassisted output under controlled conditions. Because LLMs may exhibit minor phrasing or temporal variability due to probabilistic decoding and periodic system updates, the analysis focuses on answers generated within this defined timeframe for the May 2024 GPT-4o release (all QA sessions were conducted in February 2025). Limiting the evaluation in this way avoids confounding from version drift and supports consistent interpretation of results.

After collection, the raw outputs were processed through a structured cleaning workflow. Boilerplate disclaimers, conversational fillers, and other non-substantive elements were removed to isolate the analytical content. Terminology and formatting were harmonized where appropriate to support clear interpretation by the evaluators, while the factual and conceptual substance of each answer was left unchanged. The rating sheets submitted by the evaluators were consolidated into a single tabular file with consistent variable definitions, facilitating score aggregation and the computation of inter-rater agreement.

The complete question set, cleaned model outputs, and de-identified rating tables are provided in the Supplementary Material to support transparent archiving and future re-analysis.

2.3 Prompt engineering

We adopted a standardized single-turn prompting procedure for all queries. Each question was submitted in a fresh session with no prior context so that the model’s output depended solely on the prompt provided. This configuration reduces variability arising from conversational history and mirrors typical information-seeking behavior in geohazard and emergency-management settings, where users often ask isolated, one-off questions. The approach also follows established evaluation practices in large-language-model benchmarking, where isolated, one-pass queries are commonly used to support fair and reproducible comparisons (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Srivastava et al., 2023). Consistent with this minimal-intervention philosophy, no parameter tuning, multi-turn refinement, or repeated sampling was applied.

To maintain domain relevance, a concise role-based instruction preceded every query. The instruction asked the model to answer as an expert in geological hazards and Earth sciences. The standardized cue was: “You are a domain expert in geological hazards and Earth science. Please answer the following question with accurate, detailed information and rigorous reasoning.” This prompt steered the model toward technically grounded responses without over-constraining the format, and it improved terminological clarity and coherence in domain-specific reasoning, a pattern also noted in studies using similar expert-role prompts (Wang et al., 2024).

The same prompt template was applied to all sixty questions. No examples, few-shot demonstrations, or chain-of-thought scaffolds were included, and no multi-turn exchanges were used. This ensured that each answer reflected the model’s zero-shot capability under identical conditions, supporting fair comparison across question categories and preserving the ecological validity of simulating unassisted user queries in geohazard contexts. Avoiding selective prompt augmentation also reduced the risk of uneven prompting effects, which can confound comparative evaluations (Kojima et al., 2023).

To examine the stability of this configuration, we conducted a small qualitative stability check (see Supplementary Material 3). Six representative questions (one from each category, C1–C6) were re-queried three times under the same settings. The core conceptual content remained consistent across runs, with variation largely confined to minor differences in phrasing. All main analyses in this study are therefore based on responses generated under this uniform, single-turn protocol.

2.4 Evaluators and rating dimensions

We invited eight evaluators (E1 through E8), comprising senior researchers, professors, postdoctoral researchers, and PhD candidates with formal training and at least 5 years of experience in geohazard investigation, risk assessment, and governance. All eight evaluators had been involved in developing the question set and refining the scoring rubric. Each evaluator independently assessed the model’s answers, working in isolation to avoid mutual influence.

A unified rubric was used to evaluate each answer across six capability dimensions (Table 2). Scores were assigned on a continuous 0–1 scale. The values 0, 0.5, and 1 served only as conceptual anchors for the endpoints and midpoint, not as the only permissible options. Before scoring, evaluators were explicitly instructed that they could select any value within the 0-1 range—typically in 0.1 increments—to express their level of satisfaction with the model’s performance. All scores were retained as provided; no adjustment or post hoc normalization was applied. Although the same six dimensions were applied across all questions, evaluators were instructed to interpret each dimension in a manner appropriate to the question type, following the definitions in Table 2 and the calibration examples in Supplementary Material 2.

TABLE 2

Dimension	Definition and evaluation criteria
D1. Knowledge coverage	Assesses whether the response adequately covers the core concepts, key facts, and essential knowledge elements relevant to the geological hazard question, without major omissions
D2. Comprehension and reasoning ability	Evaluates the model’s understanding of geological processes, triggering mechanisms, or scenario conditions, and the coherence and plausibility of the reasoning used to connect concepts or explain causal relationships
D3. Accuracy and rigor	Assesses both factual correctness and the quality of scientific reasoning. Accuracy refers to correct statements, appropriate terminology, and consistency with established knowledge. Rigor refers to logical coherence, appropriate qualification of claims, avoidance of overgeneralization, and the absence of unsupported or misleading reasoning
D4. Critical thinking	Evaluates the model’s ability to recognize assumptions, limitations, or uncertainties in existing explanations, models, or practices, and to offer scientifically grounded reflections or cautious recommendations where appropriate
D5. Application and context adaptability	Assesses how well the response adapts general knowledge to specific geological hazard contexts, including the practicality, relevance, and situational appropriateness of suggested measures or interpretations
D6. Innovation and knowledge expansion	Evaluates the model’s capacity to enrich understanding in a responsible manner. For open-ended or planning-oriented questions, this includes meaningful extension or synthesis of knowledge. For foundational questions with fixed or well-established answers, a satisfactory score reflects disciplined communication: Presenting established knowledge clearly and coherently, offering helpful structure or framing, and avoiding speculative, fabricated, or unwarranted novelty

Definition and evaluation criteria for the scoring dimensions.

To support consistent interpretation of the rubric, we provided each evaluator with a set of calibrating exemplar answers and scoring notes prior to formal scoring. These materials (now included in Supplementary Material 2) illustrate how the six dimensions should be applied to different question types.

Scores were assigned independently, without discussion or consensus-building among evaluators. For each question category and each evaluation dimension, the final score was computed as the arithmetic mean of the eight individual ratings. This approach preserves individual judgment, avoids group influence, and allows agreement among evaluators to be assessed quantitatively through inter-rater reliability analysis.

2.5 Statistical analysis framework

We developed a structured statistical pipeline to systematically evaluate GPT-4o′s performance across geological-hazard QA tasks and to quantify inter-rater agreement. The pipeline comprised three components: data curation and preprocessing, descriptive statistics, and rater-consistency assessment.

2.5.1 Data curation and preprocessing

The eight individual rating sheets were consolidated into a rectangular matrix with fields Evaluator × QuestionID × Category × Dimension × Score. We screened for obvious entry errors and missing ratings and applied corrections to ensure completeness and reliability. Based on the cleaned dataset, grouped summaries were produced by problem category, capability dimension, and evaluator index.

2.5.2 Descriptive statistics

To characterize performance differences across task types and capability dimensions, we computed means and standard deviations for each category and each dimension, thereby assessing central tendency and dispersion. Distributions were visualized using box plots, violin plots, and radar charts to convey overall spread, density, and contrasts, highlighting differences in knowledge coverage, reasoning ability, and contextual adaptability across categories and dimensions.

2.5.3 Inter-rater reliability

Inter-rater reliability was evaluated using the ICC, specifically the two-way mixed-effects model for average measures, denoted as ICC (3, k) (McGraw and Wong, 1996; Shrout and Fleiss, 1979). This model treats the panel of evaluators (k = 8) as fixed effects and the items (n = 60) as random effects. The average-measures coefficient was selected because the study’s composite scores are derived from the mean ratings of the eight evaluators, making the reliability of the averaged scores the relevant metric rather than individual ratings. The consistency formulation of ICC (3, k) is calculated as:where MS_R and MS_E represent the mean square for rows (items) and the residual mean square, respectively. We report both the overall ICC and stratified estimates by task category (C1–C6) and evaluation dimension (D1–D6) to detect variations in agreement across different assessment contexts. ICC values approaching 1 indicate high agreement among evaluators. Values near zero or below indicate that agreement is limited and does not exceed what would be expected by chance, which may reflect the subjective or interpretive nature of certain question types or evaluation dimensions rather than deficiencies in the scoring procedure itself.

3 Results

3.1 Overall performance across all questions

The dataset comprises ratings from eight evaluators for 60 questions across six evaluation dimensions. Figure 2 summarizes the distribution of evaluator scores at the question level using interquartile-range (IQR) bands for each dimension. The shaded regions depict the middle 50% of scores across evaluators, highlighting the degree of agreement or dispersion across different dimensions and questions. The black line represents the overall mean score for each question, averaged across all evaluators and dimensions, and provides a reference for the model’s aggregate performance. Together, Figure 2 conveys both the central tendency of scores and the variability in evaluator judgments across the full question set.

FIGURE 2

Several patterns emerge from the question-level distributions in Figure 2. First, questions belonging to the scenario planning and design category (C5) tend to exhibit lower overall mean scores than other categories, indicating that evaluators were, on average, less satisfied with the model’s responses to application-oriented and planning-related tasks. This suggests that while the model performs relatively well on conceptual and interpretative questions, translating knowledge into actionable or scenario-specific guidance remains more challenging. Second, the interquartile-range bands for Dimension D6 (Innovation and Knowledge Expansion) reveal contrasting behaviors across question types. For questions with relatively fixed or well-established answers, the D6 bands are generally narrow and centered at moderate to high scores, indicating that the model tends to exercise appropriate restraint by avoiding unnecessary speculation or fabricated novelty. In contrast, for more open-ended questions, the D6 bands are wider, reflecting greater divergence in evaluator judgments. This dispersion suggests that assessments of “satisfactory innovation” are harder to align for exploratory or forward-looking tasks, where expectations regarding originality, framing, and added insight vary more substantially among evaluators.

To provide complementary summaries of the same rating distributions, we summarized the rating distributions using box-and-violin plots (Figure 3), grouped by question category, evaluation dimension, and evaluator. Across question categories C1-C6 (left panel of Figure 3), scores for C1-C3 are tightly clustered, with medians close to 0.9 and standard deviations around 0.25-0.28, indicating relatively high agreement for basic knowledge and mechanism-focused questions. By contrast, the scenario-planning category C5 shows the widest spread, with a standard deviation of 0.312 and a visibly broader IQR band, suggesting more divergent views on how well the model performs on application-oriented tasks. Across evaluation dimensions (middle panel of Figure 3), the lower quartiles for D1 and D2 lie near 0.8 and their standard deviations are 0.160 and 0.169, so most ratings fall in the upper score range and internal consistency is comparatively strong. D4 and D6, in comparison, display much greater dispersion, with standard deviations of 0.393 and 0.318, reflecting wider differences in how evaluators judged critical thinking and innovation.

FIGURE 3

Evaluator-level patterns (right panel of Figure 3) further illustrate these contrasts. Most evaluators’ scores are concentrated in the high range (≥0.8), indicating generally positive assessments of the model’s answers. E2 shows the lowest standard deviation (0.138), consistent with a relatively stable internal scoring style. In contrast, E6 has the highest standard deviation (0.336) and a visibly broader spread into intermediate scores, pointing to greater within-rater variability across questions. Overall, these distributions indicate that perceived answer quality varies systematically with question type, evaluation dimension, and evaluator, rather than being uniform across the full question set.

3.2 Performance by problem category

To examine how GPT performs across diverse geohazard task types, we computed a composite performance index for each category by averaging the scores over the six dimensions D1-D6. Results are shown in Figure 4. Overall, GPT performs best on C1, with a mean score of 0.827, indicating a marked advantage on well-structured, standardized questions. C3 and C2 follow with composite scores of 0.818 and 0.797, respectively, suggesting good adaptability to tasks requiring region-specific judgments and causal reasoning.

FIGURE 4

Figure 5 further breaks down these category-level results by dimension. For C1, the highest score occurs on Accuracy and Rigor (D3 = 0.895). C2 attains the top single-dimension score across all data on Knowledge Coverage (D1 = 0.906), while C3 performs best on Understanding and Reasoning (D2 = 0.891). These results indicate that the model can deliver high-accuracy outputs on knowledge-oriented tasks and sustains strong comprehension and reasoning in causal and regional analyses.

FIGURE 5

By contrast, performance is weaker on the more complex and creative categories C4-C6 (Figure 5). The composite score is 0.749 for C4, 0.673 for C5 (the lowest among all categories), and 0.746 for C6. A similar pattern appears on the Innovation and Exploration dimension (D6): C5 and C6 score only 0.505 and 0.554, respectively, which is substantially lower than the foundational categories. Taken together, these results suggest that, under the current evaluation setup, GPT is less reliable when tasks require high-level synthesis, operational strategy generation, or frontier extrapolation.

3.3 Performance by evaluation dimension

To comprehensively assess GPT’s behavior under heterogeneous capability requirements, we averaged scores for all problem categories across the six evaluation dimensions; results are shown in Figure 6. Overall, GPT performs strongly on D1–D3, with mean scores of 0.868 (D1), 0.864 (D2), and 0.830 (D3). This indicates relatively stable performance in knowledge coverage/recall (D1), understanding and reasoning (D2), and accuracy and rigor in expression (D3). Notably, D1 reaches its peak within category C2 at 0.906, suggesting that the model is well supported by geohazard-related corpora when broad domain coverage is required.

FIGURE 6

On D5, the average score is 0.787, indicating that the model can produce operationally useful content for applied tasks (e.g., disaster monitoring and early-warning design or emergency response planning). However, the outputs often remain generic and show limited adaptability to specific scenarios.

By contrast, performance drops markedly on the higher-order cognitive dimensions D4 and D6, with mean scores of 0.578 and 0.550, respectively, forming a pronounced “performance gap” (Figure 7). This trend is most evident for categories C4-C6, reinforcing that when confronted with open-ended problems or tasks lacking established reference answers, the model’s generations often fall short in depth and originality.

FIGURE 7

3.4 Evaluator rating consistency

To evaluate the consistency of evaluator scores, we applied a two-way mixed-effects, average-measures intraclass correlation, ICC (3, k), to the ratings assigned by eight evaluators across 60 questions. Details on the definition, interpretation, and calculation of ICC (3, k) are provided in Section 2.5(3) of the Methods. The overall ICC (3,8) is 0.8095, indicating small between-rater differences on the composite, multi-dimensional scores and thus good agreement (Table 3; Figure 7a). Following Cicchetti’s guideline (Cicchetti, 1994), this value falls within the 0.75-0.90 range, i.e., “good” reliability. These results suggest that the rating framework exhibits strong scoring reliability; overall evaluator agreement on task evaluations is high, and the resulting scores are suitable for a structured assessment of model performance. In addition, the between-question variance component is substantially larger than the residual error (MSR = 0.3369 vs. MSE = 0.0642), indicating that observed score differences are driven primarily by the question content and task type rather than by rater-specific biases.

TABLE 3

MSR	MSE	Number of raters k	ICC(3,k)
0.3369	0.0642	8	0.8095

Summary statistics for the overall ICC(3,8) computation.

We further examined heterogeneity in inter-rater consistency across task categories and evaluation dimensions by conducting grouped (Figures 7b,c) and cross-classified (Figure 7d) ICC analyses. At the category level, C2 and C1 show the highest agreement (ICC = 0.919 and 0.783, respectively), consistent with clearer task structure and stronger answer consensus. In contrast, C4 and C5 display low reliability (ICC <0.1), reflecting pronounced divergence among evaluators for these problem types. At the dimension level, D1 and D2 exhibit higher agreement (ICC >0.75), whereas D4 and D6 are notably lower, revealing greater disagreement when judging higher-order cognitive dimensions. In the cross-combination analysis (Figure 7d), pairs such as C2–D1, C2–D3, C2–D4 and C1–D2 achieve ICC values exceeding 0.80, indicating highly consistent evaluations on core dimensions of foundational and reasoning-oriented tasks. By contrast, combinations such as C4–D4, C4–D5, and C4–D6 yield ICC values near zero or even negative, suggesting weak stability and substantial subjectivity in complex, open-ended, or cross-domain scenarios.

4 Discussion

4.1 ChatGPT’s capabilities and limitations across geohazard tasks

Our findings show that the response quality of GPT-4o to geohazard questions is strongly associated with problem type. On structured knowledge tasks—such as explaining hazard-classification standards or inferring formative mechanisms—the model achieves high accuracy and stability, accompanied by high inter-rater reliability. This aligns with prior evaluations of LLMs: when task boundaries are well defined and the training corpus provides sufficient coverage, LLMs exhibit strong information retrieval and expression capabilities (Gilson et al., 2022). Coupled with relatively low interaction latency (Tay et al., 2023), these strengths confer “advisor-style” utility in information retrieval and instructional support settings (Deng et al., 2025). Mechanistically, these capabilities are consistent with the Transformer architecture and large-scale pretraining, which enable the model to capture co-occurrence patterns and relationships among terms in the training data and to retrieve them when presented with well-posed questions (Bommasani et al., 2021; Tay et al., 2023). In our single-turn, text-only setting, the model does not perform explicit multimodal reasoning or iterative prompt refinement; instead, it appears to reconstruct plausible causal chains by combining known concepts from its training corpus. For example, in questions about rainfall-induced shallow landslides, GPT-4o correctly links rainfall intensity-duration (I-D) thresholds to the probability of triggering shallow failures and explains this in terms of infiltration, pore-pressure rise, and loss of shear strength in near-surface materials.

By contrast, on open-ended, extrapolative, and creative tasks the model does not consistently deliver high-quality strategies (Hager et al., 2024; Kim et al., 2024; Plevris et al., 2023). Both the mean scores and inter-rater agreement decline markedly, especially on core, higher-logic dimensions such as critical thinking and contextual adaptability. A likely contributor is corpus bias: as Biswas notes, output quality is constrained by the quantity and quality of training data (Biswas, 2023). Hallucinated or misleading content also remains a recurrent issue across GPT variants (OpenAI et al., 2023). In our own question set, for example, GPT-4o gave a confident but incorrect answer to a question on which slope type is more prone to instability under seismic loading, incorrectly treating anti-dip slopes as generally more susceptible than dip slopes, which conflicts with standard engineering understanding (see Supplementary Material 4, Q11). Recent work further shows that GPT can pass a three-player Turing test (Jones and Bergen, 2025), underscoring that machine-generated but incorrect answers may be highly persuasive to non-expert audiences (Zhou et al., 2024). At a fundamental level, large language models remain pattern recognizers trained on vast text corpora (Ray, 2024; Raza et al., 2025): they produce human-like text that is often accurate, but not necessarily grounded in deep understanding. For high-uncertainty problems and other high-stakes contexts (such as designing mitigation works or supporting risk-critical decisions), outputs from GPT should therefore be treated as supplementary input and remain subject to expert judgment and synthesis.

4.2 Validity and reliability of the domain-scholar–based evaluation framework

We evaluated a structured, multidimensional domain-scholar rating rubric using a two-way mixed-effects ICC (3, k). Overall agreement was good (Cicchetti, 1994; Shrout and Fleiss, 1979), indicating clear criteria, limited between-rater bias, and reliable aggregate scores. Between-question variance exceeded residual error, suggesting that score differences primarily reflect task content rather than rater noise (Landis and Koch, 1977).

Agreement varied with task type and dimension definition. Well-bounded combinations (e.g., C2-D1 and C1-D2) showed high consensus, consistent with greater stability when dimensions are clearly specified and answers are determinate (Kung et al., 2023). In contrast, subjective or open-ended combinations (e.g., C5-D4 and C6-D6) exhibited markedly lower agreement, aligning with prior findings on task subjectivity and rater-background effects (Gilson et al., 2022; Mirza et al., 2025; Yang et al., 2025).

To further enhance reliability, future studies should provide explicit scoring anchors and exemplars and conduct cross-domain calibration sessions to surface and reconcile interpretive differences. Such procedures have improved reproducibility in geohazards, geotechnical engineering, and education research (Fell et al., 2008; Fuchs et al., 2011; Kane, 2013).

4.3 Study limitations

This study has several limitations that should be acknowledged. First, the evaluation focuses on the May 2024 release of ChatGPT-labeled “GPT-4o”, and all question–answer interactions were collected within a clearly defined query window (February 2025). The study does not include a systematic comparison with later variants (such as GPT-4.1 or GPT-4.5) or with same-generation derivatives including o3, o4 mini, or o4 mini high (OpenAI, 2024; OpenAI, 2025; OpenAI et al., 2024). Models within the GPT series can differ in documented knowledge recency, modality support, usage quotas, invocation cost, and accessibility. These differences complicate direct comparison and raise concerns regarding fairness and reproducibility. Similar comparability issues also arise in cross-provider settings (e.g., Gemini, Grok, and other frontier LLMs), where product- and system-layer behaviors (not always transparent), safety filtering and compliance constraints, and default generation settings can further confound fair and reproducible comparisons. For example, GPT-4.5 has been described as a larger model with improved generation quality and social responsiveness, yet its training data extend only to October 2023, similar to GPT-4o, whereas other later GPT-4–series variants (e.g., GPT-4.1) may incorporate more recent training data (Figure 8). Despite sharing a comparable knowledge cutoff, GPT-4.5 and GPT-4o differ in their underlying multimodal design paradigms: GPT-4.5 represents a GPT-4–style multimodal extension built upon a primarily text-centric framework, whereas GPT-4o operationalizes multimodality as a first-class component of the core model architecture. As a result, a direct performance comparison between these two models would conflate differences arising from modality integration strategies with those attributable to task-specific reasoning ability, thereby complicating interpretation. Such discrepancies in knowledge recency can materially influence factual accuracy and alignment with contemporary hazard information (Bubeck et al., 2023; Wei et al., 2022). To maintain internal validity and avoid confounding effects associated with version heterogeneity, the present study restricts its evaluation to GPT-4o.

FIGURE 8

Second, the assessment does not examine GPT-4o′s multimodal or spatial-reasoning capabilities. Many geohazard-related tasks, such as landslide interpretation from imagery, slope failure recognition, and flood-extent assessment, require integration of visual, spatial, and contextual cues that cannot be fully captured through text-only inputs. Although GPT-4o provides image-processing functions, these were not included in the current design. Future work should therefore evaluate multimodal workflows, particularly those that combine remote sensing data and field observations, to achieve a more complete characterization of the model’s usefulness in geohazard prevention and response.

Third, the study does not fully address the temporal variability of LLM outputs. Large language models can produce slightly different answers when identical questions are posed at different times, reflecting the probabilistic nature of their inference processes as well as potential updates to the underlying system. Under the controlled single-turn prompting used here, the conceptual content of responses was generally stable, although minor differences in phrasing were observed. To increase transparency, we conducted a small qualitative stability check by generating three responses for six representative questions (see Supplementary Material 3). This exercise illustrated typical variation patterns while avoiding the substantial scoring burden of a full 10-run × 60-item protocol. A broader assessment of how model outputs change over longer periods and under different inference conditions would also be useful. Such work could clarify how stable LLM behavior remains as the system evolves.

A further conceptual limitation concerns the epistemological status of LLM-generated responses. Large language models produce outputs through statistical association rather than deductive, causal, or empirically grounded reasoning. Their answers therefore differ fundamentally from human expert knowledge, which is supported by disciplinary theory, field experience, and empirical validation. In this study, domain-scholar evaluations serve as a reference for assessing factual adequacy and conceptual coherence, and are not intended to imply epistemic equivalence between model outputs and human expertise. Future research should explore how these differences in reasoning modes affect reliability in high-stakes geohazard contexts, and should also examine more directly whether large language models meet the expectations of domain scholars and practitioners, particularly in terms of accuracy, relevance, and professional adequacy in geohazard applications.

5 Conclusion and way forward

This study builds a multi-dimensional, evaluator-rated framework to systematically evaluate GPT-4o on geological-hazard QA. Using 60 items spanning six problem categories and six capability dimensions, we quantify strengths and weaknesses across tasks and validate reliability with the intraclass correlation coefficient. GPT-4o attains its highest category scores in C1 (Basic Knowledge, 0.827), C3 (Regional Differences, 0.818), and C2 (Formation-Mechanism Inference, 0.797). By dimension, D1 (Knowledge Coverage, 0.868), D2 (Comprehension and Reasoning, 0.864), and D3 (Accuracy and Rigor, 0.830) lead, whereas D4 (Critical Thinking, 0.578) and D6 (Innovation, 0.550) are markedly lower, especially on complex, open-ended tasks (C4-C6). Overall rater agreement is good, ICC (3, k) = 0.8095, supporting the robustness and reproducibility of the conclusions.

Methodologically, we propose a reproducible evaluation framework for domain-specific AI that integrates multidimensional scoring, a multi-rater evaluation procedure, and statistical reliability analysis; the approach is transferable to other specialized domains. In application, the findings support cautious adoption of GPT as a supportive tool for DRR workflows, such as geohazard monitoring and early warning, while emphasizing operation under professional oversight. Conceptually, the results delineate a gap between fluency and understanding, showing that polished language does not guarantee the causal reasoning and abstraction required for complex geohazard tasks.

It is noteworthy that newer generations of ChatGPT, including GPT-5, are implemented as a unified system that can route requests between faster response behavior and deeper reasoning modes. While this evolution may improve overall adaptivity, it complicates version-specific evaluation and traceability. Our study documents a time-window-specific capability profile for GPT-4o and provides an evaluation baseline for longitudinal tracking and for future cross-model comparisons conducted under explicitly matched constraints. Future research should (i) establish unified, version-aware evaluation suites to support horizontal comparisons across models and vertical tracking across releases, and (ii) extend to multimodal tasks, using fusion of text and image modalities to evaluate the model’s ability to reason over remote sensing artifacts such as interferograms, UAV orthomosaics, LiDAR point clouds, and GNSS time series in support of geohazard risk reduction and early warning.

In summary, GPT-4o cannot replace expert judgment; however, it can efficiently support information synthesis, preliminary analyses, and cross-disciplinary communication. For high-uncertainty or safety-critical contexts typical of geohazard early warning and DRR, human-in-the-loop oversight remains essential to mitigate deceptively plausible yet erroneous outputs. The present version-specific evaluation baseline offers methodological and practical value and sets a reference point for evaluating and optimizing unified-architecture, multimodal reasoning models in the GPT-5 era.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

SW: Data curation, Formal Analysis, Investigation, Methodology, Resources, Visualization, Writing – original draft, Writing – review and editing. CoX: Conceptualization, Data curation, Funding acquisition, Supervision, Writing – review and editing. ZX: Data curation, Writing – review and editing. YH: Data curation, Writing – review and editing. GX: Data curation, Writing – review and editing. YC: Data curation, Writing – review and editing. JM: Data curation, Writing – review and editing. RM: Data curation, Writing – review and editing. CeX: Data curation, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Institute of Natural Hazards, the Ministry of Emergency Management of China (grant no. ZDJ 2025-54), and the Chongqing Water Resources Bureau, China (grant no. CQS24C00836).

Acknowledgments

We thank the handling editor and the reviewers for their constructive comments, which substantially improved the manuscript.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI tools (ChatGPT, OpenAI, San Francisco, CA, United States) were used in two ways: (1) to generate responses to a structured set of geohazard-related questions, which served as the primary research data for expert evaluation; and (2) to assist in language polishing, grammar checking, and translation during manuscript preparation. No AI tools were involved in study design, statistical analysis, or interpretation of results. All authors take full responsibility for the scientific integrity and accuracy of the manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feart.2025.1695920/full#supplementary-material

References

1
BiswasS. S. (2023). Potential use of chat GPT in global warming. Ann. Biomed. Eng.51 (6), 1126–1127. 10.1007/s10439-023-03171-8
2
BommasaniR.HudsonD. A.AdeliE.AltmanR.AroraS.von ArxS.et al (2021). On the opportunities and risks of foundation models (version 3). arXiv.10.48550/ARXIV.2108.07258.
- Google Scholar
3
BubeckS.ChandrasekaranV.EldanR.GehrkeJ.HorvitzE.KamarE.et al (2023). Sparks of artificial general intelligence: early experiments with GPT-4. arXiv.10.48550/ARXIV.2303.12712.
- Google Scholar
4
CascellaM.MontomoliJ.BelliniV.BignamiE. (2023). Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst.47 (1), 33. 10.1007/s10916-023-01925-4
5
CicchettiD. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess.6 (4), 284–290. 10.1037/1040-3590.6.4.284
- CrossRef
- Google Scholar
6
DengR.JiangM.YuX.LuY.LiuS. (2025). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Comput. and Educ.227, 105224. 10.1016/j.compedu.2024.105224
- CrossRef
- Google Scholar
7
FellR.CorominasJ.BonnardC.CasciniL.LeroiE.SavageW. Z. (2008). Guidelines for landslide susceptibility, hazard and risk zoning for land use planning. Eng. Geol.102 (3–4), 85–98. 10.1016/j.enggeo.2008.03.022
- CrossRef
- Google Scholar
8
FuchsS.KuhlickeC.MeyerV. (2011). Editorial for the special issue: vulnerability to natural hazards—the challenge of integration. Nat. Hazards58 (2), 609–619. 10.1007/s11069-011-9825-5
- CrossRef
- Google Scholar
9
GaoH.XuC.XieC.MaJ.XiaoZ. (2024). Landslides triggered by the July 2023 extreme rainstorm in the haihe river basin, China. Landslides21 (11), 2885–2890. 10.1007/s10346-024-02322-9
- CrossRef
- Google Scholar
10
GaoH.XuC.WuS.LiT.HuangY. (2025). Has the unpredictability of geological disasters been increased by global warming?Npj Nat. Hazards2 (1), 55. 10.1038/s44304-025-00108-0
- CrossRef
- Google Scholar
11
GilsonA.SafranekC.HuangT.SocratesV.ChiL.TaylorR. A.et al (2022). How does ChatGPT perform on the medical licensing exams? The implications of large language models for medical education and knowledge assessment. Med. Educ.10.1101/2022.12.23.22283901
- CrossRef
- Google Scholar
12
HagerP.JungmannF.HollandR.BhagatK.HubrechtI.KnauerM.et al (2024). Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med.30 (9), 2613–2622. 10.1038/s41591-024-03097-1
13
HendrycksD.BurnsC.BasartS.ZouA.MazeikaM.SongD.et al (2021). Measuring massive multitask language understanding (no. arXiv:2009.03300). arXiv.10.48550/arXiv.2009.03300.
- Google Scholar
14
HosseiniS. H.PourzangbarA. (2026). How well do DeepSeek, ChatGPT, and gemini respond to water science questions?Environ. Model. and Softw.196, 106772. 10.1016/j.envsoft.2025.106772
- CrossRef
- Google Scholar
15
HostetterH.NaserM. Z.HuangX.GalesJ. (2024). The role of large language models (AI chatbots) in fire engineering: an examination of technical questions against domain knowledge. Nat. Hazards Res.4 (4), 669–688. 10.1016/j.nhres.2024.06.003
- CrossRef
- Google Scholar
16
HuangY.XuC.HeX.ChengJ.XuX.TianY. (2025). Landslides induced by the 2023 jishishan Ms6.2 earthquake (NW China): spatial distribution characteristics and implication for the seismogenic fault. Npj Nat. Hazards2 (1), 14. 10.1038/s44304-025-00064-9
- CrossRef
- Google Scholar
17
JonesC. R.BergenB. K. (2025). Large language models pass the turing test (version 1). arXiv.10.48550/ARXIV.2503.23674.
- Google Scholar
18
KaklauskasA.RajibS.PiaseckieneG.KaklauskieneL.SepliakovasJ.LepkovaN.et al (2024). Multiple criteria and statistical sentiment analysis on flooding. Sci. Rep.14 (1), 30291. 10.1038/s41598-024-81562-0
19
KaneM. T. (2013). Validating the interpretations and uses of test scores. J. Educ. Meas.50 (1), 1–73. 10.1111/jedm.12000
- CrossRef
- Google Scholar
20
KimD.KimT.KimY.ByunY.-H.YunT. S. (2024). A ChatGPT-MATLAB framework for numerical modeling in geotechnical engineering applications. Comput. Geotechnics169, 106237. 10.1016/j.compgeo.2024.106237
- CrossRef
- Google Scholar
21
KojimaT.GuS. S.ReidM.MatsuoY.IwasawaY. (2023). Large language models are zero-shot reasoners (no. arXiv:2205.11916). arXiv.10.48550/arXiv.2205.11916.
- Google Scholar
22
KooT. K.LiM. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med.15 (2), 155–163. 10.1016/j.jcm.2016.02.012
23
KungT. H.CheathamM.MedenillaA.SillosC.De LeonL.ElepañoC.et al (2023). Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit. Health2 (2), e0000198. 10.1371/journal.pdig.0000198
24
LaiJ.GanW.WuJ.QiZ.YuP. S. (2024). Large language models in law: a survey. AI Open5, 181–196. 10.1016/j.aiopen.2024.09.002
- CrossRef
- Google Scholar
25
LandisJ. R.KochG. G. (1977). The measurement of observer agreement for categorical data. Biometrics33 (1), 159–174. 10.2307/2529310
26
LiangP.BommasaniR.LeeT.TsiprasD.SoyluD.YasunagaM.et al (2022). Holistic evaluation of language models. 10.48550/ARXIV.2211.09110
- CrossRef
- Google Scholar
27
McGrawK. O.WongS. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychol. Methods1 (1), 30–46. 10.1037/1082-989X.1.1.30
- CrossRef
- Google Scholar
28
MirzaA.AlamparaN.KunchapuS.Ríos-GarcíaM.EmoekabuB.KrishnanA.et al (2025). A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem.17 (7), 1027–1034. 10.1038/s41557-025-01815-x
29
NehaF.BhatiD.ShuklaD. K.AmiruzzamanM. (2024). ChatGPT: transforming healthcare with AI. AI5 (4), 2618–2650. 10.3390/ai5040126
- CrossRef
- Google Scholar
30
OpenAI (2024). GPT-4.5 system card. Available online at: https://openai.com/index/gpt-4-5-system-card/.
- Google Scholar
31
OpenAI (2025). Introducing GPT-4.1 in the API. Available online at: https://openai.com/index/gpt-4-1/.
- Google Scholar
32
OpenAIAchiamJ.AdlerS.AgarwalS.AhmadL.AkkayaI.et al (2023). GPT-4 technical report (version 6). arXiv.10.48550/ARXIV.2303.08774.
- Google Scholar
33
OpenAILererA.GoucherA. P.PerelmanA.RameshA.ClarkA.et al (2024). GPT-4o system card (version 1). arXiv.10.48550/ARXIV.2410.21276.
- Google Scholar
34
PlevrisV.PapazafeiropoulosG.Jiménez RiosA. (2023). Chatbots put to the test in math and logic problems: a comparison and assessment of ChatGPT-3.5, ChatGPT-4, and google bard. AI4 (4), 949–969. 10.3390/ai4040048
- CrossRef
- Google Scholar
35
RayP. P. (2024). ChatGPT in transforming communication in seismic engineering: case studies, implications, key challenges and future directions. Earthq. Sci.37 (4), 352–367. 10.1016/j.eqs.2024.04.003
- CrossRef
- Google Scholar
36
RazaM.JahangirZ.RiazM. B.SaeedM. J.SattarM. A. (2025). Industrial applications of large language models. Sci. Rep.15 (1), 13755. 10.1038/s41598-025-98483-1
37
ReichsteinM.Camps-VallsG.StevensB.JungM.DenzlerJ.CarvalhaisN.et al (2019). Deep learning and process understanding for data-driven Earth system science. Nature566 (7743), 195–204. 10.1038/s41586-019-0912-1
38
SemeraroF.CascellaM.MontomoliJ.BelliniV.BignamiE. G. (2025). Comparative analysis of AI tools for disseminating CPR guidelines: implications for cardiac arrest education. Resuscitation208, 110528. 10.1016/j.resuscitation.2025.110528
39
ShaoX.MaS.XuC.XieC.LiT.HuangY.et al (2024). Landslides triggered by the 2022 ms. 6.8 luding strike-slip earthquake: an update. Eng. Geol.335, 107536. 10.1016/j.enggeo.2024.107536
- CrossRef
- Google Scholar
40
ShroutP. E.FleissJ. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychol. Bull.86 (2), 420–428. 10.1037/0033-2909.86.2.420
41
SrivastavaA.RastogiA.RaoA.ShoebA. A. M.AbidA.FischA.et al (2023). Beyond the imitation game: quantifying and extrapolating the capabilities of language models (no. arXiv:2206.04615). arXiv.10.48550/arXiv.2206.04615.
- Google Scholar
42
TayY.DehghaniM.BahriD.MetzlerD. (2023). Efficient transformers: a survey. ACM Comput. Surv.55 (6), 1–28. 10.1145/3530811
- CrossRef
- Google Scholar
43
WangL.ChenX.DengX.WenH.YouM.LiuW.et al (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. Npj Digit. Med.7 (1), 41. 10.1038/s41746-024-01029-4
44
WeiJ.WangX.SchuurmansD.BosmaM.IchterB.XiaF.et al (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv.10.48550/ARXIV.2201.11903.
- Google Scholar
45
WilsonM. P.FoulgerG. R.WilkinsonM. W.GluyasJ. G.MhanaN.TezelT. (2023). Artificial intelligence and human-induced seismicity: initial observations of ChatGPT. Seismol. Res. Lett.94 (5), 2111–2118. 10.1785/0220230112
- CrossRef
- Google Scholar
46
WuS.XuC.MaJ.GaoH. (2025). Escalating risks and impacts of rainfall-induced geohazards. Nat. Hazards Res.5. 10.1016/j.nhres.2025.03.003
- CrossRef
- Google Scholar
47
XieC.GaoH.HuangY.XueZ.XuC.DaiK. (2025). Leveraging the DeepSeek large model: a framework for AI-assisted disaster prevention, mitigation, and emergency response systems. Earthq. Res. Adv.5, 100378. 10.1016/j.eqrea.2025.100378
- CrossRef
- Google Scholar
48
XuC.LinN. (2025). Building a global forum for natural hazard science. Npj Nat. Hazards2 (1), s130–s132. 10.1038/s44304-025-00130-2
- CrossRef
- Google Scholar
49
XuF.MaJ.LiN.ChengJ. C. P. (2025). Large language model applications in disaster management: an interdisciplinary review. Int. J. Disaster Risk Reduct.127, 105642. 10.1016/j.ijdrr.2025.105642
- CrossRef
- Google Scholar
50
XueZ.XuC.XuX. (2023). Application of ChatGPT in natural disaster prevention and reduction. Nat. Hazards Res.3 (3), 556–562. 10.1016/j.nhres.2023.07.005
- CrossRef
- Google Scholar
51
YangZ.YaoZ.TasminM.VashishtP.JangW. S.OuyangF.et al (2025). Unveiling GPT-4V’s hidden challenges behind high accuracy on USMLE questions: observational study. J. Med. Internet Res.27, e65146. 10.2196/65146
52
ZhaoT.WangS.OuyangC.ChenM.LiuC.ZhangJ.et al (2024). Artificial intelligence for geoscience: progress, challenges, and perspectives. Innovation5 (5), 100691. 10.1016/j.xinn.2024.100691
53
ZhouL.SchellaertW.Martínez-PlumedF.Moros-DavalY.FerriC.Hernández-OralloJ. (2024). Larger and more instructable language models become less reliable. Nature634 (8032), 61–68. 10.1038/s41586-024-07930-y

Summary

Keywords

ChatGPT, disaster risk reduction, domain evaluator ratings, geological hazards, multi-dimensional capability profiling, question answering

Citation

Wu S, Xu C, Xue Z, Huang Y, Xu G, Cui Y, Ma J, Ma R and Xie C (2026) Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight. Front. Earth Sci. 13:1695920. doi: 10.3389/feart.2025.1695920

Received

30 August 2025

Revised

22 December 2025

Accepted

29 December 2025

Published

13 January 2026

Volume

13 - 2025

Edited by

Augusto Neri, National Institute of Geophysics and Volcanology (INGV), Italy

Reviewed by

Annemarie Christophersen, GNS Science, New Zealand

Hans-Balder Havenith, University of Liège, Belgium

Mohammad Al Mashagbeh, The University of Jordan, Jordan

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chong Xu, xc11111111@126.com

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Geohazards and Georisks

ORIGINAL RESEARCH article

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

Abstract

1 Introduction