- 1National Institute of Natural Hazards, Ministry of Emergency Management of China, Beijing, China
- 2Key Laboratory of Compound and Chained Natural Hazards Dynamics, Ministry of Emergency Management of China, Beijing, China
- 3School of Emergency Management Science and Engineering, University of Chinese Academy of Sciences, Beijing, China
- 4School of Civil Engineering and Architecture, Anhui University of Science and Technology, Huainan, China
- 5School of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing, China
Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ChatGPT-4o for geological-hazard question answering using a transparent, rubric-based design. Sixty questions spanning six task categories (C1-C6) were posed within a fixed time window under a controlled single-turn protocol, and eight evaluators with geohazard expertise independently rated each response on six capability dimensions (D1 Knowledge Coverage; D2 Comprehension and Reasoning; D3 Accuracy and Rigor; D4 Critical Thinking; D5 Application and Context Adaptability; D6 Innovation and Knowledge Expansion). Scores were assigned on a continuous 0–1 scale, with 0, 0.5, and 1 used as anchor points to guide interpretation. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Performance was consistently higher on structured knowledge tasks—defined here as questions with well-established concepts, factual grounding, or clearly bounded reasoning paths (C1 = 0.827; C2 = 0.797; C3 = 0.818) than on open-ended tasks (C4-C6 mean = 0.591). Across dimensions, scores were highest for D1 (0.868), D2 (0.864), and D3 (0.830), and lowest for D4 (0.578) and D6 (0.550). Overall agreement was good (ICC (3, k) = 0.8095), while agreement decreased for more subjective tasks and dimensions. The study provides (i) a baseline, version-specific appraisal of GPT-4o in geohazard-related QA, (ii) a transferable rubric-based workflow for evaluating domain LLMs, and (iii) evidence that human oversight remains essential when such systems are used to support safety-critical disaster risk reduction decisions.
1 Introduction
In recent years, geological hazards triggered by extreme rainfall and earthquakes have shown rising frequency and expanding impacts (Gao et al., 2024; Huang et al., 2025; Shao et al., 2024). Compounding and cascading effects across multiple hazards have become increasingly prominent, rendering hazard processes highly abrupt, with complex evolutionary pathways and narrow decision windows (Gao et al., 2025; Wu et al., 2025). These characteristics raise the bar for monitoring and early warning, risk assessment, and emergency management (C. Xu and Lin, 2025).
Against this backdrop, generative large language models (LLMs) have made substantive advances in multimodal understanding and information integration, knowledge organization and generation, and complex reasoning. They exhibit cross-task transferability and adaptivity, revealing the potential for artificial general intelligence (Bommasani et al., 2021), and opening new technical pathways for natural-hazard research and governance (OpenAI et al., 2023; 2024). LLMs have been adopted in highly specialized domains that require timely knowledge services, including law (Lai et al., 2024), chemistry (Mirza et al., 2025), medicine (Cascella et al., 2023; Neha et al., 2024; Semeraro et al., 2025; Yang et al., 2025), and engineering (Hostetter et al., 2024; Kim et al., 2024). They are also extending into Earth system and human–environment modeling and data analysis within the geosciences (Reichstein et al., 2019; Xie et al., 2025; Zhao et al., 2024).
Within disaster risk reduction (DRR), natural language processing is increasingly used across the stages of observation, cognition, analysis, and decision-making. Pilot applications include disaster information extraction, sentiment analysis and scenario simulation, public outreach, and decision support (Xu et al., 2025; Xue et al., 2023; Zhao et al., 2024), with floods and earthquakes as the most common use cases. For example, generative models such as ChatGPT (GPT) have supported “abstractive reviews” that rapidly synthesize evidence for flood-rescue logistics and resource allocation (Kaklauskas et al., 2024). In seismic-engineering contexts, GPT has been used for technical drafting, terminology explanation, and communication to improve research and outreach efficiency (Ray, 2024; Wilson et al., 2023).
However, most existing evaluations focus on general or single tasks. Evidence that has been carefully examined and accepted by researchers and practitioners in geohazard science is still lacking regarding the effectiveness and limits of using a generic question-answering (QA) interface across what we refer to here as the full knowledge chain of geological hazards. This chain spans foundational concepts, process interpretation, regional variability, interdisciplinary reasoning, scenario construction, and emerging research topics. Widely used academic benchmarks such as MMLU and BIG-bench do not transfer well to the domain-constrained and context-dependent nature of professional QA in geohazard assessment (Hendrycks et al., 2021; Srivastava et al., 2023). Recent discussions in the AI research community have also emphasized that strong benchmark performance does not necessarily translate into reliable behavior in applied decision-making contexts. This “high-score, low-utility” gap has also been discussed in public remarks by Ilya Sutskever, a leading figure in deep learning and co-founder of OpenAI, who has cautioned against over-reliance on benchmark-centered capability assessments. Taken together, these issues illustrate the need for systematic, scholar-informed and practice-oriented capability auditing, and for developing clear, evidence-based guidance on how LLMs should be responsibly introduced into geohazard-related applications, where accuracy and practical applicability are essential.
Model iteration further underscores the need for establishing clear historical baselines. GPT-4o, released in May 2024, became one of the first widely accessible multimodal models supporting text–image interaction for tasks relevant to hazard interpretation and scientific communication (OpenAI et al., 2024). Subsequent releases, including the shift to GPT-5 as the default model in 2025, introduced architectures that can vary the depth of internal reasoning and, depending on user tier and platform, offer multiple model variants. These changes improve flexibility but also make it more difficult to determine which specific model version a user is interacting with. This increasing opacity highlights the need for version-specific and well-documented assessments to support transparent comparison across evolving model families.
Against this background, we develop an expert-driven, multi-dimensional rating and reliability framework to systematically evaluate GPT-4o′s text-based QA performance for geological-hazard tasks that are pertinent to DRR and to AI-enabled, remote-sensing–assisted monitoring and early-warning workflows. We design 60 representative items across six problem categories and six capability dimensions, organize independent blind ratings, and quantify agreement and reliability using the intraclass correlation coefficient (ICC) (Koo and Li, 2016). Our principal contributions are as follows:
1. An expert-driven quantitative evaluation framework structured as “problem category × capability dimension,” with reliability reported via ICC, yielding a reproducible and transferable protocol for professional QA assessment;
2. A multi-dimensional capability profile and uncertainty characterization of GPT-4o in geohazard and DRR. The analysis identifies strengths in structured knowledge tasks (questions anchored in well-established concepts, classification standards, or other widely agreed reference answers) and in causal reasoning, and weaknesses in critical and innovative thinking as well as in complex, design-oriented, open-ended tasks. These patterns provide practical justification for human-in-the-loop oversight in safety-critical settings;
3. A version-specific, traceable performance baseline produced in the era of unified GPT-5 routing and deprecation of legacy defaults, supporting longitudinal tracking of subsequent versions and fair cross-model comparisons.
In sum, this study fills the gap in domain-specific, reproducible evaluation of general-purpose LLMs in a geological-hazard QA setting, providing an evidence base and methodological template for AI-enabled DRR and risk-aware governance. The full methodology, question set, structured rating data, and processing pipeline are documented in “Methodology and Data” and the Supplementary Material to facilitate verification and reuse by peers.
2 Methodology and data
For methodological transparency and continuity, the entire workflow is illustrated in Figure 1, encompassing every step from question design and prompt engineering through answer collection and data preprocessing to rating and statistical analysis.
2.1 Question design and categorization
We developed a 60-question set to reflect the types of knowledge and reasoning that commonly arise in geological-hazard work. To examine different aspects of model performance, the questions were organized into six categories: (C1) Basic Knowledge; (C2) Formation-Mechanism Inference; (C3) Regional Differences; (C4) Interdisciplinary Analysis; (C5) Scenario Planning and Design; and (C6) Frontier Exploration.
When drafting the questions, our intention was to cover the main themes encountered in geohazard assessment. The set includes straightforward factual items as well as questions that require interpretation or multi-step reasoning, so that the difficulty resembles what practitioners deal with in real settings. Each question was written to stand alone, avoiding reliance on earlier items. We also paid attention to phrasing, aiming for wording that would be clear to both the model and the domain scholars who later evaluated the responses.
C1 and C2 focus on basic concepts and triggering mechanisms. C3 highlights regional patterns, C4 addresses cross-disciplinary integration, C5 concerns planning-oriented tasks in practical scenarios, and C6 explores forward-looking scientific directions. All questions were reviewed by researchers working in geohazards to ensure scientific accuracy and relevance. Each item was asked independently in a single-turn format to prevent any influence from prior interactions. The questions were also phrased in a way that resembles how non-expert users typically seek information about geological hazards. The six-category structure and representative items are listed in Table 1.
2.2 Answer collection and preprocessing
All responses were generated using GPT-4o through the standard web interface. Each question was submitted in an independent chat session during a short, continuous period, ensuring that all outputs reflected the same underlying model state. A uniform presentation format was used for every item, and no supplementary background information or follow-up clarification was provided. The single-turn setup served as an experimental control that allowed us to observe the model’s immediate, unassisted response under fixed conditions (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Liang et al., 2022). It is important to note that in professional geohazard assessment, AI tools are typically used through iterative exchanges that allow practitioners to probe and verify the model’s responses. The single-turn configuration used here was therefore not intended to reflect standard practice, but to provide a stable, version-specific baseline for evaluating the model’s immediate, unassisted output under controlled conditions. Because LLMs may exhibit minor phrasing or temporal variability due to probabilistic decoding and periodic system updates, the analysis focuses on answers generated within this defined timeframe for the May 2024 GPT-4o release (all QA sessions were conducted in February 2025). Limiting the evaluation in this way avoids confounding from version drift and supports consistent interpretation of results.
After collection, the raw outputs were processed through a structured cleaning workflow. Boilerplate disclaimers, conversational fillers, and other non-substantive elements were removed to isolate the analytical content. Terminology and formatting were harmonized where appropriate to support clear interpretation by the evaluators, while the factual and conceptual substance of each answer was left unchanged. The rating sheets submitted by the evaluators were consolidated into a single tabular file with consistent variable definitions, facilitating score aggregation and the computation of inter-rater agreement.
The complete question set, cleaned model outputs, and de-identified rating tables are provided in the Supplementary Material to support transparent archiving and future re-analysis.
2.3 Prompt engineering
We adopted a standardized single-turn prompting procedure for all queries. Each question was submitted in a fresh session with no prior context so that the model’s output depended solely on the prompt provided. This configuration reduces variability arising from conversational history and mirrors typical information-seeking behavior in geohazard and emergency-management settings, where users often ask isolated, one-off questions. The approach also follows established evaluation practices in large-language-model benchmarking, where isolated, one-pass queries are commonly used to support fair and reproducible comparisons (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Srivastava et al., 2023). Consistent with this minimal-intervention philosophy, no parameter tuning, multi-turn refinement, or repeated sampling was applied.
To maintain domain relevance, a concise role-based instruction preceded every query. The instruction asked the model to answer as an expert in geological hazards and Earth sciences. The standardized cue was: “You are a domain expert in geological hazards and Earth science. Please answer the following question with accurate, detailed information and rigorous reasoning.” This prompt steered the model toward technically grounded responses without over-constraining the format, and it improved terminological clarity and coherence in domain-specific reasoning, a pattern also noted in studies using similar expert-role prompts (Wang et al., 2024).
The same prompt template was applied to all sixty questions. No examples, few-shot demonstrations, or chain-of-thought scaffolds were included, and no multi-turn exchanges were used. This ensured that each answer reflected the model’s zero-shot capability under identical conditions, supporting fair comparison across question categories and preserving the ecological validity of simulating unassisted user queries in geohazard contexts. Avoiding selective prompt augmentation also reduced the risk of uneven prompting effects, which can confound comparative evaluations (Kojima et al., 2023).
To examine the stability of this configuration, we conducted a small qualitative stability check (see Supplementary Material 3). Six representative questions (one from each category, C1–C6) were re-queried three times under the same settings. The core conceptual content remained consistent across runs, with variation largely confined to minor differences in phrasing. All main analyses in this study are therefore based on responses generated under this uniform, single-turn protocol.
2.4 Evaluators and rating dimensions
We invited eight evaluators (E1 through E8), comprising senior researchers, professors, postdoctoral researchers, and PhD candidates with formal training and at least 5 years of experience in geohazard investigation, risk assessment, and governance. All eight evaluators had been involved in developing the question set and refining the scoring rubric. Each evaluator independently assessed the model’s answers, working in isolation to avoid mutual influence.
A unified rubric was used to evaluate each answer across six capability dimensions (Table 2). Scores were assigned on a continuous 0–1 scale. The values 0, 0.5, and 1 served only as conceptual anchors for the endpoints and midpoint, not as the only permissible options. Before scoring, evaluators were explicitly instructed that they could select any value within the 0-1 range—typically in 0.1 increments—to express their level of satisfaction with the model’s performance. All scores were retained as provided; no adjustment or post hoc normalization was applied. Although the same six dimensions were applied across all questions, evaluators were instructed to interpret each dimension in a manner appropriate to the question type, following the definitions in Table 2 and the calibration examples in Supplementary Material 2.
To support consistent interpretation of the rubric, we provided each evaluator with a set of calibrating exemplar answers and scoring notes prior to formal scoring. These materials (now included in Supplementary Material 2) illustrate how the six dimensions should be applied to different question types.
Scores were assigned independently, without discussion or consensus-building among evaluators. For each question category and each evaluation dimension, the final score was computed as the arithmetic mean of the eight individual ratings. This approach preserves individual judgment, avoids group influence, and allows agreement among evaluators to be assessed quantitatively through inter-rater reliability analysis.
2.5 Statistical analysis framework
We developed a structured statistical pipeline to systematically evaluate GPT-4o′s performance across geological-hazard QA tasks and to quantify inter-rater agreement. The pipeline comprised three components: data curation and preprocessing, descriptive statistics, and rater-consistency assessment.
2.5.1 Data curation and preprocessing
The eight individual rating sheets were consolidated into a rectangular matrix with fields Evaluator × QuestionID × Category × Dimension × Score. We screened for obvious entry errors and missing ratings and applied corrections to ensure completeness and reliability. Based on the cleaned dataset, grouped summaries were produced by problem category, capability dimension, and evaluator index.
2.5.2 Descriptive statistics
To characterize performance differences across task types and capability dimensions, we computed means and standard deviations for each category and each dimension, thereby assessing central tendency and dispersion. Distributions were visualized using box plots, violin plots, and radar charts to convey overall spread, density, and contrasts, highlighting differences in knowledge coverage, reasoning ability, and contextual adaptability across categories and dimensions.
2.5.3 Inter-rater reliability
Inter-rater reliability was evaluated using the ICC, specifically the two-way mixed-effects model for average measures, denoted as ICC (3, k) (McGraw and Wong, 1996; Shrout and Fleiss, 1979). This model treats the panel of evaluators (k = 8) as fixed effects and the items (n = 60) as random effects. The average-measures coefficient was selected because the study’s composite scores are derived from the mean ratings of the eight evaluators, making the reliability of the averaged scores the relevant metric rather than individual ratings. The consistency formulation of ICC (3, k) is calculated as:
where MSR and MSE represent the mean square for rows (items) and the residual mean square, respectively. We report both the overall ICC and stratified estimates by task category (C1–C6) and evaluation dimension (D1–D6) to detect variations in agreement across different assessment contexts. ICC values approaching 1 indicate high agreement among evaluators. Values near zero or below indicate that agreement is limited and does not exceed what would be expected by chance, which may reflect the subjective or interpretive nature of certain question types or evaluation dimensions rather than deficiencies in the scoring procedure itself.
3 Results
3.1 Overall performance across all questions
The dataset comprises ratings from eight evaluators for 60 questions across six evaluation dimensions. Figure 2 summarizes the distribution of evaluator scores at the question level using interquartile-range (IQR) bands for each dimension. The shaded regions depict the middle 50% of scores across evaluators, highlighting the degree of agreement or dispersion across different dimensions and questions. The black line represents the overall mean score for each question, averaged across all evaluators and dimensions, and provides a reference for the model’s aggregate performance. Together, Figure 2 conveys both the central tendency of scores and the variability in evaluator judgments across the full question set.
Figure 2. Distribution of evaluator scores across questions using interquartile ranges. For each question (1-60), shaded bands show the interquartile range (25th-75th percentile) of evaluator scores for each evaluation dimension (D1-D6). Dimensions with greater score dispersion are rendered in lighter tones, whereas more concentrated ratings appear in more saturated colors. The black solid line indicates the overall mean score per question, averaged across all evaluators and dimensions.
Several patterns emerge from the question-level distributions in Figure 2. First, questions belonging to the scenario planning and design category (C5) tend to exhibit lower overall mean scores than other categories, indicating that evaluators were, on average, less satisfied with the model’s responses to application-oriented and planning-related tasks. This suggests that while the model performs relatively well on conceptual and interpretative questions, translating knowledge into actionable or scenario-specific guidance remains more challenging. Second, the interquartile-range bands for Dimension D6 (Innovation and Knowledge Expansion) reveal contrasting behaviors across question types. For questions with relatively fixed or well-established answers, the D6 bands are generally narrow and centered at moderate to high scores, indicating that the model tends to exercise appropriate restraint by avoiding unnecessary speculation or fabricated novelty. In contrast, for more open-ended questions, the D6 bands are wider, reflecting greater divergence in evaluator judgments. This dispersion suggests that assessments of “satisfactory innovation” are harder to align for exploratory or forward-looking tasks, where expectations regarding originality, framing, and added insight vary more substantially among evaluators.
To provide complementary summaries of the same rating distributions, we summarized the rating distributions using box-and-violin plots (Figure 3), grouped by question category, evaluation dimension, and evaluator. Across question categories C1-C6 (left panel of Figure 3), scores for C1-C3 are tightly clustered, with medians close to 0.9 and standard deviations around 0.25-0.28, indicating relatively high agreement for basic knowledge and mechanism-focused questions. By contrast, the scenario-planning category C5 shows the widest spread, with a standard deviation of 0.312 and a visibly broader IQR band, suggesting more divergent views on how well the model performs on application-oriented tasks. Across evaluation dimensions (middle panel of Figure 3), the lower quartiles for D1 and D2 lie near 0.8 and their standard deviations are 0.160 and 0.169, so most ratings fall in the upper score range and internal consistency is comparatively strong. D4 and D6, in comparison, display much greater dispersion, with standard deviations of 0.393 and 0.318, reflecting wider differences in how evaluators judged critical thinking and innovation.
Figure 3. Box-and-violin plots of score distributions by question category, evaluation dimension, and evaluator. For each group, the light violin shows the full distribution of scores, the dark vertical bar indicates the interquartile range (25th-75th percentiles), and the white dot marks the median.
Evaluator-level patterns (right panel of Figure 3) further illustrate these contrasts. Most evaluators’ scores are concentrated in the high range (≥0.8), indicating generally positive assessments of the model’s answers. E2 shows the lowest standard deviation (0.138), consistent with a relatively stable internal scoring style. In contrast, E6 has the highest standard deviation (0.336) and a visibly broader spread into intermediate scores, pointing to greater within-rater variability across questions. Overall, these distributions indicate that perceived answer quality varies systematically with question type, evaluation dimension, and evaluator, rather than being uniform across the full question set.
3.2 Performance by problem category
To examine how GPT performs across diverse geohazard task types, we computed a composite performance index for each category by averaging the scores over the six dimensions D1-D6. Results are shown in Figure 4. Overall, GPT performs best on C1, with a mean score of 0.827, indicating a marked advantage on well-structured, standardized questions. C3 and C2 follow with composite scores of 0.818 and 0.797, respectively, suggesting good adaptability to tasks requiring region-specific judgments and causal reasoning.
Figure 5 further breaks down these category-level results by dimension. For C1, the highest score occurs on Accuracy and Rigor (D3 = 0.895). C2 attains the top single-dimension score across all data on Knowledge Coverage (D1 = 0.906), while C3 performs best on Understanding and Reasoning (D2 = 0.891). These results indicate that the model can deliver high-accuracy outputs on knowledge-oriented tasks and sustains strong comprehension and reasoning in causal and regional analyses.
Figure 5. Dimension-level performance across problem categories. (a) C1 peaks on D3, C2 on D1, and C3 on D2. (b) C4 peaks on D1, C5 on D1, and C6 on D2.
By contrast, performance is weaker on the more complex and creative categories C4-C6 (Figure 5). The composite score is 0.749 for C4, 0.673 for C5 (the lowest among all categories), and 0.746 for C6. A similar pattern appears on the Innovation and Exploration dimension (D6): C5 and C6 score only 0.505 and 0.554, respectively, which is substantially lower than the foundational categories. Taken together, these results suggest that, under the current evaluation setup, GPT is less reliable when tasks require high-level synthesis, operational strategy generation, or frontier extrapolation.
3.3 Performance by evaluation dimension
To comprehensively assess GPT’s behavior under heterogeneous capability requirements, we averaged scores for all problem categories across the six evaluation dimensions; results are shown in Figure 6. Overall, GPT performs strongly on D1–D3, with mean scores of 0.868 (D1), 0.864 (D2), and 0.830 (D3). This indicates relatively stable performance in knowledge coverage/recall (D1), understanding and reasoning (D2), and accuracy and rigor in expression (D3). Notably, D1 reaches its peak within category C2 at 0.906, suggesting that the model is well supported by geohazard-related corpora when broad domain coverage is required.
Figure 6. Average scores across evaluation dimensions D1–D6. Strong performance is observed on D1–D3, whereas D4 and D6 remain comparatively low, highlighting challenges in higher-order content generation.
On D5, the average score is 0.787, indicating that the model can produce operationally useful content for applied tasks (e.g., disaster monitoring and early-warning design or emergency response planning). However, the outputs often remain generic and show limited adaptability to specific scenarios.
By contrast, performance drops markedly on the higher-order cognitive dimensions D4 and D6, with mean scores of 0.578 and 0.550, respectively, forming a pronounced “performance gap” (Figure 7). This trend is most evident for categories C4-C6, reinforcing that when confronted with open-ended problems or tasks lacking established reference answers, the model’s generations often fall short in depth and originality.
Figure 7. Inter-rater reliability of expert assessments across question categories and evaluation dimensions. (a) Distribution of ICC (3,8) values across all questions and evaluators. (b) ICC (3,8) grouped by question category. (c) ICC (3,8) grouped by evaluation dimension. (d) Heatmap of ICC (3,8) values over category–dimension combinations, highlighting areas of strong consensus and areas of greater subjectivity and variability.
3.4 Evaluator rating consistency
To evaluate the consistency of evaluator scores, we applied a two-way mixed-effects, average-measures intraclass correlation, ICC (3, k), to the ratings assigned by eight evaluators across 60 questions. Details on the definition, interpretation, and calculation of ICC (3, k) are provided in Section 2.5(3) of the Methods. The overall ICC (3,8) is 0.8095, indicating small between-rater differences on the composite, multi-dimensional scores and thus good agreement (Table 3; Figure 7a). Following Cicchetti’s guideline (Cicchetti, 1994), this value falls within the 0.75-0.90 range, i.e., “good” reliability. These results suggest that the rating framework exhibits strong scoring reliability; overall evaluator agreement on task evaluations is high, and the resulting scores are suitable for a structured assessment of model performance. In addition, the between-question variance component is substantially larger than the residual error (MSR = 0.3369 vs. MSE = 0.0642), indicating that observed score differences are driven primarily by the question content and task type rather than by rater-specific biases.
We further examined heterogeneity in inter-rater consistency across task categories and evaluation dimensions by conducting grouped (Figures 7b,c) and cross-classified (Figure 7d) ICC analyses. At the category level, C2 and C1 show the highest agreement (ICC = 0.919 and 0.783, respectively), consistent with clearer task structure and stronger answer consensus. In contrast, C4 and C5 display low reliability (ICC <0.1), reflecting pronounced divergence among evaluators for these problem types. At the dimension level, D1 and D2 exhibit higher agreement (ICC >0.75), whereas D4 and D6 are notably lower, revealing greater disagreement when judging higher-order cognitive dimensions. In the cross-combination analysis (Figure 7d), pairs such as C2–D1, C2–D3, C2–D4 and C1–D2 achieve ICC values exceeding 0.80, indicating highly consistent evaluations on core dimensions of foundational and reasoning-oriented tasks. By contrast, combinations such as C4–D4, C4–D5, and C4–D6 yield ICC values near zero or even negative, suggesting weak stability and substantial subjectivity in complex, open-ended, or cross-domain scenarios.
4 Discussion
4.1 ChatGPT’s capabilities and limitations across geohazard tasks
Our findings show that the response quality of GPT-4o to geohazard questions is strongly associated with problem type. On structured knowledge tasks—such as explaining hazard-classification standards or inferring formative mechanisms—the model achieves high accuracy and stability, accompanied by high inter-rater reliability. This aligns with prior evaluations of LLMs: when task boundaries are well defined and the training corpus provides sufficient coverage, LLMs exhibit strong information retrieval and expression capabilities (Gilson et al., 2022). Coupled with relatively low interaction latency (Tay et al., 2023), these strengths confer “advisor-style” utility in information retrieval and instructional support settings (Deng et al., 2025). Mechanistically, these capabilities are consistent with the Transformer architecture and large-scale pretraining, which enable the model to capture co-occurrence patterns and relationships among terms in the training data and to retrieve them when presented with well-posed questions (Bommasani et al., 2021; Tay et al., 2023). In our single-turn, text-only setting, the model does not perform explicit multimodal reasoning or iterative prompt refinement; instead, it appears to reconstruct plausible causal chains by combining known concepts from its training corpus. For example, in questions about rainfall-induced shallow landslides, GPT-4o correctly links rainfall intensity-duration (I-D) thresholds to the probability of triggering shallow failures and explains this in terms of infiltration, pore-pressure rise, and loss of shear strength in near-surface materials.
By contrast, on open-ended, extrapolative, and creative tasks the model does not consistently deliver high-quality strategies (Hager et al., 2024; Kim et al., 2024; Plevris et al., 2023). Both the mean scores and inter-rater agreement decline markedly, especially on core, higher-logic dimensions such as critical thinking and contextual adaptability. A likely contributor is corpus bias: as Biswas notes, output quality is constrained by the quantity and quality of training data (Biswas, 2023). Hallucinated or misleading content also remains a recurrent issue across GPT variants (OpenAI et al., 2023). In our own question set, for example, GPT-4o gave a confident but incorrect answer to a question on which slope type is more prone to instability under seismic loading, incorrectly treating anti-dip slopes as generally more susceptible than dip slopes, which conflicts with standard engineering understanding (see Supplementary Material 4, Q11). Recent work further shows that GPT can pass a three-player Turing test (Jones and Bergen, 2025), underscoring that machine-generated but incorrect answers may be highly persuasive to non-expert audiences (Zhou et al., 2024). At a fundamental level, large language models remain pattern recognizers trained on vast text corpora (Ray, 2024; Raza et al., 2025): they produce human-like text that is often accurate, but not necessarily grounded in deep understanding. For high-uncertainty problems and other high-stakes contexts (such as designing mitigation works or supporting risk-critical decisions), outputs from GPT should therefore be treated as supplementary input and remain subject to expert judgment and synthesis.
4.2 Validity and reliability of the domain-scholar–based evaluation framework
We evaluated a structured, multidimensional domain-scholar rating rubric using a two-way mixed-effects ICC (3, k). Overall agreement was good (Cicchetti, 1994; Shrout and Fleiss, 1979), indicating clear criteria, limited between-rater bias, and reliable aggregate scores. Between-question variance exceeded residual error, suggesting that score differences primarily reflect task content rather than rater noise (Landis and Koch, 1977).
Agreement varied with task type and dimension definition. Well-bounded combinations (e.g., C2-D1 and C1-D2) showed high consensus, consistent with greater stability when dimensions are clearly specified and answers are determinate (Kung et al., 2023). In contrast, subjective or open-ended combinations (e.g., C5-D4 and C6-D6) exhibited markedly lower agreement, aligning with prior findings on task subjectivity and rater-background effects (Gilson et al., 2022; Mirza et al., 2025; Yang et al., 2025).
To further enhance reliability, future studies should provide explicit scoring anchors and exemplars and conduct cross-domain calibration sessions to surface and reconcile interpretive differences. Such procedures have improved reproducibility in geohazards, geotechnical engineering, and education research (Fell et al., 2008; Fuchs et al., 2011; Kane, 2013).
4.3 Study limitations
This study has several limitations that should be acknowledged. First, the evaluation focuses on the May 2024 release of ChatGPT-labeled “GPT-4o”, and all question–answer interactions were collected within a clearly defined query window (February 2025). The study does not include a systematic comparison with later variants (such as GPT-4.1 or GPT-4.5) or with same-generation derivatives including o3, o4 mini, or o4 mini high (OpenAI, 2024; OpenAI, 2025; OpenAI et al., 2024). Models within the GPT series can differ in documented knowledge recency, modality support, usage quotas, invocation cost, and accessibility. These differences complicate direct comparison and raise concerns regarding fairness and reproducibility. Similar comparability issues also arise in cross-provider settings (e.g., Gemini, Grok, and other frontier LLMs), where product- and system-layer behaviors (not always transparent), safety filtering and compliance constraints, and default generation settings can further confound fair and reproducible comparisons. For example, GPT-4.5 has been described as a larger model with improved generation quality and social responsiveness, yet its training data extend only to October 2023, similar to GPT-4o, whereas other later GPT-4–series variants (e.g., GPT-4.1) may incorporate more recent training data (Figure 8). Despite sharing a comparable knowledge cutoff, GPT-4.5 and GPT-4o differ in their underlying multimodal design paradigms: GPT-4.5 represents a GPT-4–style multimodal extension built upon a primarily text-centric framework, whereas GPT-4o operationalizes multimodality as a first-class component of the core model architecture. As a result, a direct performance comparison between these two models would conflate differences arising from modality integration strategies with those attributable to task-specific reasoning ability, thereby complicating interpretation. Such discrepancies in knowledge recency can materially influence factual accuracy and alignment with contemporary hazard information (Bubeck et al., 2023; Wei et al., 2022). To maintain internal validity and avoid confounding effects associated with version heterogeneity, the present study restricts its evaluation to GPT-4o.
Second, the assessment does not examine GPT-4o′s multimodal or spatial-reasoning capabilities. Many geohazard-related tasks, such as landslide interpretation from imagery, slope failure recognition, and flood-extent assessment, require integration of visual, spatial, and contextual cues that cannot be fully captured through text-only inputs. Although GPT-4o provides image-processing functions, these were not included in the current design. Future work should therefore evaluate multimodal workflows, particularly those that combine remote sensing data and field observations, to achieve a more complete characterization of the model’s usefulness in geohazard prevention and response.
Third, the study does not fully address the temporal variability of LLM outputs. Large language models can produce slightly different answers when identical questions are posed at different times, reflecting the probabilistic nature of their inference processes as well as potential updates to the underlying system. Under the controlled single-turn prompting used here, the conceptual content of responses was generally stable, although minor differences in phrasing were observed. To increase transparency, we conducted a small qualitative stability check by generating three responses for six representative questions (see Supplementary Material 3). This exercise illustrated typical variation patterns while avoiding the substantial scoring burden of a full 10-run × 60-item protocol. A broader assessment of how model outputs change over longer periods and under different inference conditions would also be useful. Such work could clarify how stable LLM behavior remains as the system evolves.
A further conceptual limitation concerns the epistemological status of LLM-generated responses. Large language models produce outputs through statistical association rather than deductive, causal, or empirically grounded reasoning. Their answers therefore differ fundamentally from human expert knowledge, which is supported by disciplinary theory, field experience, and empirical validation. In this study, domain-scholar evaluations serve as a reference for assessing factual adequacy and conceptual coherence, and are not intended to imply epistemic equivalence between model outputs and human expertise. Future research should explore how these differences in reasoning modes affect reliability in high-stakes geohazard contexts, and should also examine more directly whether large language models meet the expectations of domain scholars and practitioners, particularly in terms of accuracy, relevance, and professional adequacy in geohazard applications.
5 Conclusion and way forward
This study builds a multi-dimensional, evaluator-rated framework to systematically evaluate GPT-4o on geological-hazard QA. Using 60 items spanning six problem categories and six capability dimensions, we quantify strengths and weaknesses across tasks and validate reliability with the intraclass correlation coefficient. GPT-4o attains its highest category scores in C1 (Basic Knowledge, 0.827), C3 (Regional Differences, 0.818), and C2 (Formation-Mechanism Inference, 0.797). By dimension, D1 (Knowledge Coverage, 0.868), D2 (Comprehension and Reasoning, 0.864), and D3 (Accuracy and Rigor, 0.830) lead, whereas D4 (Critical Thinking, 0.578) and D6 (Innovation, 0.550) are markedly lower, especially on complex, open-ended tasks (C4-C6). Overall rater agreement is good, ICC (3, k) = 0.8095, supporting the robustness and reproducibility of the conclusions.
Methodologically, we propose a reproducible evaluation framework for domain-specific AI that integrates multidimensional scoring, a multi-rater evaluation procedure, and statistical reliability analysis; the approach is transferable to other specialized domains. In application, the findings support cautious adoption of GPT as a supportive tool for DRR workflows, such as geohazard monitoring and early warning, while emphasizing operation under professional oversight. Conceptually, the results delineate a gap between fluency and understanding, showing that polished language does not guarantee the causal reasoning and abstraction required for complex geohazard tasks.
It is noteworthy that newer generations of ChatGPT, including GPT-5, are implemented as a unified system that can route requests between faster response behavior and deeper reasoning modes. While this evolution may improve overall adaptivity, it complicates version-specific evaluation and traceability. Our study documents a time-window-specific capability profile for GPT-4o and provides an evaluation baseline for longitudinal tracking and for future cross-model comparisons conducted under explicitly matched constraints. Future research should (i) establish unified, version-aware evaluation suites to support horizontal comparisons across models and vertical tracking across releases, and (ii) extend to multimodal tasks, using fusion of text and image modalities to evaluate the model’s ability to reason over remote sensing artifacts such as interferograms, UAV orthomosaics, LiDAR point clouds, and GNSS time series in support of geohazard risk reduction and early warning.
In summary, GPT-4o cannot replace expert judgment; however, it can efficiently support information synthesis, preliminary analyses, and cross-disciplinary communication. For high-uncertainty or safety-critical contexts typical of geohazard early warning and DRR, human-in-the-loop oversight remains essential to mitigate deceptively plausible yet erroneous outputs. The present version-specific evaluation baseline offers methodological and practical value and sets a reference point for evaluating and optimizing unified-architecture, multimodal reasoning models in the GPT-5 era.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.
Author contributions
SW: Data curation, Formal Analysis, Investigation, Methodology, Resources, Visualization, Writing – original draft, Writing – review and editing. CoX: Conceptualization, Data curation, Funding acquisition, Supervision, Writing – review and editing. ZX: Data curation, Writing – review and editing. YH: Data curation, Writing – review and editing. GX: Data curation, Writing – review and editing. YC: Data curation, Writing – review and editing. JM: Data curation, Writing – review and editing. RM: Data curation, Writing – review and editing. CeX: Data curation, Writing – review and editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Institute of Natural Hazards, the Ministry of Emergency Management of China (grant no. ZDJ 2025-54), and the Chongqing Water Resources Bureau, China (grant no. CQS24C00836).
Acknowledgements
We thank the handling editor and the reviewers for their constructive comments, which substantially improved the manuscript.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI tools (ChatGPT, OpenAI, San Francisco, CA, United States) were used in two ways: (1) to generate responses to a structured set of geohazard-related questions, which served as the primary research data for expert evaluation; and (2) to assist in language polishing, grammar checking, and translation during manuscript preparation. No AI tools were involved in study design, statistical analysis, or interpretation of results. All authors take full responsibility for the scientific integrity and accuracy of the manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feart.2025.1695920/full#supplementary-material
References
Biswas, S. S. (2023). Potential use of chat GPT in global warming. Ann. Biomed. Eng. 51 (6), 1126–1127. doi:10.1007/s10439-023-03171-8
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models (version 3). arXiv.10.48550/ARXIV.2108.07258.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., et al. (2023). Sparks of artificial general intelligence: early experiments with GPT-4. arXiv.10.48550/ARXIV.2303.12712.
Cascella, M., Montomoli, J., Bellini, V., and Bignami, E. (2023). Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47 (1), 33. doi:10.1007/s10916-023-01925-4
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess. 6 (4), 284–290. doi:10.1037/1040-3590.6.4.284
Deng, R., Jiang, M., Yu, X., Lu, Y., and Liu, S. (2025). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Comput. and Educ. 227, 105224. doi:10.1016/j.compedu.2024.105224
Fell, R., Corominas, J., Bonnard, C., Cascini, L., Leroi, E., and Savage, W. Z. (2008). Guidelines for landslide susceptibility, hazard and risk zoning for land use planning. Eng. Geol. 102 (3–4), 85–98. doi:10.1016/j.enggeo.2008.03.022
Fuchs, S., Kuhlicke, C., and Meyer, V. (2011). Editorial for the special issue: vulnerability to natural hazards—the challenge of integration. Nat. Hazards 58 (2), 609–619. doi:10.1007/s11069-011-9825-5
Gao, H., Xu, C., Xie, C., Ma, J., and Xiao, Z. (2024). Landslides triggered by the July 2023 extreme rainstorm in the haihe river basin, China. Landslides 21 (11), 2885–2890. doi:10.1007/s10346-024-02322-9
Gao, H., Xu, C., Wu, S., Li, T., and Huang, Y. (2025). Has the unpredictability of geological disasters been increased by global warming? Npj Nat. Hazards 2 (1), 55. doi:10.1038/s44304-025-00108-0
Gilson, A., Safranek, C., Huang, T., Socrates, V., Chi, L., Taylor, R. A., et al. (2022). How does ChatGPT perform on the medical licensing exams? The implications of large language models for medical education and knowledge assessment. Med. Educ. doi:10.1101/2022.12.23.22283901
Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., et al. (2024). Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30 (9), 2613–2622. doi:10.1038/s41591-024-03097-1
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., et al. (2021). Measuring massive multitask language understanding (no. arXiv:2009.03300). arXiv.10.48550/arXiv.2009.03300.
Hosseini, S. H., and Pourzangbar, A. (2026). How well do DeepSeek, ChatGPT, and gemini respond to water science questions? Environ. Model. and Softw. 196, 106772. doi:10.1016/j.envsoft.2025.106772
Hostetter, H., Naser, M. Z., Huang, X., and Gales, J. (2024). The role of large language models (AI chatbots) in fire engineering: an examination of technical questions against domain knowledge. Nat. Hazards Res. 4 (4), 669–688. doi:10.1016/j.nhres.2024.06.003
Huang, Y., Xu, C., He, X., Cheng, J., Xu, X., and Tian, Y. (2025). Landslides induced by the 2023 jishishan Ms6.2 earthquake (NW China): spatial distribution characteristics and implication for the seismogenic fault. Npj Nat. Hazards 2 (1), 14. doi:10.1038/s44304-025-00064-9
Jones, C. R., and Bergen, B. K. (2025). Large language models pass the turing test (version 1). arXiv.10.48550/ARXIV.2503.23674.
Kaklauskas, A., Rajib, S., Piaseckiene, G., Kaklauskiene, L., Sepliakovas, J., Lepkova, N., et al. (2024). Multiple criteria and statistical sentiment analysis on flooding. Sci. Rep. 14 (1), 30291. doi:10.1038/s41598-024-81562-0
Kane, M. T. (2013). Validating the interpretations and uses of test scores. J. Educ. Meas. 50 (1), 1–73. doi:10.1111/jedm.12000
Kim, D., Kim, T., Kim, Y., Byun, Y.-H., and Yun, T. S. (2024). A ChatGPT-MATLAB framework for numerical modeling in geotechnical engineering applications. Comput. Geotechnics 169, 106237. doi:10.1016/j.compgeo.2024.106237
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2023). Large language models are zero-shot reasoners (no. arXiv:2205.11916). arXiv.10.48550/arXiv.2205.11916.
Koo, T. K., and Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15 (2), 155–163. doi:10.1016/j.jcm.2016.02.012
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., et al. (2023). Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit. Health 2 (2), e0000198. doi:10.1371/journal.pdig.0000198
Lai, J., Gan, W., Wu, J., Qi, Z., and Yu, P. S. (2024). Large language models in law: a survey. AI Open 5, 181–196. doi:10.1016/j.aiopen.2024.09.002
Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33 (1), 159–174. doi:10.2307/2529310
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. (2022). Holistic evaluation of language models. doi:10.48550/ARXIV.2211.09110
McGraw, K. O., and Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychol. Methods 1 (1), 30–46. doi:10.1037/1082-989X.1.1.30
Mirza, A., Alampara, N., Kunchapu, S., Ríos-García, M., Emoekabu, B., Krishnan, A., et al. (2025). A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 17 (7), 1027–1034. doi:10.1038/s41557-025-01815-x
Neha, F., Bhati, D., Shukla, D. K., and Amiruzzaman, M. (2024). ChatGPT: transforming healthcare with AI. AI 5 (4), 2618–2650. doi:10.3390/ai5040126
OpenAI (2024). GPT-4.5 system card. Available online at: https://openai.com/index/gpt-4-5-system-card/.
OpenAI (2025). Introducing GPT-4.1 in the API. Available online at: https://openai.com/index/gpt-4-1/.
OpenAI Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., et al. (2023). GPT-4 technical report (version 6). arXiv.10.48550/ARXIV.2303.08774.
OpenAI Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., et al. (2024). GPT-4o system card (version 1). arXiv.10.48550/ARXIV.2410.21276.
Plevris, V., Papazafeiropoulos, G., and Jiménez Rios, A. (2023). Chatbots put to the test in math and logic problems: a comparison and assessment of ChatGPT-3.5, ChatGPT-4, and google bard. AI 4 (4), 949–969. doi:10.3390/ai4040048
Ray, P. P. (2024). ChatGPT in transforming communication in seismic engineering: case studies, implications, key challenges and future directions. Earthq. Sci. 37 (4), 352–367. doi:10.1016/j.eqs.2024.04.003
Raza, M., Jahangir, Z., Riaz, M. B., Saeed, M. J., and Sattar, M. A. (2025). Industrial applications of large language models. Sci. Rep. 15 (1), 13755. doi:10.1038/s41598-025-98483-1
Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., et al. (2019). Deep learning and process understanding for data-driven Earth system science. Nature 566 (7743), 195–204. doi:10.1038/s41586-019-0912-1
Semeraro, F., Cascella, M., Montomoli, J., Bellini, V., and Bignami, E. G. (2025). Comparative analysis of AI tools for disseminating CPR guidelines: implications for cardiac arrest education. Resuscitation 208, 110528. doi:10.1016/j.resuscitation.2025.110528
Shao, X., Ma, S., Xu, C., Xie, C., Li, T., Huang, Y., et al. (2024). Landslides triggered by the 2022 ms. 6.8 luding strike-slip earthquake: an update. Eng. Geol. 335, 107536. doi:10.1016/j.enggeo.2024.107536
Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86 (2), 420–428. doi:10.1037/0033-2909.86.2.420
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., et al. (2023). Beyond the imitation game: quantifying and extrapolating the capabilities of language models (no. arXiv:2206.04615). arXiv.10.48550/arXiv.2206.04615.
Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. (2023). Efficient transformers: a survey. ACM Comput. Surv. 55 (6), 1–28. doi:10.1145/3530811
Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., et al. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. Npj Digit. Med. 7 (1), 41. doi:10.1038/s41746-024-01029-4
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv.10.48550/ARXIV.2201.11903.
Wilson, M. P., Foulger, G. R., Wilkinson, M. W., Gluyas, J. G., Mhana, N., and Tezel, T. (2023). Artificial intelligence and human-induced seismicity: initial observations of ChatGPT. Seismol. Res. Lett. 94 (5), 2111–2118. doi:10.1785/0220230112
Wu, S., Xu, C., Ma, J., and Gao, H. (2025). Escalating risks and impacts of rainfall-induced geohazards. Nat. Hazards Res. 5. doi:10.1016/j.nhres.2025.03.003
Xie, C., Gao, H., Huang, Y., Xue, Z., Xu, C., and Dai, K. (2025). Leveraging the DeepSeek large model: a framework for AI-assisted disaster prevention, mitigation, and emergency response systems. Earthq. Res. Adv. 5, 100378. doi:10.1016/j.eqrea.2025.100378
Xu, C., and Lin, N. (2025). Building a global forum for natural hazard science. Npj Nat. Hazards 2 (1), s130–s132. doi:10.1038/s44304-025-00130-2
Xu, F., Ma, J., Li, N., and Cheng, J. C. P. (2025). Large language model applications in disaster management: an interdisciplinary review. Int. J. Disaster Risk Reduct. 127, 105642. doi:10.1016/j.ijdrr.2025.105642
Xue, Z., Xu, C., and Xu, X. (2023). Application of ChatGPT in natural disaster prevention and reduction. Nat. Hazards Res. 3 (3), 556–562. doi:10.1016/j.nhres.2023.07.005
Yang, Z., Yao, Z., Tasmin, M., Vashisht, P., Jang, W. S., Ouyang, F., et al. (2025). Unveiling GPT-4V’s hidden challenges behind high accuracy on USMLE questions: observational study. J. Med. Internet Res. 27, e65146. doi:10.2196/65146
Zhao, T., Wang, S., Ouyang, C., Chen, M., Liu, C., Zhang, J., et al. (2024). Artificial intelligence for geoscience: progress, challenges, and perspectives. Innovation 5 (5), 100691. doi:10.1016/j.xinn.2024.100691
Keywords: ChatGPT, disaster risk reduction, domain evaluator ratings, geological hazards, multi-dimensional capability profiling, question answering
Citation: Wu S, Xu C, Xue Z, Huang Y, Xu G, Cui Y, Ma J, Ma R and Xie C (2026) Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight. Front. Earth Sci. 13:1695920. doi: 10.3389/feart.2025.1695920
Received: 30 August 2025; Accepted: 29 December 2025;
Published: 13 January 2026.
Edited by:
Augusto Neri, National Institute of Geophysics and Volcanology (INGV), ItalyReviewed by:
Annemarie Christophersen, GNS Science, New ZealandHans-Balder Havenith, University of Liège, Belgium
Mohammad Al Mashagbeh, The University of Jordan, Jordan
Copyright © 2026 Wu, Xu, Xue, Huang, Xu, Cui, Ma, Ma and Xie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chong Xu, eGMxMTExMTExMUAxMjYuY29t
Guoguo Xu1,2,3