Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Earth Sci., 13 January 2026

Sec. Geohazards and Georisks

Volume 13 - 2025 | https://doi.org/10.3389/feart.2025.1695920

This article is part of the Research TopicPrevention, Mitigation, and Relief of Compound and Chained Natural Hazards, volume IIIView all 6 articles

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

Saier Wu,Saier Wu1,2Chong Xu,
Chong Xu1,2*Zhiwen Xue,,Zhiwen Xue1,2,3Yuandong Huang,,Yuandong Huang1,2,3Guoguo Xu,,Guoguo Xu1,2,3Yulong CuiYulong Cui4Junxue Ma,Junxue Ma1,2Ruixia Ma,,Ruixia Ma1,2,5Chenchen Xie,,Chenchen Xie1,2,5
  • 1National Institute of Natural Hazards, Ministry of Emergency Management of China, Beijing, China
  • 2Key Laboratory of Compound and Chained Natural Hazards Dynamics, Ministry of Emergency Management of China, Beijing, China
  • 3School of Emergency Management Science and Engineering, University of Chinese Academy of Sciences, Beijing, China
  • 4School of Civil Engineering and Architecture, Anhui University of Science and Technology, Huainan, China
  • 5School of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing, China

Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ChatGPT-4o for geological-hazard question answering using a transparent, rubric-based design. Sixty questions spanning six task categories (C1-C6) were posed within a fixed time window under a controlled single-turn protocol, and eight evaluators with geohazard expertise independently rated each response on six capability dimensions (D1 Knowledge Coverage; D2 Comprehension and Reasoning; D3 Accuracy and Rigor; D4 Critical Thinking; D5 Application and Context Adaptability; D6 Innovation and Knowledge Expansion). Scores were assigned on a continuous 0–1 scale, with 0, 0.5, and 1 used as anchor points to guide interpretation. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Performance was consistently higher on structured knowledge tasks—defined here as questions with well-established concepts, factual grounding, or clearly bounded reasoning paths (C1 = 0.827; C2 = 0.797; C3 = 0.818) than on open-ended tasks (C4-C6 mean = 0.591). Across dimensions, scores were highest for D1 (0.868), D2 (0.864), and D3 (0.830), and lowest for D4 (0.578) and D6 (0.550). Overall agreement was good (ICC (3, k) = 0.8095), while agreement decreased for more subjective tasks and dimensions. The study provides (i) a baseline, version-specific appraisal of GPT-4o in geohazard-related QA, (ii) a transferable rubric-based workflow for evaluating domain LLMs, and (iii) evidence that human oversight remains essential when such systems are used to support safety-critical disaster risk reduction decisions.

1 Introduction

In recent years, geological hazards triggered by extreme rainfall and earthquakes have shown rising frequency and expanding impacts (Gao et al., 2024; Huang et al., 2025; Shao et al., 2024). Compounding and cascading effects across multiple hazards have become increasingly prominent, rendering hazard processes highly abrupt, with complex evolutionary pathways and narrow decision windows (Gao et al., 2025; Wu et al., 2025). These characteristics raise the bar for monitoring and early warning, risk assessment, and emergency management (C. Xu and Lin, 2025).

Against this backdrop, generative large language models (LLMs) have made substantive advances in multimodal understanding and information integration, knowledge organization and generation, and complex reasoning. They exhibit cross-task transferability and adaptivity, revealing the potential for artificial general intelligence (Bommasani et al., 2021), and opening new technical pathways for natural-hazard research and governance (OpenAI et al., 2023; 2024). LLMs have been adopted in highly specialized domains that require timely knowledge services, including law (Lai et al., 2024), chemistry (Mirza et al., 2025), medicine (Cascella et al., 2023; Neha et al., 2024; Semeraro et al., 2025; Yang et al., 2025), and engineering (Hostetter et al., 2024; Kim et al., 2024). They are also extending into Earth system and human–environment modeling and data analysis within the geosciences (Reichstein et al., 2019; Xie et al., 2025; Zhao et al., 2024).

Within disaster risk reduction (DRR), natural language processing is increasingly used across the stages of observation, cognition, analysis, and decision-making. Pilot applications include disaster information extraction, sentiment analysis and scenario simulation, public outreach, and decision support (Xu et al., 2025; Xue et al., 2023; Zhao et al., 2024), with floods and earthquakes as the most common use cases. For example, generative models such as ChatGPT (GPT) have supported “abstractive reviews” that rapidly synthesize evidence for flood-rescue logistics and resource allocation (Kaklauskas et al., 2024). In seismic-engineering contexts, GPT has been used for technical drafting, terminology explanation, and communication to improve research and outreach efficiency (Ray, 2024; Wilson et al., 2023).

However, most existing evaluations focus on general or single tasks. Evidence that has been carefully examined and accepted by researchers and practitioners in geohazard science is still lacking regarding the effectiveness and limits of using a generic question-answering (QA) interface across what we refer to here as the full knowledge chain of geological hazards. This chain spans foundational concepts, process interpretation, regional variability, interdisciplinary reasoning, scenario construction, and emerging research topics. Widely used academic benchmarks such as MMLU and BIG-bench do not transfer well to the domain-constrained and context-dependent nature of professional QA in geohazard assessment (Hendrycks et al., 2021; Srivastava et al., 2023). Recent discussions in the AI research community have also emphasized that strong benchmark performance does not necessarily translate into reliable behavior in applied decision-making contexts. This “high-score, low-utility” gap has also been discussed in public remarks by Ilya Sutskever, a leading figure in deep learning and co-founder of OpenAI, who has cautioned against over-reliance on benchmark-centered capability assessments. Taken together, these issues illustrate the need for systematic, scholar-informed and practice-oriented capability auditing, and for developing clear, evidence-based guidance on how LLMs should be responsibly introduced into geohazard-related applications, where accuracy and practical applicability are essential.

Model iteration further underscores the need for establishing clear historical baselines. GPT-4o, released in May 2024, became one of the first widely accessible multimodal models supporting text–image interaction for tasks relevant to hazard interpretation and scientific communication (OpenAI et al., 2024). Subsequent releases, including the shift to GPT-5 as the default model in 2025, introduced architectures that can vary the depth of internal reasoning and, depending on user tier and platform, offer multiple model variants. These changes improve flexibility but also make it more difficult to determine which specific model version a user is interacting with. This increasing opacity highlights the need for version-specific and well-documented assessments to support transparent comparison across evolving model families.

Against this background, we develop an expert-driven, multi-dimensional rating and reliability framework to systematically evaluate GPT-4o′s text-based QA performance for geological-hazard tasks that are pertinent to DRR and to AI-enabled, remote-sensing–assisted monitoring and early-warning workflows. We design 60 representative items across six problem categories and six capability dimensions, organize independent blind ratings, and quantify agreement and reliability using the intraclass correlation coefficient (ICC) (Koo and Li, 2016). Our principal contributions are as follows:

1. An expert-driven quantitative evaluation framework structured as “problem category × capability dimension,” with reliability reported via ICC, yielding a reproducible and transferable protocol for professional QA assessment;

2. A multi-dimensional capability profile and uncertainty characterization of GPT-4o in geohazard and DRR. The analysis identifies strengths in structured knowledge tasks (questions anchored in well-established concepts, classification standards, or other widely agreed reference answers) and in causal reasoning, and weaknesses in critical and innovative thinking as well as in complex, design-oriented, open-ended tasks. These patterns provide practical justification for human-in-the-loop oversight in safety-critical settings;

3. A version-specific, traceable performance baseline produced in the era of unified GPT-5 routing and deprecation of legacy defaults, supporting longitudinal tracking of subsequent versions and fair cross-model comparisons.

In sum, this study fills the gap in domain-specific, reproducible evaluation of general-purpose LLMs in a geological-hazard QA setting, providing an evidence base and methodological template for AI-enabled DRR and risk-aware governance. The full methodology, question set, structured rating data, and processing pipeline are documented in “Methodology and Data” and the Supplementary Material to facilitate verification and reuse by peers.

2 Methodology and data

For methodological transparency and continuity, the entire workflow is illustrated in Figure 1, encompassing every step from question design and prompt engineering through answer collection and data preprocessing to rating and statistical analysis.

Figure 1
Flowchart illustrating a three-phase process. Phase 1: Question Design & Refinement involves developing a question set on geological hazards, conducting peer reviews, and revising questions. Phase 2: Answer Collection & Expert Review includes interacting with GPT-4o using prompts and scoring answers on six dimensions. Phase 3: Data Processing & Analysis involves data organization, descriptive statistical analysis, and inter-rater reliability analysis, including overall consistency, by category and dimension.

Figure 1. Overall methodological workflow.

2.1 Question design and categorization

We developed a 60-question set to reflect the types of knowledge and reasoning that commonly arise in geological-hazard work. To examine different aspects of model performance, the questions were organized into six categories: (C1) Basic Knowledge; (C2) Formation-Mechanism Inference; (C3) Regional Differences; (C4) Interdisciplinary Analysis; (C5) Scenario Planning and Design; and (C6) Frontier Exploration.

When drafting the questions, our intention was to cover the main themes encountered in geohazard assessment. The set includes straightforward factual items as well as questions that require interpretation or multi-step reasoning, so that the difficulty resembles what practitioners deal with in real settings. Each question was written to stand alone, avoiding reliance on earlier items. We also paid attention to phrasing, aiming for wording that would be clear to both the model and the domain scholars who later evaluated the responses.

C1 and C2 focus on basic concepts and triggering mechanisms. C3 highlights regional patterns, C4 addresses cross-disciplinary integration, C5 concerns planning-oriented tasks in practical scenarios, and C6 explores forward-looking scientific directions. All questions were reviewed by researchers working in geohazards to ensure scientific accuracy and relevance. Each item was asked independently in a single-turn format to prevent any influence from prior interactions. The questions were also phrased in a way that resembles how non-expert users typically seek information about geological hazards. The six-category structure and representative items are listed in Table 1.

Table 1
www.frontiersin.org

Table 1. Overview of six problem categories and illustrative examples.

2.2 Answer collection and preprocessing

All responses were generated using GPT-4o through the standard web interface. Each question was submitted in an independent chat session during a short, continuous period, ensuring that all outputs reflected the same underlying model state. A uniform presentation format was used for every item, and no supplementary background information or follow-up clarification was provided. The single-turn setup served as an experimental control that allowed us to observe the model’s immediate, unassisted response under fixed conditions (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Liang et al., 2022). It is important to note that in professional geohazard assessment, AI tools are typically used through iterative exchanges that allow practitioners to probe and verify the model’s responses. The single-turn configuration used here was therefore not intended to reflect standard practice, but to provide a stable, version-specific baseline for evaluating the model’s immediate, unassisted output under controlled conditions. Because LLMs may exhibit minor phrasing or temporal variability due to probabilistic decoding and periodic system updates, the analysis focuses on answers generated within this defined timeframe for the May 2024 GPT-4o release (all QA sessions were conducted in February 2025). Limiting the evaluation in this way avoids confounding from version drift and supports consistent interpretation of results.

After collection, the raw outputs were processed through a structured cleaning workflow. Boilerplate disclaimers, conversational fillers, and other non-substantive elements were removed to isolate the analytical content. Terminology and formatting were harmonized where appropriate to support clear interpretation by the evaluators, while the factual and conceptual substance of each answer was left unchanged. The rating sheets submitted by the evaluators were consolidated into a single tabular file with consistent variable definitions, facilitating score aggregation and the computation of inter-rater agreement.

The complete question set, cleaned model outputs, and de-identified rating tables are provided in the Supplementary Material to support transparent archiving and future re-analysis.

2.3 Prompt engineering

We adopted a standardized single-turn prompting procedure for all queries. Each question was submitted in a fresh session with no prior context so that the model’s output depended solely on the prompt provided. This configuration reduces variability arising from conversational history and mirrors typical information-seeking behavior in geohazard and emergency-management settings, where users often ask isolated, one-off questions. The approach also follows established evaluation practices in large-language-model benchmarking, where isolated, one-pass queries are commonly used to support fair and reproducible comparisons (Hendrycks et al., 2021; Hosseini and Pourzangbar, 2026; Srivastava et al., 2023). Consistent with this minimal-intervention philosophy, no parameter tuning, multi-turn refinement, or repeated sampling was applied.

To maintain domain relevance, a concise role-based instruction preceded every query. The instruction asked the model to answer as an expert in geological hazards and Earth sciences. The standardized cue was: “You are a domain expert in geological hazards and Earth science. Please answer the following question with accurate, detailed information and rigorous reasoning.” This prompt steered the model toward technically grounded responses without over-constraining the format, and it improved terminological clarity and coherence in domain-specific reasoning, a pattern also noted in studies using similar expert-role prompts (Wang et al., 2024).

The same prompt template was applied to all sixty questions. No examples, few-shot demonstrations, or chain-of-thought scaffolds were included, and no multi-turn exchanges were used. This ensured that each answer reflected the model’s zero-shot capability under identical conditions, supporting fair comparison across question categories and preserving the ecological validity of simulating unassisted user queries in geohazard contexts. Avoiding selective prompt augmentation also reduced the risk of uneven prompting effects, which can confound comparative evaluations (Kojima et al., 2023).

To examine the stability of this configuration, we conducted a small qualitative stability check (see Supplementary Material 3). Six representative questions (one from each category, C1–C6) were re-queried three times under the same settings. The core conceptual content remained consistent across runs, with variation largely confined to minor differences in phrasing. All main analyses in this study are therefore based on responses generated under this uniform, single-turn protocol.

2.4 Evaluators and rating dimensions

We invited eight evaluators (E1 through E8), comprising senior researchers, professors, postdoctoral researchers, and PhD candidates with formal training and at least 5 years of experience in geohazard investigation, risk assessment, and governance. All eight evaluators had been involved in developing the question set and refining the scoring rubric. Each evaluator independently assessed the model’s answers, working in isolation to avoid mutual influence.

A unified rubric was used to evaluate each answer across six capability dimensions (Table 2). Scores were assigned on a continuous 0–1 scale. The values 0, 0.5, and 1 served only as conceptual anchors for the endpoints and midpoint, not as the only permissible options. Before scoring, evaluators were explicitly instructed that they could select any value within the 0-1 range—typically in 0.1 increments—to express their level of satisfaction with the model’s performance. All scores were retained as provided; no adjustment or post hoc normalization was applied. Although the same six dimensions were applied across all questions, evaluators were instructed to interpret each dimension in a manner appropriate to the question type, following the definitions in Table 2 and the calibration examples in Supplementary Material 2.

Table 2
www.frontiersin.org

Table 2. Definition and evaluation criteria for the scoring dimensions.

To support consistent interpretation of the rubric, we provided each evaluator with a set of calibrating exemplar answers and scoring notes prior to formal scoring. These materials (now included in Supplementary Material 2) illustrate how the six dimensions should be applied to different question types.

Scores were assigned independently, without discussion or consensus-building among evaluators. For each question category and each evaluation dimension, the final score was computed as the arithmetic mean of the eight individual ratings. This approach preserves individual judgment, avoids group influence, and allows agreement among evaluators to be assessed quantitatively through inter-rater reliability analysis.

2.5 Statistical analysis framework

We developed a structured statistical pipeline to systematically evaluate GPT-4o′s performance across geological-hazard QA tasks and to quantify inter-rater agreement. The pipeline comprised three components: data curation and preprocessing, descriptive statistics, and rater-consistency assessment.

2.5.1 Data curation and preprocessing

The eight individual rating sheets were consolidated into a rectangular matrix with fields Evaluator × QuestionID × Category × Dimension × Score. We screened for obvious entry errors and missing ratings and applied corrections to ensure completeness and reliability. Based on the cleaned dataset, grouped summaries were produced by problem category, capability dimension, and evaluator index.

2.5.2 Descriptive statistics

To characterize performance differences across task types and capability dimensions, we computed means and standard deviations for each category and each dimension, thereby assessing central tendency and dispersion. Distributions were visualized using box plots, violin plots, and radar charts to convey overall spread, density, and contrasts, highlighting differences in knowledge coverage, reasoning ability, and contextual adaptability across categories and dimensions.

2.5.3 Inter-rater reliability

Inter-rater reliability was evaluated using the ICC, specifically the two-way mixed-effects model for average measures, denoted as ICC (3, k) (McGraw and Wong, 1996; Shrout and Fleiss, 1979). This model treats the panel of evaluators (k = 8) as fixed effects and the items (n = 60) as random effects. The average-measures coefficient was selected because the study’s composite scores are derived from the mean ratings of the eight evaluators, making the reliability of the averaged scores the relevant metric rather than individual ratings. The consistency formulation of ICC (3, k) is calculated as:

ICC3,k=MSRMSEMSR

where MSR and MSE represent the mean square for rows (items) and the residual mean square, respectively. We report both the overall ICC and stratified estimates by task category (C1–C6) and evaluation dimension (D1–D6) to detect variations in agreement across different assessment contexts. ICC values approaching 1 indicate high agreement among evaluators. Values near zero or below indicate that agreement is limited and does not exceed what would be expected by chance, which may reflect the subjective or interpretive nature of certain question types or evaluation dimensions rather than deficiencies in the scoring procedure itself.

3 Results

3.1 Overall performance across all questions

The dataset comprises ratings from eight evaluators for 60 questions across six evaluation dimensions. Figure 2 summarizes the distribution of evaluator scores at the question level using interquartile-range (IQR) bands for each dimension. The shaded regions depict the middle 50% of scores across evaluators, highlighting the degree of agreement or dispersion across different dimensions and questions. The black line represents the overall mean score for each question, averaged across all evaluators and dimensions, and provides a reference for the model’s aggregate performance. Together, Figure 2 conveys both the central tendency of scores and the variability in evaluator judgments across the full question set.

Figure 2
Line graph with shaded areas showing scores by question number from one to sixty. Colors represent datasets D1 to D6, with a black line indicating the overall mean. Scores range from zero to one, with categories C1 to C6 marked.

Figure 2. Distribution of evaluator scores across questions using interquartile ranges. For each question (1-60), shaded bands show the interquartile range (25th-75th percentile) of evaluator scores for each evaluation dimension (D1-D6). Dimensions with greater score dispersion are rendered in lighter tones, whereas more concentrated ratings appear in more saturated colors. The black solid line indicates the overall mean score per question, averaged across all evaluators and dimensions.

Several patterns emerge from the question-level distributions in Figure 2. First, questions belonging to the scenario planning and design category (C5) tend to exhibit lower overall mean scores than other categories, indicating that evaluators were, on average, less satisfied with the model’s responses to application-oriented and planning-related tasks. This suggests that while the model performs relatively well on conceptual and interpretative questions, translating knowledge into actionable or scenario-specific guidance remains more challenging. Second, the interquartile-range bands for Dimension D6 (Innovation and Knowledge Expansion) reveal contrasting behaviors across question types. For questions with relatively fixed or well-established answers, the D6 bands are generally narrow and centered at moderate to high scores, indicating that the model tends to exercise appropriate restraint by avoiding unnecessary speculation or fabricated novelty. In contrast, for more open-ended questions, the D6 bands are wider, reflecting greater divergence in evaluator judgments. This dispersion suggests that assessments of “satisfactory innovation” are harder to align for exploratory or forward-looking tasks, where expectations regarding originality, framing, and added insight vary more substantially among evaluators.

To provide complementary summaries of the same rating distributions, we summarized the rating distributions using box-and-violin plots (Figure 3), grouped by question category, evaluation dimension, and evaluator. Across question categories C1-C6 (left panel of Figure 3), scores for C1-C3 are tightly clustered, with medians close to 0.9 and standard deviations around 0.25-0.28, indicating relatively high agreement for basic knowledge and mechanism-focused questions. By contrast, the scenario-planning category C5 shows the widest spread, with a standard deviation of 0.312 and a visibly broader IQR band, suggesting more divergent views on how well the model performs on application-oriented tasks. Across evaluation dimensions (middle panel of Figure 3), the lower quartiles for D1 and D2 lie near 0.8 and their standard deviations are 0.160 and 0.169, so most ratings fall in the upper score range and internal consistency is comparatively strong. D4 and D6, in comparison, display much greater dispersion, with standard deviations of 0.393 and 0.318, reflecting wider differences in how evaluators judged critical thinking and innovation.

Figure 3
Violin plots display scores across three categories: Question Category (C1 to C6), Evaluation Dimension (D1 to D6), and Evaluator ID (E1 to E8). Each plot shows distribution and median scores, indicated by white dots, ranging from 0 to 1.

Figure 3. Box-and-violin plots of score distributions by question category, evaluation dimension, and evaluator. For each group, the light violin shows the full distribution of scores, the dark vertical bar indicates the interquartile range (25th-75th percentiles), and the white dot marks the median.

Evaluator-level patterns (right panel of Figure 3) further illustrate these contrasts. Most evaluators’ scores are concentrated in the high range (≥0.8), indicating generally positive assessments of the model’s answers. E2 shows the lowest standard deviation (0.138), consistent with a relatively stable internal scoring style. In contrast, E6 has the highest standard deviation (0.336) and a visibly broader spread into intermediate scores, pointing to greater within-rater variability across questions. Overall, these distributions indicate that perceived answer quality varies systematically with question type, evaluation dimension, and evaluator, rather than being uniform across the full question set.

3.2 Performance by problem category

To examine how GPT performs across diverse geohazard task types, we computed a composite performance index for each category by averaging the scores over the six dimensions D1-D6. Results are shown in Figure 4. Overall, GPT performs best on C1, with a mean score of 0.827, indicating a marked advantage on well-structured, standardized questions. C3 and C2 follow with composite scores of 0.818 and 0.797, respectively, suggesting good adaptability to tasks requiring region-specific judgments and causal reasoning.

Figure 4
Radar chart depicting six categories labeled C1 to C6. Data points are connected forming a blue polygon. Scores are marked next to each point: C1 at 0.827, C2 at 0.797, C3 at 0.818, C4 at 0.749, C5 at 0.673, and C6 at 0.746. A yellow star denotes peak performance at C1.

Figure 4. Composite performance index of GPT across six problem categories.

Figure 5 further breaks down these category-level results by dimension. For C1, the highest score occurs on Accuracy and Rigor (D3 = 0.895). C2 attains the top single-dimension score across all data on Knowledge Coverage (D1 = 0.906), while C3 performs best on Understanding and Reasoning (D2 = 0.891). These results indicate that the model can deliver high-accuracy outputs on knowledge-oriented tasks and sustains strong comprehension and reasoning in causal and regional analyses.

Figure 5
Radar charts labeled

Figure 5. Dimension-level performance across problem categories. (a) C1 peaks on D3, C2 on D1, and C3 on D2. (b) C4 peaks on D1, C5 on D1, and C6 on D2.

By contrast, performance is weaker on the more complex and creative categories C4-C6 (Figure 5). The composite score is 0.749 for C4, 0.673 for C5 (the lowest among all categories), and 0.746 for C6. A similar pattern appears on the Innovation and Exploration dimension (D6): C5 and C6 score only 0.505 and 0.554, respectively, which is substantially lower than the foundational categories. Taken together, these results suggest that, under the current evaluation setup, GPT is less reliable when tasks require high-level synthesis, operational strategy generation, or frontier extrapolation.

3.3 Performance by evaluation dimension

To comprehensively assess GPT’s behavior under heterogeneous capability requirements, we averaged scores for all problem categories across the six evaluation dimensions; results are shown in Figure 6. Overall, GPT performs strongly on D1–D3, with mean scores of 0.868 (D1), 0.864 (D2), and 0.830 (D3). This indicates relatively stable performance in knowledge coverage/recall (D1), understanding and reasoning (D2), and accuracy and rigor in expression (D3). Notably, D1 reaches its peak within category C2 at 0.906, suggesting that the model is well supported by geohazard-related corpora when broad domain coverage is required.

Figure 6
Radar chart showing performance across six dimensions labeled D1 to D6. Data points are marked as squares, with values ranging from 0.550 to 0.868. A star marks the peak performance at D1 with a value of 0.868. The area within the data points is shaded purple.

Figure 6. Average scores across evaluation dimensions D1–D6. Strong performance is observed on D1–D3, whereas D4 and D6 remain comparatively low, highlighting challenges in higher-order content generation.

On D5, the average score is 0.787, indicating that the model can produce operationally useful content for applied tasks (e.g., disaster monitoring and early-warning design or emergency response planning). However, the outputs often remain generic and show limited adaptability to specific scenarios.

By contrast, performance drops markedly on the higher-order cognitive dimensions D4 and D6, with mean scores of 0.578 and 0.550, respectively, forming a pronounced “performance gap” (Figure 7). This trend is most evident for categories C4-C6, reinforcing that when confronted with open-ended problems or tasks lacking established reference answers, the model’s generations often fall short in depth and originality.

Figure 7
Composite image showing four panels: (a) a color bar indicating ICC ratings from poor to excellent; (b) bar chart showing ICC values for six question categories, with C2 and C6 having higher values; (c) bar chart for ICC values across six evaluation dimensions, with D3 being the highest; (d) a heatmap representing ICC values for combinations of question categories and evaluation dimensions, colors ranging from blue to light yellow indicating varying ICC levels.

Figure 7. Inter-rater reliability of expert assessments across question categories and evaluation dimensions. (a) Distribution of ICC (3,8) values across all questions and evaluators. (b) ICC (3,8) grouped by question category. (c) ICC (3,8) grouped by evaluation dimension. (d) Heatmap of ICC (3,8) values over category–dimension combinations, highlighting areas of strong consensus and areas of greater subjectivity and variability.

3.4 Evaluator rating consistency

To evaluate the consistency of evaluator scores, we applied a two-way mixed-effects, average-measures intraclass correlation, ICC (3, k), to the ratings assigned by eight evaluators across 60 questions. Details on the definition, interpretation, and calculation of ICC (3, k) are provided in Section 2.5(3) of the Methods. The overall ICC (3,8) is 0.8095, indicating small between-rater differences on the composite, multi-dimensional scores and thus good agreement (Table 3; Figure 7a). Following Cicchetti’s guideline (Cicchetti, 1994), this value falls within the 0.75-0.90 range, i.e., “good” reliability. These results suggest that the rating framework exhibits strong scoring reliability; overall evaluator agreement on task evaluations is high, and the resulting scores are suitable for a structured assessment of model performance. In addition, the between-question variance component is substantially larger than the residual error (MSR = 0.3369 vs. MSE = 0.0642), indicating that observed score differences are driven primarily by the question content and task type rather than by rater-specific biases.

Table 3
www.frontiersin.org

Table 3. Summary statistics for the overall ICC(3,8) computation.

We further examined heterogeneity in inter-rater consistency across task categories and evaluation dimensions by conducting grouped (Figures 7b,c) and cross-classified (Figure 7d) ICC analyses. At the category level, C2 and C1 show the highest agreement (ICC = 0.919 and 0.783, respectively), consistent with clearer task structure and stronger answer consensus. In contrast, C4 and C5 display low reliability (ICC <0.1), reflecting pronounced divergence among evaluators for these problem types. At the dimension level, D1 and D2 exhibit higher agreement (ICC >0.75), whereas D4 and D6 are notably lower, revealing greater disagreement when judging higher-order cognitive dimensions. In the cross-combination analysis (Figure 7d), pairs such as C2–D1, C2–D3, C2–D4 and C1–D2 achieve ICC values exceeding 0.80, indicating highly consistent evaluations on core dimensions of foundational and reasoning-oriented tasks. By contrast, combinations such as C4–D4, C4–D5, and C4–D6 yield ICC values near zero or even negative, suggesting weak stability and substantial subjectivity in complex, open-ended, or cross-domain scenarios.

4 Discussion

4.1 ChatGPT’s capabilities and limitations across geohazard tasks

Our findings show that the response quality of GPT-4o to geohazard questions is strongly associated with problem type. On structured knowledge tasks—such as explaining hazard-classification standards or inferring formative mechanisms—the model achieves high accuracy and stability, accompanied by high inter-rater reliability. This aligns with prior evaluations of LLMs: when task boundaries are well defined and the training corpus provides sufficient coverage, LLMs exhibit strong information retrieval and expression capabilities (Gilson et al., 2022). Coupled with relatively low interaction latency (Tay et al., 2023), these strengths confer “advisor-style” utility in information retrieval and instructional support settings (Deng et al., 2025). Mechanistically, these capabilities are consistent with the Transformer architecture and large-scale pretraining, which enable the model to capture co-occurrence patterns and relationships among terms in the training data and to retrieve them when presented with well-posed questions (Bommasani et al., 2021; Tay et al., 2023). In our single-turn, text-only setting, the model does not perform explicit multimodal reasoning or iterative prompt refinement; instead, it appears to reconstruct plausible causal chains by combining known concepts from its training corpus. For example, in questions about rainfall-induced shallow landslides, GPT-4o correctly links rainfall intensity-duration (I-D) thresholds to the probability of triggering shallow failures and explains this in terms of infiltration, pore-pressure rise, and loss of shear strength in near-surface materials.

By contrast, on open-ended, extrapolative, and creative tasks the model does not consistently deliver high-quality strategies (Hager et al., 2024; Kim et al., 2024; Plevris et al., 2023). Both the mean scores and inter-rater agreement decline markedly, especially on core, higher-logic dimensions such as critical thinking and contextual adaptability. A likely contributor is corpus bias: as Biswas notes, output quality is constrained by the quantity and quality of training data (Biswas, 2023). Hallucinated or misleading content also remains a recurrent issue across GPT variants (OpenAI et al., 2023). In our own question set, for example, GPT-4o gave a confident but incorrect answer to a question on which slope type is more prone to instability under seismic loading, incorrectly treating anti-dip slopes as generally more susceptible than dip slopes, which conflicts with standard engineering understanding (see Supplementary Material 4, Q11). Recent work further shows that GPT can pass a three-player Turing test (Jones and Bergen, 2025), underscoring that machine-generated but incorrect answers may be highly persuasive to non-expert audiences (Zhou et al., 2024). At a fundamental level, large language models remain pattern recognizers trained on vast text corpora (Ray, 2024; Raza et al., 2025): they produce human-like text that is often accurate, but not necessarily grounded in deep understanding. For high-uncertainty problems and other high-stakes contexts (such as designing mitigation works or supporting risk-critical decisions), outputs from GPT should therefore be treated as supplementary input and remain subject to expert judgment and synthesis.

4.2 Validity and reliability of the domain-scholar–based evaluation framework

We evaluated a structured, multidimensional domain-scholar rating rubric using a two-way mixed-effects ICC (3, k). Overall agreement was good (Cicchetti, 1994; Shrout and Fleiss, 1979), indicating clear criteria, limited between-rater bias, and reliable aggregate scores. Between-question variance exceeded residual error, suggesting that score differences primarily reflect task content rather than rater noise (Landis and Koch, 1977).

Agreement varied with task type and dimension definition. Well-bounded combinations (e.g., C2-D1 and C1-D2) showed high consensus, consistent with greater stability when dimensions are clearly specified and answers are determinate (Kung et al., 2023). In contrast, subjective or open-ended combinations (e.g., C5-D4 and C6-D6) exhibited markedly lower agreement, aligning with prior findings on task subjectivity and rater-background effects (Gilson et al., 2022; Mirza et al., 2025; Yang et al., 2025).

To further enhance reliability, future studies should provide explicit scoring anchors and exemplars and conduct cross-domain calibration sessions to surface and reconcile interpretive differences. Such procedures have improved reproducibility in geohazards, geotechnical engineering, and education research (Fell et al., 2008; Fuchs et al., 2011; Kane, 2013).

4.3 Study limitations

This study has several limitations that should be acknowledged. First, the evaluation focuses on the May 2024 release of ChatGPT-labeled “GPT-4o”, and all question–answer interactions were collected within a clearly defined query window (February 2025). The study does not include a systematic comparison with later variants (such as GPT-4.1 or GPT-4.5) or with same-generation derivatives including o3, o4 mini, or o4 mini high (OpenAI, 2024; OpenAI, 2025; OpenAI et al., 2024). Models within the GPT series can differ in documented knowledge recency, modality support, usage quotas, invocation cost, and accessibility. These differences complicate direct comparison and raise concerns regarding fairness and reproducibility. Similar comparability issues also arise in cross-provider settings (e.g., Gemini, Grok, and other frontier LLMs), where product- and system-layer behaviors (not always transparent), safety filtering and compliance constraints, and default generation settings can further confound fair and reproducible comparisons. For example, GPT-4.5 has been described as a larger model with improved generation quality and social responsiveness, yet its training data extend only to October 2023, similar to GPT-4o, whereas other later GPT-4–series variants (e.g., GPT-4.1) may incorporate more recent training data (Figure 8). Despite sharing a comparable knowledge cutoff, GPT-4.5 and GPT-4o differ in their underlying multimodal design paradigms: GPT-4.5 represents a GPT-4–style multimodal extension built upon a primarily text-centric framework, whereas GPT-4o operationalizes multimodality as a first-class component of the core model architecture. As a result, a direct performance comparison between these two models would conflate differences arising from modality integration strategies with those attributable to task-specific reasoning ability, thereby complicating interpretation. Such discrepancies in knowledge recency can materially influence factual accuracy and alignment with contemporary hazard information (Bubeck et al., 2023; Wei et al., 2022). To maintain internal validity and avoid confounding effects associated with version heterogeneity, the present study restricts its evaluation to GPT-4o.

Figure 8
Timeline graphic depicting release dates and types. Dotted circles indicate preview releases; solid circles represent stable releases. A dot color highlights training data cutoff dates. Various versions are marked, ranging from 2024 to 2025.

Figure 8. Release timeline of major GPT models and their training data cutoff dates.

Second, the assessment does not examine GPT-4o′s multimodal or spatial-reasoning capabilities. Many geohazard-related tasks, such as landslide interpretation from imagery, slope failure recognition, and flood-extent assessment, require integration of visual, spatial, and contextual cues that cannot be fully captured through text-only inputs. Although GPT-4o provides image-processing functions, these were not included in the current design. Future work should therefore evaluate multimodal workflows, particularly those that combine remote sensing data and field observations, to achieve a more complete characterization of the model’s usefulness in geohazard prevention and response.

Third, the study does not fully address the temporal variability of LLM outputs. Large language models can produce slightly different answers when identical questions are posed at different times, reflecting the probabilistic nature of their inference processes as well as potential updates to the underlying system. Under the controlled single-turn prompting used here, the conceptual content of responses was generally stable, although minor differences in phrasing were observed. To increase transparency, we conducted a small qualitative stability check by generating three responses for six representative questions (see Supplementary Material 3). This exercise illustrated typical variation patterns while avoiding the substantial scoring burden of a full 10-run × 60-item protocol. A broader assessment of how model outputs change over longer periods and under different inference conditions would also be useful. Such work could clarify how stable LLM behavior remains as the system evolves.

A further conceptual limitation concerns the epistemological status of LLM-generated responses. Large language models produce outputs through statistical association rather than deductive, causal, or empirically grounded reasoning. Their answers therefore differ fundamentally from human expert knowledge, which is supported by disciplinary theory, field experience, and empirical validation. In this study, domain-scholar evaluations serve as a reference for assessing factual adequacy and conceptual coherence, and are not intended to imply epistemic equivalence between model outputs and human expertise. Future research should explore how these differences in reasoning modes affect reliability in high-stakes geohazard contexts, and should also examine more directly whether large language models meet the expectations of domain scholars and practitioners, particularly in terms of accuracy, relevance, and professional adequacy in geohazard applications.

5 Conclusion and way forward

This study builds a multi-dimensional, evaluator-rated framework to systematically evaluate GPT-4o on geological-hazard QA. Using 60 items spanning six problem categories and six capability dimensions, we quantify strengths and weaknesses across tasks and validate reliability with the intraclass correlation coefficient. GPT-4o attains its highest category scores in C1 (Basic Knowledge, 0.827), C3 (Regional Differences, 0.818), and C2 (Formation-Mechanism Inference, 0.797). By dimension, D1 (Knowledge Coverage, 0.868), D2 (Comprehension and Reasoning, 0.864), and D3 (Accuracy and Rigor, 0.830) lead, whereas D4 (Critical Thinking, 0.578) and D6 (Innovation, 0.550) are markedly lower, especially on complex, open-ended tasks (C4-C6). Overall rater agreement is good, ICC (3, k) = 0.8095, supporting the robustness and reproducibility of the conclusions.

Methodologically, we propose a reproducible evaluation framework for domain-specific AI that integrates multidimensional scoring, a multi-rater evaluation procedure, and statistical reliability analysis; the approach is transferable to other specialized domains. In application, the findings support cautious adoption of GPT as a supportive tool for DRR workflows, such as geohazard monitoring and early warning, while emphasizing operation under professional oversight. Conceptually, the results delineate a gap between fluency and understanding, showing that polished language does not guarantee the causal reasoning and abstraction required for complex geohazard tasks.

It is noteworthy that newer generations of ChatGPT, including GPT-5, are implemented as a unified system that can route requests between faster response behavior and deeper reasoning modes. While this evolution may improve overall adaptivity, it complicates version-specific evaluation and traceability. Our study documents a time-window-specific capability profile for GPT-4o and provides an evaluation baseline for longitudinal tracking and for future cross-model comparisons conducted under explicitly matched constraints. Future research should (i) establish unified, version-aware evaluation suites to support horizontal comparisons across models and vertical tracking across releases, and (ii) extend to multimodal tasks, using fusion of text and image modalities to evaluate the model’s ability to reason over remote sensing artifacts such as interferograms, UAV orthomosaics, LiDAR point clouds, and GNSS time series in support of geohazard risk reduction and early warning.

In summary, GPT-4o cannot replace expert judgment; however, it can efficiently support information synthesis, preliminary analyses, and cross-disciplinary communication. For high-uncertainty or safety-critical contexts typical of geohazard early warning and DRR, human-in-the-loop oversight remains essential to mitigate deceptively plausible yet erroneous outputs. The present version-specific evaluation baseline offers methodological and practical value and sets a reference point for evaluating and optimizing unified-architecture, multimodal reasoning models in the GPT-5 era.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

SW: Data curation, Formal Analysis, Investigation, Methodology, Resources, Visualization, Writing – original draft, Writing – review and editing. CoX: Conceptualization, Data curation, Funding acquisition, Supervision, Writing – review and editing. ZX: Data curation, Writing – review and editing. YH: Data curation, Writing – review and editing. GX: Data curation, Writing – review and editing. YC: Data curation, Writing – review and editing. JM: Data curation, Writing – review and editing. RM: Data curation, Writing – review and editing. CeX: Data curation, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Institute of Natural Hazards, the Ministry of Emergency Management of China (grant no. ZDJ 2025-54), and the Chongqing Water Resources Bureau, China (grant no. CQS24C00836).

Acknowledgements

We thank the handling editor and the reviewers for their constructive comments, which substantially improved the manuscript.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI tools (ChatGPT, OpenAI, San Francisco, CA, United States) were used in two ways: (1) to generate responses to a structured set of geohazard-related questions, which served as the primary research data for expert evaluation; and (2) to assist in language polishing, grammar checking, and translation during manuscript preparation. No AI tools were involved in study design, statistical analysis, or interpretation of results. All authors take full responsibility for the scientific integrity and accuracy of the manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feart.2025.1695920/full#supplementary-material

References

Biswas, S. S. (2023). Potential use of chat GPT in global warming. Ann. Biomed. Eng. 51 (6), 1126–1127. doi:10.1007/s10439-023-03171-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models (version 3). arXiv.10.48550/ARXIV.2108.07258.

Google Scholar

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., et al. (2023). Sparks of artificial general intelligence: early experiments with GPT-4. arXiv.10.48550/ARXIV.2303.12712.

Google Scholar

Cascella, M., Montomoli, J., Bellini, V., and Bignami, E. (2023). Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47 (1), 33. doi:10.1007/s10916-023-01925-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess. 6 (4), 284–290. doi:10.1037/1040-3590.6.4.284

CrossRef Full Text | Google Scholar

Deng, R., Jiang, M., Yu, X., Lu, Y., and Liu, S. (2025). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Comput. and Educ. 227, 105224. doi:10.1016/j.compedu.2024.105224

CrossRef Full Text | Google Scholar

Fell, R., Corominas, J., Bonnard, C., Cascini, L., Leroi, E., and Savage, W. Z. (2008). Guidelines for landslide susceptibility, hazard and risk zoning for land use planning. Eng. Geol. 102 (3–4), 85–98. doi:10.1016/j.enggeo.2008.03.022

CrossRef Full Text | Google Scholar

Fuchs, S., Kuhlicke, C., and Meyer, V. (2011). Editorial for the special issue: vulnerability to natural hazards—the challenge of integration. Nat. Hazards 58 (2), 609–619. doi:10.1007/s11069-011-9825-5

CrossRef Full Text | Google Scholar

Gao, H., Xu, C., Xie, C., Ma, J., and Xiao, Z. (2024). Landslides triggered by the July 2023 extreme rainstorm in the haihe river basin, China. Landslides 21 (11), 2885–2890. doi:10.1007/s10346-024-02322-9

CrossRef Full Text | Google Scholar

Gao, H., Xu, C., Wu, S., Li, T., and Huang, Y. (2025). Has the unpredictability of geological disasters been increased by global warming? Npj Nat. Hazards 2 (1), 55. doi:10.1038/s44304-025-00108-0

CrossRef Full Text | Google Scholar

Gilson, A., Safranek, C., Huang, T., Socrates, V., Chi, L., Taylor, R. A., et al. (2022). How does ChatGPT perform on the medical licensing exams? The implications of large language models for medical education and knowledge assessment. Med. Educ. doi:10.1101/2022.12.23.22283901

CrossRef Full Text | Google Scholar

Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., et al. (2024). Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30 (9), 2613–2622. doi:10.1038/s41591-024-03097-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., et al. (2021). Measuring massive multitask language understanding (no. arXiv:2009.03300). arXiv.10.48550/arXiv.2009.03300.

Google Scholar

Hosseini, S. H., and Pourzangbar, A. (2026). How well do DeepSeek, ChatGPT, and gemini respond to water science questions? Environ. Model. and Softw. 196, 106772. doi:10.1016/j.envsoft.2025.106772

CrossRef Full Text | Google Scholar

Hostetter, H., Naser, M. Z., Huang, X., and Gales, J. (2024). The role of large language models (AI chatbots) in fire engineering: an examination of technical questions against domain knowledge. Nat. Hazards Res. 4 (4), 669–688. doi:10.1016/j.nhres.2024.06.003

CrossRef Full Text | Google Scholar

Huang, Y., Xu, C., He, X., Cheng, J., Xu, X., and Tian, Y. (2025). Landslides induced by the 2023 jishishan Ms6.2 earthquake (NW China): spatial distribution characteristics and implication for the seismogenic fault. Npj Nat. Hazards 2 (1), 14. doi:10.1038/s44304-025-00064-9

CrossRef Full Text | Google Scholar

Jones, C. R., and Bergen, B. K. (2025). Large language models pass the turing test (version 1). arXiv.10.48550/ARXIV.2503.23674.

Google Scholar

Kaklauskas, A., Rajib, S., Piaseckiene, G., Kaklauskiene, L., Sepliakovas, J., Lepkova, N., et al. (2024). Multiple criteria and statistical sentiment analysis on flooding. Sci. Rep. 14 (1), 30291. doi:10.1038/s41598-024-81562-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Kane, M. T. (2013). Validating the interpretations and uses of test scores. J. Educ. Meas. 50 (1), 1–73. doi:10.1111/jedm.12000

CrossRef Full Text | Google Scholar

Kim, D., Kim, T., Kim, Y., Byun, Y.-H., and Yun, T. S. (2024). A ChatGPT-MATLAB framework for numerical modeling in geotechnical engineering applications. Comput. Geotechnics 169, 106237. doi:10.1016/j.compgeo.2024.106237

CrossRef Full Text | Google Scholar

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2023). Large language models are zero-shot reasoners (no. arXiv:2205.11916). arXiv.10.48550/arXiv.2205.11916.

Google Scholar

Koo, T. K., and Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15 (2), 155–163. doi:10.1016/j.jcm.2016.02.012

PubMed Abstract | CrossRef Full Text | Google Scholar

Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., et al. (2023). Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit. Health 2 (2), e0000198. doi:10.1371/journal.pdig.0000198

PubMed Abstract | CrossRef Full Text | Google Scholar

Lai, J., Gan, W., Wu, J., Qi, Z., and Yu, P. S. (2024). Large language models in law: a survey. AI Open 5, 181–196. doi:10.1016/j.aiopen.2024.09.002

CrossRef Full Text | Google Scholar

Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33 (1), 159–174. doi:10.2307/2529310

PubMed Abstract | CrossRef Full Text | Google Scholar

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. (2022). Holistic evaluation of language models. doi:10.48550/ARXIV.2211.09110

CrossRef Full Text | Google Scholar

McGraw, K. O., and Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychol. Methods 1 (1), 30–46. doi:10.1037/1082-989X.1.1.30

CrossRef Full Text | Google Scholar

Mirza, A., Alampara, N., Kunchapu, S., Ríos-García, M., Emoekabu, B., Krishnan, A., et al. (2025). A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 17 (7), 1027–1034. doi:10.1038/s41557-025-01815-x

PubMed Abstract | CrossRef Full Text | Google Scholar

Neha, F., Bhati, D., Shukla, D. K., and Amiruzzaman, M. (2024). ChatGPT: transforming healthcare with AI. AI 5 (4), 2618–2650. doi:10.3390/ai5040126

CrossRef Full Text | Google Scholar

OpenAI (2024). GPT-4.5 system card. Available online at: https://openai.com/index/gpt-4-5-system-card/.

Google Scholar

OpenAI (2025). Introducing GPT-4.1 in the API. Available online at: https://openai.com/index/gpt-4-1/.

Google Scholar

OpenAI Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., et al. (2023). GPT-4 technical report (version 6). arXiv.10.48550/ARXIV.2303.08774.

Google Scholar

OpenAI Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., et al. (2024). GPT-4o system card (version 1). arXiv.10.48550/ARXIV.2410.21276.

Google Scholar

Plevris, V., Papazafeiropoulos, G., and Jiménez Rios, A. (2023). Chatbots put to the test in math and logic problems: a comparison and assessment of ChatGPT-3.5, ChatGPT-4, and google bard. AI 4 (4), 949–969. doi:10.3390/ai4040048

CrossRef Full Text | Google Scholar

Ray, P. P. (2024). ChatGPT in transforming communication in seismic engineering: case studies, implications, key challenges and future directions. Earthq. Sci. 37 (4), 352–367. doi:10.1016/j.eqs.2024.04.003

CrossRef Full Text | Google Scholar

Raza, M., Jahangir, Z., Riaz, M. B., Saeed, M. J., and Sattar, M. A. (2025). Industrial applications of large language models. Sci. Rep. 15 (1), 13755. doi:10.1038/s41598-025-98483-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., et al. (2019). Deep learning and process understanding for data-driven Earth system science. Nature 566 (7743), 195–204. doi:10.1038/s41586-019-0912-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Semeraro, F., Cascella, M., Montomoli, J., Bellini, V., and Bignami, E. G. (2025). Comparative analysis of AI tools for disseminating CPR guidelines: implications for cardiac arrest education. Resuscitation 208, 110528. doi:10.1016/j.resuscitation.2025.110528

PubMed Abstract | CrossRef Full Text | Google Scholar

Shao, X., Ma, S., Xu, C., Xie, C., Li, T., Huang, Y., et al. (2024). Landslides triggered by the 2022 ms. 6.8 luding strike-slip earthquake: an update. Eng. Geol. 335, 107536. doi:10.1016/j.enggeo.2024.107536

CrossRef Full Text | Google Scholar

Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86 (2), 420–428. doi:10.1037/0033-2909.86.2.420

PubMed Abstract | CrossRef Full Text | Google Scholar

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., et al. (2023). Beyond the imitation game: quantifying and extrapolating the capabilities of language models (no. arXiv:2206.04615). arXiv.10.48550/arXiv.2206.04615.

Google Scholar

Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. (2023). Efficient transformers: a survey. ACM Comput. Surv. 55 (6), 1–28. doi:10.1145/3530811

CrossRef Full Text | Google Scholar

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., et al. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. Npj Digit. Med. 7 (1), 41. doi:10.1038/s41746-024-01029-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv.10.48550/ARXIV.2201.11903.

Google Scholar

Wilson, M. P., Foulger, G. R., Wilkinson, M. W., Gluyas, J. G., Mhana, N., and Tezel, T. (2023). Artificial intelligence and human-induced seismicity: initial observations of ChatGPT. Seismol. Res. Lett. 94 (5), 2111–2118. doi:10.1785/0220230112

CrossRef Full Text | Google Scholar

Wu, S., Xu, C., Ma, J., and Gao, H. (2025). Escalating risks and impacts of rainfall-induced geohazards. Nat. Hazards Res. 5. doi:10.1016/j.nhres.2025.03.003

CrossRef Full Text | Google Scholar

Xie, C., Gao, H., Huang, Y., Xue, Z., Xu, C., and Dai, K. (2025). Leveraging the DeepSeek large model: a framework for AI-assisted disaster prevention, mitigation, and emergency response systems. Earthq. Res. Adv. 5, 100378. doi:10.1016/j.eqrea.2025.100378

CrossRef Full Text | Google Scholar

Xu, C., and Lin, N. (2025). Building a global forum for natural hazard science. Npj Nat. Hazards 2 (1), s130–s132. doi:10.1038/s44304-025-00130-2

CrossRef Full Text | Google Scholar

Xu, F., Ma, J., Li, N., and Cheng, J. C. P. (2025). Large language model applications in disaster management: an interdisciplinary review. Int. J. Disaster Risk Reduct. 127, 105642. doi:10.1016/j.ijdrr.2025.105642

CrossRef Full Text | Google Scholar

Xue, Z., Xu, C., and Xu, X. (2023). Application of ChatGPT in natural disaster prevention and reduction. Nat. Hazards Res. 3 (3), 556–562. doi:10.1016/j.nhres.2023.07.005

CrossRef Full Text | Google Scholar

Yang, Z., Yao, Z., Tasmin, M., Vashisht, P., Jang, W. S., Ouyang, F., et al. (2025). Unveiling GPT-4V’s hidden challenges behind high accuracy on USMLE questions: observational study. J. Med. Internet Res. 27, e65146. doi:10.2196/65146

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, T., Wang, S., Ouyang, C., Chen, M., Liu, C., Zhang, J., et al. (2024). Artificial intelligence for geoscience: progress, challenges, and perspectives. Innovation 5 (5), 100691. doi:10.1016/j.xinn.2024.100691

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., and Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature 634 (8032), 61–68. doi:10.1038/s41586-024-07930-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: ChatGPT, disaster risk reduction, domain evaluator ratings, geological hazards, multi-dimensional capability profiling, question answering

Citation: Wu S, Xu C, Xue Z, Huang Y, Xu G, Cui Y, Ma J, Ma R and Xie C (2026) Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight. Front. Earth Sci. 13:1695920. doi: 10.3389/feart.2025.1695920

Received: 30 August 2025; Accepted: 29 December 2025;
Published: 13 January 2026.

Edited by:

Augusto Neri, National Institute of Geophysics and Volcanology (INGV), Italy

Reviewed by:

Annemarie Christophersen, GNS Science, New Zealand
Hans-Balder Havenith, University of Liège, Belgium
Mohammad Al Mashagbeh, The University of Jordan, Jordan

Copyright © 2026 Wu, Xu, Xue, Huang, Xu, Cui, Ma, Ma and Xie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chong Xu, eGMxMTExMTExMUAxMjYuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.