- Department of Nuclear Engineering, North Carolina State University (NCSU), Raleigh, NC, United States
The verification, validation, and uncertainty quantification (VVUQ) of high-fidelity, high-resolution multi-physics modeling and simulation in nuclear engineering applications are essential for assessing the predictive credibility of developed models. Appropriate practices and methods are required to address ongoing challenges. Some key examples include the large dimensionality of the input and output spaces, modeling complexity, high computational cost, scarcity of relevant experimental data, and the lack of guidelines and protocols for the development of multi-physics benchmarks. This study provides several guidelines and recommendations. Dimensionality reduction and screening approaches can be used to address the high-dimensional input and output spaces. A multi-level validation hierarchy where the coupling level is increased progressively is suggested to manage modeling complexity. A validation scoring method is proposed to compare the different coupling levels and to identify gaps in the modeling. Surrogate models can be used to address the computational cost, though they require the estimation of an additional model uncertainty. For consistent uncertainty propagation, sample-processing diagrams are introduced that can help avoid sampling errors between the multiple inputs. For the validation of multivariate outputs such as time series, local, regional, and global univariate metrics can be used together with more complicated multivariate methods based on U-pooling. Some of the proposed recommendations are demonstrated on the multi-physics modeling of the first cold ramp test from the OECD/Nuclear Energy Agency (NEA) Multi-physics Pellet Cladding Mechanical Interaction Validation (MPCMIV) benchmark. The multi-level modeling hierarchy ranges from single-physics fuel performance models to coupled multi-physics models. The MOOSE-based tools Griffin, Bison, and THM are employed alongside the fuel performance code OFFBEAT. The measurements considered in here include the cladding’s axial elongation and coolant temperature at three different locations during the cold ramp test. Validation metrics are computed at local, regional, and global scales. Validation scores are computed for each model and physics domain. The results highlight the need for at least a coupling between the RP and FP to accurately predict the cladding axial elongation, whereas the coolant temperatures are less sensitive to the coupling level due to their small variations during the cold ramp test.
1 Introduction
Scientific computing plays a key role in many modern nuclear engineering activities, with modeling and simulation (M&S) contributing considerably to overcoming current limitations. Significant progress has been made in M&S in recent years due to the growing computational resources, improved numerical algorithms, the need for enhanced predictive capabilities, and the diversity of applications (e.g., advanced reactors and small modular reactors). Early M&S developments focused primarily on single-physics models, but recent decades have seen a surge of multi-physics models. Multi-physics M&S is important for accurately predicting various nuclear reactor conditions due to the inherent coupling of fundamental phenomena (e.g., Doppler feedback). The traditional approach for multi-physics modeling of a nuclear reactor core involves the external coupling of reactor physics and thermal-hydraulics codes at the assembly level. Novel multi-physics tools, however, have drastically expanded the modeling fidelity of nuclear reactors, enabling pin-level, and in some cases even sub-pin, resolution. Flexible multi-physics frameworks have been established that allow different coupling strategies, while high-performance computing and large-scale parallelization have further enhanced computational efficiency. The utilization of these impressive M&S capabilities for decision-making in areas of interest, such as core design for new advanced reactors, loading pattern optimization for the existing fleet of LWRs, and safety analysis for different beyond design basis accidents, hinges upon adequate verification, validation, and uncertainty quantification (VVUQ).
VVUQ is a process that builds confidence in the predictive capabilities of a model for a selected application of interest. VVUQ for computational models is a field with rich history that spans over 60 years (Sargent and Balci, 2017). Consistent and comprehensive approaches have been proposed (Roy and Oberkampf, 2011) that address some key aspects of VVUQ, such as the identification of all sources of uncertainties and the incorporation of numerical errors. Specific methodologies and guidelines have been developed for single-physics nuclear reactor models to support safety analyses. International benchmarks, databases, and standards have been established to facilitate code-to-code and code-to-measurement comparisons. In the field of reactor physics, two prominent examples are the International Criticality Safety Benchmark Evaluation Project (ICSBEP) and the International Reactor Physics Experiments Evaluation Project (IRPhEP) (Bess and Ivanova, 2020). For thermal-hydraulics and fuel performance, the International Experimental Thermal HYdraulics Systems (TIETHYS) (Rohatgi et al., 2018) and the International Fuel Performance Experiments (IFPE) (Menut et al., 2000) databases are additional examples. These initiatives have strengthened the quality and consistency of past and current experimental data while also highlighting the needs for new validation experimental campaigns. Although a lot of progress has been made in the VVUQ of standalone physics models, there is a growing need to develop similar standards and guidelines for multi-physics models that cover both traditional and novel multi-physics tools.
The VVUQ of nuclear reactor multi-physics models, especially those that are multi-scale and with high-resolution, presents significant challenges. There is a scarcity of experimental data for multi-physics phenomena. This is even more pronounced in transient scenarios, where conducting relevant experiments is very difficult. Historical experimental data from reactivity-initiated accident (RIA) and loss of coolant accident (LOCA) tests are available (Vaglio-Gaudard et al., 2023). For the former, the experiments carried out in CABRI, SPERT-III, and TREAT are representative examples. For the latter, the experiments carried out in the LOFT and PHEBUS-LOCA programs are representative. The measured data over the transients typically include local fuel rod thermal and dimensional measurements and integral power and system measurements. An additional challenge is the high computational cost associated with multi-physics calculations that often prevents the implementation of rigorous uncertainty analyses, such as those proposed in Roy and Oberkampf (2011). Recent progress has been made in the verification of high-fidelity multi-physics (Novak A. et al., 2023; Avramova et al., 2021) that include improvements in coupling techniques, convergence acceleration, efficient multi-scale approaches, data transfer methods, and software engineering practices. In the field of model validation, however, less progress has been made. A few international benchmark activities have been launched by the Organization for Economic Co-operation and Development (OECD)/Nuclear Energy Agency (NEA) (Avramova et al., 2021; Vaglio-Gaudard et al., 2023), such as the Multi-physics Pellet Cladding Mechanical Interaction Validation (MPCMIV) and the Tennessee Valley Authority Watts Bar Unit 1 Multi-Physics (TVA-WB1) benchmarks. As noted in Vaglio-Gaudard et al. (2023), there is currently a lack of detailed experimental data necessary for validating high-fidelity, pin-by-pin resolution multi-physics models for light-water reactor (LWR) applications. Addressing this gap will require new experimental campaigns supported by advanced instrumentation techniques. The field of uncertainty quantification (UQ) is closely related to validation, particularly the aspects of uncertainty propagation and sensitivity analysis. UQ for high-fidelity and high-resolution multi-physics nuclear reactor modeling is computationally intensive because it relies primarily on stochastic sampling approaches that require multiple code evaluations. Although deterministic UQ approaches exist for single-physics models that significantly reduce computational cost, such as those based on generalized perturbation theory in reactor physics, they have not yet been extended to multi-physics models. Moreover, current multi-physics UQ studies are often separated from code validation and consider only a subset of all possible sources of uncertainties. Examples of such studies for LWRs can be found in Delipei et al. (2022) and Croisette et al. (2025), while examples of advanced reactors can be found in Yu et al. (2025) and Trivedi et al. (2025). This brief overview of the current state of VVUQ for the multi-physics modeling of nuclear reactors underscores the need for dedicated guidelines and recommendations that will support future activities. An initial discussion and a first attempt at providing general directions regarding all these aspects were presented in Valentine et al. (2024).
In this article, guidelines and recommendations regarding the validation of high-fidelity multi-physics nuclear reactor models are discussed, along with a validation case study on the MPCMIV benchmark using MOOSE-based tools. In Section 2, a general background is provided on the validation of computer models before providing more details about recent developments in the field of nuclear engineering, which identifies the main ongoing challenges. Section 3 provides recommendations to address some of these challenges. Section 4 presents a demonstration case study on the MPCMIV benchmark. Finally, some general conclusions are drawn in Section 5, and future avenues of research are highlighted.
2 Background on validation of computer models
Verification and validation (V&V) have a long history that spans various engineering and philosophical domains. The operations and research scientific community was the first to formally discuss V&V, laying the foundation for the terminology and definitions still in use today. A brief history of V&V can be found in Sargent and Balci (2017) and Oberkampf and Roy (2010). In the former, the evolution of V&V activities is divided into three main eras: “early” (before 1970), “awareness” (up to 1990), and “modern” (up to today).
During the early era, some initial attempts at V&V emerged in the scientific community, but the lack of consistent terminology and definitions hindered systematic progress. More specifically in nuclear engineering, the design of nuclear systems relied predominantly on experimental data, from which simplified models were developed with large safety margins. The lack of advanced M&S capabilities, as well as limited computational resources, restricted the application of these models to support studies and analyses. Therefore, rigorous V&V was not of primary importance. A key contribution from this area was the recognition of the necessity of V&V for scientific models. Although specific methods or standards had not been established, important concepts were introduced, such as the use of experimental data for model validation.
The awareness era marked a significant rise in interest for V&V across various engineering communities. The construction of new nuclear reactors of increasing size led to the development of safety acceptance criteria and the first M&S guidelines (Rohatgi and Kaizer, 2020) for licensing purposes. This meant that models could be used to predict safety parameters, increasing their importance in the design and licensing of nuclear reactors. The models still relied on strong approximations, with a lot of empirical or semi-empirical modules tuned using relevant experimental data. The first systematic definitions were established in 1979 for model V&V by the Society for Computer Simulations (Schlesinger, 1979). Verification was defined as “substantiation that a computerized model represents a conceptual model within specified limits of accuracy,” while validation was defined as “substantiation that a computerized model, within its domain of applicability, possesses a satisfactory range of accuracy consistent with the intended application of the model.” Several key concepts previously discussed were combined to define an iterative model development process framework. The framework is built around the concepts of reality, the conceptual model, and the computerized model. “Reality” in this definition does not explicitly refer to experimental measurements but rather to the real system and scenario under investigation (e.g., control rod ejection in a nuclear reactor). The conceptual model consists of any abstraction used to represent the physical phenomena involved in the system under study into a mathematical model, typically expressed through equations and operators in the form of partial differential equations (PDEs). The computerized model is the implementation of the conceptual model on a computer using elements of numerical analysis and linear algebra. In this framework, verification is related to the evaluation of the model’s implementation by assessing the accuracy of the predictions of the computerized model compared to the conceptual model. Numerical errors, such as discretization and iterative error, are estimated during this evaluation. Validation is related to the evaluation of the model’s predictions against a referent of reality (e.g., expert opinion and measurements) for the intended application. The uncertainties in the model predictions and the referent of reality are part of this evaluation.
In what follows here, the discussion will primarily focus on the validation component of V&V. According to Schlesinger (1979), the key terms in the definition of validation are “substantiation,” “domain of applicability,” and “accuracy.” “Substantiation” refers to the process of providing evidence that supports the validity of the model. Like the judicial system, the modeler needs to accumulate different forms of evidence to convince the community that the model is validated. “Domain of applicability” indicates that validation cannot be performed across all conditions and scenarios and is instead restricted in the domain covered by the collected evidence. Accuracy involves the quantitative assessment of the model’s predictive capability against a chosen reference for the intended application, which can have the form of available experimental measurements. A scientific model developed following these principles should have the following properties (Sargent, 2020): (1) it must be developed for a specific application; (2) it should be parsimonious, meaning that it is as simple as possible to model the important phenomena of the intended application; (3) it must have an acceptable level of accuracy for its intended purpose. The last property introduces the concept of accuracy requirements. It is not sufficient to simply measure the model’s accuracy for a given application. Accuracy requirements must be defined in advance, and the model’s accuracy should be evaluated with respect to these requirements for the intended application.
During the awareness era, many approaches and techniques were proposed to answer the questions “who should decide if the model is valid?” and “how should the validity of the model be assessed?” For the former question, three different approaches exist (Sargent, 2020).
1. A decision is made by the model development team that conducted the V&V activities during the model development process.
2. A decision is made by the users of the model. This has the added benefit of improving model credibility since the users are part of the V&V process of the model.
3. A decision is made by a team independent of the development and user teams. This approach is called “independent verification and validation” (IV&V), and the independent team can perform IV&V parallel with the model development or after the model has been developed. Conducting IV&V builds even greater credibility and confidence in the model, but it can be prohibitively expensive.
Regarding how the model should be assessed, various techniques have been explored, ranging in both complexity and degree of subjectivity. Among the more subjective approaches are face validity and scoring methods. In face validity, an expert assesses the model’s adequacy using any available information, from code documentation up to graphs resulting from model predictions. Scoring methods assign subjective scores that are weighted for different aspects of the model development process. One of the drawbacks of this method is that the scores tend to hide the subjectivity involved. More objective approaches rely on statistical techniques, such as hypothesis testing and confidence intervals. Hypothesis testing methods compare model predictions against experimental data to determine the probability that they originate from the same distribution. If this probability falls below a predefined threshold, then the model is deemed invalid. One drawback of this method is that it answers the model validation question with a binary response (yes/no), which is not in line with current validation practices that require a quantitative measure of accuracy. Additionally, the assessment decision should be separated from the measure of accuracy. Confidence interval methods, on the other hand, use the uncertainty information in both predictions and measurements to estimate error intervals where the “true” error lies. One of the key takeaways from these validation techniques is that there does not exist an approach that is best across all models. The suitability and effectiveness of a given validation technique depend, among other things, on the underlying system that is being modeled, the goal of the M&S, and the information available (e.g., uncertainties).
In the modern era, which we are currently in, V&V interest remains strong. Solid foundations have been established for the model development process and V&V practices, including accepted definitions and standards. V&V is being adopted by most engineering M&S fields, and several techniques are being investigated. Many organizations, including the U.S. Nuclear Regulatory Commission (NRC), made V&V a requirement for computational models. Past V&V techniques matured, and new ones were introduced to address existing shortcomings. The V&V definitions were refined to eliminate some of the ambiguities, converging into the definitions used today by most engineering disciplines. Oberkampf and Roy (2010) give excellent historical background on the evolution of these definitions through the efforts of the Institute of Electrical and Electronics Engineers (IEEE), the U.S. Department of Defense (DoD), the American Institute of Aeronautics and Astronautics (AIAA), and the American Society of Mechanical Engineers (ASME). The updated definition of validation is “the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model.” One important clarification in this definition is that both validation and verification are ongoing processes that need to be repeated whenever there is a model update, a new application is envisioned, or new experimental information is available. Accuracy assessment is also emphasized with respect to the real world using appropriate experimental data. Experimental observations are thus regarded as the best available measure of real-world phenomena. It is important to note that this interpretation does not treat experimental data as more accurate than model predictions but simply as the most reliable reference against which model accuracy can be evaluated.
Uncertainty quantification, in the updated definitions, is at the core of the validation process. All potential sources of uncertainty must be quantified and incorporated into the accuracy assessment. These include numerical errors, input parameter uncertainties, experimental uncertainties, and model uncertainties. The following general statistical framework can be defined between the “true” value of scalar quantity
where
Once all these sources of uncertainty are quantified, appropriate validation metrics
Model accuracy assessment using validation metrics is conducted at specified conditions, where experimental data are available that delineate the validation domain. The “predictive capability” of the model, however, refers to different conditions that can be within (interpolation) or outside (extrapolation) the validation domain. In both cases, the accuracy of the model needs to be estimated for the intended application conditions. Accuracy requirements must be specified to determine whether the model demonstrates an acceptable level of agreement. These requirements are essential, as models with high accuracy can still be insufficient for applications demanding very high precision, whereas models with low accuracy might be adequate for applications with less precision needs.
To support the model validation process, a dedicated preliminary task known as a “phenomena identification and ranking table” (PIRT) is often employed. The purpose of a PIRT is to systematically identify and rank the importance of different physical phenomena with regards to a specific quantity of interest for a given application. Additionally, a PIRT incorporates the state of knowledge for the phenomena involved and the corresponding amount of available information (e.g., aleatory and epistemic uncertainties).
The various aspects of model validation discussed so far can create confusion regarding their practical implementation. For this reason, a clear framework for the validation process is provided in Oberkampf and Roy (2010) in three sequential steps.
1. Model accuracy assessment. This step involves the comparison of model predictions against experimental data using appropriate validation metrics. This is typically referred to as “model validation” in most nuclear engineering V&V activities.
2. Model predictive capability. This step involves model prediction on the intended application conditions. As stated previously, this can be an interpolation within the validation domain or an extrapolation outside of this domain. The interpolation or extrapolation involves not only the model prediction itself but also the associated model uncertainty. The outcome of this step is the total uncertainty of the model and the corresponding accuracy of the application conditions.
3. Model adequacy assessment. In this final step, the accuracy obtained from the previous step is compared against the accuracy requirements for the intended application to decide if the model is adequate.
In most of the nuclear engineering literature, the term “model validation” is used for the first step presented above. The available experimental data for this step typically come from both traditional historical experiments and validation experiments. Traditional experiments typically aim to either infer knowledge about physical phenomena and their modeling or assess the safety of the system across different conditions. Validation experiments, on the other hand, are performed specifically to assess the predictive capability of scientific models and thus are better suited for model validation. This means that the primary focus of validation experiments is not the precision of the measurements but the accurate characterization of the experimental conditions (Oberkampf and Roy, 2010). The design of validation experiments can even be guided by M&S studies to identify the experimental configurations and conditions that will most effectively support model validation. The use of M&S for such purposes can reduce the cost of experimental campaigns and maximize the amount of assimilated information. For complex systems, direct model validation of the complete system might not be possible due to the lack of validation or traditional experiments at the system scale. Instead, a hierarchical model validation strategy with different tiers is suggested in order to leverage available experimental data at lower scales or subsystems (Avramova et al., 2021; Valentine et al., 2024; Oberkampf and Roy, 2010; NEA, 2023). This strategy employs a hierarchical methodology that decomposes and simplifies the coupled phenomena. At the top of the hierarchical pyramid lies the system and conditions of interest, where experimental data are usually either not available or very scarce. The next tier consists of subsystems that retain most of the multi-physics coupled phenomena but that do not capture the complete geometrical aspects and physical phenomena of the intended application. For these subsystems, a moderate amount of experimental data can usually be found. At the bottom of the pyramid lie unit experiments of a very limited scope, designed to target specific phenomena and often limited to single-physics phenomena in nuclear engineering. Multi-physics modeling validation can thus start from the bottom of the pyramid with the individual single-physics models and progressively increase the coupling complexity in alignment with the available experimental data at the specific validation pyramid tier. The exact form of the validation hierarchy depends on the underlying system and intended application. Recently, efforts have been made to establish general frameworks for creating such validation hierarchies (Luckring et al., 2023) and to develop methods for a hierarchical prioritization of physical phenomena using a combination of PIRTs, gap analysis, and global sensitivity analysis (Shaw et al., 2023).
Most of the model validation elements previously discussed are combined in Oberkampf et al. (2007) to develop the Predictive Capability Maturity Model (PCMM) for computational M&S. The goal of this method is to assess the maturity of M&S efforts for a given application of interest to improve its credibility and trustworthiness. Although the maturity of M&S goes beyond the scope of this article, PCMM highlights aspects that can be very important for developing validation guidelines for high-fidelity multi-physics nuclear reactor models. PCMM was also applied to the Virtual Environment for Reactor Applications (VERA) code suite as part of its V&V activities (Jones et al., 2018). Like PIRTs, PCMM uses a maturity scoring approach that combines subjective and objective judgments. The maturity scores range from 0 to 3, and they are assigned to six different elements of M&S: (1) representation and geometric fidelity, (2) physics and material model fidelity, (3) code verification, (4) solution verification, (5) model validation, and (6) uncertainty quantification and sensitivity analysis. An interesting aspect is the separation of modeling fidelity into geometric, physics, and materials fidelity. This is similar to the distinction between traditional multi-physics modeling and high-fidelity, high-resolution multi-physics modeling in nuclear engineering. Oberkampf et al. (2007) stressed that higher fidelity does not guarantee improved predictive capability. Only after model validation can predictive capability be assessed. Also relevant is the acknowledgement of various degrees of maturity for all the identified M&S elements without judging their adequacy. The usefulness of a scoring system is recognized for conveying information related to M&S maturity that should preferably also follow a hierarchical approach.
All the information provided so far presents the general evolution and current state of model validation paradigms. In the following section, a background on multi-physics validation activities specific to nuclear engineering applications is provided.
2.1 Multi-physics model validation activities in nuclear engineering
Multi-physics model validation is a relatively new field in nuclear engineering. Three main categories of activities associated with model validation are: (1) safety analysis approaches that meet regulatory requirements, (2) benchmarks for code-to-code and code-to-measurements comparisons, and (3) the qualification of multi-physics tools.
Historically, model validation in nuclear engineering has been a core part of safety analyses (Zhang, 2019). Initially, conservative approaches were employed to assess safety margins, but as computational resources increased and M&S fidelity improved, Best Estimate Plus Uncertainty (BEPU) approaches were introduced. BEPU approaches use model predictions together with an estimation of their uncertainty to compute safety margins instead of relying on overly conservative predictions. VVUQ is a key element in these approaches, with some examples being the Code Scaling, Applicability, and Uncertainty (CSAU) evaluation and the Evaluation Model Development and Assessment Process (EMDAP) developed by the U.S. NRC. Industrial BEPU approaches have also been proposed, such as BELOCA and RLBLOC. Performing VVUQ in this context can be computationally expensive due to the required modeling fidelity and multiple code evaluations typically required for UQ. For this reason, pragmatic and graded VVUQ approaches are suggested to allow an effective BEPU analysis (Zhang, 2019). Wilk’s method is often employed in BEPU approaches to reduce the computational cost. This method uses Wilk’s formula to determine the minimum number of random samples required to estimate a specified tolerance limit for a quantity of interest with a selected confidence level. The 95% tolerance limit with 95% confidence level is a common choice; it requires 59 code evaluations for predicting a one-sided tolerance limit. More details about Wilk’s method and its limitations can be found in Porter (2019).
Aside from activities related to regulatory requirements and safety analyses, many multi-physics validation benchmark activities have been conducted by the nuclear M&S community in the past three decades. Most of these benchmarks were developed through the OECD/NEA. Initially, such benchmarks focused on the validation of traditional multi-physics tools, emphasizing the need for a multi-level approach (Avramova and Ivanov 2010; Avramova et al., 2015). Traditional multi-physics modeling uses two-group diffusion-based neutronics codes with a spatial resolution at the assembly level for the coupled neutronics/thermal-hydraulics calculations. One representative pin is solved for each assembly with a 1D heat conduction that does not account for most of the fuel performance phenomena (e.g., relocation). The standard coupling approach for steady state calculations is Picard iterations, while for transient calculations it is explicit coupling. Examples of benchmarks developed for these tools include the OECD/NRC Pressurized Water Reactor Main Steam Line Break (PWR-MSLB), the OECD/NRC Boiling Water Reactor Turbine Trip (BWR-TT), and the OECD/DOE/CEA VVER-1000 Coolant Transient (V1000T). A multi-level approach is applied through different benchmark phases that define exercises of progressively increasing modeling complexity. These benchmarks included many code-to-code comparisons due to a lack of high-quality in-core local experimental data. The available measurements involved either system quantities or integral core quantities. As a follow-up to these initial multi-physics benchmark activities, the next set of benchmarks targeted higher quality experimental data, with some including local measurements at the assembly level (Avramova et al., 2015). The OECD/NEA Kalinin-3 VVER-100 and OECD/NRC Oskarshamn-2 BWR stability benchmarks are some representative examples. An important issue raised by these benchmarks was the need for systematic uncertainty and sensitivity analysis. For this reason, the OECD/NEA Benchmark for Uncertainty Analysis in Best-Estimate Modelling (UAM) for Design, Operation, and Safety Analysis of LWRs (LWR-UAM) was developed. The main objective of this benchmark is the development and testing of uncertainty and sensitivity analysis across multi-physics and multi-scale models. Three different phases, starting from pin cell neutronics and up to full core and system transient exercises, are defined. Both experimental and numerical test cases are available, including overlaps with the previous benchmarks. The recent surge of high-fidelity, high-resolution tools in nuclear reactor studies motivated the development of a new set of international benchmarks that are still ongoing. These novel multi-physics tools can provide predictions at the pin/subchannel level with increased modeling fidelity in each of the coupled codes. Several ongoing benchmarks provide data at assembly and pin scales to address both traditional and novel multi-physics tools. Examples of such benchmarks include the OECD/NEA Multi-physics Pellet Cladding Mechanical Interaction Validation (MPCMIV), the OECD/NEA TVA Watts Bar Unit 1 (WB1) Multi-Physics Multi-Cycle, and BEAVRS (Horelik et al., 2025). The increased computational cost associated with the VVUQ of novel multi-physics tools has also motivated the development of high-to-low (Hi2Lo) model fidelity approaches, where a high-fidelity code is used to inform a low-fidelity code. The LWR-UAM benchmark has been updated to include numerical benchmark exercises that target the verification and validation of such Hi2Lo approaches in the presence of uncertainties. All these benchmark activities are relevant to conventional large-scale LWRs. Significantly fewer activities have been performed for the multi-physics VVUQ of advanced reactors and small modular reactors (SMRs). The OECD/NEA SFR-UAM is one example of a benchmark for sodium fast reactors (SFRs) (OECD/NEA, 2025a), while the under development LW-SMR is an example of an upcoming activity for SMRs (OECD/NEA, 2025c).
The third category of validation activities is related to the validation of high-fidelity, high-resolution multi-physics tools. One of the most prominent examples is the VERA code suite developed under the Consortium for the Advanced Simulation of LWRs (CASL) (CASL, 2020; Athe et al., 2021). VERA has pin-by-pin resolution using, among other things, the MPACT code for core neutronics, CTF for subchannel thermal-hydraulics, and Bison for fuel performance. The PCMM approach was applied to VERA predictive capability assessment, within which VVUQ was performed for seven different challenge problems (e.g., reactivity insertion accident and loss of coolant accident). While VERA is an example of a conventional LWR activity, the U.S. Department of Energy (DOE) Nuclear Energy Advanced Modeling and Simulation (NEAMS) program is an example of advanced reactor activities. Under NEAMS, numerous multi-physics coupling systems have been developed that leverage the Multi-physics Object-Oriented Simulation Environment (MOOSE) (Martineau, 2021). MOOSE is a high-performance computing framework using finite elements that allows flexible coupling among various MOOSE-based codes. Some validation activities have been carried out for the multi-physics transient modeling of heat pipe and gas-cooled microreactors using a coupling between Griffin and Bison (Stauff et al., 2024). Recently, the capability to couple external codes with the MOOSE framework was implemented. Cardinal is a wrapper tool that allows the coupling of MOOSE-based codes with the Monte Carlo neutronics code OpenMC and the computational fluid dynamics (CFD) code NekRS. NekRS–Bison-coupled calculations were performed and compared against experimental data from a seven-pin Freon fuel bundle (Novak A. J. et al., 2023). More validation activities related to SFRs are expected in the future. In Kwon et al. (2025), the nTRACER 2D/1D planar method of characteristics (MOC) neutronics code is coupled with the CTF subchannel thermal-hydraulics code. The multi-physics coupling is applied to the TVA WB1 benchmark. Code-to-code comparisons are performed with benchmark solutions from VERA, while comparisons with experimental data are performed for the boron letdown curve. All these examples of validation activities involve deterministic codes only, but recently, multi-physics couplings have been developed involving Monte Carlo codes for the core neutronics. Examples of such activities are the McSAFE program and the ongoing McSAFER program supported by the European Commission (Demazière et al., 2022). Under the McSAFE program, a multi-physics coupling between the Serpent Monte Carlo neutronics code, the SUBCHANFLOW subchannel thermal-hydraulics, and the TRANSURANUS fuel performance code was validated against first-cycle depletion experimental data from a PWR. The validation data included both integral and local data.
The high-fidelity multi-physics validation activities discussed so far mainly compare model predictions against available experimental data. UQ and sensitivity analysis (SA) are not performed due to the very high computational cost. The complexity of the high-fidelity multi-physics M&S allows only stochastic approaches to be used that require multiple code evaluations. Although UQ and SA are fundamental aspects of VVUQ, most current multi-physics UQ studies consist of preliminary efforts to demonstrate some capabilities without being part of model validation. Additionally, only a subset of all possible sources of uncertainties is included without the separation of epistemic and aleatory uncertainties. An active field of research is the development of methods for reducing computational cost. Avramova et al. (2021) briefly discuss a multi-physics Hi2Lo approach where each high-fidelity code informs a lower-fidelity code. In Huang et al. (2017), reduced-order models are constructed using dimension reduction techniques in each coupled physics domain. A very common approach is to use data-driven machine learning methods either as surrogate models between the uncertain inputs and outputs or, in some cases, to accelerate the predictions of the multi-physics system. Radaideh and Kozlowski (2019) proposed a UQ and SA framework based on various deep learning methods, while in Papadionysiou et al. (2025), a neural-network model is trained with data generated with CTF to predict coolant properties at subchannel resolution. Such approaches are necessary to perform UQ and SA in the context of high-fidelity multi-physics M&S. This introduces an additional error
The VVUQ of high-fidelity multi-physics M&S is very challenging and is a field of active research. There is a lack of high-quality, high-resolution experimental data, something even more pronounced for advanced reactors. New experimental campaigns are needed with advanced instrumentation techniques. Novak A. et al. (2023) raise the need for specific verification guidelines, and they propose some suggestions. Similarly, the development of appropriate validation guidelines and best practices has been recognized as necessary (Avramova et al., 2021). An initial discussion and some first general directions are provided in Valentine et al. (2024). In the next section, a list of challenges is identified and discussed in more detail.
2.2 Challenges for the validation and uncertainty quantification of high-fidelity multi-physics models
The model complexity, computational cost, and lack of high-quality and high-resolution experimental data create various challenges for the validation of high-fidelity multi-physics models. The main ones are identified and discussed in this section. The challenges span multiple VVUQ aspects, including input uncertainty quantification, uncertainty propagation, sensitivity analysis, and comparison against experimental data.
2.2.1 Large number of input parameters and sources of uncertainties
A high-fidelity multi-physics model will involve a vast number of input parameters, ranging from nuclear data to models of thermal-hydraulics closure and fuel behavior (e.g., fission gas release). Identifying all the relevant inputs and characterizing their uncertainties is challenging, particularly since some of these inputs might exhibit strong correlations. Large input dimensionality also typically increases the number of code evaluations for uncertainty quantification and sensitivity analysis, leading to a stronger reliance on surrogate models. The development of accurate surrogate models (e.g., training, validation, testing) becomes more complicated and time-consuming in very large input spaces.
2.2.2 Input uncertainty quantification
Separating epistemic from aleatory uncertainty is a comprehensive approach (Roy and Oberkampf, 2011) that is seldom done in practice for nuclear engineering applications due to the lack of detailed uncertainty information and the increased computational cost for propagating these uncertainties. Inputs related to inlet conditions (e.g., mass flow rate) and manufacturing (e.g., fuel dimensions) are usually characterized as aleatory with a normal or uniform distributions. The uncertainty of empirical or semi-empirical sub-models is obtained by comparison against selected experimental measurements. In some cases, inverse uncertainty quantification and calibration techniques are used to obtain the input parameters’ uncertainty quantification of the sub-model. Care should be taken not to include data from model calibration into model validation.
2.2.3 Model complexity
High-fidelity multi-physics models usually include non-linearities, discontinuities, and strong interactions among uncertain inputs. Such complex behaviors practically necessitate the use of sample-based stochastic approaches for uncertainty propagation that result in multiple code evaluations. Due to the high associated computational cost, a limited number of samples can be generated, thus restricting the applicability of various UQ and SA methods.
2.2.4 Consistent sampling of input parameters
In stochastic approaches for uncertainty propagation, which are the main approaches for high-fidelity multi-physics calculations, random samples are generated for the input parameters. For some inputs, consistency rules need to be applied at each sample. An example is the total cross-section that must be equal to the sum of the partial cross-sections. A consistent sampling will involve the multivariate sampling of the partial cross-sections, accounting for their correlations, followed by their summation to obtain the corresponding total cross-section samples.
Sometimes the consistency requirement can be more subjective. An example is the perturbation of the fuel rod radius in a reactor core. One approach is to consider the same perturbation for all fuel rods. This assumes that all the fuel rod dimensions are fully correlated, leading to essentially only one random input. Another approach is to consider all perturbations to be independent for each fuel rod or fuel assembly. This results in a significantly larger number of uncertain inputs. Both methods are valid, but in practice, the former approach is used more often due to its simplicity in applying the input perturbations. The latter approach is more accurate when the underlying uncertainties represent random effects, but its implementation might not be feasible if the individual fuel rods are not explicitly modeled, thereby limiting its applicability.
2.2.5 Consistent uncertainty propagation
Two separate aspects can be identified regarding the consistent uncertainty propagation for high-fidelity multi-physics models. The first regards avoiding double-counting uncertainties between the coupled physics codes. UQ frameworks for addressing this aspect were proposed in Zeng (2020) and Delipei et al. (2022), where the main idea is to apply the same samples to all common sources of uncertainties within each code. A simple example is cladding thickness, where the same perturbation should be applied in the lattice neutronics model to generate the sample cross-section, in the fuel performance model, and in the thermal-hydraulics model. A more complicated situation arises when model parameters implicitly depend on fundamental sources of uncertainties. In the previous example, if the fuel performance code uses an internal model to compute the gap conductance based on the fuel dimensions, then it would be able to predict a consistent value for each sample. However, if a fixed gap conductance value is considered with an associated uncertainty independent of the fuel dimensions, then the perturbation obtained would be inconsistent with the fuel dimension perturbation.
The second aspect is related to the application of appropriate reactor operational procedures for each input sample. One approach is to fix the operational procedures (e.g., control rod positions and total power) and compute the associated output’s uncertainty. Through the LWR-UAM benchmark, it was concluded that such an approach results in an uncertainty of
2.2.6 Comprehensive uncertainty propagation
Uncertainty propagation, in principle, should identify and treat epistemic, aleatory, and mixed uncertainties separately (Roy and Oberkampf, 2011). Moreover, it should consider all possible sources of uncertainty, including numerical uncertainty and stochastic sampling uncertainty. Practically, this is not possible without some assumptions, due to some of the reasons mentioned above. For example, it is quite common to not separate the aleatory from the epistemic sources of uncertainty.
2.2.7 Quantitative sensitivity analysis
The goal of quantitative sensitivity analysis is to measure the contribution of each input through uncertainty propagation to the output’s uncertainty. The input uncertainty characterization (e.g., independent, correlated) and the underlying relationship between the inputs and output (e.g., linear, non-linear) create different types of contributions that can be separated into four categories.
1. Direct contribution. This is the contribution of the form
2. Uncertainty contribution. An input that has a small direct contribution but has a large uncertainty can have a stronger impact on the output’s uncertainty than an input with a large direct contribution that has a small uncertainty.
3. Interaction contribution. An input might contribute to the output’s uncertainty through interactions with other inputs. For example, the post-Departure from Nucleate Boiling (DNB) heat transfer model parameters will affect the output only if other parameters lead to DNB conditions being reached.
4. Correlation contribution. The contributions of two inputs that are highly correlated are difficult to separate. Various methods that account for these correlations exist, but they should be interpreted carefully. In the context of nuclear reactors, cross-sections are the most challenging inputs regarding this aspect because they include correlations across isotopes, reactions, and energy groups.
The impact of each input can vary in the input space due to non-linearities and discontinuities. Furthermore, the sensitivity is usually expressed for some statistical property of the output. This can be the whole distribution, the variance, an extreme percentile, or any other possible property. Some of the most widely used methods target the variance decomposition of the output. Such methods use analysis of variance (ANOVA) sensitivity indices, with prominent examples being the Sobol and Shapley indices (Owen, 2014). The main drawback of ANOVA methods is that they require a very large number of code evaluations, which for any practical multi-physics study will be above 10,000 calculations, rendering their direct application non-feasible. Indirect ANOVA indices can be computed using surrogate models. The very large input and output space involved in high-fidelity and high-resolution multi-physics models requires a combination of dimensionality reduction techniques, screening approaches, and surrogate models to estimate these sensitivity indices. The Morris method is often used for screening purposes for a moderate-sized input (Iooss and Lemaître, 2015). If the model is not so complex and a linear approximation is reasonable, then linear regression-based indices, such as the correlation coefficients and Johnson indices, can be employed with even relatively small sample sizes (Clouvel et al., 2025).
2.2.8 Scarce experimental data with limited uncertainty information
There are few historical data that can be used for high-fidelity multi-physics model validation combined with some relatively new validation experiments, such as Kamerman et al. (2022). This issue is even more pronounced for high-resolution data and advanced reactors. Furthermore, the uncertainty information is limited. The experiments are usually not repeated to estimate the random error. An estimate of the measurement uncertainty in the form of standard deviation is sometimes provided. A common choice is to consider the uncertainty as aleatory and use a normal distribution with the provided standard deviation centered on the measured value.
2.2.9 Multivariate outputs
Most of the high-fidelity, high-resolution multi-physics models’ predictions are multivariate outputs in space and time, leading to strong spatiotemporal correlations. There could also be strong correlations between different outputs, such as between the linear power and fuel temperature. If there are available experimental data for these multivariate outputs, then the typical univariate validation metrics, such as the area metric, will not be appropriate. Validation metrics dedicated to multivariate output should be used instead.
Multivariate outputs also pose challenges in the development of surrogate models for UQ and SA. Methods based on principal component analysis (PCA) have been used to facilitate surrogate model training. PCA is a linear dimensionality reduction technique that determines a latent output space consisting of a few principal components’ coefficients that maximize the explained variance. Aggregate sensitivity indices for multivariate outputs based on the explained variance of the first principal components have been proposed in Delipei (2019). Aside from PCA, there are various other methods that can reduce the output’s dimensions. Neural-network-based autoencoder is an example of a non-linear dimensionality reduction technique.
2.2.10 Model predictive capability
Once a model’s accuracy is assessed for the different validation experiments, the next step is to estimate its predictive credibility in the intended application domain. This can be very complicated and poses many challenges. The application domain can be in an interpolation or an extrapolation region of the validation domain covered by the validation experiments. The transferring or scaling of the estimated accuracy from the validation domain to the application domain is not straightforward and depends on the underlying physical phenomena. In nuclear reactor M&S, it is anticipated that most of the application domains will lie in the extrapolation regions due to the difficulty in performing large-scale transient experiments.
Thermal-hydraulics is the physics domain in nuclear engineering that has historically focused more on the scaling distortions between the large-scale nuclear reactor applications and the scaled-down selected experiments that aim to represent most of the important physical phenomena (Mascari et al., 2024). Non-dimensional parameters are typically used, and various scaling methods have been proposed, from simple linear scaling to complex hierarchical procedures. Similarity indices can also be used to measure the degree of similarity between the various experiments and the intended application (Abdel-Khalik et al., 2015). More research is needed to investigate the scaling of high-fidelity multi-physics model predictions.
2.2.11 Multi-physics benchmark protocols
The ICSBEP and IRPhEP are the most mature benchmark projects that define guidelines and protocols for reactor physics validation (Bess and Ivanova, 2020). The benchmark information is divided into four main sections.
1. Detailed description of experimental conditions and all available information (e.g., uncertainties, drawings).
2. Comprehensive evaluation of experimental data to facilitate model validation. Low-quality data (e.g., missing important experimental conditions for modeling) are identified and separated from higher quality data.
3. Benchmark specifications, where simplifications are made to create clear and concise data for constructing computational models for the available experimental measurement conditions.
4. Benchmark reference calculation results accompanied by associated input decks.
In each section, different possible categories of experiments are identified. Some examples from IRPhEP include critical configurations, reactivity effects, power distributions, and isotopic concentrations.
These protocols and guidelines are primarily tailored to steady state and depletion reactor physics experiments. New protocols and guidelines should be developed for multi-physics benchmarks, covering steady state, cycle depletion, and transient conditions. The ICSBEP and IRPhEP benchmark projects should serve as the foundation for developing these protocols and guidelines.
3 Recommendations and guidelines for high-fidelity multi-physics model validation and uncertainty quantification
In the previous section, some important challenges were discussed in relation to the validation of high-fidelity multi-physics models. In this section, guidelines and recommendations are provided for some of these challenges. To facilitate the discussion, some general definitions and notations are provided for the multi-physics model. We assume there are
where
•
•
•
•
The single-physics model definitions are extended to a coupling between two physics
where
It is important to note that, in this case, the term
3.1 High-dimensional input and output spaces
The large number of input and output parameters in high-fidelity multi-physics models creates many difficulties, from input uncertainty quantification to uncertainty propagation and sensitivity analysis. It also renders the training of surrogate models more complicated and time-consuming. For all these reasons, there is a strong interest in reducing the dimensionality of both input and output spaces.
When there are strong correlations, whether statistical in nature or due to spatiotemporal aspects, linear and non-linear dimensionality reduction techniques can be very powerful. These methods transform the original high-dimensional space into a lower-dimensional latent space. Common linear approaches that have been used in nuclear engineering are PCA or proper orthogonal decomposition (POD) and dynamic mode decomposition (DMD). The Discrete Empirical Interpolation Method (DEIM) has often been used in conjunction with POD to treat non-linear operators (German et al., 2019). For non-linear complex behaviors, neural-network-based methods such as autoencoders can also be used. One drawback of dimensionality reduction techniques is that the lower-dimensional latent space variables are difficult to interpret. Examples where these methods could be useful in multi-physics nuclear reactor M&S are the input cross-sections and the output temperature distribution. Both examples have strong correlations that can be leveraged to greatly reduce their dimensionality. It is important to note that these methods are unsupervised, meaning that when used in the input space, for example, they do not incorporate information from the impact of the input on the output of interest.
A different approach to reduce the input space dimensions involves identifying and discarding a subset of non-influential inputs for a specific output of interest. This is called “screening,” and various techniques can be found in the literature; three are identified as useful for practical applications in multi-physics nuclear reactor M&S. The first is One-at-a-Time (OAT), which involves perturbing each input individually around some reference point and evaluating the impact on the output. OAT methods are very simple but have important limitations since they capture only local effects and cannot account for interactions between inputs or non-linearities. The second approach is the Morris method, which alleviates most of these limitations by performing a series of OAT studies in multiple input space locations. This happens, of course, at the expense of computational cost. For this reason, the Morris method is feasible for a relatively small to medium number of input parameters (<50). More details about these methods can be found in Iooss and Lemaître (2015). For larger input dimensions, a recently proposed method using the Hilbert–Schmidt independence criterion (HSIC) indices, combined with statistical tests, can be considered (Iooss and Marrel, 2019). All these methods rely on code evaluations to measure the impact of each input and, based on some threshold criteria, assess if the input is influential or negligible.
Both dimensionality reduction and screening techniques are summarized in Table 1. A combination of both techniques could be considered to develop hybrid approaches. It is difficult to provide a general recommendation because the selected technique depends on the specific multi-physics model, the intended scope (e.g., building a surrogate model), and the output of interest. From previous nuclear reactor applications, such as in Delipei (2019), the most complex inputs are the cross-sections because of the strong correlations across isotopes, reactions, and energies. The cross-sections can be thought of as a very large multivariate input with a sparse correlation matrix consisting of some independent blocks of cross-sections for some specific isotopes and reactions. Within each block, strong correlations are present. Previous studies have shown that only a few of these cross-sections usually contribute to the output’s uncertainty (Delipei et al., 2021). This leaves a lot of room for hybrid approaches using dimensionality reduction and screening to identify only a subset of the important cross-sections.
3.2 Input uncertainty quantification
Input uncertainty is characterized in multiple ways in multi-physics nuclear reactor M&S, from reported manufactured uncertainty to posterior distributions obtained via Bayesian calibration on some selected experimental data. Guidelines and best practices for inverse uncertainty quantification in thermal-hydraulics codes were proposed through the SAPIUM framework in Baccou et al. (2020) that could also be used for other physics domains. Elements of this framework include sensitivity analysis studies to identify the most important inputs, careful selection and assessment of the relevant experimental data, and selection of an appropriate inverse UQ method.
One important aspect that is usually omitted from most nuclear engineering input uncertainty quantification studies is the systematic characterization of the input uncertainties as aleatory, epistemic, or mixed. It should be noted that almost all uncertainties in nuclear engineering M&S would have a mixed aleatory/epistemic nature. However, for practical purposes, it might be useful to attribute the dominant type to each input. This is important because if a comprehensive uncertainty propagation is performed as described in Roy and Oberkampf (2011), a nested stochastic approach should be applied, where the outer loop samples the epistemic inputs and the inner loop samples the aleatory inputs. This nested loop results in a family of CDFs that create a p-box for the uncertainty representation. In Table 2, a general characterization is provided for five broad categories of inputs related to multi-physics nuclear reactors M&S. The first consists of initial and boundary conditions, with typical examples being the mass flow rate, temperature, pressure, and power. The second category is the manufacturing parameters, such as the fuel rod dimensions. The third is the microscopic cross-sections provided in the nuclear data libraries. The fourth category is the material properties, with examples being the specific heat capacity and thermal conductivity. The fifth category is model parameters, such as the fission gas release model. The model parameters should be considered as epistemic, while all other categories can be reasonably assumed to be aleatory.
It is important to keep in mind that the comprehensive treatment of epistemic and aleatory uncertainties is computationally expensive, since it involves the nested loop discussed above. For this reason, most multi-physics UQ studies treat all inputs as aleatory. However, we suggest providing a general characterization, as performed in Table 2, acknowledging their different nature, and then proceeding by stating the assumption of not separating them for practical purposes.
3.3 Model complexity
High-fidelity multi-physics models can include very complex behaviors from non-linearities to strong discontinuities. Stochastic approaches to uncertainty propagation and sensitivity analysis in the context of validation are practically needed; these require multiple code evaluations. At the same time, the available experimental data might not necessitate a complete two-way coupling of all physics domains. A hierarchical approach is thus of strong interest to progressively increase the level of coupling and the associated model validation. Additionally, a satisfactory coupling level can be selected for performing uncertainty quantification that is cost-effective. In this case, the M&S simplification is guided by experimental data.
A scheme representing a multi-level coupling hierarchy for the M&S of LWRs cores was presented in NEA (2023); a similar scheme is shown in Figure 1. Three different coupling levels are identified. Level 0 involves no coupling and thus only single-physics models for reactor physics (RP), fuel performance (FP), and thermal-hydraulics (TH). Each individual single-physics model is first validated before moving to the next coupling level. Level 1 includes the coupling of two physics domains: RP–TH, RP–FP, and FP–TH. All these coupled models are validated against appropriate experimental data. Level 2 includes the coupling of all three physics domains: RP–TH–FP. The amount of available experimental data decreases with the coupling level. However, experimental data that are used to validate higher levels can also be used at lower levels with approximations of the missing physics domain. This enables the assessment of the coupling impact at each level. Novel tools like MOOSE facilitate a seamless transition between the coupling levels.
Figure 1. Example of multi-physics coupling hierarchy for the M&S of LWRs based on NEA (2023).
A model scoring approach is suggested to easily compare the performance of each coupled model against the available experimental data. Every physics domain is expressed using the
The subjective aspect of the scoring approach stems from the method used to compute the local scores
As an example, we consider the case where there is no uncertainty information for
These scores take continuous values (e.g.,
The global scores computed using Equation 5 can be used to create two illustrative scoring diagrams in Figure 2. In Figure 2A, the scores of each physics domain are drawn for coupling between RP, FP, and TH. The predicted quantities from the TH physics domain show the best agreement with the experimental data, while the FP predictions show the worst. In Figure 2B, the scores for one quantity of interest from the reactor physics domain (e.g.,
Figure 2. Example of scoring diagrams: (a) for all physics domain outputs in a given multi-physics model and (b) for one selected output computed by various coupled models in a hierarchical study.
3.4 Consistent sample-based uncertainty propagation
As mentioned in Section 2.2, consistency in stochastic uncertainty propagation involves (1) avoiding double-counting of uncertainty across different physics domains and (2) applying required modeling and operational rules for each sample. To better understand these aspects, the fundamental input sources of uncertainties
Aside from the fundamental input sources of uncertainties, each code will have its specific fundamental model parameter uncertainties with their respective samples
Tracking all these complex sample dependencies across different multi-physics coupling levels is challenging and prone to error. Therefore, we propose the use of sample-processing diagrams for uncertainty propagation to identify and track all interactions between the fundamental samples and record the required consistency rules. Two examples of sample-processing diagrams are provided in Figures 3, 4. Figure 3 corresponds to a full coupling between three physics domains, while Figure 4 corresponds to a two-way coupling between
Figure 3. Example of a sample-processing diagram for uncertainty propagation of a complete two-way coupling between
Figure 4. Example of a sample-processing diagram for uncertainty propagation of two-way coupling between
Once the sample-processing diagram is constructed, the sampling of all the fundamental inputs can be performed. The complicated nature of many of the inputs (e.g., cross-sections) and their interactions through all the coupled codes might make it challenging to perform on-the-fly sampling. An offline sampling approach is recommended in this case, where many input samples are generated in advance and stored. This is used by the SCALE code suite (Wieselquist and Lefebvre, 2024), where a set of 1,000 pre-computed microscopic cross-section samples are provided with the code suite. Several benefits stem from such approaches.
• Modular workload. One team can focus on the stochastic sampling generation and make sure that all consistency rules are met, while another team can work on developing the coupled models.
• Repeatability. The calculations can be repeated in case of a code crash or if the calculation was stopped.
• Model comparisons. The performance of two different models can be compared on a sample-by-sample basis to better understand the discrepancies.
• Similarity studies. If the same input samples are applied to different applications, it can allow the estimation of similarity indices between these applications. For example, an experimental configuration can be designed that maximizes similarities with a selected nuclear reactor scenario.
• Sensitivity analysis. Trends and behaviors between the inputs and outputs can be used to infer some simple sensitivity analysis conclusions.
Examples of uncertainty quantification frameworks that use offline sampling approaches can be found in Delipei et al. (2022) for LWRs and Delipei et al. (2025) for high-temperature gas-cooled reactors (HTGRs).
3.5 Sensitivity analysis
Sensitivity analysis for high-fidelity multi-physics M&S is very challenging due to its high computational cost. For most UQ multi-physics studies, a realistic number of code evaluations is in the range of 100–1,000. Based on this constraint, two different categories of sensitivity analysis approaches are suggested.
The first category consists of methods that rely only on code evaluations and thus are restricted to a limited number of samples. The input/output samples obtained are used to infer different sensitivity analysis conclusions. The high dimensionality of input or output spaces is not a problem in this case, and that is why including all possible sources of uncertainties is recommended. Methods that estimate the correlation between each input and output belong to this category, and some of the main ones follow.
• Pearson correlation that measures the linear relationship between input and output. In the presence of strong non-linearities, these correlations can be completely wrong (Iooss and Lemaître, 2015).
• Spearman correlation that allows for a monotonic relationship between input and output (Iooss and Lemaître, 2015).
• Dependence measures that allow non-linear relationships between inputs and outputs. Examples of such methods are the HSIC indices (Da Veiga, 2015) and the Randomized Dependence Coefficient (Lopez-Paz et al., 2013).
From these methods, the most widely used in nuclear engineering are Pearson and Spearman due to their simplicity. In recent years, HSIC indices have started to be used as well, especially for screening purposes. Further investigation of such dependence measures is strongly recommended. The main advantages of these methods are that they account for all sources of uncertainties, the random sampling used for uncertainty propagation can also be utilized to estimate these correlations/dependencies without additional calculations, and the correlations/dependencies estimations are computationally inexpensive, allowing the methods to be applicable to large input and output spaces. The main drawback is that the estimated correlations/dependencies encompass all the types of sensitivities discussed in Section 2.2.7 for all inputs, making it difficult to isolate each contribution. A careful interpretation of the computed correlations/dependencies should be performed because they do not necessarily relate to causal effects. Depending on the sampling size, statistical noise can be significant, requiring the analysis of only statistically significant results. For example, Delipei et al. (2022) determined that for a sample size of 100, only correlations above 0.35 are considered statistically significant.
Aside from aiming to estimate the correlations, the variance decomposition is possible for this limited number of samples if a linear regression approximation is used. The Johnson indices have recently been applied (Clouvel et al., 2025) to nuclear engineering studies and show very promising results for high-dimensional correlated inputs such as cross-sections. Further investigations into such efficient quantitative methods are recommended.
The second category of methods provide a more detailed and quantitative breakdown of the importance of each input, but these consequently require significantly more code evaluations, which are not feasible using only the multi-physics model. The Sobol and Shapley ANOVA sensitivity indices (Owen, 2014) are examples used in nuclear engineering. Both indices decompose the variance of the output into individual contributions from each input and are considered the preferred methods for global sensitivity analysis. The construction of a surrogate model is necessary for such sensitivity analysis studies. The surrogate model can have any architecture from polynomial functions to Gaussian processes and up to neural networks and random forests. The training of the surrogate model can be challenging in complex, high-dimensional input and output spaces. To address this, dimensionality reduction and screening techniques can be used, as discussed in Section 3.1. Once the surrogate model is trained in these reduced input and output spaces, it can be used to evaluate a much larger number of samples to compute the desired sensitivity indices.
It is important to note that, although this category of methods focuses on surrogate models, Hi2Lo and reduced-order models (ROMs) can also be used to reduce the computational cost and perform more detailed sensitivity analysis studies. Moreover, the construction of a surrogate model, Hi2Lo model, or ROM can be used to accelerate uncertainty propagation. In this case, the dimensionality reduction, screening, and the model errors will need to be estimated and included in the total predicted uncertainty for the outputs of interest (Equation 1).
3.6 Scarce experimental data with limited uncertainty information
The amount of currently available validation data for high-fidelity multi-physics M&S is very limited and mainly comes from historical experiments. There is thus a strong need for validation experiments, where the main goal of the experiment is the VVUQ of multi-physics M&S. Validation experiments differ from traditional experiments since they do not so much emphasize the controlled conditions of the experiment as the complete characterization of all possible conditions and of the associated uncertainties (Oberkampf and Roy, 2010). Advanced instrumentation techniques are needed for efficient high-resolution multi-physics measurements. Promising examples of such techniques are the micro-pocket fission detector that is able to simultaneously measure local neutron flux and temperature (Unruh et al., 2017) and the instrumented capsule used to perform in situ irradiation measurements in advanced reactors (Downey et al., 2024).
The scarcity of experimental data requires validation metrics that can be applied even for a few measured values, with or without reported uncertainties. The area metric is a good example (Oberkampf and Roy, 2010) and is recommended because it adapts to both the uncertainty representation (e.g., aleatory and epistemic) and the amount of available data points. In extreme situations, where only one value is measured without uncertainty and a model prediction is also made without uncertainty estimation, the area metric is equivalent to the absolute difference between the prediction and measurement.
3.7 Multivariate outputs
The area metric, recommended in the previous section, applies to univariate outputs. It is quite common, however, in the multi-physics M&S of nuclear reactors to predict multivariate spatial and temporal outputs. Dedicated validation metrics and approaches are needed for such outputs that capture their correlations.
The U-pooling technique is one method that can be used to combine information across multiple points in space and time and even across different quantities (Oberkampf and Roy, 2010). This technique uses the probability integral transform (PIT) theorem, which allows for the conversion of any measurement of any quantity through a prediction inverse CDF to a corresponding
To better showcase how these metrics can be applied locally, regionally, and globally, it is important to highlight that these are univariate metrics between a prediction at some point of the state space
The exact values of the weights will depend on the nature of the predicted quantity (e.g., point estimate and volumetric average) and on the
3.8 Model predictive capability
The evaluation of the model predictive capability for high-fidelity multi-physics models poses a significant challenge, as discussed in Section 2.2.10. Computational cost and model complexity limit the applicability of various existing methods. A first category of approach is the interpolation/extrapolation of model accuracy based on some identified features following a PIRT. The features determine the dimensionality of the validation and application domain, and the interpolation/extrapolation can be performed with different techniques (e.g., linear). A second category of approach can leverage the estimation of stochastic similarity indices between validation experiments and the application of interest. Deterministic similarity indices based on sensitivity studies or adjoint-based computations, such as those performed by the TSURFER module within the SCALE code suite (Wieselquist and Lefebvre, 2024), are not currently possible for high-fidelity multi-physics calculations. Examples of stochastic similarity indices can be found in Huang et al. (2020) and Park and Park (2024). An interesting approach, which is related to similarity indices, is the Physics-guided Covered Mapping (PCM) discussed in Abdel-Khalik et al. (2015). Both stochastic similarity indices and the PCM require the consistent sampling procedure discussed in Section 3.4 and can serve as promising techniques for model predictive capability. The main required functionality is the ability to apply the same samples across all models for the validation experiments and application. Additionally, the results obtained on the validation experiments can be reused for multiple applications of interest without additional code evaluations.
3.9 Multi-physics benchmarks
The existing ICSBEP and IRPhEP benchmarks set the standards and quality for the desired benchmark guidelines of high-fidelity multi-physics M&S. The same general structure should be used, consisting of the four main aspects discussed in Section 2.2.11 (Bess and Ivanova, 2020). The uncertainty guide might need to be adapted to accommodate some of the challenges mentioned above. The peer review will need to include experts from each physics domain but also experts in multi-physics M&S at different stages of the review process discussed in Bess and Ivanova (2020), such as internal reviewers, independent reviewers, technical review group, subgroup reviewers, and the international user community.
The major aspects that need to be updated are the subsections of the benchmark handbook that are currently based on the different possible experimental measurements and configurations. In multi-physics M&S, the different measured quantities cannot be so easily separated. Additionally, only multi-physics experiments should be included in the multi-physics benchmark. Any experiment that does not involve feedback effects from different physics domains should be submitted to the appropriate existing single-physics benchmark handbook (e.g., IRPhEP). From the subsections present in IRPhEP, the following are also relevant for the multi-physics benchmarks: critical configuration, isotopic measurements, and other miscellaneous types of measurements. It is proposed that the structure of the subsection should reflect the experiment type (e.g., steady state and transient) and the resolution of the experiment (pin resolution, assembly resolution, and integral). The following nested structure is proposed for the subsections of a multi-physics benchmark handbook as a starting point.
• Subsection 1.1. critical steady state configuration
o Subsection 1.1.1 integral measurements
⁃ critical boron concentration …
o Subsection 1.1.1 local measurements
⁃ temperature distribution …
• Subsection 1.2. cycle depletion configuration
o Subsection 1.1.1 integral measurements
⁃ boron letdown curve …
o Subsection 1.1.1 local measurements
⁃ void fraction distribution, isotopic measurements …
• Subsection 1.3. transient configuration
o Subsection 1.1.1 integral measurements
⁃ power evolution …
o Subsection 1.1.1 local measurements
⁃ fuel axial elongation …
• Subsection 1.4. other miscellaneous types of measurements
The subsections are separated at the first level by the general experiment type. Critical steady-state configurations involve experiments performed for a specific fixed core condition. Cycle depletion configuration includes experiments that exhibit slow irradiation effects over a specific time horizon (e.g., one cycle and multi-cycle). The transient configuration includes any experiment performed under transient conditions that can cover different time scales (e.g., short or long transients). Any experiment not covered by these three general categories is included in the last subsection. This broad categorization is coarser than the IRPhEP to allow flexibility in the different scales and types of measurements in multi-physics experiments. As in IRPhEP, of course, the complete description of the geometry and materials for each experiment is provided. At the second level, the integral measurements are separated from the local (assembly and pin) measurements. At the third level, the different measured or derived quantities are provided together with a detailed description of the measurement techniques, processing, and associated uncertainties.
In the benchmark specifications section of the multi-physics handbook, each measured or derived quantity in the multi-physics benchmarks can also be associated with a specific physics domain. Some commonly measured quantities by the physics domain include:
• reactor physics:
• fuel performance: fuel/cladding temperature, fuel burnup/isotopic analysis, fuel/cladding stress and deformations, fuel/cladding microstructure changes (e.g., oxidation);
• thermal-hydraulics: coolant temperature, coolant mass flow rate, coolant density, coolant pressure, void fraction.
The benchmark specifications can be defined progressively, both in terms of model complexity and physical phenomena complexity. For example, a first benchmark case can involve the modeling of the initial conditions in Subsection 1.1, while a second case can focus on the transient itself in Subsection 1.3. Each benchmark case can also be applicable to multiple coupling levels to allow the multi-level validation discussed in Section 3.3. The quantities for physics domains not covered by the coupling, denoted as
4 MPCMIV benchmark case study
In the previous sections, we provided a general background about VVUQ in the context of nuclear engineering and focused specifically on high-fidelity multi-physics M&S. Various current gaps and challenges were identified related to model validation. General guidelines and recommendations were provided to address some of these challenges. In this section, some of these guidelines are demonstrated using the MPCMIV benchmark (OECD/NEA, 2025b). MPCMIV is a multi-physics transient benchmark addressing both traditional and novel multi-physics M&S tools. The main aspects that are investigated are (1) model complexity through the use of the multi-level validation scores and (2) the use of local, regional, and global validation metrics for the multivariate output data. The other recommendations, such as the consistent uncertainty propagation, are left for future research.
4.1 MPCMIV benchmark specifications
The MPCMIV benchmark was supported by the U.S. DOE NEUP program and is developed within the OECD/NEA Expert Group on Reactor System Multi-Physics (EGMUP). The main goal of the MPCMIV benchmark is to provide validation data for transient multi-physics M&S that apply to both traditional and novel multi-physics tools. Additionally, the benchmark aims to contribute to the development of integral evaluation, the multi-physics transient benchmark protocol and guidelines, and validation principles and practices.
The MPCMIV benchmark is based on experimental data from two cold ramp tests conducted at the Studsvik R2 reactor. The fuel rodlet used in the cold ramp tests was refabricated from a father rod previously irradiated in a boiling water reactor (BWR) for three cycles. The two cold ramp tests were performed in one of the two in-pile loops available at the R2 reactor. Various quantities were measured during the cold ramp tests for the three physics domains involved (RP, TH, and FP). For RP, the power was estimated based on a calorimetric technique. For TH, the rod inlet/outlet temperature and mass flow rates were measured. For FP, the cladding outer diameter and axial elongation were measured. The cold ramp experiments were performed by inserting the fuel rodlet from the top into the in-pile loop irradiation position. The fuel rodlet was irradiated at a constant power for less than 1 minute before the reactor was shut down. The in-pile loop has a nested structure (Figure 5, from Faure (2024). Starting outward and moving inward, the following structures comprised the in-pile loop: U-tube, ramp rig in the descending side of the U-tube, ramp capsule within the ramp rig, and the fuel rodlet within the ramp capsule. Water flowed between the structures through holes in various radial locations.
Figure 5. R2 in-pile loop diagram used for the MPCMIV cold ramp tests (Faure, 2024).
To accurately model the two cold ramp tests, two spatial computational domains are required: the modeling of (1) the whole R2 Studsvik reactor, and (2) the rodlet domain that experiences the cold ramp test, including the father rod irradiation and the R2 reactor in-pile loop. Four modeling tiers are available that require different benchmark simulations.
• Tier-1 addresses high-fidelity multi-physics tools that model both the R2 reactor domain (criticality and cycle depletion) and the rodlet domain during base irradiation of the father rod and the cold ramp tests.
• Tier-2 addresses high-fidelity multi-physics tools that model only the rodlet domain during base irradiation of the father rod and the cold ramp tests. Simplified boundary conditions are provided by the benchmark to account for the R2 reactor’s surrounding environment.
• Tier-3 addresses traditional multi-physics tools that model only the rodlet domain during base irradiation of the father rod and the cold ramp tests. Simplified boundary conditions are provided by the benchmark to account for the R2 reactor’s surrounding environment.
• Tier-4 addresses fuel performance single-physics codes that model only the rodlet domain during base irradiation of the father rod and the cold ramp tests. Simplified boundary conditions are provided by the benchmark for the TH and RP physics domains.
For each modeling tier, four different computational benchmark phases are required.
• Phase 1/model development: models of the individual physics domains are qualified and verified before performing the multi-physics simulations.
• Phase 2/pre-qualification: multi-physics model simulations are performed for the first cold ramp test. The model’s accuracy is measured by comparison against the available measurements, and the input sources of uncertainties are quantified.
• Phase 3/blind simulation: modeling and simulation of the second cold ramp test without access to the experimental data. Uncertainty propagation is performed to estimate the model predictive uncertainty.
• Phase 4/post-test: the experimental data are disclosed to the participants to assess the model’s predictive capability and to perform sensitivity analysis studies.
Results for Tier-2 and -4 and for Phases 1 and 2 were obtained by Faure (2024) and Faure et al. (2024). We present the results obtained for Tier-2 and Phase 2 within the context of the high-fidelity multi-physics M&S validation guidelines to illustrate some of the recommendations provided in Section 3. It is important to note that because the benchmark is still ongoing, the exact values for some of the measured inputs and outputs cannot be provided. Therefore, relative values are used instead.
4.2 Multi-physics modeling
Tier-2, as mentioned above, focuses on the fuel rodlet domain from its base irradiation up to the cold ramp test within the in-pile loop. The father rod irradiation results can be found in Faure et al. (2024) and show that the rodlet burnup is approximately
Griffin is a reactor physics code that can solve the diffusion and transport (SN) equations and can perform fixed source, eigenvalue, depletion, and transient calculations (Wang et al., 2025). Griffin uses the MOOSE finite element framework for efficient high-fidelity M&S and includes continuous (CFEM) and discontinuous (DFEM) finite element capabilities. The latter achieves higher accuracy solutions at the expense of computational cost. Various acceleration techniques are used, such as coarse mesh acceleration methods or an asynchronous parallel sweeper. For the Griffin modeling of the first cold ramp test, only the descending part of the U-tube, where the ramp capsule is located, is modeled with a 3D geometry (Figure 7). The DFEM-Sn transport solver is used with two neutron energy groups. All the main structures and coolant channels are considered. The fuel rodlet is separated into the fuel region and a homogenized gap-cladding region. In the fuel region, a similar mesh size is employed in the Bison model. The mesh was created using the volume conservation approach, with 25 radial and 24 azimuthal divisions in the fuel. One axial division is used every 5 mm. Vacuum boundary conditions are applied on the outer surfaces, while the incoming neutron current from the R2 reactor on these surfaces is provided by previous benchmark studies. The material cross-sections are provided by the benchmark using a 3D Monte Carlo model of the in-pile loop developed with the Serpent code (Leppänen et al., 2015). The cross-sections are parametrized in fuel temperature, moderator temperature, and moderator density as follows.
• Fuel temperature: 300 K, 600 K, 900 K, 1,200 K, 1,500 K, and 1,800 K.
• Moderator temperature: 300 K, 350 K, 400 K, and 500 K.
• Moderator density: 0.6 g/cm3, 0.8 g/cm3, 0.9 g/cm3, 1.0 g/cm3.
THM is a MOOSE-based 1D system thermal-hydraulics module for single-phase fluids (Hansel et al., 2024). The module offers multiple fluid properties (e.g., gas and water), a wide variety of components, closure laws, and logical control systems, as well as 2D and 3D heat structures where the heat conduction equation can be solved.
The THM model used for the first cold ramp tests covers the system shown in Figure 5. Due to the 1D system modeling, the 3D junctions were simplified to simple volumetric junctions. The axial flow areas were also homogenized due to some variations in very short distances (approximately a few mm) at some locations along the flow channels and due to some 2D complex geometries that cannot be captured in detail with the THM model. An average discretization of 10 cm per element was used. The in-pile loop was considered thermally insulated from the surrounding reactor environment, resulting in no heat transfer between the reactor and the in-pile loop. Gamma irradiation heating from the R2 reactor on the metallic structures and coolant was provided by the benchmark following studies at different power levels. The Dittus–Boelter correlation was used to compute the heat transfer coefficients between the metallic structures and the coolant channels. The Churchill correlation was used to compute the Darcy friction factor between the metallic structures and the coolant.
Bison is fuel performance code for advanced simulations of various nuclear fuel types (e.g., UO2, TRISO, and metallic fuel) in normal and accidental conditions (Williamson et al., 2021). Bison includes 1D, 2D, and 3D geometric modeling capabilities and can perform both smeared and discrete pellet modeling. Different fuel performance models are available for the various complex irradiation phenomena throughout a nuclear fuel’s life. Examples of these models are fuel and cladding creep, fission gas release, and fuel cracking. Regarding the pellet-to-clad contact, which is of interest to this study, the mortar contact model is available in Bison (Recuero et al., 2022) that aims to predict PCMI with higher accuracy than traditional methods.
The OpenFOAM Fuel Behavior Analysis Tool (OFFBEAT) is a fuel performance code based on the open-source finite volume C++ library OpenFOAM (Scolaro et al., 2020). Like Bison, multidimensional 1D, 2D, and 3D modeling capabilities are available for normal and accidental conditions. Flexible finite volume meshing allows the treatment of smeared and discrete pellets. A wide range of material properties and fuel performance models are included in OFFBEAT. For example, MATPRO is used for some material properties, and the open-source 0D code SCIANTIX for fission gas release. OFFBEAT relies on the rigid pellet hypothesis but allows both small and large strains. The PCMI modeling allows 50% of the relocation to be recovered; however, the friction between fuel and cladding, and thus the transfer of axial elongation, occurs during both soft and hard contact.
Fuel performance modeling was performed with both Bison and OFFBEAT codes due to the reasons mentioned previously. A common model was used for both the father rod and the refabricated rodlet to facilitate the transition between the irradiation in the BWR and the cold ramp test in the R2 reactor. For the rodlet model, the base irradiation boundary conditions at the axial position of the fuel rodlet were used (e.g., linear power), and the plenum length was scaled by preserving the plenum-to-rod length ratio. Such a process introduced small errors that were deemed not influential. For both the Bison and OFFBEAT models of the base irradiation and the cold ramp test, the default fuel performance models were used (Table 3). The PCMI contact model is of high importance for the cold ramp test. OFFBEAT handles PCMI through a penalty formulation using a selected friction coefficient. In Bison, the coulomb and mortar options were selected, but due to convergence issues, the friction between the fuel pellet and the cladding was not modeled. Concerning the geometric modeling of the rodlet, an axisymmetric 2D R-Z model was used with smeared pellets in both Bison and OFFBEAT. The axial and radial nodalization selected for each code was a result of verification studies for the convergence of the cladding axial elongation during the first cold ramp test. In Bison, 208 axial and 45 radial meshes were used; in OFFBEAT, 100 axial and 45 radial meshes were used.
4.3 Multi-level validation and scoring approach
Different coupling levels were tested following the multi-level model hierarchy of Figure 1. The only level-0 model we considered was single-physics fuel performance because the cladding axial elongation is the measured quantity that exhibits larger variation and, consequently, is the most difficult to accurately predict. RP–FP and FP–TH level-1 coupling models were developed, as well as the RP–FP–TH fully coupled level-2 model. The last model corresponds to the multi-physics scheme shown in Figure 6. From the measured quantities during the first cold ramp, one from the fuel performance physics domain and three from the thermal-hydraulics physics domain were considered.
• Cladding axial elongation (
• Inlet temperature of the ramp rig (
• Outlet temperature of the ramp rig (
• Outlet temperature of the in-pile loop (
All these measured quantities were available at each time step during the cold ramp test, leading to temporal multivariate outputs, as discussed in Section 2.2.9. The validation metric used locally at each time step was the absolute error between the prediction and the measurements. Consistent multivariate validation metrics will be investigated in future studies. One local, two regional, and one global metric were computed for each measured quantity. The local metric was simply the absolute error at the instance of the measured peak value. Two regional metrics were computed: region 1 for the first two-thirds of the cold ramp, capturing the rapid transient effects, and region 2 for the last third of the transient, capturing the slower transient effects as the steady state is approached. For these regional metrics, the temporal mean and standard deviation of the absolute error are provided. A small standard deviation indicates a relatively constant discrepancy over time around the mean value. A large value for the standard deviation indicates significant variation in the discrepancies between prediction and measurements over time. For the global metric, like the regional metrics, the mean and standard deviation of the absolute error were computed over the entire transient.
All these different metrics can be combined with specific accuracy requirements to estimate a validation score for each measured quantity and model. The measured quantities belonging to the same physics domain are aggregated using Equation 5 to assign one score to each physics domain. For the fuel performance domain, and more specifically for the cladding axial elongation, the following accuracy requirements are selected.
These accuracy requirements for the validation scores are based on a literature review for expected discrepancies between predictions and measurements. In fuel performance models, the onset of PCMI is very difficult to predict due to all the interacting phenomena involved that impact the closure of the gap and subsequent elongation transfer from the fuel to the cladding. Some of the phenomena involved are cladding creep, fuel expansion, fuel swelling, and fuel relocation. Fuel performance codes typically use different models for all these phenomena. Additionally, geometric modeling and nodalization can have a strong impact (e.g., 1D vs. 2D). A recent OECD/NEA benchmark evaluated the impact of such different effects on the cladding axial elongation for multiple ramps-up followed by a holding period and a ramp-down (OECD/NEA, 2024). Different maximum LHRs were considered in the range [40.5–54.7] kW/m. The results obtained show a large spread, with some participants predicting the cladding axial elongation within a 25% margin, while others predicted as much as three times more elongation than the measurements. It is important to note that the ramp-ups in this benchmark last 6 hours, while in the MPCMIV benchmark they last less than 1 minute. This example is simply provided to illustrate the difficulties in accurately predicting the cladding axial elongation.
For the thermal-hydraulics domain, accuracy requirements were selected for the temperature predictions based on a literature review. An example that shares some similarities with the MPCMIV cold ramp test in terms of coolant temperature variation can be found in Licht et al. (2015). The resulting requirements are
4.4 Modeling and simulation results
The following section discusses the results for the different coupling levels together with the corresponding validation scores. The validation metric used is the absolute error which is computed locally at the maximum value, regionally for two regions covering the first two-thirds of the transient and the last third, and globally for the whole transient. For the regions and the global metrics, the mean value of the absolute error and the standard deviation are provided. The validation score is computed for each physics domain using the absolute error locally and using the mean of the absolute error regionally and globally. The validation score for the three TH quantities is aggregated using Equation 5. The results of the cladding axial elongation are normalized to their maximum value due to the benchmark being still ongoing. The predicted temporal evolution is compared against the measurements. For the three TH temperatures, the differences between the prediction and the measurements over time are provided due to the small variations in the temperatures.
For coupling level 0, a FP standalone model was considered. For coupling level 1, RP–FP and FP–TH models were used. For coupling level 2, a full coupling RP–FP–TH was studied. All calculations were performed on the INL Sawtooth supercomputer using one computing node with 48 CPUs. Approximate runtimes for these models are provided in Table 4, ranging from 1 h to 42 h. It is important to note that the models discussed here are preliminary and are used mainly to illustrate some of the concepts discussed in Section 3. More details about the modeling choices for each code can be found in Faure (2024).
4.4.1 FP single-physics
For the FP single-physics study, the LHR measured through the calorimetric technique was used, which is provided by the MPCMIV benchmark with a flat axial profile. For the coolant conditions, the available measurements were used. These boundary condition data constitute the
The local, regional, and global Bison validation metrics results are shown in Table 5 together with the respective validation scores. The metrics show large discrepancies between prediction and measurements that are attributed to the lack of a friction model. The validation score is 1.0 across all metrics. The OFFBEAT results are provided in Table 6. Although they show an improvement, the discrepancies are still significant, resulting in validation scores of 1.0 for the region and global metrics. A validation score of 1.7 is obtained for the maximum value. The improvement is attributed to the contact model in OFFBEAT. The predictions of the cladding axial elongation are compared against the measurements in Figure 8. The time of the contact between the cladding and fuel can be identified in OFFBEAT predictions halfway through the cold ramp, where the predicted cladding axial elongation increases rapidly, leading to improved results. However, the measurements indicate that the contact probably occurred very early in the ramp, which explains the large discrepancies for both codes in the region and global metrics.
Table 5. Bison single-physics results for cladding elongation prediction. For the region and global validation metrics, mean
Table 6. OFFBEAT single-physics results for cladding elongation prediction. For the region and global validation metrics, mean
Figure 8. Cladding axial elongation comparison between the predictions from OFFBEAT and Bison single-physics calculations and the measurements.
4.4.2 FP–TH coupling
The FP–TH multi-physics study used the coupling framework of Figure 6 without Griffin. Instead, the LHR measured through the calorimetric method was used, which is provided by the MPCMIV benchmark with a flat axial profile. The local, regional, and global validation metrics and scores results are shown in Table 7. The comparisons of the predictions against measurements are shown in Figure 9. For the cladding axial elongation, the same large discrepancies were obtained as in the FP single-physics study. A validation score of 1.0 was estimated across local, regional, and global scales. For the coolant temperatures, the discrepancies were small, indicating that this modeling fidelity can be satisfactory for these quantities. The validation score ranges between 3.11 and 4.97 across all scales. This is attributed to the simple phenomena occurring during the cold ramp test in the coolant and highlights that the main difficulty is in accurately predicting the cladding axial elongation.
Table 7. FP–TH coupling results for cladding elongation prediction and three temperatures. For region and global validation metrics, mean
Figure 9. Comparison between the predictions from FP–TH multi-physics calculations and the measurements.
4.4.3 RP–FP coupling
The RP–FP multi-physics study used the coupling framework of Figure 6 without THM. Instead, the coolant conditions were provided by the available measurements. The cladding axial elongation is the only predicted quantity from this coupled model that can be compared against measurements. The local, regional, and global validation metrics and validation scores results are shown in Table 8. The comparisons of the predictions against measurements are shown in Figure 10. A notable improvement is obtained for this multi-physics coupling, with the validation score ranging between 4.0 and 4.8. This improvement is attributed to the prediction of the LHR by Griffin using benchmark data for the surrounding reactor environment. The LHR measured using the calorimetric technique is based on the coolant temperature difference between the inlet and the outlet. However, due to the short duration of the cold ramp, the coolant heating delay was important, leading to a slower increase in the LHR. This slower increase led to delayed contact between the fuel and the cladding and, consequently, in an underprediction of the cladding axial elongation. The predicted LHR in the RP–FP increased sharply to the maximum value directly when the fuel rodlet was inserted into the reactor, which led to the closure of the gap in the beginning of the cold ramp. This allowed significantly more fuel axial elongation to be transferred to the cladding, leading to the improved results.
Table 8. RP–FP multi-physics results for cladding elongation prediction. For region and global validation metrics, mean
Figure 10. Cladding axial elongation comparison between the predictions from RP–FP multi-physics calculations and the measurements.
4.4.4 RP–FP–TH coupling
The RP–FP–TH multi-physics study used the coupling framework of Figure 6. The local, regional, and global validation metrics and scores are shown in Table 9. The comparisons of the predictions against measurements are shown in Figure 11. Very good performance was obtained for both the cladding axial elongation and the coolant temperatures, with validation scores ranging between 4.2 and 5.0. As mentioned in the FP–TH coupling, the coolant temperatures exhibited small variations without complicated phenomena. This renders their prediction with satisfactory accuracy relatively easy. The cladding axial elongation prediction, on the other hand, was significantly more complicated due to the PCMI. The RP–FP–TH coupling obtains results very close to the measurements, for the same reason as the RP–FP coupling. The predicted sharp LHR increase at the beginning of the cold ramp, as opposed to the slower LHR obtained by the calorimetric technique, led to an earlier onset of the PCMI. An interesting observation is that the FP–TH model using the LHR predicted by the RP–FP–TH model shows results similar to the latter model.
Table 9. RP –FP–TH coupling results for cladding elongation prediction and three temperatures. For region and global validation metrics, mean
Figure 11. Comparison between the predictions from the RP–FP–TH multi-physics calculations and the measurements.
4.4.5 Multi-level validation scoring
The computation of the various validation scores allows the drawing of different model scoring diagrams that can be quite informative, as discussed in Section 3.3. For the MPCMIV example used here, the diagram showing the performance of the RP–FP–TH multi-physics model in each individual physics domain is drawn in Figure 12. In this diagram, the global scores corresponding to each measured quantity over the whole duration of the cold ramp test are aggregated by physics domain. Because there is no measured quantity for RP, a value of 1 is used as the validation score. The diagram shows that the multi-physics model is able to predict both FP and TH physics domains quantities with excellent agreement. It is important to note that in this demonstration case, FP-related phenomena dominated over TH phenomena, rendering the accurate prediction of the cladding axial elongation significantly more challenging than the coolant temperatures, as indicated by the results in Tables 7 and 9. Consequently, the specified accuracy requirements for TH are comparatively easy to satisfy in this transient case. This underscores that conclusions drawn from such studies are case-dependent and might not generalize to other transients where stronger TH phenomena might be present. Furthermore, the selected accuracy requirements will also depend on the application of interest.
Figure 12. Validation score diagram by physics domain for the RP–FP–TH multi-physics coupling over the whole cold ramp test. The RP has a value of 1 because no measurements were used from this physics domain.
Other diagrams can be drawn from the computed validation scores. An example is the diagram focusing on one measured quantity and evaluating the performance of different coupling levels. In Figure 13, the diagrams for the cladding axial elongation at the time of its maximum value (Figure 13A) and over the whole cold ramp (Figure 13B) are drawn. It can be concluded that both RP–FP and RP–FP–TH models show excellent performance regarding the cladding axial elongation prediction for the MPCMIV first cold ramp test at both the local and global scales. When RP is not included in the coupling, the performance deteriorates significantly for the reasons explained in the previous sections. One potential conclusion from this analysis is that the RP–FP coupling level is accurate enough to perform the uncertainty quantification for the cladding axial elongation, reducing coupling complexity and computational cost.
Figure 13. Multi-level validation score diagram for the cladding axial elongation: (a) at the local scale of the maximum value and (b) at a global scale over the entire transient.
5 Conclusion
High-fidelity, high-resolution multi-physics M&S capabilities are increasingly being developed and applied in nuclear engineering applications. The predictive credibility of these novel tools relies on appropriate VVUQ practices, which differ significantly from those used for traditional M&S tools. This study provides guidelines and recommendations that address key challenges regarding the model validation and uncertainty quantification of such high-fidelity, high-resolution multi-physics tools.
A general background of VVUQ for scientific computation models is first presented to clarify essential concepts and terminology, such as sources of uncertainties, validation metrics, and model predictive capability. Relevant validation efforts for multi-physics M&S in nuclear engineering are discussed that are grouped in three main categories: (1) the development of approaches for safety analysis that meet regulatory requirements, (2) the development of benchmarks for code-to-code and code-to-measurements comparisons, and (3) the development of qualified multi-physics tools. Each category has distinct objectives. Several major challenges for high-fidelity, high-resolution multi-physics model validation and uncertainty quantification across these three categories are identified and discussed. These include the large number of inputs and outputs, complex coupled modeling exhibiting strong non-linearities and discontinuities, high computational cost limiting the utilization of consistent and comprehensive uncertainty quantification and sensitivity analysis approaches, scarcity of relevant experimental data, and the need for appropriate guidelines and protocols for developing multi-physics benchmarks and handbooks.
Guidelines and recommendations are provided to address some of these challenges. Dimensionality reduction and screening techniques are suggested to manage the large number of inputs and outputs. To address modeling complexity, a multi-level validation hierarchy is recommended that progressively increases the level of coupling among the physics domains. A validation scoring approach is introduced to compare results across coupling levels and identify gaps in modeling and experimental data coverage. Regarding computational cost, surrogate models can be used, but they require the estimation of an additional model uncertainty. To ensure consistent uncertainty propagation, sample-processing diagrams are proposed to prevent sampling inconsistencies among multiple inputs. For the validation of multivariate outputs, such as time series, appropriate validation metrics are discussed. Local, regional, and global univariate metrics can be used, but they do not treat the correlations between the different outputs. More complicated approaches exist in the literature based on the U-pooling method that are worth more investigation for nuclear engineering applications. For the multi-physics benchmarks, a tentative structure is proposed using the IRPhEP and ICSBEP benchmark projects as the foundation.
Some of the recommendations are demonstrated using the MPCMIV benchmark, particularly for the multi-physics modeling of the fuel rodlet during the first cold ramp test. The adopted multi-level modeling hierarchy includes FP single-physics models, and FP–TH, RP–FP, and RP–FP–TH multi-physics models. The measurements considered in this work include the cladding axial elongation and the coolant temperature at three different locations. Validation metrics are computed at local, regional, and global scales. Validation scores are computed for each model and for each physics domain. The results highlight the necessity for at least a coupling between the RP and FP to accurately predict the cladding axial elongation. The coolant temperatures are less sensitive to the coupling level due to their small variations during the cold ramp test.
Future research will demonstrate some of the remaining recommendations using the MPCMIV benchmark. Some examples include uncertainty propagation and sensitivity analysis. Finally, the development of systematic multi-physics benchmark protocols will be investigated.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
GD: Supervision, Data curation, Formal Analysis, Writing – review and editing, Writing – original draft, Methodology, Investigation, Visualization, Conceptualization. QF: Writing – original draft, Visualization, Formal Analysis, Validation, Data curation, Writing – review and editing, Software, Investigation. MA: Project administration, Conceptualization, Investigation, Resources, Writing – original draft, Supervision, Funding acquisition, Writing – review and editing. KI: Writing – original draft, Investigation, Resources, Conceptualization, Project administration, Funding acquisition, Writing – review and editing, Supervision.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This research was supported by the U.S. Department of Energy Nuclear Energy University Program (NEUP) Project No. 20-19590: Benchmark Evaluation of Transient Multi-Physics Experimental Data for Pellet Cladding Mechanical Interactions.
Acknowledgements
Portions of the text in this article were refined with the assistance of ChatGPT (GPT-5), a large language model developed by OpenAI and accessed via the ChatGPT web application (https://chat.openai.com). The tool was used solely to improve the clarity and phrasing of the writing; all substantive content and interpretations are the author’s own.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The author KI declared that they were an editorial board member of Frontiers at the time of submission. This had no impact on the peer review process and the final decision.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. Portions of the text in this article were refined with the assistance of ChatGPT (GPT-5), a large language model developed by OpenAI and accessed via the ChatGPT web application (https://chat.openai.com). The tool was used solely to improve the clarity and phrasing of the writing; all substantive content and interpretations are the authors’ own.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Abdel-Khalik, H. S., Hawari, A. I., and Wang, C. (2015). Physics-guided covered mapping: a new approach for quantifying experimental coverage and bias scaling. ANS Trans. 112, 704–707.
Ao, D., Hu, Z., and Mahadevan, S. (2017). Dynamics model validation using time-domain metrics. J. Verif. Valid. Uncert. 2, 011004. doi:10.1115/1.4036182
Athe, P., Jones, C., and Nam, D. (2021). Assessment of the predictive capability of VERA—CS for CASL challenge problems. J. Verif. Valid. Uncert. 6, 021003. doi:10.1115/1.4050248
Avramova, M. N., and Ivanov, K. N. (2010). Verification, validation and uncertainty quantification in multi-physics modeling for nuclear reactor design and safety analysis. Prog. Nucl. Energy 52 (7), 601–614. doi:10.1016/j.pnucene.2010.03.009
Avramova, M., Ivanov, K., Kozlowski, T., Pasichnyk, I., Zwermann, W., Velkov, K., et al. (2015). Multi-physics and multi-scale benchmarking and uncertainty quantification within OECD/NEA framework. Multi-Physics Model. LWR Static Transient Behav. 84, 178–196. doi:10.1016/j.anucene.2014.12.014
Avramova, M., Abarca, A., Hou, J., and Ivanov, K. (2021). Innovations in multi-physics methods development, validation, and uncertainty quantification. J. Nucl. Eng. 2 (1), 44–56. doi:10.3390/jne2010005
Baccou, J., Zhang, J., Fillion, P., Damblin, G., Petruzzi, A., Mendizábal, R., et al. (2020). SAPIUM: a generic framework for a practical and transparent quantification of thermal-hydraulic code model input uncertainty. Nucl. Sci. Eng. 194 (8–9), 721–736. doi:10.1080/00295639.2020.1759310
Bess, J. D., and Ivanova, T. (2020). Current overview of ICSBEP and IRPhEP benchmark evaluation practices. EPJ Web Conf. 239, 2488166873. doi:10.1051/epjconf/202023918002
Clouvel, L., Bertrand, I., Vincent, C., Idrissi, M., and Robin, F. (2025). An overview of variance-based importance measures in the linear regression context: comparative analyses and numerical tests. Socio-Environmental Syst. Model. 7, 18681. doi:10.18174/sesmo.18681
Croisette, T., Olita, P., Vaglio-Gaudard, C., Rubiolo, P., and Préa, R. (2025). Multi-physics simulations and sensitivity analysis of a main steam line break transient in a pressurized water reactor. Nucl. Sci. Eng. 30, 1–19. doi:10.1080/00295639.2025.2502888
Da Veiga, S. (2015). Global sensitivity analysis with dependence measures. J. Stat. Comput. Simul. 85 (7), 1283–1305. doi:10.1080/00949655.2014.945932
Delipei, G. (2019). Development of an uncertainty quantification methodology for multi-physics best estimate analysis and application to the rod ejection accident in a pressurized water reactor. Université Paris Saclay. PhD Thesis. Available online at: https://theses.hal.science/tel-02381187/.
Delipei, G. K., Hou, J., Avramova, M., Rouxelin, P., and Ivanov, K. (2021). Summary of comparative analysis and conclusions from OECD/NEA LWR-UAM benchmark phase I. Nucl. Eng. Des. 384, 111474. doi:10.1016/j.nucengdes.2021.111474
Delipei, G. K., Rouxelin, P., Abarca, A., Hou, J., Avramova, M., and Ivanov, K. (2022). CTF-PARCS core multi-physics computational framework for efficient LWR steady-state, depletion and transient uncertainty quantification. Energies 15 (14), 5226. doi:10.3390/en15145226
Delipei, G. K., Altahhan, M., Rouxelin, P., Sen, S., George, N., Avramova, M., et al. (2025). Uncertainty quantification framework for high-temperature gas-cooled reactors using VSOP and DAKOTA. Nucl. Sci. Eng. BEPU24 Special Issue 20, 1–27. doi:10.1080/00295639.2025.2486901
Demazière, C., Sanchez-Espinoza, V. H., and Zentner, I. (2022). Advanced numerical simulation and modeling for reactor safety – contributions from the CORTEX, McSAFER, and METIS projects. EPJ Nucl. Sci. Technol. 8, 29. doi:10.1051/epjn/2022026
Downey, C. M., Oldham, N., Fleming, A., Chapman, D., Cruz, A. M., and Kelly, E. (2024). Design of a First-of-A-Kind instrumented advanced test reactor irradiation capsule experiment for in situ thermal conductivity measurements of metallic fuel. Prog. Nucl. Energy 175, 105325. doi:10.1016/j.pnucene.2024.105325
Faure, Q. (2024). Novel multi-fidelity multi-physics validation methodology for transient applications. North Carolina State University. Available online at: https://www.lib.ncsu.edu/resolver/1840.20/41905.
Faure, Q., Delipei, G., Scolaro, A., Avramova, M., and Ivanov, K. (2024). Fuel performance code to code comparative analysis for the OECD/NEA MPCMIV benchmark. Nucl. Eng. Des. 430, 113685. doi:10.1016/j.nucengdes.2024.113685
German, P., Ragusa, J. C., and Carlo, F. (2019). Application of multiphysics model order reduction to Doppler/Neutronic feedback. EPJ Nucl. Sci. Technol. 5, 17. doi:10.1051/epjn/2019034
Hansel, J., Andrs, D., Charlot, L., and Giudicelli, G. (2024). The MOOSE thermal hydraulics module. J. Open Source Softw. 9 (94), 6146. doi:10.21105/joss.06146
Horelik, N., Herman, B., Forget, B., and Smith, K. (2025). Benchmark for evaluation and validation of reactor simulations (BEAVRS). La Grange Park, United States: American Nuclear Society - ANS.
Huang, D., Abdel-Khalik, H., Rabiti, C., and Gleicher, F. (2017). Dimensionality reducibility for multi-physics reduced order modeling. Ann. Nucl. Energy 110, 526–540. doi:10.1016/j.anucene.2017.06.045
Huang, D., Mertyurek, U., and Abdel-Khalik, H. (2020). Verification of the sensitivity and uncertainty-based criticality safety validation techniques: ORNL’s SCALE case study. Nucl. Eng. Des. 361, 110571. doi:10.1016/j.nucengdes.2020.110571
Iooss, B., and Lemaître, P. (2015). “A review on global sensitivity analysis methods,” in Uncertainty management in simulation-optimization of complex systems: Algorithms and applications. Editors G. Dellino, and C. Meloni (Springer).
Iooss, B., and Marrel, A. (2019). Advanced methodology for uncertainty propagation in computer experiments with large number of inputs. Nucl. Technol. 205 (12), 1588–1606. doi:10.1080/00295450.2019.1573617
Jones, C., Hetzler, A., Nam, D., Athe, P., and Sieger, M. (2018). Updated verification and validation assessment for VERA. SAND--2018-8800R. Sandia National Laboratories SNL-NM. doi:10.2172/1466526
Kamerman, D. W., Jensen, C. B., Wachs, D. M., and Woolstenhulme, N. E. (2022). High-burnup experiments in reactivity initiated accidents (HERA). INL/EXT-20-57844-Rev004. Idaho National Laboratory. doi:10.2172/1874819
Kwon, S. J., Papadionysiou, M., Jung, Y. S., Rouxelin, P., Vasiliev, A., Ferroukhi, H., et al. (2025). Verification of the nTRACER/CTF code system for full core high resolution cycle analysis with the OECD/NEA TVA watts bar unit 1 benchmark. Ann. Nucl. Energy 220, 111532. doi:10.1016/j.anucene.2025.111532
Leppänen, J., Pusa, M., Viitanen, T., Valtavirta, V., and Kaltiaisenaho, T. (2015). The serpent monte carlo code: status, development and applications in 2013. Jt. Int. Conference Supercomput. Nucl. Appl. Monte Carlo 2013, SNA + MC 2013. Pluri- and Trans-disciplinarity, Towards New Modeling and Numerical Simulation Paradigms 82, 142–150. doi:10.1016/j.anucene.2014.08.024
Li, W., Chen, W., Jiang, Z., Lu, Z., and Liu, Y. (2014). New validation metrics for models with multiple correlated responses. Reliability Engineering System Safety 127, 1–11. doi:10.1016/j.ress.2014.02.002
Licht, J. R., Dionne, B., Van den Branden, G., Sikik, E., and Koonen, E. (2015). RELAP5 model description and validation for the BR2 loss-of-flow experiments. Argonne National Laboratory (ANL). ANL/GTRI/TM--14/10. doi:10.2172/1212763
Liu, Y., Chen, W., Paul, A., and Huang, H.-Z. (2011). Toward a better understanding of model validation metrics. Journal of Mechanical Design 133, 071005. doi:10.1115/1.4004223
Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013). “The randomized dependence coefficient,” Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, (Red Hook, NY, USA: NIPS’), 1–9. doi:10.48550/arXiv.1304.7717
Luckring, J. M., Shaw, S., Oberkampf, W. L., and Graves, R. E. (2023). Model validation hierarchies for connecting system design to modeling and simulation capabilities. Progress Aerospace Sciences 142, 100950. doi:10.1016/j.paerosci.2023.100950
Martineau, R. C. (2021). The MOOSE multiphysics computational framework for nuclear power applications: a special issue of nuclear technology. Nuclear Technology 207 (7), iii–viii. doi:10.1080/00295450.2021.1915487
Mascari, F., D’Auria, F., Bestion, D., Lien, P., Nakamura, H., Austregesilo, H., et al. (2024). OECD/NEA/CSNI state-of-the-art report on scaling in system thermal-hydraulics applications to nuclear reactor safety and design (The S-SOAR). Nuclear Engineering Design 416, 112750. doi:10.1016/j.nucengdes.2023.112750
Menut, P., Sartori, E., and Turnbull, J. A. (2000). “The public domain database on nuclear fuel performance experiments (IFPE) for the purpose of code development and validation,” in Proceedings of the 2000 international topical meeting on light water reactor fuel performance (Park City, Utah). April 10.
NEA (2023). Existing practices for multi-physics validation: report on sub-task 3, task force 2. OECD Publishing. NEA/NSC/R(2019)7.
Novak, A. J., Shriwise, P., Romano, P. K., Rahaman, R., Merzari, E., and Gaston, D. (2023). Coupled monte carlo transport and conjugate heat transfer for wire-wrapped bundles within the MOOSE framework. Nuclear Science Engineering 197 (10), 2561–2584. doi:10.1080/00295639.2022.2158715
Novak, A. J., Shriwise, P., Romano, P., Shaver, D., Fang, J., Yuan, H., et al. (2023). “High-fidelity multiphysics for fission: challenges, accomplishments, and future directions,” in Proceedings of the 20th international topical meeting on nuclear reactor thermal hydraulics (NURETH 2023).
Oberkampf, W. L., and Barone, M. F. (2006). Measures of agreement between computation and experiment: validation metrics. J. Computational Physics 217 (1), 5–36. doi:10.1016/j.jcp.2006.03.037
Oberkampf, W. L., and Roy, C. J. (2010). Verification and validation in scientific computing. Cambridge University Press.
Oberkampf, W. L., Trucano, T. G., and Pilch, M. M. (2007). Predictive capability maturity model for computational modeling and simulation. Nos. SAND2007-5948. Sandia National Laboratories. doi:10.2172/976951
OECD/NEA (2024). “Benchmark study on pellet-cladding mechanical interaction in light water reactor fuel volume 2,” in Validation of beginning-of-life power ramping (case 3). Technical Report No. 7663.
OECD/NEA (2025a). Available online at: https://www.oecd-nea.org/jcms/pl_20438/subgroup-onuncertainty-analysis-in-modelling-uam-for-design-operation-and-safety-analysis-of-sodium-cooled-fast-reactors-sfr-uam (Accessed August 30).
OECD/NEA (2025b). Available online at: https://www.oecd-nea.org/jcms/pl_32203/multi-physics-pellet-cladding-mechanical-interaction-validation-mpcmiv-benchmark (Accessed December 1).
OECD/NEA (2025c). Available online at: https://www.oecd-nea.org/jcms/pl_106827/light-water-cooled-small-modular-reactor-benchmark-for-uncertainty-quantification-and-propagation-in-multiphysics-simulations-lw-smr (Accessed August 30).
Owen, A. B. (2014). Sobol’ indices and shapley value. SIAM/ASA Journal on Uncertainty Quantification 2 (1), 245–251. doi:10.1137/130936233
Papadionysiou, M., Delipei, G., Avramova, M., Ferroukhi, H., and Ivanov, K. (2025). High-resolution predictions of the coolant properties for the 3D PWR core with artificial neural networks based on CTF. Nuclear Engineering Design 442, 114261. doi:10.1016/j.nucengdes.2025.114261
Park, Ho J., and Park, J. W. (2024). Similarity coefficient generation using adjoint-based sensitivity and uncertainty method and stochastic sampling method. Energies 17 (4), 827. doi:10.3390/en17040827
Porter, N. W. (2019). Wilks’ formula applied to computational tools: a practical discussion and verification. Annals Nuclear Energy 133, 129–137. doi:10.1016/j.anucene.2019.05.012
Radaideh, M. I., and Kozlowski, T. (2019). Combining simulations and data with deep learning and uncertainty quantification for advanced energy modeling. Int. J. Energy Res. 43 (14), 7866–7890. doi:10.1002/er.4698
Recuero, A., Lindsay, A., Yushu, D., Peterson, J. W., and Spencer, B. (2022). A mortar thermomechanical contact computational framework for nuclear fuel performance simulation. Nuclear Engineering Design 394, 111808. doi:10.1016/j.nucengdes.2022.111808
Riedmaier, S., Danquah, B., Schick, B., and Diermeyer, F. (2021). Unified framework and survey for model verification, validation and uncertainty quantification. Archives Computational Methods Engineering 28 (4), 2655–2688. doi:10.1007/s11831-020-09473-7
Rohatgi, U. S., and Kaizer, J. S. (2020). Historical perspectives of BEPU research in US. Nuclear Engineering Design 358, 110430. doi:10.1016/j.nucengdes.2019.110430
Rohatgi, U., Dyrda, J., and Soppera, N. (2018). The international experimental thermal hydraulic systems database (TIETHYS): a new NEA validation tool. Proceedings of the 26th International Conference on Nuclear Engineering (ICONE26), V004T15A022. doi:10.1115/ICONE26-82631
Roy, C. J., and Oberkampf, W. L. (2011). A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing. Computer Methods Applied Mechanics Engineering 200 (25), 2131–2144. doi:10.1016/j.cma.2011.03.016
Sargent, R. G. (2020). Verification and validation of simulation models: an advanced tutorial. Winter Simulation Conference 14, 16–29. doi:10.1109/WSC48552.2020.9384052
Sargent, R. G., and Balci, O. (2017). History of verification and validation of simulation models. Winter Simulation Conference, 292–307. doi:10.1109/WSC.2017.8247794
Schlesinger, S. (1979). Terminology for model credibility. Simulation 32 (3), 103–104. doi:10.1177/003754977903200304
Scolaro, A., Clifford, I., Fiorina, C., and Pautz, A. (2020). The OFFBEAT multi-dimensional fuel behavior solver. Nuclear Engineering Design 358, 110416. doi:10.1016/j.nucengdes.2019.110416
Shaw, S., Luckring, J. M., Oberkampf, W., and Graves, R. E. (2023). “Exploitation of a validation hierarchy for modeling and simulation,” in AIAA SCITECH 2023 Forum, 0 vol. AIAA SciTech Forum (American Institute of Aeronautics and Astronautics). doi:10.2514/6.2023-2605
Stauff, N. E., Abdelhameed, A., Cao, Y., Ibarra, L., Lee, S. K., Miao, Y., et al. (2024). Assessment and validation of NEAMS tools for high-fidelity multiphysics transient modeling of microreactors: application of NEAMS codes to perform multiphysics modeling analyses of micro-reactor concepts. ANL/NEAMS--24/3. Argonne National Laboratory (ANL). doi:10.2172/2475766
Trivedi, I., Delipei, G., Hou, J., Grasso, G., and Ivanov, K. (2025). Development and application of two-step uncertainty propagation and sensitivity analysis methodology for fast reactor safety analysis. Nuclear Engineering Design 433, 113882. doi:10.1016/j.nucengdes.2025.113882
Tucker, M. D., and Novog, D. R. (2020). The impact of fueling operations on full core uncertainty analysis in CANDU reactors. J. Nuclear Engineering Radiation Science 6, 031401. doi:10.1115/1.4045485
Unruh, T., Reichenberger, M., Stevenson, S., McGregor, D., and Tsai, K. (2017). Enhanced micro-pocket fission detector for high temperature reactors - FY17 FInal project report. Idaho National Laboratory. INL/EXT--17-43397-Rev000. doi:10.2172/1468579
Vaglio-Gaudard, C., Destouches, C., Hawari, A., Avramova, M., Ivanov, K., Valentine, T., et al. (2023). Challenge for the validation of high-fidelity multi-physics LWR modeling and simulation: development of new experiments in research reactors. Frontiers Energy Research, 11–2023. doi:10.3389/fenrg.2023.1110979
Valentine, T. E., Ivanov, E., Ivanov, K., Petruzzi, A., Avramova, M., Hursin, M., et al. (2024). OECD-NEA expert group on reactor systems multi-physics: Development of verification and validation guidelines for multi-physics simulations. ICONE31 (Volume 10 Risk Assessments and Management; Computer Code Verification and Validation; Nuclear Education and Public Acceptance), V010T13A014. doi:10.1115/ICONE31-136161
Wang, Y., Prince, Z. M., Park, H., Calvin, O. W., Choi, N., Jung, Y. S., et al. (2025). Griffin: a MOOSE-based reactor physics application for multiphysics simulation of advanced nuclear reactors. Annals Nuclear Energy 211, 110917. doi:10.1016/j.anucene.2024.110917
Wieselquist, W. A., and Lefebvre, R. A. (2024). SCALE 6.3.2 user manual. ORNL/TM--2024/3386. Oak Ridge National Laboratory. doi:10.2172/2361197
Williamson, R. L., Hales, J. D., Novascone, S. R., Pastore, G., Gamble, K. A., Spencer, B. W., et al. (2021). BISON: a flexible code for advanced simulation of the performance of multiple nuclear fuel forms. Nuclear Technology 207 (7), 954–980. doi:10.1080/00295450.2020.1836940
Yu, Y., Park, H., Novak, A., and Shemon, E. (2025). High fidelity multiphysics tightly coupled model for a lead cooled fast reactor concept and application to statistical calculation of hot channel factors. Nuclear Engineering Design 435, 113915. doi:10.1016/j.nucengdes.2025.113915
Zeng, K. (2020). “Uncertainty analysis framework for the multi-physics light water reactor simulation,” in ProQuest dissertations and theses. North Carolina State University. Dissertations & Theses @ North Carolina State University @ Raleigh; ProQuest Dissertations & Theses Global (2429661381). Available online at: https://proxying.lib.ncsu.edu/index.php?url=https://www.proquest.com/dissertations-theses/uncertainty-analysis-framework-multi-physics/docview/2429661381/se-2?accountid=12725.
Zhang, J. (2019). The role of verification & validation process in best estimate plus uncertainty methodology development. Nuclear Engineering and Design 355, 110312. doi:10.1016/j.nucengdes.2019.110312
Glossary
ANOVA Analysis of variance
BEPU Best Estimate Plus Uncertainty
BWR Boiling water reactor
CDF Cumulative density functions
DMD Dynamic mode decomposition
Hi2Lo High to low fidelity
HSIC Hilbert–Schmidt independence criterion
HTGR high-temperature gas-cooled reactors
ICSBEP International Criticality Safety Benchmark Evaluation Project
IRPhEP International Reactor Physics Experiments Evaluation Project
LHR linear heat rate
LWR Light-water reactor
MAE Mean absolute error
M&S Modeling and simulation
NEA Nuclear Energy Agency
OECD Organization for Economic Co-operation and Development
PCA Principal component analysis
PCMI Pellet cladding mechanical interaction
POD Proper orthogonal decomposition
REA Rod ejection accident
ROM Reduced-order models
RSME Root mean square error
SA Sensitivity analysis
SFR Sodium fast reactor
SMR Small modular reactor
UQ Uncertainty quantification
V&V Verification and validation
VVUQ Verification, validation, and uncertainty quantification
Keywords: high-fidelity, high-resolution, Multi-physics Pellet Cladding Mechanical Interaction Validation, Benchmark, multi-physics, uncertainty quantification, validation
Citation: Delipei GK, Faure Q, Avramova M and Ivanov K (2026) High-fidelity multi-physics guidelines for model validation and uncertainty quantification. Front. Nucl. Eng. 4:1720142. doi: 10.3389/fnuen.2025.1720142
Received: 07 October 2025; Accepted: 15 December 2025;
Published: 04 February 2026.
Edited by:
Stefano Terlizzi, The Pennsylvania State University (PSU), United StatesReviewed by:
Rabab Elzohery, Oak Ridge National Laboratory (DOE), United StatesPaul Ferney, Idaho National Laboratory (DOE), United States
Copyright © 2026 Delipei, Faure, Avramova and Ivanov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Gregory K. Delipei, Z2tkZWxpcGVAbmNzdS5lZHU=
Quentin Faure