- Department of Educational Psychology, University of Nevada, Las Vegas, Las Vegas, NV, United States
When statistical significance becomes the currency of publication, the incentive to reach p < 0.05 can subtly shape research behavior. While many researchers report their findings with integrity, others face implicit pressures that increase the likelihood of selective reporting or post hoc adjustments. This article introduces the Adaptive Integrity Model (AIM), a conceptual framework that synthesizes explanation, prediction, and detection to illuminate how p-hacking tendencies arise within specific institutional and cognitive environments. Whereas existing tools such as p-curve and z-curve offer retrospective diagnostics based on p-value distributions, AIM complements these by embedding detection within a broader model of behavioral and structural influences. Its explanatory component quantifies the structural incentives and psychological biases that shape research behavior. Its predictive component flags statistical irregularities such as clustering near the significance threshold, omitted test reporting, and boundary inflation. Its detection component evaluates transparency through replication outcomes, preregistration adherence, and analytic completeness. Validated across five real-world datasets, including studies later retracted or disputed, AIM generates a Pintegrity score that captures statistical anomalies and contextual vulnerabilities. By modeling research integrity as a layered system, AIM offers journal editors, funders, and reviewers a scalable tool for credibility assessment that promotes retrospective research audits and prospective safeguards.
1 Introduction
Scientific claims today are judged by tidy statistics, with p-values standing in for actual insight. Yet, paradoxically, the p-value, originally intended as a safeguard of inferential discipline, has become a source of distortion. In fields where publication hinges on statistical significance, researchers may encounter pressures that subtly shape their analytic decisions toward achieving the canonical threshold of p < 0.05 (Head et al., 2015). Known as p-hacking, this practice includes tactics such as selective reporting, flexible data exclusion, and multiple testing without correction (Fraser et al., 2018; Head et al., 2015; Simmons et al., 2011). The damage is as technical as it is epistemic. When statistical inference becomes a game of thresholds, the scientific record fills with results that are irreproducible and misleading (Ioannidis, 2005; Munafò et al., 2017; Wicherts et al., 2016). In response, researchers have turned to methods like p-curve and z-curve analysis to look for signs of p-hacking in the distribution of published p-values. These techniques assume that clustering just below the 0.05 threshold signals selective reporting or analytical manipulation (Bartoš and Schimmack, 2022; Simonsohn et al., 2014). Yet editors and funders need models that estimate vulnerability prospectively, not merely detect distortions post hoc, and that make their assumptions and weighting transparent and open to calibration. P-hacking forms part of a broader spectrum of questionable research practices, including selective reporting, publication bias, and outcome switching, that reflect systemic rather than purely individual pressures (Fanelli, 2009; Fraser et al., 2018; Simmons et al., 2011; John et al., 2012). These behavioral patterns underscore the need for models that integrate institutional and cognitive dimensions of integrity. Broader analyses of research integrity highlight how structural incentives and metric-driven evaluation systems shape academic behavior (Ioannidis, 2023; Resnik, 2006). Recent computational approaches also examine how digital forensics and network analysis can detect emerging forms of misconduct (Hofmann et al., 2023). Together, these perspectives situate p-hacking within a broader ecosystem of systemic pressures that extend beyond individual intent.
Earlier efforts to model research credibility, such as p-curve and z-curve analyses, have been valuable in detecting statistical irregularities but remain limited in explanatory scope. Although useful as diagnostics, both models fall short of being truly explanatory. Analysis of the p-curve assumes a dichotomy between true and false findings but cannot accommodate complex biases such as motivated reasoning or institutional incentives.
Although useful as diagnostics, both models fall short of being truly explanatory. Analysis of the p-curve assumes a dichotomy between true and false findings but cannot accommodate complex biases such as motivated reasoning or institutional incentives, and the z-curve presumes homogeneity in study quality and transparency, overlooking the institutional and psychological environments that make p-hacking more probable. The z-curve estimates discovery rates but presumes homogeneity in study quality and reporting transparency, an assumption that breaks down in high-stakes publication environments (Bartoš and Schimmack, 2022; Lakens, 2021; van Aert et al., 2016). Spotting a suspicious pattern is not the same as understanding where it comes from or how to prevent it. While p-curve and z-curve analyses offer statistical diagnostics, they lack a mechanism for predicting misconduct or tracing its origins. They ignore the layered ecosystem of pressures, cognitive shortcuts, and institutional norms that shape researchers' decisions long before data analysis begins. As a result, they remain reactive rather than preventive, helpful for post hoc audits but insufficient for designing upstream safeguards (Ioannidis, 2014; Munafò et al., 2017; Nosek et al., 2012). To improve scientific practice, a model must show how researchers are nudged toward compromise, not just flag that it occurred.
In this article, I introduce the Adaptive Integrity Model (AIM), an integrated framework of explanatory, predictive, and detection components that form a basis for understanding, predicting, and detecting research distortion. The explanatory component focuses on external pressures such as tenure expectations and funding competition, as well as internal cognitive biases that shape how researchers interpret and report their findings. The predictive component identifies statistical warning signs: p-value clustering, inconsistencies in test reporting, and unexplained shifts in analytic design. The detection component evaluates transparency practices, including replication access, preregistration adherence, and analytic completeness. Tested against five publicly available datasets, most of which have been either retracted or contested, AIM not only flagged statistical anomalies but provided a rationale for their emergence. The model's output, a composite Pintegrity score, serves as an audit tool and a deterrent that equips researchers, journals, and funding agencies to evaluate research credibility in richer evidentiary terms.
What this paper adds. AIM advances integrity assessment in three ways. First, it integrates behavioral and institutional context (Ip) with statistical footprints (Ph) and transparency practices (Tx) so that detection is tied to plausible causes and preventive levers. Second, it formalizes these inputs into a single, interpretable score with an explicit weighting scheme (0.40–0.40–0.20) that reflects evidence favoring contextual and procedural predictors of credibility over anomaly density, and that we make open to empirical re-calibration (see Appendix B for scoring rules). Third, it demonstrates feasibility on five public datasets (including later-contested studies), showing how AIM can support pre-publication screening and post-publication audit without penalizing exploratory designs that are properly documented.
2 Theoretical background
2.1 The limitations of p-value reliance
For decades, the p-value has been the dominant convention for inferential statistics in empirical research, with its criticality buttressed by journal submission standards, grant review protocols, and methodological training (Wasserman and Madrid-Morales, 2019). Yet the p-value, first introduced by Ronald Fisher in the 1920s as a heuristic for gauging evidential inconsistency with the null hypothesis, was never intended to serve as a definitive criterion for validity (Fisher et al., 1990; Hubbard and Bayarri, 2003). Over time, its probabilistic nuance was flattened into a binary decision rule, setting the stage for widespread misinterpretation and overuse. The appeal of the p-value lies in its simplicity, but that simplicity conceals a fundamental incoherence. A low p-value does not indicate that a hypothesis is true, only that the observed data would be unlikely if the null hypothesis held. Yet even this interpretation collapses under scrutiny. The probability of obtaining p < 0.05 given that the null is true (the false positive rate) can skyrocket in low-powered studies and produce what Gigerenzer (2004) describes as a widespread illusion of inferential certainty rooted in statistical ritualism. Compounding the problem is the misconception that non-significance implies no effect, which in turn suppresses valuable null results and biases the published record (Hubbard and Bayarri, 2003). These distortions have real consequences in psychology, biomedicine, and education, where many studies have been called into question not because of fabrication, but as a result of statistical self-deception embedded in accepted methods (Wasserstein et al., 2019).
What makes the problem urgent now is not just the misuse of a tool but the entrenchment of a flawed epistemology. A widely cited editorial titled “Scientists rise up against statistical significance” captures the growing revolt within the scientific community, urging researchers to abandon p-value thresholds in favor of more nuanced inferential reasoning (Amrhein et al., 2019). That call to action proceeds from the reality that placing too much weight in p-values leads to either-or thinking, when research really calls for more nuanced interpretation and ongoing synthesis (Patel and Green, 2025; Wasserstein and Lazar, 2016). Ironically, the very institutional mechanisms designed to enforce rigor, such as pre-publication review, statistical checklists, and threshold criteria, often amplify the problem by incentivizing many researchers to chase thresholds in lieu of clarifying uncertainty (Wasserstein et al., 2019). In other words, the limitations of p-values are not merely statistical but structural. They arise from the conflation of decision procedures with inference and of administrative convenience with epistemic validity. Any model aspiring to reform scientific practice must begin by confronting this real misalignment between statistical conventions and inferential goals.
2.2 Incentive structures and scientific distortion
If statistical misuse were simply the result of misunderstanding, the solution would lie in better training. However, as researchers operate within a reward system that privileges publishability over accuracy, the more persistent driver of distortion is structural. Journals favor statistically significant results, funding agencies reward productivity metrics, and hiring committees generally equate statistical “success” with intellectual merit (Nosek et al., 2012; Smaldino and McElreath, 2016). Inside this ecosystem, the incentives for discovering reality are misaligned with the incentives for academic survival. Indeed, if tenure, grants, and academic prestige hinge on clear, “positive” outcomes, questionable practices may begin to look rational or even necessary (Nosek et al., 2012). The empirical effects of this incentive misalignment are well documented. Studies of publication trends show a consistent overrepresentation of results that simply meet the p < 0.05 threshold and an underreporting of null findings, particularly in high-impact journals (Fanelli, 2010b; Head et al., 2015).
What appears at first glance to be a pattern of insight is, in fact, often one of selection. The overrepresentation of significant results is a statistical distortion rooted in systemic pressures. But such distortions are not entirely the product of researcher bias. They are adaptive responses to a distorted incentive landscape (Ioannidis, 2014; Munafò et al., 2017). Studies in motivated reasoning (Kunda, 1990), cognitive dissonance (Festinger, 1962), and behavioral ethics (Bazerman and Tenbrunsel, 2011) have shown how individuals can unconsciously interpret evidence in ways that align with their interest. Within institutional theory, reward systems, such as publication metrics or tenure criteria, exert structural pressures that normalize these tendencies (Merton, 1973; Smaldino and McElreath, 2016). The most insidious feature of this system is its feedback loop. As statistically significant findings become a currency of professional advancement, journals implicitly endorse the thresholds that distort the evidentiary base. Reviewers sternly demand clarity where ambiguity would be more honest, and editorial standards reward certainty even when the data are equivocal. Over time, this dynamic erodes the epistemic function of statistical analysis (Ioannidis, 2005). In such an environment, even methodologically sound studies are susceptible to misinterpretation, as researchers internalize the value of significance and subtly shape their designs to meet it (McElreath and Smaldino, 2016; see also Wicherts et al., 2016). Reforming statistical practice, then, cannot be accomplished without addressing the deeper system of incentives that motivates distortion.
2.3 The limitations of current detection models
Several statistical tools have emerged to address research distortions, but most operate downstream from the institutional pressures that produce them. Detection models like the p-curve and the z-curve have been celebrated as methodological breakthroughs. They allow researchers to infer the presence of p-hacking by studying the distributions of reported p-values, or by looking for excess clustering just below the p < 0.05 threshold (Schimmack, 2021; Simonsohn et al., 2014). The p-curve evaluates the right skewness of significant results, interpreting skewed distributions as evidence of underlying true effects. The z-curve extends this logic by modeling a mixture of p-values to estimate both expected discovery and replication rates (Bartoš and Schimmack, 2022; van Aert et al., 2016). These analytical tools mark a meaningful shift in epistemic strategy: they evaluate statistical credibility not by weighing single studies but by analyzing the distributional patterns that emerge across them. In so doing, they have positioned meta-science as a form of empirical diagnostic.
But while these two models detect distortion, they do not explain it (van Aert et al., 2016). Their inferential logic is post hoc, and their diagnostic power ends where theory should begin. For example, a skewed p-distribution may flag selective reporting, but it says nothing at all about what caused it—career incentives, replication fear, or analytic ambiguity. Both p-curves and z-curves treat the data as if it exists apart from the pressures shaping it, which runs counter to what research on bias and incentive structures has found (Nosek et al., 2012; Smaldino and McElreath, 2016). In particular, the p-curve presumes that p-values reflect a uniform standard, when in reality, practices like data peeking or selective reporting lead to different and distorted outcomes. The z-curve attempts to adjust for this heterogeneity, but it still lacks a framework for modeling the motivational factors behind the statistical anomalies it identifies (Lakens, 2022; McShane and Gal, 2017).
The p-curve and the z-curve are also methodologically constrained when applied prospectively, because they require large distributions of p-values and cannot meaningfully assess the integrity of individual studies (Bartoš and Schimmack, 2022). As a result, their usefulness lies in retrospective audit, not forward-looking guidance. This limitation matters a lot. Detection without prevention reduces epistemic reform to statistical forensics, and statistical forensics, by definition, occurs after the damage is done. In contexts where replication is costly, slow, or impractical, this lag between detection and correction is structurally disabling (Ioannidis, 2014; Munafò et al., 2017). If science is to build reliable norms of integrity, it needs tools that not only model what questionable research looks like in the aggregate, but also how it emerges, under what conditions, and with what consequences. A model that can make such patterns intelligible in real time must be grounded in statistics, as well as in behavior, incentives, and transparency practices. The ethical stakes of these distortions are palpable. Once misleading information gets into the scientific record, it influences the reputation of individual scientists but also the allocation of resources, the shaping of public policy, and the outcomes of clinical and educational practice. It follows that any model that aims to prevent or detect statistical distortion must be grounded in epistemic responsibility and ethical accountability.
2.4 The need for multi-component models
As current detection methods cannot explain the origins of distortion or guide reform in real time, the theoretical task is clear: science needs models that integrate multiple dimensions of research behavior. A reliable assessment of research integrity must go beyond statistical audit trails. It must equally account for the pressures that produce those trails and the transparency mechanisms that either constrain or enable them. One-dimensional tools like the p-curve or z-curve cannot capture this interplay because they treat observed results as isolated data points, detached from the contexts in which they were generated (Schimmack, 2021; Lakens, 2022). Put differently, a clustering of p-values around 0.049 looks identical whether it stems from cognitive bias, institutional incentives, or true effect heterogeneity. Unlike one-dimensional detection tools, a multi-component model provides the conceptual leverage to distinguish among these sources of distortion and examine how they interact. There is precedent for this idea in other fields. In psychometrics, for example, models of test validity distinguish between observed scores, underlying constructs, and sources of measurement bias (Messick, 1995). In epidemiology, causal inference models integrate exposure, confounding, and outcome pathways instead of relying on univariate associations (Greenland et al., 1999). The core insight here is structural: when distortions arise from intersecting forces, they must be modeled accordingly.
Research integrity is no different. Statistical anomalies, researcher intentions, and procedural safeguards form a system, not a sequence, and any explanatory framework that isolates one at the expense of the others will misattribute both cause and culpability (John et al., 2012; Smaldino and McElreath, 2016). What is more, connecting anomalies to the pressures and habits that produce them allows the model to flag early warning signs and suggest practical steps. Equally important, a multi-component model expands the scope of application. Rather than rely on post hoc audits of published results, it allows for real-time estimation of research integrity tailored to specific studies, research teams, or submission pipelines. This makes it possible to flag studies for further review before publication, to identify structural vulnerabilities in editorial practices, or to monitor progress toward open science goals. The Adaptive Integrity Model (AIM) is designed with these theoretical and ethical imperatives in mind. Drawing on established work in cognitive psychology, sociology of science, and behavioral ethics, it frames p-hacking as a systemic outcome of interacting biases, pressures, and practices. Its goal is not just to detect error but to support institutional and individual accountability by revealing the pathways through which research integrity erodes and how it can be protected. Unlike retrospective tools that require dozens of p-values to infer bias, a multi-component model can evaluate even single-study submissions by incorporating factors such as preregistration compliance, test-reporting completeness, and power estimates relative to hypothesized effects (van Aert et al., 2016; Nosek et al., 2015). Such a model will not replace statistical diagnostics but reframe them, not as endpoints of suspicion, but as entry points into a more inclusive account of how research is produced and how it should be evaluated.
2.5 The adaptive integrity model: conceptual origins
As established in the foregoing section, reframing statistical diagnostics requires more than computational refinement. It requires a theoretical foundation based on understanding how scientists think, decide, and adapt. It follows that a model that explains, predicts, and detects p-hacking cannot emerge from statistical logic alone (Moss and De Bin, 2023). Instead, it must draw from a broader conceptual lineage that includes decision theory, cognitive bias research, metascience, and the sociology of knowledge. AIM is grounded in the premise that questionable research practices are neither random nor entirely individualistic. Rather, they are patterned outcomes arising from bounded rationality, incentive systems, and epistemic norms. However, the idea that distortions follow predictable paths is not new. Kahneman (1974) notable work on heuristics and biases illustrated how decision-makers often rely on simplifying rules that lead to systematic error. In science, such heuristics are shaped by environmental factors: the pressure of “publish or perish,” the availability of flexible methods, and the absence of transparent reporting standards. If we want to measure integrity meaningfully, we need to look at where decisions are made, not just at the p-values that result from them (Bishop, 2019; Gelman and Loken, 2023).
Meta-scientific inquiry reinforces this systems-level view. Studies show that questionable research practices are not isolated incidents but replicable behaviors that increase under specific conditions: intense competition, limited oversight, and very high rewards for statistically significant results (John et al., 2012; Nosek et al., 2012). These realities reinforce the need for a model that can formalize how environmental pressures interact with researcher cognition and analytic flexibility to produce recognizable statistical artifacts. The explanatory component of AIM builds upon this foundation. It does not infer bias exclusively from numerical patterns but rather uses patterns as surface expressions of deeper problems. In this respect, AIM aligns with what Gigerenzer et al. (1999) have described as the ecological rationality approach. The model views human behavior as adaptively shaped by the constraints and opportunities of its environment. Ecological rationality presumes that decision-making is sensible only when interpreted within the context in which it occurs. In the case of research misconduct, this means that what appears as irrational behavior, such as manipulating analyses or suppressing null results, can be a rational adaptation to institutional incentives. AIM uses this lens to interpret p-hacking as a system-contingent response to misaligned professional rewards. This framing is critical, as it anchors the model's evaluation of integrity in both statistical patterns and the institutional conditions that give rise to them, including prevailing norms, systemic pressures, and procedural constraints.
This shift in explanatory focus enables predictive and diagnostic advances. Whereas most detection tools ask whether a pattern is suspicious, AIM asks what combination of incentives, biases, and safeguards could plausibly generate that pattern. Its architecture is designed not only to identify red flags but also to trace their structural roots. That design reflects insights from epistemic risk modeling, where the goal is not merely to prevent error but to understand where and why error is most likely to be made (Biddle and Kukla, 2017). In the same spirit, AIM draws from reliability engineering, where multiple indicators are used together to estimate the risk of failure. Just as aircraft maintenance does not rely upon a single warning light, AIM does not rely on a single signal. It integrates explanatory variables, predictive indicators, and detection criteria into a composite estimate of vulnerability. This is the conceptual core of the model: integrity is not a property of research outcomes but a property of systems.
3 The adaptive integrity model
3.1 Model overview
As shown in Figure 1, AIM is centered on a core premise: research integrity is not a latent trait to be inferred from isolated outcomes, but a systemic property that emerges from the interaction of cognitive biases, incentive structures, and statistical signals. Rather than rely on a single factor, such as p-value distributions or researcher disclosure practices, AIM combines and leverages three interdependent components: (1) an explanatory module (Ip) that quantifies sources of bias and pressure, (2) a predictive module (Ph) that identifies statistical red flags associated with questionable research practices, and (3) a detection module (Tx) that assesses procedural transparency. These components are integrated into a composite Pintegrity score, which reflects both the probability and the provenance of research distortion. Each component addresses a distinct epistemic challenge: Ip accounts for the origins of bias, Ph flags potential statistical footprints of bias, and Tx determines whether structural safeguards were in place. This layered approach is designed to overcome the limitations of one-dimensional models, which often detect distortion without explaining it at all (Bartoš and Schimmack, 2022; Nosek et al., 2012; van Aert et al., 2016).
AIM's design takes its cue from diagnostic systems in domains where certainty is elusive and inference must be distributed across multiple inputs. For example, in clinical medicine, the focus is no longer on individual biomarkers but on integrated decision-support systems that synthesize lab data, imaging results, and patient symptoms into probabilistic risk profiles (Berner and Graber, 2008). The goal in this research area is not confirmation but calibration for the sole purpose of accommodating noise, ambiguity, and variability. The same idea applies in engineering, where layered diagnostics are used to spot potential failure points (Leveson, 2012). These methods share one logic: When false positives and false negatives bear real costs, the solution is not more precision in one signal, but rather integration across many signals. AIM applies this logic to the epistemic domain. It treats statistical anomalies as interdependent outputs influenced by the procedural, cognitive, and institutional peculiarities of the research environment. In shifting the key question from “Was there p-hacking?” to “What conditions made distortion of results probable?,” AIM replaces binary judgment with layered inference and grounds itself less in accusation and more in vulnerability estimation (Biddle and Kukla, 2017; see also Gelman and Loken, 2023).
Each component within AIM adds a distinct computational layer, but all the outputs are synthesized through a unified weighting and estimation process. The explanatory module (Ip) translates context into quantifiable risk, drawing on indicators such as journal impact pressures, disciplinary norms, and researcher-level vulnerability profiles (Fraser et al., 2018; Nosek et al., 2012). The predictive module (Ph) analyzes statistical anomalies as clustered patterns, such as overrepresentation of marginally significant results or unexpected variance compressions that together signal potential distortion (Ioannidis, 2014; Schimmack, 2021). The detection module (Tx) assesses transparency measures across three levels: analytic completeness, preregistration adherence, and data accessibility (Wicherts et al., 2016). These inputs are then normalized and fed into a logistic weighting algorithm, which integrates empirical irregularities with contextual vulnerability to generate a Pintegrity score. Rather than issuing a fixed label, the score expresses a range of likelihoods that can be interpreted at various levels—from a single paper to a research program. This computational structure allows the model to operate in both retrospective and prospective contexts. It can be used for pre-publication screening, post-publication auditing tasks, or even for continuous monitoring of research workflows (Munafò et al., 2017; McElreath and Smaldino, 2016). Instead of attempting to replace human judgment, AIM provides an empirical scaffold for an elaborated research integrity assessment.
3.2 Ip, Ph, and Tx
The explanatory component (Ip) anchors AIM in the principle that research distortion is seldom the product of isolated decisions, but rather the outcome of structurally patterned pressures. It quantifies contextual vulnerability by modeling how incentive gradients, such as publishability thresholds, career advancement demands, and institutional prestige metrics, impact researcher behavior under epistemic uncertainty (Nosek et al., 2012; Biddle and Kukla, 2017). To be clear, Ip does not assume malicious intent; rather, it models rational actors navigating a system optimized for positive results. Drawing from cognitive bias theory, it assigns weighted scores to known pressure vectors, such as whether a study is submitted to a high-impact outlet, whether the author is early-career, or whether the reported design leaves room for undisclosed flexibility (Wicherts et al., 2016). By treating these features as predictors of conditional vulnerability and not as disqualifying flaws, Ip enables AIM to distinguish between studies conducted under high-pressure environments and those operating with greater epistemic flexibility. This distinction is critical. Without it, models of integrity risk equating context-blind statistical anomalies with misconduct. Ip formalizes the background conditions of research production, turning vague ethical concerns into measurable inputs. In so doing, it turns AIM into an elaborated model of how scientific norms, incentives, and cognition interact.
Much like Ip, the predictive component (Ph) treats statistical irregularities as inferential warning signals rather than isolated curiosities. Whereas the explanatory module models contextual pressures, Ph focuses on patterned outcomes—detectable signatures of distortion that transcend individual datasets. Its core function is to flag distributions that deviate systematically from what valid inference would yield, such as excessive clustering just below p < 0.05, sharp declines in insignificant ranges, or disproportionate effect sizes relative to study design. These anomalies are not interpreted as definitive proof of p-hacking but as probabilistic cues whose frequency and configuration may reveal analytic flexibility or selective reporting (Bartoš and Schimmack, 2022; Fraser et al., 2018). In addition, Ph incorporates weighting based on sample size, field of study, and baseline replication rates, to ensure that signals are assessed against calibrated expectations rather than against generic statistical thresholds (Schimmack, 2021). In this way, Ph avoids the false dichotomy between dismissing anomalies entirely and treating them as definitive. Instead, Ph synthesizes multiple irregularities into a composite risk vector, which feeds directly into the detection layer, and maintains a continuous analytic flow from structural pressure to statistical pattern to procedural indicator.
Tx, the detection component of AIM, operationalizes transparency as a procedural variable that leaves measurable audit trails in the research process. While the predictive module interprets statistical outputs, Tx evaluates what researchers disclose, omit, or leave ambiguous. It assigns weighted scores to indicators of procedural openness, including whether the study was preregistered, whether all prespecified outcomes were reported, whether data and code are accessible, and whether analytic paths are traceable from design to result (Wicherts et al., 2016; Nosek et al., 2015). These transparency metrics are not evaluated in isolation but interpreted in relation to the contextual pressures identified by Ip and the statistical issues flagged by Ph. For example, a study showing high anomaly density under high incentive pressure but scoring low on transparency presents a very different integrity profile from one with similar patterns but clear procedural disclosure. Tx completes AIM's inferential arc: it evaluates whether there were institutional or behavioral safeguards in place to constrain analytic flexibility and mitigate confirmation bias. In so doing, it treats transparency as a measurable defense mechanism, or as an empirical index of whether epistemic risk was anticipated, managed, or left unchecked.
Finally, Pintegrity serves as a probabilistic estimate of a study's vulnerability to research bias. It integrates the outputs of Ip, Ph, and Tx. As previously noted, each of these components contributes a distinct dimension: Ip assesses structural incentives that may predispose to bias; Ph identifies patterns in statistical results suggestive of questionable practices; and Tx evaluates the presence of transparency measures that can mitigate such risks. These inputs are normalized and combined using a logistic function calibrated against empirical datasets—a method comparable to probabilistic structural integrity methods used in nuclear safety assessments (Chavoshi et al., 2021). This model supports growing efforts to detect risks in research before they lead to failure (Committee on Responsible Science, 2017). By offering a nuanced measure of integrity, the Pintegrity score makes analysis and decision-making easy for all stakeholders.
3.3 Algebraic specifications
For AIM to operate as a computational model, and to enable scoring, calibration, and cross-study comparison, each of its components must be explicitly formalized. The model's explanatory component, denoted Ip, is algebraically defined as:
where Pe measures extrinsic academic pressure, such as institutional metrics and journal rankings, and Bi captures intrinsic cognitive incentives, such as career ambition or fear of reputational loss. Both are normalized to the [0,1] interval. This dual structure draws from behavioral research showing that misconduct typically arises from the convergence of systemic and psychological incentives (Fanelli, 2010b; Gopalakrishna et al., 2022; Martinson et al., 2005).
The model's predictive component, Ph, is formulated using a logistic function to reflect the nonlinear relationship between surface-level results and underlying data manipulation. It is expressed as:
When Tr ≈ Tc and δp is moderate, Ph approaches the upper mid-range (≈ 0.5), whereas when Tr < Tc the logistic term rises steeply, signaling elevated distortion risk.
where Tc and Tr represent the number of statistical tests conducted and reported, respectively, and δp denotes the proportion of p-values clustering between .040 and .050, which is a range strongly associated with questionable research practices (De Winter and Dodou, 2015; Head et al., 2015; Simonsohn et al., 2014). The function uses the constant e (~2.718) to turn a calculated red-flag score into a probability, and the logistic transformation reflects the empirical fact that small discrepancies between conducted and reported analyses may appear benign until they reach a threshold beyond which the risk of p-hacking rises steeply.
Tx, the detection component of AIM, reflects procedural transparency and robustness. It is defined by a bounded weighted sum:
where Ra is the ratio of reported to conducted analyses, Vr is a binary variable indicating whether the study has been independently replicated, and Mc reflects the degree of match between the reported results and preregistered protocols. The weights (1:3:2) are rooted in the current consensus that successful replication is the strongest indicator of reliability (Christensen et al., 2019; Hardwicke et al., 2022; Munafò et al., 2017). The use of a minimum function ensures that the transparency score is capped at 1.0, preserving interpretability while allowing cumulative contributions across several indicators.
AIM integrates the three components into a weighted linear function that produces a scalar integrity score:
The weights emphasize contextual and procedural signals over purely statistical ones, in alignment with critiques of overreliance on p-value forensics (Ioannidis, 2014; Nelson et al., 2018; Wicherts et al., 2016). Here, the departure from a composite sigmoid model is intentional: The linear formulation provides transparency, ease of interpretation, and scalability across disciplines without sacrificing conceptual precision. The 0.40–0.40–0.20 weighting was provisionally selected based on theoretical considerations and pilot testing across a sample of articles in psychology, medicine, and economics.
Each AIM indicator score is assigned using a 4-point ordinal scale ranging from 0 to 3, consistent with reproducibility and risk-of-bias frameworks in meta-research (Wicherts et al., 2016; Hardwicke et al., 2020). A score of 0 reflects the absence of a signal (e.g., no preregistration or p-value clustering), while a score of 3 indicates strong evidence of procedural vulnerability or contextual pressure. These values are normalized within each component (Ip, Ph, Tx) prior to weighted aggregation. Consistent with prior critiques of overreliance on p-value diagnostic, these weights reflect a deliberate emphasis on transparency and contextual risk (see section 3.5 for further details). By expressing the model in algebraic terms, AIM is formalized into a reproducible diagnostic tool in which each component is modular, empirically motivated, and independently testable. For complete scoring criteria, ordinal thresholds, and operational definitions used to apply these formulas to empirical data, see Appendix A.
The weighting scheme reflects evidence from meta-research indicating that contextual and transparency variables are more stable predictors of research credibility than anomaly-based indicators alone. Studies of questionable research practices have shown that systemic pressures and incomplete disclosure account for a greater share of reproducibility variance than statistical clustering patterns (Fanelli, 2010a; Gopalakrishna et al., 2022; Wicherts et al., 2016; Munafò et al., 2017). Assigning 0.40 weights to both contextual (Ip) and transparency (Tx) components captures this empirical pattern while maintaining interpretive balance across domains. The lower 0.20 weighting for statistical anomalies (Ph) avoids over-sensitivity to noise in small-sample or exploratory studies, consistent with recommendations that integrity models prioritize procedural clarity over signal density (Ioannidis, 2014; Hardwicke et al., 2020). Although theoretically derived, these parameters are designed for calibration: future cross-domain applications can adjust weights through cross-validation or supervised learning once larger benchmark datasets are available.
3.4 Classification system
To convert AIM from a scoring algorithm into a decision-making tool, the model's continuous integrity score must be paired with an interpretive classification system that preserves nuance without reverting to binary diagnosis. AIM addresses this need through a three-tier classification: low Integrity (0.00–0.40), Moderate Integrity (0.41–0.70), and High Integrity (0.71–1.00). These boundaries were designed to balance sensitivity to research vulnerabilities with the avoidance of false certainty (Grimes et al., 2018). While conventional fraud detection models rely on dichotomous cutoffs, AIM provides a more flexible triage to accommodate the ambiguity that typically characterizes real-world research behavior (Committee on Responsible Science, 2017; Resnik and Shamoo, 2011; Smaldino and McElreath, 2016). Each classification tier signals a level of confidence and is meant to support decision-making by editors, funders, or institutional reviewers. Rather than labeling work as credible or not, the framework encourages evaluators to treat integrity as a gradient shaped by transparency, contextual incentives, and statistical plausibility. This structure thus aligns with broader movements toward differentiated responses to research risk rather than blanket sanctions or blind trust (Vazire, 2017).
These thresholds are not derived from statistical optimization on a pilot dataset but are instead grounded in the structural design of AIM and informed by prior literature on scientific transparency and research misconduct (Ionnidis, 2005; Simmons et al., 2011). A score below 0.40 signals a confluence of high contextual pressure, statistical anomalies, and weak transparency—conditions that frequently emerge in metascientific analyses of retracted or contested studies (Fanelli, 2009; Wicherts et al., 2016; Ioannidis, 2014). Conversely, scores above 0.70 reflect high levels of procedural transparency and a low density of risk indicators, consistent with methodological best practices such as preregistration, open data, and successful replication (Munafò et al., 2017; Hardwicke et al., 2020). Although these thresholds are provisional, they align conceptually with tiered risk frameworks used in both research ethics and clinical diagnostics, which prioritize interpretability over rigid binary delineation (Grimes et al., 2018). Rather than providing definitive judgments, these ranges function as interpretive heuristics for guiding further review and signaling relative risk within a probabilistic landscape of scientific credibility.
Most important, this classification system is extensible. If AIM is embedded in more granular review protocols, such as editorial scoring rubrics, funder dashboards, or institutional audit tools, additional strata can be introduced. Depending on the field, users can fine-tune the thresholds to reflect the typical distribution of Pintegrity scores. A journal operating in a high-risk research domain may justifiably tighten the cutoff for “High Integrity,” whereas a funding agency seeking to flag high-risk exploratory work may widen the middle tier. This flexibility preserves AIM's foundational rationale while making it adaptable to regulatory, editorial, or epistemic priorities. Rather than imposing a single standard across all domains, the classification system is designed to evolve with empirical calibration, and mirrors broader calls for metrics that respond to the complexity of scientific practice (Vazire, 2017; Nosek et al., 2012; Gopalakrishna et al., 2022).
3.5 Weighting rationale
AIM's weighting (40% each for context and transparency, 20% for statistical anomalies) is based on both conceptual priorities and real-world considerations. As previously discussed, empirical critiques of overreliance on statistical red flags, such as suspicious p-value clustering, have shown limited diagnostic specificity when used in isolation (Simonsohn et al., 2014; De Winter and Dodou, 2015; Ioannidis, 2005). In assigning a lower weight to Ph, AIM avoids replicating this error by prioritizing structural and procedural features that more consistently align with scientific credibility. The equal emphasis on Ip and Tx is deliberate: One represents the environmental pressures that trigger misconduct, whereas the other captures the transparency mechanisms that constrain it. This balance acknowledges that neither context nor conduct alone suffices as an indicator of research integrity—both must be assessed in tandem to deliver a reliable diagnostic signal (Fanelli, 2010a; Gopalakrishna et al., 2022; Martinson et al., 2005).
The current weights were not selected through regression-based optimization, but to qualitatively reflect patterns identified across metascientific domains (Munafò et al., 2017). Research suggests that many breakdowns in scientific credibility come from flawed systems more than from intentional wrongdoing (Smaldino and McElreath, 2016; National Academies of Sciences, 2017; Vazire, 2017). The 0.40–0.40–0.20 distribution encodes this logic: it assigns greater diagnostic weight to context and transparency, which metascientific studies consistently identify as the most reliable predictors of research integrity. Early configurations of AIM that placed disproportionate weight on Ph appeared to penalize exploratory or data-rich studies that, while statistically dense, adhered to sound procedure. This underscores a known limitation of anomaly-based diagnostics: when transparency and contextual variables are omitted, methodological complexity is typically mistaken for misconduct (Wicherts et al., 2016; Nelson et al., 2018). AIM's design favors clarity and fairness over sheer sensitivity, and its weights can be fine-tuned as more data becomes available.
Future calibration of these weights is anticipated and encouraged. As AIM is applied across larger datasets and domains with varying risk profiles, its parameters can be empirically tuned through cross-validation, bootstrapping, or supervised learning. AIM also holds significant promise for fields like educational and cognitive psychology, where replication challenges, flexible analytic strategies, and incentive structures often converge. These domains frequently feature complex, multivariate designs that make statistical patterns alone unreliable indicators of integrity. By incorporating contextual pressures and transparency signals, AIM can distinguish between exploratory rigor and suspect reporting more effectively than anomaly-focus tools. This adaptability makes AIM a valuable complement to p-curves or z-curves, which focus on statistical artifacts. More broadly, this openness to recalibration distinguishes AIM from fixed-metric scoring systems that presume a one-size-fits-all model of misconduct detection (Gopalakrishna et al., 2022; Nosek et al., 2012). By explicitly defining and justifying its weighting scheme, AIM promotes transparency both in the research it assesses and in the methodologies used for its evaluation.
Preliminary sensitivity checks were conducted by perturbing each component weight by ± 0.05 while keeping the total at 1.00. Resulting P(integrity) classifications remained stable across all pilot datasets, with less than 0.03 average deviation in final scores. This indicates that the weighting scheme is robust to modest variations and unlikely to alter qualitative outcomes under reasonable alternatives.
A complete worked example of AIM scoring and normalization, including raw inputs, ordinal levels, and aggregated results, is provided in Appendix A (Table A1).
4 Applying AIM to benchmark datasets
4.1 Dataset descriptions
The five benchmark datasets used to test AIM were curated from the DataColada project, a repository renowned for its ongoing forensic scrutiny of psychological research (DataColada., n.d.). These datasets were not chosen for their disciplinary uniformity but for their diversity of experimental structure, analytic transparency, and susceptibility to bias, all of which are ideal conditions for evaluating AIM's multi-dimensional scoring system. By selecting cases that vary along these dimensions, this validation corpus tests AIM's ability to discriminate between suspect, ambiguous, and procedurally sound studies, not by field but by the empirical pattern of integrity-related indicators. Each dataset includes both behavioral outcomes and procedural signals relevant to integrity analysis, such as preregistration alignment, replication status, reporting scope, and statistical granularity. The dataset names used here (“Just Posting It,” “Clusterfake,” “My Class Year Is Harvard,” “The Cheaters Are Out of Order,” and “Forgetting the Words”) come from blog titles created by DataColada. These names are not the official publication titles but serve as shorthand for the studies under critique. The original study names are deliberately omitted by the blog's authors, likely as a legal and ethical precaution, given the forensic nature of their analyses.
Dataset 1, “Just Posting It,” investigates how emotions such as shame and guilt mediate prosocial behavior. Subjects were exposed to distinct experimental conditions and then asked to donate coins, enabling the detection of subtle psychological nudges on generosity. However, no p-values were present in the dataset, making it impossible to compute key AIM indicators such as Ph and Tx. As such, Dataset 1 was excluded from scoring.
Dataset 2, “Clusterfake,” simulates financial misreporting in a tax scenario. Participants self-reported income, deductions, and final payments under incentives to distort the truth. Together, these datasets allow AIM to test for bias where pressure is overt but transparency is low—precisely the kind of interaction the model is designed to quantify. Dataset 3, “My Class Year Is Harvard,” captures attitude shifts under cognitive dissonance. Participants wrote persuasive essays under varying degrees of choice and then reported attitudinal alignment. The tension between subjective outcomes and obscure analytic choices creates the kind of inferential ambiguity that anomaly-oriented diagnostics misread and submits AIM to a test of its contextual sensitivity.
The final two datasets deepen the model's field coverage. Dataset 4, “The Cheaters Are Out of Order,” examines dishonesty in a coin-toss guessing task. Statistical improbability is juxtaposed with self-reported ethicality and affective state, challenging AIM to balance red flags with procedural nuance. Dataset 5, “Forgetting the Words,” explores regulatory focus manipulations in a moral impurity task. Participants reflected on life goals under promotion, prevention, or control conditions, then reported moral attitudes and social intentions. The design includes essay data, manipulation checks, and a complex factorial structure—highlighting the role of interpretive fidelity over statistical simplicity.
These five datasets, spanning self-report, behavioral economics, moral psychology, and regulatory framing, constitute a robust testbed for AIM's explanatory, predictive, and detection components. Most important, they test whether the model can detect risk without over-penalizing exploratory complexity, and whether it can appropriately withhold judgment when key analytic inputs are unavailable. A full description of each dataset, including its behavioral focus, sample size, and AIM indicators, is shown in Appendix B.
4.2 Results across datasets
When applied to the five datasets, AIM produced scores that reflected theoretical expectations and empirical nuance. Dataset 1 (“Just Posting It”) could not be assigned a valid Ph or Tx score due to the absence of any reported p-values. As a result, no Pintegrity score was computed. Dataset 2 (“Clusterfake”) yielded a low overall integrity score (Pintegrity = 0.44), driven by high contextual pressure (Ip = 0.88) and limited transparency (Tx = 0.23). Ph appeared low (0.0), but this reflects a lack of clustering rather than an absence of risk, especially given overreporting of statistical outputs. Dataset 3 (“My Class Year Is Harvard”) fell into the moderate range (Pintegrity = 0.52), with high contextual pressure (Ip = 0.78) moderate transparency (Tx = 0.51), and no overt p-hacking indicators (Ph = 0.0). Dataset 4 (“The Cheaters Are Out of Order”) returned the highest score among the four analyzable datasets (Pintegrity = 0.62), balancing high contextual pressure (Ip = 0.83) with solid transparency (Tx = 0.72) and no detectable statistical anomalies. Dataset 5 (“Forgetting the Words”) achieved the highest score (Pintegrity = 0.51), driven by strong transparency (Tx = 0.93) but low contextual pressure (Ip = 0.35) and no anomaly signals (Ph = 0.0).
The dispersion of scores reveals the model's strength not in identifying misconduct per se, but in detecting procedural vulnerability, where interpretive leeway and reporting opacity introduce epistemic risk. High Tx scores tended to elevate overall integrity, particularly when paired with moderate contextual pressure. For example, Dataset 5′s complex factorial design might have triggered red flags under a purely anomaly-oriented model. But AIM's weighting structure, particularly its emphasis on transparency, preserved the distinction between complexity and opacity. This aligns with recommendations for moving beyond dichotomous models of fraud vs. innocence to capturing where scientific work may be less robust or more susceptible to distortion (Fanelli, 2018; Ioannidis, 2014). That sensitivity to gradient risk positions AIM as a diagnostic instrument rather than a punitive mechanism. Across datasets, the integrity scores appeared not as verdicts but as structured signals highlighting where closer scrutiny, clearer documentation, or independent replication would strengthen credibility.
Borderline cases, such as Dataset 3 and Dataset 4, illustrate the model's capacity for interpretive fidelity. Both studies avoided overt misconduct but revealed methodological ambiguity, either in selective reporting or pressure-laden framing. AIM neither penalized these studies unduly nor gave them clean slates; rather, it assigned moderate scores that reflect residual uncertainty. In so doing, the model resisted the pitfalls of overconfidence and overcorrection. These results suggest that AIM performed as intended, even when p-value data were missing or anomaly signals were absent. Its probabilistic orientation enables reviewers, editors, and meta-researchers to differentiate between structural fragility and outright deception, a distinction often missed in traditional checklists or binary scoring systems. Insofar as research integrity exists on a spectrum, AIM's outputs serve less as verdicts and more as maps of epistemic terrain. Such maps are particularly useful when datasets vary in format, outcome domain, and evidentiary structure. Across the five distinct cases—four scored and one excluded—AIM demonstrated interpretive robustness by assigning scores that reflected procedural vulnerability without overcorrecting for complexity or rewarding transparency alone. The results validate AIM's internal coherence and lay the groundwork for broader applications, including in fields such as educational psychology and cognitive psychology, where risk signals are often diffuse and analytic flexibility is common (Almutawa et al., 2025; Hohl and Dolcos, 2024; Uddin, 2021). Table 1 summarizes AIM scores alongside the status of each study.
4.3 Diagnostic patterns and edge cases
In four of the five analyzable datasets, AIM showed consistent patterns in how signs of integrity either grouped together or pulled apart. High Tx scores tended to elevate overall integrity, even in cases with moderate contextual pressure or ambiguous statistical structures. This means that transparency acts as a stabilizing variable: preregistration, replication, and open data effectively dampen the impact of other red flags. However, Dataset 5 (“Forgetting the Words”) demonstrated that high transparency alone does not guarantee a high integrity score—particularly when both Ip and Ph are low. Conversely, inflated Ph scores—whether due to clustering or overreporting—lowered integrity estimates primarily when transparency was weak or contextual pressure was high. The model's weighting structure thus avoided over-penalizing exploratory methods when they are clearly documented. This pattern validates the model's core premise: that research integrity is not reducible to statistical output alone, but emerges from an interaction between procedural safeguards and motivational context. When those safeguards are present, statistical density can be interpreted as complexity. But when they are absent, the same signals become indicators of risk.
Edge cases further sharpened this distinction. Dataset 4 (“The Cheaters Are Out of Order”) featured high transparency but strong contextual pressure and no observable p-value anomalies, producing the highest overall score. This confirms the model's ability to recognize when high-risk settings are balanced by procedural robustness. Similarly, Dataset 3 (“My Class Year Is Harvard”) yielded a balanced overall score not because it was cleanly transparent, but because its high incentive structure was partially offset by a lack of p-hacking indicators. In both cases, AIM assigned neither absolution nor indictment. Instead, it produced a probabilistic evaluation that recognized epistemic risk without collapsing it into binary logic. This interpretive restraint sets AIM apart from tools that infer misconduct from deviation, treating all irregularity as malfeasance. Put differently, the model acknowledges that complexity without clarity warrants scrutiny, not suspicion, and that in a research culture that prizes methodological sophistication, such restraint is not optional.
Dataset 1 (“Just Posting It”) presented a different kind of edge case: a complete absence of p-value data. Rather than force an unreliable estimate, AIM withheld scoring, reflecting its capacity to respect data limitations rather than generate false certainty. These results underscore the model's role as an epistemic filter. It neither replaces peer review nor prescribes corrective action. It simply reveals structural signals that can inform judgment. That function is valuable in research endeavors where analytic flexibility is high and incentive structures are diffuse. Rather than flagging deviation, AIM contextualizes it. Rather than assuming that statistical irregularity implies intent, it asks whether that irregularity is buffered by transparency and made legible through disclosure. This disposition enables a more precise allocation of investigative or editorial attention by flagging what requires closer scrutiny, and by not condemning what is atypical. As a result, AIM's outputs are not judgments. They are invitations to verify, replicate, question, or qualify. As such, the outputs help operationalize the fundamental principle that integrity is not the absence of fraud, but the presence of conditions that make distortion unlikely (Fanelli, 2018; Ioannidis, 2014).
5 Implications, limitations, future directions
5.1 Implications
AIM introduces a diagnostic framework for research evaluation that prioritizes structural and procedural risk indicators. This orientation reflects metascientific efforts to foster integrity not by punitive enforcement but through institutionalized transparency, accountability, and methodological rigor (Moher et al., 2020; Nosek et al., 2015). As it merges contextual pressures, statistical irregularities, and procedural openness, AIM is a probabilistic model of research risk that enhances peer review and editorial protocols with a comparative scoring tool (Fanelli, 2018; Ioannidis, 2014). Journals, funders, and institutions can adopt AIM to improve consistency and anticipate methodological risks. Editorial boards can use AIM heuristics to flag submissions for further methodological vetting and enhance traditional review without displacing expert judgment (Horbach and Halffman, 2019; Tennant et al., 2017). Funding agencies may use AIM retrospectively to track the epistemic reliability of published studies linked to grant portfolios and create a standardized audit layer in lieu of fragmented oversight mechanisms (Bouter et al., 2016; Wicherts et al., 2016). AIM goes beyond discipline norms by analyzing signal patterns, making it easy to spot risks that span fields. It favors transparency and nuance over standards that might penalize exploratory work (Begley and Ioannidis, 2015; Munafò et al., 2017).
Unlike rigid reporting standards that may stifle exploratory or data-rich study designs, AIM is calibrated to reward transparency and contextual clarity rather than conformity. Its scoring system supports methodological diversity by rewarding transparent design choices and guarding against rigor-focused reforms that might stifle innovation (Chambers, 2017; Nosek et al., 2012). In this way, AIM balances the need for procedural accountability with the creative demands of frontier research. As scientific governance increasingly pivots toward preventive models, AIM offers a scalable tool that aligns institutional evaluation with the probabilistic nature of epistemic risk, helping to advance a culture of forward-looking integrity and methodological responsibility (Moher et al., 2020; Nosek et al., 2015).
The model also carries implications for research training and policy development by modeling how integrity can be operationalized as a function of structural conditions rather than individual intent. Graduate curricula in research-intensive disciplines could incorporate AIM-based heuristics to teach students how procedural clarity, design transparency, and contextual disclosure jointly shape the epistemic reliability of a study (Gopalakrishna et al., 2022; Haven et al., 2019). Likewise, policymakers concerned with enhancing replicability infrastructure, such as registries, data repositories, and reporting standards, could adopt AIM as a unifying evaluative schema that encourages coordinated oversight across the research lifecycle (Hardwicke et al., 2020). Because AIM can operate even when some dimensions are unavailable—such as in cases lacking p-values or reported statistics—it provides a realistic framework for judgment under partial information. By reframing integrity as a probabilistic property of systems, AIM offers a pedagogically accessible and policy-relevant scaffold for building a culture of trustworthy scientific research.
5.2 Theoretical and empirical advantages
At the heart of AIM lies a clear departure from models that reduce scientific reliability to surface-level artifacts. Whereas traditional tools typically assume that trustworthiness is intrinsic to features, such as p-values that cluster just below .05 or the presence of preregistration checklists, AIM adopts a systems-based epistemology. The model draws on structural theories of knowledge production, which treat integrity not as a trait of individual actors or isolated practices, but as the emergent product of interlocking methodological and institutional constraints (Douglas, 2009; Kitcher, 2001). This perspective holds that integrity emerges when methodological clarity, procedural disclosure, and contextual alignment reinforce one another. Meta-research lends support to this principle by showing that research failures rarely result from a single breach but more often from the accumulation of small, interacting vulnerabilities (Munafò et al., 2017; Gopalakrishna et al., 2022). AIM formalizes this logic by embedding each epistemic dimension—statistical anomaly, procedural opacity, and contextual pressure—into a scoring system that tracks how vulnerabilities compound into systemic breakdown.
This integrated design sets AIM apart from earlier integrity tools such as the p-curve and z-curve, whose strengths are also their limitations. The p-curve identifies signs of selective reporting by analyzing the shape of statistically significant results, while the z-curve estimates replicability through the distribution of test statistics (Schimmack, 2020; Simonsohn et al., 2014). Yet neither model accounts for the procedural contexts from which such results proceeded: whether analytic decisions were justified, whether methods were shared, or whether institutional incentives distorted disclosure. AIM closes this gap by incorporating both inferential and procedural data into a multi-vector model. By combining statistical irregularities with transparency breakdowns under pressure, AIM reveals the structural roots of research risk. Moreover, its design allows scores to be withheld or flagged as incomplete when input dimensions—like p-values or replication data—are missing, reinforcing interpretive humility. This shift positions AIM as a model of explanation aimed at improving institutions.
It applies this logic early in the research lifecycle (during manuscript prep, preregistration, and data sharing) when risks are still manageable. AIM's components—ambiguity in reporting practices and contextual volatility—are well suited for extraction by Natural Language Processing systems, which have already shown the capacity to flag similar signals in manuscripts before peer review (Johnson et al., 2016; Marshall et al., 2015). This allows AIM to detect latent risks and address them by providing formative guidance before a study is published. This advantage casts AIM less as a tool for retrospective analysis and more as a diagnostic system embedded in the infrastructure of research itself. Unlike statistical checkers and procedural checklists, the model generates a probability vector that captures the systemic pressures surrounding a study. Its primary function is not to settle questions of validity, but to find and reveal patterns of vulnerabilities that institutions can investigate or monitor over time.
5.3 Limitations
AIM reframes research assessment around structural and procedural risk but remains bounded by the epistemological assumptions embedded in its design. Its reliance on probabilistic indicators in lieu of deterministic proof limits its ability to adjudicate individual misconduct or assign causal responsibility for research distortions. In fields where fraud is concealed behind sophisticated data manipulation or falsified records, AIM may fail to detect malfeasance altogether (Bik et al., 2016; Stroebe et al., 2012). And because AIM sees risk as a pattern shaped by multiple interacting forces, it offers interpretations rather than black-and-white conclusions. In this way, AIM shares the limitations of other meta-research tools that place system-level risk mapping over case-level attribution (Fanelli, 2009).
The model's flexibility also introduces methodological limits. Although AIM's architecture is meant to accommodate different study designs and disciplines, the selection and weighting of signals may inadvertently reflect the biases of its users. Scoring variability could arise not from differences in study quality but from inconsistent application of AIM criteria, especially when evaluators differ in statistical literacy or contextual knowledge (Wicherts et al., 2016; Bouter et al., 2016). Some cues, such as unusual p-value trends or pressure to publish, demand interpretation rather than rigid coding. Like many interpretive tools, AIM can make it difficult to see where human judgment influences the outcome, which complicates consistent evaluation (Munafò and Davey Smith, 2018).
Even where the model offers conceptual clarity and diagnostic potential, its institutional implementation presents challenges. Integrating the model into editorial workflows or funding evaluations demands time, training, and a cultural shift. A degree of resistance may also arise from stakeholders who perceive AIM as reductive, particularly in places where scoring outputs are used punitively (Resnik and Shamoo, 2017; Haven and Tijdink, 2023). Also, because AIM produces probabilistic outputs, its findings risk misinterpretation by those lacking statistical training. These risks underscore the value of casting AIM as a supplementary tool that advances transparency while respecting disciplinary nuance.
5.4 Future directions
Realizing AIM's potential requires more than conceptual clarity. It requires institutional anchoring within the systems that shape how research is conducted, reviewed, and disseminated. Its modular design makes it compatible with a number of implementation platforms: grant review committees, editorial dashboards, reproducibility audits, and open science systems. Further, because each of its three components (statistical irregularity, procedural opacity, and contextual pressure) can be encoded in structured or semi-structured data, the model is adaptable to both human-led assessments and AI-assisted evaluations (Elman et al., 2018b; Karjus, 2025). This sharp versatility allows it to serve as a decision-support tool and as a coordinating framework that aligns different reform efforts across the research lifecycle.
However, the success of such institutional embedding depends on the model's ability to accommodate the epistemic diversity of the scientific community. What counts as transparency in computational neuroscience may not apply to field-based ethnography, and pressure variables will manifest differently across disciplinary funding systems. As well, tuning AIM to different fields will take trial and error to maintain its flexibility and accuracy. Pilot programs led by research consortia, disciplinary societies, and national councils could help advance the model's scoring system, keeping its core structure intact while improving its fit across different fields of scientific research (Moravcsik, 2014).
Future research should also examine how AIM could evolve into a scalable policy tool for scientific integrity. Journals could implement longitudinal AIM scoring to assess whether editorial reforms improve the reliability of their publication portfolios. Funding agencies might integrate AIM into program evaluations, tracing epistemic outcomes across cohorts, institutions, or thematic domains. Graduate programs could adapt AIM-inspired rubrics to evaluate thesis projects not solely on originality or execution, but on methodological transparency and epistemic rigor. In each case, AIM would shift from a heuristic to a normative framework that treats foresight, accountability, and structural alignment as essential to scientific practice (Mejlgaard et al., 2020; Organisation for Economic Co-operation and Development, 2022).
6 Conclusion
The pursuit of statistical rigor has unwittingly spawned conditions that reward distortion over disclosure. As long as publication remains contingent upon p-values falling below a certain conventional threshold, scientific integrity will continue to erode in subtle but systematized ways (Fraser et al., 2018; Head et al., 2015). Existing models such as the p-curve and z-curve, while very effective in detecting suspicious distributions, offer little insight into the cognitive or institutional forces that shape p-hacking before the first analysis is run (Bartoš and Schimmack, 2022; Lakens, 2021). AIM addresses this epistemic gap by reframing misconduct not as the product of isolated corrupt scientists but as the predictable output of a risk-laden system. By unifying statistical anomalies, procedural ambiguity, and contextual pressure into a single scoring framework, AIM exposes how structural vulnerabilities build toward systemic failure (Gopalakrishna et al., 2022; Munafò et al., 2017; Ioannidis, 2014). Its capacity to withhold scores when inputs are missing, rather than force judgment, reinforces its commitment to interpretive integrity. Also, as it provides a mechanism for detecting epistemic drift before it hardens into distortion, the model's primary value lies in probabilistic foresight.
This shift in orientation from retrospective audits to systemic diagnostics elevates AIM from a passive analytic to an active system safeguard. Unlike checklist-based reforms that codify transparency without contextual nuance, this model is inherently adaptive. It responds to domain-specific pressures while preserving cross-disciplinary comparability (Bouter et al., 2016; Haven and Tijdink, 2023). By not relying exclusively on p-value clustering or statistical density, AIM avoids the false precision of anomaly-only tools and instead generates partial but informative risk profiles even in low-signal environments. Its probabilistic scoring system does not eliminate uncertainty but transforms it into an actionable signal that editors, funders, and institutions can use to direct scrutiny, allocate oversight, and preempt loss of credibility. By exposing how methodological choices interact with incentive systems to produce statistical vulnerabilities, AIM empowers editors, funders, and institutional reviewers to monitor epistemic risk without suppressing innovation (Horbach and Halffman, 2019; Moher et al., 2020). In this sense, the model is not just a tool for evaluating completed research, but a framework for shaping research environments where credibility is more likely to emerge and less likely to erode under pressure.
Implemented across editorial workflows, funding evaluations, and research training, AIM could help institutionalize a new standard of epistemic responsibility. Its modularity makes it compatible with AI-assisted screening tools, reproducibility dashboards, and structured peer review protocols, while its theoretical architecture remains anchored in empirical insights from meta-research (Elman et al., 2018a; Karjus, 2025; Pineau et al., 2021). Instead of boiling integrity down to metrics, AIM captures how it emerges from the system as a whole. The point is not to enforce orthodoxy, but to clarify which deviations signal systemic failure. It follows that AIM does what traditional diagnostic tools cannot: it turns statistical noise into epistemic signal, and risk detection into potential structural reform. As the demands on science intensify, integrity must evolve from a normative ideal into a measurable condition of research itself. AIM offers a credible path toward such transformation. In practical terms, AIM can be embedded at multiple levels of scientific governance. Editorial boards can use component-level AIM scores (Ip, Ph, Tx) to flag manuscripts that warrant deeper methodological review before publication. Funding agencies can integrate AIM metrics into grant evaluations to assess transparency readiness, while institutions can use the same framework in integrity audits and researcher training programs. Because the model's variables are explicitly defined and auditable, these implementations would promote transparency without imposing rigid bureaucratic controls, helping stakeholders translate diagnostic insight into corrective and preventive action.
Data availability statement
The datasets analyzed in this study are available at www.datacolada.org. A compiled and cleaned version of the datasets used in this manuscript is also available upon request from the corresponding author at: YWZmb2dub25AdW5sdi5uZXZhZGEuZWR1. Requests to access the datasets should be directed to Don Affognon: YWZmb2dub25AdW5sdi5uZXZhZGEuZWR1.
Author contributions
DA: Writing – original draft, Writing – review & editing.
Funding
The author(s) declare that no financial support was received for the research and/or publication of this article.
Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Gen AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2025.1675991/full#supplementary-material
References
Almutawa, S. S., Alshehri, N. A., AlNoshan, A. A., AbuDujain, N. M., Almutawa, K. S., and Almutawa, A. S. (2025). The influence of cognitive flexibility on research abilities among medical students: cross-section study. BMC Med. Educ. 25:7. doi: 10.1186/s12909-024-06445-4
Amrhein, V., Greenland, S., and McShane, B. (2019). Scientists rise up against statistical significance. Nature 567, 305–307. doi: 10.1038/d41586-019-00857-9
Bartoš, F., and Schimmack, U. (2022). Z-curve 2.0: estimating replication rates and discovery rates. Meta Psychol. 6:e2720. doi: 10.15626/MP.2021.2720
Bazerman, M. H., and Tenbrunsel, A. E. (2011). Blind Spots: Why we Fail to do What's Right and What to do About It. Princeton, NJ: Princeton University Press. doi: 10.1515/9781400837991
Begley, C. G., and Ioannidis, J. P. A. (2015). Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116, 116–126. doi: 10.1161/CIRCRESAHA.114.303819
Berner, E. S., and Graber, M. L. (2008). Overconfidence as cause of diagnostic error in medicine. Am. J. Med. 121, S2–S23. doi: 10.1016/j.amjmed.01.001
Bik, E. M., Casadevall, A., and Fang, F. C. (2016). The prevalence of inappropriate image duplication in biomedical research publications. mBio 7, e00809–16. doi: 10.1128/mBio.00809-16
Bishop, D. (2019). Rein in the four horsemen of irreproducibility. Nature 568, 435–435. doi: 10.1038/d41586-019-01307-2
Bouter, L. M., Tijdink, J., Axelsen, N., Martinson, B. C., and Ter Riet, G. (2016). Ranking major and minor research misbehaviors: results from a survey among participants of four World Conferences on Research Integrity. Res. Integr. Peer Rev. 1:17. doi: 10.1186/s41073-016-0024-5
Chambers, C. D. (2017). The Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice. Princeton, NJ: Princeton University Press. doi: 10.1515/9781400884940
Chavoshi, S. Z., Booker, J., Bradford, R., and Martin, M. (2021). A review of probabilistic structural integrity assessment in the nuclear sector and possible future directions. Fatigue Fract. Eng. Mater. Struct. 44, 3227–3257. doi: 10.1111/ffe.13572
Christensen, G., Dafoe, A., Miguel, E., Moore, D. A., and Rose, A. K. (2019). A study of the impact of data sharing on article citations using journal policies as a natural experiment. PLOS ONE 14:e0225883. doi: 10.1371/journal.pone.0225883
Committee on Responsible Science, Committee on Science, Engineering, Medicine, and Public Policy, Policy and Global Affairs, National Academies of Sciences, Engineering, and Medicine. (2017). Fostering Integrity in Research. Washington, DC: National Academies Press, 21896. doi: 10.17226/21896
DataColada. (n.d.). Main blog. Available online at: https://datacolada.org/ (Accessed July 29, 2025)
De Winter, J. C., and Dodou, D. (2015). A surge of p -values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ 3:e733. doi: 10.7717/peerj.733
Douglas, H. E. (2009). Science, Policy, and the Value-Free Ideal. Pittsburgh, PA: University of Pittsburgh Press. doi: 10.2307/j.ctt6wrc78
Elman, C., Kapiszewski, D., and Lupia, A. (2018a). Transparent social inquiry: implications for political science. Annu. Rev. Polit. Sci. 21, 29–47. doi: 10.1146/annurev-polisci-091515-025429
Elman, J. A., Jak, A. J., Panizzon, M. S., Tu, X. M., Chen, T., Reynolds, C. A., et al. (2018b). Underdiagnosis of mild cognitive impairment: consequence of ignoring practice effects. Alzheimer's Dementia Diagn. Assess. Dis. Monit. 10, 372–381. doi: 10.1016/j.dadm.04.003
Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLOS ONE 4:e5738. doi: 10.1371/journal.pone.0005738
Fanelli, D. (2010a). Do pressures to publish increase scientists' bias? An empirical support from US States data. PLoS ONE 5:e10271. doi: 10.1371/journal.pone.0010271
Fanelli, D. (2010b). “Positive” results increase down the hierarchy of the sciences. PLoS ONE 5:e10068. doi: 10.1371/journal.pone.0010068
Fanelli, D. (2018). Is science really facing a reproducibility crisis, and do we need it to? Proc. Natl. Acad. Sci. 115, 2628–2631. doi: 10.1073/pnas.1708272114
Festinger, L. (1962). A Theory of Cognitive Dissonance. Stanford, CA: Stanford University Press. doi: 10.1038/scientificamerican1062-93
Fisher, C. B., Hoagwood, K., Boyce, C., Duster, T., Frank, D. A., Grisso, T., et al. (1990). Research Ethics in Social Science. New York, NY: The New Press.
Fraser, H., Parker, T., Nakagawa, S., Barnett, A., and Fidler, F. (2018). Questionable research practices in ecology and evolution. PLoS ONE 13:e0200303. doi: 10.1371/journal.pone.0200303
Gelman, A., and Loken, E. (2023). Garden of Forking Paths: The Hidden Pitfalls of Data Analysis and How to Avoid Them. New York, NY: Cambridge University Press.
Gigerenzer G. Todd P. M. the ABC Research Group. (1999). Simple Heuristics that Make us Smart. New York, NY: Oxford University Press.
Gopalakrishna, G., Ter Riet, G., Vink, G., Stoop, I., Wicherts, J. M., Bouter, L. M., et al. (2022). Prevalence of questionable research practices, research misconduct and their potential explanatory factors: a survey among academic researchers in The Netherlands. PLoS ONE 17:e0263023. doi: 10.1371/journal.pone.0263023
Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology 10, 37–48. doi: 10.1097/00001648-199901000-00008
Grimes, D. R., Bauch, C. T., and Ioannidis, J. P. A. (2018). Modelling science trustworthiness under publish or perish pressure. R. Soc. Open Sci. 5:171511. doi: 10.1098/rsos.171511
Hardwicke, T. E., Thibault, R. T., Kosie, J. E., Wallach, J. D., Kidwell, M. C., Ioannidis, J. P. A., et al. (2022). Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). Perspect. Psychol. Sci. 17, 239–251. doi: 10.1177/1745691620979806
Hardwicke, T. E., Wallach, J. D., Kidwell, M. C., Bendixen, T., Crüwell, S., Ioannidis, J. P. A., et al. (2020). An empirical assessment of transparency and reproducibility-related research practices in the social sciences (2014–2017). R. Soc. Open Sci. 7:190806. doi: 10.1098/rsos.190806
Haven, T. L., Bouter, L. M., Smulders, Y. M., and Tijdink, J. K. (2019). Perceived publication pressure in Amsterdam: survey of all disciplinary fields and academic ranks. PLoS ONE 14:e0217931. doi: 10.1371/journal.pone.0217931
Haven, T. L., and Tijdink, J. K. (2023). How to combine rules and commitment in fostering research integrity? Account. Res. 31, 917–943. doi: 10.1080/08989621.2023.2191192
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., and Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biol. 13:e1002106. doi: 10.1371/journal.pbio.1002106
Hofmann, B., Thoresen, M., and Holm, S. (2023). Research integrity attitudes and behaviors are difficult to alter: Results from a ten-year follow-up study in Norway. J Empir Res Hum Res Ethics. 18, 50–57. doi: 10.1177/15562646221150032
Hohl, K., and Dolcos, S. (2024). Measuring cognitive flexibility: a brief review of neuropsychological, self-report, and neuroscientific approaches. Front. Hum. Neurosci. 18:1331960. doi: 10.3389/fnhum.2024.1331960
Horbach, S. P. J. M., and Halffman, W. (2019). The ability of different peer review procedures to flag problematic publications. Scientometrics 118, 339–373. doi: 10.1007/s11192-018-2969-2
Hubbard, R., and Bayarri, M. J. (2003). Confusion over measures of evidence (p ‘s) versus errors (α 's) in classical statistical testing. Am. Stat. 57, 171–178. doi: 10.1198/0003130031856
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2:e124. doi: 10.1371/journal.pmed.0020124
Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Med. 11:e1001747. doi: 10.1371/journal.pmed.1001747
Ioannidis, J. P. A. (2023). In defense of quantitative metrics in researcher assessments. PLoS Biol. 21:e3002408. doi: 10.1371/journal.pbio.3002408
John, L. K., Loewenstein, G., and Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532. doi: 10.1177/0956797611430953
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., et al. (2016). MIMIC-III, a freely accessible critical care database. Sci. Data 3:160035. doi: 10.1038/sdata.2016.35
Kahneman, D. (1974). Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131. doi: 10.1126/science.185.4157.1124
Karjus, A. (2025). Machine-assisted quantitizing designs: augmenting humanities and social sciences with artificial intelligence. Hum. Soc. Sci. Commun. 12:277. doi: 10.1057/s41599-025-04503-w
Kitcher, P. (2001). Science, Truth, and Democracy, 1st edn. New York: Oxford University Press. doi: 10.1093/0195145836.001.0001
Kunda, Z. (1990). The case for motivated reasoning. Psychol. Bull. 108, 480–498. doi: 10.1037/0033-2909.108.3.480
Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspect. Psychol. Sci. 16, 639–648. doi: 10.1177/1745691620958012
Lakens, D. (2022). Sample size justification. Collabra: Psychol. 8:33267. doi: 10.1525/collabra.33267
Leveson, N. G. (2012). Engineering a Safer World: Systems Thinking Applied to Safety. Cambridge, MA: The MIT Press. doi: 10.7551/mitpress/8179.001.0001
Marshall, I. J., Kuiper, J., and Wallace, B. C. (2015). Automating risk of bias assessment for clinical trials. IEEE J. Biomed. Health Inf. 19, 1406–1412. doi: 10.1109/JBHI.2015.2431314
Martinson, B. C., Anderson, M. S., and de Vries, R. (2005). Scientists behaving badly. Nature 435, 737–738. doi: 10.1038/435737a
McElreath, R., and Smaldino, P. E. (2016). The natural selection of bad science. R. Soc. Open Sci. 3, 160384. doi: 10.1098/rsos.160384
McShane, B. B., and Gal, D. (2017). Statistical significance and the dichotomization of evidence. J. Am. Stat. Assoc. 112, 885–895. doi: 10.1080/0162017.1289846.
Mejlgaard, N., Bouter, L. M., Gaskell, G., Kavouras, P., Allum, N., Bendtsen, A.-K., et al. (2020). Research integrity: nine ways to move from talk to walk. Nature 586, 358–360. doi: 10.1038/d41586-020-02847-8
Merton, R. K. (1973). The Sociology of Science: Theoretical and Empirical Investigations. Chicago, IL: University of Chicago Press.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons' responses and performances as scientific inquiry into score meaning. Am. Psychol. 50, 741–749. doi: 10.1037/0003-066X.50.9.741
Moher, D., Bouter, L., Kleinert, S., Glasziou, P., Sham, M. H., Barbour, V., et al. (2020). The Hong Kong principles for assessing researchers: fostering research integrity. PLOS Biol. 18:e3000737. doi: 10.1371/journal.pbio.3000737
Moravcsik, A. (2014). Transparency: the revolution in qualitative research. Polit. Sci. Polit. 47, 48–53. doi: 10.1017/S1049096513001789
Moss, J., and De Bin, R. (2023). Modelling publication bias and p-hacking. Biometrics 79, 319–331. doi: 10.1111/biom.13560
Munafò, M. R., and Davey Smith, G. (2018). Robust research needs many lines of evidence. Nature 553, 399–401. doi: 10.1038/d41586-018-01023-3
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie Du Sert, N., et al. (2017). A manifesto for reproducible science. Nat. Hum. Behav. 1:21. doi: 10.1038/s41562-016-0021
National Academies of Sciences, Engineering, and Medicine. (2017). Fostering Integrity in Research. Washington, DC: The National Academies Press. doi: 10.17226/2189
Nelson, L. D., Simmons, J., and Simonsohn, U. (2018). Psychology's renaissance. Annu. Rev. Psychol. 69, 511–534. doi: 10.1146/annurev-psych-122216-011836
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., et al. (2015). Promoting an open research culture. Science 348, 1422–1425. doi: 10.1126/science.aab2374
Nosek, B. A., Spies, J. R., and Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci. 7, 615–631. doi: 10.1177/1745691612459058
Organisation for Economic Co-operation and Development. (2022). Recommendation on Research Security. Paris: OECD. Available online at: https://www.oecd.org/sti/recommendation-on-research-security.htm
Patel, S., and Green, A. (2025). Death by p-value: the overreliance on p-values in critical care research. Crit. Care 29:73. doi: 10.1186/s13054-025-05307-9
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Larochelle, H. (2021). Improving reproducibility in machine learning research: a report from the NeurIPS 2019 reproducibility program. J. Mach. Learn. Res. 22, 1–20. doi: 10.48550/arXiv.2003.12206
Resnik, D. B. (2006). The Price of Truth: How Money Affects the Norms of Science. New York, NY: Oxford University Press. doi: 10.1093/acprof:oso/9780195309782.001.0001
Resnik, D. B., and Shamoo, A. E. (2011). The Singapore statement on research integrity. Account. Res. 18, 71–75. doi: 10.1080/08989621.2011.557296
Resnik, D. B., and Shamoo, A. E. (2017). Fostering research integrity. Account. Res. 24, 367–372. doi: 10.1080/08989621.2017.1334556
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Can. Psychol. 61, 364–376. doi: 10.1037/cap0000246
Schimmack, U. (2021). Invalid claims about the validity of implicit association tests by prisoners of the implicit social-cognition paradigm. Perspect. Psychol. Sci. 16, 435–442. doi: 10.1177/1745691621991860
Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366. doi: 10.1177/0956797611417632
Simonsohn, U., Nelson, L. D., and Simmons, J. P. (2014). P-curve: a key to the file-drawer. J. Exp. Psychol. Gen. 143, 534–547. doi: 10.1037/a0033242
Smaldino, P. E., and McElreath, R. (2016). The natural selection of bad science. R. Soc. Open Sci. 3:160384. doi: 10.1098/rsos.160384
Stroebe, W., Postmes, T., and Spears, R. (2012). Scientific misconduct and the myth of self-correction in science. Perspect. Psychol. Sci. 7, 670–688. doi: 10.1177/1745691612460687
Tennant, J. P., Dugan, J. M., Graziotin, D., Jacques, D. C., Waldner, F., Mietchen, D., et al. (2017). A multi-disciplinary perspective on emergent and future innovations in peer review. F1000Research 6:1151. doi: 10.12688/f1000research.12037.3
Uddin, L. Q. (2021). Cognitive and behavioural flexibility: neural mechanisms and clinical considerations. Nat. Rev. Neurosci. 22, 167–179. doi: 10.1038/s41583-021-00428-w
van Aert, R. C. M., Wicherts, J. M., and Van Assen, M. A. L. M. (2016). Conducting meta-analyses based on p values: reservations and recommendations for applying p-uniform and p-curve. Perspect. Psychol. Sci. 11, 713–729. doi: 10.1177/1745691616650874
Vazire, S. (2017). Quality uncertainty erodes trust in science. Collabra Psychol. 3:1. doi: 10.1525/collabra.74
Wasserman, H., and Madrid-Morales, D. (2019). An exploratory study of “fake news” and media trust in Kenya, Nigeria and South Africa. Afr. J. Stud. 40, 107–123. doi: 10.1080/23743670.2019.1627230
Wasserstein, R. L., and Lazar, N. A. (2016). The ASA's statement on p-values: context, process, and purpose. Am. Stat. 70, 129–133. doi: 10.1080/00032016.1154108
Wasserstein, R. L., Schirm, A. L., and Lazar, N. A. (2019). Moving to a world beyond “ p < 0.05.” Am. Stat. 73, 1–19. doi: 10.1080/00032019.1583913
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., and van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. Front. Psychol. 7:1832. doi: 10.3389/fpsyg.2016.01832
Appendix
Appendix A: Example Scoring and Normalization
Table A1. End-to-end scoring vignette: mapping raw indicators to ordinal levels and normalized values.
Appendix B: Scoring Rules for AIM Categories
Keywords: academic integrity, p-value distributions, P-curve, publish or perish, Z-curve
Citation: Affognon DA (2025) Beyond the p < 0.05 trap: the adaptive integrity model for preventing and detecting P-hacking. Front. Educ. 10:1675991. doi: 10.3389/feduc.2025.1675991
Received: 29 July 2025; Revised: 13 November 2025;
Accepted: 18 November 2025; Published: 09 December 2025.
Edited by:
Gavin T. L. Brown, The University of Auckland, New ZealandReviewed by:
Teddy Lazebnik, University of Haifa, IsraelHendrik Tevaearai Stahel, University Hospital of Bern, Switzerland
Copyright © 2025 Affognon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Don A. Affognon, YWZmb2dub25AdW5sdi5uZXZhZGEuZWR1