REVIEW article

Front. Comput. Sci., 12 February 2026

Sec. Theoretical Computer Science

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1717711

Algorithmic self-repair: frontiers in fault-tolerant computation

  • College of Engineering and Information Technology, University of Dubai, Dubai, United Arab Emirates

Article metrics

View details

340

Views

55

Downloads

Abstract

How can algorithms continue to function when confronted with faults, noise, and malicious behavior? This question lies at the heart of resilient computation, a challenge addressed by multiple traditions but rarely examined through a unified lens. In this article, we introduce the concept of algorithmic self-repair as a framework for understanding how algorithms detect, mitigate, and recover from failures. We compare five major classes of algorithmic self-repair: (1) self-stabilizing algorithms that guarantee convergence from arbitrary states; (2) self-healing graph algorithms that preserve connectivity under dynamic failures; (3) error-resilient online algorithms that sustain competitiveness despite uncertain or corrupted inputs; (4) redundancy-based and probabilistic repair techniques that achieve robustness through replication or stochastic correction; and (5) Byzantine fault-tolerant algorithms that maintain correctness even in the presence of adversarial participants. By consolidating these approaches into a shared taxonomy, we highlight their guiding principles, strengths, and trade-offs. The result is not merely a survey but a structured foundation and roadmap for advancing resilient computation, positioning algorithmic self-repair as a frontier where fault tolerance becomes a defining design principle of algorithms.

1 Introduction

Resilient computation has long been recognized as a central challenge in computer science. Modern systems are expected to function in environments where transient faults, adversarial attacks, and unexpected disruptions are inevitable. Distributed networks, real-time decision-making platforms, and large-scale cloud infrastructures all demand algorithms that not only achieve efficiency under ideal conditions but also maintain correctness when confronted with noise, failure, or manipulation. The question of how algorithms can continue to operate under such adverse conditions has become more pressing as computational systems grow in complexity and reach (Kumari and Kaur, 2021).

Traditional approaches to fault tolerance have typically been tailored to narrow classes of problems: redundancy mechanisms to guard against random hardware errors, Byzantine protocols to withstand malicious participants, or stabilization methods to restore systems from arbitrary states. Each of these lines of work has produced deep theoretical insights and practical successes, yet they have often been pursued in isolation. As a result, the field presents a fragmented landscape, where different schools of thought develop overlapping solutions without a shared vocabulary or unifying perspective. This fragmentation makes it difficult to compare approaches, transfer ideas across domains, or build general theories of resilience in algorithms (Isukapalli and Srirama, 2024).

This article introduces the perspective of algorithmic self-repair as a way to consolidate and organize these diverse contributions. The central idea is to analyze algorithms according to how they detect, mitigate, and recover from failures, rather than focusing solely on their nominal correctness in failure-free settings. By bringing together results from distributed computing, online algorithms, graph theory, and probabilistic methods, we aim to provide a coherent taxonomy that highlights common design principles and clarifies essential trade-offs among efficiency, robustness, and adaptability. Our intention is not to provide an exhaustive account of every contribution in fault-tolerant computing, but to identify representative paradigms that illustrate the foundations of self-repair and to trace how these ideas have evolved across different communities.

The scope of this survey is thus theoretical rather than architectural. We concentrate on algorithmic techniques, their mathematical underpinnings, and the limits of what can be achieved in adversarial or uncertain environments. The discussion emphasizes five principal categories: (1) self-stabilizing algorithms, (2) self-healing graph algorithms, (3) error-resilient online algorithms, (4) redundancy-based and probabilistic repair methods, and (5) Byzantine fault-tolerant algorithms. By comparing these categories within a single framework, we reveal structural connections that are often overlooked when they are studied separately. Figure 1 provides a high-level roadmap from classical fault tolerance to the proposed algorithmic self-repair framework, summarizing the unifying taxonomy, shared metrics, and its broader implications.

Figure 1

Roadmap. The remainder of the paper develops this perspective systematically. Section 2 introduces the foundations of algorithmic self-repair and presents a unified framework with a classification of major models. Section 3 situates this framework within prior literature and clarifies how it differs from existing surveys. Sections 4–8 examine each category in detail, emphasizing theoretical principles and representative results. Section 9 then turns to trade-offs and hybrid strategies, showing how different approaches can be evaluated and integrated. Finally, Section 10 summarizes the main insights and outlines directions for future research.

2 Algorithmic self-repair

This section introduces algorithmic self-repair as a unifying paradigm for resilient computation. It captures the capacity of algorithms to detect, mitigate, and recover from faults without external intervention, going beyond fixed redundancy or pre-programmed procedures. By adapting dynamically to changing conditions, self-repairing algorithms ensure that distributed networks, online platforms, and autonomous systems can function despite unpredictable failures, adversarial behavior, or incomplete information. The framework we develop specifies faults, repair strategies, and computational guarantees, and organizes major approaches into a coherent classification that highlights resilience as a fundamental property of computation.

To position the terminology more clearly, we emphasize that algorithmic self-repair operates explicitly at the algorithmic level: the detection of inconsistency, the initiation of repair, and the restoration of guarantees are all part of the algorithm's own control structure rather than delegated to external system mechanisms. This places self-repair in contrast with traditional fault tolerance, which often relies on architectural masking, hardware redundancy, or system-level coordination. The term also highlights a unifying and compositional perspective: diverse paradigms such as self-stabilization, self-healing networks, redundancy-based repair, probabilistic methods, and Byzantine agreement all implement an internal recovery loop that can be expressed through a shared structure. Algorithmic self-repair therefore provides not a renaming of classical ideas, but an integrative lens that makes their common algorithmic essence explicit.

2.1 A unified language for algorithmic self-repair

Algorithmic resilience has long been articulated through three implicit components: the assumptions about faults, the mechanisms that restore correct behavior, and the guarantees that remain valid once recovery occurs. These elements appear throughout foundational work—from Dijkstra's characterization of convergence under transient faults to the fault and guarantee structures embedded in Byzantine agreement protocols (Dijkstra, 1974; Lamport et al., 1982; Gärtner, 1999; Schneider, 1993).

What differs across research traditions is not the presence of these components, but the way they are expressed. Each paradigm introduces its own terminology and analytical conventions, often making cross-comparison difficult. To provide a common vocabulary, we make these components explicit through the triplet

where denotes the assumed fault model, captures the algorithmic mechanism that performs repair or recovery, and represents the formal guarantees achieved after repair. This notation does not replace existing theories; rather, it distills the shared structure already present in them and enables a unified analytical perspective.

The failure model encompasses the types of disruptions an algorithm is designed to withstand, including transient disturbances, permanent crashes, adversarial inputs, and fully arbitrary Byzantine deviations. The repair strategy encodes how the system restores correctness, ranging from convergence techniques in self-stabilizing protocols to redundancy-based masking, probabilistic repair mechanisms, and consensus procedures used to maintain consistent state. The guarantee specifies what the algorithm ensures under these conditions: bounded-time convergence, probabilistic correctness, bounded communication overhead, or consistency properties such as agreement and validity.

Using the triplet across all categories provides a uniform lens through which the survey synthesizes ideas that historically evolved independently. It allows self-stabilization, self-healing networks, redundancy-driven repair, error-resilient online algorithms, and Byzantine-tolerant consensus to be discussed within a single coherent structure, highlighting both their shared foundations and their distinct trade-offs. This intuition can be formalized by specifying precisely what it means for an algorithm to repair itself.

An algorithm is said to be self-repairing if its behavior can be described by a triplet A = 〈F, R, G〉, where F specifies a class of faults, R is an algorithmic mechanism that detects and mitigates their effects, and G is a formal guarantee that the algorithm restores and preserves once faults stop occurring. Concretely, an algorithm is self-repairing if, after any finite sequence of faults from F, the repair mechanism R eventually re-establishes a state in which G holds and maintains G until new faults occur.

The set F denotes the admissible fault patterns under which the algorithm is required to repair. It may include transient faults, crash faults, omission faults, Byzantine deviations, or combinations thereof. The repair mechanism R denotes the algorithm's transition relation under the presence and absence of faults. It specifies how the system evolves in response to faults from F and how it progresses toward re-establishing the guarantee G. The guarantee G specifies the predicate or set of configurations that represent the intended correct behavior of the algorithm once faults have ceased.

To further clarify how the terminology relates to established models of resilient computation, Table 1 summarizes the correspondence between classical notions and the unifying perspective provided by algorithmic self-repair. Our terminology for faults and dependability follows the classical taxonomy of Avizienis et al. (2004), which provides the standard definitions for faults, errors, and failures in dependable and secure computing.

Table 1

Classical notionRelation to algorithmic self-repair
Fault toleranceTraditionally defined at the system or architectural level; algorithmic self-repair focuses instead on the explicit detection and correction logic built into the algorithm itself.
ResilienceBroad ability of systems to maintain acceptable performance under disturbance; algorithmic self-repair explains this behavior in terms of concrete algorithmic mechanisms that detect faults, repair state, and restore guarantees.
Self-stabilizationA canonical example of algorithmic self-repair in which the algorithm is designed to converge from arbitrary or corrupted states back to a legitimate configuration after transient faults.
Self-healing networksStructural repair strategies in dynamic graphs that reconfigure the topology after node or edge failures, fitting naturally as algorithms whose primary repair action is to adjust connectivity while preserving global properties.
Byzantine fault toleranceFocuses on correctness in the presence of arbitrary or adversarial behavior; algorithmic self-repair captures these protocols as instances where the repair logic enforces agreement and consistency despite malicious participants.
Error-resilient online algorithmsIllustrate how self-repair can be embedded into sequential decision-making by maintaining competitive performance even when inputs are noisy, incomplete, or adversarially perturbed.

Classical notions of resilient computation mapped to the concept of algorithmic self-repair.

Throughout the paper, the standard fault taxonomy used in resilient computation is followed. A transient fault is a temporary corruption of local state or messages that may place the system in an arbitrary configuration but after which components behave correctly again. A crash fault permanently halts a component, an omission fault causes send or receive actions to be skipped, and a Byzantine fault allows arbitrary, potentially adversarial deviations from the algorithm.

2.2 Taxonomy of algorithmic self-repair

Algorithmic self-repair offers a broad paradigm for understanding how computation can persist in the presence of faults, adversarial inputs, and uncertainty. The diversity of approaches that fall under this perspective can be organized into five principal classes, each defined by its assumptions about failures, the strategies it employs, and the guarantees it provides. Taken together, these classes illustrate the coherence of a field that has often been treated as fragmented, while also highlighting the underlying principles that unify resilient computation.

To give this taxonomy a principled structure, each class is defined by its primary algorithmic repair mechanism rather than by application domain or historical lineage. Self-stabilizing algorithms repair through state convergence; self-healing algorithms repair through structural reconfiguration; error-resilient online algorithms repair by correcting sequential decisions under uncertain or corrupted inputs; redundancy-based and probabilistic methods repair through replication or stochastic correction; and Byzantine-tolerant algorithms repair by enforcing adversarial-robust agreement. This mechanism-based perspective parallels classical dependability taxonomies (Avizienis et al., 2004; Koren and Krishna, 2020; Kumari and Kaur, 2021; Solouki et al., 2024) and clarifies how these five classes collectively span the core algorithm-level approaches to self-repair.

The five classes considered here arise naturally from the different ways in which repair is carried out at the algorithmic level. Each reflects a distinct and long-standing line of work in which recovery is encoded directly into an algorithm's behavior: restoration through convergence, repair by structural reconfiguration, correction within online decision processes, resilience via redundancy and probabilistic inference, and agreement under adversarial behavior. These families capture the principal algorithmic approaches that formalize detection, response, and recovery with provable guarantees. Techniques such as checkpointing, rollback recovery, or architectural replication are typically implemented at the system or middleware layer (Hasan and Goraya, 2018; Rehman et al., 2022; Isukapalli and Srirama, 2024) and therefore play a complementary rather than primary role in this classification. When relevant, such system-oriented methods can still be interpreted within the general framework introduced earlier, but the comparative discussion centers on the five algorithmic classes that form the core landscape of self-repair.

The first class is that of self-stabilizing algorithms. These algorithms are designed to recover from arbitrary initial states or from transient perturbations of the system's global state, converging within finite time to a correct configuration. They capture the idea that stability can be restored without external reset, making them particularly suitable for dynamic and decentralized environments (Dijkstra, 1974; Schneider, 1993; Altisen et al., 2022; Guellati and Kheddouci, 2010). Their effectiveness is commonly evaluated by convergence time and by the amount of local memory required at each node. A classic example is leader election: even if faults create multiple leaders, a stabilizing algorithm ensures eventual recovery to a unique leader.

The second class consists of self-healing graph algorithms. Although certain graph algorithms exhibit both convergence and structural adjustment, self-healing is distinguished by its primary repair mechanism: the algorithm actively modifies the network's structure to restore connectivity or other topological properties. These methods maintain desirable properties such as low diameter or high fault-tolerance under repeated attacks or failures (Gallos and Fefferman, 2015; Pandurangan et al., 2016; Avin et al., 2008; Pandurangan et al., 2012). Their performance is measured by recovery time, stretch (the change in path lengths), and degree increase (the added load per node). A representative application is dynamic routing in a damaged network, where traffic must be rerouted efficiently.

The third class involves error-resilient online algorithms. Here, repair is triggered by the need to act sequentially without knowledge of future inputs, while also compensating for noise, missing data, or adversarially corrupted requests. Sequential uncertainty is therefore an inherent part of the repair mechanism (Albers, 2003; Karp et al., 1990; Markarian, 2024; Hoi et al., 2021). Their guarantees are typically expressed through competitive analysis, regret bounds, or explicit tolerances to noise. Navigation systems are a familiar example: they must produce feasible routes despite occasional GPS inaccuracies.

The fourth class is that of redundancy-based and probabilistic repair techniques. Here, what is repaired is the correctness of data or computation: replication, coding, and probabilistic inference are used to reconstruct the intended output, even when some messages or operations are corrupted or lost. Rooted in information theory (Shannon, 1948; Hennessy and Milner, 1992; Pieprzyk et al., 2023), these methods are analyzed in terms of redundancy overhead and recovery guarantees. Classic examples include error-correcting codes in satellite communication, probabilistic filtering in sensor networks, and robust inference under noisy measurements (Tang and Breugel, 2020; Ma et al., 2023).

The fifth class comprises Byzantine fault-tolerant algorithms. These methods guarantee correctness even when some participants behave arbitrarily or maliciously (Lamport et al., 1982; Castro and Liskov, 1999; Chen, 2020; Correia et al., 2011). Their core guarantees include safety, liveness, and explicit thresholds on the maximum fraction of faulty participants. Modern blockchain consensus protocols such as PBFT and its variations (Ren et al., 2023; Chevalier et al., 2019; Ongaro and Ousterhout, 2014) exemplify this approach.

Some widely used resilience techniques, such as checkpointing, rollback recovery, and system-level replication, are not included as separate classes in this taxonomy. These mechanisms are typically implemented at the operating-system, middleware, or storage level (Isukapalli and Srirama, 2024; Kumari and Kaur, 2021) and operate independently of an algorithm's internal control logic. Nevertheless, their behavior can still be expressed within the triplet by viewing rollback as a repair strategy that restores a previously recorded consistent state.

These boundaries ensure that the taxonomy remains coherent while accommodating natural overlaps between different forms of repair. Although the five classes differ in assumptions, methods, and guarantees, they share the fundamental goal of enabling computation to continue in uncertain, dynamic, and potentially hostile environments. Table 2 provides a comparative overview that aligns inputs, outputs, and objectives across classes.

Table 2

CategoryAssumed input (Failure)Produced output (Recovered state)Repair objectiveRelation to resilient computation
Self-stabilizing algorithmsArbitrary or corrupted system state caused by transient faultsStable configuration consistent across all nodesConvergence without external resetsRecovery under minimal assumptions using only local rules
Self-healing graph algorithmsNetworks with failed or removed nodes or edgesRepaired topology restoring connectivity and structurePreserve connectivity and efficiencyEmbeds repair directly into evolving network structures
Error-resilient online algorithmsInput sequences with noise, missing requests, or adversarial corruptionRobust online decisions forming a feasible sequenceMaintain efficiency vs. offline optimumExtends online computation by incorporating explicit fault models
Redundancy and probabilistic repairFaulty or lossy data and communication channelsCorrect outputs via replication, coding, or inferenceReliability with high probability and limited overheadBounds uncertainty through redundancy and stochastic methods
Byzantine fault-tolerant algorithmsDistributed systems with nodes acting arbitrarily or maliciouslyConsistent agreement among honest participantsSafety, liveness, correctnessStrongest guarantees in adversarial environments

Comparison of algorithmic self-repair models under the unified framework.

To clarify the computational setting underlying the five classes of algorithmic self-repair, we specify the models in which these algorithms are traditionally studied. All algorithms considered here operate over a collection of computational entities, called nodes or processes. Each node stores a local state, and the global system state is the combination of all local states together with the structure of the communication medium. For the classes originating in distributed computing, including self-stabilizing algorithms, self-healing graph algorithms, and Byzantine fault-tolerant algorithms, the communication medium is modeled as a graph whose vertices represent processes and whose edges represent communication links (Dijkstra, 1974; Schneider, 1993; Avizienis et al., 2004). Computation proceeds by local interactions in which nodes read the states of their neighbors, exchange messages, and update their own state. These distributed models rely on explicit connectivity assumptions that guarantee that information can propagate sufficiently to enable recovery. For example, self-healing graph algorithms assume that after failures the remaining network still contains enough structure to coordinate reconfiguration (Gallos and Fefferman, 2015; Pandurangan et al., 2016), and Byzantine agreement is feasible only when the number of faulty processes is below the classical threshold that preserves consistent communication among honest nodes (Lamport et al., 1982; Castro and Liskov, 1999). Dynamic networks discussed later in the survey are interpreted within the same graph-based formalism, with the graph allowed to evolve over time while maintaining minimal connectivity for progress.

Error-resilient online algorithms are studied in a sequential model that does not rely on a network at all. In this setting, requests arrive one at a time, and an algorithm must irrevocably act without knowing future inputs. The system state consists of the algorithm's internal memory and the history of previously observed requests, and guarantees are analyzed through online-performance measures such as the competitive ratio (Albers, 2003; Karp et al., 1990; Markarian, 2024). By contrast, redundancy-based and probabilistic repair techniques may operate in a centralized model, as in classical error-correcting codes that protect data stored on a single device (Shannon, 1948), or in a distributed model, as in replicated and encoded storage systems that place redundant fragments across multiple servers (Venkatesha and Parthasarathi, 2024; Isukapalli and Srirama, 2024). In these settings, the computational model specifies how data are stored, transmitted, and potentially corrupted, and repair is analyzed relative to stochastic noise processes or to the placement and accessibility of redundant information. This explicit description of nodes, networks, system states, and modeling assumptions ensures that the comparison between classes is grounded in well-defined computational foundations.

3 Related work

Research on fault tolerance has been carried out across many domains of computer science, yet surveys in this area are often tied to specific infrastructures or applications rather than abstract algorithmic principles. In cloud computing, Kumari and Kaur (2021) provide a detailed review of mechanisms such as checkpointing and replication, but their focus remains largely at the level of infrastructure redundancy. Similarly, Koren and Krishna (2020) present a broad system-level classification, emphasizing hardware and architectural techniques while stopping short of abstracting these concepts into algorithmic models. Within software systems, Solouki et al. (2024) highlight strategies for embedded computing under real-time constraints, and Isukapalli and Srirama (2024) examine recovery protocols in distributed data systems. Complementary efforts include Venkatesha and Parthasarathi (2024) on hardware-level error detection and Yu et al. (2024) on robustness in machine learning. Collectively, these surveys provide valuable advances within their domains, but they remain primarily domain-centric, without offering a unifying framework that exposes the common algorithmic principles underpinning these approaches.

Algorithmic perspectives have also emerged in more specialized contexts. Albers (2003) and Markarian (2024); Markarian et al. (2024) survey online algorithms under uncertainty, with rigorous analyses of competitiveness and adaptivity. Yet these works do not explicitly address recovery from faults. Saad and Saia (2014) investigate self-healing computation in adversarial environments, but their models remain limited to specific attack scenarios rather than general abstractions. In distributed computing, the foundational work of Dolev and Strong (1983) established the principles of authenticated Byzantine agreement and laid the groundwork for resilient protocols. However, these results were not situated within a broader classification of fault-tolerance strategies.

Self-stabilization has provided one of the earliest formal paradigms for algorithmic recovery since the pioneering work of Dijkstra. Schneider (1993) capture these origins, while Guellati and Kheddouci (2010) extend the perspective to graph algorithms. Tixeuil (2009) and Altisen et al. (2022) provide accessible introductions and formal treatments of the paradigm, and Feldmann et al. (2020) explore applications in overlay networks. Relatedly, Zhong et al. (2023) review Byzantine consensus, emphasizing scalability and blockchain infrastructures. These works serve as essential references within their respective paradigms, but they remain confined to specific classes of algorithms and do not bridge the conceptual gap between stabilization, Byzantine resilience, and adaptive online methods.

Beyond domain-specific or paradigm-specific studies, Babaoglu et al. (2004) articulated the “Self-Star Vision,” which called for systems capable of self-healing, self-stabilization, and self-management. Although visionary, this work remained aspirational, without offering the formal taxonomies or provable algorithmic guarantees required for systematic analysis.

More recent surveys and research have begun to connect resilience with advances in distributed and machine learning contexts. Bouhata et al. (2024) provide a timely survey of Byzantine fault tolerance in distributed machine learning, with particular attention to the resilience of first-order optimization methods. Wei and Liu (2025) propose a taxonomy for trustworthy distributed artificial intelligence systems, encompassing robustness, privacy, and governance. Dahan and Levy (2024) introduce Weight for Robustness, an algorithm that achieves optimal convergence in Byzantine-robust asynchronous training. Barrak et al. (2023) present SPIRT, a serverless peer-to-peer learning framework with strong resilience to peer failures. Finally, Andreina et al. (2025) examine robustness under transfer attacks in distributed learning systems, showing how heterogeneity can improve adversarial accuracy. These contributions further extend the relevance of fault-tolerance research into the emerging landscape of distributed artificial intelligence.

These bodies of work highlight the richness of research on fault tolerance, but they also underscore the absence of a unified algorithmic perspective. Most surveys are constrained either by their domain of application or by their focus on a single paradigm. Others are visionary but lack the formal abstractions needed to guide rigorous analysis. This survey responds to that gap by introducing the framework of algorithmic self-repair, formalized as a triplet of failure model, repair strategy, and computational guarantee. Through this abstraction, fault-tolerant algorithms across stabilization, healing, redundancy, Byzantine consensus, and error-resilient online computation are brought into a single taxonomy. The framework makes explicit the trade-offs among efficiency, robustness, and adaptability, and it enables structured comparison across paradigms.

Table 3 situates our work in relation to existing surveys. It illustrates how prior efforts have been constrained by scope or perspective, while the present contribution provides a cross-cutting theory of algorithmic self-repair with provable analytical foundations.

Table 3

SurveyDomain/ applicationMain focusKey limitationRelation to the unified framework
Rehman et al., 2022Cloud computing infrastructuresFault tolerance mechanisms in virtualized and service-based systemsRemains domain-specific; does not abstract beyond infrastructure redundancyTransforms redundancy into a general repair strategy with formal guarantees
Kumari and Kaur, 2021Cloud platformsCheckpointing and replication for cloud reliabilityPurely infrastructure-level; no formal classification of strategiesReinterprets these techniques within a general algorithmic taxonomy of repair
Koren and Krishna, 2020Hardware and system architectureSystem-level resilience at hardware and architectural layersHardware-centric; lacks abstraction into general computational modelsElevates these ideas into a unifying classification of failures and recovery methods
Solouki et al., 2024; Isukapalli and Srirama, 2024; Venkatesha and Parthasarathi, 2024; Yu et al., 2024Embedded systems, distributed data, processors, and artificial intelligenceFault tolerance through error detection, recovery protocols, and hardware redundancyNarrow coverage, tied to specific domains or technologies; do not connect to a broader theoryUnifies these efforts into a cross-cutting taxonomy of failures, repair strategies, and guarantees
Albers, 2003 and Markarian, 2024; Markarian et al., 2024Online algorithmsCompetitiveness and adaptivity under uncertaintyDo not consider faults or explicit recovery mechanismsExtends the online model to explicitly incorporate repair strategies and fault recovery
Saad and Saia, 2014Adversarial distributed systemsSelf-healing computation in hostile environmentsLimited to narrow adversarial scenarios; lacks generalityGeneralizes self-healing across diverse failure models within a unified framework
Schneider, 1993; Guellati and Kheddouci, 2010; Altisen et al., 2022, and Feldmann et al., 2020Self-stabilizationRecovery from arbitrary initial states through convergenceConfined to a single paradigm; isolated from Byzantine fault tolerance and online resilienceIntegrates self-stabilization with other paradigms under a single classification
Zhong et al., 2023Blockchain and distributed consensusByzantine fault-tolerant consensus protocolsLimited to consensus and scalability; does not generalize across paradigmsPositions Byzantine agreement within a broader theory of algorithmic self-repair
Bouhata et al., 2024 and Wei and Liu, 2025Distributed machine learning and artificial intelligence governanceByzantine resilience in learning and taxonomies of trustworthy AIOriented toward governance and machine learning; lack formal computational foundationsProvides the algorithmic backbone for guarantees that extend beyond machine learning
Babaoglu et al., 2004Self-managing systemsVisionary call for self-healing and self-stabilizing infrastructuresAspirational in scope; lacks rigorous definitions and provable guaranteesOperationalizes this vision through formal definitions and provable classifications

Comparison of existing surveys with a unified framework of algorithmic self-repair.

4 Class 1: self-stabilizing algorithms

Self-stabilization represents one of the most powerful paradigms of algorithmic self-repair. Introduced by Dijkstra (1974), it addresses the fundamental challenge of how a distributed system can recover from arbitrary transient faults without relying on any external intervention. A self-stabilizing algorithm guarantees that, starting from any arbitrary or corrupted state, the system eventually converges to a legitimate configuration and remains there unless new faults occur. This makes self-stabilization a cornerstone of fault-tolerant distributed computing and a foundational category within the taxonomy of self-repair.

Three notions form the foundation of self-stabilization. A legitimate configuration is a global state satisfying the algorithm's correctness predicate. The closure property requires that every fault-free execution starting from a legitimate configuration remains within that set, whereas convergence requires that every fair execution from an arbitrary configuration eventually reaches one. Within the triplet 〈F, R, G〉, convergence captures the action of the repair mechanism R in restoring correctness after transient faults, and closure expresses the persistence of the guarantee G once faults cease. A fair execution ensures that every continuously enabled action is eventually taken.

4.1 Foundations and frontiers

The development of self-stabilization began with Dijkstra's ring-based algorithm for mutual exclusion, which showed that a distributed system could recover correctness without resets or external intervention (Dijkstra, 1974). Work in the 1980s and 1990s extended this foundation to core distributed tasks such as consensus, spanning tree construction, and resource allocation. Karaata (2002) contributed important refinements by introducing self-stabilizing algorithms for mutual exclusion and leader election, showing how stabilization principles can support structured coordination tasks.

A stabilizing algorithm is called silent when, after reaching a legitimate configuration, no further state changes or communication occur. Execution resumes only if a transient fault perturbs the state and thereby enables one of the algorithm's guarded actions. Nodes do not detect faults explicitly; progress continues solely because the corrupted state satisfies a guard condition.

These developments demonstrated that stabilization can maintain structural graph properties with minimal overhead (Devismes, 2005). In a complementary direction, Goddard et al. (2003) proposed a self-stabilizing algorithm for strong matchings, showing that stabilization applies to combinatorial optimization in distributed graphs. Saifullah and Tsin (2011) advanced this trajectory with a self-stabilizing algorithm for three-edge connectivity, and Karaata (2002) further explored biconnected component detection, underscoring the importance of stabilization in preserving connectivity under transient faults.

At the same time, theoretical results clarified the efficiency and resource boundaries of self-stabilization. Dolev et al. (1997) analyzed resource requirements of message-driven protocols, establishing time and space complexity limits for distributed self-stabilizing systems. Ghosh et al. (1996) introduced the concept of fault-containing self-stabilization, ensuring that transient faults remain locally confined rather than propagating across the entire system. Earlier work by Katz and Perry (1990) demonstrated that message-passing protocols could be extended with stabilization properties, highlighting the feasibility of embedding self-repairing features into classical distributed designs. Gärtner (1999)'s influential survey connected self-stabilization to the broader landscape of fault-tolerant distributed computing under asynchronous assumptions, providing a theoretical synthesis that situated stabilization as part of the larger theory of distributed resilience. More recently, Kanewala et al. (2017) proposed transformation techniques based on self-stabilizing kernels, showing how classical algorithms can be adapted to modern distributed graph-processing platforms. Together, these contributions demonstrate how the paradigm has evolved from a foundational idea into a versatile and system-aware approach for fault recovery.

4.2 Formalization under the 〈F, R, G〉 framework

Within the unified framework introduced in Section 2, self-stabilizing algorithms can be expressed using the triplet

which specifies the admissible faults, the mechanism by which the system repairs itself, and the guarantees preserved once recovery has occurred. This notation provides a precise mathematical lens through which the classical theory of self-stabilization can be aligned with the broader landscape of algorithmic self-repair.

4.2.1 Failure model

The failure model is intentionally unrestricted. A distributed system is represented by a global state S, consisting of all information stored across its processes. A transient fault may arbitrarily corrupt S, modifying local variables, violating communication assumptions, or leaving multiple devices in inconsistent roles. Formally, the system may begin in any state S, even one that bears no relation to the intended task specification. For example, in leader election, the system may start with several devices claiming leadership or with none recognizing a leader. The only constraint is that faults eventually stop, allowing the repair mechanism to restore correctness.

4.2.2 Repair strategy

Repair is achieved through localized update rules applied by each process. Consider a system of n processes connected in a network. Each process v maintains a local state σ(v) and can read the states of its neighbors . A repair action is captured by a local transition function

which maps the observed neighborhood state to a new local state. Intuitively, each process repeatedly inspects its neighbors and updates its own state using a simple, deterministic rule.

A classical example is Dijkstra's token ring algorithm: each process maintains a counter and updates it whenever it matches its predecessor's counter. Repeated local adjustments eventually ensure that exactly one token circulates in the ring. Even from a completely chaotic initial state, the system gradually moves toward a correct configuration through local interactions alone.

4.2.3 Guarantee

Self-stabilization provides two central guarantees: convergence and closure.

Convergence ensures that, from any initial state S, the system reaches a legitimate configuration S* that satisfies the task specification. For leader election, this means that exactly one leader exists; for spanning tree construction, it means that the resulting structure is a cycle-free tree. Many classical self-stabilizing algorithms converge in O(n2) rounds, so that on a system of 100 processes, at most 10, 000 steps suffice in the worst case.

Closure ensures that once the system has reached a legitimate configuration S*, it remains correct under further execution of the algorithm, provided no new faults occur. Importantly, the system does not explicitly detect that it is correct; closure follows from the structure of the algorithm rather than from explicit signaling of stability.

Additional measures refine the analysis. The space complexity of a process's local state is often logarithmic in n. Probabilistic self-stabilization further allows systems to converge with high probability, enabling faster protocols in large-scale or noisy environments.

Under the triplet 〈F, R, G〉, self-stabilizing algorithms emerge as a powerful form of algorithmic self-repair: they guarantee recovery from arbitrary states using only local rules and without centralized coordination. This makes them particularly valuable for autonomous infrastructures such as wireless sensor networks, peer-to-peer systems, and other fault-prone distributed environments where devices must eventually regain coordinated behavior even when starting from complete disorder.

5 Class 2: self-healing graph algorithms

Self-healing algorithms represent a second major category of algorithmic self-repair. Their defining property is the ability to restore structural integrity in dynamic networks after faults such as node removals or edge deletions. In this context, structural integrity refers to the preservation of connectivity, bounded path lengths, and controlled node degrees after faults and repairs, ensuring that the repaired graph continues to satisfy the essential invariants of the original specification. Unlike self-stabilization, which focuses on eventual convergence from arbitrary states, self-healing emphasizes localized and often immediate responses to damage. These algorithms are particularly relevant in graph-based systems, where maintaining connectivity, bounded path length, and balanced node degree is critical for ensuring continued performance under disruption.

Two notions play a central role in the analysis of self-healing graphs. A graph is said to maintain structural integrity if, after faults and repairs, it continues to satisfy core global properties such as connectivity, bounded path lengths, and controlled node degrees. Performance is often measured through stretch, defined as the ratio between the distance of two nodes in the repaired graph and their distance in the original graph, and through degree blow-up, the ratio between a node's degree after repair and its degree before the faults occurred.

5.1 Foundations and frontiers

Research on self-healing graph algorithms has developed along both theoretical and applied dimensions. Gallos and Fefferman (2015) proposed a simple yet effective strategy for networks damaged by node or link failures, drawing analogies to Achlioptas processes from random graph theory to explain how networks recover during restoration. Lee (2024) extended this line of inquiry by addressing clique transversal problems in graphs of bounded degeneracy, showing how structural graph parameters influence the design of efficient self-healing mechanisms.

Contributions have also bridged theory and biological inspiration. Emek and Keren introduced a thin self-stabilizing asynchronous unison algorithm designed for fault-tolerant biological networks, focusing on graphs of bounded diameter. Their results provided efficient self-healing methods for fundamental tasks such as leader election and maximal independent set discovery, strengthening the connection between distributed coordination and resilience (Emek and Keren, 2021).

Applications in engineering further illustrate the breadth of self-healing strategies. In optical networks, Hu et al. (2022) designed mechanisms that improve the reliability of fiber sensor networks by dynamically reconfiguring topologies after failures. In power systems, Choopani et al. (2020) developed optimization-based self-healing strategies for active distribution networks, using meta-heuristic algorithms to support microgrid operations while maintaining the capacity for rapid recovery. These works highlight how principles of graph repair translate into reliability improvements in critical infrastructures.

Algorithmic innovations have expanded the theoretical frontier of self-healing. Dhinnesh et al. (2022) introduced quadratic probing techniques for automated tracing and repair, reducing operational costs in managing network disruptions. Pandurangan et al. (2016) advanced the study of self-healing expanders, showing that certain classes of graphs can preserve connectivity and structural invariants under repeated adversarial deletions. Ridgley et al. (2021) proposed distributed first-order optimization algorithms with self-healing capabilities, allowing agents to adjust objectives adaptively in dynamic environments. Zadsar et al. (2017) further strengthened resilience in power networks by designing strategies for active distribution systems that emphasize rapid responsiveness during outages.

Taken together, these contributions show that self-healing graph algorithms are not confined to abstract theory but form a versatile toolkit spanning domains as diverse as network science, optical communication, and energy distribution. By combining structural guarantees with localized repair strategies, they offer a powerful paradigm for adaptive resilience in graph-based systems.

5.2 Formalization under the 〈F, R, G〉 framework

Within the unified framework introduced earlier, self-healing graph algorithms can be described by the triplet

where F specifies the possible failures affecting the network, R denotes the algorithmic mechanism that performs repair, and G captures the structural and performance properties that must hold after the repair. Interpreting self-healing algorithms through this triplet provides a mathematically precise lens for understanding how they respond to disruptions and what guarantees they preserve.

5.2.1 Failure model (F)

The system is represented as a graph G = (V, E), where V is the set of vertices and E is the set of edges. A failure event is modeled as the removal of a subset of vertices FVV or a subset of edges FEE, resulting in the faulty graph

Failures may be adversarial, as in targeted attacks on high-degree or strategically important vertices, or stochastic, as in random node crashes or transient link failures. In both cases, the effect of F is to fragment the network, increase path lengths, or disconnect essential resources. Thus F formally specifies the class of vertex and edge deletions under which the algorithm must repair.

5.2.2 Repair strategy (R)

The repair mechanism is a transformation

that produces a repaired graph from the faulty graph G′. The repair is performed through localized modifications, such as adding new edges, redirecting communication paths, or rebalancing degrees. Locality is fundamental: each node vV makes decisions based on information available within a bounded-radius neighborhood, often of constant or logarithmic size. This ensures that the repair mechanism R scales to large and dynamic infrastructures—including wireless sensor networks, peer-to-peer overlays, and communication backbones—where global coordination is infeasible.

Two quantitative measures express how R alters network structure. Let degG(v) be the degree of a vertex v in G, and its degree in the repaired graph. The degree increase is

which measures the additional burden placed on v due to repair actions. A desirable repair mechanism ensures that Δ(v) is bounded, typically by a constant or a logarithmic factor, so that no vertex becomes overloaded.

Likewise, if dG(u, v) denotes the shortest-path distance between u and v in G and denotes the corresponding distance in , the stretch factor is defined by

This metric captures how much longer communication paths become after repair. A stretch close to 1 indicates that communication efficiency is largely preserved, while a large stretch indicates significant performance degradation.

5.2.3 Guarantee (G)

The guarantee specifies the structural and performance properties that the algorithm must restore and preserve after the failures in F have stopped and the repair mechanism R has acted. The most fundamental requirement is connectivity: whenever it is possible to maintain a connected graph after the failures, the repaired graph should be connected. Beyond connectivity, guarantees focus on quantitative performance:

  • Bounded stretch: The stretch factor remains small, often logarithmic in |V|, ensuring efficient communication.

  • Bounded degree increase: The quantity is bounded, ensuring that repair responsibilities are distributed fairly across nodes.

  • Efficient recovery time: The computation of from G′ can be carried out in near-linear or polylogarithmic time, making the repair feasible for large-scale systems.

Viewing self-healing graph algorithms through the triplet 〈F, R, G〉 highlights how they are designed for environments in which failures occur dynamically and repeatedly. While self-stabilizing algorithms focus on recovering from arbitrary global states, self-healing algorithms emphasize immediate, localized responses to specific structural failures. The explicit definitions of F, R, and G provide a foundation for analyzing central trade-offs, such as whether low stretch can be achieved without incurring high degree increases, or whether rapid local repair is compatible with global efficiency. These considerations reveal both the power of existing techniques and the opportunities for designing new self-healing algorithms that balance efficiency, robustness, and fairness.

6 Class 3: error-resilient online algorithms

Error-resilient online algorithms form a third major category within algorithmic self-repair. They are designed for settings where decisions must be made sequentially, often without full knowledge of future inputs, while simultaneously tolerating errors in the data stream. Their performance is typically evaluated through competitive analysis, a framework that compares the cost of an online algorithm with the cost of an optimal offline algorithm that knows the entire input sequence in advance. Within this framework, the competitive ratio is the worst-case value of this comparison, expressed as the maximum ratio between the online algorithm's cost and the offline optimum. Error resilience extends competitive analysis by explicitly accounting for corrupted, missing, or adversarial inputs, ensuring that robustness is preserved alongside efficiency.

Two models are central in this setting. A predictive algorithm receives a forecast of future requests or faults with a bounded error guarantee. A learning-augmented algorithm derives its predictions from past behavior, typically through a trained model that estimates forthcoming inputs or system states. In both cases, the repair mechanism R may condition its actions on the predicted sequence, while the guarantee G incorporates correctness together with bounds that degrade as prediction error increases.

6.1 Foundations and frontiers

The foundations of this area were laid by Karp et al. (1990), who introduced the competitive ratio as a formal metric for measuring the performance of online algorithms relative to the offline optimum. Their results established the groundwork for reasoning about algorithmic robustness under adversarial conditions and shaped decades of subsequent research in online computation.

More recent contributions have incorporated predictive and learning-based techniques into the online framework. Lykouris and Vassilvitskii (2021) demonstrated how competitive caching algorithms can be enhanced by machine learning predictions, showing that accurate forecasts of future requests can significantly improve competitive ratios while maintaining resilience against adversarial modifications. This work bridged the domains of online algorithms and learning theory, highlighting how prediction-aware adaptivity strengthens robustness in sequential decision-making.

Robust optimization has further expanded the scope of error resilience. Cohen et al. (2023) developed adaptive models for online linear programming, demonstrating that decision-making can remain stable and efficient under fluctuating and uncertain inputs. Although not explicitly adversarial in nature, these results illustrate how resilience can be embedded into real-time scheduling and resource allocation problems. In distributed environments, Zhao et al. (2019) proposed resilient distributed optimization algorithms capable of maintaining efficiency even in the presence of faulty or malicious components, underscoring the importance of error-resilient design in large-scale, multi-agent settings. Extending to new computational paradigms, Khadiev and Khadieva (2020) investigated quantum online algorithms that leverage the unique properties of quantum systems to withstand adversarial errors, opening directions for resilience in emerging technologies.

Together, these contributions reveal a trajectory from classical competitive analysis toward broader frameworks that integrate prediction, robustness, and adaptability. They also show how resilience is becoming a defining principle across both classical and novel models of online computation.

6.2 Formalization under the 〈F, R, G〉 framework

Within the unified framework introduced in Section 2, error-resilient online algorithms can be described by the triplet

where F specifies how requests may be faulty, R denotes the algorithmic mechanism that adapts decisions in the presence of uncertainty, and G captures the resulting performance guarantees under imperfect inputs. This formulation extends classical online computation by embedding robustness directly into each decision step.

6.2.1 Failure model

In the standard online model, requests arrive sequentially and must be handled immediately, without knowledge of future inputs. In the error-resilient setting, even the current request may be unreliable. Some requests may be missing, some may contain random noise, and others may be corrupted adversarially.

To model these imperfections, an error budgetE specifies how many faulty requests may appear in an input sequence.

  • In deterministic settings, E is an upper bound on the number of corrupted or missing requests.

  • In probabilistic settings, E represents the expected number of faulty requests, for example when each request is independently corrupted with some probability.

A simple example is a navigation system that updates routes in real time. Faulty sensors may feed it outdated or incorrect traffic information. The error budget E reflects either the maximum or the expected number of such erroneous updates that the algorithm must tolerate.

6.2.2 Repair strategy

Unlike self-healing or self-stabilizing settings, in which repair occurs after a fault, here the repair is incorporated into the online decision itself: the algorithm must act in the moment despite uncertainty about the input. Common repair mechanisms include:

  • Filtering: ignoring requests that appear inconsistent, implausible, or too extreme.

  • Hedging: spreading decisions across multiple options to remain robust if some information is faulty.

  • Predictions with safety buffers: incorporating forecasts while adjusting them by error margins that hedge against inaccuracies.

  • Redundancy: maintaining additional assignments, caches, or partial solutions so that failures of individual decisions can be compensated by others.

For instance, an online trading algorithm that receives noisy or corrupted price information might hedge by investing across several assets rather than committing fully to a single potentially incorrect signal.

6.2.3 Guarantee

Performance is compared to an ideal offline optimum that knows the true (uncorrupted) input sequence in advance. The classical measure is the competitive ratio

In the presence of faulty requests, the competitive ratio becomes a function of the error budget:

capturing how performance degrades as the number of corrupted requests increases. For instance, if

then the algorithm is 2-competitive in the absence of errors (E = 0), and each additional corrupted request increases the cost by at most one unit relative to the offline optimum.

Another widely used guarantee is regret, defined as the difference between the online algorithm's cumulative cost and that of the best fixed strategy chosen in hindsight. In the noisy-input model, regret is extended to incorporate the uncertainty arising from corrupted requests. For example, in the trading scenario, regret measures how much profit is lost relative to the best single investment that could have been chosen with full knowledge of which price updates were accurate and which were corrupted.

By combining a clearly defined error model, robust decision rules, and guarantees adapted to uncertainty, error-resilient online algorithms provide a rigorous method for designing systems that remain effective even when the data stream is noisy or adversarial. Applications include navigation software that reacts to inconsistent sensor data, caching systems that process unreliable access logs, financial algorithms facing imperfect market signals, and cloud resource allocation under unpredictable workloads. The 〈F, R, G〉 framework makes explicit how uncertainty is modeled, how decisions adapt, and what forms of efficiency can still be guaranteed despite imperfect information.

7 Class 4: algorithmic redundancy and probabilistic self-repair

Redundancy-based and probabilistic repair form a fourth category within the taxonomy of algorithmic self-repair. Redundancy-based repair uses additional resources or replicated components to mask or compensate for faults, ensuring that correct behavior can be restored even when individual components fail. Probabilistic repair exploits statistical structure in the fault model to detect, isolate, or correct errors with high probability. These techniques do not eliminate faults outright but ensure that their effects remain bounded, providing reliability in settings where failures are unavoidable and uncertainty is inherent.

Two concepts are used to formalize the behavior of probabilistic repair mechanisms. A bisimulation reduction relates two stochastic processes by identifying a partition of states such that corresponding blocks admit identical transition probabilities. This allows a high-dimensional repair process to be analyzed through an equivalent, lower-dimensional abstraction. Probabilistic risk assessment quantifies the likelihood of entering undesirable states under a given repair mechanism R. In the triplet 〈F, R, G〉, bisimulation supports the analysis of R by simplifying its dynamics, while risk assessment characterizes the extent to which the guarantee G may be violated under stochastic faults.

7.1 Foundations and frontiers

The roots of redundancy-based fault tolerance trace back to Shannon (1948)'s seminal work on communication theory, which introduced error detection and correction through redundancy encoding. This foundational insight established that reliable communication is possible even over noisy channels, provided that the communication rate remains below the channel capacity. In computing, the 1970s and 1980s witnessed the rise of redundant architectures such as N-version programming and rollback recovery systems, which built directly on these principles to provide robust system-level reliability.

Modern redundancy-based and probabilistic repair mechanisms often give rise to stochastic models whose behavior is captured by Markovian transitions between system states. To analyze such stochastic repair behavior, it is useful to reduce the underlying process to a smaller but behaviorally equivalent abstraction. A bisimulation reduction achieves this by partitioning the state space of the stochastic process so that all states within a block have identical transition probabilities to every other block. This allows a high-dimensional repair process to be replaced by a lower-dimensional model that preserves its probabilistic behavior.

In addition to model reduction, evaluating the reliability of a stochastic repair process requires quantifying the likelihood that the system violates its intended correctness guarantees. Probabilistic risk assessment provides such a measure by estimating the probability that the repair dynamics enter configurations in which the guarantee G fails to hold under a given fault model F. This perspective makes explicit how uncertainty in the fault pattern influences the overall dependability of the repair mechanism.

During the 1990s, the scope expanded to include probabilistic models. Hennessy and Milner (1992) introduced probabilistic bisimilarity, laying a theoretical foundation for analyzing systems whose behavior incorporates stochastic uncertainty. This development enabled more flexible forms of error correction than deterministic approaches, opening the door to adaptive and probabilistic reasoning within algorithmic design.

Recent work highlights the growing convergence between redundancy and probabilistic approaches. Sakthivel and Alexander (20240) integrated machine learning methods such as k-nearest neighbors with probabilistic models to improve fault detection accuracy in power electronics and control systems, demonstrating how redundancy can be enriched by data-driven methods. In data structures and probabilistic modeling, Tang and Breugel (2020) investigated redundancy in automata and distance metrics, showing how layered probabilistic reasoning enhances reliability under adversarial conditions. Redundancy also underpins advances in data compression: Pieprzyk et al. (2023) demonstrated how leveraging inherent redundancies in real-world data leads to more optimal compression without compromising performance.

Probabilistic self-repair mechanisms extend these principles by explicitly embedding stochastic reasoning into recovery processes. In target tracking, Ma et al. (2023) designed a probabilistic data association algorithm that incorporates adaptive Kalman filtering, improving robustness in environments with measurement noise. Zhao et al. (2020) introduced self-paced probabilistic principal component analysis, enabling dynamic handling of data with outliers and highlighting the flexibility of probabilistic inference in self-repair. Probabilistic risk assessment has also influenced engineering design: Song et al. (2021) applied such methods to optimize structures against environmental risks, ensuring operational reliability while accounting for uncertainty. Applications in distributed energy systems emphasize the role of probabilistic scheduling, where Abo-Elyousr et al. (2022) demonstrated robust resource allocation under uncertainty. Complementary work by Qiu and Cui (2024) used probabilistic graph models for decision-making in material selection, illustrating how probabilistic reasoning enhances adaptability in data-driven learning environments.

Taken together, these contributions demonstrate how redundancy and probabilistic repair strategies reinforce one another: redundancy absorbs faults, while probabilistic reasoning interprets uncertainty. Their integration underscores the importance of adaptive, resource-aware methods for sustaining reliability in complex systems ranging from energy infrastructures to learning architectures.

7.2 Formalization under the 〈F, R, G〉 framework

Within the unified framework introduced in Section 2, redundancy-based and probabilistic repair techniques can be described by the triplet

which specifies how faults are modeled, how they are mitigated, and what level of reliability is guaranteed after repair.

7.2.1 Failure model

Faults arise through stochastic disturbances that affect transmitted data, stored information, or hardware components. Typical examples include flipped bits during communication, corrupted memory cells, packet loss, or noise in sensor measurements. Formally, when a sender transmits a message

the receiver may observe a corrupted version

where each symbol differs from mi with a probability determined by the underlying noise process. The model F therefore characterizes the statistical properties of such corruptions and the rate at which they occur.

7.2.2 Repair strategy

The repair mechanism introduces redundancy or uses probabilistic inference to reconstruct the original information from corrupted observations. Common forms of R include:

  • Redundancy-based repair. The message M is encoded into C(M) of length n>k, providing r = nk redundant symbols that allow recovery even when up to r symbols are lost or corrupted. Examples include replication and erasure codes.

  • Probabilistic repair. The system estimates the most likely value of the underlying data by combining noisy observations with statistical models. Techniques such as Kalman filtering and Bayesian inference infer hidden states or correct corrupted measurements.

  • Hybrid methods. Many practical systems combine deterministic and probabilistic repair. For instance, encoded storage may be supplemented with probabilistic anomaly detection to identify or replace corrupted fragments.

7.2.3 Guarantee

The guarantee specifies the reliability and efficiency achieved after repair:

  • Error probability. The system recovers the correct message or state with probability at least 1−ϵ, where ϵ is a small tunable error parameter that typically decreases as redundancy increases.

  • Redundancy overhead. If r = nk denotes the number of redundant symbols, then

  • measures the additional storage, computation, or bandwidth required for repair. Information-theoretic limits, such as Shannon's capacity theorem, bound how small this overhead can be while still ensuring reliable communication.

  • Expected recovery time. Probabilistic processes such as Markov models or filtering techniques quantify the expected time required to detect and correct faults, ensuring that recovery occurs with minimal delay in expectation.

Under the triplet 〈F, R, G〉, redundancy-based and probabilistic techniques convert unpredictable noise into failures that are statistically controlled and repairable. These mechanisms support a wide range of applications, including error-correcting codes in communication systems, replicated and encoded storage in cloud infrastructures, and probabilistic filtering in sensor networks. The unified framework makes precise how faults are modeled, how repair is performed, and what reliability guarantees can be delivered.

8 Class 5: byzantine fault-tolerant algorithms

Byzantine fault tolerance represents one of the strongest forms of algorithmic self-repair. It ensures that distributed systems continue to operate correctly even when some components behave arbitrarily or maliciously. The model captures the most general type of adversarial behavior, where faulty nodes may lie, send conflicting information, or collude to disrupt agreement. Algorithms in this category are central to modern infrastructures such as blockchain networks, distributed databases, and cloud computing platforms. Their development reflects a continuous effort to balance safety, efficiency, and scalability under worst-case assumptions.

8.1 Foundations and frontiers

The concept of Byzantine fault tolerance originates from the Byzantine Generals Problem, formulated by Lamport et al. (1982), which illustrated the fundamental difficulty of achieving agreement in the presence of treacherous participants. Early work in this area established the impossibility of reaching consensus in fully asynchronous systems with even a single faulty node, highlighting the inherent challenges of the model.

A significant step forward came with the Practical Byzantine Fault Tolerance (PBFT) algorithm by Castro and Liskov (1999). PBFT reduced communication complexity and showed that consensus could be achieved efficiently in asynchronous environments, making Byzantine fault tolerance viable for real-world deployments. In parallel, the Raft algorithm by Ongaro and Ousterhout (2014) focused on clarity and maintainability in crash-tolerant systems. While Raft primarily addresses crash failures, its influence has extended into Byzantine settings through adaptations and comparative analyses.

Contemporary research has expanded Byzantine fault-tolerant protocols to meet the demands of modern distributed systems. Wang and Wattenhofer (2020) proposed a layered framework for asynchronous Byzantine agreement in incomplete networks, demonstrating how consensus protocols can adapt to varied connectivity patterns. Chen (2020) investigated the fundamental limits of Byzantine agreement with dishonest processors, providing new insight into the trade-offs that constrain protocol design. In blockchain contexts, Huang et al. (2019) analyzed the Raft algorithm for private blockchains, highlighting performance characteristics critical for secure and efficient ledger maintenance.

Other researchers have extended Byzantine agreement to broader contexts. Flamini et al. (2024) introduced a multidimensional framework that captures richer trust relationships in synchronous settings, expanding the applicability of Byzantine consensus. Augustine et al. (2013) developed a fast Byzantine agreement protocol for dynamic networks, demonstrating the need for adaptive solutions in environments where system membership or topology changes over time. Tang and Breugel (2020) proposed improvements to PBFT tailored for high-frequency trading, where consensus must be reached under stringent latency constraints. Ren et al. (2023) introduced a node role division strategy that reduces communication overhead while maintaining resilience, offering a scalable approach for heterogeneous distributed systems. Most recently, Duvignau et al. (2023) combined Byzantine tolerance with self-stabilization, presenting the first protocol for repeated reliable broadcast that recovers even from arbitrary transient faults. This hybridization marks an important convergence of paradigms, showing how Byzantine guarantees can be integrated with stabilization principles to provide stronger resilience.

Collectively, these contributions demonstrate how Byzantine fault-tolerant algorithms have evolved from theoretical impossibility results into practical and versatile tools for resilient distributed computing. They show a trajectory from foundational definitions to highly specialized protocols adapted to blockchains, financial systems, and dynamic networks.

8.2 Formalization under the 〈F, R, G〉 framework

Within the unified framework introduced in Section 2, Byzantine fault-tolerant algorithms can be expressed using the triplet

which specifies the admissible adversarial behaviors, the protocol's repair and agreement mechanisms, and the correctness guarantees that hold despite arbitrary faults.

8.2.1 Failure model

The distributed system consists of n participants,

among which a subset BN may behave arbitrarily. Elements of B are called Byzantine nodes, and we write f = |B| for their number. The remaining nf nodes are honest and follow the prescribed protocol.

A Byzantine node may deviate from the algorithm in any way. It may send different messages to different recipients, lie about its state, ignore received information, impersonate another node, or collude with other faulty nodes. This fault model strictly generalizes crash and omission faults by allowing the adversary to choose behaviors that hinder agreement as much as possible.

8.2.2 Repair strategy

Each honest node iN\B maintains a local state at round t. Upon receiving a set of messages during that round, the node updates its state according to a deterministic or randomized transition rule

where Ri denotes the algorithm's update function.

Since Byzantine nodes may send contradictory or misleading information, the repair mechanism relies on structured communication and verification among honest nodes. Common components of R include:

  • Replication and majority agreement: multiple honest nodes maintain replicas of values, and majority or supermajority rules filter out inconsistent or adversarial messages.

  • Authentication and verification: digital signatures or message authentication codes ensure that forged or tampered messages can be detected.

  • Cross-checking: honest nodes exchange and compare received information to expose inconsistencies created by faulty nodes.

  • Structured coordination: roles such as leaders or committees may be assigned to facilitate consistent updates while preserving resilience to Byzantine influence.

8.2.3 Guarantee

Byzantine fault-tolerant protocols aim to preserve correctness even when up to f nodes behave adversarially. The central guarantees are:

  • Safety: all honest nodes that reach a decision must decide on the same value. If nodes i and j decide on vi and vj respectively, then

  • Safety holds regardless of the specific actions of the faulty nodes.

  • Liveness: every honest node eventually reaches a decision within finitely many rounds, ensuring that progress continues despite adversarial behavior.

  • Fault threshold: agreement can be maintained only when the number of faulty nodes satisfies f<fmax for the given protocol. Classical results show that in synchronous systems one can tolerate up to f<n/3 Byzantine faults, while in asynchronous systems stronger assumptions (such as partial synchrony or randomized mechanisms) are required.

  • Efficiency: the communication and round complexity required to maintain agreement scales in a controlled manner with n, with many protocols achieving polynomial or even near-linear communication overhead.

The triplet 〈F, R, G〉 therefore highlights the defining feature of Byzantine fault tolerance: even when a subset of participants behaves arbitrarily and adversarially, the honest nodes still repair inconsistencies, maintain agreement, and ensure progress. This makes Byzantine fault-tolerant algorithms a cornerstone of resilient distributed systems, from state-machine replication to blockchain consensus protocols.

9 Discussion: trade-offs and hybrid strategies

This section examines how different self-repair approaches compare in practice and how they can be combined. We first analyze the trade-offs that guide the choice of a suitable method for a given environment, and then explore how multiple strategies can be integrated into hybrid systems that achieve greater resilience than any single approach alone.

9.1 Choosing the right approach and understanding trade-offs

No single approach to algorithmic self-repair can address every failure. The right choice depends on the kinds of faults that may occur, the environment in which they arise, and the guarantees required. Some systems must recover from any initial state, others need rapid local repair after small failures, some must handle noisy or incomplete input, and others must withstand participants that behave maliciously.

Table 4 compares the main families of self-repair by four practical questions: when they are typically used, what they guarantee, how the cost grows as the system scales, and which limits cannot be overcome. For example, in error-resilient online algorithms, the benchmark is an ideal solution with full knowledge of future inputs; in redundancy-based methods, reliability depends on staying below the channel capacity of a noisy medium (Shannon, 1948); and in Byzantine protocols, correctness requires that fewer than one third of participants act maliciously (Lamport et al., 1982).

Table 4

CategoryWhen to use/applicationCore guaranteeTypical cost as the system growsFundamental limit
Self-stabilizationSystems that may start in an arbitrary or inconsistent state; large sensor deployments and routing ringsSystem eventually reaches a legitimate configuration without external reset (Dijkstra, 1974)Convergence time often on the order of n2 rounds (Dijkstra, 1974)Deterministic consensus impossible in a fully asynchronous system with even one crash fault (Fischer et al., 1985)
Self-healing graph algorithmsNetworks that lose nodes or links; data centers; robotic or drone swarmsConnectivity is preserved while path lengths and node degrees grow only within proven bounds (Gallos and Fefferman, 2015; Pandurangan et al., 2016)Repair time from logarithmic to linear in n (Gallos and Fefferman, 2015; Pandurangan et al., 2016)Purely local repair cannot preserve strong global properties under unlimited churn (Baruch and Trehan, 2011)
Error-resilient online algorithmsReal-time decision systems with noisy, missing, or adversarial inputs; caching, navigation, tradingPerformance tracks an ideal offline benchmark; guarantees degrade smoothly with error level (Karp et al., 1990; Lykouris and Vassilvitskii, 2021; Cohen et al., 2023)Comparable to standard online methods, with extra overhead for robustness (Lykouris and Vassilvitskii, 2021)Lower bounds show limited side information cannot eliminate errors beyond problem-specific thresholds (Albers, 2003)
Redundancy and probabilistic repairData or communication subject to random faults; satellite links; distributed storageError probability reduced through replication, coding, or inference, if rate remains below channel capacity (Shannon, 1948)Extra replicas or symbols, with decoding overhead (Shannon, 1948)Reliable transmission at or above channel capacity is impossible (Shannon, 1948)
Byzantine fault toleranceDistributed coordination with potentially malicious participants; critical control systems; permissioned ledgersAll honest participants reach agreement and the system continues to make progress (Lamport et al., 1982)Message cost often grows quadratically with the number of participants (Castro and Liskov, 1999)Agreement is impossible if one third or more of participants are malicious (Lamport et al., 1982); consensus also impossible in fully asynchronous systems (Fischer et al., 1985)

Comparing self-repair approaches: typical scenarios, guarantees, costs, and fundamental limits.

Each family highlights a distinct trade-off. Self-stabilization offers recovery from any state but may require many rounds. Self-healing graph methods repair quickly but cannot guarantee global properties under unlimited change. Error-resilient online algorithms integrate robustness into real-time decisions but cannot surpass proven lower bounds. Redundancy and probabilistic repair can drive error rates low but are constrained by channel capacity. Byzantine protocols tolerate adversarial behavior but only below a fixed threshold. Choosing the right method means matching the fault model to the environment, balancing cost, and staying within known limits.

9.2 Designing hybrid self-repairing systems

Real-world systems rarely fail in only one way. An autonomous drone fleet, for example, may face harsh environments, broken communication links, faulty sensors, malicious interference, and unpredictable weather. No single method can address all these challenges. Self-healing techniques restore lost connections but are vulnerable to manipulation, Byzantine protocols secure agreement but may react too slowly, and probabilistic repair contains random errors but not coordinated attacks. These contrasts motivate hybrid systems that combine complementary strategies, achieving resilience beyond what any single category can provide. Figure 2 illustrates the hybrid self-repair pipeline, including detection, strategy selection, orchestration, verification, and outcomes.

Figure 2

Hybrid architectures already appear in practice. Cloud platforms combine self-healing, probabilistic repair, and Byzantine protocols. Autonomous vehicles rely on error-resilient learning, software updates, and secure consensus for coordination. Blockchains use Byzantine protocols for correctness and probabilistic replication for persistence. In all cases, resilience comes from orchestrating several mechanisms rather than relying on one.

The framework of 〈Failure model, Repair strategy, Guarantee〉 applies to hybrids as well, with multiple triplets operating across different layers. The key design principle is reinforcement: mechanisms must strengthen rather than undermine one another. For example, fast recovery must not bypass safety checks, and redundancy budgets must be preserved as controllers adapt. Proper orchestration ensures guarantees remain enforceable under stress.

These principles illustrate how the categories complement one another. Self-stabilization ensures recovery from arbitrary states, self-healing provides rapid local repair, error-resilient online algorithms manage noisy inputs, probabilistic repair absorbs random faults, and Byzantine tolerance secures agreement under malicious behavior. Each has trade-offs, but together they provide a stronger foundation.

Designing hybrids requires detecting faults, selecting strategies suited to the context, and coordinating them under resource limits. In adversarial settings, Byzantine tolerance may be prioritized; in stochastic settings, probabilistic repair may dominate. Adaptive orchestration ensures that strategies evolve with conditions rather than remaining fixed.

Hybridization also raises challenges. Mechanisms may interfere, redundancy may add overhead, and orchestration must balance latency, accuracy, and security. Meeting these challenges requires adaptive coordination, predictive models, and scalable verification at runtime.

The lesson is clear: modern infrastructures face diverse and evolving fault landscapes, and no single method suffices. By combining approaches, hybrid systems provide robustness, adaptability, and trustworthiness at scale. In this sense, hybrid self-repair offers a blueprint for resilient computation that is both rigorous and practical.

10 Conclusion

This survey has advanced the perspective of algorithmic self-repair as a frontier for understanding resilience in computation. Rather than treating self-stabilization, self-healing, error-resilient online computation, redundancy-based methods, and Byzantine fault tolerance as isolated traditions, we have consolidated them into a coherent taxonomy. The aim has been not merely to catalog techniques, but to expose the common principles that enable resilience and to illuminate the trade-offs between efficiency, adaptability, and robustness that define the boundaries of this frontier.

Yet the challenges of resilient computation remain far from resolved. One central issue is how to optimize fault tolerance without imposing prohibitive computational overheads, a balance that becomes particularly critical in large-scale and resource-constrained environments. Another lies in the design of adaptive strategies that can adjust proactively and reactively to evolving patterns of disruption, preserving both responsiveness and theoretical guarantees. The prospect of integrating learning-based techniques into this landscape further complicates the picture: while such methods promise adaptability, ensuring that they remain provably reliable and explainable is essential for their use in safety-critical domains.

A further frontier lies in bridging the gap between adversarial and stochastic models of failure. Real-world systems rarely conform to purely worst-case or purely probabilistic assumptions, and hybrid models that capture this duality could reshape our understanding of what self-repairing algorithms can achieve. Closely related are the roles of consensus and broadcast as core abstractions for fault-tolerant distributed computation. Their interplay not only structures impossibility results and lower bounds but also highlights opportunities for designing hybrid frameworks that combine convergence with adversarial resilience. These foundational questions extend beyond classification, pointing to deep open problems at the intersection of distributed computing, online decision-making, and fault-tolerant theory.

Looking ahead, the consolidation offered here remains primarily conceptual. To realize its potential, empirical validation is needed: benchmarking across domains such as distributed infrastructures, cybersecurity, and autonomous systems will be essential for testing scalability and practical feasibility. Only by closing the gap between theoretical foundations and real-world deployment can algorithmic self-repair move from a unifying concept to a guiding principle of resilient computation.

Statements

Author contributions

CM: Conceptualization, Formal analysis, Methodology, Supervision, Writing – original draft. AP: Investigation, Validation, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    Abo-ElyousrF. K.SharafA. M.DarwishM. M. F.LehtonenM.MahmoudK. (2022). Optimal scheduling of DG and EV parking lots simultaneously with demand response based on self-adjusted pso and k-means clustering. Energy Sci. Eng. 10, 40254043. doi: 10.1002/ese3.1264

  • 2

    AlbersS. (2003). Online algorithms: a survey. Math. Program. 97, 326. doi: 10.1007/s10107-003-0436-0

  • 3

    AltisenK.DevismesS.DuboisS.PetitF. (2022). Introduction to Distributed Self-Stabilizing Algorithms. Cham: Springer Nature.

  • 4

    AndreinaS.ZimmerP.KarameG. (2025). “On the robustness of distributed machine learning against transfer attacks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 15382–15390. doi: 10.1609/aaai.v39i15.33688

  • 5

    AugustineJ.PanduranganG.RobinsonP. (2013). “Fast byzantine agreement in dynamic networks,” in Proceedings of the 2013 ACM Symposium on Principles of Distributed Computing, 74–83. doi: 10.1145/2484239.2484275

  • 6

    AvinC.LotkerZ.PignoletY.-A.ScheidelerC.SchmidS. (2008). “Self-healing for ad hoc networks,” in Proceedings of the 7th International Symposium Mobile Ad Hoc Networking and Computing (MobiHoc) (ACM), 7079.

  • 7

    AvizienisA.LaprieJ.-C.RandellB.LandwehrC. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1133. doi: 10.1109/TDSC.2004.2

  • 8

    BabaogluO.JelasityM.MontresorA.FetzerC.LeonardiS.van MoorselA.et al. (2004). “The self-star vision,” in Self-star Workshop (Springer), 120. doi: 10.1007/11428589_1

  • 9

    BarrakA.JaziriM.TrabelsiR.JaafarF.PetrilloF. (2023). “Spirt: a fault-tolerant and reliable peer-to-peer serverless ml training architecture,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) (IEEE), 650661. doi: 10.1109/QRS60937.2023.00069

  • 10

    BaruchA.TrehanA. (2011). “On the impossibility of local self-healing in expander graphs,” in Proceedings of the 30th Annual ACM Symposium Principles of Distributed Computing (PODC) (ACM), 273282.

  • 11

    BouhataD.MoumenH.MazariJ. A.BounceurA. (2024). Byzantine fault tolerance in distributed machine learning: a survey. J. Exp. Theor. Artif. Intell. 37, 13311389. doi: 10.1080/0952813X.2024.2391778

  • 12

    CastroM.LiskovB. (1999). “Practical byzantine fault tolerance,” in Proceedings of the 3rd USENIX Symposium Operation Systems Design Implement, 173186.

  • 13

    ChenW. (2020). Fundamental limits of byzantine agreement. arXiv preprint arXiv:2009.10965.

  • 14

    ChevalierP.KaminskiB.HutchisonF.MaQ.SharmaS.FacklerA.et al. (2019). Protocol for asynchronous, reliable, secure and efficient consensus (parsec) version 2.0. arXiv preprint arXiv:1907.11445.

  • 15

    ChoopaniK.HedayatiM.EffatnejadR. (2020). Self-healing optimization in active distribution network to improve reliability, and reduce losses, switching cost, and load shedding. Int. Trans. Electr. Energy Syst. 30:e12348. doi: 10.1002/2050-7038.12348

  • 16

    CohenI.PostekK.ShternS. (2023). An adaptive robust optimization model for parallel machine scheduling. Eur. J. Oper. Res. 306, 83104. doi: 10.1016/j.ejor.2022.07.018

  • 17

    CorreiaM.VeroneseG. S.NevesN. F.VerissimoP. (2011). Byzantine consensus in asynchronous message-passing systems: a survey. Int. J. Crit. Comput.-Based Syst. 2, 141161. doi: 10.1504/IJCCBS.2011.041257

  • 18

    DahanT.LevyK. Y. (2024). “Weight for robustness: a comprehensive approach towards optimal fault-tolerant asynchronous ML,” in Advances in Neural Information Processing Systems, 30041–30075. doi: 10.52202/079017-0946

  • 19

    DevismesS. (2005). A silent self-stabilizing algorithm for finding cut-nodes and bridges. Parallel Process. Lett. 15, 7994. doi: 10.1142/S0129626405002143

  • 20

    DhinneshN. A. D. C.SabapathiT.SundareswaranN. (2022). An efficient self-healing network through quadratic probing optimization mechanism. Int. J. Commun. Syst. 35, e5098. doi: 10.1002/dac.5098

  • 21

    DijkstraE. W. (1974). Self-stabilizing systems in spite of distributed control. Commun. ACM17, 643644. doi: 10.1145/361179.361202

  • 22

    DolevD.StrongH. R. (1983). Authenticated algorithms for byzantine agreement. SIAM J. Comput. 12, 656666. doi: 10.1137/0212045

  • 23

    DolevS.IsraeliA.MoranS. (1997). Resource bounds for self-stabilizing message-driven protocols. SIAM J. Comput. 26, 273290. doi: 10.1137/S0097539792235074

  • 24

    DuvignauR.RaynalM.SchillerE. M. (2023). Self-stabilizing byzantine fault-tolerant repeated reliable broadcast. Theor. Comput. Sci. 972:114070. doi: 10.1016/j.tcs.2023.114070

  • 25

    EmekK.KerenA. (2021). “A thin self-stabilizing asynchronous unison algorithm with applications to fault tolerant biological networks,” in Proceedings of the ACM Symposium Principles of Distributed Computing (PODC). doi: 10.1145/3465084.3467922

  • 26

    FeldmannM.ScheidelerC.SchmidS. (2020). Survey on algorithms for self-stabilizing overlay networks. ACM Comput. Surv. 53, 174. doi: 10.1145/3397190

  • 27

    FischerM. J.LynchN. A.PatersonM. S. (1985). Impossibility of distributed consensus with one faulty process. J. ACM32, 374382. doi: 10.1145/3149.214121

  • 28

    FlaminiA.LongoR.MeneghettiA. (2024). Multidimensional byzantine agreement in a synchronous setting. Appl. Algebra Eng. Commun. Comput. 35, 233251. doi: 10.1007/s00200-022-00548-5

  • 29

    GallosL. K.FeffermanD. (2015). Simple and efficient self-healing strategy for damaged complex networks. Phys. Rev. E92:052806. doi: 10.1103/PhysRevE.92.052806

  • 30

    GärtnerF. C. (1999). Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput. Surv. 31, 126. doi: 10.1145/311531.311532

  • 31

    GhoshS.GuptaA.HermanT.PemmarajuS. V. (1996). “Fault-containing self-stabilizing algorithms,” in Proceedings of the 15th Annual ACM Symposium Principles of Distributed Computing (PODC), 45–54. doi: 10.1145/248052.248057

  • 32

    GoddardW.HedetniemiS. T.JacobsD. P.SrimaniP. K. (2003). “Self-stabilizing distributed algorithm for strong matching in a system graph,” in Proceedings of the 10th International Conference High Performance Computer (HiPC) (Springer), 6673. doi: 10.1007/978-3-540-24596-4_8

  • 33

    GuellatiN.KheddouciH. (2010). A survey on self-stabilizing algorithms for independence, domination, coloring, and matching in graphs. J. Parallel Distrib. Comput. 70, 406415. doi: 10.1016/j.jpdc.2009.11.006

  • 34

    HasanM.GorayaM. S. (2018). Fault tolerance in cloud computing environment: a systematic survey. Comput. Ind. 99, 156172. doi: 10.1016/j.compind.2018.03.027

  • 35

    HennessyM.MilnerR. (1992). Probabilistic methods in a theory of processes. Inf. Comput. 100, 171193.

  • 36

    HoiS. C. H.SahooD.LuJ.ZhaoP. (2021). Online learning: a comprehensive survey. Neurocomputing459, 249289. doi: 10.1016/j.neucom.2021.04.112

  • 37

    HuX.SiH.MaoJ.WangY. (2022). Self-healing and shortest path in optical fiber sensor network. J. Sensors2022:5717041. doi: 10.1155/2022/5717041

  • 38

    HuangD.MaX.ZhangS. (2019). Performance analysis of the raft consensus algorithm for private blockchains. IEEE Trans. Syst. Man Cybern. Syst. 50, 172181. doi: 10.1109/TSMC.2019.2895471

  • 39

    IsukapalliS.SriramaS. N. (2024). A systematic survey on fault-tolerant solutions for distributed data analytics: taxonomy, comparison, and future directions. Comput. Sci. Rev. 53:100660. doi: 10.1016/j.cosrev.2024.100660

  • 40

    KanewalaT.ZalewskiM.BarnasM.LumsdaineA. (2017). Families of distributed memory parallel graph algorithms from self-stabilizing kernels–an sssp case study. arXiv preprint arXiv:1706.05760.

  • 41

    KaraataN. (2002). A stabilizing algorithm for finding biconnected components. J. Parallel Distrib. Comput. 62, 755766. doi: 10.1006/jpdc.2001.1833

  • 42

    KarpR. M.VaziraniU. V.VaziraniV. V. (1990). “An optimal algorithm for on-line bipartite matching,” in Proceedings of the 22nd Annual ACM Symposium Theory Computer (STOC), 352–358. doi: 10.1145/100216.100262

  • 43

    KatzS.PerryK. (1990). “Self-stabilizing extensions for message-passing systems,” in Proceedings of the 9th Annual ACM Symposium Principles of Distributed Computing (PODC), 91–101. doi: 10.1145/93385.93405

  • 44

    KhadievK.KhadievaA.MannapovI. (2018). Quantum online algorithms with respect to space and advice complexity. Lobachevskii J. Math.39, 13771387. doi: 10.1134/S1995080218090421

  • 45

    KorenI.KrishnaC. M. (2020). Fault-Tolerant Systems. San Francisco, CA: Morgan Kaufmann. doi: 10.1016/B978-0-12-818105-8.00014-0

  • 46

    KumariP.KaurP. (2021). A survey of fault tolerance in cloud computing. J. King Saud Univ. Comput. Inf. Sci. 33, 11591176. doi: 10.1016/j.jksuci.2018.09.021

  • 47

    LamportL.ShostakR.PeaseM. (1982). The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 382401. doi: 10.1145/357172.357176

  • 48

    LeeJ. (2024). Exploring clique transversal problems for d-degenerate graphs with fixed d: from polynomial-time solvability to parameterized complexity. Axioms13:382. doi: 10.3390/axioms13060382

  • 49

    LykourisT.VassilvitskiiS. (2021). Competitive caching with machine learned advice. J. ACM68, 124. doi: 10.1145/3447579

  • 50

    MaB.ZhangT.ShenM.TangJ. (2023). “Probabilistic data association algorithm based on adaptive robust kalman filtering,” in Proceedings of the 3rd International Symposium Computer Engeneering Intelligence Communications (ISCEIC) (SPIE), 738742. doi: 10.1117/12.2661021

  • 51

    MarkarianC. (2024). Online algorithmic study of facility location problems: a survey. IEEE Access12, 7772477738. doi: 10.1109/ACCESS.2024.3406788

  • 52

    MarkarianC.FachkhaC.YassineN. (2024). Revisiting online algorithms: a survey of set cover solutions beyond competitive analysis. IEEE Access12, 174723174739. doi: 10.1109/ACCESS.2024.3504541

  • 53

    OngaroD.OusterhoutJ. (2014). “In search of an understandable consensus algorithm,” in USENIX Annual Technology Conference, 305319.

  • 54

    PanduranganG.RobinsonP.TrehanA. (2016). Dex: self-healing expanders. Distrib. Comput. 29, 163185. doi: 10.1007/s00446-015-0258-3

  • 55

    PanduranganG.TrehanA.UpfalE. (2012). Distributed construction of self-healing expander networks. ACM Trans. Algorithms8, 129.

  • 56

    PieprzykJ.DudaJ.PawłowskiM.CamtepeS.MahboubiA.MorawieckiP. (2023). The compression optimality of asymmetric numeral systems. Entropy25:672. doi: 10.3390/e25040672

  • 57

    QiuS.CuiJ. (2024). Probabilistic graph model based recommendation algorithm for material selection in self-directed learning. SAGE Open14:21582440241241981. doi: 10.1177/21582440241241981

  • 58

    RehmanA. U.AguiarR. L.BarracaJ. P. (2022). Fault-tolerance in the scope of cloud computing. IEEE Access10, 6342263441. doi: 10.1109/ACCESS.2022.3182211

  • 59

    RenX.TongX.ZhangW. (2023). Improved PBFT consensus algorithm based on node role division. J. Comput. Commun. 11, 2038. doi: 10.4236/jcc.2023.112003

  • 60

    RidgleyI. L. D.FreemanR. A.LynchK. M. (2021). “Self-healing first-order distributed optimization,” in Proceedings of the 60th IEEE Conference Decision Control (CDC) (IEEE), 38503856. doi: 10.1109/CDC45484.2021.9683487

  • 61

    SaadG.SaiaJ. (2014). “Self-healing computation,” in Proceedings of the Symposium Self-Stabilizing Systems (Springer), 195210. doi: 10.1007/978-3-319-11764-5_14

  • 62

    SaifullahA.TsinA. (2011). A self-stabilizing algorithm for 3-edge-connectivity. Int. J. High Perform. Comput. Netw. 5, 114. doi: 10.1504/IJHPCN.2011.038709

  • 63

    SakthivelK.AlexanderS. A. (2024). An extensive critique on machine learning techniques for fault tolerance and power quality improvement in multilevel inverters. Energy Rep. 12, 58145833. doi: 10.1016/j.egyr.2024.11.016

  • 64

    SchneiderM. (1993). Self-stabilization. ACM Comput. Surv. 25, 4567. doi: 10.1145/151254.151256

  • 65

    ShannonC. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 379423. doi: 10.1002/j.1538-7305.1948.tb01338.x

  • 66

    SoloukiM. A.AngiziS.ViolanteM. (2024). Dependability in embedded systems: a survey of fault tolerance methods and software-based mitigation techniques. IEEE Access12, 180939180967. doi: 10.1109/ACCESS.2024.3509633

  • 67

    SongT.PuH.SchonfeldP.ZhangH.LiW.HuJ.et al. (2021). Bi-objective mountain railway alignment optimization incorporating seismic risk assessment. Comput.-Aided Civ. Infrastruct. Eng. 36, 143163. doi: 10.1111/mice.12607

  • 68

    TangL.BreugelF. D. (2020). Deciding probabilistic bisimilarity distance one for probabilistic automata. J. Comput. Syst. Sci. 111, 5784. doi: 10.1016/j.jcss.2020.02.003

  • 69

    TixeuilS. (2009). “Self-stabilizing algorithms,” in Algorithms and Theory of Computation Handbook, 26–1. doi: 10.1201/9781584888215-c26

  • 70

    VenkateshaS.ParthasarathiR. (2024). Survey on redundancy-based fault tolerance methods for processors and hardware accelerators: trends in quantum computing, heterogeneous systems and reliability. ACM Comput. Surv. 56, 176. doi: 10.1145/3663672

  • 71

    WangH.WattenhoferR. (2020). “Asynchronous byzantine agreement in incomplete networks,” in Proceedings of the 2020 ACM Symposium Principles Distribution Computer. doi: 10.1145/3419614.3423250

  • 72

    WeiW.LiuL. (2025). Trustworthy distributed AI systems: robustness, privacy, and governance. ACM Comput. Surv. 57, 142. doi: 10.1145/3645102

  • 73

    YuG.TanG.HuangH.ZhangZ.ChenP.NatellaR.et al. (2024). A survey on failure analysis and fault injection in ai systems. arXiv preprint arXiv:2407.00125.

  • 74

    ZadsarM.HaghifamM. R.LarimiS. M. M. (2017). Approach for self-healing resilient operation of active distribution network with microgrid. IET Gener. Transm. Distrib. 11, 46334643. doi: 10.1049/iet-gtd.2016.1783

  • 75

    ZhaoB.XiaoX.ZhangW.ZhangB.GanG.XiaS. (2020). “Self-paced probabilistic principal component analysis for data with outliers,” in Proceedings of the IEEE International Conferences Acoust Speech Signal Process (ICASSP), 3737–3741. doi: 10.1109/ICASSP40776.2020.9054487

  • 76

    ZhaoC.HeJ.WangQ.-G. (2019). Resilient distributed optimization algorithm against adversarial attacks. IEEE Trans. Autom. Control65, 43084315. doi: 10.1109/TAC.2019.2954363

  • 77

    ZhongW.YangC.LiangW.CaiJ.ChenL.LiaoJ.et al. (2023). Byzantine fault-tolerant consensus algorithms: a survey. Electronics12:3801. doi: 10.3390/electronics12183801

Summary

Keywords

algorithmic redundancy, Byzantine fault tolerance, error-resilient online algorithms, fault-tolerant computation, self-healing graphs, self-stabilization

Citation

Markarian C and Panthakkan A (2026) Algorithmic self-repair: frontiers in fault-tolerant computation. Front. Comput. Sci. 8:1717711. doi: 10.3389/fcomp.2026.1717711

Received

02 October 2025

Revised

19 January 2026

Accepted

21 January 2026

Published

12 February 2026

Volume

8 - 2026

Edited by

Marco Faella, University of Naples Federico II, Italy

Reviewed by

Marielle Stoelinga, University of Twente, Netherlands

Ai Liu, Nanjing University of Aeronautics and Astronautics, China

Updates

Copyright

*Correspondence: Christine Markarian,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics