Abstract
Cloud-native microservices improve development velocity and elasticity, but they also create complex and dynamic service dependencies. Resource contention, queue buildup, and downstream slowdowns can propagate through call chains, amplifying end-to-end tail latency (e.g., p95/p99) and increasing Service Level Objective (SLO) violation risks. While many studies focus on post-hoc anomaly detection and root-cause analysis, industrial operations increasingly demand proactive capabilities, like predicting performance risks before a request finishes, issuing early warnings from partial trace prefixes, and producing actionable signals for mitigation. This mini-review synthesizes recent progress on trace-driven proactive SLO management. We summarize problem formulations and evaluation protocols for SLO violation and tail-quantile prediction, prefix early warning under precision constraints, and actionable intermediate outputs such as bottleneck candidate ranking and what-if estimation. We then survey modeling approaches spanning feature-based baselines, sequence models, graph neural networks, sequence-graph fusion, and multimodal/causal extensions, highlighting practical issues such as class imbalance, sampling-induced missing spans, and topology drift. Finally, we survey commonly used public benchmarks and traces, and discuss open challenges toward deployable, trustworthy proactive SLO management.
1 Introduction
Modern cloud services increasingly adopt microservice architectures to enable independent development, deployment, and elastic scaling (Dragoni et al., 2017; Newman, 2021). In mission-critical domains such as power grid dispatching, cloud-native transformations have been actively explored to improve system scalability and operational efficiency (Liang et al., 2016; Wen et al., 2016). A single user request typically traverses multiple services and middleware components, forming a dynamic call graph (Gan et al., 2019). When a service experiences resource shortages, queue buildup, lock contention, garbage collection pauses, or downstream degradation, latency and errors can propagate along dependencies, manifesting as tail-latency spikes and SLO violations (Wu et al., 2020; Yu et al., 2021). This is especially relevant to latency-sensitive real-time interactive services such as cloud-hosted mobile online games (Meiländer et al., 2014). Traditional alerting based on single-metric thresholds often lags behind these cascading behaviors (Notaro et al., 2021).
Distributed tracing provides request-centric observability by recording traces and spans that capture the execution path and timing across services (Sigelman et al., 2010). Tracing has been widely used for diagnosis and root-cause analysis (Wu et al., 2020; Yu et al., 2021), but it can also support proactive operations (Grohmann et al., 2021). In proactive SLO management, the goal is to forecast risk before user impact becomes visible, ideally while a request is still in-flight (trace prefix). Such forecasts are most valuable when they arrive early enough to enable mitigation and are accompanied by actionable signals, such as likely bottleneck services or predicted benefits of interventions. Beyond compute-side mitigation, network-side control (e.g., SDN-enabled QoS enforcement/traffic steering) provides additional actionable knobs that can complement early-warning predictors (Gorlatch et al., 2014).
Service level objectives (SLOs) are central in site reliability engineering (SRE) as explicit targets for availability and latency (Beyer et al., 2016). The operational motivation for proactive prediction is closely related to tail behavior - when requests fan out across many microservices, the slowest component often dominates the end-to-end latency distribution (“tail at scale”) (Dean and Barroso, 2013). Although microservices improve modularity and independent deployment, they also multiply dependencies and complicate performance debugging and capacity planning (Dragoni et al., 2017).
The remainder of this mini-review is organized as follows. Section 2 introduces tracing primitives, prefix definitions, and multimodal observability. Section 3 formalizes prediction tasks and evaluation protocols. Section 4 surveys modeling approaches from feature-based baselines to graph neural networks and causal extensions. Section 5 surveys public benchmarks and datasets. Section 6 discusses open challenges toward deployable proactive SLO management.
2 Tracing signals, prefixes, and multimodal observability
A trace represents a single end-to-end request, while a span represents a timed operation within a service or component. Spans are linked by parent-child or caller-callee relations (Sigelman et al., 2010). Traces can be projected into call graphs whose nodes and edges carry attributes such as service identity, latency, retries, and error codes.
Proactive settings introduce an online constraint: at inference time, only a partial trace prefix is observable. This distinguishes proactive SLO management from post-hoc diagnosis, where complete traces are available. Prefixes may be defined by elapsed wall-clock time, the number of observed spans, or the depth of topological expansion from the root span. Concurrent branches often complicate the notion of “progress,” since multiple spans may execute in parallel without a natural ordering. Modeling, therefore, requires explicit decisions on prefix ordering, time encoding, and how to represent partially observed subtrees.
Traces alone may miss important context. Metrics often reflect resource and saturation signals (CPU throttling, garbage collection, network congestion), and logs contain semantic clues (exceptions, error messages). Recent work increasingly treats proactive SLO management as a multimodal problem, aligning trace, metric, and log signals at request or service granularity (Zhao et al., 2023). In practice, data quality issues are common: head-based or tail-based sampling can drop intermediate spans, context propagation bugs can break trace continuity, and duplicated caller/callee instrumentation can bias edge statistics. Robust preprocessing and reporting of data cleaning choices are crucial for reproducibility (Huye et al., 2024).
Most modern tracing deployments rely on standardized instrumentation and context propagation. OpenTelemetry provides cross-language APIs/SDKs and a vendor-neutral specification that has become a de facto industry standard (OpenTelemetry, 2025). In production AIOps pipelines, traces are often fused with logs and metrics; classic log-analysis studies highlight both the value and the pitfalls of relying on noisy operational data (He et al., 2016; Du et al., 2017).
3 Problem formulations and evaluation
Proactive SLO management can be viewed as a family of prediction and decision-support tasks. Common formulations include: (i) SLO violation prediction (binary classification), (ii) tail latency/quantile prediction (regression), (iii) prefix early warning (when to alert), and (iv) actionable intermediate outputs such as bottleneck ranking and what-if estimation. Table 1 summarizes these tasks along with recommended evaluation metrics and representative systems.
Table 1
| Task | Input → Output | Recommended metrics | Representative systems |
|---|---|---|---|
| SLO violation prediction | Trace/metric history (window) → P(violation) | PR-AUC; calibration error | SuanMing (Grohmann et al., 2021) |
| Tail/quantile latency prediction | Prefix (+config) → latency / tail quantile (e.g., p95) | pinball loss; relative error | PERT-GNN (Tam et al., 2023); FastPERT (Tam et al., 2025) |
| Prefix early warning | Prefix stream → alert time | EAR@Precision≥P0; Coverage; FAR | (limited dedicated work) |
| Actionable outputs | Trace/call graph → Top-k candidates / what-if | Top-k hit; NDCG; intervention error | MicroRCA (Wu et al., 2020); Sage (Gan et al., 2021) |
Typical proactive SLO management tasks and evaluation targets.
SLO violation and latency prediction: SLO violation labels are often unavailable in public traces. Therefore, many studies define proxy violation labels using latency-threshold rules (e.g., based on p95/p99) computed from historical data. This proxy-vs-real-SLO gap should be made explicit because real SLOs may depend on API endpoint, customer tier, time window, and composite indicators. For binary classification, PR-AUC is preferred over ROC-AUC due to class imbalance. Calibration error (Guo et al., 2017) should also be reported since early warning systems require well-calibrated probabilities. For quantile/latency regression, pinball loss directly measures quantile accuracy, while relative error metrics capture prediction quality across different latency scales.
Prefix early warning: A key operational requirement for early warning is to control false alarms while maximizing detection earliness. One practical evaluation protocol adapted from early time-series classification literature (Xing et al., 2012; Schäfer and Leser, 2020; Bilski and Jastrzębska, 2023) is to select an alert threshold on a validation set such that Precision≥P0 (e.g., 0.90 or 0.95), then report metrics on a held-out test set. These include: (a) Earliest Alarm Ratio (EAR), the average minimal prefix fraction at which a true-violation request triggers an alert; (b) Coverage, the fraction of violating requests that are successfully alerted; and (c) False Alarm Rate (FAR), the fraction of non-violating requests that trigger an alert. This protocol is consistent with cost-sensitive thresholding and captures the “earliness-accuracy” trade-off central to proactive prediction. While SLO violation prediction and bottleneck ranking have received substantial attention, work on prefix-based early warning with explicit earliness evaluation remains relatively limited in the microservices literature.
Bottleneck localization and root-cause analysis: Beyond forecasting SLO violation risk, trace data are widely used to localize bottlenecks and root causes. Systems such as MicroRCA (Wu et al., 2020) and MicroRank (Yu et al., 2021) analyze propagation patterns across microservices to rank likely culprits, while critical-path analysis frameworks like CRISP (Zhang et al., 2022) summarize large volumes of traces into actionable performance explanations. These diagnosis-oriented tasks are complementary to early warning, since a practical proactive controller often needs both “will we violate” and “where should we intervene” signals.
What-if estimation: What-if estimation extends bottleneck ranking by quantifying the expected improvement from specific interventions. Given a predicted SLO violation, what-if models answer questions such as: “If we scale Service A from 2 to 4 replicas, by how much will end-to-end latency decrease?” or “If we reroute 20% of traffic away from Region B, what is the probability of meeting SLO?” Sage (Gan et al., 2021) solves this through counterfactual generation: the system generates hypothetical scenarios where a candidate service's resource utilization is set to “healthy” values, then predicts the resulting end-to-end latency using a learned generative model. If the counterfactual latency meets SLO, the intervention is deemed effective. Evaluation of what-if estimation typically uses metrics such as intervention error (the difference between predicted and actual post-intervention latency) or decision accuracy (whether the recommended intervention actually resolves the SLO violation when applied). Because ground-truth intervention outcomes are expensive to obtain, which requires either production experiments or high-fidelity simulation, evaluation datasets for what-if estimation remain scarce, and most published results rely on controlled testbeds or synthetic fault injection.
4 Modeling approaches: from baselines to structured and multimodal predictors
Feature-based baselines: Gradient-boosted trees such as XGBoost, LightGBM, and CatBoost remain strong in practice because they are easy to train, efficient at inference, and amenable to probability calibration (Chen and Guestrin, 2016; Ke et al., 2017; Prokhorenkova et al., 2018). However, handcrafted aggregates may miss fine-grained dependency structure and temporal evolution.
Sequence models: Sequence models treat spans as event tokens and are naturally aligned with prefix prediction. Recurrent units such as LSTM and GRU provide efficient prefix encoders (Hochreiter and Schmidhuber, 1997; Cho et al., 2014), while Transformer-style attention offers flexible modeling of long-range dependencies and irregular events (Vaswani et al., 2017). Temporal convolutional networks provide a convolutional alternative for generic sequence modeling (Bai et al., 2018), while long-horizon Transformer variants target long sequence time-series forecasting (Zhou et al., 2021a). However, all sequence approaches must address long traces, concurrent branches that violate strict sequential assumptions, and appropriate position/time encoding.
Graph neural networks: Graph neural networks explicitly model service dependencies and can support explanation via node/edge importance. Building on general GNN primitives such as graph convolutions, neighborhood aggregation, and graph attention (Kipf and Welling, 2017; Hamilton et al., 2017; Veliĉković et al., 2017), they are well-suited for bottleneck ranking and resource decision support, including proactive auto-scaling (Park et al., 2021, 2024; Meng et al., 2023). Challenges include over-smoothing on deep call chains, missing edges due to sampling, and topology drift caused by deployments and elastic scaling.
Hybrid sequence-graph models: Hybrid models aim to combine temporal evolution and topology. A prominent design pattern is the dual-encoder architecture: one encoder (typically GNN-based) captures call-graph structure, while another (LSTM, GRU, or Transformer) models temporal dynamics. The two representations are then fused via gating, attention, or concatenation. TraceGra exemplifies this approach by using a GNN to extract topological features from a unified trace-and-metric graph representation and an LSTM to capture temporal patterns, combining them through an encoder–decoder framework for anomaly detection (Chen et al., 2022). TopoMAD modifies LSTM cells with additional GNN gates to jointly model spatial dependencies among microservices and temporal evolution of metrics (He et al., 2023). More recently, USRFNet introduced a dual-stream architecture that explicitly separates traffic-side features (modeled by GNN) from resource-side features (modeled by gated MLP), then fuses them via cross-diffusion attention and low-rank fusion to predict window-level p95 latency (Qian et al., 2025). These hybrid designs address the limitations of single-representation models, where pure sequence models ignore graph structure, while pure GNN models may homogenize heterogeneous feature types through uniform message passing.
Structural latency predictors and multimodal extensions: Recent latency predictors introduce structural inductive bias over PERT-like decompositions to improve both accuracy and interpretability (Tam et al., 2023, 2025). Multimodal extensions further align trace graphs with metrics and logs (Zhao et al., 2023), improving robustness when any single modality is incomplete.
Causal and counterfactual approaches: Causal methods are increasingly explored for actionable prediction. Rather than only estimating SLO violation risk, the goal is to estimate the effect of hypothetical interventions, such as scaling a specific service or rerouting traffic on end-to-end latency outcomes. Sage introduces a causal Bayesian network combined with a graphical variational auto-encoder to generate counterfactual latency scenarios (Gan et al., 2021). Given an observed SLO violation, Sage hypothetically “fixes” the utilization of candidate services to normal values and predicts whether SLO would be met (restored), thereby identifying root causes through counterfactual reasoning rather than correlation. Lohse et al. (2025) proposed a causal discovery framework that reconstructs the latency DAG of microservice architectures using domain knowledge and causal discovery algorithms, enabling identification of actionable intervention targets that causally influence high latency. However, causal estimation faces significant challenges: shared resources (CPU, network bandwidth, memory) introduce confounding, opportunities for randomized experiments are limited in production, and observational data alone cannot distinguish correlation from causation without strong structural assumptions. Current practical systems often combine structured causal models with small-scale online validation (A/B tests, canary deployments) to verify predicted intervention effects before broad rollout.
5 Public datasets and benchmarks
Reproducible evaluation requires both controllable benchmarks and representative traces. DeathStarBench provides open-source microservice applications with configurable workloads and fault injection via external tools, enabling controlled studies of tail latency and cascading slowdowns (Gan et al., 2019). Other benchmarks, such as Train-Ticket (Zhou et al., 2021b) and SockShop (Weaveworks, 2023), are also used in the literature, though DeathStarBench remains the most widely adopted for SLO-focused studies. Large-scale public traces, such as Alibaba microservice call graphs, enable learning from realistic production dynamics (Luo et al., 2021, 2022), but often require significant preprocessing, including deduplication, missing-span handling, and topology repair. Huye (Huye et al., 2024) systematically documents topological inconsistencies in the Alibaba traces and introduces Casper, a method that recovers significantly more valid traces by leveraging dataset redundancies. Because SLO violations are rare events, most datasets exhibit extreme class imbalance, and reported results can be sensitive to sampling strategies, time splits, and proxy label definitions. Reporting dataset statistics like trace volume, span depth distribution, violation rate, and sampling rate, and releasing the preprocessing code are critical to meaningful comparisons.
Beyond class imbalance, two data-quality issues recur across public traces. First, sampling-induced missing spans: production tracing systems typically employ head-based or tail-based sampling to reduce overhead, which can drop intermediate spans and break parent-child relationships. Models trained on complete traces may not generalize to sampled data; evaluation protocols should therefore report sampling rates and assess robustness to incomplete call graphs. Second, topology drift: microservice deployments evolve through canary releases, auto-scaling, and service mesh reconfigurations, so traces collected months apart may reflect different call-graph structures. Models assuming static topology risk silent degradation; benchmarks ideally should include temporal metadata to enable drift-aware evaluation.
Outside microservice-specific benchmarks, large-scale cluster traces are often used to evaluate workload prediction and resource management components underpinning SLO control. Google has released Borg cluster-usage traces and corresponding analyses (Reiss et al., 2012; Tirmazi et al., 2020). Microsoft has released Azure workload traces and telemetry through the AzurePublicDataset and Resource Central projects (Cortez et al., 2017). These datasets complement microservice benchmarks by providing realistic resource-demand dynamics, although they generally lack request-level call graphs. A summary is shown in Table 2.
Table 2
| Dataset/benchmark | Type | Scale (approx.) | Key features for proactive SLO modeling |
|---|---|---|---|
| DeathStarBench (Gan et al., 2019) | Benchmark apps | 5 apps, configurable | Tail-latency experiments; fault injection via external tools |
| Train-Ticket (Zhou et al., 2021b) | Benchmark app | 41 services (current) | Realistic ticket-booking workflow; fault injection supported |
| Alibaba Cluster Trace (v2021/v2022) (Luo et al., 2021, 2022) | Production traces | ~20M traces (0.5% sample), 20K+ microservices | Public call-graph traces; requires preprocessing; corrected version available (Huye et al., 2024) |
| Google Borg Traces (Reiss et al., 2012; Tirmazi et al., 2020) | Cluster workload | ~12K machines, 29 days | Task/resource usage and scheduling; useful for capacity planning; no request-level call graphs |
| AzurePublicDataset (V1/V2) (Cortez et al., 2017; Microsoft, 2019a,b) | Cloud workload | VM-level, 30 days | Azure VM traces for resource demand analysis; complements microservice benchmarks |
| Resource Central (Cortez et al., 2017) | Cloud telemetry | VM-level | Azure VM telemetry with forecasting tasks; includes released dataset |
Commonly used microservice benchmarks and traces for trace-driven performance modeling.
6 Discussion
Several gaps remain between research prototypes and deployable proactive SLO management. We organize the discussion around three themes: (A) evaluation gaps and label semantics, (B–C) operational integration and scalability/efficiency, and (D) the path from correlation to action. Then, (E) concludes with an outlook.
6.1 Evaluation gaps and label semantics
A recurring issue is the disconnect between proxy labels and real SLO semantics. Most studies define violation labels using latency quantiles computed from training data, but production SLOs are often more nuanced: they may vary by API endpoint, customer tier, or time window, and may incorporate composite indicators beyond latency. This proxy-vs-real gap makes it unclear how offline accuracy translates to operational value. Future work should explore evaluation protocols that incorporate SLO heterogeneity. Calibration and uncertainty quantification represent another under-addressed dimension. Early warning systems must balance false alarms against missed violations, yet most published models report only discrimination metrics without assessing probability calibration. Techniques such as Platt scaling or conformal prediction could improve reliability, but their application to trace-based predictors remains limited. Moreover, prefix-based early warning, where predictions must be issued from incomplete traces, introduces additional evaluation complexity. The trade-off between earliness and accuracy lacks standardized benchmarks, and most existing systems do not report metrics such as EAR or coverage that capture this trade-off.
6.2 Operational integration
To bridge prototypes to production, inference latency is a critical component: a proactive predictor must return results fast enough to enable mitigation before the request completes. Lightweight models or hierarchical prediction strategies may be necessary. Integration with existing observability stacks, like commercial APM tools or open-source platforms designed for post-hoc diagnosis requires streaming trace ingestion, low-latency feature extraction, and seamless alert routing. Production environments also drift continuously through canary releases, traffic shifts, and instrumentation changes, causing models trained on historical data to silently degrade.
6.3 Scalability and efficiency
Large-scale deployments generate millions of traces per minute. Complementary to trace-driven predictors, scalability modeling for real-time interactive applications on clouds helps characterize scaling behavior and cost–benefit trade-offs of mitigation actions under latency constraints (Meiländer and Gorlatch, 2018). Training and inference at this scale demands strategies such as sampling-based training, online learning, and model compression. Memory efficiency is equally important: graph-based models must maintain dynamic adjacency structures, while sequence models over long traces incur quadratic attention costs unless efficient variants are employed. Co-designing models with streaming infrastructure offers a promising direction but requires tight coupling between ML and systems engineering.
6.4 From correlation to action
The hardest challenge is moving from correlation to actionable decisions. Predicting that a request is at risk is not equivalent to knowing which intervention will help. Such interventions may include compute-side actions (e.g., autoscaling and configuration changes) as well as network-side control, such as SDN-enabled QoS enforcement/traffic steering (Gorlatch et al., 2014). Complementary to trace-driven prediction, a line of research closes the loop by directly optimizing resource allocation and autoscaling to mitigate tail latency and SLO/SLA violations. Sinan (Zhang et al., 2021) uses ML models to estimate the performance impact of inter-service dependencies and allocates resources per tier to preserve end-to-end tail-latency targets. POBO (Guo et al., 2023) applies safe Bayesian optimization to search for resource configurations under system-wide and tail-latency constraints. DeepScaler (Meng et al., 2023) leverages spatiotemporal GNNs with adaptive graph learning for holistic autoscaling across microservices, and QueueFlower (Cao et al., 2024) performs dynamic queue balancing using real-time latency feedback without offline dependency profiling. These systems are highly complementary to the actionable outputs surveyed in this review and further motivate integrating trace-prefix warning, bottleneck localization, and what-if estimation with safe controllers in a closed-looped pipeline. Progress likely requires combining structured models that expose interpretable decomposition, targeted online validation, and causal reasoning that accounts for shared-resource confounding. Observational trace data alone cannot distinguish whether two services are slow due to causal dependency or shared congestion.
6.5 Outlook
Despite these challenges, trace-driven proactive SLO management is a rapidly growing area. The adoption of OpenTelemetry, the availability of public benchmarks, and industry interest in AIOps indicates continued progress. Key opportunities include standardized evaluation protocols reflecting real SLO semantics, integration of causal reasoning into predictive pipelines, and end-to-end systems that close the loop from prediction to automated mitigation.
Statements
Author contributions
MY: Conceptualization, Investigation, Project administration, Writing – original draft, Writing – review & editing. HL: Investigation, Writing – review & editing. JD: Investigation, Writing – review & editing. KL: Investigation, Writing – review & editing. TD: Visualization, Writing – review & editing. YF: Visualization, Writing – review & editing. CY: Project administration, Supervision, Writing – review & editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
MY, HL, JD, TD, YF, and CY were employed by Electric Power Research Institute, CSG. KL was employed by China Southern Power Grid.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. The author(s) used ChatGPT (OpenAI) solely for language editing and grammar improvement. ChatGPT was not used to generate any technical content, results, figures or tables, or references. All edits were reviewed and approved by the author(s), who take full responsibility for the final manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
BaiS.KolterJ. Z.KoltunV. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. doi: 10.48550/arXiv.1803.01271
2
BeyerB.JonesC.PetoffJ.MurphyN. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media, Inc.
3
BilskiJ. M.JastrzębskaA. (2023). Calimera: a new early time series classification method. Inform. Process. Manag. 60:103465. doi: 10.1016/j.ipm.2023.103465
4
CaoH.LiuX.GuoH.HeJ.LiuX. (2024). “Queueflower: orchestrating microservice workflows via dynamic queue balancing,” in 2024 IEEE International Conference on Web Services (ICWS) (Shenzhen: IEEE), 1293–1299. doi: 10.1109/ICWS62655.2024.00155
5
ChenJ.LiuF.JiangJ.ZhongG.XuD.TanZ.et al. (2023). TraceGra: a trace-based anomaly detection for microservice using graph deep learning. Comput. Commun.204, 109–117. doi: 10.1016/j.comcom.2023.03.028
6
ChenT.GuestrinC. (2016). “Xgboost: a scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'16 (San Francisco, CA: ACM), 785–794. doi: 10.1145/2939672.2939785
7
ChoK.van MerrienboerB.GulcehreC.BahdanauD.BougaresF.SchwenkH.et al. (2014). “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Doha: Association for Computational Linguistics), 1724–1734. doi: 10.3115/v1/D14-1179
8
CortezE.BondeA.MuzioA.RussinovichM.FontouraM.BianchiniR. (2017). “Resource central: understanding and predicting workloads for improved resource management in large cloud platforms,” in Proceedings of the 26th Symposium on Operating Systems Principles, SOSP'17 (Shanghai: ACM), 153–167. doi: 10.1145/3132747.3132772
9
DeanJ.BarrosoL. A. (2013). The tail at scale. Commun. ACM56, 74–80. doi: 10.1145/2408776.2408794
10
DragoniN.GiallorenzoS.LafuenteA. L.MazzaraM.MontesiF.MustafinR.et al. (2017). Microservices: Yesterday, Today, and Tomorrow. Cham: Springer International Publishing, 195–216. doi: 10.1007/978-3-319-67425-4_12
11
DuM.LiF.ZhengG.SrikumarV. (2017). “Deeplog: anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS'17 (Dallas, TX: ACM), 1285–1298. doi: 10.1145/3133956.3134015
12
GanY.LiangM.DevS.LoD.DelimitrouC. (2021). “Sage: practical and scalable ml-driven performance debugging in microservices,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'21 (Virtual: ACM), 135–151. doi: 10.1145/3445814.3446700
13
GanY.ZhangY.ChengD.ShettyA.RathiP.KatarkiN.et al. (2019). “An open-source benchmark suite for microservices and their hardware-software implications for cloud &edge systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'19 (Providence, RI: ACM), 3–18. doi: 10.1145/3297858.3304013
14
GorlatchS.HumernbrumT.GlinkaF. (2014). “Improving qos in real-time internet applications: from best-effort to software-defined networks,” in 2014 International Conference on Computing, Networking and Communications (ICNC) (Honolulu, HI: IEEE), 189–193. doi: 10.1109/ICCNC.2014.6785329
15
GrohmannJ.StraesserM.ChalbaniA.EismannS.ArianY.HerbstN.et al. (2021). “Suanming: explainable prediction of performance degradations in microservice applications,” in Proceedings of the ACM/SPEC International Conference on Performance Engineering, ICPE'21 (Virtual: ACM), 165–176. doi: 10.1145/3427921.3450248
16
GuoC.PleissG.SunY.WeinbergerK. Q. (2017). “On calibration of modern neural networks,” in International Conference on Machine Learning (Sydney: PMLR), 1321–1330.
17
GuoH.CaoH.HeJ.LiuX.ShiY. (2023). Pobo: safe and optimal resource management for cloud microservices. Perform. Eval. 162:102376. doi: 10.1016/j.peva.2023.102376
18
HamiltonW. L.YingR.LeskovecJ. (2017). “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17 (Red Hook, NY: Curran Associates Inc.), 1025–1035.
19
HeS.ZhuJ.HeP.LyuM. R. (2016). “Experience report: system log analysis for anomaly detection,” in 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE) (Ottawa, ON: IEEE), 207–218. doi: 10.1109/ISSRE.2016.21
20
HeZ.ChenP.LiX.WangY.YuG.ChenC.et al. (2023). A spatiotemporal deep learning approach for unsupervised anomaly detection in cloud systems. IEEE Trans. Neural Netw. Learn. Syst/ 34, 1705–1719. doi: 10.1109/TNNLS.2020.3027736
21
HochreiterS.SchmidhuberJ. (1997). Long short-term memory. Neural Comput. 9, 1735–1780. doi: 10.1162/neco.1997.9.8.1735
22
HuyeD.LiuL.SambasivanR. R. (2024). “Systemizing and mitigating topological inconsistencies in alibaba's microservice call-graph datasets,” in Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering, ICPE'24 (London: ACM), 276–285. doi: 10.1145/3629526.3645043
23
KeG.MengQ.FinleyT.WangT.ChenW.MaW.et al. (2017). “Lightgbm: a highly efficient gradient boosting decision tree,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17 (Red Hook, NY: Curran Associates Inc.), 3149–3157.
24
KipfT. N.WellingM. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.
25
LiangS.HuR.ZhouH.HeC.FangW.ZhaoH.et al. (2016). A new generation of power dispatching automation system based on cloud computing architecture. South. Power Syst. Technol. 10, 8–14. Chinese. doi: 10.13648/j.cnki.issn1674-0629.2016.06.002
26
LohseC.TsutsumiD.BaA.HarshaP.SubramanianC.StraesserM.et al. (2025). “Causal latency modelling for cloud microservices,” in 2025 IEEE 18th International Conference on Cloud Computing (CLOUD) (Helsinki: IEEE), 143–151. doi: 10.1109/CLOUD67622.2025.00024
27
LuoS.XuH.LuC.YeK.XuG.ZhangL.et al. (2021). “Characterizing microservice dependency and performance: Alibaba trace analysis,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC'21 (Seattle, WA: ACM), 412–426. doi: 10.1145/3472883.3487003
28
LuoS.XuH.YeK.XuG.ZhangL.YangG.et al. (2022). “The power of prediction: microservice auto scaling via workload learning,” in Proceedings of the 13th Symposium on Cloud Computing, SoCC'22 (San Francisco, CA: ACM), 355–369. doi: 10.1145/3542929.3563477
29
MeiländerD.GlinkaF.GorlatchS.LinL.ZhangW.LiaoX. (2014). “Bringing mobile online games to clouds,” in 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) (Toronto, ON: IEEE), 340–345. doi: 10.1109/INFCOMW.2014.6849255
30
MeiländerD.GorlatchS. (2018). Modeling the scalability of real-time online interactive applications on clouds. Fut. Gen. Comput. Syst. 86, 1019–1031. doi: 10.1016/j.future.2017.07.041
31
MengC.SongS.TongH.PanM.YuY. (2023). “Deepscaler: holistic autoscaling for microservices based on spatiotemporal gnn with adaptive graph learning,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (Luxembourg: IEEE), 53–65. doi: 10.1109/ASE56229.2023.00038
32
Microsoft (2019a). AzurePublicDatasetV1. GitHub Repository. Redmond, WA:Microsoft(Accessed January 07, 2026).
33
Microsoft (2019b). AzurePublicDatasetV2. GitHub Repository. Redmond, WA:Microsoft(Accessed January 07, 2026).
34
NewmanS. (2021). Building Microservices: Designing Fine-Grained Systems. Sebastopol, CA: O'Reilly Media, Inc.
35
NotaroP.CardosoJ.GerndtM. (2021). A survey of aiops methods for failure management. ACM Trans. Intell. Syst. Technol. 12, 1–45. doi: 10.1145/3483424
36
OpenTelemetry (2025). OpenTelemetry Specification 1.52.0.OpenTelemetry(Accessed January 07, 2026).
37
ParkJ.ChoiB.LeeC.HanD. (2021). “Graf: a graph neural network based proactive resource allocation framework for slo-oriented microservices,” in Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, CoNEXT'21 (Virtual: ACM), 154–167. doi: 10.1145/3485983.3494866
38
ParkJ.ChoiB.LeeC.HanD. (2024). Graph neural network-based slo-aware proactive resource autoscaling framework for microservices. IEEE/ACM Trans. Netw. 32, 3331–3346. doi: 10.1109/TNET.2024.3393427
39
ProkhorenkovaL.GusevG.VorobevA.DorogushA. V.GulinA. (2018). “Catboost: unbiased boosting with categorical features,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18 (Red Hook, NY: Curran Associates Inc.), 6639–6649.
40
QianW.ZhaoH.ChenT.ChenJ.WangZ.ChowK.et al. (2025). Learning unified system representations for microservice tail latency prediction. arXiv preprint arXiv:2508.01635. doi: 10.48550/arXiv.2508.01635
41
ReissC.TumanovA.GangerG. R.KatzR. H.KozuchM. A. (2012). “Heterogeneity and dynamicity of clouds at scale: Google trace analysis,” in Proceedings of the Third ACM Symposium on Cloud Computing, SOCC'12 (San Jose, CA: ACM), 1–13. doi: 10.1145/2391229.2391236
42
SchäferP.LeserU. (2020). Teaser: early and accurate time series classification. Data Min. Knowl. Discov. 34, 1336–1362. doi: 10.1007/s10618-020-00690-z
43
SigelmanB. H.BarrosoL. A.BurrowsM.StephensonP.PlakalM.BeaverD.et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Mountain View, CA: Technical report, Google, Inc.
44
TamD. S. H.LiuY.XuH.XieS.LauW. C. (2023). “Pert-gnn: latency prediction for microservice-based cloud-native applications via graph neural networks,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD'23 (Long Beach, CA: ACM), 2155–2165. doi: 10.1145/3580305.3599465
45
TamD. S. H.XuH.LiuY.XieS.LauW. C. (2025). Fastpert: Towards fast microservice application latency prediction via structural inductive bias over pert networks. Proc. AAAI Conf. Artif. Intell. 39, 20787–20795. doi: 10.1609/aaai.v39i19.34291
46
TirmaziM.BarkerA.DengN.HaqueM. E.QinZ. G.HandS.et al. (2020). “Borg: the next generation,” in Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys'20 (Heraklion: ACM), 1–14. doi: 10.1145/3342195.3387517
47
VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al. (2017). “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17 (Long Beach, CA: Curran Associates Inc.), 6000–6010.
48
VeliĉkovićP.CucurullG.CasanovaA.RomeroA.LiòP.BengioY. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903. doi: 10.48550/arXiv.1710.10903
49
Weaveworks (2023). Sock Shop: A Microservice Demo Application.Weaveworks(Archived Dec 29, 2023; Accessed January 07, 2026).
50
WenB.SuY.HuJ.GuQ.SunC. (2016). Architecture design of an intelligent dispatching supporting platform based on integration of core business. South. Power Syst. Technol. 10, 15–19. Chinese. doi: 10.13648/j.cnki.issn1674-0629.2016.06.003
51
WuL.TordssonJ.ElmrothE.KaoO. (2020). “Microrca: root cause localization of performance issues in microservices,” in NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium (Budapest: IEEE), 1–9. doi: 10.1109/NOMS47738.2020.9110353
52
XingZ.PeiJ.YuP. S. (2012). Early classification on time series. Knowl. Inform. Syst. 31, 105–127. doi: 10.1007/s10115-011-0400-x
53
YuG.ChenP.ChenH.GuanZ.HuangZ.JingL.et al. (2021). “Microrank: end-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proceedings of the Web Conference 2021, WWW'21 (Ljubljana: ACM), 3087–3098. doi: 10.1145/3442381.3449905
54
ZhangY.HuaW.ZhouZ.SuhG. E.DelimitrouC. (2021). “Sinan: Ml-based and qos-aware resource management for cloud microservices,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'21 (ACM), 167–181. doi: 10.1145/3445814.3446693
55
ZhangZ.RamanathanM. K.RajP.ParwalA.SherwoodT.ChabbiM. (2022). “CRISP: critical path analysis of Large-Scale microservice architectures,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22) (Carlsbad, CA: USENIX Association), 655–672.
56
ZhaoC.MaM.ZhongZ.ZhangS.TanZ.XiongX.et al. (2023). “Robust multimodal failure detection for microservice systems,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD'23 (Long Beach, CA: ACM), 5639–5649. doi: 10.1145/3580305.3599902
57
ZhouH.ZhangS.PengJ.ZhangS.LiJ.XiongH.et al. (2021a). Informer: beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 35, 11106–11115. doi: 10.1609/aaai.v35i12.17325
58
ZhouX.PengX.XieT.SunJ.JiC.LiW.et al. (2021b). Fault analysis and debugging of microservice systems: industrial survey, benchmark system, and empirical study. IEEE Trans. Softw. Eng. 47, 243–260. doi: 10.1109/TSE.2018.2887384
Summary
Keywords
causal inference, distributed tracing, graph neural networks, microservices, multimodal learning, prefix-based early warning, SLO violation prediction, tail latency prediction
Citation
Yu M, Liu H, Du J, Lin K, Dai T, Fu Y and Yang C (2026) From distributed tracing to proactive SLO management: a mini-review of trace-driven performance prediction for cloud-native microservices. Front. Comput. Sci. 8:1783945. doi: 10.3389/fcomp.2026.1783945
Received
09 January 2026
Revised
03 February 2026
Accepted
03 February 2026
Published
18 February 2026
Volume
8 - 2026
Edited by
Xiaojie Wang, Chongqing University of Posts and Telecommunications, China
Reviewed by
Sergei Gorlatch, University of Münster, Germany
Hongchen Cao, ShanghaiTech University, China
Updates
Copyright
© 2026 Yu, Liu, Du, Lin, Dai, Fu and Yang.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chunyan Yang, yangcy2@csg.cn
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.