Abstract
The development of online social networks is accompanied by intricate abnormal interaction phenomena severely impairing the ecosystem’s credibility. Current anomaly detection approaches find it challenging to balance accuracy and robustness when tackling dynamic structural changes, heterogeneous relationships, and lack of labeled data. To address these challenges, this paper proposes ST-MVAN, a Spatio-Temporal Multi-View Attention Network for unsupervised anomaly detection. The proposed framework integrates three core components: (1) in the spatial dimension, we construct heterogeneous relational subgraphs and design an improved Graph Convolutional Network (GCN) that incorporates edge attributes as additive bias and leverages sparse attention to filter structural noise; (2) for feature fusion, an Efficient Channel Attention (ECA) mechanism is introduced to adaptively assign importance weights to multi-view features; and (3) in the temporal dimension, a bidirectional GRU captures dynamic evolutionary dependencies. Finally, a joint Encoder-Decoder framework calculates anomaly scores based on reconstruction errors. Furthermore, we perform experiments on the Digg and Yelp datasets to validate that our method achieves an AUC improvement of up to 12.26% compared to baseline methods. These results demonstrate that ST-MVAN can effectively mitigate structural noise and enhance the security of dynamic social network environments.
1 Introduction
Online social platforms’ deep penetration into daily life has driven social networks to transform information dissemination modes, interpersonal interaction approaches, and social collaboration mechanisms. Likes, reposts, comments, and other interactive acts have emerged as key pillars of the online ecosystem. However, the openness and anonymity of online platforms, while offering convenience to users, have also spawned numerous abnormal activities, including botnet assaults, malicious fake evaluations, and false public opinion manipulation (Figure 1). These behaviors not only disrupt the normal network order, but may also trigger severe trust crises and social risks. Therefore, accurately detecting anomalous behaviors from massive, complex and dynamic social data has become a pressing key issue to be addressed by both the academic circle and industry.
FIGURE 1
Existing methods for anomaly behavior detection in online social networks can be roughly categorized into two types: traditional machine learning methods and deep learning methods. Traditional machine learning methods, such as logistic regression, support vector machine (SVM) and random forests, mainly rely on handcrafted statistical features for classification tasks [1, 2]. However, such methods struggle to effectively capture the complex graph structural correlations and temporal evolution patterns of behaviors in social networks. In addition, they suffer from time-consuming and labor-intensive feature engineering, along with limited generalization ability when dealing with heterogeneous interaction data [3].
In recent years, deep learning has brought new tools for anomaly detection. In many real-world applications, researchers have built stronger recognition systems and robust frameworks to cope with complex inputs and noise [4, 5]. These studies suggest that layered representations and removing negative information can significantly improve robustness [6, 7]. This lesson is particularly relevant to dynamic social networks, where multiple relations exist and noisy links may hide abnormal behaviors. Graph Neural Networks (GNNs) and their variants, such as GCN and GAT, have shown strong results for social network anomaly detection by modeling non-Euclidean data [8–10]. Nevertheless, many existing methods remain limited in handling structural camouflage and in effectively fusing multi-view spatio-temporal features in complex environments.
To address the aforementioned challenges, this paper proposes a deep graph neural network model named ST-MVAN, which is based on spatio-temporal multi-view attention. The model is designed to achieve accurate identification of anomalous behaviors through refined structure-aware learning and spatio-temporal feature fusion. Specifically, the core designs of the model are as follows: First, through heterogeneous view construction, the complex heterogeneous graph is decomposed into multiple homogeneous interaction subgraphs according to interaction types. Second, in the spatial feature extraction phase, a sparse graph attention mechanism integrated with edge attributes is proposed. This mechanism integrates edge features into multi-head attention computations as an additive offset with the Sparsemax mechanism embedded therein. Meanwhile the ECA-based adaptive multi-view fusion module utilizes the Efficient Channel Attention network to dynamically allocate significance weights to various interaction perspectives. Finally, a Bi-GRU is employed to capture the temporal evolution characteristics of user behaviors with anomaly detection achieved through reconstruction errors.
The main contributions of this paper are summarized as follows:
This paper presents a Spatio-Temporal Multi-View Anomaly Detection Framework (ST-MVAN), a comprehensive framework incorporating heterogeneous view decoupling, relation-aware sparse aggregation, adaptive multi-view fusion, and temporal evolution modeling for handling the intricacy of social network data that efficiently overcomes the shortcomings of current approaches in managing structural noise and extracting spatio-temporal features.
We develop a relation-aware multi-head sparse attention module to address the drawback of graph convolutional networks in neighbor node weighted average aggregation, integrate edge features into multi-head attention calculations, combine with the Sparsemax strategy, achieve adaptive truncation and sparsification of neighbor weights, cut down computational costs, and boost the model’s ability to filter camouflaged noisy neighbors.
We propose an ECA-based multi-view fusion with temporal modeling mechanism to tackle the challenges of weight differences among various interaction relations and dynamic changes in user behaviors in heterogeneous graphs, utilize ECA-Net to dynamically acquire view-specific importance weights, and Bi-GRU to capture temporal dependencies.
The rest of this paper is structured as below: Chapter 2 reviews relevant studies regarding abnormal behavior detection; Chapter 3 details the complete framework of the ST-MVAN model along with the design concepts behind each core component; Chapter 4 demonstrates experimental findings and their analysis for the proposed model across multiple datasets, where baseline approaches are chosen for comparative assessment; Chapter 5 summarizes the research conducted in this paper.
2 Related work
Anomaly detection in social networks has evolved from early manual feature engineering to advanced deep learning paradigms capable of automatic feature extraction. To comprehensively position our work, we review the literature along two complementary axes: temporal evolution modeling, which captures dynamic behavioral changes; and heterogeneous multi-relational modeling, which addresses complex structural interactions and multi-view fusion.
2.1 Dynamic graph anomaly detection
Dynamic graph methodologies aim to identify anomalies by characterizing the temporal evolution of network topology and user behaviors. These approaches generally fall into three sub-categories: snapshot-based, stream-based, and Transformer-based frameworks.
Snapshot-based approaches conceptualize the dynamic graph as a sequence of discrete static snapshots. Early studies leveraged temporal random walks or cross-snapshot matrix factorization to capture structural variations over discrete intervals [11, 12]. While effective for coarse-grained analysis, these methods often struggle to capture fine-grained temporal dependencies. To address real-time interaction updates, stream-based methods have been developed to update node embeddings incrementally. For instance, NetWalk [13] utilizes clique embedding and streaming clustering to detect anomalies in real-time. Similarly, AddGraph [14] employs an extended temporal window to model short-term dependencies, providing an end-to-end architecture for dynamic edge classification.
More recently, Transformer-based architectures have emerged as a dominant force for modeling long-range temporal dependencies. By leveraging self-attention mechanisms, these models can capture complex evolutionary patterns that RNN-based methods might miss. Liu et al. [15] proposed a Transformer-based framework that captures the velocity of connection variations, significantly improving the detection of abrupt structural anomalies. However, while these dynamic methods excel at temporal modeling, they often treat all interactions uniformly, potentially neglecting the heterogeneity inherent in social relationships.
2.2 Heterogeneous and multi-relational modeling
Real-world social networks are inherently heterogeneous, encompassing diverse node types and interaction patterns. Research in this domain focuses on distinguishing these patterns through meta-paths, hypergraphs, and multi-view graph neural networks.
To preserve the semantic information of different relations, meta-path-based methods define specific semantic sequences to aggregate neighbors [16, 17], while recent works explore automatic relation weighting to reduce manual reliance [18]. To capture higher-order correlations beyond pairwise connections, hypergraph neural networks have been introduced to model complex group interactions [19]. Multi-view GNNs and structure-content fusion represent the current state-of-the-art, aiming to project node attributes and topology into a shared latent space [20, 21], often leveraging contrastive learning to enhance discriminative power [22]. Furthermore, to combat “structural camouflage” where anomalies mimic normal connections, researchers have integrated adversarial learning and neighbor filtering strategies to mitigate graph inconsistency [23–25].
However, ensuring robustness against noise and perturbations remains a challenge in heterogeneous graph modeling. Recent advances in robust multimodal learning offer valuable methodological insights for fusing diverse data signals. Specifically, strategies such as adversarial alignment [26–28] and negative information removal [7, 29] have demonstrated strong capability in mitigating noise. Furthermore, context-aware attention mechanisms [30] offer a principled way to refine feature weights. Recognizing that relation-specific views play a role analogous to modalities, and motivated by these robust mechanisms [31], ST-MVAN adapts such adaptive fusion concepts to resist view-specific noise and minimize aggregation bias.
While the aforementioned methods have advanced the field, ST-MVAN distinguishes itself in three key aspects. First, unlike the attention-window approach of AddGraph [14], ST-MVAN explicitly decouples heterogeneous interactions into distinct views, mitigating semantic interference across relation types. Second, compared with THGNN [21], our framework incorporates relation-aware sparse attention with edge-attribute bias, where Sparsemax enables adaptive neighbor pruning and improved resilience to structural camouflage noise. Finally, addressing the scalability limitations of Transformer-based dynamic graph models [15], ST-MVAN leverages a streamlined Bi-GRU with connectivity-restricted sparse aggregation to efficiently capture long-range evolutionary dependencies on large-scale sparse networks. Collectively, these spatial and temporal refinements enhance robustness against structural perturbations and dynamic network evolution.
3 Construction of the ST-MVAN model
3.1 Overall framework of the model
This paper presents a deep graph neural network architecture built upon Spatio-Temporal Multi-View Attention designated ST-MVAN to target the complex nature and heterogeneous characteristics of interaction patterns in dynamic graph structures. The architecture realizes accurate recognition of anomalous interaction patterns within an unsupervised learning framework by capturing spatial correlation features and temporal evolutionary regularities of nodes under multi-dimensional perspectives.
ST-MVAN takes an encoder-decoder framework (as shown in Figure 2). The encoder harmonizes heterogeneous feature representations through feature projection, integrates neighboring node data across distinct relation perspectives via a sparse attention module with edge feature properties, adaptively amalgamates multi-perspective feature sets using an ECA mechanism, and leverages a bidirectional GRU to capture temporal dynamics for generating ultimate spatio-temporal node embeddings. The decoder uses such embeddings to jointly recover network topological structure and interaction feature attributes measuring anomaly level based on reconstruction residual values.
FIGURE 2
3.2 Multi-view sparse attention encoder
To decouple spatial dependencies of different patterns from complex network interactions, the ST-MVAN model first formally defines dynamic graph snapshots and performs multi-view decomposition, followed by the alignment of heterogeneous features.
3.2.1 Dynamic graph definition and multi-view subgraph construction
We define a dynamic graph as a sequence of temporal snapshots . Here, the network snapshot at any arbitrary time is denoted as , where represents the set containing nodes, is the set of interaction edges at time , represents the node feature matrix, and contains the attribute features of all edges.
Considering that multiple types of interaction relations often exist between nodes in real-world networks, we define the set of all possible interaction relation types as . To capture the topological structure under specific relations, the model decomposes the global snapshot into independent relation view subgraphs, as shown in Equation 1:
For the -th view subgraph , its node set is shared with the original graph, but the edge set only contains interaction edges belonging to relation type . Formally, it is defined as Equation 2:where is the relation type mapping function. Specifically, these views are defined by different interaction types. This decomposition strategy enables the model to learn spatial embeddings of nodes specifically for each distinct interaction pattern, avoiding mutual interference between different relational features.
3.2.2 Feature extraction and alignment initialization
Social network data shows heterogeneous multi-modal properties including unstructured text-based semantic features, structured statistical user attributes, and attribute characteristics of interactive edges; we map both node and edge information into a latent space of identical dimensionality to integrate such heterogeneous information into an integrated deep learning framework.
Original node features comprise two components, semantic and statistical features, to fully capture comprehensive user characteristic profiles. For the semantic feature component, users’ historical behavior logs hold abundant text-based content reflecting personal interest tendencies and latent behavioral intentions. We extract and concatenate core textual content linked to user behavioral activities, specifically article titles and review content. We adopt the pre-trained RoBERTa-base model to encode these text sequences. Input sequences are truncated to a maximum length of 128 tokens. To ensure training efficiency, the parameters of RoBERTa are frozen to serve as a static feature extractor. Finally, we implement mean pooling on the output word vectors of the last hidden layer to yield a 768-dimensional feature vector containing complete deep semantic information.
Regarding statistical features, this work chooses structural metrics representing user impact and activity degrees: social connectivity gauges node connection scope while cumulative interaction count gauges general activity degree. We assemble these quantitative data into a vector, and then conduct log normalization to derive the statistical feature vector .
To fuse these two types of information, the model concatenates them and maps them to the initial hidden state of the node through a learnable linear projection layer, as shown in Equation 3:where and are the weight matrix and bias term respectively, denotes the concatenation operation, and employs the ReLU activation function. At this point, serves as the node feature input to the graph neural network.
Edge features are used to characterize the attributes of interactions. For an edge , interaction frequency and interaction timestamp are selected as raw features. The normalized frequency value is concatenated with the time feature encoded by a sinusoidal function, and then mapped to an edge embedding of the same dimension as the node feature via a Multi-Layer Perceptron (MLP), as shown in Equation 4:
3.2.3 Sparse graph attention aggregation with edge attribute bias
To efficiently utilize network topology and edge attribute data during node representation updating, this paper proposes a sparse graph attention mechanism integrated with edge attribute bias; unlike conventional GCNs conducting average aggregation of neighboring nodes, this mechanism employs a multi-head attention mechanism to dynamically allocate weights based on node feature similarity and edge attributes.
Let the input node feature of the -th layer be . Under a specific relation view , for the -th attention head, the model first maps the source node and its neighbor to independent feature subspaces to calculate the query vector and key vectorr, as shown in Equation 5:where and are learnable linear projection matrices. The model incorporates the obtained edge embedding into the relevance metric. This paper adopts an additive bias strategy, utilizing a nonlinear transformation function to map the high-dimensional edge embedding to a scalar bias, which is applied to the calculation process of the attention score, as shown in Equation 6:
In the equation, the first term characterizes the association intensity of nodes in the feature space through a scaled dot product; the second term introduces edge attributes as a bias to correct this association intensity.
To filter out massive redundant connections and invalid information in the network structure, this paper adopts the Sparsemax activation function instead of the traditional Softmax. Sparsemax can truncate the weights of low-relevance neighbors to zero, thereby generating sparse and robust normalized weights, as shown in Equation 7:
Finally, weighted aggregation is performed on the neighbor value vectors based on these sparse weights. To promote gradient propagation and training stability in deep networks, the model introduces multi-head concatenation, residual connections, and layer normalization mechanisms to obtain the updated node representation under the current view, as shown in Equation 8:where is used to fuse the feature information output by multi-head attention.
3.2.4 Multi-view adaptive fusion based on ECA
Since different types of interaction views contribute differently to defining node behavior patterns, simple averaging or concatenation cannot distinguish the importance of views. Therefore, this paper introduces Efficient Channel Attention (ECA) to achieve efficient fusion of multi-view features.
First, we concatenate the node representations under views to form a multi-channel feature stack . The module treats each view as an independent channel. It first aggregates the global information of each channel through Global Average Pooling (GAP), then captures local interaction information across channels using 1D convolution, and finally generates normalized weights for each view through a Sigmoid function, as shown in Equation 9:
After obtaining the weights, we perform a weighted summation of the representations from different views. To further fuse features and compress redundant information, the weighted features are reduced in dimension through a fully connected layer to obtain the final spatial feature representation at time , as shown in Equation 10:
3.3 Bidirectional temporal evolution modeling
After processing by the spatial multi-view attention module, the aggregated spatial feature representation of nodes at each moment is obtained. To capture the dynamic evolution patterns of user behavior in the temporal dimension, the model inputs the sequence of spatial features within a continuous time window into a bidirectional GRU for temporal modeling.
Compared with LSTM, GRU retains the ability to capture long-range dependencies with fewer parameters and higher computational efficiency, making it more suitable for processing large-scale dynamic graph data. To comprehensively assess a node’s state at time , ST-MVAN uses a bidirectional GRU to capture both historical context and future evolution trends simultaneously. Formally we abstract the GRU’s internal gate update mechanism as a nonlinear transformation function. The model executes forward and backward propagation in parallel. Forward evolution transmits past information to the future capturing the cumulative impact of historical behavior sequences on the current state, as shown in Equation 11.
Backward evolution passes information from the future to the past, utilizing observed data from subsequent moments to assist in judging the potential intent of the current behavior, as shown in Equation 12.where represent the forward and backward hidden states at time , respectively. Finally, the hidden states from these two directions are concatenated to obtain the spatio-temporal embedding representation of the node at time , as shown in Equation 13:
This representation fuses the spatial structural features under multi-views as well as the temporal evolution laws.
3.4 Anomaly detection based on dual reconstruction
ST-MVAN follows a self-supervised learning paradigm. The core assumption is that anomalous behaviors deviate from the latent spatio-temporal evolution laws of the network, thereby leading to higher reconstruction errors. The model contains dual decoders for structure and attributes, quantifying the degree of anomaly by jointly optimizing reconstruction tasks to capture normal behavior patterns.
3.4.1 Dual decoding and joint optimization
To simultaneously capture topological structure associations and content patterns of interaction behaviors, the decoder reconstructs the link existence probability and the edge attribute , respectively, as shown in Equation 14:where is the Sigmoid function, and denotes the concatenation operation.
Model training aims to minimize the joint reconstruction loss. To prevent the model from predicting edges for all node pairs, a negative sampling strategy is introduced in the structure reconstruction task. For each moment , in addition to focusing on the set of existing positive sample edges, a set of negative sample edges of comparable size is constructed via random sampling. The structure reconstruction loss adopts the binary cross-entropy loss function, aiming to maximize the likelihood probability of positive samples while minimizing that of negative samples, as shown in Equation 15:
For attribute reconstruction, the prediction error is calculated only within the set of real edges , adopting mean squared error to quantify the difference between the reconstructed attribute and the real attribute , as shown in Equation 16:
The final total optimization objective is defined as the weighted sum of the above two parts, with an regularization term introduced to prevent overfitting, as shown in Equation 17:where is the hyperparameter balancing structure and attribute losses, represents all learnable parameters of the model, and is the regularization weight coefficient.
3.4.2 Anomaly score determination
In the testing phase, the model calculates the anomaly score for each interaction edge based on the reconstruction difficulty. We perform a weighted fusion of structural anomaly and attribute anomaly, defining the final anomaly score as follows in Equation 18:where is used to adjust the sensitivity of both components. When exceeds a preset threshold, the interaction is determined to be an anomalous behavior.
3.5 Complexity analysis
To evaluate the scalability of the proposed ST-MVAN framework, we analyze the time complexity of its core components. Let denote the number of nodes, the number of edges, the number of relational views, the number of GNN layers, the number of attention heads, the hidden feature dimension, and the length of the temporal window.
For a single graph snapshot , the computational cost primarily stems from the spatial encoder and the reconstruction decoder. In the encoder, unlike standard Transformers with complexity, our sparse attention mechanism restricts computation to connected neighbors. Considering attention heads, the complexity for calculating attention coefficients and aggregating neighbors is . Including the linear transformations and the negligible cost of ECA-based fusion, the spatial encoding complexity per snapshot is . Furthermore, the reconstruction decoder, which is critical for anomaly detection, employs a negative sampling strategy. Instead of reconstructing the full adjacency matrix with cost, we only compute scores for positive edges and a sampled set of negative edges , resulting in a decoding complexity of per snapshot.
In the temporal dimension, the Bi-GRU captures evolutionary patterns over a window . For each node, the complexity of updating hidden states at each time step is , resulting in a temporal complexity of per snapshot. Summing the spatial encoder, decoder, and temporal components, the overall time complexity of ST-MVAN is expressed as Equation 19:
Since real-world social networks are typically sparse (i.e., ) and is comparable to , the overall complexity remains linear with respect to the number of edges and nodes . This confirms that ST-MVAN avoids the quadratic bottleneck of dense methods and maintains high efficiency and scalability for large-scale dynamic social networks.
4 Experiments and analysis
4.1 Datasets
To assess the proposed model’s performance on dynamic multi-view graph anomaly detection tasks we adopted the Digg dataset. Digg is a representative news aggregation and social sharing platform allowing users to build social connections by following others and show interest in news via voting (namely the Digg function). This dataset includes 279,631 users with abundant topological structures and time-stamped interaction data well-suited for building heterogeneous multi-view networks. Given the original dataset’s large size and numerous sparse nodes we kept only active users in the candidate pool to guarantee experimental efficiency and model stability while conducting user alignment.
To capture the heterogeneity of user interactions, we constructed relation-specific subgraphs as distinct interaction views for each dataset. For Digg, we constructed a Following view to model explicit social links, and a Co-voting view to capture implicit interest similarity, where edges connect users who voted on the same post. Similarly, for Yelp, we constructed a Friendship view representing user social connections, and a Co-reviewing view derived from shared reviewed businesses, where edges connect users who reviewed the same business.
4.2 Baselines and experimental settings
4.2.1 Baseline methods
We compare ST-MVAN with the following four baseline methods:
DeepWalk [12]: A typical graph embedding approach that regards random walk sequences as text segments and nodes as vocabulary entries; it adopts the Skip-Gram model to acquire latent node representations by maximizing the co-occurrence probability of nodes.
GraphSAGE [32]: An inductive architecture for graph representation learning; it produces node embeddings through sampling and aggregating features from a node’s local neighboring area; this approach fully leverages node attributes and processes each graph snapshot separately.
NetWalk [13]: A dynamic network anomaly detection approach enabling incremental learning of network representations; it uses clique embedding technologies to update node representations in real time and adopts a streaming clustering method to identify anomalies based on reconstruction deviations; this approach effectively copes with dynamic structural variations.
AddGraph [14]: A robust end-to-end architecture for dynamic graph anomaly detection; it models structural patterns and short-term temporal dependencies simultaneously; this approach adopts an attention-based time window to capture evolution trends.
4.2.2 Parameter settings
The experiments were built with the PyTorch framework (Python 3.8). The AdamW optimizer was chosen to boost generalization performance paired with a cosine annealing learning rate scheduler. The initial learning rate was set to 0.003 to enable fast convergence in early training phases and detailed parameter tuning in subsequent stages. Weight decay was set to and dropout rate to 0.5 to avoid overfitting. The model’s structure includes 2 GCN layers, 4 attention heads and a hidden layer dimension of 64; the model underwent 100 training epochs. The dataset was split in chronological order with the first 50% used as training data and the next 50% as testing data.
Since the datasets lack ground-truth anomalies, we employ an anomaly injection strategy to generate synthetic anomalies exclusively for the testing phase. Specifically, we treat the existing edges in the test set as normal samples. To generate structural anomalies, we randomly sample node pairs from the node set that have no observed interaction in the original data and assign them random timestamps within the test interval to create non-existent edges, simulating users forming unusual connections. To generate attribute anomalies, we randomly perturb the interaction frequency or timestamp features of existing edges by adding noise to deviate from their normal distribution. These anomalies are injected at ratios of 1%, 5%, and 10% relative to the normal edges. All experiments were repeated 10 times independently. Specifically, for each run, we re-generated the injected anomalies using different random seeds to evaluate the model’s robustness against data variations. Within each specific run, the same set of anomalies was applied to all baseline methods to ensure a fair comparison. We report the mean AUC and standard deviation over these repeated runs to assess result stability.
4.2.3 Evaluation metrics
To conduct quantitative evaluation of the proposed model’s performance we employ the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) as the core assessment index. This index offers a reliable gauge of the model’s distinguishing capability with a higher value indicating stronger capacity to accurately differentiate abnormal interactions from normal ones.
4.3 Analysis of experimental results
4.3.1 Robustness analysis under different anomaly ratios
Table 1 presents the quantitative results (Mean SD) of diverse models on the Digg and Yelp datasets under varying anomaly injection ratios. The experimental findings show that ST-MVAN attains the best AUC results across all settings, fully confirming the validity of the proposed framework. Moreover, the relatively low standard deviations observed across all settings indicate the high stability and robustness of ST-MVAN against experimental randomness. Taking the Digg dataset with 1% anomaly injection as an instance, ST-MVAN achieves an AUC of 87.89%, surpassing DeepWalk by around 17.09% and outperforming GraphSAGE by roughly 15.39%. This directly illustrates the vital role of capturing temporal dynamics in anomaly detection. Against NetWalk, ST-MVAN maintains a distinct advantage, exceeding it by approximately 12.26% and 8.00% under the 1% setting on the Digg and Yelp datasets respectively. When compared with AddGraph, ST-MVAN outperforms it by 4.48%, 1.84%, and 2.19% under 1%, 5%, and 10% anomaly ratios on the Digg dataset. Similarly, on the Yelp dataset, ST-MVAN delivers performance improvements of 4.06%, 2.28%, and 1.68% respectively. Overall, these findings confirm that the ST-MVAN model exhibits superior detection capability and robustness.
TABLE 1
| Model | Digg | Yelp | ||||
|---|---|---|---|---|---|---|
| 1% | 5% | 10% | 1% | 5% | 10% | |
| DeepWalk | 0.7080 0.0154 | 0.6881 0.0142 | 0.6398 0.0168 | 0.6243 0.0132 | 0.6185 0.0157 | 0.6054 0.0126 |
| GraphSAGE | 0.7250 0.0105 | 0.7385 0.0092 | 0.7120 0.0118 | 0.7350 0.0086 | 0.7409 0.0114 | 0.7280 0.0097 |
| NetWalk | 0.7563 0.0082 | 0.7176 0.0115 | 0.6837 0.0096 | 0.7524 0.0108 | 0.7478 0.0085 | 0.7396 0.0121 |
| AddGraph | 0.8341 0.0076 | 0.8470 0.0064 | 0.8369 0.0088 | 0.7918 0.0095 | 0.8037 0.0072 | 0.7950 0.0086 |
| Ours (ST-MVAN) | 0.87890.0079 | 0.86540.0053 | 0.85880.0064 | 0.83240.0084 | 0.82650.0069 | 0.81180.0063 |
Comparison of the AUC performance of different models on the Digg and Yelp datasets (Mean SD).
Bold values indicate the best performance.
To visually evaluate overall model performance under specific noise levels, Figures 3, 4 show ROC curve comparisons of all models at 5% anomaly injection ratio. All curves present a sharp upward slope attaining high True Positive Rate (TPR) while maintaining low False Positive Rate (FPR) and demonstrating strong anomaly ranking ability. Notably, ST-MVAN’s ROC curve covers the largest area and lies mostly above all baseline models visually highlighting its edge in overall detection performance. Against AddGraph, ST-MVAN shows a notable advantage in the low FPR region with more distinct curvature. This trait enables ST-MVAN to detect more abnormal interactions with fewer false positives in real-world applications thus delivering greater practical utility.
FIGURE 3
FIGURE 4
4.3.2 Ablation studies
To further confirm the validity of core components in the ST-MVAN model we carried out ablation experiments on the Digg and Yelp datasets with a fixed 5% anomaly injection ratio. Experimental results presented in Table 2 and Figure 5 show that the full ST-MVAN model achieved the best AUC performance on both datasets reaching 86.54% and 82.65% respectively. Comparative analysis indicates that removing the Bi-GRU module causes the most notable performance drop. This result fully proves that capturing long- and short-term dynamic evolution patterns of user behaviors is critical for identifying abnormal behaviors in social networks. Additionally replacing the multi-head attention mechanism with a traditional GCN adopting weighted average aggregation leads to the loss of edge attribute semantic information and the ability to adaptively assign neighbor weights. Also excluding the ECA channel attention prevents the model from adaptively balancing the importance of different relational views, thus undermining the robustness of multi-view feature fusion. Overall, experimental results clearly demonstrate that ST-MVAN’s superior performance originates from the synergistic effect of refined spatial neighborhood feature aggregation, adaptive multi-view subgraph fusion, and bidirectional temporal modeling. This synergy allows the model to efficiently learn spatiotemporal distribution patterns of normal interactions and accurately identify abnormal behaviors deviating from these patterns.
TABLE 2
| Model variants | Digg | Yelp |
|---|---|---|
| With GCN aggregation | 0.8492 | 0.8094 |
| Without ECA | 0.8578 | 0.8185 |
| Without Bi-GRU | 0.8390 | 0.7961 |
| Ours (ST-MVAN) | 0.8654 | 0.8265 |
Ablation study results on Digg and Yelp datasets.
Bold values indicate the best performance.
FIGURE 5
4.3.3 Sensitivity analysis
To further assess ST-MVAN’s stability across varying training proportions we carried out sensitivity tests on the Digg dataset with fixed 10% anomaly injection ratio. We steadily decreased training proportion from 60% to 10% and documented AUC values for each timestamp in the testing phase.
From the experimental results in Figure 6, it is evident that as the training set proportion drops from to the median and maximum AUC values of the model do not decline with reduced training data instead showing steady upward tendency. This is because the training set consists exclusively of normal samples and the relative quantity of anomalous edge samples in the testing environment increases with the rising testing proportion. This data partitioning shift enables the model to better capture distribution differences between positive and negative samples during inference enhancing discriminability of anomaly scores and lifting detection performance upper limit. Yet when training data is reduced to , boxplot morphology changes noticeably with expanded vertical span and decreased minimum AUC value indicating that data scarcity increases model training uncertainty and fluctuation impairing detection result stability to some extent. Even so, ST-MVAN maintains highly competitive average AUC level with median exceeding that of high training proportion settings under 10% training data. This fully verifies the proposed framework’s excellent feature capture ability and robustness in label-scarce or weakly supervised scenarios effectively addressing data limitation challenges in dynamic graph anomaly detection.
FIGURE 6
5 Conclusion
Tackling issues related to anomalous interaction in dynamic social networks, including strong concealment, intricate evolution characteristics, and lack of labeled data, this paper presents an end-to-end deep learning detection architecture named ST-MVAN. This model creatively develops an attention mechanism integrating edge attribute bias and ECA channel weighting module realizing refined feature fusion for multi-view heterogeneous subgraphs. By combining Bi-GRU and Encoder-Decoder self-supervised reconstruction architecture, the model successfully captures long- and short-term temporal dependencies of user behaviors allowing accurate detection of anomalous edges in an unsupervised setting. Extensive experiments conducted on two real-world datasets, Digg and Yelp, have fully validated the effectiveness and robustness of ST-MVAN. Experimental results demonstrate that the proposed model significantly outperforms mainstream baseline methods, such as NetWalk and AddGraph, in terms of detection accuracy. Through ablation studies, we confirmed that biased attention aggregation, multi-view adaptive fusion, and bidirectional temporal modeling are critical components for enhancing model performance. Additionally, the model exhibited satisfactory stability in the sensitivity analysis.
Statements
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
JW: Supervision, Methodology, Validation, Conceptualization, Writing – review and editing, Data curation, Investigation, Software, Writing – original draft, Formal Analysis, Resources, Visualization, Project administration.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1.
XingLLiSZhangQWuHMaHZhangX. A survey on social network’s anomalous behavior detection. Complex and Intell Syst (2024) 10:5917–32. 10.1007/s40747-024-01446-8
2.
GuoSLiXMuZ. Adversarial machine learning on social network: a survey. Front Phys (2021) 9:766540. 10.3389/fphy.2021.766540
3.
MaXWuJXueSYangJZhouCShengQZet alA comprehensive survey on graph anomaly detection with deep learning. IEEE Transact Knowledge Data Eng. (2021) 35:12012–38. 10.1109/tkde.2021.3118815
4.
ZhuXGuoCFengHHuangYFengYWangXet alA review of key technologies for emotion analysis using multimodal information. Cogn Comput (2024) 16:1504–30. 10.1007/s12559-024-10287-z
5.
ZhuXLiuZCambriaEYuXFanXChenHet alA client–server based recognition system: non-contact single/multiple emotional and behavioral state assessment methods. Comp Methods Programs Biomed (2025) 260:108564. 10.1016/j.cmpb.2024.108564
6.
ZhuXHuangYWangXWangREmotion recognition based on brain-like multimodal hierarchical perception. Multimedia Tools Appl (2024) 83:56039–57. 10.1007/s11042-023-17347-w
7.
WangRWangYCambriaEFanXYuXHuangYet alContrastive-based removal of negative information in multimodal emotion analysis. Cogn Comput (2025) 17:107. 10.1007/s12559-025-10463-9
8.
JiangBZhangZLinDTangJLuoBSemi-supervised learning with graph learning-convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019). p. 11313–20.
9.
VeličkovićPCucurullGCasanovaARomeroALioPBengioY. Graph attention networks (2018) 6:2. Available online at: https://openreview.net/forum?id=rJXMpikCZ
10.
DingKLiJBhanushaliRLiuH. Deep anomaly detection on attributed networks. In: Proceedings of the 2019 SIAM international conference on data mining (SIAM) (2019). p. 594–602.
11.
YoonMHooiBShinKFaloutsosC. Fast and accurate anomaly detection in dynamic graphs with a two-pronged approach. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (2019). p. 647–57.
12.
PerozziBAl-RfouRSkienaS. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (2014). p. 701–10.
13.
YuWChengWAggarwalCCZhangKChenHWangW. Netwalk: a flexible deep embedding approach for anomaly detection in dynamic networks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining (2018). p. 2672–81.
14.
ZhengLLiZLiJLiZGaoJ. Addgraph: anomaly detection in dynamic graph using attention-based temporal gcn. IJCAI (2019). 4419–4425. 10.24963/ijcai.2019/614
15.
LiuYPanSWangYGXiongFWangLChenQet alAnomaly detection in dynamic graphs via transformer. IEEE Trans Knowledge Data Eng (2021) 35:12081–94. 10.1109/tkde.2021.3124061
16.
WangXLuYShiCWangRCuiPMouS. Dynamic heterogeneous information network embedding with meta-path based proximity. IEEE Trans Knowl Data Eng (2020) 34:1117–32. 10.1109/tkde.2020.2993870
17.
WangXJiHShiCWangBYeYCuiPet alHeterogeneous graph attention network. World Wide Web Conference (2019) 2022–32. 10.1145/3308558.3313562
18.
ZhaoJWangXShiCHuBSongGYeY. Heterogeneous graph structure learning for graph neural networks. Proc AAAI Conference Artificial Intelligence (2021) 35:4697–705. 10.1609/aaai.v35i5.16600
19.
AlamMTAhmedCFLeungCK. Hyperedge anomaly detection with hypergraph neural network. arXiv preprint arXiv:2412.05641. (2024).
20.
WangLLiPXiongKZhaoJLinR. Modeling heterogeneous graph network on fraud detection: a community-based framework with attention mechanism. In: Proceedings of the 30th ACM international conference on information and knowledge management (2021). p. 1959–68.
21.
LiYZhuJZhangCYangYZhangJQiaoYet alThgnn: an embedding-based model for anomaly detection in dynamic heterogeneous social networks. In: Proceedings of the 32nd ACM international conference on information and knowledge management (2023). p. 1368–78.
22.
JinMLiuYZhengYChiLLiYFPanS. Anemone: graph anomaly detection with multi-scale contrastive learning. In: Proceedings of the 30th ACM international conference on information and knowledge management (2021). p. 3122–6.
23.
JinWDerrTWangYMaYLiuZTangJ. Node similarity preserving graph convolutional networks. In: Proceedings of the 14th ACM international conference on web search and data mining (2021). p. 148–56.
24.
DouYLiuZSunLDengYPengHYuPS. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In: Proceedings of the 29th ACM international conference on information and knowledge management (2020). p. 315–24.
25.
LiuYAoXQinZChiJFengJYangHet alPick and choose: a gnn-based imbalanced learning approach for fraud detection. In: Proceedings of the web conference 2021 (2021). p. 3168–77.
26.
XiangJZhuXCambriaE. Integrating audio–visual text generation with contrastive learning for enhanced multimodal emotion analysis. Inf Fusion (2025) 127:103809. 10.1016/j.inffus.2025.103809
27.
WangRXuDCasconeLWangYChenHZhengJet alRaft: robust adversarial fusion transformer for multimodal sentiment analysis. Array (2025) 27:100445. 10.1016/j.array.2025.100445
28.
ZhuXWangYCambriaERidaILópezJSCuiLet alRmer-dt: robust multimodal emotion recognition in conversational contexts based on diffusion and transformers. Inf Fusion (2025) 123:103268. 10.1016/j.inffus.2025.103268
29.
ZhangYChenHRidaIZhuX. A generative random modality dropout framework for robust multimodal emotion recognition. IEEE Intell Syst (2025) 40:62–9. 10.1109/mis.2025.3597120
30.
WangRGuoCShabazMRidaICambriaEZhuX. Cime: contextual interaction-based multimodal emotion analysis with enhanced semantic information. IEEE Trans Comput Social Syst (2025). 1–11. 10.1109/TCSS.2025.3572495
31.
ZhuXFengHCambriaEHuangYJuMYuanHet alEmvas: end-to-end multimodal emotion visualization analysis system. Complex Intell Syst (2025) 11:1–15. 10.1007/s40747-025-01931-8
32.
HamiltonWYingZLeskovecJ. Inductive representation learning on large graphs. Adv Neural Inform Process Syst (2017) 30:1025–1035. Available online at: https://dl.acm.org/doi/10.5555/3294771.3294869.
Summary
Keywords
anomaly behavior detection, complex networks, graph neural network, multi-head attention mechanism, social networks
Citation
Wang J (2026) Dynamic social network anomalous behavior detection based on spatiotemporal multi-view graph attention fusion network. Front. Phys. 14:1786937. doi: 10.3389/fphy.2026.1786937
Received
13 January 2026
Revised
11 February 2026
Accepted
18 February 2026
Published
27 February 2026
Volume
14 - 2026
Edited by
Amin Ul Haq, University of Electronic Science and Technology of China, China
Reviewed by
Syed Mohd. Faisal, Malla Reddy University, India
Xianxun Zhu, Shanghai University, China
Updates
Copyright
© 2026 Wang.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jimin Wang, wjm426@stu.haust.edu.cn
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.