Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining

Shin, Kijung; Hooi, Bryan; Kim, Jisu; Faloutsos, Christos

doi:10.3389/fdata.2020.594302

ORIGINAL RESEARCH article

Front. Big Data, 29 April 2021

Sec. Big Data Networks

Volume 3 - 2020 | https://doi.org/10.3389/fdata.2020.594302

This article is part of the Research TopicComputational Behavioral Modeling for Big User DataView all 5 articles

Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining

Kijung Shin¹*

Bryan Hooi²

Jisu Kim³

Christos Faloutsos⁴

¹Graduate School of AI and School of Electrical Engineering, KAIST, Daejeon, South Korea
²School of Computing and Institute of Data Science, National University of Singapore, Singapore, Singapore
³DataShape, Inria Saclay, Palaiseau, France
⁴School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States

How can we detect fraudulent lockstep behavior in large-scale multi-aspect data (i.e., tensors)? Can we detect it when data are too large to fit in memory or even on a disk? Past studies have shown that dense subtensors in real-world tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or fraudulent behavior such as retweet boosting, bot activities, and network attacks. Thus, various approaches, including tensor decomposition and search, have been proposed for detecting dense subtensors rapidly and accurately. However, existing methods suffer from low accuracy, or they assume that tensors are small enough to fit in main memory, which is unrealistic in many real-world applications such as social media and web. To overcome these limitations, we propose D-Cube, a disk-based dense-subtensor detection method, which also can run in a distributed manner across multiple machines. Compared to state-of-the-art methods, D-Cube is (1) Memory Efficient: requires up to 1,561× less memory and handles 1,000× larger data (2.6TB), (2) Fast: up to 7× faster due to its near-linear scalability, (3) Provably Accurate: gives a guarantee on the densities of the detected subtensors, and (4) Effective: spotted network attacks from TCP dumps and synchronized behavior in rating data most accurately.

1 Introduction

Given a tensor that is too large to fit in memory, how can we detect dense subtensors? Especially, can we spot dense subtensors without sacrificing speed and accuracy provided by in-memory algorithms?

A common application of this problem is review fraud detection, where we aim to spot suspicious lockstep behavior among groups of fraudulent user accounts who review suspiciously similar sets of products. Previous work (Maruhashi et al., 2011; Jiang et al., 2015; Shin et al., 2018) has shown the benefit of incorporating extra information, such as timestamps, ratings, and review keywords, by modeling review data as a tensor. Tensors allow us to consider additional dimensions in order to identify suspicious behavior of interest more accurately and specifically. That is, extraordinarily dense subtensors indicate groups of users with lockstep behaviors both in the products they review and along the additional dimensions (e.g., multiple users reviewing the same products at the exact same time).

In addition to review-fraud detection, spotting dense subtensors has been found effective for many anomaly-detection tasks. Examples include network-intrusion detection in TCP dumps (Maruhashi et al., 2011; Shin et al., 2018), retweet-boosting detection in online social networks (Jiang et al., 2015), bot-activity detection in Wikipedia (Shin et al., 2018), and genetics applications (Saha et al., 2010; Maruhashi et al., 2011).

Due to these wide applications, several methods have been proposed for rapid and accurate dense-subtensor detection, and search-based methods have shown the best performance. Specifically, search-based methods (Jiang et al., 2015; Shin et al., 2018) outperform methods based on tensor decomposition, such as CP Decomposition and HOSVD (Maruhashi et al., 2011), in terms of accuracy and flexibility with regard to the choice of density metrics. Moreover, the latest search-based methods (Shin et al., 2018) provide a guarantee on the densities of the subtensors it finds, while methods based on tensor decomposition do not.

However, existing search methods for dense-subtensor detection assume that input tensors are small enough to fit in memory. Moreover, they are not directly applicable to tensors stored in disk since using them for such tensors incurs too many disk I/Os due to their highly iterative nature. However, real applications, such as social media and web, often involve disk-resident tensors with terabytes or even petabytes, which in-memory algorithms cannot handle. This leaves a growing gap that needs to be filled.

1.1 Our Contributions

To overcome these limitations, we propose D-Cube a dense-subtensor detection method for disk-resident tensors. D-Cube works under the W-Stream model (Ruhl, 2003), where data are only sequentially read and written during computation. As seen in Table 1, only D-Cube supports out-of-core computation, which allows it to process data too large to fit in main memory. D-Cube is optimized for this setting by carefully minimizing the amount of disk I/O and the number of steps requiring disk accesses, without losing accuracy guarantees it provides. Moreover, we present a distributed version of D-Cube using the MapReduce framework (Dean and Ghemawat, 2008), specifically its open source implementation Hadoop .

TABLE 1

TABLE 1. Comparison of D-Cube and state-of-the-art dense-subtensor detection methods. ✓denotes ‘supported’.

The main strengths of D-Cube are as follows:

Memory Efficient: D-Cube requires up to 1,561× less memory and successfully handles 1,000× larger data (2.6TB) than its best competitors (Figures 1A,B).

Fast: D-Cube detects dense subtensors up to 7× faster in real-world tensors and 12× faster in synthetic tensors than its best competitors due to its near-linear scalability with all aspects of tensors (Figure 1A).

Provably Accurate: D-Cube provides a guarantee on the densities of the subtensors it finds (Theorem 3), and it shows similar or higher accuracy in dense-subtensor detection than its best competitors on real-world tensors (Figure 1B).

Effective: D-Cube successfully spotted network attacks from TCP dumps, and lockstep behavior in rating data, with the highest accuracy (Figure 1C).

FIGURE 1

FIGURE 1. Strengths of D-Cube . ‘O.O.M’ stands for ‘out of memory’. (A) Fast and Scalable: D-Cube was 12× faster and successfully handled 1,000× larger data (2.6TB) than its best competitors. (B) Efficient and Accurate: D-Cube required 47× less memory and found subtensors as dense as those found by its best competitors from English Wikipedia revision history. (C) Effective: D-Cube accurately spotted network attacks from TCP dumps. See Section 4 for the detailed experimental settings.

Reproducibility: The code and data used in the paper are available at http://dmlab.kaist.ac.kr/dcube.

1.2 Related Work

We discuss previous work on (a) dense-subgraph detection, (b) dense-subtensor detection, (c) large-scale tensor decomposition, and (d) other anomaly/fraud detection methods.

Dense Subgraph Detection. Dense-subgraph detection in graphs has been extensively studied in theory; see Lee et al. (2010) for a survey. Exact algorithms (Goldberg, 1984; Khuller and Saha, 2009) and approximate algorithms (Charikar, 2000; Khuller and Saha, 2009) have been proposed for finding subgraphs with maximum average degree. These have been extended for incorporating size restrictions (Andersen and Chellapilla, 2009), alternative metrics for denser subgraphs (Tsourakakis et al., 2013), evolving graphs (Epasto et al., 2015), subgraphs with limited overlap (Balalau et al., 2015; Galbrun et al., 2016), and streaming or distributed settings (Bahmani et al., 2012, 2014). Dense subgraph detection has been applied to fraud detection in social or review networks (Beutel et al., 2013; Jiang et al., 2014; Shah et al., 2014; Shin et al., 2016; Hooi et al., 2017).

Dense Subtensor Detection. Extending dense subgraph detection to tensors (Jiang et al., 2015; Shin et al., 2017a, 2018) incorporates additional dimensions, such as time, to identify dense regions of interest with greater accuracy and specificity. Jiang et al. (2015) proposed CrossSpot, which starts from a seed subtensor and adjusts it in a greedy way until it reaches a local optimum, shows high accuracy in practice but does not provide any theoretical guarantees on its running time and accuracy. Shin et al. (2018) proposed M-Zoom, which starts from the entire tensor and only shrinks it by removing attributes one by one in a greedy way, improves CrossSpot in terms of speed and approximation guarantees. M-Biz, which was proposed in Shin et al. (2018), starts from the output of M-Zoom and repeats adding or removing an attribute greedily until a local optimum is reached. Given a dynamic tensor, DenseAlert and DenseStream, which were proposed in Shin et al. (2017a), incrementally compute a single dense subtensor in it. CrossSpot, M-Zoom, M-Biz, and Densestream require all tuples of relations to be loaded into memory at once and to be randomly accessed, which limit their applicability to large-scale datasets. Densealert maintains only the tuples created within a time window, and thus it can find a dense subtensor only within the window. Dense-subtensor detection in tensors has been found useful for detecting retweet boosting (Jiang et al., 2015), network attacks (Maruhashi et al., 2011; Shin et al., 2017a, 2018), bot activities (Shin et al., 2018), and vandalism on Wikipedia (Shin et al., 2017a), and also for genetics applications (Saha et al., 2010; Maruhashi et al., 2011).

Large-Scale Tensor Decomposition. Tensor decomposition such as HOSVD and CP decomposition (Kolda and Bader, 2009) can be used to spot dense subtensors, as shown in Maruhashi et al. (2011). Scalable algorithms for tensor decomposition have been developed, including disk-based algorithms (Shin and Kang, 2014; Oh et al., 2017), distributed algorithms (Kang et al., 2012; Shin and Kang, 2014; Jeon et al., 2015), and approximate algorithms based on sampling (Papalexakis et al., 2012) and count-min sketch (Wang et al., 2015). However, dense-subtensor detection based on tensor decomposition has serious limitations: it usually detects subtensors with significantly lower density (see Section 4.3) than search-based methods, provides no flexibility with regard to the choice of density metric, and does not provide any approximation guarantee.

Other Anomaly/Fraud Detection Methods. In addition to dense-subtensor detection, many approaches, including those based on egonet features (Akoglu et al., 2010), coreness (Shin et al., 2016), and behavior models (Rossi et al., 2013), have been used for anomaly and fraud detection in graphs. See Akoglu et al. (2015) for a survey.

1.3 Organization of the Paper

In Section 2, we provide notations and a formal problem definition. In Section 3, we propose D-Cube, a disk-based dense-subtensor detection method. In Section 4, we present experimental results and discuss them. In Section 5, we offer conclusions.

2 Preliminaries and Problem Definition

In this section, we first introduce notations and concepts used in the paper. Then, we define density measures and the problem of top-k dense-subtensor detection.

2.1 Notations and Concepts

Table 2 lists the symbols frequently used in the paper. We use $[x] = {1,2, \dots, x}$ for brevity. Let $ℛ (A_{1}, \dots, A_{N}, X)$ be a relation with N dimension attributes, denoted by $A_{1}, \dots, A_{N}$ , and a nonnegative measure attribute, denoted by X (see Example 1 for a running example). For each tuple $t \in ℛ$ and for each $n \in [N]$ , $t [A_{n}]$ and t[X] indicate the values of A_n and X, resp., in t. For each $n \in [N]$ , we use $ℛ_{n} = {t [A_{n}] : t \in ℛ}$ to denote the set of distinct values of A_n in $ℛ$ . The relation $ℛ$ is naturally represented as an N-way tensor of size $| ℛ_{1} | \times \dots \times | ℛ_{N} |$ . The value of each entry in the tensor is t[X], if the corresponding tuple t exists, and 0 otherwise. Let $ℬ_{n}$ be a subset of $ℛ_{n}$ . Then, a subtensor $ℬ$ in $ℛ$ is defined as $ℬ (A_{1}, \dots, A_{N}, X) = {t \in ℛ : \forall n \in [N], t [A_{n}] \in ℬ_{n}}$ , the set of tuples where each attribute A_n has a value in $ℬ_{n}$ . The relation $ℬ$ is a ‘subtensor’ because it forms a subtensor of size $| ℬ_{1} | \times \dots \times | ℬ_{N} |$ in the tensor representation of $ℛ$ , as in Figure 2B. We define the mass of $ℛ$ as $M_{ℛ} = \sum_{t \in ℛ} t [X]$ , the sum of attribute X in the tuples of $ℛ$ . We denote the set of tuples of $ℬ$ whose attribute A_n = a by $ℬ (a, n) = {t \in ℬ : t [A_{n}] = a}$ and its mass, called the attribute-value mass of a in A_n, by $M_{ℬ (a, n)} = \sum_{t \in ℬ (a, n)} t [X]$ .

TABLE 2

TABLE 2. Table of symbols.

FIGURE 2

FIGURE 2. Pictorial description of Example 1. (A) Relation $ℛ$ where the colored tuples compose relation $ℬ$ . (B) Tensor representation of $ℛ$ where the relation $ℬ$ forms a subtensor.

Example1. (Wikipedia Revision History). As inFigure 2, assume a relation $R (\underline{u s e r}, \underline{p a g e}, \underline{d a t e}, c o u n t)$ , where each tuple (u, p, d, c) in $R$ indicates that user u revised page p, c times, on date d. The first three attributes, A₁= user, A₂= page, and A₃= date, are dimension attributes, and the other one, X=count, is the measure attribute. Let $B_{1} = {A l i c e, B o b}$ , $B_{2} = {A, B}$ , and $B_{3} = {M a y - 29}$ . Then,Bis the set of tuples regarding the revision of page A or B by Alice or Bob on May-29, and its mass M_Bis 19, the total number of such revisions. The attribute-value mass of Alice (i.e., $M_{B (A l i c e, 1)}$ ) is 9, the number of revisions on A or B by exactly Alice on May-29. In the tensor representation,Bcomposes a subtensor inR, as depicted inFigure 2B.

2.2 Density Measures

We present density measures proven useful for anomaly detection in past studies. We use them throughout the paper although our dense-subtensor detection method, explained in Section 3, is flexible and not restricted to specific measures. Below, we slightly abuse notations to emphasize that the density measures are the functions of $M_{ℬ}$ , ${| ℬ_{n} |}_{n = 1}^{N}$ , $M_{ℛ}$ , and ${| ℛ_{n} |}_{n = 1}^{N}$ , where $ℬ$ is a subtensor of a relation $ℛ$ .

Arithmetic Average Mass (Definition 1) and Geometric Average Mass (Definition 2), which were used for detecting network intrusions and bot activities in Shin et al. (2018), are the extensions of density measures widely-used for graphs (Kannan and Vinay, 1999; Charikar, 2000).

Definition 1 (Arithmetic Average Mass $ρ_{a r i}$ ). The arithmetic average mass of a subtensorBof a relationRis defined as

ρ_{a r i} (ℬ, ℛ) = ρ_{a r i} (M_{ℬ}, {| ℬ_{n} |}_{n = 1}^{N}, M_{ℛ}, {| ℛ_{n} |}_{n = 1}^{N}) = \frac{M_{ℬ}}{\frac{1}{N} \sum_{n = 1}^{N} | ℬ_{n} |} .

Definition 2 (Geometric Average Mass $ρ_{g e o}$ ). The geometric average mass of a subtensorBof a relationRis defined as

ρ_{g e o} (ℬ, ℛ) = ρ_{g e o} (M_{ℬ}, {| ℬ_{n} |}_{n = 1}^{N}, M_{ℛ}, {| ℛ_{n} |}_{n = 1}^{N}) = \frac{M_{ℬ}}{{({\prod_{n = 1}^{N} | ℬ_{n} |}^{​})}^{\frac{1}{N}}} .

Suspiciousness (Definition 3), which was used for detecting ‘retweet-boosting’ activities in Jiang et al. (2014), is the negative log-likelihood that $ℬ$ has mass $M_{ℬ}$ under the assumption that each entry of $ℛ$ is i.i.d from a Poisson distribution.

Definition 3 (Suspiciousness $ρ_{s u s p}$ ). The suspiciousness of a subtensorBof a relationRis defined as

\begin{matrix} ρ_{s u s p} (ℬ, ℛ) = ρ_{s u s p} (M_{ℬ}, {| ℬ_{n} |}_{n = 1}^{N}, M_{ℛ}, {| ℛ_{n} |}_{n = 1}^{N}) \\ = M_{ℬ} (\log \frac{M_{ℬ}}{M_{ℛ}} - 1) + M_{ℛ} \prod_{n = 1}^{N} \frac{| ℬ_{n} |}{| ℛ_{n} |} - M_{ℬ} \log (\prod_{n = 1}^{N} \frac{| ℬ_{n} |}{| ℛ_{n} |}) . \end{matrix}

Entry Surplus (Definition 4) is the observed mass of $ℬ$ subtracted by α times the expected mass, under the assumption that the value of each entry (in the tensor representation) in $ℛ$ is i.i.d. It is a multi-dimensional extension of edge surplus, which was proposed in Tsourakakis et al. (2013) as a density metric for graphs.

Definition 4 (Entry Surplus). The entry surplus of a subtensor B of a relation R is defined as

\begin{matrix} ρ_{e s (α)} (ℬ, ℛ) = ρ_{e s (α)} (M_{ℬ}, {| ℬ_{n} |}_{n = 1}^{N}, M_{ℛ}, {| ℛ_{n} |}_{n = 1}^{N}) \\ = M_{ℬ} - α M_{ℛ} \prod_{n = 1}^{N} \frac{| ℬ_{n} |}{| ℛ_{n} |} . \end{matrix}

Subtensors with high entry surplus are configurable by adjusting α. With high α values, relatively small compact subtensors have higher entry surplus than large sparse subtensors, while the opposite happens with small α values. We show this tendency experimentally in Section 4.7.

2.3 Problem Definition

Based on the concepts and density measures in the previous sections, we define the problem of top-k dense-subtensor detection in a large-scale tensor in Definition 1.

Problem 1 (Large-scale Top-k Densest Subtensor Detection). (1) Given: a large-scale relation R not fitting in memory, the number of subtensors k, and a density measure ρ, (2) Find: the top-k subtensors of R with the highest density in terms of ρ.

Even when we restrict our attention to finding one subtensor in a matrix fitting in memory (i.e., k = 1 and N = 2), obtaining an exact solution takes $O ({({\sum_{n = 1}^{N} | ℛ_{n} |}^{})}^{6})$ time (Goldberg, 1984; Khuller and Saha, 2009), which is infeasible for large-scale tensors. Thus, our focus in this work is to design an approximate algorithm with (1) near-linear scalability with all aspects of $ℛ$ , which does not fit in memory, (2) an approximation guarantee at least for some density measures, and (3) meaningful results on real-world data.

3 Proposed Method

In this section, we propose D-Cube, a disk-based dense-subtensor detection method. We first describe D-Cube in Section 3.1. Then, we prove its theoretical properties in Section 3.2. Lastly, we present our MapReduce implementation of D-Cube in Section 3.3. Throughout these subsections, we assume that the entries of tensors (i.e., the tuples of relations) are stored on disk and read/written only in a sequential way. However, all other data (e.g., distinct attribute-value sets and the mass of each attribute value) are assumed to be stored in memory.

Algorithm_1Algorithm_2

3.1 Algorithm

D-Cube is a search method that starts with the given relation and removes attribute values (and the tuples with the attribute values) sequentially so that a dense subtensor is left. Contrary to previous approaches, D-Cube removes multiple attribute values (and the tuples with the attribute values) at a time to reduce the number of iterations and also disk I/Os. In addition to this advantage, D-Cube carefully chooses attribute values to remove to give the same accuracy guarantee as if attribute values were removed one by one, and shows similar or even higher accuracy empirically.

3.1.1 Overall Structure of D-Cube (Algorithm 1)

Algorithm 1 describes the overall structure of D-Cube . It first copies and assigns the given relation $ℛ$ to $ℛ^{o r i}$ (line 1); and computes the sets of distinct attribute values composing $ℛ$ (line 2). Then, it finds k dense subtensors one by one from $ℛ$ (line 6) using its mass as a parameter (line 5). The detailed procedure for detecting a single dense subtensor from $ℛ$ is explained in Section 3.1.2. After each subtensor $ℬ$ is found, the tuples included in $ℬ$ are removed from $ℛ$ (line 7) to prevent the same subtensor from being found again. Due to this change in $ℛ$ , subtensors found from $ℛ$ are not necessarily the subtensors of the original relation $ℛ^{o r i}$ . Thus, instead of $ℬ$ , the subtensor in $ℛ^{o r i}$ formed by the same attribute values forming $ℬ$ is added to the list of k dense subtensors (lines 8–9). Notice that, due to this step, D-Cube can detect overlapping dense subtensors. That is, a tuple can be included in multiple dense subtensors.

Based on our assumption that the sets of distinct attribute values (i.e., ${ℛ_{n}}_{n = 1}^{N}$ and ${ℬ_{n}}_{n = 1}^{N}$ ) are stored in memory and can be randomly accessed, all the steps in Algorithm 1 can be performed by sequentially reading and writing tuples in relations (i.e., tensor entries) in disk without loading all the tuples in memory at once. For example, the filtering steps in lines 7–8 can be performed by sequentially reading each tuple from disk and writing the tuple to disk only if it satisfies the given condition.

ALGORITHM 1

Algorithm 1. D‐CUBE

Note that this overall structure of D-Cube is similar to that of M-Zoom (Shin et al., 2018) except that tuples are stored on disk. However, the methods differ significantly in the way each dense subtensor is found from $ℛ$ , which is explained in the following section.

3.1.2 Single Subtensor Detection (Algorithm 2)

Algorithm 2 describes how D-Cube detects each dense subtensor from the given relation $ℛ$ . It first initializes a subtensor $ℬ$ to $ℛ$ (lines 1–2) then repeatedly removes attribute values and the tuples of $ℬ$ with those attribute values until all values are removed (line 5).

Specifically, in each iteration, D-Cube first chooses a dimension attribute A_i that attribute values are removed from (line 7). Then, it computes D_i, the set of attribute values whose masses are less than $θ (\geq 1)$ times the average (line 8). We explain how the dimension attribute is chosen, in Section 3.1.3 and analyze the effects of θ on the accuracy and the time complexity, in Section 3.2. The tuples whose attribute values of A_i are in D_i are removed from $ℬ$ at once within a single scan of $ℬ$ (line 16). However, deleting a subset of D_i may achieve higher value of the metric ρ. Hence, D-Cube computes the changes in the density of $ℬ$ (line 11) as if the attribute values in D_i were removed one by one, in an increasing order of their masses. This allows D-Cube to optimize ρ as if we removed attributes one by one, while still benefiting from the computational speedup of removing multiple attributes in each scan. Note that these changes in ρ can be computed exactly without actually removing the tuples from $ℬ$ or even accessing the tuples in $ℬ$ since its mass (i.e., $M_{ℬ}$ ) and the number of distinct attribute values (i.e., ${| ℬ_{n} |}_{n = 1}^{N}$ ) are maintained up-to-date (11–12). This is because removing an attribute value from a dimension attribute does not affect the masses of the other values of the same attribute. The orders that attribute values are removed and when the density of $ℬ$ is maximized are maintained (lines 13–15) so that the subtensor $ℬ$ maximizing the density can be restored and returned (lines 17–18), as the result of Algorithm 2.

ALGORITHM 2

Algorithm 2. find_one in D-Cube

Note that, in each iteration (lines 5–16) of Algorithm 2, the tuples of $ℬ$ , which are stored on disk, need to be scanned only twice, once in line 6 and once in line 16. Moreover, both steps can be performed by simply sequentially reading and/or writing tuples in $ℬ$ without loading all the tuples in memory at once. For example, to compute attribute-value masses in line 6, D-Cube increases $M_{ℬ (t [A_{n}], n)}$ by $t [X]$ for each dimension attribute A_n after reading each tuple t in $ℬ$ sequentially from disk.

Algorithm_3

Algorithm_4

3.1.3 Dimension Selection (Algorithms 3 and 4)

We discuss two policies for choosing a dimension attribute that attribute values are removed from. They are used in line 7 of Algorithm 2 offering different advantages.

Maximum Cardinality Policy (Algorithm 3): The dimension attribute with the largest cardinality is chosen, as described in Algorithm 3. This simple policy, however, provides an accuracy guarantee (see Theorem 3 in Section 3.2.2).

Maximum Density Policy (Algorithm 4): The density of $ℬ$ when attribute values are removed from each dimension attribute is computed. Then, the dimension attribute leading to the highest density is chosen. Note that the tuples in $ℬ$ , stored on disk, do not need to be accessed for this computation, as described in Algorithm 4. Although this policy does not provide the accuracy guarantee given by the maximum cardinality policy, this policy works well with various density measures and tends to spot denser subtensors than the maximum cardinality policy in our experiments with real-world data.

ALGORITHM 3

Algorithm 3. select_dimension by cardinality

ALGORITHM 4

Algorithm 4. select_dimension by density

3.1.4 Efficient Implementation

We present the optimization techniques used for the efficient implementation of D-Cube.

Combining Disk-Accessing Steps. The amount of disk I/O can be reduced by combining multiple steps involving disk accesses. In Algorithm 1, updating $ℛ$ (line 7) in an iteration can be combined with computing the mass of $ℛ$ (line 5) in the next iteration. That is, if we aggregate the values of the tuples of $ℛ$ while they are written for the update, we do not need to scan $ℛ$ again for computing its mass in the next iteration. Likewise, in Algorithm 2, updating $ℬ$ (line 16) in an iteration can be combined with computing attribute-value masses (line 6) in the next iteration. This optimization reduces the amount of disk I/O in D-Cube about 30%.

Caching Tensor Entries in Memory. Although we assume that tuples are stored on disk, storing them in memory up to the memory capacity speeds up D-Cube up to 3 times in our experiments (see Section 4.4). We cache the tuples in $ℬ$ , which are more frequently accessed than those in $ℛ$ or $ℛ^{o r i}$ , in memory with the highest priority.

3.2 Analyses

In this section, we prove the time and space complexities of D-Cube and the accuracy guarantee provided by D-Cube . Then, we theoretically compare D-Cube with M-Zoom and M-Biz (Shin et al., 2018).

3.2.1 Complexity Analyses

Theorem 1 states the worst-case time complexity, which equals to the worst-case I/O complexity, of D-Cube .

Lemma 1(Maximum Number of Iterations in Algorithm 2). Let

L = \max_{n \in [N]} | R_{n} |

. Then, the number of iterations (lines 5–16) in Algorithm 2 is at most

N \min (\log_{θ} L, L) .

ProofIn each iteration (lines 5–16) of Algorithm 2, among the values of the chosen dimension attribute

A_{i}

, attribute values whose masses are at most

θ \frac{M_{ℬ}}{| ℬ_{i} |}

, where

θ \geq 1

, are removed. The set of such attribute values is denoted by

D_{i}

. We will show that, if

| ℬ_{i} | > 0

, then

| ℬ_{i} \ D_{i} | < | ℬ_{i} | / θ (1)

Note that, when $| ℬ_{i} \ D_{i} | = 0$ , Eq. (1) trivially holds. When $| ℬ_{i} \ D_{i} | > 0$ , $M_{ℬ}$ can be factorized and lower bounded as

\begin{matrix} M_{ℬ} = \sum_{a \in ℬ_{i} \ D_{i}} M_{ℬ (a, i)} + \sum_{a \in D_{i}} M_{ℬ (a, i)} \\ \geq \sum_{a \in {ℬ_{i} \ D_{i}}_{i}} M_{ℬ (a, i)} > | ℬ_{i} \ D_{i} | \cdot θ \frac{M_{ℬ}}{| ℬ_{i} |}, \end{matrix}

where the last strict inequality is from the definition of $D_{i}$ and that $| ℬ_{i} \ D_{i} | > 0$ . This strict inequality implies $M_{ℬ} > 0$ , and thus dividing both sides by $θ \frac{M_{ℬ}}{| ℬ_{i} |}$ gives Eq. 1. Now, Eq. 1 implies that the number of remaining values of the chosen attribute after each iteration is less than $1 / θ$ of that before the iteration. Hence each attribute can be chosen at most $\log_{θ} L$ times before all of its values are removed. Thus, the maximum number of iterations is at most $N \log_{θ} L$ . Also, by Eq. 1, at least one attribute value is removed per iteration. Hence, the maximum number of iterations is at most the number of attribute values, which is upper bounded by $N L$ . Hence the number of iterations is upper bounded by $N \max (\log_{θ} L, L)$ .∎

Theorem 1 (Worst-case Time Complexity). Let

L = \max_{n \in [N]} | ℛ_{n} |

. If

θ = O (e^{(\frac{N | ℛ |}{L})})

, which is a weaker condition than

θ = O (1)

, the worst-case time complexity ofAlgorithm 1is

O (k N^{2} | ℛ | \min (\log_{θ} L, L)) . (2)

ProofFrom Lemma 1, the number of iterations (lines 5–16) in Algorithm 2 is $O (N \min (\log_{θ} L, L))$ . Executing lines 6 and 16 $O (N \min (\log_{θ} L, L))$ times takes $O (N^{2} | ℛ | \min (\log_{θ} L, L))$ , which dominates the time complexity of the other parts. For example, repeatedly executing line 9 takes $O (N L \log_{2} L)$ , and by our assumption, it is dominated by $O (N^{2} | ℛ | \min (\log_{θ} L, L))$ . Thus, the worst-case time complexity of Algorithm 2 is $O (N^{2} | ℛ | \min (\log_{θ} L, L))$ , and that of Algorithm 1, which executes Algorithm 2, k times, is $O (k N^{2} | ℛ | \min (\log_{θ} L, L))$ .∎However, this worst-case time complexity, which allows the worst distributions of the measure attribute values of tuples, is too pessimistic. InSection 4.4, we experimentally show thatD-Cubescales linearly with k, N, and $ℛ$ ; and sub-linearly with L even when θ is its smallest value 1.Theorem 2 states the memory requirement ofD-Cube. Since the tuples do not need to be stored in memory all at once inD-Cube, its memory requirement does not depend on the number of tuples (i.e., $| ℛ |$ ).

Theorem 2 (Memory Requirements). The amount of memory space inAlgorithm 1is $O ({\sum_{n = 1}^{N} | ℛ_{n} |}^{})$ .

Proof In Algorithm 1, ${{M_{ℬ (a, n)}}_{a \in ℬ_{n}}}_{n = 1}^{N}$ , ${ℛ_{n}}_{n = 1}^{N}$ , and ${ℬ_{n}}_{n = 1}^{N}$ need to be loaded into memory at once. Each has at most $\sum_{n = 1}^{N} | ℛ_{n} |$ values. Thus, the memory requirement is $O ({\sum_{n = 1}^{N} | ℛ_{n} |}^{})$ . ∎

3.2.2 Accuracy in Dense-Subtensor Detection

We show that D-Cube gives the same accuracy guarantee with in-memory algorithms proposed in Shin et al. (2018), if we set θ to 1, although accesses to tuples (stored on disk) are restricted in D-Cube to reduce disk I/Os. Specifically, Theorem 3 states that the subtensor found by Algorithm 2 with the maximum cardinality policy has density at least $\frac{1}{θ N}$ of the optimum when $ρ_{a r i}$ is used as the density measure.

Theorem 3(θN-Approximation Guarantee). Let

ℬ^{*}

be the subtensor

ℬ

maximizing

ρ_{a r i} (ℬ, ℛ)

in the given relation R. Let

\tilde{ℬ}

be the subtensor returned byAlgorithm 2with

ρ_{a r i}

and the maximum cardinality policy. Then,

ρ_{a r i} (\tilde{ℬ}, ℛ) \geq \frac{1}{θ N} ρ_{a r i} (ℬ^{*}, ℛ) .

ProofFirst, the maximal subtensor

ℬ^{*}

satisfies that, for any

i \in [N]

and for any attribute value

a \in ℬ_{i}^{*}

, its attribute-value mass

M_{ℬ^{*} (a, i)}

is at least

\frac{1}{N} ρ_{a r i} (ℬ^{*}, ℛ)

. This is since the maximality of

ρ_{a r i} (ℬ^{*}, ℛ)

implies

ρ_{a r i} (ℬ^{*} - ℬ^{*} (a, i), ℛ) \leq ρ_{a r i} (ℬ^{*}, ℛ)

, and plugging in Definition 1 to

ρ_{a r i}

gives

\frac{M_{ℬ^{*}} - M_{ℬ^{*} (a, i)}}{\frac{1}{N} (({\sum_{n = 1}^{N} | ℬ_{n}^{*} |}^{​}) - 1)} = ρ_{a r i} (ℬ^{*} - ℬ^{*} (a, i), ℛ) \leq ρ_{a r i} (ℬ^{*}, ℛ) = \frac{M_{ℬ^{*}}}{\frac{1}{N} {\sum_{n = 1}^{N} | ℬ_{n}^{*} |}^{​}}

, which reduces to

M_{ℬ^{*} (a, i)} \geq \frac{1}{N} ρ_{a r i} (ℬ *, ℛ) . (3)

Consider the earliest iteration (lines 5–16) in Algorithm 2 where an attribute value a of $ℬ^{*}$ is included in $D_{i}$ . Let $ℬ^{'}$ be $ℬ$ in the beginning of the iteration. Our goal is to prove $ρ_{a r i} (ℬ^{'}, ℛ) \geq \frac{1}{θ N} ρ_{a r i} (ℬ^{*}, ℛ)$ , which we will show as $ρ_{a r i} (\tilde{ℬ}, ℛ) \geq ρ_{a r i} (ℬ^{'}, ℛ) \geq \frac{M_{ℬ^{'} (a, i)}}{θ} \geq \frac{M_{ℬ^{*} (a, i)}}{θ} \geq \frac{1}{θ N} ρ_{a r i} (ℬ^{*}, ℛ) .$ First, $ρ_{a r i} (\tilde{ℬ}, ℛ) \geq ρ_{a r i} (ℬ^{'}, ℛ)$ is from the maximality of $ρ_{a r i} (\tilde{ℬ}, ℛ)$ among the densities of the subtensors generated in the iterations (lines 1:line:single:order1-1:line:single:order2 in Algorithm 2). Second, applying $| ℬ_{i}^{'} | \geq \frac{1}{N} {\sum_{n = 1}^{N} | ℬ_{n}^{'} |}^{}$ from the maximum cardinality policy (Algorithm 3) to Definition 1 of $ρ_{a r i}$ gives $ρ_{a r i} (ℬ, ℛ) = \frac{M_{ℬ}}{\frac{1}{N} {\sum_{n = 1}^{N} | ℬ_{n}^{'} |}^{}} \geq \frac{M_{ℬ^{'}}}{| ℬ_{i}^{'} |}$ . And $a \in D_{i}$ gives $θ \frac{M_{ℬ^{'}}}{| ℬ^{'} |} \geq M_{ℬ^{'} (a, i)}$ . So combining these gives $ρ_{a r i} (ℬ^{'}, ℛ) \geq \frac{M_{ℬ^{'} (a, i)}}{θ}$ . Third, $\frac{M_{ℬ^{'} (a, i)}}{θ} \geq \frac{M_{ℬ^{*} (a, i)}}{θ}$ is from $ℬ^{'} \supset ℬ^{*}$ . Fourth, $\frac{M_{ℬ^{*} (a, i)}}{θ} \geq \frac{1}{θ N} ρ_{a r i} (ℬ^{*}, ℛ)$ is from Eq. (3). Hence, $ρ_{a r i} (\tilde{ℬ}, ℛ) \geq \frac{1}{θ N} ρ_{a r i} (ℬ^{*}, ℛ)$ holds. ∎

3.2.3 Theoretical Comparison with M-Zoom and M-Biz (Shin et al., 2018)

While D-Cube requires only $O ({\sum_{n = 1}^{N} | ℛ_{n} |}^{})$ memory space (see Theorem 2), which does not depend on the number of tuples (i.e., $| ℛ |$ ), M-Zoom and M-Biz require additional $O (N | ℛ |)$ space for storing all tuples in main memory. The worst-case time complexity of D-Cube is $O (k N^{2} | ℛ | \min (\log_{θ} L, L))$ (see Theorem 1), and it is slightly higher than that of M-Zoom, which is $O (k N | ℛ | \log L)$ . Empirically, however, D-Cube is up to 7× faster than M-Zoom, as we show in Section 4. The main reason is that D-Cube reads and writes tuples only sequentially, allowing efficient caching based on spatial locality. On the other hand, M-Zoom requires tuples to be stored and accessed in hash tables, making efficient caching difficult.¹ The time complexity of M-Biz depends on the number of iterations until reaching a local optimum, and there is no known upper bound on the number of iterations tighter than $O (2^{({\sum_{n = 1}^{N} | ℛ_{n} |}^{})})$ . If $ρ_{a r i}$ is used, M-Zoom and M-Biz² give an approximation ratio of N, which is the approximation ratio of D-Cube when θ is set to 1 (see Theorem 3).

3.3 MapReduce Implementation

We present our MapReduce implementation of D-Cube, assuming that tuples in relations are stored in a distributed file system. Specifically, we describe four MapReduce algorithms that cover the steps of D-Cube accessing tuples.

(1) Filtering Tuples. In lines 7-8 Algorithm 1 and line 16 of Algorithm 2, D-Cube filters the tuples satisfying the given conditions. These steps are done by the following map-only algorithm, where we broadcast the data used in each condition (e.g., ${ℬ_{n}}_{n = 1}^{N}$ in line 7 of Algorithm 1) to mappers using the distributed cache functionality.

Map-stage: Take a tuple t (i.e., $〈 t [A_{1}], \dots, t [A_{N}], t [X] 〉$ ) and emit t if t satisfies the given condition. Otherwise, the tuple is ignored.

(2) Computing Attribute-value Masses. Line 6 of Algorithm 2 is performed by the following algorithm, where we reduce the amount of shuffled data by combining the intermediate results within each mapper.

Map-stage: Take a tuple t (i.e., $〈 t [A_{1}], \dots, t [A_{N}], t [X] 〉$ ) and emit N key/value pairs ${〈 (n, t [A_{n}]), t [X] 〉}_{n = 1}^{N}$ .

Combine-stage/Reduce-stage: Take $〈 (n, a), values 〉$ and emit $〈 (n, a), sum (values) 〉$ .

Each tuple $〈 (n, a), value 〉$ of the final output indicates that $M_{ℬ (a, n)} = value$ .

(3) Computing Mass. Line 5 of Algorithm 1 can be performed by the following algorithm, where we reduce the amount of shuffled data by combining the intermediate results within each mapper.

Map-stage: Take a tuple t (i.e., $〈 t [A_{1}], \dots, t [A_{N}], t [X] 〉$ ) and emit $〈 0, t [X] 〉$ .

Combine-stage/Reduce-stage: Take $〈 0, values 〉$ and emit $〈 0, sum (values) 〉$ .

The value of the final tuple corresponds to $M_{ℛ}$ .

(4) Computing Attribute-value Sets. Line 2 of Algorithm 1 can be performed by the following algorithm, where we reduce the amount of shuffled data by combining the intermediate results within each mapper.

Map-stage: Take a tuple t (i.e., $〈 t [A_{1}], \dots, t [A_{N}], t [X] 〉$ ) and emit N key/value pairs ${〈 (n, T [A_{n}]), 0 〉}_{n = 1}^{N}$ .

Combine-stage/Reduce-stage: Take $〈 (n, a), values 〉$ and emit $〈 (n, a), 0 〉$ .

Each tuple $〈 (n, a), 0 〉$ of the final output indicates that a is a member of $ℛ_{n}$ .

4 Results and Discussion

We designed and conducted experiments to answer the following questions:

Q1. Memory Efficiency: How much memory space does D-Cube require for analyzing real-world tensors? How large tensors can D-Cube handle?

Q2. Speed and Accuracy in Dense-subtensor Detection: How rapidly and accurately does D-Cube identify dense subtensors? Does D-Cube outperform its best competitors?

Q3. Scalability: Does D-Cube scale linearly with all aspects of data? Does D-Cube scale out?

Q4. Effectiveness in Anomaly Detection: Which anomalies does D-Cube detect in real-world tensors?

Q5. Effect of θ: How does the mass-threshold parameter θ affect the speed and accuracy of D-Cube in dense-subtensor detection?

Q6. Effect of α: How does the parameter α in density metric $ρ_{e s (α)}$ affect subtensors that D-Cube detects?

4.1 Experimental Settings

4.1.1 Machines

We ran all serial algorithms on a machine with 2.67GHz Intel Xeon E7-8837 CPUs and 1TB memory. We ran MapReduce algorithms on a 40-node Hadoop cluster, where each node has an Intel Xeon E3-1230 3.3GHz CPU and 32GB memory.

4.1.2 Datasets

We describe the real-world and synthetic tensors used in our experiments. Real-world tensors are categorized into four groups: (a) Rating data (SWM, Yelp, Android, Netflix, and YahooM.), (b) Wikipedia revision histories (KoWiki and EnWiki), (c) Temporal social networks (Youtube and SMS), and (d) TCP dumps (DARPA and AirForce). Some statistics of these datasets are summarized in Table 3.

TABLE 3

TABLE 3. Summary of real-world datasets.

Rating data. Rating data are relations with schema (user, item, timestamp, score, #ratings). Each tuple (u,i,t,s,r) indicates that user u gave item i score s, r times, at timestamp t. In the SWM dataset (Akoglu et al., 2013), the timestamps are in dates, and the items are entertaining software from a popular online software marketplace. In the Yelp dataset, the timestamps are in dates, and the items are businesses listed on Yelp, a review site. In the Android dataset (McAuley et al., 2015), the timestamps are hours, and the items are Android apps on Amazon, an online store. In the Netflix dataset (Bennett and Lanning, 2007), the timestamps are in dates, and the items are movies listed on Netflix, a movie rental and streaming service. In the YahooM. dataset (Dror et al., 2012), the timestamps are in hours, and the items are musical items listed on Yahoo! Music, a provider of various music services.

Wikipedia revision history. Wikipedia revision histories are relations with schema (user, page, timestamp, #revisions). Each tuple (u,p,t,r) indicates that user u revised page p, r times, at timestamp t (in hour) in Wikipedia, a crowd-sourcing online encyclopedia. In the KoWiki dataset, the pages are from Korean Wikipedia. In the EnWiki dataset, the pages are from English Wikipedia.

Temporal social networks. Temporal social networks are relations with schema (source, destination, timestamp, #interactions). Each tuple (s,d,t,i) indicates that user s interacts with user d, i times, at timestamp t. In the Youtube dataset (Mislove et al., 2007), the timestamps are in hours, and the interactions are becoming friends on Youtube, a video-sharing website. In the SMS dataset, the timestamps are in hours, and the interactions are sending text messages.

TCP Dumps. The DARPA dataset (Lippmann et al., 2000), collected by the Cyber Systems and Technology Group in 1998, is a relation with schema (source IP, destination IP, timestamp, #connections). Each tuple (s,d,t,c) indicates that c connections were made from IP s to IP d at timestamp t (in minutes). The AirForce dataset, used for KDD Cup. 1999, is a relation with schema (protocol, service, src bytes, dst bytes, flag, host count, srv count, #connections). The description of each attribute is as follows:

protocol: type of protocol (tcp, udp, etc.).

service: service on destination (http, telnet, etc.).

src bytes: bytes sent from source to destination.

dst bytes: bytes sent from destination to source.

flag: normal or error status.

host count: number of connections made to the same host in the past two seconds.

srv count: number of connections made to the same service in the past two seconds.

#connections: number of connections with the given dimension attribute values.

Synthetic Tensors: We used synthetic tensors for scalability tests. Each tensor was created by generating a random binary tensor and injecting ten random dense subtensors, whose volumes are 10^N and densities (in terms of $ρ_{a r i}$ ) are between 10× and 100× of that of the entire tensor.

4.1.3 Implementations

We implemented the following dense-subtensor detection methods for our experiments.

D-Cube (Proposed): We implemented D-Cube in Java with Hadoop 1.2.1. We set the mass-threshold parameter θ to 1 and used the maximum density policy for dimension selection, unless otherwise stated.

M-Zoom and M-Biz (Shin et al., 2018): We used the open-source Java implementations of M-Zoom and M-Biz³. As suggested in Shin et al. (2018), we used the outputs of M-Zoom as the initial states in M-Biz .

CrossSpot (Jiang et al., 2015): We used a Java implementation of the open-source implementation of CrossSpot⁴. Although CrossSpot was originally designed to maximize $ρ_{s u s p}$ , we used its variants that directly maximize the density metric compared in each experiment. We used CPD as the seed selection method of CrossSpot as in Shin et al. (2018).

CPD (CP Decomposition): Let ${A^{(n)}}_{n = 1}^{N}$ be the factor matrices obtained by CP Decomposition (Kolda and Bader (2009)). The ith dense subtensor is composed by every attribute value $a_{n}$ whose corresponding element in the ith column of $A^{(n)}$ is greater than or equal to $1 / \sqrt{| ℛ_{n} |}$ . We used the Tensor Toolbox⁵ for CP Decomposition.

MAF (Maruhashi et al., 2011): We used the Tensor Toolbox for CP Decomposition, which MAF is largely based on.

4.2 Q1. Memory Efficiency

We compare the amount of memory required by different methods for handling the real-world datasets. As seen in Figure 3, D-Cube, which does not require tuples to be stored in memory, needed up to 1,561× less memory than the second most memory-efficient method, which stores tuples in memory.

FIGURE 3

FIGURE 3. D-Cube is memory efficient. D-Cube requires up to 1,561× less memory than the second most memory-efficient method.

Due to its memory efficiency, D-Cube successfully handled 1,000× larger data than its competitors within a memory budget. We ran methods on 3-way synthetic tensors with different numbers of tuples (i.e., $| ℛ |$ ), with a memory budget of 16GB per machine. In every tensor, the cardinality of each dimension attribute was $1 / 1000$ of the number of tuples, i.e., $| ℛ_{n} | = | ℛ | / 1000$ , $\forall n \in [N]$ . Figure 1A in Section 1 shows the result. The Hadoop implementation of D-Cube successfully spotted dense subtensors in a tensor with $10^{11}$ tuples (2.6TB), and the serial version of D-Cube successfully spotted dense subtensors in a tensor with 10¹⁰ tuples (240GB), which was the largest tensor that can be stored on a disk. However, all other methods ran out of memory even on a tensor with 10⁹ tuples (21GB).

4.3 Q2. Speed and Accuracy in Dense-Subtensor Detection

We compare how rapidly and accurately D-Cube (the serial version) and its competitors detect dense subtensors in the real-world datasets. We measured the wall-clock time (average over three runs) taken for detecting three subtensors by each method, and we measured the maximum density of the three subtensors found by each method using different density measures in Section 2.2. For this experiment, we did not limit the memory budget so that every method can handle every dataset. D-Cube also utilized extra memory space by caching tuples in memory, as explained in Section 3.1.4.

Figure 4 shows the results averaged over all considered datasets.⁶ The results in each data set can be found in the supplementary material. D-Cube provided the best trade-off between speed and accuracy. Specifically, D-Cubewas up to 7× faster (on average 3.6× faster) than the second fastest method M-Zoom. Moreover, D-Cubewith the maximum density policy spotted high-density subtensors consistently regardless of target density measures. Specifically, on average, D-Cube with the maximum density policy was most accurate in dense-subtensor detection when $ρ_{g e o}$ and $ρ_{e s (10)}$ were used; and it was second most accurate when $ρ_{s u s p}$ and $ρ_{e s (1)}$ were used. When $ρ_{a r i}$ was used, M-Zoom, M-Biz, and D-Cube with the maximum cardinality policy were on average more accurate than D-Cube with the maximum density policy. Although MAF does not appear in Figure 4, it consistently provided sparser subtensors than CPD with similar speed.

FIGURE 4

FIGURE 4. D-Cube rapidly and accurately detects dense subtensors. In each plot, points indicate the the densities of subtensors detected by different methods and their running times, averaged over all considered real-world tensors. Upper-left region indicates better performance. D-Cube is about 3.6× faster than the second fastest method M-Zoom. Moreover, D-Cube with the maximum density consistently finds dense subtensors regardless of target density measures.

4.4 Q3. Scalability

We show that D-Cube scales (sub-)linearly with every input factor, i.e., the number of tuples, the number of dimension attributes, and the cardinality of dimension attributes, and the number of subtensors that we aim to find. To measure the scalability with each factor, we started with finding a dense subtensor in a synthetic tensor with 10⁸ tuples and 3 dimension attributes each of whose cardinality is 10⁵. Then, we measured the running time as we changed one factor at a time while fixing the other factors. The threshold parameter θ was fixed to 1. As seen in Figure 5, D-Cube scaled linearly with every factor and sub-linearly with the cardinality of attributes even when θ was set to its minimum value 1. This supports our claim in Section 3.2.1 that the worst-case time complexity of D-Cube (Theorem 1) is too pessimistic. This linear scalability of D-Cube held both with enough memory budget (blue solid lines in Figure 5) to store all tuples and with minimum memory budget (red dashed lines in Figure 5) to barely meet the requirements although D-Cube was up to 3× faster in the former case.

FIGURE 5

FIGURE 5. D-Cube scales (sub-)linearly with all input factors regardless of memory budgets.

We also evaluate the machine scalability of the MapReduce implementation of D-Cube. We measured its running time taken for finding a dense subtensor in a synthetic tensor with 10¹⁰ tuples and 3 dimension attributes each of whose cardinality is 10⁷, as we increased the number of machines running in parallel from 1 to 40. Figure 6 shows the changes in the running time and the speed-up, which is defined as T₁/T_M where T_M is the running time with M machines. The speed-up increased near linearly when a small number of machines were used, while it flattened as more machines were added due to the overhead in the distributed system.

FIGURE 6

FIGURE 6. D-Cube scales out. The MapReduce implementation of D-Cube is speeded up 8× with 10 machines, and 20× with 40 machines.

4.5 Q4. Effectiveness in Anomaly Detection

We demonstrate the effectiveness of D-Cube in four applications using real-world tensors.

4.5.1 Network Intrusion Detection from TCP Dumps

D-Cube detected network attacks from TCP dumps accurately by spotting corresponding dense subtensors. We consider two TCP dumps that are modeled differently. The DARPA dataset is a 3-way tensor where the dimension attributes are source IPs, destination IPs, and timestamps in minutes; and the measure attribute is the number of connections. The AirForce dataset, which does not include IP information, is a 7-way tensor where the measure attribute is the same but the dimension attributes are the features of the connections, including protocols and services. Both datasets include labels indicating whether each connection is malicious or not.

Figure 1C in Section 1 lists the five densest subtensors (in terms of $ρ_{g e o}$ ) found by D-Cube in each dataset. Notice that the dense subtensors are mostly composed of various types of network attacks. Based on this observation, we classified each connection as malicious or benign based on the density of the densest subtensor including the connection (i.e., the denser the subtensor including a connection is, the more suspicious the connection is). This led to high area under the ROC curve (AUROC) as seen in Table 4, where we report the AUROC when each method was used with the density measure giving the highest AUROC. In both datasets, using D-Cube resulted in the highest AUROC.

TABLE 4

TABLE 4. D-Cube spots network attacks and synchronized behavior fastest and most accurately from TCP dumps and rating datasets, respectively.

4.5.2 Synchronized Behavior Detection in Rating Data

D-Cube spotted suspicious synchronized behavior accurately in rating data. Specifically, we assume an attack scenario where fraudsters in a review site, who aim to boost (or lower) the ratings of the set of items, create multiple user accounts and give the same score to the items within a short period of time. This lockstep behavior forms a dense subtensor with volume (# fake accounts × # target items × 1 × 1) in the rating dataset, whose dimension attributes are users, items, timestamps, and rating scores.

We injected 10 such random dense subtensors whose volumes varied from 15×15×1×1 to 60×60×1×1 in the Yelp and Android datasets. We compared the ratio of the injected subtensors detected by each dense-subtensor detection method. We considered each injected subtensor as overlooked by a method if the subtensor did not belong to any of the top-10 dense subtensors spotted by the method or it was hidden in a natural dense subtensor at least 10 times larger than the injected subtensor. That is, we measured the recall at top 10. We repeated this experiment 10 times, and the averaged results are summarized in Table 4. For each method, we report the results with the density measure giving the highest recall. In both datasets, D-Cube detected a largest number of the injected subtensors. Especially, in the Android dataset, D-Cube detected 9 out of the 10 injected subtensors, while the second best method detected only 7 injected subtensors on average.

4.5.3 Spam-Review Detection in Rating Data

D-Cube successfully spotted spam reviews in the SWM dataset, which contains reviews from an online software marketplace. We modeled the SWM dataset as a 4-way tensor whose dimension attributes are users, software, ratings, and timestamps in dates, and we applied D-Cube (with $ρ = ρ_{a r i}$ ) to the dataset. Table 6 shows the statistics of the top-3 dense subtensors. Although ground-truth labels were not available, as the examples in Table 5 show, all the reviews composing the first and second dense subtensors were obvious spam reviews. In addition, at least 48% of the reviews composing the third dense subtensor were obvious spam reviews.

TABLE 5

TABLE 5. D-Cube successfully detects spam reviews in the SWM dataset.

4.5.4 Anomaly Detection in Wikipedia Revision Histories

D-Cube detected interesting anomalies in Wikipedia revision histories, which we model as 3-way tensors whose dimension attributes are users, pages, and timestamps in hours. Table 6 gives the statistics of the top-3 dense subtensors detected by D-Cube (with $ρ = ρ_{a r i}$ and the maximum cardinality policy) in the KoWiki dataset and by D-Cube (with $ρ = ρ_{g e o}$ and the maximum density policy) in the EnWiki dataset. All three subtensors detected in the KoWiki dataset indicated edit wars. For example, the second subtensor corresponded to an edit war where 4 users changed 4 pages, 1,011 times, within 5 h. On the other hand, all three subtensors detected in the Enwiki dataset indicated bot activities. For example, the third subtensor corresponded to 3 bots which edited 1,067 pages 973,747 times. The users composing the top-5 dense subtensors in the EnWiki dataset are listed in Table 7. Notice that all of them are bots.

TABLE 6

TABLE 6. Summary of the dense subtensors that D-Cube detects in the SWM, KoWiki, and EnWiki datasets.

TABLE 7

TABLE 7. D-Cube successfully spots bot activities in the EnWiki dataset.

4.6 Q5. Effects of Parameter θ on Speed and Accuracy in Dense-Subtensor Detection

We investigate the effects of the mass-threshold parameter θ on the speed and accuracy of D-Cube in dense-subtensor detection. We used the serial version of D-Cube with a memory budget of 16GB, and we measured the relative density of detected subtensors and its running time, as in Section 4.3. Figure 7 shows the results averaged over all considered datasets. Different θ values provided a trade-off between speed and accuracy in dense-subtensor detection. Specifically, increasing θ tended to make D-Cube faster but also make it detect sparser subtensors. This tendency is consistent with our theoretical analyses (Theorems 1–3 in Section 3.2). The sensitivity of the dense-subtensor detection accuracy to θ depended on the used density measures. Specifically, the sensitivity was lower with $ρ_{e s (α)}$ than with the other density measures.

FIGURE 7

FIGURE 7. The mass-threshold parameter θ gives a trade-off between the speed and accuracy of D-Cube in dense-subtensor detection. We report the running time and the density of detected subtensors, averaged over all considered real-world datasets. As θ increases, D-Cube tends to be faster, detecting sparser subtensors.

4.7 Q6. Effects of Parameter α in $ρ_{e s (α)}$ on Subtensors Detected by D-Cube

We show that the dense subtensors detected by D-Cube are configurable by the parameter α in density measure $ρ_{e s (α)}$ . Figure 8 shows the volumes and masses of subtensors detected in the Youtube and Yelp datasets by D-Cube when $ρ_{e s (α)}$ with different α values were used as the density metrics. With large α values, D-Cube tended to spot relatively small but compact subtensors. With small α values, however, D-Cube tended to spot relatively sparse but large subtensors. Similar tendencies were obtained with the other datasets.

FIGURE 8

FIGURE 8. Subtensors detected by D-Cube are configurable by the parameter α in density metric $ρ_{e s (α)}$ . As α increases, D-Cube spots smaller but more compact subtensors.

5 Conclusion

In this work, we propose D-Cube, a disk-based dense-subtensor detection method, to deal with disk-resident tensors too large to fit in main memory. D-Cube is optimized to minimize disk I/Os while providing a guarantee on the quality of the subtensors it finds. Moreover, we propose a distributed version of D-Cube running on MapReduce for terabyte-scale or larger data distributed across multiple machines. In summary, D-Cube achieves the following advantages over its state-of-the-art competitors:

Memory Efficient: D-Cube handles 1,000× larger data (2.6TB) by reducing memory usage up to 1,561× compared to in-memory algorithms (Section 4.2).

Fast: Even when data fit in memory, D-Cube is up to 7× faster than its competitors (Section 4.3) with near-linear scalability (Section 4.4).

Provably Accurate: D-Cube is one of the methods guaranteeing the best approximation ratio (Theorem 3) in dense-subtensor detection and spotting the densest subtensors in practice (Section 4.3).

Effective: D-Cube was most accurate in two applications: detecting network attacks from TCP dumps and lockstep behavior in rating data (Section 4.5).

Reproducibility: The code and data used in the paper are available at http://dmlab.kaist.ac.kr/dcube

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://dmlab.kaist.ac.kr/dcube.

Author Contributions

KS, BH, and CF contributed to conception and design of the study. KS performed the experiments. JK performed the mathematical analysis. KS wrote the first draft of the manuscript. KS and BH wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

Funding

This research was supported by National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296) and Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This research was also supported by the National Science Foundation under Grant Nos. CNS-1314632 and IIS-1408924. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The content of the manuscript has been presented in part at the 10th ACM International Conference on Web Search and Data Mining (Shin et al., 2017b). In this extended version, we refined D-Cube with a new parameter θ, and we proved that the time complexity of D-Cube is significantly improved with the refinement (Lemma 1 and Theorem 1). We also proved that, for N-way tensors, D-Cube gives an θN-approximation guarantee for Problem 1 (Theorem 3). Additionally, we considered an extra density measure (Definition 3) and an extra competitor (i.e., M-Biz); and we applied D-Cube to three more real-world datasets (i.e., KoWiki, EnWiki, and SWM) and successfully detected edit wars, bot activities, and spam reviews (Tables 5–7). Lastly, we conducted experiments showing the effects of parameters θ and α on the speed and accuracy of D-Cube in dense-subtensor detection (Figures 7 and 8). Most of this work was also included in the PhD thesis of the first author (KS).

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fdata.2020.594302/full#supplementary-material

Footnotes

¹M-Zoom repeats retrieving all tuples with a given attribute value, and thus it requires storing and accessing tuples in hash tables for quick retrievals.

²We assume that M-Biz uses the outputs of M-Zoom as its initial states, as suggested in Shin et al. (2018).

³https://github.com/kijungs/mzoom

⁴https://github.com/mjiang89/CrossSpot

⁵https://www.sandia.gov/tgkolda/TensorToolbox/

⁶In each dataset, we measured the relative running time of each method (compared to the running time of D-Cube with the maximum density policy) and the relative density of detected dense subtensors (compared to the density of subtensors detected by D-Cube with the maximum density policy). Then, we averaged them over all considered datasets.

References

Akoglu, L., Chandy, R., and Faloutsos, C. (2013). Opinion fraud detection in online reviews by network effects. ICWSM.

Google Scholar

Akoglu, L., McGlohon, M., and Faloutsos, C. (2010). Oddball: spotting anomalies in weighted graphs. PAKDD.

Google Scholar

Akoglu, L., Tong, H., and Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining Knowl. Discov. 29, 626–688. doi:10.1201/b15352-15

CrossRef Full Text | Google Scholar

Andersen, R., and Chellapilla, K. (2009). Finding dense subgraphs with size bounds. WAW.

Google Scholar

Bahmani, B., Goel, A., and Munagala, K. (2014). Efficient primal-dual graph algorithms for mapreduce. WAW.

Google Scholar

Bahmani, B., Kumar, R., and Vassilvitskii, S. (2012). Densest subgraph in streaming and mapreduce. PVLDB 5, 454–465. doi:10.14778/2140436.2140442

CrossRef Full Text | Google Scholar

Balalau, O. D., Bonchi, F., Chan, T., Gullo, F., and Sozio, M. (2015). Finding subgraphs with maximum total density and limited overlap. WSDM.

CrossRef Full Text | Google Scholar

Bennett, J., and Lanning, S. (2007). The netflix prize. KDD Cup.

Google Scholar

Beutel, A., Xu, W., Guruswami, V., Palow, C., and Faloutsos, C. (2013). Copycatch: stopping group attacks by spotting lockstep behavior in social networks. WWW.

CrossRef Full Text | Google Scholar

Charikar, M. (2000). Greedy approximation algorithms for finding dense components in a graph. APPROX.

Google Scholar

Dean, J., and Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113. doi:10.21276/ijre.2018.5.5.4

CrossRef Full Text | Google Scholar

Dror, G., Koenigstein, N., Koren, Y., and Weimer, M. (2012). The yahoo! music dataset and kdd-cup’11. KDD Cup.

Google Scholar

Epasto, A., Lattanzi, S., and Sozio, M. (2015). Efficient densest subgraph computation in evolving graphs. WWW.

CrossRef Full Text | Google Scholar

Galbrun, E., Gionis, A., and Tatti, N. (2016). Top-k overlapping densest subgraphs. Data Mining Knowl. Discov. 30, 1134–1165. doi:10.1007/s10618-016-0464-z

CrossRef Full Text | Google Scholar

Goldberg, A. V. (1984). Finding a maximum density subgraph. Technical Report.

Google Scholar

Hooi, B., Shin, K., Song, H. A., Beutel, A., Shah, N., and Faloutsos, C. (2017). Graph-based fraud detection in the face of camouflage. ACM Trans. Knowl. Discov. Data 11, 44. doi:10.1145/3056563

CrossRef Full Text | Google Scholar

Jeon, I., Papalexakis, E. E., Kang, U., and Faloutsos, C. (2015). Haten2: billion-scale tensor decompositions. ICDE, 1047–1058.

Google Scholar

Jiang, M., Beutel, A., Cui, P., Hooi, B., Yang, S., and Faloutsos, C. (2015). A general suspiciousness metric for dense blocks in multimodal data. ICDM.

CrossRef Full Text | Google Scholar

Jiang, M., Cui, P., Beutel, A., Faloutsos, C., and Yang, S. (2014). Catchsync: catching synchronized behavior in large directed graphs. KDD.

CrossRef Full Text | Google Scholar

Kang, U., Papalexakis, E., Harpale, A., and Faloutsos, C. (2012). Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. KDD.

Google Scholar

Kannan, R., and Vinay, V. (1999). Analyzing the structure of large graphs. Technical Report.

Google Scholar

Khuller, S., and Saha, B. (2009). On finding dense subgraphs. ICALP, 597–608.

CrossRef Full Text | Google Scholar

Kolda, T. G., and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Rev. 51, 455–500. doi:10.2172/755101

CrossRef Full Text | Google Scholar

Lee, V. E., Ruan, N., Jin, R., and Aggarwal, C. (2010). A survey of algorithms for dense subgraph discovery. Managing and Mining Graph Data, 303–336.

CrossRef Full Text | Google Scholar

Lippmann, R. P., Fried, D. J., Graf, I., Haines, J. W., Kendall, K. R., McClung, D., et al. (2000). Evaluating intrusion detection systems: the 1998 darpa off-line intrusion detection evaluation. DISCEX.

Google Scholar

Maruhashi, K., Guo, F., and Faloutsos, C. (2011). Multiaspectforensics: pattern mining on large-scale heterogeneous networks with tensor analysis. ASONAM.

CrossRef Full Text | Google Scholar

McAuley, J., Pandey, R., and Leskovec, J. (2015). Inferring networks of substitutable and complementary products. KDD.

CrossRef Full Text | Google Scholar

Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., and Bhattacharjee, B. (2007). Measurement and analysis of online social networks. IMC.

CrossRef Full Text | Google Scholar

Oh, J., Shin, K., Papalexakis, E. E., Faloutsos, C., Yu, H., and S-hot, (2017). Scalable high-order tucker decomposition. WSDM.

CrossRef Full Text | Google Scholar

Papalexakis, E. E., Faloutsos, C., and Sidiropoulos, N. D. (2012). Parcube: sparse parallelizable tensor decompositions. PKDD.

Google Scholar

Rossi, R. A., Gallagher, B., Neville, J., and Henderson, K. (2013). Modeling dynamic behavior in large evolving graphs. WSDM.

CrossRef Full Text | Google Scholar

Ruhl, J. M. (2003). Efficient algorithms for new computational models. Ph.D. thesis, Massachusetts Institute of Technology.

Google Scholar

Saha, B., Hoch, A., Khuller, S., Raschid, L., and Zhang, X. N. (2010). Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB.

Google Scholar

Shah, N., Beutel, A., Gallagher, B., and Faloutsos, C. (2014). Spotting suspicious link behavior with fbox: an adversarial perspective. ICDM.

CrossRef Full Text | Google Scholar

Shin, K., Eliassi-Rad, T., and Faloutsos, C. (2016). Corescope: graph mining using k-core analysis—patterns, anomalies and algorithms. ICDM.

CrossRef Full Text | Google Scholar

Shin, K., Hooi, B., and Faloutsos, C. (2018). Fast, accurate, and flexible algorithms for dense subtensor mining. ACM Trans. Knowledge Discov. Data 12, 28. doi:10.1145/3154414.1-2830

CrossRef Full Text | Google Scholar

Shin, K., Hooi, B., Kim, J., and Faloutsos, C. (2017b). D-cube: dense-block detection in terabyte-scale tensors. WSDM.

CrossRef Full Text | Google Scholar

Shin, K., Hooi, B., Kim, J., and Faloutsos, C. (2017a). Densealert: incremental dense-subtensor detection in tensor streams. KDD.

Google Scholar

Shin, K., and Kang, U. (2014). Distributed methods for high-dimensional and large-scale tensor factorization. ICDM.

CrossRef Full Text | Google Scholar

Tsourakakis, C., Bonchi, F., Gionis, A., Gullo, F., and Tsiarli, M. (2013). Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. KDD.

CrossRef Full Text | Google Scholar

Wang, Y., Tung, H. Y., Smola, A. J., and Anandkumar, A. (2015). Fast and guaranteed tensor decomposition via sketching. NIPS.

Google Scholar

Keywords: tensor, dense subtensor, anomaly detection, fraud detection, out-of-core algorithm, distributed algorithm

Citation: Shin K, Hooi B, Kim J and Faloutsos C (2021) Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining. Front. Big Data 3:594302. doi: 10.3389/fdata.2020.594302

Received: 13 August 2020; Accepted: 17 December 2020;
Published: 29 April 2021.

Edited by:

Meng Jiang, University of Notre Dame, United States

Reviewed by:

Kai Shu, Illinois Institute of Technology, United States
Kun Kuang, Zhejiang University, China
Tong Zhao, University of Notre Dame, United States

Copyright © 2021 Shin, Hooi, Kim and Faloutsos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kijung Shin, a2lqdW5nc0BrYWlzdC5hYy5rcg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining

1 Introduction

1.1 Our Contributions

1.2 Related Work

1.3 Organization of the Paper

2 Preliminaries and Problem Definition

2.1 Notations and Concepts

2.2 Density Measures

2.3 Problem Definition

3 Proposed Method

3.1 Algorithm

3.1.1 Overall Structure of D-Cube (Algorithm 1)

3.1.2 Single Subtensor Detection (Algorithm 2)

3.1.3 Dimension Selection (Algorithms 3 and 4)

3.1.4 Efficient Implementation

3.2 Analyses

3.2.1 Complexity Analyses

3.2.2 Accuracy in Dense-Subtensor Detection

3.2.3 Theoretical Comparison with M-Zoom and M-Biz (Shin et al., 2018)

3.3 MapReduce Implementation

4 Results and Discussion

4.1 Experimental Settings

4.1.1 Machines

4.1.2 Datasets

4.1.3 Implementations

4.2 Q1. Memory Efficiency

4.3 Q2. Speed and Accuracy in Dense-Subtensor Detection

4.4 Q3. Scalability

4.5 Q4. Effectiveness in Anomaly Detection

4.5.1 Network Intrusion Detection from TCP Dumps

4.5.2 Synchronized Behavior Detection in Rating Data

4.5.3 Spam-Review Detection in Rating Data

4.5.4 Anomaly Detection in Wikipedia Revision Histories

4.6 Q5. Effects of Parameter θ on Speed and Accuracy in Dense-Subtensor Detection

4.7 Q6. Effects of Parameter α in ρes(α) on Subtensors Detected by D-Cube

5 Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Supplementary Material

Footnotes

References

4.7 Q6. Effects of Parameter α in $ρ_{e s (α)}$ on Subtensors Detected by D-Cube