MODIT: MOtif DIscovery in Temporal Networks

Temporal networks are graphs where each edge is linked with a timestamp, denoting when an interaction between two nodes happens. According to the most recently proposed definitions of the problem, motif search in temporal networks consists in finding and counting all connected temporal graphs Q (called motifs) occurring in a larger temporal network T, such that matched target edges follow the same chronological order imposed by edges in Q. In the last few years, several algorithms have been proposed to solve motif search, but most of them are limited to very small or specific motifs due to the computational complexity of the problem. In this paper, we present MODIT (MOtif DIscovery in Temporal Networks), an algorithm for counting motifs of any size in temporal networks, inspired by a very recent algorithm for subgraph isomorphism in temporal networks, called TemporalRI. Experiments show that for big motifs (more than 3 nodes and 3 edges) MODIT can efficiently retrieve them in reasonable time (up to few hours) in many networks of medium and large size and outperforms state-of-the art algorithms.


INTRODUCTION AND RELATED WORKS
Networks (also named graphs) are tools for the description and analysis of entities, called nodes, that interact with each other by means of edges. There are many types of data that can be represented by graphs, including computer networks, social networks, communication networks, biological networks, and so on. A wide range of domains can be modeled and studied with static networks but many complex systems are fully dynamic, indeed interactions between entities change over time. Systems of this type can be modeled as temporal networks, in which edges between nodes are associated with temporal information such as, for example, the duration of the interaction and the instant in which the interaction begins. Annotations of edges with temporal data is important to understand the formation and the evolution of such systems.
In literature, several definitions of temporal networks have been proposed (Holme and Saramaki, 2012;Masuda and Lambiotte, 2020). In few works, these are also referenced as dynamic (Carley et al., 2007), evolutionary (Aggarwal and Subbian, 2014) or time-varying (Casteigts et al., 2011) networks. In this paper, we define temporal network as a multigraph (i.e a graph where two nodes may interact multiple times). Each edge is associated with an integer, called timestamp, which denotes when two nodes interact.
In this work, we focus on motif search in temporal networks. Different definitions of temporal motifs have been proposed so far (Kovanen et al., 2011;Hulovatyy et al., 2015;Paranjape et al., 2017). Here, we follow the most recent definition proposed by Paranjape et al. (2017), which is becoming the most accepted one. A temporal motif is a temporal network where edges denote a succession of events. In addition to the original definition proposed by Paranjape et al. (2017), simultaneous events, represented by edges with equal timestamps, are allowed, provided that such edges do not link the same pair of nodes. Temporal graphs Q 1 and Q 2 of Figure 1 are two examples of motifs. Applications of Temporal Motif Search include the creation of evolution rules that govern the way the network changes over time (Berlingerio et al., 2009;Ugander et al., 2013) allowing also to identify all the time an edge participates to particular pattern in a time window. A second application consists in the identification of motifs in temporal network at different time resolution to identify patterns at different time scale. Another application consists in temporal network classification using a feature representation based on the temporal motifs distribution (Tu et al., 2018).
Given a time interval , we say that a motif Q -occurs in T, iff: (i) Q is isomorphic (i.e., structurally equivalent) to a subgraph S of T (called an occurrence of Q in T), (ii) edges in S follow the same chronological order imposed by corresponding matched edges in Q, (iii) all interactions in S are observed in a time interval less than or equal to (i.e., they are likely to be related each other). In Figure 1, motif Q 1 -occurs ( = 6 in the example) in T, while Q 2 does not.
For a given temporal graph T and time interval , motif search aims at retrieving all motifs that -occurs in T. In addition, for each such motif Q, we also count the number of occurrences of Q in T. It has been shown that Temporal Motif Search (TMS) problem is NP-complete, even for star topologies (Liu et al., 2019). For this reason, motif search is usually restricted to motifs with up to a certain number of nodes and edges. Given = 10, Figure 2 shows all temporal motifs with up to 3 nodes and 3 edges that -occur in a toy temporal graph, together with the corresponding number of occurrences.
Temporal motifs have been introduced for the first time by Kovanen et al. (2011). Authors define a motif as an ordered set of edges such that: (i) the difference between the timestamps of two consecutive edges must be less than or equal to a certain threshold and (ii) if a node is part of a motif, then all its adjacent edges have to be consecutive (consecutive edge restriction).
In Hulovatyy et al. (2015) the consecutive edge restriction was relaxed and the authors considered only induced subgraphs, called graphlets, in order to reduce the computational complexity while obtaining approximate results. Paranjape et al. (2017) describes a temporal motif as a sequence of edges ordered by increasing timestamps. More precisely, the authors define a k-node, l-edge, -temporal motif as a sequence of l edges, M = (u 1 , v 1 , t 1 ), (u 2 , v 2 , t 2 ), . . . , (u l , v l , t l ) that are time-ordered within a duration, i.e., t 1 < t 2 · · · < t l and t l − t 1 ≤ , such that the static graph induced by the edges is connected and has k nodes. The authors present an algorithm to efficiently calculate the frequencies of all possible directed temporal motifs with 3 edges. For bigger motifs they use a naive algorithm that first computes static matches, then filters out occurrences which do not match the temporal constraints.
To tackle with the NP-completeness of TMS, approximate solutions have been proposed too. Liu et al. (2019) propose a general sampling framework to estimate motif counts. It consists in partitioning time into intervals, finding exact counts of motifs in each interval and weighting counts to get the final estimate, using importance sampling.
In this paper, we present a new motif search algorithm, called MODIT (MOtif DIscovery in Temporal networks). The method is inspired by the temporal subgraph matching algorithm TemporalRI Micale et al., 2021). Our algorithm overcomes many of the limitations imposed by other motif search methods. In fact, MODIT is general and can search for motifs of any size. It has no consecutive edge restriction and allows edges with equal timestamps, provided that they do not link the same pair of nodes.
The rest of the paper is organized as follows. In section 2, we give preliminary definitions about temporal networks and temporal motif search, then we illustrate MODIT and evaluate its computational complexity. In section 3, we assess the performance of MODIT on a dataset of real networks and compare it with the algorithm presented in Paranjape et al. (2017). Finally, section 4 ends the paper.

Preliminary Definitions
In this section, we formally define the concepts of temporal graph and temporal motif, then we introduce the temporal motif search problem.

Temporal Graph
A temporal graph (or network) G is a pair of sets (V, E), where V is the set of nodes and E ⊆ V ×V ×R is the set of edges. Each edge is a triple (s, d, t) where s is the source node, d is the destination node and t is the timestamp, denoting the moment or the time interval in which the two nodes interact.
By definition, a temporal graph is a multigraph, because there can be multiple edges between two nodes. However, triplets in E are distinct, therefore such edges need to have different timestamps. With e.source, e.dest, and e.time we denote the source, the destination and the timestamp of FIGURE 1 | Example of motif -occurrence in a temporal graph T, given = 6. Motif Q 1 has exactly one -occurrence in T, which is the subgraph formed by nodes and edges colored in red. Motif Q 2 , instead, does not -occur in T. In fact, the subgraph with blue nodes and blue edges is isomorphic to Q 2 and respects the chronological order imposed by Q 2 's edges, but its edges are not observed within the time window . edge e, respectively. A temporal graph G = (V, E) is undirected if ∀(s, d, t) ∈ E(G) we have (d, s, t) ∈ E(G), otherwise it is directed. With Inc(v) we denote the set of all edges that are incident to node v, i.e., having v as source or destination.

Temporal Motif
Let Q = (V, E) a connected temporal graph with l edges and (t 1 , t 2 , . . . , t l ) the sequence of Q's edges timestamps in ascending order. Q is a temporal motif iff: (i) In other words, a temporal motif Q can be considered a sort of standardized temporal graph, in which edge timestamps denote an order in which events happen starting from the initial event (i.e., event 1). Edges with equal timestamps (if any) in Q represent simultaneous event. Examples of temporal motifs are the graphs Q 1 and Q 2 depicted in Figure 1.
To establish if a temporal graph contains a temporal motif, we need to introduce the concept of Temporal Subgraph Isomorphism. We follow the definition reported in Micale et al. (2021).

Temporal Subgraph Isomorphism
Given two temporal graphs Q = (V Q , E Q ) and T = (V T , E T ), called, respectively, query and target, and an integer , the Temporal Subgraph Isomorphism (TSI) problem consists in finding an injective function f : V Q → V T , called node mapping, and an injective function g : E Q → E T , called edge mapping, such that the following conditions hold: .time| ≤ . The first condition ensures that the edge mapping is consistent with the node mapping. The second condition requires that the chronological order between query edges is respected in the target network. The third condition imposes that all matching target edges are observed within a fixed time interval .
The TSI problem can have one or more solutions. In this case, we say that Q -occurs in T. Given an edge mapping g between Q and T, a -occurrence of Q in T is a temporal graph S formed by edges g(q 1 ), g(q 2 ), . . . , g(q k ) and all nodes that are sources or destinations of at least one of these edges.
In Figure 1, given = 6, query Q 1 -occurs in target T and the corresponding occurrence is the subgraph of T highlighted in red. Query Q 2 , instead, has no -occurrences in T. Indeed, there is only one subgraph of T (highlighted in blue) that is isomorphic to Q 2 but violates the constraint on edge timestamps.
Finally, we define the temporal motif search problem.

Temporal Motif Search
Given a temporal graph T = (V T , E T ) and three integers k, l and , the Temporal Motif Search (TMS) problem consists in: (i) retrieving all temporal motifs that -occurs in T and have at most k nodes and l edges, (ii) counting the number of -occurrences of such motifs. An example of application of the TMS problem is shown in Figure 2 where k = 3, l = 3 and = 10.

The MODIT Algorithm
In what follows we introduce a new algorithm for solving the TMS problem, called MODIT (MOtif DIscovery in Temporal networks).
Given three parameters k, l, and , MODIT scans a temporal graph T to retrieve all temporal motifs with at most k nodes Algorithm 1 | MODIT(G, , k, l). 1 Let motifMap be an empty hash map. 2 Let timestampSet be an empty multiset and l edges which -occur in T and counts the number of -occurrences of each motif in T.
For each newly identified occurrence, the algorithm performs the following steps: 1. Standardization of edge timestamps; 2. Construction of the canonical form and identification of the corresponding temporal motif; 3. Update of the count of the number of motif occurrences.
The search starts from the smallest motif occurrences formed by single edges. We call these edges seeds. Each of these occurrences is then recursively extended by adding one edge at the time until the specified maximum number of nodes and edges is reached.
MODIT can work on both undirected and directed graphs. For ease of presentation, we illustrate the functioning of the algorithm for undirected networks. However, all the procedures presented here can be easily adapted to directed networks.
The pseudocode of MODIT is reported in Algorithm 1. All motifs retrieved by the algorithm, together with the number of their occurrences, are stored in a hash map motifMap, empty at the beginning (line 1). Each motif in the map is uniquely represented by a string, called canonical form. In addition, since the same occurrence of a motif may be examined multiple times, we also need to store all the distinct retrieved subgraph  (2), so a is the first node of the ordering. Nodes b and c have the same degree, so we need to examine the adjacency lists of b and c sorted by timestamp and destination node. The first edge in the sorted list of b has timestamp 1, while the first edge in the sorted list of c has timestamp 2, so the second node of the ordering is b and the third one is c. Following such node ordering, the canonical form of the graph in the figure is {(a, 1, occurrences in a hash set minedSubgraphs (line 2). For each edge e = (u, v, t) of the target graph, MODIT performs the following steps. First, e and its nodes are embedded in a new occurrence graph S, which is added to the set minedSubgraphs (lines 4-7). The timestamp of the seed edge e is stored in a variable minTime, which will be used throughout the search to avoid scanning some subgraphs of T multiple times (line 8). Each edge that will be added to the subgraph must have a timestamp greater than or equal to minTime. To ensure that each subgraph obtained by expanding S does not violate the temporal constrain, MODIT also uses three auxiliary variables: timestampSet, maxTime, and λ (lines 9-12). Variable timestampSet is a multi-set containing the timestamps of the currently examined subgraph S (at the beginning just t). maxTime will store at each step the maximum timestamp of edges in S. λ represents how much we can extend the time window covered by edges in S (i.e., the difference between the maximum and the minimum timestamps), without exceeding . maxTime and λ are initialized and kept updated during the search using Algorithm 2 (line 13).
Next, the timestamps of the current subgraph S are standardized, i.e they are modified in order to transform S into a temporal motif M, in compliance with the definition provided in section 2.1 (line 14). Standardization of timestamps aims at identifying the motif M of which S is an occurrence and works as follows. First, the list of edge timestamps in S (without duplicates) is sorted in ascending order. Then, each edge of S is assigned the rank of the corresponding timestamp in the sorted list.
Standardization alone may produce distinct motifs that are actually structurally equivalent and in which equivalent edges have the same timestamps. To avoid this, a canonical form C is extracted from M (line 15). The canonical form is a string that uniquely represents a motif, so that two motifs that are equivalent have exactly the same canonical forms. C is obtained by concatenating motif edges based on a given order of motif nodes. Nodes are first ordered according to their degree. Possible ties are solved comparing their sorted adjacency lists, in which edges are ordered by timestamp and, in case of ties, by destination node. Following the calculated node ordering, sorted adjacency lists of nodes are concatenated to yield the canonical form. Figure 3 shows an example of computation of canonical form. In the depicted motif, node a is the first node in the ordering, since it has the maximum degree (2) After constructing the canonical form, the number of occurrences of M is incremented in motifMap (line 17).
Next, MODIT continues the search of new occurrences by extending S edge by edge, starting from an anchor node. This is done using the recursive procedure RECURSIVESEARCH described in Algorithm 3 (lines 17-18). The first two calls to the procedure will extend the seed using both endpoints as anchors. In general, S will be extended by adding an edge which is not already present in S and is incident to an anchor node already present in S.
The structure of RECURSIVESEARCH procedure is very similar to Algorithm 1. First, we check if the currently examined subgraph has reached the maximum allowed number of edges (lines 1-2). If so, the recursive algorithm stops, otherwise the search goes on, considering all possible edges e = (u, a, t) not already present in S and incident to the anchor node a (line 3). For each such edge, MODIT ensures that by adding e to S, the temporal constrain is not violated (line 4). Based on the current values of minTime, maxTime and λ, we can add e to S without violating the constraint iff minTime ≤ t ≤ maxTime + λ. We impose that t is no lower than minTime, i.e., the timestamp of the seed edge, to reduce the number of redundant candidates generated. If e does not violate the constraint, before adding it to S (line 11), we do the following steps. First, we check if e does not connect two nodes already present in S (line 5). If so, we verify if the currently examined subgraph has not reached the maximum allowed number of nodes (line 6). In this case, node u is added to S (line 7). Otherwise, we proceed with the next edge (line 10).
Boolean conditions expressed in line 4 does not prevent examining some subgraphs multiple times. Therefore, before going on with the search, we need to check if S has not been already examined before (line 12). This is done by simply comparing the list of edge ids of S and each subgraph of the minedSubgraphs set.
If S is new, we follow the same steps performed in Algorithm 1. First, we include t in timestampSet and add S to minedSubgraph (lines 13-14). We update temporal auxiliary variables minTime, maxTime and λ (line 15). Then, edge timestamps are standardized to obtain a temporal motif M (line 16). From M we extract the canonical form C (line 17) and increase the number of its occurrences (line 18).
Next, Algorithm 3 is called recursively twice using the endpoints of u as anchor nodes (lines 19-20). After returning from the recursive calls, backtracking is performed (lines 21-26). Backtracking implies: (i) removing from S the last added edge e, (ii) removing from S the last added node, (iii) removing the timestamp of e from timestampSet, (iv) updating auxiliary variables maxTime and λ.

MODIT Complexity Analysis
In this subsection, we analyze the complexity of MODIT. The search starts from the smallest motif occurrences formed by single edges. Therefore, the for-loop in Algorithm 1 is performed |E(G)| times, where G is the target graph. Inside the loop, MODIT tries to expand each occurrence as long as possible. Lines 4-7 of Algorithm 3 require constant time because they are applied to a subgraph formed by only one edge. Now let's analyze the complexity of Algorithm 3. Let d max the maximum node degree of G. The for-loop in Algorithm 3 is performed, in the worst case, d max times. Assuming = ∞, all d max edges are candidates to extend the motif. Lines 4-14 of Algorithm 3 require a constant time. The complexity of Algorithm 2 depends on the number of distinct timestamps of the current motif. In the worst case there are l edges and all timestamps are different. Identifying the minimum and maximum requires an ordering of timestamps that has linear complexity. The rest of the operations can be done in constant time. So, the complexity of Algorithm 2 is O(l). However, in practice, these operations are done faster because MODIT stores timestamps in a data structure that is self-sorted as elements are inserted/removed. Standardization of timestamps requires linear time with respect to the number of edges. Since a motif can have at most l edges, the complexity is O(l). The time required to build the canonical form depends on the number of nodes and the number of edges of the motif. Sorting the adjacency list of a node requires, in the worst case, l operations. Since a motif can have at most k nodes, sorting their adjacency lists requires at most k · l operations. The ordering of nodes has linear complexity with respect to the number of nodes, thus performs in the worst case, k operations. Therefore, the number of operations required to calculate the canonical form is at most k · l + k. Updating the number of occurrences of a motif is done using a hash map, so it takes constant time.
To derive the final complexity of Algorithm 3, we need to evaluate the maximum depth of the recursion. Each call to Algorithm 3 adds one edge at the time, so the maximum recursion depth is l. Assuming no early backtracking, this implies that the complexity of the recursive procedure is O (l · k · d max ) l . So overall, the complexity of MODIT is O |E(G)| · (l · k · d max ) l .

RESULTS
MODIT has been implemented in Java and tested on two datasets of real temporal networks of different sizes, denoted as Dataset 1 and Dataset 2, respectively. Table 1 lists the main features of the networks of the two datasets. For each graph we report the number of nodes, the number of edges, the number of distinct timestamps and the resolution, i.e., the minimum difference between consecutive timestamps.  SFHH-CONF is a network that describes the interactions between the 405 participants of the SFHH conference in Nice, France (Génois and Barrat, 2018). AS-TOPOLOGY is a peerto-peer communication network between autonomous systems, with data collected between February and March of 2010. CONTACTS-DUBLIN is a contact network of attendees at the Infectious SocioPatterns event held in the Science Gallery in Dublin, Ireland (Isella et al., 2011). ENRON-EMAIL is a network of e-mail exchanges of Enron corporation employees between 1999 and 2003 (Keila and Skillicorn, 2005). DIGG-FRIENDS describes friendly bonds between users of Digg, a web news aggregator used in America (Hogg and Lerman, 2012). It is based on data collected in 1 month of 2009. YAHOO-MESSAGES represents the exchange of e-mails between users of Yahoo-Mail in 1 month of 2010.
Dataset 2 includes 6 temporal networks taken from the SNAP dataset 1 . COLLEGEMSG consists of private messages sent on an online social network at the University of California, Irvine (Panzarasa et al., 2009). The network EMAIL-EU-CORE-TEMPORAL was generated using email data from a large European research institution. Only emails exchanged between institution members were taken into account. EMAIL-EU-CORE-TEMPORAL-DEPT1, EMAIL-EU-CORE-TEMPORAL-DEPT2, EMAIL-EU-CORE-TEMPORAL-DEPT3 and EMAIL-EU-CORE-TEMPORAL-DEPT4 are four sub-networks including communications between members of four different departments of the institution (Paranjape et al., 2017).
We first ran MODIT on each network of Dataset 1 for different combinations of values of , k (maximum number of motif nodes) and l (maximum number of motif edges). Then, we compared MODIT to the algorithm proposed by Paranjape et al. (2017), which is included in the SNAP platform for network 1 https://snap.stanford.edu/data/index.html analysis and uses the same definition of temporal motif. All other methods were discarded because they use different definitions of temporal motifs. This comparison was done on the networks of Dataset 2.
All experiments were performed on an Intel Core i5-7500 processor with 16GB of RAM, 10 of which were used for the Java Virtual Machine.

Experiments on Dataset 1
In this section, we illustrate the results of the experiments on Dataset 1. We report in Tables 2-4 the results in terms of (i) execution times, (ii) number of distinct motifs identified, (iii) number of occurrences of the most frequent motif, and (iv) average number of motif occurrences. The experiments were performed for different combinations of values of , k and l. was set to r, 2r, and 3r, where r is the resolution of the temporal network. For k we used values 3, 4, and 5. l was varied as a function of k and set to values k − 1, 2 · (k − 1) and 3 · (k − 1). In some networks (in particular, in AS-TOPOLOGY and ENRON-EMAIL) and for some configurations of the parameters, MODIT went out of memory and was unable to finish the execution. In these cases we did not report any running time. This is due to the large number of distinct motifs present in the networks, which leads to an excessive growth of the map of motif counts, together with a large number of occurrences causing many recursive calls of Algorithm 3. In fact, as k and l increase, the number of motif topologies and the number of combinations of standardized timestamps increases exponentially [e.g., SFHH-CONF network in the following configurations: ( = 3r, k = 3, l = 2), ( = 3r, k = 3, l = 4) and ( = 3r, k = 3, l = 6)]. This is confirmed by the high values of the number of occurrences of the most frequent motif and the average number of motif occurrences found for small values of k and l.
Interestingly, in some networks (e.g., ENRON-EMAIL, DIGG-FRIENDS) we observe that keeping and k fixed and varying l, the number of distinct motifs, the number of occurrences of the    most frequent motif and average number of occurrences remain the same.

Comparison With Paranjape's Algorithm
Finally we compared MODIT with the algorithm proposed by Paranjape et al. (2017) on the networks of Dataset 2.
For the comparison, we set k = 3 and l = 3 because Paranjape's algorithm can handle only motifs with 2 or 3 nodes and 3 edges.
Results are reported in Table 5 and show that Paranjape's algorithm is much faster than MODIT. This gap is mainly due to the fact that Paranjape's method uses a series of efficient dynamic programming algorithms, which are specifically designed to count specific classes of motifs, i.e., motifs with 3 edges. On the other hand, MODIT is general and designed to find motifs of any size and any type. Furthermore, Paranjape's algorithm searches only motifs having exactly the specified number of nodes and edges. On the other hand, MODIT looks for all motifs having at most the number of nodes and edges specified by the user.

CONCLUSIONS
In this paper, we presented MODIT (MOtif DIscovery in Temporal Networks), an algorithm for counting motifs of any size in temporal networks, inspired by a very recent algorithm for subgraph isomorphism in temporal networks, called TemporalRI. Given the three parameters k, l, and , MODIT scans the whole temporal graph to search for all subgraphs having at most k nodes and l edges and in which the difference between the maximum and the minimum timestamp is no greater than . We ran MODIT on a dataset of real temporal networks of medium and large size by varying , the maximum number of nodes and edges. We also compared MODIT with the algorithm proposed by Paranjape et al. (2017) using a different dataset of temporal networks downloaded from SNAP.
For the future, we plan to: