T-EDGE: Temporal WEighted MultiDiGraph Embedding for Ethereum Transaction Network Analysis

Recently, graph embedding techniques have been widely used in the analysis of various networks, but most of the existing embedding methods omit the temporal and weighted information of edges which may be contributing in financial transaction networks. The open nature of Ethereum, a blockchain-based platform, gives us an unprecedented opportunity for data mining in this area. By taking the realistic rules and features of transaction networks into consideration, we propose to model the Ethereum transaction network as a Temporal Weighted Multidigraph (TWMDG) where each node is a unique Ethereum account and each edge represents a transaction weighted by amount and assigned with timestamp. In a TWMDG, we define the problem of Temporal Weighted Multidigraph Embedding (T-EDGE) by incorporating both temporal and weighted information of the edges, the purpose being to capture more comprehensive properties of dynamic transaction networks. To evaluate the effectiveness of the proposed embedding method, we conduct experiments of predictive tasks, including temporal link prediction and node classification, on real-world transaction data collected from Ethereum. Experimental results demonstrate that T-EDGE outperforms baseline embedding methods, indicating that time-dependent walks and multiplicity characteristic of edges are informative and essential for time-sensitive transaction networks.


Introduction
The past decade has witnessed an explosive growth of graph data, and analysis of large-scale networks has attracted increasing attention from both academia and industry [Volpp, 2006]. However, as a kind of networks that exists widely in the real world, there are relatively few analytical studies on financial transaction networks because the transaction data are usually private for the sake of security and interest. Fortunately, the recent emergence of blockchain technology makes transaction data mining more feasible and re- * Contact Author liable. Generally speaking, blockchain is an open and distributed ledger technology managed by a peer-to-peer network through a special consensus mechanism, and all transaction records on blockchain are publicly accessible [Swan, 2015]. The open nature of blockchain data provides researchers with unprecedented opportunities for data mining in this area [Tasca et al., 2018;Feder et al., 2018;Atzei et al., 2017;Möser et al., 2013].
Being the largest public blockchain-based platform that supports smart contract, Ethereum [Wood, 2014] has attracted wide attention and its market capitalization has reached 20 billion USD . To facilitate the implementation of smart contracts, Ethereum introduces the concept of account, which is formally an address 1 , but adds storage space for recording account balances, transactions, codes, etc. The corresponding cryptocurrency on Ethereum, known as Ether, can be transferred between accounts and used to compensate participant mining nodes. Since its debut in 2014, Ethereum has accumulated a large number of user transaction records. Utilizing these records,  conducts the first systematic study to characterize Ethereum and obtain new observations via traditional network analysis. Different from other large-scale complex networks, Ethereum transaction network, where each edge represents a particular Ether transaction, contains some unique information such as the directions, amount values and timestamps of the transactions. It is essential to incorporate such information for accurate modeling, characterization, and understanding of transaction network data. In addition, multiple transactions between two users are expected and it is more comprehensive to model a transaction network as a multidigraph 2 rather than a simple graph. Therefore, in this work, we model the Ethereum transaction network as a Temporal Weighted Multidigraph where a node is a unique address and an edge represents a transaction weighted by amount and assigned with timestamp.
In recent years, researchers have extensively investigated a variety of machine learning applications on large-scale com- plex networks, and the performance of these machine learning tasks is heavily dependent on the choice of data representation. Graph embedding is an effective method to represent node features in a low dimensional space for network analysis and downstream machine learning tasks [Cai et al., 2018]. Among various graph embedding methods, a series of random walk based approaches have been proposed to learn a mapping function from an original graph to a low dimensional vector space by maximizing the likelihood of co-occurrence of neighbor nodes [Perozzi et al., 2014;Grover and Leskovec, 2016]. Inspired by the word2vec algorithm [Mikolov et al., 2013a] proposed for natural language processing, these random walk based embedding methods are especially useful when the network is too large to be measured entirely [Goyal and Ferrara, 2018]. Recently, to better extract the temporal information from dynamic networks, [Nguyen et al., 2018] proposed a general framework called Continuous-Time Dynamic Network Embeddings (CTDNE) to incorporate temporal dependencies into existing random walk based network embedding models.
Taking the realistic rules and features of transaction networks like the Ethereum, the challenges of transaction network embedding are listed as follows: (1) Transaction networks evolve continuously over time with additions of links, which is overlooked in most of the existing graph embedding algorithms; (2) The practical meaning of connections between accounts is not a one-off established relationship but a time-dependent event. Hence multiple edges need to be considered in transaction network embedding; (3) Unlike social network, random walks on Ethereum transaction network are concrete, which represent money transfer flows in the real world; (4) The amount value of transaction reflects the similarity between two accounts to some extent. In most cases, the larger amount of transaction, the closer relationship between two accounts. Figure 1 is a microcosm of transaction activities on Ethereum.
To this end, we propose a novel framework named Temporal WEighted MultiDiGraph Embedding (T-EDGE), which aims to capture the non-negligible temporal properties and important money-transfer tendencies of time-sensitive transaction networks. For the transaction networks discussed here, existing methods that ignore temporal information may sample a large number of invalid transaction se-quences to derive node embeddings. For example in Figure 1, {A5, A1, A2} is a possible random walk sequence in traditional methods. However, it is not practical in a temporal graph as the transaction from A1 to A2 happens earlier. While in CTDNE [Nguyen et al., 2018], although temporal information is considered, the existence of multiple edges between points is neglected. For instance, according to CTDNE, the temporal walk from A0 to A1 is represented as a sequence of nodes {A0, A1}. However, whether A2 is possible for the next walk depends on whether the transaction path 1 or 3 is sampled by the previous walk from A0 to A1.
In this work, we represent a l-length temporal walk as a sequence of l nodes together with a sequence of (l − 1) edges traversed in non-decreasing timestamps. This kind of temporal walk represents an actually feasible path for money flow in the transaction network. Therefore, the proposed method is expected to learn more meaningful and accurate time-dependent node embeddings that capture more comprehensive properties from dynamic transaction networks.
The main contributions of our paper are as follows: • To the best of our knowledge, this is the first work to understand Ethereum transaction records via graph embedding. In particular, we consider two important and practical machine learning tasks, namely link prediction and node classification. • We refine the definition of a temporal walk for transaction networks by considering temporal dependencies and multiplicity of edges. This kind of random walk sequences contains the practical meaning of money flow in transaction networks. • We propose a novel graph embedding method called Temporal Weighted Multidigraph Embedding (T-EDGE) which incorporates transaction information from both time and amount domains, and experiments on realistic Ethereum data demonstrate its superiority over existing methods. Figure 2 demonstrates the four main steps of the proposed framework for Ethereum transaction network analysis, including data collection, network construction, graph embedding and downstream applications. The parts of network construction and graph embedding are described in the rest of this section, and the parts of data collection and applications will be explained later in Section 3.

Network Construction
Ether transfer is one of the major activities happening on Ethereum. Here we abstract an Ether transfer transaction as a four-tuple (src, dst, w, t), which means the sender src transfers w Ether to the recipient dst at time t. To investigate the Ether transfer on Ethereum, we abstract the Ethereum transaction network as a Temporal Weighted Multidigraph: Definition 1 (Temporal Weighted Multidigraph (TWMDG)). Given a graph G = (V, E), let V be the set of nodes and E be the set of edges. Each edge is unique and is represented as e = (u, v, w, t), where u is the source node, v is the target node, w is the weight value and t is the timestamp. For the sake of simplicity, we define mapping functions Src(e) = u, Based on collected four-tuples from Ethereum transaction records, we can build a Temporal Weighted Multidigraph, where each node represents a unique account and each edge represents a unique Ether transfer transaction.

Temporal Weighted Multidigraph Embedding
We now define the problem of Temporal WEighted MultiDiGraph Embedding (T-EDGE) as follows: Given a temporal weighted multidigraph G = (V, E), our principal goal is to learn an embedding function Φ : V → R d (d |V |) which preserves original network information including node similarity, as well as temporal and weighting properties specifically for financial transaction networks, thus enhancing predictive performance on down-stream machine learning tasks. The proposed method aims to learn more appropriate and meaningful dynamic node representations using a general embedding framework consisting of two main parts. The first part is a random walk generator, which samples a set of walks with the temporal constraint and flexible biased strategies; the second part is an update procedure based on SkipGram [Mikolov et al., 2013a;Mikolov et al., 2013b], which learns node embeddings as a maximum likelihood optimization problem.
Random walk mechanism has been widely proved to be an effective technique to measure local similarity of networks for a variety of domains [Spitzer, 2013]. For a temporal weighted multidigraph discussed here, we define the concept of a Temporal Walk as follows: Definition 2 (Temporal Walk). In TWMDG, a temporal walk from node v 1 to v l is an l-length path traversed in nondecreasing timestamps. Such a temporal walk is represented as a sequence of l nodes ). We define that nodes u and v are temporally connected if there exists a temporal path from u to v. In order to sample valid random walks which obey the temporal constraint, we introduce a new concept called Temporal Successive Edges in TWMDG. Definition 3 (Temporal Successive Edges). Given a temporal weighted multidigraph G = (V, E), the temporal successive edges of a node u at time t is defined as follows: For instance, in Figure 1, let t = T (e 5 ), then L t (A1) = {e 5 , e 6 , e 10 }. The set of temporal successive edges plays the role of candidate for walkers to select possible successors.
Apart from the temporal constraint, we further develop biased searching strategies by considering more detailed transaction information. For the Ethereum transaction network discussed here, we abstract the transaction time and amount as the temporal and weighted information of a TWMDG. Consider a random walk that just traversed edge e i−1 , and is now stopping at node v i at time t = T (e i−1 ). The next node v i+1 of the random walk is decided by selecting a temporally valid edge e i . We describe different sampling biases by formulating the selection probability for each temporal successive edge e ∈ L t (v i ).
From the perspective of temporal domain, we consider both unbiased and biased sampling strategies as follows.
• Temporal Unbiased Sampling (TUS). This is the default setting in the time domain, which assumes that each temporal successive edge e ∈ L t (v i ) of node v i at time t has the same probability to be selected: • Temporal Biased Sampling (TBS). For financial transaction networks, the similarity between accounts is timedependent and dynamic. On the one hand, the accounts with frequent interactions are supposed to have a stronger relationship. Therefore, we let η − : R → Z + be a function that maps the timestamps of edges to a descending ranking. In this case, each edge e ∈ L t (v i ) will be assigned with a selection probability: .
( 2) where T (e) denotes the timestamp of the edge e. This sampling method biases the selection towards edges that are closer in time to the previous edge.
On the other hand, sampling the interactions among accounts in a large time interval may also be important for

Algorithm 1 Temporal Weighted Multidigraph Embedding
Input: Temporal Weighted Multidigraph G = (V, E), dimensions d, walks per node r, walk length l, window size k Output: Φ(v) for ∀v ∈ V 1: Initialize set of temporal walks T W to ∅ 2: for iter = 1 to r do 3: for all nodes u ∈ V do 4: walkn = T emporalW alk(G, u, l) 5: Append walkn to T W 6: end for 7: end for 8: Φ = StochasticGradientDescent(k, d, T W ) 9: return Φ ∈ R |V |×d different domains of networks for the purpose of preserving global similarity in time domain. For such scenarios, we propose another strategy that favors edges appearing later to the previous timestamp. Let η + : R → Z + be a function that maps the timestamps of edges to an ascending ranking. The probability of selecting each edge e ∈ L t (v i ) can be given as: .
( 3) Apart from the transaction time, the amount values of the edges (edge weights) also plays an essential role in financial transaction networks. In the following, we present unbiased and biased strategies from a weighted domain.
• Weighted Unbiased Sampling (WUS). Similar to TUS, this is the default setting in the amount domain and each edge e ∈ L t (v i ) has the same probability to be sampled: • Weighted Biased Sampling (WBS). As illustrated in the Introduction, the weight value of each transaction indicates the significance of interactions between the two accounts involved. For most instances, a higher value of transaction amount implies a larger similarity between the two accounts. Thus each edge e ∈ L t (v i ) can be assigned the selection probability: To prevent the extreme situation where edges with small weights would never be sampled, we consider a linear mapping function to weakens the effects of edge weights. Thus we have P W (e) = η + (W (e)) e ∈Lt(vi) η + (W (e )) .
Furthermore, we combine the aforementioned sampling probabilities from both temporal and weighted domains, i.e., P T and P W , by P (e) = P T (e) α P W (e) (1−α) (0 ≤ α ≤ 1) for ∀e ∈ L t (v i ). Here α = 0.5 is the default value for balancing between time domain and amount domain. Note that T-EDGE, with default settings TUS and WUS, can be regarded

EDGE, while T-EDGE (TBS), T-EDGE (WBS) and T-EDGE (TBS+WBS) select the edges with temporal or/and weighted biases.
Given the sampling results of temporal random walks, we formulate the task of learning time and weight dependent graph embedding in a TWMDG as an optimization problem. This optimization aims to maximize the log-probability of observing a node's neighborhood conditioned on its embedding vector: where k is the window size which restricts the size of random walk context. According to the conditional independent assumption in SkipGram, Eq. 7 can be transformed to (8) The pseudocode for T-EDGE and temporal walk is given in Algorithms 1 and 2 respectively.

Data Collection
On Ethereum, accounts can be divided into two categories, external owned accounts (EOA) which are similar to general bank accounts [Weili and Zibin, 2018]; and smart contract accounts which are source code files. In this work, we focus on the transactions among EOAs for the reason that the Ether transfer records between them are publicly available in the blockchain. Besides, we only include the successful transactions among EOAs with non-zero amount value into our dataset.
Since it is extremely time-consuming to process the whole Ethereum transaction network with more than two million EOAs , here we ascertain a number of objective accounts and then obtain their transaction data through APIs of Etherscan (https://etherscan.io/). Centered by each objective account, we obtain a directed K-order subgraph   Figure 4). K-in and K-out are two parameters to control the depth of sampling inward and outward from the center, respectively.

K-in=2
Center K-out=3 Figure 4: Schematic illustration of a directed K-order subgraph.
On Ethereum, various related information of Ether transactions is stored as data packages. In details, the TxHash field is a unique identification of a transaction, the Value field in a transaction refers to the amount of money transferred, and the Timestamp field indicates when the transaction happens. Besides, the From and To field denote the sender and recipient of the transaction. With the collected four-tuples (F rom, T o, V alue, T imestamp), we can easily construct a temporal weighted multidigraph.

Link Prediction
Link prediction problem predicts the occurrence of links in a given graph on the basis of observed information. In this work, we first evaluate performance of the proposed T-EDGE method on a temporal directed link prediction task based on binary classification.
First of all, we sort all the collected edges according to their timestamps and assume the earlier edges E (with a smaller value of timestamp) as the known links, and V denotes the nodes involved in E . Node set V and edge set E constitute the current network G = (V, E ). Then we can learn node representations of the current network Φ(v) for ∀v ∈ V via graph embedding methods. Secondly, for the binary classifier, node pairs (src, dst) existing in E act as positive samples of the training set. Then we randomly sample an equal number of node pairs with no link as negative samples. We obtain features of a directed link from nodes v i to v j by concatenating their node embeddings, i.e., If i = j, F i,j = F j,i . Finally, we train a support vector classifier to classify the links in the test set where the remainder (links with a larger value of timestamp) are treated as the positive samples.
Dataset In this work, we collect three subgraphs with different size from Ethereum for experiments.
EthereumG1 is centered by account "0x51faeda318982f439e80012fb45d2b017ddccdbe" with K-in = K-out = 3; EthereumG2 is centered by account "0x5e247060f48eeb64367250ed03ff5091bba47fd1" with K-in = K-out = 4; EthereumG3 is centered by the same account as EthereumG1 with K-in = K-out = 4. A summary of the dataset is listed in Table 1.

Settings
In the experiments, we compare the proposed T-EDGE with two baseline random walk based graph embedding methods, DeepWalk [Perozzi et al., 2014] and node2vec [Grover and Leskovec, 2016]. To ensure a fair comparison, we implement the directed version of DeepWalk and node2vec using OpenNE [THUNLP, 2017], an open source toolkit for graph embedding. For these random walk based embedding methods, we have several hyperparameters: the node embedding dimension d, the size of window k, the length of walk l, and walks per node r. In general, we set d = 128, and k = 4. Specifically, we set r = 20, l = 10 for EthereumG1, r = 10, l = 10 for EthereumG2, r = 10, l = 20 for EthereumG3. For node2vec, we grid search over p, q ∈ {0.50, 1.0, 1.5, 2.0} according to [Grover and Leskovec, 2016]. For DeepWalk, we set p = q = 1.0 as it is a special case of node2vec. Table 2 compares the performance of various methods on temporal directed link prediction in terms of Area Under Curve (AUC) and Average Precision (AP). For a clearer illustration, we only demonstrate two extreme sampling strategies of proposed algorithm: T-EDGE, which does not apply any bias, and T-EDGE (TBS+WBS), which combines biases from both time-domain and amount-domain with default α = 0.5. As discussed in Section 2.2, we have two kinds of TBS defined in Eqs. 2 and 3 as well as two kinds of WBS defined in Eqs. 5 and 6. Here we implement all the four possible combinations for T-EDGE (TBS+WBS), and report the best result in Table 2.

Discussion of results
According to Table 2, we have the following observations: (1) T-EDGE without any bias overwhelmingly outperforms DeepWalk and node2vec, which manifests that the temporal information as well as the multiplicity characteristic of edges in TWMDG are very important and meaningful for analysis and understanding of financial transaction networks; (2) With biases of both time and amount domains, T-EDGE (TBS+WBS) attains better performance than unbiased T-EDGE, demonstrating that the rich information from time and amount domains does help us obtain a more comprehensive representation for predictive tasks.
To further illustrate the superiority of T-EDGE methods, we compare the performance of the embedding methods on EthereumG1 with varying value of node embedding dimension d, walk length l, walks per node r and window size k. Results in Figure 5 point out that: (1) T-EDGE with or without additional biases consistently outperform DeepWalk and node2vec under different circumstances of k, l, r; (2) Deep-Walk and node2vec are more sensitive to two hyperparameters, walk length l and walks per node r, while T-EDGE methods can always achieve promising results with a wide   range of both l and r; (3) Interestingly, with an increase of d, the performance of T-EDGE methods monotonically improves but performance of DeepWalk and node2vec degrades with d larger than 64, which implies that T-EDGE methods can embed richer helpful information and thus requiring a larger value of d for data representation.
To further investigate the effects of different sampling strategies on T-EDGE methods, we provide results of all possible combinations of three time domain strategies defined in Eqs. 1, 2, 3 and three amount domain strategies described in Eqs. 4, 5, 6. Figure

Node Classification
Phishing scam is a new type of cybercrime which arises along with the emergence of online business [Liu and Ye, 2001]. It is reported to accounts for more than 50% of all cybercrimes in Ethereum since 2017 [Konradt et al., 2016]. To further evaluate the performance of the proposed T-EDGE strategies, we also conduct node classification experiments on Ethereum to classify labeled phishing nodes and unlabeled nodes (treated as non-phishing nodes). In this part, we consider 445 phishing nodes labeled by Etherscan and the same number of randomly selected unlabeled nodes as our objective nodes, and a detailed list of these nodes is given in [Authors, 2019]. We make an assumption that for a typical Ether transfer flow centered on a phishing node, the previous node of the phishing node may be a victim, and the next one to three nodes may be the bridge nodes with money laundering behaviors. Therefore, we collect subgraphs with K-in = 1, K-out = 3 for each of the 890 objective nodes and then splice them into a large-scale network with 86,623 nodes.  For all embedding methods, we utilize the same hyperparameter setting (k = 4, r = 4, l = 10, d = 128), and the specific settings for node2vec are the same as that in link prediction experiments. To make a comprehensive evaluation, we randomly select {60%, 70%, 80%} of objective nodes as training set and the remaining objective nodes as test set respectively. We use five-fold cross validation to train the classifier and evaluate it on the test set. The results of micro-F1 (miF1) and Marco-F1 (maF1) are shown in Table 3. These results further verify our assumption and motivation in Section 1 that, with consideration of temporal properties and moneytransfer information, we can obtain a more meaningful representation of transaction networks which can effectively boost predictive performance.

Conclusion
In this work, we proposed a novel framework for Ethereum analysis via network embedding. Particularly, we constructed a temporal weighted multidigraph to retain information as much as possible and present a graph embedding method called T-EDGE which incorporates temporal and weighted information of financial transaction networks into node embeddings. We implemented the proposed and two baseline embedding methods on realistic Ethereum network for two predictive tasks with practical relevance, namely, temporal link prediction and phishing/non-phishing node classification. Experimental results demonstrated the effectiveness of the proposed T-EDGE embedding method, meanwhile indicating that a temporal weighted multidigraph can more comprehensively represent the temporal and financial properties of dynamic transaction networks. For future work, we can use the proposed embedding method to investigate more applications of Ethereum or extend the current framework to analyze other large-scale temporal or domain-dependent networks.