The Complex Community Structure of the Bitcoin Address Correspondence Network

Bitcoin is built on a blockchain, an immutable decentralised ledger that allows entities (users) to exchange Bitcoins in a pseudonymous manner. Bitcoins are associated with alpha-numeric addresses and are transferred via transactions. Each transaction is composed of a set of input addresses (associated with unspent outputs received from previous transactions) and a set of output addresses (to which Bitcoins are transferred). Despite Bitcoin was designed with anonymity in mind, different heuristic approaches exist to detect which addresses in a specific transaction belong to the same entity. By applying these heuristics, we build an Address Correspondence Network: in this representation, addresses are nodes are connected with edges if at least one heuristic detects them as belonging to the same entity. %addresses are nodes and edges are drawn between addresses detected as belonging to the same entity by at least one heuristic. %nodes represent addresses and edges model the likelihood that two nodes belong to the same entity %In this network, connected components represent sets of addresses controlled by the same entity. In this paper, we analyse for the first time the Address Correspondence Network and show it is characterised by a complex topology, signalled by a broad, skewed degree distribution and a power-law component size distribution. Using a large-scale dataset of addresses for which the controlling entities are known, we show that a combination of external data coupled with standard community detection algorithms can reliably identify entities. The complex nature of the Address Correspondence Network reveals that usage patterns of individual entities create statistical regularities; and that these regularities can be leveraged to more accurately identify entities and gain a deeper understanding of the Bitcoin economy as a whole.


Introduction
Cryptocurrencies are rapidly growing in interest, becoming a popular mechanism to perform pseudonymous exchanges between users (entities). They also allow payments in a decentralised manner without needing a trusted third party. The first and most popular cryptocurrency is Bitcoin, which uses an immutable and publicly available ledger to facilitate transactions between entities. Moreover, given its pseudo-anonymity, Bitcoin has also been used to perform activities in illegal markets.
For example, 1 estimate that one-quarter of entities in the Bitcoin network are associated with illegal activity. Consequently, several governing challenges have arisen, and law enforcement agents are particularly interested in techniques that allow tracing the origin of funds. Specifically, in Bitcoin, given the ledger's public nature, tracing the funds can be achieved by inspecting the history of transactions in the system. However, identifying the entities is a complex task because they can use different pseudonyms (addresses) in the system. By the Bitcoin protocol, it is impossible to completely de-anonymise the entities; however, not all entities prioritise anonymity 2 , and it is possible to find recoverable traces of their activity in the transaction history.
The structure of the transactions allows, in some cases, tracing back address pseudonyms that potentially belong to the same 1 arXiv:2105.09078v1 [cs.SI] 19 May 2021 entity. For example, 3 apply heuristics and then cluster together pseudonyms based on evidence of shared spending authority. In this paper, we study the application of several heuristics that leads to creating a sequence of Address Correspondence Networks.
Each of these networks includes weighted links between addresses that potentially belong to the same entity, thus approaching entity identification from a network science perspective. Even though other approaches use networks to model some parts of the Bitcoin economic dynamics (e.g. [4][5][6], to the best of our knowledge, network science approaches have not addressed the problem of analysing the Address Correspondence Network to date. In this study, we show that the Address Correspondence Networks have a strong community structure and general-purpose clustering approaches are suitable for analysing them. Furthermore, our experiments suggest that having a set of identified entities generates large gains in cluster quality-however, this gain quickly declines, and a small number of known entities is enough to produce significant increase in the quality of the detection.
The rest of this paper is organised as follows: section 2 explains the basics of the Bitcoin blockchain, heuristics, entity identification and related work. Section 3 presents our methods for constructing Address Correspondence Networks, the clustering technique and its quality metrics. In section 4, we discuss our findings, and finally, in section 5, we discuss conclusions and future work.

Background and related work
This section introduces the main concepts related to Bitcoin. Next, it discusses the the task of identifying addresses controlled by the same entity, followed by a reviews of the main studies in the area.

The Bitcoin blockchain
Bitcoin was introduced in 7 as a decentralised payment network and digital currency which would be independent of central bank authorities. It is built on a blockchain, an immutable decentralised ledger that allows users, i.e. entities, to exchange the units of account (Bitcoins) in a pseudonymous manner. Entities transacting in the Bitcoin network control addresses-unique identifiers which have the right to transfer specific amounts of Bitcoins.
There are different types of addresses, which determine how the associated Bitcoins are accessed. For example, to spend Bitcoins associated with an address of type Pay to Public Key Hash (P2PKH), the entity needs to present a valid signature based on their private key, and a public key that hashes to the P2PKH value. Another example is the Pay to Script Hash (P2SH) address type: it defines a script for custom validation, which may include several signatures, passwords and other user-defined requirements. We denote with a an address and with A the set {a 1 , . . . , a n } of addresses appearing in the Bitcoin blockchain.
Furthermore, we denote an entity as e, with E representing the set {e 1 , . . . , e k } of entities that own Bitcoin addresses.
To spend or receive Bitcoins, entities create transactions. A transaction t is composed of a set of input addresses, a set of output addresses, and information specifying the amount of Bitcoins to be allocated to each output address. Formally, let T be the set of transactions stored in the Bitcoin blockchain, and P(A ) be the power set of A . We model with i : T → P(A ) and o : T → P(A ) the mappings between a transaction and its input and output address sets. The sum of Bitcoins associated with the input addresses equals the sum of Bitcoins associated with the output addresses plus transaction fees. Therefore, if an entity wishes to spend only a partial amount of Bitcoins associated with the input addresses, the remainder is typically sent to an existing or newly created change address controlled by the initiating entity. Transaction outputs that have not yet been used as inputs to other transactions are referred to as UTXOs (unspent transaction outputs).
The transaction history is replicated on multiple nodes in the Bitcoin network. Entities broadcast new transactions to other nodes in the network. As part of Bitcoin's decentralised consensus protocol, specialised miner nodes are incentivised to solve proof-of-work puzzles that validate new transactions and group them into blocks. Blocks are sequentially appended to the blockchain; the number of blocks preceding a particular block is known as its block height. Furthermore, entities may specify a transaction's locktime. This is the minimum block height the blockchain must reach before miners should consider validating the transaction, i.e. a transaction with locktime j is added to block j + 1 or later. introduced in section 2.2) linking those addresses to the same entity. Each heuristic is assigned a positive weight, such that their sum is equal to one. The edge weight is the sum of the heuristic weights for which the corresponding heuristic is present between two addresses. The author applies a variety of generic graph clustering algorithms (e.g. k-means, spectral, DBSCAN) as well as graph sparsification and coarsening techniques to the constructed heuristic graph. In this study, we propose the address correspondence network, which is similar to the network built by 13 However, in our correspondence network, an edge between two addresses represents the number of times the heuristics identify the pair as controlled by the same entity. We use a label propagation algorithm to build the clusters, using ground truth information to drive the algorithm.
There exist other approaches and extensions to address clustering. 14 show that higher cluster homogeneity can be achieved when transaction data is augmented with off-chain information from the internet. 15 propose incorporating lower-level network information to enhance deanonymisation. Furthermore, 16 extend address clustering by using supervised machine learning to predict the type of entity controlling addresses in an unlabeled cluster. In our study, in addition to using a ground truth to guide the clustering construction, we introduce a temporal component in the analysis. We build address correspondence networks for various time intervals. In this way, we can analyze the evolution of the network over time.

Methodology
We expand upon the work of 13 by performing address clustering on so-called Address Correspondence Networks, denoted For some addresses a j , the controlling entity is known. Using the block explorer tool provided by 17 , we obtain entity labels for 28 million addresses involved in transactions before 2017. We refer to this data set as the ground truth. The mapping information contained in the ground truth is denoted with e , such that A = {a j |∃e (a j )} ⊆ A is the set of addresses for which the entity label is known. We use the ground truth to (1) sample from T and (2) to evaluate the quality of address clustering methods.
The remainder of this section is organised as follows. Section 3.1 describes the method for sampling from T . This sample is divided further into cumulative and partial subsets, which are described in section 3.2. Section 3.3 details the construction of the Address Correspondence Networks. We explain our approach to clustering these networks in section 3.4, while the metrics used to evaluate clustering quality are introduced in section 3.5.

Transaction sampling
For computational feasibility, we restrict our analysis to a sample of T , as depicted in Figure 1. First, we randomly select a subset A S 0 ⊆ A of the addresses in the ground truth. Next, we select all transactions involving an address a ∈ A S 0 as an input or output, i.e., We then build the set A S 1 of addresses that appear in transactions The aforementioned process is then repeated in a similar manner. This involves finding the set T S 1 of transactions which include at least two addresses in A S 1 , i.e.
We set the condition on two addresses per transaction to reduce the size of the subsequently constructed Address Correspondence Networks. Finally, we build A S 2 as the addresses appearing in transactions of T S 1 and not already in

As a result, this process constructs a set of sampled transactions
An advantage of this sampling method is that the constructed Address Correspondence Networks are centred around ground truth seed addresses, thereby exploiting the previous knowledge of controlling entities.

Partial and cumulative transaction sets
To study the evolution of the Bitcoin Address Correspondence Network over time, we create temporal subsets of the transactions in T S . Each subset includes only the transactions in T S that were generated in a specific time interval. We create time intervals using two different strategies, which we name cumulative and partial, summarised in Figure 2.
The cumulative strategy creates eight time intervals of progressively increasing width 1

Address Correspondence Network construction
Let w : A × A → N be a function that counts how often an address pair, (a 1 , a 2 ), is detected by any of the seven heuristics introduced in section 2.2 as being controlled by the same entity (considering only transactions in T S [o,c] ). It is worth noting that w is symmetric (or undirected), i.e. w(a 1 , a 2 ) = w(a 2 , a 1 ).

The information captured by applying w to each pair of addresses in
The construction process is depicted in Figure 2.

The addresses in A S
[o,c] are the vertices of the graph, and w is the weight function.
is the set of edges connecting address in two ways: having respectively a i and a o in its input and output address sets i(t) and o(t), and having w(a i , a o ) > 0. 1 We represent dates in the use the DD.MM.YY format.

pairs
having both a i 1 and a i 2 in its input set i(t), and having w(a i 1 , a i 2 ) > 0.
Note that in a transaction, different heuristics can concur by identifying the same address as a change address, increasing the weights of the edges related to such an address. Figure 3 shows the degree distribution of the Address Correspondence Networks G [11s2,12s1] and G [11s2] . The two distributions show a similar shape, but note that the left plot is a cumulative graph and the right plot is a partial graph; this indicates that the correspondence networks appear to preserve common properties across time. Table 1 provides descriptive statistics of the 16 Address Correspondence Networks we constructed from the eight partial and cumulative transaction sets. While the degree distributions cannot be assimilated to a single statistical distribution, they are skewed and fat-tailed, features that are recognised in complex networks of different contexts like biological, technological or social interactions 18 .    23 . In LPA, each node is initialised with a unique label, denoting the cluster it is part of (the controlling entity of an address). In the basic case, all the nodes are initially assigned a random label. Afterwards, each node is randomly visited and assigned a label according to the majority voting of its neighbours. The process repeats until every node in the network gets a label to which most of its neighbours belong. Figure 5 shows a clustering for the partial network G [12s2] .

Address Correspondence Network clustering
To initialise parts of the nodes, we use the information from the ground truth e .
In this paper, we are interested in exploring the ability of community detection algorithms to provide additional information about the true identities of users. We hypothesise that the Address Correspondence Network encodes additional information about the entities that control specific addresses. We argue that successive applications of heuristics may lead to connections between addresses controlled by the same entity that are denser and higher weighted than connections between addresses of different entities. Following this argument, we apply LPA to obtain a disjoint set of clusters c] . Because of the additional information provided by the ground truth, we modified LPA to avoid that the addresses in A I [o,c] can change label, as they are associated with the actual entity according to the ground truth information. In the experiments, we vary the proportion p of initialised nodes, that is defined as: Since [o,c] } produced by LPA, we analyse the quality of C [o,c] by defining a set of discrete random variables to describe characteristics of the network, and by five metrics: modularity to give information about the intrinsic quality of the clusters (and inherent topological structure of the network), homogeneity, entropy, Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) to compare the clusters with the ground truth labels. Furthermore, all metrics are measured as functions of the proportion of initialised nodes p.

Random variables
To study the characteristics of the network, we define the following discrete random variables associated with the distributions of entities, addresses, and known addresses in the address correspondence network.
The first random variable, E, assumes a value from the set of entities according to their frequency in the correspondence network. More specifically, E can assume the value e ∈ E [o,c] with probability equal to the numbers of addresses in A [o,c] mapped to e, divided by the total number of addresses in A [o,c] , i.e.: In addition to E, we also define variables that assume values in the entity set according to their frequency in specific clusters.
Let [o,c] , and r i = ∑ j q i j the ratio of edges with at least one end in C (i) [o,c] . The modularity is defined as: A value close to 0 indicates that the community structure is akin to a random network, while values close to 1 indicate strong community structures, meaning dense connections inside the communities and sparse connections between them. Information Theory Metrics. Entropy, introduced in an information theory context by 25 , quantifies the expected amount of information or uncertainty contained in a random variable. Let X be a discrete random variable, which can assume values {x 1 , x 2 , . . . , x k } with probability {P(x 1 ), P(x 2 ), . . . , P(x k )}. The entropy of X is defined as: while the normalised Shannon entropy is: We use the normalised entropy of E i and C j to study the clusters by the perspective of the entities and the one of the cluster themselves.
Entropy also gives important information of the interrelation between random variables. Let us consider two variables X and Y , and let P(X,Y ) be the joint probability distribution. The conditional entropy H(Y |X) is defined as: The conditional entropy indicates how much extra information is needed to describe Y given that X is known. Additionally, the amount of information needed on average to specify the value of two random variables is H(X,Y ) = H(X|Y ) + H(Y ).
We use conditional entropy to measure the quality of the clusters. We do it by comparing them with the distribution of the entities in the Address Correspondence Network, exploiting the variables E and C. Such a measure is named homogeneity and is initially introduced by 26  : : The Rand Index is defined as: where the denominator is the number of address pairs in A [o,c] . As with MI/AMI, we consider an adjusted version of RI, the Adjusted Rand Index (ARI) as proposed by 30 , which accounts for chance: , 31 shows that ARI is equivalent to Cohen's Kappa ( 32 ), which is well suited for the evaluation of community detection methods, as discussed by 33 .

Results
We first analyse the size of the clusters identified by LPA for the Address Correspondence Networks described in Section 3, whose statistics are shown in Table 1. Figure 6 shows the cluster size distribution of G [11s2,13s1] and G [15s1] , for initialisation proportions p = 0 and p = 0.1. Note that the density of the small clusters, in both cases, shifts to reach larger cluster sizes when p = 0.1, as well as the maximum cluster size of G [11s2,13s1] . This indicates that even a small proportion of initialised nodes, such as p = 0.1, considerably modifies the cluster distribution in the networks.
We also fit a power-law distribution to the cluster size distribution, shown by the dotted red lines with the corresponding alpha values in Figure 6. Furthermore, the power-law distribution fits the data significantly better than an exponential distribution,

11/21
resulting in p-values of less than 0.1% using likelihood ratio tests 34 . The exponents are larger for p = 0 than for p = 0.1, in agreement with the observation related to the range of values in the cluster size. In general, the distributions are very heterogeneous. Additionally, the cluster size distribution suggests that, from a Correspondence Network perspective, there is a preferential attachment dynamic in the address generation where entities that control many addresses are likely to generate more addresses than others.
Next, we study the behaviour of the intra-cluster total degree (number of edges connecting nodes that belong to the same cluster) and the inter-cluster degree (number of edges between nodes that belong to different clusters) as functions of the cluster size. For the total intra-cluster degree, there are two extreme behaviours that can be expected. On the one hand, a linear dependency on cluster size would signal that address reuse is negligible (therefore that privacy-preserving usage are commonplace), and the topology of the correspondence network encodes no additional information about the identity of the users that control the addresses. On the other hand, a quadratic relationship (close to the theoretical maximum ∝ c(c − 1)/2) would signal that the clusters are very densely interconnected, and the actual address reuse is high. Therefore, it would be possible to infer actual information about the users by directly inspecting the correspondence network through network science methods. In Figure 7,  . We observe that AMI and ARI lead to similar results: they rapidly increase before converging to the maximum value as p increases. In contrast, homogeneity exhibits no such initial rapid increase, and instead increases linearly with p. The mean levels of AMI, ARI and homogeneity do not consistently increase or decrease with increasing half-year. Furthermore, the mean metric levels for the partial networks appear to be comparable to those for the cumulative networks. This suggests that the complexity and structure of the Address Correspondence Network communities remain stable over time.
The effect of the node initialisation. If the cost of labelling a Bitcoin address is assumed to be constant, the marginal gain in clustering quality per unit cost from increasing p quickly declines. Considering that homogeneity remains constant across all p, it appears that increasing p is cost-effective until around p = 0.1. At this point, A I [o,c] contains most of the information required to describe the community structure. The observed saturations in |C [o,c] |, AMI and ARI suggest that increasing p beyond 0.1 adds only idiosyncratic community information, yielding little improvement in clustering quality. This is further confirmed by studying clustering modularity as a function of p in Figure 11. Modularity appears mostly constant except for a sharp initial change, showing a robust community topology that is consistently detected after initialising a small proportion of nodes.
To assert the significance of the results presented in Figures 8 -12 G [11s2,13s1] , the randomised results show little variation. However, all randomised results appear significantly different to those for the original networks. This suggests that the (non-randomised) results shown in Figures 8 -12 are a consequence of more complex network properties rather than solely the degree distribution.

12/21
Furthermore, the effect of node initialisation order was studied by repeating the experiments for the G [11s2,13s1] and G [15s1] networks using 100 random orderings. The node initialisation order does not seem to affect the general level and shape of the curves. Small perturbations observed in Figures 8 -12 appear to be idiosyncrasies of the chosen ordering, and may be larger for smaller networks (since the curves for G [11s2,13s1] vary more than the ones for G [15s1] ).
The effect of cluster and entity sizes. Figure 13 showsĤ and G [11s2,15s2] networks.Ĥ(E i ) andĤ(C j ) are expressed as functions of the relative cluster and entity sizes, i.e. normalised to |A [o,c] |, respectively. We run experiments with p = 0 and p = 0.1. We note thatĤ(E i ) correlates negatively with the relative cluster size, andĤ(C j ) correlates negatively with relative entity size. For small clusters and entities, there are strips of points located at the minimum and maximum values ofĤ(E i ) andĤ(C j ). This is to be expected: if we consider a cluster with only two addresses, both associated with the same entity,Ĥ(E i ) is minimum. If two addresses are mapped to different entities, we obtain a uniform entity label distribution, andĤ(E i ) is maximum. Such extreme fluctuations become less likely as cluster size increases. Large clusters, therefore, tend to be purer than smaller clusters, corresponding to a higher clustering quality.
Similarly, entities represented by more addresses are distributed more asymmetrically across clusters, again corresponding to a higher clustering quality. This is in agreement with the results in Figure 7, where the community structure is shown to become more apparent for larger clusters.
Furthermore, the mean levels ofĤ(E i ) andĤ(C j ) for the partial networks are always less than or equal to the ones of the corresponding cumulative networks (comparing row 1 to row 3 and row 2 to row 4 in figure 13). This suggests that partial networks allow a higher quality of interpretation regarding the community structure. A possible explanation for this is that Bitcoin entities have less time to obfuscate their activity: the longer the considered transaction history, the more the obfuscation attempts accumulate and the more difficult it becomes to detect the true community structure.
Interestingly, the averageĤ(E i ) andĤ(C j ) increase after initialising 10% of nodes. The increase inĤ(E i ) can be explained by the loss of small, homogeneous clusters with lowĤ(E i ). ForĤ(C j ), the increase is likely due to the decrease in the number of clusters, which in turn causes H max (E i ) to decrease.

Conclusion and future work
In this paper, we consider the application of a general-purpose community detection algorithm, LPA, to detect address clusters that are controlled by the same entity in the Bitcoin transaction history. Specifically, we apply LPA to Address Correspondence Networks, which incorporate information from a variety of simple address linking heuristics. We detect a strong community structure within these networks by inspecting their intra-and inter-cluster degrees. We find that the inter-cluster degree grows faster than the inter-cluster degree for cluster size increments. Address correspondence networks are therefore suitable for the application of general community detection methods from the broader field of network science-this creates an entry point for future researchers to move far beyond the application of primitive heuristics.
Since LPA is able to exploit ground truth information, we find that clustering quality improves as the number of labelled addresses in the Address Correspondence Networks increases. However, under the assumption that the cost of labelling a Bitcoin address is constant, we find that the marginal gain in clustering quality per unit cost quickly declines. Under this assumption, we propose that address labelling is cost-effective until around p = 0. analysis; supervised methods are suitable if more ground truth information is available in the future.

Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions
DDA and CJT conceived the experiment. JAF and AP performed the analysis. All authors discussed the methods and results.
JAF and AP wrote a first draft. All authors worked and agreed on the final version.

Funding
DDA acknowledges partial funding from by the Swiss National Science foundation under contract # 407550_167177. CJT acknowledges financial support from the University of Zurich through the University Research Priority Program on Social Networks.

Data Availability Statement
The data analysed in this study is publicly available by synchronising the Bitcoin blockchain. The ground truth dataset is available at https://www.walletexplorer.com/.        Figure 13. Normalised entropy as a function of relative cluster size and relative entity size.