ORIGINAL RESEARCH article
The Complex Community Structure of the Bitcoin Address Correspondence Network
- 1Faculty of Business, Economics and Informatics, Universität Zürich, Zürich, Switzerland
- 2Department of Computer Science, Aalborg University, Aalborg, Denmark
- 3Department of Informatics, Universität Zürich, Zürich, Switzerland
- 4UZH Blockchain Center and URPP Social Networks, Universität Zürich, Zürich, Switzerland
Bitcoin is built on a blockchain, an immutable decentralized ledger that allows entities (users) to exchange Bitcoins in a pseudonymous manner. Bitcoins are associated with alpha-numeric addresses and are transferred via transactions. Each transaction is composed of a set of input addresses (associated with unspent outputs received from previous transactions) and a set of output addresses (to which Bitcoins are transferred). Despite Bitcoin was designed with anonymity in mind, different heuristic approaches exist to detect which addresses in a specific transaction belong to the same entity. By applying these heuristics, we build an Address Correspondence Network: in this representation, addresses are nodes are connected with edges if at least one heuristic detects them as belonging to the same entity. In this paper, we analyze for the first time the Address Correspondence Network and show it is characterized by a complex topology, signaled by a broad, skewed degree distribution and a power-law component size distribution. Using a large-scale dataset of addresses for which the controlling entities are known, we show that a combination of external data coupled with standard community detection algorithms can reliably identify entities. The complex nature of the Address Correspondence Network reveals that usage patterns of individual entities create statistical regularities; and that these regularities can be leveraged to more accurately identify entities and gain a deeper understanding of the Bitcoin economy as a whole.
Cryptocurrencies are rapidly growing in interest, becoming a popular mechanism to perform pseudonymous exchanges between users (entities). They also allow payments in a decentralized manner without needing a trusted third party. The first and most popular cryptocurrency is Bitcoin, which uses an immutable and publicly available ledger to facilitate transactions between entities. Moreover, given its pseudo-anonymity, Bitcoin has also been used to perform activities in illegal markets. For example, Foley et al.  estimate that one-quarter of entities in the Bitcoin network are associated with illegal activity. Consequently, several governing challenges have arisen, and law enforcement agents are particularly interested in techniques that allow tracing the origin of funds. Specifically, in Bitcoin, given the ledger’s public nature, tracing the funds can be achieved by inspecting the history of transactions in the system. However, identifying the entities is a complex task because they can use different pseudonyms (addresses) in the system. By the Bitcoin protocol, it is impossible to completely de-anonymize the entities; however, not all entities prioritize anonymity , and it is possible to find recoverable traces of their activity in the transaction history.
The structure of the transactions allows, in some cases, tracing back address pseudonyms that potentially belong to the same entity. For example, Meiklejohn et al.  apply heuristics and then cluster together pseudonyms based on evidence of shared spending authority. In this paper, we study the application of several heuristics that leads to creating a sequence of Address Correspondence Networks. Each of these networks includes weighted links between addresses that potentially belong to the same entity, thus approaching entity identification from a network science perspective. Even though other approaches use networks to model some parts of the Bitcoin economic dynamics (e.g. [4–7]), to the best of our knowledge, network science approaches have not addressed the problem of analyzing the Address Correspondence Network to date. In this study, we show that the Address Correspondence Networks have a strong community structure and general-purpose clustering approaches are suitable for analyzing them. Furthermore, our experiments suggest that having a set of identified entities generates large gains in cluster quality—however, this gain quickly declines, and a small number of known entities is enough to produce significant increase in the quality of the detection.
The rest of this paper is organized as follows: Section 2 explains the basics of the Bitcoin blockchain, heuristics, entity identification and related work. Section 3 presents our methods for constructing Address Correspondence Networks, the clustering technique and its quality metrics. In Section 4, we discuss our findings, and finally, in Section 5, we discuss conclusion and future work.
2 Background and Related Work
This section introduces the main concepts related to Bitcoin. Next, it discusses the task of identifying addresses controlled by the same entity, followed by a reviews of the main studies in the area.
2.1 The Bitcoin Blockchain
Bitcoin was introduced in  as a decentralized payment network and digital currency which would be independent of central bank authorities. It is built on a blockchain, an immutable decentralized ledger that allows users, i.e. entities, to exchange the units of account (Bitcoins) in a pseudonymous manner. Entities transacting in the Bitcoin network control addresses—unique identifiers which have the right to transfer specific amounts of Bitcoins.
There are different types of addresses, which determine how the associated Bitcoins are accessed. For example, to spend Bitcoins associated with an address of type Pay to Public Key Hash (P2PKH), the entity needs to present a valid signature based on their private key, and a public key that hashes to the P2PKH value. Another example is the Pay to Script Hash (P2SH) address type: it defines a script for custom validation, which may include several signatures, passwords and other user-defined requirements. We denote with a an address and with
To spend or receive Bitcoins, entities create transactions. A transaction t is composed of a set of input addresses, a set of output addresses, and information specifying the amount of Bitcoins to be allocated to each output address. Formally, let
The transaction history is replicated on multiple nodes in the Bitcoin network. Entities broadcast new transactions to other nodes in the network. As part of Bitcoin’s decentralized consensus protocol, specialized miner nodes are incentivized to solve proof-of-work puzzles that validate new transactions and group them into blocks. Blocks are sequentially appended to the blockchain; the number of blocks preceding a particular block is known as its block height. Furthermore, entities may specify a transaction’s locktime. This is the minimum block height the blockchain must reach before miners should consider validating the transaction, i.e. a transaction with locktime j is added to block
A peculiar property of the Bitcoin network is the pseudonymity: entities conceal their identity through the use of nameless addresses (pseudonyms), linking an address to a real-world entity exposes their entire activity on the Bitcoin network, since the transaction history is publicly available. Entities are therefore advised to generate a new address for every transaction, so that each address is used once as a transaction output and once as a transaction input.
2.2 Address Clustering
The objective of address clustering is to find sets of addresses
1) Multi-input: All input addresses of a transaction are assumed to be controlled by the same entity.
2) Change address type: If all input addresses of a transaction are of one address type (e.g. P2PKH or P2SH), the potential change addresses are of the same type.
3) Change address behavior: Since entities are advised to generate a new address for receiving change, an output address receiving Bitcoins for the first time may be a change address.
4) Change locktime: If a transaction’s locktime is specified, outputs spent in different transactions on the same block as the specified locktime may be change addresses. Intuitively, this is because the entity initiating the transaction also knows its locktime.
5) Optimal change: If an output is smaller than any of the transaction inputs, it is likely a change address.
6) Peeling chain: In a peeling chain, a single address with a relatively large amount of Bitcoins begins by transferring a small amount of Bitcoins to an output address, with the rest being allocated to a one-time change address. This process repeats several times until the larger amount is reduced, meaning that addresses continuing the chain are potential change addresses Meiklejohn et al. .
7) Power of 10: This heuristic assumes that the sum of deliberately transferred Bitcoins in a transaction is a power of 10. If such an output is present, the other outputs may be change addresses.
2.3 Related Work
Address clustering in Bitcoin has been the subject of numerous studies. Initial studies focused on the multi-input heuristic. For example, Nick  identify more than 69% vulnerable addresses using only this heuristic. Also Harrigan and Fretter  consider the multi-input heuristic and attribute its effectiveness to frequent address reuse, as well as the presence of large address clusters having high centrality measures with respect to transactions between clusters. Furthermore, they suggest that incremental cluster growth and the avoidable merging of large clusters makes the multi-input heuristic suitable for real-time analysis. Fleder et al.  construct directed transaction graphs for periods of 24 h and 7 months. In such graphs, the nodes are addresses and each edge represents a transaction from an input address to an output address. They obtain address entity labels by scraping public forums and social networks. By applying the multi-input heuristic, they identify transactions where labeled addresses have interacted with a large number of known entities such as SatoshiDICE and Wikileaks.
Meiklejohn et al.  combines the multi-input heuristic with a second one, similar to the change address behavior heuristic. They identify major entities and interactions between them, and note that the change address heuristic tends to collapse address groups into large super-clusters. Zhang et al.  consider another variation of the change address behavior heuristic, and show that it improves clustering quality when address reduction is used as a performance measure. In this study, we focus on the heuristics introduced in Section 2.2 by Kalodner et al. .
Patel  proposes novel approaches to Bitcoin address clustering. He considers clustering an undirected, weighted heuristic graph, where the nodes are addresses, and each edge indicates the presence of at least one of eight heuristics (a superset of those introduced in Section 2.2) linking those addresses to the same entity. Each heuristic is assigned a positive weight, such that their sum is equal to one. The edge weight is the sum of the heuristic weights for which the corresponding heuristic is present between two addresses. The author applies a variety of generic graph clustering algorithms (e.g. k-means, spectral, DBSCAN) as well as graph sparsification and coarsening techniques to the constructed heuristic graph. In this study, we propose the address correspondence network, which is similar to the network built by Patel  However, in our correspondence network, an edge between two addresses represents the number of times the heuristics identify the pair as controlled by the same entity. We use a label propagation algorithm to build the clusters, using ground truth information to drive the algorithm.
There exist other approaches and extensions to address clustering. Ermilov et al.  show that higher cluster homogeneity can be achieved when transaction data is augmented with off-chain information from the internet. Biryukov and Tikhomirov  propose incorporating lower-level network information to enhance deanonymization. Furthermore, Harlev et al.  extend address clustering by using supervised machine learning to predict the type of entity controlling addresses in an unlabeled cluster. In our study, in addition to using a ground truth to guide the clustering construction, we introduce a temporal component in the analysis. We build address correspondence networks for various time intervals. In this way, we can analyze the evolution of the network over time.
We expand upon the work of Patel  by performing address clustering on so-called Address Correspondence Networks, denoted
For some addresses
The remainder of this section is organized as follows. Section 3.1 describes the method for sampling from
3.1 Transaction Sampling
For computational feasibility, we restrict our analysis to a sample of
As a result, this process constructs a set of sampled transactions
3.2 Partial and Cumulative Transaction Sets
To study the evolution of the Bitcoin Address Correspondence Network over time, we create temporal subsets of the transactions in
FIGURE 2. Cumulative and partial transaction sets, and construction of the Address Correspondence Networks.
The cumulative strategy creates eight time intervals of progressively increasing width,1
Cumulative transaction sets are denoted with
3.3 Address Correspondence Network Construction
The information captured by applying
Note that in a transaction, different heuristics can concur by identifying the same address as a change address, increasing the weights of the edges related to such an address. Figure 3 shows the degree distribution of the Address Correspondence Networks
TABLE 1. Number of nodes, edges and ground truth addresses of the partial and cumulative Address Correspondence Networks for each semester from 2012 to 2015.
Figure 4 shows the distribution of ground truth entities in the Address Correspondence Networks. In each plot, we compare a cumulative network and the partial network from its last six months, e.g.
3.4 Address Correspondence Network Clustering
To initialize parts of the nodes, we use the information from the ground truth
In this paper, we are interested in exploring the ability of community detection algorithms to provide additional information about the true identities of users. We hypothesize that the Address Correspondence Network encodes additional information about the entities that control specific addresses. We argue that successive applications of heuristics may lead to connections between addresses controlled by the same entity that are denser and higher weighted than connections between addresses of different entities. Following this argument, we apply LPA to obtain a disjoint set of clusters
In the experiments, we vary the proportion p of initialized nodes, that is defined as:
3.5 Cluster Quality Analysis
Finally, we quantify the clustering quality as a function of cluster size and entity size. Given an Address Correspondence Network
3.5.1 Random Variables
To study the characteristics of the network, we define the following discrete random variables associated with the distributions of entities, addresses, and known addresses in the address correspondence network.
The first random variable, E, assumes a value from the set of entities according to their frequency in the correspondence network. More specifically, E can assume the value
In addition to E, we also define variables that assume values in the entity set according to their frequency in specific clusters. Let
The variable C assumes a cluster identifier according to its frequency over the addresses in the ground truth. C can assume a value
Finally, we define variables complementary to
Modularity, initially proposed by Newman and Girvan , compares the clusters with a random baseline. This is done by computing the difference between the number of edges inside the clusters with the expected value of edges using the same clusters but with random connections between the nodes. Let
A value close to 0 indicates that the community structure is akin to a random network, while values close to 1 indicate strong community structures, meaning dense connections inside the communities and sparse connections between them.
Information Theory Metrics: Entropy, introduced in an information theory context by Shannon , quantifies the expected amount of information or uncertainty contained in a random variable. Let X be a discrete random variable, which can assume values
while the normalized Shannon entropy is:
We use the normalized entropy of
Entropy also gives important information of the interrelation between random variables. Let us consider two variables X and Y, and let
The conditional entropy indicates how much extra information is needed to describe Y given that X is known. Additionally, the amount of information needed on average to specify the value of two random variables is
We use conditional entropy to measure the quality of the clusters. We do it by comparing them with the distribution of the entities in the Address Correspondence Network, exploiting the variables E and C. Such a measure is named homogeneity and is initially introduced by Rosenberg and Hirschberg . Ideally, a cluster should only contain addresses that are controlled by the same entity. In such a case, clusters are homogeneous and it holds
The fundamental Mutual Information (MI)  quantifies the agreement between partitions. In addition to
and quantifies the reduction of the uncertainty of
AMI gets values in the
Finally, we consider the Rand Index (RI), initially proposed by Rand , which compares two set of clusters while ignoring permutations. Let
The Rand Index is defined as:
where the denominator is the number of address pairs in
We first analyze the size of the clusters identified by LPA for the Address Correspondence Networks described in Section 3, whose statistics are shown in Table 1. Figure 6 shows the cluster size distribution of
FIGURE 6. Cluster size distribution of
We also fit a power-law distribution to the cluster size distribution, shown by the dotted red lines with the corresponding alpha values in Figure 6. Furthermore, the power-law distribution fits the data significantly better than an exponential distribution, resulting in p-values of less than
Next, we study the behavior of the intra-cluster total degree (number of edges connecting nodes that belong to the same cluster) and the inter-cluster degree (number of edges between nodes that belong to different clusters) as functions of the cluster size. For the total intra-cluster degree, there are two extreme behaviors that can be expected. On the one hand, a linear dependency on cluster size would signal that address reuse is negligible (therefore that privacy-preserving usage are commonplace), and the topology of the correspondence network encodes no additional information about the identity of the users that control the addresses. On the other hand, a quadratic relationship (close to the theoretical maximum
FIGURE 7. Comparison of the total intra-cluster and inter-cluster degrees for
Figure 8 shows the number of clusters returned by LPA,
The complexity and structure of the Address Correspondence Network are stable over time: Figures 9–11 show AMI, ARI and homogeneity as functions of p. Since these metrics require ground truth labels, they are computed only for addresses in
The effect of the node initialization: If the cost of labeling a Bitcoin address is assumed to be constant, the marginal gain in clustering quality per unit cost from increasing p quickly declines. Considering that homogeneity remains constant across all p, it appears that increasing p is cost-effective until around
To assert the significance of the results presented in Figures 8–12, we repeated the experiments for 100 randomized versions of the
Furthermore, the effect of node initialization order was studied by repeating the experiments for the
The effect of cluster and entity sizes: Figure 13 shows
Furthermore, the mean levels of
Interestingly, the average
5 Conclusion and Future Work
In this paper, we consider the application of a general-purpose community detection algorithm, LPA, to detect address clusters that are controlled by the same entity in the Bitcoin transaction history. Specifically, we apply LPA to Address Correspondence Networks, which incorporate information from a variety of simple address linking heuristics. We detect a strong community structure within these networks by inspecting their intra- and inter-cluster degrees. We find that the inter-cluster degree grows faster than the inter-cluster degree for cluster size increments. Address correspondence networks are therefore suitable for the application of general community detection methods from the broader field of network science—this creates an entry point for future researchers to move far beyond the application of primitive heuristics.
Since LPA is able to exploit ground truth information, we find that clustering quality improves as the number of labeled addresses in the Address Correspondence Networks increases. However, under the assumption that the cost of labeling a Bitcoin address is constant, we find that the marginal gain in clustering quality per unit cost quickly declines. Under this assumption, we propose that address labeling is cost-effective until around p = 0.1
For future work, we plan to conduct experiments to test the robustness of the heuristics and specific combinations between them. For example, analyzing their likelihood and studying their contribution to the links between addresses. From a network reconstruction perspective, link prediction is an interesting approach to improve the correspondence network by validating current links and predicting missing ones. Additionally, different machine learning approaches can be implemented to graph analysis; supervised methods are suitable if more ground truth information is available in the future.
Data Availability Statement
JF and AP developed the software, curated the data, run the analyses, created the visualizations, and wrote the initial draft. DD and CT contributed with the conceptualization and methodology, supervised the study and reviewed and edited the text. AB supervised the study and reviewed and edited the text. All authors discussed the results. All authors worked and agreed on the final version.
DD acknowledges partial funding from by the Swiss National Science foundation under contract #407550_167177. CT acknowledges financial support from the University of Zurich through the University Research Priority Program on Social Networks.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
1We represent dates in the use the DD.MM.YY format.
2. Gaihre A, Luo Y, Liu H. Do Bitcoin Users Really Care about Anonymity? an Analysis of the Bitcoin Transaction Graph. In: 2018 IEEE International Conference on Big Data (Big Data) (2018). p. 1198–207. doi:10.1109/BigData.2018.8622442
7. Bovet A, Campajola C, Mottes F, Restocchi V, Vallarano N, Squartini T, et al. The Evolving Liaisons between the Transaction Networks of Bitcoin and its price Dynamics. arXiv:1907.03577 [physics, q-fin] ArXiv (2019) 1907:03577.
8. Nakamoto S. Bitcoin: A Peer-To-Peer Electronic Cash System (2008). Available at SSRN: https://ssrn.com/abstract=3440802
11. Harrigan M, Fretter C. “The Unreasonable Effectiveness of Address Clustering”. In 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) (2016), 368–373. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071
15. Ermilov D, Panov M, Yanovich Y. Automatic Bitcoin Address Clustering. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). Mexico: Cancun IEEE (2017). p. 461–6. doi:10.1109/ICMLA.2017.0-118
16. Biryukov A, Tikhomirov S. Deanonymization and Linkability of Cryptocurrency Transactions Based on Network Analysis. In: 2019 IEEE European Symposium on Security and Privacy (EuroS&P). Stockholm, Sweden: IEEE (2019). p. 172–84. doi:10.1109/EuroSP.2019.00022
17. Harlev MA, Sun Yin H, Langenheldt KC, Mukkamala RR, Vatrapu R. Breaking Bad: De-anonymising Entity Types on the Bitcoin Blockchain Using Supervised Machine Learning. In: Proceedings of the 51st Hawaii International Conference on System Sciences 2018. United States: Hawaii International Conference on System Sciences (HICSS) (2018). p. 3497–506. Proceedings of the Annual Hawaii International Conference on System Sciences.
27. Rosenberg A, Hirschberg J. V-measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic: Association for Computational Linguistics (2007). p. 410–20.
Keywords: blockchain technology, bitcoin (BTC), label propagarion algorithm, network science, deanonymization
Citation: Fischer JA, Palechor A, Dell’Aglio D, Bernstein A and Tessone CJ (2021) The Complex Community Structure of the Bitcoin Address Correspondence Network. Front. Phys. 9:681798. doi: 10.3389/fphy.2021.681798
Received: 17 March 2021; Accepted: 10 June 2021;
Published: 30 June 2021.
Edited by:Zhong-Yuan Zhang, Central University of Finance and Economics, China
Reviewed by:Ju Xiang, Changsha Medical University, China
Jie Cao, Nanjing University of Finance and Economics, China
Copyright © 2021 Fischer, Palechor, Dell’Aglio, Bernstein and Tessone. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Daniele Dell’Aglio, firstname.lastname@example.org
†These authors have contributed equally to this work