Topical alignment in online social systems

Understanding the dynamics of social interactions is crucial to comprehend human behavior. The emergence of online social media has enabled access to data regarding people relationships at a large scale. Twitter, specifically, is an information oriented network, with users sharing and consuming information. In this work, we study whether users tend to be in contact with people interested in similar topics, i.e., if they are topically aligned. To do so, we propose an approach based on the use of hashtags to extract information topics from Twitter messages and model users' interests. Our results show that, on average, users are connected with other users similar to them. Furthermore, we show that topical alignment provides interesting information that can eventually allow inferring users' connectivity. Our work, besides providing a way to assess the topical similarity of users, quantifies topical alignment among individuals, contributing to a better understanding of how complex social systems are structured.

have allowed to address a number of questions related to how humans connect among each other. Research using data from online social media have, in turn, produced new methods and models that are at the core of present (computational) social sciences. In this work, we explore the relationships between users of the microblogging service Twitter and the information shared by them. Information sharing is a very important aspect of Twitter, which is also considered as an information network [1], i.e., it is often a means for the consumption and sharing of contents that are mainly diffused through users' connections. The way Twitter −and other social networks − works leads to an interesting linkage between information and (often adaptive or dynamic) relationships among individuals, which is the focus of our investigation.
In what follows, we inspect how much the information shared by users is related to their connections in the social network. Our goal is to demonstrate that the information spread in Twitter is a crucial component of social dynamics through the verification of topical alignment. Connected users being topically aligned is an indication of how much homogeneity is pervasive across their dimensions of interests and ideas. Accordingly, this requires a proper assessment of the information topics, since focusing only on individual annotations does not capture the latent "context" in which users are engaged when exchanging messages. We infer topics from clusters of highly associated hashtags in messages exchanged by users. This allows us to capture topics exposing latent higher-level semantic entities without the need of an external ontology or manual classification step [2,3,4]. affiliation to them indicates individual preferences in the wide range of topics available in the social network and it constitute our pool to assess similarity between users. The engagement in specific topics tells something about a user, and we adopt them as the basis to create a metric based on users' interests in different topics. In short, we want to assess if connected users tend to be topically similar and how much the similarity is relevant to their relationships.
Our results show that, on Twitter, follow and mention relationships are more likely to have a higher topical alignment than random pairs of users. Furthermore, we verify that both kinds of relationships tend to display a similar alignment pattern, despite the belief that they are relationships of a different kind [5].
Finally, our analysis also shows that connections with strong interactions tend to have higher similarity and that the similarity between connected users indicates a higher probability of interaction.

Related Work
In an online social system, the emergence of connections among individuals can be explained by different mechanisms from the preferential attachment [6] to shortcuts for the consumption of information [7]. It is clear that the information shared in an online social network is an important characteristic to be taken into account while analyzing its connections. However, there is no clear definition of information in a social network context. In this work, we consider information as the different kinds of content that flow in a network and may affect people's opinions or ideas. This is analogous to the Bateson's general definition of information as composed of pieces that are supposed to be "a difference that makes a difference" [8]. Some recent efforts have been directed to the study of how the information traversing the network is related to its links. Weng et al. [7] recently demonstrated that information flows play an important role in link creation in the Yahoo! Meme network. Around 12% of the new edges were motivated by the information flow, indicating that the network's edges dynamics cannot be explained merely by its topological structure. Furthermore, they showed that, while some users create connections mostly based on friendship, others are more guided by the content that users produce and share. Bogdanov et al. provide a model of pre-specified topics and verified the consistency of their use by Twitter users, they also applied this to predict influencers and to minimize the latency in information dissemination [4]. Meyers et al. [9] were interested in how the rise of abrupt changes in the information flow dynamics influences the creation and removal of links. Their work found that cascade of tweets was likely to cause follow or unfollow bursts, i.e., people start to follow or unfollow others with the abrupt increase in the retweets of some content. Also using data from Twitter, Das et al. [10] studied how the difference in users intent affects the content of their messages and their propagation. Suh et al. [11] focused on which features increase the probability of a message to be retweeted finding that the presence of hashtags (i.e. the presence of a context), along with other factors, favors the sharing. Following on how context and interests shape information sharing, Wu et al. [12] categorized influential users in Twitter -i.e. celebrities, media, and bloggers -finding that usually users in the same category show common behaviors that differ from one category to another. Another contribution along this line is the one by Kang and Lerman [13] where they studied how the position in the network and the engagement of users affect the information they receive. The authors found that more engaged users usually occupy bridge positions in the network and are exposed to more diverse and novel information with respect to less engaged ones. Finally, the role of network structure and access to information has also been studied by Aral and Van Alstyne [14] analyzing data from an executive recruitment firm.

Topical Alignment
We are concerned with the degree to which users are more topically aligned with their connections. This is closely related to the homophily concept [15,16,17,18,19], i.e., the tendency of individuals to form dyads with people similar to them, which have implications for the final network structure [43]. If the similarity between pairs of individuals induces them to form a tie, this tendency is called choice homophily, otherwise, if it is just a result of the constraints in the opportunities of connections, induced homophily. Both types might be necessary to explain levels of similarity encountered in dyads, as Kossinets et al. showed with dyads of a university community [18]. Nonetheless, choice homophily requires assessing individuals preferences and often this is infeasible. Thus, the concepts of baseline homophily, the expected similarity between random pairs of individuals, and inbreeding homophily, the similarity of dyads that are above or under the baseline, introduced by McPherson & Smith-Lovin [15] are often used in practical approaches [20].
Topically aligned dyads are not necessarily a result of homophily, as connected individuals also tend to become more similar to each other over time, what is know as social influence, or social contagion [21,22,23]. Social influence is an important ingredient for synthetic models such as the one proposed by Robert Axelrod [24] and is also verified in social networks [25]. However, real data also show that some effects attributed to social contagion may be a result of homophily [22]. Furthermore, the creation of dyads may be motivated by latent or by unknown characteristics as pointed by Shalizi et al., thus, it might be impossible to verify whether ties similarity is really the result of homophily or social influence [23]. This infeasibility in disentangling both processes does not affect our work, as we are not interested in verifying which one is driving the similarity of the dyads. Our goal is to assess to which degree connected users are topically similar, independently of the generating mechanism.
Nonetheless, we consider that the works more related to our own were strictly interested in homophily in online networks. Laniado et al. [26] inspected gender homophily -i.e., the prevalence of same-gender relationships -in the Tuenti Spanish social network. They based their analysis on self-reported gender data and their results showed the presence of gender homophily in dyadic and triadic relationships. Aiello et al. [27] explored homophily in the context of tagging social networks (Flickr, Last.fm, and aNobii). In these networks, tags are used to classify resources -a different usage than hashtags on Twitter. In their approach, tags employed by the users are used to compute their similarity, which quantifies their proximity in tags usage. They found that users topical similarity is related to their shortest path distance on the social graph and that it could predict some links on the graph. Crandall [25] explored homophily using datasets extracted from Wikipedia and LiveJournal -article and blogging based networks -and modeled users according to their articles editing history. Choudhury explored homophily over a set of demographic users characteristics and its relation to the structure of their ego-network, most importantly they showed that the presence of homophily concerning topical interests is independent of the ego network structure [28].
Some of these works are more related to networks centered in some kind of digital artifact, e.g., image, article, etc. Twitter, however, is more centered on the information posted by its users. Furthermore, hashtags or other features, by themselves, are not sufficient to assess similarity among users as they do not fully capture the context of users' messages. Thus, despite their findings, these works leave aside the latent semantics in the information sharing. Others had to rely on an external tool or specific classification to measure users similarity [28,29,30]. It is necessary to look at a higher granularity to capture the different kinds of content that users are engaged with, which we achieve using topics of information. To the best of our knowledge, no study has explored the topic in this way. Thereby, our work contributes to the understanding of the nature of relationships in a social network exploring a component still hard to be manipulated: the different kinds of information that traverse the network.

Twitter Dataset
There is no clear definition of social media or online social network, however, there is a general consensus that services like Twitter are instances of social media services [31]. Due to its microblogging nature, some consider Twitter also as a news media or an information network [1,32]. This is an important feature as we are interested in the content shared by the users and their relationships. We explore both mentions, mentioning a user in a tweet, and follow, subscribing to receive other user tweets, relationships in this work. In our analyses, we explicitly decided not to include retweets as we are more interested in information created by the users than shared information. Explicitly creating a new tweet supposes a larger effort than retweeting one, thus we believe that this is a more reliable proxy of users real interests with respect to retweets. Moreover, not considering retweets has also the side effect of limiting the number of bots in our datasets.
As suggested in [33] the majority of contents produced by bots are retweets. So excluding them and users with only retweets should reduce the number of bots in our analyses. Finally, the last interaction form present nowadays in Twitter -quoted tweets-was not present in 2013 when we started our data collection.
Thus, we do not consider quoted tweets in this work.
Our dataset is composed by all the geo-localized tweets -tweets with valid GPS coordinates -located in the United Kingdom and Ireland in a 7 months period from January to September 2013 through the Twitter's Streaming API 1 .
Further, more tweets of the users with geo-localized tweets and their follow/friend

Topics of Information
Information in Twitter flows through tweets, which are short messages with a highly dynamic vocabulary, encumbering traditional text clustering techniques.
We decided to build topics of information considering the tweets with hashtags, as they are indicators of the tweet content. Hashtags are users generated annotations containing a shared meaning, similar to acronyms generated organically by a population [44] Furthermore, it is common for users to insert more than one hashtag in a tweet, and we exploit this aspect to build a semantic mapping of information in Twitter. We assume the existence of a semantic association between hashtags that co-occur in the same tweet. This is analogous to the assumption that words are semantically associated if they are likely to co-occur frequently [34]. Thus, our method focuses only on the implicit semantics given by Twitter messages, i.e., it does not consider explicit semantics given by other sources. This semantic mapping is captured by a weighted co-occurrence graph of hashtags, which we built by extracting all pairs of hashtags that co-occurred in each tweet in our dataset. Therefore, in this graph, an edge (h i , h j ) indicates that the hashtags h i and h j co-occurred and, as the graph is weighted, gives the number of different tweets in which they are both present.
We built a hashtag weighted co-occurrence graph using the 16,935,625 tweets with hashtags belonging to our dataset. As we removed hashtags that did not co-occur with any other, the co-occurrence graph resulted in 2,090,971 from the total of 4,320,429 distinct hashtags. As noted before, the edges of this graph represent a semantic association between hashtags. In order to further restrict our analysis to cases in which the statistics is not very scarce, and to reduce possible noise coming from low co-occurrences which might not have a clear significant association, we additionally removed all the edges between pairs of hashtags that co-occurred in less than 3 tweets. This process produces our final co-occurrence graph, which includes 104,308 hashtags and 526,522 edges.
We consider that topics of information are sets of hashtags clustered together in the graph. Thus, we expect that they will reflect the higher level structures that emerge from the latent semantic association of hashtags, providing the different contexts to which messages refer to. It is natural to see that these clusters could be captured by a community detection method and we decided to use the OSLOM tool [35]. OSLOM is able to capture overlapping communities, a desirable feature considering that one hashtag may be used in different contexts.The application of OSLOM resulted in 2,074 communities and 14,118 homeless nodes, i.e., hashtags that did not belong to any community. We considered the communities and the homeless nodes as topics. Despite the latter possibly not significantly benefiting our future procedures, we believe that a hashtag alone can also carry information.
Furthermore, our method to assess topical similarity should not be affected by this increase of topics as it does not take into consideration the topics that are not shared by two users (see the Supplementary Information for more details).
Summing up both communities and homeless nodes in our analysis we consider a total of 16,192 topics with an average of 622 users per topic.
This approach of building a co-occurrence graph and using a community detection method to find topics was also used by Weng and Menczer [36] through the Louvain method [37], although they were not concerned with topical alignment.
They assumed, based on the topical locality assumption, that semantically similar hashtags would appear in tweets together. Notwithstanding the resemblance to our premises, we do not presume that hashtags are similar, only semantically associated. Even though there is not an easy way to ground the accuracy of this approach, we believe that it is a sound method for assessing information topics.
Its premises and procedures are well defined over the semantic associations of hashtags.

Users Dataset
Users considered in our analysis had, at least, one tweet with a hashtag in order to assess which topics of information they were affiliated with. Thus, we selected the 774,596 users from the 1 million of users with tweets. Before starting the analysis, we also took particular care in reducing the number of bots in our dataset. Along with excluding retweets we also decided to remove users that have been active for less than one day and those who showed an unusual activity.
Specifically, we excluded users that had, on average, more than 400 tweets per day, as we consider that it is normally unfeasible for a real person to produce this quantity of tweets (for more information on bot filtering and their possible impact see the Supplementary Information). Finally, users had to have, at least, one hashtag belonging to the topics detected (described in the previous section), leading to a final set of 608,899 users. We name this set Population as it includes all the users in our experiment.
After that, we extracted from the entire population another set of 9490 users, which we define as central users. Those users are the core of our analyses as we calculate the topical similarity between them and their direct connections and compare it against random users selected from the entire population. Central users have been extracted randomly from the users in our dataset that have been active for the entire 7 months data collection period and produced at least 10 2 tweets to guarantee a large corpus of tweets about their interests. Details for the two sets are shown in Table 1.

Users Representation
Each user is represented by a feature vector u, which comprises her affiliation to all topics of information. The process of building a user vector is illustrated in Feature u i corresponds to her affiliation in topic i and its value represents the number of hashtags belonging to t i (the set of hashtags belonging to the topic i) that were used by the user in her tweets. As the communities obtained by OSLOM may overlap, the same hashtag may be computed in more than one feature. In this case, each hashtag adds a proportional value to each feature it belongs . The value of a feature u i is given by All the hashtags used by a user are contained in a multiset U = (H, m U ), Eq. 1. As #love appears in the topics t 1 and t 2 , it adds 1 to their respective features.

Weighting Users' Vectors
The previous definition of users' features vector considers that all topics have the same weight, i.e., the values of the respective features are directly derived from the number of hashtags used. This may be not suitable for our task as some popular topics or of general use could be over-represented and thus should have a smaller weight. To overcome this distortion, we consider that topics shared by a large percentage of the users ought to have a small weight, likewise, topics possessed by only a small percentage of users ought to weight more. The intuition behind this is that features corresponding to rare topics should be more discriminative of the topical proximity of users than features corresponding to frequent topics.
Strictly speaking, we would like to take into account the information content of each topic [38]. To do so, we rely on TF-IDF [34] to weight users affiliation to each topic u i following: where I is the set of all individuals, i.e., Twitter users. For each feature i in the user vector, this method will weigh its value according to the number of users that also used it -e.g., a feature that is shared by all users will have its value set to 0 as it does not provide information to discriminate users.

Computing Similarity between Users
With the representation of users as feature vectors, we are able to compute topical similarity between two users using as metric the cosine similarity of their vectors [34]. The cosine similarity fits well to this task as it only focuses on the angle between vectors -i.e., it does not consider their length. Cosine similarity ranges from 0 to 1; identical users would have similarity 1; users that do not share anything in common 0. It is evaluated using Eq. 3 below. In preliminary analyses, we also tested Kendall's tau, Spearman's rho and Jaccard similarity measures. We did not adopt them as they did not present significant differences or improvements with respect to cosine similarity.

Topical Alignment
The hypothesis that users are more topically aligned to their neighbors than to random users will be addressed here in terms of baseline alignment and inbreeding alignment similar to the classification introduced by McPherson & Smith-Lovin [15]. Here, we consider baseline alignment as the expected average similarity between users and a random group of the population. Inbreeding alignment is defined as the difference between the baseline distribution and the distribution of average similarity between the users and those with whom they form a dyad, which is formed by a follow or mention relationship. In other words, baseline alignment is our null model and inbreeding alignment a measure of how much real values deviate from the null model. This deviation is captured by the Kolmogorov-Smirnov test [39] and the likelihood of the distribution of dyads yielding higher (or lower) values of average similarity is captured by a Mann-Whitney U test [40,41]. We believe this approach has significant benefits than just looking at the hashtags shared by users as we comment on the Supplementary Information material.

Topically Aligned Follow Relationships
We initially explore inbreeding alignment with respect to follow connections. Our hypothesis is that users are, on average, more similar with their followees, i.e., we expect their topical alignment to be significant. This means that the distribution of similarity averages of the individuals with their followees is expected to yield higher values than the distribution of averages with randomly chosen individuals from the population. We tested this hypothesis using the central users and their followees.  There is an overlap among the distributions, mostly concentrated in lower similarities. However, it is clear that there is a difference between the random distribution and the followees distribution. The Kolmogorov-Smirnov statistics between the distributions is 0.37, p < 0.001. We also used the Mann-Whitney U test to verify if the distribution with followees was likely to have a higher average similarity than the other. Results were positive with an effect size of 0.75, p < 0.001. Overall, the analysis shows that, on average, users tend to be connected to whom they are more similar with, that is, the similarity between followees is higher than the baseline similarity, thus showing the presence of inbreeding alignment. This implies that a user tends to have a stronger topical similarity with followees than with randomly chosen users. Users on Twitter can use the convention @username to mention another user in a tweet. The interactions that happen through mentions are often seen as a relationship stronger than the follow connections [42]. One hypothesis that emerges from such affirmation is that the topical similarity between mentioned users tends to be higher than between followed users. To test this hypothesis, we verified if the distribution of similarity averages with the mentioned users tended to be concentrated in higher values of similarity than the same distribution for followees. As shown by Fig 4, the distributions are roughly the same. Thus, in this context, we cannot say that the mention relations are more topically aligned than the connections with followed users. distributions, that is not evident in the distributions with random users (Fig 3).

Users Interactions
Given the proximity between the two distributions presented in Fig 4, users on average might follow and mention others in a close similarity pattern. This hypothesis is verified in Fig 5, which indicates that users that tend to follow similar users, also tend to mention similar users.

Reciprocity of Relationships
Relationships in Twitter are not reciprocal, a user following another does not imply that the other will choose to follow back. Thus, the existence of reciprocity indicates a stronger relationship between two users as both decided to establish this bond. In the scope of this work, the relationship strength is also viewed in terms of the topical similarity, thus, we expect that reciprocal dyads have a higher similarity than non-reciprocal dyads. This was verified for both mention and follow relationships, i.e., relationships wherein the two users mentioned each The two distributions differ, the distribution of similarity for the reciprocal followees is concentrated around higher values of similarity. The comparison for the reciprocal mentions distribution is shown in Fig 6 (B). The distribution of reciprocal mentions also has a higher similarity. This indicates that reciprocal relations are more prone to have a higher topical similarity, i.e., users have a more similar topic affiliation if they have a reciprocal relationship.
The tests conducted in this subsection reinforce what was seen in the previous section: there is no significant difference between the nature of mention and follow relationships with respect to topical similarity. The distributions of both relationships are very alike when considering the dyads similarity, even with reciprocal relationships. Furthermore, we could verify that, in the case of reciprocal relationships, there is a higher topical alignment than with nonreciprocal relationships. This indicates that users with a reciprocal relationship tend to become more similar by social influence or, conversely, that users similarity can be a factor which influences both to establish the relationship. Our method is unable to discriminate between either of the two mechanisms, as we would need to add a temporal dimension to the evolution of similarity and the network structure.

Mention Probability
All the analyses shown until now indicate that the similarity of most of the dyads is concentrated around low values. Therefore, it is natural to presume that most of the mentions made by central users involve users with low similarity with them. However, this contrasts with common sense as we expect that users in dyads with high similarity are more likely to be mentioned.
We explored this question, i.e., if the probability of being mentioned is higher for users with a high similarity, by looking at all dyads of followees. We also took into account the number of times that each followee was mentioned by a central user. To do so, we first defined m u,v as the number of mentions made by central user u to followee v and s u,v as their similarity. Then we calculated P (m u,v > M |s u,v ≤ S) as the conditional probability of a user being mentioned more than M times, given that her similarity with the mentioning user is smaller than S: Figure 7: Conditional probability of followees being mentioned more than M times by central users, given their similarity is smaller than S. The probability has been calculated using 547, 346 dyads involving connected users. This analysis shows how the similarity gives an indication of the interactions inside connections, at least for some values of similarity. As more similar is the connected users, the higher is their probability to have interacted.

Inference by Similarity
There is a correlation between users average similarity with followees and men-  followees. This happens because they continue to be the most similar available in the whole pool. We believe that this is due to the fact that topics' affiliation patterns are almost unique for some dyads, hence, the majority of other users in the pool does not have a larger similarity than the actual followees of the user.
Even if the results for an average similarity of 0.4 and 0.6 are quite remarkable in terms of the match between inferred and real followees, the results obtained considering all the users together is not too good. Nonetheless, it is important to notice that the method applied here does not take into consideration the whole social network structure, which is likely the main factor responsible for determining connections. Our focus is to explore the relation between information and users' relationships, not to provide a complete algorithm for link prediction or recommendation. Having said that, we, however, believe that our results show that users affiliation in topics can be an important feature to be taken into account in link prediction or recommendation algorithms. We repeated the process done for following relations considering, in this case, the probability of mentioning another user. In this case, we verified whether we could infer if a central user mentioned another user only looking at the similarity between them. Results are shown in Fig 9 and are quite similar to the ones for the following probability with, in some cases, a better performance. This once again reinforces the idea that, in the case of topical alignment, following and mentioning interactions show a similar behavior and highlights the importance that topical similarity might have for some users.

Conclusions
In today's world, online social networks as Twitter provide a laboratory where information and users connections are available for study. In this work, we analyzed how the pair-to-pair structure of a social network is related to the information shared on it. Connections in a social network are the substrate over which information flows, which makes their flow partially dictated by the network structure. However, information flow cannot be seen as an independent phenomenon; its contents can affect how individuals behave. For instance, people might be inclined to bond with others following the affinity in the information they share. On the other hand, information shared by an individual can make other users less prone to establish a bond with her. We have explored this relation using Twitter's information and connection data demonstrating that individuals which have a relationship tend to be more similar than expected regarding the information they share, i.e., connected users tend to be topically aligned.
On the other hand, in order to investigate how information is coupled with social connections, a key point is to design a model which captures its desired characteristics. We achieve this by modeling information as semantic topics of hashtags as Weng et al. [36]. These topics encompass contents of information shared among users. We computed users affiliation in topics to characterize individuals' interests and preferences on Twitter. This characterization served as a basis for the exploration of topical similarity between individuals and we found that, on average, individuals are more likely to have a relationship with more similar users. For some users this effect is so profound that they are essentially connected to the users most similar to them in all our dataset, which suggests an effective way to predict new connections at least for a subset of individuals in the network.
We have also verified if the influence of topical similarity between individuals differed in mentions and follows relations. Our results show a consistency across the two types of relationships, showing no significant difference between them.
This was also verified when considering reciprocal relationships, which, in both cases, showed a higher level of similarity than non-reciprocal ones.
The approach presented in this work uses hashtags to build information topics. This limited our results to users that used hashtags, which significantly reduced our sample. Moreover, as we did not have the whole Twitter network structure, our hypothesis was restricted to exploring dyads and could not explore questions involving network measures, such as distance and centrality. Additionally, considering only geo-localized tweets further reduced the size of our datasets. Nonetheless, we believe that our sample provides a significant support to understand some relationships among users. There is also the possibility to improve our method to build topics, which currently ignores the temporal behavior of hashtags. The moment in which hashtags co-occur might contain specificities that we were not able to capture. However, even with these limitations, we could verify that the topics detected have a semantic sense and our datasets were sufficiently large as to achieve statistically relevance.
Our work demonstrates the importance of topical similarity between users regarding their connections and interactions. Our contribution also provides a feasible computational way to compute the similarity between users and can be used to further explore homophily and social influence in a social network. This can be further enhanced to improve our understanding of the mechanisms by which users connect, analyzing the whole social network structure, which was not available to us. Furthermore, it is necessary to further investigate how the flow of information is related to network dynamics. Our results also leave open opportunities to explore how topics' semantics affect the behavior of users who adopt them. Other possibilities include using our method in applications for link recommendation or finding missing links in social networks.

Competing interests
The authors declare that they have no competing interests.

Data Availability Statement
The data used in this study is available, in anonymized form, at DOI: 10.5281/zenodo.833390.

Data Ethics Statement
All data used in this work have been obtained using the Twitter public API. We adhered to the Spanish Law for personal data protection, which does not require obtaining permission from an Ethical Committee to use public and anonymized Twitter data. We also confirmed that we followed Twitter's terms and conditions when conducting this study.

More information on Bots filtering
The presence of Bots in our dataset may alter the results of our analyses. Usually bots tend to produce a large number of tweets, mainly focused on one or few topics, so not filtering them could lead to large errors on the average similarity between users.
To assure that our dataset only contains a negligible fraction of Bots we implemented three different filtering strategies. First of all, as bots content tends to be dominated by retweets (i.e. see [1]) we decided to exclude all the retweets from our dataset and all the users with a very high percentage of retweets in their timeline. This decision has also been motivated by the fact that, in our analysis, we are more interested in contents produced by the users than in shared information.
To further reduce the probability of finding a bot in our dataset we also removed users with an unfeasible daily rate of tweets. 2 shows, for each user in our dataset, the average number of produced tweets per day in relation to the time they have been active; counted as the time difference between the first and the last tweet in our dataset. As it is clear from the figure, there is a high peak in the activity of users with a short active time suggesting that they produced a high number of tweets and then disappeared. This is another typical signature of bots activity so we decided to filter all the users that have been active for one day or less and those who produced, on average, more than 400 tweets per day. This left 9490 central users and a total population of 608899 over the initial 774596 that used at least one hashtag included in the topics we extracted. The distribution of tweets per user is very heterogeneous, as it it can be seen on  To further demonstrate the robustness of our analyses, we also computed the average similarities such as that on Fig 3 of the main text only considering central users whose score is smaller than 0.5 (5). Comparing the original figure with the new one, it is clear that, even using this conservative threshold, the possible influence of bots in our results is insignificant.

Topical similarity versus hashtags similarity
One of the main innovations of our work is that we use topics instead of hashtags to calculate similarity between users. At this point one can argue what are the advantages of using topics instead of hashtags, as extracting topics from hashtags co-occurence network is a costly process. A similar approach could be considering directly vectors that describe hashtags usage instead of topics. This method, however, disregard dyads wherein users do not use the same hashtags but are interested in the same issues. To test if our method gives better results than only using hashtags we repeated the analysis in Fig 3 of the main text calculating the average similarity also using hashtags vectors. As demonstrated by 6, the distribution of average similarity calculated using hashtags is more peaked and centered at lower values. This is due, as shown in the inset of 6, to  the presence of a significant amount of dyads with a low similarity. This means that most connections use the same topic but not the same hashtags. Thus, we believe that using topics to detect similarity is more robust and allows to uncover relationships that would go unnoticed using hashtags.

The role of homeless nodes in topics detection
The extraction of topics based on community detection highlighted a high number (14118) of "homeless nodes" -hashtags that do not belong to any community, so the algorithm creates a new community with only one node -. Given their high number with respect to larger communities, it could be important to asses how those topics affect our results. Our intuition is that homeless hashtags are usually typos or ambiguous words not so common as hashtags. Thus, they should appear in few tweets and be employed by few users. To verify this hypothesis, Fig. 7 presents the number of homeless hashtags used by distinct users. As expected, more than half of the homeless hashtags have been used by only one user and almost the totality by no more than five. This supports our intuition that those hashtags play a minimal role in our results as topics used by only one user are not considered in the similarity.

A description of Topics
The majority of topics considered in this project contains few hashtags, as it can be seen in Fig. 8.