AlignScape, displaying sequence similarity using self-organizing maps

The current richness of sequence data needs efficient methodologies to display and analyze the complexity of the information in a compact and readable manner. Traditionally, phylogenetic trees and sequence similarity networks have been used to display and analyze sequences of protein families. These methods aim to shed light on key computational biology problems such as sequence classification and functional inference. Here, we present a new methodology, AlignScape, based on self-organizing maps. AlignScape is applied to three large families of proteins: the kinases and GPCRs from human, and bacterial T6SS proteins. AlignScape provides a map of the similarity landscape and a tree representation of multiple sequence alignments These representations are useful to display, cluster, and classify sequences as well as identify functional trends. The efficient GPU implementation of AlignScape allows the analysis of large MSAs in a few minutes. Furthermore, we show how the AlignScape analysis of proteins belonging to the T6SS complex can be used to predict coevolving partners.

where cij are the BLOSUM62 elements (m=23 is the BLOSUM62 square matrix dimension).Thus, SR(Ma, Mb) can be considered as the lower limit of the score between Ma and Mb.Importantly, if Ma and Mb are very distant or Go and Ge are set too low, it might happen that S(Ma, Mb) < SR (Ma, Mb).In such a case, S(Ma, Mb)-SR (Ma, Mb) is set heuristically to a minimum value of 0.001.On the other hand, the upper limit of the score between Ma and Mb is the score obtained when the two PPM are identical, i.e., SI(Ma, Mb): As PPMd can be expressed through vectorial algebra, it was implemented in a pytorch call to perform calculations on GPU.

MSA clustering with a self-organizing map (AlignScape)
AlignScape is an adaptation of a code where SOM was developed to analyze the conformational space of protein MD trajectories (Bouvier et al., 2015;Mallet et al., 2021).AlignScape takes an MSA as input and utilizes the PPMd as a metric.To avoid the border effects of a planar output space (Mount and Weaver, 2011), AlignScape sets the output SOM as a n x n toroidal map where each unit is initialized as a random PPM computed using a uniform distribution U(0,1).Hence, AlignScape reduces an input MSA of l•25 variables (where l is the alignment length and 25 all different alignment symbols) and p observations (total number of aligned sequences) to a n 2 output SOM.AlignScape training runs consecutive training cycles named epochs.During an epoch, each MSA sequence is iteratively sampled once.When selected, each aligned sequence is converted into PPM and compared against all SOM units to identify its BMU.Then, the PPMs of the BMU and its neighboring units are updated to resemble the sampled sequence.The function that regulates the neighbor selection radius (σ) is a Gaussian, which decays exponentially with the iterations, from a fourth of the SOM dimension (n/4) to 1.The learning rate (α), which controls how much we modify the PPMs of the BMU and its neighbors, decreases exponentially from 0.5 to 0. To maximize efficiency, each iteration can be structured in batches of sequences rather than in individual sequences.
AlignScape training can be performed on both CPUs and GPUs.Our CPU implementation was developed based on Bouvier et al. work (Bouvier et al., 2015) and our GPU implementation was built upon the subsequent work by Mallet et al. (Mallet et al., 2021), where they specifically optimized their method for GPU.

U-matrix representation
As the output SOM, the U-matrix is a n x n periodic grid.More specifically, the value of each U-matrix unit is the mean PPMd between the corresponding SOM unit and its eight adjacent neighboring units.Here, the U-matrix is plotted using a heatmap, where dark-blue units represent units whose sequences are similar to the sequences of their neighboring units.In contrast, yellow units represent units whose sequences diverge from the neighbor unit sequences.Hence, regions of contiguous blue-colored units are interpreted as clusters of similar sequences (basin) and yellow regions as borders between clusters (barrier).Sequences from the input MSA can be mapped to the U-matrix using their respective BMUs.Subsequently, these BMUs can be color-coded according to prior information associated with their corresponding sequences.This sequence annotating process enables the identification of insightful patterns within the U-matrix.

U-matrix distance
Given two aligned sequences sa and sb, their corresponding PPMs Ma and Mb, and their corresponding U-matrix BMUs, BMUa and BMUb, we defined its U-matrix distance as the length of the shortest path between BMUa and BMUb.Importantly, the distance between two U-matrix adjacent units is calculated as the norm of their values.SciPy Python library (Virtanen et al., 2020) was used to compute the shortest path between non-adjacent units.

Quantization and topographical error
The quantization error (QE) is the mean PPMd between the PPMs of the input sequences and the PPMs of their corresponding BMUs.Hence, the QE is a descriptor representing the similarity between the input MSA and the trained SOM.The smaller the QE, the better SOM fits the initial data.The topographical error (TE) is the proportion of input sequences for which its best matching unit and its second best matching unit are not adjacent.TE estimates to which extent SOM preserves the topology of the input data.The closer the TE is to 0, the better.

Robustness and scalability of AlignScape
There are several user-defined parameters needed to perform a AlignScape calculation: the learning rate (α), the neighbor selection radius (σ), the size of the map (n x n), the size of the batch, and the number of epochs.The values of α and σ were empirically determined by trial and error.The σ parameter specifically affects the coarse-graining of the sequences, a smaller radius forces all sequences to be assigned to different units.The map size should be larger than the input MSA, and its maximal size is limited by the computational power (i.e., GPU or CPU memory).Similarly, the batch size is limited by the processing unit's memory and trade-off between the learning time and the map size.In our trials, we set the batch size of 10 when the map has a size of 90 x 90.Memory allocation depends on several parameters: the batch size, the SOM size, and the length of the MSA.Trimming the original alignment, decreasing the batch size, or reducing the SOM size can facilitate the accommodation of lengthy MSAs.On the other hand, execution time is dependent on the batch size, the SOM size, the MSA length, and the number of sequences within the MSA.AlignScape's execution time increases as the SOM size, the MSA length, and the number of MSA sequences grow.To mitigate execution time, one might consider increasing the batch size, although this must be balanced against memory allocation.Suppl.Figure 10A illustrates the increase in memory allocation as the batch size increases, while Suppl.Figure 10B showcases the average execution time for a training epoch.This data is valuable in selecting an appropriate batch size for a specific MSA length, aiding in optimizing execution times.To evaluate the convergence of our AlignScape calculations, we employed two error measurements: quantization and topographical errors (QE and TE).We observed that the convergence was achieved after 100 epochs (Suppl.Figure 7).Different runs consistently generated similar Umatrices during this stage, even with various random initializations.

Inference of non-annotated sequences
By leveraging sequence annotation within the U-matrix, we can infer information for non-annotated sequences.To do so, we employed the k-nearest neighbors algorithm.The annotated sequences are utilized as the training set, with their corresponding BMUs serving as feature vectors and the annotated information acting as labels.The U-matrix distance is utilized as the algorithm metric.In cases where multiple labels had an equal number of neighbors during the inference stage, we implemented a procedure to resolve the tie.This procedure involves calculating the sum of distances of all k-nearest neighbors assigned to a label.The label with the lowest total distance is then selected as the final inference.

Graph-based representation of the SOM
The SOM can be represented as a square grid graph, accompanied by its corresponding adjacency matrix.In this graph, the SOM units serve as the nodes, and the connections to their eight direct neighbors in the grid are represented as edges.In addition, one can define a graph representation and its corresponding meta-adjacency matrix based on a sample of sequences {si} of the MSA, where the nodes are the {BMUi} of these sequences and the edges are the U-matrix distances between the units.

Minimum spanning tree (MST)
The minimum spanning tree (MST) is defined as the set of shortest paths that connect {BMUi} corresponding to a sample of sequences {si}.The MST is required not to have cycles, and the path lengths are computed as U-matrix distances.To compute the MST, we utilize the sparse.csgraph.minimum_spanning_treefrom Scipy Python (Virtanen et al., 2020) from the metaadjacency matrix of {si}.

Aperiodic U-matrix
Given a sample of sequences {si} and its MST, we can transform the periodic U-matrix into a planar aperiodic U-matrix as follows: 1.The aperiodic U-matrix is defined as a 3n x 3n map subdivided into 3 x 3 smaller maps of size n x n, where n is the dimension of the SOM.The aperiodic U-matrix value of rBMU1 and rBMU2 units are set as the value of BMU1 and BMU2, respectively.While iterating over the BMUs pairs, it could happen that either BMU1 or BMU2 has already been remapped.In this scenario, the coordinates of the already remapped BMU are kept, and the coordinates of its partner are calculated as specified in ii) onwards.3. The periodic U-matrix units {BMUj} not included in {BMUi} are remapped into the aperiodic U-matrix as follows: i) the BMUi from {BMUi} with the smallest U-matrix distance against BMUj is calculated ii) the coordinates of BMUj into the nine smaller maps are calculated, and the ones with the lowest Euclidean distance against rBMUi is set as the rBMUj iii) the value of rBMUj is set as the value of BMUj. 4. The values of the units of the aperiodic U-matrix without an associated periodic U-matrix unit are set as infinite.

Clustering
To cluster the U-matrix units, we used an Agglomerative Clustering algorithm.As a clustering metric, we employed the U-matrix distance.The number of clusters for the Agglomerative algorithm was set as the total number of local minima in the U-matrix.To calculate these minima, we utilized the Umatrix unit values and enforced a minimal Euclidean distance of seven units between each minima.The clustering algorithm was performed by cluster.AgglomerativeClustering from scikit-learn Python (Pedregosa et al., 2011) and the count of local minima by skimage.feature.peak_local_maxfrom scikitimage (Walt et al., 2014).In the latter, the U-matrix unit values were multiplied by -1 to locate the local minimums instead of the local maximums.

AlignScape distance matrices
Given a sample of k sequences {si} and their corresponding {BMUi}, we defined its AlignScape distance matrix as a k x k matrix with elements dij corresponding to the U-matrix distance between the ith and the jth sequences.All AlignScape distance matrices were hierarchically clustered using the Seaborn Python library (Waskom et al., 2017).

Linear correlation coefficient between AlignScape distance matrices
A sequence sA and a sequence sB mapped in UA and UB U-matrices, respectively, are defined as a duplet if they have a common genetic origin (gene, gene cluster, …).Given ai and bi distances from DA and DB AlignScape distance matrices, respectively, they are defined as correlated distances if they represent a distance between the same two duplets (Suppl.Figure 1).The Pearson correlation coefficient r between DA and DB is calculated as: where ai and bi are the correlated distances between DA and DB, t is the total number of correlated distances, and ā and b̅ are the means of all ai and all bi, respectively.
Given a set {Di} of AlignScape distance matrices, its AlignScape correlation matrix is calculated by pairwisely computing its r.

Phylogenetic and BLOSUM62 correlation matrices
The phylogenetic correlation matrix was calculated using gene phylogenetic distance matrices.These distance matrices were obtained by computing the gene phylogenetic trees of the sequences mapped in the AlignScape U-matrices.The phylogenetic trees were generated following the subsequent steps: i) the aligned sequences forming the phylogenetic tree were extracted from the AlignScape inputting MSA.ii) The corrected Akaike Information criteria (Akaike, 1998) calculated by Iqtree (Kalyaanamoorthy et al., 2017) was used to determine the tree best-fit maximum likelihood model (Suppl.Table 5).iii) Iqtree calculated the tree with 1000 bootstrap replicates.iv) SeaView (Gouy et al., 2010) was used to visualize the final phylogenetic tree.To extract the distances between tree leaves and generate the phylogenetic distance matrices, we utilized the Phylo package from BioPython (Talevich et al., 2012).The BLOSUM62 correlation matrix was computed using gene BLOSUM62 distance matrices.These were calculated with the BLOSUM62 distance between the sequences mapped in the AlignScape Umatrices.BLOSUM62 distance was computed using the BioPython package.

Data gathering 1.16.1 Human Kinome
The structure-based MSA of the human Kinome containing 497 typical kinase domains was downloaded from Modi V. et al. (Modi and Dunbrack, 2019).The authors estimated an accuracy of 97% for this MSA by pairwisely superimposing 272 known structures of human kinases.Importantly, each MSA sequence was labeled according to its kinase group (AGC, CAMK, STE, CMGC, TKL, TK, CK1, RGC, TYR, NEK, and Other).We modified the downloaded MSA by removing the sites with low structural similarity (sites in lowercase) as well as non-informative sites (using ClipKit (ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference | PLOS Biology, n.d.).Additionally, we used cd-hit (Li et al., 2001) to remove redundant sequences (99% sequence identity threshold).

Human GPCRs
The MSA of the seven transmembrane domains of all 817 human G-protein coupled receptors (GPCRs) was downloaded from Cvicek V et al. (Cvicek et al., 2016).This MSA was generated and validated with experimental GPCR structures.Each aligned sequence was labeled in one of the 12 GPCR subgroups (A-α, A-β, A-γ, A-δ, A-other, olfactory, taste2, vomeronasal, B, Adeshion, C, and F) among the four major GPCR groups (A, B, C, and F).We used cd-hit to remove redundant sequences (99% sequence identity threshold) and ClipKit to remove non-informative sites from the downloaded MSA.

T6SS i gene clusters
Twenty-six genomes of pathogenic proteobacteria expressing at least one T6SS i were selected from the KEGG database (Kanehisa and Goto, 2000).From the 26 genomes, we extracted 56 T6SS gene clusters and classified them according to their subtypes (Suppl.Table 4).For each gene cluster, the corresponding TssB sequence was searched in the SecreT6 database (Li et al., 2015) (which has subtype annotation) using Blast (Madden and Camacho, 2021).The gene cluster subtype was assigned according to the subtype of the closest blast hit.To maximize the variety of T6SS i , the genomes were chosen based on a thorough literature search.
The sequences of the 13 essential T6SS genes (TssA, TssB, TssC, TssE, TssF, TssG, TssJ, TssK, TssL, TssM, hcp, ClpV, and VgrG) were extracted from each gene cluster and used as queries on a Blast search on Uniref90 database (Suzek et al., 2015).Resulting homologous sequences whose length was shorter than 50% or longer than 150% of the length of the query were discarded.Redundant hits with an identity higher than 90% were clustered using cd-hit (Li et al., 2001).
We generated an MSA for each essential T6SS gene by iterating two steps: 1) sequence alignment of the filtered hits using MAFFT (Katoh et al., 2002) and 2) outliers removal using EvalMSA (Chiner-Oms and González-Candelas, 2016).This iterative process stops either at the 5th round or when EvalMSA could not find any outlier.Finally, we used ClipKit (ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference | PLOS Biology, n.d.) on the latest MSA to remove the non-informative sites.Figure 6) TssC and TssM aperiodic U-matrices and distance matrices.In the Umatrices, each sequence from the input MSA was mapped to its BMU, which was colored according to the T6SS i subtype of the sequence.Gray squared BMUs along the MST represent MSA sequences without prior classification data.In the distance matrices, colored boxes between the matrix and the dendrogram indicate the T6SS i subtype of each sequence, and sequence acronyms were reported in Supplementary Figure 9) Visual comparison between three representations: AlignScape Umatrix, circular unrooted phylogenetic tree, and an Sequence Similarity Network (SSN) for the human kinome sequences.A) An aperiodic U-matrix representing the sequence similarity landscape of the Human Kinome.Each sequence from the input MSA was mapped to its BMU and was colorcoded based on kinase groups.B) Circular unrooted phylogenetic tree of human protein kinases generated using the p-distance metric between sequences, the neighbor-joining algorithm, and 1000 bootstrap replicates.These calculations were performed using the MegaX software (Kumar et al., 2018).The resulting tree was subsequently uploaded to the iTOL webserver (Letunic and Bork, 2016), where each clade was color-coded based on its corresponding kinase group.C) Sequence similarity network of the Human kinome.We used EFI online tools to generate the SSN (Zallot et al., 2019;Oberg et al., 2023), and the network was displayed using cytoscape (Shannon et al., 2003).Supplementary Figure 10

Supplementary
) AlignScape scalability.A) GPU memory allocation required for various MSA lengths and batch sizes.B) Average execution time for one training epoch across various MSA lengths and batch sizes.The average execution time and its standard deviation (depicted as error bars) were calculated based on five trials.Both GPU memory allocation and execution time tests were conducted in an NVIDIA RTX A4500 with 20GB of memory.

Table 2 )
A-δ sequences and A-δ inferred sequences outside the main A-δ branch.

Table 4 )
List of T6SS gene clusters.It includes information about the organism of 330 origin, the gene cluster acronym, the organism TaxID in both KEGG and NCBI database, and the