Quantification of inter-sample differences in T cell receptor sequences

Inter-sample comparisons of the T cell receptor (TCR) repertoire are crucial for gaining a better understanding into the immunological states determined by different collections of T cells from different donor sites, cell types, and genetic and pathological backgrounds. As a theoretical approach for the quantitative comparison, previous studies utilized the Poisson abundance models and the conventional methods in ecology, which focus on the abundance distribution of observed TCR sequences. However, these methods ignore the details of the measured sequences and are consequently unable to identify sub-repertoires that might have the contributions to the observed inter-sample differences. In this paper, we propose a new comparative approach based on TCR sequence information, which can estimate the low-dimensional structure by projecting the pairwise sequence dissimilarities in high-dimensional sequence space. The inter-sample differences are then quantified according to information-theoretic measures among the distributions of data estimated in the embedded space. Using an actual dataset of TCR sequences in transgenic mice that have strong restrictions on somatic recombination, we demonstrate that our proposed method can accurately identify the inter-sample hierarchical structure, which is consistent with that estimated by previous methods based on abundance or count information. Moreover, we identified the key sequences that contribute to the pairwise sample differences. Such identification of the sequences contributing to variation in immune cell repertoires may provide substantial insight for the development of new immunotherapies and vaccines.


INTRODUCTION
The development of high-throughput sequencing with next-generation sequencers has provided new 23 opportunities to quantify T cell receptor (TCR) repertoires and to compare their differences among different 24 cell types, organisms, and pathological samples. Such information is indispensable for quantitatively 25 understanding the immunological state of organisms that is shaped by the collection of immune cells. 26 Moreover, the detailed information of TCR repertoires, especially that of inter-sample differences, is 27 anticipated to significantly promote the development of immunotherapies and vaccines (Hou et al., 2016). 28 To this end, several theoretical methods have been proposed to quantify sample differences by focusing on 29 the count (abundance) distribution of unique TCR sequences in a repertoire (Greiff et al., 2015;Laydon 30 et al., 2015;Hou et al., 2016). Poisson abundance (PA) models are among the recently developed methods 97 with the presence or absence of Foxp3 expression. The two genetic backgrounds of mice are labeled as 98 "wild type" and "Ep" in the original paper. Both groups showed strong restriction on rearrangement of 99 V(D)J genes (i.e., the two α-chain rearrangements between Jα2.6 and Jα2 with a fixed Vα2.9 segment 100 and fixed β-chain Vβ14Dβ2Jβ2.6). The main difference between these groups is that the Ep mice were 101 backcrossed with mice that express transgenic class 2 major histocompatibility complex molecules bound to 102 a single "Ep" peptide (Pacholczyk et al., 2006). Thus, Ep mice are expected to show a more restricted 103 TCR repertoire than wild-type mice. To evaluate the diversity of TCR repertoire, the complementarity   107 To date, no sufficient and effective method for comparing TCR repertoires among different samples 108 has been established, which is mainly due to the enormous complexity and diversity of TCR sequences 109 (Hou et al., 2016). In ecology, the diversity in a population is conventionally measured with metrics of 110 species abundance' between pairs of samples. However, these methods generally rely on observational 111 data of species abundance counts, and can therefore be vulnerable to sampling bias. One widely used 112 typical measure to quantify biological diversity is through a dissimilarity metric such as the Bray-Curtis 113 index (Bray and Curtis, 1957;Tang et al., 2016;Silverman et al., 2016). In the context of TCR repertories, 114 the abundance counts of observed sequences' can be considered to be analogous to species abundance' 115 in an ecological context. However, because of the sparsity of observed sequences, application of these 116 dissimilarity metrics to a dataset of TCR repertoires may not always work well. PA models have attracted 117 substantial attention as methods to overcome these issues, since these models can compensate for the sampling bias by estimating the statistical parameters directly from the measures of the abundance counts of 119 observed sequences (Robinson and Smyth, 2007;Sepúlveda et al., 2010), and are thus expected to be more 120 robust to sampling bias. However, both methods ignore detailed information of the amino acid sequences of 121 TCRs.

122
Therefore, in this study, rather than focusing on observation counts, we instead focus on the sequence 123 similarity among repertoires. Our method consists of four steps: (i) calculate a dissimilarity matrix of 124 observed TCR sequences in all samples using the Smith-Waterman (SW) algorithm with a scoring matrix; 125 (ii) embed the data in a low-dimensional Euclidian space by dimensionality reduction methods while 126 preserving the inter-sequence relations quantified by the dissimilarity matrix; (iii) estimate the sequence 127 distributions in the low-dimensional space by the KDE algorithm; and (iv) quantify the sample differences 128 by calculating the JSD value between the probabilistic density functions of different samples, and cluster 129 the samples accordingly. Each of the above steps is described in detail in the following subsections. The first step of our method is the quantification of similarity for each pair of TCR sequences in all 132 samples. The SW algorithm remains the most popular pairwise local sequence alignment algorithm in 133 bioinformatics for quantifying the similarity of amino acid sequences (Smith and Waterman, 1981). In 134 recent years, improved versions of the SW algorithm have been proposed to resolve the problems related to 135 the increase in computational costs along with the rapidly increasing size of datasets that are now possible 136 from next-generation sequencing. Here, we used one of these modifications, the striped SW algorithm 137 (Farrar, 2007), which uses a single-instruction-multiple-data (SIMD) system that allows for multiple units to  The SW algorithm requires amino acid substitution matrices, which determine the cost of the replacement 141 of a single amino acid residue by another (Henikoff and Henikoff, 1992). Although the SW algorithm has 142 already been applied to TCR sequences as a mapping tool for CDR3 sequences (Shugay et al., 2014), no 143 study has yet established the best choice of substitution matrices for comparison of TCR data. Therefore, to 144 clarify the effect of the type of substitution matrix employed and determine the optimal choice for our 145 method, we tested 10 different matrices: five different point-accepted mutation matrices (PAM; 30, 100, 146 120, 160, and 250) (States et al., 1991), and five different blocks substitution matrices (BLOSUM; 45, 50, 147 62, 80, and 100) (Henikoff and Henikoff, 1992). The gap opening and extension penalties were set to 10 148 and 1, respectively (Farrar, 2007).

149
Since the substitution matrices give nonzero values for replacements between the same amino acid 150 residues, the total score of the alignment between two identical sequences depends on their sequence lengths.

151
Thus, the diagonal elements of a pairwise distance matrix will have different values even when they are 152 calculated from the alignments of two identical sequences. In other words, both the sequence similarity and 153 the sequence length determine the values of the pairwise distance matrix. To adjust for this sequence-length 154 effect, we converted the pairwise distance matrix into a dissimilarity matrix using the following equation: where D i,j and S i,j are a pairwise distance matrix and dissimilarity matrix between the two sequences i and 156 j, respectively. At this step, we calculated the pairwise distances between all pairs of unique sequences 157 observed in all samples with the striped SW algorithm. We then transformed the pairwise distance matrix 158 into the dissimilarity matrix using Eq. 1.

Dimensionality reduction with manifold learning methods 160
To visualize the structure of the high-dimensional dissimilarity matrix in a low-dimensional space, we 161 applied dimensionality reduction (manifold learning) techniques to the dissimilarity matrix described detailed parameters of all methods are described in Table S1 in the Supplementary Information.
where P (x) and Q(x) are the estimated PDFs and D KL is the Kullback-Leibler divergence; M (x) is and D local JS is the "local JSD" , whose integration with respect to x gives the JSD. Thus, the 216 "pairwise" JSDs provide a sample-difference matrix that quantifies the combinatorial differences between 217 all pairs of the samples. To categorize all samples, we utilized hierarchical cluster analysis, which converts 218 the N × N -dimension sample-distance matrix into a dendrogram. Specifically, we used an agglomerative 219 hierarchical cluster technique; each sample is initially treated as a singleton cluster, and pairs of clusters are 220 repetitively merged according to a criterion until only a single cluster remains (Maimon and Rokach, 2005). 221 We here used Wards criterion (Ward, 1963)  count-based methods, we also quantified the inter-sample difference with BPLN and Bray-Curtis methods.

225
BPLN was applied according to the methods described in the original paper by Rempala et al. (2011).

226
To evaluate the goodness of fit of the clustering results, Rempala and colleagues Rempala et al.

227
(2011) calculated the cophenetic correlation coefficient (CCC), which quantifies the distortion due to the 228 transformation from the distance matrix to the cophenetic matrix, from which the dendrogram was derived. value of JSD is unclear. Specifically, we used bootstrap methods to evaluate significance, resampled data 235 points from the naive PDF according to the number of observed read counts, and then re-estimated the PDF 236 from the resampled data points. We then calculated the JSDs between the naive and re-estimated PDFs.

237
We repeated this process 100 times to obtain a histogram of the calculated JSDs. The 2.5th and 97.5th 238 percentiles of the histogram of the JSDs between the naive and each re-estimated PDF represent both ends 239 of the 95% confidence interval, where values outside of the interval indicate a significance level of over 5%.

240
To identify the sequences with the greatest contributions to the inter-sample distances, we selected square 241 bins for the top 1% of the local JSDs. We next defined the sequences in these bins as those contributing to 242 the observed pairwise sample difference. Furthermore, to investigate the characteristics of the contributing 243 sequences, we calculated the relative frequencies of the amino acid residues in all of the contributing sequences. The graphics of the relative frequencies were obtained using WebLog 3 software (Crooks et al., .

246
All analyses were performed using custom-made codes written in Python, Matlab, and R.  Fig. 1(B)) were well 281 reflected in the two clusters for the MDS and ISOMAP methods ( Fig. 2(B,C)), but were not represented 282 clearly in the clusters of t-SNE. This result suggests that t-SNE emphasizes slight differences within clusters 283 rather than large differences between the clusters of the dissimilarity matrix. Since it is unclear whether this 284 visualization property of t-SNE works efficiently for comparisons between samples, we quantified and compared the distributions of data points at the next step, and examined the method that would be most 286 appropriate for this purpose. 288 We applied the KDE algorithms to the spatial distributions of data points to estimate their probability 289 density functions (color gradient in Fig. 2), with which JSD is calculated to quantify pairwise-sample 290 differences. The matrices of the pairwise-sample differences are shown in Fig. 3(A), and the resulting 291 dendrograms in Fig. 3

298
To verify the results of hierarchical clustering obtained by our method, they were compared with those 299 obtained with previous observation count-based methods, the BPLN and Bray-Curtis method. As shown 300 in Fig. 4, the sample differences and dendrograms estimated from the BPLN and Bray-Curtis methods

301
were very similar to those obtained using our approach with MDS, t-SNE, and SE. Importantly, these 302 similar results were obtained with different data modalities: sequence similarity and observation counts.

303
Therefore, this consistency suggests that there is common information between sequence similarity and 304 observation counts with respect to quantifying the differences among samples. We should note that these two 305 modalities can be combined simply by assigning the number of observed sequence counts as a weighting 306 factor for each data point (i.e., a unique sequence) in the embedded space. Indeed, the counts-weighted 307 PDFs using KDE (Fig. S1) showed no obvious change in the hierarchical clustering structure of the 308 pairwise-sample differences. Taking these results together, the MDS or t-SNE appears to be the better choice 309 as a dimension-reduction method for evaluation of differences in TCR repertories among samples, given 310 that these methods show wide spatial distributions of the data points, and also show the most consistent 311 dendrogram structures with those of previous count-based methods.

313
To verify the statistical significance of the calculated JSDs between all sample pairs, we calculated the 314 JSDs between the naive and the re-estimated PDFs using a non-parametric bootstrap algorithm. Figure 5 315 shows the histogram of the JSD values between the naive PDF of Fig. 2 (C, ii) and the re-estimated PDFs. indicates that the types of T cells and the genetic background can be discriminated with sufficient statistical 326 significance.

328
The main advantage of our method compared to count-based methods is the ability to identify the major 329 sequences contributing to inter-sample differences. To identify the sequences with the greatest contributions 330 to large local JSD values, we plotted the spatial distribution of the local JSDs between the WtTN1 and 331 EpTN1 sequences. As shown in Fig. 6, two regions were identified that were associated with top 1% 332 significance values. Table 1 lists the identified sequences in these regions with larger local JSDs than the 333 others. In regions 1 and 2, there were no sequences for WtTN1, whereas EpTN1 had multiple sequences 334 in these regions, suggesting that these apparent Ep-specific sequences may contribute to the observed 335 abnormality in the antigen presentation of Ep mice.

336
This type of sequence identification can provide further knowledge about the characteristics of sequence 337 alignments. Figure 7 shows the occupation probability (relative frequency) of amino acids at each position 338 of the sequences obtained from all of the sequences contributing to pairwise differences, and a consensus 339 sequence was determined from amino acid positions 6th to 11th. We note that these contributing sequences 340 and their characteristics cannot be easily identified simply by examining the overlapping sequences in two 341 samples, because there was almost no overlap between EpTN1 and WtTN1 sequences (0.352%, 1/284), and 342 because these account for only 6.34% (18/284) of the total unique sequences in the two samples.

DISCUSSION
We quantified the difference in TCR repertoires among different samples based on amino acid sequence 344 dissimilarity. Through a quantitative comparison of the sequence distributions in the dimension-reduction 345 spaces of the dissimilarity matrix, we estimated the inter-sample hierarchical structure, which was almost 346 identical to that estimated with previous count-based methods that did not incorporate detailed sequence 347 information. Furthermore, we identified the sequences that most strongly contribute to the pairwise sample 348 difference using the local JSD distribution.

349
Despite the fact that our method relies on sequence similarity and previous methods are based on  repertoires.

380
One issue concerns the treatment of gap penalties. When we evaluated the differences of the score matrices 381 shown in Fig. 1, we fixed the gap opening and extension penalties to 10 and 1, respectively. Although the 382 effects of the penalties have not been adequately investigated in previous studies (Wrabl and Grishin, 2004), 383 the gap opening penalty was found to affect estimations of the hierarchical data structure (data not shown).

384
The CDR3 region of TCR is a much shorter sequence than peptide sequences, and also shows frequent The upper and lower panels show the dissimilarity matrices and projection maps in two-dimensional space, respectively. All of the rows and columns in each dissimilarity matrix were sorted according to the sum of their elements. The colors of points in the lower panels of (A) and (B) correspond to the clusters in PAM250 and BLOSUM45 that were discriminated by k-means algorithms (k = 7). Panel (i) includes the points of the total unique sequences observed in all samples. Panel (ii) includes only the portions of sequences that were observed in each sample. "Ep" and "Wt" denote two different genetic backgrounds of mice. "TN" and "TR" denote naive and regulatory T-cells. "1" and "2" denote the thymus and peripheral lymph nodes, respectively. As an instance, EpTN1 denotes the naive T-cells that were collected from the thymus in the "Ep" mice.       Table S1. Detail parameters of manifold learning methods. Except for ISOMAP, we used the functions in the class of sklearn.manifold in the scikit-learn toolbox. The parameters of each function are described in the Table S1. For ISOMAP, first, we calculated the geodesic distances by using k-nearest neighbor algorithm and Floyd-Warshall method. Then, we applied the geodesic distances to MDS with the following parameters.