Comparative Analysis of the CDR Loops of Antigen Receptors

The adaptive immune system uses two main types of antigen receptors: T-cell receptors (TCRs) and antibodies. While both proteins share a globally similar β-sandwich architecture, TCRs are specialized to recognize peptide antigens in the binding groove of the major histocompatibility complex, while antibodies can bind an almost infinite range of molecules. For both proteins, the main determinants of target recognition are the complementarity-determining region (CDR) loops. Five of the six CDRs adopt a limited number of backbone conformations, known as the “canonical classes”; the remaining CDR (β3in TCRs and H3 in antibodies) is more structurally diverse. In this paper, we first update the definition of canonical forms in TCRs, build an auto-updating sequence-based prediction tool (available at http://opig.stats.ox.ac.uk/resources) and demonstrate its application on large scale sequencing studies. Given the global similarity of TCRs and antibodies, we then examine the structural similarity of their CDRs. We find that TCR and antibody CDRs tend to have different length distributions, and where they have similar lengths, they mostly occupy distinct structural spaces. In the rare cases where we found structural similarity, the underlying sequence patterns for the TCR and antibody version are different. Finally, where multiple structures have been solved for the same CDR sequence, the structural variability in TCR loops is higher than that in antibodies, suggesting TCR CDRs are more flexible. These structural differences between TCR and antibody CDRs may be important to their different biological functions.


INTRODUCTION
The adaptive immune system defends the host organism against a wide range of foreign molecules, or antigens, using two types of receptors: T-cell receptors (TCRs) and antibodies (1). TCRs typically recognize peptide antigens presented via the major histocompatibility complex [MHC; (2)], while antibodies can bind almost any antigen, including proteins, peptides, and haptens (3). Despite their different roles in the immune response, these proteins share a β-sandwich fold (Figure 1) (4,5).
In humans, most TCRs are αβTCRs, consisting of one TCRα chain and one TCRβ chain, while most antibodies are comprised of two heavy(H)-light(L) chain dimers (Figure 1). All four types of chains (α, β, H, and L) are formed from the somatic rearrangement of the respective V, D, and J genes of the TCR or antibody loci. The random combination of these genes, alongside further diversification mechanisms (e.g., random nucleotide addition), are estimated to yield trillions of unique TCRs and antibodies (6,7). TCRα and antibody light chains are made from the V and J genes, while TCRβ and antibody heavy chains are assembled from the V, D, and J genes, making the L-chain equivalent to α-chain and H-chain equivalent to β-chain (6,8). In both types of FIGURE 1 | TCRs and antibodies share a globally similar structure. Both proteins are heterodimers, characterized by a set of six CDRs that form the majority of the binding site. Comparable chains and CDRs share coloring schemes; for example, the TCRβ and antibody heavy chain are colored gray, while CDRα1, and CDRL1 are colored purple.
antigen receptors, sequence, and structural diversity is concentrated in six hypervariable loops, known as the complementarity determining regions (CDRs). There are three in the TCRα chain (CDRα1-CDRα3) and three in the TCRβ chain (CDRβ1-CDRβ3). Likewise, the light chain and heavy chain of antibodies have three CDRs each (CDRL1-CDRL3, CDRH1-CDRH3). In TCRs, CDR1, and CDR2 typically contact the MHC's conserved α-helices (9,10), while the CDR3 almost always contacts the peptide antigen (11,12). All six antibody CDRs can be involved in antigen recognition (3,13), though the CDRH3 loop is often the most important (14,15). The structural complementarity between the binding sites of the antigen receptor and their cognate antigen governs the binding interactions. As the CDRs form the majority of the binding site, their conformations are critical to the binding.
The "canonical class" model was first proposed for antibodies in 1987 (16). It is based on the observation that CDRs adopt a limited number of backbone conformations. The definition of canonical classes had been revisited multiple times as more structures become available [e.g., (16)(17)(18)(19)(20)]. Sequence features, such as the presence of specific amino acids within or near the CDR loop, may be used to predict the canonical forms [e.g., (17,18,20)]. Canonical forms have been used for in silico antibody design (21,22), and predicting the structures of CDR sequences from next-generation sequencing (NGS) datasets (20,23). Despite the value of canonical classes to antibody design and development, only two studies have so far applied the concept to TCR CDRs (24,25).
The first clustering of TCR CDR loops was carried out using only seven TCR structures (24). At that time, four canonical classes were identified for CDRα1, four for CDRα2, three for CDRβ1, and three classes for CDRβ2; neither CDR3 loop was clustered. More recently, Klausen et al. clustered the CDRs from a non-redundant set of 105 αβTCR and 11 unpaired TCR structures. They performed the clustering in torsion space using an affinity propagation algorithm. In total, 38 canonical forms were characterized. These clusters were then used to construct a sequence-based, random forest classifier, with canonical form prediction accuracies between 63.21 and 98.25% (25).
CDRβ3 in TCRs and CDRH3 in antibodies show higher variability in sequence composition and structure than the other CDRs (1). While no canonical forms have been defined for CDRH3, several groups have analyzed the kinked and extended (or bulged and non-bulged) conformations at the start and end of the loop, known as the "base" or "torso" region (18,(26)(27)(28)(29)(30)(31). Weitzner et al. (30) showed that pseudo bond angle τ and pseudo dihedral angle α of the second last residue of the CDRH3 loop [Chothia (32) position 101, IMGT (33) position 116] can differentiate between the extended and kinked torso conformations. Finn et al. (31) analyzed the first three and last four residues of CDRH3 loops and observed that for the same IMGT position 116, the φ/ψ angles are different in kinked and extended torsos. In this paper, we carry out the first examination of the conformation of the base region of CDRβ3 loops.
Although TCRs and antibodies are derived from similar genetic mechanisms and share a similar architecture, only a handful of studies have compared them [e.g., (5,10,(33)(34)(35)(36)]. Furthermore, analyses have largely focused on sequence-based features. For instance, Rock et al. found that the CDRα3 and CDRβ3 loops have a different length distribution to CDRL3 and CDRH3 (34), while Blevins et al. observed that the TCR CDR1 and CDR2 sequences have more charged amino acids than the analogous antibody CDR1 and CDR2 (10). Given the similarity in the genetic mechanisms, fold and the limited conformational variability in the canonical CDRs, it might be reasonable to expect structural similarity between TCRs and antibodies. Comparing TCRs and antibodies should identify potential characteristics that may inspire antibody-like TCR design, TCR-mimic antibodies and soluble TCRs (5,(37)(38)(39). Such analyses can also highlight structural signatures that may relate to their different biological functions, such as MHC restriction in TCRs (6) and the virtually unconstrained antigen binding in antibodies (3).
In this manuscript, we first used a length-independent clustering method to update the canonical classes of TCR CDRs (20). We then built a sequence-based TCR CDR canonical form prediction algorithm based on an adapted position-specific scoring matrix (40). Next, we attempted to enrich the TCR CDR dataset with antibody CDR structures. We found that TCR and antibody CDRs occupy distinct areas of structural space. In the small number of common conformational clusters, the underlying sequence patterns differ. This structural distinction may be a key differentiator between the functions of these two classes of immune proteins.

RESULTS
The IMGT-defined CDR loops (33) were extracted from each chain of the 270 high-quality TCR structures in STCRDab [STCRDab set; (41)]. This is more than double the numbers used in previous studies of TCR canonical forms, Al-Lazikani et al.  (24,25). Redundant sequences were retained as the conformations may differ as shown in the forthcoming sections. Summary statistics of the sequence-redundant STCRDab set for each of the CDR types are listed in Table 1.

Updating the Canonical Classes of TCR CDRs
We clustered each CDR type (e.g., CDRα1, CDRα2, etc.) using the DBSCAN method (20), with a minimum cluster size of five and a clustering threshold of 1.0Å . For clusters with more than two unique sequences, we designated them as "canonical classes"; otherwise, they were considered "pseudo-classes" ( Table 1; see Materials and Methods for full details).
In total, we identified seven α1, seven α2, five α3, one β1 and two β2 canonical classes (see Table 1). The representative structures and sequence patterns of our TCR CDR canonical classes are shown in Figure 2 and Figures S1-S4. We found that some canonical forms have highly conserved positions in their encoding sequences, which might govern the conformations. However, we also observed many cases where structures of the same sequence fell into different structural clusters. For instance, the α1 loop with the sequence DSVNN belonged to α1-5-A when unbound, to pseudo-class α1-5-B* when bound to an MHC with a long peptide, and was unclustered when the binding peptide is short ( Figure S5).
Inter-species difference was not observed among canonical classes. Any antigenic species and TCR species may be found in any canonical classes (Figures S6, S7). TCRs that bind to MHC1relate proteins appear to adopt only a particular set of CDR canonical forms ( Figure S8).
We compared our canonical forms to those from Al-Lazikani et al. and Klausen et al. (see Tables S1, S2). We matched canonical classes if their representative structure was found in our canonical class. Some canonical classes were represented by structures that were filtered out of our dataset for quality reasons. In these cases, a sequence-identical CDR with a comparable backbone conformation was used as a proxy to match canonical classes (see Materials and Methods).

Comparison to Previous Canonical Forms
For CDRα1, our DBSCAN method broadly agreed with Al-Lazikani et al. and Klausen et al. (24,25), apart from α1-3 in Al-Lazikani et al.'s classes, for which we found no corresponding cluster. Klausen et al.'s α1-5 cluster was matched to pseudo-class α1-6-F, meaning that for this cluster, we found more than five structures, but only a single unique sequence.
All seven of our CDRα2 classes and our one CDRα2 pseudoclass were matched to previously observed CDRα2 canonical classes (see Table S1 for the full list of comparisons). All five of our CDRα3 canonical forms and two of the pseudo-classes mapped to ones from Klausen et al. In addition, we found nine further pseudo-classes that were not identified in their study. The "#" refers to the number of entities. TCR-only clusters are formed at a threshold of 1.0Å; TCR and antibody CDRs are clustered using the standard threshold for antibody CDRs given in Nowak et al. (20). The TCR and antibody CDR sequence lengths observed in the combined TCR and antibody CDR cluster are separated by ''/".  (42)] four-letter code and the chain identifier of the centroid structure is shown in the bracket next to the cluster name. The sequence pattern below each centroid structure is generated by WebLogo (43), using the unique sequences of the cluster: α3-10-A has 11, α3-11-A has 6, α3-12-A has 2, α3-13-A has 3 and α3-13-B has 2. Hydrophilic residues were in blue, neutral residues in green and hydrophobic residues in black (see Materials and Methods).
cluster representative of the Klausen et al.'s α3-1 canonical class was filtered out of our dataset as it comes from a structure with a resolution >2.9Å. Since its sequence was also absent in the rest of our dataset, we were unable to map this class to our canonical forms. The cluster representatives of Klausen et al.'s α3-2, α3-4, α3-11, and α3-12 were unclustered in our analysis. Our single CDRβ1 class and our one CDRβ1 pseudo-class were matched to previous clusterings. Klausen et al.'s β1-3 and β1-4 forms were not in clusters in our work (Table S2). For CDRβ2, we were unable to find a match for Al-Lazikani et al.'s β2-2 class, nor Klausen et al.'s β2-1 and β2-7 classes. However, our two CDRβ2 classes and one CDRβ2 pseudo-class were matched, with our β2-6-B merging three of Klausen et al.'s clusters (Table S2).
Both previous clusterings were based on backbone dihedral angles of the CDR loops, whereas in our work, we clustered using backbone distances. Despite these different approaches, there was a large degree of overlap between our canonical forms and those found previously. We have also identified a small number of new canonical classes from our larger dataset. As was shown for antibody CDR canonical forms (40), the growth of structural data continuously modifies our understanding of CDR loop structures. It is therefore necessary to continuously and preferably automatically update the definition of canonical forms as more structural information becomes available.

Prediction of CDRs From Sequence
As our TCR canonical classes showed conserved sequence patterns (Figure 2 and Figures S1-S4), we built a sequencebased, length-independent position-specific scoring matrix (PSSM), that can be used to predict TCR CDR canonical classes (40). The performance of the predictor was evaluated by a leaveone-out cross-validation protocol on the unique sequences in STCRDab set ( Table 2). Accuracy ranged from 73.2 to 100%, which is comparable to previous results (25).
To assess the coverage of our method on large sets of sequencing data, we used our PSSMs to predict the CDR canonical classes of an NGS dataset of mouse TCRα sequences (44). The entire dataset contained 1,563,876 sequences (1,498,254 CDRα1, 1,563,876 CDRα2, 1,267,235 CDRα3); on a single 3.4 GHz core, the prediction took three minutes. Our method achieved high coverage for CDRα1 and CDRα2, but made predictions for only 37% of the non-redundant CDRα3 sequences ( Table 3). The poor coverage of CDRα3 could be due to the paucity of the currently available data in capturing the conformations of this more sequence-diverse CDR (Figures S9-S11).
Ten TCR structures containing 44 individual chains that were unseen at the time of methodology development, were used as a blind test set for the predictor. All canonical CDRs had 100% prediction coverage. Apart from CDRα3, all CDR types were predicted with 100% accuracy; in other words, at least one member of the predicted canonical class had backbone root-mean square deviation (RMSD) ≤1.0Å to the native structure. CDRα3 had one false prediction where the loop GTERSGGYQKVT was not assigned to any clusters even though the backbone RMSD falls within 1.0Å to a member of the α3-9-A canonical class.

Comparison Between TCR and Antibody CDRs
Inspired by the shared architecture and genetic generation mechanism of antibodies and TCRs, we examined whether their CDR length distributions, structures/canonical forms, and sequence patterns that give rise to their canonical forms overlap. For each of the TCR CDRs, we compared all its structures with the corresponding antibody CDR's structures (e.g., CDRα1 with CDRL1). We considered the CDR sequence length distributions and their structural clustering with DBSCAN, using the clustering thresholds that have been previously used for antibody CDR clustering (20). Changing these thresholds does not qualitatively affect the results described below. A full list of comparisons is described in Table 1. For all pairs of TCR/antibody CDRs, we found that most clusters contained only TCR CDRs or antibody CDRs (e.g., CDRα1 only). In other words, the CDRs from TCRs and from antibodies tended to occupy distinct areas of structural space. Where we observed an overlap in the structural space, we inspected whether the sequence motifs for the canonical classes differ between the two types of receptors.

CDRα1/CDRL1
The CDRα1 loops in our set were five-to seven-residues long, while CDRL1 loops spanned from three-to twelve-residues ( Figure S12). The most common length was six for both types of loops: over 70% of the CDRα1 loops and 45% of the CDRL1 loops. Thirteen structural clusters of CDRα1/CDRL1 were identified (Figure S13), of these only α1,L1-5-A contained both CDRα1 loops (6 unique sequences) and CDRL1s (18 unique sequences). The sequence logos formed by the sequence-unique CDRα1 and CDRL1 loops in α1,L1-5-A had different sequence patterns ( Figure S14). The general physicochemical properties were similar, but CDRL1 had a preference for Valine and Tyrosine on the third and fifth positions while CDRα1 showed ambiguity at these two positions. Principal component analysis (PCA) on one-hot-encoded unique sequences displayed a separation between CDR loops from the two types of receptors (see Figure S14), except for one TCR α1 sequence (DSVNN) that was considered to have a more similar sequence pattern to antibody L1 loops. This sequence is a TCR α1 sequence that adopts multiple conformations as described above.

CDRα2/CDRL2
CDRα2 loops tend to be longer than CDRL2, with nearly all of the CDRL2 loops being three-residues long and CDRα2 having a range of lengths from five to eight (Figure S15). Given this, there is little chance of structural similarity between the two types of loops. Structural clustering confirmed that all classes only contained one CDR type, with two CDRα2 clusters, and one CDRL2 cluster (Figure S16).

CDRα3/CDRL3
The majority of the CDRL3 loops in our set were nine-residues long (2073 of the 2872; 72.2%), whilst CDRα3 loops had nine to fifteen residues (Figure 3). Among the eleven clusters identified, we found a cluster, α3,L3-10-A, which included both CDR types, with 17 and 12 unique sequences of CDRL3 and CDRα3 respectively (Figure 4). A PCA of the sequences in this cluster (α3,L3-10-A) separated the TCRα3 and antibody L3 members with one exception (Figure 5). The α3 sequence GTYNQGGKLI clustered with the L3 group. This sequence was one of the eight (out of a total of 99) unique α3 sequences we found to have multiple conformations. Our dataset contained three structures of this sequence; the other two were unclustered (3vxu:I and 3vxu:D).

CDRβ1/CDRH1
In our dataset, CDRβ1 loops tended to be shorter than their CDRH1 counterpart ( Figure S17). Over 80% of the CDRβ1 were five-residues long, compared to eight-residues long loops dominating for CDRH1 (>80%). Nine classes were formed by the structural clustering: seven CDRH1 clusters and two CDRβ1 clusters ( Figure S18).

CDRβ2/CDRH2
More than 90% of the CDRβ2 loops in our set were six-residues long, while only 0.2% of the CDRH2 had six residues and the rest were in the range of seven to ten residues ( Figure S19). None of the seven clusters that were formed contained members from both types of CDRs ( Figure S20).

β-Chain CDRs in TCRs and Heavy-Chain CDRs in Antibodies and Nanobodies
In our clustering, CDRs from nanobodies were also included. Unlike the CDRs of TCRs, these typically did fall into the antibody H1 and H2 clusters (Figures S18, S20). In the case of CDRβ1/CDRH1, 72.2% (156 out of 216) of the nanobody CDRH1 loops were clustered with antibody CDRH1s in the β1,H1-8-A class. The remaining nanobody CDRH1 loops formed the majority of the other two length-eight clusters (all 46 in β1,H1-8-B and 14 out of 21 in β1,H1-8-C respectively). Nanobody CDRH2 loops clustered with the length-seven and eight antibody CDRH2 loops. These results suggest that  nanobody CDR loops are structurally more similar to antibodies than to TCRs.

CDRβ3/CDRH3
There are no canonical classes for CDRH3 or CDRβ3 but in the case of CDRH3, previous studies have found structural conservation in the start and end of the loop, known as the "torso" or "base" region (30,31). We therefore inspected the base structure of the CDRβ3/CDRH3 loops. Following the study by Weitzner et al. (30), we carried out the loop anchor transform (LAT) analysis that was used to capture the characteristic base structure in CDRH3. The LAT analysis showed that CDRβ3 loops have a similar width of distribution in all six degrees of freedom as seen for CDRH3, but with a slight shift in the peak (Figure S21). This presented the possibility that CDRβ3 and CDRH3 could share similar base structures. In the same study, the pseudo bond angle (τ ) and pseudo dihedral angle (α) of the penultimate residue of the CDRH3 structure (IMGT position 116) were used to differentiate between extended and kinked torsos. We found that very few CDRβ3 loops had their  Figure S22).

CDR Structural Variability in TCR and Antibodies
As noted above, some TCR CDRs with identical sequences fell into different structural clusters. Considering the joint clustering of TCR and antibody CDRs, we found that of the 212 unique TCR CDR sequences with multiple example structures, 41 (19.3%) fell into more than one structural cluster (Table S3). This happened far less for antibody CDRs where only 7.95% (166 of 2089) of the unique antibody CDR sequences with multiple example structures were found in more than one structural cluster. Whether the structure was crystallized in complex with the cognate molecule did not appear to cause the conformational difference as bound and unbound structures were observed in the different clusters (Tables S4, S5).
Following the analysis in Marks et al. (45), we compared the maximum backbone RMSD of structures from an identical sequence ( Figure S23). If we consider structures with a difference of <1Å in backbone RMSD as similar (184 of the 284 TCRs and 2195 of the 2717 antibodies), this analysis illustrates that the majority of the sequence-identical structures are structurally similar in both TCR and antibody CDRs, but that TCR loops show a more skewed distribution, tending toward higher structural variability.

DISCUSSION
We have identified and refined the canonical class definitions of TCR CDRs using a more up-to-date structural dataset, and used these to generate an auto-updating database and prediction server.
In our dataset of 270 TCR structures, we find seven CDRα1, seven CDRα2, five CDRα3, one CDRβ1, and two CDRβ2 canonical classes. In addition, we report several "pseudoclasses", in which there are multiple examples of the structural conformation, but only one unique sequence. One of the major advantages of a canonical class model is the rapid mapping between the sequence and structural space. To demonstrate this for TCRs, we applied an adapted PSSM-based methodology (40) to assign the canonical classes for an NGS dataset of ∼1.5 million mouse TCRα sequences in a few minutes.
The commonalities between TCR and antibody in their folds and the genetic mechanisms that generate them prompted us to explore the similarities and differences between TCR and antibody CDR structures. We performed a length-independent structural clustering of TCR and antibody CDRs and found that they almost always separate into different clusters. This separation was partly driven by the differences in the length distributions between TCR and antibody CDRs. In the cases where we found structural clusters which contained both types of loops, we found that the underlying sequence motifs were distinct.
We also compared the CDRβ3 and CDRH3 loops in terms of their torso structures -the starting and the ending residues of the CDRβ3/CDRH3 loops, and found once again TCRs and antibodies gravitated toward different structures. CDRH3 loops in antibodies tend to have a kinked torso while CDRβ3 are only found with the extended torsos that is less common in CDRH3.
Driven by the multiple observations where sequence-identical TCR CDR structures were found in different clusters, we also assessed the structural variability of TCR and antibody CDRs. Our results agree with previous findings about the flexibility of TCR CDRs (46,47), and further suggest that TCR loops are more flexible than antibody CDRs. Structural rigidification has been proposed as an affinity maturation mechanism in antibodies (48), while flexibility in the binding site has been suggested as one of the features enabling the promiscuous binding of TCR-MHC (49). The fact that nearly 20% of the TCR CDRs that could show different conformations did so suggests that the canonical class model will struggle to accurately predict TCR CDR conformations. TCR modeling tools may provide more details through homology or de novo loop predictions (25,50,51). Overall, our results suggest that there are structural differences between TCR and antibody CDRs, and the differences we observe potentially help explain how these two receptors bind to their separate target antigens.
Many groups have attempted to augment TCR and antibody designs by swapping binding sites between these two receptors [e.g., (37,52)]. TCR-mimic antibodies were developed in the hope of transferring the ability of TCRs to target intracellular proteins, to antibodies for the use in cancer therapy. Antibody-like TCR design transfers the highly specific antibody binding sites to TCRs. It was shown to enhance the specificity and affinity in TCR antigen recognition in vitro and in vivo (37). These cases present the possibility of altering the behavior of the receptors such as targeting peptides in the context of an MHC molecule by grafting TCR binding sites, or enhancing specificity when the antibody components are added. Our results should aid the development and design of these types of therapeutic immune proteins.

Nomenclature
We use the following nomenclature for structures: four characters of the PDB code, followed by a colon, then the chain identifier of the structure, e.g., 5hhm:E. Clusters are identified by two letters describing the CDR type, the loop length(s), then a letter (20). For instance, the α1-6-B class refers to the length-6 CDRα1 class with the second-largest number of unique sequences. Pseudo-clusters with only one unique sequence have their names appended with an asterisk (*). For the clustering of both TCR and antibody CDRs, we name the clusters by the two letters representing the TCR CDR type, two letters for the antibody CDR type, the loop length(s), followed by a letter, e.g., α1,L1-11,12-A denotes the largest cluster from the CDRα1/CDRL1 clustering, in which sequences are 11 or 12 residues long.

Structural Datasets
TCR structures with at least one α or β chain and resolution ≤2.8Å were downloaded from STCRDab on 31 May 2018 (41). Antibody structures with resolution ≤2.8Å were downloaded from SAbDab on 31 May 2018 (54). Structures from all species were retained. In total, 270 TCR PDB entries and 2,563 antibody PDB entries were used. We retained all chains in a PDB entry that passed the quality criteria. The structures of the CDR loops were extracted using a similar procedure to Nowak et al. (20). The IMGT-defined CDRs, along with five N-terminal and five Cterminal anchor residues, were selected if there were no missing backbone atoms and none of the backbone atoms had B-factors >80. Loops also needed to be continuous, i.e., all peptide bond lengths were <1.37Å. Since loops with identical sequences can adopt multiple conformations (see Results and Figure

Clustering Method
Loops are clustered using the length-independent clustering method from Nowak et al. (20). First, the algorithm superimposes the backbone atoms of all ten anchor residues. Structural similarity is then calculated using the dynamic time warp (DTW) algorithm (55); the resulting DTW score is effectively a lengthindependent RMSD value (20). The DTW scores are then clustered using the DBSCAN method (56).
In the clustering of TCR CDRs alone, we only consider a set of sequences to be a CDR cluster only if it contains a minimum of five structures and two unique sequences. If a set contains five or more structures but only one unique sequence, we label this a "pseudo-class." All other loops are considered to be "unclustered." This is a more lenient threshold than previous investigations clustering antibody CDR loops [e.g., Nowak et al. (20)], as the dataset of TCR structures is far smaller. To choose the optimal clustering parameter, DBSCAN was run over a range of DTW score thresholds in increments of 0.1Å. We find that 1.0Å offered an optimal balance between the number and size of clusters for the five TCR CDR loops.
For the clustering of multiple CDR types, we use the same clustering thresholds and criteria for the respective CDR types as the antibody study (20), where they were selected using the Ordering Points to Identify the Clustering Structure [OPTICS; (57)] algorithm. CDRα1/CDRL1 are clustered at 0.82Å , CDRα2/CDRL2 at 1Å (not initialized in the previous paper), CDRα3/CDRL3 at 0.91Å , CDRβ1/CDRH1 at 0.8Å and CDRβ2/CDRH2 at 0.63Å. A valid cluster must contain at least six unique sequences as illustrated in Nowak et al. (20).

Comparison With TCR Canonical Classes in Earlier Work
In order to compare our CDR clusters to previous studies (24,25), we identified overlaps in the representative PDB entries. For example, Al-Lazikani et al.'s α2-3 class contains the PDB entry 1tcr. Since 1tcr:A was in our α2-8-B class, we considered these two classes (α2-3 and α2-8-B) to be analogous. A similar procedure was used to map our classes to Klausen et al.'s classes (25).
Some canonical classes from Al-Lazikani et al. or Klausen et al. were represented by structures that were filtered out of our dataset. To match these classes, we searched for a CDR structure in our set that has the same CDR sequence and checked if there was a backbone match. For example, the centroid structure of Klausen et al.'s α3-7 class is PDB 3tf7, which has the sequence AVSAKGTGSKLS. We found that 3tfk:C has the same CDRα3 sequence as 3tf7 and a backbone RMSD of 0.81Å ; thus, we assigned their α3-7 to our α3-12-A class.

Sequence-Based Prediction of Canonical Forms
For each CDR canonical class, we generated a position-specific scoring matrix (PSSM). The score of an amino acid a in IMGT position i, s(a, i), is: where p a,i represents the probability of a at i, and this is calculated separately for each class. The background probability of a, b a , was assumed to be identical for all residues i.e., 0.05. To predict the cluster for a target loop with length l, we first select PSSMs containing loops with the same length. For example, if the new CDRα1 loop is six residues long, we choose PSSMs of the α1-6-A, α1-6-B, α1-6-C and α1-6-D classes. The PSSM score for the target loop for class c, P c , is the sum of the position-specific scores: If a is never observed at i, we assume s(a, i) = −1. For canonical class assignment, we designated a target loop to the class with the highest value of P c . Furthermore P c must be higher than 1, except for CDRα3where P c must be equal to or greater than the loop's length. The value of P c was chosen by performing leave-one-out cross-validation tests over several values of P c . Sequences were assigned to a pseudo-class if and only if they had an identical sequence.
To benchmark the scoring strategy, we ran a leave-oneout cross-validation protocol on the unique sequences from canonical classes and unclustered structures. A prediction was evaluated using the following criteria: • True positive: sequence is assigned to the correct canonical class.
• False positive: sequence is assigned to a different canonical class.
• True negative: sequence is from an unclustered loop, and not assigned a canonical class.
• False negative: sequence is from a canonical class, but predicted to be unclustered.

Prediction on Next-Generation Sequencing Dataset
We predicted the canonical forms for the α-chain in a set of mouse TCR sequences from BioProject accession PRJNA362309 (44). Overlapping Illumina reads were assembled using FLASh (58), and TCR amino acid sequences were extracted using IgBLAST (59). The sequences were then numbered by ANARCI (53); only those with productive CDR3 rearrangement, CDR1 and CDR2 loops at least five residues long, and CDR3 loops at least eight residues long were retained.

Prediction of New TCR Structures
We used the 44 αβTCR structures that were released between 31st May 2018 and 5th June 2019 as a blind test set. Unlike our structural dataset for clustering, we did not impose a quality restriction for CDR prediction. Predictions were considered to be correct if the backbone RMSD between the native CDR structure and any member of the assigned canonical class was ≤1.0Å.

Comparison Between TCR and Antibody CDR Structures
We clustered TCR and antibody CDR structures and examined the length distribution, structural clustering and sequence patterns. Sequence lengths of CDR loops disregard the five anchor residues at each side of the CDR structures. Structural clustering comparison were done as described above. We used WebLogo to generate all sequence patterns (43). Amino acids were colored with the default hydrophobicity scale: hydrophilic residues (R, K, D, E, N and Q) were in blue, neutral residues (S, G, H, T, A, and P) in green and hydrophobic residues (Y, V, M, C, L, F, I, and W) in black.
In canonical forms where both TCR and antibody CDRs were found, we examined the difference between the sequence patterns used by TCRs and antibodies. We transformed the sequences using one hot encoding by position, where each feature was represented by the position and the residue name, and applied Principal Component Analysis (PCA) using the scikit-learn module in Python.

Analysis of CDRβ3 and CDRH3 Structures
We analyzed CDRβ3 structures with a set of metrics that have been previously applied to the CDRH3 loop in antibodies: loop anchor transform (LAT), pseudo bond angles, and dihedral angles (30,31).
LAT is the Euler transformation of the coordinate planes formed by residues at IMGT positions of 105 and 117 [see Supplementary Information of Weitzner et al. (30) for a detailed mathematical definition]. Briefly, a coordinate system is defined for each of the two residues centered on the C α atoms, where the z-axis points toward the carbonyl carbon, the y-axis is perpendicular to z in the N-C α -C plane, and the x-axis is the cross product of these two components. The Euler transformation is then represented by six degrees of freedom, capturing the translation (X, Y, Z) and rotation (φ, ψ, θ ).
We calculated the pseudo bond angle (τ ) and pseudo dihedral angle (α) of the residue at IMGT position 116 (corresponding to Chothia position 101), in the CDRβ3 and CDRH3 sets. Consistent with Weitzner et al. (30), the pseudo bond angle is formed by the C α atoms of the residues before, at and after the IMGT position 116, whereas the pseudo dihedral angle spans across the C α atoms of residues at position 116, one before and two after.
Finn et al. (31) observed that the dihedral angles adopted by the base residues of extended torsos were different from that of kinked torsos. To capture the observation made by Finn et al. (31), the dihedral angles of the first three (T1-T3) and last four (T4-T7) residues of the CDRβ3 and CDRH3 loops are obtained from the Biopython module (60).

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the manuscript/Supplementary Files.