Original Research ARTICLE
Comparative Analysis of the CDR Loops of Antigen Receptors
- Department of Statistics, University of Oxford, Oxford, United Kingdom
The adaptive immune system uses two main types of antigen receptors: T-cell receptors (TCRs) and antibodies. While both proteins share a globally similar β-sandwich architecture, TCRs are specialized to recognize peptide antigens in the binding groove of the major histocompatibility complex, while antibodies can bind an almost infinite range of molecules. For both proteins, the main determinants of target recognition are the complementarity-determining region (CDR) loops. Five of the six CDRs adopt a limited number of backbone conformations, known as the “canonical classes”; the remaining CDR (β3in TCRs and H3 in antibodies) is more structurally diverse. In this paper, we first update the definition of canonical forms in TCRs, build an auto-updating sequence-based prediction tool (available at http://opig.stats.ox.ac.uk/resources) and demonstrate its application on large scale sequencing studies. Given the global similarity of TCRs and antibodies, we then examine the structural similarity of their CDRs. We find that TCR and antibody CDRs tend to have different length distributions, and where they have similar lengths, they mostly occupy distinct structural spaces. In the rare cases where we found structural similarity, the underlying sequence patterns for the TCR and antibody version are different. Finally, where multiple structures have been solved for the same CDR sequence, the structural variability in TCR loops is higher than that in antibodies, suggesting TCR CDRs are more flexible. These structural differences between TCR and antibody CDRs may be important to their different biological functions.
The adaptive immune system defends the host organism against a wide range of foreign molecules, or antigens, using two types of receptors: T-cell receptors (TCRs) and antibodies (1). TCRs typically recognize peptide antigens presented via the major histocompatibility complex [MHC; (2)], while antibodies can bind almost any antigen, including proteins, peptides, and haptens (3). Despite their different roles in the immune response, these proteins share a β-sandwich fold (Figure 1) (4, 5).
Figure 1. TCRs and antibodies share a globally similar structure. Both proteins are heterodimers, characterized by a set of six CDRs that form the majority of the binding site. Comparable chains and CDRs share coloring schemes; for example, the TCRβ and antibody heavy chain are colored gray, while CDRα1, and CDRL1 are colored purple.
In humans, most TCRs are αβTCRs, consisting of one TCRα chain and one TCRβ chain, while most antibodies are comprised of two heavy(H)-light(L) chain dimers (Figure 1). All four types of chains (α, β, H, and L) are formed from the somatic rearrangement of the respective V, D, and J genes of the TCR or antibody loci. The random combination of these genes, alongside further diversification mechanisms (e.g., random nucleotide addition), are estimated to yield trillions of unique TCRs and antibodies (6, 7). TCRα and antibody light chains are made from the V and J genes, while TCRβ and antibody heavy chains are assembled from the V, D, and J genes, making the L-chain equivalent to α-chain and H-chain equivalent to β-chain (6, 8). In both types of antigen receptors, sequence, and structural diversity is concentrated in six hypervariable loops, known as the complementarity determining regions (CDRs). There are three in the TCRα chain (CDRα1–CDRα3) and three in the TCRβ chain (CDRβ1–CDRβ3). Likewise, the light chain and heavy chain of antibodies have three CDRs each (CDRL1–CDRL3, CDRH1–CDRH3). In TCRs, CDR1, and CDR2 typically contact the MHC's conserved α-helices (9, 10), while the CDR3 almost always contacts the peptide antigen (11, 12). All six antibody CDRs can be involved in antigen recognition (3, 13), though the CDRH3 loop is often the most important (14, 15). The structural complementarity between the binding sites of the antigen receptor and their cognate antigen governs the binding interactions. As the CDRs form the majority of the binding site, their conformations are critical to the binding.
The “canonical class” model was first proposed for antibodies in 1987 (16). It is based on the observation that CDRs adopt a limited number of backbone conformations. The definition of canonical classes had been revisited multiple times as more structures become available [e.g., (16–20)]. Sequence features, such as the presence of specific amino acids within or near the CDR loop, may be used to predict the canonical forms [e.g., (17, 18, 20)]. Canonical forms have been used for in silico antibody design (21, 22), and predicting the structures of CDR sequences from next-generation sequencing (NGS) datasets (20, 23). Despite the value of canonical classes to antibody design and development, only two studies have so far applied the concept to TCR CDRs (24, 25).
The first clustering of TCR CDR loops was carried out using only seven TCR structures (24). At that time, four canonical classes were identified for CDRα1, four for CDRα2, three for CDRβ1, and three classes for CDRβ2; neither CDR3 loop was clustered. More recently, Klausen et al. clustered the CDRs from a non-redundant set of 105 αβTCRand 11 unpaired TCR structures. They performed the clustering in torsion space using an affinity propagation algorithm. In total, 38 canonical forms were characterized. These clusters were then used to construct a sequence-based, random forest classifier, with canonical form prediction accuracies between 63.21 and 98.25% (25).
CDRβ3in TCRs and CDRH3 in antibodies show higher variability in sequence composition and structure than the other CDRs (1). While no canonical forms have been defined for CDRH3, several groups have analyzed the kinked and extended (or bulged and non-bulged) conformations at the start and end of the loop, known as the “base” or “torso” region (18, 26–31). Weitzner et al. (30) showed that pseudo bond angle τ and pseudo dihedral angle α of the second last residue of the CDRH3 loop [Chothia (32) position 101, IMGT (33) position 116] can differentiate between the extended and kinked torso conformations. Finn et al. (31) analyzed the first three and last four residues of CDRH3 loops and observed that for the same IMGT position 116, the ϕ/ψ angles are different in kinked and extended torsos. In this paper, we carry out the first examination of the conformation of the base region of CDRβ3loops.
Although TCRs and antibodies are derived from similar genetic mechanisms and share a similar architecture, only a handful of studies have compared them [e.g., (5, 10, 33–36)]. Furthermore, analyses have largely focused on sequence-based features. For instance, Rock et al. found that the CDRα3and CDRβ3loops have a different length distribution to CDRL3 and CDRH3 (34), while Blevins et al. observed that the TCR CDR1 and CDR2 sequences have more charged amino acids than the analogous antibody CDR1 and CDR2 (10). Given the similarity in the genetic mechanisms, fold and the limited conformational variability in the canonical CDRs, it might be reasonable to expect structural similarity between TCRs and antibodies. Comparing TCRs and antibodies should identify potential characteristics that may inspire antibody-like TCR design, TCR-mimic antibodies and soluble TCRs (5, 37–39). Such analyses can also highlight structural signatures that may relate to their different biological functions, such as MHC restriction in TCRs (6) and the virtually unconstrained antigen binding in antibodies (3).
In this manuscript, we first used a length-independent clustering method to update the canonical classes of TCR CDRs (20). We then built a sequence-based TCR CDR canonical form prediction algorithm based on an adapted position-specific scoring matrix (40). Next, we attempted to enrich the TCR CDR dataset with antibody CDR structures. We found that TCR and antibody CDRs occupy distinct areas of structural space. In the small number of common conformational clusters, the underlying sequence patterns differ. This structural distinction may be a key differentiator between the functions of these two classes of immune proteins.
The IMGT-defined CDR loops (33) were extracted from each chain of the 270 high-quality TCR structures in STCRDab [STCRDab set; (41)]. This is more than double the numbers used in previous studies of TCR canonical forms, Al-Lazikani et al. (seven structures) and Klausen et al. (116 structures) (24, 25). Redundant sequences were retained as the conformations may differ as shown in the forthcoming sections. Summary statistics of the sequence-redundant STCRDab set for each of the CDR types are listed in Table 1.
2.1. Updating the Canonical Classes of TCR CDRs
We clustered each CDR type (e.g., CDRα1, CDRα2, etc.) using the DBSCAN method (20), with a minimum cluster size of five and a clustering threshold of 1.0Å. For clusters with more than two unique sequences, we designated them as “canonical classes”; otherwise, they were considered “pseudo-classes” (Table 1; see Materials and Methods for full details).
In total, we identified seven α1, seven α2, five α3, one β1and two β2canonical classes (see Table 1). The representative structures and sequence patterns of our TCR CDR canonical classes are shown in Figure 2 and Figures S1–S4. We found that some canonical forms have highly conserved positions in their encoding sequences, which might govern the conformations. However, we also observed many cases where structures of the same sequence fell into different structural clusters. For instance, the α1loop with the sequence DSVNN belonged to α1-5-A when unbound, to pseudo-class α1-5-B* when bound to an MHC with a long peptide, and was unclustered when the binding peptide is short (Figure S5).
Figure 2. CDRα3canonical classes. At a 1.0Åclustering threshold, our DBSCAN method identified five canonical classes. Each class has at least five structures and two unique sequences. For every CDRα3class, the centroid structure is illustrated, with anchors in white, and the CDRα3region (IMGT 105–117) colored. The Protein Data Bank [PDB; (42)] four-letter code and the chain identifier of the centroid structure is shown in the bracket next to the cluster name. The sequence pattern below each centroid structure is generated by WebLogo (43), using the unique sequences of the cluster: α3-10-A has 11, α3-11-A has 6, α3-12-A has 2, α3-13-A has 3 and α3-13-B has 2. Hydrophilic residues were in blue, neutral residues in green and hydrophobic residues in black (see Materials and Methods).
Inter-species difference was not observed among canonical classes. Any antigenic species and TCR species may be found in any canonical classes (Figures S6, S7). TCRs that bind to MHC1-relate proteins appear to adopt only a particular set of CDR canonical forms (Figure S8).
We compared our canonical forms to those from Al-Lazikani et al. and Klausen et al. (see Tables S1, S2). We matched canonical classes if their representative structure was found in our canonical class. Some canonical classes were represented by structures that were filtered out of our dataset for quality reasons. In these cases, a sequence-identical CDR with a comparable backbone conformation was used as a proxy to match canonical classes (see Materials and Methods).
2.1.1. Comparison to Previous Canonical Forms
For CDRα1, our DBSCAN method broadly agreed with Al-Lazikani et al. and Klausen et al. (24, 25), apart from α1-3 in Al-Lazikani et al.'s classes, for which we found no corresponding cluster. Klausen et al.'s α1-5 cluster was matched to pseudo-class α1-6-F, meaning that for this cluster, we found more than five structures, but only a single unique sequence.
All seven of our CDRα2classes and our one CDRα2pseudo-class were matched to previously observed CDRα2canonical classes (see Table S1 for the full list of comparisons). All five of our CDRα3canonical forms and two of the pseudo-classes mapped to ones from Klausen et al. In addition, we found nine further pseudo-classes that were not identified in their study. The cluster representative of the Klausen et al.'s α3-1 canonical class was filtered out of our dataset as it comes from a structure with a resolution >2.9Å. Since its sequence was also absent in the rest of our dataset, we were unable to map this class to our canonical forms. The cluster representatives of Klausen et al.'s α3-2, α3-4, α3-11, and α3-12 were unclustered in our analysis.
Our single CDRβ1class and our one CDRβ1pseudo-class were matched to previous clusterings. Klausen et al.'s β1-3 and β1-4 forms were not in clusters in our work (Table S2). For CDRβ2, we were unable to find a match for Al-Lazikani et al.'s β2-2 class, nor Klausen et al.'s β2-1 and β2-7 classes. However, our two CDRβ2classes and one CDRβ2pseudo-class were matched, with our β2-6-B merging three of Klausen et al.'s clusters (Table S2).
Both previous clusterings were based on backbone dihedral angles of the CDR loops, whereas in our work, we clustered using backbone distances. Despite these different approaches, there was a large degree of overlap between our canonical forms and those found previously. We have also identified a small number of new canonical classes from our larger dataset. As was shown for antibody CDR canonical forms (40), the growth of structural data continuously modifies our understanding of CDR loop structures. It is therefore necessary to continuously and preferably automatically update the definition of canonical forms as more structural information becomes available.
2.2. Prediction of CDRs From Sequence
As our TCR canonical classes showed conserved sequence patterns (Figure 2 and Figures S1–S4), we built a sequence-based, length-independent position-specific scoring matrix (PSSM), that can be used to predict TCR CDR canonical classes (40). The performance of the predictor was evaluated by a leave-one-out cross-validation protocol on the unique sequences in STCRDab set (Table 2). Accuracy ranged from 73.2 to 100%, which is comparable to previous results (25).
To assess the coverage of our method on large sets of sequencing data, we used our PSSMs to predict the CDR canonical classes of an NGS dataset of mouse TCRα sequences (44). The entire dataset contained 1,563,876 sequences (1,498,254 CDRα1, 1,563,876 CDRα2, 1,267,235 CDRα3); on a single 3.4 GHz core, the prediction took three minutes. Our method achieved high coverage for CDRα1and CDRα2, but made predictions for only 37% of the non-redundant CDRα3sequences (Table 3). The poor coverage of CDRα3could be due to the paucity of the currently available data in capturing the conformations of this more sequence-diverse CDR (Figures S9–S11).
Ten TCR structures containing 44 individual chains that were unseen at the time of methodology development, were used as a blind test set for the predictor. All canonical CDRs had 100% prediction coverage. Apart from CDRα3, all CDR types were predicted with 100% accuracy; in other words, at least one member of the predicted canonical class had backbone root-mean square deviation (RMSD) ≤ 1.0Åto the native structure. CDRα3had one false prediction where the loop GTERSGGYQKVT was not assigned to any clusters even though the backbone RMSD falls within 1.0Åto a member of the α3-9-A canonical class.
2.3. Comparison Between TCR and Antibody CDRs
Inspired by the shared architecture and genetic generation mechanism of antibodies and TCRs, we examined whether their CDR length distributions, structures/canonical forms, and sequence patterns that give rise to their canonical forms overlap. For each of the TCR CDRs, we compared all its structures with the corresponding antibody CDR's structures (e.g., CDRα1with CDRL1). We considered the CDR sequence length distributions and their structural clustering with DBSCAN, using the clustering thresholds that have been previously used for antibody CDR clustering (20). Changing these thresholds does not qualitatively affect the results described below. A full list of comparisons is described in Table 1. For all pairs of TCR/antibody CDRs, we found that most clusters contained only TCR CDRs or antibody CDRs (e.g., CDRα1only). In other words, the CDRs from TCRs and from antibodies tended to occupy distinct areas of structural space. Where we observed an overlap in the structural space, we inspected whether the sequence motifs for the canonical classes differ between the two types of receptors.
The CDRα1loops in our set were five- to seven-residues long, while CDRL1 loops spanned from three- to twelve-residues (Figure S12). The most common length was six for both types of loops: over 70% of the CDRα1loops and 45% of the CDRL1 loops. Thirteen structural clusters of CDRα1/CDRL1 were identified (Figure S13), of these only α1,L1-5-A contained both CDRα1loops (6 unique sequences) and CDRL1s (18 unique sequences). The sequence logos formed by the sequence-unique CDRα1and CDRL1 loops in α1,L1-5-A had different sequence patterns (Figure S14). The general physicochemical properties were similar, but CDRL1 had a preference for Valine and Tyrosine on the third and fifth positions while CDRα1showed ambiguity at these two positions. Principal component analysis (PCA) on one-hot-encoded unique sequences displayed a separation between CDR loops from the two types of receptors (see Figure S14), except for one TCR α1sequence (DSVNN) that was considered to have a more similar sequence pattern to antibody L1 loops. This sequence is a TCR α1sequence that adopts multiple conformations as described above.
CDRα2loops tend to be longer than CDRL2, with nearly all of the CDRL2 loops being three-residues long and CDRα2having a range of lengths from five to eight (Figure S15). Given this, there is little chance of structural similarity between the two types of loops. Structural clustering confirmed that all classes only contained one CDR type, with two CDRα2clusters, and one CDRL2 cluster (Figure S16).
The majority of the CDRL3 loops in our set were nine-residues long (2073 of the 2872; 72.2%), whilst CDRα3loops had nine to fifteen residues (Figure 3). Among the eleven clusters identified, we found a cluster, α3,L3-10-A, which included both CDR types, with 17 and 12 unique sequences of CDRL3 and CDRα3respectively (Figure 4). A PCA of the sequences in this cluster (α3,L3-10-A) separated the TCRα3and antibody L3 members with one exception (Figure 5). The α3sequence GTYNQGGKLI clustered with the L3 group. This sequence was one of the eight (out of a total of 99) unique α3sequences we found to have multiple conformations. Our dataset contained three structures of this sequence; the other two were unclustered (3vxu:I and 3vxu:D).
Figure 4. All CDRα3and CDRL3 structures were clustered using the DBSCAN method. Orange bars indicate the number of CDRL3 structures, and blue bars the number of CDRα3structures. All classes, apart from α3,L3-10-A, have structures from only one CDR type, i.e. CDRα3or CDRL3.
Figure 5. The unique TCR and antibody sequences in the α3,L3-10-A class. (A) Sequence logos of CDRL3 and CDRα3loops in the α3,L3-10-A class, using only the unique sequences. The sequence patterns appear to be distinct between the two classes. (B) Principal component analysis (PCA) plot of the first two components in one-hot-encoded unique sequences in α3,L3-10-A, stratified by TCR and antibody CDRs (see Materials and Methods). TCR α3and antibody L3 are separated by the first principal component, only one TCR sequence (GTYNQGGKLI) is close to the antibody set.
In our dataset, CDRβ1loops tended to be shorter than their CDRH1 counterpart (Figure S17). Over 80% of the CDRβ1were five-residues long, compared to eight-residues long loops dominating for CDRH1 (>80%). Nine classes were formed by the structural clustering: seven CDRH1 clusters and two CDRβ1clusters (Figure S18).
More than 90% of the CDRβ2loops in our set were six-residues long, while only 0.2% of the CDRH2 had six residues and the rest were in the range of seven to ten residues (Figure S19). None of the seven clusters that were formed contained members from both types of CDRs (Figure S20).
2.3.6. β-Chain CDRs in TCRs and Heavy-Chain CDRs in Antibodies and Nanobodies
In our clustering, CDRs from nanobodies were also included. Unlike the CDRs of TCRs, these typically did fall into the antibody H1 and H2 clusters (Figures S18, S20). In the case of CDRβ1/CDRH1, 72.2% (156 out of 216) of the nanobody CDRH1 loops were clustered with antibody CDRH1s in the β1,H1-8-A class. The remaining nanobody CDRH1 loops formed the majority of the other two length-eight clusters (all 46 in β1,H1-8-B and 14 out of 21 in β1,H1-8-C respectively). Nanobody CDRH2 loops clustered with the length-seven and eight antibody CDRH2 loops. These results suggest that nanobody CDR loops are structurally more similar to antibodies than to TCRs.
There are no canonical classes for CDRH3 or CDRβ3but in the case of CDRH3, previous studies have found structural conservation in the start and end of the loop, known as the “torso” or “base” region (30, 31). We therefore inspected the base structure of the CDRβ3/CDRH3 loops. Following the study by Weitzner et al. (30), we carried out the loop anchor transform (LAT) analysis that was used to capture the characteristic base structure in CDRH3. The LAT analysis showed that CDRβ3loops have a similar width of distribution in all six degrees of freedom as seen for CDRH3, but with a slight shift in the peak (Figure S21). This presented the possibility that CDRβ3and CDRH3 could share similar base structures. In the same study, the pseudo bond angle (τ) and pseudo dihedral angle (α) of the penultimate residue of the CDRH3 structure (IMGT position 116) were used to differentiate between extended and kinked torsos. We found that very few CDRβ3loops had their τ116 and α116 in the space of kinked torsos (Figure 6). Instead 317 out of the 325 (97.5%) CDRβ3loops had an extended base. This behavior is the opposite of CDRH3 loops. The eight outlying CDRβ3structures with positive α116 have either an aromatic side chain at IMGT position 116 that restricts the shape of the base, or an abrupt bend to accommodate unusual binding peptides. Consistent observations were made when we analyzed the ϕ/ψ plots for the torso positions as outlined in Finn et al. (31) (see Figure S22).
Figure 6. Pseudo bond angle (τ) and pseudo dihedral angle (α) analyses on IMGT (33) position 116. Scatters represent individual observations of β3 (blue) and H3 (orange) structures. Regions occupied by kinked and extended CDRH3 loops according to Weitzner et al. (30) are indicated by arrows.
2.3.8. CDR Structural Variability in TCR and Antibodies
As noted above, some TCR CDRs with identical sequences fell into different structural clusters. Considering the joint clustering of TCR and antibody CDRs, we found that of the 212 unique TCR CDR sequences with multiple example structures, 41 (19.3%) fell into more than one structural cluster (Table S3). This happened far less for antibody CDRs where only 7.95% (166 of 2089) of the unique antibody CDR sequences with multiple example structures were found in more than one structural cluster. Whether the structure was crystallized in complex with the cognate molecule did not appear to cause the conformational difference as bound and unbound structures were observed in the different clusters (Tables S4, S5).
Following the analysis in Marks et al. (45), we compared the maximum backbone RMSD of structures from an identical sequence (Figure S23). If we consider structures with a difference of <1Åin backbone RMSD as similar (184 of the 284 TCRs and 2195 of the 2717 antibodies), this analysis illustrates that the majority of the sequence-identical structures are structurally similar in both TCR and antibody CDRs, but that TCR loops show a more skewed distribution, tending toward higher structural variability.
We have identified and refined the canonical class definitions of TCR CDRs using a more up-to-date structural dataset, and used these to generate an auto-updating database and prediction server.
In our dataset of 270 TCR structures, we find seven CDRα1, seven CDRα2, five CDRα3, one CDRβ1, and two CDRβ2canonical classes. In addition, we report several “pseudo-classes”, in which there are multiple examples of the structural conformation, but only one unique sequence. One of the major advantages of a canonical class model is the rapid mapping between the sequence and structural space. To demonstrate this for TCRs, we applied an adapted PSSM-based methodology (40) to assign the canonical classes for an NGS dataset of ~1.5 million mouse TCRα sequences in a few minutes.
The commonalities between TCR and antibody in their folds and the genetic mechanisms that generate them prompted us to explore the similarities and differences between TCR and antibody CDR structures. We performed a length-independent structural clustering of TCR and antibody CDRs and found that they almost always separate into different clusters. This separation was partly driven by the differences in the length distributions between TCR and antibody CDRs. In the cases where we found structural clusters which contained both types of loops, we found that the underlying sequence motifs were distinct.
We also compared the CDRβ3and CDRH3 loops in terms of their torso structures – the starting and the ending residues of the CDRβ3/CDRH3 loops, and found once again TCRs and antibodies gravitated toward different structures. CDRH3 loops in antibodies tend to have a kinked torso while CDRβ3are only found with the extended torsos that is less common in CDRH3.
Driven by the multiple observations where sequence-identical TCR CDR structures were found in different clusters, we also assessed the structural variability of TCR and antibody CDRs. Our results agree with previous findings about the flexibility of TCR CDRs (46, 47), and further suggest that TCR loops are more flexible than antibody CDRs. Structural rigidification has been proposed as an affinity maturation mechanism in antibodies (48), while flexibility in the binding site has been suggested as one of the features enabling the promiscuous binding of TCR-MHC (49). The fact that nearly 20% of the TCR CDRs that could show different conformations did so suggests that the canonical class model will struggle to accurately predict TCR CDR conformations. TCR modeling tools may provide more details through homology or de novo loop predictions (25, 50, 51). Overall, our results suggest that there are structural differences between TCR and antibody CDRs, and the differences we observe potentially help explain how these two receptors bind to their separate target antigens.
Many groups have attempted to augment TCR and antibody designs by swapping binding sites between these two receptors [e.g., (37, 52)]. TCR-mimic antibodies were developed in the hope of transferring the ability of TCRs to target intracellular proteins, to antibodies for the use in cancer therapy. Antibody-like TCR design transfers the highly specific antibody binding sites to TCRs. It was shown to enhance the specificity and affinity in TCR antigen recognition in vitro and in vivo (37). These cases present the possibility of altering the behavior of the receptors such as targeting peptides in the context of an MHC molecule by grafting TCR binding sites, or enhancing specificity when the antibody components are added. Our results should aid the development and design of these types of therapeutic immune proteins.
4. Materials and Methods
We use the following nomenclature for structures: four characters of the PDB code, followed by a colon, then the chain identifier of the structure, e.g., 5hhm:E. Clusters are identified by two letters describing the CDR type, the loop length(s), then a letter (20). For instance, the α1-6-B class refers to the length-6 CDRα1class with the second-largest number of unique sequences. Pseudo-clusters with only one unique sequence have their names appended with an asterisk (*). For the clustering of both TCR and antibody CDRs, we name the clusters by the two letters representing the TCR CDR type, two letters for the antibody CDR type, the loop length(s), followed by a letter, e.g., α1,L1-11,12-A denotes the largest cluster from the CDRα1/CDRL1 clustering, in which sequences are 11 or 12 residues long.
TCR and antibody structures were numbered in the IMGT scheme (33) using ANARCI (53): CDR1 (27–38), CDR2 (56–65), and CDR3 (105–117). To compare the CDR loops of TCRs and antibodies, we assumed equivalence between β/heavy chains and α/light chains (41), as TCRβ and antibody heavy chains rely on VDJ recombination, while TCRα and antibody light chains are formed by VJ recombination.
4.3. Structural Datasets
TCR structures with at least one α or β chain and resolution ≤ 2.8Åwere downloaded from STCRDab on 31 May 2018 (41). Antibody structures with resolution ≤ 2.8Åwere downloaded from SAbDab on 31 May 2018 (54). Structures from all species were retained. In total, 270 TCR PDB entries and 2,563 antibody PDB entries were used. We retained all chains in a PDB entry that passed the quality criteria. The structures of the CDR loops were extracted using a similar procedure to Nowak et al. (20). The IMGT-defined CDRs, along with five N-terminal and five C-terminal anchor residues, were selected if there were no missing backbone atoms and none of the backbone atoms had B-factors >80. Loops also needed to be continuous, i.e., all peptide bond lengths were <1.37Å. Since loops with identical sequences can adopt multiple conformations (see Results and Figure S5 in the current work on TCR CDRs and examples in Nowak et al. (20) for antibody CDRs), we retained all CDR loop structures that passed the structural quality criteria.
4.4. Clustering Method
Loops are clustered using the length-independent clustering method from Nowak et al. (20). First, the algorithm superimposes the backbone atoms of all ten anchor residues. Structural similarity is then calculated using the dynamic time warp (DTW) algorithm (55); the resulting DTW score is effectively a length-independent RMSD value (20). The DTW scores are then clustered using the DBSCAN method (56).
In the clustering of TCR CDRs alone, we only consider a set of sequences to be a CDR cluster only if it contains a minimum of five structures and two unique sequences. If a set contains five or more structures but only one unique sequence, we label this a “pseudo-class.” All other loops are considered to be “unclustered.” This is a more lenient threshold than previous investigations clustering antibody CDR loops [e.g., Nowak et al. (20)], as the dataset of TCR structures is far smaller. To choose the optimal clustering parameter, DBSCAN was run over a range of DTW score thresholds in increments of 0.1Å. We find that 1.0Åoffered an optimal balance between the number and size of clusters for the five TCR CDR loops.
For the clustering of multiple CDR types, we use the same clustering thresholds and criteria for the respective CDR types as the antibody study (20), where they were selected using the Ordering Points to Identify the Clustering Structure [OPTICS; (57)] algorithm. CDRα1/CDRL1 are clustered at 0.82Å, CDRα2/CDRL2 at 1Å (not initialized in the previous paper), CDRα3/CDRL3 at 0.91Å, CDRβ1/CDRH1 at 0.8Åand CDRβ2/CDRH2 at 0.63Å. A valid cluster must contain at least six unique sequences as illustrated in Nowak et al. (20).
4.5. Comparison With TCR Canonical Classes in Earlier Work
In order to compare our CDR clusters to previous studies (24, 25), we identified overlaps in the representative PDB entries. For example, Al-Lazikani et al.'s α2-3 class contains the PDB entry 1tcr. Since 1tcr:A was in our α2-8-B class, we considered these two classes (α2-3 and α2-8-B) to be analogous. A similar procedure was used to map our classes to Klausen et al.'s classes (25).
Some canonical classes from Al-Lazikani et al. or Klausen et al. were represented by structures that were filtered out of our dataset. To match these classes, we searched for a CDR structure in our set that has the same CDR sequence and checked if there was a backbone match. For example, the centroid structure of Klausen et al.'s α3-7 class is PDB 3tf7, which has the sequence AVSAKGTGSKLS. We found that 3tfk:C has the same CDRα3sequence as 3tf7 and a backbone RMSD of 0.81Å; thus, we assigned their α3-7 to our α3-12-A class.
4.6. Sequence-Based Prediction of Canonical Forms
For each CDR canonical class, we generated a position-specific scoring matrix (PSSM). The score of an amino acid a in IMGT position i, s(a, i), is:
where pa, i represents the probability of a at i, and this is calculated separately for each class. The background probability of a, ba, was assumed to be identical for all residues i.e., 0.05.
To predict the cluster for a target loop with length l, we first select PSSMs containing loops with the same length. For example, if the new CDRα1loop is six residues long, we choose PSSMs of the α1-6-A, α1-6-B, α1-6-C and α1-6-D classes. The PSSM score for the target loop for class c, Pc, is the sum of the position-specific scores:
If a is never observed at i, we assume s(a, i) = −1. For canonical class assignment, we designated a target loop to the class with the highest value of Pc. Furthermore Pc must be higher than 1, except for CDRα3where Pc must be equal to or greater than the loop's length. The value of Pc was chosen by performing leave-one-out cross-validation tests over several values of Pc. Sequences were assigned to a pseudo-class if and only if they had an identical sequence.
To benchmark the scoring strategy, we ran a leave-one-out cross-validation protocol on the unique sequences from canonical classes and unclustered structures. A prediction was evaluated using the following criteria:
• True positive: sequence is assigned to the correct canonical class.
• False positive: sequence is assigned to a different canonical class.
• True negative: sequence is from an unclustered loop, and not assigned a canonical class.
• False negative: sequence is from a canonical class, but predicted to be unclustered.
4.7. Prediction on Next-Generation Sequencing Dataset
We predicted the canonical forms for the α-chain in a set of mouse TCR sequences from BioProject accession PRJNA362309 (44). Overlapping Illumina reads were assembled using FLASh (58), and TCR amino acid sequences were extracted using IgBLAST (59). The sequences were then numbered by ANARCI (53); only those with productive CDR3 rearrangement, CDR1 and CDR2 loops at least five residues long, and CDR3 loops at least eight residues long were retained.
4.8. Prediction of New TCR Structures
We used the 44 αβTCRstructures that were released between 31st May 2018 and 5th June 2019 as a blind test set. Unlike our structural dataset for clustering, we did not impose a quality restriction for CDR prediction. Predictions were considered to be correct if the backbone RMSD between the native CDR structure and any member of the assigned canonical class was ≤ 1.0Å.
4.9. Comparison Between TCR and Antibody CDR Structures
We clustered TCR and antibody CDR structures and examined the length distribution, structural clustering and sequence patterns. Sequence lengths of CDR loops disregard the five anchor residues at each side of the CDR structures. Structural clustering comparison were done as described above. We used WebLogo to generate all sequence patterns (43). Amino acids were colored with the default hydrophobicity scale: hydrophilic residues (R, K, D, E, N and Q) were in blue, neutral residues (S, G, H, T, A, and P) in green and hydrophobic residues (Y, V, M, C, L, F, I, and W) in black.
In canonical forms where both TCR and antibody CDRs were found, we examined the difference between the sequence patterns used by TCRs and antibodies. We transformed the sequences using one hot encoding by position, where each feature was represented by the position and the residue name, and applied Principal Component Analysis (PCA) using the scikit-learn module in Python.
4.10. Analysis of CDRβ3and CDRH3 Structures
LAT is the Euler transformation of the coordinate planes formed by residues at IMGT positions of 105 and 117 [see Supplementary Information of Weitzner et al. (30) for a detailed mathematical definition]. Briefly, a coordinate system is defined for each of the two residues centered on the Cα atoms, where the z-axis points toward the carbonyl carbon, the y-axis is perpendicular to z in the N-Cα-C plane, and the x-axis is the cross product of these two components. The Euler transformation is then represented by six degrees of freedom, capturing the translation (X, Y, Z) and rotation (ϕ, ψ, θ).
We calculated the pseudo bond angle (τ) and pseudo dihedral angle (α) of the residue at IMGT position 116 (corresponding to Chothia position 101), in the CDRβ3and CDRH3 sets. Consistent with Weitzner et al. (30), the pseudo bond angle is formed by the Cα atoms of the residues before, at and after the IMGT position 116, whereas the pseudo dihedral angle spans across the Cα atoms of residues at position 116, one before and two after.
Finn et al. (31) observed that the dihedral angles adopted by the base residues of extended torsos were different from that of kinked torsos. To capture the observation made by Finn et al. (31), the dihedral angles of the first three (T1-T3) and last four (T4-T7) residues of the CDRβ3and CDRH3 loops are obtained from the Biopython module (60).
Data Availability Statement
All datasets generated for this study are included in the manuscript/Supplementary Files.
WW and JL performed the experiments and analyzed the data. WW, JL, and CD contributed to the design of the study, manuscript writing, revision, read, and approved the submitted version.
This work was supported by the Engineering and Physical Sciences Research Council and Medical Research Council Grant EP/L016044/1.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu.2019.02454/full#supplementary-material
4. Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, et al. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. (2015) 43:D376–81. doi: 10.1093/nar/gku947
5. Dunbar J, Knapp B, Fuchs A, Shi J, Deane CM. Examining variable domain orientations in antigen receptors gives insight into TCR-like antibody design. PLOS Comput Biol. (2014) 10:1–10. doi: 10.1371/journal.pcbi.1003852
7. Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high–throughput sequencing of the antibody repertoire. Nat Biotechnol. (2014) 32:158–68. doi: 10.1038/nbt.2782
9. Sharon E, Sibener LV, Battle A, Fraser HB, Garcia KC, Pritchard JK. Genetic variation in MHC proteins is associated with T cell receptor expression biases. Nat Genet. (2016) 48:995–1002. doi: 10.1038/ng.3625
10. Blevins SJ, Pierce BG, Singh NK, Riley TP, Wang Y, Spear TT, et al. How structural adaptability exists alongside HLA-A2 bias in the human αβ TCR repertoire. Proc Natl Acad Sci USA. (2016) 113:E1276–85. doi: 10.1073/pnas.1522069113
12. Cole DK, Miles KM, Madura F, Holland CJ, Schauenburg AJA, Godkin AJ, et al. T-cell Receptor (TCR)-peptide specificity overrides affinity-enhancing TCR-major histocompatibility complex interactions. J Biol Chem. (2014) 289:628–38. doi: 10.1074/jbc.M113.522110
19. Kuroda D, Shirai H, Kobori M, Nakamura H. Systematic classification of CDR–L3 in antibodies: implications of the light chain subtypes and the VL–VH interface. Proteins. (2009) 75:139–46. doi: 10.1002/prot.22230
20. Nowak J, Baker T, Georges G, Kelm S, Klostermann S, Shi J, et al. Length-independent structural similarities enrich the antibody CDR canonical class model. mAbs. (2016) 8:751–60. doi: 10.1080/19420862.2016.1158370
21. Pantazes RJ, Maranas CD. OptCDR: a general computational method for the design of antibody complementarity determining regions for targeted epitope binding. Protein Eng Des Sel. (2010) 23:849–58. doi: 10.1093/protein/gzq061
22. Lapidoth GD, Baran D, Pszolla GM, Norn C, Alon A, Tyka MD, et al. AbDesign: An algorithm for combinatorial backbone design guided by natural conformations and sequences. Proteins. (2015) 83:1385–406. doi: 10.1002/prot.24779
23. Kovaltsuk A, Krawczyk K, Galson JD, Kelly DF, Deane CM, Trück J. How B-cell receptor repertoire sequencing can be enriched with structural antibody data. Front Immunol. (2017) 8:1753. doi: 10.3389/fimmu.2017.01753
31. Finn JA, Leman JK, Willis JR, Cisneros A III, Crowe JE Jr, Meiler J. Improving loop modeling of the antibody complementarity-determining region 3 using knowledge-based restraints. PLoS ONE. (2016) 11:e0154811. doi: 10.1371/journal.pone.0154811
33. Lefranc MP, Pommié C, Ruiz M, Giudicelli V, Foulquier E, Truong L, et al. IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev Comp Immunol. (2003) 27:55–77. doi: 10.1016/S0145-305X(02)00039-3
37. Xu Y, Yang Z, Horan LH, Zhang P, Liu L, Zimdahl B, et al. A novel antibody-TCR (AbTCR) platform combines Fab-based antigen recognition with gamma/delta-TCR signaling to facilitate T-cell cytotoxicity with low cytokine release. Cell Discov. (2018) 4:62. doi: 10.1038/s41421
40. Wong WK, Georges G, Ros F, Kelm S, Lewis AP, Taddese B, et al. SCALOP: sequence-based antibody canonical loop structure annotation. Bioinformatics. (2018) 35:1774–6. doi: 10.1080/2162402X.2015.1049803
44. Maceiras A, Almeida S, Mariotti-Ferrandiz E, Chaara W, Six A, Hori S, et al. T follicular helper and T follicular regulatory cells have different TCR specificity. Nat Commun. (2017) 8:15067. doi: 10.1038/ncomms15067
46. Holland CJ, MacLachlan BJ, Bianchi V, Hesketh SJ, Morgan R, Vickery O, et al. In Silico and structural analyses demonstrate that intrinsic protein motions guide T cell receptor complementarity determining region loop flexibility. Front Immunol. (2018) 9:674. doi: 10.3389/fimmu.2018.00674
47. Crooks JE, Boughter CT, Scott LR, Adams EJ. The hypervariable loops of free TCRs sample multiple distinct metastable conformations in solution. Front Mol Biosci. (2018) 5:95. doi: 10.3389/fmolb.2018.00095
49. Rossjohn J, Gras S, Miles JJ, Turner SJ, Godfrey DI, McCluskey J. T cell antigen receptor recognition of antigen-presenting molecules. Ann Rev Immunol. (2015) 33:169–200. doi: 10.1146/annurev-immunol-032414-112334
51. Li S, Wilamowski J, Teraguchi S, van Eerden FJ, Rozewicki J, Davila A, et al. Structural modeling of lymphocyte receptors and their antigens. In: Kaneko S, editor. In Vitro Differentiation of T-Cells. Springer (2019). p. 207–29.
56. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. KDD'96. Portland, OR: AAAI Press (1996). p. 226–31.
60. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. (2009) 25:1422–23. doi: 10.1093/bioinformatics/btp163
Keywords: T-cell receptors, antibodies, loop conformations, protein structure prediction, NGS
Citation: Wong WK, Leem J and Deane CM (2019) Comparative Analysis of the CDR Loops of Antigen Receptors. Front. Immunol. 10:2454. doi: 10.3389/fimmu.2019.02454
Received: 30 July 2019; Accepted: 01 October 2019;
Published: 15 October 2019.
Edited by:Harry W. Schroeder, University of Alabama at Birmingham, United States
Reviewed by:Michael Zemlin, Saarland University Hospital, Germany
John Cambier, University of Colorado Denver, United States
Copyright © 2019 Wong, Leem and Deane. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Charlotte M. Deane, email@example.com
†Present address: Jinwoo Leem, BenevolentAI, London, United Kingdom