The TR locus annotation and characteristics of Rhinolophus ferrumequinum

T cell antigen receptors (TCRs) in vertebrate could be divided into αβ or γδ, which are encoded by TRA/D, TRG and TRB loci respectively. TCRs play a central role in mammal cellular immunity, which are functionally produced by rearrangement of V gene, D gene, J gene and C gene in the loci. Bat is the only mammal with flying ability, and is considered as the main host of zoonotic virus, which occupies an important position in the field of public health. At present, little is known about the composition of bat TRs gene. According to the whole genome sequencing results of the Rhinolophus ferrumequinum, and referring to the TR/IG annotation rules formulated by IMGT. We make a complete annotation on the TRA/D, TRG and TRB loci of the Rhinolophus ferrumequinum. A total of 128 V segments, 3 D segments, 85 J segments and 6 C segments were annotated, in addition to compared with the known mammalian, the characteristics of the TRs locus and germline genes of the Rhinolophus ferrumequinum were analyzed.


Introduction
Vertebrate T cells participate in adaptive immune response via their diversified TCRs on the surface. TCRs composed of two polymorphic chains, which can be divided into two types: α β polypeptide chain and γ δ However, due to the incompleteness of sequencing and annotation of these bats' genomes and transcriptome, the systematic and complete annotation of bats' whole genome has not been completed at present, especially the annotation of bats' immune genes such as TCR, BCR and MHC is currently blank. According to the annotation rules of IMGT-ONTOLOGY, we annotated TRA/D, TRB, TRG of Rhinolophus ferrumequinum completely, and compared the annotated genes with those of human, mouse, pig, dog, cattle and other species in the IMGT database.

Methods and Materials 2.1 Annotation of the TR locus of Rhinolophus ferrumequinum
According to the whole genome sequence (GCA_004115265.3) shared in NCBI, which was submitted by vertebrate genomes project through whole genome shotgun sequencing, the genome has 28 chromosomes, with a total length of 2075.77Mb, and we could complete the current analysis due to the high quality and coverage of this genome. We completed the annotation work according to the process of Figure 1. The first step is to locate the TR locus in chromosome. According to the borne gene information of mammalian TR loci recorded in IMGT database, TRA/D loci are usually located in OR10G2 (Olfactory receptor family 10 subfamily G member 2, OR10G2) gene and DAD1(Defender against cell death 1, DAD1). The TRG locus is located between AMPH (Amphiphysin, AMPH) gene and STARD3NL (Related to steroidogenic acute regulatory protein D3-n-terminal like, STRAD3NL) gene. TRB locus is located between MOXD2 (Monooxygenase-beta-hydroxylase-like 2, MOXD2) gene and EPHB6 (Ephrin type-b receptor 6, EPHB6) gene. The second step is the identification of germline genes. We analyzed sequences among the borne genes: we analyzed the V gene based on human and mouse pattern by IgBLAST tool (for the V gene of TRA/D locus, if the output result is TRAV/TRDV, we will annotate it as the V gene shared by TRA and TRD loci).
Furthermore, the sequence was imported into Geneious Prime software, and the homology of the sequence was compared by using the germline gene information recorded in IMGT database, including the existing V, D, J and C genes of human, mouse, pig and dog.
The third step is description of germline genes. Here, four parts are described according to the types of germline gene. (1) V gene is composed of 3 parts: L-Part+V-exon+V-RS, L-part contains two segments: L-part1 and L-part2, there is a part of intron between part1 and part2, L-Part1 contains 2 conservative components: Init-codon and Donor-splice, L-part2 contains 1 conservative component: Acceptor-splice, part2 is next to the V-exon part, and V-exon is divided The composition of the C gene is different due to loci: TRAC and TRDC usually consist of 4 exons: Exon1+Exon2+Exon3+Exon4UTR. TRGC usually consists of 3 exons: Exon1+Exon2+Exon3, and TRGC-Exon2 usually has multiple situations: Ex2A, Ex2B, Ex2C, Ex2R, Ex2T. TRBC usually consists of 4 exons: Exon1+Exon2+Exon3+Exon4. The fourth step is the functional identification of germline genes. The following rules are used for functional identification of germline genes: (1) 3'RSS sequence of V gene (7-23-9), J gene 5' end RSS sequence (9-12-7), 5' end RSS sequence (9-12-7) and 3' end RSS sequence (7-23-9) of D gene, (2) Whether the 5'end of V gene contains leader sequence; (3) Conserved acceptor and donor splicing site; (4) Length of coding region; (5) Frameshift or stop codon of coding region. The last step is classification of group and subgroup of germline genes. For each identified and analyzed V gene, according to its existing species, a phylogenetic evolution tree is constructed to name the V gene uniformly. If no homologous V gene is found, it will be named according to its position in the loci. Each D, J and C gene was named according to its physical location in its own cluster.

Comparison of the TR locus of Rhinolophus ferrumequinum and other mammals
The analysis results of TRA/D, TRG and TRB loci were drawn into physical maps by Geneious Prime software according to the annotation of IMGT, and the information of four TR loci of each species in IMGT database was counted and compared. At the same time, the number of germline genes and subgroups among species were compared, and the nucleotide and amino acid composition of annotated genes were analyzed. According to the classical RSS sequence, the mutation number was counted, and the conservative analysis of 12/23 RSS sequences of V and J genes was carried out. Sequences we analyzed include: 847Kb between OR10G2 and DAD1, 174Kb between AMPH and STARD3L, and 240Kb between MOXD2 and EPHB6. Using the whole genome assembly (GCA_004115265.3) of Rhinolophus ferrumequinum recorded in the NCBI, the TRA/D locus of Rhinolophus ferrumequinum was identified ( Figure.2):
Supplementary Table 2 provides detailed information for each annotated TRG locus germline genes in Chromosome20.

TRB locus
The TRB locus ( Figure. The structure of the TRB locus of Rhinolophus ferrumequinum suggests that the upstream is a V cluster composed of 28 TRBV genes, and the downstream is two complete D-J-C gene clusters. The first cluster contains: 1 D gene, 6 J genes and 1 C gene, the second cluster contains: 1 D gene, 9 J genes and 1 C gene. There is a TRBV gene with the opposite transcription direction downstream of the second TRBC gene. Supplementary Table 3 provides detailed information for each annotated TRB locus germline genes in Chromosome26.

Classification of germline gene in 4 TR locus of Rhinolophus ferrumequinum 3.4.1 Classification and phylogenetic analysis of the V genes
(1) A total of 81 TRAV genes were found in TRA loci, and the nucleotide similarity of 81 TRAV genes ranged from 25.1% to 98.9%. According to the standard that the nucleotide homology is more than 75% and belongs to the same group, it can be divided into 31 groups.
In order to analysis the classification and affiliation between annotated V genes of bat with existing species, we selected the V gene sequences of primate representative species, human, carnivorous representative species, dog and artiodactyl representative species, cattle, from the IMGT database to construct phylogenetic trees. There are two criteria for selecting genes: one is to select only potential functional genes and in-frame pseudogenes, and the other is to select only one gene for each subgroup.
After all the selected V genes were compared by ClustalW method, MEGA7 was used for N-J method of rootless tree construction. The phylogenetic tree constructed by 31 bat TRAV genes shown in Figure5-A. The results reflect the homology between the annotated bat TRAV genes and existing species. We named the bat TRAV genes according to the comparison results. It should be noted that there are 3 groups, TRAV8, TRAV18 and TRAV19, of which the TRAV8-2 gene and the rest of the TRAV8 members are only about 65% identity, but in the evolutionary branch, we found that TRAV8-2 and TRAV8-1 clustered to the same branch, since TRAV8-2 is a pseudogene, we incorporated it into TRAV8 group. The same situation occurs when the identity between TRAV18-1 and TRAV18-2 is 73.8%, and TRAV19-1 and TRAV19-2 is 69. (2) There are 18 TRDV genes in the TRD locus, and the nucleotide identity is between 27%-98.6%. The 18 TRDV genes are divided into 11 families, TRDV5, TRDV6, TRDV7, TRDV8, TRDV9, TRDV10, TRDV11 are single gene groups. The phylogenetic tree constructed by the 11 TRDV genes of bat with human, dog, and cattle is shown in Figure 5-B. Except for the TRDV10 family, all groups of the bat can find homology in the 3 species. Of the 18 TRDV genes, only 2 shared with TRA loci were classified as pseudogenes, and the other 16 were functional genes.
(3) The TRG locus contains 14 TRGV genes with nucleotide similarity ranging from 33% to 99%. Because the TRG locus of bat is composed of two identical V-J-C gene clusters, the 14 TRGV genes are divided into 7 groups with nucleotide identity ranging from 94.6% to 99.0%. A phylogenetic tree was constructed with the annotated TRDV genes of bat and the TRDV genes of human, dog, cattle, and mice (Figure5-C). All TRGV genes of bat can be found homologous in 4 species. The 14 TRGV genes have a high proportion of pseudogenes, 5 of 14 are classified as pseudogenes, 3 ORFs and 6 functional genes. Because the recorded data of TRDV and TRGV genes between species are limited and the uniformity is not high, they are named according to position in the locus.
(4) The TRB locus contains 29 TRBV genes with nucleotide identity ranging from 33.7% to 94.2%, and the 29 TRBV genes are divided into 25 groups. The phylogenetic trees constructed by human, mouse, dog, and pig showed that all the annotated TRBV genes of the bat could be found to be homologous in 4 species. It is worth noting that human TRBV1 could not cluster with the TRBV1 group of cattle, mouse, pig and bat, but was on the branch of TRBV4. There are 3 multigene groups, TRBV5 and TRBV6 have 2 gene members, and TRBV12 has 3 gene members. The  Supplementary Table 4.
The amino acid sequences of some TRAV genes (81 TRAV genes in Supplementary   Figure 1) and all TRDV, TRGV and TRBV genes annotated of bat were aligned according to the unique number of IMGT V-region (Figure 6), so as to ensure the homology to the maximum extent.  species is between 72.5% and 92.6%, and the similarity of TRBC2 is between 71.3% and 93.8%.

Germline gene recombination signal sequences analysis
According to the IMGT rule, we statistically analyzed the RSS sequences judged as germline genes (Figure 9). We used the classic 7-mer-CACAGTG (heptamer) and 9-mer-ACAAAAACC (nonamer) as motifs, The number of mutations in the RSS sequence was analyzed: In the TRA/D locus, 7 of the 81 TRAV genes did not have the 23RSS sequence, and the number of mutations in the 74 RSS sequences was between 0-9; The 3' ends of 18 TRDV genes all have 23RSS sequences, and the number of mutations in 18 RSS sequences is between 0 and 9, the number of mutations in 15 sequences is less than 6, the number of mutations in 2 sequences is 8, and the number of mutations in one sequence is 9. Of the 60 TRAJ genes, 5 TRAJ have no 12RSS sequence at the 3' end, and the number of mutations in 55 12RSS sequences is between 0 and 7, among which the number of mutations in 54 sequences is less than 6 and the number of mutations in 1 sequence is 7.
In TRG locus, 2 out of 14 TRGV genes do not have 23RSS sequence, and the number of mutations in 23RSS sequence of 12 TRGV genes is between 0 and 9, of which 11 have less than 7 mutations, and the number of mutations in one sequence is 9. The 6 TRGJ genes all have 12RSS sequences at the 3' end, and the number of mutations in the 12RSS sequences of the 6 TRGJ genes ranges from 0 to 9, among which the number of mutations in 3 genes is less than 5 and the number of mutations in 3 genes is 9. In TRB loci, 29 TRBV genes all have 23RSS sequences, and the number of mutations in 23RSS sequences of TRBV genes is between 1 and 6, of which only 1 mutation number is 6, and a total of 20 mutations are concentrated in 3-4 mutations; Only 1 of the 15 TRBJ genes does not have 12RSS sequence, and the number of mutations in 12RSS of 14 genes is between 2 and 6, mainly between 3 and 5. It can be seen that the RSS sequence of TRB locus is more conservative than other loci.
All the V genes heptamer and the poly-A tract of V gene nonamer are highly conserved. The nonamer of TRAV and TRDV has a high degree of diversity in the first 4 nucleotides, while the whole nonamer of TRGV has high diversity. The diversity of RSS sequence is more obvious in J gene, except that the poly-A tract of J gene nonamer is conservative. The first 2 nucleotides and the last two of nonamer of J gene are diversified, which is more obvious in heptamer of J gene, except the bases (cac) of the first 3 nucleotides, the last 4 nucleotides of J gene heptamer are highly diversified.

Discussion
The function of vertebrate T/B cells depends on TCR/BCR, and the TR locus analysis of species is of great significance for us to understand its immune response. The information of 4 TR loci of many species has been included in the IMGT database, and the TR loci of 12 species, including human, macaque, mouse, pig, cow, sheep, dog, cat, dolphin, ferret, rabbit and zebra fish, have been completely annotated and described(http://www.imgt.org/IMGTrepertoire/LocusGenes/#C). Previous studies always annotated and analyzed one or two loci of a single species. However, for the first time, we annotated the 4 loci TRA/TRD, TRG and TRB of Rhinolophus ferrumequinum directly through the whole genome through the distribution characteristics, sequence characteristics and conservative RSS of germline genes. This paper can provide ideas and reference methods for new species with unresolved TR/IG loci. In addition, we believe that high-quality TR locus annotation is the key to the subsequent analysis of the TCR/BCR repertoire of Rhinolophus ferrumequinum.
Gene replication is the basis of the evolution of antigen-specific receptors, which leads to the formation of independent IG/TR sites [26,27]. Since mammalian TRD loci are embedded in TRA loci, the length of this locus spans the longest among the TR loci, even so, the structure has not changed much [28], and the amplification events in the genome are part of the reasons for the differences. In the TRA/D locus, the most obvious replication events occurred in Artiodactyl cattle (3330Kb) and sheep (2880Kb). In terms of the structural characteristics of the loci, the TRA/D locus of Rhinolophus ferrumequinum also maintained a conservative structure of mammals, only primate macaques (806Kb), carnivorous cats (830Kb) and dogs (760Kb) are shorter than the TRA/D locus of Rhinolophus ferrumequinum (850Kb).
In contrast, although the coverage of TRG loci is the shortest, the situation among species may be more complicated. Many species have detected the existence of 2 TRG loci. For example, there are two loci, TRG1 and TRG2, in Atlantic salmon, although the expression of TRG2 cannot be detected [29]. In addition, the study of sheep found that there were at least two TRG loci on chromosome 4, and the expression analysis showed that both of them had functions [30,31], and the second TRG locus was found in cattle, buffalo and goats by the same method [32,33]. On the other hand, the TRG locus structure of Rhinolophus ferrumequinum is relatively simple, which consists of two V-J-C gene clusters, and the composition and similarity among the gene clusters are almost identical. The dog's TRG locus has undergone repeated replication, resulting in eight V-J-C gene clusters [34]. Mammalian TRB locus is probably the most conservative , with multiple V genes in the upstream and multiple D-J-C clusters in the downstream, and replication events occur among D-J-C gene clusters. In the early analysis of chicken expression level, only one D-J-C cluster was found [35]. Primate humans, macaques, rodent mice, carnivorous cats, dogs, and Rhinolophus ferrumequinum all contain two D-J-C clusters. Artiodactyla is the representative of three D-J-C gene clusters, including pigs, sheep. We found that the TRBJ gene of the second D-J-C cluster and the third cluster of pigs and sheep has very high homology, so the replication event is likely to occur in the second gene cluster, and there are four D-J-C clusters downstream of the TRB locus of short-tailed opossum [36]. In addition, the TRB loci of primate humans (620Kb) and rhesus monkeys (736Kb) have the largest span, and the number of embryonic genes is only slightly less than that of sheep (506Kb Similarly, TRBV6 of horse-headed bat and human are polygenic families, but TRBV6 of dog and pig is monogenic family [10]. TRAV24, TRAV25 of Rhinolophus ferrumequinum and cattle are obviously polygenic groups, but TRAV24, TRAV25 of human and dog are monogenic groups. Similarly, TRBV6 of horse-headed bat and human are polygenic families, but TRBV6 of dog and pig is monogenic family [10]. We counted the CDR region length (Supplemental Table 6) of the homologous TRAV and TRBV genes among different species, and concluded that the CDR region length of the TRBV gene subgroup is conservative among different species, and only the lengths of TRBV1 and TRBV17 are different between human beings and bats. However, the CDR region length of TRAV varies greatly among four species, even though the loss of some statistical information and gene families has affected our statistics. Early researchers have put forward the evolutionary model of IG and TCR genes, which is called: birth-and-death evolution. This evolutionary model explains the emergence of most polygenic families and the fact that some genes become pseudogenes due to extensive mutations, and family amplification occurs at the same time as mutations and deletions occur between families. This is a compensation mechanism to maintain diversity. The internal replication of genome has become a routine event, and the birth of most germline genes comes from the replication of intergenomic fragments, that is to say, the birth of germline genes depends more on the replication of existing genes than on the birth of new subgroups [37][38][39].
The birth of genes is accompanied by death, because some genes always become pseudogenes due to mutations or frame shifting [39]. 16% of 81 TRAVs, 11% of 18 TRDV genes, 35% of 14 TRGVs and 3% of 29 TRBVs belong to pseudogenes. Correspondingly, the proportion of TRBV pseudogenes in human, dog and pig is 19%, 43% and 31.5% respectively. The TRA/D locus of cattle not only shows high "birth rate", but also shows high "death rate". The proportion of TRAV pseudogenes in human, mouse and cattle is 19%, 16.5% and 37.2% respectively [8]. Although there is no diversity degree of cattle, in contrast, 81 TRAV/DV genes of Rhinolophus ferrumequinum show a larger and more stable germline genes pool than 54 TRAV/DV genes of human beings, which is similar to IGVH3 gene family of small brown bat, and there may be more preliminary discoveries of combination diversity [22].
Generally speaking, we want to evaluate the diversity of TCR or BCR receptor repertoire of a species, first from the point of view of the number of germline genes, because this directly affects the diversity of rearrangement, and then consider the addition and mutation of nucleotides. In previous studies, SHM occurred only in B cells of higher vertebrates in order to produce high affinity antibodies, which is very rare in T cells. We counted all species in the IMGT database, including our annotated germline gene pool of Rhinolophus ferrumequinum. Unlike the TRAV gene, most mammals have about 60 TRAJ genes. It is worth noting that sheep and cattle of artiodactyl have a large number of replication events in TRA/D locus and TRAV/dv gene, and the number of TRAJ gene in sheep is 79, but the number of cattle belonging to artiodactyl has not increased significantly. The researchers speculated that the 19 TRAJ complementary genes found in sheep may be the result of replication from TRAJ29 to TRAJ39 due to sequencing errors or amplification [40]. The statistical results of D gene are similar, only cattle and sheep have 9 TRAD genes, and to our surprise, there is only one D gene in Rhinolophus ferrumequinum, which is the only species with only one TRDD gene at present. Another feature of TRG loci is the number of TRGC and the diversity of corresponding protein structures, especially the size of TRGC gene is usually different, and this difference is caused by the different exon number and intron length of coding junction region, which is very obvious even among same species. TRGC2, TRGC3 and TRGC5 of salmon are divided into three exons (EX1, EX2 and EX3), while TRGC1 has two EX2 [29]. TRGC1 and TRGC5 of sheep have only one EX2(A), while TRGC3 has two EX2(A and c) and TRGC6 has three EX2(A, b and C) [41]. TRGC5 of cattle has only one EX2, but TRGC3 and TRGC7 have two EX2, while TRGC1, TRGC2, TRGC4 and TRGC6 all have three EX2 [42]. TRGC1 and TRGC2 of dromedary also contain three EX2 [43,44]. TRGC2, TRGC3 and TRGC4 of dogs have two EX2, while TRGC1, TRGC6, TRGC7 and TRGC8 have only one EX2. All Monoptera and Marsupia lack the second cystamine, which is necessary for the formation of intra-chain disulfide bonds, which is obviously caused by independent mutations [45]. We made a conservative analysis of the RSS sequences of germline genes, but we did not find non-classical RSS before and after the V and J genes without RSS sequences, for example, if the spacer was 12bp±1 or 23bp±1. It would not affect the further analysis, because the effective recombination only occurred between the 12bp RSS and the 23bp RSS [46]. RSS sequence conserved sites of the four loci of Rhinolophus ferrumequinum are all related, whether it is heptamer or nonamer, the first 4 nucleotides of the conserved site of heptamer are CACA, while the conserved site of nonamer is usually the poly-A tract in the sequence. Because the first three nucleotides of heptamer play an important role in the recombination process, the differences of these positions may affect or hinder the recombination of genes, but at present we still don't know how or how much it will affect the rearrangement if we have atypical RSS before and after germline genes [47].

Conclusion
Bat is a special kind of mammal, which carries a lot of virulent viruses, but it will not be harmed by these viruses. We still don't know the mechanism of bat's anti-virus, not only because of the difficulty in obtaining samples, but also because of the lack of relevant research on bats [17]. In this paper, the four TR loci of Rhinolophus ferrumequinum were annotated completely at the genome level, the structural differences of TR loci between different species and Rhinolophus ferrumequinum were statistically analyzed, and the differences of germline gene composition between different species and Rhinolophus ferrumequinum were discussed. Generally speaking, the four TCR loci of Rhinolophus ferrumequinum are highly conserved compared with other mammals. It should be noted that homologous genes defined by phylogeny may not be able to participate in the rearrangement process, such as homologous genes of TRAV and TRDV, so it is necessary to analyze bat receptor repertoire data through specific primers. On the basis of the annotation of TR loci of Rhinolophus ferrumequinum, our research group has completed the annotation of IG loci of Rhinolophus ferrumequinum (another paper), and has completed the high-throughput sequencing analysis of TCR CDR3 and BCR CDR3 repertoire collected from many types of bats in Xishui area, Guizhou province, China, and compared with the homogeneity and heterogeneity of TCR CDR3 and BCR CDR3 repertoire in physiological conditions, human and mouse central and peripheral, which can provide a basis for studying the acquired immune response mechanism of bats.