Immunoglobulin and T Cell Receptor Genes: IMGT® and the Birth and Rise of Immunoinformatics

Lefranc, Marie-Paule

doi:10.3389/fimmu.2014.00022

CLASSIFICATION article

Front. Immunol., 05 February 2014

Sec. T Cell Biology

Volume 5 - 2014 | https://doi.org/10.3389/fimmu.2014.00022

Immunoglobulin and T Cell Receptor Genes: IMGT^® and the Birth and Rise of Immunoinformatics

ML
Marie-Paule Lefranc ^*

The International ImMunoGenetics Information System® (IMGT®), Laboratoire d’ImmunoGénétique Moléculaire (LIGM), Institut de Génétique Humaine, UPR CNRS, Université Montpellier 2, Montpellier, France

Article metrics

View details

256

Citations

40,1k

Views

8,9k

Downloads

Abstract

IMGT^®, the international ImMunoGeneTics information system^®¹, (CNRS and Université Montpellier 2) is the global reference in immunogenetics and immunoinformatics. By its creation in 1989, IMGT^® marked the advent of immunoinformatics, which emerged at the interface between immunogenetics and bioinformatics. IMGT^® is specialized in the immunoglobulins (IG) or antibodies, T cell receptors (TR), major histocompatibility (MH), and proteins of the IgSF and MhSF superfamilies. IMGT^® has been built on the IMGT-ONTOLOGY axioms and concepts, which bridged the gap between genes, sequences, and three-dimensional (3D) structures. The concepts include the IMGT^® standardized keywords (concepts of identification), IMGT^® standardized labels (concepts of description), IMGT^® standardized nomenclature (concepts of classification), IMGT unique numbering, and IMGT Colliers de Perles (concepts of numerotation). IMGT^® comprises seven databases, 15,000 pages of web resources, and 17 tools, and provides a high-quality and integrated system for the analysis of the genomic and expressed IG and TR repertoire of the adaptive immune responses. Tools and databases are used in basic, veterinary, and medical research, in clinical applications (mutation analysis in leukemia and lymphoma) and in antibody engineering and humanization. They include, for example IMGT/V-QUEST and IMGT/JunctionAnalysis for nucleotide sequence analysis and their high-throughput version IMGT/HighV-QUEST for next-generation sequencing (500,000 sequences per batch), IMGT/DomainGapAlign for amino acid sequence analysis of IG and TR variable and constant domains and of MH groove domains, IMGT/3Dstructure-DB for 3D structures, contact analysis and paratope/epitope interactions of IG/antigen and TR/peptide-MH complexes and IMGT/mAb-DB interface for therapeutic antibodies and fusion proteins for immune applications (FPIA).

IMGT^®: The Birth of Immunoinformatics

IMGT^®, the international ImMunoGeneTics information sys- tem^®¹ (1), was created in 1989 by Marie-Paule Lefranc at Montpellier, France (CNRS and Université Montpellier 2). The founding of IMGT^® marked the advent of immunoinformatics, a new science, which emerged at the interface between immunogenetics and bioinformatics. For the first time, immunoglobulin (IG) or antibody and T cell receptor (TR) variable (V), diversity (D), joining (J), and constant (C) genes were officially recognized as “genes” as well as the conventional genes (2–5). This major breakthrough allowed genes and data of the complex and highly diversified adaptive immune responses to be managed in genomic databases and tools.

The adaptive immune response was acquired by jawed vertebrates (or gnathostomata) more than 450 million years ago and is found in all extant jawed vertebrate species from fishes to humans. Understanding the basis for adaptive immunity, at the level of cell populations, individual cells, and molecules, has been a major focus of immunology in the past century (6, 7). The adaptive immune response is characterized by a remarkable immune specificity and memory, which are the properties of the B and T cells owing to an extreme diversity of their antigen receptors. The specific antigen receptors comprise the immunoglobulins (IG) or antibodies of the B cells and plasmocytes (2) (Figure 1), and the T cell receptors (TR) (3) (Figure 2). The IG recognize antigens in their native (unprocessed) form, whereas the TR recognize processed antigens, which are presented as peptides by the highly polymorphic major histocompatibility (MH, in humans HLA for human leukocyte antigens) proteins (Figure 2).

Figure 1

Figure 2

The potential antigen receptor repertoire of each individual is estimated to comprise about 2 × 10¹² different IG and TR, and the limiting factor is only the number of B and T cells that an organism is genetically programed to produce (2, 3). This huge diversity results from the complex molecular synthesis of the IG and TR chains and more particularly of their variable domains (V-DOMAIN) which, at their N-terminal end, recognize and bind the antigens (2, 3). The IG and TR synthesis includes several unique mechanisms that occur at the DNA level: combinatorial rearrangements of the V, D, and J genes that code the V-DOMAIN [the V–(D)–J being spliced to the C gene that encodes the C-REGION in the transcript], exonuclease trimming at the ends of the V, D, and J genes and random addition of nucleotides by the terminal deoxynucleotidyl transferase (TdT) that creates the junctional N-diversity regions, and later during B cell differentiation, for the IG, somatic hypermutations and class or subclass switch (2, 3).

IMGT^® manages the diversity and complexity of the IG and TR and the polymorphism of the MH of humans and other vertebrates. IMGT^® is also specialized in the other proteins of the immunoglobulin superfamily (IgSF) and MH superfamily (MhSF) and related proteins of the immune system (RPI) of vertebrates and invertebrates (1). IMGT^® provides a common access to standardized data from genome, proteome, genetics, two-dimensional (2D), and three-dimensional (3D) structures. IMGT^® is the acknowledged high-quality integrated knowledge resource in immunogenetics for exploring immune functional genomics. IMGT^® comprises seven databases (for sequences, genes and 3D structures) (9– 14), 17 online tools (15– 30), and more than 15,000 pages of web resources [e.g., IMGT Scientific chart, IMGT Repertoire, IMGT Education > Aide-mémoire (31), the IMGT Medical page, the IMGT Veterinary page, the IMGT Biotechnology page, the IMGT Immunoinformatics page] (1). IMGT^® is the global reference in immunogenetics and immunoinformatics (32–47). Its standards have been endorsed by the World Health Organization–International Union of Immunological Societies (WHO–IUIS) Nomenclature Committee since 1995 (first IMGT^® online access at the Ninth International Congress of Immunology, San Francisco, CA, USA) (48, 49) and the WHO–International Nonproprietary Names (INN) Programme (50, 51).

The accuracy and the consistency of the IMGT^® data are based on IMGT-ONTOLOGY (52–54), the first, and so far, unique ontology for immunogenetics and immunoinformatics (8, 52–70). IMGT-ONTOLOGY manages the immunogenetics knowledge through diverse facets that rely on seven axioms: IDENTIFICATION, DESCRIPTION, CLASSIFICATION, NUMEROTATION, LOCALIZATION, ORIENTATION, and OBTENTION (53, 54, 58). The concepts generated from these axioms led to the elaboration of the IMGT^® standards that constitute the IMGT Scientific chart: e.g., IMGT^® standardized keywords (IDENTIFICATION) (59), IMGT^® standardized labels (DESCRIPTION) (60), IMGT^® standardized gene and allele nomenclature (CLASSIFICATION) (61), IMGT unique numbering (8, 62–66), and its standardized graphical 2D representation or IMGT Colliers de Perles (67–70) (NUMEROTATION).

The fundamental information generated from these IMGT-ONTOLOGY concepts, which led to the IMGT Scientific chart rules is reviewed. The major IMGT^® tools and databases used for IG and TR repertoire analysis, antibody humanization, and IG/Ag and TR/pMH structures are briefly presented: IMGT/V-QUEST (15–20) for the analysis of rearranged nucleotide sequence with the results of the integrated IMGT/JunctionAnalysis (21, 22), IMGT/Automat (23, 24) and IMGT/Collier-de-Perles tool (29), IMGT/HighV-QUEST, the high-throughput version for next-generation sequencing (NGS) (20, 25, 26), IMGT/DomainGapAlign (12, 27, 28) for amino acid (AA) sequence analysis, IMGT/3Dstructure-DB for 3D structures (11–13) and its extension, IMGT/2Dstructure-DB (for antibodies and other proteins for which the 3D structure is not available). IMGT^® tools and databases run against IMGT reference directories built from sequences annotated in IMGT/LIGM-DB (9), the IMGT^® nucleotide database (175,406 sequences from 346 species in November 2013) and from IMGT/GENE-DB (10), the IMGT^® gene database (3,117 genes and 4,732 alleles from 17 species, of which 695 genes and 1,420 alleles for Homo sapiens and 868 genes and 1,318 alleles for Mus musculus in November 2013).

An interface, IMGT/mAb-DB (14), has been developed to provide an easy access to therapeutic antibody AA sequences (links to IMGT/2Dstructure-DB) and structures (links to IMGT/3Dstructure-DB, if 3D structures are available). IMGT/mAb-DB data include monoclonal antibodies (mAb, INN suffix -mab; a -mab is defined by the presence of at least an IG variable domain) and fusion proteins for immune applications (FPIA, INN suffix -cept) (a -cept is defined by a receptor fused to an Fc) from the WHO–INN Programme (50, 51). This database also includes a few composite proteins for clinical applications (CPCA) (e.g., protein or peptide fused to an Fc for only increasing their half-life, identified by the INN prefix ef-) and some related proteins of the immune system (RPI) used, unmodified, for clinical applications. The unified IMGT^® approach is of major interest for bridging knowledge from IG and TR repertoire in normal and pathological situations (71–74), IG allotypes and immunogenicity (75–77), NGS repertoire (25, 26), antibody engineering, and humanization (35, 42–44, 46, 78–82).

IMGT-Ontology Concepts

IDENTIFICATION: IMGT^® standardized keywords

More than 325 IMGT^® standardized keywords (189 for sequences and 137 for 3D structures) were precisely defined (59). They represent the controlled vocabulary assigned during the annotation process and allow standardized search criteria for querying the IMGT^® databases and for the extraction of sequences and 3D structures. They have been entered in BioPortal at the National Center for Biomedical Ontology (NCBO) in 2010² .

Standardized keywords are assigned at each step of the molecular synthesis of an IG. Those assigned to a nucleotide sequence are found in the “DE” (definition) and “KW” (keyword) lines of the IMGT/LIGM-DB files (9). They characterize for instance the gene type, the configuration type and the functionality type (59). There are six gene types: variable (V), diversity (D), joining (J), constant (C), conventional-with-leader, and conventional-without-leader. Four of them (V, D, J, and C) identify the IG and TR genes and are specific to immunogenetics. There are four configuration types: germline (for the V, D, and J genes before DNA rearrangement), rearranged (for the V, D, and J genes after DNA rearrangement), partially-rearranged (for D gene after only one DNA rearrangement) and undefined (for the C gene and for the conventional genes that do not rearrange). The functionality type depends on the gene configuration. The functionality type of genes in germline or undefined configuration is functional (F), open reading frame (ORF), or pseudogene (P). The functionality type of genes in rearranged or partially-rearranged configuration is either productive [no stop codon in the V–(D)–J-region and in-frame junction] or unproductive [stop codon(s) in the V–(D)–J-region, and/or out-of-frame junction].

The 20 usual AA have been classified into 11 IMGT physicochemical classes (IMGT^®, see footnote text 1, IMGT Education > Aide-mémoire > Amino acids). The AA changes are described according to the hydropathy (3 classes), volume (5 classes), and IMGT physicochemical classes (11 classes) (31). For example, Q1 > E (+ + −) means that in the AA change (Q > E), the two AA at codon 1 belong to the same hydropathy (+) and volume (+) classes but to different IMGT physicochemical properties (−) classes (31). Four types of AA changes are identified in IMGT^®: very similar (+ + +), similar (+ + −, + − +), dissimilar (− − +, − + −, + − −), and very dissimilar (− − −).

DESCRIPTION: IMGT^® standardized labels

More than 560 IMGT^® standardized labels (277 for sequences and 285 for 3D structures) were precisely defined (60). They are written in capital letters (no plural) to be recognizable without creating new terms. Standardized labels assigned to the description of sequences are found in the “FT” (feature) lines of the IMGT/LIGM-DB files (9). Querying these labels represents a big plus compared to the generalist nucleotide databases [GenBank/European Nucleotide Archive (ENA)/DNA Data Bank of Japan (DDBJ)]. Thus it is possible to query for the “CDR3-IMGT” of the human rearranged productive sequences of IG-Heavy-Gamma (e.g., 1733 CDR3-IMGT obtained, with their sequences at the nucleotide or AA level). The core labels include V-REGION, D-REGION, J-REGION, and C-REGION, which correspond to the coding region of the V, D, J, and C genes. IMGT structure labels for chains and domains and their correspondence with sequence labels are shown for human IG (Table 1), for human TR (Table 2), and for MH (8) (Table 3). These labels are necessary for a standardized description of the IG, TR, and MH sequences and structures in databases and tools (60).

Table 1

IG structure labels (IMGT/3Dstructure-DB)				Sequence labels (IMGT/LIGM-DB)
Receptor^a	Chain^b	Domain description type	Domain^c	Region
IG-GAMMA-1_KAPPA	L-KAPPA	V-DOMAIN	V-KAPPA	V–J-REGION
		C-DOMAIN	C-KAPPA	C-REGION
	H-GAMMA-1	V-DOMAIN	VH	V–D–J-REGION
		C-DOMAIN	CH1	C-REGION^d
		C-DOMAIN	CH2
		C-DOMAIN	CH3
IG-MU_LAMBDA	L-LAMBDA	V-DOMAIN	V-LAMBDA	V–J-REGION
		C-DOMAIN	C-LAMBDA-1	C-REGION
	H-MU	V-DOMAIN	VH	V–D–J-REGION
		C-DOMAIN	CH1	C-REGION^d
		C-DOMAIN	CH2
		C-DOMAIN	CH3
		C-DOMAIN	CH4^e

Immunoglobulin (IG) receptor, chain, and domain structure labels and correspondence with sequence labels.

^aLabels are shown for two examples of IG (Homo sapiens IgG1-kappa and IgM-lambda). An IG (“Receptor”) (Figure 1) is made of two identical heavy (H, for IG-HEAVY) chains and two identical light (L, for IG-LIGHT) chains (“Chain”) and usually comprises 12 (e.g., IgG1) or 14 (e.g., IgM) domains. Each chain has an N-terminal V-DOMAIN (or V–(D)–J-REGION, encoded by the rearranged V–(D)–J genes), whereas the remaining of the chain is the C-REGION (encoded by a C gene). The IG C-REGION comprises one C-DOMAIN (C-KAPPA or C-LAMBDA) for the L chain, or several C-DOMAIN (CH) for the H chain (2).

^bThe kappa (L-KAPPA) or lambda (L-LAMBDA) light chains may associate to any heavy chain isotype (e.g., H-GAMMA-1, H-MU). In humans, there are nine isotypes, H-MU, H-DELTA, H-GAMMA-3, H-GAMMA-1, H-ALPHA-1, H-GAMMA-2, H-GAMMA-4, H-EPSILON, H-ALPHA-2 (listed in the order 5′–3′ in the IGH locus of the IGHC genes, which encode the constant region of the heavy chains (2) (IMGT^®http//www.imgt.org, IMGT Repertoire).

^cThe IG V-DOMAIN includes VH (for the IG heavy chain) and VL (for the IG light chain). In higher vertebrates, the VL is V-KAPPA or V-LAMBDA, whereas in fishes, the VL is V-IOTA. The C-DOMAIN includes CH [for the IG heavy chain, the number of CH per chain depending on the isotype (2)] and CL (for the IG light chain). In higher vertebrates, the CL is C-KAPPA or C-LAMBDA, whereas in fishes, the CL is C-IOTA.

^dThe heavy chain C-REGION also includes the HINGE-REGION for the H-ALPHA, H-DELTA, and H-GAMMA chains and, for membrane IG (mIG), the CONNECTING-REGION (CO), TRANSMEMBRANE-REGION (TM) and CYTOPLASMIC-REGION (CY); for secreted IG (sIG), the C-REGION includes CHS instead of CO, TM, and CY.

^eFor H-MU and H-EPSILON.

Table 2

TR structure labels (IMGT/3Dstructure-DB)				Sequence labels (IMGT/LIGM-DB)
Receptor^a	Chain	Domain description type	Domain^b	Region
TR-ALPHA_BETA	TR-ALPHA	V-DOMAIN	V-ALPHA	V–J-REGION
		C-DOMAIN	C-ALPHA	Part of C-REGION^c
	TR-BETA	V-DOMAIN	V-BETA	V–D–J-REGION
		C-DOMAIN	C-BETA	Part of C-REGION^c
TR-GAMMA_DELTA	TR-GAMMA	V-DOMAIN	V-GAMMA	V–J-REGION
		C-DOMAIN	C-GAMMA	Part of C-REGION^c
	TR-DELTA	V-DOMAIN	V-DELTA	V–D–J-REGION
		C-DOMAIN	C-DELTA	Part of C-REGION^c

T cell receptor (TR), chain, and domain structure labels and correspondence with sequence labels.

^aA TR (“Receptor”) (3) (Figure 2) is made of two chains (alpha and beta, or gamma and delta) (“Chain”) and comprises four domains. Each chain has an N-terminal V-DOMAIN [or V–(D)–J-REGION, encoded by the rearranged V–(D)–J genes (3)] whereas the remaining of the chain is the C-REGION (encoded by a C gene). The TR C-REGION comprises one C-DOMAIN (3). TR receptor, chain, and domain structure labels, and correspondence with sequence labels, are shown for two examples of TR (Homo sapiens TR-alpha_beta and TR-gamma_delta).

^bThe TR V-DOMAIN includes V-ALPHA, V-BETA, V-GAMMA, and V-DELTA. The TR C-DOMAIN includes C-ALPHA, C-BETA, C-GAMMA, and C-DELTA (there are two isotypes for the TR-BETA and TR-GAMMA chains in humans, TR-BETA-1 and TR-BETA-2, and TR-GAMMA-1 and TR-GAMMA-2, the C-REGION of these chains being encoded by the TRBC1 and TRBC2 genes, and TRGC1 and TRGC2 genes, respectively) (IMGT^®http://www.imgt.org, IMGT Repertoire) (3).

^cThe TR chain C-REGION also includes the CONNECTING-REGION (CO), the TRANSMEMBRANE-REGION (TM), and the CYTOPLASMIC-REGION (CY), which are not present in 3D structures.

Table 3

MH group	MH structure labels (IMGT/3Dstructure-DB)					Sequence labels (IMGT/LIGM-DB)
	Receptor^a	Chain	Domain description type^b	Domain	Domain number	Region
MH1	MH1-ALPHA_B2M	I-ALPHA	G-DOMAIN	G-ALPHA1	[D1]	Part of REGION^c
			G-DOMAIN	G-ALPHA2	[D2]
			C-LIKE-DOMAIN	C-LIKE	[D3]
		B2M	C-LIKE-DOMAIN	C-LIKE	[D]	REGION
MH2	MH2-ALPHA_BETA	II-ALPHA	G-DOMAIN	G-ALPHA	[D1]	Part of REGION^c
			C-LIKE-DOMAIN	C-LIKE	[D2]
		II-BETA	G-DOMAIN	G-BETA	[D1]	Part of REGION^c
			C-LIKE-DOMAIN	C-LIKE	[D2]

Major histocompatibility (MH) receptor, chain, and domain structure labels and correspondence with sequence labels.

^aAn MH (“Receptor”) (8) depending on the MH group is made of one chain (I-ALPHA) non-covalently associated to the beta2-microglobulin (B2M) (MH1 group, in the literature MHC class I) (Figure 2) or of two chains (II-ALPHA and II-BETA) (MH2 group, in the literature MHC class II). The I-ALPHA chain has two G-DOMAIN whereas each II-ALPHA and II-BETA has one G-DOMAIN. MH receptor, chain, and domain structure labels, and correspondence with sequence labels, are shown for examples of members of the MH1 and MH2 groups.

^bThe domain description type shows that the MH proteins belong to the MhSF by their G-DOMAIN and to the IgSF by their C-LIKE-DOMAIN. The B2M associated to the I-ALPHA chain in MH1 has only a single C-LIKE-DOMAIN and only belongs to the IgSF.

^cThe REGION of the I-ALPHA, II-ALPHA, and II-BETA chains also includes the CONNECTING-REGION (CO), the TRANSMEMBRANE-REGION (TM), and the CYTOPLASMIC-REGION (CY), which are not present in 3D structures.

Highly conserved AA at a given position in a domain have IMGT labels (60). Thus three AA labels are common to the V and C-domains: 1st-CYS (cysteine C at position 23), CONSERVED-TRP (tryptophan W at position 41), and 2nd-CYS (C at position 104) (62–66). Two other labels are characteristics of the IG and TR V-DOMAIN and correspond to the first AA of the canonical F/W–G–X–G motif (where F is phenylalanine, W tryptophan, G glycine, and X any AA) encoded by the J-REGION: J-PHE or J-TRP (F or W at position 118) (62–64, 66).

CLASSIFICATION: IMGT^® standardized genes and alleles

The IMGT-ONTOLOGY CLASSIFICATION axiom was the trigger of immunoinformatics’ birth. Indeed the IMGT^® concepts of classification allowed, for the first time, to classify the antigen receptor genes (IG and TR) for any locus [e.g., immunoglobulin heavy (IGH), T cell receptor alpha (TRA)], for any gene configuration (germline, undefined, or rearranged), and for any species (from fishes to humans). In higher vertebrates, there are seven IG and TR major loci (other loci correspond to chromosomal orphon sets, genes of which are orphons, not used in the IG or TR chain synthesis). The IG major loci include the IGH, and for the light chains, the immunoglobulin kappa (IGK), and the immunoglobulin lambda (IGL) in higher vertebrates, and the immunoglobulin iota (IGI) in fishes (IMGT^®, see footnote text 1, IMGT Repertoire).

Since the creation of IMGT^® in 1989, at New Haven during the Tenth Human Genome Mapping Workshop (HGM10), the standardized classification and nomenclature of the IG and TR of humans and other vertebrate species have been under the responsibility of the IMGT Nomenclature Committee (IMGT-NC). IMGT^® gene and allele names are based on the concepts of classification of “Group,” “Subgroup,” “Gene,” and “Allele” (61). “Group” allows to classify a set of genes that belong to the same multigene family, within the same species or between different species. For example, there are 10 groups for the IG of higher vertebrates: IGHV, IGHD, IGHJ, IGHC, IGKV, IGKJ, IGKC, IGLV, IGLJ, IGLC. “Subgroup” allows to identify a subset of genes, which belong to the same group, and which, in a given species, share at least 75% identity at the nucleotide level, e.g., Homo sapiens IGHV1 subgroup. Subgroups, genes, and alleles are always associated to a species name. An allele is a polymorphic variant of a gene, which is characterized by the mutations of its sequence at the nucleotide level, identified in its core sequence and compared to the gene allele reference sequence, designated as allele *01. For example, Homo sapiens IGHV1-2*01 is the allele *01 of the Homo sapiens IGHV1-2 gene that belongs to the Homo sapiens IGHV1 subgroup, which itself belongs to the IGHV group. For the IGH locus, the constant genes are designated by the letter (and eventually number) corresponding to the encoded isotypes (IGHM, IGHD, IGHG3…), instead of using the letter C. IG and TR genes and alleles are not italicized in publications. IMGT-ONTOLOGY concepts of classification have been entered in the NCBO BioPortal.

The IMGT^® IG and TR gene names (2–5) were approved by the Human Genome Organisation (HUGO) Nomenclature Committee (HGNC) in 1999 (83, 84) and were endorsed by the WHO–IUIS Nomenclature Subcommittee for IG and TR (48, 49). The IMGT^® IG and TR gene names are the official international reference and, as such, have been entered in IMGT/GENE-DB (10), in the Genome Database (GDB) (85), in LocusLink at the National Center for Biotechnology Information (NCBI) USA (86), in Entrez Gene (NCBI) when this database (now designated as “Gene”) superseded LocusLink (87), in NCBI MapViewer, in Ensembl at the European Bioinformatics Institute (EBI) (88), and in the Vertebrate Genome Annotation (Vega) Browser (89) at the Wellcome Trust Sanger Institute (UK). HGNC, Gene NCBI, Ensembl, and Vega have direct links to IMGT/GENE-DB (10). IMGT^® human IG and TR genes were also integrated in IMGT-ONTOLOGY on the NCBO BioPortal and, on the same site, in the HUGO ontology and in the National Cancer Institute (NCI) Metathesaurus. AA sequences of human IG and TR constant genes (e.g., Homo sapiens IGHM, IGHG1, IGHG2) were provided to UniProt in 2008. Since 2007, IMGT^® gene and allele names have been used for the description of the therapeutic mAb and FPIA of the WHO–INN Programme (50, 51).

The basis for the nomenclature of the MH of newly sequenced genomes has been set up on the same concepts. In IMGT^®, MHC refers to the locus, which indeed is a complex of genes, particularly in the higher vertebrates. In contrast the letter “C” is dropped when referring to individual genes and proteins. Thus, the class I genes are designated as MH1 whereas the class II genes are designated as MH2. The IMGT nomenclature, with the MH1 and MH2 groups, has been used for the first time with the Oncorhynchus mykiss genes [see footnote text 1, IMGT Repertoire (MH) > Locus and genes > Gene tables]. It can also be applied to the human genes in databases, which deal with humans and other vertebrate species (for example, Homo sapiens MH1-A for HLA-A).

NUMEROTATION: IMGT unique numbering and IMGT Collier de Perles

The IMGT-ONTOLOGY NUMEROTATION axiom is acknowledged as the “IMGT^® Rosetta stone” that has bridged the biological and computational spheres in bioinformatics (40). The IMGT^® concepts of numerotation comprise the IMGT unique numbering (8, 62–66) and its graphical 2D representation the IMGT Collier de Perles (67–70). Developed for and by the “domain,” these concepts integrate sequences, structures, and interactions into a standardized domain-centric knowledge for functional genomics. The IMGT unique numbering has been defined for the variable V-domain (V-DOMAIN of the IG and TR, and V-LIKE-DOMAIN of IgSF other than IG and TR) (62–64), the constant C-domain (C-DOMAIN of the IG and TR, and C-LIKE-DOMAIN of IgSF other than IG and TR) (65), and the groove G-domain (G-DOMAIN of the MH, and G-LIKE-DOMAIN of MhSF other than MH) (8, 90, 91). Thus the IMGT unique numbering and IMGT Collier de Perles provide a definitive and universal system across species including invertebrates, for the sequences and structures of the V, C, and G-domains of IG, TR, MH, IgSF, and MhSF (66, 70, 92, 93).

V-domain IMGT^® definitive system

V-domain definition and main characteristics

In the IMGT^® definitive system, the V-domain includes the V-DOMAIN of the IG and of the TR, which corresponds to the V–J-REGION or V–D–J-REGION encoded by V–(D)–J rearrangements (2, 3), and the V-LIKE-DOMAIN of the IgSF other than IG and TR. The V-domain description of any receptor, any chain, and any species is based on the IMGT unique numbering for V-domain (V-DOMAIN and V-LIKE-DOMAIN) (62–64, 66).

A V-domain (Figure 3) comprises about 100 AA and is made of nine antiparallel beta strands (A, B, C, C′, C″, D, E, F, and G) linked by beta turns (AB, CC′, C″D, DE, and EF), and three loops (BC, C′C″, and FG), forming a sandwich of two sheets [ABED] [GFCC′C″] (62–64, 66). The sheets are closely packed against each other through hydrophobic interactions giving a hydrophobic core, and joined together by a disulfide bridge between a first highly conserved cysteine (1st-CYS) in the B strand (in the first sheet) and a second equally conserved cysteine (2nd-CYS) in the F strand (in the second sheet) (62–64, 66).

Figure 3

V-domain strands and loops (FR-IMGT and CDR-IMGT)

The V-domain strands and loops and their delimitations and lengths, based on the IMGT unique numbering for V-domain (62–64, 66), are shown in Table 4. In the IG and TR V-DOMAIN, the three hypervariable loops BC, C′C″, and FG involved in the ligand recognition (native antigen for IG and pMH for TR) are designated complementarity determining regions (CDR-IMGT), whereas the strands form the framework region (FR-IMGT), which includes FR1-IMGT, FR2-IMGT, FR3-IMGT, and FR4-IMGT (Table 4). In the IMGT^® definitive system, the CDR-IMGT have accurate and unambiguous delimitations in contrast to the CDR described in the literature. Correspondences between the IMGT unique numbering with other numberings, e.g., Kabat (94) or Chothia (95), are available in the IMGT Scientific chart. The correspondences with these previous and heterogenous numberings are useful for the interpretation of previously published data but nowadays the usage of these numberings has become obsolete owing to the development of immunoinformatics based on the IMGT^® standards (8, 62–70) (IMGT^®, see footnote text 1, IMGT Scientific chart > Numbering > Correspondence between V numberings).

Table 4

V-domain strands and loops^a	IMGT positions^b	Lengths^c	Characteristic IMGT Residue@Position^d	V-DOMAIN FR-IMGT and CDR-IMGT
A-STRAND	1–15	15 (14 if gap at 10)		FR1-IMGT
B-STRAND	16–26	11	1st-CYS 23
BC-LOOP	27–38	12 (or less)		CDR1-IMGT
C-STRAND	39–46	8	CONSERVED-TRP 41	FR2-IMGT
C′-STRAND	47–55	9
C′C″-LOOP	56–65	10 (or less)		CDR2-IMGT
C″-STRAND	66–74	9 (or 8 if gap at 73)		FR3-IMGT
D-STRAND	75–84	10 (or 8 if gaps at 81, 82)
E-STRAND	85–96	12	Hydrophobic 89
F-STRAND	97–104	8	2nd-CYS 104
FG-LOOP	105–117	13 (or less, or more)		CDR3-IMGT
G-STRAND	118–128	11 (or 10)	V-DOMAIN J-PHE 118 or J-TRP 118^e	FR4-IMGT

V-domain strands and loops, IMGT positions, and lengths, based on the IMGT unique numbering for V-domain (V-DOMAIN and V-LIKE-DOMAIN).

^aIMGT^® labels (concepts of description) are written in capital letters (no plural) (60). Beta turns (AB, CC′, C″D, DE, or EF) are individualized only if they have additional AA compared to the standard description. If not, they are included in the strands.

^bBased on the IMGT unique numbering for V-domain (V-DOMAIN and V-LIKE-DOMAIN) (62–64, 66).

^cIn number of AA (or codons).

^dIMGT Residue@Position is a given residue (usually an AA) or a given conserved property AA class, at a given position in a domain, based on the IMGT unique numbering (66).

^eIn the IG and TR V-DOMAIN, the G-STRAND (or FR4-IMGT) is the C-terminal part of the J-REGION, with J-PHE or J-TRP 118, and the canonical motif F/W–G–X–G at positions 118–121 (2, 3). The JUNCTION refers to the CDR3-IMGT plus the two anchors 2nd-CYS 104 and J-PHE or J-TRP 118 (63, 64). The JUNCTION (positions 104–118) is therefore two AA longer than the corresponding CDR3-IMGT (positions 105–117) (63, 64).

For a V-domain, the BC loop (or CDR1-IMGT in a V-DOMAIN) encompasses positions 27–38, the C′C″ loop (or CDR2-IMGT in a V-DOMAIN) positions 56–65, and the FG loop (or CDR3-IMGT) positions 105–117. In a V-DOMAIN, the CDR3-IMGT encompasses the V–(D)–J junction that results from a V–J or V–D–J rearrangement (2, 3) and is more variable in sequence and length than the CDR1-IMGT and CDR2-IMGT that are encoded by the V gene region only. For CDR3-IMGT of length >13 AA, additional IMGT positions are added at the top of the loop between 111 and 112 (Table 5).

Table 5

CDR3-IMGT lengths	IMGT additional positions for CDR3-IMGT length >13 AA^a
21	111	111.1	111.2	111.3	111.4	112.4	112.3	112.2	112.1	112
20	111	111.1	111.2	111.3	–	112.4	112.3	112.2	112.1	112
19	111	111.1	111.2	111.3	–	–	112.3	112.2	112.1	112
18	111	111.1	111.2	–	–	–	112.3	112.2	112.1	112
17	111	111.1	111.2	–	–	–	–	112.2	112.1	112
16	111	111.1	–	–	–	–	–	112.2	112.1	112
15	111	111.1	–	–	–	–	–	–	112.1	112
14	111	–	–	–	–	–	–	–	112.1	112

IMGT additional positions for CDR3-IMGT.

^aFor CDR3-IMGT length >13 AA, IMGT additional positions are created between positions 111 and 112 at the top of the CDR3-IMGT loop in the following order 112.1, 111.1, 112.2, 111.2, 112.3, 111.3, etc., and as many positions can be added as necessary for very long CDR3-IMGT. For CDR3-IMGT length <13 AA (not shown), IMGT gaps are created classically from the top of the loop, in the following order 111, 112, 110, 113, 109, 114, etc. (IMGT^®http://www.imgt.org, IMGT Scientific chart > Numbering).

IMGT Colliers de Perles

The loop and strands are visualized in the IMGT Colliers de Perles (67– 70), which can be displayed on one layer (closer to the AA sequence) or on two layers (closer to the 3D structure) (Figure 3). The three loops, BC, C′C″, and FG (or CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT for a V-DOMAIN) are delimited by the IMGT anchors, which are shown in square in the IMGT Colliers de Perles. IMGT anchors are positions, which belong to strands and represent anchors for the loops of the V-domains. IMGT anchors are the key and original concept of IMGT^®, which definitively solved the ambiguous situation of different CDR lengths and delimitations found in the literature. The six anchors of a V-domain are positions 26 and 39 (anchors of the BC loop or CDR1-IMGT in V-DOMAIN), 55 and 66 (anchors of the C′–C″ loop or CDR2-IMGT in V-DOMAIN), 104 and 118 (anchors of the FG loop or CDR3-IMGT in V-DOMAIN). The CDR3-IMGT anchors are highly conserved, they are C104 (2nd-CYS, in F strand) and F118 or W118 (J-PHE or J-TRP in G strand). The JUNCTION of an IG or TR V-DOMAIN includes the anchors 104 and 118, and is therefore two AA longer than the corresponding CDR3-IMGT (positions 105–117).

In biological data, the lengths of the loops and strands are given by the number of occupied positions [unoccupied positions or “IMGT gaps” are represented with hatches in the IMGT Colliers de Perles (Figure 3) or by dots in alignments]. The CDR-IMGT lengths are given in number of AA (or codons), into brackets and separated by dots: for example [9.6.9] means that the BC, C′C″, and FG loops (or CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT for a V-DOMAIN) have a length of 9, 6, and 9 AA (or codons), respectively. Similarly [25.17.38.11] means that the FR1-IMGT, FR2-IMGT, FR3-IMGT, and FR4-IMGT have a length of 25, 17, 38, and 11 AA (or codons), respectively. Together, the four FR of a VH domain usually comprise 91 AA and the individual FR-IMGT lengths are [25.17.38.11], whereas the four FR of a VL domain usually comprise 89 AA and the individual FR-IMGT lengths are [26.17.36.10].

Conserved AA

A V-domain has five characteristic AA at given positions (positions with bold (online red) letters in the IMGT Colliers de Perles). Four of them are highly conserved and hydrophobic (31) and are common to the C-domain: 23 (1st-CYS), 41 (CONSERVED-TRP), 89 (hydrophobic), and 104 (2nd-CYS). These AA contribute to the two major features shared by the V and C-domain: the disulfide bridge (between the two cysteines 23 and 104) and the internal hydrophobic core of the domain (with the side chains of tryptophan W41 and AA 89). The fifth position, 118, is an anchor of the FG loop. It is occupied, in the V-domains of IgSF other than IG or TR, by AA with diverse physicochemical properties (31). In contrast, in IG and TR V-DOMAIN, the position 118 is occupied by remarkably conserved AA, which consist in a phenylalanine or a tryptophan encoded by the J-REGION and therefore designated J-TRP or J-PHE 118. The bulky aromatic side chains of J-TRP and J-PHE are internally orientated and structurally contribute to the V-DOMAIN hydrophobic core (64).

Genomic delimitation

A last criterion used in the IMGT^® definitive system for the characterization of a V-domain is its delimitation taking into account the exon delimitations, whenever appropriate. The exon rule is not used for the delimitation of the 5′ end of the first N-terminal domain of proteins with a leader (this includes the V-DOMAIN of the IG and TR chains). In those cases, the 5′end of the first N-terminal domain of the chain corresponds to the proteolytic site between the leader (L-REGION) and the coding region of the mature protein. The IG and TR V-DOMAIN is therefore delimited in 5′ by a proteolytic site and in 3′ at the genomic level by the splicing site of the J-REGION (60). This IMGT^® genomic approach integrates the strands A and G, in contrast to structural alignments that usually lack these strands due to their poor structural conservation, and thus bridges the gap between genomic data (exon) and 3D structure (domain).

C-domain IMGT^® definitive system

C-domain definition and main characteristics

In the IMGT^® definitive system, the C-domain includes the C-DOMAIN of the IG and of the TR (2, 3) and the C-LIKE-DOMAIN of the IgSF other than IG and TR. The C-domain description of any receptor, any chain, and any species is based on the IMGT unique numbering for C-domain (C-DOMAIN and C-LIKE-DOMAIN) (65, 66).

A C-domain (Figure 4) comprises about 90–100 AA and is made of seven antiparallel beta strands (A, B, C, D, E, F, and G), linked by beta turns (AB, DE, and EF), a transverse strand (CD) and two loops (BC and FG), and forming a sandwich of two sheets (ABED) (GFC) (65, 66). A C-domain has a topology and a three-dimensional structure similar to that of a V-domain but without the C′ and C″ strands and the C′C″ loop, which is replaced by a transverse CD strand (65).

Figure 4

C-domain strands and loops

The C-domain strands, turns, and loops and their delimitations and lengths, based on the IMGT unique numbering for C-domain (65, 66), are shown in Table 6. Correspondences between the IMGT unique numbering with other numberings (Eu, Kabat) are available in the IMGT Scientific chart. The correspondences with these previous numberings are useful for the interpretation of previously published data but, as for the V-domain, the usage of these previous numberings has become obsolete owing to the development of immunoinformatics based on the IMGT^® standards (8, 62–70) (IMGT^®, see footnote text 1, IMGT Scientific chart > Numbering > Correspondence between C numberings).

Table 6

C-domain strands, turns, and loops^a	IMGT positions^b	Lengths^c	Characteristic IMGT Residue@Position^d
A-STRAND	1–15	15 (14 if gap at 10)
AB-TURN	15.1–15.3	0–3
B-STRAND	16–26	11	1st-CYS 23
BC-LOOP	27–31	10 (or less)
	34–38
C-STRAND	39–45	7	CONSERVED-TRP 41
CD-STRAND	45.1–45.9	0–9
D-STRAND	77–84	8 (or 7 if gap at 82)
DE-TURN	84.1–84.7	0–14
	85.1–85.7
E-STRAND	85–96	12	Hydrophobic 89
EF-TURN	96.1–96.2	0–2
F-STRAND	97–104	8	2nd-CYS 104
FG-LOOP	105–117	13 (or less, or more)
G-STRAND	118–128	11 (or less)

C-domain strands, turns, and loops, IMGT positions, and lengths, based on the IMGT unique numbering for C-domain (C-DOMAIN and C-LIKE-DOMAIN).

^aIMGT^® labels (concepts of description) are written in capital letters (no plural) (60).

^bBased on the IMGT unique numbering for C-domain (C-DOMAIN and C-LIKE-DOMAIN) (65, 66).

^cIn number of amino acids (AA) (or codons).

^dIMGT Residue@Position is a given residue (usually an AA) or a given conserved property AA class, at a given position in a domain, based on the IMGT unique numbering (66).