Preparing Unbiased T-Cell Receptor and Antibody cDNA Libraries for the Deep Next Generation Sequencing Profiling

High-throughput sequencing has the power to reveal the nature of adaptive immunity as represented by the full complexity of T-cell receptor (TCR) and antibody (IG) repertoires, but is at present severely compromised by the quantitative bias, bottlenecks, and accumulated errors that inevitably occur in the course of library preparation and sequencing. Here we report an optimized protocol for the unbiased preparation of TCR and IG cDNA libraries for high-throughput sequencing, starting from thousands or millions of live cells in an investigated sample. Critical points to control are revealed, along with tips that allow researchers to minimize quantitative bias, accumulated errors, and cross-sample contamination at each stage, and to enhance the subsequent bioinformatic analysis. The protocol is simple, reliable, and can be performed in 1–2 days.


INTRODUCTION
Next generation sequencing (NGS) technologies opened a breathtaking opportunity to perform deep analysis and comparative studies of the T-cell receptor (TCR) and antibody (IG) repertoires of the human donors and model animals, as well as of the various sorted, separated, or cultured lymphocyte subsets of interest (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13). Still, rational NGS-analysis of such immune repertoires is critically dependent on the library preparation protocols, starting from a lymphocytes/PBMC sample and ending with the amplification of individual TCR/IG segment encoding molecules on the solid phase of a sequencing machine. Multiple sampling bottlenecks, PCR biases, and cross-contamination at different stages lie in wait to trick a researcher on his way to get the deep, clear, and congruent data.
While studying autoimmunity and hematopoietic stem cell transplantation therapy (10,(14)(15)(16)(17), we have optimized cDNAbased protocol that allows unbiased pre-sequencing amplification of the human and murine, alpha-and beta-TCR, as well as IG heavy chain gene libraries. The protocol employs a specific oligonucleotide to prime cDNA synthesis, and template switching effect to form a universal 5 -adapter and to introduce sample barcode at the very first stage of library preparation. Subsequent two-step PCR amplification is performed with universal pairs of primers for the whole library using step-out plus PCR-suppression effect (18) on the 5 -end and nested PCR (19) on the 3 -end of the library (16).
This approach allows efficient and unbiased amplification of millions of the TCR/IG mRNA molecules in only 27-30 (21)(22)(23)(24) considering dilution factor, see below) PCR cycles, thus providing sufficient starting material for the deep NGS-analysis of complex lymphocyte samples. Current protocol is optimal for the sequencing on Illumina MiSeq/HiSeq platforms and Roche 454 platforms.
Here we report the upgraded and tested protocol in a ready-touse format with the technical details required for the method to be easily and uniformly reproduced in any laboratory.

ADVANTAGES OF cDNA LIBRARIES AND 5 -TEMPLATE SWITCH
Starting with cDNA synthesis using 5 -template switching (16,20,21) has at least two decisive advantages in comparison with the genomic DNA-based approaches (2,12).
First, the whole diversity of variable chains (up to approximately 100 different V gene segment variants 1 , can be amplified using just a pair (for TCRs) or a simple multiplex set (for IGs) of oligonucleotides, specific to the template switch adapter on the 5 -end and to the constant gene segments on the 3 -end of the library (Figure 1).
In contrast, the approaches starting with the genomic DNA require multiplex primer sets to be used both at the 5 V gene segments' end, and at the 3 introns/J-segments end of the library (2). Moreover, a subsequent nested PCR amplification, which requires FIGURE 1 | Flow-chart of the library preparation protocol from RNA and to NGS-ready PCR product. XXXXX: optional sample barcodes (see Sample Barcoding in Appendix for details and Supplementary Material for barcodes). *For TCR alpha/beta profiling with 100 nt sequencing length, multiplexed J-segment-specific primers should be used as a reverse primer in the second PCR amplification step as described in section "Next Generation Sequencing Options." another set of multiplex primers, can be necessary to obtain pure TCR or IG library from genomic DNA. Multiplexing inevitably leads to dramatic bias in relative efficiency of amplification of different variable segments and thus to the loss of quantitative information, and complete loss of some of the rare clonotypes (10,16,22,23).
Second, abundant copies of mRNAs encoding TCR or IG chains comprise an essential portion of the total lymphocyte RNA. This practically results in an efficient amplification of a deep library starting from 10 6 mRNA molecules from a 3 µg of total RNA sample purified from three million PBMC cells (10). cDNA synthesis reaction can be performed in a volume of 10-15 µl in a single PCR tube (see Protocol), allowing multiple parallel experiments to be carried out.
In contrast, amplification of the TCR/IG library starting from 15 µg of genomic DNA of the same three million PBMC sample requires PCR to be carried out in larger volumes (since no more than 0.5 µg of genomic DNA can be taken for a 50 µl PCR reaction), and still does not provide comparable PCR efficiency, i.e., essential portion of the original sample diversity is lost due to the stochastic character of PCR, inevitably missing rare molecules.

LIMITATIONS OF THE USE OF cDNA LIBRARIES AND 5 -TEMPLATE SWITCH
We have recently demonstrated that cDNA-based template switching protocol is highly quantitative at the ensemble level -the level of relative TRBV gene segments' frequencies (10). Indeed, PCR bias is minimized and the whole approach is quite quantitative Frontiers in Immunology | T Cell Biology in respect of relative abundance of mRNA molecules at start and sequencing reads at the end of analysis pipeline. However, it should be noted that individual T-cell or B-cell clones can potentially be characterized by higher or lower expression levels of TCR or IG mRNA (24,25). This limitation should be kept in mind when using NGS data for the estimation of particular lymphocyte clones' relative abundance.
It is generally important that the cells being analyzed "feel fine" and contain a sufficient amount of TCR/IG mRNA. Therefore, it is preferable to purify total RNA from a freshly isolated cell sample for the native analysis. For the frozen samples, overnight incubation of thawed cells in presence of IL2 (Roche, 15 U/ml) leads to at least twofold increase of TCR genes RNA expression levels (our unpublished observations).
Differences in the efficiency of reverse transcription and template switching may lead to a different number of cDNA molecules read per T-or B-cell. Therefore, it is important to use the same reverse transcriptase and 5 -template switch adapter and carry out all the procedures in identical experimental conditions to obtain results that can be further accurately compared at the deep level (e.g., in an analysis of relative diversity of naïve T cells or a PBMC sample, etc.).

EXPERIMENTAL DESIGN: CELLS, NUMBERS, AND BOTTLENECKS
The desirable depth of TCR or IG repertoire analysis depends on the particular experimental questions raised. For example, application of the current protocol for the deep analysis of a PBMC sample containing 10 6 T cells will provide quantitative data on those TCR clonotypes that constitute at least 0.01-0.1% of all T cells in a sample (100-1000 T cells) (10). The majority (>95%) of TCR clonotypes constituting at least 0.001% (at least 10 T cells) will be sequenced, while approximately 20-40% of TCR clonotypes represented by a single T cell in a sample may be lost (estimated according to our quantitative experiments, depends on the reverse transcriptase used). Preferably, all the synthesized cDNA should be used for the first PCR amplification step. Second PCR should result in sufficient amount of target PCR product in a reasonable number of amplification cycles (see Protocol). The desirable number of output CDR3-containing high quality sequencing reads is at least 2 × 10 6 per sample (see Protocol and Expected Results).
Much smaller bottleneck limits should be quite sufficient for the majority of the experimental tasks concerning more specific subpopulations of lymphocytes characterized by lower diversity [such as sorted antigen-specific T cells (26) or B cells (27)]. For example, 10,000 lymphocytes, 10 ng high quality total RNA, no more than 21 first PCR cycles, no more than 20 s PCR cycles (see Protocol and Expected Results), and at least 30,000 CDR3-containing sequencing reads (ideally 100,000 reads to achieve over-sequencing) per sample may be sufficient to identify most TCR/IG clonotypes in a low-complexity sample. It is preferable to use reverse transcriptase with high 5 -template switching efficiency (e.g., SMARTScribe, Clontech) when small cell samples/RNA amounts are analyzed.

EXPERIMENTAL DESIGN: SAMPLE BARCODES, MULTIPLEXED SEQUENCING, CROSS-CONTAMINATION
Since as few as 30,000 sequencing reads per sample may be sufficient for many experimental tasks in immune repertoire's profiling, and, for example, paired end 150 bp Illumina MiSeq run can produce more than five million good quality TCR/IG CDR3 reads, a researcher may be often interested in sequencing multiple samples in a single run. At the same time, ligating Illumina sample barcodes to 10 or more samples is rather expensive and laborious. Our design suggests that sample barcodes can be introduced within the 5 -template switch adapter during cDNA synthesis and/or second PCR amplification steps (see Figure 1). Samples with the barcodes inside can be then combined in equal (or unequal, if it is desirable to get more reads for some samples) proportions, and Illumina adapters can be ligated to the resulting pooled PCR library of approximately 500-600 bp length (see Protocol and Sample Barcoding in Appendix).
Sample barcodes on both ends of the library allow to eliminate most cross-contaminations between the samples sequenced in the same run/lane that may occur during the amplification of the combined sample after adapters' ligation, and potentially in course of bridge amplification on the solid phase of the sequencing machine.
To avoid contamination on the earlier stages of pre-sequencing library preparation, all procedures, including: RNA purification, cDNA synthesis, first and second PCR preparation -should be performed in separate clean PCR boxes.

Use standard Trizol (Invitrogen) or QIAzol (QIAGEN), or other
analogous protocol for RNA isolation. Alternatively, use RNeasy kit (QIAGEN), or other column-based RNA isolation method. Depending on the starting material, consider the following RNA purification procedures: A. For small amount of whole blood (less than 100 µl) use 1 ml of Trizol or specific RNA isolation kits (for example, QIAamp RNA Blood Kit, QIAGEN). B. For large amount of whole blood, preferably perform preliminary PBMC separation using standard procedures (Ficoll density gradient separation) and proceed to C. C. For large amount of white blood cells, use 1 ml of Trizol (per up to 10 7 cells). If using column-based RNA isolation method for the large amount of cells, DNase treatment is necessary (according to a manufacturer protocol) since large amounts of genomic DNA significantly affect cDNA synthesis. D. For small amount of cells (below 100,000 live cells, for example, sorted or bead-separated T or B cells), preferably perform isolation of total RNA shortly after cell acquisition, in order to minimize loss of live cells and mRNA. When using Trizol protocol, add a co-precipitant (e.g., Pellet Paint, Millipore) to the aqueous phase before adding isopropyl alcohol. It is highly desirable that the precipitant forms a single well-defined spot. This provides confidence that some portion of the material will not be washed off by EtOH. Do not discard EtOH used to wash the sample until you are convinced that library preparation has been performed successfully, since some portion of RNA can remain in EtOH. All the cell/RNA isolation, cDNA synthesis and first PCR preparation steps should be carried out in a clean DNA/RNAase free www.frontiersin.org room or a PCR box with no contact with any TCR-containing PCR products to prevent contamination. Standard RNA samples handling precautions should be used (gloves, labcoats, filtered tips, and certified RNAase free reagents) to avoid RNA degradation.

MIXING THE BARCODED SAMPLES FOR MULTIPLEX SEQUENCING
In order to combine several PCR libraries with pre-introduced sample barcodes (see Figure 1 and Sample Barcoding in Appendix for possible options), perform the following: 13. Determine the concentration of each library using the QuBit Fluorometer. 14. Combine samples in a sterile microcentrifuge tube proportionally to the desirable amount of sequencing reads per sample. A total amount of PCR products should be approximately 0.5-1 µg (specify the required amount of the PCR product in a sequencing center).
Alternatively, each sample can be ligated to sequencing adapters with different sample barcodes separately. Samples are mixed in desirable proportions before sequencing.

NEXT GENERATION SEQUENCING OPTIONS
Design of the current protocol is optimized for the Illumina paired end 2 × 150 nt (or 2 × 300 nt for IGs) sequencing as the most reliable way to obtain unbiased TCR/IG repertoire. The paired end sequencing is obligatory when double sample barcodes (see and Sample Barcoding in Appendix) and/or unique molecular identifiers (see Unique Molecular Identifiers in Appendix) are used. If no unique molecular identifiers are used, and sample barcoding is used on the 3 -end of the library only (Figure 1), then single end sequencing is possible. However, only half of obtained sequencing reads will contain the CDR3 region.
Protocol also suits well the Roche 454 sequencing technology. Frequent length-errors in reading homogenous oligonucleotide stretches on this platform should be kept in mind, and proper error-correction algorithms utilized (10).
In order to use Illumina paired end 2 × 100 nt sequencing for TCRs, the only required modification is that multiplexed J-segment-specific primers should be used instead of the reverse primer in the second PCR amplification step. This minor multiplexing within limited number of PCR cycles does not lead to essential quantitative bias and allows sequence to start closer to the CDR3 region of interest, as described (10,16). For IG's heavy chain, the universal J-segment-specific primer ( Table 1) is close to CDR3 already and no modifications are necessary.
Alternative strategy is that sequences for Illumina flow cell and custom sequencing primers can be introduced in the course of amplification (not shown on Figure 1). Although potentially beneficial, it requires thorough design in cooperation with sequencing centers.
This protocol is not adopted for Ion Torrent as these sequencing machines have limitations in the maximal length of analyzed sequencing library. Multiplex PCR mix for the V-segment is required for Ion Torrent library preparation, albeit leads to significant quantitative bias during amplification (10).
To provide better cluster differentiation, ask sequencing facility to spike the library with 10-30% of PhiX and/or design primers as described in Introducing Diversity at the Ends of the Library in Appendix.
Size selection on agarose gel after ligation of adapters is strongly recommended since even minor amounts of short nonspecific PCR products can significantly reduce target sequences output.

SOFTWARE ANALYSIS OF NGS DATA
Output NGS data on TCR/IG profiling contain numerous errors accumulated during reverse transcription, PCR amplification, and sequencing. For the latter, higher Phred quality score only means lower frequency of sequencing errors. Thus, high sequence quality does not guarantee absence of sequencing errors. Generally, the more we sequence, the more erroneous TCR/IG variants we generate. Without appropriate errorcorrection, NGS data can generate artificial TCR/IG diversity exceeding the native diversity of complex input library up to several-fold (10).
Several approaches were proposed to correct the PCR and high quality sequencing errors in TCR datasets, suggesting to filter off low frequency TCR variants (8), to filter off the low abundance variants with single mismatch comparing to the major clonotypes (7), or to correct single mismatch errors in germline segments by www.frontiersin.org

Step_1
Step-out primer 2, from the Smart20, universal for all libraries mapping to the major clonotypes (10). Low quality sequences can be either filtered off (7,8) or mapped to the high quality ones in order to rescue quantitative information (10). There are currently three available software packages for NGS TCR data analysis: IMGT/HighV-QUEST web service 2 , Decombinator (28), and our new software, named MiTCR 3 (29). Note that IMGT/HighV-QUEST is limited to only 50,000-150,000 sequences per batch and thus it is hardly suitable for the analysis of deep NGS profiling data. MiTCR is the only software package that considers sequence quality, performs correction of PCR and sequencing errors, and rescues low quality sequencing data. Two basic error-correction modes are currently implemented, aiming either to eliminate maximal number of accumulated errors, or to preserve maximal original TCR diversity, albeit with less efficient error-correction. Moreover, analysis parameters can be tuned by user in a wide range to obtain optimal result for the particular experimental task. Output format is a tab-delimited file or a special *.cls file for the MiTCR-Viewer software (Figure 2).

RNA
The quality and quantity of obtained RNA is critical for the library generation. Quality of total RNA is evaluated by two visible bands on electrophoresis (or two highest peaks on Agilent Bioanalyzer) corresponding to 18S and 28S rRNA. The relative amount of two bands should be between 1:2 and 1:1. The expected yield is 1-3 µg of total RNA from one million of PBMC when using Trizol protocol. If starting material is limited (10,000 cells or less) RNA should be completely used in one cDNA synthesis reaction without analyzing by electrophoresis.

NUMBER OF PCR CYCLES
In order to preserve natural TCR/IG diversity of the sample it is important to minimize the number of PCR cycles used for library Frontiers in Immunology | T Cell Biology preparation. In our system, maximal number of PCR cycles should be 18 for the first and 12 for the second amplification step if starting from 2 µg of total RNA. A well visible band is observed on electrophoresis after 12 cycles of second PCR amplification (that is at least 50 ng of PCR product per 25 µl reaction). For a minimum amount of starting material (below 10,000 cells) the www.frontiersin.org maximum number of PCR cycles should be 21 for the first and 18-20 for the second amplification step. If the number of cycles needed to obtain a visible band is higher, this may indicate that low number of molecules has successfully entered amplification, thus leading to uncertain detection of CDR3 clonotypes of the input sample.

SEQUENCING OUTPUT AND ANALYSIS
With the use of the proposed protocol, at least three million of high quality CDR3-containing sequencing reads from a paired end MiSeq run and at least 100 million CDR3-containing sequencing reads from one lane of paired end HiSeq 2,000/2,500 run are expected. The number of different clonotypes depends on the nature and amount of starting material. For example, profiling of 5-10 million human PBMC cells using 1/10 of HiSeq 2000 Illumina lane (at least 10 million CDR3-containg reads) can yield from 0.5 to 2.5 million TCR beta CDR3 clonotypes after appropriate error-correction.

APPENDIX SAMPLE BARCODING
When sequencing multiple samples, it is recommended to introduce sample barcodes during the library preparation process. This allows to minimize cross-sample contamination and to treat all samples as the single one when ligating Illumina adapters. It is possible to introduce sample barcodes on different stages (See Figure 1). One of the best ways is to use 5 -template switch adapters with built-in sample barcodes, thus labeling each sample at the very first library preparation step. Alternatively/additionally, 5end sample barcode can be introduced at the 5 -end of the Step-out primer 2 (see Table 1). We also recommend introduction of sample barcodes within the reverse primers used in the second amplification step (hum bcj, hum acj, mus bcj, mus acj, or IGHJ-r1, see Table 1). Using this approach, each sample is barcoded at both ends of the library. This is crucial when accurate comparison of two or more samples is required, as we observe different levels of swapping ends between molecules in course of standard Illumina library preparation stage and presumably on the solid phase of the sequencer, during bridge amplification. For your convenience, we have generated a list of 5-nucleotide sample barcodes, which differ from each other by at least two nucleotides (see Supplementary Material), thus minimizing the chance of barcode misinterpretation if the single error occurs during sample preparation or sequencing.

UNIQUE MOLECULAR IDENTIFIERS
Unique molecular identifiers can be introduced as random oligonucleotides at the very first amplification (or cDNA synthesis) step of library preparation (30). Each molecule that successfully enters amplification becomes labeled by a unique combination of nucleotides -a molecular identifier. Thus each TCR/IG CDR3 sequence variant in the output NGS dataset is characterized by a number of distinct molecular identifiers indicating the number of such cDNA molecules that have entered the PCR amplification.
This approach allows to correct the PCR bias that occurs during amplification and to count mRNA/cDNA molecules of each type directly, which makes the TCR/IG repertoire analysis even more quantitative. Unique molecular identifiers consisting of 12 random nucleotides (which give approximately 17 million unique variants) can be introduced within the 5 -template switch adapter (Table 1, SmartNNN). This template switch adapter also contains multiple deoxyuridine nucleotides. After cDNA synthesis, Uracyl DNA glycosylase treatment allows to eliminate SmartNNN, thus preventing possible exchange of unique molecular identifiers during following PCR amplification (30).

INTRODUCING DIVERSITY AT THE ENDS OF THE LIBRARY
The common problem with sequencing PCR products by Illumina is the presence of the same nucleotides in the beginning of most sequencing reads. This can lead to a fail of a sequencing run as Illumina software cannot discriminate adjacent clusters, which produce identical fluorescent signals during the first several sequencing cycles. The common solution used by sequencing centers is spiking the sequencing library by PhiX library containing random DNA fragments. However, in this case, the number of obtained target sequences is decreased by at least 30%. To avoid this problem we introduce two to four random nucleotides ("N") to the 5 end of the primers used in the second amplification step (see Table 1). Preferably, the number of "N" nucleotides flanking the library should be different for the samples mixed on the same Illumina lane, in order to generate additional diversity of starting sequencing steps and to avoid identical nucleotides being present in the same positions, which may alter Illumina sequencing quality. If one sample is sequenced per Illumina lane and no sample barcodes are used, it is recommended to use a mixture of three identical primers, each containing a different number of "N" nucleotides at the 5 end -e.g., (N) 2 Step1/(N) 3 Step 1/(N) 4 Step1, the same with the reverse primer (see Table 1).