TCRpred: incorporating T-cell receptor repertoire for clinical outcome prediction

T-cell receptor (TCR) plays critical roles in recognizing antigen peptides and mediating adaptive immune response against disease. High-throughput technologies have enabled the sequencing of TCR repertoire at the single nucleotide level, allowing researchers to characterize TCR sequences with high resolutions. The TCR sequences provide important information about patients’ adaptive immune system, and have the potential to improve clinical outcome prediction. However, it is challenging to incorporate the TCR repertoire data for prediction, because the data is unstructured, highly complex, and TCR sequences vary widely in their compositions and abundances across different individuals. We introduce TCRpred, an analytic tool for incorporating TCR repertoire for clinical outcome prediction. The TCRpred is able to utilize features that can be extracted from the TCR amino acid sequences, as well as features that are hidden in the TCR amino acid sequences and are hard to extract. Simulation studies show that the proposed approach has a good performance in predicting clinical outcome and tends to be more powerful than potential alternative approaches. We apply the TCRpred to real cancer datasets and demonstrate its practical utility in clinical outcome prediction.

In this simulation study, we compared five methods, Basic-GLM, tcrLASSO, tcrRidge, DeepTCR TCRpred B and TCRpred P, under different parameter settings.Since DeepTCR does not accommodate adjusting covariates, the data were generated without adjusting covariates.
We focused on the binary outcome.We considered k = 3 (i.e., 3-mers) and let the six most frequent 3-mers be important features.The corresponding coefficients γ j follow a c 0 Uniform(-1, 1) for j = 1, • • • , 6.We considered c 0 = 1, 3, and 5.The scaling parameter for K were set as τ = 5 and 8.The homology matrix K for data simulation was constructed based on the substitution matrix of BLOSUM62.The proportion of cases was between 0.3 and 0.7 for the analyzed datasets.For each replicate, we simulated 500 individuals for parameter estimation and 500 individuals for prediction evaluation.We replicated 500 times for each parameter setting.For DeepTCR, the V(D)J gene segment usages were included and the default argument setting was used, and the 500 training individuals were split into two sets: 350 for training and 150 for validation.For performance evaluation, we used classification error (C.Err) and area under the ROC curve (AUC) over 500 replicates.The results are shown in Supplementary Tables S1 and S2.The proposed TCRpred outperforms the compared approaches, including Basic-GLM, tcrRidge, tcrLASSO, and DeepTCR.The Basic-GLM does not account for the TCR information, while the tcrRidge and tcrLASSO do not accommodate the TCR homology information among TCR repertoires.The DeepTCR encodes each TCR sequence by an embedding vector, and then aggregates multiple TCR sequences into a single numerical vector to represent a TCR repertoire.This aggregation process is likely to cause a loss of information, which may partly explains its less accurate prediction performance.
Table S1.Classification error (C.Err) and AUC for the binary outcome when τ = 5

ADDITIONAL SIMULATION STUDY II
We conducted stimulation study for τ = 8.Other parameter settings were in line with those in the main text.As to the results, we observed a similar pattern to that in the main text where τ = 5.
Table S3.Classification error (C.Err) and AUC for the binary outcome.Data were simulated based on 3-mers (k = 3) and BLOSUM62.

ADDITIONAL SIMULATION STUDY III
We conducted stimulation studies for the binary outcome.In data generation, we considered k = 3, τ = 5 and subtitution matrix BLOSUM62.Other parameter settings were in line with those in the main text.
In model fitting, we considered k = 2, 4. As to the results, we observed that the proposed methods still performed well even when k was misspecified.

COEFFICIENT ESTIMATES OF THE EXTRACTED FEATURES BY THE TCRPRED APPROACH
Table S10.Coefficient estimates of the extracted features by the TCRpred PDR RYN GNE AGG GGR GDT ETQ Estimates 0.990 0.987 0.390 0.133 0.117 0.047 0.039 Frontiers

Table S2 .
Classification error (C.Err) and AUC for the binary outcome when τ = 8

Table S4 .
Classification error (C.Err) and AUC for the binary outcome.Data were simulated based on 4-mers (k = 4) and BLOSUM62.

Table S5 .
Classification error (C.Err) and AUC for the binary outcome.Data were simulated based on 3-mers (k = 3) and PAM250.

Table S6 .
Classification error (C.Err) and AUC for the binary outcome.Data were simulated based on 4-mers (k = 4) and PAM250.

Table S8 .
Classification error (C.Err) and AUC when 2-mers were used for model fitting.

Table S9 .
Classification error (C.Err) and AUC when 4-mers were used for model fitting.