Current annotation strategies for T cell phenotyping of single-cell RNA-seq data

Single-cell RNA sequencing (scRNA-seq) has become a popular technique for interrogating the diversity and dynamic nature of cellular gene expression and has numerous advantages in immunology. For example, scRNA-seq, in contrast to bulk RNA sequencing, can discern cellular subtypes within a population, which is important for heterogenous populations such as T cells. Moreover, recent advancements in the technology allow the parallel capturing of the highly diverse T-cell receptor (TCR) sequence with the gene expression. However, the field of single-cell RNA sequencing data analysis is still hampered by a lack of gold-standard cell phenotype annotation. This problem is particularly evident in the case of T cells due to the heterogeneity in both their gene expression and their TCR. While current cell phenotype annotation tools can differentiate major cell populations from each other, labelling T-cell subtypes remains problematic. In this review, we identify the common automated strategy for annotating T cells and their subpopulations, and also describe what crucial information is still missing from these tools.


Introduction
The first single-cell RNA sequencing (scRNA-seq) experiments started in 2009, and the technique became commercially available in 2014 (1).Single-cell RNA sequencing has rapidly gained widespread use, as more detailed information can be acquired using it than using bulk RNA-seq.Additionally, scRNA-seq data are becoming more accessible as more companies (e.g., 10x Genomics and BD Rhapsody ® ) are developing and optimising the technology, leading to a higher throughput and decreasing costs.With the increasing availability of scRNA-seq data, there has been a substantial increase in our understanding of the functions of immune cells.This has led to discoveries of new immune cell subpopulations, their dynamic and heterogeneous nature, and their role in disease (2)(3)(4)(5).A particularly useful advantage of scRNA-seq for the study of the adaptive immune system is the ability to uncover paired information on the gene expression and the immune receptor of a single cell [more extensively reviewed in (6)].However, defining the cellular profiles for adaptive immune cells remains a complex task.For example, the T cells of the adaptive immune system are very heterogeneous and can adopt a wide variety of phenotypes.In addition to a wide variety of phenotypes, there is an increased layer of complexity due to the highly polymorphic nature of the immune cell receptors they carry, such as the T-cell receptor (TCR) for T cells.The TCR is created through somatic recombination to create a highly variable CDR3 sequence containing a variable (V), and Junction (J) for alpha (a) and gamma (g) chains, or a V, Diversity (D) and J for beta (b) and delta (d) chains (7).These unique TCRs can recognise a vast array of epitopes, including immunopeptides, lipids, and some small molecules [e.g., phosphoantigens and Vitamin B metabolites (8)].The most wellstudied mechanism of epitope recognition is the antigenic peptide presentation by the major histocompatibility complex (MHC) protein, encoded in humans via the human leukocyte antigen (HLA) gene loci, and then to conventional ab T cells (7).However, there are also unconventional T cells which are thought not to interact with MHC, such as mucosal-associated invariant T cells (MAIT), natural killer (NK) T cells, and gd T cells (9).These unconventional T cells and their cellular profiles remain poorly understood.
A crucial step in the analysis of scRNA-seq data involves annotating the cells with the correct cellular phenotype.The initial manual annotation of the cells in a scRNA-seq dataset, after (pre-)processing, is time intensive, may contain data entry errors, and requires expert knowledge of the marker genes specific to the different cellular subsets.The initial (pre-)processing is commonly done using the R Seurat package (10) or the Python Scanpy package (11).For a more comprehensive description of the different steps in the (pre-) processing of scRNA-seq data, we refer you to this excellent review by Heumos et al. (2023) (12).In brief, the manual annotation of cells in scRNA-seq data is typically approached by clustering the cells and comparing these clusters to identify the differentially expressed genes (DEGs) among them to verify if they are known marker genes that are specific to cellular populations.This is hampered by a number of factors, however, such as a high gene dropout rate, the free-floating ambient mRNA of one cell being captured in a droplet together with another cell (droplet-based methods), or the poor expression of some marker genes at the RNA level, which would be better identified at the protein level (13).More recently, this manual annotation process has been superseded by automatic methods that leverage machine learning to automate and ease the burden (12).To aid in annotating cells with their phenotypes in scRNA-seq data, several automated pipelines have been developed to infer the phenotype based on a cell's gene expression profile.However, these tools are often focused on inferring broader cell types (i.e., annotating a cell as a T cell), and it is unknown how well these tools work for inferring the subpopulations of these broader cell types (i.e., identifying a T helper [Th] 1 cell).Thus, in this review, we describe the currently available annotation tools for identifying T-cell phenotypes from scRNA-seq datasets.We compare their annotation strategies to the literature to verify whether they are fully capturing these hard-todelineate subpopulations.Finally, we reflect on how well some of the unconventional T-cell populations are currently being captured.

Single-cell annotation tools
To prevent the labour-exhaustive manual annotation of new datasets, automatic annotation tools have been developed to decrease time, improve labelling accuracy, and promote consistency.Automated annotation has become part of the current goldstandard approach to single-cell RNAseq, along with manual annotation/inspection of the automated annotations by expert review (i.e., expert familiarity with the common markers of cellular populations, which enables accurate annotation) (12).Therefore, a range of tools have been developed to aid in annotation automation (Table 1).As highlighted in Table 1, the current tools fall into several subcategories, each with distinct advantages and limitations.These annotation methods can also be distinguished by the type of machine learning (ML) approach, with methods categorized into unsupervised, supervised, or semi-supervised approaches.
The unsupervised approach is typically clustering-based, including, for example, k-nearest neighbours (e.g., Seurat clustering (10)), which groups together cells with similar expression profiles.Subsequently labelling the clusters then requires the manual interrogation of the distinct markers per population.Accurate annotation relies on the expert knowledge of the user for common genes expressed for each cell type.
The supervised ML classification of scRNA-seq data is available in SingleR (20), Garnett (14), and CellTypist (16).These tools enable the prediction of cell-type labels for a novel dataset based on a prediction model trained on prior datasets.The ability to annotate a new dataset with high accuracy requires the dataset to have a good overlap of genes with the prediction model.This method is more robust in handling missing marker genes in a dataset, as it relies on the entirety of a cell's gene expression to classify a cell, rather than just a few marker genes.However, if there is too much heterogeneity between the datasets, then the prediction tools fail to identify the cells correctly.The package scTriangulate aims to overcome this limitation by using multiple annotation sources (21).
The semi-supervised annotation approach includes models such as the SCINA (22) tool, which was developed to annotate cells based on a consensus list of known markers.An alternative tool, scGate (15), follows a process similar to the gating strategy employed in flow cytometry experiments, and classifies the markers in a hierarchical structure of pure and impure cells.The latter includes prelisted markers, adding to the interpretability of the method.The scGate researchers also defined common gating strategies on common cellular markers, and this led to the development of ProjectTILs (23) to further automate the process.A particular advantage of scGate is that the user can provide their own list of markers and and is advantageous to use in instances that the dataset is dissimilar or not modelled within the pre-learned supervised models.
Therefore, the researcher will need to consider which method is most appropriate for their dataset.For instance, if their dataset is similar to a previous annotated dataset and was obtained using the same technology, then the reference-based/label transfer may be the best strategy for annotating the cells.Alternatively, if researchers have a novel cellular subset from a species that is not human or mouse, the use of reference-based, gene set-based, and marker-based tools may not be advisable, as they rely on similarity to previously curated datasets.In addition, these ML-based label transfer methods are hampered by their reliance on the quality of the annotation of the original dataset.As such, we encourage users to carefully review the latest datasets and markers that were used to define populations, if available.
Although the accuracy of these automated methods has significantly improved, a two-step annotation process is strongly recommended.This two-step process involves primary annotations of the gene expression clusters by automated algorithms, followed by expert-based manual interrogation of the cell populations.In general, a combination of both strategies will result in the most accurate definitions of cell subsets.

T-cell annotations
As highlighted above, the current annotation strategies can distinguish between populations with large phenotypic differences (e.g., B cell vs. T cells), as there are fewer overlapping transcripts.However, within each cell type there can be subspecialisations.For instance, T cells have a variety of subtypes.These subtypes are first stratified into two main lineages based on the TCR, that is, alphabeta (ab) and gamma-delta (gd) T cells.Subsequently, ab T cells, the best-described T-cell subtype, are further delineated into CD4 -or CD8 -expressing T cells.However, these can be further stratified based on their function and capacity for formation of immunological memory.The most well-described classical subpopulations relate to the class I (CD8 + ) and class II (CD4 + ) abTCR cells, which are responsible for screening the peptideloaded major histocompatibility complex for "self" and "foreign" antigens (7).Less is known about the unconventional T cells, which encompass natural killer T (NKT) cells, mucosal invariant T cells (MAIT), and gd T cells.However, evidence that these unconventional T cells have important roles in both health and disease [reviewed in (24)(25)(26)] is emerging.Therefore, future work should consider both classical and unconventional T cells.
Given this plethora of cell subsets, how are these subpopulations currently defined by common annotation models for humans?To address this question, we looked at several tools that claim to be able to annotate for more delineated T-cell subpopulations.These annotation tools included scGate, CellTypist, and Data2Talk [online tool], as these had more extensive documentation for the T-cell subsets.Additionally, we compared these with the common protein expression panels used to identify T-cell subsets, as these are well-curated and validated panels.Last, we also included findings Impossible to take into account "new effects".Requires strong degree of similarity between query and reference.
Azimuth (10), Symphony (18), scArches (19) from the literature to fill in other annotation gaps.Tables 2-5 highlight the markers used to classify the CD4 ab T cells (Table 2), CD8 ab T cells (Table 3), gd T cells (Table 4), and miscellaneous T cell markers (Table 5) that were identified by the documented annotation models or through literature searches.
The T-cell annotation models include most of the well-defined effector CD4 + populations, including T helper 1, Th17, follicular Th (Tfh) and regulatory T cells (Tregs) (Table 2).CellTypist was the only annotation model to included memory markers for the CD4 + T-cell population, while Data2Talk included Th2 cells, but the markers were not disclosed.The CD8 + T cells were classified into cytotoxic T cells (CTL; granzymes [GZMB, GZMK, etc.], perforin [PFR1], granulysin [GNLY]), NKT cells (KLR gene family, CD160, etc.), and MAIT cells (Table 3).These CD8 + T-cell subsets were also broken down into memory features, naive, effector, effector memory (Tem), terminal memory (TEMRA), resident memory (Trm), and central memory (Tcm) (Table 3) cells.The three annotation models cover many of the common classical CD8 + and CD4 + T-cell populations, except for Th2, Th9, and Th22 cells.The identification of these populations has relied on cytokine expression.However, the current technology inadequately captures the transcription factors and cytokines (e.g., interleukins) (37).In addition, these populations may also be missed due to a bias in the experimental choices, that is, no focused Th2 specific singlecell experiments.Therefore, we need to identify appropriate markers for the transcriptional level before we can add them to the label transfer models.
T cells can also be defined by their functional states, which are not necessarily restricted to T-cell lineage (e.g., gd TCR vs. ab TCR), or a specific subtype (e.g., CD4, CD8, or DN).These functional features include activation (e.g., CD69 [early], CD25 [late] and CD38/HLA-DR [very late]), exhaustion (PD-1, TIGIT, LAG3, and TIM3) (36), senescent (CD57 and KLRG1) (34), and cell cycling/ proliferation markers (Table 5).However, the current automated annotation includes only the cell cycling markers.Given that these functional features are important in determining if a T cell is functioning properly, they need to be included in annotation models to identify the most biologically relevant T cell clones.It should be mentioned that when a cell expresses a marker associated with activation, senescence, or exhaustion, it does not mean a cell is activated, senescent, or exhausted.For instance, exhaustion is a functional state characterized by multiple features, including not only the expression of a combination of inhibitory genes such as PD-1, TIGIT, LAG3, and TIM3, and others, but also a lack of effector capacity, that is, a lack of cytokine production or cytotoxic activity (38).Defining these states is even further complicated by the fact that certain genes are associated with multiple states.For example, sole PD-1 expression can indicate an activated state, but it can also indicate a differentiation state to exhaustion, or be a marker of exhaustion if expressed together with other immune checkpoint genes (39).Similarly, when KLRG1 is expressed together with CD57, this can point to T-cell senescence; however, KLRG1 can also be a defining feature of antigen-experienced memory T cells when expressed by itself (40).Therefore, to accurately annotate the exhausted and senescent cellular states, identification of the expression (or lack thereof) of multiple markers is needed.Researchers need to carefully design their annotation panels and be transparent about what markers were used to identify the subpopulations.Currently, gd T cells have limited representation in the annotation models (Table 4).The gd T-cell population is subdivided into innate (Vd2g9 + ) and adaptive-like (e.g., Vd1 + , Vd2 + Vg9 -, Vd3 + ) gd T cells.The most studied gd T-cell subpopulation is that of the invariant innate Vd2g9 + T cells that respond to (E)-4-Hydroxy-3-methyl-but-2-enyl pyrophosphate   (HMB-PP).However, the models fail to adequately differentiate between these innate and adaptive-like gd T-cell subpopulations.For instance, the scGate general annotation model classifies gd T cells into the innate T-cell population along with NKT cells.CellTypist classifies gd T cells as gd TCR or CRTAM + gd T cells.Only Talk2Data includes gd T cell sub-populations, but the markers used for the classifications are unknown (Table 4).Therefore, at this point in time, fully capturing the diversity of gd T-cell subsets in scRNAseq data analysis requires expert knowledge of marker genes.
From the literature we know that there are difficulties obtaining data from the adaptive gd T cells.gd T cells are challenging to study as no antigen-specific culturing methods (41) exist for them, they have a minority fraction in the blood (comprising up to 10% of all T cells) (40), and have a high prevalence in mucosal membranes (e.g., skin, liver, and intestines) (42).Nevertheless, we are slowly defining adaptive gd T cells that have overlapping phenotypes with the ab T cells.For instance, functional information derived from mouse models has been used to identify several phenotypes, including T17 + (IL-17 and Th17-like) and T1 (IFNg and Th1-like) cells (43).Intriguingly, on average ~30% of gd T cells express the CD8 marker (40).Importantly, recent studies show CD8 + gd T cells exhibit peptide restriction, similar to classical ab T cells (44, 45).Consequently, gd T cells express the same cytotoxic T-cell markers as CD8 + ab T cells (30).Therefore, CD8 + gd T cells may be functionally indistinguishable from CD8+ ab T cells.We also note that gd T cells can interact with CD1 and MR1, but their  % Flow cytometry protein expression panel markers from: https://www.sartorius.com/en/applications/life-science-research/cell-analysis/flow-cytometry/immune-cell-function/tcell-activation & Bioturing can predict cell types based on 80,574,317 cells https://talk2data.bioturing.com/predict^Curated markers from CellTypist (V2 list of markers).*Based on Figure 1 top associated markers from a single-cell study.molecular signature is not well defined (46).Overall, the innate gd T cells can be readily identified by their TCR arrangement in singlecell experiments.To bridge the adaptive gd T cells annotation gap, therefore, we need to use TCR information along with what is known about classical and unconventional ab T cells.
In addition to the issues with gd T-cell classification, there are also issues with annotating other unconventional populations, such as MAIT and NKT cells.MAIT cells exhibit MR1 restriction and the semi-invariant TCR arrangement of TRAV1-2 with TRAJ33, TRAJ12, or TRAJ23, often paired with either the TRBV6 or TRBV20 gene families.In flow cytometry experiments, TRAV1-2 and CD161 (KLRB1), IL-18Ra or CD26 are commonly used to identify the MAIT population; however, there can be individual variability (28).MAIT cells can also exhibit the expression of Th17 markers (RORgt and IL-17), in addition to Th1-like features (T-bet, IFNg) (28).Moreover, the semi-invariant type I NKT cells are identified by an invariant pairing of Va24Ja18 (TRAV10-TRAJ18) with Vb11 (TRBV25) and exhibit CD1d restriction, while the type II NKT cells have highly variable TCR combinations, and little is known about what their lipid-restriction is ( 9).It appears that there may be many subtypes of NKT cells, including Th17-like, Th2-like (GATA3), and Th1 (T-bet) cells (9).Therefore, both MAIT cells and NKT cells cannot be distinguished from other T-cell subpopulations based on gene expression alone.The easiest way to identify MAIT cells and Type I NKT cells will be to utilise the scTCR-seq data layer in combination with the gene expression layer.However, the type II NKT cells cannot be accurately identified until we can identify if they have specific gene(s) that are distinct from the other T-cell subsets.
Other considerations for annotation at the single-cell transcriptome level concern the gene sparsity, low abundance of transcripts captured, and poor correlation of mRNA expression to protein expression for several markers.To illustrate these problems, we highlight a common issue with the identification of the CD4 + Tcell population due to the abundance and sparsity of CD4 cells being lower than those of CD8A and CD8B cells (Figure 1).For this, we used the publicly available dataset GSE145370 (47), which was derived from CD45 + sorted cells from oesophageal tumour and adjacent tissue.The 14 available samples (~108,000 single cells) were then processed through the STEGO.R pipeline (48).The low abundance and high sparsity of CD4 cells makes it difficult to distinguish the double negative (CD4 − CD8 − ) T-cell population Marker sparsity of common T cell markers.represent the (top row) CD8 + T cells, (middle row) CD4 + T cells and (bottom row) double negative.The (left column) represents the scGate annotations, while the (middle and right columns) show the scaled expression of the markers of individual transcripts with the name listed above.The data was derived from an esophageal cancer set: GSE145370 (47), with the data processed and figure made with the aid of STEGO.R (48).
from the true CD4 + population (Figure 1, left column).As an illustration of missing populations, we looked at a common marker associated with CD4 + Tregs, FOXP3.The semi-automated method may miss many of the CD4 + T regulatory cells if CD4 is used in the annotation model.Therefore, other common CD4-specific markers, in the absence of CD8, may need to be used as a surrogate for correctly annotating CD4 subpopulations.Alternatively, the sorting of pure CD4 + T-cell subpopulations (i.e., Th1 and Th2), followed by bulk RNA-seq and differential expression analysis, may be required to identify new population specific transcriptional markers.This would aid in finding alternative transcriptional markers to identify CD4 subpopulations without the need to use the CD4 transcript for annotation purposes.
A combination of gene and protein expression layers could be used to resolve several of these annotation problems.This inclusion of a protein expression layer has been made possible by the cellular indexing of transcriptomes and epitopes through sequencing (CITEseq), allowing the use of protein-specific antibodies within scRNAseq.For instance, the issue of identifying CD4 + T cells that stems from low mRNA abundance could be resolved by the addition of CD4-specific antibodies to capture CD4 protein expression.Another such problem that could be resolved by CITE-seq is the identification of memory subsets within T-cell populations.CITE-seq resolves this by capturing the expression of two CD45 protein isoforms that originate from alternative splicing, CD45RO and CD45RA, to differentiate between naive T cells (CD45RA + /CD45RO − ) and memory T cells (CD45RA − /CD45RO + ) (10).However, while the inclusion of protein antibodies is easily able to resolve isoform expression, this task is not as simple as relying on RNA expression alone.The sequencing of isoforms typically requires full-length transcripts, and the read lengths required to cover these transcripts are not obtained by the commonly used short-read methods for scRNA-seq (49, 50).To illustrate this limitation, using CellTypist to annotate cells in a scRNA-seq experiment, we are currently unable to differentiate between a naive T cell and central memory T cell (Table 3).However, scRNA-seq with the CITE-seq has been able to identify the memory populations (51).Alternatively, long-read sequencing, for example by Oxford Nanopore Technologies (ONT) or PacBio, can be used for scRNA-seq, and can readily resolve splicing/isoform information.For instance, ONT-based single-cell RNA sequencing led to the detection of multiple CD45 isoforms, consistent with CITE-seq data (52).Thus, to properly annotate memory T-cell populations or other T-cell populations that are defined by protein markers with poor mRNA expression correlation, we will either need to include protein expression (e.g., CITE-seq), or sequence isoforms using techniques that capture the full length of a transcript, such as long-read sequencing.
An additional inconsistency between the protein expression and transcriptional profiling pertains to the degree of expression.With flow cytometry, protein expression can readily capture dose, including low, moderate, and high, based on arbitrary cut-offs.However, due to fewer transcripts being captured, there is limited capacity to have these grades of expression in scRNA-seq, and they can mostly only be differentiated by binary (e.g., present or absent) thresholds.For instance, CD127 low protein expression is a marker for Tregs; however, this would be an inappropriate transcriptional marker (Table 2).Therefore, when designing a panel of transcriptional phenotyping markers, the expert will need to consider this technological limitation.
Overall, the above analysis identified inconsistencies with marker choice (Tables 2-5), which represents a concerning issue regarding the reproducibility of these T-cell studies.Additionally, there was a plethora of missing annotations (e.g., for Th2 cells, gd T cell phenotypes, and functional features).Consequently, if these missing annotations are essential to identifying the T cell associated with a particular disease(s)/pathology (e.g., infection, cancer, autoimmune disease, and transplantation), using the automated models will lead to the T-cell subset of interest being missed.Therefore, filling in the missing annotations will need to be done manually or by way of a semiautomated process using custom gene sets.

Identifiable needs for future T-cell annotation strategies
T cells remain a challenging subset of immune cells to interrogate due to their complex and variable subspecialisations, together with the diversity of the TCR repertoire.There has been some progress made in the development of T cell-specific annotation strategies and in TCR repertoire interrogation [reviewed in (6)].Technology has progressed to now include simultaneous scRNA-seq and scTCR-seq, which can capture both the abTCR and gdTCR genes (e.g., 10x Genomics and BD Rhapsody).Both these layers of data are likely needed to identify the role individual T-cell clones are performing at a given time point.For example, scTCR-seq can capture the paired abTCR or gdTCR sequence and identify if the clone was expanded.Clonal expansion may indicate whether or not a particular TCR has responded to an epitope/antigen.The functional state will also further indicate if it is worth undertaking further analysis of the T cell and enable bystander clones to be ruled out.This information is needed for functional validation so that sorting based on phenotype-specific biomarkers and TCR genes can be done, which in turn can eventually be used as immunotherapies (e.g., CAR-T or TCR-T) (53).Having access to both layers in the initial discovery single-cell experiment will decrease the time needed to identify the most biologically relevant T-cell clones.
A deep dive into the current annotation strategies identified that inconsistences exist in the subclassification of T cells, along with missing T-cell subsets.To rectify these phenotyping inconsistencies, we will need a central resource of well-curated classifications so we can estimate the robustness of the markers for any given T-cell subpopulation.We may need to consider not segregating the classification based on gdTCR vs. abTCR, as new understanding is showcasing overlapping, if not identical, markers (Tables 2-4).To achieve this database, the T-cell community requires the development of a public repository for protein markers, bulk RNA-seq derived markers, and, if possible, scRNA-seq with scTCR-seq and protein antibody information.Once this is built, we can determine the most robust markers per T-cell subset.We believe this literature review provides a useful reference and may serve as a foundation in the realization of this effort.
Once a consistent gene-set list of markers is established, we need to tackle the remaining problems regarding how to efficiently interrogate scRNA with paired scTCR-seq data.To achieve this, expert T-cell functional knowledge and computational expertise will be needed.This could help in determining which T cells should be functionally tested, and may lead to groundbreaking discoveries that lead to novel T cell-based therapeutics or help guide patient management in current immunotherapy protocols.

Conclusions
In this study, we presented a comprehensive review of the tools used to annotate T cells from scRNA-seq datasets and also analysed the single-cell derived TCR repertoire.There are a multitude of automated strategies used to annotate T cells.However, the biggest shortcomings are a lack of consistency among tools concerning the markers used to annotate the T cell subsets, leading to severe issues with reproducibility.To overcome this challenge, collation of the currently available T cell-based data should be stored in a single repository, and development of new tools that make use of this harmonised framework is needed.Without this progress, there will continue to be issues with reproducibility, which will hamper progress in the development of T cell-based therapies.

TABLE 1
Common strategies and programs for annotating scRNA-seq datasets.

TABLE 4 gd
T cell Markers (human).

TABLE 5
Miscellaneous T cell markers (human).