Elucidating the activation mechanisms for bifurcation of regulatory and effector T cell fates by multidimensional single cell analysis

In T cells, T cell receptor (TCR) signalling initiates downstream transcriptional mechanisms for T cell activation and differentiation. Foxp3-expressing regulatory T cells (Treg) require TCR signals for their suppressive function and maintenance in the periphery. It is, however, unclear how TCR signalling controls the transcriptional programme of Treg. Since most of studies identified the transcriptional features of Treg in comparison to naïve T cells, the relationship between Treg and non-naïve T cells including memory-phenotype T cells (Tmem) and effector T cells (Teff) is not well understood. Here we dissect the transcriptomes of various T cell subsets from independent datasets using the multidimensional analysis method Canonical Correspondence Analysis (CCA). We show that resting Treg share gene modules for activation with Tmem and Teff. Importantly, Tmem activate the distinct transcriptional modules for T cell activation, which are uniquely repressed in Treg. The activation signature of Treg is dependent on TCR signals, and is more actively operating in activated Treg. Furthermore, by analysing single cell RNA-seq data from tumour-infiltrating T cells, we revealed that FOXP3 expression occurs predominantly in activated T cells. Moreover, we identified FOXP3-driven and T follicular helper (Tfh)-like differentiation pathways in tumour microenvironments, and their bifurcation point, which is enriched with recently activated T cells. Collectively, our study reveals the activation mechanisms downstream of TCR signals for the bifurcation of Treg and Teff differentiation and their maturation processes.


Introduction 1
T cell receptor (TCR) signalling activates NFAT, AP-1, and NF-kB (1), which 2 induces the transcription of Interleukin (IL)-2 and IL-2 receptor (R) a-chain 3 (Il2ra, CD25). IL-2 signalling induces further T cell activation, proliferation and 4 differentiation (2). In addition, IL-2 signalling has key roles in immunological 5 tolerance (2). This is partly mediated through CD25-expressing regulatory T 6 cells (Treg), which suppress the activities of other T cells (3). Intriguingly, TCR 7 signalling also induces the transient expression of FoxP3, the lineage-specific 8 transcription factor of Treg (4), in any T cells in humans (5), and in mice in the 9 presence of IL-2 and TGF-b (6). These suggest that FoxP3 can be actively 10 induced as a negative feedback mechanism for the T cell activation process, 11 especially in inflammatory conditions in tissues (7). Thus, the T cell activation 12 processes may dynamically control Treg phenotype and function during 13 immune response and homeostasis. 14 In fact, TCR signalling plays a critical role in Treg. Studies using TCR 15 transgenic mice showed that Treg require TCR activation for in vitro 16 suppression (8). The binding of Foxp3 protein to chromatin occurs mainly in 17 the enhancer regions that have been opened by TCR signals (9). In fact, 18 continuous TCR signals are required for Treg function, because the 19 conditional deletion of the TCR-a chain in Treg abrogates the suppressive 20 activity of Treg and eliminates their activated or effector-Treg (eTreg) 21 phenotype (10, 11). It is, however, unclear how TCR signals contribute to the 22 Treg-type transcriptional programme, and whether TCR signals are operating 23 in all Treg cells or whether these are required only when Treg suppress the 1

activity of other T cells. 2
Heterogeneity of Treg has been previously addressed through classifying 3 Treg into subpopulations, according to the origin (thymic Treg, peripheral 4 Treg, visceral adipose tissue Treg (12)), the transcription factor expression 5 and ability to control inflammation (Th1-Treg (13) and , and T 6 follicular regulatory T cells (15)), and their activation status (activated 7 Treg/eTreg, resting Treg, and memory-type Treg (16)). Among these Treg 8 subpopulations, of interest is eTreg, which are activated and functionally 9 mature Treg. Murine eTreg can be identified by memory/activation markers 10 such as CD44, CD62L, and GITR (16,17), and their differentiation is 11 controlled by the transcription factors Blimp-1, IRF4 and Myb (18,19). Human 12 Treg can be classified into naïve Treg (CD25+CD45RA+Foxp3+) and eTreg 13 (CD25+CD45RA-Foxp3+) (20). However, our recent computational study 14 showed that classical gating approach is not effective for understanding 15 multidimensional data, and that marker expression data may be rather 16 effectively analysed by the computational clustering approaches that aim to 17 understand the dynamics of marker expressions in Treg (21). Furthermore, 18 the recent advancement of single cell technologies has opened the door to 19 address the heterogeneity of Treg by their gene profiles at the single cell 20

level. 21
When addressing the single cell level heterogeneity, it is critical to analyse 22 activated effector T cells (Teff) and memory-like T cells (memory-phenotype T 23 cells; Tmem) together with Treg. The surface phenotype of Tmem is 24 methods are often not sufficiently powerful for hypothesis-driven research, 1 and our previous studies developed a transcriptome analysis method using a 2 variant of CA, Canonical Correspondence Analysis (CCA) for microarray data 3 (31) and RNA-seq data (34). In this approach, two transcriptome datasets are 4 canonically analysed: the correlations between cell samples in one dataset 5 (main dataset) and the immunological processes (explanatory variables) in 6 another dataset (explanatory dataset) are analysed based on their 7 correlations to individual genes. Briefly, CCA uses a linear regression to 8 identify the interpretable part (constrained space) of main data by explanatory 9 variables, and visualises similarities between genes, cells, and explanatory 10 variables using a singular value decomposition (SVD) solution within the 11 interpretable space (34). Thus, CCA enables to investigate and identify the 12 unique features of each T cell population, visualising the relationships 13 between T cell populations. 14 In this study, we investigate the multidimensional features of Treg in 15 comparison to other CD4+ T cells including Teff, Tmem, and naïve T cells 16 under normal or pathological conditions. Here we aim to identify the 17 differential regulation of transcriptional modules for T cell activation and 18 differentiation in these populations, and to reveal systems and molecular 19 mechanisms behind the differential regulation. Furthermore, using our new 20 single cell combinatorial CCA (SC4A) approach, we investigate the single-cell Conventional CCA (Gene-oriented analysis) 3 CCA canonically analyses two independent microarray or RNA-seq datasets 4 (34). Briefly, gene expression data of the standardised main dataset (S) is 5 linearly regressed onto the explanatory variable(s) (D), which identifies the 6 interpretable part of the main dataset ("Constrained data", S*). When only one 7 explanatory variable is used, the CA algorithm of CCA assigns numerical 8 values to cell samples and genes so that the dispersion of samples is 9 maximised (uncorrelated information components), providing a one-10 dimensional solution (34). The correlation analysis of explanatory data and 11 gene scores in the CA solution generates a biplot value, which, in the one-12 dimensional solution, represents the explanatory variable score. In the one-13 dimensional CCA solution, the single biplot value can either be +1 or -1, 14 indicating the direction (increasing/decreasing) of correlated genes in the 15 explanatory variable against that in the main dataset. In order to use the one-16 dimensional solution as a scoring system, the CCA score (i.e. Axis 1 score) is 17 multiplied by the single biplot value, which indicates positive or negative 18 correlation to Axis 1, ensuring that cells and genes with high scores have high 19 positive correlations to the explanatory variable. When two or more 20 explanatory variables are used, the CA algorithm then performs SVD on S*, 21 creating new matrices (i.e. sample and gene score matrices). These scores 22 are sorted into new uncorrelated axes α k , along which the entire set of scores 23 generated by SVD is distributed. The first axis has the largest variation 24 (inertia) and thereby explains the greatest amount of information extracted by 25 the analysis. The map approach enables the comparison of two or more 1 explanatory variables, while the regression process in CCA allows the 2 analysis across two different experiments (34). Biplot values of the CCA result 3 are shown by arrows on the CCA map. CCA provides a map that shows the 4 correlations between samples of interest, explanatory variables, and genes. 5 Highly correlated components are closely positioned on the map. 6 7 Note that the same genes must be used in both transcriptome dataset 8 matrices. The main dataset is projected onto the explanatory variable dataset, 9 thus the genes in common to both datasets comprise the interpretable part 10 (intersect) of the main data. Mathematical operation implemented in the CCA 11 algorithm produces immunological process (explanatory variable), gene and 12 cell sample scores. The results are visualized as the 3-dimensional CCA 13 solution on the CCA map (i.e. CCA triplot) that shows the relationships 14 between cell subsets, genes and immunological processes. For example, in 15 the application of CCA to population-level (bulk) data, transcriptome datasets 16 of peripheral CD4+ T cells (including Treg, naïve, memory, draining LN and 17 non-dLN and tissue effector CD4+ T cells) were processed by CCA using 18 indicated explanatory variables (e.g. T cell activation) and the cross-level 19 relationships between components at three different levels (immunological 20 process, gene and cell) were analysed. 21

22
Single cell data pre-processing and single cell CCA (single-cell oriented 23

analysis) 24
RNA-seq expression data of GSE72056 was obtained from single-cell 1 suspension of tumour cells with unknown activation and differentiation 2 statuses (35). Genes with low variances and low maximal values were 3 excluded. In order to identify CD4+ T cells, single cell data were filtered by the 4 expression of CD4 and CD3E to obtain only the CD4+CD3E+ single cells, and 5 also by k-means clustering of PCA gene plot to exclude outlier cells (21) for 6 subsequent analysis. 7 In the application of CCA to single cell data, importantly, the same single cells 8 are used in both main data and the explanatory variables (i.e. selected 9 genes). The main dataset is projected onto the explanatory variables, 10 visualising the relationships between single cells, genes and explanatory 11 variables, which represent major activation/differentiation processes in the 12 dataset. 13 14

Explanatory variables for conventional CCA 15
Explanatory variables for CCA were prepared as follows. Differentially-16 expressed genes were selected by a moderated t-test result using the 17 Bioconductor package, limma. The top-ranked differentially expressed genes 18 (according to their p-values) were used for making the explanatory variables. 19 The T cell activation explanatory variable (Tact) was defined by the difference 20 in gene expression between anti-CD3/CD28-stimulated (17 h) CD4 + T cells 21 and untreated naïve CD4 + T cells from GSE42276 (36). Precisely, genes were 22 selected by FDR < 0.01 and log2 fold change (> 1 or < -1) in the comparison 23 of the gene expression profile of the activated and resting T cells. For the one-24 dimensional CCA of T cell populations ( Figure 1B), the expression data of 1 GSE15907 (37) was regressed onto gene values in Tact representing the 2 change in gene expression following T cell activation, and CA was performed 3 for the regressed data and the correlation analysis was done between the 4 new axis and the explanatory variable. For the two-dimensional CCA of T cell 5 populations (Figure 2A and 2B), the expression data of GSE15907 (37) was 6 regressed onto Tact, in combination with Foxp3 and Runx1 explanatory 7 variables representing the effects of Foxp3 and Runx1 expression on CD4+ T 8 cells (GSE6939 (38)). Foxp3 explanatory variable is the log2 fold change of 9 Foxp3-transduced naïve CD4+ T cells and empty vector-transduced CD4+T 10 cells. Runx1 explanatory variable is the log2 fold change of Runx1-transduced 11 naïve CD4+ T cells and empty vector-transduced CD4+T cells. Subsequently, 12 CA was applied to the regressed data and the correlation analysis was 13 performed between the new axes and the explanatory variables. 14 15 SC4A 16 SC4A is a composite approach to understand non-annotated single cell data 17 by identifying distinct populations of cells and the differentiation processes 18 that are correlated with these populations. Since T cell population is usually 19 identified by a single lineage specification factor, in the application of SC4A, 20 such a factor will represent the cell population and their differentiation process 21 Figure 1A). The advantage of this approach is that SC4A 22 uses the gene expression data of a part of the dataset analysed, and thus the 23 regression analysis of CCA becomes more efficient because of the absence 24 of between-experimental variation, which is usually significant in cross-dataset 1 analysis (34). 2 SC4A is performed by the following three steps: (1) identification of putative 3 cell populations and candidate genes for explanatory variables by standard 4 CCA; (2) combinatorial CCA to identify the top-ranked genes to be used as 5 explanatory variables; and (3) the final CCA solution using the selected genes 6 as explanatory variables. 7

Preliminary analysis 8
The aim of the preliminary analysis is to identify the putative cell populations 9 and candidate genes for explanatory variables, and standard CCA is a useful 10 method to do this, because candidate genes can be identified by their 11 correlation to each putative cell population. Considering that the final output is 12 most effectively understood by visualisation using 2-dimensional (showing 13 correlation between explanatory variable, samples and genes) or 3-14 dimensional (showing correlation between explanatory variable and 15 samples/genes) plot, up to 4 cell populations will be identified, and up to 5-10 16 genes for each population will be identified by their correlation to the 17 population (Supplementary Figure 1B). 18

Combinatorial CCA 19
Here SC4A aims to identify a set of genes that make the dispersion of cell 20 populations maximum in the CCA solution. To achieve this, all the 21 combinations of genes will be used as explanatory variables and tested for 22 discriminating each two populations using CCA. During each combinatorial 23 cycle, two genes are chosen from the total selected genes for all defined 1 single-cell populations in the main dataset and tested for their correlations to 2 one defined cell population vs all other T cells. 3 Correlation can be visualised as the degree angle measured between the 4 explanatory variable (gene) and the centroid of the defined cell population. 5 Out of the two genes, the gene with the smallest angle to the defined cell 6 population is the most correlated. All selected genes are tested in this 7 pairwise manner against all defined cell populations vs all other T cells to 8 identify the gene that is most highly and uniquely correlated to each defined 9 cell population. At each combinatorial CCA, the most correlated gene to each 10 cell population is identified using vector multiplication (Supplementary Figure  11   1C). The top-ranked genes are determined by F1 score (the harmonic mean 12 of precision and sensitivity) and the correlation to the population of interest. 13 When the top-ranked gene is different between F1 score and the correlation, 14 the most immunologically-meaningful gene can be chosen. 15

SC4A 16
Finally, CCA is performed using the genes that are selected by the 17 combinatorial CCA to be used as explanatory variables. Thus, the single cell 18 dataset will be explained by the expression of the set of chosen genes, each 19 of which uniquely correlates with a cell population and represents the 20 differentiation process of the cell population (Supplementary Figure 1D). 21 Algebraically, SC4A is defined as follows. Single cell RNA-seq data X R p × 22 m is the measurement of m genes from p single cells. The j-th column x j = (x 1j 23 x 2j ... x pj ) T is the expression data of the j-th gene from p single cells, where T 1 indicates transposed vector. In SC4A, by choosing a set of k genes for 2 explanatory variable, X' R p × (m -k) will be analysed by CCA using Z R p × k 3 as explanatory variables. As in the algorithm for CCA, X' is standardised by 4 column sums (c) and row sums (r), i.e. S = D r -1/2 (1/n X-rcT) D c -1/2 , where n 5 is the grand total of gene expression data, D r and D c are the diagonal 6 matrices of r and c, respectively. Meanwhile, Z is scaled and standardised 7 (i.e. mean = 0 and variance = 1). S is linearly regressed onto Z by the 8  Results 2 Identification of the Foxp3-independent activation signature in Treg and 3

memory-phenotype T cells 4
Firstly, we investigated how T cell activation-related genes are differentially 5 regulated in resting Treg and other CD4+ T cell populations including Tmem 6 and Teff. To address this multidimensional problem, we applied CCA to the 7 microarray dataset of various CD4 + T cells using the explanatory variable for 8 the T cell activation process, which was obtained from the microarray dataset 9 that analysed resting and activated conventional T cells ("T cell subset data" 10 and "T cell activation data" in Table 1). Thus, we aimed to visualise the cross-11 level relationships between genes, the T cell populations, and the T cell 12 activation process ( Figure 1A). Using the single explanatory variable, the T 13 cell activation process, the solution of CCA is one-dimensional and the cell 14 node (dLN) T cells from BDC TCR transgenic (Tg) mice, which develop type I 20 diabetes). In contrast, Foxp3 + Treg, Tmem, and tissue-infiltrating Teff in the 21 pancreas from BDC Tg (i.e. with inflammation in the islets) had high scores 22 ( Figure 1B). These results indicate that Treg are as "activated" as Tmem and 23 tissue-infiltrating activated Teff at the transcriptomic level by CCA. 24 Next, we addressed whether the highly "activated" status of Treg is dependent 1 on Foxp3. Since Foxp3 suppresses Runx1-mediated transcriptional activities 2 (38), we investigated the same T cell population dataset using the following 3 three explanatory variables: T cell activation (Tact), retroviral Foxp3 4 transduction (Foxp3) and Runx1 transduction (Runx1) (see Methods). The 5 CCA solution was 3-dimensional, while the first two axes explained the 6 majority of variance (98.8%, Figure 2A). As expected, Tmem, tissue- These results indicate that a set of activation genes are operating in all the 19 three non-naïve T cell populations (i.e. Treg, Teff and Tmem), while some of 20 them are more specific to Treg. 21

22
The Treg transcriptome is characterized by the repression of a part of the 1 activation genes for Tmem 2 Next, we determine the modules of genes that are differentially regulated 3 between Treg and Tmem, in order to understand the multidimensional identity 4 of Treg and Tmem transcriptomes (i.e. how these populations can be defined 5 in comparison to all relevant populations). Specifically, we asked if the Axis 2 6 captured the differential transcriptional regulations between Tmem and Treg. 7 Importantly, Axis 2 represents Foxp3-driven and Runx1-driven transcriptional 8 effects, which are correlated with Treg and Tmem/Teff, respectively ( Figure  9 3A). This suggests that Axis 2 provides a 'scoring system' for regulatory vs 10 effector functions. Thus, the genes in Axis 1-low (precisely, genes above 25 11 percentile for positive correlations with Tact) were identified as Tact genes. Intriguingly, heatmap analysis showed that both Treg and Tmem expressed 22 Tact-Foxp3 genes at high levels, compared to naïve and effector T cells 23 ( Figure 3B). On the other hand, Tact-Runx1 genes were selectively 24 downregulated in Treg, while their expressions were sustained in Tmem 1 ( Figure 3C). In other words, the repression of Tact-Runx1 genes was the 2 major feature of Treg in comparison to Tmem, and Tact-Foxp3 genes are the 3 activation genes, the expression of which is induced by T cell activation in 4 both Treg and Tmem, and is sustained or enhanced even in the presence of 5 Foxp3. Interestingly, comparable selective downregulation of Tact-Runx1 6 genes was observed in Teff as well ( Figure 3C). This suggests that the set of 7 activation genes operating in Teff is different from the ones in Tmem, and that 8 Tmem and Treg share more activation genes than Treg-Teff and Tmem-Teff 9 ( Figure 3B and 3C). These results collectively indicate that the Treg-ness is 10 composed of the induction of the Treg-Tmem shared activation genes (i.e. 11 Tact-Foxp3 genes) and the Foxp3-mediated repression of Tmem-specific 12 genes (i.e. Tact-Runx1 genes), defining the multidimensional identity of Treg. Treg, Tmem and Teff (as captured by Figure 3B and 3C). Importantly, many 18 of these activation genes are shared between Treg and Tmem, but not with 19 Teff. The closer similarity between resting Treg and Tmem, compared to Teff, 20 is not surprising, considering that both resting Treg and Tmem are at the 21 resting status, while Teff are more recently activated and executing effector 22 functions. In addition, the distinct features of Teff may also include their 23 capacity of tissue infiltration and the effects of the microenvironment. These 24 features were not captured by standard t-test analysis ( Supplementary Fig  1   1). 2 Tact-Foxp3 genes included the transcription factors Nfat5, Runx2, and Ahr, 3 which were expressed by most of Tmem cells as well ( Figure 3D). The Treg-4 associated markers, Il2ra (CD25), Itgae (CD103), and Tnfrsf18 (GITR) were 5 expressed not only by Treg but also by Tmem at moderate to high levels. 6 Notably, the expression of Ctla4, Ccr4, and Lag3 was high in Treg and Tmem 7 cells, but it was repressed in Teff ( Figure 3D). This suggests that Treg and 8 Tmem are in later stages of T cell activation, when the expression of CTLA-4 9 is induced as a negative feedback mechanism (41), while it is not induced in 10 tissue-infiltrating Teff, presumably because they are more recently activated 11 and actively proliferating. 12 Tact-Runx1 genes included many cell cycle-related genes (e.g. Ccna1, 13 Cdca2, and Chek2), suggesting that these cells are in cell cycle and 14 proliferating ( Figure 3E). The higher expression of Mki67 and Fos suggests 15 that these Tmem cells had been activated by TCR signals in vivo before the 16 analysis. Tact-Runx1 genes also included the transcription factors Tbx21, 17 Maf, Hif1a, and Bcl6, which have roles in Th1, Th2, Th17, and Tfh 18 differentiation, respectively (42-44). In accordance with this, the Tfh markers 19 Cxcr5 and Pdcd1 were specifically expressed by Tmem, suggesting that 20 Tmem are heterogenous populations and composed of these Th and Tfh 21 cells. These results are compatible with the model that Treg and Tmem 22 constitute the self-reactive T cell population that have constitutive activation 23 status (7), and that the major function of Foxp3 is to modify the constitutive 24 activation processes by repressing a part of the activation gene modules (i.e. 1 Tact-Runx1 genes) (Figure 4). 2 3 The activated status of Treg is TCR signal dependent 4 We next asked whether the constitutively "activated" status of Treg is 5 dependent on TCR signals. We applied CCA to the microarray data of 6 CD44 hi CD62L lo activated Treg (CD44 hi activated Treg) and CD44 lo CD62L hi 7 naïve-like Treg (CD44 lo naïve Treg) from inducible Tcra KO or WT (TCR KO 8 data, Table 1, Figure 5A) using the T cell activation variable as explanatory 9 variable. The CCA result showed that CD44 hi activated Treg from WT mice 10 only showed high activation scores, compared with all the other groups. 11 Interestingly, TCRa KO CD44 lo naïve-like Treg showed the lowest scores, and 12 were lower than WT CD44 lo naïve-like Treg ( Figure 5B). These results 13 indicate that TCR signaling is required for the constitutive activation status of 14 Treg, especially CD44 hi activated Treg, and suggest that these activated Treg 15 are more enriched with the cells that received TCR signals recently, 16 compared to CD44 lo naïve-like Treg. 17 In order to further address whether the TCR signal-dependent activation 18 signature of Treg is constitutively maintained or specifically induced by in vivo 19 activation events (presumably as tonic TCR signals (7)), we analysed the 20 RNA-seq dataset of in vivo activated Treg (Ref. (16), Table 1). The dataset 21 was generated by depleting a part of Treg by Diphtheria toxin (DT) using bone 22 marrow chimera of Foxp3 GFPCreERT2 :Rosa26YFP and Foxp3 GFP DTR (16). The 23 DT treatment depletes DT receptor (DTR)-expressing Treg from Foxp3 GFP DTR , 24 and thus induces a transient inflammation through the reduction of Treg. 1 Foxp3 GFPCreERT2 allows to label Foxp3-expressing cells by YFP at the moment 2 of tamoxifen administration. Van der Veeken et al thus analysed resting Treg 3 from untreated mice (rTreg), activated Treg from mice with recent depletion 4 (11 days before the analysis) in an inflammatory condition (aTreg), and 5 "memory" Treg (mTreg) from mice with a distant depletion (60 days before the 6 analysis) ( Figure 5C). As expected, the CCA analysis using the T cell 7 activation variable showed that aTreg had higher activation scores than both 8 rTreg and mTreg ( Figure 5D). This indicates that the activation mechanisms 9 are more actively operating in activated Treg in an inflammatory environment. 10 In order to further dissect the activation signature of Treg, we obtained the 11 lists of differentially expressed genes (DEG) between WT Treg vs Tcra KO 12 Treg (designated as TCR-dependent genes), and between aTreg and rTreg 13 Foxp3 genes are regulated by TCR signalling (Figure 6A and 6B), 22 suggesting that the activation status of resting Treg and Tmem may be 23 sustained by TCR signals. Pathway analysis showed that Tact-Runx1 and 24 aTreg-specific genes were enriched for cell-cycle related pathways. In 25 contrast, Tact-Foxp3 genes were enriched for pathways related to signal 1 transduction only ( Figure 6C). Collectively, the results above suggest that 2 resting Treg are maintained by TCR and cytokine signalling, and that the 3 activation of Treg induces the transcriptional activities of Tact-Runx1 genes, 4 which promote proliferation and cell division. 5 6 FOXP3 expression more frequently occurs in activated T cells than resting 7 cells by single cell CCA 8 The analyses above showed that Treg are on average more activated than 9 naïve T cells and that the activation status of Treg can be variable. However, 10 it is still unclear whether individual Treg are activated than any naïve T cells at 11 the single cell level. The alternative hypothesis is that Treg are enriched with 12 the T cells that have recognised their cognate antigens and been activated. In 13 order to determine this and thereby understand the dynamics of T cell 14 regulation in vivo, we investigated the single cell RNA-seq data of tumour-15 infiltrating T cells from human patients (Ref. (35)  Firstly, we in silico-sorted FOXP3+ and FOXP3-CD4 + CD3 + T cells from 19 unannotated single cell data from tumours, which tissues were dispersed and 20 CD45+ cells were sorted by flow cytometry without the use of any lymphocyte 21 markers (GSE72056 , Table 1). Thus, the identities of individual single cells 22 were needed to be identified in a data-oriented manner, and Treg and non-23 Treg cells in these tumour tissues had unknown individual activation and 24 differentiation statuses. Thus, we applied CCA to the in silico-sorted single 1 cell T cell data using the explanatory variables of activated conventional CD4+ 2 T cells (Tact) and resting T cells (Trest; GSE15390, Table 1), aiming to define 3 individual single cells according to their level of activation by their correlations 4 to these two variables ( Figure 7A). Here we used these two variables, Tact  5 and Trest, in order to generate a two-dimensional CCA solution, instead of a 6 single explanatory variable that represents T cell activation by the log2 fold 7 change in gene expression between activated and resting CD4+ T cells (c.f. 8 in the Axis 1 ( Figure 7B). Here, CCA Axis 1 ´ (-1) score is designated as the 19 T cell activation score. Thus, using the activation score and FOXP3 20 expression, the following four subpopulations were defined: "Activated 21 FOXP3+", "Resting FOXP3+", "Activated FOXP3-", and "Resting FOXP3-" 22 ( Figure 7B). 23 Next, we aimed to determine whether individual activated FOXP3+ Treg are 24 more activated than activated FOXP3-non-Treg at the single cell level 25 According to the T cell activation score established by the CCA solution in 1 Figure 7B, FOXP3 + Treg had significantly higher T cell activation scores than 2 FOXP3-non-Treg on average, as indicated by the higher median in the violin 3 plots and greater density of samples with higher T cell activation scores 4 ( Figure 7C), confirming the results by bulk cell analysis (Figure 1). Using the 5 CCA definition of activated and resting Treg and non-Treg established in 6 Figure 7B, the T cell activation score neatly captured the activated status of 7 single cells, allocating high positive and negative scores to activated and 8 resting cells, respectively ( Figure 7D). Importantly, there was no significant 9 difference between Activated FOXP3+ and Activated FOXP3-cells and 10 between Resting FOXP3+ and Resting FOXP3-cells (Figure 7D) mechanism to suppress the effector response against tumour antigen (7). 20 Alternatively, but not exclusively, FOXP3+ T cells may have high-affinity TCRs 21 to self-MHC and/or tumour antigens and be more prone to activation (10). and Methods). Thus, PDCD1, FOXP3, CTLA4, and CCR7 were identified as 5 the most correlated genes for Activated FOXP3-, Activated FOXP3+, 6 Figure  7 3), which represent the four immunological processes (see above). Finally, 8 using these four genes as explanatory variables, we applied CCA to the single 9 cell transcriptomes, obtaining the solution of the SC4A approach. 10

Activated T cells, and Resting T cells, respectively (Supplementary
The single cell space of the SC4A solution showed that Activated and Resting 11 T cells had negative and positive scores, respectively ( Figure 8B). This 12 indicates that Axis 1 represents T cell activation vs naïve-ness. Single cells 13 were successfully clustered into Activated FOXP3 + Treg, Activated FOXP3-14 non-Treg, and Resting T cells. Resting FOXP3+ Treg and Resting FOXP3-T 15 cells were mostly overlapped (Figure 8C), indicating that the major features in 16 the dataset dominated the difference between these two resting T cell groups. 17 Importantly, the explanatory variable CTLA4, which represents the T cell 18 activation process, was highly correlated with both Activated FOXP3 + Treg 19 and Activated FOXP3-non-Treg at the middle, indicating its neutral position in 20 terms of Tfh and Treg activation processes. As expected, the variable CCR7, 21 which represents naïve-ness, was correlated with both Resting FOXP3+ Treg 22 and Resting FOXP3-T cells. The explanatory variable PDCD1, which 23 represents the Tfh-like process, was highly correlated with Activated FOXP3-24 non-Treg cells, while the variable FOXP3 was correlated with Activated 25 FOXP3+ Treg. Thus, the single cell transcriptomes were modelled by the 1 correlations between gene expression, single cells, and the expression of the 2 four key genes, which represent the four immunological processes ( Figure 8B  3 and 8C). PCA and t-distributed stochastic neighbor embedding (t-SNE) did 4 not provide insights into such cross-level relationships or clear separations of 5 the populations (Supplementary Figure 4). 6 Next, in order to understand the relationship between the T cell activation 7 signature and FOXP3-driven and Tfh-like processes ( Figure 8B and 8C), we 8 aimed to identify and characterise genes with high correlations to these 9 processes, which were represented by CTLA4, FOXP3, and PDCD1 10 explanatory variables, by analysing the gene space of the final output of 11 SC4A ( Figure 8C; see Methods). As expected, the Tfh genes, IL21 and BCL6 12 (45), were highly correlated with PDCD1 explanatory variable. IL2RA (CD25) 13 is a Treg marker (46) and was highly correlated with FOXP3 explanatory 14 variable. IL7R and BACH2 are known to be associated with naïve T cells (47, 15 48), and were positively correlated with CCR7 explanatory variable, which 16 represents the naïve-ness ( Figure 8C). Thus we defined FOXP3-driven Treg 17 genes (magenta circles) and Tfh-like genes (blue circles) according to their 18 high correlation to the FOXP3 and the PDCD1 explanatory variables, 19 respectively, while we designated as Activation genes (red circles) the genes 20 that have high correlations with the CTLA4 variable, including LAG3 and 21 CCR5, which were positioned around 0 in Axis 2 ( Figure 8C). 22

Identification of the bifurcation point of activated T cells that leads to Tfh-like 1 and Treg differentiation in tumour-infiltrating T cells 2
The analyses above strongly suggested that there are two major 3 differentiation pathways for those tumour-infiltrating T cells, which are 4 regulated by FOXP3-driven and Tfh-like processes. In order to identify these 5 lineages, we applied an unsupervised clustering algorithm to the sample 6 space of the SC4A/CCA result ( Figure 8B), and identified 6 clusters, to which 7 a pseudotime method (49) was applied, constructing "lineage curves" ( Figure  8 8D; see Methods). Importantly, the lineage curves had a bifurcation point at 9 Cluster II, which leads to the two distinct differentiation pathways, Tfh-like and 10 FOXP3-driven differentiation. Since cells may change and mature their 11 phenotypes in different dynamics between these two lineages, we designated 12 Tfh-like-associated and FOXP3-associated pseudotime as Tfh-pseudotime 13 and FOXP3-pseudotime ( Figure 8D). 14 In fact, the expression of Activation genes was progressively increased in the 15 shared clusters (i.e. Cluster I and II) for the two pseudotimes, and throughout 16 the rest of the FOXP3-pseudotime and the early phase of Tfh-like 17 differentiation (i.e. Cluster III) in Tfh-pseudotime, while it was suppressed 18 towards the end of Tfh-like differentiation (Cluster IV; Figure 8E) in Tfh-19 pseudotime. Given that Tfh-pseudotime is correlated with PDCD1 expression 20 ( Figure 8C), this suggests that PDCD1 expression and the Tfh-effector 21 process are induced during the earlier phases of effector T cell activity, and 22 that the activation processes in PDCD1 high T cells are suppressed, 23 presumably through PD1-PDL1 interactions in the tumour environment (50). 24 Interestingly, FOXP3-driven genes had similar dynamics to Activation genes 1 in both FOXP3-pseudotime and Tfh-pseudotime ( Figure 8F). In contrast, Tfh-2 like genes were mostly suppressed throughout FOXP3-pseudotime, while 3 they were progressively induced throughout Tfh-pseudotime (Figure 8G). 4 These differential regulations of two gene modules resonate with those of 5 Tact-Foxp3 genes (which are expressed by both Treg and Tmem) and Tact-6 Runx1 genes (which are expressed specifically in Tmem, and repressed in 7 Treg) (Figure 3). In fact, FOXP3 expression is weakly induced in some cells in 8 the bifurcating Cluster II and the early phase of Tfh-like differentiation (Cluster 9 III) in Tfh-pseudotime, and is progressively increased at and beyond Cluster V 10 in FOXP3-pseudotime ( Figure 8H). 11 RUNX1 is highly expressed in the common Clusters I and II, and is 12 downregulated in the transition from Cluster II to Cluster III in Tfh-pseudotime, 13 and from Cluster II to Cluster V in FOXP3-pseudotime ( Figure 8I), which is 14 compatible with the known dynamics of RUNX1 expression: Runx1 is 15 downregulated when naïve CD4+ T cells differentiate into activated/effector 16 cells following TCR signaling (51). By analysing other key genes used as CCA 17 explanatory variables, CTLA4 was induced at the bifurcating point, Cluster II, 18 and onwards in both of the lineages at equivalent expression levels ( Figure  19

8J), reflecting the activated status of both effector Tfh-type cells and Treg. 20
Importantly, CTLA4 is a marker of Treg as well as activated effector T cells, 21 where it acts as a negative regulator of T cell proliferation (52). 22 PDCD1 expression was also induced at the bifurcating point, and throughout 23 Tfh-pseudotime, but specifically suppressed in the early phase of FOXP3-24 pseudotime (Figure 8K), which is compatible with the known dynamics that 1 PDCD1 is transiently upregulated in activated CD4+ T cells as a negative 2 regulatory mechanism to restrain proinflammatory immune responses and 3 maintain peripheral tolerance (53). Further supporting this dynamic 4 perspective, IL2 expression occurs mainly in Cluster II, indicating that these 5 cells are enriched with the T cells that recently recognised antigens (54) 6 ( Figure 8L). Consistently, the expression of the naïve T cell marker CCR7 7 was the highest in the cells with a relatively naïve phenotype in shared Cluster

Identification of markers for the differential regulation of Tfh-like and Treg 1 differentiation in activated T cells 2
Lastly, we aimed to demonstrate the utility of the current approach by 3 discovering exemplary marker genes that distinguish cells in FOXP3-and Tfh-4 pseudotime (i.e. the FOXP3-driven pathway I-II-V-VI, and the Tfh-like pathway 5 I-II-III-IV) (Figure 9A), and identifying the T cell subpopulations by a flow 6 cytometric visualisation of single cell data. Since Activation genes ( Figure 8C Figure 8E), we took the intersect of these genes and the Tact-Foxp3 genes, 9 which were expressed by both resting Tmem and resting Treg in mice ( Figure  10 3). DUSP4 and NFAT5 were such genes and in fact induced in cells at the 11 activated bifurcating Cluster II and onwards in both lineages ( Figure 9B). 12 Similarly, in order to identify a marker to distinguish Treg and Tfh-like cells, 13 firstly, we identified CCR8 and IL2RA in the intersect of FOXP3-driven genes 14 ( Figure 8C) and the Tact-Foxp3 genes, which were induced highly and 15 progressively in Treg-lineage cells throughout FOXP3-pseudotime, while 16 mostly suppressed across Tfh-pseudotime ( Figure 9C). In contrast, BCL6 and 17 KCNK5 (found in the intersect of Tfh-like genes ( Figure 8C) and the Tact-18 Runx1 genes, which are expressed in resting Tmem but suppressed in resting 19 Treg (Figure 3)) were progressively induced across Tfh-pseudotime, while 20 suppressed in FOXP3-pseudotime ( Figure 9D). Resting Treg showed an activated status, comparable to that of Teff and 3 Tmem at the population level. In addition, the activation signature of Treg was 4 more remarkable in CD44 hi CD62L lo activated Treg than CD44 lo CD62L hi naïve-5 like Treg. CD44 hi CD62L lo Treg are also identified as eTreg, which may have 6 enhanced immunosuppressive activities (55). The eTreg fraction includes the 7 GITR hi PD-1 hi CD25 hi "Triple-high" eTreg that have high CD5 and Nur77 8 expressions, which indicates that they have received strong TCR signals (17). 9 In humans, CD25 hi CD45RA -FOXP3 hi eTreg highly express Ki67 (56), 10 indicating that these cells were recently activated. Given that TCRs of Treg 11 have higher affinities to self-antigens (57), these eTreg may have the most 12 self-reactive TCRs during homeostasis. Alternatively, the eTreg subset may 13 have recently received strong TCR signals and upregulated activation 14 markers, and such cells may acquire a resting status at later time points. 15 Future investigations by TCR repertoire analysis will answer this question. 16 Our study revealed the heterogeneity of FOXP3+ Treg at the single cell level, 17 and showed that tumour-infiltrating Treg include FOXP3+ T cells with various 18 levels of activation ( Figure 7B and Figure 8C). It is plausible that, in the 19 physiological polyclonal settings, the variations in the activated status of 20 individual Treg may be due to the TCR affinity to its cognate antigen, the 21 availability of cognate antigen, and the strength and duration of TCR signals. 22 Our SC4A analysis identified the FOXP3-driven genes, which are specific to 23 activated FOXP3+ cells and include IL-2 and common gamma chain cytokine 24 receptors (i.e. IL2RA, IL2RB, IL15RA, IL4R, and IL2RG), DNA replication 25 licensing factors (e.g. MCM2), and transcription factors such as PRDM1 1 (BLIMP1) and IRF4 (which control the differentiation and function of eTreg 2 (19)). These gene modules are distinct from the Tfh-like genes and the 3 activation genes (Figure 8), and may be controlled specifically by FOXP3 4 under strong TCR signals. The expression of these genes is variable within 5 the FOXP3+ T cells, suggesting that the transcriptional activities of these 6 genes are dynamically regulated over time in tumour-infiltrating Treg. Thus, 7 single cell-level analysis is becoming a key technology to address the 8 heterogeneity of Treg. To our knowledge, this study is one of the first single 9 cell analyses of Treg transcriptomes, while we find that, during the review 10 process of this manuscript, another study addressing Treg heterogeneity by 11 single cell RNA-seq was deposited at a preprint server (58)). 12 The shared activation genes between activated FOXP3+ Treg and FOXP3-13 non-Treg contain apoptosis-related genes (e.g. CASP3, BAD), which may be 14 differentially controlled between Treg and non-Treg at the protein level. For 15 example, activated FOXP3-non-Treg express DUSP6 ( Figure 9B), which is a 16 negative regulator of JNK-induced apoptosis through BIM activation, while 17 FOXP3 suppresses DUSP6 expression and promotes the apoptosis 18 mechanism (59). In addition, the activation genes include transcription factors 19 such as TBX21 (T-bet) and BATF. Although TBX21 is sometimes thought to 20 be a Th1-specific gene, it is upregulated immediately after T cell activation 21 (60). BATF was identified as a critical factor for the differentiation and 22 accumulation of tissue-infiltrating Treg (61). These activation genes may be 23 required when T cells are activated and differentiate into either Treg or Teff. 24 Further studies are required to investigate the temporal sequences of these 1 differentiation events in vivo. 2 Although the effects of TCR signals on Tmem were not directly examined, 3 considering that Tmem are self-reactive and their differentiation is dependent 4 on the recognition of cognate antigens in the thymus (7), these results 5 collectively suggest that the activation signature of Tmem is also dependent 6 on TCR signals, as is the activation signature in Treg ( Figure 5B). Intriguingly, 7 some Treg may lose their Foxp3 expression and become ex-Treg, which are 8 enriched in CD44 hi effector T cells or Tmem (30). In contrast, a Tmem 9 population (precisely, Foxp3 − CD44 hi CD73 hi FR4 hi T cells) efficiently express 10 Foxp3 during lymphopenia (62). These findings support the feedback control 11 model that Foxp3 expression can be induced in Tmem and sustained in Treg 12 as a regulatory feedback mechanism for TCR signals (7). Given the variations 13 in the activated status in individual Treg and Tmem, single cell analysis will be 14 required to address this problem. For example, although Samstein et al. 15 showed that DNA hypersensitivity sites in Treg are similar to those in 16 activated T cells (9), it is possible that DNA hypersensitivity sites are variable 17 between individual Treg, and that Tmem may have a similar chromatin 18 structure to Treg. 19 Importantly, our analysis showed that Tmem-specific activation-induced 20 genes (i.e. Tact-Runx1 genes) are uniquely repressed in Treg. The repression 21 is likely to be mediated by the interaction between Foxp3 and other 22 transcription factors that regulate the expression of the Tmem-specific 23 activation genes ( Figure 3C). Interestingly, Runx1 was associated with these 24 Tmem-specific genes. In fact, Foxp3 interacts with Runx1 and thereby 1 represses IL-2 transcription and controls the regulatory function of Treg (38), 2 and a significant part of the Foxp3-binding to active enhancers occurs through 3 the Foxp3-Runx1 interaction (9). These suggest that Runx1 may have a 4 unique role in the differentiation and maintenance of Tmem. 5 While CTLA-4 is commonly recognised as a Treg marker, it is upregulated in 6 all activated T cells, thus CTLA-4 is also a marker of activated T cells (41). 7 CTLA-4 is in fact expressed by only a subset of resting Treg (63), which may 8 be more activated and proliferating in vivo (64). In fact, our study shows that 9 CTLA-4 is expressed by non-Treg activated T cells including resting Tmem 10 ( Figure 3D) and FOXP3-Tfh-like effector T cells in the tumour 11 microenvironment ( Figure 7E and 8C). These findings support that CTLA-4 is 12 primarily a marker for general T cell activation, rather than Treg-specific 13 marker, and that Treg are highly activated T cells with FOXP3 and CTLA-4 14 expression. Importantly, although both FOXP3+ and FOXP3-cells had the 15 same relative level of activation ( Figure 7D), the absolute number of FOXP3+ 16 cells expressing CTLA4 was lower than that of Tfh-type cells ( Figure 8J In contrast, the expression of PDCD1 was consistently high in all Tfh-like 23 cells, while it was sparse among FOXP3+ cells ( Figure 8K). The co-24 expression of BCL6 and IL21 in some of these PD-1+ cells indicates that Tfh 1 differentiation occurs in the tumour microenvironment, presumably through 2 the repeated and chronic exposure to quasi-self antigens (i.e. tumour 3 antigens). Interestingly, the Tfh signature has been identified in type-I 4 diabetes in both mice and humans (65). Intriguingly, the Tfh-like genes include 5 cell-cycle related genes (e.g. CDK6), immediate early transcription factors 6 (NFATC1, EGR2/3), and RNA-processing genes (e.g. DICER1). The 7 significance of these gene modules should be addressed in future studies. 8 However, the high PDCD1 expression in Tfh-like cells may make them 9 vulnerable to the negative immunoregulatory effect of PD-1 in tumour 10 microenvironments (50). In fact, the most mature PDCD1 high Tfh-like cells 11 (cluster VI, Figure 8K) moderately decrease the expression levels of activation 12 genes ( Figure 8E), suggesting that these cells may have started to be 13 regulated by PD1 ligands. Further experimental investigations are required to 14 reveal how dynamically PD1 regulates T cells during immune response. 15 Interestingly, RUNX1 is completely repressed in the early phase of FOXP3-16 pseudotime (Cluster V) but re-expressed in the late phase of FOXP3-17 pseudotime the expression of RUNX1 is significantly elevated (Cluster VI) 18 ( Figure 8I). Similar to RUNX1, some cells appear to be expressing PDCD1 in 19 the later phase of FOXP3-pseudotime in Treg-lineage cells. The 20 reappearance of effector phenotype genes RUNX1 and PDCD1 in FOXP3 high 21 cells may indicate that these Treg are highly activated effector Treg, which 22 may be actively participating in neutralizing the activities of effector T cells 23 and Tfh. Alternatively, these FOXP3 high cells may include ambivalent cells with 24 the characters of both regulatory and effector, or alternatively, be at the point 25 of conversion to Tmem (Figure 8K). In bulk resting Treg, Pdcd1 was 1 expressed at low levels in some Treg ( Figure 3E) as well as PD-L2-encoding 2 gene Pdcd1lg2 ( Figure 3D). Future studies are required to reveal the role of 3 these cells. 4 SC4A is a useful method to identify distinct clusters of T cells and the 5 correlated genes to each cluster, and thereby to reveal characteristic cell 6 groups and their active gene modules, while retaining the single-cell level 7 variations. We also showed that SC4A and CCA results can be further 8 analysed by the pseudotime approach (CCA-pseudotime). Since SC4A/CCA 9 provides functional annotations to cell groups and gene clusters, the 10 understanding of the pseudotime axis is effective, as shown in the current 11 study. However, given that pseudotime is not a direct measurement of the 12 time-dependent events, but rather is that of similarities between samples (66), 13 future studies are required to analyse time-dependent events in vivo, ideally 14 with a new experimental system to directly address the temporal dynamics. In 15 order to make the current SC4A/CCA approach accessible to experimental 16 immunologists, we visualised our single cell data findings using a flow-17 cytometry style (Figure 9). Although reliable antibodies are currently not 18 available for those intracellular candidate genes, and the expression of protein 19 and transcripts may not be synchronized, the present study demonstrated the 20 power of the SC4A/CCA approach to extract biological meaning from 21 unannotated single cell RNA-seq data. The current limitation of SC4A is that it 22 is computationally expensive (i.e. requires several hours for each analysis 23 using a standard desktop), and the improvement of the computational 24 algorithm using a low-level language will be beneficial. Importantly, SC4A is 25 most effective when used together with in-depth knowledge of immunology 1 and gene regulation, facilitates the interpretation of CCA results and 2 explanatory variable selection. Thus, it is hoped that these tools will be widely 3 used by experimental immunologists with a sound understanding of the 4 biological significance of results, as well as adequate competence in 5 computational analysis, which will enable to ask questions involving 6 multidimensional problems such as multiple T cell subsets. 7 8 Data and code availability 9 All R codes are available upon request. Processed data will be provided upon 10 reasonable requests to the corresponding author.          The microarray dataset of peripheral CD4+ T cells, including naïve, effector 5 and memory phenotype from various sites (GSE15907), was analysed using 6 the T cell activation variable, which was obtained by the microarray dataset of 7 conventional activated CD4+ T cells (GSE42276). (A) Schematic 8 representation of CCA for the cross-level analysis of T cell populations (cells), 9 immunological processes, and genes. (B) CCA was applied to the T cell 10 population data using an explanatory variable for T cell activation, which was 11 obtained as fold change between activated and resting conventional CD4+ T 12 cells. The CCA solution is thus one-dimensional, and is used as "T cell 13 activation score" (see Methods).  Venn diagram analysis was used to obtain intersects of TCR-dependent 4 genes (DEG between TCRa KO and WT Treg), aTreg-specific genes (DEG 5 between aTreg and rTreg), and Tact-Foxp3 and Tact-Runx1 genes (see 6 between Tact and Trest, and thus, Activated T cells and Resting T cells were 23 defined by the CCA Axis 1 score, and these cells were further classified into 24 Treg and non-Treg by their FOXP3 expression (see legend). Percentage 25 cells was shown by a 2-dimensional plot in a flow cytometric style. Data from 1 Treg-lineage cells (Cluster V and VI, upper panels) and Tfh-like lineage cells 2 (Cluster III and IV, lower panels). The gene in x-axis (NFAT5) is from the 3 activation gene group (B), while y-axis shows genes from either the FOXP3-4 Treg group (C) or the Tfh-like/Tmem group (D). Thresholds and quadrant 5 gates were determined in an empirical manner using density plot. 6