A Regularized Multi-Task Learning Approach for Cell Type Detection in Single-Cell RNA Sequencing Data

Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W 1, W 2, … W t ) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.


INTRODUCTION
There has been great interest recently in single-cell molecular profiling technologies, particularly when dealing with rare or highly specific cell types and states. Recent technological advances enabled us to process tens of thousands of cells per scRNA-seq experiment (Svensson et al., 2018). A fundamental step in the downstream analysis of single-cell data is to type the individual cells. The most popular and immediate approach is to identify the cell categories using unsupervised learning (Gribov et al., 2010;Kiselev et al., 2017), which is further analyzed to determine the cell categories. This way of analysis for annotating cells has been prevalent for identifying biologically coherent cell populations in scRNA-seq data so far (Cao et al., 2017;Fincher et al., 2018;Han et al., 2018;Plass et al., 2018).
Unsupervised (clustering) methods require manual annotation, which imply problems concerning the resolution of (sub-) types, manpower resources, and bias toward existing human knowledge. This step escapes the characteristic advantage of scRNA-seq data analysis because of the intervention of human manpower. Manual annotation depends on the prior knowledge of marker genes, which may be obtained from earlier bulk studies. For this, assigning the biological meaning of the cell clusters (groups) is not only a complicated task but also demands a huge amount of time. This problem becomes even worse when the number of cells and samples increases which surely prevents fast and reproducible annotation.
To overcome this problem, we need methods that automatically determine cell labels. With labeled data input, the supervised learning method (Alquicira-Hernández et al., 2018;Wagner and Yanai, 2018;Ma and Pellegrini, 2019;Pliner et al., 2019) can handle automatic and hassle-free cell type detection. Recently supervised approaches have gained popularity as they can determine cell types from the data, but the underlying molecular mechanisms are still not explored fully (Abdelaal et al., 2019).
There exist several methods which address the problem of cell type detection in scRNA-seq data using supervised (or unsupervised) approaches. Abdelaal et al. (2019) present an excellent review of the different supervised techniques for cell type detection. The task is to merely learn the cellular identities from annotated training data to predict the cell type. These approaches are relatively new compared with the large extent of methods available for addressing the computational challenges of a single-cell analysis. Pliner et al. (2019) have devised a method called Garnett for rapidly annotating cell types in single-cell RNA-seq data. Garnett operates on four steps: first, it defines a markup language to determine the cell types, and then, this language is processed by a parser that identifies representative cells bearing marker genes. In the third and fourth steps, it recognizes additional cells to each cell type based on their similarity to the representative and finally applies a classifier trained on one dataset. Wagner et al. (Wagner and Yanai, 2018) have introduced a new method based on a hierarchical machine learning framework that can construct robust cell type classifiers from heterogeneous scRNA-seq datasets. Also, Ma and Pellegrini (2019) have proposed a method ACTINN (Automated Cell Type Identification using Neural Networks), which utilized a neural network model with three hidden layers for training the dataset with predefined cell types. Here, predictions made for other datasets were based on trained parameters. The unbiased feature selection method is combined with machine learning classification to build a powerful method scPred (Alquicira-Hernández et al., 2018). It brings the advantage of dimensionality reduction and orthogonalization of gene expression values for accurately predicting the cell types. Most of the methods have proposed a model which trains the scRNA data for prediction or uses some feature selection techniques to reduce the dimension of the input data before clustering. So the training process solely depends on the whole data and ignores the crosstalk between multiple cell populations.
Here, we presented a framework based on the regularized multi-task learning (RMTL) approach for automatic cell type detection in the scRNA-seq datasets. The advantage of our model is that it can take multiple cell populations as the input, leveraging simultaneous learning of features. RMTL is already a wellexplored field and is recently gaining popularity for solving numerous problems in the bioinformatics domain (Zhang et al., 2018;Dizaji et al., 2020;Wang et al., 2020). The performance of learning will increase if we learn from multiple interrelated tasks (Baxter, 1997). The multi-task learning approach can also tackle the overfitting issue. The crucial task here was to find out the shared parameters for identifying the relationship with common features among the tasks (Singh et al., 2019). Our model took samples of different cellular identities as reference input data and predicted cell types from query datasets. Here, we hypothesized that the biological information of cell samples coming from several cell types was related in some way, and for this reason, we have to learn all cell samples simultaneously using multi-task learning. Here, L 2,1 regularization is used to smooth the loss function, thereby minimizing the complexity of the model. We have compared the proposed method with four state-of-the-art, widely used cell type detection tasks for scRNA-seq data. The results showed the proposed method outperformed the other in automatically detecting the cell types.

Dataset Description and Preprocessing
The following datasets are used for the experiments: 1. CBMC (Stoeckius et al., 2017) Table 1. We adopted the standard pipeline of Seurat v3. (Stuart et al., 2019) for the preliminary analysis and preprocessing, particularly quality control, cell and gene filtering, and normalization.  (Grabherr et al., 2011) and the eukaryotic genome annotation tool PASA to perform the de novo assembly of reads. 5. Klein (Klein et al., 2015): This dataset was generated by the droplet barcoding method with an average total read count of 20,033.40 reads in the expression matrix. A total of eight single-cell datasets are submitted: three for mouse embryonic stem (ES) cells (one biological replicate and two technical replicates); three samples following LIF withdrawal (days 2, 4, and 7); one pure RNA dataset (from human lymphoblast K562 cells); and one sample of single K562 cells. The dataset was downloaded from GEO under the accession no. GSE65525. The dataset contains 24,175 number of genes and 2,717 number of cells with four cell types. Cells are captured and barcoded in nanoliter droplets with high capture efficiency. 6. PBMC68k (Zheng et al., 2017): The dataset is downloaded from the 10x Genomics website https://support.10xgenomics. com/single-cell-geneexpression/datasets. The data are sequenced on Illumina NextSeq 500 high output with 20,000 reads per cell.

Preprocessing
In the initial step, we have collected a single-cell RNA count matrix from different sources. Columns of these matrices contain cell/sample information, and genes are represented row-wise. The RNA counts are organized as a matrix M cl×ge , where cl is the number of cells and ge is the number of genes. Each element [M] ij represents the count of the ith cell and the jth gene. If more than a thousand genes are expressed (non-zero values) in one cell, then the cell is termed as good. We assumed one gene is expressed if the minimum read count of it exceeds 5 in at least 10% of the good cells. The data matrix M with expressed genes and good cells is normalized using a linear model and normality-based normalizing transformation method (Linnorm) (Yip et al., 2017). The resulting matrix (M cl′×ge ') is then log 2 transformed by adding one as a pseudocount.

A Short Description on Multi-Task Learning
Baxter et al., 1997 first introduced the concept of multi-task learning through the theoretical learning of multiple task and describes the multi-task sampling usage for the Bayesian model. This model was utilized to know how much information is required by individual task to learn. Baxter et al., 2000 established the concept of inductive biases for searching optimal hypothesis in the environment of multiple related task. Ben and Schuller, 2003 have utilized generalized VC dimension to derive bounds for each task while assuming that the learning tasks are related.
The main assumption behind the usage of multi-task learning (MTL) is that the tasks that comes under different types of learning are related to each other. Example of learning tasks may be supervised learning (e.g., classification, regression), unsupervised learning (e.g., clustering), reinforcement study, semi-supervised learning, and many more. Among all the learning tasks, all tasks or a subset of tasks are related to each another. The motivation behind that simultaneous learning of related and multiple tasks leads improved performance rather than learning a single task. The primary intention of using MTL is to enhance the generalized performance between related tasks.
The main idea is that given Z learning task, assuming that the datasets for these tasks are coming from same space of X × Y, the conditional distribution of the response variable Y; Y z |X z are related, where X is the explanatory variable of all the Z tasks. In particular, given z learning tasks : τ z i 1 , each task having n data points: (x 1z , y 1z ), (x 2z , y 2z ), . . . (x nz , y nz ), where each data point is coming from a distribution P z on X × Y. Here P z s are different for each task, however MTL assumes that these are related. Now the aim is to learn z functions f 1 ,f 2 . . . f z each of which corresponds to a learning task as: f z (x iz )=y iz . For z=1 the problem reduces to single-task learning. Several setups may also be possible, such as when the input data x iz are same for all task, but output value y iz differs from each other. The other scenario may be the case of having the same output y it for different inputs x it , which corresponds to the problem of integrating information from heterogeneous databases Ben-David et al., 2002.

Description of the Proposed Methodology
We have proposed a supervised model which leverages the characteristics of the regularized multi-task learning algorithm for efficiently identifying cell types present in the scRNA-seq datasets. The overall analysis is shown in Figure 1.

Multi-Task Learning for Cell Type Prediction
A regularization-based approach (Evgeniou and Pontil, 2004) is proposed to solve the MTL problem, where the regularization functions are minimized in an analogous way to SVM which is used in single-task learning. All the algorithms more or less try to minimize the following function: where L (•) represents the loss function, ω represents the crosstask regularization, and λ 1 and λ 2 are positive regularization parameters. λ 1 signifies the strength of relatedness of all tasks and is estimated through a cross-validation procedure, whereas λ 2 is to introduce the penalty of the quadratic form of W.
The tasks in our case are to learn different subpopulations (S 1 , S 2 , . . . S T ) of T cell types. Each task can be represented as where n represents the number of samples, and p represents the number of genes in scRNA-seq data. Assuming the generalized linear model Y it = f t (W t .X it ) = f t ((W 0 + V t ). X it ), for each t ∈ 1,2, T,"." represents the standard dot product in R d , where W t = W 0 + V t . Here, the vector W t corresponds to the linear model for each task. V t controls the task relatedness. Here, we have utilized the RMTL framework, developed to estimate the parameters W 0 and V t : where the first summation represents the loss over all tasks for each data point, and λ 1 and λ 2 positive regularization parameters trade-off between fitting the data and smoothness of the estimate. Primarily, the input matrix is randomly partitioned into two parts: training and test sets (Figure 1). In total, 80% of the data are randomly selected for training purposes, and the rest 20% are used for testing. In this model, the number of tasks are presented as t ∈ 1, 2, . . . T; here, each task represents the learning of expression data from individual cells.
Here, the reference data are the scRNA-seq expression matrix over all the cellular identities. We applied the proposed model in the reference data for training. We tested the accuracy of our model in the query data. The query dataset is, of course, excluded from the reference for training. Therefore, we used completely independent data as references and tested the model with independent data. The accuracy of the model is calculated with the percentage of correctly identifying cell types in test data.

Comparisons of the Proposed Method
With State-Of-The-Art Methods

Description of State-Of-The-Art Methods
We have compared the proposed method with the current stateof-the-art techniques in supervised learning-based single-cell typing. Supervised learning is advantageous over unsupervised learning (clustering) because it automatizes the cell typing procedure instead of manual annotation. Garnett (Pliner et al., 2019) annotates cell types in single-cell RNA-seq data by defining a markup language to specify cell types that are subsequently processed by a parser that identifies representative cells bearing marker genes. New cells are assigned to cell types based on their similarity to representative cells. An alternative work presents a hierarchical machine learning framework that yields robust cell type classifiers trained on heterogeneous scRNA-Seq datasets (Wagner and Yanai, 2018). ACTINN (Automated Cell Type Identification using Neural Networks), presents a neural network model with three hidden layers, which are trained and used for prediction in the usual way (Ma and Pellegrini, 2019). scPred combines an unbiased feature selection method with standard machine learning classification, where dimensionality reduction and orthogonalization of gene expression values prove advantageous to accurately predict cell types (Alquicira-Hernández et al., 2018). CHETAH builds a hierarchical classification tree from the reference (training) dataset and classifies unknown samples by computing the correlation between genes that discriminate the test cell from the reference dataset (de Kanter et al., 2019). In this work, we have

Training and Test Data
Each scRNA-seq dataset is divided into training and test data at a ratio of 8:2. The performance of each competing method is evaluated by determining the average test accuracy and the corresponding standard errors over 100 runs. To know how the different methods react to reducing the training data, data were subsampled at rates ranging from 20 to 100% in steps of 20% FIGURE 1 | Workflow of the methodology: the proposed approach for cell type identification-the data are randomly divided into training and test sets. The cell types present in training sets are used to train the multi-task learning classifier with cross-validation. Then, the learnt model is tested with test datasets, and accuracy is measured with a confusion matrix.
Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 788832 The amount of training data used during the training phase (in percentage).
The bold values represent the amount of training data used during the training phase (in percentage). prior to training. For each of the 100 runs, random subsamples were drawn independently. Table 2 displays the corresponding mean accuracy with the standard deviation for 100 independent runs. It is evident that the proposed method outperformed the state-of-the-art method with respect to the accuracy score. It can be noticed that although for a small number of training samples, the performance is not much impressive, and the proposed method outperformed others when sufficient samples are available for training. To visualize the original and predicted labels, we performed t-SNE-based embedding of the datasets. Figure 2 shows twodimensional t-SNE-based embedding of melanoma data with its original and predicted labels. The predicted labels are obtained from the trained model. Figure 2A represents the t-SNE embedding of melanoma data for original and predicted labels. Figure 2B demonstrates the comparison between the predicted and original labels for each individual cell. Of note, some minor cells such as macrophages (4.47% of the total cells), endothelial cells (7.15% of the total cells), and NK cells (1.73% of the total cells) which come with little samples are also correctly predicted by the proposed method. To show the false positive of the prediction, bottom figures of Figure 2B show a set of donut charts that represent the misclassification of cells. It is evident from the figure that in most of the cases (except Treg cells), the false-positive rate is extremely low.  predicted cell samples), and endothelial (7.15% of the original cell samples and 7.15% of the predicted cell samples) cells are predicted with utmost accuracy. A similar conclusion can be drawn from the CBMC data classification. Supplementary Figure S1 shows the classification results in detail.

Proposed Method Can Identify Poorly Covered Cell Accurately
In this study, our aim was to see how the poorly covered cell types, which generally come with little training data, can be detected by the competing methods. For this, we have applied the trained model on test data and computed the recall and precision score for the samples of a particular cell type.
Considering the heterogeneous distribution of cells, we performed this experiment on CBMC and in Melanoma datasets. For other datasets, cells have a sufficient proportion of samples, such as for Klein, and the proportion of samples of four cell types are type-1 (11.15%), type-2 (25.13%), type-3 (29.37%), and type-4 (34.33%).

Stability Performance
To compare the stability in the performance of the four competing methods, we have carried out a 10-fold crossvalidation analysis for all the datasets. In each fold, we randomly divided the training data as training: validation in the ratio 9:1 and computed the validation accuracy. The process was repeated 100 times for each fold. Thus, in each fold we obtained 100 validation accuracy and one test accuracy for each of the competing method. The medians of the validation accuracy were compared with a Wilcoxon rank-sum test across the folds. Table 4 shows the p-value for all the competing methods across all the datasets. Although all methods produce stable results with low p-values, nevertheless, the proposed method showed a more stable performance among the other methods. Figure 3 also shows the test accuracy for all the methods across the folds for the CBMC dataset (see Supplementary Material for the results of other datasets). From Table 3 and Figure 3, it is evident that the proposed method outperformed the others for producing stable results.

Execution Time
All experiments were carried out on a Linux server having 50 cores and a X86_64 platform. To compare the execution time of the competing methods, we performed an analysis. Four simulated data [using splatter (Zappia et al., 2017)] are generated by varying the number of cells and classes as follows: 500 cells with two classes, 1,000 cells with three classes, 1,500 cells with four classes, and 2,000 cells with five classes. All the simulated datasets are generated keeping equal group probabilities, 2,000 number of features with a fixed dropout rate as 0.2. In each case, the runtime is compared with different competing methods (see Table 5).

DISCUSSION
Our proposed methodology addresses the cell type prediction issue by having a vigorous multi-task learning model to predict the cell types efficiently. This cell type detection is very crucial in many applications of single-cell RNA sequencing data. The results demonstrated that the L 21 regularization technique helps in jointly learning the features of cell types. In experiments referring to six different data sets CBMC (Stoeckius et al., 2017), Goolam (Goolam et al., 2016), Melanoma (Tirosh et al., 2016), PBMC (Zheng et al., 2017), Yan (Yan et al., 2013), and Klein (Klein et al., 2015), we evaluated how the proposed method was performed in comparison with other methods (scPred (Alquicira-Hernández et al., 2018), ACTINN (Ma and Pellegrini, 2019), CHETAH (de Kanter et al., 2019), and Garnett (Pliner et al., 2019). The proposed method outperformed the other methods on all datasets utilized in this study. It also outperformed the others in terms of economic use of training samples. For example, for the datasets Klein, CBMC, Melanoma, and Yan, 40% of the training samples are enough to obtain more than 65% accuracy. Of note, exactly these advantages meant landmark arguments for regularized multi-task learning also in their original application.
In summary, we provided a new method that implemented the latest advances in machine learning for the purposes of typing single cells on basic heterogeneous single-cell RNA sequencing data. We have demonstrated that the theoretical promises can indeed be leveraged. In this study, we argued to have pushed the limits in single-cell typing by a non-negligible amount.
We concluded by acknowledging that also our method, of course, leaves room for improvement: various open problems are still awaiting their solutions. For example, one challenge is the fact that our method, by virtue of being a supervised approach, requires cell annotations prior to classification. Although an automated approach is much needed over approaches that require manual intervention at some point, however, actionable annotations need to be provided prior running the method.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.