DeconPeaker, a Deconvolution Model to Identify Cell Types Based on Chromatin Accessibility in ATAC-Seq Data of Mixture Samples

While our understanding of cellular and molecular processes has grown exponentially, issues related to the cell microenvironment and cellular heterogeneity have sparked a new debate concerning the cell identity. Cell composition (chromatin and nuclear architecture) poses a strong risk for dynamic changes in the diseased condition. Since chromatin accessibility patterns play a major role in human diseases, it is therefore anticipated that a deconvolution tool based on open chromatin data will provide better performance in identifying cell composition. Herein, we have designed the deconvolution tool “DeconPeaker,” which can precisely define the uniqueness among subpopulations of cells using open chromatin datasets. Using this tool, we simultaneously evaluated chromatin accessibility and gene expression datasets to estimate cell types and their respective proportions in a mixture of samples. In comparison to other known deconvolution methods, we observed the lowest average root-mean-square error (RMSE = 0.042) and the highest average correlation coefficient (r = 0.919) between the prediction and “true” proportion. As a proof-of-concept, we also tested chromatin accessibility data from acute myeloid leukemia (AML) and successfully obtained unique cell types associated with AML progression. Furthermore, we showed that chromatin accessibility represents more essential characteristics in the identification of cell types than gene expression. Taken together, DeconPeaker as a powerful tool has the potential to combine different datasets (primarily, chromatin accessibility and gene expression) and define different cell types in mixtures. The Python package of DeconPeaker is now available at https://github.com/lihuamei/DeconPeaker.


INTRODUCTION
Human diseases are multifactorial and complex processes in which genetic-epigenetic components are significantly involved. To date, several key biological pathways regulating cellular functions have been defined; however, the knowledge about the behavior of individual cells is still very limited. Furthermore, the diversity among intracellular and intercellular interactions creates a significant challenge toward understanding of this multicellular network. To mention, the lack of defined gene signature and biological characteristics of bulk tissues from the histological district subtypes lead to the suboptimal-mediocre results in human diseases (Amit et al., 2020).
Several disease association studies have suggested the cell type composition as a confounding factor (Newman et al., 2015). For instance, at various stages of acute myeloid leukemia (AML), dynamic changes in cell composition from hematopoietic stem cells (HSCs) to monocytes, indicating that leukemogenesis largely mirrors the process of normal myelopoiesis (Corces et al., 2016). Likewise, the cell types in tumor microenvironment (TME) reflect both cancer subtype and the immune response (Hutter and Zenklusen, 2018). Embryogenesis, morphogenesis, cell differentiation, and growth are also directly associated with the changes in cell type composition (Hunt et al., 2019). In singlecell sequencing analysis, cell identity is mainly tagged/labeled with cell type-specific surface markers (proteins); however, the difficulties arise when heterogeneous mixture of cells also contains the unknown cell type. Furthermore, in publically available databases such as The Cancer Genome Atlas (TCGA), thousands of samples have been determined. However, these samples were generated as a mixture from bulk sequencing. Therefore, resolving cell types and compositions from these available samples will facilitate our understanding of biological mechanisms. Thus, adequate methods are needed to identify the correct cell types and compositions from a mixture.
To gain better statistical insight into the composition of the cell types in a sample mixture, many methods (also known as cell type deconvolution) have previously been developed. Most of these approaches use gene expression data by focusing on estimating the proportions and/or pure expression states, which can be divided into two subclasses, "partial" and "complete" (Gaujoux and Seoighe, 2012;Chikina et al., 2015). The former requires either cell type-specific signatures or their relative proportions (Abbas et al., 2009;Erkkilä et al., 2010;Newman et al., 2015;Hunt et al., 2019), while the latter estimates the relative cell fractions and simultaneously disentangle their expression profiles directly from mixtures (Repsilber et al., 2010;Zhong et al., 2013). In addition, DNA methylation signal is also used to predict cellular components. Houseman et al. proposed a method based on linking two regression models for the prediction of blood cell-type components (Houseman et al., 2012). Jaffe and Irizarry further reported an adaptation of the Houseman method for application to Illumina M450 array data (Aryee et al., 2014). Salas et al. proposed an optimized library for whole-blood deconvolution (Salas et al., 2018). Likewise, Chakravarthy et al. suggested about DNA methylation-based approach for the deconvolution of Pan-cancer datasets (Newman et al., 2015;Chakravarthy et al., 2018). In recent years, it has been found that chromatin accessibility at the regions distant from transcription start sites (TSSs; such as enhancers) is more predictive of cell identity than gene expression itself (Song et al., 2011;Hnisz et al., 2013). In this regard, Corces et al. demonstrated that chromatin accessibility was more cell type-specific and could capture cell identity better than mRNA expression (Corces et al., 2016). Likewise, Zamanighomi et al. suggested that cell type-specific peaks enriched the transcription factor motifs and were based on the assay for transposase-accessible chromatin by sequencing (ATAC-Seq) specific to each cell subpopulation (Zamanighomi et al., 2018).
Considering these factors, we generated DeconPeaker, a partial deconvolution method that resolves relative proportions of different cell types in the peak intensity profiles (chromatin accessibility) from the measurement of mixture samples. Compared to other known deconvolution methods, DeconPeaker is reliable and applicable to both chromatin accessibility as well as gene expression data from mixtures.

MATERIALS AND METHODS
DeconPeaker estimates relative proportions of each cell type from bulk ATAC-Seq data. The model can be simply described as m = B × f, where m is the measurement for the mixtures, B represents the signature matrix, and f denotes a vector of unknown proportions that needs to be estimated (Newman et al., 2015).

Pre-processing for ATAC-Seq Data
ATAC-Seq datasets were processed by Kundaje's pipeline with default parameters 1 . Briefly, the pipeline has two steps: (1) to align pair-end reads to the hg19 genome and remove duplicate reads; (2) to call narrow peaks with MACS2 (Zhang et al., 2008) for each unique cell type. Only peak and BAM files of all samples were retained.
For reference samples, we first filtered out weak peaks with −lgq ≤ 2 (defined by MACS2) to avoid false positives as previously described (Jalili et al., 2015). The peaks that overlapped with the blacklisted regions 2 were also discarded. After generating the list of non-redundant peaks for all reference samples as previously described (Corces et al., 2016), a tool featureCounts (Liao et al., 2013) was used to calculate the fragment counts (paired reads counts) for each sample across all non-redundant peaks, resulting in a count matrix, in which rows represented peaks, and columns indicated the reference samples.

Cell Type-Specific Peaks and Signature Matrix
To avoid batch effects, we performed quantile normalization on the count matrix across all reference samples and excluded peaks below the median value of globally normalized intensities.

Cell Type-Specific Peaks (CTSPs)
A statistically reliable and significant CTSP represents an open chromatin region that tends to be more pronounced in one cell type than others. Recently, Zamanighomi et al. used a Poisson regression model combined with hypothesis testing to call CTSPs in single-cell ATAC-seq (scATAC-Seq) data (Zamanighomi et al., 2018), which was difficult to converge during parameter estimation. Here, we employed the strategy of scABC (Zamanighomi et al., 2018) and gave a relatively simple and fast hypothesis-testing framework to identify CTSPs. The detailed mathematical processes can be found in Supplementary Material S1. Finally, the intensity of each CTSP was represented with the average of the peaks in each cell type's samples.

Signature Matrix
This is to undermine B, which represents the signature matrix, as mentioned above. In a typical human ATCA-Seq, thousands of CTSPs can introduce noise, which can be avoided by shrinking the total number of CTSPs simultaneously, while considering the stability of the linear system to avoid extreme sensitivity to small fluctuations (Abbas et al., 2009;Newman et al., 2015). As previously described (Newman et al., 2015), signature matrices can be made more robust for deconvolution by minimizing condition number that is an inherent matrix property. Here, we first calculated the significance score π-value (defined in Supplementary Material S1) of each peak and then employed the optimizing strategy of CIBERSORT (Newman et al., 2015) to derive the signature matrix ( Figure 1B). Briefly, we pre-set the minimum (g) and maximum (G) number of CTSPs for each cell type in B. For each cell type, CTSPs were ranked by πvalue, and the top G CTSPs were selected into B. To ensure signature matrix stability, we iterated B across all cell types from g to G, and the signature matrix with the minimum condition number was retained. For each iteration, we performed the z-score transformation on B.

Data Transformation
Owing to the complexity and diversity of the feeding data [i.e., signature matrices (B) and mixtures (m)], the accuracy and reliability of the deconvolution are difficult to guarantee. Therefore, we employed an automatic transformation method to enhance the adaptability of DeconPeaker, primarily based on the optimal parameter λ, estimated by the Box-Cox algorithm (Box and Cox, 1964). We only considered three transformation scenarios, log2-transformation (λ = 0), squareroot transformation (λ = 0.5), and no transformation (λ = 1) (Osborne, 2010).

SIMPLS for Deconvolution
In DeconPeaker, SIMPLS (De Jong, 1993) was employed to identify cell types by estimating relative subsets of chromatin accessibility. SIMPLS is a variant of partial least squares (PLS) for multiple response variables, which estimates the regression coefficients by linking signature matrices (B) and the measurement of the mixture (m), which was conducted using the function "mvr" in the R packages "pls" (Mevik et al., 2011). To obtain the estimated cell type proportion (f ), negative regression coefficients were set to 0, and the remaining coefficients were normalized to sum to 1.

Model Evaluation
Root-mean-square error (RMSE) and Pearson correlation coefficient (PCC) between the prediction (f p ) and the known composition of cell type (f t ) were calculated to evaluate the model, which were commonly used to measure differences between the ground truth and estimate. In addition, we employed an asymptotic test for the Wasserstein distance between observer m and prediction m to produce a P-value of the deconvolution using Monte Carlo sampling. Wasserstein metric reflects the degree of similarity between two distributions, in which smaller differences and smaller assigned P-values indicated greater significance of the model. The details of estimating P-value are described in Supplementary Material S1.

Synthetic Dataset
We simulated 195 synthetic mixture samples with cell typeknown ATAC-Seq data by sampling the cell type fractions using Dirichlet distribution from primary blood cells (GSE74912) with SAMTOOLS (Li et al., 2009). Cell types with different variability (1-13 types) were covered in the synthetic samples, and each synthesized sample consists of 2 million paired-end reads.
To further validate the performance of DeconPeaker using chromatin accessibility data, we retrieved a dataset of transposase-accessible chromatin profiles for 695 individual mouse cardiac progenitor cells (PRJEB23303) covering E7.5 to E9.5 of five cell types (Jia et al., 2018). From this scATAC-Seq dataset, we randomly selected 50% of cells in each cell type without replacement to construct the reference samples, and the remaining were used to synthesize 100 test mixture samples using replacement sampling. Each synthetic sample consisted of 3,000 cells. In each test sample, the proportion of each cell type was estimated with the number of five cell types in the sample.

Data Availability
The study utilized 11 datasets from three different platforms (ATAC-Seq, RNA-Seq, and Microarray), as demonstrated in Supplementary Table S1. For the evaluation of DeconPeaker, ATAC-Seq datasets were downloaded from Gene Expression Omnibus (GEO) via accession number GSE74912 (Corces et al., 2016) and from https://github.com/loosolab/cardiac-progenitors (Jia et al., 2018). The former contains 79 normal samples and 42 AML samples (as mixture samples). Data of the normal samples, which contained 13 cell types, were used as a reference to derive the signature matrix and to generate simulated datasets with variable numbers and proportions. The latter (PRJEB23303) contained 695 cells, covering five cell types that were characterized as mouse cardiac progenitor cells from E7.5 to E9.5 using single-cell transposase-accessible chromatin profiling (scATAC-Seq). In addition to the RNA-Seq dataset (GSE74246), the mRNA expression data that matched the dataset of GSE74912 (ATAC-Seq data) was used to test the model. For comparative validation, eight benchmarking datasets were retrieved from the previously described source link 3 (Hunt et al., 2019). Among these eight datasets, two [PRJEB8231 (Parsons et al., 2015) and GSE64098 (Ruijie et al., 2015)] were RNA-Seq data, while the other six (GSE29832 (Gong et al., 2011), GSE19830 (Shen-Orr et al., 2010, GSE11058 (Abbas et al., 2009), GSE5350 (Leming et al., 2006), GSE19380 (Kuhn et al., 2011), and GSE65133 (Newman et al., 2015)] were microarray data. Each dataset contains reference and mixture samples with known mixing proportions.

RESULTS
DeconPeaker predicts the cell type composition using SIMPLS (De Jong, 1993) on the basis of a signature matrix that represents cell type-specific peaks (open chromatin regions). The data processing in this tool requires three main steps (as shown in Figure 1A): (1) identification of a list of nonoverlapping cell type-specific peaks (CTSPs) with the reference samples by a hypothesis test framework, then a construction of a signature matrix by minimizing the condition number (workflow shown in Figure 1B, see section "Materials and Methods"); (2) deconvolution of the mixtures with the signature matrix using SIMPLS; and (3) evaluation of the deconvolution using asymptotic test for consistency of the distributions between observations and predictions (see section "Materials and Methods"). Using these parameters, DeconPeaker can optimize chromatin accessibility data as well as cell type-specific gene expression (mRNA expression levels). The addition of SIMPLS in this tool provided a uniqueness to the deconvolution of mixed cell samples.
Notably, optimizing the number of CTSPs ensures that the signature matrix is stable and robust. In addition, the model automatically transforms the feeding data, including signature matrices, and mixtures before SIMPLS (De Jong, 1993) are applied (see section "Materials and Methods"), so as to make them fit a normal distribution as possible. As such, this strategy can enhance the adaptability and accuracy of DeconPeaker for different datasets.

Performance Evaluation of Synthetic Mixtures
Given n samples for a cell type in the reference data (GSE74912) (Corces et al., 2016), n−1 reference samples were used to derive a signature matrix, and the remaining reference sample was used to synthesize 195 mixture samples (see "Materials and Methods, " as shown in Supplementary Figure S1). RMSE and PCC were introduced to measure the consistency between the ground truth and estimated fractions. Compared to CIBERSORT (Newman et al., 2015), DeconPeaker showed higher PCC (Figure 2A) and lower average RMSEs on the synthetic mixtures ( Figure 2B). Furthermore, when decomposing the deconvolution at a single cell-type level, we found that DeconPeaker's PCCs were above 0.95 and higher than that of CIBERSORT in predictions for each cell type (Supplementary Figures S3A-C), indicating that DeconPeaker had better deconvolution performance on synthetic mixtures of chromatin accessibility.
Additionally, the signature peak intensities showed a strong positive correlation between the pairs HSC and multipotent progenitor (MPP), granulocyte-monocyte progenitor (GMP), and lymphoid-primed multipotent progenitor (LMPP) (Figure 2C), which lead us to speculate whether this positive correlation influences the deconvolution due to the multicollinearity between the two cell types.
To test multicollinearity, we first calculated the differences between the expected (truth) and predicted multicollinearity for each cell type. We also fitted the difference between the two cell types, and found a strong anti-correlation between pairs HSC-MPP and GMP-LMPP (Figures 2D,E), suggesting that multicollinearity could affect the accuracy of deconvolution if the two cell types coexist in the sample.
We also compared the PCCs of the signature peak intensities between cell types with the schematic (cell lineage) of the human hematopoietic hierarchy for 13 primary blood cells types (Corces et al., 2016). The result showed that cell type pairs with strong PCC have narrow lineage distances, indicating the distance between cell types in the lineage as an important cause of multicollinearity source of potential interference in the deconvolution. The performance of DeconPeaker was also validated on single-cell ATAC-Seq datasets (PRJEB23303). This dataset contained ATAC-Seq data for 695 mouse single cardiac progenitor cells covering E7.5 to E9.5 of five cell types (Jia et al., 2018) (see "Materials and Methods"). On 100 synthetic mixtures, DeconPeaker displayed a higher correlation coefficient (PCC = 0.97) between the truth and the predicted than CIBERSORT ( Figure 2F). For the average RMSEs, the two models showed a comparable result (Figure 2G), indicating the potential of DeconPeaker in resolving the single-cell data.

Evaluation on Experimental Data Sets
Since lack of ATAC-Seq data of cell type proportion-known mixture samples, the evaluation of our tool is based on eight known gene expression benchmarking datasets (two RNA-Seq and six microarray data), which have been widely used to test deconvolution algorithms (Supplementary Table S1). Although, DeconPeaker is modeled specifically for chromatin accessibility, except for constructing the count matrix, it shares many features similar to the other partial deconvolution algorithms ( Figure 1A). This allowed us to evaluate its performance on these benchmarking datasets, even though they are not the peaks of chromatin accessibility.
We used DeconPeaker and CIBERSORT to derive signature matrices with the reference samples of each benchmarking dataset to predict the cell type proportions in each mixed sample. In our analysis for RNA-Seq, for Liu's data, both DeconPeaker and CIBERSORT showed good performance (Figure 3A), while in case of Parsons' data, CIBERSORT performed slightly better ( Figure 3B). Furthermore, the performances between these two tools were equally comparable for the mixture of Shen-Orr's data, which consisted of microarray data of rat liver, brain, and lung ( Figure 3C). We also tested using Newman-PBMC data, which has a very complex cell type composition and poses a huge challenge for deconvolution methods. Interestingly, DeconPeaker showed high PCC and performed better than CIBERSORT ( Figure 3D). Moreover, DeconPeaker consistently showed higher PCCs than CIBERSORT in other four microarray datasets used in this study (Supplementary Figures S4A-D). To further test whether the performance of DeconPeaker is significantly better than CIBERSORT, we used a non-parametric test (Wilcoxon test) between every two groups. The results showed that the prediction by DeconPeaker significantly outperforms that by CIBERSORT on Synthetic's, Shi's and Shen-Orr's, but is slightly lower on Jia's and Kuhn's. On the remaining datasets, the predictions of the two methods did not show any significant difference (Supplementary Figure S5). All of these clearly demonstrated the reliable performance of DeconPeaker on the benchmarking datasets in context to cross-platform adaptability.
Notably, from the RMSE perspective, the predictions on datasets of Parsons' , Shi's, and Newman's showed higher deviations from true proportions than on others (Supplementary Figure S5). Several factors that may make these datasets difficult to resolve: the first factor could possibly be the number of cell types in the mixed sample, while the second factor could well be the existence of two or more cell types that may be very similar, such as HSC-MPP (Figures 2C-E). An additional third factor can be considered as a batch bias of the reference profiles for the cell types.

Comparison With Other Deconvolution Algorithms
We compared DeconPeaker against nine other deconvolution models, four of which were accessed through the CellMix R package (Gaujoux and Seoighe, 2013), including ls-fit (Abbas et al., 2009), qprog (Gong et al., 2011), DSA (Zhong et al., 2013), and deconf (Repsilber et al., 2010). The remaining algorithms, EPIC (Racle et al., 2017), PERT (Qiao et al., 2012), dtangle (Hunt et al., 2019), DeconRNASeq (Gong and Szustakowski, 2013), and CIBERSORT (Newman et al., 2015), were retrieved from the links provided in the corresponding literature. All of these models require signature genes or peaks (chromatin accessibility). However, DSA and deconf are complete deconvolution methods that only require signature genes and do not explicitly require reference data. To better evaluate and compare the performance of these methods, we have considered two scenarios. One is the comparisons of the methods based on different signature matrices. Some deconvolution methods have plugins for directly inferring signature matrices, such as CIBERSORT and dtangle, where the signature matrix is the major determinant of prediction accuracy. The second is to compare the performance of the methods using unified signature matrices.
For the first scenario mentioned above, the acquisition of signatures corresponding to different methods includes the following aspects. For DeconPeaker, CIBERSORT, and dtangle, the signature matrices were derived by their own specific strategy, while in the case of other models, the signature peaks (or genes) or signature matrices were provided by CIBERSORT. Since, some algorithms are preferred by certain platforms, such as CIBERSORT (to microarray), DeconRNASeq (to RNA-Seq), and dtangle (to both). If the deconvolution algorithms are applied to cross-platform data, their prior assumptions in the models may be destroyed. In this comparison, we have considered different configurations of the algorithms to reduce the impacts of cross-platforms on the models, and the details are provided in Supplementary Table S2.
We assessed the capacity of these algorithms by RMSE and PCC on the nine benchmarking datasets (include one synthetic dataset) and found that DeconPeaker showed the lowest average RMSEs on the Shen-Orr's and Newman's data. Although DeconPeaker performed slightly low on Shi's and Parsons' data, it still performed quite well (RMSE Shi s = 0.054, RMSE Parson s = 0.091) (Figure 4A), indicating good robustness. To make comparisons between the algorithms more intuitively, we combined the deconvolution of each algorithm on the benchmarking datasets, showing that DeconPeaker exhibits the lowest average RMSEs (Figure 4B) and the highest average PCCs (Figure 4C). For the second scenario, we uniformly used the signature matrices identified by DeconPeaker to evaluate the accuracy and robustness of the methods. The results showed that DeconPeaker has the second lowest average RMSEs and the highest average PCCs (Supplementary Figure S6). All of these indicate that DeconPeaker performs comparable or even better in the predictive performances. Flowchart of the analysis is shown in Supplementary Figure S2.

Performance Evaluation of Signature Matrices
To evaluate the effect of signature matrices derived by DeconPeaker, the signature matrix containing 1,768 peaks from GSE74912 (Corces et al., 2016) was used. We first employed ChIPSeeker (Yu et al., 2015) to annotate the signature peaks and found that most of the peaks were located in the introns and distal intergenic regions, of which only 4.13% of the peaks were at the promoters ( Figure 5D and Supplementary Table S3). This peak distribution confirms that distal element accessibility is highly cell type specific (Corces et al., 2016). Furthermore, we used principal component analysis (PCA) to visualize the distribution of individual samples of different cell types for all the peaks and signature peaks (Figures 5A,B). The clustering based on all peaks does not clearly separate cell types in the first three principal components. In contrast, the use of the peaks in the signature matrix led to a clear separation of cell type. In addition, the heat map of the signature matrix intensity intuitively showed differential peaks across the cell types ( Figure 5C). Collectively, these results verify that the signature matrices derived by our model are reasonable.
It is well established that the signature matrices are the key to deconvolution. The comparison between matrices can reflect the performance of any model from different aspects. Moreover, the different strategies have been used to call the signature matrices. For instance, CIBERSORT uses a two-sided unequal variance t-test by minimizing condition number to derive signature matrices (Newman et al., 2015). dtangle uses the "Ratio" method, which selects and ranks markers according to the ratio of the mean expression of each gene in each cell type along with the mean of the gene in all other cell types (Hunt et al., 2019). In this study, we used Jaccard similarity coefficient (JSC) to compare the signature matrices derived by CIBERSORT, dtangle, and DeconPeaker. JSC is a measure of similarity between the two sets and defined as the number of the intersection divided by the number of the union. The results showed that the JSC between DeconPeaker and CIBERSORT is 0.125, between DeconPeaker and dtangle is 0.201, and between CIBERSORT and dtangle is 0.235 ( Figure 5E). The low JSCs suggest that the signature peaks derived by different algorithms vary widely.
Furthermore, we exchanged the signature matrices derived by DeconPeaker and CIBERSORT, and compared the deconvolution on the synthetic mixtures. The signature matrix identified by DeconPeaker had lower average RMSEs, indicating that the signature matrices identified by DeconPeaker can improve the accuracy of deconvolution (Figure 5F). In addition, the deconvolutions by CIBERSORT using an external and the selfidentified signature matrices showed different average RMSEs by one-way ANOVA, while DeconPeaker was found to be relatively stable ( Figure 5F). To further explore this issue, we used all benchmarking datasets to test. It was found that DeconPeaker is stable against the external signature matrices, while CIBERSORT is sensitive (Figure 5G). When using the signature matrices  identified by DeconPeaker, the RMSEs of both algorithms were smaller in median RMSEs and had more compact distribution, although the average RMSEs were slightly higher in CIBERSORT ( Figure 5G). Taken together, the signature matrices derived by DeconPeaker showed good performance in deconvolution. Furthermore, DeconPeaker's deconvolution based on the external signature matrix maintained good stability and accuracy.

Impact of Data Transformation
To enhance the adaptability of DeconPeaker to different kinds of datasets (gene expression or open chromatin), an automatic data transformation method was introduced, and the deconvolution performance was evaluated. Here, we only applied "None" (without transform), "Auto" (Automatic transform), "Log2" (log2-transform), and "Sqrt" (square-root transform) transformations on the signature matrix and the mixtures of each benchmarking data. The results indicated that transformation "Auto" has the lowest average RMSEs, the highest average PCCs, and the most compact distribution (Figures 6A,B), indicating that the automatic transformation method could improve the adaptability and deconvolution performance for different kinds of datasets. To test if there is significant difference between the transform methods, we introduced Wilcoxon test between every two groups. The results showed that there were significant differences between "Log2" and the others, and no significant differences among "Auto, " "Sqrt, " and "None" (Figures 6A,B). Hence, in most cases, the peak profile or gene expression deconvolution should be done in linear space rather than log-transformed space. However, the model that combines multiple transformation strategies has better data adaptability for deconvolution. A flowchart of this analysis is shown in Supplementary Figure S2.

Deconvolution of Cell Composition for AML Dataset
We used a dataset of HSCs from AML (Corces et al., 2016). This dataset contained 32 unique mixture samples (replicates were merged) on ATAC-Seq and RNA-Seq, primarily covering three distinct stages of AML development, namely, preleukemic HSCs (pHSCs), leukemia stem cells (LSCs), and leukemic blast cells (Blasts). According to the literature (Corces et al., 2016), cell type compositions were mostly HSCs and MPPs in pHSCs, and GMPs and LMPPs in LSCs. However, the Blasts analysis showed a wider distribution, namely, less differentiated blasts associate with GMP cells, and more differentiated blasts associate with monocytes (MONOs) (Corces et al., 2016).
For the ATAC-Seq data (GSE74912), we built the signature matrix (1,768 peaks) based on 77 normal samples, covering 13 primary blood cell types (workflow shown in Supplementary  Figure S1), and used this signature matrix to deconvolute the mixtures of AML (Figures 7A,B). HSC and MPP were found as the major cell type components of the pHSC stage, the proportions of LMPP and GMP showed a significant abundance in the LSCs stage, and GMP and MONO dominated the Blast stage, which is consistent with the previous report (Corces et al., 2016). In addition, we observed an increase in the MONO phase in the three stages, especially from the LSC stage to the Blast stage. Notably, GMP showed no significant change between LSC and Blast stages.
For RNA-Seq data (GSE74246), we built the signature matrix (1,245 genes) with 49 normal transcriptomes to deconvolute the mixed transcriptomes of AML (Supplementary Figure S1). The results showed that HSC and MPP dominated the pHSC stage (Supplementary Figures S7A,B), which is consistent with the finding based on ATAC-Seq data (Figures 7A,B). The proportion of LMPP was the highest in the LSC stage, while MONO dominated in the Blast stage. According to literature (Corces et al., 2016) and the deconvolution results for ATAC-Seq data (Figures 7A,B), a high proportion of GMP was found in both LSC and Blast stages, but in the deconvolution using RNA-Seq data, GMP did not show a dominating proportion, suggesting that gene expression is not the most essential feature of cell identity. To further evaluate the capacity of chromatin accessibility data and gene expression data to identify cell types, we visualized the distribution of individual samples of different cell types with PCA, using either the signature peaks (from GSE74912) (Figure 5B) or the signature genes (from GSE74246) (Supplementary Figure S7C). In PCA plots when using cell type-specific ATAC-Seq peaks, samples of the same cell type were better clustered, while the ones of different cell types were better separated in comparison to the results with the signature genes, such as for cell types LMPP, GMP, and CMP ( Figure 5B and Supplementary Figure S7C), suggesting that the chromatin accessibility is more specific in classifying cell types. Furthermore, to compare the performance of deconvolution of the mixed samples on the two signature matrices, the literature results were used as a standard cell type proportion (Corces et al., 2016) and were further compared to the cell type proportions predicted either based on ATAC-Seq data or on RNA-Seq data by calculating the correlation (PCC) between them. The results showed that the cell type proportion based on ATAC-Seq data exhibited a higher PCC and lower average RMSEs with the standard cell type proportion compared to that based on RNA-Seq data (Figures 7C-E). This further confirmed that chromatin accessibility provides more information about the cell identity than gene expression. In addition, we identified the overlapping genes between the signature genes and the genes associated with the signature ATAC-Seq peaks. According to the annotation ( Supplementary Table S3), the 1,768 signature ATAC-Seq peaks associate with the 870 genes. The number of the expression signature genes is 1,245 (GSE74246). Importantly, we found that only 112 genes (∼6.3% to peak number) were common between both (Supplementary Figure S7D). This small fraction (6.3%) of overlap indicates that most of the ATAC signature peaks are not at the regulatory sites of the signature genes. In other words, a chromatin accessibility region (one peak) probably corresponds to multiple genes in the regulation, not merely in a one-to-one manner, especially in 3D organization of genome. This can also be evident from the fact that gene expression is consequence of a complex regulatory process. For a refine and unique cell type or cell state, several factors (even external stimuli) often play a significant role, primarily affecting the transcription factor bindings to the DNA.

DISCUSSION
Gene expression deconvolution methods are ideal to define unique cell types in transcriptomes of samples with mixed cell types. Likewise, DNA methylation data is also highly cell type specific and can reveal hidden components of tissue mixtures. Studies have shown that models using DNA methylation consistently outperformed the gene expressionbased methods (Houseman et al., 2012;Reinius et al., 2012;Chakravarthy et al., 2018). Notably, the accessibility of chromatin is related to the binding of the transcription factor to DNA, which also indicates cell type specificity. In this study, we generated DeconPeaker, a novel deconvolution estimator for characterizing cell type composition using chromatin accessibility mixtures, as well as for gene expression datasets (RNA-Seq and Microarray). To achieve the higher optimization with DeconPeaker, we first used multiple linear regression with intersection-union test (IUT) by minimizing the condition number to derive signature matrices and then predicted the cellular fractions of the mixed samples by SIMPLS with the derived signature matrices. In comparison to other known deconvolution methods on the benchmarking datasets across different platforms, DeconPeaker demonstrated the lowest average RMSEs and highest average PCCs between predictions and truths (Figures 4B,C). In addition, the signature matrices identified by DeconPeaker consistently showed lower average RMSEs and the highest average PCCs, indicating a greater reliability and a broader range of applications of DeconPeaker. Notably, we have proposed a novel hypothesis testing framework by minimizing condition number to identify signature matrices in DeconPeaker. Compared with CIBERSORT (Newman et al., 2015), our strategy enables us to more accurately depict cell type specificity of peaks or genes (gene expression data). The automatic transformation strategy for feeding data (signature matrices and mixtures) is unique in DeconPeaker compared to other deconvolution methods. This is due to the fact that outliers and anomalous distribution of feeding data can greatly affect the performance of the model, which is required to ensure accuracy of the deconvolution (Figures 6A,B). In most cases, the peak profile or gene expression deconvolution in linear space performs better than in logtransformed space.
To validate DeconPeaker, we analyzed marker cell types of AML at different stages based on the chromatin accessibility (ATAC-Seq, GSE74912) and mRNA expression data (GSE74246). The conclusions derived from chromatin accessibility were consistent with previous reports (Corces et al., 2016), suggesting that this tool has important applications in the interpretation and identification of biological mechanisms. We further evaluated the capacity of chromatin accessibility data and gene expression data in identifying cell types, and found that chromatin accessibility was more specific than gene expression in the classification of cell types (Supplementary Figure S7C). Moreover, deconvolution with chromatin accessibility data had higher PCC and lower average RMSEs between the predictions and the standard cell type proportion (Figures 7C-E). This indicates that chromatin accessibility represents more information about cell identity than gene expression. It is noteworthy to mention that we have found only a small number (∼6.3%) of genes that overlapped between the expression signature genes and the genes that associate ATAC-Seq signature peaks (Supplementary Figure S7D).
This indicates that the cell identity in a different cellular information has a distinct feature, namely, transcription regulation layer (chromatin accessibility) and the gene expression layer. Since the cell type or cell state often associates with extracellular stimulus, the chromatin accessibility, which affects the transcription factor binding to DNA, is probably more sensitive to the cell identity.
In addition, we must also point out the possible biases when using a constant signature matrix to deconvolute samples under different disease states or sequencing platforms. Therefore, we propose a signature matrix based on purified data from multiple platforms and from both healthy and disease samples to reduce the biological and technique bias (Vallania et al., 2018).
Taken together, "DeconPeaker" is amenable to chromatin accessibility data measured with ATAC-Seq and gene expression datasets. Primarily, due to its flexible statistical approach, it will enable researchers to measure bulk biospecimens, in particular, the samples with a mixture of cell types. Notably, to enhance the performance, additional test using DeconPeaker, particularly on more ATAC-Seq datasets containing both reference and cell type proportion-known mixture samples should be conducted in the future.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at NCBI: GSE74912, GSE74246, GSE5350, GSE29832, GSE19830, GSE11058, GSE19380, GSE65133, and GSE64098. This data can be available from the ENA repository: PRJEB8231 and PRJEB23303.

AUTHOR CONTRIBUTIONS
HDL and XS designed the study. HML coded the algorithms. HML, HDL, and AS wrote and revised the manuscript. HML conducted the data analysis. XS, KL, and ZQ provided interpretation and discussion. All authors contributed and approved the final manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020. 00392/full#supplementary-material FIGURE S1 | Schematic of DeconPeaker validation on ATAC-Seq synthetic mixtures and applications on acute myeloid leukemia (AML) datasets.     MATERIAL S1 | Supplementary Note: Identification of cell type-specific peaks, Calculation of significance score π-value, and Asymptotic test for deconvolution.