<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Bioinform.</journal-id>
<journal-title>Frontiers in Bioinformatics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Bioinform.</abbrev-journal-title>
<issn pub-type="epub">2673-7647</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1120290</article-id>
<article-id pub-id-type="doi">10.3389/fbinf.2023.1120290</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Bioinformatics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Gene representation in scRNA-seq is correlated with common motifs at the 3&#x2032; end of transcripts</article-title>
<alt-title alt-title-type="left-running-head">Li et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fbinf.2023.1120290">10.3389/fbinf.2023.1120290</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Xinling</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gibson</surname>
<given-names>Greg</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/43397/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Qiu</surname>
<given-names>Peng</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/334884/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>The Wallace H. Coulter Department of Biomedical Engineering</institution>, <institution>Georgia Institute of Technology and Emory University</institution>, <addr-line>Atlanta</addr-line>, <addr-line>GA</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>School of Biological Sciences, and Center for Integrative Genomics</institution>, <institution>Georgia Institute of Technology</institution>, <addr-line>Atlanta</addr-line>, <addr-line>GA</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1803995/overview">Martin Hemberg</ext-link>, Brigham and Women&#x2019;s Hospital and Harvard Medical School, United States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1870775/overview">Stefan Semrau</ext-link>, New York Stem Cell Foundation, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/127000/overview">Xianwen Ren</ext-link>, Peking University, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1871495/overview">Yered Pita-Juarez</ext-link>, Beth Israel Deaconess Medical Center and Harvard Medical School, United States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Peng Qiu, <email>peng.qiu@bme.gatech.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>15</day>
<month>05</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>3</volume>
<elocation-id>1120290</elocation-id>
<history>
<date date-type="received">
<day>09</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>05</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Li, Gibson and Qiu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Li, Gibson and Qiu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>One important characteristic of single-cell RNA sequencing (scRNA-seq) data is its high sparsity, where the gene-cell count data matrix contains high proportion of zeros. The sparsity has motivated widespread discussions on dropouts and missing data, as well as imputation algorithms of scRNA-seq analysis. Here, we aim to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share. From public data sources, we gathered paired bulk RNA-seq and scRNA-seq data from 53 human samples, which were generated in diverse biological contexts. We derived pseudo-bulk gene expression by averaging the scRNA-seq data across cells. Comparisons of the paired bulk and pseudo-bulk gene expression profiles revealed that there indeed exists a collection of genes that are frequently under-detected in scRNA-seq compared to bulk RNA-seq. This result was robust to randomization when unpaired bulk and pseudo-bulk gene expression profiles were compared. We performed motif search to the last 350&#xa0;bp of the identified genes, and observed an enrichment of poly(T) motif. The poly(T) motif toward the tails of those genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during scRNA-seq library preparation, which is a mechanistic conjecture of why certain genes may be more prone to be under-detected in scRNA-seq.</p>
</abstract>
<kwd-group>
<kwd>10X</kwd>
<kwd>single-cell RNA sequencing</kwd>
<kwd>bulk RNA-seq</kwd>
<kwd>data integration</kwd>
<kwd>comparison</kwd>
<kwd>dropouts</kwd>
<kwd>pathway analysis</kwd>
<kwd>motif discovery</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Single Cell Bioinformatics</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Single-cell RNA-sequencing (scRNA-seq) allows the dissection of gene expression heterogeneity at single-cell resolution (<xref ref-type="bibr" rid="B7">Chen et al., 2019a</xref>), which can give insights into the existence and behavior of different cell types (<xref ref-type="bibr" rid="B28">Pennisi, 2018</xref>). In general, scRNA-seq technologies can be categorized into two major types: droplet-based and plate-based (<xref ref-type="bibr" rid="B3">Baran-Gale et al., 2018</xref>). Droplet-based scRNA-seq system includes Drop-seq (<xref ref-type="bibr" rid="B26">Macosko et al., 2015</xref>), inDrop (<xref ref-type="bibr" rid="B19">Klein et al., 2015</xref>), and 10X Chromium (<xref ref-type="bibr" rid="B18">Kitzman, 2016</xref>), and plated-based scRNA-seq system includes SMART-seq and SMART-seq2 (<xref ref-type="bibr" rid="B29">Picelli et al., 2013</xref>). Regardless of the technology, scRNA-seq data is often highly sparse. In a typical gene-cell count matrix in scRNA-seq analysis, &#x3e;90% of the elements are zeros. Some of those zeros are biologically meaningful signals, such as a cell type specific marker gene showing zero expression count in cells belonging to other cell types. Meanwhile, some of those zeros represent technical issues, such as an expressed gene in a cell not being captured and hence undetected due to technical limitations. The fact that not all zeros in scRNA-seq data are problematic has been supported by multiple published studies (<xref ref-type="bibr" rid="B17">Kim et al., 2020</xref>; <xref ref-type="bibr" rid="B30">Qiu, 2020</xref>; <xref ref-type="bibr" rid="B34">Svensson, 2020</xref>).</p>
<p>Many computational methods and pipelines for scRNA-seq include components of gene selection and dimension reduction to address the high sparsity of the data. Selection of highly variable genes enables subsequent analysis to focus on genes whose zeros counts are more enriched by biologically meaningful zeros and less affected by the technical limitations (<xref ref-type="bibr" rid="B30">Qiu, 2020</xref>). In dimension reduction techniques [e.g., PCA (<xref ref-type="bibr" rid="B10">Friedman et al., 2001</xref>), t-SNE (<xref ref-type="bibr" rid="B20">Kobak and Berens, 2019</xref>) and UMAP (<xref ref-type="bibr" rid="B4">Becht et al., 2018</xref>)], the reduced dimensions are derived by linear or non-linear combinations of genes, which borrow strength across genes to reduce the sparsity. Methods that adopted these approaches include Seurat (<xref ref-type="bibr" rid="B5">Butler et al., 2018</xref>), TSCAN (<xref ref-type="bibr" rid="B15">Ji and Ji, 2016</xref>), and STREAM (<xref ref-type="bibr" rid="B8">Chen et al., 2019b</xref>). In addition, many imputation algorithms have been developed to generate improved versions of the data with lower sparsity, such as scImpute (<xref ref-type="bibr" rid="B21">Li and Li, 2018</xref>), MAGIC (<xref ref-type="bibr" rid="B36">van Dijk et al., 2018</xref>), RESCUE (<xref ref-type="bibr" rid="B35">Tracy et al., 2019</xref>), and SAVER (<xref ref-type="bibr" rid="B14">Huang et al., 2018</xref>). Many of these imputation tools also adopt gene selection and dimension reduction, so that they can robustly identify gene-gene similarities or cell-cell similarities and use these relationships to impute the data (<xref ref-type="bibr" rid="B13">Hou et al., 2020</xref>). Furthermore, an opposite view of the sparsity has been presented in two algorithms, co-occurrence clustering (<xref ref-type="bibr" rid="B30">Qiu, 2020</xref>) and HIPPO (<xref ref-type="bibr" rid="B17">Kim et al., 2020</xref>), which demonstrated that the sparsity pattern of scRNA-seq data can be an extremely useful signal to accurately identify cell clusters and cell types. Therefore, the literature and research community has not formed a consensus of best practice to handle the sparsity of scRNA-seq data.</p>
<p>In the literature, the high sparsity in scRNA-seq is often referred to as dropout. The term dropout was introduced to describe technical failures that may cause a highly expressed gene to be undetected (<xref ref-type="bibr" rid="B16">Kharchenko et al., 2014</xref>) However, in widespread discussions, the use of this terminology has been inconsistent. Dropout sometimes refers to zeros caused by technical issues so that expressed genes are undetected, sometimes refers to all observed zeros in the data, and sometimes refers to the fact that not all mRNA molecules in the biological sample are captured which causes all genes to be under-detected to some extent (<xref ref-type="bibr" rid="B31">Sarkar and Stephens, 2021</xref>). In this paper, our usage of the term dropout aligns with the third meaning above, and we are interested in examining whether there exist genes that are more prone to be under-detected in scRNA-seq.</p>
<p>In order to develop methods to address dropouts or under-detection in scRNA-seq, it is important to understand the factors that contribute to the dropouts. Recent studies have suggested that 3&#x2032;-UTR length, compartment, transcript count, and differential expression levels (<xref ref-type="bibr" rid="B1">Andrews and Hemberg, 2019</xref>; <xref ref-type="bibr" rid="B32">Lipnitskaya et al., 2022</xref>) may play roles in the dropouts of scRNA-seq. For example, genes with shorter 3&#x2032;-UTR length have larger quantitative difference between gene expression in matched scRNA-seq and bulk RNA-seq experiments (<xref ref-type="bibr" rid="B32">Lipnitskaya et al., 2022</xref>). In addition, choice of technology platform can also affect dropouts. For example, comparisons between SMART-seq2 and 10X Chromium showed that 10X Chromium had more noise and a higher dropout rate (<xref ref-type="bibr" rid="B37">Wang et al., 2021</xref>). However, these previous studies involved relatively small numbers of samples, which led to conclusions with limited scope and generality. In this study, we collected paired bulk RNA-seq and scRNA-seq samples from diverse data sources and diverse biological contexts, and used this data to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share.</p>
</sec>
<sec sec-type="results" id="s2">
<title>Results</title>
<sec id="s2-1">
<title>Paired bulk RNA-seq and scRNA-seq data</title>
<p>Through extensive literature search, we have identified eight datasets with paired bulk RNA-seq data and scRNA-seq data available for the same samples. A summary of these datasets is listed in the Materials and Methods. In total, we have paired bulk RNA-seq data and scRNA-seq data for 53 samples. The samples originated from diverse biological contexts, including fibroblasts, trachea, women reproductive system, breast cancer, and cancer cell lines.</p>
<p>For each GEO bulk RNA-seq dataset, median-of-ratios normalization was performed followed by log transformation. The scRNA-seq data for each sample was preprocessed separately, with library size normalization followed by log transformation. Then, a pseudo-bulk RNA-seq profile was calculated for each sample, by averaging the scRNA-seq expression data across all cells in the sample. Next, for each sample pair, the normalized bulk RNA-seq and single-cell based pseudo-bulk expression of overlapping genes among the two data types were identified. With these preprocessing steps, for each of the 53 samples, we obtained one bulk RNA-seq profile and one pseudo-bulk RNA-seq profile for the overlapping genes. Then quantile normalization was performed for the bulk RNA-seq profiles.</p>
<p>The preprocessed paired bulk and pseudo-bulk data were visualized using the scatter plots in <xref ref-type="fig" rid="F1">Figure 1</xref>, where each dot represents expression data of one gene in one sample. In <xref ref-type="fig" rid="F1">Figure 1A</xref>, we visualized the paired bulk and pseudo-bulk data for all 53 samples (blue), and overlaid with the paired bulk and pseudo-bulk data for one of the samples (red). In <xref ref-type="fig" rid="F1">Figure 1B</xref>, the same visualization was used to highlight another sample in the context of all samples. Visualizations highlighting other samples (not shown) looked similar to <xref ref-type="fig" rid="F1">Figures 1A, B</xref>. Based on these scatter plots, we can see the general correlation between bulk RNA-seq and scRNA-seq data, which is expected. To justify the choice of quantile normalization for processing the bulk expression data, we tried to alter our analysis pipeline by removing quantile normalization for bulk RNA-seq data, and we noticed that the alignment of normalized bulk data across samples was poor. For example, as shown in <xref ref-type="sec" rid="s10">Supplementary Figure S1</xref>, without quantile normalization, the range of normalized bulk RNA-seq data for the two highlighted samples showed marked difference. Therefore, quantile normalized is needed for the bulk RNA-seq data. Across the 53 paired samples, the Pearson correlation between the two data types has mean and standard deviation of 0.385 &#xb1; 0.063, while the Spearman correlation has mean and standard deviation of 0.849 &#xb1; 0.049. The Pearson correlation is lower than the Spearman correlation, which is expected, because the relationship between the two data types is not linear as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. In addition, we can see that the preprocessing steps were able to properly align the 53 bulk RNA-seq expression profiles across different datasets, and also properly align the 53 pseudo-bulk expression profiles, so that we can compare across these samples to identify genes that tend to be under-detected in scRNA-seq relative to bulk RNA-seq.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Scatter plot visualization of paired bulk and pseudo-bulk data for 53 samples. Each dot is expression of one gene in one sample, so there are 36,362 genes &#x2a;53 dots in one scatter plot. <bold>(A)</bold> Scatter plot of all genes in all 53 samples in blue, overlaid with scatter plot for all genes in one sample &#x201c;N3&#x201d; in red. <bold>(B)</bold> Scatter plot of all genes in all 53 samples in blue, overlaid with all genes in another sample &#x201c;P2&#x201d; in red.</p>
</caption>
<graphic xlink:href="fbinf-03-1120290-g001.tif"/>
</fig>
</sec>
<sec id="s2-2">
<title>Genes that are consistently under-detected in scRNA-seq</title>
<p>In order to identify genes that are more prone to dropout or under-detection in scRNA-seq, we examined whether there exist genes that repeatedly appeared in the upper-left corner of the scatter plot in <xref ref-type="fig" rid="F1">Figure 1</xref>, which compared bulk RNA-seq expression profiles and pseudo-bulk expression profiles derived from scRNA-seq. We visualized the scatter plot as a density plot in <xref ref-type="fig" rid="F2">Figure 2A</xref>, and manually drew a gate (region-of-interest) in its upper-left corner. We positioned the gate to avoid high density regions, so that genes falling into the gate represented outlier cases where expressions detected by scRNA-seq were much lower than expressions detected by bulk RNA-seq. If a gene appeared in the gate multiple times, this gene was consistently under-detected in scRNA-seq experiments compared to bulk RNA-seq experiments for multiple of the 53 samples.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Density plot of scatterplot of 53 paired samples with gates indicating candidate genes in three aspects. The gates were selected based on data distribution of 53 paired sample such that the gate in the upper-left corner represents the genes that are under-detected in scRNA-seq experiments compared to bulk RNA-seq experiments, the gate in the upper-right corner represents the genes that are highly expressed in both bulk RNA-seq and scRNA-seq experiments, and the gate in the bottom corner represents the genes that are over-detected in scRNA-seq experiments compared to bulk RNA-seq experiments <bold>(A)</bold>. Significantly enriched motifs in last 350bp of the longest transcripts of genes that occurred more than once in each gate are shown for upper-left gate <bold>(B)</bold>, upper-right gate <bold>(C)</bold>, and bottom gate <bold>(D)</bold>.</p>
</caption>
<graphic xlink:href="fbinf-03-1120290-g002.tif"/>
</fig>
<p>The top 15 genes that most frequently appeared in the upper-left gate are listed in <xref ref-type="table" rid="T1">Table 1</xref>, along with their numbers of appearances which ranged from 26 to 44. This suggests that out of the 53 samples with paired bulk and single-cell data, these genes were frequently under-detected by scRNA-seq in more than half of the samples. Therefore, indeed, there seem to exist genes that are consistently under-detected in scRNA-seq experiments for many samples. The 15 genes include AHNAK, EIF4G2, XIST, CSDE1, DST, DDX17, FN1, SRRM2, FLNA, YWHAZ, COL3A1, ITGB1, PRRC2C, COL1A1, and GNAS. AHNAK encodes a protein involved in diverse processes such as blood-brain barrier formation, cell structure and migration, cardiac calcium channel regulation, and tumor metastasis (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>). DDX17 encodes a DEAD box protein. DEAD box proteins are implicated in a number of cellular processes involving alteration of RNA secondary structure, such as translation initiation, nuclear and mitochondrial splicing, and ribosome and spliceosome assembly (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>). EIF4G2 functions as a general repressor of translation by forming translationally inactive complexes (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>List of top 15 genes that are most frequently under-detected in scRNA-seq and their frequencies.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Gene name</th>
<th align="center">Frequency of occurrence among 53 sample pairs</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">AHNAK</td>
<td align="center">44</td>
</tr>
<tr>
<td align="center">EIF4G2</td>
<td align="center">43</td>
</tr>
<tr>
<td align="center">XIST</td>
<td align="center">39</td>
</tr>
<tr>
<td align="center">CSDE1</td>
<td align="center">35</td>
</tr>
<tr>
<td align="center">DST</td>
<td align="center">35</td>
</tr>
<tr>
<td align="center">DDX17</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">FN1</td>
<td align="center">34</td>
</tr>
<tr>
<td align="center">SRRM2</td>
<td align="center">31</td>
</tr>
<tr>
<td align="center">FLNA</td>
<td align="center">30</td>
</tr>
<tr>
<td align="center">YWHAZ</td>
<td align="center">28</td>
</tr>
<tr>
<td align="center">COL3A1</td>
<td align="center">27</td>
</tr>
<tr>
<td align="center">ITGB1</td>
<td align="center">27</td>
</tr>
<tr>
<td align="center">PRRC2C</td>
<td align="center">27</td>
</tr>
<tr>
<td align="center">COL1A1</td>
<td align="center">26</td>
</tr>
<tr>
<td align="center">GNAS</td>
<td align="center">26</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Given the diversity of biological contexts of the 53 samples and the possibility that any given gene may only be expressed in a subset of those contexts, we broadened our criterion for under-detected genes in scRNA-seq and considered all genes that occurred more than once in the upper-left gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>. The total number of dots in the upper-left gate is 3,363, with each dot representing one gene in one sample. There were 468 unique genes that appeared more than once in the upper-left gate, which is roughly 3 times more than expected according to the hypergeometric test. The fact that hundreds of genes appeared more than once in the upper-left gate is interesting. Gene set enrichment analysis showed that those genes were involved in multiple KEGG pathways related to cancer. For example, several significantly enriched KEGG pathways include proteoglycans in cancer (FDR &#x3d; 1.24E-13), pathways in cancer (FDR &#x3d; 0.02), and microRNAs in cancer (FDR &#x3d; 0.036). This is expected because 29 out of the 53 sample were generated from cancer patients or cancer cell lines.</p>
<p>To examine sequence-based commonalities among the genes that appeared more than once in the upper-left gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>, we searched for enriched motifs in the last 350&#xa0;bp of those genes using the MEME Suite (<xref ref-type="bibr" rid="B2">Bailey et al., 2009</xref>), and observed two motifs that were significantly enriched with small E-value and large number of sites (<xref ref-type="fig" rid="F2">Figure. 2B</xref>). For the enriched poly(T) motif, the position-weight visualization showed that the bit score for most of the positions were high. For many positions, the bit score was over 50%, and for some of the positions the score reached 80%. This indicated that there was relatively high certainty about the enrichment of T at most of the positions within the motif, forming a consecutive block of T&#x2019;s. In contrast, for the enriched poly(G) motif, the bit score for most of the positions were low, indicating that the certainty of having G at the positions was low. In addition, the G&#x2019;s do not form long consecutive blocks. Therefore, even though the number of sites of the poly(G) motif was large and its E-value was significant, the poly(G) motif was not as strong as the poly(T) motif. Since there is no obvious mechanism associated to the poly(G) enrichment, we conjectured that the poly(T) motif toward the tails of genes under-detected in scRNA-seq may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during the capturing step of scRNA-seq library preparation, which is a mechanistic conjecture of why those genes may be more prone to be under-detected in scRNA-seq.</p>
</sec>
<sec id="s2-3">
<title>Genes consistently highly expressed in both bulk RNA-seq and scRNA-seq</title>
<p>As a comparison, we manually drew another gate in the upper-right corner of the density plot in <xref ref-type="fig" rid="F2">Figure 2A</xref>, and examined whether there exist genes that were consistently highly expressed in both bulk RNA-seq and scRNA-seq. The manually drawn gate was positioned such that the number of genes within the gate was comparable to the number of genes within the gate in the upper-left corner. In addition, we positioned the gate to avoid high density regions, so that genes falling into the gate represented outlier cases in <xref ref-type="fig" rid="F2">Figure 2A</xref> where expression detected in both bulk RNA-seq and scRNA-seq are high. If a gene appeared in the upper-right gate multiple times, this gene consistently showed high expression in both bulk RNA-seq and scRNA-seq for multiple of the 53 samples. The total number of dots in the upper-right gate is 1,451, among which 96 unique genes appeared more than once.</p>
<p>For the top 15 genes with highest frequency of occurrences in the upper-right gate, their numbers of occurrences ranged from 29 to 50, which were more than half of the 53 samples. Given the diverse biological contexts of the 53 samples under consideration, such consistency of highly expressed genes was interesting. Meanwhile, since many housekeeping genes are known to be involved in diverse fundamental biological processes, such consistency of highly expressed genes was also expected. The top 15 genes with highest frequency in the upper-right gate included MALAT1, RPLP1, EEF1A1, RPL10, RPL13, RPS18, FTH1, B2M, TMSB4X, RPS4X, RPL13A, RPL32, RPS12, RPS27A, and RPL11. B2M encodes a protein which is associated with MHC class I heavy chain on the surface of nearly all nucleated cells (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>). TMSB4X encodes a protein which is involved in cell proliferation, migration, and differentiation and it is a major cellular constituent in many tissues (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>). EEF1A1 is expressed in brain, placenta, lung, liver, kidney, and pancreas, and it is responsible for the enzymatic delivery of aminoacyl tRNAs to the ribosome (<xref ref-type="bibr" rid="B33">Stelzer et al., 2011</xref>). In addition to the top 15 most frequently appeared genes, we also considered genes that appeared more than once in the upper-right gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>. Gene set enrichment analysis showed that those genes were involved in multiple GO terms including ribosome (FDR &#x3d; 3.89E-69), cytosol (FDR &#x3d; 2.55E-25), RNA binding (FDR &#x3d; 1.53E-44), and protein binding (FDR &#x3d; 4.01E-4), which supported our intuition that the upper-right gate is enriched for housekeeping genes required for diverse fundamental cellular processes.</p>
<p>Using the MEME Suite, we searched for enriched motifs among the 96 genes that appeared more than once in the upper-right gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>, and observed two enriched motifs with moderate number of sites, a poly(A) motif and a poly(G) motif (<xref ref-type="fig" rid="F2">Figure. 2C</xref>). For both of these enriched motifs in <xref ref-type="fig" rid="F2">Figure 2C</xref>, the bit scores were relatively low and did not form long consecutive blocks, suggesting that they were not as strong as the poly(T) motif enriched in the upper-left gate. It was encouraging to see that the poly(T) motif enriched in the upper-left gate was not observed in the upper-right gate, which strengthened our mechanistic conjecture of the poly(T) motif and hairpin structures may play a role in under-detection of gene expression in scRNA-seq experiments.</p>
</sec>
<sec id="s2-4">
<title>Genes that appear to be over-detected in scRNA-seq</title>
<p>For completeness, we also attempted to identify genes that are frequently over-detected in scRNA-seq compared to bulk RNA-seq. We manually drew a third gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>, to define the outlier cases in the bottom-right region with low density. The gate was positioned such that the numbers of genes within each of the three gates were comparable. If a gene appeared in the bottom gate multiple times, this gene is consistently over-detected in scRNA-seq compared to bulk RNA-seq. The total number of dots in the bottom gate is 1,079, which contained 174 unique genes that appeared more than once. Comparing to the total number of dots and number of unique genes in the upper-left and upper-right gates, the average occurrence of unique genes in the bottom gate was much smaller than genes in the other two gates, indicating that much fewer genes were consistently over-detected in scRNA-seq.</p>
<p>For genes that appeared more than once in the bottom gate in <xref ref-type="fig" rid="F2">Figure 2A</xref>, we performed motif search using the MEME Suite, and observed two highly enriched motifs (<xref ref-type="fig" rid="F2">Figure. 2D</xref>). For the enriched poly(A) motif, the bit scores for various positions were moderate. For enriched poly(C) motif, the bit scores were relatively low at most of the positions, and the C&#x2019;s do not form a long consecutive block. Therefore, neither of the enriched motifs in the bottom gate was as strong as the poly(T) motif enriched in the upper-left gate. Therefore, among genes that tended to be over-detected in scRNA-seq, the absence of the poly(T) motif further strengthened our conjecture that sequence-based feature may be predictive of capturing efficiency during scRNA-seq library preparation.</p>
</sec>
<sec id="s2-5">
<title>Robustness of enriched sequence motifs to choices of normalization procedure</title>
<p>To demonstrate the robustness of the enriched sequence motifs for the genes that are under-detected in scRNA-seq, we repeated the analysis of the 53 paired samples with four choices of scRNA-seq normalization algorithms, including DESeq2 (<xref ref-type="bibr" rid="B24">Love et al., 2014</xref>), SCTransform (<xref ref-type="bibr" rid="B11">Hafemeister and Satija, 2019</xref>), Linnorm (<xref ref-type="bibr" rid="B39">Yip et al., 2017</xref>), and scran (<xref ref-type="bibr" rid="B25">Lun et al., 2016</xref>). For each choice of normalization algorithm, we generated pseudo-bulk data based on the normalized scRNA-seq data, and compared with the normalized bulk RNA-seq data using the same analysis as in <xref ref-type="fig" rid="F2">Figure 2</xref>. Results of these four analyses based on different scRNA-seq normalization algorithms are shown in <xref ref-type="sec" rid="s10">Supplementary Figures S2&#x2013;S5</xref>. In these supplementary figures, we consistently observed that poly(T) motif was significantly enriched in upper-left gate of genes under-detected in scRNA-seq, and poly(A) motif was enriched in upper-right and bottom gates. These results suggested that our observation of motif enrichment is robust to the choice of the normalization procedure.</p>
</sec>
<sec id="s2-6">
<title>Randomly paired bulk RNA-seq and scRNA-seq expression profiles</title>
<p>To examine the robustness of our comparison between paired bulk RNA-seq and scRNA-seq expression data, we randomly shuffled the gene expression profiles to create 53 random pairs, where each pair of bulk RNA-seq profile and pseudo-bulk profile from scRNA-seq were generated from different biological samples. With the randomly paired data, we performed the same analysis as above, and examined whether the randomly paired data would produce similar results.</p>
<p>The randomly paired bulk and pseudo-bulk data were visualized using the scatter plots where each dot represents expression data of one gene in one randomly paired expression profiles (<xref ref-type="fig" rid="F3">Figure 3</xref>). In <xref ref-type="fig" rid="F3">Figure 3</xref>, we visualized all 53 randomly paired bulk and pseudo-bulk data (blue), and overlaid with one such random pair (red) in <xref ref-type="fig" rid="F3">Figure 3A</xref> and another random pair in <xref ref-type="fig" rid="F3">Figure 3B</xref>. Visualizations highlighting other random pairs (not shown) were similar to <xref ref-type="fig" rid="F3">Figures 3A, 3B</xref>. Based on these scatter plots, we observed that the general correlation between bulk RNA-seq and scRNA-seq data was robust to random pairing of the data. Across the 53 randomly paired samples, the Pearson correlation between the two data types has mean and standard deviation of 0.350 &#xb1; 0.055, while the Spearman correlation has mean and standard deviation of 0.786 &#xb1; 0.058. The average correlation values were slightly lower for the randomly paired data compared to the average correlation values for the paired data.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Scatter plot visualization of randomly paired bulk and pseudo-bulk data for 53 samples. Each dot is expression of one gene in one sample, so there are 37,387 genes &#x2a;53 dots in one scatter plot. <bold>(A)</bold> Scatter plot of all 53 samples in blue, overlaid with scatter plot for one sample pair &#x201c;N3 (bulk RNA-seq) vs. 3_cell_line_mixture (scRNA-seq)&#x201d; in red. <bold>(B)</bold> Scatter plot of all 53 samples in blue, overlaid with another sample pair &#x201c;3_cell_line_mixure (bulk RNA-seq) vs. 293T (scRNA-seq)&#x201d; in red.</p>
</caption>
<graphic xlink:href="fbinf-03-1120290-g003.tif"/>
</fig>
<p>Similar to the analysis above, we manually drew three gates in <xref ref-type="fig" rid="F4">Figure 4A</xref> to capture genes that tended to be under-detected in scRNA-seq, over-detected in scRNA-seq, or highly expressed in both bulk RNA-seq and scRNA-seq. The numbers of unique genes appeared more than once in the upper-left, upper-right and bottom gates in <xref ref-type="fig" rid="F4">Figure 4A</xref> were 473, 97, and 507, respectively. Before random pairing, the numbers of genes appearing more than once in those three gates in <xref ref-type="fig" rid="F2">Figure 2A</xref> were 468, 96 and 174. The comparable numbers for the upper-left and upper-right gates were encouraging, showing that the pattern of under-detection in scRNA-seq and the pattern of high expression in both technologies were robust to random pairing of the data. Interestingly, the number of genes appearing more than once in the bottom gate became much larger after random pairing. This was likely because the random pairing increased the variations in the scatter plot, so that more dots/genes fell into the bottom gate. This observation indicated that the pattern of over-detection in scRNA-seq was not as robust as the other two patterns of under-detections in scRNA-seq and high expression in both technologies. For each of the three gates in <xref ref-type="fig" rid="F4">Figure 4A</xref>, we performed motif search to the last 350&#xa0;bp of genes that appeared more than once. Similar to the results before random pairing, for genes that occurred in the upper-left gate in <xref ref-type="fig" rid="F4">Figure 4A</xref>, poly(T) motif was significantly enriched while poly(A) motif was not observed. In contrast, for genes that appeared more than once in the other two gates in <xref ref-type="fig" rid="F4">Figure 4A</xref>, poly(T) motif was not enriched. It was encouraging to see that the motif enrichment that led to our mechanistic conjecture on detection in scRNA-seq was robust to random pairing of the data.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Density plot of scatter plot of 53 randomly paired samples with gates indicating candidate genes in the three aspects for one of the 100 iterations. The coordinates of gates were the same as those of the scatter plot of 53 paired sample <bold>(A)</bold>. Enriched motifs of last 350&#xa0;bp of the longest transcripts of genes that occurred more than once in each gate are shown for upper-left gate <bold>(B)</bold>, upper-right gate <bold>(C)</bold>, and bottom gate <bold>(D)</bold>.</p>
</caption>
<graphic xlink:href="fbinf-03-1120290-g004.tif"/>
</fig>
</sec>
<sec id="s2-7">
<title>Comparison of frequently appearing genes in paired and randomly paired data</title>
<p>As a further comparison between the paired and randomly paired RNA-seq and scRNA-seq data, we examined genes that frequently appeared in the three gates, &#x2265;20 times out of the 53 samples under consideration. For each gate in <xref ref-type="fig" rid="F2">Figure 2A</xref> based on the paired data, we computed the ratio between the number of genes that appeared &#x2265;20 times and the total number of unique genes, and listed the ratios in the first row of <xref ref-type="table" rid="T2">Table 2</xref>. We also calculated these ratios for the gates in <xref ref-type="fig" rid="F4">Figure 4A</xref> based on the randomly paired data, as shown in the second row of <xref ref-type="table" rid="T2">Table 2</xref>. In addition, we performed 100 iterations of the random pairing, which allowed us to quantify the variation of these ratios in the second row of <xref ref-type="table" rid="T2">Table 2</xref> when the bulk RNA-seq and scRNA-seq data were randomly paired. Once again, in terms of number of genes frequently appearing in the three gates, we observed that the results were robust with respect to random pairing of the data.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Percentage of genes that frequently appeared (&#x2265;20 times) in the three gates in the analysis of paired data, and the percentage of frequently appearing genes in the three gates in the analysis of randomly paired data.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Percentage of unique genes that occurred&#x2265;20 times</th>
<th align="left">Upper-left gate</th>
<th align="left">Upper-right gate</th>
<th align="left">Bottom gate</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Paired data</td>
<td align="left">4.3%</td>
<td align="left">23.2%</td>
<td align="left">0.8%</td>
</tr>
<tr>
<td align="left">Randomly paired data (100 iterations)</td>
<td align="left">5.0% (&#xb1;0.7%)</td>
<td align="left">21.5% (&#xb1;2.7%)</td>
<td align="left">0.6% (&#xb1;0.3%)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The percentage for frequently appearing genes in the upper-right gate was above 20%, indicating great consistency of highly expressed genes across the 53 samples from diverse biological context, which agreed with our observation that the upper-right gate was enriched for housekeeping genes required for diverse fundamental cellular processes. The percentage for frequently appearing genes in the upper-left gate was around 4%&#x2013;5%, which was lower than the upper-right gate but much higher than the bottom gate, indicating that the pattern of under-detection in scRNA-seq is more consistent than over-detection in scRNA-seq.</p>
<p>In addition to comparing the number or percentage of frequently appearing genes, we also examined whether those frequently appearing genes were the same between paired and randomly paired data. For each gate, we computed the average intersection-over-union ratio between the sets of frequently appearing genes in the paired and randomly paired analyses, averaging across the 100 iterations of random pairing. The average intersection-over-union ratios were 0.75 and 0.66 for the upper-left and upper-right gate, indicating high overlap for those two gates in the paired and randomly paired analyses. In contrast, the average intersection-over-union ratio for the bottom gate was only 0.17, showing that frequently appearing genes in the bottom gate were quite different between the paired and randomly paired analyses, which further indicated that the pattern of over-detection in scRNA-seq is weak.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s3">
<title>Discussion</title>
<p>In this study, we analyzed paired bulk RNA-seq and scRNA-seq data from 53 samples from various biological contexts. Comparison between bulk RNA-seq and scRNA-seq data revealed genes that were consistently under-detected in scRNA-seq, and this result was robust to random pairing of the data. In addition, we observed that the frequently under-detected genes in scRNA-seq were significantly enriched by the poly(T) motif. In contrast, enrichment Tof poly(T) motif was not observed in genes consistently highly expressed in both technologies or genes that appeared to be over-detected in scRNA-seq. The motif-based observation led to our hypothesis that the poly(T) motif in genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during the capturing step of scRNA-seq library preparation, which is a mechanistic conjecture of why those genes may be more prone to be under-detected in scRNA-seq compared to bulk RNA-seq.</p>
<p>The datasets analyzed in the study not only reflected a variety of biological contexts, but also contained technical variations in experimental and computational analyses. These technical variations include choices of alignment tools, choices of reference genome, library preparations and other experimental factors. All these factors could impact the data and subsequent analyes, including the results presented in our study. An ideal situation for our study is that all samples were processed using the same experimental protocol, the same reference genome, and the same alignment software with identical version. However, this is infeasible because almost all previous datasets involved some unique details in their experimental protocols. In addition, since raw FASTQ files were unavailable for many of the bulk RNA-seq and scRNA-seq samples in this study, we were unable to obtain the raw reads to run a standardized pre-processing pipeline to derived the gene expression data for all the samples. One experimental factor in scRNA-seq, the choice between 3&#x2032; vs. 5&#x2032; library preparation protocol, presents an interesting discussion, because our motif observations and mechanistic conjecture are both relevant to the 3&#x2032; end. Among the 8 datasets included in this study, 7 were generated using 3&#x2032; protocol. The remaining dataset contained a mix of 3&#x2032; and 5&#x2032; scRNA-seq data, but did not provide information on which samples were profiled by which protocols. Therefore, we did not distinguish 3&#x2032; vs. 5&#x2019; in our analysis. Since the above-mentioned factors were ignored in our analysis, we were effectively embracing the variations caused by those factors. Even with such variations in the data, we still observed a robust motif for the upper-left gate of genes under-detected in scRNA-seq. Therefore, these technical variations strengthened the robustness of our results.</p>
<p>Although the poly(T) motif was significantly enriched among genes that were frequently under-detected in scRNA-seq, there were consistently under-detected genes that lacked this motif. For example, among the top 15 genes that were most frequently under-detected in scRNA-seq as shown in <xref ref-type="table" rid="T1">Table 1</xref>, the poly(T) motif was not present in CSDE1, FLNA, FN1, DST, EIF4G2, and YWHAZ, while the remaining 9 genes contained the poly(T) motif. The mechanism of why these 6 genes were consistently under-detected in scRNA-seq is still unclear and needs further investigation.</p>
<p>For the genes that are repeatedly under-detected in scRNA-seq, they are less likely to be considered as highly variable genes, and thus, are less likely to drive clustering or trajectory analysis results in downstream analyses. However, recognizing such genes is important. If the goal of a research project is to investigate a specific gene which happens to be more prone to be under-detected in scRNA-seq, scRNA-seq may be a less reliable experimental strategy compared to bulk RNA-seq. When developing imputation analysis of scRNA-seq data, more attention should be paid to genes with enriched poly(T) motif. As another future direction, the bulk vs. single-cell comparison in this study can be extended to other genomic data types, such as paired ATAC-seq and scATAC-seq data for a common set of samples.</p>
</sec>
<sec sec-type="materials|methods" id="s4">
<title>Materials and methods</title>
<sec id="s4-6">
<title>Summary of datasets</title>
<p>Expression data for 53 paired bulk RNA-seq and scRNA-seq samples were obtained from 8 published GEO datasets. The paired samples were from either the same individuals, the same cell lines, or the same tissue sources. The scRNA-seq for majority of the 53 samples were generated using 10X Chromium single-cell 3&#x2019; v2 or v3 protocol. Some of the samples in one GEO dataset (GSE176078) were processed using 10X Chromium single-cell 5&#x2019; protocol, but the identity of those samples was not available. A summary of the 8 GEO datasets and their references is available in <xref ref-type="table" rid="T3">Table 3</xref>. More details about accession of individual samples and how bulk RNA-seq and scRNA-seq samples were paired for each dataset can be found in <xref ref-type="sec" rid="s10">Supplementary Table S1</xref>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Summary of datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Bulk RNA-seq and scRNA-seq datasets</th>
<th align="center">Source of samples</th>
<th align="center">Number of samples</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">GSE151202 (<xref ref-type="bibr" rid="B22">Li et al., 2021</xref>)</td>
<td align="center">human vaginal wall from women with severe anterior vaginal prolapse</td>
<td align="center">15</td>
</tr>
<tr>
<td align="center">GSE161529 (<xref ref-type="bibr" rid="B27">Pal et al., 2021</xref>) and GSE161892 (<xref ref-type="bibr" rid="B27">Pal et al., 2021</xref>)</td>
<td align="center">Breast cancer</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">GSE176078 (<xref ref-type="bibr" rid="B38">Wu et al., 2021</xref>)</td>
<td align="center">Breast cancer</td>
<td align="center">24</td>
</tr>
<tr>
<td align="center">GSE149694 (<xref ref-type="bibr" rid="B23">Liu et al., 2020</xref>) and GSE150311 (<xref ref-type="bibr" rid="B23">Liu et al., 2020</xref>)</td>
<td align="center">human fibroblasts</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">GSE108382 (<xref ref-type="bibr" rid="B12">Ho et al., 2018</xref>) and GSE108394 (<xref ref-type="bibr" rid="B12">Ho et al., 2018</xref>)</td>
<td align="center">Melanoma cell line</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">GSE136148 (<xref ref-type="bibr" rid="B9">Dong et al., 2021</xref>)</td>
<td align="center">A mixture of MDA-MB-438, MCF7, and human dermal fibroblast cell lines</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">GSE143705 (<xref ref-type="bibr" rid="B6">Carraro et al., 2020</xref>) and GSE143706 (<xref ref-type="bibr" rid="B6">Carraro et al., 2020</xref>)</td>
<td align="center">Human trachea</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">GSE129240 (<xref ref-type="bibr" rid="B40">Zaitsev et al., 2019</xref>) and 10X website</td>
<td align="center">Jurkat and 293T cell lines</td>
<td align="center">2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-1">
<title>Data preprocessing of bulk RNA-seq</title>
<p>For each GEO bulk RNA-seq dataset, median-of-ratios normalization was performed by DESeq2 which accounts for factors including sequencing depth and RNA composition. Next, log transformation was performed on the normalized data. Then the overlapping genes among bulk RNA-seq and scRNA-seq of each paired sample were identified for the 53 sample pairs from the GEO datasets, and a matrix representing the normalized bulk RNA-seq expression of all overlapping genes of the 53 paired samples was created. Finally, quantile normalization was performed on this matrix.</p>
</sec>
<sec id="s4-2">
<title>Data preprocessing of scRNA-seq</title>
<p>For scRNA-seq samples, library size normalization was performed by Seurat followed by natural log transformation and calculation of average expression of each gene across all the cells to get pseudo-bulk RNA-seq data. Then, a matrix representing the single-cell based pseudo-bulk expression of the 53 samples was created.</p>
</sec>
<sec id="s4-3">
<title>Correlation analysis of bulk RNA-seq and scRNA-seq</title>
<p>Pearson correlation and Spearman correlation were used to calculate the relationship between normalized bulk RNA-seq and pseudo-bulk RNA-seq expression profiles of the paired and randomly paired data.</p>
</sec>
<sec id="s4-4">
<title>Motif enrichment analysis</title>
<p>MEME Suite was used to find significant enriched sequence motifs of last 350bp of cDNA sequences of longest transcripts of candidate genes. For motif site distribution, any number of occurrences was selected for the analysis. MEME Suite reports E-value which serves as an indicator of the statistical significance of a motif. A motif with an E-value smaller than 0.05 is considered to be significant.</p>
</sec>
<sec id="s4-5">
<title>Pathway analysis</title>
<p>DAVID (Database for Annotation, Visualization, and Integrated Discovery) was used to identify the enriched KEGG pathways, and biological process, cellular component, and molecular function GO terms. FDR value was reported by DAVID for each significantly enriched pathway.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s5">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/<xref ref-type="sec" rid="s10">Supplementary Material</xref>.</p>
</sec>
<sec id="s6">
<title>Author contributions</title>
<p>XL performed the analysis and prepared the manuscript. GG and PQ designed and supervised the project and reviewed the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>This work was supported by funding from the National Institute of Health (U01CA265711) and the National Science Foundation (CCF2007029). PQ is an ISAC Marylou Ingram Scholar and a Wallace H. Coulter Distinguished Faculty Fellow. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funders.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fbinf.2023.1120290/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fbinf.2023.1120290/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrews</surname>
<given-names>T. S.</given-names>
</name>
<name>
<surname>Hemberg</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>M3Drop: Dropout-based feature selection for scRNASeq</article-title>. <source>Bioinformatics</source> <volume>35</volume> (<issue>16</issue>), <fpage>2865</fpage>&#x2013;<lpage>2867</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty1044</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bailey</surname>
<given-names>T. L.</given-names>
</name>
<name>
<surname>Boden</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Buske</surname>
<given-names>F. A.</given-names>
</name>
<name>
<surname>Frith</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Grant</surname>
<given-names>C. E.</given-names>
</name>
<name>
<surname>Clementi</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2009</year>). <article-title>Meme suite: Tools for motif discovery and searching</article-title>. <source>Nucleic Acids Res.</source> <volume>37</volume>, <fpage>W202</fpage>&#x2013;<lpage>W208</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkp335</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baran-Gale</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chandra</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kirschner</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Experimental design for single-cell RNA sequencing</article-title>. <source>Brief. Funct. Genomics</source> <volume>17</volume> (<issue>4</issue>), <fpage>233</fpage>&#x2013;<lpage>239</lpage>. <pub-id pub-id-type="doi">10.1093/bfgp/elx035</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Becht</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>McInnes</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Healy</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dutertre</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Kwok</surname>
<given-names>I. W. H.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>L. G.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Dimensionality reduction for visualizing single-cell data using UMAP</article-title>. <source>Nat. Biotechnol.</source> <volume>37</volume>, <fpage>38</fpage>&#x2013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.4314</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Butler</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Smibert</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Papalexi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Satija</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Integrating single-cell transcriptomic data across different conditions, technologies, and species</article-title>. <source>Nat. Biotechnol.</source> <volume>36</volume> (<issue>5</issue>), <fpage>411</fpage>&#x2013;<lpage>420</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.4096</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carraro</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Mulay</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mizuno</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Konda</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Petrov</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Single-cell reconstruction of human basal cell diversity in normal and idiopathic pulmonary fibrosis lungs</article-title>. <source>Am. J. Respir. Crit. Care Med.</source> <volume>202</volume> (<issue>11</issue>), <fpage>1540</fpage>&#x2013;<lpage>1550</lpage>. <pub-id pub-id-type="doi">10.1164/rccm.201904-0792oc</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Ning</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Single-cell RNA-seq technologies and related computational data analysis</article-title>. <source>Front. Genet.</source> <volume>10</volume>, <fpage>317</fpage>. <pub-id pub-id-type="doi">10.3389/fgene.2019.00317</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Albergante</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>J. Y.</given-names>
</name>
<name>
<surname>Lareau</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Lo Bosco</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM</article-title>. <source>Nat. Commun.</source> <volume>10</volume> (<issue>1</issue>), <fpage>1903</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-019-09670-4</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dong</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Thennavan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Urrutia</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Perou</surname>
<given-names>C. M.</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Scdc: Bulk gene expression deconvolution by multiple single-cell RNA sequencing references</article-title>. <source>Brief. Bioinform</source> <volume>22</volume> (<issue>1</issue>), <fpage>416</fpage>&#x2013;<lpage>427</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbz166</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Friedman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2001</year>). <source>The elements of statistical learning</source>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Springer series in statistics</publisher-name>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafemeister</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Satija</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression</article-title>. <source>Genome Biol.</source> <volume>20</volume> (<issue>1</issue>), <fpage>296</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-019-1874-1</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ho</surname>
<given-names>Y. J.</given-names>
</name>
<name>
<surname>Anaparthy</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Molik</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Mathew</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Aicher</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Single-cell RNA-seq analysis identifies markers of resistance to targeted BRAF inhibitors in melanoma cell populations</article-title>. <source>Genome Res.</source> <volume>28</volume> (<issue>9</issue>), <fpage>1353</fpage>&#x2013;<lpage>1363</lpage>. <pub-id pub-id-type="doi">10.1101/gr.234062.117</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hou</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hicks</surname>
<given-names>S. C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A systematic evaluation of single-cell RNA-sequencing imputation methods</article-title>. <source>Genome Biol.</source> <volume>21</volume> (<issue>1</issue>), <fpage>218</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-020-02132-x</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Torre</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Dueck</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shaffer</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bonasio</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Saver: Gene expression recovery for single-cell RNA sequencing</article-title>. <source>Nat. Methods</source> <volume>15</volume> (<issue>7</issue>), <fpage>539</fpage>&#x2013;<lpage>542</lpage>. <pub-id pub-id-type="doi">10.1038/s41592-018-0033-z</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ji</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Tscan: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis</article-title>. <source>Nucleic Acids Res.</source> <volume>44</volume> (<issue>13</issue>), <fpage>e117</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkw430</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kharchenko</surname>
<given-names>P. V.</given-names>
</name>
<name>
<surname>Silberstein</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Scadden</surname>
<given-names>D. T.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Bayesian approach to single-cell differential expression analysis</article-title>. <source>Nat. Methods</source> <volume>11</volume> (<issue>7</issue>), <fpage>740</fpage>&#x2013;<lpage>742</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.2967</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Demystifying "drop-outs" in single-cell UMI data</article-title>. <source>Genome Biol.</source> <volume>21</volume> (<issue>1</issue>), <fpage>196</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-020-02096-y</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kitzman</surname>
<given-names>J. O.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Haplotypes drop by drop</article-title>. <source>Nat. Biotechnol.</source> <volume>34</volume> (<issue>3</issue>), <fpage>296</fpage>&#x2013;<lpage>298</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.3500</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klein</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Mazutis</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Akartuna</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Tallapragada</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Veres</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells</article-title>. <source>Cell</source> <volume>161</volume> (<issue>5</issue>), <fpage>1187</fpage>&#x2013;<lpage>1201</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2015.04.044</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kobak</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Berens</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>The art of using t-SNE for single-cell transcriptomics</article-title>. <source>Nat. Commun.</source> <volume>10</volume> (<issue>1</issue>), <fpage>5416</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-019-13056-x</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W. V.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J. J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>An accurate and robust imputation method scImpute for single-cell RNA-seq data</article-title>. <source>Nat. Commun.</source> <volume>9</volume> (<issue>1</issue>), <fpage>997</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-03405-7</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q. Y.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>B. F.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Single-cell transcriptome profiling of the vaginal wall in women with severe anterior vaginal prolapse</article-title>. <source>Nat. Commun.</source> <volume>12</volume> (<issue>1</issue>), <fpage>87</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-020-20358-y</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ouyang</surname>
<given-names>J. F.</given-names>
</name>
<name>
<surname>Rossello</surname>
<given-names>F. J.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>J. P.</given-names>
</name>
<name>
<surname>Davidson</surname>
<given-names>K. C.</given-names>
</name>
<name>
<surname>Valdes</surname>
<given-names>D. S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Reprogramming roadmap reveals route to human induced trophoblast stem cells</article-title>. <source>Nature</source> <volume>586</volume> (<issue>7827</issue>), <fpage>101</fpage>&#x2013;<lpage>107</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-020-2734-6</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Love</surname>
<given-names>M. I.</given-names>
</name>
<name>
<surname>Huber</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Anders</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2</article-title>. <source>Genome Biol.</source> <volume>15</volume> (<issue>12</issue>), <fpage>550</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-014-0550-8</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lun</surname>
<given-names>A. T. L.</given-names>
</name>
<name>
<surname>Bach</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Marioni</surname>
<given-names>J. C.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Pooling across cells to normalize single-cell RNA sequencing data with many zero counts</article-title>. <source>Genome Biol.</source> <volume>17</volume> (<issue>1</issue>), <fpage>75</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-016-0947-7</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Macosko</surname>
<given-names>E. Z.</given-names>
</name>
<name>
<surname>Basu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Satija</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Nemesh</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Shekhar</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets</article-title>. <source>Cell</source> <volume>161</volume> (<issue>5</issue>), <fpage>1202</fpage>&#x2013;<lpage>1214</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2015.05.002</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pal</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Vaillant</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Capaldo</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Joyce</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>A single-cell RNA expression atlas of normal, preneoplastic and tumorigenic states in the human breast</article-title>. <source>EMBO J.</source> <volume>40</volume> (<issue>11</issue>), <fpage>e107333</fpage>. <pub-id pub-id-type="doi">10.15252/embj.2020107333</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pennisi</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Chronicling embryos, cell by cell, gene by gene</article-title>. <source>Science</source> <volume>360</volume> (<issue>6387</issue>), <fpage>367</fpage>. <pub-id pub-id-type="doi">10.1126/science.360.6387.367</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Picelli</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bjorklund</surname>
<given-names>A. K.</given-names>
</name>
<name>
<surname>Faridani</surname>
<given-names>O. R.</given-names>
</name>
<name>
<surname>Sagasser</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Winberg</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sandberg</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Smart-seq2 for sensitive full-length transcriptome profiling in single cells</article-title>. <source>Nat. Methods</source> <volume>10</volume> (<issue>11</issue>), <fpage>1096</fpage>&#x2013;<lpage>1098</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.2639</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qiu</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Embracing the dropouts in single-cell RNA-seq analysis</article-title>. <source>Nat. Commun.</source> <volume>11</volume> (<issue>1</issue>), <fpage>1169</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-020-14976-9</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sarkar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis</article-title>. <source>Nat. Genet.</source> <volume>53</volume> (<issue>6</issue>), <fpage>770</fpage>&#x2013;<lpage>777</lpage>. <pub-id pub-id-type="doi">10.1038/s41588-021-00873-4</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lipnitskaya</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Legewie</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Klein</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Becker</surname>
<given-names>K.</given-names>
</name>
</person-group>, <article-title>Machine learning-assisted identification of factors contributing to the technical variability between bulk and single-cell RNA-seq experiments</article-title>, <comment>bioRxiv</comment> (<year>2022</year>) <fpage>2022</fpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stelzer</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dalah</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>T. I.</given-names>
</name>
<name>
<surname>Satanower</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Rosen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Nativ</surname>
<given-names>N.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>
<italic>In-silico</italic> human genomics with GeneCards</article-title>. <source>Hum. Genomics</source> <volume>5</volume> (<issue>6</issue>), <fpage>709</fpage>&#x2013;<lpage>717</lpage>. <pub-id pub-id-type="doi">10.1186/1479-7364-5-6-709</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Svensson</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Droplet scRNA-seq is not zero-inflated</article-title>. <source>Nat. Biotechnol.</source> <volume>38</volume> (<issue>2</issue>), <fpage>147</fpage>&#x2013;<lpage>150</lpage>. <pub-id pub-id-type="doi">10.1038/s41587-019-0379-5</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tracy</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>G. C.</given-names>
</name>
<name>
<surname>Dries</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Rescue: Imputing dropout events in single-cell RNA-sequencing data</article-title>. <source>BMC Bioinforma.</source> <volume>20</volume> (<issue>1</issue>), <fpage>388</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-019-2977-0</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>van Dijk</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Nainys</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yim</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kathail</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Carr</surname>
<given-names>A. J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Recovering gene interactions from single-cell data using data diffusion</article-title>. <source>Cell</source> <volume>174</volume> (<issue>3</issue>), <fpage>716</fpage>&#x2013;<lpage>729 e27</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2018.05.061</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Direct comparative analyses of 10X genomics Chromium and smart-seq2</article-title>. <source>Genomics Proteomics Bioinforma.</source> <volume>19</volume> (<issue>2</issue>), <fpage>253</fpage>&#x2013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1016/j.gpb.2020.02.005</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>S. Z.</given-names>
</name>
<name>
<surname>Al-Eryani</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Roden</surname>
<given-names>D. L.</given-names>
</name>
<name>
<surname>Junankar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Harvey</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Andersson</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>A single-cell and spatially resolved atlas of human breast cancers</article-title>. <source>Nat. Genet.</source> <volume>53</volume> (<issue>9</issue>), <fpage>1334</fpage>&#x2013;<lpage>1347</lpage>. <pub-id pub-id-type="doi">10.1038/s41588-021-00911-1</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yip</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kocher</surname>
<given-names>J. A.</given-names>
</name>
<name>
<surname>Sham</surname>
<given-names>P. C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Linnorm: Improved statistical analysis for single cell RNA-seq expression data</article-title>. <source>Nucleic Acids Res.</source> <volume>45</volume> (<issue>22</issue>), <fpage>13097</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkx1189</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zaitsev</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Bambouskova</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Swain</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Artyomov</surname>
<given-names>M. N.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures</article-title>. <source>Nat. Commun.</source> <volume>10</volume> (<issue>1</issue>), <fpage>2209</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-019-09990-5</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>