Reconstructing the Developmental Trajectories of Multiple Subtypes in Pulmonary Parenchymal Epithelial Cells by Single-Cell RNA-seq

Background Some lung diseases are cell type-specific. It is essential to study the cellular anatomy of the normal human lung for exploring the cellular origin of lung disease and the cell development trajectory. Methods We used the Seurat R package for data quality control. The principal component analysis (PCA) was used for linear dimensionality reduction. UMAP and tSNE were used for dimensionality reduction. Muonocle2 was used to extract lung epithelial cells to analyze the subtypes of epithelial cells further and to study the development of these cell subtypes. Results We showed a total of 20154 high quality of cells from human normal lung tissue. They were initially divided into 17 clusters cells and then identified as seven types of cells, namely macrophages, monocytes, CD8 + T cells, epithelial cells, endothelial cells, adipocytes, and NK cells. 4240 epithelial cells were extracted for further analysis and they were divided into seven clusters. The seven cell clusters include alveolar cell, alveolar endothelial progenitor, ciliated cell, secretory cell, ionocyte cell, and a group of cells that are not clear at present. We show the development track of these subtypes of epithelial cells, in which we speculate that alveolar epithelial progenitor (AEP) is a kind of progenitor cells that can develop into alveolar cells, and find six essential genes that determine the cell fate, including AGER, RPL10, RPL9, RPS18, RPS27, and SFTPB. Conclusion We provide a transcription map of human lung cells, especially the in-depth study on the development of epithelial cell subtypes, which will help us to study lung cell biology and lung diseases.


INTRODUCTION
In mammals, the lung is a highly branched network, in which the distal area of the bronchus tree transforms into dense alveolus vesicles during development. The lung plays a crucial role in both gas exchange and mucosal immunity. Its anatomical structure provides these functions through the following ways: (1) guiding air into the airway of the respiratory unit, providing mucociliary clearance, and forming a barrier to inhaled particles and pathogens; (2) alveoli, the distal cystic structure where gas exchange takes place (Vieira Braga et al., 2019). The lung is a complex organ with long development time. It is controlled and coordinated by gene network and dynamic crosstalk of multiple cells, including lineage assignment, cell proliferation, differentiation, migration, morphogenesis, and damage repair. The function of epithelial tissue depends on the abundance and distribution of differentiated cells (Rock and Hogan, 2011;Hines and Sun, 2014). Researchers have been trying to restore tissue function after lung injury, which requires understanding how physiological tasks are distributed among cell types and how cell states change between homeostasis, damage/repair, and disease (Rock et al., 2010).
Lung diseases, including lung trauma, viral pneumonia, lung cancer, asthma, chronic obstructive pulmonary disease, and cystic fibrosis, need stem cells to restore the damaged lung tissue after a large number of lung tissue cells lose function (Monsel et al., 2016). Finding stem cells of lung tissue is of great significance for the treatment of lung diseases. In this study, we integrated and analyzed samples of single-cell RNA sequencing (scRNAseq) from multiple normal lung tissues to explore progenitor cells that may regenerate into alveolar epithelial cells. The study of normal lung tissue cells will help us to better understand the process of lung development, especially the process of alveolar formation. It will have a significant impact on the survival rate of premature infants with incomplete lung development, and will also promote the repair and regeneration of lung injury in adults.

RESULTS
We calculated the number of genes in each cell, the number of UMIs (Unique Molecular Identifiers), and the percentage of mitochondrial genes (Supplementary Figure 1A), and the correlation plot of sequencing depth (Supplementary Figure 1C). Using the FindVariableFeatures function, we extracted 2000 genes with a large coefficient of variation between cells (Supplementary Figure 1B). After QC, we furtherly analyzed a total of 20154 high-quality lung cells from four normal lung tissue. To eliminate the interference of cell cycle on cell dimensionality reduction clustering, we analyzed the influence of cell cycle on clustering and removed cell cycle-related genes. The final data results show that the cells of each cell cycle are evenly distributed in each cell cluster (Supplementary  Figures 1E,F). The results of the principal component analysis (PCA) show that the first 20 pcs have good discrimination ( Figure 1A). All of them are included in the next step of the analysis. We also show the gene expression heatmap of the first four PCs and the related genes of each PCA component (Supplementary Figures 3A,B).
After removing the batch effect by harmony, the cells from each sample are evenly distributed in each cell cluster (Supplementary Figures 3C,D). We identified 17 clusters of 194-3005 cells per cluster ( Figure 1B). Based on the expression of the top ten genes with the most significant differential expression in each cell cluster, we drew the heatmap of marker gene expression in these 17 cell clusters ( Figure 1C). Cell clusters were visualized using two different methods (UMAP and tSNE), and the results were the same (Supplementary Figures 3E,F). According to the specific marker gene (Supplementary Table 2), we subdivide the cells into seven categories: macrophages, monocytes, CD8 + T cells, epithelial cells, endothelial cells, adipocytes, and NK cells (Figures 1F,G). Our scatter plot, bubble plot, and violin plot of the marker genes of epithelial cells show that these genes can distinguish epithelial cells from other cells well (Figures 1D,E and Supplementary Figures 3G,H). We also found that the classic maker genes of almost all epithelial cells are highly expressed in our epithelial cells (Supplementary Figure 4E), so we think these results are reliable.
Then, we extracted the lung epithelial cells for further analysis. Our results show that there are 4240 epithelial cells. After dimensionality reduction clustering of these cells, we found that epithelial cells can be divided into seven different clusters ( Figures 2B,C). Violin plots showing the expression of marker genes in each cluster of epithelial cells (Figure 2A). We found out the maker genes of each cluster. Then, we drew the heatmap of the top ten differentially expressed marker genes (Supplementary Figure 4D), among which the representative marker genes in each cluster were displayed with a violin plot and bubble plot (Supplementary Figures 4F,G).  (Montoro et al., 2018;Plasschaert et al., 2018;Zacharias et al., 2018;Vieira Braga et al., 2019; Table 1 and Supplementary Table 2). Besides, we used Monocle2 to perform the pseudo time trajectories of all epithelial cells and showed the fate determination between them (Figures 2D-H). The shade of the color indicates the trajectory of cell development, which is sorted according to the pseudo-time value ( Figure 2D). We mapped the Seurat cluster ID of epithelial cells to the distribution of the pseudotime value of each cell to show the development trajectory between different subgroups of epithelial cells ( Figure 2E). Five branches with different colors were obtained by pseudo-time analysis, which represented the fate of different cell subsets in epithelial cells ( Figure 2F). We mapped epithelial cells from different tissue sources and the cell cycle of these cells to the pseudo temporal trajectory to determine the influence of tissue source ( Figure 2G) and cell cycle on the pseudo-time analysis ( Figure 2H). In combination with the results of the pseudo-time analysis and the above annotation of epithelial cell subsets, we speculate that AEP, as a stem cell, might have the potential to differentiate in multiple directions. We found two subtypes (0 and 1 cell clusters) in alveolar cells. Interestingly, there was a kind of cell cluster in pulmonary epithelial cells,   which is not clear yet, according to the existing literature. In this unclear cell cluster, the expression of marker genes was characterized by the negative expression of some specific maker genes as the positive expression in the alveolar cells. At the same time, we found the first six genes that affect cell fate (Figure 2I), including AGER, RPL10, RPL9, RPS18, RPS27, and SFTPB. To show more detailed genetic information, we show the top 50 genes that affect the fate of cells ( Figure 2J). These genes may play an important role in the direction of cell differentiation.

DISCUSSION
In recent years, with the application of scRNA-seq (Treutlein et al., 2014;Ardini-Poleske et al., 2017;Nabhan et al., 2018) technology in lung tissue, it has been widely used to describe the cell type diversity and lineage level in the lung tissue due to its advantages in obtaining single-cell information. These studies have contributed to find new cell subtypes and explain the development track between these cells in adult lung tissue, so as to find out the stem cells (Zacharias et al., 2018). At present, the most studied lung stem cells are the basal cells of the tracheal epithelium. The basal progenitor cells are thought to be able to regenerate into a variety of epithelial components. It is necessary for restoring normal organ homeostasis by regeneration of functional tissue after severe trauma (Zacharias et al., 2018). One of the most important parts of lung regeneration is alveoli. In this study, we found that alveolar epithelial cells have two types of cells (0 and 1 cell clusters in epithelial cell subtypes), which is consistent with previous studies. We think that this may be due to some differences in expression between the two subtypes of alveolar cells. There are two types of alveolar epithelial cells, type 1 and type 2. The expression of their maker genes such as SFTPC, NAPSA, SFTPD, PGC, and SFTPA1 is consistent between these two subgroups. This is also consistent with the results of the heatmap of maker genes in cluster 0 and cluster 1 in Supplementary Figure 3.
Previous studies showed that there are significant differences in cell components between airway and lung parenchyma, and also in different parts of the airway (Rock et al., 2009). The basal cells distribute in the trachea and bronchi. The secretory cells distribute in the bronchi and bronchioles (Rock et al., 2009). The alveolar stem cells distribute in the junction of the bronchioloalveolar tubes (Lee et al., 2014). The type II alveolar epithelial cells distribute in the alveolar area (Barkauskas et al., 2013). Lung injury activates specialized adult epithelial progenitor cells to regenerate the epithelium. According to the degree of injury, the remaining alveolar type II cells and distal airway stem/progenitor cells will mobilize to cover the stripped alveoli and restore the normal barrier. In addition to basal cells, the main source of airway stem/progenitor cells is still uncertain. Researchers found a unique subpopulation (about 5%) of rod-shaped lineage negative epithelial progenitor cells characterized by the high expression of H2-K1, which is essential for alveolar repair. The resting H2-K1 cells are the source of regeneration activity of almost all airway lineages in vitro. Normal H2-K1 (high) cells can differentiate into alveolar cells and recover lung function after transplantation into the damaged lung. These findings suggest that small subpopulations of specialized stem/progenitor cells are necessary for effective lung regeneration and are potential therapeutic adjuncts after major lung injury (Kathiriya et al., 2020). Meanwhile, researchers from the Perelman School of Medicine at the University of Pennsylvania identified a lung progenitor cell that repairs the alveoli. They isolated these progenitors from the lungs of mice and humans and described their characteristics (Zacharias et al., 2018). Morrisey's team identified an AEP, in which AEP was found in a larger population of cells called alveolar type 2 cells (Zacharias et al., 2018). In this study, according to the maker gene of AEP cells reported in the literature, we have also identified such cells in normal lung epithelial cells, including TM4SF1+, SLC34A2−, ABCA3−, CAV1+, AGER+, RTKN2+, etc. Besides, we can see that 0/1/2 of these cell clusters are very similar in the map of UMAP clustering, which is also consistent with the discussion that AEP may be mixed in alveolar epithelial cells as reported above.
The transition area between the smallest terminal bronchioles and alveoli is called the junction of bronchioloalveolar ducts. This part of the mouse body contains ciliated cells and secretory cells, but the cells distributed in this part of the human body are not clear (Cuzić et al., 2012). A small number of cells (1-2 cells in each trachea) at the junction of the bronchioloalveolar duct were labeled as SCGB1A1 (secretory cells) and alveolar surfactant protein C (SFTPC), which is a unique protein of type II alveolar epithelial cells ). The conclusion is that these double-positive cells are bronchioloalveolar stem cells. In our study, we did not find the SCGB1A1 and SFTPC double-positive cells, but we identified the cell cluster 2 of AEP cells, whose SFTPC expression was high. We speculate that the possible reason for this is that the number of such stem cells is relatively small, and there are many experimental processes and data analysis processes maybe will lead to the loss of some cells. For example, some cells will be filtered out in the process of making single cell suspension and single cell data quality control. In this way, the relatively small number of such stem cells may not be recognized effectively. Maybe we can find these cells in a larger sample size, which needs further study in the future.
In recent years, the research of stem cell and progenitor cell lineage recognition and the activity of these cells in adults has developed rapidly (Rath et al., 2018). In contrast, the research on the regulation of cell behavior by molecular level signaling pathway is slow. The major molecular signaling pathways used in the process of individual development play an important role in the process of lung cell repair (Hogan et al., 2014). However, these viewpoints need to be further tested in physiological or clinical related experimental models. To explore the mechanism of lung formation, especially in the field of alveolar formation, is in the ascendant. The gas-liquid interface culture system of bronchial epithelium and the in vitro organlike culture system (bronchus and alveolus), which are popular in recent years, provide the best technical support and research platform for solving these problems in the field of respiratory stem/progenitor cells.
To sum up, we provide a transcription map of human lung cells, especially the in-depth study on the development of epithelial cell subtypes, which will help us to study the lung cell biology and the relationship between cell types and diseases.

Data Acquisition and Ethical Review
We downloaded GSE130148 and GSE132771 10x genomics RNAseq datasets from the GEO database 1 , extracted the data of single-cell sequencing of normal lung tissue, and combined the 1 https://www.ncbi.nlm.nih.gov/gds/ two datasets with MergeSeurat function in Seurat (Satija et al., 2015;Stuart et al., 2019) R package (version 3.1.4) 2 as our analysis data. The Institutional Review Committee of Zhongshan Hospital (Fudan University, Shanghai, China) approved this study to be exempted from research.

Using Seurat for Data Quality Control (QC)
We use R (version 3.6.3) 3 and Seurat R package for data quality control. Single-cell data sets may contain "uninteresting" sources of variation, including not only technical noise, but also batch effects, and even sources of biological variation (such as cell cycle stage) (Nagano et al., 2017). Regression of these signals from the analysis can improve the downstream dimension reduction and clustering (Buettner et al., 2015). Based on previous studies, the screening criteria of cells were determined (Zacharias et al., 2018;Vieira Braga et al., 2019). According to the median of genes and percentage of mitochondrial genes in lung samples (Supplementary Figure 1A), cells with <200 and >3000 genes (potential cell diploid) and percentage of mitochondrial genes >20% were screened and removed. After QC, we obtained a total of 20154 high-quality lung cells. The relationship between mitochondrial gene percentage and mRNA reading, and between mRNA quantity and mRNA reads were detected and visualized (Supplementary Figure 1C). After normalization of data, all highly variable genes in the single-cell data were identified after controlling the relationship between mean expression and dispersion (Supplementary Figure 1B). All variable genes (n = 18458) were used for downstream analysis, i.e., PCA.

Analyze the Effect of Cell Cycle on Cell Clustering and Remove Cell Cycle-Related Genes
We used a set of 43 G1/S and 54 G2/M cell cycle-specific genes with known stage specificity to monitor, analyze, and quantify the cell cycle stages of each cell (Tirosh et al., 2016). The heterogeneity of cell cycle, especially the transition of mitotic cells between S-phase and G2/M-phase, drives a large number of transcriptome variations, thus concealing biological signals. We classified the cells according to the maximum average expression ("cycle fraction") of the two groups of cell cycle genes. When the G1/S and G2/M Cycle scores are less than 2, we consider these cells as non-proliferative cells. Otherwise, we think the cells are proliferative. After cell cycle analysis, we observed no deviation caused by cell cycle genes (Supplementary Figures 1E,F).

Principal Component Analysis
The PCA is a multivariate statistical method to investigate the correlation between multiple variables. It studies how to reveal the internal structure of multiple variables through a few principal components, that is, to derive a few principal components from the original variables, so that they retain as much information as possible about the original variables, and are not related to each other (Li et al., 2017). Generally, the mathematical treatment is to make the original indicators into a linear combination as a new comprehensive indicator. We use PCA with variable genes as input and recognize significant principal component (PC) based on jackstraw function. The first 20 pcs meet our requirements (Figure 2A), and all of them are selected for the next analysis.

Data Secondary Analysis After Reducing the Batch Effect
Since this data comes from the results of two previous studies, which are four different batches of lung tissue samples, to avoid the impact of the batch effect on the downstream analysis, we adopted a strategy to alleviate the batch effect, called "harmony" (Tran et al., 2020). R package harmony focuses on scalable integration of scRNA-seq data for batch correction and metaanalysis (Tran et al., 2020) (version 1.0) 4 . We examined the batch effect among four different lung samples (Supplementary  Figures 1D, 2C). At the resolution of 0.25, we use FindClusters function to cluster cells and divide them into 17 different cell types. Next, we use the FindAllMarkers function to find the differentially expressed genes between each type of cell (Supplementary Table 1).

Cell Type Markers
First, we use the SingleR package (version 1.1.14) 5 for cell type automatic annotation. Then, according to a widely used cell maker query website: CellMarker (Zhang et al., 2019) 6 , query the maker of the corresponding cell line, and query the maker in the previous research results to proofread (Montoro et al., 2018;Plasschaert et al., 2018;Zacharias et al., 2018;Vieira Braga et al., 2019). Finally, we annotated seven major lung tissue cells (Figures 2F,G).

Reconstruction of EPI Cell Differentiation Track With Monocle2
We analyzed epithelial cell fate determination and pseudo time trace by monocle2 R software package (Qiu et al., 2017) (version 2.12.0) 7 . Based on the expression matrix of the singlecell transcriptome, monocle2 can simulate the biological process of a cell population by unsupervised learning. We import the data containing 4240 epithelial cells into Monocle2. We used genes expressed in at least ten cells and more than 5% of the cells. Then, we use the threshold of cell nearest distance (delta) and local density (Rho) to determine the number of clusters (Supplementary Figures 4B,C). Then, we analyzed the differential gene expression. We used the first 1000 genes with the most significant differential expression as the sequencing gene set and carried out the dimension reduction trajectory analysis. 4 https://github.com/immunologics/harmony 5 http://www.bioconductor.org/packages/release/bioc/html/single.html/ 6 http://biocc.hrbmu.edu.cn/cellmarker/index.jsp/ 7 http://cole-trapnell-lab.github.io/monocle-release/

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
CZ, YH, and YZ: methodology. JY, TL, and ML: data acquisition. YH, JY, and JL: data analysis. YH, ZH, and GB: manuscript preparation. CZ, LX, QW, and WJ: management and direction. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
We thank Simon L. M. and Tsukui T. et al. for sharing the singlecell RNA sequencing data of normal lung tissue.