Single-Cell Transcriptome Analysis in Melanoma Using Network Embedding

Wang, Liming; Liu, Fangfang; Du, Longting; Qin, Guimin

doi:10.3389/fgene.2021.700036

ORIGINAL RESEARCH article

Front. Genet., 05 July 2021

Sec. RNA

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.700036

This article is part of the Research TopicArtificial Intelligence for Extracting Phenotypic Features and Disease Subtyping Applied to Single-Cell Sequencing DataView all 8 articles

Single-Cell Transcriptome Analysis in Melanoma Using Network Embedding

School of Computer Science and Technology, Xidian University, Xi’an, China

Single-cell sequencing technology provides insights into the pathology of complex diseases like cancer. Here, we proposed a novel computational framework to explore the molecular mechanisms of cancer called melanoma. We first constructed a disease-specific cell–cell interaction network after data preprocessing and dimensionality reduction. Second, the features of cells in the cell–cell interaction network were learned by node2vec which is a graph embedding technology proposed previously. Then, consensus clusters were identified by considering different clustering algorithms. Finally, cell markers and cancer-related genes were further analyzed by integrating gene regulation pairs. We exploited our model on two independent datasets, which showed interesting results that the differences between clusters obtained by consensus clustering based on network embedding (CCNE) were observed obviously through visualization. For the KEGG pathway analysis of clusters, we found that all clusters are extremely related to MicroRNAs in cancer and HTLV-I infection, and the hub genes in cluster specific regulatory networks, i.e., ETS1, TP53, E2F1, and GATA3 are highly associated with melanoma. Furthermore, our method can also be extended to other scRNA-seq data.

Introduction

Melanoma is a malignant tumor that develops from melanocytes and is a complex multifactorial disease caused by the interaction between genetic susceptibility factors and environmental exposure (Rastrelli et al., 2014; Situm et al., 2014). In the past, many studies have been focused on the molecular mechanisms of melanoma, demonstrating that PI3K-Akt and MEK-ERK signaling pathways’ hyperactivation is highly correlated with the malignant transformation and progression of melanoma (Wei et al., 2019). Significant progress has been made in targeted therapies which aim to dampen these two signaling pathways, but melanoma is becoming increasingly resistant to these therapies (Johnson and Sosman, 2015; Luke et al., 2017). Understanding the pathogenesis of melanoma may overcome this obstacle.

There have been many significant studies based on bulk RNA-seq data of melanoma. Kunz et al. (2018) analyzed two transcriptomic types of melanocytic nevi and primary melanomas, and identified genetic characteristics and mechanisms of early and late stages of melanoma. By integrating transcriptomic and structural genomic data, Berger et al. (2010) identified 11 novel melanoma gene fusions and 12 novel readthrough transcripts resulting from basic genome rearrangements. But some limitations exist when only bulk RNA-seq data are available for the molecular expression level. The single-cell sequencing technology has brought new insights into complex biological phenomena. In particular, genome-wide single-cell measurements such as transcriptome sequencing can characterize cell composition and functional variation in homogeneous cell populations (Tang et al., 2009; Kolodziejczyk et al., 2015; Jaakkola et al., 2017). However, it is challenging to explore single-cell sequencing data effectively because of noise and dropout (Brennecke et al., 2013; Horvath et al., 2019; Bai et al., 2020). Researchers have begun to interpret the functional status of cancer cells at the single-cell level. Various computational methods have been used in cancer research and have helped the discovery of cancer development, metastasis, treatment resistance, and tumor microenvironment (Ren et al., 2018; Zhang et al., 2020). However, as far as we know, few studies focusing on molecular mechanisms of melanoma at the single-cell level (Fattore et al., 2019; Durante et al., 2020).

On the other hand, network embedding is becoming a powerful way of representing the features of nodes in complex networks, and is widely used in various fields, including medicine, biology, sociology, and finance. Several state-of-art network embedding algorithms are proposed to help accomplish analysis tasks of complex networks, e.g., DeepWalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016). Network embedding is extremely useful for highly sparse single-cell sequencing data.

In this study, we applied network embedding to represent cell features, identified consensus clusters by consensus clustering, and investigated gene markers for melanoma. In the following section, we introduced the materials and methods in our model and then showed and analyzed the results. Finally, we made conclusions for our study.

Materials and Methods

Data Sources and Preprocessing

GSE72056 is an scRNA-seq data from Gene Expression Omnibus¹ (Tirosh et al., 2016), which involves 19 melanoma patients, 23,686 genes, and 4,645 cells. After QC with Scater (McCarthy et al., 2017) and the batch effects elimination with Limma (Ritchie et al., 2015), we obtained 4,630 cells and 22,105 genes, respectively. Among the 4,630 cells, there are 1,346 malignant cells and 3,284 non-malignant cells. In our study, we only selected the malignant cells for further study. Then we performed feature selection with M3drop (Andrews and Hemberg, 2019) and obtained 6,786 genes eventually.

Another scRNA-seq data EXP0072 related to melanoma were downloaded from CancerSEA² (Yuan et al., 2019), which was originally from GSE81383 (Gerber et al., 2017). EXP0072 contains the gene expression level of 307 cells and 18,938 genes. After QC, the batch effects elimination, and feature selection, there are 297 cells and 4,566 genes, respectively.

Gene regulation pairs were collected from the HTRIdb (Bovolenta et al., 2012) and TRRUST v2 (Han et al., 2018) databases. HTRIdb is an open-access TF-target gene interaction database that can be downloaded via a user-friendly web interface (Bovolenta et al., 2012). TRRUST is a TF-target interaction database for humans, and TRRUST v2 has a significant improvement from the first version of TRRUST, including a significantly increased size and a web interface (Han et al., 2018). There are 51,871 and 8,427 regulation pairs we collected from HTRIdb and TRRUST v2, respectively. The integrated regulation pairs are used as the source of regulation pairs.

Consensus Clustering Based on Network Embedding (CCNE)

After preprocessing scRNA-seq data, we analyzed the data based on a network embedding model according to the following steps. It is worth noting that we log-transformed the data in QC of preprocess.

Step 1: Cell–cell interaction network construction

We used Euclidean distance to measure the interaction between cells because it treats the differences in different dimensions equally, and chose the average distance as the threshold to construct the cell–cell interaction network. Euclidean distance between two cells c1 and c2 is defined as follows:

d i s t (X, Y) = \sqrt{\sum_{i = 1}^{n} {(X_{i} - Y_{i})}^{2}} (1)

where X and Y are gene expression level vectors of c1 and c2, respectively, and n is the number of genes.

Step 2: Network embedding

Network embedding is a powerful measure in representing complex networks. Among these state-of-art network embedding algorithms, node2vec has been successfully used in many applications. Given any graph, it can learn the feature representation of nodes, which can then be used in various downstream machine learning tasks. In this algorithm, the parameters p and q can be flexibly changed to adjust the random walk strategy, which is helpful for the biased collection of node information (Grover and Leskovec, 2016).

Step 3: Consensus clustering

As there are various clustering algorithms, in this study, we tried to apply K-means (Macqueen, 1967), Gaussian mixture model (GMM) (Guorong et al., 2001), spectral clustering (Bach and Jordan, 2003), hierarchical clustering (Johnson, 1967), and Birch algorithm (Peng et al., 2018) to cluster the embedding vectors, and used the Silhouette Coefficients (Zhou and Gao, 2014) and Caliński–Harabaz score (Caliński and Harabasz et al., 1974) to evaluate the identified clusters. We determined the number of clusters k when the Silhouette Coefficient is optimal. The Silhouette Coefficients is a value from −1 to 1, and the larger the value, the better the result. So we selected the maximum Silhouette Coefficients value. Eventually, we chose three best algorithms for our data and constructed consensus clusters. Consensus clustering is defined as follows: we took the intersection of the clusters obtained from the three algorithms, and selected the result with the largest number of cells as the consensus clusters.

Gene Regulatory Analysis

We selected the genes from each consensus clusters that expressed in more than 70% of cells; 70% of the gene expression is more representative, and the number of genes obtained by 70% is more appropriate, which is conducive to obtaining significant regulation pairs. Then we constructed the cluster specific regulatory networks according to the following procedure. First, we retained the regulation pairs in which the targets belong to the gene sets. Second, for each regulation pair, we calculated the similarity of the TF and its target. The similarity of gene regulation pair is defined as:

s i m (g_{1}, g_{2}) = 1 - \frac{dist (g_{1}, g_{2}) - \min (d i s t)}{max (d i s t) - min (d i s t)} (2)

where dist(g₁,g₂) represents the Euclidean distance of the regulation pair (g₁,g₂), and max(dist) and min(dist) represent the maximum distance and the minimum distance in all regulation pairs, respectively. The larger the value, the stronger the correlation between the genes.

Then we identified the feed-forward loops (FFLs) (Jin, 2013) from these gene regulation pairs and constructed cluster specific gene regulatory networks by Cytoscape_v3.6.1 (Shannon et al., 2003). We took the top three genes with the highest degree from the specific gene regulatory network as hub genes.

Results

We analyzed scRNA-seq data according to the pipeline shown in Figure 1. We used existing methods to explore the molecular mechanisms of melanoma by this novel pipeline. We first preprocessed the scRNA-seq data and constructed a cell–cell interaction network based on Euclidean distance. And then we represented the cells using the network embedding algorithm node2vec and identified consensus clusters by consensus clustering. Finally, we constructed and analyzed cluster specific gene regulatory networks by integrating the expressed genes in scRNA-seq data and regulation pairs.

FIGURE 1

Figure 1. The pipeline of the study.

Cell–Cell Interaction Network Construction and Network Embedding

We calculated the Euclidean distance between every pair of cells in GSE72056 and filtered the pairs with the distance less than the mean distance. As a result, we obtained a cell–cell interaction network with 1,312 nodes (cells) and 444,382 edges after discarding 34 isolated cells.

Then we applied node2vec to represent each cell as a vector in low-dimensional space. We selected the Silhouette Coefficients as the evaluation index and performed different hyperparameters from the set {0.25, 0.5, 1, 2, 4}. The hyperparameter with the maximum Silhouette Coefficients value is chosen. Finally, we chose p = 4 and q = 0.5 (Grover and Leskovec, 2016). As a result, we obtained the embedding vectors of the cells with 128 dimensions.

Consensus Clustering and Clusters Identification

We applied the five clustering algorithms mentioned in Section “Materials and Methods” to cluster the embedding vectors (Table 1) and used the Silhouette Coefficient and Caliński–Harabaz score to evaluate the results (Figure 2). Among the five algorithms, Hierarchical clustering, GMM and K-means are significantly better than the other two methods according to the evaluation criteria, so we selected these three algorithms to obtain the consensus clusters. Finally, we obtained six clusters with 82, 120, 238, 222, 162, and 239 cells, respectively. It is worth noting that in the process of consensus clustering the cells belonging to different clusters resulted from different algorithms were discarded. We visualized the clusters using t-SNE (van der Maaten and Hinton, 2008), as shown in Figure 3.

TABLE 1

Table 1. The number of cells in each cluster by different algorithms.

FIGURE 2

Figure 2. Clusters evaluation by Silhouette Coefficient (left) and Caliński–Harabaz score (right).

FIGURE 3

Figure 3. Visualization of clusters using T-SNE.

Gene Regulatory Network Construction and Analysis

We filtered the genes that expressed in less than 70% of the cells in clusters, and identified 601, 729, 480, 744, 527, and 597 for the six clusters, respectively. Then for each cluster, we built a cluster specific regulatory network with the genes and discovered the FFLs. The information about the gene regulatory networks is shown in Table 2 and Figure 4.

TABLE 2

Table 2. The number of regulation pairs in the cluster specific gene regulatory networks.

FIGURE 4

Figure 4. Cluster specific gene regulatory networks of (A–F) 6 clusters.

We further analyzed the cluster specific gene regulatory networks. We found that at least three genes existed in each cluster are known melanoma-related genes. Among them, ATM, TP53, and FOXO3 belonged to each cluster. ATM serves as a regulator of a variety of downstream proteins, including tumor suppressor proteins (Sanders et al., 2020). FOXO3 may well induce apoptosis in melanoma cells by the expression of requisite genes (Segura et al., 2009). We examined the degree distributions of both TF and genes in the six networks, which indicated that all of the gene regulatory networks were scale-free. Generally, TFs had significantly higher degrees than target genes in each network, indicated that there is a complex combination of TF coordinated regulation and gene multiplicity. There is a negative correlation between the topological coefficients and degrees of TF and genes, which shows that hub genes are the only shared neighbor of nodes with fewer links. ETS1 and GATA3 are hub genes in all of the six clusters, and E2F1 in five of the six clusters, and TP53 in one cluster.

These hub genes are closely related to cancer, and are candidate biomarkers for melanoma. TP53 is a known tumor suppressor and relates to melanoma (Shain et al., 2015). ETS1 encodes a member of the ETS transcription factor family, and these proteins act as transcriptional activators or repressors of many genes which are involved in stem cell development, cell senescence and death, and tumorigenesis (Gluck et al., 2019). GATA3 belongs to the GATA family of transcription factors and is an important regulator of T cell development. It also plays an important role in endothelial cell biology (Terra et al., 2021). E2F1 is a member of the E2F transcription factor family. The E2F family is very important for controlling the cell cycle and inhibiting tumor proteins, and it is also the target of small DNA tumor virus transforming proteins (Murphy et al., 2021; Rocca et al., 2019).

We performed Kaplan–Meier survival analysis (Goldman et al., 2020) to analyze the expression level of these genes in melanoma. Figure 5 shows that melanoma patients with lower gene expression levels of ETS1, TP53, and E2F1, and higher gene expression level of GATA3 have a higher survival probability.

FIGURE 5

Figure 5. Kaplan–Meier survival curves of (A) TP53, (B) ETS1, (C) GATA3, and (D) E2F1.

We performed gene sets GO and pathway enrichment analysis using DAVID (DAVID Functional Annotation Bioinformatics Microarray Analysis³) (Huang et al., 2008; Huang da et al., 2009).

For GO enrichment analysis, we selected the top 10 items (PValue < 0.05) (Guo et al., 2020) for each cluster, and found that all clusters were associated with protein binding, nuclear chromatin, nucleoplasm and regulation of the apoptotic process. Cluster 1 and cluster 2 are more related to sequence-specific DNA binding and chromatin binding. Cluster 4 and cluster 6 are more associated with DNA binding, and cluster 5 and cluster 6 are relevant nuclear chromosome, telomeric region, and transcription factor binding. Also, cluster 2 is related to heart development, and cluster 3 is related to negative transcription regulation and DNA-templated.

As shown in Figure 6, for KEGG pathway analysis, we chose all results with PValue < 0.05. From that, we got that all clusters are associated with MicroRNAs in cancer and HTLV-I infection. Cluster 1 and cluster 2 are more related to pathways in cancer, basal cell carcinoma and apoptosis, cluster 3 and cluster 5 are more related to FoxO signaling pathway, and cluster 6 is strongly associated with neurotrophic signaling pathway.

FIGURE 6

Figure 6. KEGG pathway analysis for (A–F) six clusters.

Experiments on an Independent Dataset

We analyzed an independent dataset EXP0072 to verify our approach. We expressed the embedding vectors of 307 cells with 128 dimensions, which is the same as our previous experiment. The identified three clusters are shown in Supplementary Table 1 and Supplementary Figure 1. By consensus clustering, the numbers of cells were 71, 174, and 52 of the consensus clusters.

The numbers of genes that expressed in less than 70% of the cells in clusters were 220, 255, and 176, respectively. Then we built cluster specific gene regulatory networks accordingly. The information of cluster specific gene regulatory networks is shown in Supplementary Table 2 and Supplementary Figure 2.

Finally, we selected the top three genes with higher degrees (hub genes) for Kaplan–Meier survival analysis (Goldman et al., 2020) to test their functions in melanoma, which are shown in Supplementary Figure 3. These genes have a greater impact on the survival rate of melanoma. E2F1 was the hub gene in both datasets. ESR1 encodes an estrogen receptor and ligand-activated transcription factor. The protein encoded by this gene regulates the transcription of many estrogen-inducible genes and is expressed in many non-reproductive tissues. NF-kappa-B is a transcription factor formed by the combination of NFKB1 and RELA, which participates in many biological processes. E2F4 encoded by this gene is a member of the E2F family of transcription factors. The E2F family plays a crucial role in the control of cell cycle and action of tumor suppressor proteins and is also a target of the transforming proteins of small DNA tumor viruses. MYC is a proto-oncogene and encodes a nuclear phosphoprotein that plays a role in cell cycle progression, apoptosis, and cellular transformation. Amplification of this gene is frequently observed in numerous human cancers (Murphy et al., 2021).

From the experiments, we found that the size of the dataset has impact on the computational cost, but not affects the network embedding, clustering, or quality of the reconstructed GRNs.

Conclusion

We investigated the pathology of melanoma via scRNA-seq, which revealed the significant impact of hub genes in the development of melanoma. By constructing the cell–cell interaction network and obtaining the consensus clusters, we found that the differences between clusters are obvious. Through our further analysis of each cluster, we found the hub genes ETS1, TP53, E2F1, and GATA3 are related to melanoma. At the same time, in order to verify the scalability of the method, we analyzed an independent melanoma dataset that the clusters were highly consistent, and hub genes have a great impact on melanoma.

Data Availability Statement

The data underlying this article are available in Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) and CancerSEA at http://biocc.hrbmu.edu.cn/CancerSEA/goDownload.

Author Contributions

GQ, FL, and LD conceived and developed the framework for consensus clusters identification and wrote this manuscript. LW and GQ provided important feedback in the framework process and edited the manuscript. All authors made significant contributions to the completion and writing of this manuscript, and read and approved the final manuscript.

Funding

This study was supported by the Natural Science Foundation of Shaanxi Province (No. 2017JM6038) and the National Key Research and Development Program of China (2018YFC0116500).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We thank Yuhan Yang and Xiaoyang Xu for helping revise the manuscript.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.700036/full#supplementary-material

Footnotes

References

Andrews, T. S., and Hemberg, M. (2019). M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 35, 2865–2867.