SGII: Systematic Identification of Essential lncRNAs in Mouse and Human Genome With lncRNA-Protein-Protein Heterogeneous Interaction Network

Long noncoding RNAs (lncRNAs) play important roles in a variety of biological processes. Knocking out or knocking down some lncRNA genes can lead to death or infertility. These lncRNAs are called essential lncRNAs. Identifying the essential lncRNA is of importance for complex disease diagnosis and treatments. However, experimental methods for identifying essential lncRNAs are always costly and time consuming. Therefore, computational methods can be considered as an alternative approach. We propose a method to identify essential lncRNAs by combining network centrality measures and lncRNA sequence information. By constructing a lncRNA-protein-protein interaction network, we measure the essentiality of lncRNAs from their role in the network and their sequence together. We name our method as the systematic gene importance index (SGII). As far as we can tell, this is the first attempt to identify essential lncRNAs by combining sequence and network information together. The results of our method indicated that essential lncRNAs have similar roles in the LPPI network as the essential coding genes in the PPI network. Another encouraging observation is that the network information can significantly boost the predictive performance of sequence-based method. All source code and dataset of SGII have been deposited in a GitHub repository (https://github.com/ninglolo/SGII).

Knocking out or knocking down some lncRNA genes can lead to death or infertility. These lncRNAs are called essential lncRNAs, which are of vital importance for survival and development. Identification of essential lncRNAs provides insight into the minimum requirements of normal cell functioning and normal organism development. Experimental methods have been applied to identify essential lncRNAs. Li et al. established the single lncRNA knockout mouse model, as well as the multiple lncRNA knockout mouse model (Li and Chang, 2014). By large-scale phenotypic analysis, they found that knocking out lncRNAs, such as Fendrr, Peril, and Mdgt, showed perinatal and postpartum lethality (Li and Chang, 2014). Watanabe et al. found that Dnm3os has an essential role in the normal growth and bone development of mice (Watanabe et al., 2008). Zhou et al. proposed that Meg3 deletion in female rats can result in skeletal muscle defect and perinatal death (Zhou et al., 2012). These studies provide helpful insights for identifying essential lncRNAs. However, experimental methods for identifying essential lncRNA genes are not always feasible due to many factors, which may also produce misleading results (Jathar et al., 2017). Therefore, computational approaches are considered as alternative ways.
Computing essentiality of a coding gene has been widely studied. Most of the existing methods define the essentiality measures based on the topological importance of a protein in protein-protein interaction networks. Various types of centralities have been introduced in this regard. For example, Jeong et al. found that hub nodes with high connections in the protein-protein interaction (PPI) network are often indispensable, which allows them to use the degree centrality (DC) to identify essential proteins (Jeong et al., 2001). Joey et al. introduced the betweenness centrality (BC) to measure the essentiality of proteins, as they found that PPI network is modularized (Joy et al., 2005). Wang et al. used eigenvector centrality (EC) to predict essential proteins, which measures the importance of nodes by calculating the connection with high index nodes in the network . Wuchty et al. found that closeness centrality (CC) measure using local information is useful in predicting essential proteins (Wuchty and Stadler, 2003). Many more methods have tried to incorporate different types of information in predicting essential proteins Wang et al., 2012;Zhong et al., 2013;Campos et al., 2019;Zhang et al., 2020;Liu et al., 2021). However, these centrality measures are not always working, due to incomplete protein-protein networks and frequent false-positives in the high-throughput experiments for identifying protein-protein interactions. Therefore, sequence-based methods were also considered. Zeng et al. defined the Gene Importance Calculator (GIC) score using only genomic sequence information (Zeng et al., 2018). The GIC score was derived from a logistic regression model. It can score not only coding genes but also noncoding genes.
As far as we can tell, the GIC score is the only available essentiality measure that can be applied on non-coding genes, including lncRNA genes (Zeng et al., 2018). However, the design of the GIC score ignored all information that is buried in the lncRNA-protein interactions (LPI). We believe that the LPI information has a similar role in identifying essential lncRNAs to that of PPI in identifying essential coding genes.
With the development of high-throughput experimental technologies, many databases have been established for noncoding genes and their interactions. The NPInter database provides a comprehensive archive of molecular interactions involving noncoding RNAs . NONCODE database is an integrated knowledge database dedicated to non-coding RNAs and their annotations . However, essential gene databases, like the DEG database, focus more on recording essential coding genes (Zhang et al., 2004). The essential non-coding genes are rarely recorded, particularly for complex organisms, like human and mouse. This is a primary challenge in developing a systematic method for measuring essentiality of non-coding genes.
By curating data from various literatures, as well as public databases, we established a dataset as the basis for developing a computational method to measure non-coding gene essentiality. In this work, we proposed the systematic gene importance index (SGII) by combining various centralities on the lncRNA-proteinprotein heterogeneous network and sequence-based essentiality scores. By comparing our measure to both network-based methods and sequence-based method, we found that network information can boost the sequence-based method significantly.

Dataset Curation
We downloaded human and mouse lncRNA-protein interactions from the NPInter database v4.0 . Selfinteractions and duplicates were removed. The mouse lncRNA-protein interaction network involves 33255 lncRNAs, 182 proteins, and 102051 interactions. The human lncRNAprotein interaction network contains 41589 lncRNAs, 3237 proteins, and 394895 interactions. We downloaded human and mouse protein-protein interaction data from BioGrid database version 4.4 (Oughtred et al., 2021). The mouse protein-protein interaction network includes 9744 proteins and 52342 interactions. The human protein-protein interaction network includes 19106 proteins and 644235 interactions.
We combine the lncRNA-protein interactions and proteinprotein interactions by matching the name of the proteins in both datasets, producing a heterogeneous network with two types of interactions. The mouse network was composed by 9845 proteins and 33255 lncRNAs with 102051 lncRNA-protein interactions and 52342 protein-protein interactions. The human network was composed by 19553 proteins and 41589 lncRNAs with 394895 lncRNA-protein interactions and 644235 proteinprotein interactions. The sequences of all lncRNAs in both human and mouse interaction networks were obtained from the NONCODE database version 5 .
According to literatures (Penny et al., 1996;Marahrens et al., 1997;Lee, 2000;Sado et al., 2001;Grote et al., 2013;Klattenhoff et al., 2013;Sauvageau et al., 2013;Yildirim et al., 2013;Zeng et al., 2018), eight mouse lncRNAs, including Xist, Gas5, Meg3, Tsix, Gt (ROSA) 26Sor, Dnm3os, Fendrr, and Braveheart, were identified as essential lncRNAs. The remaining 33247 lncRNAs in the mouse network were marked with unknown status. For human lncRNAs, we curated a set of lncRNAs that are reported to be essential in various conditions from literatures (Supplementary Table S1). This set contains 63 lncRNAs. The names of these lncRNAs and the conditions that they are reported to be essential, are listed in Supplementary Table S1, along with literatures of the original reports. In addition, 11 mouse lncRNAs, which are homologous of human essential lncRNAs, were also collected for validation purpose, as homologous usually have similar essentiality (Georgi et al., 2013).

Gene Importance Calculator
Gene Importance Calculator (GIC) (Zeng et al., 2018) is a useful essentiality indicator for both protein-coding genes and noncoding genes. It is based solely on sequence information. The GIC score (g) is defined as follows: where θ(p) is derived from a logistic regression model. θ(p) can be defined as where α 1 , α 2 , . . ., α 5 , β 0 , β 1 and β 2 are regression coefficients, L the length of RNA sequence, e the minimum free energy of RNA secondary structure, p the conditional probability that a gene is essential, and f i the occurrence frequency of a triplet in the sequence. The five types of triplets, which are considered in the GIC, are CGA, GCG, TCG, ACG and TCA (Zeng et al., 2018). When calculating the GIC score, we need to use the external program RNAfold (Lorenz et al., 2011), which requires a sequence length less than 20000 nt. Therefore, only 24450 mouse lncRNAs and 29481 human lncRNAs can be calculated for GIC. All other lncRNAs have lengths too long for the RNAfold to work.

Network Centralities
We formulate the heterogeneous graph as G = (V, E), where V is the set of all nodes, including lncRNAs and proteins, and E the set of all interactions, including lncRNA-protein and protein-protein interactions. Without losing generality, we note the number of all nodes as n. The network can be represented as an adjacency matrix A∈{0.1} n×n . The element on the ith row and the jth column of A can be denoted as a i,j . If a i,j = 1, the ith node and the jth node have interactions between them. If a i,j = 0, there is no interaction between the ith node and the jth node. Given a i,j , we can define four different centrality measures, including degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), and eigenvector centrality (EC) for each node in the network.
The degree centrality of the ith node can be defined as follows: The betweenness centrality of the ith node can be defined as follows: where σ u,v is the number of shortest paths between the uth node and the vth node, and σ u,v (i) the number of shortest paths between the uth node and the vth node that pass the ith node. The closeness centrality of the ith node is defined as follows: where R(i) is the set of nodes that can reach the ith node, d i,j the length of the shortest path between the ith node and the jth node, and |.| cardinal operator of a set. The eigenvector centrality of the ith node is defined as follows: where x max (i) is the ith dimension of the normalized eigen vector x that corresponds to the largest eigen value of adjacency matrix A. Let λ max be the largest eigen value of A, the following relationships are satisfied in finding x: Ax λ max x, and (7) where ||.|| is the vector norm operator.

Systematic Gene Importance Index
Our network model contains two types of nodes, lncRNAs, and proteins. It also involves two types of interactions, the lncRNAprotein interactions and protein-protein interactions. Essentially, it is a lncRNA-protein-protein interaction (LPPI) heterogeneous network. Figure 1 illustrates a part of the LPPI network for human and mouse respectively. We propose the Systematic Gene Importance Index (SGII) as a comprehensive measure of gene essentiality, particularly for noncoding genes. SGII is a combination of the sequence-based GIC score and centrality measures, which have been elaborated as above.
For the ith node in the LPPI network, we compute its BC, CC, DC and EC, which can be noted as b i , c i , d i and e i , respectively. Its GIC score is noted as g i . We sort all nodes according to their BC, CC, DC, EC and GIC in a descending order, respectively. The rank of the ith node after sorting according to BC, CC, DC, EC and GIC can be noted as r b (i), r c (i), r d (i), r e (i) and r g (i), respectively.
Let s i be the degree of the ith node, which can be computed as follows: Given a threshold z, if s i ≥ z, the centrality measures will determine the essentiality of a gene directly. For convenience, we define the centrality-based essentiality indicator function for the ith node according to BC, CC, DC, and EC respectively as follows: where k is a rank threshold parameter. The ith node is identified as essential when is satisfied. If s i < z, we rely on the GIC score to determine the essentiality of a gene. Similarly, we can define the indicator function for GIC ranking, as follows: where t is another rank threshold parameter. The ith node is essential if is satisfied.
The whole flowchart of SGII is illustrated in Figure 2.

Performance Evaluation
In evaluating SGII, we use three statistics to describe its predictive performance. These statistics include sensitivity (s), false positive rate (r), and Fisher's exact test score (f), which are defined as follows: f −log 10 p where n t is the number of known essential lncRNAs that are identified as essential lncRNAs, n + the total number known essential lncRNAs, n f the number of lncRNAs with unknown essentiality that are identified as essential, nthe total number of lncRNAs with unknown essentiality and p the p-value of Fisher's exact test. Since SGII is a direct scoring method with manually configurable cutoff values, no training procedure is involved in the whole process. This is different to machine learning based methods. We cannot treat the above sensitivity and false positive rate as comparable to those in evaluating machine learning methods, as the knowledge of essential lncRNAs is too limited to perform any kind of cross-validations. This is also why we introduced the Fisher's exact test to further quantifying the quality of our results. It will measure how likely a result in whole is random or not. The bigger f value is, the results are less likely to be random.

Parameter Calibration
There are eight parameters in the GIC, which represent all the coefficients in the model built by GIC method. We took all the parameter values from literature (Zeng et al., 2018). The values for the mouse model are β 0 = 0.1625, β 1 = 2.638 × 10 -4 , β 2 = 2.194, Three parameters are introduced in combining centralities and GIC, which are noted as z, k and t. We first perform a grid search of k and t with a given value of z. The pairs of k and t, which maximize the score f, are recorded for every different z. These values are further sorted to find the best z, k and t combination. When performing the grid search on the mouse dataset, k = 1, 3, 5, 7, 9, and t = 1, 3, 5, 7, 9. When performing the grid search on the human dataset, k = 5, 10, 15, 20, 25 and t = 5, 10, 15, 20, 25. For both datasets, z = 5, 10, 15, 20. Finally, we set z = 15, k = 5, t = 9 for mouse dataset, and z = 5, k = 20, t = 5 for human dataset. All results for different parameters are provided in supplementary materials, as Supplementary Table S2.

Characters of the lncRNA-Protein-Protein Heterogeneous Network
We first explore the basic statistical characters of the LPPI network. We plot the degree distribution of the mouse and human network respectively in Figure 3. It is intuitively that the distribution of the degree follows the common power law distribution, which is similar to the PPI networks (Jeong et al., 2001). Since in the PPI network, essential proteins are usually rare and with high degrees, we assume that in our LPPI network, the essential lncRNAs have similar properties.
As we have mentioned in the method section, several lncRNAs with a length too long to calculate its secondary structure were not counted in our analysis. It becomes a question whether these lncRNAs have preferences to large or small amounts of interactions. We plot the degree distribution with and without those over-length lncRNAs for mouse and human datasets, respectively, in Figure 4. It is hard to find differences on the degree distributions. We therefore believe that, for a lncRNA, its length alone is not a major contributing factor to its interactions in the LPPI network. This also implied that the essentiality, which we believe to be associated with local network structure, has no direct relationship with the length of the lncRNA. These overlength lncRNAs were kept in the network as dummy nodes, which means we did not compute their essentiality at all, regardless of whether they have a degree over the threshold or not.
Integrating Centrality Measures and the GIC Score Figure 5 gives scatter plots of GIC pairing with each of the four types of centralities on human and mouse datasets, respectively. For the mouse dataset, the red dots, which represent essential lncRNAs, tend to appear in the top-right part of the plots, while the blue dots, which denote all other lncRNAs, spread much wider. Although the red dots are relatively rare, but their top-right FIGURE 2 | The flowchart of SGII. The method SGII consists of two parts. For lncRNAs whose degree is greater than or equal to z, four types of centralities were used to determine whether they were essential lncRNAs. For lncRNAs whose degree is less than z, GIC was used.
Frontiers in Genetics | www.frontiersin.org March 2022 | Volume 13 | Article 864564 preference is still observable. For human dataset, this preference is not intuitively obvious. This allows us to carry out further quantitative analysis on combining the centrality measures and the GIC scores. A primary challenge is that the number of known essential lncRNAs is too small for a machine learning algorithm to train on. In addition, some essential lncRNAs are only involved in a very limited number of interactions. For example, the Braveheart (Bvht) lncRNA, which is essential, has only one interaction record in the database. We think this may be due to the incomprehensive  knowledge of the lncRNA-protein interaction network. As the estimation of centrality measures highly rely on the interaction enrichment of a node in the network, when dealing with a lncRNA with limited number of interactions, we turn to rely on the GIC score.
With the settings in the method section, we combined four types of centrality measures and the GIC scores. On the mouse dataset, we identified 2284 essential lncRNAs from altogether 24450 lncRNAs. Among the 2284 lncRNAs, eight lncRNAs are known to be essential, accounting for 100% of all known essential lncRNAs, resulting a p-value = 5.73 × 10 -9 (Fisher's exact test). On the human dataset, we identified 5063 essential lncRNAs, from altogether 29481 lncRNAs, Among the 5063 essential lncRNAs, 41 lncRNAs are reported to be essential in various conditions in literatures, accounting for 65% of all curated essential lncRNAs (p-value = 3.59 × 10 -17 , Fisher's exact test). This result clearly indicates that our method is effective to identify essential lncRNAs.

Systematic Comparison Between Different Configurations of SGII
As SGII is the first attempt to combine the network information and sequence information to identify essential lncRNAs, we explore which kind of centrality measure is more capable to identify essential lncRNAs along with the GIC scores. We first plot the distribution density of different centralities and the GIC scores on mouse and human datasets respectively. As in Figure 6, BC and DC centrality measures along with the GIC scores appear to have much better separation than the CC and EC measures on the mouse dataset, while on the human dataset, only BC and DC present an intuitive separation.
However, considering the large differences on axis scale for essential lncRNAs and all lncRNAs, these intuitive observations may be misleading. Therefore, we performed a quantitative comparison using eight different conditions, GIC alone, GIC combined with each one of four types of centralities, GIC combined with BC and DC, GIC combined with CC and EC, and GIC combined with all four types of centralities. The parameters of all comparison are optimized as in method section ( Table 1).
The first observation on Table 1 is that the best combination of centrality measure and the GIC is not the combination of all four types of centralities. For the mouse dataset, the BC + DC + GIC method has the best significance level and lowest FPR value. For the human dataset, the BC + GIC method reaches the highest significance level. A second to the best significance level is obtained again by BC + DC + GIC method, with the highest sensitivity value. Therefore, we think that the BC + DC + GIC may be a better way to identify essential lncRNAs than the current configuration of SGII. This consists with the impression from Figure 6. However, due to the limited number of available data and current results, it is possible that this observation does not reflect a comprehensive scene of identifying essential lncRNAs. Therefore, we keep the configuration of SGII to combine all four kinds of centralities and the GIC score, for an unbiased way of identifying essential lncRNAs.

Comparative Analysis Between Human and Mouse Essential lncRNAs
At a closer look to Table 1, it appears that the Fisher's exact test reports much more significant results on both datasets when GIC is combined with centralities, which proves that integration of centrality measures and GIC is effective. Another observation is that SGII gives under-expected sensitivity values on the human dataset. However, the significance levels on the human dataset are generally way higher than that of the mouse dataset. This may be the results of two differences between the mouse and the human datasets. First, the human dataset is collected from literatures of lncRNAs in various conditions, including tumor cell line experiments. Essential lncRNAs, which are identified by one type of cell line experiments, may be different to those from the original essential gene definitions. As direct essential gene experiments on human are not feasible, the quality of the dataset is not comparable to the mouse dataset. This also applies to the coding gene data (Austin et al., 2004). Secondly, the number of essential lncRNAs in the human dataset is roughly eight times of that of mouse dataset. Since the computation process of the significance level is affected by the raw counts, it is anticipated that systematic differences on significance levels exist.
To further confirm the above explanations, we performed the following analysis. We find homologous genes of human essential lncRNAs in mouse. According to the studies in coding genes, these genes are likely to also produce essential lncRNAs (Georgi et al., 2013). Altogether 11 homologous genes in mouse were identified as lncRNA genes in the mouse LPPI network. We used SGII to test if we can identify these homolog essential lncRNAs (Table 2).
Obviously, sensitivity is dropping in comparison to the mouse essential lncRNAs. However, it should be noted that the FPR is also dropping, which indicates much less false positives. The significance levels remain almost the same as the mouse essential lncRNAs. Again, the BC + GIC method obtained the best significance level, while the BC + DC + GIC method obtained a second to the best significance level with the highest sensitivity. This result confirmed that the significance level difference between human and mouse dataset is largely caused by the raw counts of the dataset. It also suggests that the BC + GIC or BC + DC + GIC method may be a better choice than combining all types of centralities and the GIC score.
The importance of BC can be understood intuitively. If we think the cellular system as a system composed of molecules. The interactions between molecules transfer information. A high BC value indicated that the node is critical as an information hub in many shortest paths between other nodes. Therefore, dropping such nodes will easily break many information channels simultaneously, which will eventually destroy the whole system. That makes it an essential node in the network.
For the DC measure, the intrinsic mechanism is similar. The DC measure is directly associated to the degree of a node. If a node with many edges is dropped, it is more likely that the whole network collapses. This consists with the observations in coding genes. In addition, although some other kinds of centralities, like the NC (new centrality) , can identify essential coding genes better, it does not work well in non-coding genes. This is an expected result. For NC to work in the LPPI network, it requires that dense interactions exist among the proteins that interacting the same lncRNAs. However, we did not observe this phenomenon in our dataset. The NC is difficult to be estimated for many lncRNAs, due to lacking such kind of interactions.

Functional Analysis of Essential lncRNA in the Mouse Genome
We took the essential lncRNA gene in mouse genome for functional analysis. For every lncRNA that was predicted as essential in mouse genome, we first map this lncRNA to the Ensembl database (Howe et al., 2021) using either gene name or sequence information. The mapped genes are then uploaded to the Gene Ontology online system for functional enrichment analysis. The top three enrichment of functions are "nucleic acid binding" (GO:0003676), "heterocyclic compound binding" (GO:1901363) and "organic cyclic compound binding" (GO:0097159). As we have mentioned, this is expected for lncRNAs. They realize their functions through bindings with other molecules.

CONCLUSION
SGII is the first attempt to combine lncRNA-protein interactions and lncRNA sequence information for identifying essential noncoding RNAs. Since the study on collecting and identifying essential coding genes has been performed for over a decade, it is time to step forward to the essentiality of non-coding genes, as non-coding genes are much more common than coding genes in mouse and human genomes. Due to the limited number of known essential lncRNAs, SGII does not use conventional machine learning algorithms, but applies simple scoring schemes and statistical tests. By combining BC, CC, DC, EC and GIC scores, SGII achieved a better performance than using only sequence information. Since the knowledge for constructing LPPI network may be incomprehensive, we applied the centrality measures only on those lncRNAs with enough interactions. For those lncRNAs with limited number of interactions, we turned to rely on its sequence to score the essentiality.
The results support our assumption that essential lncRNAs have similar roles as essential coding genes in the LPPI network. Particularly, we found that BC appears to be more important than other kinds of centrality measures. Due to the limited number of known essential lncRNAs, it is not feasible to explore further optimization of different weight on different centralities. When more essential lncRNAs are reported and recorded, we believe that modern machine learning algorithms will provide deeper insights in identifying essential non-coding genes. As a summary, we listed the prediction results of SGII on mouse and human datasets in Supplementary Table S3 in supplementary materials, which may be useful for life science studies. A more comprehensive collection of essential lncRNAs is being curated. We plan to establish a database that is dedicated in recording essential lncRNA information in future.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
X-HX collected the data, implemented the algorithm, perform the experiments, analyzed the results, and partially wrote the manuscript; Y-YZ helped in designing the algorithm and analyzed the results; C-QG analyzed the results and partially wrote the manuscript; HM partially analyzed the results; LW and P-FD directed the whole study, conceptualize the algorithm, supervised the experiments, analyzed the results, and wrote the manuscript.

FUNDING
This work was supported by National Natural Science Foundation of China (NSFC 61872268) and National Key R and D Program of China (2018YFC0910405).