EXP2SL: A Machine Learning Framework for Cell-Line-Specific Synthetic Lethality Prediction

Synthetic lethality (SL), an important type of genetic interaction, can provide useful insight into the target identification process for the development of anticancer therapeutics. Although several well-established SL gene pairs have been verified to be conserved in humans, most SL interactions remain cell-line specific. Here, we demonstrated that the cell-line-specific gene expression profiles derived from the shRNA perturbation experiments performed in the LINCS L1000 project can provide useful features for predicting SL interactions in human. In this paper, we developed a semi-supervised neural network-based method called EXP2SL to accurately identify SL interactions from the L1000 gene expression profiles. Through a systematic evaluation on the SL datasets of three different cell lines, we demonstrated that our model achieved better performance than the baseline methods and verified the effectiveness of using the L1000 gene expression features and the semi-supervise training technique in SL prediction.


INTRODUCTION
Two genes are considered a synthetic lethal (SL) pair if perturbation of both genes induces a defect in cell viability, while perturbation of either gene is not harmful to cell survival (Boone et al., 2007). Different types of perturbations were considered to trigger SL in previous studies, including knockdown, knockout, mutation, aberrant gene expression, copy number variation, and drug treatment (Whitehurst et al., 2007;Jerby-Arnon et al., 2014;Han et al., 2017;Sinha et al., 2017). Studying synthetic lethal interactions may help gain novel insights into target identification. Many cancer cells carry specific mutations in one gene (e.g., a tumor suppressor gene) of a synthetic lethal pair, and thus its synthetic lethal partner becomes a promising drug target (O'Neil et al., 2017). For example, the known synthetic lethal interactions between the tumor suppressor gene BRCA1/2 and the drug target gene PARP1 can be used to selectively kill cancer cells by triggering fatal DNA damages (Bryant et al., 2005;Farmer et al., 2005). To this end, PARP1 inhibitors have been approved to treat certain types of BRCA-mutated cancers (Fong et al., 2009). SL gene pairs can be experimentally screened by developing double-knockout strains in model organisms and human cell lines. The synthetic lethality network in yeast has been well constructed using synthetic genetic arrays (SGA) (Tong et al., 2001) and diploid synthetic lethality analysis with microarrays (dSLAM) (Pan et al., 2007). Nearly one million gene pairs covering 90% of the whole yeast genome were screened in a recent study (Costanzo et al., 2016). Compared to yeast strains, which can undergo sexual reproduction to generate doubleknockout offspring from parents bearing different single knockouts, it is more challenging to develop double-knockout human cell lines in an efficient manner. Thus, a relatively low number of human gene pairs (about hundreds or thousands) can be screened by RNA interference (Whitehurst et al., 2007;Barbie et al., 2009) and CRISPR-Cas9 (Shen et al., 2017;Han et al., 2017) based double-knockout experiments. Due to the difficulty in the establishment of large-scale double-knockout systems in human cell lines, the currently screened gene pairs only account for a small fraction of all possible combinations of human genes.
To overcome the current difficulty in experimental screen and generate more SL interactions in human, computational methods have recently been proposed to predict novel human SL pairs recently. The most direct idea is to leverage the abundant SL pairs characterized in yeast to infer human SLs through ortholog mapping (Deshpande et al., 2013;Wu et al., 2013;Srivas et al., 2016). The application of these methods was limited, as a large number of human genes do not have evolutionarily close yeast orthologs. Network-based methods predict human SLs through analyzing the protein-protein interaction (PPI) networks, metabolic networks, or signaling pathways (Folger et al., 2011;Kranthi et al., 2013;Zhang et al., 2015;Apaolaza et al., 2017). Statistical methods were also developed to identify SL gene pairs from human cancer cells based on the principle that the perturbations (e.g., mutation, aberrant gene expression, and copy number variation) of both SL genes should be subject to negative selection and exhibit a mutually exclusive pattern (Jerby-Arnon et al., 2014;Srihari et al., 2015;Jacunski et al., 2015;Sinha et al., 2017;Lee et al., 2018). Besides, there exist several machine-learning-based approaches for predicting SL gene pairs. Most of these approaches learn from the adequate amount of supervised information of yeast (Wong et al., 2004;Pandey et al., 2010;Li et al., 2011). Only a few machine learning methods for predicting human SLs were developed. For example, Das et al. used a Random Forest classifier with multi-omics features (e.g., differential expression, expression correlation, mutual exclusivity and shared pathways) to predict SL pairs in human cancer ; and Liu et al. proposed a logistic matrix factorization model regularized by the PPI similarity network and the gene ontology (GO) semantic similarity network to predict SL pairs (Liu et al., 2019).
Although a number of SL interactions are conserved in humans, most of them are only observed in specific cell lines or tissues (Ryan et al., 2018). A recent study detected SL pairs in three cell lines and found that only about 10% of SL interactions were shared by two cell lines, and no SL pair was identified in all the three cell lines (Shen et al., 2017). Despite the extensive applications of the above computational methods in SL prediction, most of them make predictions for the human genetic network without considering the cell line or tissue context. Although one of the aforementioned methods  can predict SL in different human cancer types, it is difficult to directly apply this method to cell lines, as the homogenous genetic background of cell lines cannot provide enough mutation-related omics data. To provide a feasible tool for capturing the unique SL interaction networks for individual cell types, we aim to develop a computational method to learn from the experimentally measured SL interactions through considering the cell-line specific genetic information.
In this paper, we have proposed a novel computational method, EXP2SL, to predict cell-line specific SL interactions in human. The cell-line specific gene expression profiles resulting from the shRNA knockdown experiments in the LINCS L1000 project (Subramanian et al., 2017) were used to capture the information of cell-line specific genetic background. Since the available labeled data in single cell lines are limited, a semi-supervised objective function is used to exploit the large amount of unlabeled data. Tested on the combinatorial CRISPR-Cas9 perturbation-based SL datasets in three different cell lines, our model showed competitive prediction ability compared to the baseline methods. We also verified the effectiveness of the features derived from the L1000 gene expression profiles and the semi-supervised objective function. Furthermore, we evaluated the importance of each gene included in the L1000 gene expression profiles and found that the cell viability related functions were enriched among the top attributing genes.

Data Processing
The L1000 Gene Expression Profiles The LINCS L1000 project (Subramanian et al., 2017) measured the expression levels of 978 landmark genes under different perturbations (i.e., shRNA or compounds) and control conditions (i.e., empty vectors or solvents) in different human cell lines. Here, we used the gene expression profiles resulting from shRNA perturbations to construct the features of the corresponding shRNA target genes, which were 978-dimensional vectors.
Specifically, the raw data from the LINCS L1000 project were preprocessed based on the pipeline in the original paper (Subramanian et al., 2017) with minor modifications; We first directly obtained the Level 3 data from L1000, which contained the quantile normalized gene expression profiles. The shRNA profiles perturbed after 96 hours were used, as the data amount for this time point was the largest. Based on this dataset, we calculated the z-score for each dimension of a shRNA perturbed profile x∈R 978 by where z is a 978-dimensional z-score of the shRNA perturbation profile x, V is the set of vector control profiles from the same plate, median(V) and MAD(V) stand for the median value and the median absolute deviation of V, and 1.4826 is a scaling factor to make the resulted z-scores close to normal distribution.
Notably, in the original L1000 preprocessing pipeline (Subramanian et al., 2017), the control profiles were replaced by all the profiles on the plate, called population control. Here, we argue that this data preprocessing scheme may cause a biased control distribution due to the specific perturbation design. Thus, we use the expression levels treated with empty vectors as the control for the shRNA perturbed profiles. For each gene, typically more than one types of shRNA were designed to knock down the expression of the corresponding gene product. To eliminate the off-target effects of shRNAs and obtain a robust signature for each single gene, the z-scores obtained from the replicated trials of the same shRNA were first processed using an algorithm with L1000 Level 5 data (Subramanian et al., 2017), then the same protocol was used to reduce the shRNAs targeting the same gene. More specifically, the z-scores were weighted and averaged according to the Spearman correlations to obtain a final 978-dimensional L1000 gene expression profile for each gene, which was then used as the input gene features for our model and other baseline models.

SL Labels
The SL labels in our datasets were constructed from the CRISPR double-knockout experiments performed in human cell lines (Shen et al., 2017;Zhao et al., 2018;Najm et al., 2018). A recently proposed computational approach called GEMINI (Zamanighomi et al., 2019) was used to identify SL interactions from the combinatorial CRISPR perturbation based cell viability studies. We adopted the GEMINI scores to select the positive and negative SL pairs for constructing our datasets. In particular, for each cell line, positive SL pairs were selected from gene pairs satisfying two criteria: 1) GEMINI "strong" scores larger than zero, which indicates the existence of the synergic lethal effect, and 2) GEMINI "strong" scores ranking among top 5%, to reduce the potential false positives. The main reason for choosing this threshold is that the top 5% gene pairs were considered as "the most significant hits in each screen" in the GEMINI paper (Zamanighomi et al., 2019). To more thoroughly evaluate the performance of our method, we also tested another threshold (i.e., 10%) for choosing the positive SL pairs (Tables S1-S2). Negative SL pairs were those gene pairs satisfying 1) a GEMINI "strong" score less than zero, which means that there exists no synergic lethal effect between these two genes, and 2) a GEMINI "strong" score among the bottom 50%, to remove the potential false negatives. The gene pairs that were not selected as positive or negative SL pairs were considered as unknown pairs. Finally, cell lines with adequate numbers (>100) of gene pairs with both SL labels and L1000 gene expression profiles, including A549, A375, and HT29, were used in our study. The numbers of training samples for the cell lines are summarized in Table 1.

The Workflow of EXP2SL
The basic idea of our EXP2SL model is to extract useful information from the L1000 expression profiles to accurately predict cell-line specific SL interactions. To achieve this goal, a semi-supervised objective function was designed to fully exploit the large amount of unlabeled data ( Figure 1).

The Network Architecture of EXP2SL
For a given cell line, suppose that there are N genes (marked as the indices 1, 2,…, N) with measured shRNA data from the LINCS L1000 project (Subramanian et al., 2017). The corresponding L1000 gene expression profiles can be represented as a set offeature vectors . For a given cell line, our model first encodes the gene features through E sequential fully-connected layers, that is, , W e encoder ∈ R dÂd (e = 2, …, E), and b e encoder ∈ R d (e = 1, …, E) denote the learnable parameters (d is the dimension of the hidden layers).
After E encoding layers, the updated gene features fh E i g N i=1 are then used to predict SL interactions. More specifically, for a gene pair (i, j), i, j = 1,2,…, N and i ≠ j, a confidence score is calculated through a linear layer to predict the potential of SL interaction between this gene pair, that is, where W out ∈R 1×2d and b out ∈R stand for learnable parameters. Note that the pairs (i, j) and (j, i) are equivalent to each other, so we calculate the average prediction scores of concatenations of ½h E i , h E j and ½h E j , h E i to obtain the equivalent prediction results for input pairs (i, j) and (j, i).

The Semi-Supervised Objective Function
As described in SL Labels, the gene pairs with different SL labels can be classified into positive, negative, and unknown sets, denoted as P, N, and U, respectively. Here, we designed a semi-supervised loss function that utilizes information from all three sets to optimize the parameters of our model. More specifically, our loss consisted of three parts: The first part of our objective function is the mean squared error (MSE) of positive and negative samples, calculated as where and s i, j stands for the potential score of gene pair (i, j) predicted by EXP2SL. The second part of the objective function is inspired by the semisupervised Bayesian personalized ranking (BPR) loss (Rendle et al., 2009), which uses the unknown labels to boost the prediction performance. In particular, the BPR loss is defined as where s stands for the sigmoid function s (x) = 1 1+e −x . This objective function aims to enlarge the margins of the predicted scores between positive SL and unknown pairs, as well as those between the unknown and negative SL pairs. To calculate this loss, we sample the negative and unknown pairs with the sample number equal to the positive pairs during model training.
The above MSE and BPR objective functions are further combined with an L2 regularizier over all the learnable model parameters to construct the final objective function of our EXP2SL model, that is, where q denotes the model parameters, and l 1 and l 2 stand for the weight parameters controlling the contributions of the BPR loss and the L2 regularization term, respectively.
To train the EXP2SL model, we used the Adam optimizer (Kingma and Ba, 2014) with the default learning rate 0.001 and the number of training epochs 1,000. We also clipped the gradient if it was larger than 5 to stabilize the training process. We implemented our model with PyTorch 1.0.1 (Paszke et al., 2017).

Hyper-Parameters
The hyper-parameters of our model include the weight of the BPR loss l 1 from [16,32,64,128], the weight of the L2 regularization l 2 from [0.1, 0.05, 0.01, 0.005, 0.0001], the number of encoding layers from [0, 1, 2, 3, 4], and the dimension of hidden features d from [32,64,128,256]. For each cell line, a grid search was performed to select the best combination of hyper-parameter settings from the above mentioned ranges, according to the AUC scores achieved by five repeats of 5-fold cross validations under the "split pair" setting (i.e., gene pairs were randomly split into training and test sets). Details about the cross-validation settings can be found in Performance Evaluation. The baseline models were tuned using the same strategy, and the ranges for hyper-parameters in each baseline model are described in the Baseline Models.

Extraction of Feature Importance
Here, we used the saliency map-based approach proposed in (Simonyan et al., 2013) to evaluate the importance of each position along the 978-dimensional input features ff i g N i=1 . The basic idea of this method is to calculate the gradients of the output score with respect the to the input features, and the larger absolute values of gradients would suggest the more importance of the corresponding feature dimension. After the training process, the positive and negative SL pairs of each cell line are fed into the EXP2SL model, and the corresponding importance for each input feature dimension is calculated by FIGURE 1 | Workflow of the EXP2SL model. For a pair of gene, their L1000 gene expression profiles derived from knockdown conditions are the inputs of the encoding layers. Then, the updated features for both genes in a given pair are concatenated to predict the confidence score of being an SL pair by a linear combination. In addition, a semi-supervised objective function is used to train the model parameters, which aims to utilize the information from both known (positive and negative) and unknown SL gene pairs.
where s i, j is the predicted confidence score of gene pair (i, j), and w is a 978-dimensional vector containing the importance score of each dimension of the input L1000 gene expression profiles. To reduce the variance caused by random initialization of network parameters and random sampling of the unknown and negative gene pairs for calculating the BPR loss during the training process, we also take the summation of w vectors from 10 trained EXP2SL models to obtain the final importance scores for the 978 feature dimensions. The top 50 ranked features are then selected for each cell line. We examined the overlaps of the selected features between cell lines and calculated the overrepresentations of functional gene sets and pathways using the WebGestalt server (Liao et al., 2019).

Logistic Regression
We used the logistic regression (LR) model implemented based on scikit-learn (Buitinck et al., 2013). The L1000 expression profiles were used as input to the LR model. For each pair of input genes (i,j), the features of genes i and j (denoted as f i and f j , respectively) were concatenated before being fed into the LR model. Since LR may produce different results for pairs (i, j) and (j, i), each of the two pairs were treated as an individual input with the same label in the training phase. In the test phase, the prediction values from both inputs were then averaged to obtain the final prediction score. The inverse of regularization strength (a hyper-parameter) was chosen from [10, 1, 0.5, 0.1, 0.05, 0.01].

Random Forest
We used the random forest (RF) classifier implemented based on scikit-learn (Buitinck et al., 2013). The input and output of RF were the same as those of LR described above. The number of trees was selected from [32,64,128] and the maximum depth of the trees was selected from [8, 16, None], where "None" means that the trees will keep expanding until no node can be split.

Support Vector Machine
We used the support vector machine (SVM) classifier implemented based on scikit-learn (Buitinck et al., 2013). The input and output of SVM were the same as those of LR and RF described above. The only hyper-parameter, the inverse of regularization strength, was selected from [100, 50, 10, 5, 1, 0.5, 0.1].

Gradient Boosting Decision Tree
We used the gradient-boosting decision tree (GBDT) classifier implemented by the XGBoost project (Chen and Guestrin, 2016). The input and output of GBDT were the same as other classifiers described above. The number of trees was selected from [32,64,128] and the maximum depth of the trees was selected from [4,8,16].

NetLapRLS
NetLapRLS (Xia et al., 2010) (a semi-supervised regressor) was implemented based on pyDTI (https://github.com/stephenliu0423/ PyDTI). As NetLapRLS treats symmetric gene pairs (i, j) and (j, i) in the same way, there is no need to average the predictions of both pairs. Three types of similarity matrices were used as the input to NetLapRLS: 1) The protein-protein interaction (PPI) similarity matrix S p , i.e., the pairwise PPI similarities between all pairwise genes used in the cell line. The human PPI data were obtained from the STRING database v11 (Szklarczyk et al., 2014). Protein pairs marked with STRING scores larger than 0.8 were considered positive interaction pairs in the PPI network. The PPI similarity between two proteins (i, j) were calculated as the Jaccard similarity of their interaction partners in the PPI network, that is, where N(x) stands for the neighbors of protein x in the PPI network.
2) The L1000 profile similarity matrix S l , i.e., the absolute values of the pairwise L1000 profile similarities between all the genes used in the cell line. The L1000 profile similarity between two genes were calculated as the Pearson correlation between their L1000 gene expression profiles.
3) The combination of both PPI and L1000 similarities, calculated as 1 -

Cell-Line Specificity of SL Interactions
To demonstrate the cell-line specificity of SL interactions, we examined 378 CRISPR knockout pairs screened in different cell lines from the Big Papi SynLet library (Najm et al., 2018). Their SL scores were calculated by GEMINI (Zamanighomi et al., 2019), a computational tool for identifying SL interactions from pairwise CRISPR knockout screens. Three cell lines were used in our performance evaluation, including A549, A375, and HT29. Among these three cell lines, A549 and A375 exhibited relatively high correlation (Pearson correlation 0.71, Figure 2A) in GEMINI scores, which measure the strength of the SL interactions. Meanwhile, the correlations between HT29 and the other two cell lines are relatively low (Pearson correlations 0.36 and 0.28, Figure 2A). These results indicate that the SL interaction patterns between the same gene pairs in different cell lines can be quite different. Next, we examined the positive and negative SL samples selected from the Big Papi dataset according to the criteria described in SL Labels. By comparing the SL labels of the same gene pairs in the three cell lines, we found that most gene pairs have inconsistent labels cross different cell lines ( Figure 2B). There are 38 gene pairs with at least one positive label in the three cell lines, but only one of them (i.e., the BRCA1-PARP1 gene pair) is always labeled as a positive SL. Among these 38 gene pairs, 16 have negative labels in one cell line but positive labels in another one.
Based on the above observation that most SL pairs were not conserved across different cell lines, we built prediction models for each cell line separately. In addition to the Big Papi dataset, we also included the data from other literature (Shen et al., 2017;Zhao et al., 2018), which further enlarged the SL data of cell line A549. The overlaps of gene pairs used as labeled training samples between the three cell lines are shown in Figure 2C.

Performance Evaluation
We compared the performance of our model to that of several baseline methods through cross-validation on the aforementioned datasets for the three cell lines. LR, RF, SVM, and GBDT were selected as the baseline methods because they are the machine learning baseline models and accept vector input, which is suitable for our case. NetLapRLS is also used as a baseline model, as it is a well-established semi-supervised method that accepts network input and which can be used to test the effectiveness of other features, such as the PPI network. Two settings were used to split the training and test samples. The first one was called "split pair" in which gene pairs were randomly split into training and test sets. The second one was called "split gene" in which, for each test gene pair, at least one gene is not seen in training data. The "split gene" setting was mainly used to test whether the prediction can be generalized to unseen genes, which is more challenging. Note that the splitting was performed over positive and negative SL pairs, and our model also utilized the unknown pairs during the training process.
Area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPR), F1 score, accuracy, precision, sensitivity and selectivity were used to evaluate the classification performance (Tables 2 and 3). The receiver operating characteristic (ROC) and precision-recall (PR) curves achieved by EXP2SL and the baseline models are shown in Figures  S2-S3. Under the "split pair" setting, all the models achieved relatively high performance, which indicates that the prediction problem defined under this setting was relatively easy. The performance of our model was comparable with the topperforming baseline methods under this setting. However, under the more practical "split gene" setting in which we wished to predict SL pairs containing novel genes without experimental screen data (due to the limited existing experimental data), the SL prediction task became difficult as all the models achieved relatively lower AUC and AUPR scores than those under the "split pair" setting. However, our model exhibited a significantly better performance than that of all the baseline models under this "split gene" setting. EXP2SL achieved the best performance in at least 6/7 metrics for all the three cell lines (Table 3). We also tested our model and the baseline methods with a less strict threshold for defining the positive SL pairs (i.e., 10%), and our model also achieved a better performance than that of the baseline methods (Tables S1-S2).

Ablation Study and Feature Comparison
To evaluate the contribution of the semi-supervised objective function to the final prediction, we tested our EXP2SL model without the BPR loss. That is, we modified the objective function in Equation 6 and used only the MSE loss and the L2 regularization term; our model can thus be trained in a supervised manner. An obvious decrease in performance under the "split gene" setting could be observed when we removed the BPR loss (see the "EXP2SL (no BPR loss)" row in Table 3). Therefore, the results demonstrated that the semi-supervised objective function had an important contribution to the prediction performance of our model. One of the baseline models, NetLapRLS, can also incorporate different similarity matrices (i.e., the L1000 profile similarities, the PPI similarities, and the combined similarities, as described in NetLapRLS), thus allowing the comparison between different settings using different input information. The NetLapRLS models with L1000 profile similarities and with PPI similarities as the input features achieved similar performance, and the combination of both features only led to a slight increase in performance in most cases. In general, the performance of NetLapRLS was worse than EXP2SL.
We also incorporated the PPI network into our EXP2SL framework (denoted as EXP2SL (PPI) in Tables 2 and 3) using a graph convolution network (Lei et al., 2017), as described in Supporting Material and Figure S1. In this case, no significant improvement in AUC and AUPR scores was observed after adding the PPI network information (p values larger than 0.1 for all the cell lines in both conditions, Wilcoxon rank-sum test). These results indicate that using only the L1000 gene expression profiles is adequate to enable the models to capture useful features for accurately predicting SL interactions.

Feature Importance Analysis
We used the scheme described in Extraction of Feature Importance to extract the important features based on the saliency map approach (Simonyan et al., 2013). Those features (i.e., the corresponding expression levels of 978 genes) ranked among the top 50 (about 5% from the 978-dimensional features) were selected as the important features for each cell line. Among the selected feature sets, there is only one gene shared across all the three cell lines, that is, AKT1. AKT1 is known as a serine/ threonine protein kinase, which regulates many viability related cellular processes, including proliferation, apoptosis, and cell survival (Chen et al., 2001;Lee et al., 2011). Most features were considered as the top 50 important features only in one cell line (47, 46, and 46 unique important features for A549, A375, and HT29, respectively), which suggests that the prediction may rely on the specific gene expression landscapes in different cell lines.
We also checked the over-representation of functional gene sets and pathways among the selected important features of the three 2 | Performance evaluation in three different cell lines under the "split pair" setting. The mean and standard deviation (in brackets) of metrics over 10 repeats of 5-fold cross-validations are shown. The best results for each cell line and each metric are marked in bold.

Dataset
Model cell lines using the WebGestalt server (Liao et al., 2019). The gene ontology (GO) related to biological processes was first used to examine the enriched functional annotations of the selected feature sets (Tables S3-S5). The enriched GO terms were ranked according to the false discovery rate (FDR) scores and p values. As a result, the top 10 enriched functional annotations for the selected features of HT29 contains the regulation of cell death, proliferation, and apoptosis (p values < 10 -6 and FDRs < 10 -3 ), which are cell viability related functions. Then, we also checked the over-representation of selected genes among the KEGG pathways using the WebGestalt server (Liao et al., 2019) (Tables S6-S8). Among the top 10 enriched pathways ranked according to the FDR scores and p values, we found multiple cancer-related pathways for cell line HT29 and also cell cycle or cancerregulatory pathways for A375 and A549, e.g., the p53 and ERBB signaling pathways. All these results indicated that the selected features are probably related to the regulation of cell viability.

CONCLUSION
In this paper, we proposed a semi-supervised neural network based method, EXP2SL, to accurately predict cell-line specific SL interactions. Our method exploits the L1000 expression profiles measured from the shRNA knockdown experiments performed in different cell lines to learn the cell-line specific SL interactions from the labeled data generated by CRISPR-Cas9 doubleknockout based screens. In addition, a semi-supervised objective function is designed to make use of the large amount of unlabeled data. Tests on three datasets corresponding to three different cell lines showed that our model achieved better performance than the baseline models. At the same time, we verified that the L1000 gene expression profiles and the semisupervised objective function are useful in SL prediction. Moreover, we analyzed the most important genes among the whole L1000 gene expression profiles, and found that the top attributing genes are related to the regulation of cell viability, which suggested that our model may pay more attention to such meaningful components of the whole gene expression profiles. The major contributions of our work are the demonstration of L1000 expression profiles as effective features for SL prediction, and a novel semi-supervised neural network algorithm to accurately capture SL interactions. To our best knowledge, our model is the first computational approach for predicting cell-line specific synthetic lethal interactions, which may potentially benefit the target identification for specific tissue or cancer types. However, the application of our model may be limited in certain cancer types with high heterogeneity. Another limitation of our model is the dependence of the available L1000 gene expression profiles as input to EXP2SL. Although the L1000 expression profiles of more than 3,500 genes have been measured by shRNA knockdown experiments in the three cell lines analyzed in this work, there exist some cell lines with a paucity of data, which may thus limit the applications of our model on such cell lines.