A breast cancer-specific combinational QSAR model development using machine learning and deep learning approaches

Breast cancer is the most prevalent and heterogeneous form of cancer affecting women worldwide. Various therapeutic strategies are in practice based on the extent of disease spread, such as surgery, chemotherapy, radiotherapy, and immunotherapy. Combinational therapy is another strategy that has proven to be effective in controlling cancer progression. Administration of Anchor drug, a well-established primary therapeutic agent with known efficacy for specific targets, with Library drug, a supplementary drug to enhance the efficacy of anchor drugs and broaden the therapeutic approach. Our work focused on harnessing regression-based Machine learning (ML) and deep learning (DL) algorithms to develop a structure-activity relationship between the molecular descriptors of drug pairs and their combined biological activity through a QSAR (Quantitative structure-activity relationship) model. 11 popularly known machine learning and deep learning algorithms were used to develop QSAR models. A total of 52 breast cancer cell lines, 25 anchor drugs, and 51 library drugs were considered in developing the QSAR model. It was observed that Deep Neural Networks (DNNs) achieved an impressive R2 (Coefficient of Determination) of 0.94, with an RMSE (Root Mean Square Error) value of 0.255, making it the most effective algorithm for developing a structure-activity relationship with strong generalization capabilities. In conclusion, applying combinational therapy alongside ML and DL techniques represents a promising approach to combating breast cancer.

XGBoost (XGB): Extreme gradient boosting is a robust ensemble machine learning algorithm used extensively in regression tasks.An ensemble of decision trees starts with initial predictions, calculating initial residues, and uses regularized trees to reduce overfitting.New predictions were calculated by adding the output of the current tree to the previous predictions.It updates the residuals based on the difference between the actual values and current predictions, and this process will be iterated based on the number of estimators given (2).XGB algorithm was trained with a maximum depth of 1, maximum features of 8, a learning rate of 0.08, several estimators of 10000 estimators, and loss was assigned as absolute error.
Ridge Regression: Ridge regression is a linear Machine learning algorithm widely used to predict response variables based on a set of independent variables that were highly correlated.It finds the coefficient of the independent variables in a linear equation that best predicts the response variable and introduces L2 regularization, which adds a penalty term (alpha) to the equation and prevents overfitting by discouraging significant coefficients for individual descriptors and regularization parameters controlling the strength of the penalty (3).The alpha value of 20 was used while training the model.KNN(k-NearestNeighbors): KNN is a nonlinear machine algorithm that predicts data points based on their proximity to other data points in feature space.KNN made predictions based on the nearest neighbors of a data point in feature space and was chosen based on various hyperparameters such as distance metrics, weights, and neighbor count.The average of the values of nearest neighbors was considered for regression-based tasks (3).The KNN algorithm was trained with five nearest neighbors count, uniform weight, and Euclidian distance metrics with a leaf size of 10.LASSO: We have leveraged LASSO regression, a linear algorithm that had an impressive role in handling high-dimensional data and extracting essential relationships between molecular descriptors and biological responses.LASSO provides L1 regularization, adding a penalty value alpha, which drives some coefficient estimates to precisely zero in the linear regression equation, identifying the most relevant features for predicting biological response.We have performed cross-validation experiments systematically to identify the optimal level of regularization (3).An alpha value of 0.5 was used in LASSO regression to train the model with the drug response data.
Elastic Net Regression: Elastic Net is a linear regression model that leverages the L1 (alpha) and L2 (lambda) regularization techniques to establish a relationship between molecular descriptors and the biological activity of the drugs by simultaneously selecting essential variables from the training dataset (L1) and reducing multicollinearity (L2) (4).By experimenting with various alpha and lambda values, we have identified optimal regularization values alpha as 0.5 and lambda as 0.6 and selection as random for biological activity prediction.
CART (Classification and Regression Trees): It is a nonlinear decision tree-based machine learning algorithm used for regression tasks.CART's strength lies in capturing the relationship between molecular descriptors and the biological activity of drugs through recursively partitioning the feature space into homogenous regions.It involves the construction of decision trees, with each node representing a feature and selecting the features that maximize the reduction in variance (2,4).Hyperparameter tuning was performed to identify the optimal parameters: maximum depth of a tree to 50, minimum samples split of 10, minimum sample leaf count to 5, number of features to consider at each split, and maximum number of leaf nodes to 100.
Stochastic Gradient Descent Regressor (SGD) : SGD is a linear machine learning algorithm potentially used in QSAR tasks to efficiently capture the relationship between the molecular descriptors and biological activity of the drugs by iteratively updating the model coefficients to minimize the loss function (mean square error).Models coefficients were optimized by SGD regressor on the chosen loss function.Coefficients were updated with smaller learning rates to minimize the loss, making it suitable for large datasets (3).A series of experiments were done to identify optimal hyperparameters for our particular dataset where the number of iterations with no improvement in validation score was set to 250, L2 regularization was applied with lambda value of 0.7, learning rate of 0.001 and maximum number of epochs as 10000.
Support vector regressor (rbf-SVR): rbf-SVR is a machine learning algorithm that potentially captures nonlinear interactions between molecular descriptors and biological activity.The C parameter in the algorithm handles the tradeoff between low training error and low testing error, thus preventing overfitting (4,5).Cross-validation experiments were done by tuning various hyperparameters and identifying optimal parameters.The kernel was selected as 'radial basis function', epsilon, which implies the control of error tolerated in regression predictions as 0.9, and gamma that defines the decision boundary's shape was set as 'scale'.
Wider Neural Network: QSAR model development and machine learning algorithms also used deep learning-based neural networks.Neural networks were the robust algorithms to predict regressionbased tasks (5).A more comprehensive Neural network (fewer hidden layers and more neurons in each layer) was employed with one input layer consisting of 2516 units, which served as an entry point for molecular descriptor data.A rectified Linear unit activation function was used, and two hidden layers with 3000 and 2000 units, respectively, were used.Their primary purpose is to learn and represent complex patterns and relationships between molecular descriptors and biological activity, Rectified Linear unit activation function, and a single output layer with one unit responsible for producing the quantitative predictions of the biological activity and linear activation function to ensure the output values were continuous and regression-based task.
Hyperparameter tuning was performed to identify the optimal parameters for better regression-based predictions, and the learning rate was set to 0.001 to ensure a stable convergence and control the step size during the optimization process, Adam optimizer was used to minimize the loss (mean squared error), training was conducted for 50 epochs with a batch size of 64, early stopping protocol was employed to prevent overfitting with patience limit of 10.In summary, hyperparameters of the neural network were carefully selected and tuned to maintain a balance between model complexity and predictive performance Deep Neural Network: In a revised QSAR study, we have changed the architecture of the neural networks, developing a deep neural network with an input layer consisting of 2516 nodes, a Rectified linear unit activating function, five hidden layers with 500, 250, 125, 64, 32 nodes respectively and Rectified Linear unit activation function and the output layer consists a single unit responsible for quantitative prediction of continuous variables which were suitable for regression task.Hyperparameter tuning was done to identify optimal parameters, which were a learning rate of 0.001, a momentum of 0.9, Stochastic gradient descent was used as an optimizer, mean squared error was considered for the loss function, epochs were 50, batch size was 64, validation split was set to 0.25( 5) Supplementary Figures and Tables  2.2 Tables Table S1: Top 10 leading cancers with their respective estimated deaths and estimated new cases percentages (6).

Supplementary Figures
Figure S1 : Heatmap representing the distribution of anchor drug target pathways in various cell lines

Figure S5 :Figure S6 :Figure S8 :
Figure S5: Top 20 attributes from the processed dataset positively impacting model predictions.(Vertical axis represents the molecular descriptors from the dataset and horizontal axis represents the average impact of each descriptor on models output magnitude)

Table S2 :
Summary of the data sourced from GDSC 2 dataset

Table S3 :
Drug attributes with crucial contributions to model predictions positively and negatively according to SHAP scores.