Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO

PARP1 is one of six enzymes required for the highly error-prone DNA repair pathway microhomology-mediated end joining (MMEJ) and needs to be inhibited when over-expressed. In order to study the PARP1 inhibitory effect of fused tetracyclic or pentacyclic dihydrodiazepinoindolone derivatives (FTPDDs) by quantitative structure-activity relationship technique, six models were established by four kinds of methods, heuristic method, gene expression programming, random forester, and support vector regression with single, double, and triple kernel function respectively. The single, double, and triple kernel functions were RBF kernel function, the integration of RBF and polynomial kernel functions, and the integration of RBF, polynomial, and linear kernel functions respectively. The problem of multi-parameter optimization introduced in the support vector regression model was solved by the particle swarm optimization algorithm. Among the models, the model established by support vector regression with triple kernel function, in which the optimal R 2 and RMSE of training set and test set were 0.9353, 0.9348 and 0.0157, 0.0288, and R2 cv of training set and test set were 0.9090 and 0.8971, shows the strongest prediction ability and robustness. The method of support vector regression with triple kernel function is a great promotion in the field of quantitative structure-activity relationship, which will contribute a lot to designing and screening new drug molecules. The information contained in the model can provide important factors that guide drug design. Based on these factors, six new FTPDDs have been designed. Using molecular docking experiments to determine the properties of new derivatives, the new drug was ultimately successfully designed.


Introduction
Breast cancer is a common malignant disease in breast tissue.In 2020, more than 2.26 million people were diagnosed with the disease and about 685 thousand people were killed by it (Sung et al., 2021).Because of the high incidence rate and seriousness of breast cancer, developing new drugs with good therapeutic effects has become a hot research spot.The size of the breast cancer drug development market has totaled 20.2 billion dollars in 2019 and is forecast to grow to 47.7 billion in 2029 (Wilcock and Webster, 2021), which makes breast cancer drug development become one of the biggest disease research fields.
Although new breast cancer drugs have been developed continuously, the efficiency of most drugs has been limited due to side effects and the lack of specificity.Since Bryant et al. proposed the concept of synthetic lethality effect in 2005, the anti-tumor effect of Poly (ADP-ribose) polymerase (PARP) inhibitors has been gradually revealed.
The PARP family, known as diphtheria-toxin-like ADPribosyltransferases (ARTDs), is a family of 17 enzymes that share a common catalytic ADP-ribosyl transferase (ART) motif (Zong et al., 2022).PARP can be activated by recognizing DNA fragments of structural damage and then perform base excision repair (Zong et al., 2022).It plays an important role in DNA single-strand break (SSB) repair, programmed cell death regulation, and DNA stability maintenance (Ohmoto and Yachida, 2017).Among the 17 kinds of PARP, PARP1 and PARP2 are the two most important subtypes of the PARP family (Huang and Yin, 2021).And at present, PARP inhibitors mainly inhibit these two subtypes.
PARP1 is one of six enzymes required for the highly errorprone DNA repair pathway microhomology-mediated end joining (MMEJ) (Sharma et al., 2015).When PARP1 is upregulated, MMEJ is increased, causing genome instability (Muvarak et al., 2015).Ordinarily, deficient expression of a DNA repair enzyme results in increased un-repaired DNA damage which leads to mutations and cancer through replication errors.However, the accuracy of MMEJ repair mediated by PARP1 is significantly low.Therefore, it seems that cancer is more likely to occur due to over-expression rather than under-expression (Sharma et al., 2015).
In breast cancer cells, the breast cancer susceptibility gene BRCA1 and BRCA2 genes are significant tumor suppressors for DNA double-strand breaks (DSBs) by homologous recombination (HR) and their mutation easily causes genetic instability and leads to the emergence of tumor cells (Luo et al., 2021).
PARP inhibitors can selectively kill tumor cells with HR function defects caused by RBCA1 and BRCA2 genes mutation by inhibiting PARP activity and leading to DNA repair failure, but have no effect on normal cells, which is synergistic lethal effect (Lord and Ashworth, 2017).PARP inhibitors show good curative effects in breast cancer therapies, which makes the research of PARP inhibitors become a significant field in breast cancer drug research.
In the experiments shown in literature (Wang H. et al., 2020), fused tetracyclic or pentacyclic dihydrodiazepinoindolone derivatives (FTPDDs) were studied as PARP inhibitors, and Pamiparib was found with great effects in inhibiting the activity of PARP1.Therefore, some FTPDDs that have a similar structure to Pamiparib should be deeply studied to find PARP1 inhibitors with better therapeutic effects and fewer side effects.The inhibitory activity on PARP can be evaluated by PARP IC 50 .Therefore, obtaining PARP1 IC 50 values of FTPDDs is of great importance for screening out PARP1 inhibitors with high biological activity and low toxicity, which will contribute to screening out FTPDDs with good therapeutic effects subsequently.Since the traditional methods of measuring IC 50 consume a lot of manpower and material resources, quantitative structure-activity relationship (QSAR) technique is used to predict the IC 50 values of compounds quickly and accurately.
QSAR is a method that uses mathematical models to describe the relationship between molecular structure and certain biological activity (An et al., 2006;Chen and Si, 2021).The basic assumption of QSAR is that the molecular structure of the compound decides its physical, chemical, and biological properties which subsequently decide its biological activity (Roy and Das, 2014).And the biological and chemical properties of compounds are portrayed by molecular descriptors in QSAR model (Jin et al., 2022;LI et al., 2023).Therefore, QSAR can be used to predict the biological activity of new compounds by mathematical models based on precisely selected molecular descriptors (Vilar et al., 2008;Si et al., 2022).QSAR shows a strong ability in new drug research, which optimizes pharmacodynamic characteristics and reduces expensive experiments in the meanwhile.Therefore, QSAR can be used to predict the IC 50 of PARP1 inhibitors to screen out new potential drug molecules quickly and accurately.
In this study, six QSAR models based on three kinds of methods, heuristic method (HM), gene expression programming (GEP), random forest and support vector regression (SVR) with single, double, and triple kernel function, were established and compared after calculating molecular descriptors to predict the PARP1 IC 50 of FTPDDs.The single, double, and triple kernel functions were RBF kernel function, the integration of RBF and polynomial kernel functions, and the integration of RBF, polynomial, and linear kernel functions respectively.Among the results of the models, the predictions of the model constructed by SVR with triple kernel function were consistent with the measured values best.The method of SVR with triple kernel function is a big breakthrough and will contribute a lot to designing and screening new drug molecules.Activity prediction and molecular docking experiments were applied to newly designed multiple compounds.The experimental results indicated that the newly designed compounds showed better performance.
2 Computational details and theories

Data set
The data set of 57 FTPDDs was collected from literature (Wang H. et al., 2020) and listed in Tables 1-4 according to the main structure of compounds.It included two parts, the structures of 57 FTPDDs and their PARP1 IC 50 values which were measured by the same experimental method under the same experimental conditions.Using simple random sampling, the original data is randomly divided into training set and test set according to the ratio of 4:1.The data set was randomly separated into a training set of 45 compounds and a test set of 12 compounds.The training set was used to establish the models and the test set was used to evaluate the prediction ability of the constructed models.For multi model regression evaluation, small datasets are susceptible to noise and randomness, and each model may only have a small number of samples for training, resulting in unstable model results.In addition, overfitting is easy to occur if each model has more degrees of freedom to adjust.Therefore, cross validation is a good method to ensure the stability of machine learning models.
In order to test the robustness of the models, K-fold crossvalidation was adopted.The procedure involves splitting the original data evenly into K parts, and sequentially using one of these parts as the validation set while using the remaining K-1 parts as training sets.This process is repeated K times, and the average of the K validation results is used to evaluate the model's performance.

Descriptor calculation
The steps of calculating molecular descriptors of compounds are as follow.
The first three steps include drawing molecular structures of all compounds in ChemDraw software and performing preliminary optimization of the structures of the compounds by MM + molecular mechanics force field and more precise optimization by semi-empirical PM3 (Klein et al., 2006) method successively in HyperChem software (HyperChem, 1994).After the three steps, the lowest energy structures are obtained.
The remaining two steps consist of putting files obtained from HyperChem into MOPAC software (Stewart, 1989) to generate MNO files and then using MNO files as the input of the CODESSA software to calculate five classes of molecular descriptors: constitutional, topological, geometrical, electrostatic, quantumchemical (Wright et al., 1997).

Statistical parameters
The coefficient of determination (denoted R 2 or r 2 ) and root mean square error (RMSE) were used to evaluate the models.R 2 is a measure of the goodness of fit of a model.In regression, the R 2 is a statistical measure of how well the regression predictions approximate the real data points.The closer R 2 is to 1, the more it indicates that the regression prediction fits the data (Glantz and Slinker, 1990).Furthermore, to verify the robustness of the models, the coefficient of determination of K-fold cross-validation (R 2 cv ) is used in the evaluation of the results, which is the average of all R 2 values in K cross validation experiments (Allen, 1974).
RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed.The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences (Hyndman and Koehler, 2006).
Mean Absolute Error (MAE) is a measurement used to measure the average absolute difference between predicted and true values in regression problems.MAE can measure the average error between predicted and true values.The smaller the value of MAE, the smaller the average difference between predicted and true values, indicating a higher accuracy of prediction (Tropsha et al., 2003).
The Concordance Correlation Coefficient (CCC) was developed as a measure for the correlation between two sets of data, for instance a gold standard and a second reading (Sheikhpour et al., 2017).

Linear model by HM
After generating molecular descriptors, HM in CODESSA software was used to accomplish the pre-selection of the descriptors and build the linear model (Si et al., 2021).
HM in CODESSA software has the advantages of high efficiency and no limitation to the size of the data set.After calculating all molecular descriptors, preprocessing program eliminates 3 types of descriptors that cannot be used: 1) descriptors that not all compounds have; 2) descriptors that have a small variation in magnitude for all structures; 3) two collinear descriptors with the correlation coefficient greater than 0.8 (Wang Y. et al., 2020).The number of remaining available molecular descriptors is marked N. The steps of HM can refer to literature (Katritzky et al., 1995).
The HM performs descriptors pre-selection by the following criteria (Katritzky et al., 2023): 1) Fisher F-criteria must be greater than 1.0; 2) R 2 value should be higher than a specified threshold; 3) Student's t criterion must exceed a defined value; 4) duplicate descriptors should have a squared intercorrelation coefficient below a predetermined level, and the descriptor with a higher R 2 value relative to the property is retained.The remaining descriptors are then arranged in descending order based on their correlation coefficients.Any significant 2 parameter correlation identified by the F-criteria is further expanded recursively into an n parameter correlation until the normalized F-criteria is no longer higher than the initial value.The top N correlations which are determined by both R 2 and the F-criterion are then saved.
HM attempts to build a series of linear regression models with 1 to N molecular descriptors and find the optimal one.Molecular descriptors determined by HM model serve as independent variables of the non-linear models for the next step (Gao et al., 2022) .

Non-linear model by GEP
Considering that the relationship among the factors affecting IC 50 of PARP1 inhibitors is complex and usually non-linear, one non-linear model was established by GEP to predict IC 50 values of the compounds respectively.
As a new algorithm based on genetic algorithm (GA) and genetic planning (GP), GEP was proposed by Ferreira in 2001(Ferreira, 2001).More details about GEP algorithm can refer to literature (Zhou and Sun, 1999).And the steps of GEP can refer to literature (Ferreira, 2001;Liu et al., 2004).
The following schematic Figure 1 depicts the essential process of gene expression programming.Initially, a certain number of individuals (known as the initial population) have their chromosomes generated randomly.Subsequently, these chromosomes are expressed into syntactically correct programs, and each individual's fitness is evaluated against a set of fitness cases, also referred to as the selection environment (Ferreira, 2001).Based on their fitness, individuals are selected and modified for reproducing offspring with new traits.These newly individuals undergo the same developmental process, including genome expression, evaluation within the selection environment, selection, and modified reproduction (Ferreira, 2006).
The process is repeated for a specific number of generations or until a satisfactory solution is obtained.
GEP is widely used in science and Ferreira developed Automatic Problem Solver (APS) (Gepsoft, 2023), a commercial software integrated with GEP algorithm.The software can encode the appropriate molecular descriptors which are selected by HM and most related to inhibitor activity, and establish a non-linear model to predict IC 50 values of FTPDDs.The process of gene expression programming.

Non-linear model by RF
Random forest regression is an ensemble learning algorithm that performs regression tasks by constructing multiple decision trees and integrating their prediction results.In random forest, each decision tree is trained on randomly selected subsets of the data, reducing the risk of overfitting.The final regression result is obtained by averaging or weighted averaging the predictions of the individual trees.
The algorithm works by randomly selecting samples from the training set to form subsets, increasing the diversity of the models.At each node of each decision tree, only a portion of randomly selected features are considered, improving the robustness of the models.Decision trees are built using recursive partitioning based on the least impurity.In random forest, the optimal splitting points of each tree are determined by calculating their Gini coefficients which are the measurements of the purity of a classification model.
The value range for the Gini coefficient is from 0 to 1.A larger value indicates a higher purity of the model.The classifier for random forest is the CART tree.The Gini coefficient of the CART tree can be represented by Eq. 1.

Gini
When traversing each segmentation point of each feature, it is assumed that using the feature A = a.The set D is divided into two parts, namely, D1 (sample set that satisfies A = a) and D2 (sample set that does not satisfy A = a).The Gini coefficient of D under this feature is Eq. 2.
Random forest regression has several advantages such as handling high-dimensional and large-scale datasets, having good generalization performance to avoid overfitting, and handling missing and abnormal values.Additionally, it has strong fitting ability for data with nonlinear relationships.The non-linear model established through random forest can effectively predict the IC 50 value of FTPDDs.

Non-linear models by SVR
In order to build the model with stronger prediction ability, three non-linear models were established based on SVR with single, double, and triple kernel function respectively.
SVR is an important application branch of support vector machine (SVM).SVM, proposed by Vladimir N. Vapnik and Alexey Ya.Chervonenkis in 1964, is a generalized linear classifier (GLC) that classifies the data into binary categories under supervised learning.When SVM is applied in the field of regression analysis, it is often called SVR (Wang Y. et al., 2020).Unlike SVM which maximizes the distance from the hyperplane to the points closest to it, SVR minimizes the total deviation from all sample points to the hyperplane (Du and Wu, 2003).
A significant advantage of SVR is its simplicity in mathematical calculations, as it transforms nonlinear problems in the input space into linear problems in a high-dimensional feature space.Another advantage is that SVR can utilize probability rules to train multiple classifiers on different types of data, thus enhancing prediction accuracy by measuring the classification confidence.Compared to other regression techniques, SVR demonstrates lower computational complexity.Some parameters of support vector machines play an important role in model training.The penalty constant (C) is a regularization parameter that controls the degree of punishment for misclassified samples.A larger C value can lead to stricter classification, which may lead to the model overfitting the training data.A smaller C value allows for more classification errors, which may lead to better generalization ability.
Slack variable, which introduces tolerance that allows some samples to be at the boundary of classification errors or intervals.These parameters are usually used in conjunction with C to adjust the degree of cosmetic error.
For certain kernel functions, such as polynomial kernels and RBF kernels, there can also be additional parameters, such as the order of the polynomial, the bandwidth of the Gaussian kernel, etc.These parameters need to be adjusted according to the characteristics of the data.For imbalanced datasets, the importance of different categories can be balanced by adjusting their weights.This is very useful when dealing with imbalanced data.
The basic idea of SVR is to use a predetermined non-linear constructor to map the input vector to the high-dimensional feature space and regress in the mapped space.To avoid complex mapping operations in high-dimensional feature space, kernel function is used to realize inner product operation in the original space (Fang and Zhao, 2013).
Referring to literature (Deng and Tian, 2004), the primal problem of standard ε-SVR is as follows.
In Eq. 3, ξ i (*) means (ξ 1 , ξ 1 *, . . ., ξ k , ξ k *) T , which is the slack variable.b is the offset.ε is the maximum error allowed in regression.C is the penalty constant.The problem in Eq. 3 can be transformed into its dual problem: In this Eq.4, α(*) means (α 1 , α 1 *, . . ., α l , α l *) T which is Lagrange multiplier and K(x i , y i ) is kernel function.By solving this dual problem, the answer to the primal problem and the final regression decision function are obtained.
According to the quadratic programming method in the optimization theory, the parameters α 1 and α 1 * can be obtained in solving the dual problem.The parameter b can be gotten by using the Karush-Kuhn-Tucker (KTT) condition.In this way, the expression of the fitting function on the sample set f(x) can be constructed.Its form is given in Eq. 5.The coefficient

Kernel function of SVR
Kernel function has a great influence on the fitting effect of the SVR model.As mentioned in the part of establishing non-linear models by SVR, the job of the kernel function that realizes the inner product operation in the original space is to simplify calculation in high-dimensional space.Kernel function maps each sample point to an infinite-dimensional feature space to make the linearly inseparable data linearly separable (Wang and Chen, 2014).The popularly used kernel functions include linear kernel function, RBF kernel function, polynomial kernel function, and sigmoid kernel function (Mo et al., 2020).The first three kernel functions have been widely applied when establishing models by SVR and their introduction is as follows (Sheikhpour et al., 2018).
Linear kernel function is the simplest kernel function, which only calculates the inner product of two feature vectors.The accuracy of linear kernel function is not high (Mo et al., 2020) but it can be used to find which feature vectors are important with little computation.The form of linear kernel function is as follows.
RBF kernel function, a certain kind of scalar function that's radially symmetric, is the most widely used kernel function.RBF kernel function has a strong local learning ability, which means it can well characterize the local property of the sample, but its generalization ability is not good enough (Fang and Zhao, 2013;Ding et al., 2021).The form of RBF kernel function is given in Eq. 7, where the constant σ is the kernel radius of RBF kernel function.
The polynomial kernel function is a commonly used kernel function.The generalization ability of polynomial kernel function is strong, but its local learning ability is not good.The form of polynomial kernel function is given in Eq. 8, where the constant q is the order of polynomial kernel function.
The process of establishing three models of SVR with different number of kernel functions is as follows.
First, SVR with single kernel function was used to establish the regression model.RBF kernel function is usually selected as the single kernel function due to its good local learning ability.The form of single kernel function is as follows.

K sin gle K RBF (9)
In order to obtain the kernel function that enhances the learning and generalization ability of the model by SVR, Smits et al. proposed the multiple kernel function (Fang and Zhao, 2013).There are many combinations of kernel function, but the obtained multiple kernel function must satisfy Mercer's Theorem (Smits and Jordaan, 2002).
According to the closure characteristic of kernel functions, the derivation principle of mixed kernel functions can be strictly proven.Take X1, X2, that the Gram matrices corresponding to αK 1 + βK 2 are all positive definite.From this, it can be concluded that αK 1 + βK 2 , α, β ≥ 0 is kernel function.In addition, by limiting α + β 1, a stationary mixed kernel function can be constructed.Because the regression ability and generalization ability of SVR model are a pair of mutually balanced factors, the SVR model obtained by mixed kernel functions with different properties in a certain proportion can balance the performance of both aspects.
Considering that the generalization ability of polynomial kernel function is strong, RBF kernel function and polynomial kernel function were integrated as double kernel function to establish the SVR model with enhanced generalization ability (Yang et al., 2023).The form of double kernel function is given in Eq. 10, where the value of the variable a ranges from 0 to 1.
Considering linear kernel function can capture key vectors at the cost of little computation and therefore help to enhance the regression effects of the model, linear kernel function was added to the previous double kernel function, which formed triple kernel function to establish the regression model with better learning and generalization ability again.The form of triple kernel function is given in Eq. 11, where variables a and b are positive coefficients and the sum of a and b is less than or equal to 1.
Therefore, three models based on SVR with different numbers of kernel functions were constructed to predict the IC 50 values of compounds.

SVR model optimized by particle swarm optimization
The values of parameters when establishing SVR model have a great influence on the learning and generalization ability of the model.In the process of constructing the model by SVR with triple kernel function, six parameters need to be optimized: the insensitive parameter ε, the penalty factor C, the kernel radius of RBF kernel function σ, the coefficient of RBF kernel function a, the order of polynomial kernel function q, and the coefficient of polynomial kernel function b.And in the process of establishing the models by SVR with single and double kernel function, the first three and five parameters need to be optimized respectively.Their searching ranges are as follows.
However, the optimization of parameters when building model by SVR with single kernel function has already become a research difficulty.As for the model by SVR with multiple kernel function, with the number of parameters increasing, it is more difficult to optimize the parameters.Considering that traditional parameter-searching methods including the grid search method and the random search method are inefficient, particle swarm optimization (PSO) algorithm was used to optimize parameters in the process of establishing three models by SVR.
As a kind of bionic optimization algorithm, PSO was proposed by social psychologist Kennedy and electrical engineer Eberhart in 1995.The idea of PSO is based on birds swarm finding the optimal destination by sharing collective information.PSO algorithm takes random values in high-dimensional space to initialize the position and velocity information of particles and update the information by self and group learning of particles.In PSO algorithm, particles only transmit the optimal information in the process of iterative process, so the PSO has the advantages of fast search speed and convergence rate (Wang, 2022).
PSO adopts the velocity-position model, in which the forms of position and velocity of particle i in D-dimensional solution space are as follows. Xi To characterize which particle is in the best position among all particles, RMSE is used to evaluate the fitness of each particle.The smaller RMSE means better position and fitness.The best position of particle i is recorded as p ibest and the global best position is recorded as g best .In each iteration, the particle tracks p ibest , g best and its previous state to adjust the position and velocity at the current time.The iterative equations are as follows.
In the Eq. 15, V i (k), V i (k + 1), X i (k), X i (k + 1) are the velocity and position of the particle i at the current time and the next time respectively; rand( ) is a random number in range [0, 1].c 1 and c 2 are learning factors which are usually set to 2. ω is the weight factor and the value of ω automatically decreases with the iteration of the algorithm to accelerate the convergence rate.It is generally defined as: ω max , ω min are the maximal and minimal weight factors respectively, iter is the current number of iterations, and iter max is the maximum number of iterations.
Referring to literature (Xiong and Xu, 2006), The steps of parameter optimization by PSO are as follows.
Step 1: Initialize parameters of PSO including population size and the maximal and minimal weight factors ω max and ω min .Set the maximum number of iterators iter max .
Step 2: Each particle is randomly assigned a set of position information X (i,0) and velocity information V (i,0) .Set the individual best position p ibest of each particle to its current position X (i,0) .
Step 3: RMSE values of each particle are calculated to evaluate fitness.The initial global best position g best is set to the position of the particle with the best fitness.
Step 4: Update the position and velocity of each particle by iterative calculation according to Eqs 14-16 and calculate the RMSE of each particle to evaluate fitness.
Step 5: Compare the fitness of each particle with the fitness of its p ibest .If the fitness is better than that of p ibest , p ibest is updated to the current position, otherwise the original value of p ibest remains unchanged.
Step 6: Compare the updated p ibest of each particle with g best .If p ibest is better than g best , g best is updated to p ibest , otherwise the original value of g best remains unchanged.
Step 7: Judge whether the termination conditions are met.If the maximum number of iterations is reached or g best is not changed, terminate the iteration, otherwise return to step 4.

Property prediction and molecular docking
It is essential for newly designed molecules not only to exhibit efficacy against the target but also to possess favorable physical and chemical properties.Property explorer is a complimentary tool that predicts physical, chemical, and toxicological properties of molecules to aid in designing active drug compounds (Song et al., 2017).This tool enables researchers to construct new molecular structures and analyze their attribute values automatically.It provides valuable information on properties such as molecular weight, Partition coefficient (LogP), water solubility, topological polar surface area (TPSA), drug similarity, toxicity assessment, overall drug score, etc.
Macromolecular docking has become a key step in the drug development process (Chen et al., 2022a), facilitating the identification of potential therapeutic molecules and predicting ligand-target interactions at the molecular level (Chen et al., 2022b).To explore these interactions between the new PARP inhibitor and PARP1 at the binding site, the Sybyl-X 2.1 software package was employed (Li et al., 2012) This software allowed users to generate potential interactions between the molecules and proteins, as well as exploring the expected binding sites for molecular fitting.
The macromolecular docking technique was utilized to investigate the potential interaction between the new PARP inhibitor and the binding site of PARP1.Firstly, the chemical structure was imported into Sybyl-X software for calculation and optimization with parameters "max interactions" and "max display" as 1000 and 0.01, respectively.The molecules were then assigned Gasteiger-Hückel charges and minimized using the Tripos force field until convergence reached 0.05 kcal/mol/Å (Song et al., 2023).
Subsequently, the protein was imported into Sybyl-X software for hydrogenation, charging, and optimization.Unnecessary ligands and water molecules were removed to enable proper binding with the protein targets.
Finally, the docking results were imported into PyMol software for image optimization, and the amino acid residues and hydrogen bonds were labeled by same software.

Results of HM
540 molecular descriptors were calculated in CODESSA software.To select the molecular descriptors most related to PARP1 inhibitors, a series of linear models with the increasing number of molecular descriptors were established.Figure 2 shows the influences of the number of descriptors on R 2 , R 2 cv , and the standard deviation (S) of all compounds.
As shown in Figure 2, the values of R 2 and R 2 cv of all compounds were rising with the increasing number of molecular descriptors.
However, when the number of descriptors reached eight, adding another descriptor did not significantly improve the statistics of the model, so the eight-parameter model can be regarded as the optimal one.The eight selected molecular descriptors and their physicalchemical meanings are shown in Table 5, and their R 2 matrix is shown in Table 6.The 8-parameter model is discussed in details as follows.
R 2 of training set and test set in HM model were 0.7550, 0.9014 and their RMSE were 0.2327, 0.2378.The optimal prediction results by HM model are shown in Figure 3.In addition, R 2 cv of training set and test set in HM model are 0.7773 and 0.6798.

Results of RF
In order to verify the effectiveness of selecting descriptors through the HM method, RF was applied in the experiment for feature selection.Firstly, 540 unprocessed molecular descriptors belonging to 57 FTPDDs were exported from Codessa.Secondly, all descriptors containing default values were removed.Thirdly, according to the preprocessing steps of the heuristic algorithm, all descriptors with high collinearity and irrationality were also removed.Finally, RF was used to extract features from the remaining descriptors to select the most important 8 descriptors.Figure 5 shows the 8 descriptors selected by RF.The comparison between the descriptors selected by RF and HM was carried out to verify the effectiveness of the descriptors.In addition, the RMSE of four ring compounds and five ring compounds are 0.3078 and 0.3420, respectively, and their R 2 values are 0.7203 and 0.9002, respectively.

Results of SVR
The learning and generalization ability of SVR model depends on kernel function and the values of parameters.In the process of establishing models by SVR, there are always two parameters to be optimized: the insensitive parameter ε and the penalty factor.The analysis of the influences of the parameters to the SVR model (Fang and Zhao, 2013) is as follows.ε reflects the sensitivity of the model to the noise contained in the input vectors.The larger the ε is, the lower the fitting accuracy of the model is.C represents the tolerance of the model to the error.The greater the C is, the higher the fitting accuracy is, which causes more computation.In addition, if the model is established by SVR with multiple kernel function, the coefficients of kernel functions also need to be optimized.Considering the low efficiency of traditional parameter searching algorithms including the grid search method and the random research method when the number of parameters is increasing, PSO was employed to find the best combination of parameters.In PSO algorithm, only the optimal information is transmitted between particles, which avoids searching all combinations of parameters and accelerates convergence rate.SVR with single kernel function is commonly used, in which RBF kernel function is usually selected.In Eq. 7, the kernel radius σ represents the mean square deviation of RBF function, that is, the width of RBF kernel function in the direction of independent variable.The smaller the σ is, the better the fitting performance of RBF kernel function is, which leads to poor generalization ability (Fang and Zhao, 2013).The optimal combination of parameters by PSO is as follows.
ε, C, σ ( ) 0.024, 4.132, 0.758 ( ) The optimal prediction results of the model established by SVR with single kernel function are given in Figure 6.R 2 of training set and test set in the model were 0.9114 and 0.8599 and their RMSE were 0.0215 and 0.0491 respectively.In addition, R 2 cv of training set and test set in the model were 0.8223 and 0.7815.
Results showed that R 2 of training set was about 0.06 more than that of test set in the model established by SVR with single kernel function, which meant the model had an unsatisfactory generalization ability.Therefore, polynomial kernel function was integrated with RBF kernel function to establish the model by SVR with double kernel function.Therefore, the coefficient of RBF kernel function a and the order of polynomial kernel function q also need to be optimized.The optimal combination of parameters by PSO is as follows.
ε, C, σ, a, q 0.013, 0.658, 0.328, 0.51, 2 ( ) The ratio of RBF kernel function to polynomial kernel function was 0.51:0.49,which meant RBF kernel function and polynomial kernel function play almost equal roles.The optimal prediction results of the model established by SVR with double kernel function are given in Figure 7. R 2 of training set and test set in the model were 0.9259 and 0.9175 and their RMSE were 0.0180 and 0.0289 respectively.In addition, R 2 cv of training set and test set in the model were 0.8989 and 0.8712.However, results showed that both R 2 cv of training set and test set were less than 0.9 in the model established by SVR with double kernel function, which meant the learning and generalization ability of the model still needed to be improved.As mentioned before, linear kernel function can improve the regression ability of the model, so linear kernel function, polynomial kernel function, and RBF kernel function composed triple kernel function, which formed a novel SVR model with better learning and generalization ability.The optimal combination of parameters by PSO is as follows.
ε, C, σ, a, q, b 0.024, 1.123, 0.332, 0.31, 2, 0.58 ( ) The ratio of RBF kernel function, polynomial kernel function, and linear kernel function was 0.31:0.58:0.11,which indicated that polynomial kernel function played the most important role in realizing inner product operation of kernel function.The optimal prediction results of SVR with triple kernel function are given in Figure 8. R 2 of training set and test set in the model were 0.9353 and 0.9348 and their RMSE were 0.0157 and 0.0228 respectively.In addition, R 2 cv of training set and test set in the model were 0.9090 and 0.8971.

Design of new FTPDDs
Through the analysis of the molecular descriptors adopted in models, the structural factors that influence the IC 50 values of the FTPDDs were identified.The non-standardized coefficients in table 2 represent the slope of the regression equation for each independent variable and indicate the magnitude of change in the dependent variable (IC 50 ) with respect to each independent variable.These coefficients are not dependent on the unit of the independent variables and can reveal the influence of various factors on IC 50 activity.
The descriptors included in the HM model provide valuable insights into the factors related to IC 50 activity: (1) "MENANNB" refers to the interaction force between the electrons and nucleus in the bond between two nitrogen atoms.Reducing this value leads to a significant reduction in IC 50 .(Clementi, 1980).( 2) "MEEROA" represents the minimum recurrence between electrons in an oxygen atom.As its coefficient in the HM model is negative, increasing the MEEROA value results in gradually decrease in the IC 50 value.(Clementi, 1980).( 3) "MRECCB" reflects the maximum response energy of a C-C bond.In the HM model, it has a positive regression coefficient, meaning that smaller MRECCB value indicates stronger inhibitory ability of PARP.(Clementi, 1980).( 4) "MBCM" quantifies the contribution of each atomic orbital in a molecular orbital to the formation of a chemical bond.In the HM model, a positive regression coefficient suggests that smaller MBCM value leads to a lower IC 50 value and higher activity.(Clementi, 1980).( 5) "ANRINA" describes the relative reactivity of nitrogen atoms as nuclear agents in chemical reactions.The negative coefficient in the HM model implies that increasing the ANRINA value leads to gradual decrease in the IC 50 value.(Franke, 1984).( 6) "KA (O3)" reflects the connectivity and arrangement of atoms in a molecule.The negative regression coefficient indicates Influences of the number of descriptors on R2, R2cv, and S2 of all compounds.that larger KA (O3) value corresponds to a smaller IC 50 value and higher activity.(Kier, 1985).( 7) "HDH/T (QCP)" describes hydrogen bond interactions on the molecular surface.A negative regression coefficient suggests that higher HDH/T (QCP) value leads to a smaller IC 50 value and higher activity.(Clementi, 1980).( 8) "MRECGB" refers to the context of a molecular orbital (MO) particle in bonding in a chemical molecule or specifications.
In the HM model, the negative regression coefficient indicates that larger MRECGB value results in a smaller IC 50 value and higher activity.(Clementi, 1980).
In summary, the interpretation of the HM model and molecular descriptors has revealed several factors affecting inhibitory activity.To design new compounds, it is beneficial to reduce polar interactions between molecular atoms and alter the characteristics of different charge distributions of N atoms.
To obtain an ideal inhibitor structure, the structural composition of compound 44 which is the most effective compound in the literature can be modified based on these factors.Molecular structure adjustments can focus on the R region shown in Figure 9. Specifically, changes in the benzene ring with its 6 C atoms may be favorable for achieving the desired distribution of different charges.
Some chemical functional groups were incorporated into positions R 1 to R 4 , utilizing a random combination approach to minimize polar interactions between atoms.These functional groups include halogen, hydroxyl, carboxyl, aldehyde, hydrocarbon, as well as various forms of carbon and nitrogen atoms.
Through continuous and purposeful adjustments and combinations, a set of 126 molecules was designed based on analysis of descriptors in the HM model.Subsequently, the physical and chemical parameters of the newly designed molecules were calculated using CODESSA software.By inputting these parameters into the HM model, the IC 50 value of each molecule was predicted.If the predicted IC 50 value was lower than that of compound 44, the corresponding molecule would be selected for further analysis and macromolecular docking study in the Property explorer applet.Table 7 shows the predicted IC50 by HM and Docking total score of new FTPDDs.
As a result of the analysis, the predicted IC 50 values of six new compounds indicate that these compounds have preferable

Property prediction of new FTPDDs
In order to predict properties of the new compound, the Property explorer applet (PEA) was applied in the experiment.This applet provides real-time predictions of physico-chemical properties and identification of potential toxicity risks for any chemical structure drawn.It can evaluate many properties of compounds, including partition coefficient, water solubility, topological polar surface area (TPSA), molecular weight, etc. Table 8 shows the predicted IC 50 by HM and properties by PEA of newly designed compounds.
The partition coefficient, abbreviated P, is defined as a particular ratio of the concentrations of a solute between the two solvents and the logP is the logarithm of the ratio.The LogP value represents the logarithm of the partition coefficient between n-octanol and water, which is a standard measure of a compound's hydrophilicity.It has been established that the LogP value of compounds with good potential absorption should not exceed 5.0 (Kwon, 2001).
Water solubility significantly influences the intestinal absorption and cellular distribution characteristics of compounds.Higher solubility means better absorption.The goal drug design is to obtain compounds with higher water solubility.
TPSA is the sum of all topological polar regions on the molecular surface and is closely related to various bioavailability-related characteristics, such as permeability through the Blood-brain barrier (Ertl et al., 2000).
Molecular weight plays a role in the biological activity of compounds.Lower molecular weight compounds are more easily absorbed and distributed.
Drug similarity is utilized in new drug design to evaluate the "similarity" of the compound with factors such as bioavailability (Smith, 2011).

Molecular docking of new FTPDDs
During the lunar docking experiments, the newly designed compounds were employed as ligands to dock with PARP (pdb code 7CMW).Remarkably, compound 44a exhibited the most favorable performance in the macromolecular docking, achieving remarkable total score of 7.7065 which significantly surpassing that of compound 44.The detailed binding mode of compound 44a is presented in Figure 10,   illustrating the formation of two crucial hydrogen bonds with specific residues.Based on the docking conformation of compound 44a, the N atom establishes a hydrogen bond with the residue GLY-863, which aligns with the binding pattern observed in compound 44.Additionally, the nitrogen atoms from the newly incorporated structural component also form hydrogen bonds with SER-904.The strong bond reaction observed between compound 44a and PARP suggests that it could potentially serve as a promising candidate inhibitor for this protease.
Molecular docking requires the preparation of ligands and protein structures.Maestro is required for Protein Preparation.Firstly, energy minimization is performed, and then the receptor is isolated from the protein.Minimize the energy of the generated compound in Chem3D using MM2, align the processed compound as a ligand in PyMol, and finally calculate the RMSD between the generated compound and the existing ligand (Wang and Zhang, 2023).9 shows RMSD values between the selected docking pose of 7 cmw and the experimental X-ray structure.

Comparison of different models
To verify the robustness of the models, k-fold cross-validation was used for model evaluation and the value of k was set to 5. All optimal predicted results of models based on HM, GEP, RF and SVR with single, double, and triple kernel function and their R 2 cv are given in Table 10 respectively.
It is obvious that the non-linear model established by SVR with triple kernel function shows the strongest prediction ability and model robustness than that of other models.Compared with the linear model built by HM, nonlinear models can better describe complicated problems.Compared with GEP which is easily trapped in the local optimal solution of the problem, SVR is a convex quadratic optimization method, which makes its local optimal solution the global optimal one.Compared to RF, the kernel function of SVR can be selected and adjusted according to the nature of the problem to adapt to different data distributions and patterns.Compared with SVR with single kernel function, the difference between R  The design strategy mainly focused on the R region of compound 44.Furthermore, to verify the effect of compounds with different ring numbers on the predicted results, the original dataset was divided into tetracyclic compounds and pentacyclic compounds, and the R 2 and RMSE of six models related to these two compounds were calculated respectively.Table 10 shows that the prediction results by six models related to pentacyclic compounds better fit the measured values.R 2 of three SVR models related to tetracyclic compounds improve significantly.Furthermore, the improvement

Conclusion
In this study, 6 eight-parameter QSAR models were established by HM, GEP, RF and SVR with single, double, and triple kernel function to predict the biological activity of 57 FTPDDs as PARP1 inhibitors respectively.Compared with other models, the model established by SVR with triple kernel function shows the strongest prediction ability and robustness, which indicates that the method of SVR with triple kernel function has good potential for constructing models to predict the biological activity of compounds and guiding drug design.In addition, the PSO algorithm shows a strong parameter-optimized ability in the process of establishing SVR model due to its characteristic of high searching speed and fast convergence rate, which means PSO has good potential for optimizing parameters when building SVR model.Furthermore, the model established by SVR with triple kernel function shows 8 important factors that have a great influence on the biological activity of PARP1 inhibitors, which will guide new drug design and screening for breast cancer.Six FTPDDs were designed using these 8 important factors and molecular docking experiments were conducted on them.The properties of new derivatives were ultimately verified, and the effectiveness of the SVR model was also demonstrated.

FIGURE 1
FIGURE 1 obtained in the 199th generation.R 2 of training set and test set in GEP model were 0.7395, 0.7818 and their RMSE were 0.2520 and 0.2768.R 2 cv of training set and test set in GEP model were 0.7302 and 0.6998.The optimal prediction results by GEP model are shown in Figure 4.The mathematical equation of the non-linear model built by GEP is as follows.
The 8 descriptors filtered out are Max resonance energy for a H-N bond (MREAH), HASA-2/TMSA [Zefirov's PC] (HTZP), Wiener index (WI), Min e-e repulsion for a O atom (MEERAO), Max atomic state energy for a C atom (MASEAC), Final heat of formation (FHF), Min e-n attraction for a N-N bond (MENAAN), Min e-e repulsion for a C-H bond (MEERAC).Min e-e repetition for a O atom and Min e-n attraction for a N-N bond are the two most important descriptors selected through HM.Both descriptors were successfully screened by RF, demonstrating the effectiveness of the HM algorithm.Furthermore,5-fold cross-validation was used to obtain an R 2 cv of 0.87 for the training set and 0.86 for the test set.The MAE values for the training and testing sets are 0.1518 and 0.0821, respectively.

FIGURE
FIGUREPlot of measured and predicted lg (IC50) by HM.

FIGURE
FIGUREEight descriptors selected by RF.

FIGURE
FIGUREPlot of measured and predicted lg (IC50) by SVR with single kernel function.
2 of training set and test set in the model by SVR with double kernel function decrease from 0.0515 to 0.0084, which demonstrated the addition of polynomial kernel function did improve the generalization ability of SVR model.Compared with SVR with double kernel function, R 2 of training set and test set in the model by SVR with triple kernel function increased by 0.0094 and 0.0173, which demonstrated the addition of linear kernel function was helpful to improve the learning and generalization ability of SVR model.

FIGURE 7
FIGURE 7Plot of measured and predicted lg (IC50) by SVR with double kernel function.

FIGURE
FIGUREPlot of measured and predicted lg (IC50) by SVR with triple kernel function.

TABLE 1
Measured and predicted lg (IC 50 ) of FTPDDs 1-16.*The compounds of the test set.

TABLE 1 (
Continued) Measured and predicted lg (IC 50 ) of FTPDDs 1-16.*The compounds of the test set.Measured and predicted lg (IC 50 ) of FTPDDs 17-34.*The compounds of the test set.

TABLE 3
Measured and predicted lg (IC 50 ) of FTPDDs 35-54.*The compounds of the test set.

TABLE 4
Measured and predicted lg (IC 50 ) of FTPDDs 55-57.*The compounds of the test set.

TABLE 5
The selected descriptors and their physical-chemical meanings and coefficient.

TABLE 6 R
2 matrix of the eight descriptors.Predicted IC50 by HM and Docking total score of new FTPDDs.

TABLE 8
Predicted IC 50 by HM and properties by PEA of newly designed compounds.TABLE 9 RMSD values between the selected docking pose of 7 cmw and the experimental X-ray structure.

TABLE 9 (
Continued) RMSD values between the selected docking pose of 7 cmw and the experimental X-ray structure.

TABLE 10
Comparison of Statistical parameters of different methods.SVR kernel function can better improve the performance of models related to tetracyclic compounds.In addition, to demonstrate the external predictive ability of the model, Table10presents the calculation results of the three statistics Q 2 F1 , Q 2 F2 , and CCC under six different models. of