SYSTEMATIC REVIEW article

Front. Bioeng. Biotechnol., 07 July 2022

Sec. Bioprocess Engineering

Volume 10 - 2022 | https://doi.org/10.3389/fbioe.2022.788300

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

  • 1. Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico

  • 2. Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico

  • 3. Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico

  • 4. Instituto Nacional de Pediatría, Mexico City, Mexico

Article metrics

View details

15

Citations

15,9k

Views

5,6k

Downloads

Abstract

Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit–explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring “the state of the art” in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI–PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI–PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI–PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the “state of the art” on research in the AI–PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.

Introduction

Protein science witnesses the most exciting and demanding revolution of its own field; the magnitude of its genetic–epigenetic—molecular networks, inhibitors, activators, modulators, and metabolite information—is astronomical. It is organized in an open “protein self-organize, adjustment and fitness space”; for example, a protein of 100 amino acids would contain 20100 variants, and a process of searching–finding conformations in a protein of 100 amino acids can adopt ∼1046 conformation and a unique native state, the protein data exceeding many petabytes (1 petabyte is 1 million gigabytes) (Kauffman, 1992).

Therefore, the use of artificial intelligence in protein science is creating new avenues for understanding the ways of organizing and classifying life within its organisms to eventually design, control, and improve this organization. In this respect, protein synthesis is a case in point. Indeed, the discovery of the underlying mechanism of protein synthesis is an inter-field discovery, that is, “a significant achievement of 20th century biology that integrated results from two fields: molecular biology and biochemistry” (Baetu, 2015). More recently, the field of protein science is, in turn, another inter-field enterprise, this time between molecular biology and computer science, or better said, between a cross-functional team of researchers (biochemists, protein scientists, protein engineers, system biology scientists, bioinformatics, between others). Nowadays, it is possible to classify, share, and use a significant number of structural biology databases helping researchers throughout the world. Once the mechanism of DNA for protein synthesis is deduced, it will then be possible to replicate it via computational strategies through artificial intelligence (AI) and machine learning (ML) algorithms that can provide important information such as pattern recognition, nearest neighbors, vector profiles, back propagation, among others. AI has been used to exploit this novel knowledge to predict, design, classify, and evolve known proteins with improved and enhanced properties and applications in protein science (Paladino et al., 2017; Wardah et al., 2019;Cheng et al., 2008; Bernardes and Pedreira, 2013), which, in turn, makes its way to solve complex problems in the “fourth industrial revolution” and open new areas of protein research, growing at a very fast speed.

The techniques of machine learning are a subfield of AI, which has become popular due to the linear and non-linear processed data and the large amount of available combinatorial spaces. As a result, sophisticated algorithms have emerged, promoting the use of neural networks (Gainza et al., 2016) However, in spite of the large amount of research done in protein science, as far as we know, there are neither systematic reviews nor any biochemical meta-analysis in the scientific literature informing, illuminating, and guiding researchers on the best available ML techniques for particular tasks in protein science; albeit there have been recent reviews such as the work of AlQuraishi (2021), Dara et al. (2021), and Hie and Yang (2022), which prove that this inter-field is on evolution. By a biochemical meta-analysis, we mean an analysis resulting from two processes: identification and prediction. The former consists of identifying AI applications into the protein field where we classify and identify active and allosteric sites, molecular signatures, and molecular scaffolding not yet described in nature.

Each structural signature, pattern, or profile constitutes a singular part of the whole “lego-structure-kit” that is the protein space that includes the catalytic task space and shape space, which Kauffman (1992) defines as an abstract representation or mapping of all shapes and chemical reactions that can be catalyzed onto a space of task. The latter process is an analysis of the resulting predictions of structures, molecular signatures, regulatory sites, and ligand sites. Both processes are related to each other in the sense that the proteins in the identification process are searching targets of the 3D-structure for the prediction process that predicts the protein conformation multiple times from a template family or using model-free approach. The biochemical meta-analysis includes formulating the research question, searching and classifying protein tasks in the selected studies, gathering AI–PS information from the studies, evaluating the quality of the studies, analyzing and classifying the outcomes of studies, building up tables and figures for the interpretation of evidence, and presenting the results.

This study puts forward the use of ML classes and methods to address complex problems in protein science. Our point of departure is the state of the art of the AI–PS binomial; by binomial, we mean a biological name consisting of two terms that are partners in computational science as well as in biomedical or biotechnological science as a “two-feet principle” in order to understand, enhance, and control protein science development from an artificial intelligence perspective. Our cross-functional team aims at accelerating the steps of translating the basic scientific knowledge from protein science laboratories into AI applications. Here, we report a comprehensive, balanced systematic review for the literature in the inter-field and a biochemical meta-analysis, which includes a classification of screened articles: 1) by the ML techniques, they use and narrowing down the subareas, 2) by the classes, methods, algorithms, prediction type and programming language, 3) by some protein science queries, 4) by protein science applications, and 5) by protein science problems. Moreover, we present the main contributions of AI in several tasks, as well as a general outline of the processes that are carried out throughout the construction of the models and their applications. We outline a discussion on the best practices of validation, cross-validation, and individual control of testing ML models in order to assess the role that they play in the progress of ML techniques, integrating several data types and developing novel interpretations of computational methodology, thus enabling a wider range of protein’s-universe impacts. Finally, we provide future direction for machine learning approaches in the design of novel proteins, metabolic pathways, and synthetic redesign of protein networks.

Materials and Methods

A systematic review of the scientific literature found in the period (until February 2021) was carried out for this study (Figures 13) following the PIO (participants/intervention/outcome) approach and according to PRISMA declaration (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Supplementary. No ethical approval or letter of individual consent was required for this research.

FIGURE 1

FIGURE 2

FIGURE 3

PIO Strategy

One of the main objectives is to discuss new information in the latest findings about the functions of AI in protein design. Furthermore, this review and meta-analysis intend to include a wide scope of the status of artificial intelligence in protein science. The PIO (participants, intervention, and outcome) strategy was used to systematically search all databases and was the methodology to address the following research questions: What is the state of art in the use of artificial intelligence in the protein science field? What is the use of neural networks in the rational design of proteins? Which neural networks are used in the rational design of proteins? Protein design is currently considered a challenge. As artificial intelligence makes progress, this is presented as a solution to various issues toward addressing how this new branch can be used for the creation of high precision models in protein design. Following the PIO strategy, the next terms were used for the research.

Participants:

articles about proteins and their MeSH terms in general were considered for inclusion; we gave special consideration to protein design and their related terms such as scaffold (as a main structure or template), rational design, and biocatalysts (as a main task target for protein evolution and design in the chemical–biotechnological industry and biomedical field):

  • • protein

  • • protein design

  • • scaffold

  • • rational protein design

  • • biocatalysts

Intervention: studies with any types of algorithms, software, programming language, platform, or paradigm using alone or in combination were selected.

Types of algorithms:

  • • neural networks

  • • recurrent neural networks

  • • network LSTM/GRU

  • • convolutional neural network

  • • deep belief networks

  • • deep stacking networks C5.0

  • • genetic algorithms

  • • artificial intelligence

  • • decision trees

  • • classification

  • • prediction C&A

Software:

  • • Weka

  • • RapidMiner

  • • IBM Modeler

Programming languages:

  • • Python

  • • Java

  • • OpenGL

  • • C++

  • • Shell

Development platform:

  • • Caffe

  • • DeepLearning4j

  • • TensorFlow

  • • IBM distributed deep learning (DDL)

Paradigm:

  • • supervised learning

  • • unsupervised learning

  • • reinforced learning

Outcomes:

  • • novel proteins

  • • protein structure prediction

  • • novel biocatalysts

  • • new fold

  • • evolved protein

  • • new function

Databases and Searches

The electronic databases used were PubMed, Bireme, EBSCO, and OVID. The concepts with similarity were searched with “OR,” and within the groups of each element of the PIO research, they were searched with the word “AND.” Next, a diagram was constructed in order to show the history of searches and concepts used (figure tree diagram). This figure describes in full detail the searching strategy in the PubMed database as well as all keywords used. Moreover, it includes the number of resulting articles. Subsequently, the results obtained from these searches were recorded. The references themselves were then downloaded into the Mendeley database. All references were taken, organized, and saved in Mendeley, eliminating duplicates for the final result.

Biochemical Meta-analysis

The biochemical meta-analysis included formulating the research question, searching and classifying protein tasks in the 144 selected studies, gathering AI–PS information from the 144 studies, evaluating the quality of the studies (as described in the systematic review, see flowchart of PRISMA), analyzing and classifying the intervention and outcome of studies (networks, software, programming languages, development platforms, paradigms, novel proteins, novel scaffold, new fold, etc.), and building up tables and figures for the interpretation of evidence and presenting the results.

By a biochemical meta-analysis, we mean an analysis resulting from two processes: identification and prediction. The former consists of identifying AI applications into the protein field: classify and identify active and allosteric sites, molecular signatures, and molecular scaffolding not yet described in nature, each of which constitute a single part of a grand-type Lego structure. The latter is an analysis of resulting predictions: structures, molecular signatures, regulatory and ligand sites, etc.

Biochemical Meta-Analysis and Designing the Road Map

PRELIMINARY: we determined the formulation of the problem and objectives of the research within the figure, which includes the treatment of the data and their applications. Note: the information was acquired from a list of various databases from which data were analyzed.

DATA COLLECTION: primary data: observation, research and review of articles. Secondary data: data of the reviewed articles and information shared among keywords.

DATA PRE-PROCESSING (ETL and training): identification of filtered data, curated data, and features implemented; machine learning input relationship with protein science servers.

DATA PROCESSING (training data and feature extraction): observation of input data and data encoding format. Record of machine learning algorithms and methods. Recognition of key information for processing data within databases.

DATA POST-PROCESSING: observation of post-processing treatment, rule quality processing, filtering, combination, or unification of information.

MEASURE: explanation of the process, the values of different metrics for the quantification of magnitudes, and the contribution for the completion within the process of information.

ANALYZE: identify the application of machine learning algorithm in which the input of the dataset to process data format, training set, and 3D structures.

IMPROVE: determine the set to whom these new forms will be applied in models of the researched data and contribute to future implementations in protein science.

Concerning the computational aspects as to how articles were classified, three initial divisions were made and are displayed in Table 1: Pre-process, process, and post-process, each of which contain, in turn, the following items:

TABLE 1

Author/Year of Publication/SettingClasses of machine learningMethodsAlgorithmsProtein QueryCharacteristicsStrengthsLimitationsValidation and performance
Study characteristics and algorithm aspects
 Song J., 2021, China (Song et al., 2021)Connectionist and SymbolistAn ensemble predictor with a deep convolutional neural network and LightGBM with ensemble learning algorithmCNN, LightGBMA sequence-based prediction method for protein–ATP-binding residues, including, PSSM, the predicted secondary structure, and one-hot encodingThe CNN frameworks are proposed as a multi-incepResNet-based predictor architecture and a multi-Xception-based predictor architecture. LightGBM, as a Gradient Boosting Decision Tree (GBDT) for classification and regression merged by an ensemble learning algorithmThe model enriches the protein–ATP-binding residue prediction ability using sequence information. Outstanding performance using ensemble learning algorithm in combination with a deep convolutional neural network and LightGBM as an ATP-binding toolDistribution of the specific weights was calculated according to the ratio between the positive instances and the negative instances to solve the imbalance problem. The sensitivity prediction was only 0.189. This can be attributed by its very limited prediction coverage and the limited number of sequences in the training setAUC (0.922 and 0.902), MCC (0.639 and 0.0642), and 5-fold cross-validation
 Verma N., 2021, US (Verma et al., 2021)ConnectionistA DNN framework (Ssnet), for the protein–ligand interaction prediction, which utilize the secondary structure of proteins extracted as a 1D representation based on the curvature and torsion of the protein backboneDNNInformation about locations in a protein where a ligand can bind, including binding sites, allosteric sites, and cryptic sites, independently of the conformationCurvature and torsion of protein backbone, feature vector for ligand. Multiple convolution networks with varying window sizes as branch convolutionThe model does not show biases in the physicochemical properties and necessity of accurate 3D conformation while requiring significantly less computing time. Fast computation once the model is trained with weights bare fixed. No requirement of high-resolution structural dataSsnet being blind to conformation limits its capability to account for mutations resulting from the same fold but significant difference in binding affinity. Ssnet should be treated as a tool to cull millions of drug-like molecules and not as an exact binding affinity prediction toolAUC, ROC, and EF scores
 Bond. S, 2020, US (Bond et al., 2020)ConnectionistCCP4i2 Buccaneer automated model-building pipelinePDBCorrectness of protein residuesVisual examination by the crystallographer. Coot provides validation tools to identify Ramachandran outliers, unusual rotamers, and other potential errors, as well as an interface to some tools from MolProbityNo cutoff has to be chosenIt may also have difficulties in that a residue built into the solvent 5 A° away from the structure is no different than one 10 A° awayCOD for main chain 0.751; COD for side chain 0.613
 Kwon Y., 2020, Korea (Kwon et al., 2020)ConnectionistA new neural network model for binding affinity prediction of a protein–ligand complex structure3D-CNNProtein–protein complexes in a 3D structureEnsemble of multiple independently trained networks that consist of multiple channels of 3D CNN layers. Protein–ligand complexes were represented as 3D grids voxelized binding pocket and ligandHigher Pearson coefficient (0.827) than the state-of-the-art binding affinity prediction scoring functions. Accurate ranking of the relative binding affinities of possible multiple binders of a protein, comparable to the other scoring functionsFor docking power, the Ak-score-single model is not as prominent as the other criteria modelsSpearman and Pearson correlation coefficients
 Li H., 2020, France, Hong Kong (Hongjian et al., 2021)Connectionist, Symbolist and AnalogistAnalyzed machine learning scoring functions for structure-based virtual screeningRF, BRT, kNN, NN, SVM, GBDT, multi-task DNN XGBoostComparison and review of machine learning scoring functions and classical scoring functionsMachine learning-based scoring function performs better than classical scoring functions, outperforming the average classical methodsMachine learning-based scoring function has introduced strong improvements over classical scoring functions, benchmarks for SBVS.Current SBVS benchmarks do not actually mimic real test sets, and thus their ability to anticipate prospective performance is uncertainN/A
 Liang M., 2020, China (Liang and Nie, 2020)ConnectionistMethod that uses the relation between amino acids directly to predict enzyme functionRN, LSTMState description matrix containing structural information by four parts, amino acid name (N), angles φ and ψ(A), relative distance (RD), and relative angle γ (RA)A three-layer MLP; a four-layer MLP; a three-layer MLP, all with ReLU nonlinearities. The final layer was a linear layer that produced logits for optimization with a softmax loss functionStructural relationship information of amino acids and the relationship inference model can achieve good results in the protein functional classificationThe model is currently only for single-label classification rather than multi-label classification and only predicts proteins approximately into six major classes. The training has a considerable time during the entire experiment; further optimization is necessary to improve performanceAccuracy, ROC curve, AUC, 3-fold cross-validation
 Nie J., 2020, Singapore, Taiwan (Sua et al., 2020)Probabilistic inference, symbolist, and analogistIdentification of lysine PTM site from a convolutional neural network and sequence graph transform techniquesRF, SVM, MNB, LR, Max Entropy, KNN, CNN, MLPA computational technique to improve the identification of reaction sites for multiple lysine PTM sites in a protein sampleImproves the performance of identifying lysine PTM sites by using a novel combination with convolutional neural networks and sequence graph transformAs the current model that we are proposing is a multilabel model, it is very generalizable, especially when it comes to combinations of multilabel that the dataset does not have. In addition, such combinations of multilabel will increase the test sample size and provide a better idea of the accuracy of the modelDeep learning models are black-box models and may not be very useful for trying to understand the causes of PTMs and how to affect them. We gather that scientists would like to know the cause and effect in order to propose disease modification methods, rather than just pure identification of PTM’sCross-validation, precision accuracy, recall, Hamming-loss
 Qin Z., 2020, US (Qin et al., 2020)ConnectionistLearn method on amino acid sequence folds into a protein structure, along with the phi–psi angle information for high resolution of protein structureMNNNPrediction with only primary amino acid sequence without any template or co-evolutional informationPerforms labeling of dihedral angles, combined with the sequence information, allowing the phi–psi angle prediction and building the atomic structurePrediction consumes less than six orders of magnitude time. Prediction of the structure of an unknown protein is achieved, showing great advantage in the rational design of de novo proteinsPrediction accuracy can be further improved by incorporating new structure to refine the modelPrediction accuracy (85%)
 Savojardo C., 2020, Italy (Savojardo et al., 2020a)ConnectionistA method for protein subcellular localization predictionDeepMITO, 1D-CNNPerforming proteome-wide prediction of sub-mitochondrial localization on representative proteomesIts major characteristics is to combine proteome-wide experimental data with the predicted annotation of subcellular localization at submitochondrial level and complementary functional characterization in terms of biological processes and molecular functions. Evolutionary information, in the form of Position-Specific Scoring Matrices (PSSM)The model allows users to search for proteins by organisms, mitochondrial compartment, biological process, or molecular function and to quickly retrieve and download results in different formats, including JSON and CSVN/AMCC coefficient
 Wang M., 2020, US (Wang M. et al., 2020a)SymbolistA topology-based network tree, constructed by integrating the topological representation and NetTree for predicting protein–protein interaction (PPI)TopNetTree, CNN, GBTProtein structures, protein mutation, and mutation typeConvolutional Neural Networks, used in their Top Net Tree model, as a second module: consisting of the CNN-assisted GBT modelThe proposed model achieved significantly better Rp than those of other existing methods, indicating that the topology-based machine learning methods have a better predictive power for PPI systemsBoth GBTs and neural networks are quite sensitive to system errors of training of a model The ΔΔG of 27 non-binders (–8 kcal mol–1) did not follow the distribution of the whole dataset.Person coefficient (Rp) = 0.65/0.68 and 10-fold cross-validation
 Wardah W., 2020, Australia, Fiji, Japan, US (Wardah et al., 2020)Pattern recognitionA convolutional neural network to identify the peptide-binding sites in proteinsCNNAmino acid residues to create the image-like representations by feature vectorsSets of convolution layers for image operations, followed by a pooling layer and a fully connected layer. The internal weights of the network were adjusted using the Adam optimizer. Bayesian optimization uses calculated values for configuring the model’s hyper-parameters based on prior observationsThe model is able to predict a protein sequence with the highest sensitivity compared to any other toolImprovement and especially in reducing the number of non-binding residues that get falsely classified as binding sites. Better feature engineering to produce better protein–peptide-binding site prediction results. More advanced computing environmentSensitivity, specificity, AUC, ROC curve, and MCC coefficient
 Yu C., 2020, Taiwan, US (Yu and Buehler, 2020)ConnectionistA deep neural network model is based on translating protein sequences and structural information into a musical score, reflecting secondary structure information and information about the chain length and different protein moleculesRNN, LSTMA vibrational spectrum of the amino acid, comprising amino acid sequence, fold geometry, or secondary structureThe RNN layers, Long Short-Term Memory Units are for time sequence features, alongside a dynamical conditioning. The attention dynamical conditioning model monitors the note velocity changes of the note sequencesThe deep neural network is capable of training, classifying, and generating new protein sequences, reproducing existing sequences, and completely new sequences that do not exist yet. The model generates new proteins with an embedded secondary structure approachThe method could be extended to address folded structures of proteins by including more spatial information (relative distance of residuals, angles, or contact information). As well as the addition of combined optimization algorithms, like genetic algorithmsMolecular dynamics equilibration with normal mode analysis
 Cui Y., 2019, China (Cui et al., 2019)Pattern recognitionA deep learning model sequence-based for ab initio protein–ligand-binding residue predictionDCNNProtein sequences in order to construct several features for the input feature mapFirst representation, an amino acid sequence by m x d. First convolutional layer with k x d kernel size. Stage 1, with Plain(k x 1,2c) the same as for Block(k x 1,2c). Stage 2, with a Block(k x 1,2c) and Layer normalization-GLU-Conv blockThe convolutional architecture provides the ability to process variable-length inputs. The hierarchical structure of the architecture enables us to capture long-distance dependencies between the residue and those that are precisely controlled. Augmentation of the training sets slightly improves the performanceThe computational cost for training increases several times. Due to the considerable data skew, the training algorithm tends to fall into a local minimum where the network predicts all inputs as negative examplesPrecision, Recall, MCC
 Degiacomi M., 2019, UK (Degiacomi, 2019)Deep machine learningConformational space generatorMolecular dynamics, random forests and autoencoder algorithmsGenerative neural network trained on protein structures produced by molecular simulation can generate plausible conformationsGenerative neural networks for the characterization of the conformational space of proteins featuring domain-level dynamicsThe auto encoder does great at describing concerted motions (e.g., hinge motions) than at capturing subtle local fluctuations; it is most suitable to handle cases featuring domain-level rearrangementsThis generative neural network model yet lies incapable of reproducing non-diversity-related cases, which is a subject of active research in the machine learning communityPerformance assessed using different sizes of latent vector and optimizer
 Fang C., 2019, China, Japan (Fang et al., 2019)ConnectionistProtein sequence descriptor, position-specific scoring matrix, en DCNNMoRFDCNNPinpoint molecular recognition features, which are key regions of intrinsically disordered proteins by machine learning methodsEnsemble deep convolutional neural network-based molecular recognition feature prediction. It does not incorporate any predicted features from other classifiersThe proposed method is highly performant for proteome-wide MoRF prediction without any protein category biasIt is yet difficult to predict if the new models will perform better only on the results, referring to the use of a new dataset.Sensitivity, Specificity, Accuracy, AUC, ROC curve, MCC coefficient
 Fang C., 2019, US (Fang et al., 2020)ConnectionistDeep dense inception network for beta-turn predictionDeepDINProtein sequence by creating four sets of features: physicochemical, HHBlits, predicted shape string and predicted eight-state secondary structureConcatenate four convolved feature maps along the feature dimension. Feed the concatenated feature map into the stringed dense inception blocks. Dense layer, with Softmax functionProposed process for beta-turn prediction outperforms the previous authorsOf the nine cases used, the amount of data belonging to each class may not produce a model with the ability to extract features or to be well generalized. Combined features improve prediction results than those features used aloneMCC and 5-fold cross-validation
 Fu H., 2019, China (Fu et al., 2019)AnalogistClassification Natural language prediction (NLP) taskCNN DLPredict Lysine ubiquitination sites in large-scaleInput fragment. Multi-convolution-pooling layers. Fully connected layersExtract features from the original protein fragments. First used in the prediction of ubiquitinationDeepUbi is not too deep. Only two convolution-pooling structures4-, 6-, 8-, and 10-fold cross-validation Sensitivity, Specificity, Acc, AUC, MCC, Acc >85% AUC = 0.9066/MCC= 0.78
 Guo Y., 2019, US (Guo et al., 2019)Connectionist and SymbolistAsymmetric Convolutional neural networks and bidirectional long short-term memoryACNNs, BLSTM, DeepACLSTMSequence-based prediction for Protein Secondary Structure (P.S.S.)The DeepACLSTM method is proposed to predict an 8-category PSS from protein sequence features and profile featuresThe method efficiently combines ACNN with BLSTM neural networks for the PPS prediction. Leveraging the feature vector dimension of the protein feature matrixExpensive and time consumingCB6133 0.742 CB513 0.705
 Haberal I., 2019, Norway, Turkey (Haberal and Ogul, 2019)ConnectionistThree different deep learning architectures for prediction of metal-binding of Histidine (HIS) and Cysteine (CYS) amino acids2D CNN, LSTM, RNNThree methods, PAM, ProCos, and BR to create the feature set from the frame vector; applying directly to raw protein sequences without any extensive feature engineering, while optimizing the model for predicting metal-binding siteThe model is a 2D-CNN with four convolution layers, two pooling, two dropout, and two multi-layer perceptron layers. Each convolution layer has 3 × 3 pixel filtersThe good performance of the model demonstrates the potential application for protein metal-binding site prediction. A competitive tool for future metal-binding studies, protein metal -interaction, protein secondary structure prediction, and protein function prediction. The CNN method provides better results for the prediction of protein metal binding using PAM attributesThe overall best results were obtained for a window of size 15. The lowest result was obtained in windows of size 101. The lowest result for the ProCos was obtained with the CNN modelPrecision, Accuracy, Recall F-Measures K-fold (k = 3,5) cross-validation
 Heinzinger M., 2019, Germany (Heinzinger et al., 2019)ConnectionistNatural language processing with Deep learningELMo CharCNN LSTMProtein function and structure prediction via analysis of unlabeled big data and deep learning processingNovel representation of protein sequences as continuous vectors using language model ELMo, using NLP.The approach improved over some popular methods using evolutionary information, and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speedAlthough SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary informationPredictions of intrinsic disorder were evaluated through Matthew’s correlation coefficient and the False- Positive Rate. Also, the Gorodkin measure was used
 Kaleel M., 2019, Ireland (Kaleel et al., 2019)Connectionist and SymbolistDeep neural network architecture composed of stacks of bidirectional recurrent neural networks and convolutional layersRSA.Three-dimensional structure of protein predictionPredicting relative solvent accessibility (RSA) of amino acids within a protein is a significant step toward resolving the protein structure prediction challenge, especially in cases in which structural information about a protein is not available by homology transferHigh accuracy in four different classes (75% average). They performed all the training and testing in 5-fold cross-validation on a very large, state-of-the-art redundancy reduced set containing over 15,000 experimentally resolved proteinsThe protein structure prediction challenge especially in cases in which structural information about a protein is not available by homology transfer2-class ACC 0.805 2-class F1 0.80 3-class ACC 0.664 3-class F1 0.66 4-class ACC 0.565 4-class F1 0.56
 Karimi M., 2019, US (Karimi et al., 2019)Pattern recognitionInterpretable deep learning of compound–protein affinityRNN–CNN modelsDevelopment of accurate deep learning models for predicting compound–protein affinity using only compound identities and protein sequencesUsing only compound identities and protein sequences, and taking massive protein and compound data, RNN–CNN, and GCNN trained models outperform baseline modelsCompared to conventional compound or protein representations using molecular descriptors or Pfam domains, the encoded representations learned from novel structurally annotated SPS sequences and SMILES strings improve both predictive power and training efficiency for various machine learning modelsThe resulting unified RNN/GCNN–CNN model did not improve against unified RNNCNNInferior relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included for training
 Li C., 2019, China (Li and Liu, 2020)Constrained optimization and ConnectionistFeature extractor techniques for protein-fold recognitionMotifCNN and MotifDCNN SVM CNNFold-specific features with biological attributes considering the evolutionary information from position-specific frequency matrices (PSFMs) considering the structure information from residue–residueThe predictor called MotifCNN-fold combines SVMs with the pairwise sequence similarity scores based on fold-specific featuresThe model incorporates the structural motifs into the CNNs, aiming to extract the more discriminative fold-specific features with biological attributes, considering the evolutionary information from PSFMs and the structure information from CCMsExisting fold-specific features lack biological evidences and interpretability, the feature extraction method is still the bottleneck for the performance improvement of the machine learning-based methods2-fold cross-validation, Accuracy
 Lin J., 2019, China (Lin et al., 2019)Analogist and evolving structuresA drug target prediction method based on genetic algorithm and Bagging-SVM ensemble classifierGA, SVMProtein sequences by combining pseudo amino acid, dipeptide composition, and reduced sequence algorithmsGA is used to select the druggable protein dataset. The optimal feature vectors are for the SVM classifier. Bagging-SVM ensemble is for positive and negative sample setsThe method has a high reference value for the prediction of potential drug targets. An improvement over previous methodsN/AAcc, MCC, Sn, Sp, AUC, PPV, NPV, F1-score,ROC curve and 5-fold cross-validation
 Pagès G., 2019, France (Pagès et al., 2019)ConnectionistRegression structure atomic depiction with a density function3D CNNProtein model quality assessmentThree convolutional layers. Fully connected layers. Use of ELU as activation functionCompetitivity with single-model protein model quality assessment. Trained to match CAD-score, on stage 2 of CASP 11Ornate does not reach the accuracy of the best meta-methods. Scoring time about 1 s for mid-size proteinsNetwork running using a GeForce GTX 680 GPU
 Picart-Armada S., 2019, Belgium, UK, Spain (Picart-Armada et al., 2019)Pattern recognitionNetwork propagation machine learning methodsPR, Random Randomraw EGAD, PPR, Raw, GM, MC, Z-scores, KNN, WSLD, COSNet, bagSVM, RF, SVMAssess performance of several network propagation algorithms to find sensible gene targets for 22 common non-cancerous diseasesTwo biological networks, six performance metrics, and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory modelsNetwork propagation seems effective for drug target discovery, reflecting the fact that drug targets tend to cluster within the networkChoice of the input network and the seed scores on the genes need careful consideration due to possibility of overestimation in performance indicatorsThere was a dramatic reduction in performance for most methods when using a complex-aware cross-validation strategy. Three cross-validation schemes were used
 Savojardo C., 2019, Italy (Savojardo et al., 2020b)ConnectionistA convolutional neural network architecture to extract relevant patterns from primary featuresCNNHigh prediction on discriminating four mitochondrial compartments (matrix, outer, inner, intermembrane)Two pooling layers concatenated into a single vector with four independent output units with sigmoid activation function quantifying the membership of each considered mitochondrial compartmentModel has a robust approach with respect to class imbalance and accurate predictions for the four classification compartmentsAdoption of more complex architecture, like recurrent layers can improve performance. However, the use of recurrent models leads to bad performance. Impossibility to predict multiple localization for a single protein sequence10-fold cross-validation, MCC from 0.45 to 0.65
 Schantz M., 2019, Argentina, Denmark, Malaysia (Klausen et al., 2019)ConnectionistNetSurfP-2.0NetSurfP-2.0Predict local structural features of a protein from the primary sequenceA novel tool that can predict the most important local structural features with unprecedented accuracy and runtime. Is sequence-based and uses an architecture composed of convolutional and long short-term memory neural networks trained on solved protein structures.Predicts solvent accessibility, secondary structure, structural disorder, and backbone dihedral angles for each residue of the input sequencesThe models are presented with cases that are neither physically nor biologically meaningfulCASP12 0.726 TS115 0.778 CB513 0.794
 Taherzadeh G., 2019, Australia, US (Taherzadeh et al., 2019)Constrained optimization and ConnectionistPredictor method of N- and mucin-type O-linked glycosylation sites in mammalian glycoproteinsDNN, SVMAn amino acid sequence binary vector, evolutionary information, physicochemical propertiesDNN uses deep architectures of fully connected artificial neural networks. And SVM linear kernel for classification techniques to predict O-linked glycosylation sitesN-glycosylation model performs equally well for intra or cross-species datasetsLimitation to typical N-linked and mucin-type O-linked glycosylation sites due to lack of data for atypical N-linked and other types of O-linked glycosylation sitesAUC MCC, accuracy, sensitivity, specificity, ROC curve, 10-fold cross-validation
 Torng W., 2019, US (Torng and Altman, 2019)AnalogistClassification Softmax classifier for class probabilities3D CNN SVMProtein functional site detectionProtein site representation as four atom channels and supervised labelsAchieved an average of 0.955 at a threshold of 0.99 on PROSITE families. Good performance where sequence motifs are absent, but a function is knownLoss of specific orientation data. NOS structures 1TLL and 1F20 and catalytic sites in TRYPSIN-like enzymes not detected5-fold cross-validation Precision, Recall Precision = 0.99 Recall = 0.955
 Wan C., 2019, UK (Wan et al., 2019)ConnectionistA novel method (STRING2GO), with a deep maxout neural networks for protein functional predictive informationDMNN, SVMProtein functional biological network node neighborhoods and co-occurrence function informationThe network architecture consists of three fully connected hidden layers, followed by an output layer with as many neurons as the numbers of terms selected for the biological process functional domain. A sigmoid function is used as activation function and the AdaGrad optimizer is implementedSuccessful learning of the functional representation classifiers for making predictionsPotential improvement of predictive accuracy by integrating representations from other data sources with the current PPI network embedding representationsAUC, ROC, MCC
 Wang D., 2019, China (Wang D. et al., 2020)EvolutionaryAn Artificial Intelligence-based protein structure Refinement methodMulti-objective PSOQuery sequence structures as the initial particle selection for conformation representationUse of multiple energy functions as multi-objectives. Initialization, energy map of the initial particles. Iteration, energy landscape of the 4th iteration. Selection of non-dominated solutions and added to the Pareto set. And selection of the global best position and the best position every swarm has had by the use of the dominance relationship of swarms, moving to the optimal directionSuccess of AIR can be attributed to three main aspects: the first is the anisotropy of multiple templates. The complementarity of multi-objective energy functions and the swarm intelligence of the PSO algorithm, for effective search of good solutions. The larger number of iterations allows the algorithm to perform a more detailed search on the search space, which can improve the quality of the output modelsRestriction of the velocity of the dihedral angles in each iteration to a reasonable range for balancing the accuracy and the searching conformation. There are still some unreasonable solutions in the Pareto set. The final step, which ranks the structures in Pareto set, needs more studiesRMSD value
 Yu C., 2019, US (Yu et al., 2019)ConnectionistRegression musical patterns by the extension of protein designedRNN LSTMGeneration of audible sound from amino acid sequence for application on designer materialsAn RNN utilized for melody generation. (LSTM) for time sequence featuringMechanism to explain the importance of protein sequences. 4.- It can be applied to express the structure of other nanostructuresN/AN/A
 Zhang D., 2019, US (Zhang and Kabuka, 2019)ConnectionistProtein sequence pre-processing, unsupervised learning, supervised, and deep feature extractionMultimodal DNNIdentify protein–protein interactions and classify families via deep learning modelsMulti-modal deep representation learning structure by incorporating the protein physicochemical features with the graph topological features from the PPI networksThe model outperforms most of the baseline machine learning models analyzed by the authors, using the same reference datasetsIf there is a certain type of PPI that previous models cannot handle, the article will not say if the new model canPPI prediction accuracy for eight species ranged from 96.76 to 99.77%, which implies the multi-modal deep representation-learning framework achieves superior performance compared to other computational methods
 Zhang Y., 2019, China (Zhang et al., 2019)ConnectionistA new prediction approach appropriate for imbalanced DNA–protein-binding sites dataADASYNEmployment of PSSM feature and sequence feature for predicting DNA-binding sites in proteinsIntroduction of new feature representation approach by combining position-specific scoring matrix, one-hot encoding and predicted solvent accessibility features. Apply adaptive synthetic sampling to oversample the minority class and Bootstrap strategy for a majority class to deal with the imbalance problemDemonstration that the method achieves a high prediction performance and outperforms the state-of-the-art sequence-based DNA–protein-binding site predictorsConsideration of some other physicochemical features to construct the model and try to explain the biological meaning of CNN filtersSensitivity, Specificity, Accuracy, Precision, and MCC coefficient
 Zheng W., 2019, US (Zheng et al., 2019)Probabilistic inference, SymbolistTwo fully deep learning automated structure prediction pipelines for guided protein structure predictionZhang-Server and QUARKStarting from a full-length query sequence structureThree core modules: multiple sequence alignment (MSA) generation protocol to construct deep sequence-profiles for contact prediction; an improved meta- method, NeBcon, which combines multiple contact predictors, including ResPRE that predicts contact-maps by coupling precision-matrices with deep residual convolutional neural networks; an optimized contact potential to guide structure assembly simulationsImprovement of the accuracy of protein structure prediction for both FM and TBM targets. Accurate evolutionary coupling information for contact prediction, thus improving the performance of structure prediction. And properly balancing the components of the energy function was vital for accurate structure predictionIncorrect prediction of contacts between the N- and C- terminal protein regions. Low accuracy of contact prediction in the Terminal regions due to MSAs with many gaps in these regions, as the accuracy of contact-map prediction and FM target modeling is highly influenced by the number of effective sequences in the MSA.TM-score and p-values
 Cuperus J., 2018, US (Cuperus et al., 2017)ConnectionistRegression dropout probability distributionDNN, CNN, LSTMPredict protein expressionHierarchical representation of image features from dataPrediction and visualization of transcription factor binding, Dnase I hypersensitivity sites, enhancers, and DNA methylation sitesMeasurement of protein expression with yeast possessing only 5000 genesk-mer feature, Cross-validation, Held-out R2 = 0.61
 Fang C.,US, 2018 (Fang et al., 2018)Pattern recognitionA deep learning network architecture for both local and global interactions between amino acids for secondary structure predictionDeep3IA protein secondary structure prediction modelA designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequenceThis model uses a more sophisticated, yet efficient, deep learning architecture. The model utilizes hierarchical deep inception blocks to effectively process local and nonlocal interactions of residuesFurther application of the model to predict other protein structure-related properties, such as backbone torsion angles, solvent accessibility, contact number, and protein order/disorder region, will be done in the futureAccuracy, p-value
 Feinberg E., 2018, China, US (Feinberg et al., 2018)ConnectionistA PotentialNet family of graph convolutionsGCNNA generalized graph convolution to include intramolecular interactions and noncovalent interactions between different moleculesFirst: graph convolutions over only bonds, which derives new node feature maps. Second: entails both bond-based and spatial distance-based propagations of information. Third: a graph gather operation is conducted over the ligand atoms, whose feature maps are derived from bonded ligand information and spatial proximity to protein atomsStatistically significant performance increases were observed for all three prediction tasks, electronic property (multitask), solubility (single task), and toxicity prediction (multitask). Spatial graph convolutions can learn an accurate mapping of protein−ligand structures to binding free energies using the same relatively low amount of dataDrawback to train−test split is possible overfitting to the test set through hyperparameter searching. Another limitation is that train and test sets will contain similar examplesRegression enrichment factor (EF), Pearson, and Spearman coefficient, R-squared, MUE (mean-unsigned error)
 Frasca M., 2018, Italia (Frasca et al., 2018)AnalogistClustering Hopfield modelCOSNet ParCOSNet HNNAFP (Automated Protein Function Prediction)Network parameters are learned to cope with the label imbalanceAdvantage of the sparsity of input graphs and the scarcity of positive proteins in characterizing data in the AFP.Time execution increased less than the density, and more than the number of nodes5-fold cross-validation Implementation and execution in a Nvidia GeForce GTX980 GPU target Precision, Recall, F-score, AUPRC
 Hanson J., Australia, China, 2018 (Hanson et al., 2019)Pattern recognitionA sequence-based prediction of one- dimensional structural properties of proteinsCNN, LSTM-BRNNImproving the prediction of protein secondary structure, backbone angles, solvent accessibilityThe model leverages an ensemble of LSTM-BRNN and ResNet models, together with predicted residue–residue contact maps, to continue the push toward the attainable limit of prediction for 3- and 8-state secondary structures, backbone angles (h, s, and w), half-sphere exposure, contact numbers and solvent accessible surface area (ASA)The large improvement of fragment structural accuracy. A new method for predicting one-dimensional structural properties of proteins based on an ensemble of different types of neural networks (LSTM-BRNN, ResNet, and FC-NN) with predicted contact map input from SPOT-contact. The employment of an ensemble of different types of neural networks contributes another 0.5% improvementLong proteins are also shown to take extensive time, especially for 2D analysis tools. The use of CPU and GPU is shown to not make a major difference in the time taken, as the speed increase introduced by GPU acceleration mainly comes during training10-fold cross-validation, Accuracy
 Hanson J., Australia, China, 2018 (Hanson et al., 2018)ConnectionistMethod by stacking residual 2D-CNN with residual bidirectional recurrent LSTM networks, with 2D evolutionary coupling-based informationCNN, 2D-BRLSTMProtein contact map predictionTransformation of sequence-based 1D features into a 2D representation (outer concatenation function). ResNet, 2D-BRLSTM and FullyConnected (FC)Method achieves a robust performance. The model is more accurate in contact prediction across different sequence separations, proteins with a different number of homologous sequences and residues with a different number of contactsCoding limitation environment imposed by the 2D-BRLSTM model; training and testing input is limited to proteins of length 300 and 700 residuesAUC >0.95, ROC curve, precision
 Huang L., 2018, US (Huang et al., 2008)ConnectionistA novel PPI prediction method based on deep learning neural network and regularized Laplacian kernelENN-RLProtein–protein interaction networkContains five layers including the input layer, three hidden layers, and the output layer. Sigmoid is adopted as the activation function for each neuron, and layers are connected with dropouts. Regularized Laplacian kernel applied to the transition matrix built upon that evolved the PPI networkThe transition matrix learned from our evolution neural network can also help build optimized kernel fusion, which effectively overcome the limitation of the traditional WOLP method that needs a relatively large and connected training network to obtain the optimal weightsThe results show that our method can further improve the prediction performance by up to 2%, which is very close to an upper bound that is obtained by an approximate Bayesian computation-based sampling methodCross-validation, AUC, sensitivity
 Khurana S., 2018,Qatar, USA (Khurana et al., 2018)AnalogistClustering Natural language processing taskCNN FFNNSolubility predictionUse additional biological features from external feature extraction tool kits from the protein sequencesDeepSol is at least 3.5% more accurate than PaRSnIP and 15% than PROSO II. DeepSol is superior to all the current sequence-based protein solubility predictorsDeepSol S2 model was surpassed by PaRSnIP on sensitivity for soluble proteins10-fold cross-validation Acc, MCC 15% MCC = 0.55 3.5% DeepSol S1- 69 DeepSol S2- 69%
 Le N., 2018, Taiwan (Le et al., 2018)AnalogistRegression Softmax layer for classificationCNNClassify Rab protein molecules2D-CNN and position-specific scoring matrices. PSSM profiles of 20 × 20 matricesConstruct a robust deep neural network for classifying each of four specific molecular functions. Powerful model for discovering new proteins that belong to Rab molecular functionsConsideration of the potential effects of more rigorous classification tests5-fold cross-validation Sensitivity, Specificity, Acc, AUC, F-score, MCC Acc = 99, 99.5, 96.3, 97.6%
 Li H., 2018, China (Huang et al., 2018)Constrained optimizationRegression Adam optimizerDNN CNN LSTMPrediction of protein interactionsMachine learning approach for computational methods for the prediction of PPIsInsight into the identification of protein–protein interactions (PPIs) into protein functionsManual input of features into the networksHold-out testing set model validation Acc, recall, precision, F-score, MCC Acc = 0.9878 Recall= 0.9891 Precision = 0.9861 F-score= 0.9876 MCC= 0.9757
 Long H., 2018, China, US (Long et al., 2018)ConnectionistClassification sigmoid functionHDL CNN LSTM RNNPredicting hydroxylation sitesCNN deep learning model. Convolution layer consists of a set of filters through dimensions of input datap-values between AUCs of other methods are less than 0.000001Comparative results for CNN and iHyd-PseCp networks5-fold cross-validation Sn, Sp, Acc, MCC, TPR, FPR, Precision, recall
 Makrodimitris S., 2018, Netherlands (Makrodimitris et al., 2019)AnalogistClustering constrained optimizationKNN LSDRProtein function predictionTransformation of the GO terms into a lower-dimensional spaceGO-aware LSDR has slightly better performance on SDp. LSDR reduces the number of dimensions in the label-space. Improve power of the term-specific predictorsLSDR generates inconsistent parent–child pairs. GO-aware terms have a higher inconsistencies3-fold cross-validation Fp, AUPRCp, SDp, Ft, AUCRPCt
 Popova M., 2018, Russia, US (Popova et al., 2018)Constrained optimizationRegression Stack-RNN as a generative modelStack-RNN LSTM.De novo drug designDeep neural network generative novel molecules (G) and predictive novel compounds (P)The ReLeaSe method does not rely on predefined chemical descriptors No manual feature engineering for input representationExtension of the system to afford multi-objective optimization of several target properties5-fold cross-validation (5CV) model trained using a GPU Acc R2, RMSE Acc R2 = 0.91 RMSE = 0.53
 Sunseri J., 2018, US (Sunseri et al., 2019)ConnectionistRegression distributed atom densitiesCNNCathepsin S model ligand proteinCNN based on scoring functionsCNN scoring function outperforms Vina on most tasks without manual interventionDifficulties with Cathepsin S, for de novo dockingAUC, ROC, MCC
 Zhang B., 2018, China (Zhang B. et al., 2018)ConnectionistA novel deep learning architecture to improve synergy protein secondary structure predictionCNN, RNN, BRNNFour input features; position-specific scoring matrix, protein coding features, physical properties, characterization of protein sequenceA local block comprising two 1D convolutional networks with 100 kernels, and the concatenation of their outputs. BGRU block, the concatenation of input from the previous layer and before the previous layer is fed to the 1D convolutional filter. After reducing the dimensionality, the 500-dimensional data are transferred to the next BGRU layerThe CNN was successful at feature extraction, and the RNN was successful at sequence processing. The residual network connected the interval BGRU network to improve modeling long-range dependencies. When the staked layers were increased to two layers, the performance increased to 70.5%, and three-layer networks increased further to 71.4% accuracyWhen the recurrent neural network was constructed by unidirectional GRU, the performance dropped to 67.2%. The unidirectional GRU network was ineffective at capturing contextual dependenciesPrecision, Recall, F1-score, macro-F1, Accuracy
 Zhang L., 2018, China (Zhang L. et al., 2018)ConnectionistTwo novel approaches that separately generate reliable noninteracting pairs, based on sequence similarity and on random walk in the PPI networkDNN, Adam algorithmUse of auto-covariance descriptor to extract the features from amino acid sequences and deep neural networks to predict PPIsThe feature vectors of two individual proteins extracted by AC are employed as the inputs for these two DNNs, respectively. Adam algorithm is applied to speed up training. The dropout technique is employed to avoid overfitting. The ReLU activation function and cross-entropy loss are employed, since they can both accelerate the model training and obtain better prediction resultsTo reduce the bias and enhance the generalization ability of the generated negative dataset, these two strategies separately adjust the degree of the non-interacting proteins and approximate the degree to that of the positive dataset.NIP-SS is competent on all datasets and hold a good performance, whereas NIP-RW can only obtain a good performance on small dataset (positive samples ≤6000) because of the restriction of random walk and the results of extensive experimentsPrecision, Accuracy, Recall, Specificity, MCC coefficient, F1-score, AUC, Sensitivity
 Zhao X., 2018, China (Zhao et al., 2018)ConnectionistBi-modal deep architecture with sub-nets handling two parts (raw protein sequence and physicochemical properties)CNN and DNNRaw sequence and physicochemical properties of protein for characterization of the acetylated fragmentsMulti-layer 1D CNN for feature extractor and DNN with attention layer with a softmax layerCapability of transfer learning for species-specific model, combining raw protein sequence and physicochemical informationInterpretation of biological aspect, overfitting problems on small-scale data10-fold cross-validation; ACC = 0.708, sensitivity (SEN) = 0.723, specificity (SPE) = 0.707, AUC = 0.783, MCC = 0.251
 Armenteros J., 2017, Denmark (Almagro Armenteros et al., 2017)AnalogistClassification optimizationCNN RNN BLSTM FFNN Attention modelsPredict protein subcellular localizationCNN extracts motif information using different motif sizes. Recurrent neural network scans the sequence in both directionsA-BLSTM and the CONV A-BLSTM models achieved the highest performanceTraining time for the full ensemble was 80 h, approximately 5 h per modelNested cross-validation and held-out set for testing models Gorodkin, Acc, MCC 72.90% 72.89%
 Jimenez J., 2017, Spain (Jiménez et al., 2017)BayesianRegression sigmoid activation function, depicting the probability3D CNNPredict protein–ligand-binding sites Drug designFully connected networks. Hierarchical organized layersFour convolutional layers with max pooling and dropout after every two convolutional layers, followed by one regular fully connected layerDemand of significant computational resources than other methods for ligand-binding prediction10-fold cross-validation Using Nvidia GeForce GTX 1080 GPU for accelerated computing DCC, DVO AUC, ROC, Sn, SP, Precision, F1-score, MCC, Cohen’s Kappa coefficient
 Müller A., 2017, Switzerland (Müller et al., 2018)AnalogistRegression SoftMax function for temperature-controlled probabilityRNN LSTMDesign of new peptide combinatorial de novo peptide designThe computed output y is compared to the actual amino acid to calculate the categorical cross-entropy lossThe network models were shown to generate peptide libraries of a desired size within the applicability domain of the modelIncreasing the network size to more than two layers with 256 neurons led to rapid over-fitting of the training data distribution5-fold cross-validation Network training and generated sequences on a Nvidia GeForce GTX 1080 Ti GPU
 Ragoza M., 2017, US (Ragoza et al., 2017)ConnectionistClassification distributed atom densitiesCNN SGDProtein-ligand score for drug discoveryCNN architecture: construction using simple parameterization and serve as a starting point for optimizationOn a per-target basis, CNN scoring outperforms Vina scoring for 90% of the DUD-E targetsCNN performance is worse at intra-target pose ranking, which is more relevant to molecular docking3-fold cross-validation ROC, AUC, FPR, TPR, RF-score, NNScore. CNN-0.815 Vina-0.645
 Szalkai B., 2017, Hungary (Szalkai and Grolmusz, 2018a)Pattern recognitionA classification by amino acid sequence multi-label classification abilityANNProtein classification by amino acid sequenceThe convolutional architecture with 1D spatial pyramid pooling and fully connected layers. The network has six one-dimensional convolution layers with kernel sizes [6,6,5,5,5,5] and depths (filter counts) [128,128,256,256, 512,512], with parametric rectified linear unit activation. Each max pooling layer was followed by a batch normalization layerThe model outperformed the existing solutions and have attained a near 100% of accuracy in multi-label, multi-family classificationNetwork variants without batch normalization and five (instead of six) layers showed a performance drop of several percentage points. With more GPU RAM available, one can further improve upon the performance of our neural network by simply increasing the number of convolutional or fully connected layersPrecision, Recall, F1-value, AUC, ROC curve
 Szalkai B., 2017, Hungary (Szalkai and Grolmusz, 2018b)Logical InferenceClassification Hierarchical classification treeANNHierarchical biological sequence classificationSECLAF implements a multi-label binary cross-entropy classification loss on the output neuronsSECLAF produces the most accurate artificial neural network for residue sequence classification to datePreparation of the input data must be done by the userAUC
 Vang Y., 2017, US (Vang and Xie, 2017)AnalogistRegression Distributed representation with NLPCNNHLA class I-peptide-binding predictionThe CNN architecture: convolutional and fully connected dense layersEffective for validation, distribution, and representation for automatic encoding with no handcrafted encode constructionProvided sufficient data, the method is able to make prediction for any length peptides or allele subtype70% training set and 30% validation set (Hold-out) and 10-fold cross-validation GPU for faster computation of model SRCC, AUC SRCC = 0.521, 0.521, 0.513 AUC= 0.836, 0.819, 0.818 66.7%
 Wang S., 2017, US (Wang et al., 2017)AnalogistClassification Regression Regularization and optimizationUDNN RNN 2Prediction of Protein Contact MapConsists of two major modules, each being a residual neural network3D models built from contact prediction have Tm score >0.5 for 208 of the 398 membrane proteinsNo recognition of predict contact maps from PDB.Algorithm runs on GPU card. Acc L/k (k= 10, 5, 2, 1) Long-range 47% CCMpred- 21% CASP11–30%
 Yeh C., 2017, UK, US (Yeh et al., 2018)Evolving structuresOptimization GAGA multithreaded processingDesigned helical repeat proteins (DHRs)Iterates through mutation, scoring, ranking, and selectionAims to control the overall shape and size of a protein using existing blocksFirst workload imbalance, less efficient work sharing and overheads in schedulingRMSD value
 Simha R., 2015, Canada, Germany, US (Simha et al., 2015)BayesianClassification Probabilistic generative model Bayesian networksMDLoc BNProtein multi-location predictionEach iteration of the learning process obtains a Bayesian network structure of locations using the software package BANJO.Improvement of MDLoc over preliminary methods with Bayesian network classifiersMDLoc’s precision values are lower than those of BNCs, MDLoc’s5-fold cross-validation Presi, Recsi, Acc, F1-scoresi
 Yang J., 2015 China, US (Yang et al., 2015)AnalogistRegression hierarchical order reductionSVRStructure prediction of cysteine-rich proteinsPosition-specific scoring matrix (PSSM): each oxidized cysteine residue is represented as a vector of 20 elementsCyscon improved the average accuracy of connectivity pattern predictionContact information must be predicted from sequence either by feature-based training or by correlated mutations10-fold cross-validation and 20-fold cross-validation QC, QP 21.9%
 Folkman L., 2014, Australia (Folkman et al., 2014)Bayesian Constrained optimizationClassification predicted probability of the mutationSFFS SVM EASE-MMModel designed for a specific type of mutationFeature-based multiple models with each model designed for a specific type of mutationsEASE-MM archived balanced results for different types of mutations based on the accessible surface area, secondary structure, or magnitude of stability changesUsing an independent test set of 238 mutations, results were compared in with related work10-fold cross-validation ROC, AUC, MCC, Q2, Sn, Sp, PVV, NPV AUC = 0.82 MCC = 0.44 Q2 = 74.71 Sn = 73.14 Sp = 75.28 PVV = 52.30 NPV = 88.33
 Li Z., 2014, US (Li et al., 2014)BayesianClassification Probability output predictionSPIN NNSequence profile predictionSequence Profiles by Integrated Neural network based on fragment-derived Sequence profiles and structure-derived energy profilesSPIN improves over the fragment-derived profile by 6.7% (from 23.6 to 30.3%) in sequence identity between predicted and raw sequencesMinor improvement in the core of proteins, which have 10% less hydrophilic residues in predicted sequences than raw sequences10-fold cross-validation MSE, Precision, Recovery rate
 Eisenbeis S., 2012, Germany (Eisenbeis et al., 2012)N/AN/AN/AEnzyme designNo networkNo networkNo network
 Qi Y., 2012, US (Qi et al., 2012)ConnectionistClassification Back propagation in deep layersDNNPrediction of local properties in proteinsAn amino acid feature extraction layer. A sequential feature extraction layer. A series of classical neural network layersFor the prediction of coiled coil regions, our performance of 97.4% beats the best result (94%) on the same dataset from using the same evaluation setupThe largest improvement is observed for relative solvent accessibility prediction, from 79.2 to 81.0% in the multitask setting3- and 10-fold cross-validation Acc, precision, recall, F1 80.3%
 Ebina T., 2011, Japan (Ebina et al., 2011)AnalogistClassification Domain linker prediction SVMDROP SVM RFDomain predictorVector encoding. Random Forest feature selection. SVM parameter optimization. Prediction assessmentAdvantage for testing several averaging windows, 600 properties encoded, averaged with five different windows into a 3000-dimensional vectorComputational time required for performing an exhaustive search5-fold cross-validation AUC, Sn, Precision, NDO, AOS
 Yang Y., 2011, US (Yang et al., 2011)Probability InferenceRegression probabilistic-based matchingSPARKS-X AlgorithmSingle-method fold recognitionThe model is built by modeller9v7 using the alignment generated by SPARKS-XSPAKRS-X performs significantly better in recognizing structurally similar proteins (3%) and in building better models (3%)HHPRED improve 3% over SPARKS-X due to significantly more sophisticated model building techniquesROC, TPR, FPR
 Briesemeister S., 2010, Germany (Briesemeister et al., 2010)BayesianClassification probabilistic approachNBPredict protein subcellular localizationYloc, based on the simple naive Bayes classifierSmall number of features and the simple architecture guarantee interpretable predictionsReturns in confidence estimates that rate predictions are reliable or not5-fold cross-validation Acc, F1-score, precision, recall
 Lin G., 2010, US (Lin et al., 2010)AnalogistClassification OptimizationSVM SVRProtein folding kinetic rate and real-value folding rateSVM classifier to classify folding types based on binary kinetic mechanism (two-state or multi-state), instead of using structural classes of all-α-class, all-β-class and α/β-classThe accuracy of fold rate prediction is improved over previous sequence-based prediction methodsPerformance can be further enhanced with additional informationLeave-one-out cross-validation (LOOCV) Classification accuracy surface, Predicted precision
 Tian J., 2010, China (Tian et al., 2010)AnalogistClassification OptimizationRFR SVM RFEffect on single or multi-site mutation on protein thermostabilityRandom forest includes bootstrap re-sampling, random feature selection, in-depth decision, tree construction, and out-of-bag error estimatesOverall accuracy of classification and the Pearson correlation coefficient of regression were 79.2% and 0.72Direct comparison of Prethermut with the other published predictor was not performed as a result of data limitation and differences10-fold cross-validation Overall accuracy (Q2), MCC, Sn, Sp, Pearson correlation coefficient ® Acc = 79.2% r = 0.72
 Zhao F., 2010, US (Zhao et al., 2010)BayesianClassification probabilistic graphical modelCNF SVMProtein foldingConformations of a residue in the protein backbone is described as a probabilistic distribution of (θ, τ)The method generates conformations by restricting the local conformations of a proteinCNF can generate decoys with lower energy but not improve decoy quality5-, 7-, and 10-fold cross-validation Accuracy (Q3) Q3 = 80.1%
 Hong E., 2009, US (Hong et al., 2009)SymbolistClassification Branch and bound tree Logical inferenceBroMapTenth human fibronectin, D44.1 and DI.3 antibodies, Human erythropoietinBroMAP attempts the reduction of the problem size within each node through DEE and eliminationLower bounds are exploited in branching and subproblem selection for fast discovery of strong upper boundsBroMAP is particularly applicable to large protein design problems where DEE/A∗ struggles and can also substitute for DEE/A∗ in general GMEC searchN/A
 Özen A., 2009, Turkey (Özen et al., 2009)AnalogistsClassification Regression Constrained optimizationSVM KNN DT SVRSingle-site amino acid substitutionEarly Integration. Intermediate Integration. Late IntegrationPossible combination including new feature set, new kernel, or a learning method to improve accuracy.Training any classifier with an unbalanced dataset in favor of negative instances makes it difficult to learn the positive instances20-fold cross-validation Acc, Error rate, Precision, Recall, FP rate Acc= 0.842, 0.835
 Ebrahimpour A., 2008, Malaysia [(Ebrahimpour et al., 2008)ConnectionistClassification Back and batch back propagationANN FFNN IBP BBP QP GA LMLipase production Syncephalastrum racemosum, Pseudomonas sp. strain S5 and Pseudomonas aeruginosaANN architecture: input layer with six neurons, an output layer with one neuron, and a hidden layer. Transfer functions of hidden and output layers are iteratively determinedMaximum predicted values by ANN (0.47 Uml -1) and RSM (0.476 U–l - 1), whereas R2 and AAD were determined as 0.989 and 0.059% for ANN and 0.95 and 0.078% for RSM, respectivelyANN has the disadvantage of requiring large amounts of training dataRMSE, R2, AAD RMSE<0.0001 R2 = 0.9998
 Huang W., 2008, Taiwan (Huang et al., 2008)AnalogistClustering Combinatorial optimizationGA SVM KNNPrediction method for predicting subcellular localization of novel proteinsPreparation of SVM, binary classifiers of LIBSVM. Sequence representation. Inclusion of essential GO termsBias-free estimation of the accuracy reduces computational costComputational demand is impractical for large datasets10-fold cross-validation and leave-one-out cross-validation (LOOCV) Accuracy, MCC Acc= 90.6–85.7%
 Katzman S., 2008, US (Katzman et al., 2008)BayesianClassification ProbabilisticMUSTER SVMLocal structure predictionCalculation of output of each unit in each layer. Soft max function to all outputs of a given layer represents valid probability distributionAccurate predictions of novel alphabets for extending the performanceSmaller windows and number of units, the network has fewer total degrees of freedom3-fold cross-validation, Qn
 Liao J., 2007, US (Liao et al., 2007)Supervised LearningClassification RegressionRR Lasso PLSR SVMR LPSVMR LPBoosR MR ORMRProteinase K variantsDesign of protein variants. Expression of the protein variants. Analysis of protein variant sequences and activities to assess the contribution of each amino acid substitutionMachine learning algorithms make it possible to use more complex and expensive tests to only protein propertiesComputational resources are cheap; we instead used the 1000 subsamples of the training setsCross-validation
 Raveh B., 2007, Israel (Raveh et al., 2007)ConnectionistClustering Pattern recognitionK-means ClusteringExistence of α-helices, parallel β-sheets, anti-parallel sheets and loops. Non-conventional hybrid structuresNetwork motif vector (k means of motif vector). Enriched Interaction graphsRediscovery existence of conventional a-helices, parallel b-sheets, anti-parallel sheets and loops, and non-conventional hybrid structuresLimitation to backbone interactions, the degree of each node in the network was bounded from above by two covalent and two possible hydrogen bonds10-fold cross-validation
 Shamim M., 2007, India (Shamim et al., 2007)AnalogistClassification RegressionSVMProtein-fold predictionLIBSVM provides a choice of in-built kernels, such as Linear, Polynomial, Radial basis function (RBF), and Gaussian, we use RBF kernelOverall accuracy of 65.2% for fold discrimination and individual propensities, which is better than those from the literatureIncrementation of backbone conformation results in the reduction on accuracy prediction2-fold cross-validation 5-fold cross-validation Accuracy (Q), Sn, Sp Q= 65.2% >70%
 Hung C., 2006, Taiwan (Hung et al., 2006)SymbolistRegression Genetic algorithm casual treeDFS HMM GA AGCTPredict protein functionsAGCT study applies a hybrid methodology based on genetic programming with a causal tree model to predicting protein functionThe model is developed to exploit global search capabilities in genetic programming for predicting protein functions of a distantly related protein family that has difficulties in the conserved domain identificationRatios of comparison between the heuristic signal match and exhaustive sequence alignment are lowCross-validation
 Sidhu A., 2006, UK (Sidhu and Zheng, 2006)SymbolistClassification Logical InferenceBBFNN NN DTPredict signal peptideBBFNN Characteristics: Mutation matrix for protein sequence encoding. BBFNN is a linear combination of K bio-bases with the bio-basis functionThe BBFNN has improved the accuracy by a further 5%. Most cost-effective and efficient way of predicting signal peptidesSize of the positive examples in the dataset reduces prediction accuracy5-fold cross-validation Accuracy Acc >90%, 97.16% for BBFNN 97.63% for C4.5
 Zimmermann O., 2006, Germany, US (Zimmermann and Hansmann, 2006)AnalogistClassificationSVM C-SVM algorithm implementationPrediction of dihedral regionsImplementation of the sequence window of length seven and three separate predictions: helix, extended beta, and outliersProfile-only SVM classifiers show a prediction performance of 80%The approach is based on sequence profiles only. Models show a tendency to over-predict extended residues and under-predict residues in the helical stateAcc, MCC, Sn, Sp Acc = 93.3%, 93.4% MCC = 0.645, 0.671
 Capriotti E., 2005, Italy (Capriotti et al., 2005)AnalogistClassificationSVMProtein stability predictionPrediction of the direction of the protein stability changes upon single-point mutation from the protein tertiary structureLarge extent protein stability can be evaluated with specific interactions in the sequence neighbors capturedCorrelation of predicted with expected/experimental values is 0.71 with a standard error of 1.30 kcal/mol and 0.62 with an SE of 1.45 kcal/molCross-validation Accuracy, MCC, Q2 = 0.80, 0.77 MCC = 0.51, 0.42
 Rossi A., 2001, Italy (Rossi et al., 2001)ConnectionistRegression Perceptron algorithmNNBarnase and chymotrypsin inhibitorTwo- and three-body energy functions. Partitioning the 20 amino acids into classes (Hydrophobic, Neutral, Charged)The method is able to identify crucial sites for folding process: for 2ci2 and barnase and shows a very good agreement with experimental resultsNo improvement on success rate by introducing more sophisticated energy functions. Important features of real proteins are neglected by short-range HamiltoniansN/A

An overview of the included articles on study and algorithm features based in their characteristics, strengths, limitations, and measure of precision.

pre-process

database, pretreatment, and input

process

machine learning paradigm and input, algorithm and development software, three aspects of the neural network used (characteristics, strengths, and limitations) and output.

post-process

input and web server when applied.

Most of the research reported in these articles performs a pretreatment over the protein database used, that is, processes of randomization and training, in order to leave the data prepared for the computational process itself, for when the algorithm is to be executed on a software platform and within a particular machine learning paradigm (mostly supervised, unsupervised, and deep learning, as shown in Figure 4). We also reported special characteristics as well as strengths and limitations of the neural networks used. Finally, part of the post-process, when applied, concerns the web server where research results are stored. Moreover, some of these aspects are also registered in Tables 26 as well as some others (programming language and software license type).

FIGURE 4

TABLE 2

First Author/Year of Publication/CountryDatabaseInitial scaffold (ID)Designed ProteinML modelSoftware/SeverProgramming language/PlatformLicenseQuality (%)Machine learningProtein application
Protein and drug design
Hie B., 2022, USA (Hie and Yang, 2022)N/ASequence-to-function machine learning surrogate model tProtein engineering designMachine learning optimizationN/AN/AN/A50%Supervised learning: optimizationProtein design
Dara S., 2021, India (Dara et al., 2021)ZINC, BindingDB, PUBCHEM, Drugbank, REAL, Genomic Database, Adaptable Clinical Trail Database, DataFoundry, SWISS-PROT, SCoP, dbEST. Genome Information Management System, BIOMOLQUEST, PDB, SWISS-PORT, ENZIMETarget identification, hit discovery, hit to lead, lead optimizationPPI prediction, protein folding, drug repurposing, virtual screening, activity scoring, QSAR, drug design, evaluation of ADME/T propertiesAutoEncoder, ANN,CNN, DL, MLP,NB, RF, RNN, CNN, SVM, LRN/AN/AN/A50%Supervised learning: predictionDrug discovery
Feger G., 2020, Czech Republic, France (Feger et al., 2020)PDBPeptide amphiphile scaffoldsAmphiphilic peptide scaffold designSVM, RFSasFitCOpen source60Supervised Learning: PredictionProtein design
He H., 2020, China (He et al., 2020)Multiple databasesMultiple organismsReview of novel drug discovery techniquesMultiple methods for structure prediction, ligand-binding site, undruggable to drug rabble targets, hidden allosteric siteN/AN/AN/A50N/ADrug discovery
Maia E., 2020, Brazil (Maia et al., 2020)Multiple databasesstructure-based virtual screening (SBVS)Drug developmentVSAN/AMultiple languagesN/A60Supervised Learning: Unsupervised LearningDrug development design
Qin Z., 2020, US (Qin et al., 2020)PDBPhi–psi angle and sequence of natural protein, only of standard amino acidsProtein design of fold alpha-helical structureMNNNTensorflow https://github.com/IBM/mnnnPythonOpen Source95Supervised Learning: Prediction RegressionProtein design
Tsou L., 2020, Taiwan (Tsou et al., 2020)ChEMBLIn-house database of 165,000 compoundsTNBC inhibitors and GPCR classification predictionDNN, RFN/AN/AN/A60Supervised Learning: ClassificationDrug design
Wang X., 2020, China (Wang X. et al., 2020)KIBA, Davis datasetKinase protein familyPredict drug-target-binding affinityCNN, GCNN/AN/AN/A60Supervised Learning, Semi-Supervised Learning: PredictionDrug-target binding-affinity
Yu C., 2020, Taiwan, US (Yu and Buehler, 2020)PDBα-helix-rich proteinsDe novo protein designRNN, LSTMTensorFlow, https://github.com/tensorflow/magenta/issues/1438PythonOpen Source90Supervised Learning: Unsupervised Learning: PredictionProtein design
Fang C., 2019, US (Fang et al., 2020)UniProtProteins from datasets BT426 and BT6376 containing at least one beta-turnBeta-turn predictionHMM, CNN, DeepDINTensorflow, Keras http://dslsrv8.cs.missouri.edu/∼cf797/MUFoldBetaTurn/download.htmlPythonOpen Source90Supervised Learning: ClassificationProtein design
Karimi M., 2019, US (Karimi et al., 2019)BindingDB, STITCH, UnirefVarious protein classesCompound–protein affinity predictionRNN, CNNhttps://github.com/ShenLab/DeepAffinityN/AN/A75Semi-supervised, Unsupervised Learning: RegressionDrug design
Lin J., 2019, China (Lin et al., 2019)DrugBankDruggable proteins and non-druggable proteinsDrug target predictionSVM, GAhttps://github.com/QUST-AIBBDRC/GA-Bagging-SVMMatlabMathWorks90Supervised Learning: PredictionDrug design
Hu B., 2018, China (Hu et al., 2018)DDI, SIDER, TWOSIDES, HPRD, Drug Bank, Offsides PubChemSemantic meta-paths ADRmeta-path-based proximities ADRSDHINE, Network embeddingTensorFlow, N/AC, C++, PythonApache 2.065Supervised Learning: RegressionDrug design
Popova M., 2018, Russia, US (Popova et al., 2018)PHYSPROP, ChEMBL, KKBSMILE stringDrug design (de novo design)Stack-RNN, LSTM, ReLeaSEPyTorch, TensorFlow ReLeaSE https://github.com/isayev/ReLeaSEPython, CUDAOpen Source75Reinforced Learning, Unsupervised Learning: RegressionDrug design
Zafeiris D., 2018, UK (Zafeiris et al., 2018)GEO, Array ExpressionAmyloid beta-precursor protein, microtubule-associated protein tau, apolipoprotein EBiomarker discovery for Alzheimer’s diseaseANNN/AN/AN/A50Supervised Learning: ClassificationEnzyme design
Jimenez J., 2017, Spain (Jiménez et al., 2017)scPDBPDB ID File or PDB filePredict protein–ligand-binding sites Drug design3D-DCNNKeras, Theano www.playmolecule.orgPythonOpen Source90Supervised Learning: RegressionDrug design
Müller A., 2017, Switzerland (Müller et al., 2018)ADAM, APD DADPAntimicrobial peptide Amino acid sequencesDesign of new peptide combinatorial de novo peptide designRNN, LSTMmodlAMP Python package https://github.com/alexarnimueller/LSTM_peptidesPythonOpen Source100Supervised Learning: RegressionDrug design
Ragoza M., 2017, US (Ragoza et al., 2017)PDB ChEMBLSpatial and chemical features of protein–ligand complexProtein–ligand score for drug discoveryCNN, SGDGnina Caffe https://github.com/gninaC++Open Source85Supervised Learning: ClassificationDrug design
Yeh C., 2017, UK, US (Yeh et al., 2018)JSON database: centers of mass and geometric relationship dataHelical repeat proteins, Center of mass (CoM) using C-α protein sequenceDesigned helical repeat proteins (DHRs)GA multithreaded processingELFIN https://github.com/joy13975/elfinPython, C++, MATLABApache 2.0 open source 3-Clause BSD90Supervised Learning: OptimizationDrug design
Folkman L., 2014, Australia (Folkman et al., 2014)ProThermProtein sequence and amino acid substitutionModel designed for a specific type of mutationEASE-MM, SVMEASE-MM LISVM http://www.ict.griffigr.edu.au/bioinf/easePython, LinuxOpen Source75Supervised Learning: ClassificationModel design
Khan Z., 2014, Pakistan (Khan et al., 2015)BRENDAAmino Acid sequence and alkaline enzymeEnzyme catalysisDT, KNN, MLP, PNN, SVMMATLAB Bioweka WekaJavaOpen Source MathWorks50Supervised Learning: ClassificationDrug design
Li Y., 2014, US (Li and Cirino, 2014)PDBE. coliDesigns of improved enzymes and enzymes with new functions and activitiesComputational design and scaffolding and compartmentalizationN/AN/AN/A50N/ADrug design
Murphy G., 2014, US (Murphy et al., 2015)DND_4HB proteinDND_4HB proteinDesign an up-down four-helix bundleComputational foldingN/AN/AN/A50N/ADrug design
Traoré S., 2013, France (Traoré et al., 2013)PDB3D protein structureStructure-based computational protein design frameworkCFNCPD http://genoweb.toulouse.inra.fr/tschiex/CPDPerlOpen source65Supervised Learning: ClassificationProtein design
Volpato V., 2013, Ireland (Volpato et al., 2013)ENZYME UniProtOxidoreductase, transferase, hydrolase, lyase, isomerase, and ligaseAcid-residue frequency derived from multiple sequence alignments extracted from uniref90N-to-1 Neural NetworkN/AN/AN/A65Supervised Learning: ClassificationDrug design
Daniels N., 2012, US (Daniels et al., 2012)SCOPProtein sequence, 207 beta structural SCOP super familiesDetection for beta-structural proteins into the twilight zone, make over a 100-new-fold prediction genome of T. maritimaHMM, MRFSMURFLite http://smurf.cs.tufts.edu/smurflite/N/AOpen Source65Unsupervised Learning: ClusteringDrug design
Eisenbeis S., 2012, Germany (Eisenbeis et al., 2012)PDB(βα)8-barrel and the flavodoxin-like fold, CheY, HisFEnzyme designRational recombinationhttp://pubs.acs.org Modeller, RosettaPythonIBM, Academic nonprofit freeware75N/ADrug design
Ebina T., 2011, Japan (Ebina et al., 2011)DS-All datasetProtein sequenceDomain predictorDROP, SVM, RFDROP http://web.tuat.ac.jp/∼domserv/DROP.htmlBash scriptOpen source75Supervised Learning: ClassificationDrug design
Bostan B., 2009, US (Bostan et al., 2009)KEGGGiven a species proteomePredict homologous signaling pathwayPSPN/AN/AN/A50Supervised Learning: ClassificationModel design
Hong E., 2009, US (Hong et al., 2009)Standard rotamer library Expanded rotamer libraryFn3: Derived from protein Fn3, 10th human fibronectin-type III domainTenth human fibronectin, D44.1 and DI.3 antibodies, Human erythropoietinBroMAPBroMAPC++, LinuxOpen Source100Supervised Learning: OptimizationDrug design
Özen A., 2009, Turkey (Özen et al., 2009)ProThermStructure-based features: amino acid substitution likelihood equilibrium fluctuations α, Cβ, packing densitySingle-site amino acid substitutionSVM, KNN, DT, SVRMOSEK http://www.prc.boun.edu.tr/appserv/prc/mlstaN/AOpen Source85Supervised Learning: Classification RegressionModel design
Ebrahimpour A., 2008, Malaysia (Ebrahimpour et al., 2008)GenBankGeobacillus sp. StrainLipase production Syncephalastrum racemosum, Pseudomonas sp. Strain S5 and Pseudomonas aeruginosaANN, FFNN, IBP, BBP, QP, GA, LMCPC-X Software N/AJavaNeural Power version 2.575Supervised Learning: ClassificationProtein design
Zhu X., 2008, China (Zhu and Lai, 2009)PDB223 scaffold proteinsPocket residues of ribose-binding protein (2dri), tyrosyl-t/RNA synthetase (4ts1), and tryptophan synthase (1a50). No metal ion-binding sitesVector matchingN/AN/AN/A65N/ADrug design
Liao J., 2007, US (Liao et al., 2007)GenBankProteinase K-catalyzed hydrolysis of the tetrapeptide N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilideProteinase K variantsRR, Lasso, PLSR, SVMR, LPSVMR, LPBoosR, MR, ORMRN/AMatlabMathWorks75Supervised Learning: Classification RegressionProtein design
Raveh B., 2007, Israel (Raveh et al., 2007)PDBTIM-barrel fold 1YPI. Whole β-sheet global structuresExistence of α-helices, parallel β-sheets, anti-parallel sheets and loops. Non-conventional hybrid structuresK-means clusteringMatlabMatlabMathWorks75Unsupervised Learning: ClusteringProtein design
Zimmermann O., 2006, Germany (Zimmermann and Hansmann, 2006)PDBProtein sequencePrediction of dihedral regionsC-SVMLIBSVM-library DHPRED http://www.fz- juelich.de/nic/cbbC, Python, Linux, WindowsOpen source80Supervised Learning: ClassificationProtein design
Russ W., 2002, US (Russ and Ranganathan, 2002)N/ASH3 domain GroEL minichaperone WW domain prototypeThermostable consensus phytase, 84.5 kDa proteinKnowledge-base potential functionsN/AN/AN/A65N/AProtein design
Rossi A., 2001, Italy (Rossi et al., 2001)PDB, HSSP2ci2 BarnaseBarnase and chymotrypsin inhibitorPerceptronN/AN/AN/A90Supervised Learning: RegressionDrug design

An overview of the protein and drug design articles with the quality assessment.

3D-CNN, Three-dimensional convolutional neural network; ANN, Artificial neural network; BBP, Back Back propagation; BroMap, Branch and bound map estimation; CFN, Cost function network; CNN, Convolutional neural network; DeepDIN, Deep dense inception network; DT, Decision tree; DROP, Domain linker prediction using optimal feature; EASE-MM, Evolutionary Amino acid, and Structural Encodings with Multiple Models; FFNN, Feed forward neural network; GA, Genetic algorithms; GCN, Graph convolutional network; HMM, Hidden Markov model; IBP, Incremental back propagation; KNN, k-nearest neighbor; Lasso, Least absolute shrinkage and selection operator; LM, Levenberg–Marquardt; LPBoostR, Linear programming boosting regression; LPSVMR, Linear programming support vector machine regression; LSTM, Long short-term memory; MLP, Multilayer perceptron; MR, Matching loss regression; MRF, Markov random forest; MNNN, Multi-scale neighborhood-based neural network; ORMR, One-norm regularization matching-loss regression; PLSR, Partial least-squares regression; PNN, Probabilistic neural network; PSP, Predict Signal Pathway; QP, quick prob; ReLeaSE, Reinforcement Learning for Structural Evolution; RF, Random forest; RNN, Recurrent neural network; RR, Ridge regression; SDHINE, Meta path-based heterogeneous information embedding approach; SFFS, Sequential forward floating selection; SGD, Stochastic gradient descent; SVM, Support vector machine; SVMR, Support vector machine regression; SVR, Support vector regression; VSA, Virtual screening algorithms.

TABLE 3

First Author/Year of Publication/CountryDatabaseInitial scaffold (ID)Designed ProteinML modelSoftware/SeverProgramming language/PlatformLicenseQuality (%)Machine learningProtein application
Protein function prediction
Verma N., 2021, US (Verma et al., 2021)DrugBank matador PDBHuman, C. ElegansProtein–ligand interactionsDNNGitHub (https://github.com/ekraka/SSnet)PythonOpen source75Supervised learning: PredictionProtein–ligand interaction prediction
Du Z., 2020, China, Russia, US (Du et al., 2020)CAFA3, SwissProtHuman, C. ElegansAutomated function predictionNLP, CNNKeras, TensorFlowPythonOpen Source70Supervised Learning: ClassificationProtein function prediction
Liang M., 2020, China (Liang and Nie, 2020)PDBRelative angle of (C – Cα – C) principal planeEnzymatic function predictionRN, LSTMTensorFlowPythonOpen Source90Supervised Learning: PredictionProtein function prediction, Function ID
Rifaioglu A., 2019, Turkey, UK (Rifaioglu et al., 2019)UniProtKB/Swiss-ProtN/AGO term predictionDNNTensorflow, https://github.com/cansyl/DEEPredPythonOpen Source70Supervised Learning: RegressionProtein function prediction
Torng W., 2019, US (Torng and Altman, 2019)PROSITE NOS datasetProtein structure as 3D imagesProtein functional site detectionDL, 3D-CNN, SVMN/A https://simtk.org/projects/fscnnPythonN/A75Supervised Learning: ClassificationProtein function prediction
Wan C., 2019, UK (Wan et al., 2019)UniProtKB/Swiss-ProtHuman proteinsFunction predictionDMNN, SVMKeras, https://github.com/psipred/STRING2GOPythonOpen Source80Supervised Learning: Prediction ClassificationProtein function prediction
Feinberg E., 2018, China, US (Feinberg et al., 2018)PDB Bind 2007Scaffold split for grouping ligands in common frameworksMolecular Property PredictionGCNNPyTorch, NumPy and SciPyPythonOpen Source100Supervised Learning: PredictionProtein function prediction
Frasca M., 2018, Italy (Frasca et al., 2018)STRING GOOrganisms: Homo sapiens (human) S. cerevisiae (yeast) Mus musculus (mouse)AFP (Automated Protein Function Prediction)COSNet, ParCOSNet, HNNCOSNet, ParCOSNetC, C++, R, CUDAOpen Source75Unsupervised Learning: ClusteringProtein function prediction
Khurana S., 2018, Qatar, US (Khurana et al., 2018)pepcDB databasek-mer structure and additional sequence and structural features extracted from the protein sequenceSolubility predictionCNN, DL, FFNNPROSO II https://zenodo.org/record/1162886#.XSP26ffPzOQ DeepSol: https://github.com/sameerkhurana10/DSOL_rv0.2Python, LinuxOpen source95Unsupervised Learning: ClusteringProtein function prediction
Li H., 2018, China (Li et al., 2018)HPRD DIP HIPPIEPrimary sequence Escherichia coli, Drosophila, Caenorhabditis elegans, Pan’s PPI datasetsPrediction of protein interactionsDNN, CNN, LSTMKeras, Theano, TensorFlow, N/APythonOpen Source85Supervised Learning: RegressionProtein function prediction
Long H., 2018, China, US (Long et al., 2018)UniProtPseAAC Hydroxyproline and hydroxylysinePredicting hydroxylation sitesCNN, LSTMMXNet, N/ARApache 2.085Supervised Learning: ClassificationProtein function prediction
Makrodimitris S., 2018, Netherlands (Makrodimitris et al., 2019)Arabidopsis thaliana proteinsArabidopsis thaliana proteinProtein function predictionKNN, LSDRSciPy https://github. Com/stamakro/SSP-LSDR.Python, MATLAB Bioinformatics toolboxOpen source, Mathworks80Unsupervised Learning: ClusteringProtein function prediction
Zhang L., 2018, China (Zhang L. et al., 2018)UniProt, DIPS. cerevisiae, H. sapiens, and M. musculusPredicting Protein–Protein interactionsDNN, Adam AlgorithmTensorFlowPythonOpen Source100Supervised Learning: PredictionProtein function prediction
Adhikari B., 2017, US (Adhikari et al., 2018)DNCON DatasetN/AContact map protein predictionCNNTensorFlow, Keras http://sysbio.rnet.missouri.edu/dncon2/PythonOpen Source65Supervised Learning: Regression PredectionProtein residue–residue contacts
Cao R, 2017, US (Cao et al., 2017)UniProtProtein sequenceProtein function predictionRNNProLanGO Model N/AN/AN/A50Supervised Learning: ClassificationProtein function prediction
Al-Gharabli S., 2015, Jordan (Al-Gharabli et al., 2015)PDBAmino acid sequence hydrophobicityPrediction of dihedral angles physiochemical properties, enzyme loopsANNN/AN/AN/A50Supervised Learning: ClassificationProtein function prediction
Qi Y., 2012, US (Qi et al., 2012)Standard benchmark, CB513 DSSPPSI-BLAST amino acid embeddingPrediction of the local properties in proteinsDNNTorch5COpen Source100Supervised Learning: ClassificationProtein function prediction
Yang Y., 2011, US (Yang et al., 2011)SPINEProtein sequenceSingle-method fold recognitionSPARKS-X AlgorithmSPARKS-X https://sparks-lab.org/server/sparks-x/Shell scriptOpen Source75Supervised Learning: RegressionProtein function prediction
Latek D., 2010, Poland (Latek and Kolinski, 2011)10 globular proteins, 216 residues, and S100A1 protein10 globular proteins and S100A1 proteinPredicted Nuclear Overhauser Effect signals on the basis of low-energy structures from CABS-NMRCABS, MCCABS- NMR toolkit http://biocomp.chem.uw.edu.pl/services.phpN/AN/A70Unsupervised Learning: ClusteringProtein function prediction
Tian J., 2010, China, US (Tian et al., 2010)ProTherm PDB3D structuresEffect on single- or multi-site mutation on protein thermostabilityRFR, RF, SVMPrethermut http://www.mobioinfor.cn/prethermut/R, Perl, LinuxOpen Source75Supervised Learning: ClassificationProtein function prediction
Wu S., 2008, US (Wu and Zhang, 2008)PDBPDB protein sequenceProtein contact predictorMUSTERMUSTER http://zhang.bioinformatics.ku.edu/MUSTERN/AN/A50Supervised Learning: ClassificationProtein function prediction
Hung C., 2006, Taiwan (Hung et al., 2006)NCBINucleocapsid (nsp1) of a coronavirus familyPredict protein functionsAGCTN/AN/AN/A75Supervised Learning: ClassificationProtein function prediction
Sidhu A., 2006, UK (Sidhu and Zheng, 2006)Swiss-ProtSignal peptides and non-secretory proteins from Human, E. coli, prokaryoticPredict signal peptideBBFNN, DTN/AN/AN/A75Supervised Learning: RegressionProtein function prediction
Capriotti E., 2005, Italy (Capriotti et al., 2005)ProThermProtein tertiary structureProtein stability predictionSVMI-Mutant2.0 http://gpcr.biocomp.unibo.it/cgi/predictors/I-Mutant2.0/I-Mutant2.0.cgiPythonOpen Source75Supervised Learning: ClassificationProtein function prediction
Hu C., 2004, US (Hu et al., 2004)WhatIF database UniProt3D coarse-grained structure from protein sequencesOptimal non-linear scoringSVM non-linear Gaussian kernel functionsN/AN/AN/A65Supervised Learning: ClassificationProtein function prediction
Gutteridge A., 2003, UK (Gutteridge et al., 2003)PDBAmino acid sequence of quinolate phosphoribosyl transferasePredict active siteFFNNN/AN/AN/A50Unsupervised Learning: ClusteringProtein function prediction
Function Prediction and Novel Function
Nie J., Singapore 2020 (Sua et al., 2020)UniProtacetyl-lysine (S1), “crotonyl-lysine” (S2), “methyl-lysine” (S3), or “succinyl-lysine” (S4)Identification of Lysine PTM sitesRF, SVM, MNB, LR, ME, KNN, CNN, MLPTensorflow, https://github.com/khanhlee/lysineSGTPythonN/A100Supervised Learning: ClassificationFunction ID
Savojardo C.., 2020, Italy (Savojardo et al., 2020a)UniProtKB GOA, DeepMitoDBHuman, mouse, fly, yeast, and Arabidopsis thalianaprotein sub-mitochondrial localizationDeepMito, 1D-CNNN/AN/AN/A75Supervised learning: PredictionFunction ID
Fang C., 2019, China, Japan (Fang et al., 2019)PDBMoRF-containing membrane protein chainsMolecular recognition features MoRFs predictionDCNNN/AN/AN/A75Supervised Learning: ClassificationFunction ID and Fold ID
Zhang Y., 2019, China (Zhang et al., 2019)PDBPDNA-543, PDNA-224 and PDNA-316Identification of DNA–protein-binding siteADASYNTheanoPythonOpen Source85Supervised Learning: ClassificationFunction ID and Fold ID
Hanson J., 2018, Australia, China (Hanson et al., 2019)PISCES CASP12 PDB5N5EA 6FI2A 6FQ3ASequence-based prediction of one-dimensional structural properties of proteinsCNN, 2D-BRLSTMN/AN/AN/A80Supervised Learning: ClassificationFunction ID
Shah R., 2008, US (Shah et al., 2008)D DatasetProtein sequenceHomology detectionSVMSVM-HUSTLE http://www.sysbio.org/sysbio/networkbio/svm_hustlN/AN/A70Supervised Learning: ClassificationFunction ID and Fold ID

An overview of the protein function prediction, function prediction, and novel function articles with the quality assessment.

1D-CNN, one-dimensional convolutional neural network; 2D-BRLSTM, two-dimensional bidirectional recurrent long short-term memory; 3D-CNN, three-dimensional convolutional neural network; ADASYN, Adaptive Synthetic Sampling; ANN, Artificial neural network; AGCT, Alignment genetic causal tree; BBFNN, Biobasis function neural network; CABS, C-alpha-beta side; CNN, Convolutional neural network; COSNet, Cost-sensitive neural network; DCNN, Deep Convolutional neural network; DMNN, Deep mahout neural network; DFS, Depth first search; DL, Deep learning; DNN, Deep neural network; DTNN, Deep tensor neural network; FFNN, Feed forward neural network; GA, Genetic algorithms; HDL, Hybrid Deep learning; HMM, Hidden Markov model; HNN, Hopfield neural network; KNN, k-nearest neighbor; LR, Logistic regression; LSDR, Label-Space dimensionality reduction; LSTM, Long short-term memory; MC, Monte Carlo; ME, Max Entropy; MLP, Multilayer; MNB, Multinomial Naïve Bayes; MNPP, Message passing neural network; NLP, Natural language processing; NN, Neural network; ParCOSNet, Parallel COSNet; RF, Random forest; RN, Relational network; RNN, Recurrent neural network; SPARK-X, Probabilistic-based matching; SVM, Support vector machine.

TABLE 4

First Author/Year of Publication/CountryDatabaseInitial scaffold (ID)Designed ProteinML modelSoftware/SeverProgramming language/PlatformLicenseQuality (%)Machine learningProtein application
Fold ID and physicochemical properties
Rives A., 2020, UK, USA (Rives et al., 2021)SCOPeProtein data in the form of unlabeled amino acid sequences. Small vocabulary of 20 canonical elementsPredicted model contains information about biological properties in its representationsDeep contextual language modelhttps://github.com/facebookresearch/esmPythonOpen source70Supervised learning; predictionPhysicochemical and biological properties
Li H., 2020, France, Hong Kong (Hongjian et al., 2021)PDB, PubChem, ZINC, ChEMB,L BindingDB, HTSChemical Estrogen receptor α (Erα) Anaplastic lymphoma kinase Neuraminidase (NA) Reducing the level of Dmiro protein in flies Acetylcholinesterase (AchE)Protein–ligand complexRF, BRT, kNN, NN, SVM, GBDT, multi-task DNN XGBoostDescriptor data bank ODDT BINANA RF-Score-v1 RF-Score-v3 MIEC-SVMPythonOpen Source100Supervised Learning; Unsupervised Learning; Prediction Classification RegressionPhysicochemical properties
Shroff R., 2020, US (Shroff et al., 2020)PDBN/Aamino acid association guide mutation3D CNNTheano www. Mutcompute.comPythonOpen Source70Supervised Learning: Class PredictionMicroenvironment mutation identification
Wang M., 2020, China, US (Wang M. et al., 2020b)UniProtE. coli, M. musculus, H. sapiensProtein malonylation site predictionDL-CNNKeras, https://github.com/QUST-AIBBDRC/DeepMal/Python, MatlabOpen Source80Supervised Learning: ClassificationMalonylation site prediction
Chen J., 2019, China (Chen et al., 2019)Datasets A(CPLM),B,CProteins and reducing sugarsGlycation product predictionRNN, CNNN/AN/AN/A60Supervised Learning: ClassificationGlycation site predictor
Han X., 2019, Singapore, US (Han et al., 2019)eSolCell-free protein expression from E. coliProtein solubilityGANN/AN/AN/A60Supervised Learning: Regression PredictionProtein solubility prediction
Heinzinger M., 2019, Germany (Heinzinger et al., 2019)UniProt, PDBTS115 CB513 CASP12Protein sequence representationNLP, ELMoPytorch, https://embed.protein.properties/PythonOpen Source80Supervised Learning: ClassificationFold ID
Kaleel M., 2019, Ireland (Kaleel et al., 2019)PDBAmino acids are subcellular into four classes involving RSAPrediction of relative solvent accessibilityBRNNhttp://distilldeep.ucd.ie/paleale/PythonOpen Source90Supervised Learning: PredictionProtein relative solvent accessibility prediction
Li C., 2019, China (Li and Liu, 2020)LE dataset from SCOPMultiple superfamiliesDetect the structural motifs related with the protein foldsMotifCNN and MotifDCNN SVM CNNTensorFlowPythonOpen source100Supervised Learning: ClassificationFold ID
Luo L., 2019, China (Luo L. et al., 2019)BioCreative II, BioCreative III, BioCreative II.5PPI protein articlesProtein–protein interactionKeSACNNKerasPythonOpen Source50Supervised Learning: ClassificationPhysicochemical properties
Taherzadeh T., 2019, Australia, US (Taherzadeh et al., 2019)Uniprot, dbPTM, Uniprep, UnicarKB, GlycoProtDBGlycoproteinN- and O-linked glycosylationDNN SVMTensorFlow, https://sparks-lab.org/server/sprint-gly/PythonOpen Source80Supervised Learning: Regression PredictionGlycosylation site identification
Zhang D., 2019, US (Zhang and Kabuka, 2019)DIP, HPRD, UniProtD. melanogaster, S. cerevisiae, E. coli, C. elegans, H. sapiens, H. pylori, M. musculus, R. norvegicusProtein–protein interactions and protein family predictionMultimodal DNNN/AN/AN/A75Supervised Learning: ClassificationPhysicochemical properties
Cuperus J., 2018, US (Cuperus et al., 2017)5′ UTR library of 50-nt-long random sequencesYeast Saccharomyces cerevisiaePredict protein expressionCNNKeras, Theano, https://github.com/Seeliglab/2017---Deep-learning-yeast-UTRsPythonOpen Source85Supervised Learning: RegressionFold ID
Hochuli J., 2018, US (Hochuli et al., 2018)PDBLigands SMILE Protein FASTAIdentify protein–ligand scoringCNNGnina, Caffe Github.com/gninaC++, PythonOpen source50Supervised Learning: ClassificationProtein Scoring
Luo F., 2018, China (Luo F. et al., 2019)Phospho.ELM, PhosphositePlus, HPRD, dbPTM, SysPTMKinase protein familyProtein phosphorylationCNNhttps://github.com/USTCHIlab/DeepPhosN/AN/A60Supervised Learning: Regression PredictionPhosphorylation site predictor
Zhao X., 2018, China (Zhao et al., 2018)PLMDLysineLysine acetylation sitesCNN DNNKeras, Theano, https://github.com/jiagenlee/DeepAcePythonOpen Source80Supervised Learning: Regression Classification PredictionAcetylation site prediction
Zhao F., 2010, US (Zhao et al., 2010)CASP(PSSM) Position-specific scoring matrix generated by PSI-BLASTProtein foldingCNFCNFN/AN/A80Supervised Learning: ClassificationFold ID
Armstrong K., 2008, US (Armstrong and Tidor, 2008)PDBProtein sequenceProtein engineering space of foldable sequencesComputational mappingN/AC++Open source50N/AFold ID
Shamim M., 2007, India (Shamim et al., 2007)D-B dataset Ext. D-B datasetStructural information of amino acid residue and amino acid residue pairsProtein fold predictionSVMLIBSVM-libraryC++, Java, Python Windows, LinuxOpen source80Supervised Learning: ClassificationFold ID
Protein Classification
Burak T., 2021, Turkey (Alakuş and Türkoğlu, 2021)UniProtProtein sequence from 60 different familiesProtein family classification/identificationFIBHASHN/AN/AN/A70Supervised Learning: ClassificationProtein classification
Zhao Z., 2019, China (Zhao and Gong, 2019)Monomers and dimers from the authorMonomers and dimers from the authorProtein–protein interactionLSTMN/AN/AN/A60Supervised Learning: Unsupervised Learning: RegressionInterface residue pair prediction
Huang L., 2018, US (Huang et al., 2018)DIP, HPRDPPI network graphProtein–protein interactionENN-RLTensorFlow, https://www.eecis.udel.edu/∼lliao/enn/PythonOpen Source75Supervised Learning: PredictionProtein–protein interaction
Le N., 2018, Taiwan (Le et al., 2018)UniProt GORab GGT activity Rab GDI activity Rab GTPase binding Rab GEF activityClassify Rab protein molecules2D-CNNKeras, Theano DeepRab; http://bio216.bioinfo.yzu.edu.tw/deeprab/PythonOpen Source90Supervised Learning: RegressionProtein Classification
Xue L., 2018, China, US (Xue et al., 2019)Swiss-Prot, TrEMBLSecretory proteinProtein sequence into T3Ses or non-T3SesDCNNKeras, https://github.com/lje00006/DeepT3PythonOpen Source60Supervised Learning: Regression ClassificationProtein classification
Zhao B., 2018, US (Zhao and Xue, 2018)DisProt PDBIntrinsically disordered proteins (IDPs), intrinsically disordered regions (IDRs), and intrinsically disordered amino acids (IDAAs)N/AANN, DTDisEMBL, IUPred, VSL2, Dbann, and EspritzN/AN/A50Supervised Learning: RegressionIntrinsically disordered protein prediction
Szalkai B., 2017, Hungary(Szalkai and Grolmusz, 2018a)Swiss-Prot, UniProt, GOThyroid hormone, phenol-containing compound, cellular modified amino acid, protein kinase superfamilyprotein classification by amino acid sequenceANNTensorFlowPythonOpen Source90Supervised Learning: ClassificationProtein Classification
Szalkai B., 2017, Hungary (Szalkai and Grolmusz, 2018b)UniProt GOClasses.treHierarchical Biological Sequence ClassificationDNNSECLAF, TensorFlow https://pitgroup.org/seclaf/PythonOpen Source85Supervised Learning: ClassificationProtein Classification

An overview of the fold id, physicochemical properties, and protein classification articles with the quality assessment.

3D-CNN, three-dimensional convolutional neural network; ANN, Artificial neural network; BLSTM, Bidirectional long short-term memory; BRNN, Bidirectional recurrent neural network; BRT, Booster regression tree; CNF, Conditional neural filed; DNN, Deep neural network; DT, Decision Tree; ELMO, Embeddings from language models; ENN-RL, Evolution neural network-based Regularized Laplacian kernel; FIBHASH, Fibonacci numbers and hashing table; GAN, Generative adversarial network; GBDT, Gradient boosted decision tree; GR, Genetic recombination; KNN, k-nearest neighbor; KeSCANN, Knowledge-enriched Self-Attention convolutional neural network; LSTM, Long short-term memory; MotifCNN, Motif convolutional neural network; Motif DNN, Motif deep neural network; Multimodal DNN, Multimodal deep neural network; NLP, Natural language processing; NN, Neural network; RF, Random forest; RNN, Recurrent neural network; SPARK-X, Probabilistic-based matching; SVM, Support vector machine.

TABLE 5

First Author/Year of Publication/CountryDatabaseInitial scaffold (ID)Designed ProteinML modelSoftware/SeverProgramming language/PlatformLicenseQuality (%)Machine learningProtein application
Protein Structure Prediction
Xu J., 2022, USA (Xu et al., 2021)CASP13, PDB, PISCES, CATHDiscrete probability over distance for three backbone atom pair and inter-residue orientationStructure predictionConvolutional residual neural networkhttps://github.com/j3xugit/RaptorX-3DModeling/pythonOpen source70Supervised Learning; PredictionProtein structure prediction
ALQuraishi M., 2021, USA (AlQuraishi, 2021)PDB, CASP14Primary protein sequenceStructure predictionMarkov random field, Attention networksN/AN/AN/A50%Supervised Learning: PredictionProtein structure prediction
Bond P., 2020, UK (Bond et al., 2020)PDBOnly residues with side chains longer than beta-carbonPredicting the correctness of protein residuesNN, MLPCCP4C++, PythonOpen Source60Supervised Learning: RegressionProtein structure prediction
Wardah W., 2020, Australia, Fiji, Japan, US (Wardah et al., 2020)BioLiPPositive (binding) or negative (non-binding), protein sequence classificationPredicting Protein-peptide-binding sitesCNNPyTorch, https://github.com/WafaaWardah/VisualPythonOpen Source100Supervised Learning: Prediction ClassificationProtein structure prediction
Yang J., 2019, China, USA (Yang J. et al., 2020)CASP13, Uniclust30Representation of the rigid-body transform from one residue to another; angles and distancesPredicted inter-residue orientationsDeep residual convolutional neural networkhttps://yanglab.nankai.edu.cn/trRosetta/PythonOpen source70Supervised Learning; PredictionProtein structure prediction
Degiacomi M., 2019, UK (Degiacomi, 2019)PDBMalate dehydrogenase (1MLD), αB crystallin (2WJ7) Phospholipase A2 (1POA), Envelope glycoprotein (1SVB), MurD, closed (3UAG), MurD, open (1E0D), MurD, closed + open (3UAG,1E0D), HIV-1 (1E6J)Enhancement of molecular conformational space generatorMolecular dynamics, RF, auto encoderKeras, TensorflowPythonOpen Source80Unsupervised Learning: ClassificationProtein conformational space
Guo Y., 2019, US (Guo et al., 2019)CB513, CASP10, CASP11Protein sequencesProtein secondary structureACNN, BLSTMKeras, Tensorflow, https://github.com/GYBTA/DALSTM/PythonOpen Source80Supervised Learning: Prediction ClassificationProtein secondary structure prediction
Long S., 2019, China (Long and Tian, 2019)Jpred dataset cullpdb dataset UniRef90 UniProtMultiple superfamiliesProtein secondary structure predictionCNNTensorFlow N/APythonOpen Source60Supervised Learning; Unsupervised Learning; PredictionProtein structure prediction
Mirabello C., 2019, Sweden (Mirabello and Wallner, 2019)PDBN/AMethod predictionNLP, DNNKeras, TensorFlow https://bitbucket.org/clami66/rawmsaPythonOpen Source70Supervised Learning: PredictionProtein structure prediction
Pagès G., 2019, France (Pagès et al., 2019)CASPModel QAProtein model quality assessment3D CNNTensorFlow, Ornate https://team.inria.fr/nanod/software/Ornate/C++, PythonOpen Source85Supervised Learning: RegressionModel protein prediction
Schantz M., 2019, Argentina, Denmark, Malaysia (Klausen et al., 2019)PDB, PISCESCrystal structuresPrediction of protein structural featuresCNN, LSTMKerasPythonOpen source100Supervised Learning: PredictionProtein structure prediction
Wang D., 2019, China (Wang D. et al., 2020)CASP11, 12Caspase 14Protein structure refinementMulti-objective PSOAIR 2.0 www.csbio.sjtu.edu.cn/bioinf/AIR/PythonOpen Source95Supervised Learning: OptimizationProtein structure prediction
Yu C., 2019, US (Yu et al., 2019)PDB194l (lysozyme), 107m (myoglobin), 6cgz (β-barrel), a silk protein, amyloid protein, and othersGeneration of audible sound from amino acid sequence for application on designer materialsRNN, LSTMMagenta TensorFlow, Melody RNNJava, PythonOpen Source100Supervised Learning: RegressionProtein sequence prediction
Zheng W., 2019, US (Zheng et al., 2019)CASP13Query sequence profilesAutomated structure prediction pipelineZhangServer and QUARK pipelinesZhang and Quark serverN/AOpen Source85Supervised Learning: Classification RegressionProtein structure prediction
Fang C., 2018, US (Fang et al., 2018)PDB JPRED CASP CB513Different super-families, CASP10, 11, 12Protein secondary structure predictionDeep3I networkMUFOLD-SS TensorFlow and KerasPythonOpen Source80Supervised Learning: ClassificationProtein structure prediction
O’Connell J., 2018, Australia, China, US (O’Connell et al., 2018)SPIN datasetN/ASequence profile compatibleDNNhttp://sparks-lab.org. SPINN/AOpen Source65Supervised Learning: PredictionProtein sequence prediction
Sunseri J., 2018, US (Sunseri et al., 2019)D3R Grand challenge 3 Grand challenge 3Input ligand SMILES protein FASTA CSARCathepsin S model ligand proteinCNNGnina, Caffe, https://github.com/gninaC++, PythonOpen Source100Supervised Learning: RegressionProtein model prediction
Zhang B., 2018, China (Zhang B. et al., 2018)PDB, PISCES, TR5534 DatasetCASP10, 11, 12 and 13Prediction of performance of proteinCNN, RNN, BRNNKerasPythonOpen Source100Supervised Learning, PredictionProtein structure prediction
Armenteros J., 2017, Denmark (Almagro Armenteros et al., 2017)UniProtProtein sequence, Sequence informationPredict protein subcellular localizationCNN, RNN BLSTM, FFNN, Attention modelsLasagne, Theano, Deep Loc: http://www.cbs.dtu.dk/services/DeepLocPythonLicense MIT90Supervised Learning: ClassificationProtein structure prediction
Vang Y., 2017, US (Vang and Xie, 2017)IEDB MHCBN SYFPEITHIHuman leukocyte antigen (HLA) complexHLA class I-peptide-binding predictionNLP, CNNKeras, Theano, https://github.com/uci-cbcl/HLA-bindPythonOpen Source100Supervised Learning: RegressionProtein structure prediction
Wang S., 2017, US (Wang et al., 2017)Pfam CASP CAMEO150 Pfam families 105 CASP11 test proteins 76 hard CAMEO5f5pHDRNNTensorFlow, Theano http://raptorx.uchicago.edu/ContactMap/PythonApache 2.075Supervised Learning: Classification RegressionProtein structure prediction
Yang J., 2015, China, US (Yang et al., 2015)PDB SPx dataset PDBCYS datasetAmino acid sequenceStructure prediction of cysteine-rich proteinsHMM, SVRCYSCON http://www.csbio.sjtu.edu.cn/bioinf/Cyscon/N/AN/A75Supervised Learning: RegressionProtein structure prediction
Li Z., 2014, US (Li et al., 2014)PISCESTL2282 dataset TS500 dataset TR1532 datasetSequence profile predictionSPIN, NNSPIN http://sparks-lab.orgPython, LinuxOpen Source85Supervised Learning: ClassificationProtein structure prediction
Wong K., 2013, Canada, US, Saudi Arabia (Wong et al., 2013)Protein-Binding Microarray datasetDNA sequenceDNA-motif discoveryKmer-HMMkmerHMM http://www.cs.toronto.edu/wkc/kmerHMMN/AN/A50Supervised Learning: Classification. Unsupervised Learning: ClusteringModel Discovery
Katzman S., 2008, US (Katzman et al., 2008)PDB PISCESAmino acid sequence of a protein of unknown structureLocal structure predictionMulti-layer NNPREDICT-2NDhttp://www.soe.ucsc.edu/∼karplus/predict-2nd/C++Open source80Unsupervised Learning: ClusteringProtein structure prediction
Bindslev C., 2002, Denmark (Bindslev-Jensen et al., 2003)20 Patients with allergy to Macrozoarces americanusMacrozoarces americanusInvestigate potential allergenicity of Ice Structuring Protein (ISP)DTN/AN/AN/A45Supervised Learning: RegressionProtein structure prediction

An overview of the protein structure prediction articles with the quality assessment.

3D-CNN, three-dimensional convolutional neural network; ACNN, Asymmetric convolutional neural network; BLSTM, Bidirectional long short-term memory; BRNN, Bidirectional recurrent neural network; CNN, Convolutional neural network; Deep3I, Deep inception-inside-inception network; DNN, Deep neural network; DRNN, Deep residual neural network; DT, Decision Tree; FFNN, Feed forward neural network; HMM, Hidden Markov model; K-merHMM, K.mer Hidden Markov model; LSTM, Long short-term memory; MC, Monte Carlo; ML, Model; MLP, Multilayer perceptron; NN, Neural network; PSO, Particle swarm optimization; RNN, Recurrent neural network; RNN 2, Residual neural network; SPIN, Sequence Profiles by Integrated Neural network; SVR, Support vector regression; UDNN, Ultradeep neural network.

TABLE 6

First Author/Year of Publication/CountryDatabaseInitial scaffold (ID)Designed ProteinML modelSoftware/SeverProgramming language/PlatformLicenseQuality (%)Machine learningProtein application
Protein Contact Map Prediction
Yang H., 2020, China (Yang H. et al., 2020)SCOPe 2.07N/AContact map protein predictionGANKeras, Tensorflow https://github.com/melissaya/GANconPythonOpen Source70Supervised Learning: RegressionContact map prediction
Hanson J., 2018, Australia, China (Hanson et al., 2018)PDB UniProtPrimary amino acid sequence, proteins from CASP12Protein contact map predictionCNN, 2D-BRLSTMhttp://sparks-lab.org/jack/server/SPOTContact/N/AN/A95Supervised Learning: PredictionProtein contact map prediction
Ashkenazy H., 2011, Israel (Ashkenazy et al., 2011)PDB3D protein structureContact map predictionWMChttp://tau.ac.il/∼haimash/WMCPerlOpen Source45N/AProtein map prediction
Durrant J., 2011, US (Durrant and McCammon, 2011)PDB MOADCrystal structure dataIdentification of small-molecule ligandsANN scoring function mapNNScore 2.0 http://www.nbcr.net/software/nnscore/PythonOpen Source50Supervised Learning: ClassificationProtein map prediction
Lin G., 2010, US (Lin et al., 2010)PDBProtein Folding Rates. Predicting protein folding rates from geometric contact and amino acid sequenceProtein folding kinetic rate and real-value folding rateSVM, SVRSeqRate http://casp.rnet.missouri.edu/fold_rate/index.htmlJavaOpen Source75Supervised Learning: ClassificationProtein map prediction
Protein-Binding Prediction
Song J., 2021, China (Song et al., 2021)PDB Swiss-ProtATP-binding proteinsProtein–ATP-Binding ResiduesDCNN, LightGBMTensorFlow, Keras https://github.com/tlsjz/ATPensemblePythonOpen Source80Supervised Learning: Regression Prediction ClassificationPrediction of Protein–ATP Binding Residues
Kwon Y., 2020, Korea (Kwon et al., 2020)PDBind-2016VEGFR2 kinase domain and adenosine deaminasePrediction of affinity-binding of a protein–ligand complex3D-CNNKeras, TensorflowPythonOpen Source85Supervised Learning: PredictionProtein affinity-binding prediction
Mahmoud A., 2020, Switzerland, US (Mahmoud et al., 2020)PDBHIV-1 protease, dihydrofolate reductaseHydration site occupancy and thermodynamics predictionsCNNhttps://hub.docker.com/r/lilllab/watsite3N/AOpen Source65Supervised Learning: Regression ClassificationProtein–ligand-binding prediction
Wang M., 2020, US (Wang M. et al., 2020a)SKEMPI 1.0, 2.0 dataset AB-Bind S645 datasetProtein–protein complexesProtein–ligand-binding affinity predictionsSite-specific persistent homology, CNN, GBTTopNetTree, Keras https://doi.org/10.24433/CO.0537487.v1Matlab, java, pythonOpen Source90Supervised Learning: PredictionProtein–protein-binding affinity
Luo X., 2019, China (Luo et al., 2020)TransfacDNA sequencespredicting DNA–protein bindingCNNKeras, Tensorflow https://github.com/gao-lab/ePoolingPythonOpen Source70Supervised Learning: Regression PredictionProtein-binding prediction
Protein Site Prediction
Zheng W., 2020, China, US (Zheng et al., 2020)SCOPe2.07N/AProtein domain boundariesDRNNhttps://zhanglab.ccmb.med.umich.edu/FUpred/N/AOpen Source60Supervised Learning: ClassificationProtein domain identification
Cui Y., 2019, China (Cui et al., 2019)BioLipFourteen binding residuesProtein–ligand-binding residue predictionDCNNTensorFlow, https://github.com/yfCuiFaith/DeepCSeqSitePythonOpen Source100Supervised Learning: PredictionProtein site prediction
Fu H., 2019, China (Fu et al., 2019)PLMDSequences and physicochemical properties of proteinPredict Lysine ubiquitination sites in large scaleCNN, DL DeepUbiTensorFlow, DeepUbi: https://github.com/Sunmile/DeepUbiPython, MATLAB, LinuxOpen Source100Supervised Learning: ClassificationProtein site prediction
Haberal I., 2019,, Norway, Turkey (Haberal and Ogul, 2019)PDBMetal binding of histidine and Cysteine amino acidsPrediction of metal binding in proteins2D-CNN, LSTM, RNNKeras, TensorFlowPythonOpen Source100Supervised Learning: PredictionProtein site prediction
Savojardo C., 2019, Italy (Savojardo et al., 2020b)UniprotKB/Swiss-ProtMitochondrial proteinsSub-mitochondrial cellular localizationCNNhttp://busca.biocomp.unibo.it/deepmitoPythonOpen Source75Supervised Learning: RegressionProtein sub-mitochondrial site prediction
Simha R., 2015, Canada, Germany, US (Simha et al., 2015)DBMLoc datasetN/AProtein multi-location predictionMDLoc, BNMDLoc http://www.eecis.udel.edu/compbio/mdlocPythonOpen Source75Supervised Learning: ClassificationProtein site prediction
Briesemeister S., 2010, Germany (Briesemeister et al., 2010)UniProtProtein sequencePredict protein subcellular localizationNBYloc Weka www.multiloc.org/YLocPython, Java, LinuxOpen source85Supervised Learning: ClassificationProtein site prediction
Huang W., 2008, Taiwan (Huang et al., 2008)UniProt GOSCL12, SCL16 Sequence-based, GO terms, protein sequencePrediction method for predicting subcellular localization of novel proteinsGA, SVMLIBSVM ProlocGO http://iclab.life.nctu.edu.tw/prolocgoN/AN/A75Supervised Learning: ClassificationProtein site prediction
Ladunga I., 1991, Hungary (Ladunga et al., 1991)UniProtSignal peptideNovel predicted signal peptidesNN (Tiling algorithm)N/ACN/A50Supervised Learning: ClassificationProtein site prediction
Genomics
Dai W., China, 2020 (Dai et al., 2020)Reactome DB and InBio Map DBHuman essential genePredict human essential genesNetwork embedding, SVMN/AN/AN/A50Supervised Learning: ClassificationHuman gene prediction
Picart-Armada S., 2019, Belguim, UK, Spain (Picart-Armada et al., 2019)STRINGGene-disease data from 22 common non-cancerous diseasesTarget disease gene identificationPR, Random Randomraw EGAD, PPR, Raw, GM, MC, Z-scores, KNN, WSLD, COSNet, bagSVM, RF, SVMhttps://github.com/b2slab/genediseROpen Source80Semi-supervised, Supervised Learning: ClassificationTarget gene identification, target drug discovery

An overview of the protein contact map prediction, protein-binding prediction, protein site prediction, and genomics articles with the quality assessment.

2D-BRLSTM, two-dimensional bidirectional Res-long short-term memory; 2D-CNN, Two-dimensional convolutional neural ubcell; 3D-CNN, Three-dimensional convolutional neural ubcell; ANN, Artificial neural network; BN, Bayesian Network; CNN, Convolutional neural network; DCNN, Deep Convolutional neural network; DL, Deep learning; GAs, Genetic algorithms; GBT, Gradient boost tree; KNN, k-nearest neighbor; LightGBM, Light Gradient Boosting Machine; LSTM, Long short-term memory; NB, Naïve Bayes; NN, Neural network; RNN, Recurrent neural network; SVM, Support vector machine; SVR, Support vector regression; WMC, Weighted multiple conformation.

Results

Article Scaffolding

This article is arranged as follows (Figure 2): first, we provide a representation of the process in designing, preparing, and describing of the guideline throughout the article. Secondly, we review the presented formulation of the research question toward the determined problem formulation and objectives of the research, including the treatment of the data and the applications of it. Thirdly, the article processes the observation, research, and review of a series of articles to further study the data obtained and review similarities. Furthermore, the gathering of AI–PS information, within this processing of the identification of filtered data, curated data and features implemented, the observation of input data, data encoding format, recording of machine learning algorithms and methods, as so the post-processing treatment, quality rule processing, filtering, combination, or unification of information, which passes into the interpretation of the information recollected, and representation of it by the usage of figures and tables, portrays the results, which are focused on the latest findings of AI applications in the field of protein science as well as the usage of specific algorithms for protein design. Therefore, this aims to include a wide-scope range of the state of the art of artificial intelligence within protein science; this leads us to a latter analysis and discussion regarding the identification and prediction of AI applications into the protein field, by classification and identification of main protein structures, and other components not found or described yet in nature, and the resolution of possible protein prediction structures and other components of them are plausible outcomes of future research.

Toward an Innovative Cross-Functional AI–PS Binomial Inter-field

This systematic review and meta-analysis are focused on the latest findings of AI applications to the field of protein science as well as specific algorithms used for protein design. Furthermore, it aims to include a wide scope of the state of the art of artificial intelligence in protein science. PIO is the methodology used to address the following research question: What is the state of the art in the use of artificial intelligence in the protein science field? Figure 1 shows the total number of articles retrieved using the PIO strategy in the PubMed database.

The systematic review process began with 541 references obtained from five electronic databases: 42 were from PubMed, 74 were from Ebsco, 48 were from Bireme, 38 were from OVID, and 339 were from Web of Science. In the first screening, 403 articles were removed: 250 articles with a double reference; 2 not written in Spanish or English; 149 whose topic was irrelevant to the review; and two newspapers, letters, or reviews. This election process left 138 references, and manually we added 6, thus getting a total of 144 articles for the review (Figure 3).

A second screening (eligibility) was performed using the following set of quality criteria:

  • 1. Clear research questions and objectives.

  • 2. Definition of the measured concepts.

  • 3. Reliability and feasibility of the instruments to be measured.

  • 4. Detailed description of the method.

  • 5. Scaffolding and enhanced protein information.

  • 6. Characteristics of scaffolding and its realization.

  • 7. Appropriate system and learning approach.

  • 8. Journal impact.

A total of 93 articles were included for further analysis, and 51 studies were removed based on quality criteria.

Machine Learning Approach to Protein Science

Proteins are influenced by epigenetic phenomena (cellular stress, aging, etc.) because of their multiple structure-folding-function within protein science (PS), phenomena that can be challenged through the use of artificial intelligence (AI).There are several questions within this interdisciplinary approach such as How do proteins evolve? How do proteins fold and get their tridimensional structure? What are their networks within proteins? Given the astronomical numbers of possibilities for protein structures, configurations, and functions that require the use of AI as a tool to fully understand protein behavior.

A total of 144 articles were assessed for quality (

Tables 2–6

) resulting in 93 articles (

Table 1

), those articles that were greater or equal to 75 in the quality percentage qualifications were kept for the final biochemical meta-analysis. For this review and meta-analysis, we identified five main applications of AI into PS (

Tables 2

6

and

Figures 4

6

)

  • I. Protein design and drug design (Table 2)

    • a) De novo protein design.

    • b) Novel biocatalyst design.

    • c) Novel function and ligand interaction.

    • d) Evolution of non-existent proteins in nature.

    • e) Chemical structure and properties.

    • f) Drug–drug interaction.

    • g) Drug–receptor interaction.

    • h) Drug effects.

  • II. Protein function, function prediction, and novel function (Table 3)

    • a) Protein–ligand interactions.

    • b) Hydroxylation site prediction.

    • c) Prediction of the local properties in proteins.

    • d) Enzymatic function prediction.

    • e) Predicting protein–protein interactions.

    • f) Function prediction.

    • g) Molecular property prediction.

  • III. Fold ID, physicochemical properties, and protein classification (Table 4)

    • a) Fold Id.

    • b) Glycation site predictor.

    • c) Phosphorylation site predictor.

    • d) Protein–protein interaction.

    • e) Intrinsically disordered protein prediction.

  • IV. Protein structure prediction (Table 5)

    • a) Protein structure prediction: primary, secondary, and 3D-structures; domains, active sites, allosteric sites, and structural feature prediction.

    • b) Protein structure classification: folds, structural families, intrinsically disorder proteins, etc.

    • c) Protein–protein interactions and protein networks.

    • d) Protein–ligand interactions: substrates, inhibitors, activators, ions, etc.

    • V. Protein contact map prediction, protein-binding prediction, protein site prediction, and genomics (Table 6)

    • 1) Contact map prediction.

    • 2) Protein sub-mitochondrial site prediction.

    • 3) Genomics.

FIGURE 5

FIGURE 6

The 40% (57/144) of the protein studies by AI applications were the following ones: myoglobin, silk protein, amyloid proteins, Rab family, cathepsin S family, kinases family, K proteinase, barnase, apolipoprotein family, protein DND_4HB, and antimicrobial peptides. Studies in enzymes should be pointed out, oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases, NOS (nitric oxide synthase), lysozyme, which are included in the columns of the initial scaffold (

Tables 2–6

). These proteins are very useful in the industry as well as in the biomedical fields. With respect to the type of organisms, the more explored are the following ones:

E. coli, Drosophila, Caenorhabditis elegans, Homo sapiens, S. cerevisiae yeast, Mus musculus (mouse), Geobacillus, and Coronavirus.
  • Tables 2–6 present the lists of the most commonly used databases in AI applications on PS. Of all the studies reviewed, the single use of main databases and datasets used is as follows:

  • 1) PDB (30/144) 21%.

  • 2) Author’s dataset construction (21/144)15%.

  • 3) UniProt either UniProtKB or UniProtKB/SwissProt (12/144)8%.

  • 4) CASP (critical assessment of protein structure prediction) database (5/144)3%.

  • 5) SCOP (structural classification of proteins) (4/144)3%.

  • 6) N/A, GenBank (4/144) 3%.

  • 7) Protherm (3/144) 2%.

  • 8) BioLip (biologically relevant ligand–protein) (2/144) 1%.

  • 9) PLMD (protein lysine modifications database) (2/144) 1%.

  • 10) And each of the next databases ChEMBL, eSol, GEO, DSSP, Drugbank, BioCreative, Transfac, STRING, BRENDA, SPINE, PISCES, NCBI, D3R Grand challenge 3, and KEGG with a (1/144)1%.

From the studies reviewed, (23/144), 16% use two databases. Of these, the latter (11/23) 48% uses a combination of the PDB and HSPP, PISCES, ProTherm, MOAD, SPx dataset, ChEMBL, DisProt, and UniProt/SwissProt; (4/23)17% use a combination of the GO database with UniProt or STRING; (4/23)17% uses a combination of the UniProt/SwissProt database with ENZYME, DIP, TrEMBL, and CAFA database; and a (2/23)9% combination among DIP, HPRD, SKEMPI database, and SPx dataset. The rest (24/144)17% belongs to a combination of three or more databases with PDB, UniProt, among others.

Moreover, several authors (Shamim et al., 2007; Simha et al., 2015; Yang et al., 2015; Li et al., 2018; Torng and Altman, 2019) focused on using previously constructed datasets, while others chose the creation of their own, based on their own design and outcome, for example, NOS, PPI’s, SPX, DBMLoc, D-B, and Extended D-B (Tables 26 and Figure 5).

The following tables show the principal protein categories that were found in this study. Table 2 shows the result of each of the 38 articles that were considered in the protein and drug design category.

Table 3 shows 26 studies that are related to protein function prediction and 6 studies related to function prediction and novel function.

Table 4 shows 19 studies that are related to fold ID and physicochemical properties and 8 studies related to protein classification.

Table 5 shows 26 studies that are related to protein structure prediction.

Table 6 shows five studies for protein contact map prediction, five studies for protein-binding prediction, nine studies for protein site prediction, and two studies for genomics.

Table 1 shows the overview of the extracted information of the selected studies based on the quality criteria.

Machine Learning Paradigms and AI Algorithm Roles

The most applied approach we found as a result of our review and meta-analysis corresponds to supervised learning (123/144)85%, which focuses on classification algorithms (CNN, NB, KNN, RF, SVM, etc.) and regression algorithms (SVR, RFR, DT, ANN, DNN, etc.) that are used for a variety of tasks: detection of functional sites, hydroxylation sites, amino acid composition, DNA expression sequences, protein interaction, biomarker finding, protein design, drug design, 3D structure prediction, and protein folding (Tables 2–6 and Figures 4, 5). Within supervised machine learning (123), we found that classification techniques overrule, by far, regression ones (31/123) (for reference, seeTables 2–6). On a closer look, we see that these methods are generally very good at prediction tasks, although complexity may be significantly increased by the execution time required, something that is often reported as a drawback of this method (AlQuraishi, 2021).

In contrast to supervised learning, it is only (17/144)12% focusing on unsupervised learning, using clustering algorithms (CNN, FFNN, LSDR, DL, HMM, MRF, NN, etc.) for various purposes, such as protein solubility prediction, protein prediction of new functions, discovery of DNA motifs, detection of protein structures, and prediction of the nuclear Overhauser effect at low energies. Of the eight articles using this approach, two of them report an improvement in performance as an advantage, one of them in time reduction (Frasca et al., 2018) and the other one in the acceleration of automated protein function prediction methods in general (Makrodimitris et al., 2019). At the same time, however, a disadvantage reported is that time execution may be increased, a fact that should not surprise us, for it is well known that unsupervised learning algorithms are characterized by being computationally very complex methods (Table 1 and Figures 47).

On the other hand, supervised machine learning is used just a little more than deep learning techniques. Moreover, it is interesting to note that roughly (77/144)53% of the deep learning articles combine two clustering algorithms: CNN (47/77)61% and LSTM (16/77)21%. Of course, some articles put forward optimization procedures in an algorithmic genetic fashion (Figures 47).

Regarding hybrid algorithms using neural networks, we found that all 11 articles explicitly stating their use of hybrid algorithms belong to the deep learning paradigm, combining CNN and LSTM or RNN and CNN. One of them (Almagro Armenteros et al., 2017) goes even further; in that, it uses a combination of these two neural networks to predict protein subcellular localization and then an attention mechanism to identify protein regions important for subcellular localization (Table 1 and Figures 46).

It is interesting to note as well that nine articles are used for prediction (glycation product prediction (Chen et al., 2019), protein secondary structure (Guo et al., 2019), prediction of metal binding in proteins (Haberal and Ogul, 2019), compound–protein affinity prediction (Karimi et al., 2019), prediction of protein structural features (Klausen et al., 2019), protein contact map prediction (Hanson et al., 2018), prediction of protein interactions (Huang et al., 2018), predicting hydroxylation sites (Long et al., 2018), and predicting protein subcellular localization (Almagro Armenteros et al., 2017)), of which two perform prediction from original sequences (Almagro Armenteros et al., 2017;Li et al., 2018).

Moreover, one of them highlights that one of its applications is for the design of new drugs and one of them performs this task (Karimi et al., 2019).

It is tempting to put forward the claim that hybrid algorithms in deep learning are very good for prediction tasks as well as for applications in the new drug design. It is noteworthy to mention that these articles belong to the last 3 years of our revision, something that suggests that there is a tendency for the use of hybrid methods in the near future (Table 1).

AI Training, Validation, and Performance

Validation process allows obtaining a quantitative measure of the models’ efficiency. In this systematic review, several methodologies were used to train and validate in the machine and deep learning proposed by means of hold-out and k-fold cross-validation; The most utilized was the k-fold cross-validation, each one with a different folding proposal, e.g., 2-, 3-, 5-, and 10-fold (Szalkai and Grolmusz, 2018a), trained and validated its algorithm utilizing two validations: 3- and 5-fold cross-validations. Several articles used a graphics processing unit (GPU) that was employed to accelerate the deep learning training and validation process. The most utilized AI algorithm in these articles was CNN, with a 33% occurrence, followed by DNN with 9%, both programmed with Python. The performance of the AI algorithms for protein design was evaluated using parameters such as sensitivity, specificity, true-positive rate, false-positive rate, accuracy, recall, precision, F1-score, area under the curve (AUC), receiver operating characteristic (ROC) curve, and Matthew’s correlation coefficient (MCC). For the case of the hold-out validation, a percentage of the data that is taken and that percentage is randomly removed from the dataset is selected. This methodology, in particular, is computationally very simple; however, it suffers from a high variance because it is not known that data will end up in the test set or in the training one and of the importance that these data might have. In hold-out validation, datasets, which for this review are the databases of proteins, genes, peptides, etc. (seeTables 2–6 and Figures 46), are randomly divided into two partitions with different proportions (50, 70, or 75% training—50, 30, or 25% validation), which are mutually exclusive. The first part of the database is used to feed the input vectors of the methods and train the machine or deep learning algorithms, while the rest is used to evaluate and validate the results obtained with their proposed algorithms. In contrast, with this type of validation technique, hold-out takes a long time for computational processing, especially for large datasets, in particular case, the large protein databases. As a result of our meta-analysis, we found the use of the hold-out methodology to train and validate their AI proposals, as CNN, RNN, LSTM, and FFNN (Tables 1–6 and Figures 46) in the prediction of expressions, interactions, and subcellular localization of proteins and also in the prediction of the peptide binding.

Another technique for evaluating the performance of AI methods, particularly for large databases such as protein design, is cross-validation. Cross-validation is a technique used to (generally) obtain the ability of a model to fit an unknown dataset given a collected dataset. In this context, the k-fold cross-validation is an iterative process that consists of dividing the dataset randomly into k groups of approximately the same size. In this sense, although not all possible combinations of sets are examined, an estimate of the average accuracy more than acceptable can be obtained by training the model only k-fold. The first set is used to train the AI models and the other is used to test and validate them, doing this process k times using a different group for validation in the iteration. Although cross-validation is computationally an intensive method of training and validation, its advantages are the reduction of computational time because the process is repeated k times, where all the data are tested once and used for training, maintaining a reduced variance and bias. Of the total 93 articles in this review, 41 of them (47%) used the following cross-validation schemes: leave-one-out, 2-fold, 3-fold, 4-fold, 5-fold, 6-fold, 7-fold, 8-fold, 10-fold, and 20-fold cross-validations. For most of them, the use of 5-fold and 10-fold cross-validations to analyze the performance of their AI proposals predominated, with 16 and 17 articles, respectively. This method was preferred for the evaluation to the performance of CNN and SVM algorithms, with databases such as PBD, ProTherm, UniProt, GO, and ChEMBL. Additionally, in seven articles (17%), they carried out various types of cross-validations to obtain more information on the performance of their proposals. Another variant to evaluate performance was observed in three articles (7%), which combined the use of both hold-out and cross-validation methodologies in their proposals, which provide them more effective comparison of results in terms of validation schemes.

In contrast, in 22 articles of this review, 25% did not mention neither their training methods nor the validation performed to evaluate the performance of their algorithms used. Likewise, 7% of the articles evaluated their methods using various types of cross-validations at the same time to obtain more information on the performance of their proposals, e.g., 4-fold, 6-fold, 8-fold, and 1-fold, or 3-fold, 5-fold, 7-fold, and 1-fold, or 10- and 20-fold, for databases of PDB, UniProt, GO, ChEMBL, ProTherm, PISCES, GenBank, STRING, and new databases as NOS, SPx, D-B, and Ext D-B.

In general, the performance of all proposed AI algorithms was evaluated using several parameters such as sensitivity, specificity, true-positive rate, false-positive rate, accuracy, recall, precision, root-mean-square error (RMSE), R2, F1-score, area under the curve (AUC), receiver operating characteristic (ROC) curve, and Matthew’s correlation coefficient (MCC) (Table 1).

Of the 87 articles selected as finalists, we have the following: 32 use one single algorithm and 55 use a combination of two or three algorithms sequentially. In machine learning, we found 30; in deep learning, we found 20 applying machine learning (SVM); 11 deep learning (RNN); and 6 using optimization through genetic algorithms.

Regarding the programming language in which each study was developed, we found 47 articles do not specify what language they are based on, 75 articles are based on the Python language, of which 57 are based entirely on Python and 18 are in combination with other software; seeTables 2–6.

Twelve articles are based on the C++ language of which only three are based exclusively on that language and nine in combination with Python, with C, R, and CUDA and C++ language in the Linux environment.

Other nine articles are based on MATLAB of which only four are based exclusively on that language and five in combination in conjunction with Python and Bioinformatics and with Python and C++.

Six articles are based on the C language of which three are based exclusively on that language and three in combination in conjunction with C++, R, and CUDA, with Java and Python and one with Linux and Windows environment.

Finally, seven articles are based on the Java language of which two are written exclusively in this language and five in combination with TensorFlow and with C and Python.

Regarding software licenses, 90 articles were found to be Open Source. An article is licensed by Neural Power version 2.5. One article specifies an open license type belonging to IBM and GNU, respectively. Unfortunately, 45 items did not specify the type of license they own.

Road Map of Artificial Intelligence in Protein Science

The goal of this analysis is to provide a road map to apply machine learning and AI techniques in protein science. One of the results of our meta-analysis, for example, in protein structure prediction, is shown in Figure 6 in which we can observe the two main strategies for protein structure prediction. In Figure 2, we show the scaffold-template-based modeling that is the most commonly used for the scientist in this field with very good results. However, recently Senior and collaborators using a free modeling approach successfully developed an AlphaFold algorithm using a deep neural network. They generated an outstanding accuracy of the 3D structure of a protein with an unknown fold in CASP14 (Senior et al., 2020). This led to an unsolved big question about the importance of the starting point in protein structure prediction, in particular, and in protein science, in general.

The road map of this research is an evolving and a dynamic process (Figure 7). It begins by obtaining information from a list of several databases, followed by a pre-treatment step over the extracted data, including those steps for eliminating redundancies within sequences, structure threshold based on RMSD values, and the like. Further steps contribute to the required pre-processing to complete the reporting process, and then proceed to the data process of the information itself, which includes the input data and the application of the machine learning algorithm, in which the input data are set to be processed into FASTA sequences, training sets, or 3D structures, depending on the function of algorithm in turn. The algorithms used fall into four categories: supervised learning, unsupervised learning, deep learning, and optimization, where each of these categories include a set of their own subparts, which are then combined and configured to predict new ways to model previous data and contribute to future implementations in protein science. The post-processing of data and the support of the new data acquired are made up of models and sequences that were loaded on the platforms to servers such as “DeepUbi, DeepSol, COSNet, Gnina, among others”, in which these servers are used for the storage or implementation of their respective methods. Figure 7 shows that more than half of the reported research completed the three pre-process, process and post-process steps we set forward, so this sequence may be applied to protein science including protein design, classification, physicochemical properties, functionalities, folding properties, and new functions such as homology prediction, domain prediction, subcellular localization, drug design, sensitivity, and other enhancers that can provide new catalysts and new functions, all of which provide any future development for biomolecular enhancement within protein science through machine learning. Model development is intrinsically related to the protein application to be developed. Data extraction varies depending on the architecture of the model to be developed since the data become more complex as the transformation, training, and feature extraction process unfold. The extraction ranges from obtaining the amino acid sequence, secondary structure to the 3D atomic model, using the atomic coordinates. Transforming data emphasizes on performing an adequate filtering for the use of the information for the training of the model, which leads to the feature extraction for the use of machine learning model and finally generating a final output. The process road map includes the fusion of these different applied AI learnings, models, and classifications into a connected deep learning layer that will be included in future research and test datasets to cover the terms of AI science, proteins, and their applications.

FIGURE 7

Final Discussion and Further Challenges for Our Understanding of Protein Science Using AI

Novelties and Future Direction in the Binomial PS-IA Research

The protein science field has great expectations on ML methods as indispensable tools for the biomedical sciences as well as for the chemical and biotechnology industry, for applied research is moving toward synthetic organisms with artificial metabolic networks, regulators, and so on, creating synthetic molecular factories. The binomial PS-IA research is evolving and strengthening, as shown in the Results section (Tables 1–6 and Figures 47). Our research reveals that road maps are most needed to solve complex problems in PS, guiding the exploration into the protein universe. As depicted in Table 1, ML techniques, which are used nowadays, are tailored to the expected results; Tables 1–6 display an array of networks of several solving problem methods, hence showing that guidance is needed in the form of road maps.

It is important to emphasize that in order to design a model algorithm bank functioning as a kit-tool, it is essential to understand the source from which the data are obtained and then used to train each model. The studies analyzed solve classification, regression, and optimization problems. As depicted in Table 1, models providing a solution make use of probabilistic inference, functions, activation functions, reduction of the hierarchical order, and logical inference. These results support the fact that machine learning models are heterogeneous, time demanding to design, and correctly evaluate complex models—since the result may not always be as expected or the method may not be carried out successfully. As illustrated in Table 3, there are some physical limitations blocking the full execution of the various models or algorithms, for example, when there is no appropriate computational equipment. Not surprisingly, several authors report that executing a model requires a high demand on execution time, computational power, extensive time to correctly evaluate the model, large memory consumption, and optimization toward GPUs (Frasca et al., 2018; Almagro Armenteros et al., 2017; Yeh et al., 2018; Jiménez et al., 2017; Lin et al., 2010). Another crucial aspect mentioned in Table 1 is the lack of input data to train the model, something that influences the model’s precision and accuracy (Pagès et al., 2019; Cuperus et al., 2017; Folkman et al., 2014; Qi et al., 2012). Moreover, there are also limitations in model construction, such as errors in the training process, manual intervention of data, overadjustment of the model, and an inadequate algorithm construction. In the studies analyzed, there are cases in which there is no description regarding the performance of the comprehensive models, generating gaps in the understanding of the behavior of the algorithms or models, like whether they are deterministic (Long et al., 2018; Ragoza et al., 2017; Makrodimitris et al., 2019). As stated in the ML and AI Algorithm section, supervised learning is the most used method, something that highlights the use of classification algorithms. Moreover, there seems to be a current trend to solve problems in protein science using techniques that require a cross-functional group of scientists, something that, in turn, highlights the fact that there is plenty of unexplored terrain in the use of unsupervised machine learning.

An interesting finding is the implementation of free code and software, as shown in the AI Training, Validation, and Performance section. Our results exhibit a tendency to create models with transparency, which means that every study implemented in a public server has access to all new models created. Another crucial result is the one depicted in the Road Map of Artificial Intelligence in Protein Science section, which is an abstraction that reduces the design of an artificial intelligence model to be used in the resolution of a specific problem in protein science. The whole process follows three steps directed to build a competent model; these steps are 1) the procedure to obtain raw data and which type of processing should be followed for the model to be adequate, 2) the type of algorithm that may be used depending on the complexity of the problem, and finally, 3) the interpretation of results.

Overall, AI displays a window of opportunities to solve complex problems in PS because of its potential in finding patterns and correlating information that requires the integration of protein data exceeding many petabytes. However, we are still far away from solving all the protein tasks computationally. As a result of our biochemical meta-analysis, we showed that AI applications are strongly directed to function identification and protein classification (Tables 1–6), for machine learning models and methods are heterogeneous and do not always draw a clear line as to whether a process should go in a certain sequence (Table 1 and Figures 47). It should also be noted that there is no optimal method, which is why applications have different purposes and conditions, suggesting that algorithms must be customized based on the expected outcome or query (Table 1).

The evaluation accuracy horizon is an open epistemic horizon, as shown in Table 1: the metrics for ML methods used in several applications are limited; there are no reported research articles using random forest, in which the cross-validation is unnecessary. In summary, none of the studies reported explicitly use robustly validated methods.

We end by commenting on a key problem in the binomial AI–PS. As well known, it is not possible to work directly with the protein sequences. To tackle this challenge, several studies address this limitation by representing the sequence of a protein as an input to the deep learning model (Almagro Armenteros et al., 2017; Long et al., 2018; Fu et al., 2019). Moreover, given some featured procedures comprising what may be called the coding architecture, which is based on creating a specific-weight matrix or a bit vector that represents the sample. This practice was observed in some articles (Cuperus et al., 2017; Jiménez et al., 2017; Khurana et al., 2018; Le et al., 2018) that work with 2D convolutional neural networks in which the authors reported an increase in sensitivity and precision when using indexed datasets. A similar abstraction was observed in 3D convolutional neural networks since the structural representation of a protein is not a rotational invariant; several authors (Jiménez et al., 2017; Ragoza et al., 2017; Hochuli et al., 2018; Pagès et al., 2019; Sunseri et al., 2019; Torng and Altman, 2019) propose using a volumetric map divided into voxels centered on the backbone atoms, representing the physicochemical properties of proteins.

Regarding other review articles along the lines we have followed, the closest we found is the one by Dara et al. (2021). This review article is restricted to drug discovery, one of the five applications we analyzed (genomics, protein structure and function, protein design and evolution, and drug design).

Of a total of 38 articles we presented in Table 2 concerning protein and drug design, only 11 of them were about protein design, so the comparison is not at all fair between these two articles, as far as the analysis of the bibliography analyzed is concerned. However, we share with these authors part of the challenges for researchers in this area: data quality as well as the heterogeneity of databases to be searched for.

Optimization and the characteristics of a prediction must be carried out with a few design considerations, including how to represent the protein data and what type of learning algorithm to use. These form the establishment of a priority acquisition, standard acquisition, etc., and the generation of a protein based on a base model, with the aim that one day it would be possible to have controllable predictive models that can read and generate outputs in a consensual terminology, as revised in Hie and Yang (2022). Clearly showing a replacement of conventional methods to the use of machine learning algorithms (neural networks), attributed to improvements in design, computational power, etc., the result of a machine learning algorithm is not deterministic, but rather, it is intended to perform transformation functions in relation to the complexity of the data, as depicted in AlQuraishi (2021). There are volumes and volumes of empirical protein data. It is extremely difficult to synthesize such data for correct use in existing algorithms; however, machine learning has helped to compile a large number of methodologies, considering specific assumptions. Nevertheless, most of the empirical methodologies to demonstrate that drugs are safe and effectively continue to be used since there is a gap in the understanding of how the learning transmission of the data to the model is carried out (Dara et al., 2021).

In order to close our reflection as a research team, we believe that a landmark for the epistemic horizon in research is the reassurance that cross-functional groups of scientists from several academic disciplines, in this case including the participation of experts from the natural sciences (organic chemistry, physics and chemistry of proteins, molecular and structural biology, protein engineering, systems biology, microfluid chip engineering, and nanobiotechnology), together with those in computer science (artificial intelligence, knowledge engineering) promote the innovation process in tecno-sciences by combining tacit and explicit knowledge, sharing skills, methodologies, tools, ideas, concepts, experiences, and challenges to fully explore the binomial AI–PS promising area of research (Hey et al., 2019; Mataeimoghadam et al., 2020; Senior et al., 2020; Tsuchiya and Tomii, 2020). A very recent successful case study that highlights this approach is the team of creators of system Alphafold (Senior et al., 2020; AlQuraishi, 2021), one which in the CASP (Critical Assessment of Protein Structure Prediction) competition of three-dimensional protein structure modeling were able to determine the 3D structure of a protein from its amino acid sequence. By doing so, this group of researchers solved one of natural science’s open (until now) and most challenging problems using a deep learning approach combining template-based modeling (TBM) and free modeling (FM). The key point is that the neural network prediction encompasses backbone torsion angles and pairwise distances between residues (Senior et al., 2020). At the dawn of the year 2021, this peak of the iceberg brings fresh air and a great power to the protein science field, in particular, and to the life-sciences more broadly, encouraging the new generation of scientists to work as cross-functional teams in order to tackle novel tasks toward the understanding of nature.

One challenge for the binomial AI–PS research area is to tackle the representation of tacit knowledge and include it in the ML algorithms. The relevance of tacit knowledge in the building up of protein science knowledge has come a long way since Polanyi first noted it, extending to different fields in the search for an improvement of their practical skills. In AI, the predominant way of knowledge acquisition and performance is a formal one in which the machine learns and expresses explicitly through guidelines and that works in a focalized mean; the new task alludes to a tacit dimension (Polanyi, 1962), which remains in the edge of attention and incorporates aspects that are taught and learned mostly through practice and in a comprehensive manner (it is context-specific, spreads in the laboratory environment, and comes into play in decision-making.

Some Conclusions

To sum up, the systematic review and the biochemical meta-analysis offered in this article focused on the enormous innovation that has been made in the binomial AI–PS research, both in its applications and its road maps to solve protein structures and function prediction, protein and drug design, among other tasks. The contribution of this study is 3-fold: firstly, the setup of a cross-functional group in which computer scientists, professionals in biomedicine, and a philosopher constructed a common language and together identified relevant literature in the inter-field of AI–PS and constructed a bridge between the two fields, which can serve as a framework for further research in either area.

Secondly, we stressed the importance of a finer-grain understanding of training and validation methods of ML models and their outcomes, combining databases from several areas of knowledge (life-science experiments, in silico simulations, ML, direct evolution approach, etc.) that allowed us to classify, stratify, and contribute to the evolving protein science field. Thirdly, we showed that the binomial AI–PS, a progressive research program, as Lakatos would say and has still several challenges to tackle, such as the development of a comprehensive machine learning benchmarking enterprise, the experimental confirmation of the structure of the 3D modeling in laboratories, the classification, etc., controls the vulnerability of the neural networks, the development of a tool-kit to design novel biocatalysts not found in nature using reverse engineering, human-made metabolic routes, the design of new antibody molecular factory, novel proteostasis systems, the understanding of protein folding and protein-aggregation mechanisms, etc. Finally, we suggested that there may be a paradigm shift in the AI–PS research as a result of the recent great outcome of Alphafold, encouraging its use to the new generation of scientists.

In any case, what is clear is that a cross-functional group of scientists from several knowledge domains is required to work in coordination for sharing ideas, methodologies, and challenges toward the development of road maps and computational tools, paradigms, tacit, and explicit knowledge to fully explore and close the gap of the binomial AI–PS, a promising research area.

Statements

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors without undue reservation.

Author contributions

Conceived and designed the experiments: MA-B and NA-B. Performed the systematic review: JV-A, MVA, FO-F, RZ-S, NA-B, and MA-B. Analyzed the data: JV-A, LO-T, MVA, FP-E, AA, FO-F, RZ-S, NA-B, NK-V, SVA, and MA-B. Contributed to reagents/materials/analysis tools: NK-V, NA-B, CR-M, and MA-B. Wrote the article: JV-A, LO-T, MVA, AA, FP-E, NA-B, and MA-B. Contributed to helpful discussions: JV-A, LO-T, MVA, FP-E, AA, FO-F, RZ-S, NA-B, NK-V, CR-M, SVA, and MA-B.

Acknowledgments

The authors would like to acknowledge the experimental support and fruitful discussions provided by Dr. Elsa de la Chesnaye. We also wish to thank Dr. Laura Bonifaz for her support. The contributions made by the assigned pre-graduate research fellows at the Universidad lberoamericana and UNAM are greatly appreciated. We are also thankful for the contributions of Perla Sueiras, Daniela Monroy, Maria Fernanda Frlas, Pablo Cardenas and Mattea Cussel for translation and proofread the manuscript, and Rogelio Ezequiel and Alonso Loyo for the artwork.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Glossary

  • 1D-CNN

    one-dimensional convolutional neural network

  • 2D-BRLSTM

    two-dimensional bidirectional recurrent long short-term memory

  • 2D-CNN

    Two-dimensional convolutional neural network

  • 3D-CNN

    Three-dimensional convolutional neural network

  • ACNN

    Asymmetric convolutional neural network

  • ADASYN

    Adaptive synthetic sampling

  • AGCT

    Alignment genetic causal tree

  • ANN

    Artificial neural network

  • BBFNN

    Biobasis function neural network

  • BBP

    Back back propagation

  • BLSTM

    Bidirectional long short-term memory

  • BN

    Bayesian network

  • BRNN

    Bidirectional recurrent neural network

  • BroMap

    Branch and bound map estimation

  • BRT

    Booster regression tree

  • CABS

    C-alpha–beta side

  • CFN

    Cost function network

  • CNF

    Conditional neural field

  • CNN

    Convolutional neural network

  • COSNet

    Cost-sensitive neural network

  • DCNN

    Deep convolutional neural network

  • DeepDIN

    Deep dense inception network

  • Deep3I

    Deep inception-inside-inception network

  • DFS

    Depth first search

  • DL

    Deep learning

  • DMNN

    Deep mahout neural network

  • DNN

    Deep neural network

  • DRNN

    Deep residual neural network

  • DROP

    Domain linker prediction using optimal feature

  • DT

    Decision tree

  • DTNN

    Deep tensor neural network

  • EASE-MM

    Evolutionary amino acid and structural encodings with multiple models

  • ELMO

    Embeddings from language models

  • ENN-RL

    Evolution neural network-based regularized Laplacian kernel

  • FFNN

    Feed forward neural network

  • FIBHASH

    Fibonacci numbers and hashing table

  • GA

    Genetic algorithms

  • GAN

    Generative adversarial network

  • GBT

    Gradient boost tree

  • GBDT

    Gradient boosted decision tree

  • GCN

    Graph convolutional network

  • GR

    Genetic recombination

  • HDL

    Hybrid deep learning

  • HMM

    Hidden Markov model

  • HNN

    Hopfield neural network

  • IBP

    Incremental back propagation

  • KeSCANN

    Knowledge-enriched self-attention convolutional neural network

  • K-merHMM

    K.mer Hidden Markov model

  • KNN

    k-nearest neighbor

  • Lasso

    Least absolute shrinkage and selection operator

  • LightGBM

    Light gradient boosting machine

  • LM

    Levenberg–Marquardt

  • LPBoostR

    Linear programming boosting regression

  • LPSVMR

    Linear programming support vector machine regression

  • LR

    Logistic regression

  • LSDR

    Label-space dimensionality reduction

  • LSTM

    Long short-term memory

  • MC

    Monte Carlo

  • ME

    Max entropy

  • ML

    Model

  • MLP

    Multilayer perceptron

  • MNB

    Multinomial naïve bayes

  • MNNN

    Multi-scale neighborhood-based neural network

  • MNPP

    Message passing neural network

  • MotifCNN

    Motif convolutional neural network

  • Motif DNN

    Motif deep neural network

  • MR

    Matching loss regression

  • MRF

    Markov random forest

  • Multimodal DNN

    Multimodal deep neural network

  • NB

    Naïve Bayes

  • NLP

    Natural language processing

  • ORMR

    One-norm regularization matching-loss regression

  • ParCOSNet

    Parallel COSNet

  • PLSR

    Partial least-squares regression

  • PNN

    Probabilistic neural network

  • PS

    Protein science

  • PSO

    Particle swarm optimization

  • PSP

    Predict signal pathway

  • QP

    quick prob

  • ReLeaSE

    Reinforcement learning for structural evolution

  • RF

    Random forest

  • RN

    Relational network

  • RNN

    Recurrent neural network

  • RNN 2

    Residual neural network

  • RR

    Ridge regression

  • SDHINE

    Meta path-based heterogeneous information embedding approach

  • SFFS

    Sequential forward floating selection

  • SGD

    Stochastic gradient descent

  • SPARK-X

    Probabilistic-based matching

  • SPIN

    Sequence profiles by integrated neural network

  • SVM

    Support vector machine

  • SVMR

    Support vector machine regression

  • SVR

    Support vector regression

  • UDNN

    Ultradeep neural network

  • VSA

    Virtual screening algorithms

  • WMC

    Weighted multiple conformations

References

  • 1

    AdhikariB.HouJ.ChengJ. (2018). DNCON2: Improved Protein Contact Prediction Using Two-Level Deep Convolutional Neural Networks. BioInformatics34, 14661472. 10.1093/bioinformatics/btx781

  • 2

    Al-GharabliS. I.AgtashS. A.RawashdehN. A.BarqawiK. R. (2015). Artificial Neural Networks for Dihedral Angles Prediction in Enzyme Loops: A Novel Approach. Ijbra11, 153161. 10.1504/IJBRA.2015.068090

  • 3

    AlakuşT. B.Türkoğluİ. (2021). A Novel Fibonacci Hash Method for Protein Family Identification by Using Recurrent Neural Networks. Turk. J. Electr. Eng. Comput. Sci.29, 370386. Available at: http://10.0.15.66/elk-2003-116. 10.0.15.66/elk-2003-116

  • 4

    Almagro ArmenterosJ. J.SønderbyC. K.SønderbyS. K.NielsenH.WintherO. (2017). DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning. Bioinformatics33, 33873395. 10.1093/bioinformatics/btx431

  • 5

    AlQuraishiM. (2021). Machine Learning in Protein Structure Prediction. Curr. Opin. Chem. Biol.65, 18. 10.1016/j.cbpa.2021.04.005

  • 6

    ArmstrongK. A.TidorB. (2008). Computationally Mapping Sequence Space to Understand Evolutionary Protein Engineering. Biotechnol. Prog.24, 6273. 10.1021/bp070134h

  • 7

    AshkenazyH.UngerR.KligerY. (2011). Hidden Conformations in Protein Structures. Bioinformatics27, 19411947. 10.1093/bioinformatics/btr292

  • 8

    BaetuT. (2015). Carl F, Craver and Lindley Darden: In Search of Mechanisms: Discoveries across the Life Sciences. Hpls36, 459461. 10.1007/s40656-014-0038-6

  • 9

    BernardesJ.PedreiraC. (2013). A Review of Protein Function Prediction under Machine Learning Perspective. Biot7, 122141. 10.2174/18722083113079990006

  • 10

    Bindslev-JensenC.StenE.EarlL. K.CrevelR. W. R.Bindslev-JensenU.HansenT. K.et al (2003). Assessment of the Potential Allergenicity of Ice Structuring Protein Type III HPLC 12 Using the FAO/WHO 2001 Decision Tree for Novel Foods. Food Chem. Toxicol.41, 8187. 10.1016/S0278-6915(02)00212-0

  • 11

    BondP. S.WilsonK. S.CowtanK. D. (2020). Predicting Protein Model Correctness in Coot Using Machine Learning. Acta Cryst. Sect. D. Struct. Biol.76, 713723. 10.1107/S2059798320009080

  • 12

    BostanB.GreinerR.SzafronD.LuP. (2009). Predicting Homologous Signaling Pathways Using Machine Learning. Bioinformatics25, 29132920. 10.1093/bioinformatics/btp532

  • 13

    BriesemeisterS.RahnenführerJ.KohlbacherO. (2010). Going from where to Why-Interpretable Prediction of Protein Subcellular Localization. Bioinformatics26, 12321238. 10.1093/bioinformatics/btq115

  • 14

    CaoR.FreitasC.ChanL.SunM.JiangH.ChenZ. (2017). ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules22, 1732. 10.3390/molecules22101732

  • 15

    CapriottiE.FariselliP.CasadioR. (2005). I-Mutant2.0: Predicting Stability Changes upon Mutation from the Protein Sequence or Structure. Nucleic Acids Res.33, W306W310. 10.1093/nar/gki375

  • 16

    ChenJ.YangR.ZhangC.ZhangL.ZhangQ. (2019). DeepGly: A Deep Learning Framework with Recurrent and Convolutional Neural Networks to Identify Protein Glycation Sites from Imbalanced Data. IEEE ACCESS7, 142368142378. 10.1109/ACCESS.2019.2944411

  • 17

    ChengJ.TeggeA. N.BaldiP. (2008). Machine Learning Methods for Protein Structure Prediction. IEEE Rev. Biomed. Eng.1, 4149. 10.1109/RBME.2008.2008239

  • 18

    CuiY.DongQ.HongD.WangX. (2019). Predicting Protein-Ligand Binding Residues with Deep Convolutional Neural Networks. BMC Bioinforma.20, 93. 10.1186/s12859-019-2672-1

  • 19

    CuperusJ. T.GrovesB.KuchinaA.RosenbergA. B.JojicN.FieldsS.et al (2017). Deep Learning of the Regulatory Grammar of Yeast 5′ Untranslated Regions from 500,000 Random Sequences. Genome Res.27, 20152024. 10.1101/gr.224964.117

  • 20

    DaiW.ChangQ.PengW.ZhongJ.LiY. (2020). Network Embedding the Protein-Protein Interaction Network for Human Essential Genes Identification. Genes.11, 153. 10.3390/genes11020153

  • 21

    DanielsN. M.HosurR.BergerB.CowenL. J. (2012). SMURFLite: Combining Simplified Markov Random Fields with Simulated Evolution Improves Remote Homology Detection for Beta-Structural Proteins into the Twilight Zone. Bioinformatics28, 12161222. 10.1093/bioinformatics/bts110

  • 22

    DaraS.DhamercherlaS.JadavS. S.BabuC. H.AhsanM. J. (2021). Machine Learning in Drug Discovery: A Review. Artif. Intell. Rev.55 (3), 19471999. 10.1007/s10462-021-10058-4

  • 23

    DegiacomiM. T. (2019). Coupling Molecular Dynamics and Deep Learning to Mine Protein Conformational Space. Structure27, 10341040. 10.1016/j.str.2019.03.018

  • 24

    DuZ.HeY.LiJ.UverskyV. N. (2020). DeepAdd: Protein Function Prediction from K-Mer Embedding and Additional Features. Comput. Biol. Chem.89, 107379. N.PAG--N.PAG. Available at: http://10.0.3.248/j.compbiolchem.2020.107379. 10.1016/j.compbiolchem.2020.107379

  • 25

    DurrantJ. D.McCammonJ. A. (2011). NNScore 2.0: A Neural-Network Receptor-Ligand Scoring Function. J. Chem. Inf. Model..51, 28972903. 10.1021/ci2003889

  • 26

    EbinaT.TohH.KurodaY. (2011). DROP: An SVM Domain Linker Predictor Trained with Optimal Features Selected by Random Forest. Bioinformatics27, 487494. 10.1093/bioinformatics/btq700

  • 27

    EbrahimpourA.RahmanR. N. Z. R. A.Ean Ch'ngD. H.BasriM.SallehA. B. (2008). A Modeling Study by Response Surface Methodology and Artificial Neural Network on Culture Parameters Optimization for Thermostable Lipase Production from a Newly Isolated Thermophilic Geobacillus Sp. Strain ARM. BMC Biotechnol.8, 96. 10.1186/1472-6750-8-96

  • 28

    EisenbeisS.ProffittW.ColesM.TruffaultV.ShanmugaratnamS.MeilerJ.et al (2012). Potential of Fragment Recombination for Rational Design of Proteins. J. Am. Chem. Soc.134, 40194022. 10.1021/ja211657k

  • 29

    FangC.MoriwakiY.TianA.LiC.ShimizuK. (2019). Identifying Short Disorder-To-Order Binding Regions in Disordered Proteins with a Deep Convolutional Neural Network Method. J. Bioinform. Comput. Biol.17, 1950004. 10.1142/S0219720019500045

  • 30

    FangC.ShangY.XuD. (2020). A Deep Dense Inception Network for Protein Beta‐turn Prediction. Proteins88, 143151. 10.1002/prot.25780

  • 31

    FangC.ShangY.XuD. (2018). MUFOLD-SS: New Deep Inception-Inside-Inception Networks for Protein Secondary Structure Prediction. Proteins86, 592598. 10.1002/prot.25487

  • 32

    FegerG.AngelovB.AngelovaA. (2020). Prediction of Amphiphilic Cell-Penetrating Peptide Building Blocks from Protein-Derived Amino Acid Sequences for Engineering of Drug Delivery Nanoassemblies. J. Phys. Chem. B124, 40694078. 10.1021/acs.jpcb.0c01618

  • 33

    FeinbergE. N.SurD.WuZ.HusicB. E.MaiH.LiY.et al (2018). PotentialNet for Molecular Property Prediction. ACS Cent. Sci.4, 15201530. 10.1021/acscentsci.8b00507

  • 34

    FolkmanL.StanticB.SattarA. (2014). Feature-based Multiple Models Improve Classification of Mutation-Induced Stability Changes. BMC Genomics15, 96. 10.1186/1471-2164-15-S4-S6

  • 35

    FrascaM.GrossiG.GliozzoJ.MesitiM.NotaroM.PerlascaP.et al (2018). A GPU-Based Algorithm for Fast Node Label Learning in Large and Unbalanced Biomolecular Networks. BMC Bioinforma.19, 353. 10.1186/s12859-018-2301-4

  • 36

    FuH.YangY.WangX.WangH.XuY. (2019). DeepUbi: A Deep Learning Framework for Prediction of Ubiquitination Sites in Proteins. BMC Bioinforma.20, 86. 10.1186/s12859-019-2677-9

  • 37

    GainzaP.NisonoffH. M.DonaldB. R. (2016). Algorithms for Protein Design. Curr. Opin. Struct. Biol.39, 1626. 10.1016/j.sbi.2016.03.006

  • 38

    GuoY.LiW.WangB.LiuH.ZhouD. (2019). DeepACLSTM: Deep Asymmetric Convolutional Long Short-Term Memory Neural Models for Protein Secondary Structure Prediction. BMC Bioinforma.20, 341. 10.1186/s12859-019-2940-0

  • 39

    GutteridgeA.BartlettG. J.ThorntonJ. M. (2003). Using a Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes. J. Mol. Biol.330, 719734. 10.1016/S0022-2836(03)00515-1

  • 40

    Haberalİ.OğulH. (2019). Prediction of Protein Metal Binding Sites Using Deep Neural Networks. Mol. Inf.38, 1800169. 10.1002/minf.201800169

  • 41

    HanX.ZhangL.ZhouK.WangX. (2019). ProGAN: Protein Solubility Generative Adversarial Nets for Data Augmentation in DNN Framework. Comput. Chem. Eng.131, 106533. N.PAG--N.PAG. Available at: http://10.0.3.248/j.compchemeng.2019.106533. 10.1016/j.compchemeng.2019.106533

  • 42

    HansonJ.PaliwalK.LitfinT.YangY.ZhouY. (2018). Accurate Prediction of Protein Contact Maps by Coupling Residual Two-Dimensional Bidirectional Long Short-Term Memory with Convolutional Neural Networks. Bioinformatics34, 40394045. Available at: http://10.0.4.69/bioinformatics/bty481. 10.1093/bioinformatics/bty481

  • 43

    HansonJ.PaliwalK.LitfinT.YangY.ZhouY. (2019). Improving Prediction of Protein Secondary Structure, Backbone Angles, Solvent Accessibility and Contact Numbers by Using Predicted Contact Maps and an Ensemble of Recurrent and Residual Convolutional Neural Networks. Bioinformatics35, 24032410. 10.1093/bioinformatics/bty1006

  • 44

    HeH.LiuB.LuoH.ZhangT.JiangJ. (2020). Big Data and Artificial Intelligence Discover Novel Drugs Targeting Proteins without 3D Structure and Overcome the Undruggable Targets. STROKE Vasc. Neurol.5, 381387. 10.1136/svn-2019-000323

  • 45

    HeinzingerM.ElnaggarA.WangY.DallagoC.NechaevD.MatthesF.et al (2019). Modeling Aspects of the Language of Life through Transfer-Learning Protein Sequences. BMC Bioinforma.20, 723. 10.1186/s12859-019-3220-8

  • 46

    HeyT.ButlerK.JacksonS.ThiyagalingamJ. (2019). Machine Learning and Big Scientific Data. Philos. Trans. A Math. Phys. Eng. Sci.378 (2166), 20190054. arXiv. Available at: file:///Users/Myriam/Documents/2020/manuscritos. 10.1098/rsta.2019.0054

  • 47

    HieB. L.YangK. K. (2022). Adaptive Machine Learning for Protein Engineering. Curr. Opin. Struct. Biol.72, 145152. 10.1016/j.sbi.2021.11.002

  • 48

    HochuliJ.HelblingA.SkaistT.RagozaM.KoesD. R. (2018). Visualizing Convolutional Neural Network Protein-Ligand Scoring. J. Mol. Graph. Model.84, 96108. 10.1016/j.jmgm.2018.06.005

  • 49

    HongE.-J.LippowS. M.TidorB.Lozano-PérezT. (2009). Rotamer Optimization for Protein Design through MAP Estimation and Problem-Size Reduction. J. Comput. Chem.30, 19231945. 10.1002/jcc.21188

  • 50

    HuB.WangH.WangL.YuanW. (2018). Adverse Drug Reaction Predictions Using Stacking Deep Heterogeneous Information Network Embedding Approach. Molecules23, 3193. 10.3390/molecules23123193

  • 51

    HuC.LiX.LiangJ. (2004). Developing Optimal Non-linear Scoring Function for Protein Design. Bioinformatics20, 30803098. 10.1093/bioinformatics/bth369

  • 52

    HuangL.LiaoL.WuC. H. (2018). Completing Sparse and Disconnected Protein-Protein Network by Deep Learning. BMC Bioinforma.19, 103. 10.1186/s12859-018-2112-7

  • 53

    HuangW.-L.TungC.-W.HoS.-W.HwangS.-F.HoS.-Y. (2008). ProLoc-GO: Utilizing Informative Gene Ontology Terms for Sequence-Based Prediction of Protein Subcellular Localization. BMC Bioinforma.9, 80. 10.1186/1471-2105-9-80

  • 54

    HungC.-M.HuangY.-M.ChangM.-S. (2006). Alignment Using Genetic Programming with Causal Trees for Identification of Protein Functions. Nonlinear Analysis Theory, Methods & Appl.65, 10701093. 10.1016/j.na.2005.09.048

  • 55

    JiménezJ.DoerrS.Martínez-RosellG.RoseA. S.De FabritiisG. (2017). DeepSite: Protein-Binding Site Predictor Using 3D-Convolutional Neural Networks. Bioinformatics33, 30363042. 10.1093/bioinformatics/btx350

  • 56

    KaleelM.TorrisiM.MooneyC.PollastriG. (2019). PaleAle 5.0: Prediction of Protein Relative Solvent Accessibility by Deep Learning. Amino Acids51, 12891296. Available at: http://10.0.3.239/s00726-019-02767-6. 10.1007/s00726-019-02767-6

  • 57

    KarimiM.WuD.WangZ.ShenY. (2019). DeepAffinity: Interpretable Deep Learning of Compound-Protein Affinity through Unified Recurrent and Convolutional Neural Networks. Bioinformatics35, 33293338. Available at: http://10.0.4.69/bioinformatics/btz111. 10.1093/bioinformatics/btz111

  • 58

    KatzmanS.BarrettC.ThiltgenG.KarchinR.KarplusK. (2008). Predict-2nd: A Tool for Generalized Protein Local Structure Prediction. Bioinformatics24, 24532459. 10.1093/bioinformatics/btn438

  • 59

    KauffmanS. A. (1992). “Origins of Order in Evolution: Self-Organization and Selection,” in Understanding Origins (Netherlands: Springer), 153181. 10.1007/978-94-015-8054-0_8

  • 60

    KhanZ. U.HayatM.KhanM. A. (2015). Discrimination of Acidic and Alkaline Enzyme Using Chou's Pseudo Amino Acid Composition in Conjunction with Probabilistic Neural Network Model. J. Theor. Biol.365, 197203. 10.1016/j.jtbi.2014.10.014

  • 61

    KhuranaS.RawiR.KunjiK.ChuangG.-Y.BensmailH.MallR. (2018). DeepSol: A Deep Learning Framework for Sequence-Based Protein Solubility Prediction. Bioinformatics34, 26052613. 10.1093/bioinformatics/bty166

  • 62

    KlausenM. S.JespersenM. C.NielsenH.JensenK. K.JurtzV. I.SønderbyC. K.et al (2019). NetSurfP‐2.0: Improved Prediction of Protein Structural Features by Integrated Deep Learning. Proteins87, 520527. 10.1002/prot.25674

  • 63

    KwonY.ShinW.-H.KoJ.LeeJ. (2020). AK-score: Accurate Protein-Ligand Binding Affinity Prediction Using an Ensemble of 3D-Convolutional Neural Networks. Ijms21, 8424. 10.3390/ijms21228424

  • 64

    LadungaI.CzakóF.CsabaiI.GesztiT. (1991). Improving Signal Peptide Prediction Accuracy by Simulated Neural Network. Bioinformatics7, 485487. 10.1093/bioinformatics/7.4.485

  • 65

    LatekD.KolinskiA. (2011). CABS-NMR-De Novo Tool for Rapid Global Fold Determination from Chemical Shifts, Residual Dipolar Couplings and Sparse Methyl-Methyl Noes. J. Comput. Chem.32, 536544. 10.1002/jcc.21640

  • 66

    LeN.-Q. -K.HoQ.-T.OuY.-Y. (2018). Classifying the Molecular Functions of Rab GTPases in Membrane Trafficking Using Deep Convolutional Neural Networks. Anal. Biochem.555, 3341. 10.1016/j.ab.2018.06.011

  • 67

    LiC.-C.LiuB. (2020). MotifCNN-fold: Protein Fold Recognition Based on Fold-specific Features Extracted by Motif-Based Convolutional Neural Networks. Brief. Bioinform.21, 21332141. 10.1093/bib/bbz133

  • 68

    LiH.GongX.-J.YuH.ZhouC. (2018). Deep Neural Network Based Predictions of Protein Interactions Using Primary Sequences. Molecules23, 1923. 10.3390/molecules23081923

  • 69

    LiH.SzeK. H.LuG.BallesterP. J. (2021). Machine‐learning Scoring Functions for Structure‐based Virtual Screening. WIREs Comput. Mol. Sci.11. 10.1002/wcms.1478

  • 70

    LiY.CirinoP. C. (2014). Recent Advances in Engineering Proteins for Biocatalysis. Biotechnol. Bioeng.111, 12731287. 10.1002/bit.25240

  • 71

    LiZ.YangY.FaraggiE.ZhanJ.ZhouY. (2014). Direct Prediction of Profiles of Sequences Compatible with a Protein Structure by Neural Networks with Fragment-Based Local and Energy-Based Nonlocal Profiles. Proteins82, 25652573. 10.1002/prot.24620

  • 72

    LiangM.NieJ. (2020). Prediction of Enzyme Function Based on a Structure Relation Network. IEEE ACCESS8, 132360132366. 10.1109/ACCESS.2020.3010028

  • 73

    LiaoJ.WarmuthM. K.GovindarajanS.NessJ. E.WangR. P.GustafssonC.et al (2007). Engineering Proteinase K Using Machine Learning and Synthetic Genes. BMC Biotechnol.7, 16. 10.1186/1472-6750-7-16

  • 74

    LinG. N.WangZ.XuD.ChengJ. (2010). SeqRate: Sequence-Based Protein Folding Type Classification and Rates Prediction. BMC Bioinforma.11, S1. 10.1186/1471-2105-11-S3-S1

  • 75

    LinJ.ChenH.LiS.LiuY.LiX.YuB. (2019). Accurate Prediction of Potential Druggable Proteins Based on Genetic Algorithm and Bagging-SVM Ensemble Classifier. Artif. Intell. Med.98, 3547. Available at: http://10.0.3.248/j.artmed.2019.07.005. 10.1016/j.artmed.2019.07.005

  • 76

    LongH.LiaoB.XuX.YangJ. (2018). A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites. Ijms19, 2817. 10.3390/ijms19092817

  • 77

    LongS.TianP. (2019). Protein Secondary Structure Prediction with Context Convolutional Neural Network. RSC Adv.9, 3839138396. 10.1039/c9ra05218f

  • 78

    LuoF.WangM.LiuY.ZhaoX.-M.LiA. (2019). DeepPhos: Prediction of Protein Phosphorylation Sites with Deep Learning. Bioinformatics35, 27662773. 10.1093/bioinformatics/bty1051

  • 79

    LuoL.YangZ.WangL.ZhangY.LinH.WangJ. (2019). KeSACNN: a Protein-Protein Interaction Article Classification Approach Based on Deep Neural Network. Ijdmb22, 131148. 10.1504/ijdmb.2019.099724

  • 80

    LuoX.TuX.DingY.GaoG.DengM. (2020). Expectation Pooling: an Effective and Interpretable Pooling Method for Predicting DNA-Protein Binding. Bioinformatics36, 14051412. 10.1093/bioinformatics/btz768

  • 81

    MahmoudA. H.MastersM. R.YangY.LillM. A. (2020). Elucidating the Multiple Roles of Hydration for Accurate Protein-Ligand Binding Prediction via Deep Learning. Commun. Chem.3, 19. 10.1038/s42004-020-0261-x

  • 82

    MaiaE. H. B.AssisL. C.de OliveiraT. A.da SilvaA. M.TarantoA. G. (2020). Structure-Based Virtual Screening: From Classical to Artificial Intelligence. Front. Chem.8. 10.3389/fchem.2020.00343

  • 83

    MakrodimitrisS.Van HamR. C. H. J.ReindersM. J. T. (2019). Improving Protein Function Prediction Using Protein Sequence and GO-Term Similarities. Bioinformatics35, 11161124. 10.1093/bioinformatics/bty751

  • 84

    MataeimoghadamF.NewtonM. A. H.DehzangiA.KarimA.JayaramB.RanganathanS.et al (2020). Enhancing Protein Backbone Angle Prediction by Using Simpler Models of Deep Neural Networks. Sci. Rep.10, 112. 10.1038/s41598-020-76317-6

  • 85

    MirabelloC.WallnerB. (2019). rawMSA: End-To-End Deep Learning Using Raw Multiple Sequence Alignments. PLoS One14, e0220182. 10.1371/journal.pone.0220182

  • 86

    MüllerA. T.HissJ. A.SchneiderG. (2018). Recurrent Neural Network Model for Constructive Peptide Design. J. Chem. Inf. Model..58, 472479. 10.1021/acs.jcim.7b00414

  • 87

    MurphyG. S.SathyamoorthyB.DerB. S.MachiusM. C.PulavartiS. V.SzyperskiT.et al (2015). Computational De Novo Design of a Four-Helix Bundle Protein-Dnd_4hb. Protein Sci.24, 434445. 10.1002/pro.2577

  • 88

    O'ConnellJ.LiZ.HansonJ.HeffernanR.LyonsJ.PaliwalK.et al (2018). SPIN2: Predicting Sequence Profiles from Protein Structures Using Deep Neural Networks. Proteins86, 629633. 10.1002/prot.25489

  • 89

    ÖzenA.GönenM.AlpaydınE.HaliloğluT. (2009). Machine Learning Integration for Predicting the Effect of Single Amino Acid Substitutions on Protein Stability. BMC Struct. Biol.9. 10.1186/1472-6807-9-66

  • 90

    PagèsG.CharmettantB.GrudininS.ValenciaA. (2019). Protein Model Quality Assessment Using 3D Oriented Convolutional Neural Networks. Bioinformatics35, 33133319. 10.1093/bioinformatics/btz122

  • 91

    PaladinoA.MarchettiF.RinaldiS.ColomboG. (2017). Protein Design: from Computer Models to Artificial Intelligence. WIREs Comput. Mol. Sci.7, e1318. 10.1002/wcms.1318

  • 92

    Picart-ArmadaS.BarrettS. J.WilléD. R.Perera-LlunaA.GutteridgeA.DessaillyB. H. (2019). Benchmarking Network Propagation Methods for Disease Gene Identification. PLoS Comput. Biol.15, e100727624. Available at: http://10.0.5.91/journal.pcbi.1007276. 10.1371/journal.pcbi.1007276

  • 93

    PolanyiM. (1962). Personal Knowledge. Towards a Post-Critical Philosophy. 2nd ed.. London: Routledge & Kegan Paul.

  • 94

    PopovaM.IsayevO.TropshaA. (2018). Deep Reinforcement Learning for De Novo Drug Design. Sci. Adv.4, eaap7885. 10.1126/sciadv.aap7885

  • 95

    QiY.OjaM.WestonJ.NobleW. S. (2012). A Unified Multitask Architecture for Predicting Local Protein Properties. PLoS One7, e32235. 10.1371/journal.pone.0032235

  • 96

    QinZ.WuL.SunH.HuoS.MaT.LimE.et al (2020). Artificial Intelligence Method to Design and Fold Alpha-Helical Structural Proteins from the Primary Amino Acid Sequence. Extreme Mech. Lett.36, 100652. 10.1016/j.eml.2020.100652

  • 97

    RagozaM.HochuliJ.IdroboE.SunseriJ.KoesD. R. (2017). Protein-Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model..57, 942957. 10.1021/acs.jcim.6b00740

  • 98

    RavehB.RahatO.BasriR.SchreiberG. (2007). Rediscovering Secondary Structures as Network Motifs-Aan Unsupervised Learning Approach. Bioinformatics23, e163e169. 10.1093/bioinformatics/btl290

  • 99

    RivesA.MeierJ.SercuT.GoyalS.LinZ.LiuJ.et al (2021). Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proc. Natl. Acad. Sci. U. S. A.118, e2016239118. 10.1073/pnas.2016239118

  • 100

    RossiA.MichelettiC.SenoF.MaritanA. (2001). A Self-Consistent Knowledge-Based Approach to Protein Design. Biophysical J.80, 480490. 10.1016/S0006-3495(01)76030-4

  • 101

    RussW. P.RanganathanR. (2002). Knowledge-based Potential Functions in Protein Design. Curr. Opin. Struct. Biol.12, 447452. 10.1016/S0959-440X(02)00346-9

  • 102

    SavojardoC.MartelliP. L.TartariG.CasadioR. (2020b). Large-scale Prediction and Analysis of Protein Sub-mitochondrial Localization with DeepMito. BMC Bioinforma.21, 266. N.PAG--N.PAG. Available at: http://10.0.4.162/s12859-020-03617-z. 10.1186/s12859-020-03617-z

  • 103

    SavojardoC.BruciaferriN.TartariG.MartelliP. L.CasadioR. (2020a). DeepMito: Accurate Prediction of Protein Sub-mitochondrial Localization Using Convolutional Neural Networks. Bioinformatics36, 5664. Available at: http://10.0.4.69/bioinformatics/btz512. 10.1093/bioinformatics/btz512

  • 104

    SeniorA. W.EvansR.JumperJ.KirkpatrickJ.SifreL.GreenT.et al (2020). Improved Protein Structure Prediction Using Potentials from Deep Learning. Nature577, 706710. 10.1038/s41586-019-1923-7

  • 105

    ShahA. R.OehmenC. S.Webb-RobertsonB.-J. (2008). SVM-HUSTLE--an Iterative Semi-supervised Machine Learning Approach for Pairwise Protein Remote Homology Detection. Bioinformatics24, 783790. 10.1093/bioinformatics/btn028

  • 106

    ShamimM. T. A.AnwaruddinM.NagarajaramH. A. (2007). Support Vector Machine-Based Classification of Protein Folds Using the Structural Properties of Amino Acid Residues and Amino Acid Residue Pairs. Bioinformatics23, 33203327. 10.1093/bioinformatics/btm527

  • 107

    ShroffR.ColeA. W.DiazD. J.MorrowB. R.DonnellI.AnnapareddyA.et al (2020). Discovery of Novel Gain-Of-Function Mutations Guided by Structure-Based Deep Learning. ACS Synth. Biol.9, 29272935. 10.1021/acssynbio.0c00345

  • 108

    SidhuA.YangZ. R. (2006). Prediction of Signal Peptides Using Bio-Basis Function Neural Networks and Decision Trees. Appl. Bioinforma.5, 1319. 10.2165/00822942-200605010-00002

  • 109

    SimhaR.BriesemeisterS.KohlbacherO.ShatkayH. (2015). Protein (Multi-)location Prediction: Utilizing Interdependencies via a Generative Model. Bioinformatics31, i365i374. 10.1093/bioinformatics/btv264

  • 110

    SongJ.LiuG.JiangJ.ZhangP.LiangY. (2021). Prediction of Protein-ATP Binding Residues Based on Ensemble of Deep Convolutional Neural Networks and LightGBM Algorithm. Ijms22, 939. 10.3390/ijms22020939

  • 111

    SuaJ. N.LimS. Y.YuliusM. H.SuX.YappE. K. Y.LeN. Q. K.et al (2020). Incorporating Convolutional Neural Networks and Sequence Graph Transform for Identifying Multilabel Protein Lysine PTM Sites. Chemom. Intelligent Laboratory Syst.206, 104171. 10.1016/j.chemolab.2020.104171

  • 112

    SunseriJ.KingJ. E.FrancoeurP. G.KoesD. R. (2019). Convolutional Neural Network Scoring and Minimization in the D3R 2017 Community Challenge. J. Comput. Aided. Mol. Des.33, 1934. 10.1007/s10822-018-0133-y

  • 113

    Sureyya RifaiogluA.DoğanT.Jesus MartinM.Cetin-AtalayR.AtalayV. (2019). DEEPred: Automated Protein Function Prediction with Multi-Task Feed-Forward Deep Neural Networks. Sci. Rep.9, 7344. 10.1038/s41598-019-43708-3

  • 114

    SzalkaiB.GrolmuszV. (2018a). Near Perfect Protein Multi-Label Classification with Deep Neural Networks. METHODS132, 5056. 10.1016/j.ymeth.2017.06.034

  • 115

    SzalkaiB.GrolmuszV. (2018b). SECLAF: A Webserver and Deep Neural Network Design Tool for Hierarchical Biological Sequence Classification. Bioinformatics34, 24872489. 10.1093/bioinformatics/bty116

  • 116

    TaherzadehG.DehzangiA.GolchinM.ZhouY.CampbellM. P. (2019). SPRINT-gly: Predicting N- and O-Linked Glycosylation Sites of Human and Mouse Proteins by Using Sequence and Predicted Structural Properties. Bioinformatics35, 41404146. 10.1093/bioinformatics/btz215

  • 117

    TianJ.WuN.ChuX.FanY. (2010). Predicting Changes in Protein Thermostability Brought about by Single- or Multi-Site Mutations. BMC Bioinforma.11, 370. 10.1186/1471-2105-11-370

  • 118

    TorngW.AltmanR. B. (2019). High Precision Protein Functional Site Detection Using 3D Convolutional Neural Networks. Bioinformatics35, 15031512. 10.1093/bioinformatics/bty813

  • 119

    TraoréS.AlloucheD.AndréI.De GivryS.KatsirelosG.SchiexT.et al (2013). A New Framework for Computational Protein Design through Cost Function Network Optimization. Bioinformatics29, 21292136. 10.1093/bioinformatics/btt374

  • 120

    TsouL. K.YehS.-H.UengS.-H.ChangC.-P.SongJ.-S.WuM.-H.et al (2020). Comparative Study between Deep Learning and QSAR Classifications for TNBC Inhibitors and Novel GPCR Agonist Discovery. Sci. Rep.10, 16771. 10.1038/s41598-020-73681-1

  • 121

    TsuchiyaY.TomiiK. (2020). Neural Networks for Protein Structure and Function Prediction and Dynamic Analysis. Biophys. Rev.12, 569573. 10.1007/s12551-020-00685-6

  • 122

    VangY. S.XieX. (2017). HLA Class I Binding Prediction via Convolutional Neural Networks. Bioinformatics33, 26582665. 10.1093/bioinformatics/btx264

  • 123

    VermaN.QuX.TrozziF.ElsaiedM.KarkiN.TaoY.et al (2021). SSnet: A Deep Learning Approach for Protein-Ligand Interaction Prediction. Ijms22, 1392. 10.3390/ijms22031392

  • 124

    VolpatoV.AdelfioA.PollastriG. (2013). Accurate Prediction of Protein Enzymatic Class by N-To-1 Neural Networks. BMC Bioinforma.14, S11. 10.1186/1471-2105-14-S1-S11

  • 125

    WanC.CozzettoD.FaR.JonesD. T. (2019). Using Deep Maxout Neural Networks to Improve the Accuracy of Function Prediction from Protein Interaction Networks. PLoS One14, e020995821. Available at: http://10.0.5.91/journal.pone.0209958. 10.1371/journal.pone.0209958

  • 126

    WangD.GengL.ZhaoY.-J.YangY.HuangY.ZhangY.et al (2020). Artificial Intelligence-Based Multi-Objective Optimization Protocol for Protein Structure Refinement. Bioinformatics36, 437448. 10.1093/bioinformatics/btz544

  • 127

    WangM.CangZ.WeiG.-W. (2020a). A Topology-Based Network Tree for the Prediction of Protein-Protein Binding Affinity Changes Following Mutation. Nat. Mach. Intell.2, 116123. 10.1038/s42256-020-0149-6

  • 128

    WangM.CuiX.LiS.YangX.MaA.ZhangY.et al (2020b). DeepMal: Accurate Prediction of Protein Malonylation Sites by Deep Neural Networks. Chemom. Intelligent Laboratory Syst.207, 104175. 10.1016/j.chemolab.2020.104175

  • 129

    WangX.LiuY.LuF.LiH.GaoP.WeiD. (2020). Dipeptide Frequency of Word Frequency and Graph Convolutional Networks for DTA Prediction. Front. Bioeng. Biotechnol.8. 10.3389/fbioe.2020.00267

  • 130

    WangS.SunS.LiZ.ZhangR.XuJ. (2017). Accurate De Novo Prediction of Protein Contact Map by Ultra-deep Learning Model. PLoS Comput. Biol.13, e1005324. 10.1371/journal.pcbi.1005324

  • 131

    WardahW.DehzangiA.TaherzadehG.RashidM. A.KhanM. G. M.TsunodaT.et al (2020). Predicting Protein-Peptide Binding Sites with a Deep Convolutional Neural Network. J. Theor. Biol.496, 110278. 10.1016/j.jtbi.2020.110278

  • 132

    WardahW.KhanM. G. M.SharmaA.RashidM. A. (2019). Protein Secondary Structure Prediction Using Neural Networks and Deep Learning: A Review. Comput. Biol. Chem.81, 18. 10.1016/j.compbiolchem.2019.107093

  • 133

    WongK.-C.ChanT.-M.PengC.LiY.ZhangZ. (2013). DNA Motif Elucidation Using Belief Propagation. Nucleic Acids Res.41, e153. 10.1093/nar/gkt574

  • 134

    WuS.ZhangY. (2008). A Comprehensive Assessment of Sequence-Based and Template-Based Methods for Protein Contact Prediction. Bioinformatics24, 924931. 10.1093/bioinformatics/btn069

  • 135

    XuJ.McpartlonM.LiJ. (2021). Improved Protein Structure Prediction by Deep Learning Irrespective of Co-evolution Information. Nat. Mach. Intell.3, 601609. 10.1038/s42256-021-00348-5

  • 136

    XueL.TangB.ChenW.LuoJ. (2019). DeepT3: Deep Convolutional Neural Networks Accurately Identify Gram-Negative Bacterial Type III Secreted Effectors Using the N-Terminal Sequence. Bioinformatics35, 20512057. 10.1093/bioinformatics/bty931

  • 137

    YangH.WangM.YuZ.ZhaoX.-M.LiA. (2020). GANcon: Protein Contact Map Prediction with Deep Generative Adversarial Network. IEEE ACCESS8, 8089980907. 10.1109/ACCESS.2020.2991605

  • 138

    YangJ.AnishchenkoI.ParkH.PengZ.OvchinnikovS.BakerD. (2020). Improved Protein Structure Prediction Using Predicted Interresidue Orientations. Proc. Natl. Acad. Sci. U.S.A.117, 14961503. 10.1073/pnas.1914677117

  • 139

    YangJ.HeB.-J.JangR.ZhangY.ShenH.-B. (2015). Accurate Disulfide-Bonding Network Predictions Improveab Initiostructure Prediction of Cysteine-Rich Proteins. Bioinformatics31, btv4593781. 10.1093/bioinformatics/btv459

  • 140

    YangY.FaraggiE.ZhaoH.ZhouY. (2011). Improving Protein Fold Recognition and Template-Based Modeling by Employing Probabilistic-Based Matching between Predicted One-Dimensional Structural Properties of Query and Corresponding Native Properties of Templates. Bioinformatics27, 20762082. 10.1093/bioinformatics/btr350

  • 141

    YehC.-T.BrunetteT.BakerD.McIntosh-SmithS.ParmeggianiF. (2018). Elfin: An Algorithm for the Computational Design of Custom Three-Dimensional Structures from Modular Repeat Protein Building Blocks. J. Struct. Biol.201, 100107. 10.1016/j.jsb.2017.09.001

  • 142

    YuC.-H.BuehlerM. J. (2020). Sonification Based De Novo Protein Design Using Artificial Intelligence, Structure Prediction, and Analysis Using Molecular Modeling. Apl. Bioeng.4, 016108. 10.1063/1.5133026

  • 143

    YuC.-H.QinZ.Martin-MartinezF. J.BuehlerM. J. (2019). A Self-Consistent Sonification Method to Translate Amino Acid Sequences into Musical Compositions and Application in Protein Design Using Artificial Intelligence. ACS Nano13, 74717482. 10.1021/acsnano.9b02180

  • 144

    ZafeirisD.RutellaS.BallG. R. (2018). An Artificial Neural Network Integrated Pipeline for Biomarker Discovery Using Alzheimer's Disease as a Case Study. Comput. Struct. Biotechnol. J.16, 7787. 10.1016/j.csbj.2018.02.001

  • 145

    ZhangB.LiJ.Q. (2018). Prediction of 8-state Protein Secondary Structures by a Novel Deep Learning Architecture. BMC Bioinforma.19, 293. Available at: http://10.0.4.162/s12859-018-2280-5. 10.1186/s12859-018-2280-5

  • 146

    ZhangD.KabukaM. (2019). Multimodal Deep Representation Learning for Protein Interaction Identification and Protein Family Classification. BMC Bioinforma.20, 531. 10.1186/s12859-019-3084-y

  • 147

    ZhangL.YuG.GuoM.WangJ. (2018). Predicting Protein-Protein Interactions Using High-Quality Non-interacting Pairs. BMC Bioinforma.19, 525. 10.1186/s12859-018-2525-3

  • 148

    ZhangY.QiaoS.JiS.HanN.LiuD.ZhouJ. (2019). Identification of DNA-Protein Binding Sites by Bootstrap Multiple Convolutional Neural Networks on Sequence Information. Eng. Appl. Artif. Intell.79, 5866. 10.1016/j.engappai.2019.01.003

  • 149

    ZhaoB.XueB. (2018). Decision-tree Based Meta-Strategy Improved Accuracy of Disorder Prediction and Identified Novel Disordered Residues inside Binding Motifs. Ijms19, 3052. 10.3390/ijms19103052

  • 150

    ZhaoF.PengJ.XuJ. (2010). Fragment-free Approach to Protein Folding Using Conditional Neural Fields. Bioinformatics26, i310i317. 10.1093/bioinformatics/btq193

  • 151

    ZhaoX.LiJ.WangR.HeF.YueL.YinM. (2018). General and Species-specific Lysine Acetylation Site Prediction Using a Bi-modal Deep Architecture. IEEE ACCESS6, 6356063569. 10.1109/ACCESS.2018.2874882

  • 152

    ZhaoZ.GongX. (2019). Protein-Protein Interaction Interface Residue Pair Prediction Based on Deep Learning Architecture. IEEE/ACM Trans. Comput. Biol. Bioinf.16, 17531759. 10.1109/TCBB.2017.2706682

  • 153

    ZhengW.LiY.ZhangC.PearceR.MortuzaS. M.ZhangY. (2019). Deep‐learning Contact‐map Guided Protein Structure Prediction in CASP13. Proteins87, 11491164. 10.1002/prot.25792

  • 154

    ZhengW.ZhouX.WuyunQ.PearceR.LiY.ZhangY. (2020). FUpred: Detecting Protein Domains through Deep-Learning-Based Contact Map Prediction. Bioinformatics36, 37493757. 10.1093/bioinformatics/btaa217

  • 155

    ZhuX.LaiL. (2009). A Novel Method for Enzyme Design. J. Comput. Chem.30, 256267. 10.1002/jcc.21050

  • 156

    ZimmermannO.HansmannU. H. E. (2006). Support Vector Machines for Prediction of Dihedral Angle Regions. Bioinformatics22, 30093015. 10.1093/bioinformatics/btl489

Summary

Keywords

artificial intelligence, proteins, protein design and engineering, machine learning, deep learning, protein prediction, protein classification, drug design

Citation

Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N and Altamirano-Bustamante MM (2022) Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front. Bioeng. Biotechnol. 10:788300. doi: 10.3389/fbioe.2022.788300

Received

02 October 2021

Accepted

25 May 2022

Published

07 July 2022

Volume

10 - 2022

Edited by

Ratul Chowdhury, Harvard Medical School, United States

Reviewed by

Neng-Zhong Xie, Guangxi Academy of Sciences, China

Nabankur Dasgupta, Sandia National Laboratories, United States

Sudhanya Banerjee, AspenTech, United States

Updates

Copyright

*Correspondence: Myriam M. Altamirano-Bustamante,

†These authors have contributed equally to this work

This article was submitted to Bioprocess Engineering, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics