1 Introduction
Genetic variation data is nowadays easy to generate. Variation interpretation means the description of the significance of variations, often in relation to disease. This is substantially more difficult a problem than sequence generation. Experimental methods provide verified interpretations; however, due to huge amounts of variations in every individual, computational approaches are widely used. The length of human genome is over 3 billion base pairs (Nurk et al., 2022). Due to individual genetic heterogeneity, 4.1–5.0 million sites differ from the reference genome (Auton et al., 2015). Various types of prediction methods are widely used to interpret the variations, see (Niroula and Vihinen, 2016). Benchmark studies have indicated large differences in the performance of methods developed for the same type of variation prediction tasks, see e.g., (Thusberg et al., 2011; Niroula and Vihinen, 2019; Zhang et al., 2019; Marabotti et al., 2021; Anderson and Lassmann, 2022). Both predictor development and performance assessment are largely dependent on high-quality data. One might think that there is a large number of verified variations as the genetic diagnosis is widely applied; however, that is not the case, especially when considering specific types of variations or mechanisms.
The development and testing of computational methods are dependent on experimental data. Accurate prediction methods can be developed only with reliable experimentally verified cases with a systematic approach and using relevant measures (Vihinen, 2012; Vihinen, 2013). Method performance has to be assessed in comparison to existing knowledge. For that purpose, benchmark data sets with known and verified outcomes are needed. Such data sets can be time-consuming and costly to collect and require many manual steps. Therefore, it is important that the produced data are distributed and reused.
In the variation interpretation field, two databases deliver such data sets. VariBench (Nair et al., 2013; Sarkar et al., 2020) and VariSNP (Schaafsma et al., 2015) contain variation benchmark data. VariSNP is a version of the dbSNP database (Sherry et al., 2001) for short variations from where known disease-causing variants have been filtered away. VariBench is a generic database that contains all types of variations with all kinds of effects. These resources have been widely used for prediction method training and testing.
What requirements and criteria should benchmark data sets fulfill in relation to variation interpretation and in general? We have defined five criteria, discussed in (Nair et al., 2013). They include relevance, representativeness, non-redundancy, inclusion of both positive and negative cases and reusability. VariBench subscribes to the criteria and collects data sets and distributes them freely. VariBench data sets are frequently used to train and test method performance. These sets facilitate also post-publication comparison of methods to published benchmarks (Sarkar et al., 2020).
The bottleneck in sequencing projects has shifted from sequencing to interpretation of obtained results. Experimental studies of variant effects are the gold standard approaches. They are not feasible in many instances and therefore, various computational approaches have been developed. We divide the prediction methods into five categories in VariBench.
First, pathogenicity, also called tolerance, predictions aim to identify disease-related alterations of various types (for details see Table 1).These methods aim just to detect harmful or disease-related variants. Second, effect-specific methods are for the prediction of various effects at DNA, RNA and protein levels. Third, there are also predictors specific for certain molecules or families of molecules, typically for proteins. Fourth, some methods are dedicated to certain diseases. Fifth, some tools predict the phenotype, typically the severity of the variant effect.
TABLE 1
| Data set | Data sets in previous version | New data sets |
|---|---|---|
| Variation type data sets | ||
| Insertions and deletions | 4 | 2 |
| Substitutions coding region | ||
| Training data sets | 23 | 9 |
| Test data sets | 5 | 3 |
| Structure mapped variations | ||
| General structural data sets | 2 | 3 |
| Transmembrane protein data sets | 0 | 4 |
| Synonymous and unsense variants | 2 | 5 |
| Benign variants | 2 | 0 |
| Structural variants | 0 | 1 |
| Effect specific data sets | ||
| DNA regulatory elements | 7 | 4 |
| RNA splicing | 15 | 6 |
| Protein aggregation | 2 | 0 |
| Binding free energy | 2 | 1 |
| Protein disorder | 1 | 1 |
| Protein solubility | 1 | 1 |
| Protein stability | 31 | 9 |
| Single variants | 21 | 9 |
| Double variants | 1 | 0 |
| Protein folding rate | 0 | 5 |
| Protein binding affinity | ||
| Generic protein-protein interactions | 1 | 13 |
| Antibody-antigen affinity changes | 0 | 5 |
| Protein-nucleic acid interactions | 0 | 7 |
| Functional effects | ||
| Gain of function variants | 0 | 1 |
| Deep mutational data sets | 0 | 7 |
| Molecule-specific data sets | 18 | 7 |
| Disease-specific data sets | ||
| Cancer variation data sets | 4 | 4 |
| Other diseases | 8 | 2 |
| Phenotype data sets | 1 | 1 |
Types of data sets in VariBench.
High-quality variation data sets are difficult and laborious to generate. VariBench collects, organizes, and integrates additional information and distributes different types of variation data sets. It is a unique database. We have updated the resource with 143 new data sets, which include more than 90 million variants. During the update, some new categories of variations and effects have been included. There are currently variations in 5 main categories, 17 subgroups and 11 groups.
2 Data sets and quality
VariBench collects from literature, databases and predictors data sets, which have been used to train methods or assess their performance. There are no selection criteria for the inclusion of data sets. This is because of several reasons. The data sets can be used as such, or they can be further cleaned and pruned to use in additional tasks, be extended with new cases, etc. A good benchmark data set should fulfill several requirements (Vihinen, 2012; Vihinen, 2013), including good coverage, representativeness and containing both positive and negative cases that are experimentally determined. The representativeness of amino acid substitution data sets was investigated (Schaafsma and Vihinen, 2018) and found not to be optimal.
The quality of data sets in VariBench is variable. We include even known low-quality data sets, since they may be valuable when building new data sets and for other applications. We have performed some quality tests, including consistency; however, it is the duty of the users of the data to evaluate whether the data are suitable for intended use. One of the goals of VariBench is to provide existing data sets, even when problematic, e.g., for comparative purposes.
Systematics is an integral part of data and database quality. It is quite common that due to errors and lack of systematics, all variants in an existing data set cannot be reused as they cannot be mapped to reference sequences.
An example of the importance of data quality is in the field of protein stability predictions. Most of the existing predictors are based on a single database, ProTherm, which was shown to contain numerous problems (Yang et al., 2018). Recently, new and higher-quality databases have emerged in this field (Stourac et al., 2021; Turina et al., 2021).
3 Uses of VariBench data
VariBench data sets have been widely used especially to train and test variation interpretation predictors (pathogenicity/tolerance, protein stability, solubility, melting temperature, gene/protein/disease-specific predictors, and interaction and structural effects on folded and disordered regions and proteins), but also in the benchmarking performance of tools for various types and effects. In addition to human, plant and animal-related predictors and benchmarks have benefitted from VariBench (Yang et al., 2022). The data has also facilitated the interpretation of variants according to the guidelines of American College of Medical Genetics and Genomics, and the Association for Molecular Pathology (ACMG/AMP) (Richards et al., 2015) and benchmarking such annotations.
4 Data sets in VariBench
VariBench contains now 559 files for separate data sets from 295 studies and covers a wide range of variations (Tables 1, 2). The data sets were collected from literature, websites and databases. They have been used for predictive purposes, most often to develop novel predictors for different types or effects of variants. Some data sets have been specifically collected for benchmarking purposes.
TABLE 2
| Origin of dataa | Dataset first used for | Number of variants in each dataset | Number of different genes, transcripts or proteins in each dataset | References |
|---|---|---|---|---|
| Variation type datasets | ||||
| Insertions and deletions | ||||
| HGMD, gnomAD | MutPredIndel | 231963, 4679, 1203 | 3556, 4679, 802 | Pagel et al. (2019) |
| HGMD, gnomAD | MutPredLof | 98095, 8840 | 13648, 1239 | Pagel et al. (2019) |
| Substitutions, coding region | ||||
| Training datasets | ||||
| VariBench | PON-All | 45573, 306, 5360, 324, 3836, 1109, 48176, 4154 | 14765, 232, 1261, 233, 704, 287, 13383, 1149 | Yang et al. (2022) |
| HumDiv, HumVar, MGI, Disease Ontology Database, OMIA, UniProtKB, Ensembl | Mammalian diseases | 377, 207, 62 | 131, 315, 51 | Plekhanova et al. (2019) |
| http://www.arabidopsis.org, UniProt/Swiss-Prot, Ensembl | Arabidopsis thaliana | 13707 | 999 | Kono et al. (2018) |
| UniProt, SwissProt | Arabidopsis | 4410 | 994 | Kovalev et al. (2018) |
| HGMD, SwissVar, dbSNP | MutPred2 | 20643 | Pejaver et al. (2020) | |
| ClinVar, UniProt | DeepSav | 43000, 43000 | 3386, 10974 | Pei et al. (2020) |
| dbNSFP, ClinVar, HumsaVar, HGMD | VARITY | 157708, 157708 | 3912, 3912 | Wu et al. (2021) |
| ClinVar, gnomAD | MutScore | 66037 | Quinodoz et al. (2022) | |
| HGMD, gnomAD | MutFormer | 69159160 | Jiang et al. (2021) | |
| Test datasets | ||||
| ClinVar, HGMD, OMIM, gnomAD | Benchmarking with clinical data set | 1757 | Gunning et al. (2021) | |
| ClinVar, VariBench | Benchmarking study | 35167, 29173 | 3349, 8562 | Anderson and Lassmann (2022) |
| ClinVar | Rett syndrome benchmark | 4354 | 3217 | Ganakammal and Alexov (2019) |
| Structure mapped variants | ||||
| General structural datasets | ||||
| ClinVar, ExAC, HumsaVar | Missense3D | 1965, 2134 | Ittisoponpisan et al. (2019) | |
| UniProt | Protein structural analysis | 6025, 4536 | 3782, 8211 | Gao et al. (2015) |
| HumsaVar | Solvent accessibility | 10760, 69385 | 1283, 12494 | Savojardo et al. (2020) |
| Transmembrane proteins | ||||
| VariBench, ExAC | Transmembrane protein analysis | 2058, 5422, 508, 1289, 1289 | 870, 5422, 508, 1289, 1289 | Orioli and Vihinen (2019) |
| PDB | mCSM-membrane | 347, 138/38, 16 | Pires et al. (2020) | |
| ClinVar, gnomAD | TMSNP | 2624, 196 705 | Garcia-Recio et al. (2021) | |
| BorodaTM, PredMutHTP, TMSNP | MutTMPredictor | 21379, 10031, 3706, 7374, 546 | 3341, 2114, 1183, 1848, 62 | Ge et al. (2021) |
| Synonymous and unsense variations | ||||
| 1KGP | Silva | 33 | Buske et al. (2013) | |
| Silva, OMIM | TraP | 75 | 376, 96, 102 | Gelfman et al. (2017) |
| HGMD, dbDSM | usDSM | 239358, 2400, 4502, 665, 5085 | Tang et al. (2021) | |
| ClinVar | Ensemble predictor | 243, 243 | Ganakammal and Alexov (2020) | |
| 1KGP, ExAC, gnomAD, generated data | Predictor review | 1048576 | Zeng and Bromberg (2019) | |
| Structural variations | ||||
| ClinVar, gnomAD, ape sequences, 1KGP | StrVCTVRE | 7669 | 5119 | Sharo et al. (2022) |
| Effect-specific datasets | ||||
| DNA regulatory elements | ||||
| DNaseI-seq, ChIP-seq data | deltaSVM | 45 | Lee et al. (2015) | |
| dbSNP, ClinVar, OMIM | ncVarDB | 7228, 722 | Biggs et al. (2020) | |
| PRVCS, 1KGP, GTEx, GWAS catalogue | regBase | 108, 67635, 796, 60393, 21725, 3105, 102, 7513, 61170, 5023, 11436, 61170 | Zhang et al. (2019) | |
| HGMD, ClinVar, OregAnno, GWAS catalog | WEVar | 2874, 29 | Wang et al. (2021) | |
| RNA splicing | ||||
| BIC | EX-SKIP and HOT-SKIP | 74, 42 | Raponi et al. (2011) | |
| ClinVar, literature | SQUIRLS | 8322 | Danis et al. (2021) | |
| ClinVar, literature, InSiGHT | Cancer gene analysis | 12, 347, 18 | 3, 32, 13 | Moles-Fernández et al. (2018) |
| HGMD, SpliceDisease, DBASS | scdbNSFP | 2959, 45 | Jian et al. (2014) | |
| Experimental data | SPiCE | 142, 163,90 | 2, 2, 9 | Leman et al. (2018) |
| ClinVar | CADD-Splice | 1688852, 14011296, 1688852, 14011296 | Rentzsch et al. (2021) | |
| Binding free energy | ||||
| Skempi, literature | SAAMBE | 2041, 1327 | 81, 43 | Petukh et al. (2016) |
| Protein disorder | ||||
| SwissProt, VariBench | IDRMutPred | 3348, 559, 5794, 5027 | 321, 26, 2562, 2390 | Zhou et al. (2020) |
| Protein solubility | ||||
| VariBench, literature | PON-Sol2 | 5666, 46, 662 | 66, 9, 34 | Yang et al. (2021) |
| Protein stability | ||||
| Single variants | ||||
| ProTherm | PreTherMut | 836, 2530 | Tian et al. (2010) | |
| ProTherm | iStable | 3131 | Chen et al. (2013) | |
| Experimental data | CAGI frataxin benchmark | 8 | Strokach et al. (2021) | |
| ProTherm | iStable2 | 1564, 1495, 759, 265, 363, 129 | Chen et al. (2020) | |
| VariBench, ProtTherm | Benchmarking study | 1024 | Marabotti et al. (2021) | |
| ProTherm | Thermonet | 3214, 3214, 3214, 1744, 1744, 1744 | 148, 148, 148, 127, 127, 127 | Li et al. (2020) |
| ProTherm, literature | ACDC-NN | [2197, 2050, 2046, 2231, 2042, 2094, 2300, 1933, 2007, 2284] [268, 183, 415, 187, 230, 376, 178, 170, 545, 96] [183, 415, 187, 230, 376, 178, 170, 545, 96, 268] [5, 199, 21, 75, 7, 1, 33 ] [5, 1, 199, 21, 75, 7, 1, 33] [1013, 813, 924, 1080, 1157, 1296, 1219, 1235, 1180] [268, 176, 398, 65, 143, 164, 66, 25, 143, 9] [176, 398, 65, 143, 164, 66, 25, 143, 9, 198] | [104, 107, 105, 103, 103, 103, 107, 111, 109, 104] [15, 13, 12, 15, 14, 15, 14, 11, 10, 13] [13, 12, 15, 14, 15, 14, 11, 10, 13, 15] [1, 4, 2, 2, 2, 1, 2] [5, 1, 199, 21, 75, 7, 1, 33] [63, 60, 60, 55, 56, 65, 65, 69, 69] [16, 7, 11, 7, 9, 14, 8, 5, 8, 1] [7, 11, 7, 9, 14, 8, 5, 8, 1, 8] | Benevenuta et al. (2021) |
| ThermoMutDB, ProTherm, VariBench | Benchmarking study | 352 | Pancotti et al. (2022) | |
| Protein folding rate | ||||
| Experimental data | Kinetic data | 806 | Naganathan and Muñoz (2010) | |
| Literature, PFD, kineticDB | KD-FREEDOM | 467 | 15, 4 | Huang and Gromiha (2010) |
| PFD, kineticDB | Fora | 467, 154 | Huang and Gromiha (2012) | |
| PFD, kineticDB, literature | FREEDOM | 467 | Huang (2014) | |
| Literature | UnfoldingRaCe and FoldingRaCe | 790, 16, 60 | 26, 10, 5 | Chaudhary et al. (2015),Chaudhary et al. (2016) |
| Protein interaction | ||||
| Generic protein-protein interactions | ||||
| Literature | CC/PBSA | 582, 592 | 9, 57 | Benedix et al. (2009) |
| SKEMPI, literature | Protein-protein binding affinity | 123, 242, 574,1844 | 5, 9, 29, 81 | Li et al. (2014) |
| SKEMPI | MutaBind | 1925 | Li et al. (2016) | |
| SKEMPI | BindProfX | 1 402 | Xiong et al. (2017) | |
| DACUM, SKEMPI, literature | iSEE | 1102 | Geng et al. (2019) | |
| SKEMPI, ABbind, PROXiMATE, dbMPIKT | mCSM-PPI2 | 4196, 378 | 319, 19 | Rodrigues et al. (2019) |
| SKEMPI, literature | MutaBind2 | 4191, 1707 | 319, 19 | Zhang et al. (2020) |
| SKEMPI, CAPRI | SSIPe | 1470, 734, 888, 190, 152 | 319, 19 | Huang et al. (2020) |
| SKEMPI | NetTree | 645, 1131, 4947, 4169, 8338, 787 | 29, 112, 319, 319, 319, 21 | Wang et al. (2020) |
| PROXiMATE | ProAffiMuSeq | 1061, 112 | 104, 53 | Jemimah et al. (2020) |
| ClinVar, ProTherm, SKEMP, literature | ELASPIC2 | 16189, 2563 | 14227, 2378 | Strokach et al. (2019) |
| SKEMPI | mmCSM-PPI | 1340, 595, 272 | 296, 68, 24 | Rodrigues et al. (2021) |
| TCGA, ICGC | e-MutPath | 59712 | Li et al. (2021a) | |
| Antibody-antigen affinity | ||||
| AB-Bind | mCSM-AB | 558 | Pires and Ascher (2016) | |
| Literature | SiPMAB | 212 | Sulea et al. (2016) | |
| Literature | Free energy perturbation method | 200 | Clark et al. (2019) | |
| SiPMAB | Consensus predictor | 46 | Kurumida et al. (2020) | |
| AB-BIND, PROXiMATE, SKEMPI | mCSM-AB2 | 1810 | Myung et al. (2020) | |
| Protein-nucleic acid interactions | ||||
| ProNIT | mCSM-NA | 662 | 369 | Pires and Ascher (2017) |
| ProNIT | SAMPDI | 104 | 13 | Peng et al. (2018) |
| ProNIT, dbAMEPNI | PremPDI | 219 | 49 | Zhang et al. (2018) |
| ENCODE, POSTAR2 | DeepClip | 81 | 32 | Grønning et al. (2020) |
| dbAMEMPNI | iPNHOT | 293 | 105 | Zhu et al. (2020) |
| ProNIT, dbAMEMPNI | SAMPDI-3D | 101, 463, 200, 419, 227 | 26, 30, 49, 96, 18 | Li et al. (2021b) |
| PDB, literture | Nabe | 2506 | 473 | Liu et al. (2021) |
| Functional effects | ||||
| Gain of function data sets | ||||
| Literature | fuNCion | 3794, 6930 | Heyne et al. (2020) | |
| Deep mutational data sets | ||||
| Literature | DeepSequence | 712218 | 31 | Riesselman et al. (2018) |
| Literature | fuNTRp | 303, 75, 102, 286, 56 | Miller et al. (2019) | |
| Literature | Functional effects | 183204 | Reeb et al. (2020) | |
| Literature | Deep mutational landscape | 6357, 6357 | Dunham and Beltrao (2021) | |
| Literature | Benchmarking study | 230033 | 10 | Livesey and Marsh (2020) |
| Literature | LacI | 102, 4303 | 1, 1 | Miller et al. (2017) |
| Literature | Liver pyruvate kinase | 126 | 1 | Martin et al. (2020) |
| Molecule-specific data sets | ||||
| CFTR-MetaPred | 1899, 1210 | Rychkova et al. (2017) | ||
| Literature | CYSMA | 141 | Sasorith et al. (2020) | |
| SwissProt, BTKbase | KinMutRF | 3689 | 459 | Pons et al. (2016) |
| SwissVar, HumsaVar, Ensembl Variation, ClinVar | Cardiac sodium channel variants | 1392 | 1 | Tarnovskaya et al. (2020) |
| Literature | SCN9A variants | 85 | 1 | Toffano et al. (2020) |
| Literature | Troponin variants | 136 | 1 | Shakur et al. (2021) |
| Literature, ClinVar, HGMD | IDUA | 147 | 1 | Borges et al. (2021) |
| Disease-specific data sets | ||||
| Cancer variation data sets | ||||
| Literature | dbCID | 57, 153, 728 | 22, 39, 46 | Yue et al. (2019) |
| Literature | dbCPM | 108, 863, 1109 | 11, 71, 130 | Yue et al. (2018) |
| ICGC, TCGA, Pediatric Cancer Genome Project | MutaGene | 5276 | 58 | Goncearenco et al. (2017) |
| UMD_TP53, TP53MULTLOAD | TP53_PROF | 1362, 1295 | 1, 1 | Ben-Cohen et al. (2022) |
| Other diseases | ||||
| ClinVar, gnomAD, literature | CardioBoost | 1237, 215, 154, 308, 532 218, 289, 2003,2578 218, 289, 2003, 2578 347, 463, 170 106, 106, 35 157, 227, 75 157, 227, 75 | 7, 6,6,7, 9 16, 16, 16, 21 16, 16, 16, 21 12, 8, 11 1, 1, 1 1, 1, 1 1, 1, 1 | Zhang et al. (2021) |
| HGMD, dbSNP | Steroid metabolism diseases | 797 | 12 | Chan (2013) |
| COSMIC | Benchmarking cancer variants | 164 | 11 | Petrosino et al. (2021) |
| Phenotype data sets | ||||
| ClinVar | VusPrize | 45749, 25080, 684, 4843, 51091 | 2106, 1615, 244, 1239, 2828 | Mahecha et al. (2022) |
New data sets in VariBench.
Abbreviations: 1KGP, thousand genomes project; HGMD, human gene mutation database; ICGC, international cancer genome consortium; PDB, protein data bank; TCGA, the cancer genome atlas.
There are 247 new data files that contain total 90,886,959 variants. Together with previous versions, there are 105,181,219 variants, the increase is more than seven-fold from the original number of 14,294,260 variants. The number of data sets is high because many articles contain more than one data set. Many of the data sets are redundant as they contain data from the same origin. The most common sources of variants are ClinVar (Landrum et al., 2018) database of variants and their disease relationship, ProTherm thermodynamic database (Kumar et al., 2006), and VariBench itself. The number of unique variants is significantly lower than the sum of the variants in the data sets.
The data sets are divided into 5 categories, 17 subgroups and 11 groups (Table 1). The amount of data items varies for independent sets and is dependent on the original data. Data items irrelevant to VariBench (i.e., not describing variants or their effects) were removed when sets were included to the database. In many data sets, variants are described at three molecular levels (DNA, RNA and protein) and sometimes also at protein structural level. One of the aims of VariBench is to facilitate the reuse of existing data sets, therefore the data are provided in as many levels as possible. Further, the data can be used for various purposes, beyond the original application, such as benchmarking, developing different types of predictors, bioinformatics reviews and analyses of variation types, clinical variation interpretation, etc. When doing such an extension, the users must be cautious and aware of the possible limitations of the data sets and to understand how they have been collected.
The main categories of variation type data sets are insertions and deletions, substitutions in coding and non-coding regions, structure-mapped variants, synonymous and unsense variants, benign variants, and DNA structural variants (See Tables 1, 2). Unsense variants are a new category for exonic alterations that may look synonymous, but affect the protein or its expression, typically due to aberrant splicing or miRNA binding alterations (Vihinen, 2022; Vihinen, 2023a; Vihinen, 2023b). Effect-specific data sets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder, solubility, stability, folding rate, interactions, and functional effects. Molecule- and disease-specific data sets include information for individual genes, proteins, gene/protein families or diseases. Phenotype data sets are for a disease feature, severity of the phenotype.
Almost all the categories contain new data sets. In addition, we have 6 new variation categories including structural variations in DNA (1 data set), protein folding rate (5 data sets in six publications), antibody-antigen affinity changes (5 articles and sets), protein-nucleic acid interactions (6 articles), gain of function variants (Nurk et al., 2022), and deep mutational data sets (7 studies).
One of the new categories is for functional effects under the effect-specific category. These sets are mainly for massively parallel reporter assays (saturation mutagenesis) experiments. Users of these data have to be careful since the included data sets display a measured effect; however, their relevance to biological effect is not always clear, see (Vihinen, 2021). The functional effect does not necessarily mean biological effect. One would likely say that a reduction of more than 50% of e.g., enzyme activity has a functional effect. There are several diseases where 90% or more of the normal activity has to be lost for an individual to have a disease and show the effect on biological activity (Vihinen, 2021). Examples include hemophilias due to factor II, VII, IX, X or XII variations and severe immunodeficiency caused by adenosine deaminase alterations.
Statements
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: http://structure.bmc.lu.se/VariBench.
Author contributions
MV conceived the project; NS collected the data sets and developed the web site; NS and MV wrote the manuscript. All authors contributed to the article and approved the submitted version.
Funding
Financial support from Vetenskapsrådet (2019-01403) and the Swedish Cancer Society (grant number CAN 20 1350) is gratefully acknowledged.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
AndersonD.LassmannT. (2022). An expanded phenotype centric benchmark of variant prioritisation tools. Hum. Mutat.43, 539–546. 10.1002/humu.24362
2
AutonA.BrooksL. D.DurbinR. M.GarrisonE. P.KangH. M.KorbelJ. O.et al (2015). A global reference for human genetic variation. Nature526, 68–74. 10.1038/nature15393
3
Ben-CohenG.DoffeF.DevirM.LeroyB.SoussiT.RosenbergS. (2022). TP53_PROF: A machine learning model to predict impact of missense mutations in TP53. Brief. Bioinform23, bbab524. 10.1093/bib/bbab524
4
BenedixA.BeckerC. M.de GrootB. L.CaflischA.BöckmannR. A. (2009). Predicting free energy changes using structural ensembles. Nat. Methods6, 3–4. 10.1038/nmeth0109-3
5
BenevenutaS.PancottiC.FariselliP.BiroloG.SanaviaT. (2021). An antisymmetric neural network to predict free energy changes in protein variants. J. Phys. D. Appl. Phys.54, 245403. 10.1088/1361-6463/abedfb
6
BiggsH.ParthasarathyP.GavryushkinaA.GardnerP. P. (2020). ncVarDB: a manually curated database for pathogenic non-coding variants and benign controls. Oxford: Database, 2020.
7
BorgesP.PasqualimG.MatteU. (2021). Which is the best in silico program for the missense variations in idua gene? A comparison of 33 programs plus a conservation score and evaluation of 586 missense variants. Front. Mol. Biosci.8, 752797. 10.3389/fmolb.2021.752797
8
BuskeO. J.ManickarajA.MitalS.RayP. N.BrudnoM. (2013). Identification of deleterious synonymous variants in human genomes. Bioinformatics29, 1843–1850. 10.1093/bioinformatics/btt308
9
ChanA. O. (2013). Performance of in silico analysis in predicting the effect of non-synonymous variants in inherited steroid metabolic diseases. Steroids78, 726–730. 10.1016/j.steroids.2013.04.002
10
ChaudharyP.NaganathanA. N.GromihaM. M. (2015). Folding RaCe: A robust method for predicting changes in protein folding rates upon point mutations. Bioinformatics31, 2091–2097. 10.1093/bioinformatics/btv091
11
ChaudharyP.NaganathanA. N.GromihaM. M. (2016). Prediction of change in protein unfolding rates upon point mutations in two state proteins. Biochim. Biophys. Acta1864, 1104–1109. 10.1016/j.bbapap.2016.06.001
12
ChenC. W.LinJ.ChuY. W. (2013). iStable: off-the-shelf predictor integration for predicting protein stability changes. BMC Bioinforma.14, S5. Suppl 2. 10.1186/1471-2105-14-s2-s5
13
ChenC. W.LinM. H.LiaoC. C.ChangH. P.ChuY. W. (2020). iStable 2.0: predicting protein thermal stability changes by integrating various characteristic modules. Comput. Struct. Biotechnol. J.18, 622–630. 10.1016/j.csbj.2020.02.021
14
ClarkA. J.NegronC.HauserK.SunM.WangL.AbelR.et al (2019). Relative binding affinity prediction of charge-changing sequence mutations with FEP in protein-protein interfaces. J. Mol. Biol.431, 1481–1493. 10.1016/j.jmb.2019.02.003
15
DanisD.JacobsenJ. O. B.CarmodyL. C.GarganoM. A.McMurryJ. A.HegdeA.et al (2021). Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am. J. Hum. Genet.108, 1564–1577. 10.1016/j.ajhg.2021.06.014
16
DunhamA. S.BeltraoP. (2021). Exploring amino acid functions in a deep mutational landscape. Mol. Syst. Biol.17, e10305. 10.15252/msb.202110305
17
GanakammalS. R.AlexovE. (2020). An ensemble approach to predict the pathogenicity of synonymous variants. Genes. (Basel), 11. 10.3390/genes11091102
18
GanakammalS. R.AlexovE. (2019). Evaluation of performance of leading algorithms for variant pathogenicity predictions and designing a combinatory predictor method: application to rett syndrome variants. PeerJ7, e8106. 10.7717/peerj.8106
19
GaoM.ZhouH.SkolnickJ. (2015). Insights into disease-associated mutations in the human proteome through protein structural analysis. Structure23, 1362–1369. 10.1016/j.str.2015.03.028
20
Garcia-RecioA.Gómez-TamayoJ. C.ReinaI.CampilloM.CordomÃA.OlivellaM.et al (2021). Tmsnp: A web server to predict pathogenesis of missense mutations in the transmembrane region of membrane proteins. Nar. Genom Bioinform3, lqab008. 10.1093/nargab/lqab008
21
GeF.ZhuY. H.XuJ.MuhammadA.SongJ.YuD. J. (2021). MutTMPredictor: robust and accurate cascade xgboost classifier for prediction of mutations in transmembrane proteins. Comput. Struct. Biotechnol. J.19, 6400–6416. 10.1016/j.csbj.2021.11.024
22
GelfmanS.WangQ.McSweeneyK. M.RenZ.La CarpiaF.HalvorsenM.et al (2017). Annotating pathogenic non-coding variants in genic regions. Nat. Commun.8, 236. 10.1038/s41467-017-00141-2
23
GengC.VangoneA.FolkersG. E.XueL. C.BonvinA. (2019). iSEE: interface structure, evolution, and energy-based machine learning predictor of binding affinity changes upon mutations. Proteins87, 110–119. 10.1002/prot.25630
24
GoncearencoA.RagerS. L.LiM.SangQ. X.RogozinI. B.PanchenkoA. R. (2017). Exploring background mutational processes to decipher cancer genetic heterogeneity. Nucleic Acids Res.45, W514–w522. 10.1093/nar/gkx367
25
GrønningA. G. B.DoktorT. K.LarsenS. J.PetersenU. S. S.HolmL. L.BruunG. H.et al (2020). DeepCLIP: predicting the effect of mutations on protein-rna binding with deep learning. Nucleic Acids Res.48, 7099–7118. 10.1093/nar/gkaa530
26
GunningA. C.FryerV.FashamJ.CrosbyA. H.EllardS.BapleE. L.et al (2021). Assessing performance of pathogenicity predictors using clinically relevant variant datasets. J. Med. Genet.58, 547–555. 10.1136/jmedgenet-2020-107003
27
HeyneH. O.Baez-NietoD.IqbalS.PalmerD. S.BrunklausA.MayP.et al (2020). Predicting functional effects of missense variants in voltage-gated sodium and calcium channels. Sci. Transl. Med.12, eaay6848. 10.1126/scitranslmed.aay6848
28
HuangL. T. (2014). Finding simple rules for discriminating folding rate change upon single mutation by statistical and learning methods. Protein Pept. Lett.21, 743–751. 10.2174/09298665113209990070
29
HuangL. T.GromihaM. M. (2010). First insight into the prediction of protein folding rate change upon point mutation. Bioinformatics26, 2121–2127. 10.1093/bioinformatics/btq350
30
HuangL. T.GromihaM. M. (2012). Real value prediction of protein folding rate change upon point mutation. J. Comput. Aided Mol. Des.26, 339–347. 10.1007/s10822-012-9560-3
31
HuangX.ZhengW.PearceR.ZhangY. (2020). SSIPe: accurately estimating protein-protein binding affinity change upon mutations using evolutionary profiles in combination with an optimized physical energy function. Bioinformatics36, 2429–2437. 10.1093/bioinformatics/btz926
32
IttisoponpisanS.IslamS. A.KhannaT.AlhuzimiE.DavidA.SternbergM. J. E. (2019). Can predicted protein 3D structures provide reliable insights into whether missense variants are disease associated?J. Mol. Biol.431, 2197–2212. 10.1016/j.jmb.2019.04.009
33
JemimahS.SekijimaM.GromihaM. M. (2020). ProAffiMuSeq: sequence-based method to predict the binding free energy change of protein-protein complexes upon mutation using functional classification. Bioinformatics36, 1725–1730. 10.1093/bioinformatics/btz829
34
JianX.BoerwinkleE.LiuX. (2014). In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res.42, 13534–13544. 10.1093/nar/gku1206
35
JiangT.FangL.WangK. (2021). MutFormer: A context-dependent transformer-based model to predict pathogenic missense mutations. Available at: https://arxiv.org/abs/2110.14746.
36
KonoT. J. Y.LeiL.ShihC. H.HoffmanP. J.MorrellP. L.FayJ. C. (2018). Comparative genomics approaches accurately predict deleterious variants in plants. G3 (Bethesda)8, 3321–3329. 10.1534/g3.118.200563
37
KovalevM. S.IgolkinaA. A.SamsonovaM. G.NuzhdinS. V. (2018). A pipeline for classifying deleterious coding mutations in agricultural plants. Front. Plant Sci.9, 1734. 10.3389/fpls.2018.01734
38
KumarM. D.BavaK. A.GromihaM. M.PrabakaranP.KitajimaK.UedairaH.et al (2006). ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res.34, D204–D206. 10.1093/nar/gkj103
39
KurumidaY.SaitoY.KamedaT. (2020). Predicting antibody affinity changes upon mutations by combining multiple predictors. Sci. Rep.10, 19533. 10.1038/s41598-020-76369-8
40
LandrumM. J.LeeJ. M.BensonM.BrownG. R.ChaoC.ChitipirallaS.et al (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res.46, D1062–d1067. 10.1093/nar/gkx1153
41
LeeD.GorkinD. U.BakerM.StroberB. J.AsoniA. L.McCallionA. S.et al (2015). A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet.47, 955–961. 10.1038/ng.3331
42
LemanR.GaildratP.Le GacG.KaC.FichouY.AudrezetM. P.et al (2018). Novel diagnostic tool for prediction of variant spliceogenicity derived from a set of 395 combined in silico/in vitro studies: an international collaborative effort. Nucleic Acids Res.46, 7913–7923. 10.1093/nar/gky372
43
LiB.YangY. T.CapraJ. A.GersteinM. B. (2020). Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks. PLoS Comput. Biol.16, e1008291. 10.1371/journal.pcbi.1008291
44
LiG.PandayS. K.PengY.AlexovE. (2021b). SAMPDI-3D: predicting the effects of protein and dna mutations on protein-dna interactions. Bioinformatics37, 3760–3765. 10.1093/bioinformatics/btab567
45
LiM.PetukhM.AlexovE.PanchenkoA. R. (2014). Predicting the impact of missense mutations on protein-protein binding affinity. J. Chem. Theory Comput.10, 1770–1780. 10.1021/ct401022c
46
LiM.SimonettiF. L.GoncearencoA.PanchenkoA. R. (2016). MutaBind estimates and interprets the effects of sequence variants on protein-protein interactions. Nucleic Acids Res.44, W494–W501. 10.1093/nar/gkw374
47
LiY.BurgmanB.KhatriI. S.PentaparthiS. R.SuZ.McGrailD. J.et al (2021a). e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks. Nucleic Acids Res.49, e2. 10.1093/nar/gkaa1015
48
LiuJ.LiuS.LiuC.ZhangY.PanY.WangZ.et al (2021). Nabe: An energetic database of amino acid mutations in protein-nucleic acid binding interfaces. Oxford: Database, 2021.
49
LiveseyB. J.MarshJ. A. (2020). Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol.16, e9380. 10.15252/msb.20199380
50
MahechaD.NuñezH.LattigM. C.DuitamaJ. (2022). Machine learning models for accurate prioritization of variants of uncertain significance. Hum. Mutat.43, 449–460. 10.1002/humu.24339
51
MarabottiA.Del PreteE.ScafuriB.FacchianoA. (2021). Performance of Web tools for predicting changes in protein stability caused by mutations. BMC Bioinforma.22, 345. 10.1186/s12859-021-04238-w
52
MartinT. A.WuT.TangQ.DoughertyL. L.ParenteD. J.Swint-KruseL.et al (2020). Identification of biochemically neutral positions in liver pyruvate kinase. Proteins88, 1340–1350. 10.1002/prot.25953
53
MillerM.BrombergY.Swint-KruseL. (2017). Computational predictors fail to identify amino acid substitution effects at rheostat positions. Sci. Rep.7, 41329. 10.1038/srep41329
54
MillerM.VitaleD.KahnP. C.RostB.BrombergY.funtrp (2019). funtrp: identifying protein positions for variation driven functional tuning. Nucleic Acids Res.47, e142. 10.1093/nar/gkz818
55
Moles-FernándezA.Duran-LozanoL.MontalbanG.BonacheS.López-PerolioI.MenéndezM.et al (2018). Computational tools for splicing defect prediction in breast/ovarian cancer genes: how efficient are they at predicting rna alterations?Front. Genet.9, 366. 10.3389/fgene.2018.00366
56
MyungY.RodriguesC. H. M.AscherD. B.PiresD. E. V.mCSM-Ab2 (2020). mCSM-AB2: guiding rational antibody design using graph-based signatures. Bioinformatics36, 1453–1459. 10.1093/bioinformatics/btz779
57
NaganathanA. N.MuñozV. (2010). Insights into protein folding mechanisms from large scale analysis of mutational effects. Proc. Natl. Acad. Sci. U. S. A.107, 8611–8616. 10.1073/pnas.1000988107
58
NairP. S.VihinenM.VariBench (2013). VariBench: A benchmark database for variations. Hum. Mutat.34, 42–49. 10.1002/humu.22204
59
NiroulaA.VihinenM. (2019). How good are pathogenicity predictors in detecting benign variants?PLoS Comput. Biol.15, e1006481. 10.1371/journal.pcbi.1006481
60
NiroulaA.VihinenM. (2016). Variation interpretation predictors: principles, types, performance, and choice. Hum. Mutat.37, 579–597. 10.1002/humu.22987
61
NurkS.KorenS.RhieA.RautiainenM.BzikadzeA. V.MikheenkoA.et al (2022). The complete sequence of a human genome. Science376, 44–53. 10.1126/science.abj6987
62
OrioliT.VihinenM. (2019). Benchmarking membrane proteins: subcellular localization and variant tolerance predictors. BMC Genomics20, 547. 10.1186/s12864-019-5865-0
63
PagelK. A.AntakiD.LianA.MortM.CooperD. N.SebatJ.et al (2019). Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome. PLoS Comput. Biol.15, e1007112. 10.1371/journal.pcbi.1007112
64
PancottiC.BenevenutaS.BiroloG.AlberiniV.RepettoV.SanaviaT.et al (2022). Predicting protein stability changes upon single-point mutation: A thorough comparison of the available tools on a new dataset. Brief. Bioinform23, bbab555. 10.1093/bib/bbab555
65
PeiJ.KinchL. N.OtwinowskiZ.GrishinN. V. (2020). Mutation severity spectrum of rare alleles in the human genome is predictive of disease type. PLoS Comput. Biol.16, e1007775. 10.1371/journal.pcbi.1007775
66
PejaverV.UrrestiJ.Lugo-MartinezJ.PagelK. A.LinG. N.NamH. J.et al (2020). Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun.11, 5918. 10.1038/s41467-020-19669-x
67
PengY.SunL.JiaZ.LiL.AlexovE. (2018). Predicting protein-DNA binding free energy change upon missense mutations using modified MM/PBSA approach: SAMPDI webserver. Bioinformatics34, 779–786. 10.1093/bioinformatics/btx698
68
PetrosinoM.NovakL.PasquoA.ChiaraluceR.TurinaP.CapriottiE.et al (2021). Analysis and interpretation of the impact of missense variants in cancer. Int. J. Mol. Sci., 22.
69
PetukhM.DaiL.AlexovE. (2016). Saambe: webserver to predict the charge of binding free energy caused by amino acids mutations. Int. J. Mol. Sci.17, 547. 10.3390/ijms17040547
70
PiresD. E.AscherD. B. (2016). mCSM-AB: a web server for predicting antibody-antigen affinity changes upon mutation with graph-based signatures. Nucleic Acids Res.44, W469–W473. 10.1093/nar/gkw458
71
PiresD. E. V.AscherD. B. (2017). mCSM-NA: predicting the effects of mutations on protein-nucleic acids interactions. Nucleic Acids Res.45, W241–w246. 10.1093/nar/gkx236
72
PiresD. E. V.RodriguesC. H. M.AscherD. B. (2020). mCSM-membrane: predicting the effects of mutations on transmembrane proteins. Nucleic Acids Res.48, W147–w153. 10.1093/nar/gkaa416
73
PlekhanovaE.NuzhdinS. V.UtkinL. V.SamsonovaM. G. (2019). Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evol. Appl.12, 18–28. 10.1111/eva.12607
74
PonsT.VazquezM.Matey-HernandezM. L.BrunakS.ValenciaA.IzarzugazaJ. M. (2016). KinMutRF: A random forest classifier of sequence variants in the human protein kinase superfamily. BMC Genomics17, 396. Suppl 2. 10.1186/s12864-016-2723-1
75
QuinodozM.PeterV. G.CisarovaK.Royer-BertrandB.StensonP. D.CooperD. N.et al (2022). Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. Am. J. Hum. Genet.109, 457–470. 10.1016/j.ajhg.2022.01.006
76
RaponiM.KralovicovaJ.CopsonE.DivinaP.EcclesD.JohnsonP.et al (2011). Prediction of single-nucleotide substitutions that result in exon skipping: identification of a splicing silencer in brca1 exon 6. Hum. Mutat.32, 436–444. 10.1002/humu.21458
77
ReebJ.WirthT.RostB. (2020). Variant effect predictions capture some aspects of deep mutational scanning experiments. BMC Bioinforma.21, 107. 10.1186/s12859-020-3439-4
78
RentzschP.SchubachM.ShendureJ.KircherM. (2021). CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med.13, 31. 10.1186/s13073-021-00835-9
79
RichardsS.AzizN.BaleS.BickD.DasS.Gastier-FosterJ.et al (2015). Standards and guidelines for the interpretation of sequence variants: A joint consensus recommendation of the American College of medical genetics and genomics and the association for molecular Pathology. Genet. Med.17, 405–424. 10.1038/gim.2015.30
80
RiesselmanA. J.IngrahamJ. B.MarksD. S. (2018). Deep generative models of genetic variation capture the effects of mutations. Nat. Methods15, 816–822. 10.1038/s41592-018-0138-4
81
RodriguesC. H. M.MyungY.PiresD. E. V.AscherD. B.mCSM-Ppi2 (2019). mCSM-PPI2: predicting the effects of mutations on protein–protein interactions. Nucleic Acids Res.47, W338–w344. 10.1093/nar/gkz383
82
RodriguesC. H. M.PiresD. E. V.AscherD. B.mmCSM-Ppi (2021). mmCSM-PPI: predicting the effects of multiple point mutations on protein–protein interactions. Nucleic Acids Res.49, W417–w424. 10.1093/nar/gkab273
83
RychkovaA.BuuM.ScharfeC.LefterovaM.OdegaardJ.SchrijverI.et al (2017). Developing gene-specific meta-predictor of variant pathogenicity.
84
SarkarA.YangY.VihinenM. (2020). Variation benchmark datasets: update, criteria, quality and applications. Database2020, baz117. 10.1093/database/baz117
85
SasorithS.BauxD.BergougnouxA.PauletD.LahureA.BareilC.et al (2020). The CYSMA web server: an example of integrative tool for in silico analysis of missense variants identified in mendelian disorders. Hum. Mutat.41, 375–386. 10.1002/humu.23941
86
SavojardoC.ManfrediM.MartelliP. L.CasadioR. (2020). Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Front. Mol. Biosci.7, 626363. 10.3389/fmolb.2020.626363
87
SchaafsmaG. C.VihinenM. (2018). Representativeness of variation benchmark datasets. BMC Bioinforma.19 (1), 461. 10.1186/s12859-018-2478-6
88
SchaafsmaG. C.VihinenM.VariSNP (2015). VariSNP, A benchmark database for variations from dbSNP. Hum. Mutat.36, 161–166. 10.1002/humu.22727
89
ShakurR.OchoaJ. P.RobinsonA. J.NiroulaA.ChandranA.RahmanT.et al (2021). Prognostic implications of troponin T variations in inherited cardiomyopathies using systems biology. NPJ Genom Med.6, 47. 10.1038/s41525-021-00204-w
90
SharoA. G.HuZ.SunyaevS. R.BrennerS. E. (2022). StrVCTVRE: A supervised learning method to predict the pathogenicity of human genome structural variants. Am. J. Hum. Genet.109, 195–209. 10.1016/j.ajhg.2021.12.007
91
SherryS. T.WardM. H.KholodovM.BakerJ.PhanL.SmigielskiE. M.et al (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Res.29, 308–311. 10.1093/nar/29.1.308
92
StouracJ.DubravaJ.MusilM.HorackovaJ.DamborskyJ.MazurenkoS.et al (2021). FireProtDB: database of manually curated protein stability data. Nucleic Acids Res.49, D319–d324. 10.1093/nar/gkaa981
93
StrokachA.Corbi-VergeC.KimP. M. (2019). Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a CAGI5 blind challenge. Hum. Mutat.40, 1414–1423. 10.1002/humu.23852
94
StrokachA.LuT. Y.KimP. M. (2021). ELASPIC2 (EL2): combining contextualized language models and graph neural networks to predict effects of mutations. J. Mol. Biol.433, 166810. 10.1016/j.jmb.2021.166810
95
SuleaT.VivcharukV.CorbeilC. R.DeprezC.PurisimaE. O. (2016). Assessment of solvated interaction energy function for ranking antibody-antigen binding affinities. J. Chem. Inf. Model.56, 1292–1303. 10.1021/acs.jcim.6b00043
96
TangX.ZhangT.ChengN.WangH.ZhengC. H.XiaJ.et al (2021). usDSM: a novel method for deleterious synonymous mutation prediction using undersampling scheme. Brief. Bioinform22, bbab123. 10.1093/bib/bbab123
97
TarnovskayaS. I.KorkoshV. S.ZhorovB. S.FrishmanD. (2020). Predicting novel disease mutations in the cardiac sodium channel. Biochem. Biophys. Res. Commun.521, 603–611. 10.1016/j.bbrc.2019.10.142
98
ThusbergJ.OlatubosunA.VihinenM. (2011). Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat.32, 358–368. 10.1002/humu.21445
99
TianJ.WuN.ChuX.FanY. (2010). Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinforma.11, 370. 10.1186/1471-2105-11-370
100
ToffanoA. A.ChiarotG.ZamunerS.MarchiM.SalviE.WaxmanS. G.et al (2020). Computational pipeline to probe NaV1.7 gain-of-function variants in neuropathic painful syndromes. Sci. Rep.10, 17930. 10.1038/s41598-020-74591-y
101
TurinaP.FariselliP.CapriottiE. (2021). ThermoScan: semi-automatic identification of protein stability data from Pubmed. Front. Mol. Biosci.8, 620475. 10.3389/fmolb.2021.620475
102
VihinenM. (2021). Functional effects of protein variants. Biochimie180, 104–120. 10.1016/j.biochi.2020.10.009
103
VihinenM. (2013). Guidelines for reporting and using prediction tools for genetic variation analysis. Hum. Mutat.34, 275–282. 10.1002/humu.22253
104
VihinenM. (2012). How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics13, S2. Suppl 4. 10.1186/1471-2164-13-s4-s2
105
VihinenM. (2023b). Nonsynonymous synonymous variants demand for a paradigm shift in genetics. Curr. Genet.24, 18–23. 10.2174/1389202924666230417101020
106
VihinenM. (2023a). Systematic errors in annotations of truncations, loss-of-function and synonymous variants. Front. Genet.14, 1015017. 10.3389/fgene.2023.1015017
107
VihinenM. (2022). When a synonymous variant is nonsynonymous. Genes. (Basel), 13.
108
WangM.CangZ.WeiG. W. (2020). A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation. Nat. Mach. Intell.2, 116–123. 10.1038/s42256-020-0149-6
109
WangY.JiangY.YaoB.HuangK.LiuY.WangY.et al (2021). WEVar: A novel statistical learning framework for predicting noncoding regulatory variants. Brief. Bioinform22, bbab189. 10.1093/bib/bbab189
110
WuY.LiR.SunS.WeileJ.RothF. P. (2021). Improved pathogenicity prediction for rare human missense variants. Am. J. Hum. Genet.108, 1891–1906. 10.1016/j.ajhg.2021.08.012
111
XiongP.ZhangC.ZhengW.ZhangY. (2017). BindProfX: assessing mutation-induced binding affinity change by protein interface profiles with pseudo-counts. J. Mol. Biol.429, 426–434. 10.1016/j.jmb.2016.11.022
112
YangY.ShaoA.VihinenM. (2022). PON-All, amino acid substitution tolerance predictor for all organisms. Front. Mol. Biosci.9, 867572. 10.3389/fmolb.2022.867572
113
YangY.UrolaginS.NiroulaA.DingX.ShenB.VihinenM. (2018). PON-Tstab: protein variant stability predictor importance of training data quality. Int. J. Mol. Sci.19, 1009. 10.3390/ijms19041009
114
YangY.ZengL.VihinenM.Pon-Sol2 (2021). Prediction of effects of variants on protein solubility. Int. J. Mol. Sci., 22.
115
YueZ.ZhaoL.ChengN.YanH.XiaJ. (2019). dbCID: a manually curated resource for exploring the driver indels in human cancer. Brief. Bioinform20, 1925–1933. 10.1093/bib/bby059
116
YueZ.ZhaoL.XiaJ. (2018). dbCPM: a manually curated database for exploring the cancer passenger mutations. Brief. Bioinform21, 309–317. 10.1093/bib/bby105
117
ZengZ.BrombergY. (2019). Predicting functional effects of synonymous variants: A systematic review and perspectives. Front. Genet.10, 914. 10.3389/fgene.2019.00914
118
ZhangN.ChenY.LuH.ZhaoF.AlvarezR. V.GoncearencoA.et al (2020). MutaBind2: predicting the impacts of single and multiple mutations on protein-protein interactions. iScience23, 100939. 10.1016/j.isci.2020.100939
119
ZhangN.ChenY.ZhaoF.YangQ.SimonettiF. L.LiM. (2018). PremPDI estimates and interprets the effects of missense mutations on protein-DNA interactions. PLoS Comput. Biol.14, e1006615. 10.1371/journal.pcbi.1006615
120
ZhangS.HeY.LiuH.ZhaiH.HuangD.YiX.et al (2019). regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variants. Nucleic Acids Res.47, e134. 10.1093/nar/gkz774
121
ZhangX.WalshR.WhiffinN.BuchanR.MidwinterW.WilkA.et al (2021). Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions. Genet. Med.23, 69–79. 10.1038/s41436-020-00972-3
122
ZhouJ. B.XiongY.AnK.YeZ. Q.WuY. D. (2020). IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nssnvs) in intrinsically disordered regions. Bioinformatics36, 4977–4983. 10.1093/bioinformatics/btaa618
123
ZhuX.LiuL.HeJ.FangT.XiongY.MitchellJ. C. (2020). iPNHOT: a knowledge-based approach for identifying protein-nucleic acid interaction hot spots. BMC Bioinforma.21, 289. 10.1186/s12859-020-03636-w
Summary
Keywords
variation, mutation, benchmark, method performance assessment, data sets, variation database
Citation
Shirvanizadeh N and Vihinen M (2023) VariBench, new variation benchmark categories and data sets. Front. Bioinform. 3:1248732. doi: 10.3389/fbinf.2023.1248732
Received
27 June 2023
Accepted
08 September 2023
Published
19 September 2023
Volume
3 - 2023
Edited by
Marcelo Reis, State University of Campinas, Brazil
Reviewed by
Castrense Savojardo, University of Bologna, Italy
Carlos Rodrigues, Baker Heart and Diabetes Institute, Australia
Seyed Jamalaldin Haddadi, State University of Campinas, Brazil
Updates
Copyright
© 2023 Shirvanizadeh and Vihinen.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Mauno Vihinen, mauno.vihinen@med.lu.se
†Present address: Niloofar Shirvanizadeh, Cancer Genomics and Proteomics, Karolinska University Hospital, Huddinge, Sweden
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.