VariBench, new variation benchmark categories and data sets

Genetic variation data is nowadays easy to generate. Variation interpretation means the description of the significance of variations, often in relation to disease. This is substantially more difficult a problem than sequence generation. Experimental methods provide verified interpretations; however, due to huge amounts of variations in every individual, computational approaches are widely used. The length of human genome is over 3 billion base pairs (Nurk et al., 2022). Due to individual genetic heterogeneity, 4.1–5.0 million sites differ from the reference genome (Auton et al., 2015). Various types of prediction methods are widely used to interpret the variations, see (Niroula and Vihinen, 2016). Benchmark studies have indicated large differences in the performance of methods developed for the same type of variation prediction tasks, see e.g., (Thusberg et al., 2011; Niroula and Vihinen, 2019; Zhang et al., 2019; Marabotti et al., 2021; Anderson and Lassmann, 2022). Both predictor development and performance assessment are largely dependent on high-quality data. One might think that there is a large number of verified variations as the genetic diagnosis is widely applied; however, that is not the case, especially when considering specific types of variations or mechanisms. The development and testing of computational methods are dependent on experimental data. Accurate prediction methods can be developed only with reliable experimentally verified cases with a systematic approach and using relevant measures (Vihinen, 2012; Vihinen, 2013). Method performance has to be assessed in comparison to existing knowledge. For that purpose, benchmark data sets with known and verified outcomes are needed. Such data sets can be time-consuming and costly to collect and require many manual steps. Therefore, it is important that the produced data are distributed and reused. In the variation interpretation field, two databases deliver such data sets. VariBench (Nair et al., 2013; Sarkar et al., 2020) and VariSNP (Schaafsma et al., 2015) contain variation benchmark data. VariSNP is a version of the dbSNP database (Sherry et al., 2001) for short variations from where known disease-causing variants have been filtered away. VariBench is a generic database that contains all types of variations with all kinds of effects. These resources have been widely used for prediction method training and testing. What requirements and criteria should benchmark data sets fulfill in relation to variation interpretation and in general? We have defined five criteria, discussed in (Nair et al., 2013). They include relevance, representativeness, non-redundancy, inclusion of both positive and negative cases and reusability. VariBench subscribes to the criteria and collects data sets and distributes them freely. VariBench data sets are frequently used to train and test method performance. These sets facilitate also post-publication comparison of methods to published benchmarks (Sarkar et al., 2020). OPEN ACCESS


Introduction
Genetic variation data is nowadays easy to generate.Variation interpretation means the description of the significance of variations, often in relation to disease.This is substantially more difficult a problem than sequence generation.Experimental methods provide verified interpretations; however, due to huge amounts of variations in every individual, computational approaches are widely used.The length of human genome is over 3 billion base pairs (Nurk et al., 2022).Due to individual genetic heterogeneity, 4.1-5.0 million sites differ from the reference genome (Auton et al., 2015).Various types of prediction methods are widely used to interpret the variations, see (Niroula and Vihinen, 2016).Benchmark studies have indicated large differences in the performance of methods developed for the same type of variation prediction tasks, see e.g., (Thusberg et al., 2011;Niroula and Vihinen, 2019;Zhang et al., 2019;Marabotti et al., 2021;Anderson and Lassmann, 2022).Both predictor development and performance assessment are largely dependent on high-quality data.One might think that there is a large number of verified variations as the genetic diagnosis is widely applied; however, that is not the case, especially when considering specific types of variations or mechanisms.
The development and testing of computational methods are dependent on experimental data.Accurate prediction methods can be developed only with reliable experimentally verified cases with a systematic approach and using relevant measures (Vihinen, 2012;Vihinen, 2013).Method performance has to be assessed in comparison to existing knowledge.For that purpose, benchmark data sets with known and verified outcomes are needed.Such data sets can be time-consuming and costly to collect and require many manual steps.Therefore, it is important that the produced data are distributed and reused.
In the variation interpretation field, two databases deliver such data sets.VariBench (Nair et al., 2013;Sarkar et al., 2020) and VariSNP (Schaafsma et al., 2015) contain variation benchmark data.VariSNP is a version of the dbSNP database (Sherry et al., 2001) for short variations from where known disease-causing variants have been filtered away.VariBench is a generic database that contains all types of variations with all kinds of effects.These resources have been widely used for prediction method training and testing.
What requirements and criteria should benchmark data sets fulfill in relation to variation interpretation and in general?We have defined five criteria, discussed in (Nair et al., 2013).They include relevance, representativeness, non-redundancy, inclusion of both positive and negative cases and reusability.VariBench subscribes to the criteria and collects data sets and distributes them freely.VariBench data sets are frequently used to train and test method performance.These sets facilitate also post-publication comparison of methods to published benchmarks (Sarkar et al., 2020).
The bottleneck in sequencing projects has shifted from sequencing to interpretation of obtained results.Experimental studies of variant effects are the gold standard approaches.They are not feasible in many instances and therefore, various computational approaches have been developed.We divide the prediction methods into five categories in VariBench.
First, pathogenicity, also called tolerance, predictions aim to identify disease-related alterations of various types (for details see Table 1).These methods aim just to detect harmful or disease-related variants.Second, effect-specific methods are for the prediction of various effects at DNA, RNA and protein levels.Third, there are also predictors specific for certain molecules or families of molecules, typically for proteins.Fourth, some methods are dedicated to certain diseases.Fifth, some tools predict the phenotype, typically the severity of the variant effect.
High-quality variation data sets are difficult and laborious to generate.VariBench collects, organizes, and integrates additional information and distributes different types of variation data sets.It is a unique database.We have updated the resource with 143 new data sets, which include more than 90 million variants.During the update, some new categories of variations and effects have been included.There are currently variations in 5 main categories, 17 subgroups and 11 groups.

Data sets and quality
VariBench collects from literature, databases and predictors data sets, which have been used to train methods or assess their performance.There are no selection criteria for the inclusion of data sets.This is because of several reasons.The data sets can be used as such, or they can be further cleaned and pruned to use in additional tasks, be extended with new cases, etc.A good benchmark data set should fulfill several requirements (Vihinen, 2012;Vihinen, 2013), including good coverage, representativeness and containing both positive and negative cases that are experimentally determined.The representativeness of amino acid substitution data sets was investigated (Schaafsma and Vihinen, 2018) and found not to be optimal.
The quality of data sets in VariBench is variable.We include even known low-quality data sets, since they may be valuable when building new data sets and for other applications.We have performed some quality tests, including consistency; however, it is the duty of the users of the data to evaluate whether the data are suitable for intended use.One of the goals of VariBench is to provide existing data sets, even when problematic, e.g., for comparative purposes.
Systematics is an integral part of data and database quality.It is quite common that due to errors and lack of systematics, all variants in an existing data set cannot be reused as they cannot be mapped to reference sequences.
An example of the importance of data quality is in the field of protein stability predictions.Most of the existing predictors are based on a single database, ProTherm, which was shown to contain numerous problems (Yang et al., 2018).Recently, new and higherquality databases have emerged in this field (Stourac et al., 2021;Turina et al., 2021).

Uses of VariBench data
VariBench data sets have been widely used especially to train and test variation interpretation predictors (pathogenicity/tolerance, protein stability, solubility, melting temperature, gene/protein/ disease-specific predictors, and interaction and structural effects on folded and disordered regions and proteins), but also in the benchmarking performance of tools for various types and effects.In addition to human, plant and animal-related predictors and benchmarks have benefitted from VariBench (Yang et al., 2022).The data has also facilitated the interpretation of variants according to the guidelines of American College of Medical Genetics and Genomics, and the Association for Molecular Pathology (ACMG/ AMP) (Richards et al., 2015) and benchmarking such annotations.

Data sets in VariBench
VariBench contains now 559 files for separate data sets from 295 studies and covers a wide range of variations (Tables 1, 2).The data sets were collected from literature, websites and databases.They have been used for predictive purposes, most often to develop novel predictors for different types or effects of variants.Some data sets have been specifically collected for benchmarking purposes.
There are 247 new data files that contain total 90,886,959 variants.Together with previous versions, there are 105,181,219 variants, the increase is more than seven-fold from the original number of 14,294,260 variants.The number of data sets is high because many articles contain more than one data set.Many of the data sets are redundant as they contain data from the same origin.The most common sources of variants are ClinVar (Landrum et al., 2018) database of variants and their disease relationship, ProTherm thermodynamic database (Kumar et al., 2006), and VariBench itself.The number of unique variants is significantly lower than the sum of the variants in the data sets.
The data sets are divided into 5 categories, 17 subgroups and 11 groups (Table 1).The amount of data items varies for independent sets and is dependent on the original data.Data items irrelevant to VariBench (i.e., not describing variants or their effects) were removed when sets were included to the database.In many data sets, variants are described at three molecular levels (DNA, RNA and protein) and sometimes also at protein structural level.One of the aims of VariBench is to facilitate the reuse of existing data sets, therefore the data are provided in as many levels as possible.Further, the data can be used for various purposes, beyond the original application, such as benchmarking, developing different types of predictors, bioinformatics reviews and analyses of variation types, clinical variation interpretation, etc.When doing such an extension, the users must be cautious and aware of the possible limitations of the data sets and to understand how they have been collected.
The main categories of variation type data sets are insertions and deletions, substitutions in coding and non-coding regions, structuremapped variants, synonymous and unsense variants, benign variants, and DNA structural variants (See Tables 1, 2).Unsense variants are a new category for exonic alterations that may look synonymous, but affect the protein or its expression, typically due to aberrant splicing or miRNA binding alterations (Vihinen, 2022;Vihinen, 2023a;Vihinen, 2023b).Effect-specific data sets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder, solubility, stability, folding rate, interactions, and functional effects.Molecule-and disease-specific data sets include information for individual genes, proteins, gene/protein families or diseases.Phenotype data sets are for a disease feature, severity of the phenotype.Almost all the categories contain new data sets.In addition, we have 6 new variation categories including structural variations in DNA (1 data set), protein folding rate (5 data sets in six publications), antibody-antigen affinity changes (5 articles and sets), protein-nucleic acid interactions (6 articles), gain of

Data set Data sets in previous version
New data sets

Variation type data sets
Insertions and deletions 4 2

Substitutions coding region
Training data sets 23 9 Test data sets 5 3

TABLE 1
Types of data sets in VariBench.

TABLE 2 (
Continued) New data sets in VariBench.

TABLE 2 (
Continued) New data sets in VariBench.

TABLE 2 (
Continued) New data sets in VariBench.