Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervized predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses.


Introduction
Many challenges of the 21st century, from personalized healthcare to energy production and storage [1], are linked by material discovery and development.The discovery of new molecules can bring enormous societal and technological progress, potentially curing rare diseases and providing a pathway for personalized precision medicine [2]; disruptive advances are likely to come from unexplored regions of the set of all possible chemical compounds.But a complete exploration of the huge space of potential chemicals is computationally intractable.It has been estimated that the number of pharmacologically-sensible molecules is in the order of 10 23 to 10 80 compounds [3,4].Furthermore, often this search space can be constrained based on known chemistry, patented molecules and desired qualities (e.g.solubility or toxicity).There have been many approaches to mapping and mining the space of possible molecules, including high throughput screening, combinatorial libraries, evolutionary algorithms, and other techniques [5][6][7][8].The exponential growth of publicly available experimental data also increases popularity of machine learning applications in bioinformatics [9,10]  for predicting pharmacological properties of drugs [11], biomarker development [12][13][14][15], and target identification [16].
Over the last few years, advances in machine learning, and especially in deep learning have driven the design of new computational systems that improve with experience and are able to model increasingly complex phenomena.One approach that has been proven fruitful for the modeling of the distribution of molecular data has been deep generative models.Just as we can sample a random numbers from a random variable distribution, generative models learn to model a data distribution in a way amenable for sampling new data points or conditioning over certain properties.
Deep generative models have shown remarkable results in a wide range of settings, from generating synthetic images [17] and natural language texts [18], to the applications in biomedicine, including the design of DNA sequences [19], and aging research [20].One important field of application for deep generative models lies in the inverse design of organic molecules [21]: for a given functionality (solubility, ease of synthesis, toxicity or presence of particular molecular fragments, etc.) we wish to find the perfect molecule that meets these requirements.The inverse design uses optimization, sampling, and search methods to navigate the manifold of the functionality of chemical space.A deep generative model fits these requirements, since it learns the manifold of desired functionality and allows to map this functionality back to the chemical space.
Part of the success of deep learning in different fields has been driven by ever-growing publicly available datasets and standard benchmark sets.These sets serve as a common measuring stick for newly developed models and optimization strategies; for example, when the field of computer vision had more or less worked the classical MNIST dataset of handwritten digits into the ground [22], larger and more complex datasets such as ImageNet [23] immediately took its place.
In the context of organic molecules, MoleculeNet [24] was introduced as a standardized benchmark suite for discriminative tasks such as regression and classification, but at present, there is no counterpart for generative models and generation tasks.We attribute this due to two main reasons: 1) lack of a common benchmark set and 2) lack of metrics.Both reasons are hard to tackle since any given molecular application will depend highly on the context.The molecules used for organic redox flow batteries will differ quite significantly from molecules for Malaria transmission blocking.Even for a common dataset, the desired qualities might differ greatly.For drug molecules, we sometimes seek a diverse set of molecules to avoid an already patented chemical space, but it can also be advantageous to find molecules that are quite similar in functionality and structure to a patented molecule while being different enough.Furthermore, it is not immediately clear how to compare deep generative models.There is no clear-cut way to evaluate the quality of generated samples; in particular, the loss function of the model itself is seldom a good choice for final evaluation.Often the real validation test for a candidate sample requires experimental characterization either via multiple in-vitro experiments or a device prototype.Computational approaches for scoring molecules tend to suffer from a lack of accuracy (e.g., target-ligand interaction) or are too time-consuming (e.g., simulating the electronic structure).
In this work, we aim to tackle both of these problems by providing a unified implementation of a benchmark suite for molecular generation, including data preprocessing utilities, evaluation metrics, and state of the art molecular generation models.While it is clear there can be no single solution to cover all molecular applications, we hope that these contributions will serve as a clear and unified testbed for current and future generative models.
In the following sections, we elaborate on the details of each of the features of MOSES and conclude with evaluation results on a molecular biology dataset along with a discussion on current challenges in molecular generation.We also highlight possible ideas for improvement in datasets outside of a molecular biology context, alternative representations of molecules, and improvements in metrics for the molecules.The pipeline of our framework is shown on Figure 1.

Dataset
Deep generative models require large datasets in order to learn patterns that can generalize and lead to new molecules.While large datasets of molecules exist, there is no commonly accepted standard dataset.The pharmaceutical industry has historically driven much of the large-scale development and testing of molecular libraries via virtual screening and high-throughput screening.As Table 1 shows, the largest datasets correspond to libraries utilized in medicinal chemistry.
Virtual screening libraries are often constructed from the application of common reactions to a set of seed molecules; the reactions are such that their products are predictable and easily performed in a laboratory setting while seed molecules are typically purchasable or easily synthesized or extracted from natural products.Even though the coverage of reactions tends to be a small subset of all possible reactions [25], the resulting combinations can lead to virtual libraries on the order of 10 14 − 10 18 compounds [26].Large datasets with experimentally characterized labeled data are quite valuable for constructing accurate models and libraries, so these tend to be closed-source and company-owned.Most current molecular generation applications utilize a subset of ZINC or ChEMBL molecules.For this purpose, we propose a biological molecule benchmark set refined from the ZINC database.
The set is based on the ZINC Clean Leads collection.It contains 4,591,276 molecules with molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5.We removed molecules containing charged atoms, atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms.The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters [27].
Medicinal chemistry filters are used to discard compounds containing so-called "structural alerts".Molecules containing such moieties either bear unstable or reactive groups or undergo biotransformations resulting in the formation of toxic metabolites or intermediates.Herein, we present a list of routine MCFs for the rational pre-selection of compounds more appropriate for modern drug design and development.These include some electrophilic alkylating groups, such as Michael acceptors (MCF1-3), alkyl halides (MCF4), epoxide (MCF5), isocyanate (MCF6), aldehyde (MCF7), imine (Schiff base, MCF8), aziridine (MCF9) which are very liable for nucleophilic attack, e.g. by serine, lysine or amino group of cysteine.In many cases, it leads to unselective protein and/or DNA damage.Metabolism of hydrazine (MCF10) furnishes diazene intermediates (MCF11) which are also alkylating warheads.Monosubstituted furans (MCF12) and thiophenes (MCF13) are transformed into reactive intermediates via epoxidation.Their active metabolites irreversibly bind nucleophilic groups and modify proteins.Electrophilic aromatics (e.g.halopyridine, MCF14), oxidized anilines (MCF15) and disulfides (MCF16) are also highly reactive.In vivo, alkylators are trapped and inactivated by the thiol group of glutathione, which is a key natural antioxidant.Azides (MCF17) are highly toxic; compounds containing this functional group particularly cause genotoxicity.Aminals (MCF18) and acetals (MCF19) are frequently unstable and inappropriate in generated structures.In addition, virtual structures containing a large number of halogens (MCF20-22) should be excluded due to increased molecular weight and lipophilicity (insufficient solubility for oral administration), metabolic stability and toxicity.The detailed mechanism of toxicity for structure alerts mentioned above has been comprehensively described in [28,29].
The final dataset contains 1,936,962 molecular structures.For experiments, we also provide a training, test and scaffold test sets containing 250, 10, and 10 thousand molecules respectively.The scaffold test set contains unique Bemis-Murcko scaffolds [35] that were not present in the training and test sets.We use this set to assess how well the model can generate previously unobserved scaffolds.
Table 1: Datasets for generation tasks.

Representations
Before training a generative or predictive model on molecular data, one has to represent compounds in a machine learning-friendly way.For generative tasks, we often would like to have a one-to-one mapping between the molecule and its representation to do a quick search in a lookup table, perform reconstruction, or sample new molecules (see example in Figure 2).For the generative models considered in this work, we have focused on two primary forms of representation: sequences of tokens via SMILES [36,37] and molecular graphs.Both representations rely on empirical principles of bonding, and a molecule is interpreted as an undirected graph where each atom is a node and the bonds are the edges.To reduce complexity, hydrogen atoms are treated implicitly because they are deduced from standard chemistry valence rules.A challenge when dealing with graphs is the graph isomorphism problem: without an explicit ordering of nodes, two graphs can map to the same structure and finding out whether they are equivalent is computationally demanding.This can be used for data augmentation purposes; but conversely, it has been argued that this representation degeneracy introduces noise to a model [38].In this section, we expand on these two representations and also elaborate on other alternatives that might be used for generative models.SMILES.Any graph can be transformed into a sequence of tokens by traversing its spanning tree and collecting all visited edges and nodes as tokens.For molecules, the Simplified Molecular Input Line System, or SMILES, defines grammar and syntax rules that are widely adopted in the chemistry community.In order to remove ambiguity, SMILES supports a canonical ordering (canonical SMILES) [39].Additionally, SMILES can also encode the stereochemistry by specifying special tokens for chirality (isomeric SMILES).A non-standard specification can also be adapted to include specific types of bonds, fragments, etc.
Much of the popularity behind SMILES in molecular generative models [40][41][42][43][44][45][46][47][48] has been due to the fast adaptation of sequence modeling tools such as recurrent neural networks, attention mechanisms, and dilated convolutions, among others, to the text nature of SMILES.However, in a string format, we cannot preserve locality of atoms, and some atoms that are adjacent in the molecular graph can appear in distant positions in a string.Also, not every string with SMILES tokens is semantically correct, i.e., represents a valid molecule, so a neural network has to track distant relationships in SMILES, which leads to a lower number of valid strings.Some methods try to incorporate SMILES syntax into a network architecture to increase the fraction of valid molecules [49,50].There is also an International Chemical Identifier (InChI) [51] format, which is a more verbose string representation, explicitly specifying a chemical formula, charges, hydrogens, and isotopes.
Molecular graphs.In a machine learning context, graphs are normally represented via sparse adjacency matrices with values corresponding to the presence or absence of edges between vertices.These matrices are augmented with node and edge feature vectors since nodes and edges in molecular graphs are likely to include physical properties related to the atoms and bonds.Often these are represented as sparse tensors [52].
There are multiple ways of processing and taking into account information defined on a graph, and so several approaches have been proposed such as Graph Convolutional Networks [53], Weave Networks [24], and Message Passing Networks [54].The molecular graph can be represented implicitly as an adjacency matrix with different atoms encoded along certain dimensions of the tensor; this approach has been successfully used in the MolGAN model [55] for the QM9 dataset [56].Another approach for treating graphs was introduced with the Junction Tree VAE model [57], which, instead of modeling a molecular graph as an atom-wise graph, interprets each molecule as a composition of subgraphs chosen from a vocabulary of fragments.
3D structure and alternative representations.For a long time, one of the most common representations for molecular structures has been a fingerprint, a fixed-length vector indicating the presence or absence of a chemical environment [58,59].Fingerprints are not typically invertible but can be used for fast screening by similarity to available in-house compounds.Kadurin et al. [47] implemented VAE and AAE models based on Molecular Access System (MACCS) fingerprints to screen for novel compounds with the desired effect.Polykovskiy et al. [60] proposed a generative model for creating molecular structures for a given fingerprint, training a simple generator on fingerprints and then transforming them into SMILES with a separate model.However, fingerprints often lack biological relevance and do not fit into the process of designing novel compounds.
One shortcoming of using molecular graphs and SMILES to model chemical and physical phenomena is that they represent a description of molecules on the level of atoms.A complete description of a molecule will require to take into account their 3D coordinates, symmetries with respect to the Schrödinger equation, and possibly its electrons.This is a current open research problem, and therefore we would like to draw attention to other works that involve alternative representations for molecules and could be potentially used or expanded for generative models.
One promising direction lies in a wavefunction or electronic density representation of molecules; these should in principle contain all information needed for an exact description of the physics and chemistry of a molecule.An initial approximation was introduced with the help of invertible hamiltonians via Coulomb matrices [61] that contain pairwise Coulomb interactions between atoms.Their generalization can be seen as tensors that encode multiple symmetries and physical properties of atoms [62].The use of electronic densities has also been explored for prediction [63].
2D topological fingerprints), internal diversity, and Fréchet ChemNet Distance.We also present a set of auxiliary metrics useful for the drug design process but could be extended for other applications.
Fragment similarity (Frag) is defined as the cosine distance between vectors of fragment frequencies.For a set of molecules G, its fragment frequency vector f G has a size equal to the size of the vocabulary of all chemical fragments in the dataset, and elements of f G represent frequencies with which the corresponding fragments appear in G.The distance is then defined as where molecules in both G and R are fragmented using the BRICS algorithm [67] implemented in RDKit [68].This metric shows the similarity of two sets of molecules at the level of chemical fragments.
Scaffold similarity (Scaff) is calculated in a similar way, as cosine similarity between the vectors s G and s R that represent frequencies of scaffolds in sets of molecules G and R: Scaffolds are derived from the molecules by removing side chain atoms using the Bemis-Murcko algorithm [35], also implemented in RDKit [68].The purpose of this metric is to show how similar are the scaffolds in generated and reference datasets.Note that in both fragment and scaffold similarity the comparison is at a substructure level but not molecule level, i.e. it is possible to have a distance of 0 (identical) with two different sets of molecules, as long as their substructure counts are the same.
Nearest neighbor similarity (SNN) is the average Tanimoto similarity T (m G , m R ) (also known as the Jaccard index) between a molecule m G from the generated set G and its nearest neighbor molecule m R in the reference dataset R: where m R and m G are some representations of the molecules as bit strings (fingerprints); the resulting similarity metric shows how generated molecules are similar to reference molecules in terms of the chemical structures that are encoded in these fingerprints.In this work, we used standard Morgan [69] fingerprints with radius 2 and 1024 bits.This representation is useful for a general analysis of the chemical space considering the constant search for novel scaffolds and chemotypes which is an essential part of the modern drug discovery process.
In general, molecules with similar chemical structures tend to possess the same biological response, there are more common features called pharmacophores that are responsible for biological activity.It has been shown that structurally diverse ligands with common pharmacophore hypothesis can bind to the same receptor site [70].Therefore, nearest neighbor similarity calculated using 2D pharmacophore fingerprints [59] can be beneficial when comparing two sets of molecules with regards to the biological targets they can bind to.
Internal diversity (IntDiv p ) [71] assesses the chemical diversity within the generated set of molecules G.
While SNN measures the dissimilarity to the reference dataset (external diversity), the internal diversity metric evaluates the generated molecules.This metric detects a common failure case of generative models-mode collapse.With mode collapse the model produces a limited variety of samples, ignoring some areas of chemical space.A higher value of the metric corresponds to higher diversity in the generated set.In the experiments, we report IntDiv 1 (G) and IntDiv 2 (G).Fréchet ChemNet Distance (FCD) [72] is calculated using the penultimate layer of the deep neural network ChemNet trained to predict the biological activities of drugs.The activations represent both chemical and biological properties of the compounds.For two sets of molecules G and R, FCD is defined as where µ G , µ R are mean vectors and Σ G , Σ R are covariance matrices of the activations on the penultimate layer of ChemNet on the sets G and R respectively.
We believe that these five metrics provide a good coverage of various desirable properties of a set of generated molecules, and we propose to use them as standard metrics for the comparison of different generative models for molecular fingerprints and structures.
Auxiliary Metrics.For evaluation purposes, we have also included utilities for computing auxiliary metrics that represent qualities that are commonly used for small molecule drug discovery.
• Molecular weight (MW): the sum of atomic weights in a molecule.
• LogP: the water-octanol partition coefficient, a property that measures how likely a molecule is able to mix with water.Computed via RDKit's Crippen [73] function.
• Synthetics Accessibility Score (SA): a heuristic estimate of how hard (10) or how easy (1) it is to synthesize a given molecule.SAscore is based on a combination of fragment contributions and a complexity penalty [74].
• Quantitative Estimation of Drug-likeness (QED): a 0 to 1 float value estimating how likely a molecule is a viable candidate for a drug.QED is meant to capture the abstract notion of aesthetics in medicinal chemistry [75].
• Natural Product-likeness Score (NP): a numerical estimate that can be used to determine if a molecule is likely to be a natural product (0 . . .5), a drug molecule (−3 . . .3) or a synthetic product (−5 . . .0).This score is calculated from several substructure descriptors and comparing them to the properties of other distributions of molecules [76].
To quanititaively compare the distributions in the generated and test sets, we use the Fréchet distance.
Alternatively, distributions of molecules can be compared in a simplified manner by looking at the mean and variance of properties.

Models
In the current version of MOSES, we implemented several deep learning models that cover different approaches to molecule generation problem such as language models modeled as character-level recurrent neural networks (CharRNN) [41,72], Variational and Adversarial Autoencoders (VAE [77][78][79], AAE [80], [78,60]), Junction Tree Variational Autoencoders (JT-VAE).These baseline models are trained in an unsupervised or semi-supervised fashion.Most models are based on the SMILES representation of a molecule and as such will typically incorporate sequence modeling tools such as recurrent neural networks (RNN) with different types of cells.There are many new models coming out dealing with other representation of molecules such as the JT-VAE which works with molecular graphs as tensors.Although more models will be added, we believe the current coverage in the collection is sufficient for good comparison of any arbitrary new model.
Model comparison can be challenging since different training parameters (number of epochs, batch size, learning rate, initial state, optimizer) and architecture hyperparameters (hidden layer dimension, number of layers, etc.) can significantly alter their performance.For each model, we attempted to preserve its original architecture as published and tweaked hyperparameters to improve performance.We used random search over 100 architectures for every model and selected the architecture that produced the highest chemical validity.For models with the same validity, we chose models with the highest uniqueness.All models are implemented with Python 3 utilizing the PyTorch [81] framework.Next, we briefly introduce these models.
Character-level recurrent neural networks (CharRNN) [41] treats the task of generating SMILES as a language model attempting to learn the statistical structure of SMILES syntax by training it on a large corpus of SMILES (Figure 3).Model parameters are optimized using maximum likelihood estimation (MLE).CharRNN is implemented using Long Short-Term Memory [82] RNN cells stacked into 3 layers with hidden dimension 768 each.We used a dropout [83] layer with dropout rate 0.2.Softmax is utilized as an output layer.Training is done with a batch size of 64, using the Adam [84] optimizer for 50 epochs with a learning rate of 10 −3 halved after each 10 epochs.
Variational autoencoder (VAE) is a framework for training two neural networks-an encoder and a decoder-to learn a mapping from high-dimensional data representation into a lower-dimensional space and back.The lower-dimensional space is called the latent space, which is often a continuous vector space with normally distributed latent representation.VAE parameters are optimized to encode and decode data by minimizing the reconstruction loss while also minimizing a KL-divergence term arising from the variational approximation that can loosely be interpreted as a regularization term (Figure 4).Since molecules are discrete objects, properly trained VAE defines an invertible continuous representation of a molecule.
A VAE-based architecture for the molecular generation was initially proposed by Gómez-Bombarelli et al. [77], and alternative architectures have been proposed by Kadurin et al. [78] and Blaschke et al. [79].We combine aspects from both implementations in MOSES.Utilizing a bidirectional [85] Gated Recurrent Unit (GRU) [86] with a linear output layer as an encoder.The decoder is a 3-layer GRU of 512 hidden dimensions with intermediate dropout layers with dropout probability 0.  Adversarial Autoencoders (AAE) [80] combine the idea of VAE with that of adversarial training as found in GAN.One of the main drawbacks of VAE is the KL divergence term that has a closed-form analytical solution only for a handful of distributions.In AAE, the KL divergence term is avoided by training a discriminator network to predict whether a given sample came from the latent space of the AE or from a prior distribution.Parameters are optimized to minimize the reconstruction loss and to minimize the discriminator loss.Kadurin et al. [78] applied AAE architecture to the drug generation task.The model consists of an encoder with a 1-layer bidirectional LSTM with 512 hidden dimensions, a decoder with a 2-layer LSTM with 512 hidden dimensions and a shared embedding of size 16.The discriminator network is a 2-layer fully connected neural network with 640 and 256 nodes respectively with exponential linear unit (ELU) [87] activation function [88].Training is done with a batch size of 128, with the Adam optimizer using a learning rate of 10 −3 for 50 epochs.We halved the learning rate after each 10 epochs.We also wish to highlight and re-emphasize several research directions and challenges that future generative models and efforts should tackle.
• Organic material datasets.While most easily available datasets cover biological contexts, there is a need for large high-quality datasets corresponding to organic materials for redox flow batteries, light emitting diodes, solar cell, dyes, polymers and copolymers, odorants, crop protectants, and food preservatives.
• Non-atom centric representations of molecules.Most generative models have focused on the usage of molecular graph either explicitly or as SMILES.To be able to describe more accurately and completely molecular processes we might need representations that include information related to electrons or 3D structure of a molecule.
• Hierarchical representations of molecules.As noted with JT-VAE, a hierarchical model might be able to leverage structure at different scales (atoms, fragments, etc.).This might prove important towards moving from small molecules to proteins, where the number of atoms can be on the order of thousands.Structured representations might benefit from reduced complexity, in processing time but also for prediction.
• Better and more general metrics.Current auxiliary metrics have been developed around heuristics and rules of medicinal chemistry, often validated with small datasets due to the lack of publicly available datasets.To build better guided generative models we need more accurate metrics that go beyond just medicinal chemistry.Measuring the synthetic accessibility, and cost of synthesis of a molecule, incorporating reaction path between commercially available reactants and natural products is important [91], as well as measuring reliably solubility and miscibility between materials, in a variety of molecules environments [92].
• Guided and biased generation.There have been several demonstrations of being able to bias the generative process of models using reinforcement learning [93][94][95], often these methods (e.g.Monte Carlo Tree Search) are data intensive, so more efficient RL optimization methods are needed.For areas of application that deserve attention are the generation of symmetrypoint group preserving molecules and navigation around patented molecular space.The former is important for planning and simplifying synthesis, the latter for promoting or avoiding certain types of substructures.
We hope this work enables future researchers interested in tackling molecular and material challenges.

Figure 1 :
Figure 1: Molecular Sets (MOSES) pipeline.The open-source library provides a dataset, baseline models, and evaluation metrics.

Figure 2 :
Figure 2: Different representations of a vanillin molecule.

…Figure 3 :
Figure 3: CharRNN model.A model is trained by maximizing the likelihood of known molecules.

2 .
Training is done with a batch size of 128, utilizing a gradient clipping of 50, KL-term weight linearly increased from 0 to 1 during training.We optimized the model using Adam optimizer with a learning rate of 3 • 10 −4 .We trained the model for 50 epochs.

Figure 4 :
Figure 4: Autoencoder-based models.VAE/AAE forms a specific distribution in the latent space.

Figure 5 :
Figure 5: Distribution of chemical properties for MOSES dataset and sets of generated molecules.In brackets-Fréchet distance to MOSES dataset.