Your new experience awaits. Try the new design now and help us make it even better

OPINION article

Front. Drug Discov.

Sec. In silico Methods and Artificial Intelligence for Drug Discovery

Volume 5 - 2025 | doi: 10.3389/fddsv.2025.1674289

On the Biologically Relevant Chemical Space: BioReCS

Provisionally accepted
  • 1National Autonomous University of Mexico, México City, Mexico
  • 2Cinvestav Unidad Zacatenco, Mexico City, Mexico
  • 3Universidad de Costa Rica, San José, Costa Rica

The final, formatted version of the article will be published soon.

The "chemical space" (CS), "chemical compound space," or "chemical universe" terms are frequently used in drug discovery and other areas, including chemical synthesis, catalysis, materials science, food chemistry, and agrochemistry, among others (Kim et al., 2024). While the concept is often used intuitively or colloquially, CS is inherently complex, and numerous formal definitions have been proposed and reviewed (Medina-Franco et al., 2022). A commonly accepted notion of CS relates to the number of chemical compounds that could theoretically exist-the "size" of chemical space-which varies greatly depending on the classes of compounds considered (e.g., small organic molecules, peptides, odorants).Another perspective views CS as a multidimensional space in which molecular properties (both structural and functional) define coordinates and relationships between compounds (Virshup et al., 2013;Martinez-Mayorga and Medina-Franco, 2014). These definitions give rise to the concept of chemical subspaces (ChemSpas): subsets of the broader chemical universe distinguished by shared structural or functional features. Within this framework, the biologically relevant chemical space (BioReCS) comprises molecules with biological activity-both beneficial and detrimental. BioReCS spans diverse application areas such as drug discovery, agrochemistry, sensory chemistry (e.g., flavor and odor), food science, and natural product research. It also includes compounds with reactive molecules, including promiscuous and poly-active molecules, as well as those with highly detrimental or undesirable effects, such as toxic and allergic compounds.Chemical compound databases are key resources for exploring the CS and are central to chemoinformatics (Williams and Richard, 2025). Numerous public databases-varying in size and specialization-target specific regions of BioReCS. Table 1 provides representative examples of freely available libraries across several domains. Comprehensive reviews of chemoinformatic and bioinformatic databases have been published elsewhere (Rigden and Fernández, 2024;de Azevedo et al., 2024). A systematic study of CS requires molecular descriptors that define the dimensionality of the space.The choice of descriptors depends on project goals, compound classes (e.g., metal-containing vs. purely organic molecules), and the dataset size and diversity. Large and ultra-large chemical libraries that are highly used today in drug discovery projects (Lyu et al., 2019;Corrêa Veríssimo et al., 2024), for example, demand descriptors that strike a balance between computational efficiency and chemical relevance (Warr et al., 2022). The rise of machine learning has led to the development of novel molecular representations (Wigh et al., 2022). Visualization is another critical tool for CS analysis, because these spaces often involve many dimensions; dimensionality-reduction techniques are commonly used to project them into two or three dimensions for interpretation. Recent reviews detail advancements in the visualization of chemical space (Sosnin, 2025).In this article, we offer an integrative perspective on BioReCS, highlighting common considerations for its consistent and meaningful exploration. We also address its size, historical evolution, and future expansion. In many research projects, the chemical universe-and by extension, BioReCS-is explored through distinct sections of chemical subspaces (ChemSpas). For instance, CS analyses may focus specifically on small-molecule drug candidates, peptides (Orsi and Reymond, 2024), or proteolysis-targeting chimeras (PROTACs) (Danishuddin et al., 2023;Sincere et al., 2023). Other studies target agrochemicals, odorants, natural products, or metal-containing compounds. Some research initiatives are at the intersection of multiple ChemSpas, such as investigating bioactive compounds that straddle both natural product and food chemical domains (Avellaneda-Tamayo et al., 2024) or studying the overlap between flavor and odor chemicals (Cui et al., 2025). Analyzing these intersecting regions of chemical space often requires integrating methodologies from diverse disciplines. In this section, we highlight both heavily explored and underexplored regions of BioReCS. In drug discovery, widely used public databases such as ChEMBL (Zdrazil et al., 2025) and PubChem (Kim et al., 2024) serve as major sources of biologically active small molecules, primarily organic compounds. Owing to their extensive biological activity annotations, these databases are major sources of poly-active compounds and promiscuous structures. Table 1 summarizes these and other key databases that cover different regions of BioReCS. The chemical space of drug-like molecules, particularly small organic compounds and natural products, has been extensively studied. Closely related areas, such as small peptides and other beyond Rule of 5 (bRo5) entities, are also well-characterized using computational approaches (Price et al., 2024;Capecchi and Reymond, 2021;López-López et al., 2023).Importantly, to fully chart the boundaries of BioReCS, it is crucial to include negative biological data-that is, compounds known to lack bioactivity (Williams et al., 2016;López-López et al., 2022). These data help define the non-biologically relevant portions of chemical space. A notable example is dark chemical matter, a large-scale dataset comprising small molecules from corporate compound collections that have repeatedly failed to show activity in high-throughput screening assays (Wassermann et al., 2015). Also, a recent development is the generation of InertDB, a compound collection with 3,205 curated inactive compounds obtained from PubChem (An et al., 2025). The database also includes 64,368 putative inactive molecules generated with a deep generative artificial intelligence (AI) model trained on the experimentally determined inactive molecules (An et al., 2025). Certain types of chemical structures remain underrepresented in chemoinformatics due to modeling challenges. A prominent example is metal-containing molecules, which are often excluded during data curation because most chemoinformatics tools are optimized for small organic compounds (Fourches et al., 2016;Bento et al., 2020;Valle-Núñez et al., 2025). Metallodrugs, therefore, represent a structurally and functionally important class that is commonly filtered out by default. However, the difficulty of modeling a region of BioReCS should not justify its exclusion. Similarly, various compound classes are rarely targeted in drug discovery efforts, including large and complex natural products, macrocycles (compounds containing rings of ≥12 atoms), protein-protein interaction (PPI) modulators or inhibitors, PROTACs, and mid-sized peptides. Many of these molecules fall into the beyond Rule of 5 (bRo5) category (Price et al., 2024;Whitty and Zhou, 2015;Schaub et al., 2021) (Table 1). Despite their complexity, interest in characterizing these regions of chemical space is growing. Recent studies have addressed the CS of peptides (Orsi and Reymond, 2024;Capecchi et al., 2019), agrochemicals (Zhang et al., 2018), metallodrugs (Meggers, 2007;López López and Medina-Franco, 2025), macrocycles (Viarengo-Baker et al., 2021;Kim et al., 2025), and PPIs (Zhang et al., 2014;Choi et al., 2021). Beyond beneficial regions, BioReCS also encompasses gray-to-dark areas-zones that include compounds with undesirable biological effects, such as toxic chemicals (Tihányi et al., 2025;(Annex on Chemicals, 2025). Understandably, these regions have received less attention than areas linked to therapeutic or beneficial activity. Nonetheless, distinguishing the characteristics that separate harmful compounds from beneficial ones is vital for the design of safer, human-beneficial, and ecologically responsible molecules. In this section, we highlight common challenges associated with exploring BioReCS, along with possible workarounds and emerging directions. While not exhaustive, these topics are meant to illustrate recurring issues and encourage a holistic consideration of the BioReCS. The structural diversity across underexplored regions of BioReCS presents a major challenge to define a consistent chemical space using molecular descriptors. Traditional descriptors, tailored to specific ChemSpas such as small molecules, peptides, or metallodrugs, lack universality. However, there are ongoing efforts to develop structure-inclusive, general-purpose descriptors. Notable examples include molecular quantum numbers (Nguyen et al., 2009) and the MAP4 fingerprint (Capecchi et al., 2020 ref), which is designed to accommodate entities ranging from small molecules to biomolecules and even metabolomic data. More recently, neural network embeddings derived from chemical language models have shown promise in encoding chemically meaningful representations that can reconstruct molecular structures or predict properties (Lžičař and Gamouh, 2024). However, there is still a pressing need to develop systematic molecular fingerprints for the study of biomaterials and inorganic molecules. Many bioactive compounds, especially drugs, are weak bases, acids, or ampholytes that can ionize depending on the pH of their environment. Pioneering studies have reported that 62.9% of compounds in the World Drug Index (n = 582) are ionizable, with the majority being bases, fewer acids, and some ampholytes (Manallack, 2007), however, chemogenomic analyses on contemporary drugs (n = 3766) haveshown that this percentage can reach 80% (Manallack et al., 2013). In consequence, the ionization state -charged or neutral-of a bioactive compound profoundly impacts its solubility, permeability, absorption, distribution, toxicity, and binding, making this distinction essential in drug development and computational modeling. However, CS analyses typically assume molecular structures with neutral charge, which may not reflect the actual bioactive species of compounds under physiological or environmental conditions.Even when the structural representation of an ionizable compound is accurate, chemoinformatics tools often calculate molecular descriptors such as lipophilicity (logP) based solely on the neutral species, overlooking the dominant ionic forms. Computing lipophilicity using logD at physiological pH is much more relevant than using logP for small molecules (Bhal et al., 2007;Zamora et al., 2017), including standard amino acid residues (Zamora et al., 2019) to non-standard residues (Viayna et al., 2024). Those limitations underscore the need for implementing chemoinformatics tools capable of calculating molecular properties contingent on the ionization state of bioactive compounds as a function of environmental pH in CS research (Bertsch et al., 2023;Bertsch-Aguilar et al., 2024). This highlights that neglecting the pH-dependent behavior of bioactive compounds could limit the biological relevance of BioReCS. Consequently, future efforts should aim to incorporate protonation state dynamics to enhance their representativeness in pHdependent CS analysis. In drug discovery and beyond, there is growing interest in creating on-demand, synthetically accessible virtual libraries for high-throughput screening (Perebyinis and Rognan, 2022;Grygorenko et al., 2020;Chávez-Hernández et al., 2023). Advances in generative models have accelerated the enumeration of the large and ultra-large chemical libraries, expanding the known chemical space and enabling the design of extensive libraries guided by structure or property constraints (Ye, 2024). However, evaluating the usefulness of such libraries requires more than sheer size; chemical diversity, as assessed through fingerprints, scaffolds, and physicochemical descriptors, is equally critical. Notably, a recent historical analysis of ChEMBL, PubChem, and DrugBank revealed that newer libraries are not necessarily more diverse (Lopez Perez et al., 2025). A similar trend could be observed for the continuously enumerated ultra-large chemical libraries, highlighting the need to quantify their chemical diversity using multiple structural representations. For BioReCS, we must consider not only the scale and diversity of expansion but also its direction-whether new molecules occupy unexplored regions or merely populate existing subspaces. Depending on the application area (e.g., drug discovery), the bioactivity profile should also be considered to avoid populating regions of BioReCS with promiscuous compounds associated with undesirable clinical effects. As the concept and application of chemical space evolve, so too must the computational tools used to explore it (Reymond, 2025). Novel or less conventional regions of drug-like space, such as bRo5 compounds discussed in Section 2.2, demand innovative methodologies or adaptations of existing ones.For instance, a recently developed hybrid fingerprint was designed specifically to accommodate metalcontaining molecules, extending traditional organic-focused fingerprints by incorporating metal-specific features (López López and Medina-Franco, 2025). Looking ahead, we anticipate increasing use of hybrid computational workflows, which combine descriptor-based, rule-based, and AI-driven methods (Medina-Franco et al., 2024). In parallel, new methods for analyzing multiple dimensions and types of informationsuch as chemical multiverse analysis and the creation of consensus chemical spaces (Medina-Franco et al., 2022;Medina-Franco et al., 2019;López-López and Medina-Franco, 2023) -will enable more efficient use and integration of available data. Finally, machine learning models trained in known regions of BioReCS will play a pivotal role in navigating uncharted subspaces and improving coverage of BioReCS. In this opinion article, we offered a holistic perspective on the biologically relevant chemical space (BioReCS) as a subset of the broader chemical universe. Effective navigation of BioReCS requires not only cataloging active compounds but also systematically reporting biologically inactive molecules, which help define the limits of relevance. While most of the explored regions focus on human-beneficial activities-such as therapeutic development, agriculture, and food sciences-BioReCS also includes dark regions populated by undesirable or toxic compounds. Recognizing and learning from these contrasts is essential for safer, ecologically responsible, and more targeted molecular design. The exploration of understudied ChemSpas may drive the development or refinement of computational tools, especially in cases where current methods fall short. Broadening the scope of BioReCS analysis-from both a structural and functional standpoint-could reveal hidden subspaces containing compounds with novel or unexpected biological activities. Importantly, training machine learning models on known BioReCS data will enhance our capacity to identify uncharted regions and optimize exploration strategies. As chemical databases continue to grow, it is important to emphasize that expansion alone does not equate to increased chemical diversity or biological relevance. Future research should consider not only the scale of these libraries but also their directionality, structural diversity, and applicability to real-world biological contexts. JLM-F wrote the first draft of the manuscript. All authors edited, revised, and approved the revised article.

Keywords: chemoinformatics, Dark chemical matter, De novo design, Food chemicals, metallodrugs, Natural Products, odor chemicals, Peptides

Received: 27 Jul 2025; Accepted: 12 Aug 2025.

Copyright: © 2025 López-López, Avellaneda-Tamayo, Zamora and Medina-Franco. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: José L Medina-Franco, National Autonomous University of Mexico, México City, Mexico

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.