ORIGINAL RESEARCH article
Front. Bioinform.
Sec. Protein Bioinformatics
This article is part of the Research TopicAI in Protein ScienceView all articles
Beyond Tanimoto: A Learned Bioactivity Similarity Index Enhances Ligand Discovery
Provisionally accepted- 1Instituto de Cálculo, Universidad de Buenos Aires Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina
- 2LUCAI BIO, Dover, United States
- 3IQUIBICEN, Universidad de Buenos Aires Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Structural similarity metrics like the Tanimoto Coefficient (TC) miss many functionally related compounds—indeed, 60% of similarly bioactive ligand pairs in ChEMBL show TC < 0.30, revealing a major blind spot that constrains ligand-based discovery. Our motivation is to overcome this blind spot and enable the recovery of structurally different yet functionally equivalent chemotypes that structure-based similarity fails to detect. Here we introduce the Bioactivity Similarity Index (BSI), a machine learning model that estimates the probability that two molecules bind the same or related protein receptors. Trained under leave-one-protein-out across Pfam-defined protein groups on dissimilar pairs, BSI not only outperforms TC but also surpasses modern molecular embedding baselines (ChemBERTa and CLAMP, using cosine similarity) across protein families. We further develop a cross-family model (BSI-Large) that, while slightly below group-specific models, generalizes better and can be fine-tuned to protein families with less data, consistently improving over models trained from scratch. In retrospective validation on new ChEMBL v35 data, BSI achieves strong early-retrieval performance (top 2% enrichment factor, EF₂%), with group-specific models delivering the best enrichment and BSI-Large remaining competitive. In a realistic virtual-screening-like scenario against the target ADRA2B, the mean rank of the next active given a known active improves from 45.2 (TC) to 3.9 (BSI), with 54.9 for ChemBERTa and 28.6 for CLAMP. Altogether, BSI complements, rather than replaces, structure-based similarity and embedding-based comparisons, extending hit finding to remote chemotypes that are structurally dissimilar yet functionally equivalent. The code is available at https://github.com/gschottlender/bioactivity-similarity-index.
Keywords: Bioactivity Similarity Index, Bsi, machine learning, molecular embedding baselines, ChemBERTa, Clamp, Virtual-screening
Received: 29 Aug 2025; Accepted: 07 Nov 2025.
Copyright: © 2025 Schottlender, Prieto, Marti and Fernández Do Porto. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Marcelo Adrian Marti, marti.marcelo@gmail.com
Dario Fernández Do Porto, dariofd@gmail.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.