Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Bioinform.

Sec. Protein Bioinformatics

This article is part of the Research TopicAI in Protein ScienceView all articles

Beyond Tanimoto: A Learned Bioactivity Similarity Index Enhances Ligand Discovery

Provisionally accepted
  • 1Instituto de Cálculo, Universidad de Buenos Aires Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina
  • 2LUCAI BIO, Dover, United States
  • 3IQUIBICEN, Universidad de Buenos Aires Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina

The final, formatted version of the article will be published soon.

Structural similarity metrics like the Tanimoto Coefficient (TC) miss many functionally related compounds—indeed, 60% of similarly bioactive ligand pairs in ChEMBL show TC < 0.30, revealing a major blind spot that constrains ligand-based discovery. Our motivation is to overcome this blind spot and enable the recovery of structurally different yet functionally equivalent chemotypes that structure-based similarity fails to detect. Here we introduce the Bioactivity Similarity Index (BSI), a machine learning model that estimates the probability that two molecules bind the same or related protein receptors. Trained under leave-one-protein-out across Pfam-defined protein groups on dissimilar pairs, BSI not only outperforms TC but also surpasses modern molecular embedding baselines (ChemBERTa and CLAMP, using cosine similarity) across protein families. We further develop a cross-family model (BSI-Large) that, while slightly below group-specific models, generalizes better and can be fine-tuned to protein families with less data, consistently improving over models trained from scratch. In retrospective validation on new ChEMBL v35 data, BSI achieves strong early-retrieval performance (top 2% enrichment factor, EF₂%), with group-specific models delivering the best enrichment and BSI-Large remaining competitive. In a realistic virtual-screening-like scenario against the target ADRA2B, the mean rank of the next active given a known active improves from 45.2 (TC) to 3.9 (BSI), with 54.9 for ChemBERTa and 28.6 for CLAMP. Altogether, BSI complements, rather than replaces, structure-based similarity and embedding-based comparisons, extending hit finding to remote chemotypes that are structurally dissimilar yet functionally equivalent. The code is available at https://github.com/gschottlender/bioactivity-similarity-index.

Keywords: Bioactivity Similarity Index, Bsi, machine learning, molecular embedding baselines, ChemBERTa, Clamp, Virtual-screening

Received: 29 Aug 2025; Accepted: 07 Nov 2025.

Copyright: © 2025 Schottlender, Prieto, Marti and Fernández Do Porto. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Marcelo Adrian Marti, marti.marcelo@gmail.com
Dario Fernández Do Porto, dariofd@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.