ORIGINAL RESEARCH article
Front. Mol. Biosci.
Sec. Structural Biology
Reformulation of the Protein Databank for Real-time Search of Geometrical Attributes of Protein Structures
Provisionally accepted- University of South Carolina Molinaroli College of Engineering and Computing, Columbia, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
In this study, we introduce the design and implementation of PDBMine, a large-scale, queryable platform for mining sequence-structure statistics from the Protein Data Bank (PDB). PDBMine enables rapid analysis of local conformational trends across proteins by extracting dihedral angles and sequence patterns at scale. In addition to the design and implementation of PDBMine, we also present results validating its ability to return structurally meaningful information. We first assess the accuracy of its dihedral angle distributions by comparing them to established Ramachandran space and verifying expected behaviors of residues such as glycine and proline. We then use PDBMine to analyze the statistical properties of amino acid subsequences of length k = 1 to 5. Our findings reveal that longer k-mers exhibit significant departures from statistical independence, suggesting context-dependent constraints on amino acid co-occurrence. We also show that increasing local sequence context restricts dihedral angle variability, with longer k-mers producing distributions that more closely align with experimentally observed backbone geometries. Finally, we present a high-dimensional clustering method for grouping full-sequence dihedral conformations, enabling identification of dominant local structural motifs. These results highlight PDBMine's potential as a versatile tool for structure validation, statistical modeling, and probing the principles that govern sequence-structure compatibility in proteins.
Keywords: dihedral angles, Protein data bank, Protein structure validation, Ramachandran plot, sequence-structure relationship, structural bioinformatics
Received: 28 Aug 2025; Accepted: 04 Dec 2025.
Copyright: © 2025 Azeem, Lee, Hein, Ott and Valafar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Homayoun Valafar
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
