ProkBERT family: genomic language models for microbiome applications

Background In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease. Methods ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks. Results In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks. Conclusions The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.


Introduction
Numerous tasks in bioinformatics involve classifying or labeling sequence data such as predicting genes (Lukashin and Borodovsky, 1998;Delcher et al., 1999;Sommer and Salzberg, 2021), annotating sequence features (Aziz et al., 2008;Seemann, 2014;Tatusova et al., 2016;Meyer et al., 2019), etc.A significant challenge in this field is deriving efficient vector representations from these sequences (Zhang et al., 2023).Classification tasks related to sequences-like classifying assembled contigs into MAGs (metagenome-assembled-genomes) or analyzing AMR-associated genes-are often addressed by initially categorizing the data into bins or using simple composition-based representations, such as k-mer frequency distributions.A common method involves converting sequences into a basic presence-absence vector, indicating whether a particular genome contains specific sequence features like mutations, motifs, or other patterns.However, a drawback of this method is that proximity in this representation space doesn't always imply semantic similarity.Another prevalent representation uses hidden Markov models (Durbin et al., 1998), where the model parameters encapsulate the essential properties of the sequences.Yet, integrating such models with machine learning algorithms like support vector machines or random forests can be complex.Despite this, hidden Markov models have demonstrated their effectiveness in classification tasks and provide highest quality annotations (Zdobnov and Apweiler, 2001;Cantalapiedra et al., 2021).
Neural network-based representations have distinct advantages, primarily their compatibility with a wide range of machine-learning tools, including autoML and statistical frameworks.Past research has highlighted the effectiveness of neural network representations for sequences, with a variety of classification tasks addressed using networks such as CNNs and RNNs (Min et al., 2017).These networks have been employed in areas like motif discovery, gene-expression prediction (Kelley et al., 2018) splicing site recognition (Ji et al., 2021), and promoter identification, as detailed in several comprehensive reviews (Min et al., 2017;Sapoval et al., 2022;Zhang et al., 2023).However, convolutional neural networks face challenges, like the need for extensive labeled sequence data.They are also task-specific, limiting their applicability to other scenarios outside their training focus.A significant bottleneck in integrating neural networks into bioinformatics has been the scarcity of adequate labeled data.Recent advancements in machine learning, inspired by breakthroughs in natural language processing, image analysis (Han et al., 2022), and protein structure prediction (Alipanahi et al., 2015;Jumper et al., 2021), have introduced new paradigms.Transformer-based architectures, especially large language models (Devlin et al., 2019;Brown et al., 2020a;Raffel et al., 2020), offer versatile representations-often termed "reusable" or "fundamental models."Among the recent training approaches is the fine-tuning paradigm, which divides the training process into two phases: pretraining and fine-tuning.Pretraining demands vast amounts of self-labeled data, while fine-tuning can, in some instances, operate with minimal, or even no examples.
In bioinformatics, there exists a paradoxical challenge.On one hand, there's an abundance of sequence data available, especially in public repositories like the SRA (sequence read arhive).The volume of this data is expanding exponentially, and as sequencing and other data-producing technologies become more affordable, this growth trend is likely to persist.These data repositories are akin to hidden treasures.Yet, they remain under-analyzed and underprocessed.Researchers often focus primarily on specific mutations, neglecting other valuable aspects of the data.Conversely, while there's an abundance of raw sequence data, there's a scarcity of labeled data.The accompanying metadata is frequently limited, and given the high cost of experiments, only a handful of samples, typically ranging from 3-15, are available within a specific group or stratum.It's also worth noting that labeling criteria can differ significantly across projects.
Recognizing these challenges, there is a compelling need for innovative methods that can harness the vast repositories of raw sequence data and navigate the complexity of labeling inconsistencies.It is in this context that our research contributes a novel solution.The development and application of our genomic language model family aims to address the mentioned issues, providing a robust, adaptable, and efficient tool for sequence classification.
While the concept of pretrained models isn't new, several have emerged recently, such as DNABERT (Ji et al., 2021;Zhou et al., 2023), Nucleotide Transformer (Dalla-Torre et al., 2023), and LookingGlass (Hoarfrost et al., 2022).However, a common limitation among these methods is their primary focus on human sequences or their restricted context size.
In the pretraining phase, the objective is to derive a general representation that captures the semantic relationships between objects, which in this context means obtaining a nuanced representation of sequence data.Typically, achieving this requires billions of samples, yet the volume of available sequence data far surpasses this number.We trained our genomic language models on an extensive corpus of available sequence data, encompassing bacteria, archaea, viruses, and fungi.Subsequently, we fine-tuned our models to tackle specific classification tasks, including the recognition of promoters and phages.
The ProkBERT family encompasses a series of models tailored to meet the intricate demands of microbial sequence classification, analysis, and visualization.The versatility of the ProkBERT models is manifested through their diverse applications: 1. Zero-shot learning: This approach allows for clustering of sequences by leveraging the embeddings directly produced by the model, eliminating the necessity for explicit fine-tuning.2. Sequence classification: ProkBERT models can be seamlessly fine-tuned, whether for token-specific or comprehensive sequence-based classification tasks.
With these capabilities, the ProkBERT family aims to bridge the current gaps in the field, offering a robust toolset for diverse bioinformatics challenges.

Materials and methods
In this study, we used the transfer-learning paradigm for sequence classification based on transformer-based architectures.
The first phase involves pretraining on a large amount of sequence data, allowing the model to learn general sequence patterns.Once this foundation is established, we move to the fine-tuning phase where the model is adapted to specific tasks or datasets.
The following sections provide a step-by-step description of our methods, from preparing raw sequence data to the specifics of both pretraining and fine-tuning.Figure 1 illustrates the training process.
In the development of the ProkBERT family, the initial step involves pretraining the model on a vast corpus of data.During this pretraining phase, the model aims to tackle the Masked Language Modeling task.In this task, specific portions of the sequence are masked, and the model's objective is to predict these masked sections, optimizing the likelihood of the missing parts using crossentropy as the loss function.The model typically receives input in the form of a vectorized representation of the sequence.A notable constraint of standard transformers is their limited input size.Though various solutions have been suggested to address this limitation, the maximum token size is typically restricted up to 4kb, significantly smaller than the average bacterial genome, but much larger than an average gene.
Fine-tuning nucleotide sequences is a technique used to adapt pre-trained models to specialized tasks or specific datasets.The first step involves segmenting raw sequences into chunks, usually ranging from 0.1-1kb in size, to optimize the model's learning capability (Pan and Yang, 2009).Using weights from a pre-trained model, the system benefits from the knowledge obtained from comprehensive training on extensive datasets (Vaswani et al., 2017;Devlin et al., 2019).This initialization helps in quicker convergence and improved performance.After this initialization, the model undergoes training on the desired dataset, adjusting to its specific patterns and details.The outcome of this procedure allows the model to produce labeled sequences or tokens, which can be used for various annotation or prediction purposes (Brown et al., 2020b).

. Sequence data . . Sequence segmentation and tokenization
The first step is processing the sequence data.While there are many parallels between sequence data processing and natural language processing, drawing direct analogies can be challenging.For instance, determining what constitutes a 'sentence' in the realm of nucleotide and protein sequences doesn't have a direct counterpart in natural language.Additionally, the input size for neural networks is inherently limited.Figure 2 illustrates the strategy employed to vectorize the sequences.
Initially, the input sequence is segmented into smaller chunks.We employed two approaches for this: 1. Contiguous sampling, where contigs are divided into multiple non-overlapping segments; and 2. Random sampling, which involves fragmenting the input sequence into various segments at random.
Following segmentation, the next phase is encoding the sequence into a simpler vector format.The primary question revolves around defining the fundamental building block for a token.Various solutions have been suggested, the most widely strategy is applying one-hot-encoding (Sapoval et al., 2022), but DNA-BERT (Ji et al., 2021) applies the maximal overlapping kmer strategy, meanwhile others relies on nucleotide level mapping (Dalla-Torre et al., 2023).
This phase is termed tokenization.We introduce a method termed Local Context-Aware tokenization (LCA), where individual elements consist of overlapping k-mers.Two principal parameters dominate this approach: k-mer size and shift.For k = 1, the tokenization resorts to a basic character-based approach, with a typical example illustrated in Figure 2. Employing overlapping kmers can lead to enhanced classification performance.A greater shift value allows the model to use a broader context while reducing computational demands, while having the information-rich local context as well.
The k-mers are then mapped into numerical ids, which will be the input for ProkBERT.As another example with k = 6 and shift=2, the tokenized segments will be the following: {AAGTCC, GTCCAG, CCAGGA, ..., AAGATT}.If the sequence length is odd, then the last charcter won't be used.One of the main advantages of the approach is that with the same number of tokens it is possible to cover a larger context, therefore it is possible to considerably reduce the computational and memory requirements, which is the typical bottleneck of the transformer architecture.
In this study, we propose models with a k-mer size of 6 (termed ProkBERT-mini), k-mer size of 1 (dubbed ProkBERT-mini-c), and a variant supporting a larger context window, named ProkBERT-mini-long, which relies on a kmer size of 6 with a shift = 2.

. . Training data
The dataset was retrieved from the NCBI RefSeq database (O'Leary et al., 2016;Li et al., 2021) on January 6th, 2023.It included reference or representative genomes from bacteria, viruses, archaea, and fungi.After filtering, the sequence database consisted of 976,878 unique contigs derived from 17,178 assemblies.These assemblies represent 3,882 distinct genera, amounting to approximately 0.18 petabase pairs.The segment databases was created by sampling fixed lengths of [256,512,1024] or, in other instances, variable lengths aiming for an approximate coverage of 1.

. Pretraining and learning sequence representations . . Transformer model selection and parameters
In our study, we employed the MegatronBert model (Shoeybi et al., 2019), a variant of the BERT architecture (Devlin et al., 2019), optimized for large-scale training.The architecture overview is presented in Supplementary Figure S1.The key attributes of our models can be seen in Table 1.The mini and mini-long models share a common vocabulary of 4,101 k-mers.In contrast, the mini-c model is distinct, using a smaller set comprising only 9 items, including special tokens (i.e., [CLS], [SEP]) and nucleotides (A, C, T, G).All models employ a learnable relative key-value positional embedding, which maps input vectors into a 384-dimensional space.The mini and mini-long models support maximum sequence lengths of 1024 bp and 2048 bp, respectively.Across all models, the intermediate layers of the encoder use the GELU activation function, expanding the input dimensions to 3,072 before compressing them back to 384 dimensions.The Masked Language Modeling (MLM) head, a standard component in each model, decodes from 384 to 4,101 dimensions, adapted to the varying vocabulary sizes.To ensure efficient parallel computations, we encapsulated the entire architecture within a DataParallel wrapper, thus optimizing GPU utilization.For implementation, all models were developed using the PyTorch version 2.01 framework and the Hugging Face library version 4.33.2.While Masked Language Modeling (MLM) acts as the primary pre-training objective for BERT models (Bidirectional Encoder Representations from Transformers) as established by Devlin et al. (2019), our implementation has slight variations.In the traditional BERT approach, a certain percentage of input tokens are randomly masked, and the model predicts these based on their context.Typically, about 15% of tokens undergo masking.However, due to our usage of overlapping k-mers, masking becomes more intricate.If a k-mer of size k = 6 is masked, we need to ensure at least six tokens are also masked to prevent trivial restoration from context and locality.
For an input sequence of tokens x and a binary mask vector m-where 1 indicates a masked token and 0 indicates an unmasked token-the model outputs predicted vectors y.As for the noise application on masked tokens, probabilities p 1 , p 2 , and p 3 define different noise strategies.In our model, when a token is masked, it is substituted with the special [MASK] token with a probability of p 1 .Alternatively, with a probability p 2 , it can be replaced with a random k-mer from our vocabulary.Lastly, there's a p 3 chance that the masked k-mer will remain as it is.Conventionally, these probabilities are set at 0.8, 0.1, and 0.1, respectively.
The MLM objective aims to minimize the negative log likelihood over all masked positions, as described by the equation: Where y i [l i ] denotes the predicted probability of the true label l i for the masked position i.This objective, coupled with the noise injection strategy, ensures that the model learns bidirectional representations, thus becomes capable of understanding and generating contextually relevant tokens.
When dealing with overlapping k-mers, simple token masking becomes insufficient.If a single k-mer token is masked, all overlapping k-mers related to that token must also be masked.This is crucial because when a k-mer is not masked and subsequently restored, it might inadvertently provide contextual information about its neighbors.Such a situation would enable the trivial restoration of adjacent masked k-mers.In essence, one unmasked k-mer could potentially "leak" enough information to unmask its neighboring tokens.For examples, as presented in Figure 2C (Overlapping tokeniization), if only the second token "AGTCCA" is masked, it can be fully restored from its neighboring tokens: "AAGTCC" and "GTCCAG." This overlapping nature of k-mers posed unique challenges.As a result, we had to dynamically adjust the MLM parameters and the lengths of sequence segments during the pretraining phase.Additionally, when multiple contiguous k-mers were masked together, the probability associated with the MLM had to be recalibrated.This was necessary to ensure that the actual proportion of the sequence being masked was consistent with our intended masking ratio.

. . . Training phases and configuration
Initially, we employed parameters that allowed complete sequence restoration (k-mer of k = 6) by masking only five continuous tokens (with p 1 = 0.9) and manipulating 15% of the tokens.Once a loss threshold of 1 was attained, the MLM parameters were adjusted to heighten the masking complexity.We implemented various masking lengths, such as 2 nucleotides for kmer of k = 6 and 2 characters for k = 1.Training data in the first phase had a fixed length of 128nt segments.The succeeding phase used variable-length datasets: with a probability of 0.5 a full-length segments, and with a probability of 0.5 a segment between 30-512 bp was selected into the the batch.The termination criterion for training was no further improvement or performance decrease, in both the MLM and promoter tasks.Models underwent training for roughly one batch each.We opted for batch sizes that spanned around 0.5-2 million bp sequences.Computations were executed on HPC-VEGA and Komondor platforms with Nvidia-A100 GPUs, leveraging slurm, pytorch distributed, and multiple GPU nodes.

. . Analysis of encoder outputs
In deep learning, an encoder typically processes input data (such as a sequence of tokens) and produces a dense vector representation for each token.These dense vectors, often referred to as embeddings or encoded vectors, capture the semantic information of the input tokens.

FIGURE
LCA Tokenization and corrupted sequence restoration.The figure illustrates how the corruption of a character at sequence level a ects the initial vector representations of the seqment with respect to the di erent tokenization methods.The th nucleotide is unknown or masked, and those k-mers that overlap with that position are masked.As a result, when k = and s = not only the th character is hidden, but the th as well.
the encoder produces a sequence of vectors: where e i represents the embedded vector for the token s i .In case of multiple inputs or batches, if we have a batch of size B with each sequence containing T tokens, the encoder's output would be a 3D tensor of shape (B, T, D) where D is the dimensionality of the embeddings.
Once we have the encoded vectors, there are several ways to aggregate or pool them to get a single representation for the entire sequence as shown in Supplementary Figure S1.Here are some common pooling methods: For batches, these pooling operations are applied independently for each input sequence in the batch.The provided NCBI annotations were preprocessed and extended.Intergenic regions were defined as non-annotated genomic features with respect to the strand.We retained the CDS, intergenic, pseudogenes, ncRNA features, while the rare or infrequently used features (such as riboswitch, binding_site, tmRNA, etc.) were excluded from the analysis.This was followed by sampling segments of various lengths from each genomic region.We sampled a maximum of 2000 sequence features from each contig, considering the strand, to evaluate strand-specific biases as well.
Then, we randomly corrupted a segment 10,000 times, i.e., a character was replaced with "*" and tokens containing "*" were mapped to the [MASK] token as illustrated on Figure 3.

. Application I: bacterial promoter prediction
The first task our models were evaluated on involved distinguishing between promoter and non-promoter sequences in bacteria.A sequence is labeled "1" if identified as a promoter and "0" otherwise.The next section gives an overview of the dataset structure and details about its constructions.

. . Dataset overview
The known promoters, referred to as positive samples, are primarily drawn from the Prokaryotic Promoter Database (PPD, Su et al., 2021), which contains experimentally validated promoter sequences from 75 organisms.Figure 4 illustrates the composition and source of our dataset, segregating prokaryotic promoters from non-promoters and including an independent test set based on E.coli sigma70 promoters.

. . . Data partitioning and utilization
To ensure comprehensive evaluation, the dataset was split into three parts, divided randomly into training, validation, and testing datasets.The dataset comprises known promoter sequences from organisms, retrieved from the Prokaryotic Promoter Database (PPD), alongside non-promoter sequences obtained from the NCBI RefSeq database (specifically sampled from CDS regions).It also includes non-promoter sequences constructed via higher and zero-order Markov chains that mirror compositional characteristics of known promoters.Additionally, an independent test set, focusing on E. coli sigma promoters, was employed, curated by Cassiano and Silva-Rocha ( ).A balanced distribution approach was adopted to even out the number of positive and negative samples, with the dataset being systematically divided into training, validation, and test subsets.This stratification underpins a thorough evaluation of the model e cacy.
2. Validation set: Comprises 10% of the data, aiding in fine-tuning model parameters and preventing overfitting.3. Test set: Forms the remaining 10% of the data, crucial for unbiased model performance evaluation. .

. . Dataset construction for multispecies train, test and validation sets
The prokaryotic promoter sequences are typically 81 bp long, ensuring compatibility with most tools' input prerequisites, particularly around the putative TSS region interval [−60, +20].Our positive dataset encompasses promoter sequences from various species, predominantly found on both chromosomes and plasmids.Promoters included in the independent test set, based on exact match, were excluded from the training data.Species and contigs were mapped to NCBI assembly and sequence accessions.To curate comprehensive non-promoter sequences (negative samples), we employed three strategies: 1. Using non-promoter sequences (CDS-Coding Sequences).2. Random sequences generated with a 3rd-order Markov chain.3. Pure random sequences (0-order Markov chain) as proposed by Cassiano and Silva-Rocha (2020).
The distribution of this composite dataset was 40% CDS, 40% Markov-derived random sequences, and 20% pure random sequences (0-order Markov chain).One practical application of promoter detection in coding sequences is to check whether an unintentional promoter is injected or can be located inside a modified or designed coding sequence region, causing disruption.To cover this use-case, we incorporated the coding regions into our training and evaluation dataset.The CDS sequences were extracted from the genomic sequences of contigs, based on annotations from NCBI.The 81 bp long CDS region samples were selected based on the NCBI-provided annotations for the available contigs with respect to the underlying species.The promoter regions often contain AT-rich sequences, i.e., TATA box.To capture and model the AT-rich regions, we applied 3rd and 0 order Markov chains to generate sequence examples that reflect the compositional property of known promoters.
A 3rd-order Markov chain predicts the next nucleotide in a sequence based on the states of the previous three nucleotides.Formally, the probability of observing a nucleotide x i given the nucleotides at positions x i−3 , x i−2 , and x i−1 is: For DNA sequences, this yields 4 4 = 256 possible nucleotide combinations.Such higher-order modeling can more effectively capture intricate sequence patterns and dependencies than lowerorder models (Durbin et al., 1998).However, estimating transition probabilities requires extensive data due to the increased number of states (Koski and Noble, 2001).We determined these probabilities using promoter sequences, to which we added the reverse complement of each promoter.Subsequently, random promoter sequences were generated using these models.
We have a second, independent test for assessing model performance and referred to Cassiano and Silva-Rocha (2020)'s dataset comprising E. coli sigma70 sequences.The positive, well-recognized samples came from Regulon DB (Santos-Zavaleta et al., 2019).Cassiano and Silva-Rocha (2020) evaluated various tools using an experimentally validated E. coli K-12 promoter set dependent on sigma70, sourced from Regulon DB 10.5 (Santos-Zavaleta et al., 2019).Given the extensive documentation of sigma70-dependent promoters in bacteria, only these were considered.They used a positive dataset of 865 high-evidence sequences from Regulon DB and a negative set of 1,000 sequences mimicking the nucleotide distribution of the natural sequences.We ensured no overlap existed within the promoter datasets.
The promoter dataset is available as a Zenodo and Hugging Face dataset.

. . Training for promoter prediction
We employed a fine-tuning paradigm to evaluate our model.Our proposed binary classification model extends the Megatron BERT architecture (Shoeybi et al., 2019), tailored specifically for binary classification tasks.Let X represent the sequence of input embeddings, with f BERT (X) denoting the transformation by Megatron BERT.Given an input sequence of length T, this model transforms X into a sequence output S with dimensions T × hidden_size, where S = f BERT (X).Unlike the conventional BERT model, which classifies sequences based on the special [CLS] token representing the "sentence, " our approach emphasizes integrating representations of all tokens using a weighting scheme as shown in Supplementary Figure S1.
To obtain a fixed-size representation from the variable-length sequence S, we devised a weighting mechanism.The sequence S undergoes a transformation through a linear layer to yield a sequence of weights W: Here, W 1 is a matrix sized hidden_size × 1 and b 1 is a bias term.The softmax operation ensures W forms a valid probability distribution over sequence positions.The model then computes a weighted sum of the sequence representations: Where w i and s i represent the weight and the sequence representation at the i th position, respectively.Subsequently, P is processed by a dropout layer with a probability of hidden_dropout_prob to produce P ′ .This results in the final classification logits L.

. Application II: phage sequence analysis
Bacteriophages have a significant role in the microbiome, influencing host dynamics and serving as essential agents for horizontal gene transfer (De la Cruz and Davies, 2000).Through this mechanism, they aid in the transfer of antibiotic resistance and virulence genes, promoting evolutionary processes.Understanding the diversity of phages is crucial for tackling challenges like climate change and diseases (Jansson and Wu, 2023).These phages exhibit distinct patterns in both healthy and diseased microbiomes (Yang et al., 2023).The correlation between the human virome and various health conditions, such as cancer, inflammatory bowel diseases, and diabetes, has been documented (Zhao et al., 2017;Han et al., 2018;Nakatsu et al., 2018;Fernandes et al., 2019;Liang et al., 2020;Zuo et al., 2022).However, deeper research is needed to discern causality and their impact on microbial and host biological processes.
Despite the abundance of phages (Bai et al., 2022a), accurately quantifying and characterizing them remains a challenge.One primary limitation is the restricted number of viral sequences in databases like NCBI RefSeq.Additionally, the categorization of viral taxonomy is still a topic of discussion (Walker et al., 2022).Though there have been recent efforts to expand databases (Zhang et al., 2022;Camargo et al., 2023), the overall understanding of viral diversity is still not complete (Yan et al., 2023).We have assembled a unique phage sequence database using recently published genomic data.
Another challenge is the life cycle of phages; temperate phages might integrate their genomes into bacterial chromosomes and are often annotated as bacterial genomes, leading to potential misidentification.Current databases also show biases toward certain genera (Schackart III et al., 2023), which can skew benchmarking and the evaluation of different methods.To address this, we used a balanced benchmarking approach, ensuring each viral group corresponds to their predicted host genus, minimizing bias.We also compared viral genomes to their respective hosts, a more demanding classification task, such as distinguishing a Salmonella phage from its host genome compared to marine cyanobacteria.For our study, we selected a specific number of phages for testing, ensuring there is no overlap between training and testing sets at the species level.

. . Phage dataset description
To train and assess our prediction models, we assembled a comprehensive phage sequence database from diverse sources.As of 9th July, 2023, we procured viral sequences and annotations from the RefSeq database (O' Leary et al., 2016;Li et al., 2021).By isolating entries labeled "phage, " we obtained 6,075 contigs.Our database was further enriched with the inclusion of Ren et al. (2020), a dataset validated through the TemPhD method (Zhang et al., 2022), adding another 192,326 phage contigs extracted from 148,229 assemblies.
To address sequence redundancy present in both the RefSeq and TemPhD databases, we applied the CD-HIT algorithm (Li and Godzik, 2006;Fu et al., 2012) (using CD-HIT-EST with a default word size of 5).While several clustering thresholds (0.99, 0.95, 0.90) were experimented with and found to produce similar outcomes, we settled on a threshold of 0.99.This process resulted in a refined set of 40,512 distinct phage sequences, with an average length of approximately 43,356 base pairs, culminating in a total of 3.5 billion base pairs.Notably, these sequences target a wide spectrum of 660 bacterial genera.Subsequent to sequence curation, phage sequences were mapped to their respective bacterial hosts to formulate a balanced training dataset, ensuring equitable representation between phages and their hosts.This step is imperative, given the distinct distributions observed between bacterial sequences Frontiers in Microbiology frontiersin.org and their phage counterparts.In numerous instances, due to ambiguities in species-level identification or gaps in taxonomic data, host mapping was executed at broader taxonomic strata, predominantly at the genus level.
In our examination of bacteriophage-host associations at the genus level, several bacterial genera stood out, showcasing pronounced phage interactions.Salmonella, a main cause of food-related sicknesses (Popoff et al., 2004), stands out with an impressive association of 24,182 phages, spanning a cumulative length of over a billion base pairs (1,026,930,954 bp) and an average phage length of 42,467 bp.Following closely, the common gut bacterium, Escherichia (Tenaillon et al., 2012), is linked with 8,820 phages, accumulating a total length of 408,866,394 bp.The genus Klebsiella, notorious for its role in various infections (Paczosa and Mecsas, 2016), associates with 4,904 phages.Genera such as Listeria (Vázquez-Boland et al., 2011), Staphylococcus (Lowy, 1998), and Pseudomonas (Driscoll et al., 2007), each with distinct clinical significance, exhibit rich phage interactions.Notably, Mycobacterium (Cole et al., 1998), consisting of pathogens like the tuberculosis-causing bacterium, shows associations with 2,156 phages.Many of these bacterial genera are benign and even beneficial under normal conditions, they also include species that can cause severe diseases in humans, especially when there's an imbalance in the body's natural flora or when antibiotic resistance develops.Monitoring phage interactions with these bacteria offers potential pathways for therapeutic interventions and a deeper understanding of microbial ecology in human health.
Additionally, balanced databases were created, stratified by the host genus level, to mitigate the effect of underrepresented or overrepresented phages, such as Salmonella.The reversecomplement sequences were included.The final dataset encompasses a total of 660 unique bacterial genera.Undersampling was performed with a threshold of 20,027,298 bp for 25 genera, while the others were upsampled with a maximum coverage of 5x, obtaining random samples of shorter fragments from the contigs.Random segmentation and sampling were carried out as previously described.The bacterial assemblies were randomly selected from the NCBI database, prioritizing higher-quality assemblies.Many of them were not included in the pretraining dataset.Subsequently, we constructed a database with various sequence lengths: 256, 512, 1024, and 2048 bps.The train-test-validation split was executed in a 0.8, 0.1, and 0.1 proportion at the phage sequence level.
For comparison with alternative methods and tools, we had to subsample our test set (N = 10, 000) to conduct the evaluation within a reasonable timeframe.

. . Model training for phage sequence analysis
The task was formulated as binary classification, similarly to the promoters.Phage sequence classification was approached in a manner analogous to the promoter training.Given the extensive size of the dataset, preprocessing was conducted beforehand, segmenting sequences into various lengths: 256, 512, 1,024, and 2,048 bps.For both mini and mini-c models, the training process was partitioned into three distinct phases.An initial grid search was executed to optimize learning rates, and base models were trained for an hour.The parameter yielding the highest Matthews Correlation Coefficient (MCC) was selected.The model was then trained using segment lengths of 256 bps for half an epoch, followed by 512 bps for another half epoch, and concluding with two epochs for 1024 bps segments.The training regimen for the mini-long model was similar, albeit commencing with 512 bps segments, then transitioning to 1024 bps, and finally to 2048 bps segments.Model optimization employed the settings delineated previously.

MCC (Matthews Correlation Coefficient):
Used for binary classifications and defined as: where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives.The coefficient ranges from −1 (total disagreement) to 1 (perfect agreement).F1 Score: The harmonic mean of precision and recall, given by: The silhouette score is a measure used to calculate the goodness of a clustering algorithm.It indicates how close each sample in one cluster is to the samples in the neighboring clusters, with values ranging from -1 to 1, where a high value indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters (Rousseeuw, 1987).
Equation for the silhouette score s(i) for a single sample: Where: • a(i) is the average distance from the i-th sample to the other samples in the same cluster.• b(i) is the smallest average distance from the i-th sample to samples in a different cluster, minimized over clusters.

Results and discussion
. ProkBERT's learned representations capture genomic structure and phylogeny We assessed the zero-shot capabilities of our models by examining their proficiency in predicting genomic features based solely on embedding vectors, in a manner akin to Nucleotide Transformers and related methodologies. Figure 5 presents the UMAP projection of these embedded vector representations.Employing the UMAP technique, we reduced the dimensionality of genomic segments and derived embeddings.These were then evaluated using silhouette scores across the three models: ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.
Our primary objective was to discern if the representations of sequence segments from ESKAPE pathogens could be distinctly categorized.Indeed, Figure 5 exhibits clear delineation among known genomic features, including CDS (coding sequences), intergenic regions, ncRNA, and pseudogenes.It's important to note that these models were not explicitly trained to differentiate these sequence features; the representations were solely derived through pretraining.For the critical genomic comparison between "intergenic" and "CDS" regions, the silhouette scores obtained were 0.4925, 0.5766, and 0.3352 across the respective models, emphasizing a consistent and clear distinction between these features.Regarding non-coding RNA representations, the silhouette scores for "ncRNA" vs. "CDS" were 0.1537, 0.2935, and 0.2192, while for "ncRNA" vs. "intergenic, " they were 0.1648, 0.1302, and 0.3109, further affirming the assertion that ncRNAs cluster distinctly.Pseudogenes, as anticipated, exhibited some overlap with 'CDS' , notably in the ProkBERT-mini model with a score of −0.0358.Yet, when compared with 'ncRNA' , a distinct separation was observed, as evidenced by scores of 0.1630, 0.2365, and 0.1636.This analysis aligns with biological knowledge, where pseudogenes are expected to be more similar to CDS, while ncRNAs, which have different functions and characteristics, form distinct clusters from CDS and intergenic regions.All three models appear to produce similar clustering results for the given pairs of genomic features.
The embeddings prominently display the genomic intricacies of ESKAPE pathogens.Notably, Klebsiella pneumoniae and Escherichia coli, both members of the Enterobacteriaceae family, exhibit close proximity in the embedding space, echoing potential genomic kinship or shared evolutionary paths.This observation is further corroborated by the low silhouette scores across the models.In contrast, species like Pseudomonas aeruginosa manifest as more distinct clusters, emphasizing their genetic disparities.Intriguing overlaps, such as those between differently labeled Acinetobacter baumannii entities, highlight potential challenges in the data or shared genomic features.Combined, the UMAP visualizations and silhouette scores provide a profound insight into speciesspecific genomic embeddings, revealing both shared and distinct genomic signatures.

. ProkBERT can e ciently recover corrupted sequences
In evaluating the models' capabilities in the masking task, we used random masking across various genomic segments, such as CDS, ncRNA, intergenic, and pseudogenes, detailed in Table 2.We measured performance with metrics like ROC-AUC and average reference rank.However, a direct model comparison presents challenges.Notably, ProkBERT-mini-c boasts a significantly smaller vocabulary size (9) in comparison to  This allows ProkBERT-mini-c to achieve higher rankings, like top3, with relative ease as it encompasses nearly the entire vocabulary (there are 4 nucleotides).Also, the local context's representation in ProkBERT-mini-long is less dense, making the restoration of the masked nucleotides harder in contrast to the others.
For sequences spanning 1,024 nucleotides, ProkBERT-mini exhibited a commendable AUC of 0.9998, accompanied by top 1 and top 3 prediction accuracies of 51.69% and 92.27%, respectively.Concurrently, ProkBERT-mini-c achieved an AUC of 0.9586, with top 1 and top 3 accuracies at 51.28% and 92.22%.However, ProkBERT-mini-long reported slightly subdued figures, with an AUC of 0.9992 and top 1 and top 3 accuracies of 27.68% and 55.89%.This underscores the efficacy of the ProkBERT model family in handling genomic tasks.A salient observation from our analysis is that a model's prediction proficiency is intrinsically tied to the contextual size.
In our next assessment some performance nuances became evident across various genomic regions.The prokbert-mini model consistently stood out, especially within the Coding Sequence (CDS) and Intergenic domains.For these regions, it achieved an unmatched ROC-AUC of 0.9998.Specifically, within the CDS region, the model attained a Top1 accuracy of 50.33%, a Top3 accuracy of 91.87%, and an average reference rank of 0.811.In the Intergenic sections, these figures were 48.97%, 91.12%, and 0.843, respectively.The prokbert-mini-c model also exhibited commendable performance.Within the CDS regions, this model reached a Top1 accuracy of 50.65%, a Top3 accuracy of 91.91%, and an average reference rank of 0.802.For the Intergenic regions, the metrics were 48.84%, 91.39%, and 0.839 respectively.Despite the achievements of the aforementioned models, challenges persisted across all models in the non-coding RNA (ncRNA) domains.Even the top-performing prokbert-mini saw its Top1 accuracy drop to 32.46%, with an average reference rank increasing to 1.202.Contrastingly, the prokbert-mini-long, despite its detailed design, exhibited reduced accuracies, with  Top1 and Top3 accuracies of 25.18% and 52.66% across all labels, hinting at potential inefficiencies or overfitting.Collectively, these underscore the importance of tailored model architectures for genomic sequences and highlight the complexities of various genomic regions, laying a foundation for future targeted deep learning strategies in genomics.

. ProkBERT performs accurately and robustly in promoter sequence recognition
Identifying promoters, which are crucial in initiating the transcription process, is fundamental to understanding gene regulation in bacteria.Our initial fine-tuning task focused on the identification of these genomic regions, primarily through a binary classification approach that distinguishes sequences as either promoters or non-promoters.Although this method is widely used, various alternative strategies have been explored.A significant limitation of current techniques, as highlighted by Chevez-Guardado and Peña-Castillo (2021), is their reliance on training with a limited range of species, mainly E. coli, but also including Bacillus subtilis and a few other key species.
As illustrated in Figure 1, our training began with a pretrained model followed by training using cross-entropy loss minimization.We evaluated the training outcomes on two datasets: a test set curated by Cassiano and Silva-Rocha (2020), and another one comprising mixed species.The models' performance on the first dataset can be seen in Table 3.
The results are presented in Table 3.The ProkBERT family models exhibit remarkably consistent performance across the metrics assessed.With respect to accuracy, all three tools achieve an impressive score of 0.87, marking them among the top performers in promoter prediction.This suggests that, regardless of the specific version, the underlying methodology used in the mini series is robust and effective.
When evaluating the balance between true and false predictions using MCC both ProkBERT-mini and ProkBERT-mini-long slightly edge out ProkBERT-mini-c with an MCC of 0.74 compared to 0.73 for mini-c.Although the difference is marginal, it might indicate subtle refinements in the mini-long approach.In terms of sensitivity, which focuses on the ability to correctly identify promoters, ProkBERT-mini leads with a score of 0.90, closely followed by ProkBERT-mini-long at 0.89 and ProkBERT-mini-c at 0.88.This hierarchy, albeit with small differences, highlights the minute improvements achieved in the mini and mini-long versions.Lastly, for specificity, all three versions achieve an identical score of 0.85.This uniformity underscores the consistency in their ability to correctly identify non-promoters.In summary, while the performance across the mini versions is largely comparable, ProkBERT-mini and ProkBERT-mini-long display marginal advantages in certain metrics, hinting at potential refinements in these versions.
The Promotech tool demonstrates a mixed performance across the metrics.With an accuracy of 0.71, it correctly predicts the presence or absence of promoters 71% of the time.While this accuracy is lower than the top-performing tools like ProkBERT-mini and its variants, it is significantly better than the lower-performing tools such as Multiply and bTSSfinder.Sensitivity for Promotech is 0.49, suggesting that it correctly identifies nearly half of the actual promoters.However, its most remarkable performance metric is its specificity, with a score of 0.90.This means Promotech is adept at identifying non-promoters, correctly classifying them 90% of the time.
Among the methods assessed, CNNProm, Sigma70Pred, iPromoter-BnCNN, and iPromoter-2L exhibit notably high sensitivity scores, signifying their pronounced ability to correctly identify promoters.Specifically, iPromoter-BnCNN leads with an exceptional sensitivity of 0.99, closely trailed by Sigma70Pred at 0.95, CNNProm at 0.95, and iPromoter-2L at 0.94.Such high sensitivity scores indicate these models' potential in minimizing false negatives, which is crucial in applications where missing an actual promoter can have significant implications.However, it's vital to interpret these results with caution.The high sensitivity scores, especially of iPromoter-BnCNN and Sigma70Pred, come at the expense of specificity.For instance, iPromoter-BnCNN has a notably low specificity of 0.18, implying a substantial rate of false positives.Similarly, Sigma70Pred has a specificity of 0.41.This suggests that while these models are adept at identifying promoters, they often misclassify non-promoters as promoters.An essential factor to consider in this evaluation is the training data.Given that these models were trained on E. coli data, their performance might be biased when evaluated on the same or closely related datasets.This lack of independence between training and testing data can lead to overly optimistic performance metrics, as the models might merely be recalling patterns they've already seen, rather than generalizing to novel, unseen data.
Next, we evaluated our models' performance on a test set encompassing a broad mix of promoters, extending beyond just E. coli.The results are shown in Figure 6.
The trio of tools in the ProkBERT familymini, mini-c, and mini-long -consistently exhibited strong performance across the metrics analyzed.In terms of accuracy, all three achieved scores between 0.79 and 0.81, solidifying their position among leading promoter prediction tools.This uniformity in results points to a reliable methodology underlying the ProkBERT family.Using the Matthews Correlation Coefficient (MCC) as a measure of prediction balance, ProkBERT-mini The selection of competitors for the second test set took into account the larger size of the dataset, which posed practical challenges for established methods optimized for smaller sequences, resulting in processing issues and longer evaluation times.
and ProkBERT-mini-long both slightly outperformed ProkBERT-mini-c with MCC values of 0.63 and 0.62 respectively, against the 0.57 of mini-c.Considering sensitivity, ProkBERT-mini achieved the highest score of 0.81, with ProkBERT-mini-long and ProkBERT-mini-c trailing at 0.79 and 0.75, respectively.This order reiterates the nuanced enhancements in the models.With regard to specificity, ProkBERT-mini-long stood out with a score of 0.83, whereas ProkBERT-mini and ProkBERT-mini-c both scored 0.82, reflecting their adeptness at accurate non-promoter classification.
Of the tools assessed, both Sigma70Pred and iPromoter-BnCNN show moderate performance in sensitivity, with iPromoter-BnCNN taking the lead at 0.66 and Sigma70Pred following at 0.52.Promotech displayed a varied metric performance.With an accuracy rate of 61%, it identifies promoters correctly in a majority of instances.Its sensitivity value of 0.29 signifies its capability to detect roughly one-third of true promoters.Yet, its high specificity of 0.93 reveals its proficiency at negating non-promoters.
Promoter prediction is an intricate task that requires a balance between sensitivity and specificity.The consistently strong performance of the ProkBERT family highlights their reliability in this domain.Yet, the selection of a tool should be made after weighing the potential implications of both false positives and negatives.
. ProkBERT swiftly and accurately identifies phage sequences, even in challenging settings Various tools have addressed phage sequence identification, each employing distinct strategies.These methods can be categorized into: (i) homology or marker-based tools like VirSorter2 (Guo et al., 2021) and VIBRANT (Kieft et al., 2020), (ii) alignment-free methods, for instance, DeepVirFinder (Ren et al., 2020) and INHERIT (Bai et al., 2022b).The first category leans on existing annotations, databases, and sequences.In contrast, alignment-free methods are less influenced by existing knowledge, offering broader applicability and greater reliability with imperfect sequence data (Wu et al., 2023).We assessed our classification accuracy against INHERIT, VirSorter2, and DeepVirFinder (Ren et al., 2020).Notably, INHERIT employs a DNABert architecture for classification, akin to ours, drawing inspiration from DNABert (Ji et al., 2021).
In genomic studies, discerning phage-related segments becomes increasingly challenging as the segment length diminishes (Guo et al., 2021).This study rigorously evaluates six distinct phage classification methodologies over a range of sequence lengths, leveraging the accuracy and MCC as primary performance metrics.
For the shortest fragments (256bp), VirSorter was unable to process the test set.Among the evaluated methods, the ProkBERT models-mini, mini-c, mini-longconsistently emerged as top performers across varying lengths, as depicted in Figure 7. Specifically, ProkBERT-mini excels with shorter sequences, achieving the highest accuracy for 256 bp fragments.This high accuracy does not come at the cost  of increased false positives or negatives, as evidenced by its comparable MCC values.In contrast, DeepVirFinder, ranking fifth, indicates potential optimization areas for such short sequences.While ProkBERT-mini consistently ranks highest for lengths up to 1,024 bps, ProkBERT-mini-c closely follows, signifying its stability and reliability.Notably, the maximum sequence length that ProkBERT-mini and ProkBERT-mini-c can process is limited to 1024bps, introducing the specialized ProkBERT-mini-long for extended sequences.This model showcases its prowess with 2kb sequences, achieving an accuracy of 92.90% and an MCC of 0.859.Virsorter2, despite initial struggles with shorter sequences, exhibits significant improvements for longer fragments.However, both DeepVirFinder and INHERIT show limited enhancements with increased sequence lengths, suggesting these methods might not capitalize on the additional information longer sequences provide as effectively as their counterparts.In conclusion, ProkBERT-mini and ProkBERT-mini-long clearly stand out as top-performing models across various sequence lengths.While other methods may have their merits, they simply don't match the consistency and robustness offered by the ProkBERT models.
In phage classification, sensitivity signifies the proportion of actual phage sequences that are correctly identified.Conversely, specificity represents the proportion of non-phage sequences accurately discerned.A method exhibiting high sensitivity effectively identifies most phage sequences, while high specificity indicates minimal misclassification of non-phage sequences as phage-related.Supplementary Figure S2 presents the comparative results of the models in terms of specificity and sensitivity.Interestingly, longer sequences tend to decrease the specificity for VirSorter2.This trend suggests that VirSorter2 might misclassify non-phage sequences more frequently as the sequence length increases.A concurrent analysis of sensitivity and specificity reveals nuances in method performance.For example, ProkBERT-mini consistently achieves top ranks in sensitivity but displays variable results in specificity.On the other hand, Virsorter2, despite its strong specificity, especially with extended sequences, requires enhancements in its sensitivity.Notably, several methods, including DeepVirFinder, ProkBERT-mini, ProkBERT-mini-long, and ProkBERT-mini-c, consistently maintain high specificity.Their narrow interquartile ranges around upper values underscore their consistent, reliable performance.
Next, we scrutinized the relationship between evaluation time and prediction performance.It's important to note that the evaluation time encompasses not just the prediction interval but also includes sequence preprocessing and model loading durations.The ProkBERT family shines in terms of both swiftness and efficacy.These methods, regardless of sequence length, consistently register evaluation durations under 10 seconds, making them invaluable for applications necessitating real-time predictions.Specifically, for 2kb sequences, ProkBERT-mini-long records a commendable accuracy of nearly 92.9%.Its Matthews Correlation Coefficient (MCC), a reliable metric of prediction prowess, stands at approximately 0.859 for the same sequence length.In contrast, both VirSorter2 and DeepVirFinder manifest protracted evaluation phases, with the latency amplifying as sequences lengthen.
Remarkably, VirSorter2 demands an evaluation span surpassing 1,000 seconds for 2kb sequences.While assessing accuracy, DeepVirFinder exhibits suboptimal performance, especially with succinct sequences like 256 bp, where it achieves a mere 75%.However, it's essential to acknowledge that VirSorter2 extends beyond mere classification; it offers comprehensive annotations, a process inherently time-intensive.
In essence, the ProkBERT family represents a synergy of rapidity and reliability.Concurrently, other contenders like VirSorter2, DeepVirFinder, and INHERIT unveil distinct advantages, coupled with potential avenues for refinement.

Conclusion
In bioinformatics, there has always been a keen interest in developing tools that can offer precise and context-sensitive interpretations of sequences.Meeting this demand, we introduced the ProkBERT model family.These innovative models benefit from transfer learning (Pan and Yang, 2009), a method showing promise in a variety of applications.A standout feature of ProkBERT is its ability to harness vast amounts of unlabeled sequence data through self-supervised learning (He et al., 2020).This approach equips ProkBERT to handle challenges like limited labeled data, a problem that has often hindered traditional models such as CNNs, RNNs, and LSTMs (Cho et al., 2014;LeCun et al., 2015).Another strength of ProkBERT is its adaptability; it performs well in different scenarios, from those with sparse data to classic supervised learning tasks (Snell et al., 2017).When we compare ProkBERT to older models that largely depend on expansive datasets, it's clear that ProkBERT ushers in a more adaptable approach for sequence analysis in prokaryotic microbiome studies.
Our results affirm the robust generalization capabilities of the ProkBERT family.The learned representations are not only consistent but also harmonize well with established biological understanding.Specifically, the embeddings effectively delineate genomic features such as coding sequences (CDS), intergenic regions, and non-coding RNAs (ncRNA).Beyond capturing genomic attributes, the embeddings also encapsulate phylogenetic relationships.A case in point is the close proximity in the embedding space between Klebsiella pneumoniae and Escherichia coli, both belonging to the Enterobacteriaceae family.
We validated the versatility of the ProkBERT model family by applying it to two challenging problems: promoter sequence prediction and phage identification.Promoters play an instrumental role in transcriptomic regulation.Leveraging the transfer-learning paradigm, ProkBERT adeptly addressed the promoter prediction challenge, even when fine-tuned on multi-species datasets.This adaptability addresses a significant gap, as many conventional bioinformatics tools tend to be speciesspecific, often overlooking microbial diversity.In comprehensive benchmarks against prominent tools, including Multiply, Promotech, and i-Promoter2L, ProkBERT consistently outclassed both traditional machine learning and deep learning counterparts.For instance, in E. coli promoter recognition, it achieved an accuracy of 0.87 and an MCC of 0.74, and even in a mixed-species context, the accuracy was 0.81 with an MCC of 0.62.Additionally, our findings underscore the robustness of the training, with the ProkBERT-mini variant demonstrating resilience against variations in optimization parameters, such as learning rate.
Our evaluations demonstrate the prowess of ProkBERT in classifying phage sequences.Remarkably, it achieves high sensitivity and specificity even in challenging cases where available sequence information is limited.However, this exercise also highlights an inherent limitation of ProkBERT, and more broadly of transformer models: the restricted context window size.While transformer architectures are adept at capturing long-range interactions (Lin et al., 2022), they typically have a limited view of only a few kilobases.In comparative benchmarks with varying sequence lengths, ProkBERT consistently surpassed established tools like VirSorter2 and DeepVirFinder.For instance, it attained an accuracy of 92.90% and an MCC of 0.859 in multiple benchmark studies.Intriguingly, ProkBERT even outperformed a DNA-BERTbased model, which employs a BERT architecture and vectorization strategy similar to ours.
Discussing model variants, both ProkBERT-mini and ProkBERT-mini-c have a maximum context size of 1kb, while ProkBERT-mini-long extends this to 2kb.Notably, ProkBERT-mini-long manages to use longer sequence information without compromising on prediction performance or demanding additional computational resources, thanks to the LCA tokenization strategy.Our results indicate that the local context information offered by ProkBERT-mini-long and ProkBERT-mini enhances robustness, giving them an edge over ProkBERT-mini-c.
ProkBERT's superiority is not limited to prediction accuracy; it also excels in terms of inference speed.Variants such as ProkBERT-mini, ProkBERT-mini-long, and ProkBERT-mini-c consistently deliver outstanding performance, both in terms of evaluation speed and accuracy.Regardless of the sequence length, these models typically complete evaluations in under 10 seconds, making them exceptionally suited for real-time applications (Vaswani et al., 2017).
The vector representations generated by ProkBERT can be seamlessly integrated with traditional machine learning tools, paving the way for innovative hybrid methodologies.Being an encoder architecture, ProkBERT's ability to produce embeddings for nucleotide sequences enables the direct incorporation of sequence information into more complex classifiers.This fusion of traditional and deep learning methods represents a promising frontier in bioinformatics.Furthermore, insights from natural language processing research suggest that the most informative representations may not always emerge from the final layer of a model (Rae et al., 2021).This underscores the need for future studies to delve deeper into the optimal layers for sequence representation extraction in bioinformatics models.
ProkBERT distinguishes itself by being both compact and powerful, embodying a blend of efficiency and accessibility.One prevailing challenge with contemporary large language models like GPT (Radford et al., 2019), BERT (Devlin et al., 2019), and T5 (Raffel et al., 2019) is their enormity.Models with hundreds of millions or even billions of parameters not only demand substantial computational resources but also complicate training and hyperparameter optimization processes.In stark contrast, ProkBERT is designed with a lean parameter count of approximately 20 million.This design choice ensures that it can comfortably fit within the memory constraints of modest GPUs.As a result, even researchers without access to highperformance computing setups or top-tier GPUs can utilize ProkBERT.Platforms like Google Colab, which offer free but limited GPU computation, become viable environments for training and evaluation tasks with ProkBERT.
As we present the findings of our study, it's important to recognize certain limitations and identify areas for future enhancement.These include: (i) creation of larger models: The effectiveness of our current models can be further improved by scaling up.Larger models are likely to capture more complex patterns, which is particularly beneficial for handling diverse and extensive datasets.(ii) Increasing context size: Expanding the context size in our models could lead to a better understanding of longer sequence dependencies.This enhancement is crucial for the accurate interpretation of biological sequences.(iii) Building new datasets: The development of new, comprehensive datasets is an ongoing necessity.These datasets should not only be larger in size but also more diverse, ensuring the robustness and wide applicability of our models.(iv) Diversity in sequencing applications: Despite our progress, the question of diversity in sequence applications remains.This includes broadening the range of sequences our models can recognize and applying them to a variety of biological phenomena.(v) Further applications and descriptions: Future research should also aim to add and describe additional applications.This would involve applying our models to new sequence analysis tasks, expanding the scope and utility of our work.Each of these points represents a critical area for improvement and further research.Addressing these limitations will enable us to develop more comprehensive and versatile tools in the field of bioinformatics.
In essence, our findings highlight ProkBERT's capability to learn detailed and adaptable vector representations for sequences.These representations hold promise not only for current analytical challenges but also for emergent and unforeseen sequence classification tasks in the future.Amidst the challenges of understanding microbial communities, ProkBERT stands as a transformative tool, elucidating the complex interplay of genes and organisms in the microbiome with remarkable precision.

FIGUREA
FIGUREA schematic overview of the training process.Starting with raw sequence data, it undergoes preprocessing and vectorization.The model is then fine-tuned, beginning with weights from a pretrained model, to address the specific classification task.The output showcases classified sequences or tokens, predicted labels, scores, and a visualization highlighting underlying sequence patterns and explanations.

FIGURE
FIGURE Preprocessing of sequences.The sequences are initially segmented into chunks ranging between -kb.Two segmentation strategies can be employed: (A) Contiguous segmentation where the sequence is split into non-overlapping parts; (B) Random segmentation where segments of varying lengths are randomly sampled from the original contig.The third part (C) outlines the tokenization process of the segments: (a) Splitting segments into non-overlapping tokens; (b) Creating maximally overlapping k-mers; (c) Generating partially overlapping k-mers by shifting with a fixed size.
Mean Pooling: Average the vectors: e mean = 1 T T i=1 e i .• Sum Pooling: Sum the vectors: e sum = T i=1 e i .• Max Pooling: Max value per dimension: e max [j] = max T i=1 e i [j].• Min Pooling: Min value per dimension: e min [j] = min T i=1 e i [j].
1. Training set: Constitutes 80% of the total data and is pivotal for initial model development and training.

FIGURE
FIGURESchematic of the promoter dataset.This figure provides a visual representation of the sequence sources and their distribution within the study.The dataset comprises known promoter sequences from organisms, retrieved from the Prokaryotic Promoter Database (PPD), alongside non-promoter sequences obtained from the NCBI RefSeq database (specifically sampled from CDS regions).It also includes non-promoter sequences constructed via higher and zero-order Markov chains that mirror compositional characteristics of known promoters.Additionally, an independent test set, focusing on E. coli sigma promoters, was employed, curated by Cassiano and Silva-Rocha ().A balanced distribution approach was adopted to even out the number of positive and negative samples, with the dataset being systematically divided into training, validation, and test subsets.This stratification underpins a thorough evaluation of the model e cacy.
Represents the proportion of correctly predicted instances to the total, defined as:Accuracy = TP + TN TP + TN + FP + FNSensitivity (Recall):The proportion of actual positives correctly identifiedEvaluates the model's discriminative ability between positive and negative classes.It's the area under the ROC curve, which plots Sensitivity against 1−Specificity for various thresholds.

FIGURE
FIGURE UMAP embeddings of genomic segment representations.The figure presents the two-dimensional UMAP projections of embedded vector representations for various genomic features, derived from the ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long models.The distinct clusters highlight the models' ability to di erentiate between features such as CDS (coding sequences), intergenic regions, ncRNA, and pseudogenes, even without explicit training for feature di erentiation.(B) The segments are colored according to species, indicating that cluster structure reflects the phylogenic similarities.(A) Sequence embeddings of the di erent regions.(B) Sequence embeddings of the di erent species of ESKAPE pathogens.

FIGURE
FIGUREPromoter prediction performance metrics on a diverse test set.A comparative analysis of various promoter prediction tools, showcasing their performance across key metrics including accuracy, F score, MCC, sensitivity, and specificity.The tools evaluated include ProkBERT-mini, ProkBERT-mini-c, ProkBERT-mini-long, Promotech, Sigma Pred, iPromoter-BnCNN, and MULTiPly.

FIGURE
FIGURE ProkBERT identifies phage sequences accurately and rapidly.(A) Method comparison over varying sequence lengths based on two essential performance metrics: accuracy and MCC.(B) Scatter plots illustrating the relationship between evaluation time (on a logarithmic scale) and the mentioned performance metrics.The size of each point signifies the sequence length.Evaluation time encompasses model loading, sequence preprocessing, and inference phases.(A) Comparison of methods across di erent sequence lengths.(B) Comparison of evaluation time and performance metrics.
TABLE Masking performance of the ProkBERT family.TABLE Evaluation of promoter prediction tools on E-coli sigma dataset (Transposed).