StartLink and StartLink+: Prediction of Gene Starts in Prokaryotic Genomes

State-of-the-art algorithms of ab initio gene prediction for prokaryotic genomes were shown to be sufficiently accurate. A pair of algorithms would agree on predictions of gene 3′ends. Nonetheless, predictions of gene starts would not match for 15–25% of genes in a genome. This discrepancy is a serious issue that is difficult to be resolved due to the absence of sufficiently large sets of genes with experimentally verified starts. We have introduced StartLink that infers gene starts from conservation patterns revealed by multiple alignments of homologous nucleotide sequences. We also have introduced StartLink+ combining both ab initio and alignment-based methods. The ability of StartLink to predict the start of a given gene is restricted by the availability of homologs in a database. We observed that StartLink made predictions for 85% of genes per genome on average. The StartLink+ accuracy was shown to be 98–99% on the sets of genes with experimentally verified starts. In comparison with database annotations, we observed that the annotated gene starts deviated from the StartLink+ predictions for ∼5% of genes in AT-rich genomes and for 10–15% of genes in GC-rich genomes on average. The use of StartLink+ has a potential to significantly improve gene start annotation in genomic databases.

Let consider the cases when the algorithms 1 and 2 predict the same start 1. completely independently 2. with complete dependence (e.g., being the same algorithm), 3. by selecting a start at random among several candidates with equal probability Then we have for (1) the following. For independence ( ), and full dependence ( ), randomness ( ) conditions (see the proof below) we get Since ( = | ≠ ) = 1 − ( = | = ), the probability of the identical start prediction being false increases as the number of options to select from increases (Fig. S1). Importantly, we see the effect on the overall error rate when we add a second, independent prediction. This simple study demonstrates that the largest improvement is gained by maximizing independence between the algorithms.
Note that the real error rates of gene-start prediction are difficult to model theoretically, therefore the above consideration makes simplifying assumptions (e.g., on the uniformity of selecting any of the incorrect starts when a wrong prediction is made).

Proof of Equation 3
The derivations of Eq 2 under the three conditions of randomness, independence, and full dependence, is as follows. The first expansion (irrespective of condition) is Therefore, 2 Note 2: Effects of the variations in distributions of the Kimura distance on the StartLink performance.
StartLink uses the Kimura distances as a metric to filter out sequences of too close or too distant relatives. We tested several alternative evolutionary distance metrics, including (non-) synonymous substitution rates, amino acid identity, etc. but they either performed equally well or slightly worse at providing StartLink with a good set of orthologous sequences (data not shown). Overall, we found that these metrics provide a way to remove sequences of evolutionarily very close/distant relatives, but that additional filtering (such as analyzing the gaps in the MSA incurred by each query, as described in the main text) might still be needed.
The effect of varying the maximum Kimura threshold from 0.2 to 0.8, while fixing the minimum threshold to 0.1, is shown in Fig. S2. StartLink's sensitivity is robust with respect to increase of the maximum Kimura distance. On the other hand, the coverage rate fluctuates. This occurs because when more sequences with larger Kimura distance from query are included in the MSA (and thus, more mutations), the gene start becomes more difficult to infer.

Supplementary Fig. S2:
The effect of changing the maximum Kimura threshold on StartLink's sensitivity and coverage rates. The minimum Kimura threshold is fixed to 0.1, and x ∈ {0.2,0.3, … ,0.8}.
Specifically, N. pharaonis and H. salinarum's coverage rates suffer when the maximum Kimura threshold is below 0.4. Note, however, that the number of available genomes of Archaea (used as a reference clade) is quite small, especially when compared to numbers in Enterobacterales and Actinobacteria. This means that, on average, the coverage rates for archaea will be lower than for bacteria.
On the other hand, M. tuberculosis's coverage rate decreases as the maximum threshold increases past 0.4. A direct inspection of the MSAs showed that they appear to have more mutations around the genestart region than, for example, in E. coli for the same range of Kimura distances. It is unclear whether this is due to a sampling bias of genomes in the database or a biological feature. That said, StartLink+'s sensitivity rate remains constant, meaning that for such alignments, it does not make a prediction.
Similarly, Fig. S3 fixes the maximum Kimura threshold to 0.5 and varies the minimum threshold from 0.001 to 0.4. Overall, this has a lower effect on the coverage and sensitivity rates than in the previous case. We see an initial drop in the case of E. coli when the minimum threshold is close to 0.001. In this range, we observed more orthologs with high nucleotide-level identity rate (to the query gene) than in the other three genomes. It is again unclear whether this stems from a genome sampling bias in the database or from some underlying patterns of genome evolution, but the overall reduction in sensitivity is still small. It is important to note that for all the above experiments, the sensitivity of StartLink+ remains consistently high and largely unaffected, while the coverage rate mirrors that of StartLink since it is directly dependent on it (data not shown). In other words, our selection of the Kimura distances interval does not need to be fine-tuned beyond what has been done. This is fortunate, given that we do not want to overfit the small size ground-truth data available to us. The reasons for this pattern are, in part, due to just over 1,000 genomes of Archaea in the entire clade, compared to over 6,311 and 8,097 genomes for Enterobacterales and Actinobacteria, respectively, with much lower nodes in the taxonomy tree. In fact, E. coli and M. tuberculosis each have 1,000 + sequenced genomes of the strains within their own species.

Supplementary
Therefore, the observed difference in patterns is likely to only be a result of the undersampling of the microbial species and may disappear once the database is populated more uniformly.
Supplementary Fig. S5: The percentage of gene start differences between PGAP and StartLink+, computed as Diff(PGAP, StartLink+), as a function of the minimum and maximum Kimura distances between a query and its targets. The color-coding key panel shows the correspondence of color and the percentage of gene starts differences. This analysis was done for 443 query genomes.
We observed that deviations of the StartLink+ predictions from the PGAP annotation were in the same range regardless of variations in min and max values of the range of the Kimura between queries and targets Fig. S5.

Note 3: Distributions of the MSA scores computed around gene-starts
The StartLink algorithm relies on metrics, such as a measure of sequence conservation, computed in non-coding and coding regions, thus, around candidate starts. These metrics are computed after removing evolutionarily close and distant sequences, as determined by the Kimura distance, from the set of sequences in the initial MSA.
In this section, we analyze the scores computed by StartLink in different sections of the multiple sequence alignment, from 5' downstream. We run StartLink for genes with experimentally verified starts and build multiple sequence alignments. The goal is to show how the two scores defined in Eq 6 and 7 of the main text behave in different regions of the MSA (specifically, upstream of, at, and downstream of the verified start).
First, we emphasize the following: this analysis is done to show how the algorithm behaves. This means that conditions that StartLink uses, such as only extracting LORFs and filtering by Kimura, are inherently part of the analysis. For example, if a query's verified start is at the LORF 5' end, then the scores for the non-coding region are not computed, StartLink will not search for a gene-start there.

Block conservation scores
We start by analyzing the block conservation score defined by Eq 6 of the main text. For each protein MSA, we take blocks (with width 10 aa) in four different regions (Fig. S6): 20 aa upstream of the verified start (Up-far), just upstream of the verified start (Up-close), just downstream of it (Downclose), and 20 aa downstream of the verified start (Down-far).

Supplementary Fig. S6:
Blocks selected for the MSA score analysis. Fig. S7 shows the distribution of scores for each block region. As expected, the upstream (non-coding) regions have a much lower conservation score than the downstream (coding) regions, as mutations happen with higher frequency in non-coding regions than in coding regions, Also, given that sequences were extracted as LORFs, this region of the MSA would have a large number of gaps that reduce overall score. This pattern helps StartLink to skip over such regions. Notice that the rules of the StartLink setup (Kimura filtering, LORF extraction, etc...) facilitate the use a simple metric based on identity conservation to skip over non-coding regions, where conserved blocks are not detected. Supplementary Fig. S7: Distribution of block conservation scores in regions around verified starts.
On the opposite end of the protein MSA, the two blocks downstream of the verified start show relatively high conservation scores. Interestingly, the Down-far block has a higher conservation score (on average) than the Down-close one. This is especially apparent in M. tuberculosis. This pattern indicates that non-synonymous mutations just downstream of the gene-start may be more likely to occur than in a randomly selected section deeper in a gene. This may be a reason why computation of evolutionary distance metrics on the entire gene may not select an optimal set of sequences for StartLink's gene-start search algorithm.

Identity score for gene start candidates
Here we analyze the effectiveness of the 5' score (Eq 7 of the main text) for separating true and false starts. Notably, there are two types of false starts.
In an ideal case, the true start will have a high 5' identity score, due to presence of orthologous genes with the same location of the 5' end. The upstream (false) candidates tend to have low conservation scores, due to mutations that occur in non-coding region. On the other hand, the (false) candidates downstream of the verified start can show high identity scores, due to conservation across orthologs. Still, the false starts downstream could be discriminated if codons synonymous to GTG and TTG appear in the alignment to betray absence of selection pressure expected for a start codon. Hence, the penalty for the synonymous codons was added in the equation for 5 ′ (see main text). Fig. S8 shows how the 5' score separates the annotated (in this case, verified) gene-starts from the false candidates. The gap between the groups is strikingly wide, meaning that the 0.5 threshold is a reasonable, non-overfitting value. Supplementary Fig. S8: Distribution of 5' identity scores for verified starts as well as for upstream and downstream false gene start candidates.

Note 4: Distributions of differences in numbers of selected orthologs
The maximum number of allowed targets (currently = 50) is used to limit the number of sequences in MSA. At most sequences are selected from a large number of BLAST hits and used in the MSA prior to further filtering. The average number of targets per query after a full StartLink run is shown in Fig. 16; it can vary significantly, especially when comparing Archaea and Enterobacterales.
One reason why Enterobacterales has a high average number of targets is the large spread of the Kimura distances compared to other clades (see Fig. 10). On the other side, the spread of Kimura distances within Archaea targets has narrow spread, while the average number of targets reaches as low as 10 per query. This is partly due to the small number of genomes in this clade, making it less likely that we find enough sequences within the right Kimura distance range.
Still, we observed on the verified set that the larger final number of targets per query did not translate into a large improvement in performance. For example, for StartLink running with = 50, both archaea (H. salinarum and N. pharaonis) end up with on average 20 targets per query, compared to E. coli's 40 targets per query. In terms of error rate, however, for H. salinarum and N. pharaonis we saw the error rate of 3% and 2% respectively, while E. coli's the error rate was 5%.
In another experiment we decreased to 20 in order to check the performance on Archaea with low target-per-query averages. This change in N resulted in 10 to 15 targets per query for both archaea. However, we saw no real change in error rate; we saw a slight increase in error rate for H. salinarum (by 0.6%) and a decrease in error rate for N. pharaonis by 0.7%. Moreover, in all verified genomes, when was shifted we saw a 0.5% (on average per genome) change in error rate, some positive and some negative. This shows that StartLink is stable with respect to , and the minor shifts could be attributed to the underlying randomness of selecting the target sequences.

Note 5: Gene-start prediction in metagenomes
Standard self-training ab initio predictors usually require a sufficiently long sequence (e.g. a sequenced genome) for estimating the models parameters. In metagenomics applications short read assembled sequences normally are lacking full genomic context. Therefore, conventional gene finders such as GeneMarkS-2 cannot run its full cycle with training of the typical model (Lomsadze et al. 2018) due to insufficient training data. Still, GeneMarkS-2 can run a single iteration which is equivalent to running the gene finder with the MetaGeneMark models (Zhu, Lomsadze, and Borodovsky 2010). Such a run relies on pre-built collection of heuristic GC specific models that do not include models for the regulatory motifs in the gene up\stream regions. As shown in Fig. S9, the performance of GeneMarkS-2 starts to degrade rapidly when the input fragment size goes below the 200K range. And at 10K, we see a drop all the way down to 75% on M. tuberculosis, R. denitrificans, and E. coli, 96% on H. salinarum and 88% on N. pharaonis. This is in contrast to StartLink, which doesn't depend on the genome size at all, and maintains an average accuracy of 95% on all genomes, with the lowest being for M. tuberculosis at 93.62%.
This experiment shows the advantage of similarity-based approaches in ability to make predictions of gene starts in short sequence fragments for genes whose orthologs are present in the database.

Note 6: StartLink technical details and optimizations
In this section, we discuss some under-the-hood details of the selection process of target sequences in StartLink, focusing on items that can affect the overall performance and runtime. Before we begin, we state the goal of this selection process by describing the desired properties of the final selected set of sequences. We then describe the selection process and discuss optimizations and shortcuts made to reduce the overall runtime.
In StartLink, target sequence selection is a multi-step process. We start with a search of protein sequence targets in a database and then filter out unwanted sequences. Given that our initial databases are large, we use DIAMOND BLASTp, which provides a faster version of the BLASTp search algorithm. This search typically returns thousands of hits per query, and the goal is to select (in our case, = 50) target sequences (per query) that will be used by StartLink.
Ultimately, we want to choose a set of sequences that, when aligned, would allow us to find the start of the query gene easily. For our purposes, we want this set (a query and (maximum) targets) to satisfy the following pairwise constraints: ( 1 , 2 ) ∈ [0.1,0.5] ∀ 1 , 2 ∈ , 1 ≠ 2 This means that no pair of sequences in this set should have a Kimura distance smaller than 0.1 or larger than 0.5.
The most straightforward approach to accomplishing this is to compute all pairwise Kimura distances between query and target sequences, and then randomly select that exist in the desired range. However, the Kimura distance calculation requires a global alignment of two sequences, which in this case would be very costly (given the large initial number of sequences).

Avoiding global alignment
Therefore, the first order of approximation is to use the already-existing local alignments between a query and its targets (generated by the BLASTp search), and immediately compute approximate Kimura distance values. In our experiments, we found no tangible difference in StartLink's performance on the set of genes with verified starts based on whether global or local alignment was used in the Kimura distance computation.
Note, however, that these local alignments are only provided between a query and a target, and not between pairs of target sequences. Furthermore, there can still be tens of thousands of targets per query and going through them all would result in many unnecessary computations.

Avoiding all-pairs target alignments
We can avoid (at this stage) directly comparing target sequences to each other by observing how they compare to the query. This step exploits the fact that two very similar target sequences should have similar Kimura distances to the query. As such, if the Kimura distance between two targets 1 and 2 and the query is roughly the same (i.e. ( , 1 ) ≈ ( , 2 )), then we remove one of them from the list of candidates under the assumption that there is a higher chance that they are very similar to each other. We deem two Kimura distances as approximately the same if they are equal up to the fourth decimal place.
It's important to note that this heuristic is not optimal, however. In particular, it doesn't hold strictly in the opposite direction, i.e., in general, two sequences with similar Kimura distances to the query may have differences that did not affect the result of computation of the Kimura distance to the query. That said, we have found that this step slightly improves performance because we are less likely at this stage to import multiple target sequences that are very similar to each other, which would end up in biasing the alignment.

Avoiding analyzing the large number of target sequences
We first note that hits retrieved from BLASTp are ordered by -value. In practical terms, this means that they are ordered in terms of a similarity score to the query sequence. It also means that they are roughly ordered by the Kimura distance; we say "roughly" because the ordering is not exact, but it does not affect the final performance.
Given this list, we want to find the position of the sequence after which no sequence exists with ∈ [ , ]. If the list was perfectly sorted, this task would boil down to finding the first sequence with > . Given that the ordering is not exact, however, we take some measures to decrease the likelihood that our heuristic removes sequences within the range [ , ].
First, if the total number of target sequences for a query is less than 2,000, we skip this filtering step since the number of computations needed is not large. Second, instead of looking for the first sequence with > , we place an additional buffer that favors keeping in more out-of-range sequences than removing in-range sequences; basically, we update the threshold such that we search for > + 0.2. Third, instead of traversing the list in order and finding the first sequence that satisfies > + 0.2, we perform a binary search through the list, which makes it much less likely to stop at a very low positioned sequence in the list.
As such, if the number of sequences is higher than 2,000, we perform a binary search through the list to find an (approximate) "upper bound" above which all sequences (likely) have a distance value larger than the maximum allowed threshold (+0.2). This operation is quick, requiring log( ) Kimura distance computations on average (where is the number of hits). In other words, for a list of 100,000 hits, we require (on average) 16.6 Kimura distance computations, where each computation is a simple run through an existing local alignment, at the nucleotide level, counting the transition and transversion changes.
We effectively now have a list of sequences whose Kimura distance is in the range [0, + 0.2]. We don't perform the same operation to find the lower limit because our desired threshold of 0.1 is close to 0, leaving us a small buffer region. Instead, we shuffle the list and go through it one sequence at a time until we have 50 sequences that satisfy our requirements. In the worst case, we would have to go through all sequences in this range, especially if the target sequences are very similar to the query or each other.

Optimality in the face of randomness
If we had gone through every sequence in the list, computed , sorted it, then filtered out the sequences outside the range [ , ], we would still have to randomly select only (from the possibly thousands) of the remaining sequences. This reduction, which comes from the requirement of the subsequent multiple sequence alignment step, means that even if we were to act optimally at the initial stages, the act of random selection reduces the likelihood that our final set is actually optimal. In other words, we can act sub-optimally (as suggested in the above heuristics) at the initial steps without necessarily observing a change in the outcome. In practice, we did not see a tangible difference in StartLink's final performance by acting optimally or sub-optimally at the initial stages.

Note 7: Examples of multiple sequence alignments made by StartLink
We show below examples of MSAs where StartLink+ selection of the gene start (#selected) doesn't match PGAP's prediction (#ref). The examples were taken from eight genomes, coming from the four clades considered in the main text.