A novel linear indexing method for strings under all internal nodes in a suffix tree

Al-okaily, Anas; Tbakhi, Abdelghani

doi:10.3389/fbinf.2025.1577324

METHODS article

Front. Bioinform., 04 September 2025

Sec. Integrative Bioinformatics

Volume 5 - 2025 | https://doi.org/10.3389/fbinf.2025.1577324

This article is part of the Research TopicInnovative Tools for Multi-Omics Data AnalysisView all articles

A novel linear indexing method for strings under all internal nodes in a suffix tree

Anas Al-okaily¹*

Abdelghani Tbakhi²

¹Department of Cell Therapy and Applied Genomics, King Hussein Cancer Center, Amman, Jordan
²Department of Pathology and Molecular Medicine, McMaster University, Hamilton, ON, Canada

Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.

1 Introduction

Numerous string-processing problems arise in several scientific fields, including biology and medicine. These problems include exact and approximate pattern matching, motif search, lowest common ancestor queries, and the detection of tandem repeats. The inputs for such problems can range from small documents and databases to DNA sequences and large-scale corporate data. To address string problems more efficiently, several data structures have been designed and are commonly used, including suffix trees (Weiner, 1973; McCreight, 1976; Ukkonen, 1995), suffix arrays (Abouelhoda et al., 2004), and the FM-index (Ferragina and Manzini, 2000).

Constructing suffix trees, suffix arrays, and FM-indexes can all be achieved in linear time and space. Although building suffix trees incurs a higher constant-factor overhead than building suffix arrays and FM-indexes, their structure is more flexible and dynamic. This flexibility arises from the ability of suffix trees to identify systematic redundancies among the suffixes in the input data—capabilities not offered by suffix arrays or FM-indexes. For instance, suffix trees make it easy to observe that a subtree rooted at an internal node is isomorphic or partially isomorphic to subtrees rooted at other internal nodes. Such structural observations are not possible with suffix arrays or FM-indexes. Once these redundancies are identified and abstracted, complex string problems can be solved more efficiently than using suffix arrays, FM-indexes, or even the standard suffix tree representation.

In this work, we introduce two algorithms that index strings under all internal nodes in suffix trees in linear time and space.

2 Methods

Let $T$ be a text of length $n$ derived from an alphabet of size $Σ$ . Let ST be the suffix tree of $T$ . Let $h$ be the height of ST, i.e., the maximum number of nodes between the root node and an internal node. For any internal node $x$ in ST, we define the following functions: $D e p t h (x)$ denotes the depth of node $x$ , i.e., the sum of the lengths of all edges between the root of ST and node $x$ ; $S L (x)$ denotes the node to which the suffix link of node $x$ points; $S L S (x)$ denotes the set of nodes whose suffix links point to node $x$ (note that for any internal node, the size of this set is up to $Σ$ ); and $L e a v e s (x)$ denotes the set of leaf nodes in the subtree rooted at node $x$ . For any leaf node $l$ in ST, $S u f f i x_I n d e x (l)$ denotes the suffix index (set during the construction of ST) labeled at leaf node $l$ (if the suffix index labeled at $l$ is $i$ , for instance, then the label of the edges from the root node to $l$ represents the $i^{th}$ suffix in $T$ ).

Definition 1: Let $x$ be an internal node, and let $S (x)$ be the set of suffix indexes, based on $T$ , labeled at each leaf node under $x$ in ST, i.e., $S (x) = {S u f f i x_I n d e x (l) ∣ l \in L e a v e s (x)}$ . Then, the suffixes under node $x$ , denoted as $S U (x)$ , are the suffix indexes in $T$ that start from node $x$ , i.e., $S U (x) = {D e p t h (x) + S u f f i x_I n d e x (l) ∣ l \in L e a v e s (x)}$ .

As an example, the $S$ list of node 20 in Figure 1 is ${9,13,4}$ ; therefore, $S U (20) = {12,16,7}$ since the $D e p t h$ of node 20 is 3.

Figure 1

Diagram of a suffix tree with nodes labeled from 0 to 27. Blue lines connect nodes, representing suffix links, and green dotted lines indicate suffix links between non-adjacent nodes. Letters beside lines (A, T, G, C, $) represent the sequences linking nodes. Nodes are marked with red numbers beside them, perhaps indicating their order or specific values.

Figure 1. This diagram visualizes a suffix tree constructed from a string AGCCTAATTTAACTAAG$ using https://hwv.dk/st/?AGCATAATTTAACTAAG$. Each node is annotated with a unique identifier enclosed in a circle for ease of reference. The edges between nodes are labeled with substrings that represent segments of the original string along distinct suffix paths. Leaf nodes—those without children—are marked with red integers, indicating the starting positions (suffix indexes) of the corresponding suffixes in the original string. Green dotted arrows denote suffix links, which connect internal nodes according to standard suffix tree construction rules.

Observe the following properties:

$•$ If $S L (a) = b$ , then $S U (a) \subseteq S U (b)$ . This implies that any processing or indexing assigned to suffixes in $S U (a)$ can be implicitly applied to the same suffixes that are (and must be) in $S U (b)$ .

$•$ In order to achieve the above point, nodes in $S L S (x)$ must be indexed or processed before $x$ itself (which means that a post-order traversal is required).

$•$ The set of suffixes that eventually need to be indexed under node $x$ is the set of suffixes under node $x$ minus the set of all suffixes under each node with a suffix link pointing to $x$ , i.e., $S U (x) - (⋃_{n \in S L S (x)} S U (n))$ .

$•$ The indexing process must recursively traverse the suffix links in ST.

Therefore, to compute this indexing scheme and traverse the suffix links recursively, the following tree structure must be designed and constructed.

2.1 Okaily-Sheehy-Huang-Rajasekaran (OSHR) tree structure

Given ST, the structure of the OSHR tree is defined as follows (the acronym “OSHR” is explained in the Acknowledgments section):

$•$ The root node is the root of ST.

$•$ There is a directed edge from node $a$ to node $b$ if $S L (b) = a$ . For example, under the OSHR tree structure, node 25 must have a directed edge to node 14 since there is a suffix link from node 14 to node 25.

$•$ A leaf node in the OSHR tree structure is any internal node $v$ in the ST structure for which $S L S (v) = \emptyset$ ; that is, $v$ has no incoming suffix link (for example, node 14 in Figure 1).

$•$ An internal node in the OSHR tree structure is any internal node $v$ in the ST structure for which $S L S (v) \neq \emptyset$ ; that is, $v$ has at least one incoming suffix link (for example, node 25 in Figure 1).

$•$ The children of an internal node $v$ are the nodes in $S L S (v)$ . For example, the children of node 26 $(S L S (26))$ in Figure 1 are {node 15, node 21}.

$•$ Edges have no labels.

$•$ Leaf nodes under the ST structure are not included in the OSHR tree structure.

The directed edges in the OSHR tree, which are the reverse of suffix links, correspond to a simplified form of Weiner links in ST (as defined by Wellnitz (2021), Apostolico and Cunial (2014), Belazzougui et al. (2020). Due to the construction properties of ST and its suffix links, the OSHR tree forms a directed acyclic graph. The construction of the OSHR tree is carried out by traversing ST, and at each visited internal node $v$ , a list called SLS is created at node $S L (v)$ if it does not already exist, and $v$ is then appended to this list. Clearly, the space and time complexities of building OSHR trees are both linear, $O (Σ n)$ , and this structure can be constructed either implicitly (within the ST) or explicitly (as a separate tree structure).

The OSHR tree differs from the suffix-tour graph (Starikovskaya and Vildhøj., 2015) and the suffix link tree (Starikovskaya and Vildhøj., 2015); (Apostolico and Cunial., 2014); (Belazzougui et al., 2020). Unlike the suffix-tour graph, the OSHR structure is acyclic. Compared to the suffix link tree, the edges in the OSHR tree are unlabeled, they do not include the leaf nodes of ST, and its leaf nodes correspond to internal nodes in ST that have no incoming suffix links.

2.2 Okaily-Tbakhi (OT) indexing

To identify all similarities and redundancies of strings under different internal nodes in an ST, a post-order traversal of the OSHR tree is required, during which both the ST and the OSHR tree structures are utilized.

Definition 2: We denote those strings, such as suffixes defined in the $S U ()$ function, that are present under node $x$ (in the structure of ST) but not under any of the nodes in $S L S (x)$ as the Base Strings for node $x$ or BS(x). Here, $S L S (x)$ refers to the child nodes of $x$ in the OSHR tree structure. The term “base” indicates that this is the first occurrence of the string under an internal node during a post-order traversal of the OSHR tree.

The types of strings considered under each internal node $x$ can vary. These may include the following:

$•$ The set of suffixes under $x$ ( $S U (x)$ as defined earlier). In this context, base strings are referred to as base suffixes.

$•$ Substrings that label paths from $x$ to each of its descendant internal nodes, defined as $P U (x)$ . In this context, base strings are referred to as base paths.

$•$ Specific subsets of strings.

$•$ Strings of particular lengths

Definition 3: OT indexing (or OT processing) refers to the process of indexing or processing strings under each internal node (based on the ST structure, denoted as node $x$ ) via a post-order traversal of the OSHR tree while avoiding the re-indexing of the same strings that have already been indexed or processed under any of the $S L S (x)$ nodes (i.e., indexing only $B S (x)$ ).

As a simple example, consider the task of performing OT indexing on the suffixes under each internal node (the set of suffixes as defined by function $S U ()$ ) in the ST shown in Figure 1. Let us describe the OT indexing process for a subset of nodes, namely, nodes 15, 21, and 26 (noting that node 26 is the parent node of nodes 15 and 21 under the OSHR tree structure). Before beginning the post-order traversal on the OSHR tree, initialize a global list called $O T_i n d e x$ , which will store the OT index values for the suffixes under all internal nodes in ST. Now, we proceed as follows:

$•$ Node 15:

Since $S U (15) = {5,8}$ (suffix 5 corresponds to AATTTAACTAAG$, and suffix 8 to TTAACTAAG$) and the base suffixes at this node are also ${5,8}$ , append 5 and 8 to $O T_i n d e x$ (so $O T_i n d e x$ is now equal to ${5,8}$ ). Next, create two attributes associated with node 15: $L e f t_O T_i n d e x = 0$ and $R i g h t_O T_i n d e x = 1$ , which correspond to the offset positions of the suffixes in the $O T_i n d e x$ list (the suffixes for any internal node must be next to each other in the $O T_i n d e x$ list due to the post-order traversal and the fact that $(⋃_{n \in S L S (x)} S U (n)) \subseteq S U (x)$ ).

$•$ Node 21:

At this node, $S U (21) = {9,10}$ (suffix 9 corresponds to TAACTAAG$ and suffix 10 to AACTAAG$), and the base suffixes are again ${9,10}$ , so append them to $O T_i n d e x$ (now $O T_i n d e x = {5,8,9,10}$ ). Then, create and set $L e f t_O T_i n d e x = 2$ and $R i g h t_O T_i n d e x = 3$ for node 21.

$•$ Node 26:

For this node, $S U (26) = {5,8,9,10,14}$ and the base suffixes are ${14}$ (suffix 14 corresponds to AAG$). So, the suffix that now requires indexing is suffix 14 as the others were already indexed during the OT indexing process of nodes 15 and 21. Therefore, append 14 to $O T_i n d e x$ ( $O T_i n d e x$ is now equal to ${5,8,9,10,14}$ ). Next, create and set $L e f t_O T_i n d e x = 0$ and $R i g h t_O T_i n d e x = 4$ at node 26. This example illustrates how all suffixes under node 26 can be indexed through OT indexing without explicitly indexing each one of them.

The second part of this work introduces the concepts of base suffixes and base paths and proposes both linear and nonlinear algorithms to identify them under each internal node in the ST.

2.3 Base suffixes

We begin by defining base suffixes and then describe linear and nonlinear algorithms for finding base suffixes under each internal node in the ST.

Algorithm 1

Algorithm 1. Non-Trivial algorithm for identifying base suffixes.

Definition 4: A base suffix is a suffix that occurs under an internal node in the ST structure, denoted as node $x$ , and does not occur under any of the nodes in $S L S (x)$ (the child nodes of node $x$ in the OSHR tree structure). Thus, the set of base suffixes under node $x$ (base suffixes for node $x$ or BS(x)) is the set $S U (x) - (⋃_{n \in S L S (x)} S U (n))$ . If $x$ is an OSHR leaf node, i.e., $S L S (x) = \emptyset$ , then all suffixes under $x$ are base suffixes.

The examples from Figure 1 help illustrate the concept of base suffix. The base suffixes for node 26 are the set 14 (base suffix 14 corresponds to AAG$). The base suffixes for node 23 are {11, 15, 6, 1, 4} (base suffix 11 corresponds to ACTAAG$, 15 to AG$, 6 to ATTTAACTAAG$, 1 to GCATAATTTAACTAAG$, and 4 to TAATTTAACTAAG$). Because node 20 has no incoming suffix links $(S L S (20) = \emptyset)$ , all suffixes under it are base suffixes, namely, {12, 16, 7} (base suffix 12 corresponds to CTAAG$, 16 to G$, and 7 to TTTAACTAAG$). Node 12 has no base suffixes as all suffixes under it are already covered under nodes of $S L S (12)$ ( $S L S (12) = {node 20}$ ).

Definition 5: If $b s$ is a base suffix under node $x$ , then the extended suffixes of $b s$ are all suffixes identical to $b s$ that occur under each ancestor of $x$ (where ancestry is defined according to the OSHR tree structure).

For example, suffix 8 is a base suffix for node 15 (corresponds to TTAACTAAG$, starting from node 15 and ending at leaf node 6). The extended suffixes corresponding to this base suffix are the occurrences of TTAACTAAG$ under node 26 (ending at leaf node 11) and under the root node (ending at leaf node 10).

Observation 1: Based on definitions 4 and 5, the upper bound on the number of extended suffixes for any base suffix is $O (h)$ , where the last extended suffix of any base suffix is the one occurring under the root node.

Observation 2: Based on definitions 4 and 5 and Observation 1, the base suffixes under all internal nodes in ST must be $n$ distinct integers ranging from 0 to $n - 1$ (i.e., indexes of all suffixes in $T$ ).

In the example provided in Section 2.2, once the traversal reaches the root node, the $O T_i n d e x$ list will encompass all $n$ base suffixes, ordered as identified through the post-order traversal of the OSHR tree. Consequently, the root node must have a $L e f t_O T_i n d e x$ of 0 and a $R i g h t_O T_i n d e x$ of $n - 1$ .

Therefore, once a base suffix is processed or indexed, this processing or indexing can be applied implicitly to all $O (h)$ extended suffixes throughout the post-order traversal of the OSHR tree. So, what will be processed or indexed explicitly is each of the $n$ base suffixes. As a result, OT indexing or processing of all suffixes under all internal nodes in the ST can be achieved with a complexity factor of $n$ .

Algorithm 2

Algorithm 2. Non-Trivial algorithm for identifying base suffixes.

Algorithm 3

Algorithm 3. Linear algorithm for finding base suffixes.

2.3.1 Finding base suffixes

To find base suffixes under each internal node in ST, we present four approaches: a trivial algorithm with $O (n h)$ complexity, a non-trivial algorithm with $O (n h)$ complexity but more time-efficient than the trivial algorithm, a second non-trivial algorithm with $O (n h \log_{2} Σ)$ complexity, and, finally, a linear algorithm.

Trivially, all base suffixes under each internal node can be identified using the following algorithm. Build the OSHR tree (to mainly generate the $S L S$ lists for each internal node). Next, traverse the OSHR or ST tree where at each visited node $v$ : create a hash table for the set $⋃_{n \in S L S (v)} S U (n)$ ; then, check whether each suffix in $S U (v)$ exists in the hash table; if not, then that suffix is considered a base suffix for (under) node $v$ . The cost of this algorithm is $O (n h)$ .

Given Observation 2, the following non-trivial algorithm, which requires auxiliary $O (n)$ space (for a hash table named $H$ ), will cost $O (n h)$ but is clearly more time-efficient than the trivial algorithm. The algorithm is stated in Algorithm 1. Briefly, during the post-order traversal of the OSHR tree, check at each visited internal node $v$ whether each suffix in $S U (v)$ is already in $H$ ; if not, then it is a base suffix for (under) node $v$ and add this base suffix (as a number) into $H$ .

The second non-trivial algorithm achieves $O (n h \log_{2} Σ)$ time complexity using $O (n)$ auxiliary space, as shown in Algorithm 2. After building the OSHR tree, traverse the OSHR or ST tree, and at each visited node $v$ : loop through each leaf node (let the leaf node be $l$ ) under node $v$ , then check whether the leaf node labeled with suffix index equal to $S u f f i x_I n d e x (l) - 1$ is a descendant node under any node in $S L S (v)$ ; if not, then that suffix is considered a base suffix for (under) node $v$ . In the naive approach, the cost of checking whether node $l$ is a descendant node under any node in $S L S (v)$ is $O (Σ)$ as the upper bound for the $S L S ()$ list for any internal node is $Σ$ , but with a simple trick (which was also implemented), the cost can be reduced to $O (\log_{2} Σ)$ .

The linear algorithm was motivated by Observation 2. As the total number of base suffixes across all internal nodes in ST is equal to $n$ , if each base suffix can be found in constant time, the total cost will be $O (n)$ . To achieve this, two definitions must be introduced.

Definition 6: Let $A$ be a leaf node in ST with suffix index $x$ and $B$ be the parent of $A$ . Let $C$ be the leaf node with suffix index $x + 1$ and $D$ be the parent of $C$ . If $S L (B) \neq D$ , we call each node between $C$ and $D$ an inbetween node for $A$ , and we call $A$ a reference leaf node for each of those inbetween nodes.

As shown in Figure 1, node 6 is a reference leaf node for node 21 and node 21 is an inbetween node for node 6.

Note that a reference leaf node can be associated with $O (h)$ inbetween node, and an inbetween node can correspond to $O (Σ)$ reference leaf nodes. Additionally, the total number of reference leaf nodes across all internal nodes in ST is much fewer than $n$ .

Definition 7: Let $A$ be an internal node in ST, with parent $B$ . Let $S L (A) = C$ , and let $D$ be the parent of $C$ . If $S L (B) \neq D$ , we call each node between $C$ and $D$ an inbetween node for $A$ , and we call $A$ a reference internal node for each of those inbetween nodes.

As illustrated in Figure 1, node 20 is a reference internal node for node 23 and node 23 is an inbetween node for node 20.

A reference internal node may have $O (h)$ inbetween nodes, and an inbetween node can correspond to $O (Σ)$ reference internal nodes. Moreover, an inbetween node may be associated with $O (Σ)$ reference leaf nodes and $O (Σ)$ reference internal nodes. Finally, the total number of reference internal nodes across all internal nodes in ST is much fewer than $n$ .

The linear algorithm derives and identifies each base suffix in constant time using the inbetween nodes, reference leaf nodes, and reference internal nodes as stated in Algorithm 3. Since the upper bound on the number of reference leaf nodes and reference internal nodes is $O (Σ)$ for any internal node (most internal nodes are not inbetween nodes), the cost for finding these nodes is $O (Σ n)$ . In addition, computing each of the $n$ base suffixes has a cost of $O (1)$ , as shown in Algorithm 3. Therefore, the total cost is $O (Σ n)$ .

Theorem 1. Finding all base suffixes under all internal nodes in a ST can be achieved in linear time and space $(O (Σ n))$ .

Once the base suffixes have been identified for each internal node in an ST in linear time, let us OT index the $n$ base suffixes using an indexing operation $P$ , where the cost of $P$ is $p$ ; then, the total cost for OT indexing all $n$ base suffixes will be $O (p n)$ . Since the OT indexing process of each base suffix will be implicitly applied to each of its $O (h)$ extended suffixes, then the total cost of applying process $P$ to all suffixes under all internal nodes in an ST is also $O (p n)$ (as opposed to $O (p n h)$ ).

After finding the base suffixes under all internal nodes in an ST in linear time, several applications become feasible, particularly when combined with OT indexing. One such application is illustrated by the following example.

Let the OT indexing of the base suffixes in an ST be applied to solve the problem of exact pattern matching (which is a fundamental problem in biological applications such as read alignment, motif search, and genome annotation). Suppose there is a pattern that exactly matches one of the base suffixes under some node $v$ . In this case, the final OT index (constructed across the entire ST) can be used to determine that the pattern has an exact match under node $v$ (the matching here is with the base suffix itself) and also under every ancestor node of $v$ (with ancestry based on the OSHR tree structure), where the pattern’s exact match corresponds to the extended suffix (of the base suffix) under each ancestor node. This is achieved by explicitly applying OT indexing only on the base suffix under node $v$ , while the extended suffixes under the ancestor nodes of $v$ are implicitly OT-indexed through the post-order traversal of the OSHR tree (as described in the OT indexing example).

2.4 Base paths

The motivation for this indexing approach arises from the following observations. First, the primary source of complexity in a tree structure lies in the branching caused by internal nodes. Second, the tails of suffixes (i.e., the labels between a leaf node and its parent) are often very long, making their processing computationally expensive. Third, if a process reaches an internal node whose children are all leaf nodes, the computational cost for handling these leaves is bounded by the alphabet size $Σ$ . Consequently, instead of explicitly indexing or processing the full suffix tails, it is generally sufficient (and more efficient) to process only the labels along the paths connecting internal nodes to their descendant internal nodes.

Next, we define the concept of base paths and present algorithms for identifying base paths under each internal node in an ST, with both linear and nonlinear costs.

Algorithm 4

Algorithm 4. Non-trivial algorithm for finding base paths.

Definition 8: Let $x$ be an internal node, and let $P U (x)$ be the set of internal descendant nodes under $x$ in an ST. A base path is a path between two internal nodes, for example, nodes $A$ and $B$ , such that this path does not occur between two other internal nodes $C$ and $D$ , where $S L (C) = A$ , $S L (D) = B$ , and $D$ is a descendant node of $C$ . Thus, the set of base paths under node $x$ (base paths for node $x$ or $B P (x)$ ) is the set ${P U (x) - {S L (x) ∣ x \in ⋃_{n \in S L S (x)} P U (n)}}$ . Note that if an internal node $x$ is an OSHR leaf node $(S L S (x) = \emptyset)$ , then all the paths between the node and its descendant internal nodes are base paths. If the path between node $A$ and node $B$ is a base path, then node $A$ is called the top base node and node $B$ is the bottom base node.

For example, in Figure 1 (noting that the $T$ string for the suffix tree is relatively short), the set of base paths under node 23 is {node 12, node 14}. Similarly, for node 26, it is {node 20, node 21}. For the root node, the set is {node 14, node 15, node 24, node 20, node 21}.

Definition 9: If $b p$ is a base path between a top base node $A$ and a bottom base node $B$ , then the path between node $S L (A)$ and node $S L (B)$ is called an extended path of $b p$ . This relationship extends recursively to all paths between the ancestor nodes of $A$ and the ancestor nodes of $B$ in the OSHR tree structure via suffix links.

For instance, the path between the root and node 25 is an extended path of the base path between nodes 23 and 14.

Observation 3: Based on definitions 8 and 9, the upper bound on the number of extended paths for any base path is $O (h)$ , where the last extended path for $b p$ is the one whose top base node is the root node.

Observation 4: Based on definitions 8 and 9 and Observation 3, any path from the root node to an internal node can be the final extended path of a base path; hence, the total number of base paths is bounded by $Σ n$ .

Algorithm 5

Algorithm 5. Linear algorithm for finding base paths.

2.6 Finding base paths

All base paths in an ST can be identified using a straightforward (trivial) algorithm with time complexity $O (n h)$ , described as follows. The algorithm starts by building the OSHR tree, followed by a post-order traversal of the OSHR tree or ST. At each visited node $v$ , it constructs a hash table containing the set $S L (x) ∣ x \in ⋃_{n \in S L S (x)} P U (n)$ . Then, for each descendant internal node $d$ under $v$ , if $d$ is not in the hash table, the path between $v$ and $d$ is identified as a base path. In contrast, this work introduces a non-trivial algorithm that improves upon the trivial algorithm, with a time complexity of $O (n h Σ \log_{2} Σ)$ , and a linear algorithm with both time and space complexity of $O (Σ n)$ .

Algorithm 4, which is analogous to Algorithm 2, can find base paths under all internal nodes with a time complexity of $O (n h Σ \log_{2} Σ)$ and a space complexity of $O (Σ n)$ . The algorithm starts by building the OSHR tree and traverses the OSHR tree or ST, and at each visited node $v$ : traverse each descendant internal node under $v$ (say, node $d$ ) and check whether any node in $S L S (d)$ is a descendant of any node in $S L S (v)$ ; if not, the path between $v$ and $d$ is a base path. This check has a worst-case cost of $O (Σ^{2})$ as the maximum size of any $S L S$ set is $Σ$ , but with a simple trick (which was also implemented), the cost can be reduced to $O (Σ \log_{2} Σ)$ .

Since the total number of base paths is no more than $Σ n$ as given in Observation 4 and if the time cost for finding each base path is constant, all base paths can be found in linear time and space, $O (Σ n)$ . This is precisely what Algorithm 5 achieves by leveraging the properties of the OSHR tree and reference internal nodes (Definition 7).

Theorem 2: All base paths under all internal nodes in an ST can be found in linear time and space $O (Σ n)$ .

Once base paths are computed for each internal node in an ST, any index or process $P$ with cost $p$ applied to a base path $t$ under an internal node will implicitly apply to the $O (h)$ extended paths of $t$ . Therefore, the total cost of applying process $P$ for all paths under all internal nodes in an ST will be proportional to $n$ , costing $O (n p)$ instead of $O (p h n)$ .

The following is an example of OT indexing base paths. Let the OT index be constructed to resolve the pattern matching problem, as discussed in the example at the end of Section 2.3.1, where the pattern here is an exact match of one of the base paths under node $v$ . Then, the OT index (constructed across the entire ST) can be used to determine that the pattern has an exact match under node $v$ (here, the matching is with the base path itself) and also under every ancestor node of $v$ (the ancestry is based on the OSHR tree structure), where the exact match is the extended path (of the base path) under each ancestor node.

3 Results

To assess the correctness and effectiveness of the proposed algorithms, we evaluated them on the genomes of the following organisms, with genome sizes ranging from $\sim$ 1 Mb to $\sim$ 100 MB): WS1 bacterium JGI 0000059-K21 (bacteria, 0.5 MB), Astrammina rara (protist, 1.5 MB), Nosema ceranae (fungus, 5.5 MB), Cryptosporidium parvum Iowa II (protist, 8.8 MB), Spironucleus salmonicida (protist, 12.5 MB), Tieghemostelium lacteum (protist, 22.8 MB), Fusarium graminearum PH-1 (fungus, 35.5 MB), Salpingoeca rosetta (protist, 54 MB), and Chondrus crispus (Algae, 102.5 MB).

In the preprocessing step, header lines and newline characters were removed from each FASTA file, and all lowercase nucleotides were converted to uppercase. As a result, each genome was converted to a single-line sequence with all nucleotides in uppercase. The Python script used for this preprocessing step is available at the repository: https://github.com/aalokaily/Finding_base_suffixes_and_base_paths_in_suffix_trees.

All five algorithms presented in this study were implemented in Python and are publicly available in the aforementioned repository. Notably, the non-trivial algorithm (Algorithm 2) was excluded from the comparative analysis because it is both theoretically and empirically slower than the other non-trivial algorithm (Algorithm 1), as demonstrated by preliminary tests (data not shown). Regarding base suffix identification, the results obtained using the linear algorithm (Algorithm 3) perfectly matched those of its non-trivial counterpart (Algorithm 1) for each internal node in the ST. Across all tested genomes, the total number of base suffixes under all internal nodes is equal to $n$ . Similarly, for base path identification, the outputs of the non-trivial algorithm (Algorithm 4) and the linear algorithm (Algorithm 5) were identical across all internal nodes in an ST. Across all tested genomes, the total number of base paths remained bounded by $O (Σ n)$ . A summary of these results is provided in Table 1.

Table 1

Table 1. Results from the evaluation and comparison of algorithms 1 and 3 (for base suffix identification) and algorithms 4 and 5 (for base path identification).

Finally, a statistical analysis was conducted to evaluate the scalability and performance differences among the proposed algorithms. The execution time for each algorithm was plotted against the genome size, as shown in Figure 2. Linear regression confirmed a strong linear relationship between genome size and runtime for linear algorithms 3 and 5 $(R^{2} > 0.99)$ , consistent with theoretical expectations. In contrast, algorithms 1 and 4 exhibited superlinear growth due to their dependence on variable $h$ values. One-way ANOVA showed significant differences in runtime across all algorithms $(F = \dots, p < 0.001)$ . Post hoc pairwise t-tests (Bonferroni-corrected) confirmed that Algorithms 3, 5 were significantly faster than their non-trivial counterparts (algorithms 1 and 4, respectively; $p < 0.01$ ). These findings empirically validate the linear time performance of the proposed linear algorithms (Algorithms 3, 5) across genomes of varying sizes.

Figure 2

Figure 2. Execution time (log–log scale) of algorithms 1 (blue), 3 (green), 4 (orange), and 5 (red) plotted against genome size (number of nucleotide/leaf nodes). The linear trend observed for algorithms 3 and 5 confirms their linear-time behavior, while algorithms 1 and 4 exhibit superlinear growth.

4 Conclusion

The primary contribution of the OT indexing of base suffixes and base paths is their linear time and space cost for indexing all suffixes and paths under all internal nodes in an ST. This property is not achievable using existing suffix tree construction algorithms (such as Ukkonen’s algorithm (Ukkonen., 1995) or McCreight’s algorithm (McCreight, 1976)) or other approaches related to suffix trees. The resulting linear OT index enables indexing all suffixes or paths under all internal nodes with a complexity factor of $n$ instead of $n h$ . This capability can be incorporated into more efficient solutions for problems related to next-generation sequencing analysis (Li, 2013; Hu et al., 2024; Guo et al., 2024; Wang et al., 2024a; Wang et al., 2024b) and machine learning (Zhao et al., 2025; Yue et al., 2024a; Zhao et al., 2024; Zhao et al., 2022; Yue et al., 2024b).

Data availability statement

Source code of the algorithms are available at https://github.com/aalokaily/Finding_base_suffixes_and_base_paths_in_suffix_trees. Further inquiries can be directed to the corresponding author.

Author contributions

AA: Conceptualization, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review and editing. AT: Investigation, Project administration, Supervision, Writing – review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Acknowledgments

The term “OSHR” tree is derived from the last names of the first author and his PhD committee members at the University of Connecticut, Department of Computer Science, in 2016. The committee included Chun-Hsi Huang (Major Advisor), Sanguthevar Rajasekaran, and Don Sheehy. The name Okaily–Sheehy–Huang–Rajasekaran (OSHR) honors their kind, influential, and professional guidance throughout the first author’s doctoral studies. Additionally, the abbreviation “OT” corresponds to the last names of the authors of this work.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. J. discrete algorithms 2, 53–86. doi:10.1016/s1570-8667(03)00065-0

CrossRef Full Text | Google Scholar

Apostolico, A., and Cunial, F. (2014). Suffix trees and arrays. J. Encycl. Algorithms, 1–10. doi:10.1007/978-3-642-27848-8_627-1

CrossRef Full Text | Google Scholar

Belazzougui, D., Cunial, F., Kärkkäinen, J., and Mäkinen, V. (2020). Linear-time string indexing and analysis in small space. ACM Trans. Algorithms (TALG) 16, 1–54. doi:10.1145/3381417

CrossRef Full Text | Google Scholar

Ferragina, P., and Manzini, G. (2000). “Opportunistic data structures with applications,” in Proceedings 41st annual symposium on foundations of computer science (IEEE), 390–398.

Google Scholar

Guo, P., Li, Y., Wang, R., Chen, X., Kim, S., and Park, H. J. (2024). Deep neural network learning biological condition information refines gene-expression-based cell subtypes. Briefings Bioinforma. 25, bbad512. doi:10.1093/bib/bbad512

CrossRef Full Text | Google Scholar

Hu, J., Wang, Z., Sun, Z., Hu, B., Ayoola, A. O., Liang, F., et al. (2024). NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 25, 107. doi:10.1186/s13059-024-03252-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv 1303.3997. doi:10.48550/arXiv.1303.3997

CrossRef Full Text | Google Scholar

McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. J. ACM (JACM) 23, 262–272. doi:10.1145/321941.321946

CrossRef Full Text | Google Scholar

Starikovskaya, T., and Vildhøj, H. W. (2015). A suffix tree or not a suffix tree? J. Discrete Algorithms 32, 14–23. doi:10.1016/j.jda.2015.01.005

CrossRef Full Text | Google Scholar

Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica 14, 249–260. doi:10.1007/bf01206331

CrossRef Full Text | Google Scholar

Wang, S., Dong, K., Liang, D., Zhang, Y., Li, X., and Song, T. (2024a). Mippis: protein–protein interaction site prediction network with multi-information fusion. BMC Bioinforma. 25, 345. doi:10.1186/s12859-024-05964-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, T., Zhang, Y., Wang, H., Zheng, Q., Yang, J., Zhang, T., et al. (2024b). Fast and accurate dnaseq variant calling workflow composed of lush toolkit. Hum. Genomics 18: (1), 114. doi:10.1186/s40246-024-00666-w

PubMed Abstract | CrossRef Full Text | Google Scholar

Weiner, P. (1973). “Linear pattern matching algorithms,” in 14th annual symposium on Switching and Automata Theory (swat 1973) (IEEE), 1–11.

CrossRef Full Text | Google Scholar

Wellnitz, P. (2021). Counting patterns in strings and graphs. Saarbrücken, Germany: Saarländische Universitäts-und Landesbibliothek. Ph.D. thesis.

Google Scholar

Yue, J., Peng, B., Chen, Y., Jin, J., Zhao, X., Shen, C., et al. (2024a). 3dsmiles-gpt: 3d molecular pocket-based generation with token-only large language model. Chem. Sci. 15 (—), 13727–13740. doi:10.1039/d4sc03744h

PubMed Abstract | CrossRef Full Text | Google Scholar

Yue, J., Peng, B., Chen, Y., Jin, J., Zhao, X., Shen, C., et al. (2024b). Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language. Chem. Sci. 15, 13727–13740. doi:10.1039/D4SC03744H

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B.-W., Su, X.-R., Hu, P.-W., Ma, Y. P., Zhou, X., and Hu, L. (2022). A geometric deep learning framework for drug repositioning over heterogeneous information networks. Briefings Bioinforma. 23, bbac384. doi:10.1093/bib/bbac384

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B.-W., Su, X.-R., Yang, Y., Li, D.-X., Li, G.-D., Hu, P.-W., et al. (2024). A heterogeneous information network learning model with neighborhood-level structural representation for predicting lncrna–mirna interactions. Comput. Struct. Biotechnol. J. 22, 2924–2933. doi:10.1016/j.csbj.2024.06.032

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B.-W., Su, X.-R., Yang, Y., Li, D.-X., Li, G.-D., Hu, P.-W., et al. (2025). Regulation-aware graph learning for drug repositioning over heterogeneous biological network. Inf. Sci. 686, 121360. doi:10.1016/j.ins.2024.121360

CrossRef Full Text | Google Scholar

Keywords: suffix trees, strings indexing, approximate pattern matching, reads alignment, motif search

Citation: Al-okaily A and Tbakhi A (2025) A novel linear indexing method for strings under all internal nodes in a suffix tree. Front. Bioinform. 5:1577324. doi: 10.3389/fbinf.2025.1577324

Received: 15 February 2025; Accepted: 07 August 2025;
Published: 04 September 2025.

Edited by:

Kang Ning, Huazhong University of Science and Technology, China

Reviewed by:

Bo-Wei Zhao, Zhejiang University, China
Osman Ali Sadek Ibrahim, Minia University, Egypt
Alok Misra, Lovely Professional University, Phagwara, Punjab, India

Copyright © 2025 Al-okaily and Tbakhi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Anas Al-okaily , YWEuMTI2ODJAa2hjYy5qbyYjeDAyMDBhOw==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.