Whole-value analysis by abstract interpretation

Negrini, Luca

doi:10.3389/fcomp.2025.1655377

ORIGINAL RESEARCH article

Front. Comput. Sci., 20 January 2026

Sec. Software

Volume 7 - 2025 | https://doi.org/10.3389/fcomp.2025.1655377

This article is part of the Research TopicSoftware Specification and Verification: Models and ToolsView all 6 articles

Whole-value analysis by abstract interpretation

Luca Negrini^*

Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University of Venice, Venice, Italy

Value analysis is the task of understanding what concrete values a program might compute for each variable or memory region. Historically, research focused mostly on numerical analysis (i.e., value analysis of programs manipulating numeric values), while string analyses have received wider attention in the last two decades. String analyses present a key challenge: reasoning about strings entails reasoning about integer values either used as arguments to string operations (e.g., evaluating a substring) or returned by string operations (e.g., calculating the length of a string). Traditionally, string analyses were formalized with respect to a specific numeric analysis, usually considering constant values or their possible ranges, tailoring definitions, semantic proofs, and implementations to that particular combination, hence hindering the adoption of the analyses in different contexts. This study presents a modular framework to define whole-value analyses (that is, combinations of numeric analyses, string analyses, and possibly other value types computed by a program) by Abstract Interpretation. The framework defines information exchange between the different analyses in the form of abstract constraints, allowing each analysis to perform given only a generic and analysis-independent description of the abstract values computed by other analyses. Adopting such a framework (i) ensures that soundness proofs are still valid when changing the combination of domains used, and (ii) eases implementation and experimentation of different combinations of value analyses, simplifying comparisons between different scientific contributions and augmenting the set of domains an abstract interpreter can use to analyze a program.

1 Introduction

Static analysis allows one to verify properties of computer programs before they are executed. This is important for proving that programs do not behave incorrectly at execution time, leading to a runtime error or computing the wrong results. Static analysis can also provide evidence of illicit information flows, a topic highly appreciated by companies that write software dealing with sensitive data or that is exposed to external users' interaction. To have guarantees about bug and vulnerability discovery, static analysis must go beyond simple and naïve matching of well-known patterns leading to errors and vulnerabilities (so-called code smells). Instead, formal methods such as Abstract Interpretation (Cousot and Cousot, 1977; Cousot, 2021) must be used.

Abstract Interpretation is a mathematical framework to soundly reason on program semantics. Proving non-trivial properties of such semantics is, in general, undecidable (Rice, 1953). Abstract Interpretation overcomes this by reasoning on a sound over-approximation of the uncomputable real semantics, referred to as concrete, transforming it into so-called abstract semantics that is instead computable. While approximation recovers computability, it does come with the cost of imprecision, as more executions are considered with respect to the actual ones exhibited by the program. However, thanks to the over-approximation, properties proven to hold for the abstract semantics are guaranteed to hold also for the concrete one. The main idea behind abstract interpretation is to define the concrete semantics as the fixpoint of a monotone function. Such a function can then be abstracted to a simpler one that has to be proven sound. In practice, the monotone function is defined inductively on the syntax offered by the programming language. Such a definition allows the framework (i.e., the formula) to be parametric with respect to the language semantics in a way that does not require proving the correctness of the whole abstraction every time. Instead, researchers define the meaning of each instruction they want to analyze (e.g, by defining its big-step semantics), and they later abstract that meaning with an abstract domain that models some of its properties. For instance, in (Cousot and Cousot 1977), numeric values were abstracted as intervals (thus preserving their ranges) and the semantics of the language with interval arithmetics.

Several abstract domains have been proposed over the years, each with a different cost-to-precision trade-off and targeting different kinds of values and properties: numeric values (Cousot and Cousot, 1977; Logozzo and Fähndrich, 2010; Miné, 2006; Cousot and Halbwachs, 1978), string values (Costantini et al., 2015; Negrini et al., 2021; Christensen et al., 2003b), types (Cousot, 1997), dependencies (Ernst et al., 2015; Cohen, 1977), and many more. Numerical abstractions have been the first to be studied, as proving numerical properties is pivotal in safety-critical contexts. More recently, string analysis gained notable traction due to the many uses strings find in programming languages. While some of these domains can perform naturally in a standalone setting (e.g., type inference can produce expression types by only relying on the types of sub-expressions), cooperation between domains is sometimes necessary. Consider, for instance, the following snippet of Go code:

start := 2

end := 7

str := “Go is a programming language”

sub := str[start : end]

reasoning on the value of sub at line 4, that is built through a substring operation, entails reasoning on both strings and integers. In fact, while most numeric domains have been formalized in isolation due to the nature of the safety-critical programs they were aimed at, string domains are defined in combination with a numeric domain of choice. Typically, such a domain tracks constant values (e.g., as in Costantini et al., 2015) or intervals modeling their ranges (e.g., as in Negrini et al., 2021). While these explicit combinations are enough to understand the article's contribution, they are often limiting when one wants to reuse the same domain in a different combination. Consider, for instance, the PREFIX domain (Costantini et al., 2015), tracking definite prefixes of string values. The authors define the semantics of substring with respect to the integer values of the start and end indices. If one were to reuse the domain in a more precise setting, e.g., in conjunction with the INTERVAL domain (Cousot and Cousot, 1977), the definition has to be lifted by applying the semantics to all valid pairs of indices. While in this case the lift is straightforward, it might not be if the string domain is more elaborate (e.g., if it uses automata as in Arceri et al., 2020). Moreover, lifting the semantics naïvely poses the threat of compromising the soundness of the analysis.

1.1 Contribution

This study presents a framework for whole-value analyses, that is, analyses aiming to model all values computed by a program regardless of their type. Specifically, the framework is an instance of the open product (Cortesi et al., 2013), and allows client abstract domains for individual data types (e.g., integers, floats, strings, …) to exchange information modularly, without tailoring the communication to specific domain combinations. The exchange happens by means of abstract constraints that one domain can obtain from the others in the form of (in-)equalities that hold in the current state. Since the format is generic, one can add new domains to the framework or swap one of them with a more refined one without the others having to redefine or lift their semantics. We then show how abstract domains' semantics can be expressed in terms of such constraints whenever a multi-type expression (i.e., an expression whose sub-expressions and result are of heterogeneous types — e.g., Java's substring) needs to be evaluated. The strength of the framework is both theoretical and practical: domains adopting this framework (i) have their abstract semantics proven sound independently from the domain they will be combined with, and (ii) can be plugged in with different abstract domains seamlessly, with no code modifications. We then implement the framework in LiSA (Negrini et al., 2023a), and compare the results of whole-value analyses with and without the presented constraint-based framework to assess its precision. In summary:

• we define a novel constraint-based framework for whole-value analyses by abstract interpretation, proving its soundness;

• we recast the definition of some widespread abstract domains' semantics to fit our framework;

• we provide an open-source implementation of the framework in LiSA (Negrini et al., 2023a);

• we compare the analysis results of whole-value analyses with and without the proposed framework.

1.2 Paper structure

Section 2 introduces the necessary notations and notions that will be used throughout the study. Section 3 defines the Imp language, a minimalistic yet expressive imperative language that we will use for the formalization of the framework, together with its semantics. Section 4 defines the split state, a rewriting of the concrete state and semantics of Imp that simplifies the definition of abstract interpretations without introducing any loss of precision. Section 5 formalizes the framework. Section 6 reports our instantiation of the framework with some notable client abstract domains. Section 7 reports our implementation in LiSA and our experiments to assess the precision of the framework. Section 8 discusses related work. Section 9 concludes and discusses future works. Appendix A reports the proofs of all lemmas and theorems.

2 Preliminaries

2.1 Sets

A set X is a possibly infinite collection of elements, written X = {x₀, x₁, … }, where ∅ is the set containing no elements. A set can also be defined in terms of a predicate ϕ, causing all elements satisfying ϕ to be part of the set (i.e., X = {x|ϕ(x)}). We write x ∈ X to denote that x is part of the set X. |X| is the cardinality of X, that is, the number of elements it contains, and ℘(X) is the powerset of X, that is, the set containing all subsets of X. Given two sets X and Y, X ⊆ Y is the inclusion relation between X and Y, X ∪ Y is the set union between X and Y, X ∩ Y is the set intersection between X and Y, X \ Y is the set difference between X and Y, and X × Y is the Cartesian product between X and Y, that is, the set {(x, y)|x ∈ X ∧ y ∈ Y}.

2.2 Functions

A function f : X → Y is a subset of the Cartesian product X × Y such that ∄(x, y), (z, w) ∈ f:x = z ∧ y ≠ w. The set X is called domain (denoted as dom(f)), the set Y is called co-domain (denoted as codom(f)), and f(x) is the image of x in f, that is, (x, f(x)) ∈ f. Similarly to sets, a function can be defined either as a set of pairs {(x₀, y₀), (x₁, y₁), … } or using a formula Φ, written f(x) = Φ(x), indicating that f = {(x₀, Φ(x₀)), (x₁, Φ(x₁)), … }. Function id is the identity function, that is, id(x) = x. Finally, given a function f : X → Y, we denote with f⁺ its additive lift, that is, the function f⁺ : ℘ (X) → ℘(Y) defined as f⁺(S) = {(s, f(s)) | s ∈ S}.

2.3 Strings

A string σ is a sequence of characters σ₀…σ_n, σ_i ∈ Σ, with length |σ| = n + 1. We denote as Σ* the set of all possibly unbounded strings. Given a string σ ∈ Σ* and i, j ∈ ℕ, 0 ≤ i ≤ j < |σ|, we denote the subsequence σ_i…σ_j by σ[i : j]. Instead, given two strings $σ = σ_{0} \dots σ_{n}, σ^{'} = σ_{0}^{'} \dots σ_{k}^{'} \in Σ *$ , we write σ′ ↷ σ if σ′ is a substring of σ, that is, if ∃i, j ∈ ℕ, 0 ≤ i ≤ j < |σ|, j − i = k : σ[i : j] = σ′. Furthermore, we write σ′ ↷_p σ if σ′ is a prefix of σ, that is, if ∃ k < |σ| ∧ σ[0 : k]= σ′, and σ′ ↷_s σ if σ′ is a suffix of σ, that is, if ∃ k < |σ| ∧ σ[n − k : n] = σ′.

2.4 Ordered structures

A set X with a partial ordering relation ⊑_X ⊆ X × X is a poset, denoted by 〈X,⊑_X〉. If a poset has a bottom element ⊥_X and is closed under finitary applications of the least upper bound (lub, ⊔_X) operator of X, it is called a complete partial order (cpo), denoted as 〈X, ⊑_X, ⊔_X, ⊥_X〉. Moreover, a lattice 〈X, ⊑_X, ⊔_X, ⊓_X〉 is a poset having a minimum element (bottom, ⊥_X ∈ X), a maximum element (top, ⊤_X ∈ X) and closed under finitary applications of the least upper bound (lub, ⊔_X) and the greatest lower bound (glb, ⊓_X) operators. A complete lattice is closed under arbitrary lub and glb, so that ⊔Y ∈ X and ⊓Y ∈ X, for all Y ⊆ X, and it is denoted as 〈X, ⊑_X, ⊔_X, ⊓_X, ⊤_X, ⊥_X〉. Provided there is no ambiguity, we will omit subscripts of each operator for clarity. Complete lattices can be derived from other structures. For instance, given a set X, 〈℘(X), ⊆, ∪, ∩, X, ∅〉 is a complete lattice since ⊆ is a partial ordering relation, ∪ and ∩ are closed with respect to ℘(X), and ∀Y ∈ ℘(X) : ∅ ⊆ Y ⊆ X. By duality, 〈℘(X), ⊇, ∩, ∪, ∅, X〉 is also complete. Moreover, given 〈X, ⊑, ⊔, ⊓, ⊤, ⊥〉 and a set Y, the functional lift (Cousot and Cousot, 1979) of X with respect to Y is the complete lattice $〈 Y \to X, \overset{\cdot}{⊑}, \overset{\cdot}{⊔}, \overset{\cdot}{⊓}, \overset{\cdot}{⊥}, \overset{\cdot}{⊤} 〉$ of total functions Y → X, that is, of functions defined on all elements of Y. Lattice operators are defined as point-wise applications of operators over X on all y ∈ Y. Lastly, given a finite set of complete lattices 〈Y_i, ⊑_{Y_i}, ⊔_{Y_i}, ⊓_{Y_i}, ⊥_{Y_i}, ⊤_{Y_i}〉, i ∈ Δ ⊂ ℕ, their Cartesian product (Cousot, 2021) is the complete lattice $〈 {\times\sum}_{i \in Δ} Y_{i}, \overset{\times}{⊑}, \overset{\times}{⊔}, \overset{\times}{⊓}, \overset{\times}{⊥}, \overset{\times}{⊤} 〉$ , where lattice operators are component-wise applications of the operators over each Y_i. Given a poset 〈X, ⊑_X〉, an increasing chain C ⊆ X is a possibly infinite sequence of elements x₀, x₁, … of X such that x₀ ⊑ _X x₁ ⊑ _X… .

2.5 Abstract interpretation

Abstract Interpretation (Cousot and Cousot, 1977; Cousot, 2021) is a theoretical framework for sound reasoning on semantic properties of a program, establishing a correspondence between the semantics of a program, called concrete semantics, and an approximation of it, called abstract semantics. Let C and A be complete lattices, a pair of functions α : C → A and γ : A → C forms a Galois Connection between C and A, written 〈C, ⊑_C〉 $⇆_{α}^{γ}$ 〈A, ⊑_A〉, if ∀c ∈ C, a ∈ A : α(c) ⊑_A a ⇔ c ⊑_C γ(a). Equivalently, the Galois Connection exists if α ° γ is reductive (i.e., if ∀a ∈ A : α ° γ(a) ⊑_A a), and γ ° α is extensive (i.e., if ∀c ∈ C : c ⊑_C γ ° α(c)). In addition, if α ° γ = id, then α and γ form a Galois Embedding, written 〈C, ⊑_C〉 $⇆_{α}^{γ}$ 〈A, ⊑_A〉, where no two abstract elements have the same concretization. Furthermore, if also γ ° α = id, then α and γ form a Galois Isomorphism, written 〈C, ⊑_C〉 $⇆_{α}^{γ}$ 〈A, ⊑_A〉, where A is simply a reshaping of C and no abstraction (i.e., loss of precision) happens. Note that Abstract Interpretation can be employed also when a Galois Connection does not exist: in fact, it is sufficient that C and A are complete partial orders, and that a monotone concretization γ exists.

2.6 Soundness

Given 〈C, ⊑_C〉 $⇆_{α}^{γ}$ 〈A, ⊑_A〉, a concrete function f : C → C is, in general, not computable. Hence, a function f^♯ : A → A that must correctly approximate the function f is needed. If so, we say that the function f^♯ is sound. Given 〈C, ⊑_C〉 $⇆_{α}^{γ}$ 〈A, ⊑_A〉 and a concrete function f : C → C, an abstract function f^♯ : A → A is sound with respect to f if $\forall c \in C : α (f (c)) ⊑_{A} f^{♯} (α (c))$ , or equivalently $\forall a \in A : f (γ (a)) ⊑_{C} γ (f^{♯} (a))$ . Note that the latter relation can be used to prove soundness even when a Galois Connection does not exist.

2.7 Abstract domains

In the Abstract Interpretation framework, abstractions are defined through so-called abstract domains. These are composed by a partial order 〈X, ⊑_X〉, possibly extended to a complete partial order, a lattice, or a complete lattice, an upper bound operator ⊔_X on X, a bottom element ⊥_X ∈ X a widening operator ∇_X that over-approximates ⊔_X and ensures the convergence on increasing chains, an abstract transformer ⟅st⟆ : X → X for evaluating statements, and an abstract transformer ⟅b⟆ : X → X for traversing conditions. The purpose of the transformers is to evolve an instance of the domain according to the semantics of the statement st to be executed, and to refine an instance of the domain assuming that a condition b holds. When a domain is non-relational, that is, it does not maintain explicit relations between program variables, a third transformer ⟅e⟆ : X → V usually exists, evaluating an expression e to an abstract value v ∈ V, with 〈V, ⊑_V, ⊔_V, ⊥_V〉 being the complete partial order of abstract values (possibly also a lattice or a complete lattice). For instance, the domain of intervals 〈ID $\to 𝕀, \overset{\cdot}{⊑}, \overset{\cdot}{⊔}, \overset{\cdot}{⊓}, \overset{\cdot}{⊥}, \overset{\cdot}{⊤} 〉$ uses functions as abstract elements, but evaluates expressions to single intervals.

3 The imp language

We begin by defining the target programming language that we aim to analyze: IMP. IMP, whose syntax is visible in Figure 1, is a simple imperative language that features arithmetic expressions (AE, where ⊕ ∈ {+, −, *, /} with / being the integer division), Boolean expression (BE, where ⧀ ∈ { = =, ! =, < , < =, >, > = }), and string expressions (SE). It then features variable assignments to integers, strings, and Booleans, branching and looping. Note that, despite its simplicity, it features multi-type expressions len(s), a ⧀ a, s == s, contains(s, s), and substr(s, a, a) that require mixing values from different domains to be computed (i.e., substr(s, a, a) requires both strings and integers to be computed, while a ⧀ a, s == s, len(s) and contains(s, s) can be fully computed using the input values — integers or strings — but have output of a different type — integers or Booleans). Imp programs need to be well-typed to be valid: each variable defined and used throughout the program must be assigned to values of a single data type.

Figure 1

Syntax of the IMP language, allowing for integer, string, and boolean literals and variables, permitting standard arithmetic, boolean and string operations and comparisons. For statements, assignment, if-then-else, and while loop are allowed.

Figure 1. Syntax of the Imp language.

3.1 Concrete state and semantics

Expressions of the Imp language evaluate to values in the set VAL ≜ ℤ ∪ Σ* ∪ 𝔹 ∪ {⇑}, that is, to integers, strings, Booleans ({true, false}), or to a special value ⇑ denoting an error in the evaluation. Program memories $μ \in M :$ ID → VAL map program variables in the set ID to their values in VAL. Abusing notation, the set $M$ contains a special memory ⇑ produced after invalid computations. Function $⟦ st ⟧ : M \to M$ defines the semantics of each statement in terms of the effect it has on the program memory it is executed on. Expression evaluation is instead defined, abusing notation, through the function $⟦ e ⟧ : M \to$ VAL that computes the value of the expression given the values of each variable. The semantics of Imp statements and expressions is standard, and it is thus not fully specified: in the following, we only report the semantics of multi-type expressions in a big-step fashion. Note that both ⟦st⟧ and ⟦e⟧ yield ⇑ if the semantics of any sub-expression appearing in their argument evaluates to ⇑.

\begin{array}{l} \frac{〚 s 〛 μ = σ}{〚 len (s) 〛 μ = | σ |} & (3.1) \end{array}

\begin{array}{l} \frac{〚 a_{1} 〛 μ = n_{1} 〚 a_{2} 〛 μ = n_{2}}{〚 a_{1} ⧀ a_{2} 〛 μ = n_{1} ⧀ n_{2}} & (3.2) \end{array}

\begin{array}{l} \frac{〚 s_{1} 〛 μ = σ_{1} 〚 s_{2} 〛 μ = σ_{2}}{〚 s_{1} {==s}_{2} 〛 μ = σ_{1} = = σ_{2}} & (3.3) \end{array}

\begin{array}{l} \frac{〚 s_{1} 〛 μ = σ_{1} 〚 s_{2} 〛 μ = σ_{2} σ_{2} ↷ σ_{1}}{〚 contains (s_{1}, s_{2}) 〛 μ = true} & (3.4) \end{array}

\begin{array}{l} \frac{〚 s_{1} 〛 μ = σ_{1} 〚 s_{2} 〛 μ = σ_{2} σ_{2} ↷ σ_{1}}{〚 contains (s_{1}, s_{2}) 〛 μ = false} & (3.5) \end{array}

\begin{array}{l} \frac{〚 s 〛 μ = σ 〚 a_{1} 〛 μ = i 〚 a_{2} 〛 μ = j 0 \leq i \leq j < | σ |}{〚 substr ({s, a}_{1}, a_{2}) 〛 μ = σ [i : j]} & (3.6) \end{array}

\begin{array}{l} \frac{〚 s 〛 μ = σ 〚 a_{1} 〛 μ = i 〚 a_{2} 〛 μ = j i < 0 \lor j \geq | σ | \lor i > j}{〚 substr ({s, a}_{1}, a_{2}) 〛 μ = ⇑} & (3.7) \end{array}

Intuitively, Equation 3.1 shows that len returns the number of characters in its input. Comparisons on both integers (Equation 3.2) and strings (Equation 3.3) simply apply the comparisons over the literals obtained through recursive evaluation of the arguments.The semantics of contains is partitioned according to its output: Equation 3.4 yields true if the second argument is contained in the first one, while Equation 3.5 yields false if it is not. Finally, substr returns either (i) the substring of its first argument delimited by its second and third ones as shown in Equation 3.1, or (ii) an evaluation error if the bounds are invalid, as visible in Equation 3.1.

As usual in Abstract Interpretation, the concrete semantics is lifted to the collecting semantics by using sets of possible program memories. Specifically, instead of considering a single memory, accounting for an individual program execution, the collecting semantics considers possibly infinite sets of memories, describing possibly infinite sets of executions. In this setting, the semantics of statements and expressions are defined as the additive lift of the concrete ones. The statement collecting semantics $⟦ st ⟧^{+} : ℘ (M) \to ℘ (M)$ is thus defined as ⟦st⟧⁺M ≜ {⟦st⟧μ|μ ∈ M}, while the expression collecting semantics $⟦ e ⟧^{+} : ℘ (M) \to ℘ ($ VAL) is defined as ⟦e⟧⁺M ≜ {⟦e⟧μ|μ ∈ M}. Lastly, note that sets of program memories are elements of the complete powerset lattice $〈 ℘ (M), \subseteq, \cup, \cap, \emptyset, M 〉$ .

4 The split state

The collecting semantics of Imp programs defined in Section 3 provides information on all concrete executions of a given program, but it is not computable. Following the Abstract Interpretation framework, we aim at abstracting the collecting semantics, regaining decidability by introducing imprecision. While it is possible to design an abstraction for the collecting semantics we defined, it would be inconvenient for the purpose of this work. We aim at building a framework where semantic computations are delegated to several abstract domains operating on disjoint data types: abstract states will thus be the conjunction of the states of the individual domains. A direct abstraction would thus complicate both definitions and proofs, since the conversion from a monolithic state to a partitioned one would be necessary at each step.

Instead, we adapt the idea from (Ferrara 2016) of introducing an intermediate abstraction. In this section, we define a split state, that is, a rewriting of the concrete state (i.e., program memories and concrete semantics) into one where values of different data types are stored in separate maps. This rewriting models the concrete state in a way that is convenient for the remainder of this work, simplifying definitions and proofs. Having separate memories also facilitates modular abstractions: existing abstract domains (both non-relational and relational) can be employed to abstract individual sub-memories. Concretizations and transformers from such domains can be used as-is, only redefining the evaluation of expressions spanning multiple data types. Moreover, such rewriting does not introduce imprecision: proving soundness and/or completeness with respect to the split state is thus equivalent to proving it with respect to the concrete state.

4.1 Split memories

We begin by partitioning program memories, grouping program variables by the type of values they hold. Since Imp can store values of three different types (integers, strings and Booleans), we define three type-specific memories: ${\bar{μ}}_{a} \in A :$ ID → ℤ stores the values of integer variables, ${\bar{μ}}_{s} \in S :$ ID → Σ* stores the values of string variables, and ${\bar{μ}}_{b} \in B :$ ID → 𝔹 stores the values of Boolean variables. Split memories are $\bar{μ} \in \bar{M} ≜ A \times S \times B$ , that is, tuples composed by a function for each data type supported by the language. Similarly to $M$ , $\bar{M}$ also contains a special memory ⇑. In the following, we will refer to a split memory as either $\bar{μ}$ or $({\bar{μ}}_{a}, {\bar{μ}}_{s}, {\bar{μ}}_{b})$ . Being a rewriting of the concrete state, split memories still refer to a single execution and should hold a single value for any given variable. We thus consider only valid split memories, that is, memories $\bar{μ}$ such that $dom ({\bar{μ}}_{a}) \cap dom ({\bar{μ}}_{s}) = dom ({\bar{μ}}_{a}) \cap dom ({\bar{μ}}_{b}) = dom ({\bar{μ}}_{s}) \cap dom ({\bar{μ}}_{b}) = \emptyset$ . Split memories arbitrarily constructed to hold information on the same variable in more than one type-specific memory are ignored as they cannot arise when abstracting a valid concrete memory.

Example 4.1. Consider the program memory μ = {(x, 42), (y, “hello”), (z, true)}. The corresponding split memory $\bar{μ} = ({\bar{μ}}_{a}, {\bar{μ}}_{s}, {\bar{μ}}_{b})$ is such that ${\bar{μ}}_{a} = {(x, 42)}$ , ${\bar{μ}}_{s} = {(y, “hello”)}$ , and ${\bar{μ}}_{b} = {(z, true)}$ . The means for commuting between the two representations will be defined in the following section.

4.2 Abstraction and concretization

Before defining $\bar{α}$ and $\bar{γ}$ , used to convert concrete states into split states and vice-versa, we introduce two operators that will be used in their formalization: the restriction operator ⇃ and the union operator ⊎.

Definition 1 (Function restriction). Given a function f : K → V and two sets X ⊆ K and Y ⊆ V, the function restriction operator $⇃_{Y}^{X}$ :(K → V) → (X → Y) yields the function f $⇃_{Y}^{X}$ = {(x, f(x))|x ∈ X ∧ f(x) ∈ Y}, that is, it restricts the input function f on the elements in the desired domain and co-domain.

Definition 2 (Function union). Given two functions f : X → W and g : Y → Z such that X ∩ Y = ∅, the function union operator ⊎:(X → W) × (Y → Z) → (X ∪ Y → W ∪ Z) yields the function f⊎g = {(k, v)|(k ∈ dom(f)∧v = f(k))∨(k ∈ dom(g)∧v = g(k))}, that is, it combines the input functions f and g. Since the domains of f and g are disjoint, f⊎g is also a function.

Intuitively, the above operators are used to split a function into sub-functions and to join them back to the original function, respectively, as shown in the following example.

Example 4.2. Consider the function f = {(1, 2), (2, 3), (3, 4)} and the sets X = {1, 2} and Y = {2, 3}. Let $\bar{X} = dom (f) \ X$ and $\bar{Y} = codom (f) \ Y$ . The restriction of f to X and Y is f $⇃_{Y}^{X}$ = {(1, 2), (2, 3)}, and its complement is f $⇃_{\bar{Y}}^{\bar{X}} = {(3, 4)}$ . The union of the restriction and its complement is f $⇃_{Y}^{X}$ ⊎ f $⇃_{ℤ \ X}^{ℤ \ Y}$ = {(1, 2), (2, 3), (3, 4)} = f.

To avoid cluttering the notation, the superscript of ⇃ will be omitted whenever it coincides with the domain of the whole function. Note that both definitions can be trivially generalized to a partitioning that generates an arbitrary number of sub-functions. We are now in a position to define $\bar{α}$ and $\bar{γ}$ .

Definition 3 (Abstraction of concrete states). The abstraction function $\bar{α} : ℘ (M) \to ℘ (\bar{M})$ converts sets of program memories to sets of split memories by converting each individual memory through the function $\overset{\cdot}{α} : M \to \bar{M}$ . Formally:

\begin{array}{r} \bar{α} ({μ_{1}, \dots, μ_{k}}) = {\overset{\cdot}{α} (μ) | μ \in {μ_{1}, \dots, μ_{k}}} where \\ \overset{\cdot}{α} (μ) = (μ ⇃_{ℤ}, μ ⇃_{Σ *}, μ ⇃_{𝔹}) . \end{array}

Lemma 4 (Monotonicity of $\bar{α}$ ). Function $\bar{α}$ is monotone, that is, $\forall M_{1}, M_{2} \subseteq M : M_{1} \subseteq M_{2} \Rightarrow \bar{α} (M_{1}) \subseteq \bar{α} (M_{2})$ . Proof in Appendix A.1.

Definition 5 (Concretization of split states). The concretization function $\bar{γ} : ℘ (\bar{M}) \to ℘ (M)$ converts sets of split memories to sets of program memories by converting each individual memory through the function $\overset{\cdot}{γ} : \bar{M} \to M$ . Formally:

\begin{array}{r} \bar{γ} ({{\bar{μ}}_{1}, \dots, {\bar{μ}}_{k}}) = {\overset{\cdot}{γ} (\bar{μ}) | \bar{μ} \in {{\bar{μ}}_{1}, \dots, {\bar{μ}}_{k}}} where \\ \overset{\cdot}{γ} (\bar{μ}) = {\bar{μ}}_{a} ⊎ {\bar{μ}}_{s} ⊎ {\bar{μ}}_{b} . \end{array}

Lemma 6 (Monotonicity of $\bar{γ}$ ). Function $\bar{γ}$ is monotone, that is, $\forall M_{1}, M_{2} \subseteq \bar{M} : M_{1} \subseteq M_{2} \Rightarrow \bar{γ} (M_{1}) \subseteq \bar{γ} (M_{2})$ . Proof in Appendix A.1.

Example 4.3. Consider again the program memory μ = {(x, 42), (y, “hello”), (z, true)} from Example 4.1. The abstraction $\bar{μ} = ({\bar{μ}}_{a}, {\bar{μ}}_{s}, {\bar{μ}}_{b})$ of such memory is computed by $\overset{\cdot}{α}$ by partitioning the co-domain. Specifically, ${\bar{μ}}_{a} = μ ⇃_{ℤ} = {(k, μ (k)) | k \in dom (μ) \land μ (k) \in ℤ} = {(x, 42)}$ . Similarly, ${\bar{μ}}_{s} = μ ⇃_{Σ *} = {(y, “hello”)}$ and ${\bar{μ}}_{b} = μ ⇃_{𝔹} = {(z, true)}$ . Since the domains of each type-specific memory are disjoint, the concretization of $\bar{μ}$ through $\overset{\cdot}{γ}$ simply joins the memories together, obtaining μ.

Lemma 7 (Invertibility of $\bar{α}$ and $\bar{γ}$ ). The composition of $\bar{α}$ and $\bar{γ}$ corresponds to the identity function, that is, $\bar{α} ° \bar{γ} = i d$ . Moreover, the composition of $\bar{γ}$ and $\bar{α}$ corresponds to the identity function, that is, $\bar{γ} ° \bar{α} = i d$ . Thus, $\bar{α}$ is the inverse of $\bar{γ}$ , and vice-versa. Proof in Appendix A.1.

Both the abstraction and concretization functions thus proceed by converting each input memory individually, where the individual conversions partition the memory by data types in the abstraction, and combines them back in the case of the concretization. Since both functions are monotone and they compose to the identity function (as shown in Appendix A.1), they induce the Galois Isomorphism $〈 ℘ (M), \subseteq 〉 ⇆_{\bar{α}}^{\bar{γ}} 〈 ℘ (\bar{M}), \subseteq 〉$ .

4.3 Split semantics of statements

Instead of specifying a custom semantics for the split state, we take full advantage of the isomorphism. In fact, a well-known property of Galois Connections (and thus of Isomorphisms) is that the abstraction function yields the best possible abstraction of any concrete element. This means that the split collecting semantics of a statement can be computed in the concrete after applying $\bar{γ}$ , and then abstracted back to the split setting by $\bar{α}$ . Formally, the collecting split statement semantics $⦇ st ⦈^{+} : ℘ (\bar{M}) \to ℘ (\bar{M})$ is defined as $⦇ st ⦈^{+} = \bar{α} ° ⟦ st ⟧^{+} ° \bar{γ}$ . Such a definition is always possible within Abstract Interpretation, albeit not desirable: computability is not recovered, as the undecidable concrete semantics is involved in the computation. However, since the purpose of the split state is to provide a more convenient starting point for successive abstractions, uncomputability is not a concern. Similarly, we define the split collecting semantics of expressions $⦇ e ⦈^{+} : ℘ (\bar{M}) \to ℘ (Val)$ as $⦇ e ⦈^{+} = ⟦ e ⟧^{+} ° \bar{γ}$ (here, $\bar{α}$ is not involved in the computation since ⦇e⦈+ produces concrete values that do not need to be abstracted). Note that sets of split memories are elements of the complete powerset lattice $〈 ℘ (\bar{M}), \subseteq, \cup, \cap, \emptyset, \bar{M} 〉$ .

While the presence of a Galois Isomorphism ensures that no precision is lost when commuting between concrete and split states, imprecision might still arise during computations in the split semantics. We thus also have to prove the equivalence (i.e., soundness and completeness) of the split semantics with respect to the concrete one. When two semantics f and f^♯ are expressed in fixpoint form, it is enough to show that α ° f = f^♯ ° α to ensure that α(lfp f) = lfp f^♯ (Cousot and Cousot, 1979), where lfp f is the least fixpoint of the iterates of f. Instead, since both the concrete and split semantics are defined in their big-step forms, we have to prove that $\forall st \in STMT : ⟦ st ⟧^{+} ° \bar{γ} = \bar{γ} ° ⦇ st ⦈^{+}$ .

Theorem 8 (Equivalence of concrete and split semantics). For all sets of split memories $M \subseteq \bar{M}$ , statements st ∈ STMT, and expressions e ∈ EXPR, both $⟦ st ⟧^{+} \bar{γ} (M) = \bar{γ} (⦇ st ⦈^{+} M)$ and $⟦ e ⟧^{+} \bar{γ} (M) = ⦇ e ⦈^{+} M$ hold. Proof in Appendix A.2.

We thus conclude that the split state is a rephrasing of the concrete state, with no loss of precision introduced either in the conversion between the two or during the semantics computations over the split state.

5 Constraint-based whole-value analysis

By considering split states instead of concrete ones, we start from a setting where there is a clear distinction between variables holding integers ( ${\bar{μ}}_{a}$ ), ones holding strings ( ${\bar{μ}}_{s}$ ), and ones holding Booleans ( ${\bar{μ}}_{b}$ ). We can thus exploit a combination of existing domains in our framework, one abstracting each set of variables. We then simply need to specify how these domains can communicate to exchange information on values crossing type boundaries. We thus assume that the integer part of the memory ${\bar{μ}}_{a}$ is abstracted by an abstract domain $A^{♯}$ , the string part of the memory $\bar{μ} s$ is abstracted by an abstract domain $S^{♯}$ , and that the Boolean part of the memory $\bar{μ} b$ is abstracted by an abstract domain $B^{♯}$ . In the following, let $X$ be an abstract domain in ${A^{♯}, S^{♯}, B^{♯}}$ , abstracting the respective type-specific memory ${\bar{M}}_{X}$ in ${A, S, B}$ . We only require $X$ to provide the minimum ingredients for Abstract Interpretation:

• it must form a partial order, denoted $〈 X, ⊑_{X} 〉$ , and should provide an upper bound operator $⊔_{X}$ and a bottom element $⊥_{X}$ ;

• it provides a widening operator $\nabla_{X}$ (if not needed by the domain itself, it can simply delegate to $⊔_{X}$ );

• it defines a monotone concretization function $γ_{X} : X \to ℘ ({\bar{M}}_{X})$ ;

• it defines abstract transformers for the statement semantics $⟅ st ⟆^{X}$ and expression evaluation $⟅ e ⟆^{X}$ that are sound (i.e., they over-approximate what the collecting semantics would compute on statements and expressions of the data type they abstract).

Note that, while the existence of ⟅e⟆ is typical of non-relational domains, relational ones that are to be employed in whole-value analyses still need to produce an abstraction of an expression's value to provide to other domains. Thus, the framework is not limited to non-relational domains only.

5.1 Abstract states

We define the states of our framework as a special instance of the Cartesian product of the base domains.

Definition 9 (Abstract program memories). In our framework, abstract program states are defined as $μ^{♯} \in M^{♯} ≜ A^{♯} \otimes S^{♯} \otimes B^{♯}$ , that is, as tuples of instances of the three domains. Here, ⊗ denotes the smash product (Arceri and Maffeis, 2017), that is, a form of reduced product (Cortesi et al., 2013) where a bottom element from one component is propagated to all others. Being a Cartesian product (Section 2), operators over $M^{♯}$ are element-wise applications of the ones over the sub-domains:

• $(a_{1}^{♯}, s_{1}^{♯}, b_{1}^{♯}) ⊑_{M^{♯}} (a_{2}^{♯}, s_{2}^{♯}, b_{2}^{♯}) \Leftrightarrow a_{1}^{♯} ⊑_{A^{♯}} a_{2}^{♯} \land s_{1}^{♯} ⊑_{S^{♯}} s_{2}^{♯} \land b_{1}^{♯} ⊑_{B^{♯}} b_{2}^{♯}$ ;

• $(a_{1}^{♯}, s_{1}^{♯}, b_{1}^{♯}) ⊔_{M^{♯}} (a_{2}^{♯}, s_{2}^{♯}, b_{2}^{♯}) = (a_{1}^{♯} ⊔_{A^{♯}} a_{2}^{♯}, s_{1}^{♯} ⊔_{S^{♯}} s_{2}^{♯}, b_{1}^{♯} ⊔_{B^{♯}} b_{2}^{♯})$ ;

• $(a_{1}^{♯}, s_{1}^{♯}, b_{1}^{♯}) \nabla_{M^{♯}} (a_{2}^{♯}, s_{2}^{♯}, b_{2}^{♯}) = (a_{1}^{♯} \nabla_{A^{♯}} a_{2}^{♯}, s_{1}^{♯} \nabla_{S^{♯}} s_{2}^{♯}, b_{1}^{♯} \nabla_{B^{♯}} b_{2}^{♯})$ ;

Note that, being built with a smash product (that is a special instance of Cartesian product), $M^{♯}$ and its operators have the same properties of the three sub-domains (i.e., $⊑_{M^{♯}}$ is a partial order, $⊔_{M^{♯}}$ is an upper bound operator, $⊥_{M^{♯}}$ is the bottom element, and $\nabla_{M^{♯}}$ is a widening operator). In the following, we will refer to an abstract memory as either μ^♯ or (a^♯, s^♯, b^♯).

5.2 Concretization and abstract semantics

Before discussing how to define the semantics of our framework, we first have to define how abstract states concretize to split states, which is essential to prove soundness. As previously mentioned, we assume $γ_{A^{♯}} : A^{♯} \to ℘ (ID \to ℤ)$ , $γ_{S^{♯}} : S^{♯} \to ℘ (ID \to Σ *)$ , and $γ_{B^{♯}} : B^{♯} \to ℘ (ID \to 𝔹)$ to be monotone functions, concretizing instances of $A^{♯}$ , $S^{♯}$ , and $B^{♯}$ to sets of memories containing integers, strings, and Booleans, respectively. Since our objective is to connect $M^{♯}$ to $\bar{M}$ , we equip our framework with a function $γ_{M^{♯}} : M^{♯} \to ℘ (\bar{M})$ by exploiting the three provided concretizations.

Definition 10 (Concretization of abstract states). The concretization function $γ_{M^{♯}} : M^{♯} \to ℘ (\bar{M})$ converts an abstract state to a set of split memories. Formally:

\begin{array}{r} γ_{ℳ^{♯}} (μ^{♯}) = {({\bar{μ}}_{a}, {\bar{μ}}_{s}, {\bar{μ}}_{b}) {\bar{μ}}_{a} \in γ_{A^{♯}} (a^{♯}) \land {\bar{μ}}_{s} \in γ_{S^{♯}} (s^{♯}) \land {\bar{μ}}_{b} \\ \in γ_{S^{♯}} (b^{♯}) .} \end{array}

Note that this definition is only possible since Imp programs are well-typed: each variable will appear in the state of exactly one domain, and each split memory is thus guaranteed to be valid (i.e., the type-specific memories have disjoint domains). The extension of this framework to programs that are not well-typed is straightforward, as it just entails a new definition of $γ_{M^{♯}}$ . We exclude this from our work since it would add unnecessary complexity to all proofs.

Lemma 11 (Monotonicity of $γ_{M^{♯}}$ ). Function $γ_{M^{♯}}$ is monotone, that is, $\forall μ_{1}^{♯}, μ_{2}^{♯} \subseteq M^{♯} : μ_{1}^{♯} \subseteq μ_{2}^{♯} \Rightarrow γ_{M^{♯}} (μ_{1}^{♯}) \subseteq γ_{M^{♯}} (μ_{1}^{♯})$ . Proof in Appendix A.3.

Example 4.4. Suppose that the framework is instantiated using the numerical constant propagation abstraction, the string prefix abstraction (Costantini et al., 2015), and the Boolean powerset abstraction. Abstract states are thus $〈 ID \to ℤ_{⊥}^{⊤} \otimes ID \to Σ_{⊥}^{*} \otimes ID \to ℘ (𝔹) 〉$ . Consider the state μ^♯ = (a^♯, s^♯, b^♯) where a^♯ = {(x, 42)}, s^♯ = {(y, “foo”)}, and b^♯ = {(z, true)}. The individual domain concretizations produce sets of memories containing integers, strings, and Booleans, respectively. When applied to the given state, they thus produce {{(x, 42)}}, {{(y, σ)}|“foo” ↷_p σ}, and {{(z, true)}}, respectively. The concretization function $γ_{M^{♯}}$ returns all possible combinations of such memories, producing the set {({(x, 42)}, {(y, σ)}, {(z, true)})|“foo” ↷_p σ}.

5.3 Abstract semantics

As typical in Cartesian products, the abstract statement semantics $⟅ st ⟆ : M^{♯} \to M^{♯}$ of our framework is defined as $⟅ st ⟆ (a^{♯}, s^{♯}, b^{♯}) = (⟅ st ⟆^{A^{♯}} a^{♯}, ⟅ st ⟆^{S^{♯}} s^{♯}, ⟅ st ⟆^{B^{♯}} b^{♯})$ , where $⟅ st ⟆^{A^{♯}}$ is the statement semantics of $A^{♯}$ , $⟅ st ⟆^{S^{♯}}$ is the statement semantics of $S^{♯}$ , and $⟅ st ⟆^{B^{♯}}$ is the statement semantics of $B^{♯}$ . The state of each component of the framework thus evolves in isolation, storing information on disjoint sets of variables. Note that such a semantics is inherently sound, as it is the element-wise application of sound abstract transformers. Our framework goes beyond the Cartesian product: in fact, we adopt the open product framework (Cortesi et al., 2013) to allow for modular communication between the different domains. We thus redefine how multi-type expressions are evaluated within the framework. The key idea is to let one domain evaluate the expression, while providing it with an additional tool: abstract constraints. These are formalized as a subset of the Boolean expressions Be parametric on the expression being constrained, denoted $\bar{BE} (e)$ .

Definition 12 (Abstract constraints set $\bar{BE} (e)$ ). The set of abstract constraints $\bar{BE} (e)$ , parametric over an expression e, is the set of Boolean expressions < v ⊛ e> with a value v ∈ VAL on the left-hand side, and an expression e on the right-hand side. Given an expression e ∈ EXPR, $\bar{BE} (e)$ contains the elements:

\bar{BE} (e) ≜ {\begin{array}{l} {〈 v ⧀ e 〉} | v \in ℤ & if e \in AE; \\ {〈 v = = e 〉} | v \in 𝔹 & if e \in BE; \\ {〈 v \otimes e 〉} | v \in \sum * \land \otimes \in \\ {= =,! =, ↷, ↷_{p}, ↷_{s}}} \\ \cup {〈 v ⧀ len (e) 〉} | v \in ℤ & if e \in SE; \end{array}

Depending on the type of expression e, $\bar{BE} (e)$ thus contains (i) numeric (in-)equalities, (ii) Boolean equalities, or (iii) constraints about the contents and length of a string. A set of constraints $B \subseteq \bar{BE} (e)$ represents definite information: if B = ∅ then all possible concrete values compatible with the expression's type are allowed, while each generated constraint reduces the set of possible values. Furthermore, a special constraint set ↯ is used to denote an invalid set (e.g., with contradicting constraints), that models an expression whose evaluation leads to an error. The powerset of constraints form the complete lattice $〈 ℘ (\bar{BE} (e)), \supseteq, \cap, \cup, \emptyset, \bar{BE} (e) 〉$ , whose elements can be converted to sets of concrete values using the concretization function $γ \bar{BE} (e)$ .

Definition 13 (Concretization of abstract constraints). The concretization function $γ \bar{BE} (e) : ℘ (\bar{BE} (e)) \to ℘ (Val)$ converts a set of abstract constraints to the set of concrete values satisfying them. Formally:

γ \bar{BE} (e) (B) = {v | \forall 〈 v^{'} ⊛ e 〉 \in B : v^{'} ⊛ v} with γ \bar{BE} (e) (↯) = {⇑} .

Note that $γ \bar{BE} (e)$ is trivially monotone, since fewer constraints (recall that the ordering relation over $℘ (\bar{BE} (e))$ is ⊇) lead to a larger set of values satisfying them.

Abstract domains need to both (i) generate constraints on expressions they can handle (i.e., of the type they model), and (ii) generate an abstract element from a set of constraints, in order to interpret the information provided to them by the other domains. We thus define two functions, ℂ and 𝔾, that model these operations. In the following, let $D$ be an abstract domain providing all the ingredients enumerated at the beginning of Section 5, and let $V$ be the complete partial order of the values produced by $⟅ e ⟆^{D}$ .

Definition 14 (Abstract constraint function ℂ). Given an abstract domain $D$ , the abstract constraint function $ℂ D : D \times Expr \to ℘ (\bar{BE} (e))$ , where the second argument corresponds to the parameter of the generated constraint set, yields Boolean constraints about an expression e ∈ Expr, generating bounds on its concrete values based on the information contained in an instance of the domain $d^{♯} \in D$ . The result of the function is a set of Boolean constraints parametric in the input expression.

ℂ can thus be used to restrict the set of values an expression can evaluate to. Note that, by passing a relational expression to ℂ, one can gain relational information despite the output being given as non-relational constraints. We assume that sets generated by ℂ and received by 𝔾 are minimal, that is, they do not include constraints that are implied by other ones already in the set. Also, recall that no contradicting constraints can be generated, as they would lead to the invalid set of constraints ↯.

Definition 15 (Constraint interpretation function 𝔾). Given an abstract domain $D$ that models expression values as elements of a complete partial order $V$ , a set of constraints parametric over an expression e ∈ Expr can be materialized to an abstract value $v \in V$ that satisfies all constraints using function $𝔾_{D} : ℘ (\bar{BE} (e)) \to V$ .

𝔾 is thus able to generate an abstract element that soundly abstracts (i.e., over-approximates) the concrete values identified by the conjunction of the provided constraints. All domains taking part in the framework thus have to define ℂ and 𝔾 to be employed in constraint-based whole-value analyses. Formalizations and examples of both ℂ and 𝔾 are provided in Section 6. We are now in a position to define how multi-type expressions are evaluated within the framework. We express such evaluations in big-step notation.

\begin{array}{l} \frac{-}{⟅ len (s) ⟆ μ^{♯} = 𝔾_{A^{♯}} ({〈 n \otimes len (s) 〉 | 〈 n \otimes len (s) 〉 \in ℂ_{s^{♯}} (s^{♯}, s)})} & (5.1) \end{array}

\begin{array}{l} \frac{-}{⟅ a_{1} ⧀ a_{2} ⟆ μ^{♯} = 𝔾_{B^{♯}} (ℂ_{A^{♯}} (a^{♯}, a_{1} ⧀ a_{2}))} & (5.2) \end{array}

\begin{array}{l} \frac{-}{⟅ s_{1} = = s_{2} ⟆ μ^{♯} = 𝔾_{B^{♯}} (ℂ_{S^{♯}} (s^{♯}, s_{1} = = s_{2}))} & (5.3) \end{array}

\begin{array}{l} \frac{-}{⟅ contains (s_{1}, s_{2}) ⟆ μ^{♯} = 𝔾_{B^{♯}} (ℂ_{S^{♯}} (s^{♯}, contains (s_{1}, s_{2})))} & (5.4) \end{array}

\begin{array}{l} \frac{A_{1} = ℂ_{A^{♯}} (a^{♯}, a_{1}) A_{2} = ℂ_{A^{♯}} (a^{♯}, a_{2})}{⟅ substr (s, a_{1}, a_{2}) ⟆ μ^{♯} = ⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}} s^{♯}} & (5.5) \end{array}

\begin{array}{l} \frac{e = a_{1} < 0 or a_{2} > = len (s) or a_{1} > a_{2} 〈 true = = e 〉 \in ℂ_{A^{♯}} (a^{♯}, e)}{⟅ substr (s, a_{1}, a_{2}) ⟆ μ^{♯} = ⊥_{S^{♯}}} & (5.6) \end{array}

When all arguments are of a coherent type, the evaluation is straightforward: Equations 5.2–5.4 proceed by (i) generating constraints over the result using ℂ on the appropriate domain instance (a^♯ or s^♯), and (ii) converting the constraints to a Boolean abstraction using $𝔾_{B^{♯}}$ . The evaluation of len(s) in Equation 5.1 is slightly different: the abstract domain s♯ generates constraints on the expression s, and only the ones regarding its length are kept. Such constraints are then passed to $𝔾_{A^{♯}}$ to build an abstract integer. The evaluation of substr(s, a₁, a₂) is more complex: if one of the necessary conditions is violated, as described in Equation 5.6, the bottom element is generated. Instead, the domain-dependent semantics $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ is invoked in Equation 5.5, taking in a description of the integer arguments in the form of sets of constraints. Note that, whenever the computation happens in a domain but produces a value of another type (as, e.g., in Equation 5.1), the cross-domain communication happens entirely through functions ℂ and 𝔾. Instead, when an operation has parameters of different types, the semantics of that operation must be redefined in terms of constraints, as in Equation 5.5.

Example 5.2. Let us consider the same setting of Example 4.4, where the framework is instantiated as $〈 ID \to ℤ_{⊥}^{⊤} \otimes Id \to Σ *_{⊥} \otimes ID \to ℘ (𝔹) 〉$ , and consider the same state μ^♯ = (a^♯, s^♯, b^♯) where a^♯ = {(x, 42)}, s^♯ = {(y, “foo”)}, and b^♯ = {(z, true)}. If we were to evaluate the assignment k = len(y) + 1, the framework would employ the multi-type expression semantics first: ℂ_PR(s^♯, y) evaluates, as will be discussed in Section 6.5, to the set {3 < = len(y)} (intuitively, being “foo” a prefix, every string stored in y is guaranteed to have length at least 3, with no information on its maximum length). The set of constraints is then passed to the constant propagation domain, and 𝔾_CP({3 < = len(y)}) yields, as will be defined in Section 6.1, ⊤ since the expression is not constant. Thus, the final state after the assignment is μ^♯₁ = (a^♯₁, s^♯, b^♯) where a^♯₁ = {(x, 42), (k, ⊤)}.

Example 5.3. Let us change the setting of Example 4.5 by using the interval domain as numerical abstraction. The framework is now instantiated as 〈ID → I ⊗ ID → Σ*_⊥ ⊗ ID → ℘(𝔹)〉, where I is the set of intervals as defined in Section 6.2. Let the starting state be μ^♯ = (a^♯, s^♯, b^♯) where a^♯ = {(x, [42, 42])}, s^♯ = {(y, “foo”)}, and b^♯ = {(z, true)}. If we evaluate once more the assignment k = len(y)+1, the evaluation only changes in the generation of the interval element: function 𝔾Intv({3 < = len(y)}) yields, as formalized in Section 6.2, [3, +∞]. Thus, the final state after the assignment is μ^♯₂ = (a^♯₂, s^♯, b^♯) where a^♯₂ = {(x, 42), [4, +∞]}.

5.4 Soundness of the multi-type semantics

Having modified how multi-type expressions are abstracted, we have to ensure that those definitions are sound for the soundness of the abstract semantics to hold. To prove this, we assume that the definitions of ℂ and 𝔾 provided by the domains are sound, relying on the monotone concretization function $γ \bar{BE} (e)$ .

Definition 16 (Soundness of ℂ). Given an abstract domain $D$ , function $ℂ D$ is sound if the conjunction of the produced constraints soundly approximate the concrete values of the expression, that is, if $\forall e \in Expr, d^{♯} \in_{D} : γ_{D} (⟅ e ⟆ d^{♯}) \subseteq γ \bar{BE} (e) (ℂ_{D} (d^{♯}, e))$ .

Definition 17 (Soundness of 𝔾). Given an abstract domain $D$ , function $𝔾_{D}$ is sound if the generated element over-approximates the concrete elements identified by the constraints, that is, if $\forall e \in Expr, B \subseteq \bar{BE} (e) : γ \bar{BE} (e) (B) \subseteq γ D (𝔾_{D} (B))$ .

Soundness of ℂ is thus ensured if all possible concrete values of the expression satisfy all the generated constraints. Instead, 𝔾 is sound if all the concrete values identified by the constraints are abstracted by the generated element. We now define the criteria for the soundness of the substring semantics $⟅ substr (s, A_{1}, A_{2}) ⟆ S^{♯}$ .

Definition 18 (Soundness of $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ ). Given a string abstract domain $S^{♯}$ whose expression semantics produces elements of the complete partial order $V$ , the function $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ is sound if the generated abstract element is a sound approximation of all the possible substrings evaluated in the concrete. Formally, the function is sound if $\forall e \in Expr, A_{1}, A_{2} \in \bar{BE} (e), s ♯ \in S^{♯} : {σ [i : j] | σ \in γ_{S^{♯}} (⟅ s ⟆^{S^{♯}} s ♯) \land i \in γ \bar{BE} (e) (A_{1}) \land j \in γ \bar{BE} (e) (A_{2})} \subseteq γ_{S^{♯}} (⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}} s ♯)$ .

We can now reason on the soundness of the abstract transformers. Recall that, to be employed in our framework, an abstract domain must provide sound definitions of ⟅st⟆ and ⟅e⟆. These are then applied element-wise on the abstract memory of each domain (recall that $⟅ st ⟆ (a^{♯}, s^{♯}, b^{♯}) = (⟅ st ⟆ A^{♯} a^{♯}, ⟅ st ⟆ S^{♯} s^{♯}, ⟅ st ⟆^{B^{♯}} b^{♯})$ , and ⟅e⟆ delegates to the appropriate domain for single-type expressions). Thus, the evaluation of statements and single-type expressions is trivially sound, provided that the multi-type expression evaluation is sound as well. Assuming all domains provide sound ℂ and 𝔾, and that the string domain provides a sound $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ , we now define the soundness of the multi-type semantics, which we generically denote with ⟅e⟆.

Theorem 19 (Soundness of ⟅e⟆). The evaluation of multi-type expressions ⟅e⟆ is a sound approximation of the split collecting semantics ⦇e⦈+. Formally, $\forall e \in Expr : ⦇ e ⦈^{+} γ M^{♯} (μ^{♯}) \subseteq γ M^{♯} (⟅ e ⟆ μ^{♯})$ . Proof in Appendix A.4.

6 Instantiation

In this section, we provide definitions of ℂ, 𝔾, and $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ for domains commonly used in whole-value analyses. Thanks to our definitions, such domains can be implemented in a static analyzer modularly, and several of their combinations can be tested without modifying their code. The choice of the abstractions is guided by the literature on string analyses since, as will be discussed in section 8, they are the most common source of whole-value analyses definitions. We thus select:

• the constant propagation and interval (Cousot and Cousot, 1977) domains as numeric abstractions;

• the powerset abstraction for Boolean values;

• the bounded string set (Madsen and Andreasen, 2014), the prefix (Costantini et al., 2015), and the Tarsis (Negrini et al., 2021) domains for string abstractions.

The rationale behind this choice is that constant propagation and intervals are the most common numeric abstractions used in string analyses, while powerset abstraction is the only employed Boolean abstraction. Bounded string set, prefix, and Tarsis are instead domains tracking increasingly complex string information, showcasing how our framework can be instantiated on different levels of complexity. All domains that will be discussed are non-relational, meaning that they are defined as maps from program variables to abstract elements. To unify the notation, in each definition we will use ⊤ to define maps having all keys mapped to the top abstract element, and ⊥ to indicate an error state.

6.1 Constant propagation

The constant propagation domain Cp is a simple non-relational domain whose lattice elements are maps from program variables to integer values. The abstract domain is defined as follows:

Cp = 〈 Id \to ℤ_{⊥}^{⊤}, \dot{\leq}, \dot{⊔}, \dot{⊓}, ⊤, ⊥ 〉,

where the co-domain of the maps is ℤ extended with ⊤ (for non-constant values) and ⊥ (for erroneous values), and $\dot{\leq}$ , $\dot{⊔}$ , and $\dot{⊓}$ are point-wise applications of the operators that can be inferred by the Hasse diagram of $ℤ_{⊥}^{⊤}$ , visible in Figure 2. We now define functions ℂ_CP and 𝔾_CP.

Figure 2

Hasse diagram of the integer contstant propagation, having all integers in-between the top and bottom values.

Figure 2. Hasse diagram of the numeric constant propagation values $ℤ_{⊥}^{⊤}$ .

Definition 20 (Constraint generation for CP). Let μ^♯ ∈ CP be a mapping from program variables to integer constants. Function $ℂ Cp : Cp \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

ℂ_{Cp} (μ^{♯}, e) = {\begin{cases} \emptyset & if ⟅ e ⟆^{Cp} μ^{♯} = ⊤; \\ {〈 v = = e 〉} & if ⟅ e ⟆^{Cp} μ^{♯} = v \in ℤ; \\ ↯ & if ⟅ e ⟆^{Cp} μ^{♯} = ⊥ . \end{cases}

ℂ_CP thus generates (i) no constraints if the expression does not have a constant value, (ii) a constraint binding the exact value of the expression if it is constant, or (iii) ↯ if the evaluation of e leads to an error.

Definition 21 (Constraint interpretation for CP). Let $B \subseteq \bar{BE} (e)$ be a set of constraints. Function $𝔾_{Cp} : ℘ (\bar{BE} (e)) \to ℤ_{⊥}^{⊤}$ is defined as:

𝔾_{Cp} (B) = {\begin{cases} ⊥ & if B = ↯; \\ v & if 〈 v = = e 〉 \in B \lor {〈 v > = e 〉, 〈 v < = e 〉} \subseteq B; \\ ⊤ & otherwise. \end{cases}

𝔾_CP thus returns ⊥ if the set of constraints is ↯, the constant value v if the constraints bind the value of the expression to v, and ⊤ otherwise. Soundness of ℂ_CP and 𝔾_CP is proven in Appendix A.5.1.

6.2 Intervals

The intervals domain INTV (Cousot and Cousot, 1977) is a non-relational domain whose lattice elements are maps from program variables to the ranges [l, u] their values might take, with l ∈ ℤ∪{−∞} and u ∈ ℤ∪{+∞}. The abstract domain is defined as follows:

Intv = 〈 Id \to 𝕀, \dot{⊑}, \dot{⊔}, \dot{⊓}, ⊤, ⊥ 〉,

where the co-domain of the maps is 𝕀 ≜ (ℤ ∪ {−∞} × ℤ ∪ {+∞}) ∪ ⊥ corresponding to all possible intervals plus ⊥ to model erroneous values. Lattice operators are point-wise applications of the operators that can be inferred by the Hasse diagram of 𝕀, visible in Figure 3. We now define ℂ_INTV and 𝔾_INTV.

Figure 3

Hasse diagram of the lattice of intervals having integer bounds, extended with symbolic positive and negative infinities. An interval is less or equal than another one only if all integer represented by the former are also represented by the latter.

Figure 3. Hasse diagram of the intervals I.

Definition 22 (Constraint generation for INTV). Let μ^♯ ∈ Intv be a mapping from program variables to intervals. Function $ℂ_{Intv} : Intv \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

ℂ_{Intv} (μ^{♯}, e) = {\begin{array}{l} \emptyset & {if ⟅ e ⟆}^{Intv} μ^{♯} = [- \infty, + \infty]; \\ {〈 l < = e 〉} & {if ⟅ e ⟆}^{Intv} μ^{♯} = [l, + \infty], l \in ℤ; \\ {〈 u > = e 〉} & {if ⟅ e ⟆}^{Intv} μ^{♯} = [- \infty, u], u \in ℤ; \\ {〈 l < = e 〉} \\ {〈 u > = e 〉} & {if ⟅ e ⟆}^{Intv} μ^{♯} = [l, u] l, u \in ℤ; \\ ↯ & {if ⟅ e ⟆}^{Intv} μ^{♯} = ⊥; \end{array}

ℂ_INTV thus generates (i) no constraints if the expression can evaluate to any value, (ii) a constraint for each finite bound of the interval produced by evaluating e, or (iii) ↯ if the evaluation of e leads to an error.

Definition 23 (Constraint interpretation for INTV). Let $B \subseteq \bar{BE} (e)$ be a set of constraints. Function $𝔾_{Intv} : ℘ (\bar{BE} (e)) \to 𝕀$ is defined as:

\begin{array}{l} 𝔾_{Intv} (B) = \\ {\begin{cases} ⊥ & if B = ↯; \\ [v, v] & if 〈 v = = e 〉 \in B; \\ [l, u] & if {〈 l < = e 〉, 〈 u > = e 〉} \subseteq B; \\ [l, + \infty] & if 〈 l < = e 〉 \in B \land ∄ u \in ℤ : 〈 u > = e 〉 \in B; \\ [- \infty, u] & if 〈 u > = e 〉 \in B \land ∄ l \in ℤ : 〈 l < = e 〉 \in B; \\ [- \infty, + \infty] & otherwise . \end{cases} \end{array}

𝔾_INTV thus returns ⊥ if the set of constraints is ↯, and the interval corresponding to the range of possible values of e otherwise. Soundness of ℂ_INTV and 𝔾_INTV is proven in Appendix A.5.2.

6.3 Boolean powerset

The boolean powerset domain Bp is a non-relational domain whose lattice elements are maps from program variables to subsets of the Boolean values 𝔹. The abstract domain is defined as follows:

Bp = 〈 Id \to ℘ (𝔹), \dot{\subseteq}, \dot{\cup}, \dot{\cap}, ⊤, ⊥ 〉,

where the co-domain of the maps is ℘(𝔹). Lattice operators are point-wise applications of the operators that can be inferred by the Hasse diagram of ℘(𝔹), visible in Figure 4. ℂ_BP and 𝔾_BP are defined as follows.

Figure 4

Hasse diagram of the Boolean powerset, having the empty set as bottom, the set containing both true and false as top element, and the signleton values as intermediate elements.

Figure 4. Hasse diagram of the Boolean powerset ℘(𝔹).

Definition 24 (Constraint generation for BP). Let μ^♯ ∈ BP be a mapping from program variables to sets of Booleans. Function $ℂ Bp : Bp \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

ℂ_{Bp} (μ^{♯}, e) = {\begin{cases} \emptyset & if ⟅ e ⟆^{Bp} μ^{♯} = 𝔹; \\ {〈 true = = e 〉} & if ⟅ e ⟆^{Bp} μ^{♯} = {true}; \\ {〈 false = = e 〉} & if ⟅ e ⟆^{Bp} μ^{♯} = {false}; \\ ↯ & if ⟅ e ⟆^{Bp} μ^{♯} = ⊥ . \end{cases}

ℂ_BP thus generates (i) no constraints if the expression can evaluate to any value, (ii) a constraint binding the expression to its Boolean value as determined by the domain if it can only assume one value, or (iii) ↯ if evaluating the expression leads to an error.

Definition 25 (Constraint interpretation for BP). Let $B \subseteq \bar{BE} (e)$ be a set of constraints. Function $G Bp : ℘ (\bar{BE} (e)) \to ℘ (𝔹)$ is defined as:

𝔾_{Bp} (B) = {\begin{cases} \emptyset & if B = ↯; \\ {b} & if 〈 b = = e 〉 \in B; \\ 𝔹 & otherwise. \end{cases}

𝔾_BP thus returns ∅ if the set of constraints is ↯, the set {b} if the expression is constrained to the Boolean value b, or the top element 𝔹 otherwise. Soundness of ℂ_BP and 𝔾_BP is proven in Appendix A.5.3.

6.4 Bounded string set

The bounded string set domain Ss (Madsen and Andreasen, 2014) is a non-relational domain whose lattice elements are maps from program variables to bounded sets of up to k strings. The abstract domain is defined as:

Ss = 〈 Id \to ℘_{k} (Σ *), \dot{\subseteq}, {\dot{\cup}}_{k}, \dot{\cap}, ⊤, ⊥ 〉,

where the co-domain of the maps is ℘_k(Σ*)⊂℘(Σ*), that is, the powerset ℘(Σ*) where all sets with cardinality greater than k have been removed. An example of such a powerset, with k = 3, is visible in Figure 5. Σ* represents an unknown string (i.e., a set with more than k elements), while ∅ represents erroneous values. Lattice operators $\dot{\subseteq}$ , ${\dot{\cup}}_{k}$ , and $\dot{\cap}$ are point-wise applications of the respective set-theoretic operators, with ∪_k defined as:

Figure 5

Hasse diagram of the bounded string set lattice with its bound set to 3, where every subset with more than three strings is omitted and all subsets of exactly three strings are in relation with the top element.

Figure 5. Hasse diagram of the bounded set of strings ℘₃(Σ*).

A \cup_{k} B ≜ {\begin{cases} A \cup B & if | A \cup B | \leq k; \\ Σ * & otherwise. \end{cases}

We now define functions ℂ_SS and 𝔾_SS.

Definition 26 (Constraint generation for Ss). Let μ^♯ ∈ SS be a mapping from program variables to bounded sets of strings. Function $ℂ_{Ss} : Ss \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

\begin{array}{l} ℂ_{Ss} (μ^{♯}, e) = \\ {\begin{cases} {〈 0 < = len (e) 〉} & if ⟅ e ⟆^{Ss} μ^{♯} = \\ ℘_{k} (Σ *); \\ {\begin{array}{l} 〈 min_{i \in [1 . . n]} | σ_{i} | < = len (e) 〉, \\ 〈 max_{i \in [1 . . n]} | σ_{i} | > = len (e) 〉, \\ 〈 gcp {σ_{1}, \dots, σ_{n}} ↷ p e 〉, \\ 〈 gcs {σ_{1}, \dots, σ_{n}} ↷ s e 〉 \end{array} {σ_{1}, \dots, σ_{n}}, n \leq k;} & if ⟅ e ⟆^{Ss} μ^{♯} = \\ {σ_{1}, \dots, σ_{n}}, n \leq k; \\ ↯ & if ⟅ e ⟆^{Ss} μ^{♯} = \emptyset . \end{cases} \end{array}

ℂ_SS thus generates (i) a constraint indicating that the length of the expression is non-negative if the expression has more than k possible values, (ii) constraints binding (a) the prefix of the string to the greatest common prefix (gcp) of the set, (b) the suffix of the string to the greatest common suffix (gcs) of the set, and (c) bounds on the length of the string corresponding to the length of the shortest and longest string in the set, if it can have at most k possible values, or (iii) ↯ if evaluating the expression leads to an error. Note that we always produce length constraints when possible, to provide information to numerical domains that might use it.

Definition 27 (Constraint interpretation for Ss). Let $B \subseteq \bar{BE} (e)$ be a set of constraints. Function $𝔾 Ss : ℘ (\bar{BE} (e)) \to ℘_{k} (Σ *)$ is defined as:

𝔾 Ss (B) = {\begin{cases} \emptyset & if B = ↯; \\ {σ} & if < σ = = e > \in B; \\ Σ * & otherwise. \end{cases}

𝔾Ss thus returns ∅ if the set of constraints is ↯, the set {σ} if the constraints bind the value of the expression to σ, and Σ* otherwise. Note that, since the constraint set holds definite information, there cannot be more than one constraint using = =, and constraints on prefix and suffix yield infinitely many strings: we thus cannot generate sets with more than one concrete string. We can now provide the definition of $⟅ substr (s, A_{1}, A_{2}) ⟆ S^{♯}$ . Since Ss can only model bounded sets, we have to ensure that the resulting set holds at most k substrings; otherwise, the semantics returns Σ* if no evaluation error happens.

Definition 28 (Substring semantics for Ss). Let μ^♯ ∈ Ss be a mapping from program variables to bounded sets of strings, and let $A_{1} \subseteq \bar{BE} (e 1)$ and $A_{2} \subseteq \bar{BE} (e 2)$ be sets of constraints describing integer expressions e₁ and e₂, respectively. Function $⟅ substr (s, A_{1}, A_{2}) ⟆^{Ss}$ is defined as:

\begin{array}{l} ⟅ substr (s, A_{1} A_{2}) ⟆ Ss μ^{♯} = \\ {\begin{array}{l} \emptyset & {if ⟅ s ⟆}^{Ss} μ^{♯} = \emptyset \lor A_{1} = \\ [\lor A_{2} =]; \\ {if ⟅ s ⟆}^{Ss} μ^{♯} = {σ_{1}, ..., σ_{n}} \\ {σ_{w} [x : y] | \begin{array}{l} 1 \leq w \leq n, \\ i_{l} \leq x \leq i_{h}, \\ j_{l} \leq y \leq j_{h}, \\ 0 \leq x \leq y \leq | σ_{w} | \end{array}} & \begin{array}{l} \land {〈 i_{1} < = e_{1} 〉, 〈 i_{h} < = e_{1} 〉 \\ \subseteq A_{1} \\ \land {〈 j_{1} < = e_{2} 〉, 〈 j_{h} < = e_{2} 〉 \\ \subseteq A_{1} \end{array} \\ \land count(μ^{♯}, s, A_{1}, A_{2}) \leq k; \\ \sum * & otherwise, \end{array} \end{array}

where function count returns the number of valid substrings that can be computed through its parameters (the definition is left implicit, but intuitively it returns the cardinality of the set returned in the second case) and, for the sake of readability, we omit the cases when at least one of i or j are constants (i.e., when 〈 i = =e₁〉 ∈ A₁ and 〈 j = =e₂〉 ∈ A₂, respectively), as they can be assimilated to the central case when both inequalities have the same bound. We also omit the case where either i or j is unbounded (i.e., where no upper bound on their value is present in the respective constraint set), as they will generate infinitely many substrings and will collapse the result to Σ*.

Soundness of ℂ_SS, 𝔾_SS, and $⟅ substr (s, A_{1}, A_{2}) ⟆^{Ss}$ is proven in Appendix A.5.4.

6.5 Prefix

The prefix domain PR (Costantini et al., 2015) is a non-relational domain whose lattice elements are maps from program variables to strings acting as prefixes. The abstract domain is defined as:

Pr = 〈 Id \to \sum_{⊥}^{*}, \dot{⊑}, \dot{gcp}, \dot{⊓}, ⊤, ⊥ 〉,

where the co-domain of the maps is Σ*_⊥, whose Hasse diagram is visible in Figure 6. Elements are ordered according to the reverse prefix relation: σ₁⊑σ₂ ⇔ σ₂ ↷_p σ₁. The empty string ϵ thus represents an unknown string (since ∀σ ∈ Σ*:ϵ ↷_p σ), while ⊥ represents erroneous values. Lattice operators $\dot{⊑}$ , $\dot{gcp}$ , and $\dot{⊓}$ are point-wise applications of the respective operators over Σ*_⊥, with gcp and ⊓ defined as:

\begin{array}{l} σ_{1} gcp σ_{2} ≜ {\begin{cases} σ_{1} & if σ_{1} ↷_{p} σ_{2}; \\ σ_{2} & if σ_{2} ↷_{p} σ_{1}; \\ σ_{p} & otherwise, with σ_{1} = σ_{p} σ^{'} \\ \land σ_{2} = σ_{p} σ^{″} \land σ_{0}^{'} \neq σ_{0}^{″}, \end{cases} and \\ σ_{1} ⊓ σ_{2} ≜ {\begin{cases} σ_{1} & if σ_{1} ↷_{p} σ_{2}; \\ σ_{2} & if σ_{2} ↷_{p} σ_{1}; \\ ⊥ & otherwise. \end{cases} \end{array}

Figure 6

Hasse diagram of the definite string prefixes, having the empty string as top element and increasingly longer prefixes downward in the lattice.

Figure 6. Hasse diagram of the prefixes Σ*_⊥.

We now define functions ℂ_PR and 𝔾_PR.

Definition 29 (Constraint generation for PR). Let μ^♯ ∈ Pr be a mapping from program variables to string prefixes. Function $ℂ Pr : Pr \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

ℂ_{Pr} (μ^{♯}, e) = {\begin{cases} {〈 0 < = len (e) 〉} & if ⟅ e ⟆^{Pr} μ^{♯} = ϵ; \\ {〈 | σ | < = len (e) 〉, σ ↷_{p} e} & if ⟅ e ⟆^{Pr} μ^{♯} = σ; \\ ↯ & if ⟅ e ⟆^{Pr} μ^{♯} = ⊥ . \end{cases}

ℂ_PR thus generates (i) a constraint indicating that the length of the expression is non-negative if the expression can assume any string value, (ii) constraints binding (a) the prefix of the string to the result of the evaluation, and (b) the lower bound on the length of the string corresponding to the length of the prefix, if the evaluation produces a valid string as a result, or (iii) ↯ if evaluating the expression leads to an error. Note that we always produce length constraints when possible provide information to numerical domains that might use it.

Definition 30 (Constraint interpretation for PR). Let $B \subseteq \bar{BE} (e)$ be a set of constraints. Function $𝔾 Pr : ℘ (\bar{BE} (e)) \to \sum_{⊥}^{*}$ is defined as:

𝔾_{Pr} (B) = {\begin{cases} ⊥ & if B = ↯; \\ σ & if 〈 σ = = e 〉 \in B \lor 〈 σ ↷_{p} e 〉 \in B; \\ ϵ & otherwise. \end{cases}

𝔾_PR thus returns ⊥ if the set of constraints is ↯, the prefix σ if the constraints bind either the value of the expression or its prefix to σ, and ϵ otherwise. We can now provide the definition of $⟅ substr (s, A_{1}, A_{2}) ⟆ S^{♯}$ .

Definition 31 (Substring semantics for PR). Let μ^♯ ∈ Pr be a mapping from program variables to string prefixes, and let $A_{1}, A_{2} \subseteq \bar{BE} (e)$ be sets of constraints describing integer expressions. Furthermore, let us denote as $\bar{i}$ the minimal non-negative value admitted by A₁, and by $\bar{j}$ the minimal non-negative value admitted by A₂ that is greater than or equal to $\bar{i}$ . Function $⟅ substr (s, A_{1}, A_{2}) ⟆^{Pr}$ is defined as:

\begin{array}{l} a s e m {substr (s, A_{1}, A_{2})}^{Pr} μ^{♯} = \\ {\begin{cases} ⊥ & if ⟅ e ⟆^{Pr} μ^{♯} = ⊥ \lor A_{1} = ↯ \lor A_{2} = ↯; \\ σ [\bar{i} : \bar{j}] & if ⟅ e ⟆^{Pr} μ^{♯} = σ \land \bar{i} \leq \bar{j} \leq | σ |; \\ σ [\bar{i} : | σ | - 1] & if ⟅ e ⟆^{Pr} μ^{♯} = σ \land \bar{i} \leq | σ | < \bar{j}; \\ ϵ & if ⟅ e ⟆^{Pr} μ^{♯} = σ \land | σ | < \bar{i} \leq \bar{j} . \end{cases} \end{array}

$⟅ substr (s, A_{1}, A_{2}) ⟆^{Pr}$ thus shortens the approximation for the expression if (part of) the substring lies within the prefix, truncating it to ϵ otherwise. Soundness of ℂ_Pr, 𝔾_Pr, and $⟅ substr (s, A_{1}, A_{2}) ⟆^{Pr}$ is proven in Appendix A.5.5.

6.6 Tarsis

The Tarsis domain Ta (Negrini et al., 2021) is a non-relational domain whose lattice elements are maps from program variables to special finite state automata defined over an alphabet of strings instead of single characters. The abstract domain is defined as:

Ta = 〈 Id \to T Fa / \equiv, \dot{⊑}, \dot{⊔}, \dot{⊓}, ⊤, ⊥ 〉,

where the co-domain of the maps is the set of equivalence classes of finite state automata with string alphabets $T Fa / \equiv$ , and whose lattice operators are point-wise applications of the respective operators over $T Fa / \equiv$ . Specifically, ⊑ is the partial order induced by language inclusion, ⊔ and ⊓ correspond to automata union and intersection, respectively, and the top and bottom elements are $Min (𝔸_{P}^{*})$ and Min(∅) (respectively, the minimum automata recognizing all possible strings and the one recognizing the empty language). Following the notation from (Negrini et al. 2021), we denote as $A \in T Fa / \equiv$ a Tarsis automaton, and as $L$ (A) ∈ ℘(Σ*) the regular language recognized by A. We now define functions ℂ_TA and 𝔾_TA.

Definition 32 (Constraint generation for Ta). Let μ^♯ ∈ Ta be a mapping from program variables to $T Fa / \equiv$ automata. Function $ℂ_{Ta} : Ta \times Expr \to ℘ (\bar{BE} (e))$ is defined as:

\begin{array}{l} ℂ_{Ta} (μ^{♯}, e) = \\ {\begin{array}{l} {〈 0 < = len (e) 〉} & if {⟅ e ⟆}^{Ta} μ^{♯} = Min (𝔸_{p}^{*}); \\ {〈 σ = = e 〉 \\ 〈 | σ | < = len (e) 〉, | σ | \\ > = len (e)} & if L ({⟅ e ⟆}^{Ta} μ^{♯}) = {σ}; \\ {\begin{cases} 〈 {lcp (A) ↷}_{p} e 〉, \\ 〈 lcp(rev (A)) ↷_{p} e 〉, \\ 〈 i < = len (e) 〉 \\ 〈 j < = len (e) 〉 \end{cases}} & if {⟅ e ⟆}^{Ta} μ^{♯}) = A \land len (A) = [i, j]; \\ ↯ & if {⟅ e ⟆}^{Ta} μ^{♯} = Min (\emptyset); \end{array} \end{array}

ℂ_TA thus generates (i) a constraint indicating that the length of the expression is non-negative if the expression can assume any string value, (ii) constraints binding the exact value and length of the string if it can assume exactly one string value, (iii) constraints binding prefix (where lcp is the largest common prefix, e.g., the one proposed in Béal and Carton, 2000), suffix (where rev(A) reverses an automaton by swapping initial and accepting states and inverting the direction of edges), minimum length and maximum length of the string (where len(A) is the abstract semantics of len as defined in Negrini et al., 2021) according to the result of the evaluation, or (iv) ↯ if evaluating the expression leads to an error. For the sake of readability, the case when j = +∞ is not shown as it is equal to the displayed case, except that the bound using j is absent. Note that we always produce length constraints when possible provide information to numerical domains that might use it.

Definition 33 (Constraint interpretation for Ta). Let $B \subseteq \bar{BE} (e)$ be a set of constraints, let $L$ (σ) be the minimum automata $A \in T Fa / \equiv$ recognizing the regular language {σ}, and let ⌢ be the automata concatenation. Function $𝔾_{Ta} : ℘ (\bar{BE} (e)) \to \sum_{⊥}^{*}$ is defined as:

\begin{array}{l} 𝔾_{Ta} (B) = \\ {\begin{cases} Min (\emptyset) & if B = ↯; \\ L (σ) & if 〈 σ = = e 〉 \in B; \\ L (σ) ⌢ Min (𝔸_{P}^{*}) & if 〈 σ ↷_{p} e 〉 \in B \land ∄ σ^{'} \in Σ * : 〈 σ^{'} ↷_{s} e 〉 \\ \in B; \\ Min (𝔸_{P}^{*}) ⌢ L (σ) & if 〈 σ ↷ s e 〉 \in B \land ∄ σ^{'} \in Σ * : 〈 σ^{'} ↷_{p} e 〉 \\ \in B; \\ L (σ) ⌢ Min (𝔸_{P}^{*}) \\ ⌢ L (σ^{'}) & if {〈 σ ↷_{p} e 〉, < σ^{'} ↷ s e 〉} \subseteq B; \\ Min (𝔸_{P}^{*}) & otherwise. \end{cases} \end{array}

𝔾_TA thus returns Min(∅) if the set of constraints is ↯, the automaton recognizing σ if the constraints bind the value of the expression to σ, the left — resp. right — concatenation between the definite prefix — resp. suffix — and $Min (𝔸_{P}^{*})$ if they are known, and $Min (𝔸_{P}^{*})$ otherwise. We can now provide the definition of $⟅ substr (s, A_{1}, A_{2}) ⟆^{S^{♯}}$ .

Definition 34 (Substring semantics for Ta). Let μ^♯ ∈ Ta be a mapping from program variables to $T Fa / \equiv$ automata, and let $A_{1}, A_{2} \subseteq \bar{BE} (e)$ be sets of constraints describing integer expressions. Furthermore, let us denote as i⁺ and i⁻ the maximal and minimal values admitted by A₁, and by j⁺ and j⁻ the maximal and minimal values admitted by A₂, with i⁺ and j⁺ possibly equal to +∞. Function $⟅ substr (s, A_{1}, A_{2}) ⟆^{Ta}$ is defined as:

\begin{array}{l} ⟅ substr (s, A_{1}, A_{2}) ⟆^{Ta} μ^{♯} = \\ {\begin{cases} Min (\emptyset) & if ⟅ e ⟆^{Ta} μ^{♯} = Min (\emptyset) \lor A_{1} = ↯ \lor A_{2} = ↯; \\ A [[i^{-}, i^{+}] : [j^{-}, j^{+}]] & if ⟅ e ⟆^{Ta} μ^{♯} = A . \end{cases} \end{array}

$⟅ substr (s, A_{1}, A_{2}) ⟆^{Ta}$ thus directly delegates to the substring semantics of Tarsis using the intervals [i⁻, i⁺] and [j⁻, j⁺] as indices, since the domain defines its semantics in terms of intervals. Soundness of ℂ_TA, 𝔾_TA, and $⟅ substr (s, A_{1}, A_{2}) ⟆^{Ta}$ is proven in Appendix A.5.6.

6.7 Products of abstract domains

The formalizations given in this section all refer to individual domains. However, it is possible to abstract a single data type using a product of abstract domains, e.g., as in (Madsen and Andreasen 2014). The extension to such a setting is straightforward: 𝔾 and ⟅substr(s, A₁, A₂)⟆ are simply applied domain-wise, making each produce abstract elements independently, possibly followed by reductions. Instead, since ℂ must produce a minimal set of constraints containing no contradictions, the result of the individual constraint generations from each domain must be composed together to eliminate redundant constraints, collapsing the result to ↯ if a contradiction is found.

7 Implementation and evaluation

The framework presented in this study has been implemented in LiSA (Negrini et al., 2023a), an open-source¹ Java library designed to ease the creation of static analyzers and offering a common platform for the development of abstract interpretations (see, e.g., Zanatta et al., 2025; Negrini et al., 2024; Olivieri et al., 2023; Negrini et al., 2023b). Among these, (Negrini et al. 2021) provided a comparison between five string abstractions on four expressive string manipulation programs, with the objective of proving assertions. The considered abstractions were the prefix, suffix, character inclusion, and bricks domain from (Costantini et al. 2015), the FSA domain from (Arceri et al. 2020), and the Tarsis abstract domain presented in that study. The comparison was performed by building, for each domain, a smashed product between the string abstraction, the interval domain, and the Boolean powerset abstraction. The semantics of each domain were thus coded explicitly for that combination. Some domains were, however, formalized with respect to integer constants instead of intervals, and thus required their semantics to be explicitly lifted beforehand.

In this section, we (i) discuss the implementation effort to code both the smashed product and the constraint-based analysis in LiSA, and (ii) replicate the experiments from (Negrini et al. 2021) to ensure that the constraint-based analysis can achieve the same precision as the smashed product. Specifically, we employed the following domains:

• the constant propagation domain Cp and the interval domain Intv as numeric abstractions;

• the Boolean powerset domain Bp as Boolean abstraction;

• the prefix domain Pr, the suffix domain Su, the character inclusion domain Ci, the bounded string set domain Ss (with k = 5), and the Tarsis domain Ta as string abstractions.

Note that our evaluation includes two additional domains Su and Ci, and that each domain supports additional multi-type expressions with respect to the ones defined in this study (e.g., the index operator discovering the first index where a search string appears in a target string). These have been omitted for conciseness, as they would not add any technical contribution.

7.1 Implementation in LiSA

In LiSA, each abstract domain is implemented as a separate Java class that must implement some key interfaces. All domains chosen for the evaluation are non-relational (i.e., they do not explicitly track relations across different variables' values): thus, they all implement the NonRelationalValueDomain interface, which requires the definition of lattice operators for individual abstract values (e.g., operating on single intervals) and of the evaluation logic for expressions given the value of each variable. Abstract transformers for statements (e.g., assignments) are handled modularly by the ValueEnvironment class, which lifts instances of the non-relational domains to maps from variables to abstract values.

Both the smashed product² and the constraint-based analysis³ have been implemented as instances of NonRelationalValueDomain as well, each holding references to the client domains that they use for type-specific computations. The implementations are fairly similar both in complexity and length (364 and 357 lines, respectively). The per-domain effort instead varies: the methods implemented for the smashed product are typically shorter, but they require non-trivial reasoning; instead, methods for the constraint-based analysis are fewer and simpler (mainly consisting of iterations over the constraint sets), but are longer. The sizes, however, remain comparable (e.g., the largest difference is seen in Tarsis, where the constraint-based analysis requires writing ~50 more lines of code), indicating that the constraint-based analysis does not require additional development efforts. Note, however, that LiSA is built to be as modular as possible: all domains are made to be pluggable and, to some extent, replaceable with no code modifications. This is reflected in the non-relational smashed product implementation as well: relational domains would not fit within the current communication scheme, and would require additional complexity. Instead, the constraint-based analysis features modular communication by design, and its structure is expected to be preserved regardless of the chosen domains.

7.2 Comparison with the smashed product

The experiments of (Negrini et al. 2021) were run on the program samples visible in Figure 7, written in Go (where the code of strings.Count used in program CountMatches is visible in Figure 8). For each combination of the domains reported at the beginning of this section, we ran the analysis with the smashed product-based combination and with the constraint-based combination presented in this work, ensuring that the assertions they can prove are the same.

Figure 7

Figure 7. Program samples used for domain comparison. (a) Program Subs. (b) Program ToString. (c) Program Loop. (d) Program CountMatches.

Figure 8

Implementation of the 'string.Count' function of the Go API, calculating occurrences of a substring within a string. It iteratively checks the index of the substring and increments a counter. If the substring is empty, it returns the string length plus one.

Figure 8. The strings.Count function of the Go API.

The results of each analysis are visible in Table 1, where column Domain reports the combination of abstract domains used and, for each analyzed program, columns Smash and Constr report the assertions' results (where ✓ denotes an assertion that never fails, ✗ an assertion that always fails, and ⋆ an assertion that might fail, according to the invariants computed by the analysis) with the smashed product and the constraint-based analysis, respectively. Invariants computed by each analysis are omitted for space reasons, but were nonetheless manually inspected at each program point. The experiments highlight that our constraint-based approach yields the same precision as the explicit smashed product: in fact, LiSA was able to determine the same result (never fails, always fails, or might fail) for all assertions in the target programs. Moreover, during the manual inspection of the invariants, we did not find differences in the states produced by each domain in the smashed product and constraint-based analysis, thus justifying the observed equality in the assertions' results.

Table 1

Table 1. Proved assertions by each domain combination on the target programs.

8 Related work

The problem of combining heterogeneous abstractions to perform whole-value analyses is not new, as it is implicitly formalized in all string analysis definitions. However, formalizations are typically tailored to a specific combination of abstract domains.

(Christensen et al. 2003b) introduces JSA, an analyzer for Java that analyzes strings by building flow graphs expressing how string sources (i.e., constants and user inputs) are manipulated along the program execution, with the objective of validating the structure of dynamically-generated content like XML documents or the targets of reflective calls. (Christensen et al. 2003a) uses such analysis to further validate web services generating HTML pages. In such flow graphs, non-string arguments used in string expressions are treated as part of the operators' definition, e.g., setCharAt(∅, “x”) and setCharAt(1, “x”) are seen as two different operators. Thus, no reasoning on how to combine the flow graphs with domain modeling, additional data types is given. Further investigating document validation techniques, (Kim and Choe 2011) employs a domain based on pushdown automata, but only defines concatenation as an abstract operation. A notable effort by the community targeted the analysis of dynamic property accesses in JavaScript code. (Park et al. 2016) defines a domain based on regular expressions. (Madsen and Andreasen 2014) reports several existing string domains, but also defines new ones: LENGTH HASH, (SLIDING) INDEX PREDICATE, STRING HASH, NUMBER STRINGS, and TYPE STRINGS, and reasons on their combination. In both works, the authors only formalize the semantics of equality tests and string concatenation, with no reasoning for the usage of the presented domains in whole-value analyses.

Several general-purpose string abstractions have been proposed over the years, which are the most common source of whole-value analyses. (Costantini et al. 2015) formalizes the CHARACTER INCLUSION, PREFIX, SUFFIX, BRICKS, and STRING GRAPH domains, tracking increasingly complex non-relational information on string values. Each domain defines the semantics of substring using integer coefficients, and the semantics of contains that returns an element of the Boolean powerset. However, no discussion on what the result of the substring is when the indexes are not constant is given. Instead, (Arceri et al. 2020) uses finite state automata (Fsa) to model string values, and formalizes several string operations including substring, length, and startsWith. Boolean values are represented using the Boolean powerset, and numerical values using intervals. (Negrini et al. 2021) presents Tarsis, an evolution of the Fsa domain that uses automata built over an alphabet of strings instead of characters, aiming at providing the same precision while requiring fewer resources. In terms of whole-value analyses, it still uses the same abstractions for Booleans and integers. (Choi et al. 2006) defines the REGULAR STRINGS domain, a subset of the regular expressions with efficient widening. The domain is defined in conjunction with a simple constant propagation domain for numerical quantities, and the authors exploit it in their definition of substring. (Li et al. 2015) instead introduces a string-specific intermediate representation (IR) that defines data dependencies between string variables, with different levels of context sensitivity. The IR can then be analyzed with several string domains that just have to provide transformers for the IR constructs. Rather than a novel analysis, this work constitutes a framework for the definition of new analyses. Regardless, it is not clear how non-string values appear within the IR, and how client abstractions can leverage them.

Several other string analysis techniques exist in the realm of symbolic execution (Veanes, 2013; Dalla Preda et al., 2015; Han et al., 2011; Yu et al., 2014; Nguyen et al., 2011; Yu et al., 2008) and constraint solving (Abdulla et al., 2020; Chen et al., 2019; Zheng et al., 2013; Abdulla et al., 2019, 2014; Amadini et al., 2020; D'Antoni and Veanes, 2013; Wang et al., 2018), but they are orthogonal to our work and are thus not discussed.

Many techniques for combining domains have also been presented. (Cortesi et al. 2013) reports the most common instances of products (Cartesian, reduced, Granger, and open) that can be used to combine arbitrary sets of abstract domains, possibly exchanging information between them. Our framework is an instance of the open product. (Amadini et al. 2018) introduces a framework aiming at replacing Granger products, achieving a precision closer to that of the reduced product with fewer computational requirements. The framework is based on the choice of a reference domain, i.e., a domain that is at least as precise as the domains involved in the product. Refinement between domains is replaced by converting each domain into an instance of the reference domain, computing their meet, and then converting the meet back to instances of each domain. However, this framework is aimed at products of domains abstracting the same data type rather than different ones. (Gulwani and Tiwari 2006) introduces the logical product, a systematic technique to combine logical lattices. This requires the underlying theories to be convex, stably infinite, and disjoint. While the last requirement is always satisfied in our framework, we do not require either convexity or stable infiniteness on the domains we consider (note that the convexity of the set $\bar{BE} (e)$ is a current limitation on the information exchange, rather than a requirement of the domains one can employ). The Astrée static analyzer and the Verasco formally-verified static analyzer employ communication channels (Cousot et al., 2007; Jourdan et al., 2015) to exchange information between numerical abstractions, allowing mutual refinement between them. The channels act similarly to our functions 𝔾 and ℂ: when a domain needs information on the value of some expression, it can query the input channel to obtain it (the form taken by the information can vary: it may be an interval, a linear constraint, …), and the domain's transformers also populate output channels with information other domain can query. While both analyzers use channels to exchange numerical information, their extension to other data types should be feasible. The Apron library (Jeannet and Miné, 2009) also has means of (i) converting a domain instance to a set of constraints, and (ii) creating a (possibly different) domain instance from a set of constraints. Once more, these roughly correspond to functions 𝔾 and ℂ, but are only used as a form of conversion between numerical domains instances instead of information exchange. Finally, the Goblint static analyzer uses queries (Apinis, 2014) based on constraints to allow domains to ask for information about expressions from other domains. This workflow is very similar to the one presented in this work, but, to the best of our knowledge, it has no theoretical formalization and has not been employed outside of numerical analyses. All these works were essential in showcasing how constraints can be used effectively to exchange information between different abstract domains, possibly at the cost of some precision. Our framework, and specifically the set $\bar{BE} (e)$ and the functions ℂ and 𝔾, draws inspiration from them.

9 Conclusion

This study presents a modular framework for constraint-based whole-value analyses where existing domains can be used to provide abstractions of all data types supported by a programming language. The requirements are minimal: domains need to implement means for (i) converting their instances to a set of constraints, (ii) generating instances based on a set of constraints, and (iii) providing the semantics of expressions with heterogeneous argument types in terms of constraints. Provided these requirements are met, several combinations of abstract domains can be executed with no additional modification to their code. Moreover, since the constraint-based information exchange has been proven sound, each combination is guaranteed to be sound without requiring lifting of the abstract semantics, which would in turn require additional soundness proofs. An implementation of the framework is available inside the LiSA static analysis library. The framework has also been compared with ad-hoc domain combinations, showcasing the same degree of precision.

Our work explicitly targets a combination of forward analyses that are not compositional [i.e., do not run modularly (Cousot and Cousot, 2002)]. The extension of the framework to both backward and compositional analyses is left as future work. Moreover, the constraints set $\bar{BE} (e)$ considered allow only for a convex set of constraints: for instance, we are currently unable to express disjoint ranges for a single integer variable. While supporting non-convex sets could be trivial (e.g., by considering sets of sets of constraints instead, where each inner set represents convex information), special care must be employed to ensure soundness. We thus leave such an extension as future work.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

LN: Writing – original draft, Formal analysis, Methodology, Software, Visualization, Conceptualization, Supervision, Validation, Investigation, Writing – review & editing, Project administration.

Funding

The author(s) declared that financial support was received for this work and/or its publication. Work supported by the SERICS (PE00000014 - CUP H73C2200089001) project funded by PNRR NextGeneration EU.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomp.2025.1655377/full#supplementary-material

Footnotes

1. ^LiSA's GitHub repository.

2. ^Implementation of the smashed product is publicly available inside LiSA.

3. ^Implementation of the constraint-based analysis is publicly available inside LiSA.

References

Abdulla, P. A., Atig, M. F., Chen, Y.-F., Diep, B. P., Dolby, J., Janků, P., et al. (2020). “Efficient handling of string-number conversion,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2020 (New York, NY: Association for Computing Machinery), 943–957. doi: 10.1145/3385412.3386034

Crossref Full Text | Google Scholar

Abdulla, P. A., Atig, M. F., Chen, Y.-F., Holík, L., Rezine, A., Rümmer, P., et al. (2014). “String constraints for verification,” in Computer Aided Verification: 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014. Proceedings 26 (Cham: Springer), 150–166.

Google Scholar

Abdulla, P. A., Atig, M. F., Diep, B. P. Holík, L., and Janku°, P. (2019). “Chain-free string constraints,” in Automated Technology for Verification and Analysis: 17th International Symposium, ATVA 2019, Taipei, Taiwan, October 28-31, 2019, Proceedings 17 (Cham: Springer), 277–293. doi: 10.1007/978-3-030-31784-3_16

Crossref Full Text | Google Scholar

Amadini, R., Gange, G., Gauthier, F., Jordan, A., Schachte, P., Søndergaard, H., et al. (2018). Reference abstract domains and applications to string analysis. Fundam. Inform. 158, 297–326. doi: 10.3233/FI-2018-1650

Crossref Full Text | Google Scholar

Amadini, R., Gange, G., and Stuckey, P. J. (2020). Dashed strings for string constraint solving. Artif. Intell. 289:103368. doi: 10.1016/j.artint.2020.103368

Crossref Full Text | Google Scholar

Apinis, K. (2014). Frameworks for Analyzing Multi-Threaded C (PhD thesis). Technischen Universität München, München.

Google Scholar

Arceri, V., and Maffeis, S. (2017). Abstract domains for type juggling. Electron. Notes Theor. Comput. Sci. 331, 41–55. doi: 10.1016/j.entcs.2017.02.003

Crossref Full Text | Google Scholar

Arceri, V., Mastroeni, I., and Xu, S. (2020). Static analysis for ecmascript string manipulation programs. Appl. Sci. 10:3525. doi: 10.3390/app10103525

Crossref Full Text | Google Scholar

Béal, M.-P., and Carton, O. (2000). Computing the prefix of an automaton. RAIRO - Theor. Inform. Appl. 34, 503–514. doi: 10.1051/ita:2000127

Crossref Full Text | Google Scholar

Chen, T., Hague, M., Lin, A. W., Rümmer, P., and Wu, Z. (2019). Decision procedures for path feasibility of string-manipulating programs with complex operations. Proc ACM Program Lang. 3(POPL), 1–30. doi: 10.1145/3290362

Crossref Full Text | Google Scholar

Choi, T.-H., Lee, O., Kim, H., and Doh, K.-G. (2006). “A practical string analyzer by the widening approach,” in Asian Symposium on Programming Languages and Systems (Cham: Springer), 374–388. doi: 10.1007/11924661_23

Crossref Full Text | Google Scholar

Christensen, A. S., Møller, A., and Schwartzbach, M. I. (2003a). Extending java for high-level web service construction. ACM Trans. Program. Lang. Syst. 25, 814–875. doi: 10.1145/945885.945890

Crossref Full Text | Google Scholar

Christensen, A. S., Møller, A., and Schwartzbach, M. I. (2003b). “Precise analysis of string expressions,” in Static Analysis, ed. R. Cousot (Cham: Springer Berlin Heidelberg), 1–18. doi: 10.1007/3-540-44898-5_1

Crossref Full Text | Google Scholar

Cohen, E. (1977). Information transmission in computational systems. SIGOPS Oper. Syst. Rev. 11, 133–139. doi: 10.1145/1067625.806556

Crossref Full Text | Google Scholar

Cortesi, A., Costantini, G., and Ferrara, P. (2013). A survey on product operators in abstract interpretation. Electron. Proc. Theor. Comp. Sci. 129, 325–336. doi: 10.4204/EPTCS.129.19

Crossref Full Text | Google Scholar

Costantini, G., Ferrara, P., and Cortesi, A. (2015). A suite of abstract domains for static analysis of string values. Softw. Pract. Exp. 45, 245–287. doi: 10.1002/spe.2218

Crossref Full Text | Google Scholar

Cousot, P. (1997). “Types as abstract interpretations,” in Proceedings of POPL '97, POPL '97 (New York, NY: ACM), 316–331. doi: 10.1145/263699.263744

Crossref Full Text | Google Scholar

Cousot, P. (2021). Principles of Abstract Interpretation, Vol. 1, 1 Edn. Cambridge, MA: MIT Press.

Google Scholar

Cousot, P., and Cousot, R. (1977). “Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints,” in Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL '77 (New York, NY: Association for Computing Machinery), 238–252. doi: 10.1145/512950.512973

Crossref Full Text | Google Scholar

Cousot, P., and Cousot, R. (1979). “Systematic design of program analysis frameworks,” in Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL '79 (New York, NY: Association for Computing Machinery), 269–282. doi: 10.1145/567752.567778

Crossref Full Text | Google Scholar

Cousot, P., and Cousot, R. (2002). “Modular static program analysis,” in Compiler Construction, Vol. 2304, eds. G. Goos, J. Hartmanis, J. Van Leeuwen, and R. N. Horspool (Berlin: Springer Berlin Heidelberg), 159–179. Series Title: Lecture Notes in Computer Science. doi: 10.1007/3-540-45937-5_13

Crossref Full Text | Google Scholar

Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., et al. (2007). “Combination of abstractions in the ASTRÉE static analyzer,” in Advances in Computer Science - ASIAN 2006. Secure Software and Related Issues, Vol. 4435, M. Okada, and I. Satoh (Berlin: Springer Berlin Heidelberg), 272–300. Series Title: Lecture Notes in Computer Science. doi: 10.1007/978-3-540-77505-8_23

Crossref Full Text | Google Scholar

Cousot, P., and Halbwachs, N. (1978). “Automatic discovery of linear restraints among variables of a program,” in POPL '78, eds. A. V. Aho, S. N. Zilles, and T. G. Szymanski (New York, NY: ACM Press), 84–96. doi: 10.1145/512760.512770

Crossref Full Text | Google Scholar

Dalla Preda, M., Giacobazzi, R., Lakhotia, A., and Mastroeni, I. (2015). “Abstract symbolic automata: Mixed syntactic/semantic similarity analysis of executables,” in Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY: ACM Press), 329–341. doi: 10.1145/2676726.2676986

Crossref Full Text | Google Scholar

D'Antoni, L., and Veanes, M. (2013). “Static analysis of string encoders and decoders,” in International Workshop on Verification, Model Checking, and Abstract Interpretation (Cham: Springer), 209–228. doi: 10.1007/978-3-642-35873-9_14

Crossref Full Text | Google Scholar

Ernst, M. D., Lovato, A., Macedonio, D., Spiridon, C., and Spoto, F. (2015). “Boolean formulas for the static identification of injection attacks in java,” in Logic for Prog., Art. Int., Reason., eds. M. Davis, A. Fehnker, A. McIver, and A. Voronkov (Berlin: Springer Berlin Heidelberg), 130–145. doi: 10.1007/978-3-662-48899-7_10

Crossref Full Text | Google Scholar

Ferrara, P. (2016). A generic framework for heap and value analyses of object-oriented programming languages. Theor. Comput. Sci. 631, 43–72. doi: 10.1016/j.tcs.2016.04.001

Crossref Full Text | Google Scholar

Gulwani, S., and Tiwari, A. (2006). Combining abstract interpreters. ACM SIGPLAN Notices 41, 376–386. doi: 10.1145/1133255.1134026

Crossref Full Text | Google Scholar

Han, W., Ren, M., Tian, S., Ding, L., and He, Y. (2011). “Static analysis of format string vulnerabilities,” in 2011 First ACIS International Symposium on Software and Network Engineering (Seoul: IEEE), 122–127. doi: 10.1109/SSNE.2011.9

Crossref Full Text | Google Scholar

Jeannet, B., and Miné, A. (2009). “Apron: a library of numerical abstract domains for static analysis,” in Computer Aided Verification, Vol. 5643, eds. A. Bouajjani, and O. Maler (Berlin: Springer Berlin Heidelberg), 661–667. Series Title: Lecture Notes in Computer Science. doi: 10.1007/978-3-642-02658-4_52

Crossref Full Text | Google Scholar

Jourdan, J.-H., Laporte, V., Blazy, S., Leroy, X., and Pichardie, D. (2015). A formally-verified c static analyzer. ACM Sigplan Notices 50, 247–259. doi: 10.1145/2775051.2676966

Crossref Full Text | Google Scholar

Kim, S.-W., and Choe, K.-M. (2011). “String analysis as an abstract interpretation,” in International Workshop on Verification, Model Checking, and Abstract Interpretation (Cham: Springer), 294–308. doi: 10.1007/978-3-642-18275-4_21

Crossref Full Text | Google Scholar

Li, D., Lyu, Y., Wan, M., and Halfond, W. G. (2015). “String analysis for java and android applications,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (New York, NY: ACM), 661–672. doi: 10.1145/2786805.2786879

Crossref Full Text | Google Scholar

Logozzo, F., and Fähndrich, M. (2010). Pentagons: a weakly relational abstract domain for the efficient validation of array accesses. Sci. Comput. Program. 75, 796–807. doi: 10.1016/j.scico.2009.04.004

Crossref Full Text | Google Scholar

Madsen, M., and Andreasen, E. (2014). “String analysis for dynamic field access,” in Compiler Construction, ed. A. Cohen (Berlin: Springer Berlin Heidelberg), 197–217. doi: 10.1007/978-3-642-54807-9_12

Crossref Full Text | Google Scholar

Miné, A. (2006). The octagon abstract domain. High. Order Symb. Comp. 19, 31–100. doi: 10.1007/s10990-006-8609-1

Crossref Full Text | Google Scholar

Negrini, L., Arceri, V., Ferrara, P., and Cortesi, A. (2021). “Twinning automata and regular expressions for string static analysis,” in Proc. of VMCAI '21, Volume 12597 of LNCS (Cham: Springer), 267–290. doi: 10.1007/978-3-030-67067-2_13

Crossref Full Text | Google Scholar

Negrini, L., Ferrara, P., Arceri, V., and Cortesi, A. (2023a). “LiSA: a generic framework for multilanguage static analysis,” in Challenges of Software Verification, eds. V. Arceri, A. Cortesi, P. Ferrara, and M. Olliaro (Singapore: Springer Nature), 19–42. doi: 10.1007/978-981-19-9601-6_2

Crossref Full Text | Google Scholar

Negrini, L., Presotto, S., Ferrara, P., Zaffanella, E., and Cortesi, A. (2024). “Stability: an abstract domain for the trend of variation of numerical variables,” in Proceedings of the 10th ACM SIGPLAN International Workshop on Numerical and Symbolic Abstract Domains, NSAD '24 (New York, NY: Association for Computing Machinery), 10–17. doi: 10.1145/3689609.3689995

Crossref Full Text | Google Scholar

Negrini, L., Shabadi, G., and Urban, C. (2023b). “Static analysis of data transformations in jupyter notebooks,” in Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2023 (New York, NY: Association for Computing Machinery), 8–13. doi: 10.1145/3589250.3596145

Crossref Full Text | Google Scholar

Nguyen, H. V., Nguyen, H. A., Nguyen, T. T., and Nguyen, T. N. (2011). “Auto-locating and fix-propagating for html validation errors to php server-side code,” in 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) (Lawrence, KS: IEEE), 13–22. doi: 10.1109/ASE.2011.6100047

Crossref Full Text | Google Scholar

Olivieri, L., Negrini, L., Arceri, V., Tagliaferro, F., Ferrara, P., Cortesi, A., et al. (2023). “Information flow analysis for detecting non-determinism in blockchain,' in 37th European Conference on Object-Oriented' Programming (ECOOP 2023), Volume 263 of Leibniz International Proceedings in Informatics (LIPIcs), eds. K. Ali, and G. Salvaneschi (Dagstuhl: Schloss Dagstuhl - Leibniz-Zentrum für Informatik). 23:1–23:25.

Google Scholar

Park, C., Im, H., and Ryu, S. (2016). “Precise and scalable static analysis of jquery using a regular expression domain,” in Proceedings of the 12th Symposium on Dynamic Languages (New York, NY: ACM), 25–36. doi: 10.1145/2989225.2989228

Crossref Full Text | Google Scholar

Rice, H. G. (1953). Classes of recursively enumerable sets and their decision problems. Trans. Am. Math. Soc. 74, 358–366. doi: 10.1090/S0002-9947-1953-0053041-6

Crossref Full Text | Google Scholar

Veanes, M. (2013). “Applications of symbolic finite automata,” in Implementation and Application of Automata: 18th International Conference, CIAA 2013, Halifax, NS, Canada, July 16-19, 2013. Proceedings 18 (Cham: Springer), 16–23. doi: 10.1007/978-3-642-39274-0_3

Crossref Full Text | Google Scholar

Wang, H.-E., Chen, S.-Y., Yu, F., and Jiang, J.-H. R. (2018). “A symbolic model checking approach to the analysis of string and length constraints,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE '18 (New York, NY: Association for Computing Machinery), 623–633. doi: 10.1145/3238147.3238189

Crossref Full Text | Google Scholar

Yu, F., Alkhalaf, M., Bultan, T., and Ibarra, O. H. (2014). Automata-based symbolic string analysis for vulnerability detection. Form. Methods Syst. Des. 44, 44–70. doi: 10.1007/s10703-013-0189-1

Crossref Full Text | Google Scholar

Yu, F., Bultan, T., Cova, M., and Ibarra, O. H. (2008). “Symbolic string verification: an automata-based approach,” in International SPIN Workshop on Model Checking of Software (Cham: Springer), 306–324. doi: 10.1007/978-3-540-85114-1_21

Crossref Full Text | Google Scholar

Zanatta, G., Caiazza, G., Ferrara, P., and Negrini, L. (2025). Inference of access policies through static analysis. Int. J. Softw. Tools Technol. Transfer 26, 797–821. doi: 10.1007/s10009-024-00777-8

Crossref Full Text | Google Scholar

Zheng, Y., Zhang, X., and Ganesh, V. (2013). “Z3-str: a z3-based string solver for web application analysis,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (New York, NY: ACM), 114–124. doi: 10.1145/2491411.2491456

Crossref Full Text | Google Scholar

Keywords: abstract interpretation, products, program analysis, static analysis, value analysis

Citation: Negrini L (2026) Whole-value analysis by abstract interpretation. Front. Comput. Sci. 7:1655377. doi: 10.3389/fcomp.2025.1655377

Received: 27 June 2025; Revised: 11 December 2025;
Accepted: 23 December 2025;
Published: 20 January 2026.

Edited by:

Novarun Deb, University of Calgary, Canada

Reviewed by:

Mohamed Wiem Mkaouer, University of Michigan-Flint, MI, United States
Benoît Montagu, Inria Rennes - Bretagne Atlantique Research Centre, France

Copyright © 2026 Negrini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Luca Negrini, bHVjYS5uZWdyaW5pQHVuaXZlLml0

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.