Designing Experiments to Discriminate Families of Logic Models

Logic models of signaling pathways are a promising way of building effective in silico functional models of a cell, in particular of signaling pathways. The automated learning of Boolean logic models describing signaling pathways can be achieved by training to phosphoproteomics data, which is particularly useful if it is measured upon different combinations of perturbations in a high-throughput fashion. However, in practice, the number and type of allowed perturbations are not exhaustive. Moreover, experimental data are unavoidably subjected to noise. As a result, the learning process results in a family of feasible logical networks rather than in a single model. This family is composed of logic models implementing different internal wirings for the system and therefore the predictions of experiments from this family may present a significant level of variability, and hence uncertainty. In this paper, we introduce a method based on Answer Set Programming to propose an optimal experimental design that aims to narrow down the variability (in terms of input–output behaviors) within families of logical models learned from experimental data. We study how the fitness with respect to the data can be improved after an optimal selection of signaling perturbations and how we learn optimal logic models with minimal number of experiments. The methods are applied on signaling pathways in human liver cells and phosphoproteomics experimental data. Using 25% of the experiments, we obtained logical models with fitness scores (mean square error) 15% close to the ones obtained using all experiments, illustrating the impact that our approach can have on the design of experiments for efficient model calibration.


SUPPLEMENTARY ALGORITHMS
In what follows we provide pseudo-code algorithms describing the workflow used for testing our experimental design method ( Figure 1). First, let us introduce the required notation for such algorithms. We denote a prior knowledge network with nodes V and edges E with (V, E). We use (P, O) to denote a phospho-proteomics dataset consisting of P signaling perturbations and their corresponding vector observations O : P → [0, 1] m , where m is the number of readouts. In particular, we use (Ψ, Θ) to denote the dataset under consideration for testing purposes. On the one hand, for artificial case studies (Ψ, Θ) corresponds to the complete simulated dataset with respect to certain gold standard. On the other hand, for real case studies (Ψ, Θ) corresponds to the available follow-up dataset. Notably, in a real-life application, one would have only the list of feasible signaling perturbations Ψ, while observations will be generated after performing concrete wet experiments.
We assume the existence of a function LEARN for learning nearly optimal BNs and their corresponding set of input-output behaviors. Such a function is parametrized by the PKN (V, E), the phospho-proteomics dataset (P, O), F ∈ [0, 1] and S ≥ 0 specifying allowed tolerances with respect to optimal fitness and size, respectively. In addition, we assume a function M SE which returns the Mean Squared Error for a set of behaviors B with respect to a given phospho-proteomics dataset (P, O). Also, we assume the existence of a function DISCRIM IN AT E implementing the method described in Section 2.2. In this case, the parameters are the set of input-output beahaviors B, the list of feasible perturbations Ψ, the maximum allowed number of stimuli s, inhibitors i, and perturbations k to discriminate the given set of behaviors, and the Boolean flag relax specifying whether we require full pairwise discrimination or not. As a results, DISCRIM IN AT E returns all optimal sets of signaling perturbations to discriminate behaviors in B according to the given parameters. Note that there may be no solutions, i.e., an empty set is returned.
Algorithm S1 shows the pseudo-code for an auxiliary function implementing the execution of "dry experiments". The inputs of such a function are the dataset (P, O), sets of desired signaling perturbations D, and the available observations Θ. The goal of this function is to extend the dataset (P, O) with a randomly chosen set of perturbations D ∈ D by taking the corresponding observations from Θ. In addition, the function ensures that the total number of experiments in (P, O) remains lower or equal a global value M AX EXP ERIM EN T S ALLOW ED. If there exists a set of perturbations in D to extend the current dataset (P, O), the function returns the value T rue and (P, O) extended accordingly. Otherwise, the function returns the value F alse and (P, O) unchanged. It is worth noting that for the given pseudo-code we rely on the restriction and union of functions. Thus, let us recall the corresponding definitions for each case. For f : return F alse, (P, O) All sets of perturbations were performed already, i.e., D ⊆ P, ∀D ∈ D 15: end function Algorithm S2 shows the pseudo-code for the main function implementing our workflow. Line 3 starts the main loop controlled with the Boolean variable done. In Line 6 the algorithm calls the function LEARN in order to find all strictly optimal input-output behaviors, i.e., no tolerance on neither fitness nor size. Then, in Line 8 and Line 9 the algorithm prints the learning and testing MSE respectively. Notably, in a real-life application the testing MSE can not be computed as we would not have access to Θ. Next, if only 1 input-output behavior is found, in Line 12 to Line 21, the algorithm explores nearly optimal behaviors by considering a range of tolerances, first over the size and then over the fitness. After exploring nearly optimal models, if there is still only 1 input-output behavior, the algorithm goes to Line 41 and terminates the loop. Otherwise, in Line 26 the algorithm calls the function DISCRIM IN AT E in order to discriminate among found behaviors requiring full pairwise discrimination. If there is at least one solution, i.e., an optimal set of perturbations, the algorithm calls the auxiliary function DRY EXP and stores the returned value overriding variables done and (P, O). If done is set to F alse (i.e., extended is set to T rue), then it means that (P, O) was extended with new perturbations and their corresponding observations. Thus, the loop will start over from Line 4. On the contrary, if done is set to T rue (i.e., extended is set to F alse), then it means that (P, O) could not be extended and the loop terminates. In addition, in Line 26 the call to DISCRIM IN AT E may fail due to the requirement of full pairwise discrimination and return an empty set. In such a case, in Line 32 the algorithm calls the function DISCRIM IN AT E one more time but without requiring full pairwise discrimination. If there is at least one solution the algorithm proceeds as before. Otherwise, it means that it is not possible to generate any difference among the behaviors at hand considering the given list of feasible perturbations. Thus, done is set to T rue and the loop terminates.

TOY EXAMPLE ILLUSTRATION FOR THE EXPERIMENTAL DESIGN LOOP
An important result from our work is the fact that we need to introduce tolerances in the learning loop because when learning only strictly optimal models with respect to a partial experimental dataset, we may miss (global) optimal models with respect to the complete experimental dataset. Let us illustrate this with the following toy example.
We consider a PKN ( Figure S1a) in which a and b are stimuli while c and d, are readouts. From this PKN we derived a gold standard Boolean network shown in Figure S1b. Then, for the given experimental setup, all possible perturbations and their corresponding responses over the gold standard are shown in Figure S1c. Next, we start the learning loop with the PKN ( Figure S1a) and using experiments #1 and #2 ( Figure S1c). The optimal BNs learned (without tolerance) will be a family of two BNs (Iteration 0 in Figure S2), which has 2 input-output behaviors. If we compute the optimal signaling perturbations to discriminate such behaviors, the experiment proposed is #3 in Figure S1c (a = 0, b = 0). Once the experiment #3 is added to the dataset used for learning, the resulting optimal models is actually a single BN (Iteration 1 (top) in Figure S2). Notably, with 1 BN there can be only one input-output behavior. Thus, our method aims at learning nearly optimal BNs considering a range of tolerances with respect to the optimal size and fitness. In our toy example, considering BNs with size larger than the optimal by 1, yields a family of two BNs (Iteration 1 (bottom) in Figure S2), describing 2 input-output behaviors. The optimal signaling perturbation proposed to discriminate these input-output behaviors is the experiment #4 in Figure S1c (a = 1, b = 1). Finally, learning optimal BNs using experiments #1 to #4 yields the gold standard BN (Iteration 2 in Figure S2). It is worth noting that the input-output behaviors in all the steps are not necessarily the same, since BN topologies appear and dissapear from the set of optimal BNs compatible as the dataset changes. Importantly, without the incremental procedure of tolerance parameters, we would have ended the learning loop at iteration 1 and without finding the gold standard.