A Bayesian approach for constituent estimation in nucleic acid mixture models

Tomono, Taichi; Hara, Satoshi; Nakai, Yusuke; Takahara, Kazuma; Iida, Junko; Washio, Takashi

doi:10.3389/frans.2023.1301602

ORIGINAL RESEARCH article

Front. Anal. Sci., 08 January 2024

Sec. Chemometrics

Volume 3 - 2023 | https://doi.org/10.3389/frans.2023.1301602

A Bayesian approach for constituent estimation in nucleic acid mixture models

Taichi Tomono^1,2,3*

Satoshi Hara¹

Yusuke Nakai²

Kazuma Takahara²

Junko Iida^3,4

Takashi Washio¹

¹Department of Reasoning for Intelligence, The Institute of Scientific and Industrial Research, Osaka University, Ōsaka, Japan
²AI Solution Unit, Technology Research Laboratory, Shimadzu Corporation, Kyoto, Japan
³Shimadzu Analytical Innovation Research Laboratories, Osaka University, Ōsaka, Japan
⁴Life Science Business Department, Analytical and Measuring Instruments Division, Shimadzu Corporation, Kyoto, Japan

Mass spectrometry (MS) is a powerful analytical method used for various purposes such as drug development, quality assurance, food inspection, and monitoring of pollutants in the environment. In recent years, with the active development of antibodies and nucleic acid-based drugs, impurities with various modifications are produced. These can lead to a decrease in drug stability, pharmacokinetics, and efficacy, making it crucial to differentiate these impurities. Previously, attempts have been made to estimate the monoisotopic mass and ion amounts in the spectrum generated by electrospray ionization (ESI). However, conventional methods could not explicitly estimate the number of constituents, and discrete state evaluations, such as the probability that the number of constituents is k or k+1, were not possible. We propose a method where, for each possible number of constituents in the sample, mass spectrometry is modeled using parameters like monoisotopic mass and ion counts. Using Simulated Annealing, NUTS, and stochastic variational inference, we determine the parameters for each constituent number model and the maximum posterior probability. Finally, by comparing the maximum posterior probabilities between models, we select the optimal number of constituents and estimate the monoisotopic mass and ion counts under that scenario.

1 Introduction

Mass Spectrometry (MS) is a powerful analytical technique used for various purposes such as drug development and quality assurance, food inspection, and monitoring of pollutants in the environment. In recent years, with the active development of antibodies and nucleic acid drugs, impurities with different modifications are produced. These can cause a decrease in the stability of the drug, its pharmacokinetics, and its efficacy (Weinberg et al., 2005; Sanghvi, 2011; Pecori et al., 2022; Tamara et al., 2022). Therefore, it is crucial in drug development and quality assurance to distinguish these multiple impurities in pharmaceuticals and take measures against them. Moreover, if we know the monoisotopic mass of the constituents, it can provide valuable information for considering the cause of impurity generation, and if the ion amounts of the constituents are known, it helps estimate the effect of the impurity.

However, in current mass spectrometry, it is generally difficult to directly distinguish impurities in targets of middle molecules or higher that have slight modifications, and to estimate the correct number of constituents and their monoisotopic mass and ion amounts. This challenge arises because separating such impurities using conventional chromatography methods is problematic. Furthermore, the MS spectra become complex due to the isotopes in the target constituent. Especially with the most used ionization method, Electrospray Ionization (ESI), more complex spectra are produced because it creates multivalent ions, leading to many interpretations and degrees of freedom, making analysis even more challenging.

While increasing hardware resolution can differentiate slight differences between isotopes and modifications, methods like FT-ICR (Fourier Transform Ion Cyclotron Resonance), which have high resolution, require massive equipment and substantial costs, making it inconvenient to handle. Thus, it is preferable to analyze with devices that can be managed in general labs, such as Triple-Quadrupole-MS and Quadrupole -Time-of-Flight-MS(Q-TOF-MS).

Therefore, there is active research in approaching signal analysis through software. Various attempts have been made to estimate mass from mass spectrometer data. Simple methods to derive m/z lists from spectra include wavelet transformation (Zhang et al., 2009). Recently, peak detection algorithms have been developed that combine continuous wavelet transformation (CWT) and image processing (Deng et al., 2021). This method applies image segmentation to CWT coefficients, creating masks, and combining them with CWT maximum values to improve peak detection ROC (receiver operating characteristic) curve. Such techniques are useful for resolving close mass constituents in spectra measured with low molecules, which have a relatively narrow isotope distribution, or when ionized with methods producing simple charge distributions like EI (electron ionization) or MALDI (matrix-assisted laser desorption ionization). However, for spectra of middle to high molecules with a wide isotope distribution, especially those generated by ESI, which produce multivalent ions with a charge distribution, distinguishing the monoisotopic mass of interest becomes challenging.

For charge deconvolution and deisotoping from multivalent ion spectra, a lot of algorithms such as heuristic gaussian fitting using nonlinear least squares minimization (Dasari et al., 2009) have been proposed. The ReSpect algorithm using the Max Entropy method (Ferrige et al., 1992) has been long used (Zhang and Alecio, 1998; Tranter, 2000; Ferrige et al., 2003). This algorithm integrates m/z lists based on charge distribution constraints, enabling the determination of monoisotopic mass. However, ReSpect cannot explicitly estimate the number of constituents in the spectrum, and it cannot evaluate discrete states, like the probability that there are $k$ constituents or $k + 1$ constituents. Also, the entropy term of the objective function increases as the number of peaks in the deconvolution spectrum increases, leading to the selection of spectra with many peaks.

Recently, new methods like UniDec using Bayesian deconvolution have emerged (Marty et al., 2015; Marty, 2020). UniDec adopts a unique algorithm similar to the Richardson-Lucy method (Richardson, 1972; Lucy, 1974) and is faster than ReSpect. However, its iterative method to approximate the observed data by a convoluted spectrum does not resolve the issue of not being able to evaluate the probability of a certain number of constituents.

In this paper, we newly propose a method to select the optimal number of constituents by comparing the probability of each constituent count, and to estimate the monoisotopic mass and ion counts under that condition. This can suggest the presence of impurities in pharmaceuticals, assist in the search for better synthesis conditions for middle to high molecular pharmaceuticals, and be useful for quality assurance in factories. For this study, we target Time-of-Flight mass spectrometers, which are frequently used in drug development due to their high sensitivity and resolution.

2 Proposed method

2.1 Analytical method framework

First, we model the mass spectrometry system based on parameters like the mass and charge of each constituent, assuming a certain number of constituents in the sample. Here, a constituent is defined as a substance with a specific monoisotopic mass. We then perform a MAP (Maximum A Posteriori) estimation of these parameters from the observed spectrum. By comparing the maximum posterior probability in models with different numbers of constituents, we determine the model with the most appropriate number of constituents.

However, this model has a large dimensionality of the number of constituents multiplied by 6. Moreover, the posterior probability for one of the parameters, the monoisotopic mass, is flat over a large portion of the search space and has several sharp peaks locally. Hence, gradient-based methods are not suitable for this case due to anticipated gradient vanishing. Therefore, to estimate the parameters, we combine the No-U-Turn Sampler (NUTS (Hoffman and Gelman, 2014), a type of Markov Chain Monte Carlo (MCMC), with Simulated Annealing (Kirkpatrick, Gelatt, and Vecchi, 1983).

The purpose of using Simulated Annealing is to introduce a temperature parameter. By selecting a high-temperature exploration parameter distribution, we can actively explore parameters even in areas where the posterior probability is flat or has sharp peaks. This ensures a broader search across the parameter space, reducing the chance of overlooking the global solution and getting trapped in local minima.

Furthermore, NUTS can explore parameters sparsely in areas with small gradients and can explore parameters in detail in areas with large gradients. Thus, introducing NUTS allows efficient exploration of the vast, high-dimensional parameter space.

On the other hand, while MCMC is good at searching for global solutions, it does not always reach the optimal solution within a certain number of search steps. Therefore, we use the parameters with the highest posterior probabilities obtained from NUTS and Simulated Annealing as initial values and apply stochastic variational inference. By doing this, we search for the optimal parameter where the posterior probability is maximized in the vicinity of that initial value, aiming to improve the accuracy of parameter estimation.

However, simultaneously searching for parameters for all possible numbers of constituents leads to a curse of dimensionality, where the search space explosively expands as the number of constituents increases, potentially reducing search efficiency and accuracy. To avoid this problem, we sequentially increase the number of constituents from $k = 1$ to the maximum conceivable number $k = k_{\max}$ , and for $k$ constituents calculate the optimal parameters and their posterior probabilities, and efficiently focus the parameter search areas for the k+1 constituents by these posterior probabilities.

To balance the complexity of the model (number of constituents) and its fit (loss against the data), in addition to the prior distribution of each parameter, we introduce a prior distribution for the number of constituents. We also incorporate a prior distribution on the differences between the monoisotopic masses of multiple constituents. For analytical purposes, we have defined a single constituent as a substance with a distinct monoisotopic mass, thereby ensuring that their masses do not mutually take the same value. When seeking to separate isomers, it is essential to integrate other techniques such as fragmentation, ion mobility spectrometry, and chromatography, in addition to the proposed method. We first construct a model with $k = 1$ constituent, obtain the optimal parameters and the maximum posterior probability based on the above prior distributions and observed data.

Next, we construct a model with $k = 2$ constituents. For one of the two constituents, we use a prior distribution centered on the optimal parameters already estimated for $k = 1$ , narrowing its range. This suppresses the significant increase in the parameter search space. Based on this new prior distribution, we estimate the optimal parameters and obtain the maximum posterior probability.

Subsequently, we seek the maximum posterior probability for each model with constituent numbers up to the upper limit $k_{\max}$ by efficiently exploring the optimal parameters in the same manner.

Finally, we compare the maximum posterior probabilities corresponding to each model with different numbers of constituents. We select the model with the highest probability and obtain the estimates for the monoisotopic masses and ion counts.

2.2 Physical model of mass spectrometers

In a time-of-flight mass spectrometer, the relationship between time of filght $t_{t o f_t h e o r y}$ , mass $m$ , and charge $z$ can ideally be represented as in Eq. 1 (Boesl, 2017). Here, $V$ represents the acceleration voltage to eject ions, and $y$ is the flight distance of ions.

t_{t o f_t h e o r y} = 0.72 y \sqrt{\frac{m}{z} ∙ \frac{1}{V}} (1)

The actual signal obtained is a convolution of the delta function $δ (t_{t o f})$ with the detector’s response waveform $R (t)$ . Preliminary experiments have shown that $t_{t o f}$ has a stochastic ToF jitter $ε$ following the logarithmic normal distribution $Λ_{t o f} (μ_{t o f}, σ_{t o f}, E (ε) = \exp (μ_{t o f} + \frac{{σ_{t o f}}^{2}}{2}) = 0)$ ( $E : e x p e c t e d v a l u e$ ). Additionally, the height (intensity) of the response waveform also has a stochastic response factor $α$ following the logarithmic normal distribution $Λ_{r e s} (μ_{r e s}, σ_{r e s})$ .

Reflecting these variances in time of flight and detector response intensity, the response $S (t)$ when a single ion with mass m and charge z enters the detector according to Eq. 1 can be written as:

S (t, m, z) = α ∙ R (t) * δ (t - t_{t o f_t h e o r y} - ε) = α ∙ R (t) * δ (t - 0.72 y \sqrt{\frac{m}{z} ∙ \frac{1}{V}} - ε) (2)

The mass $m$ and charge $z$ of constituent $j$ follow a distribution as described below. Since the natural isotopic abundances differ by element, and each functional group has different charge rates, the probability mass functions of mass and charge are represented by a multinomial distribution as:

p_{j} (m) = m_{j}^{'} + m! \prod_{a ∊ \{H, C, N, O, S\}} \frac{1}{n_{a}!} {u_{a}}^{ξ m} (3)

q_{j} (z) = l_{j}! \prod_{b = 1}^{l_{j}} {v_{b}}^{z} (4)

$j : c o n s t i t u e n t I D s (j = 1,2, \dots k)$ $k : n u m b e r o f c o n s t i t u e n t s i n t h e s a m p l e$ $m_{j}^{'} : m o n o i s o t o p i c m a s s o f c o n s t i t u t e n t j$ $n_{a} : n u m b e r o f h y d r o g e n, c a r b o n, n i t r o g e n, o x y g e n, p h o s p h o r u s, s u l f u r a t o m s o f c o n s t i t u e n t j (a = H, C, N, O, S)$ $u_{a} : n a t u r a l i s o t o p i c a b u n d a n c e o f {}^{2}H, {}^{13}C, {}^{15}N, {}^{18}O, {}^{34}S o f c o n s t i t u t e n t j$ $ξ_{a} : i n c r e a s e d n u m b e r o f n e u t r o n s (a = H, C, N, O, S)$ $l_{j} : n u m b e r o f f u n c t i o n a l g r o u p o f c o n s t i t u t e n t j$ $v_{b} : e a c h c h a r g e r a t e o f f u n c t i o n a l g r o u p o f c o n s t i t u t e n t j (b = 1,2, \dots l_{j})$

However, this model becomes too complex for Bayesian inference due to its large number of parameters $\sum_{j = 1}^{k} (l_{j} + 10)$ . Therefore, we introduce an approximation that assumes equal isotopic abundances for all elements, an increase of one neutron in isotopes, and equal charge probabilities for all functional groups.

Consequently, the mass distribution of constituent j can be represented by a binomial distribution based on the monoisotopic mass as a reference, where $n_{j}$ atoms contained in a molecule are replaced by isotopes with an increase of +1Da at a probability of $u_{j}$ . Similarly, for the charge distribution, $l_{j}$ functional groups can be represented by a binomial distribution where they acquire a charge of +1 (in the case the mass spectrometry system is in positive mode) at a rate of $v_{j}$ . This reduces the number of parameters down to six dimensions.

{\tilde{p}}_{j} (m) = m_{j}^{'} + (\begin{array}{c} n_{j} \\ m \end{array}) {u_{j}}^{m} {(1 - u_{j})}^{n_{j} - m} (5)

{\tilde{q}}_{j} (m) = (\begin{array}{c} l_{j} \\ z \end{array}) {v_{j}}^{z} {(1 - v_{j})}^{l_{j} - z} (6)

$n_{j} : r e p r e s e n t a t i v e n u m b e r o f a t o m s o f c o n s t i t u t e n t j$ $u_{j} : r e p r e s e n t a t i v e n a t u r a l i s o t o p i c a b u n d a n c e o f c o n s t i t u t e n t j$ $v_{j} : r e p r e s e n t a t i v e c h a r g e r a t e o f f u n c t i o n a l g r o u p o f c o n s t i t u t e n t j$

Considering that constituent j contains up to $i_{j}$ ions, its spectrum is represented as:

{\bar{S}}_{j} (t) = \sum_{i = 1}^{i_{j}} S (t, m_{i}, z_{i}) (7)

\begin{array}{c} m_{i} \sim {\tilde{p}}_{j} \\ z_{i} \sim {\tilde{q}}_{j} \end{array}

$i_{j} : n u m b e r o f i o n s i n c o n s t i t u e n t j$

If the ion counts are sufficiently large, it can be approximated as:

{\bar{S}}_{j} (t) \sim i_{j} \sum_{m \in Q^{+}} \sum_{z \in Z^{+}} {\tilde{p}}_{j} (m) ∙ {\tilde{q}}_{j} (z) ∙ S (t, m, z) (8)

Noise in the time-of-flight mass spectrometer is known to be stationary and follows a normal distribution based on preliminary experiments. It is known that the thermal noise of detection circuits in such as mass spectrometers follows a normal distribution (Johnson, 1928), suggesting that in this case, thermal noise is the dominant factor in overall noise. Therefore, the noise to be added to the entire spectrum is represented as $r (t) \sim N (μ_{n o i s e}, σ_{n o i s e})$ .

Considering these, the conclusive spectrum combined with the multiple spectra of single constituent is represented as:

\bar{S} (t) = \sum_{j = 1}^{k} {\bar{S}}_{j} (t) + r (t) (9)

2.3 Bayesian estimation of number of constituents and parameters

When the observation data from the mass spectrometer ${\bar{S}}_{o b s} (t)$ is obtained, assuming the number of constituents as $k$ , the posterior probability distribution $P (θ | {\bar{S}}_{o b s} (t))$ for parameters $θ : [(m_{1}^{'}, i_{1}, n_{1}, u_{1}, l_{1}, v_{1}), \dots, (m_{k}^{'}, i_{k}, n_{k}, u_{k}, l_{k}, v_{k})]$ is defined as per Bayes’ theorem. Note that $P ({\bar{S}}_{o b s} (t) | θ)$ represents the likelihood of parameters θ when ${\bar{S}}_{o b s} (t)$ is provided, and $P (θ)$ denotes the prior distribution.

P (θ | {\bar{S}}_{o b s} (t)) \propto P ({\bar{S}}_{o b s} (t) | θ) P (θ) (10)

Here, in addition to the prior distribution of each parameter (uniform distribution), we incorporate a regularization term, $w_{b i c}$ to achieve a suitable balance between model complexity (number of constituents) and model fit (loss with respect to data). We also introduce a regularization term, $w_{e x}$ , to prevent multiple constituents within the same model from assuming the same monoisotopic mass. Hence, we introduce the following logarithmic prior distribution:

\log (P (θ)) \propto - (w_{b i c} + w_{e x}) (11)

To determine the appropriate number of constituents $k$ , we define the regularization term $w_{b i c}$ representing the complexity of the model with $k$ based on the Bayesian Information Criterion (BIC). The BIC is a statistical measure that balances the fit to the data and model complexity (Schwarz, 1978; Neath and Cavanaugh, 2012). Here, $n$ represents the dimension of the observation data ${\bar{S}}_{o b s} (t)$ , which in this study is the number of data points in the time direction.

w_{b i c} = λ \cdot \frac{k}{2} \cdot \log n

λ : 300 (h y p e r p a r a m e t e r)

Furthermore, we define a constituent by its unique monoisotopic mass. Therefore, if the estimated values of the monoisotopic mass parameters of multiple constituents are the same in the algorithm, the count of constituents will not be accurate. Here, we define the logarithmic prior distribution (regularization term) $w_{e x}$ as follows, using a penalty that increases exponentially according to the difference in estimated monoisotopic mass values, as shown in Figure 1. The integral of the spectrum $\int_{0}^{\infty} {\bar{S}}_{o b s} (t) d t$ is also multiplied as a coefficient to ensure that the impact of the penalty does not change depending on the scale of the observed data.

w_{e x} = \sum_{i = 1}^{k - 1} \sum_{j = i + 1}^{k} \frac{a \int_{0}^{\infty} {\bar{S}}_{o b s} (t) d t}{2 b} \exp (- \frac{|m_{i} - m_{j}|}{b}) (12)

a = 0.001, b = 0.1 (h y p e r p a r a m e t e r)

FIGURE 1

FIGURE 1. Characteristics of the regularization term $w_{e x .}$

Here, by substituting the parameter $θ$ generated from MCMC into model (9), we obtain the spectrum as ${\hat{S}}_{p r o p} (t)$ . We assume a normal distribution for the noise. The standard deviation of the noise denoted as $σ$ is set to 2,000. Consequently, the logarithm of the posterior probability distribution is as follows.

\begin{array}{c} \log (P (θ | {\bar{S}}_{o b s} (t))) \propto \log (P ({\bar{S}}_{o b s} (t) | θ) P (θ)) = \log (P ({\bar{S}}_{o b s} (t) | θ)) \log (P (θ)) \\ \propto L : = β \log (\frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{{|{\hat{S}}_{p r o p} (t) - {\bar{S}}_{o b s} (t)|}^{2}}{2 σ^{2}})) - β (w_{b i c} + w_{e x}) \\ = - β (\frac{1}{2 σ^{2}} {|{\hat{S}}_{p r o p} (t) - {\bar{S}}_{o b s} (t)|}^{2} - n \log (σ) - \frac{n}{2} \log (2 π) - w_{b i c} - w_{e x}) \end{array} (13)

Here, $β$ represents the inverse temperature. As described in 2.1, we use Simulated Annealing to ensure active parameter exploration in flat areas or sharp peaks of posterior probability. This is achieved by multiplying the inverse temperature $β (< 1)$ to the posterior probability. Initially starting from a low inverse temperature value (i.e., high temperature) and gradually increasing to a higher value (i.e., low temperature). At low inverse temperatures (high temperatures), the system explores a wide parameter space. Conversely, at high inverse temperatures (low temperatures), the system converges to the optimal solution. This time, we set the temperature change in three stages: $β = {0.2}^{5} \to {0.2}^{4} \to {0.2}^{3}$ .

To obtain the maximum posterior probability and parameters $θ$ that maximize the posterior probability (formula (13)), we conduct sampling from this posterior probability distribution using MCMC.

2.4 Parameter exploration and optimization

From the posterior probability distribution $P (θ | {\bar{S}}_{o b s} (t))$ , we sample the parameter $θ$ to select the one that maximizes the posterior probability. We use the No-U-Turn Sampler (NUTS) for sampling, a recent and popular variant of the Markov Chain Monte Carlo (MCMC) method. NUTS is a type of MCMC, especially a derivative of the Hamiltonian Monte Carlo method (HMC) (Neal, 2011; Betancourt, 2017).

After executing MCMC, the parameters of the maximum posterior probability obtained are inherited as initial values, and optimization of the parameters is performed using Stochastic Variational Inference [SVI (Kingma and Max, 2013; Wingate and Weber, 2013; Ranganath et al., 2014)]. For more details, please refer to the Supporting Material.

2.5 Workflow for inferring constituents in a sample

The overall picture of the workflow to determine the optimal parameters and posterior probability for each assumed number of constituents from the observational data of the mass spectrometer is as shown in Figure 2.

FIGURE 2

FIGURE 2. Inference process overall workflow.

First, as described in 2.3.1, (i) input the observational data of the mass spectrometer with dimensions of flight time and ion counts. Then (ii) assume that the number of constituents, $k$ , contained in the sample is 1. (iii) Set the inverse temperature to ${0.2}^{5}$ .

Next, as described in 2.3.2, (iv) sample $1,000 \times 4$ times from the posterior probability distribution, (v) set the MAP solution obtained by MCMC as the initial value for the next MCMC. Then (vi) divide the inverse temperature by 0.2. Repeat steps (iv) to (vi) three times.

As described in 2.3.3, (vii) set the parameter of the maximum posterior probability obtained by MCMC as the initial value for SVI. Then (viii) optimize the parameters with SVI, and (ix) set the parameter of the maximum posterior probability obtained by SVI as the initial value for the next MCMC.

Following 2.3.1, increase the number of constituents, $k$ , by 1. Repeat steps (iii) to (vi). (x) Continue this until the maximum possible number of constituents, $k_{\max}$ . Finally, (xi) compare the maximum posterior probabilities of models from constituents $k = 1$ to $k_{\max}$ , and (xii) select the model with the largest posterior probability. Also, obtain the optimal parameters at that time.

3 Results

3.1 Validation environment

The specifications of the PC used for verifying the proposed method, as well as the software versions, are detailed in Table 1. The proposed method handles data with 1 million dimensions along the time axis, requiring a large memory size. Additionally, to rapidly explore a wide 6-dimensional parameter space using MCMC, the high-speed probabilistic programming library, NumPyro, along with its compatible CUDA and GPU, were used.

TABLE 1

TABLE 1. Validation environment.

3.2 Creation of simulation data for validation

Based on the nucleic acid drug Fomivirsen (Perry and Balfour, 1999) (ID: A), four impurity constituents with modified base sequences were added, and spectra for a total of five constituents were generated via simulation. Specific values are as per Table 2. This enables the replication of a system where the principal constituent’s isotopic distribution and the impurity spectra are mixed.

TABLE 2

TABLE 2. Settings for constituent spectrum generation.

Ion counts for each constituent were set at 20,000. To facilitate the interpretation of results and to ensure that the algorithm treats each constituent fairly, we will conduct evaluations using a 1:1 concentration ratio for each component in the proposed method. The atomic counts $n_{a}$ for H, C, N, O, and S were obtained from the molecular formula of each constituent. Natural isotopic abundance ratios $u_{j}$ followed the NIST Atomic Weights and Isotopic Compositions for All Elements. The representative functional group number $l_{j}$ and the representative charge rate $v_{j}$ were set to 224 and 0.035, respectively, to ensure that the generated spectra resembled real data.

The procedure involved sampling from the multinomial distribution represented by Eqs 3 and 4 20,000 times (total incoming ion counts) for each constituent. Subsequently, spectra were formed following the procedures in Eqs 2 and 7.

The mutation from C (Cytosine) to U (Uracil) is called deamination and is generated in the synthesis process due to solvent conditions and thermal stress (Stavnezer, 2011; Gao, Choudhry, and Cao, 2018).

The spectra of the generated single constituents A to E were combined according to Eq. 9 in the 15 combinations listed in Table 3. This allows for a comprehensive combination of 2-3 constituents based on constituent A, as well as an evaluation of each individual constituent. We use these as test data.

TABLE 3

TABLE 3. Combinations of constituents when generating spectra.

3.3 Evaluation of constituent count estimation accuracy

The results of estimating the number of constituents in the spectra of the test data (Mixture No.1∼15) using our proposed method are as shown in Table 4. The values within the table represent the negative logarithm of the maximum posterior probability in the model of constituent count k. Therefore, the smallest value should be selected.

TABLE 4

TABLE 4. Negative logarithm of the maximum posterior probability assuming each constituent count (Orange background indicates the true number of constituents, blue text indicates the minimum value across models).

By choosing the most suitable number of constituents based on this criterion, the success rate for estimating the true number of constituents was 80% (12/15). Additionally, the presence or absence of impurities (distinguishing between k = 1 and k≧2) could be determined with 100% accuracy. We believe this is sufficient as a standard for recognizing the presence and number of impurities in pharmaceuticals and taking appropriate measures.

3.4 Accuracy of parameter estimation with maximum posterior

The optimal monoisotopic masses and ion counts estimated in the model where the posterior probability is maximum for each test data are shown in Table 5. The monoisotopic mass had an average error of 1.348 Da and a maximum error of 4.931 Da. This is insufficient to determine how many mutations have occurred, making it unsuitable for examining the cause of impurity generation with a difference of 1 Da. Regarding the ion counts, there was an average error of 4% and a maximum error of 82%. For instance, the standards for total desamido impurity and total impurities in injectable glucagon are 14% or less and 31% or less, respectively (Bao et al., 2022). Therefore, the accuracy of the ion count estimation in the proposed method is insufficient to estimate the impact of impurities.

TABLE 5

TABLE 5. Optimal monoisotopic masses and ion counts of the model with the maximum posterior probability.

For reference, a comparison between the spectra reconstructed from the estimated parameters and the original signal is shown in Figure 3. The overall view in (a) represents the charge distribution, and the enlarged view in (b) represents the isotopic distribution. From these results, it is clear that the spectrum we generated closely matches the observed data. Despite the spectra matching, errors in parameter estimation occurred because of the high degree of freedom in isotopic parameters that trade-off with monoisotopic mass. Even if the monoisotopic mass was lower than the true value, by increasing the representative atomic number $n_{j}$ or the representative isotopic natural abundance $u_{j}$ , it is possible to make it fit the observed data to some extent.

FIGURE 3

FIGURE 3. Comparison of observed and estimated spectra for Mixture No. 1. (A) Overall view, (B) Enlarged view.

Also, the estimated ion counts of each constituent showed errors of up to 82% from the true values. This is presumed to be due to the trade-off relationship between the ion counts of each constituent, with a decrease in the ion count of one constituent being compensated by an increase in another. This is further supported by the fact that the average error in ion counts settles at 4%.

3.5 Comparison with UniDec

Deconvolution of the test data was performed using the existing method, UniDec as well. The results of deconvolution for each observed spectrum by UniDec are shown in Table 6. According to these results, the accuracy for the correct number of constituents was 13% (2/15). This is presumed to be because the UniDec algorithm, which obtains the number of constituents after multiple iterations of deconvolution, does not necessarily guarantee the number of constituents. Please note that this use of UniDec to determine the number of constituents is not its intended application.

TABLE 6

TABLE 6. Deconvolution results for each observed spectrum by UniDec.

For the verification above, we used UniDec (Version 6.0.2). The particularly set parameters during this verification are shown in Table 7. The Mass Range was set to the same range as the proposed method, and Sample Mass Every (Da) was set to 0.1 to sufficiently detect impurities with a difference of 1 Da. For parameters not mentioned, default values were used.

TABLE 7

TABLE 7. UniDec setting parameters.

4 Discussion

Using NUTS, Simulated Annealing, and stochastic variational inference, we estimated parameters such as monoisotopic masses from observed data, and were able to choose the correct number of constituents with a higher probability than existing methods. This is thought to be due to the fact that we created models for each number of constituents, allowing for the comparative evaluation and selection of models for each number of constituents. This made it possible to suggest the presence of impurities in pharmaceuticals, which is useful for searching for better synthesis conditions for middle to high molecular weight pharmaceuticals, and for quality assurance in factories.

On the other hand, as shown in Table 5, the estimated monoisotopic mass had a maximum error of 4.931Da from the true value. This is thought to be due to the trade-off relationship between the monoisotopic mass $m_{j}^{'}$ and the parameters $n_{j}$ and $u_{j}$ that determine the isotopic distribution. Additionally, there was a relative error of several tens of percent from the true value in the ion counts of each estimated constituent. This is speculated to be because the ion counts of each constituent trade off with each other, with a decrease in one ion being compensated for by an increase in another ion. A potential solution to these problems is to represent monoisotopic masses and ion counts as probability distributions. By considering the uncertainty in monoisotopic masses and ion counts of constituents in the sample, improvements in estimation satisfaction can be expected.

Furthermore, it took about 50 h for deconvolution assuming 5 constituents per data. This is long compared to the few seconds to a few minutes processing time of UniDec. Also, this processing time is expected to increase almost linearly with the assumed number of constituents. Therefore, it is expected to take a long time when analyzing samples with many constituents, such as serum or environmental samples. A possible countermeasure to this problem is to divide the monoisotopic mass space into mini-batches and perform parallel calculations.

5 Conclusion

We assumed multiple numbers of constituents in the sample and created a mass spectrometry model from parameters such as monoisotopic masses and ion counts. We then sought the maximum posterior probability in the model of each number of constituents against observed data using NUTS, Simulated Annealing, and stochastic variational inference. As a result, we were able to estimate the number of constituents with high accuracy. We were also able to estimate parameters such as monoisotopic masses and ion counts at the same time.

Future challenges include reducing computation time, improving mass accuracy, and improving ion count accuracy. Incorporating chromatography or ion mobility information, addressing more stringent concentration ratios between constituents (e.g., greater than 10:1), and adapting to complex samples with more constituents will be pursued to expand the applicability of this method.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions

TT: Conceptualization, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing–original draft. SH: Writing–review and editing, Conceptualization. YN: Methodology, Software, Writing–review and editing. KT: Software, Writing–review and editing. JI: Writing–review and editing. TW: Conceptualization, Supervision, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

We extend our deepest gratitude to Yoshihiro Ueno, Yusuke Tagawa, Daisuke Okumura, and Daisuke Hiramaru for their advice and coordination on the project as a whole. We also thank Akira Noda and Yusuke Tamai for their insights on Bayesian estimation. Atsuhiko Toyama, Natsuyo Asano, Hiroaki Waki, Kiyoshi Ogawa, Hideaki Izumi, Masahiro Takebe, Takashi Kawabe, and Yusuke Tateishi for their advice on the needs and trends of mass spectrometers. Our appreciation extends to Makoto Yamada and Tomoyuki Oshiro for obtaining the single response waveforms of the detectors, to Masaru Nishiguchi and Hiroyuki Miura for providing samples, to Momoka Hayashida and Noriko Kato for offering insights on nucleic acid analysis, to Yuta Miyazaki for advice on noise characteristics, and to Tomoya Kudo for guidance on ion optics simulations.

Conflict of interest

Authors TT, YN, KT, and JI were employed by Shimadzu Corporation.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frans.2023.1301602/full#supplementary-material

References

Bao, Z., Cheng, Y. C., Luo, M. Z., and Zhang, J. Y. (2022). Comparison of the purity and impurity of glucagon-for-injection products under various stability conditions. Sci. Pharm. 90 (2), 32. doi:10.3390/scipharm90020032

CrossRef Full Text | Google Scholar

Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. ArXiv [Stat.ME]. Available at: http://arxiv.org/abs/1701.02434.

Google Scholar

Boesl, U. (2017). Time-of-Flight mass spectrometry: introduction to the Basics. Mass Spectrom. Rev. 36 (1), 86–109. doi:10.1002/mas.21520

PubMed Abstract | CrossRef Full Text | Google Scholar

Dasari, S., Wilmarth, P. A., Reddy, A. P., Robertson, L. J. G., Nagalla, S. R., and David, L. L. (2009). Quantification of isotopically overlapping deamidated and 18O-labeled peptides using isotopic envelope mixture modeling. J. Proteome Res. 8 (3), 1263–1270. doi:10.1021/pr801054w

PubMed Abstract | CrossRef Full Text | Google Scholar

Deng, F., Li, H., Wang, R., Yue, H., Zhao, Z., and Duan, Y. (2021). An improved peak detection algorithm in mass spectra combining wavelet Transform and image segmentation. Int. J. Mass Spectrom. 465, 116601. doi:10.1016/j.ijms.2021.116601

CrossRef Full Text | Google Scholar

Ferrige, A., Ray, S., Alecio, R., Ye, S., and Waddell, K. (2003). Electrospray-MS charge deconvolutions without compromise – an enhanced data reconstruction algorithm utilising variable peak modelling. Available at: https://positiveprobability.com/POSTERS/ASMS%202003.pdf.

Google Scholar

Ferrige, A. G., Seddon, M. J., Green, B. N., Jarvis, S. A., Skilling, J., and Staunton, J. (1992). Disentangling electrospray spectra with maximum entropy. Rapid Commun. Mass Spectrom. RCM 6 (11), 707–711. doi:10.1002/rcm.1290061115

CrossRef Full Text | Google Scholar

Gao, J., Choudhry, H., and Cao, W. (2018). Apolipoprotein B MRNA editing enzyme catalytic polypeptide-like family genes activation and regulation during tumorigenesis. Cancer Sci. 109 (8), 2375–2382. doi:10.1111/cas.13658

PubMed Abstract | CrossRef Full Text | Google Scholar

Hoffman, M. D., and Gelman, A. (2014). The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. JMLR 15 (1), 1593–1623. doi:10.48550/arXiv.1111.4246

CrossRef Full Text | Google Scholar

Johnson, J. B. (1928). Thermal agitation of electricity in conductors. Phys. Rev. 32 (1), 97–109. doi:10.1103/physrev.32.97

CrossRef Full Text | Google Scholar

Kingma, D. P., and Max, W. (2013). Auto-encoding variational Bayes. ArXiv [Stat.ML]. Available at: http://arxiv.org/abs/1312.6114v11.

Google Scholar

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220 (4598), 671–680. doi:10.1126/science.220.4598.671

PubMed Abstract | CrossRef Full Text | Google Scholar

Lucy, L. B. (1974). An iterative technique for the rectification of observed distributions. Astronomical J. 79, 745. doi:10.1086/111605

CrossRef Full Text | Google Scholar

Marty, M. T. (2020). A universal score for deconvolution of intact protein and native electrospray mass spectra. Anal. Chem. 92 (6), 4395–4401. doi:10.1021/acs.analchem.9b05272

PubMed Abstract | CrossRef Full Text | Google Scholar

Marty, M. T., Baldwin, A. J., Marklund, E. G., Hochberg, G. K. A., Benesch, J. L. P., and Robinson, C. V. (2015). Bayesian deconvolution of mass and ion mobility spectra: from binary interactions to polydisperse ensembles. Anal. Chem. 87 (8), 4370–4376. doi:10.1021/acs.analchem.5b00140

PubMed Abstract | CrossRef Full Text | Google Scholar

Neal, R. (2011). “Handbook of Markov chain Monte Carlo,” in Chapter 5: MCMC using Hamiltonian Dynamics (United States: CRC Press).

Google Scholar

Neath, A. A., and Cavanaugh, J. E. (2012). The bayesian information criterion: background, derivation, and applications. WIREs Comput. Stat. 4 (2), 199–203. doi:10.1002/wics.199

CrossRef Full Text | Google Scholar

Pecori, R., Di Giorgio, S., Paulo Lorenzo, J., and Nina Papavasiliou, F. (2022). Functions and consequences of AID/APOBEC-Mediated DNA and RNA deamination. Nat. Rev. Genet. 23 (8), 505–518. doi:10.1038/s41576-022-00459-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Perry, C. M., and Balfour, J. A. (1999). Fomivirsen. Drugs 57 (3), 375–380. discussion 381. doi:10.2165/00003495-199957030-00010

PubMed Abstract | CrossRef Full Text | Google Scholar

Ranganath, R., Gerrish, S., and Blei, D. (2014). “Black Box variational inference,” in Proceedings of the Seventeenth international conference on artificial intelligence and Statistics, edited by Samuel Kaski and Jukka corander, 33:814–22. Proceedings of machine learning research (Reykjavik, Iceland: PMLR).

Google Scholar

Richardson, W. H. (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. Am. 62 (1), 55. doi:10.1364/josa.62.000055

CrossRef Full Text | Google Scholar

Sanghvi, Y. S. (2011). A status update of modified oligonucleotides for chemotherapeutics applications. Curr. Protoc. Nucleic Acid Chem. 2011, 1–22. Chapter 4 (September): Unit 4.1. doi:10.1002/0471142700.nc0401s46

PubMed Abstract | CrossRef Full Text | Google Scholar

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statistics 6 (2), 461–464. doi:10.1214/aos/1176344136

CrossRef Full Text | Google Scholar

Stavnezer, J. (2011). Complex regulation and function of activation-induced cytidine deaminase. Trends Immunol. 32 (5), 194–201. doi:10.1016/j.it.2011.03.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Tamara, S., den Boer, M. A., and Heck, A. J. R. (2022). High-resolution native mass spectrometry. Chem. Rev. 122 (8), 7269–7326. doi:10.1021/acs.chemrev.1c00212

PubMed Abstract | CrossRef Full Text | Google Scholar

Tranter, R. L. (2000). Design and analysis in chemical research. United States: John Wiley & Sons.

Google Scholar

Weinberg, W. C., Frazier-Jessen, M. R., Wu, W. J., Weir, A., Hartsough, M., Keegan, P., et al. (2005). Development and regulation of monoclonal antibody products: challenges and opportunities. Cancer Metastasis Rev. 24 (4), 569–584. doi:10.1007/s10555-005-6196-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Wingate, D., and Weber, T. (2013). Automated variational inference in probabilistic programming. ArXiv E-Prints, arXiv:1301.1299. Available at: https://doi.org/10.48550/arXiv.1301.1299.

Google Scholar

Zhang, J., Gonzalez, E., Hestilow, T., Haskins, W., and Huang, Y. (2009). Review of peak detection algorithms in liquid-chromatography-mass spectrometry. Curr. Genomics 10 (6), 388–401. doi:10.2174/138920209789177638

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, K., and Alecio, R. (1998). “A novel approach to the automated analysis of peptide mapping data,” in Proceedings of the Estonian academy of Sciences. Biology, ecology = eesti teaduste akadeemia toimetised (Okoloogia: Bioloogia).

Google Scholar

Keywords: LC-MS, ESI, chemometrics, Bayesian inference, deconvolution, signal processing, nucleic-acid-drugs

Citation: Tomono T, Hara S, Nakai Y, Takahara K, Iida J and Washio T (2024) A Bayesian approach for constituent estimation in nucleic acid mixture models. Front. Anal. Sci. 3:1301602. doi: 10.3389/frans.2023.1301602

Received: 05 September 2023; Accepted: 14 December 2023;
Published: 08 January 2024.

Edited by:

Krzysztof Bernard Bec, University of Innsbruck, Austria

Reviewed by:

Rui Vitorino, University of Aveiro, Portugal
Marina De Gea Neves, University of Duisburg-Essen, Germany

Copyright © 2024 Tomono, Hara, Nakai, Takahara, Iida and Washio. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Taichi Tomono, dF90YWljaGlAc2hpbWFkenUuY28uanA=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.