# CHEMOMETRICS-BASED SPECTROSCOPY FOR PHARMACEUTICAL AND BIOMEDICAL ANALYSIS

EDITED BY : Vu Dang Hoang and Federico Marini PUBLISHED IN : Frontiers in Chemistry

### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-845-5 DOI 10.3389/978-2-88945-845-5

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

# Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

# Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

# What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# CHEMOMETRICS-BASED SPECTROSCOPY FOR PHARMACEUTICAL AND BIOMEDICAL ANALYSIS

Topic Editors: Vu Dang Hoang, Hanoi University of Pharmacy, Vietnam Federico Marini, Sapienza University of Rome, Italy

Chemometrics is the application of mathematics and statistics to chemical data in order to design or select optimal experimental procedures, to provide maximum relevant information, and to obtain knowledge about systems under study. This chemical discipline has constantly developed to become a mature field of Analytical Chemistry after its inception in the 1970s. The utility and versatility of chemometric techniques enable spectroscopists to perform multidimensional classification and/ or calibration of spectral data that make identification and quantification of analytes in complex mixtures possible.

Wavelets are mathematical functions that cut up data into different frequency components, and then study each component with a resolution matched to its scale. They are now being adapted for a vast number of signal processing due to their unprecedented success in terms of asymptotic optimality, spatial adaptivity and computational efficiency. In analytical chemistry, they have increasingly shown great applicability and have been preferred over existing signal processing algorithms in noise removal, resolution enhancement, data compression and chemometrics modeling in chemical studies.

The aim of this Research Topic is to present state-of-the-art applications of chemometrics, in the field of spectroscopy, with special attention to the use of wavelet transform. Both reviews and original research articles on pharmaceutical and biomedical analysis are welcome in the specialty section Analytical Chemistry.

Citation: Hoang, V. D., Marini, F., eds. (2019). Chemometrics-based Spectroscopy for Pharmaceutical and Biomedical Analysis. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-845-5

# Table of Contents

*05 Editorial: Chemometrics-Based Spectroscopy for Pharmaceutical and Biomedical Analysis*

Hoang Vu Dang and Federico Marini


Roberta Risoluti and Stefano Materazzi

*65 A Plasma Biochemical Analysis of Acute Lead Poisoning in a Rat Model by Chemometrics-Based Fourier Transform Infrared Spectroscopy: An Exploratory Study*

Wenli Tian, Dan Wang, Haoran Fan, Lujuan Yang and Gang Ma


Oleg Ryabchykov, Juergen Popp and Thomas Bocklitz


Xiangyun Ma, Xueqing Sun, Huijie Wang, Yang Wang, Da Chen and Qifeng Li

	- Andrey Bogomolov, Joachim Mannhardt and Oliver Heinzerling

# Editorial: Chemometrics-based Spectroscopy for Pharmaceutical and Biomedical Analysis

### Hoang Vu Dang<sup>1</sup> \* and Federico Marini <sup>2</sup>

*<sup>1</sup> Department of Analytical Chemistry and Toxicology, Hanoi University of Pharmacy, Hanoi, Vietnam, <sup>2</sup> Department of Chemistry, Sapienza University of Rome, Rome, Italy*

Keywords: chemometrics, spectroscopy, Pharmaceutical analysis, biomedical analysis, Wavelet Transform

**Editorial on the Research Topic**

### **Chemometrics-based Spectroscopy for Pharmaceutical and Biomedical Analysis**

Spectroscopy is associated with a plethora of different techniques studying the interaction between matter and electromagnetic radiation. Linguistically speaking, the term originates from the Latin word "spectrum" meaning "specter or image/vision," and the Greek word "σκoπε˜lν" meaning "to view or inspect." In other words, it is concerned with the absorption, emission, or scattering of electromagnetic radiation of different wavelengths, intimately linked to the structure of atoms or molecules under study.

Unambiguously, spectroscopy and optical measurement technologies are of great importance for analysis of chemical composition. Spectroscopic techniques such as UV-Vis and IR are routinely used in laboratories as well as detailed in a great number of pharmacopeia monographs (e.g., United State pharmacopeia, British pharmacopeia and European pharmacopeia) for quality control of excipients, pharmaceutical ingredients and dosage forms. These techniques can offer a rapid, cheap, non-invasive/non-destructive analysis, using both off-line and in-/at-/on-line methodologies. Nevertheless, they are usually limited to the identification and assay by spectral comparison of a test sample against a reference standard. This approach may not be suitably applied to qualitative and quantitative analysis of real-world samples due to the complexity of pharmaceutical and biomedical matrices.

Given the above information, the use of chemometrics in spectroscopy is a must to gain efficiency in accessing spectral data. By definition, chemometrics is the use of mathematical and statistical methods to extract relevant chemical information and to correlate quality parameters or physical properties to analytical data. It means that a chemometrician would refer to the knowledge of chemical and instrumental influences to display in ways allowing chemical interpretation of the system under study (Davies, 2012).

With reference to the most straight-forward explanation of chemometrics, in the present Research Topic, Biancolillo and Marini briefly reviewed the different chemometric approaches applicable in the context of spectroscopy-based pharmaceutical analysis, discussing the unsupervised exploration of the collected data as well as the possibility of building predictive models for both quantitative (calibration) and qualitative (classification) responses.

In another review, Tsenkova et al. described the up-to-date development of multivariate analysis methodology in aquaphotomics, a novel scientific discipline proposed by Tsenkova (2005). In aquaphotomics analysis, an aquaphotome (i.e., a database of water absorbance bands and patterns correlating water structures to their specific functions) is built by using light-water interaction. To deal with such complex multidimensional spectral data, chemometric methods are exploited to remove unwanted influences and extract water absorbance spectral patterns related to the perturbation of interest.

# Edited and reviewed by:

*Huan-Tsung Chang, National Taiwan University, Taiwan*

> \*Correspondence: *Hoang Vu Dang hoangvd@hup.edu.vn*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *30 November 2018* Accepted: *01 March 2019* Published: *27 March 2019*

### Citation:

*Vu Dang H and Marini F (2019) Editorial: Chemometrics-based Spectroscopy for Pharmaceutical and Biomedical Analysis. Front. Chem. 7:153. doi: 10.3389/fchem.2019.00153*

**5**

In spectral analysis, wavelets have increasingly shown great potential in chemical studies by being superior to existing signal processing algorithms in noise removal, resolution enhancement, data compression, and chemometric modeling (Chau et al., 2004; Vu Dang, 2014). In practice, multicomponent analysis may not be possible with a traditional UV spectrophotometric method due to spectral overlapping of both active and inactive ingredients of pharmaceutical samples. Majorly based on a series of studies by Dinç and co-workers, the review of Dinç and Yazan clearly detailed the theoretical aspects of wavelet transform (i.e., discrete, continuous, and fractional) and its characteristic application to UV spectroscopic analysis of pharmaceuticals.

For pharmaceutical and biomedical analysis, it is noteworthy that the combination of various spectroscopic techniques is advisable in an effort to scrutinize a complex chemical process. In the present Research Topic, this is truly reflected by the following works: (i) Wani et al. studying interaction of neratinib (an anticancer drug) with bovine serum albumin by using both spectroscopic (spectrofluorometric, UV spectrophotometric and Fourier-transform infrared) and molecular docking approaches, and (ii) Shang et al. designing and synthesizing low-cytotoxicity fluorescent probes based on anthracene derivatives for hydrogen sulfide detection.

Nowadays, the on-going application of vibrational spectroscopy has been increasingly generating an enormous number of papers published in the pharmaceutical and biomedical sciences (Abramczyk et al., 2017; Brody et al., 2017; Bunaciu and Aboul-Enein, 2017). It is thus not surprising that the present Research Topic mainly consists of research articles related to infrared and Raman spectroscopy. For instance, Tian et al. explored the use of chemometrics-based Fourier transform infrared spectroscopy for the investigation of plasma biochemical changes due to acute lead poisoning in a rat model. Ryabchykov et al. investigated a data fusion approach for combining the two most powerful imaging techniques (Raman spectroscopy and matrix-assisted laser desorption/ionization mass spectrometry) to better distinguish different regions within biological samples. Risoluti and Materazzi coupled a miniaturized Near Infrared (NIR) spectrometer to chemometrics as a novel entirely on-site approach for assessment of occupational exposure to hydroxyurea. Zou et al. compiled a NIR spectral library of amoxicillin and potassium clavulanate by using a universal model

# REFERENCES


to resolve sample-collection problems, making quantitative models more specific for Process Analytical Technology control. Dai et al. discovered the linear region of Near Infrared Diffuse Reflectance spectra of different particle sizes by using the Kubelka-Munk theory, to serve as a methodological reference for the performance of prediction models. Chen et al. introduced a novel strategy for the real-time quantification of potassium in infant formula samples, i.e., applying a modified random frog algorithm, adopted in a higher-density discrete wavelet transform domain, to select the most important features of laser-induced breakdown spectra related to potassium. Zhao et al. proved that a pharmaceutical analysis model could be more reliable and robust when its parameters (such as spectral pretreatment, latent factors, variable selection, and calibration methods) were optimized by processing trajectory, possibly integrated into PLS software. Bogomolov et al. suggested a time-domain averaging of spectral variables to improve the accuracy of in-line NIR spectroscopic moisture monitoring in a fluidized bed drying process of pharmaceutical powder. Ma et al. proposed the use of the low-rank estimation method to improve the accuracy and robustness of Partial Least Squares and Support Vector Machine chemometric models being applied to Raman quantitative analysis of pharmaceutical mixtures.

Regarding the instrumentation for vibrational spectroscopy, Chen et al. developed a moving window fast Fourier transform cross-correlation to correct non-linear shifts for synchronization of spectra obtained from different Raman instruments. In another study, Fujiwara and Kano recommended the nearest correlation—based input variable weighting method for efficient and highly-accurate soft-sensor design, which is applicable to NIR data especially when the number of input variables is large.

The idea for this Research Topic originally came from the fact that the state-of-the-art application of chemometrics, in particular wavelet transform, plays a vital role in the field of spectroscopy being unceasingly perfected and matured.

As the title indicates, hopefully, it will serve as a useful guide for spectroscopic analysis in the pharmaceutical and biomedical sciences.

# AUTHOR CONTRIBUTIONS

HV wrote and FM revised the manuscript.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Vu Dang and Marini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Study of Interactions of an Anticancer Drug Neratinib With Bovine Serum Albumin: Spectroscopic and Molecular Docking Approach

### Tanveer A. Wani <sup>1</sup> \*, Ahmed H. Bakheit 1,2, M. A. Abounassif <sup>1</sup> and Seema Zargar <sup>3</sup>

*<sup>1</sup> Department of Pharmaceutical Chemistry, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia, <sup>2</sup> Department of Chemistry, Faculty of Science and Technology, Al-Neelain University, Khartoum, Sudan, <sup>3</sup> Department of Biochemistry, College of Science, King Saud University, Riyadh, Saudi Arabia*

Binding of therapeutic agents to plasma proteins, particularly to serum albumin,

provides valuable information in the drug development. This study was designed to evaluate the binding interaction of neratinib with bovine serum albumin (BSA). Neratinib blocks HER2 signaling and is effective in trastuzumab-resistant breast cancer treatment. Spectrofluorometric, UV spectrophotometric, and fourier transform infrared (FT-IR) and molecular docking experiments were performed to study this interaction. The fluorescence of BSA is attributed to the presence of tryptophan (Trp) residues. The fluorescence of BSA in presence of neratinib was studied using the excitation wavelength of 280 nm and the emission was measured at 300-500 nm at three different temperatures. Neratinib quenched the BSA intrinsic fluorescence by static mechanism. A complex formation occurred due to the interaction leading to BSA absorption shift. The fluorescence, UV- absorption, three dimensional fluorescence and FT-IR data showed conformational changes occurred in BSA after interaction with neratinib. The binding constant values decreased as the temperature increased suggesting an instable complex formation at high temperature. Site I (sub-domain IIA) was observed as the principal binding site for neratinib. Hydrogen bonding and Van der Waals forces were suggested to be involved in the BSA-neratinib interaction due to the negative values of entropy and enthalpy changes.

Keywords: bovine serum albumin, neratinib, human serum albumin, fluorescence, quenching

# INTRODUCTION

Neratinib, a tyrosine kinase inhibitor, is used in trastuzumab-resistant breast cancer treatment as an alternative to block HER2 signaling (**Figure 1**; Burstein et al., 2010; Iqbal and Iqbal, 2014; Wani et al., 2015). Neratinib has been recently approved by United States FDA for use in early stage HER2-overexpressed/amplified breast cancer (Bose and Ozer, 2009; Feldinger and Kong, 2015; Kourie et al., 2016; US Food and Drug Administration, 2018).

# Edited by:

*Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam*

### Reviewed by:

*Hui Xu, Ludong University, China Simone Brogi, University of Siena, Italy*

> \*Correspondence: *Tanveer A. Wani twani@ksu.edu.sa*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *24 October 2017* Accepted: *22 February 2018* Published: *07 March 2018*

### Citation:

*Wani TA, Bakheit AH, Abounassif MA and Zargar S (2018) Study of Interactions of an Anticancer Drug Neratinib With Bovine Serum Albumin: Spectroscopic and Molecular Docking Approach. Front. Chem. 6:47. doi: 10.3389/fchem.2018.00047*

**7**

Plasma proteins act as carriers for transportation of drugs and other compounds. Amongst the various plasma proteins, serum albumin is the most abundant protein and it plays a vital role in transportation of drug ligands (Jahanban-Esfahlan et al., 2015; Wani et al., 2017b,c). Several tyrosine kinase inhibitors have been studied for their interaction with bovine serum protein (BSA) (Shen et al., 2015) and in this study, the interaction of neratinib with BSA was explored. BSA was selected for studying the interaction owing to its structural similarity to human serum albumin (HSA), low procurement cost and ready availability (He and Carter, 1992; Chi et al., 2010). So far, studies on the interaction between plasma proteins and neratinib only focused on the characterization of neratinib covalent binding with serum albumin and reversible covalent binding of neratinib with plasma proteins (Chandrasekaran et al., 2010; Wang et al., 2010). The BSA contains 583 amino acids and three homologous domains. These homologous I, II, and III domains are connected by disulfide bonds. Two tryptophan residues namely Trp-134 and Trp-212, are present in BSA molecule and have intrinsic fluorescence (Kragh-Hansen, 1981). The pharmacokinetics parameters of distribution, transportation and excretion of small ligands depend on the noncovalent binding interactions of drug ligands to proteins. Exploration of the interaction mechanism between the drug ligands with BSA is of great interest (Berezhkovskiy, 2007; Chamani and Heshmati, 2008; Xiao et al., 2011; Khorsand Ahmadi et al., 2015; Marouzi et al., 2017).

The interaction between neratinib and serum albumin was explored in this study. Multispectroscopic (UV-vis absorption, fluorescence, FT-IR) along with computational approaches were used to study the binding interaction. The parameters under study included binding site involvement, complex formation and binding energies of neratinib with BSA. The molecular docking data were corroborated with experimental results to obtain a better understanding of the mechanisms involved in the interaction.

# METHODS

# Chemicals and Reagents

Bovine serum albumin (BSA) was procured from Sisco Research Laboratories, India. Neratinib was obtained from Selleckchem, USA. Phenylbutazone and ibuprofen were purchased through National Scientific Company, Saudi Arabia. The stock solutions for neratinib, BSA, phenylbutazone and ibuprofen were prepared as per their molecular weight. Phosphate buffer pH 7.4 was used for preparation of BSA stock solution of 1.5 × 10−<sup>6</sup> M. Neratinib was dissolved in 500 µL dimethyl sulphoxide and then diluted with phosphate buffer pH 7.4 to get a stock concentration of 1.8 × 10−<sup>3</sup> M. The stock concentration was further diluted with the buffer to obtain working standard solutions in the range of 3.8 × 10−<sup>5</sup> and 5.2 × 10−<sup>4</sup> M. The stock solutions of ibuprofen and phenylbutazone were prepared in methanol and then diluted with the phosphate buffer. The deionized water was obtained from a Flex Type-IV instrument from Elga Lab Water, UK.

# Fluorescence Spectra Measurement

The fluorescence analysis was carried out using a JASCO FP-8200 spectrofluorometer (Japan). The chosen excitation wavelength was 280 nm and the emission fluorescence was attained within the 300–500 nm range. BSA solution 1.5 × 10−<sup>6</sup> M was titrated with different neratinib concentrations (0, 1.5 × 10−<sup>6</sup> , . . . ., 2.11 × 10−<sup>5</sup> M) and the fluorescence measurements were carried out at the temperatures of 298, 303, and 308 K. These two solutions were mixed in a ratio of 1:1 v/v. Thus, the concentrations measured were half of the initial concentrations of either BSA or neratinib. The fluorescence intensity (FI) might decrease due to inner filter effect since a compound present in the solution might absorb in the ultraviolet region near the excitation/emission wavelength. Therefore, the correction of FI was done for studying the neratinib–BSA interaction using the following equation:

$$Fcor = Fobs \times e^{(Ae\infty + Aem)/2}$$

Where, Fcor and Fobs denote corrected fluorescence intensity and measured fluorescence intensity respectively; and Aex and Aem are the modified absorbance values of the protein upon ligand addition at the excitation and emission wavelengths, respectively.

# Synchronous Fluorescence Spectra Measurement

The synchronous fluorescence spectra were studied for conformational changes that could occur in BSA at 298 K (room temperature). Scanning intervals 1λ (1λ = λem-λex) of 15 and 60 nm characterize the tyrosine and tryptophan residues, respectively.

# FT-IR Spectra Measurement

A Bruker Alpha II FT-IR spectrometer (USA) coupled with the OPUS software was used. The spectra (spectral resolution 2 cm−<sup>1</sup> ; 24 scans) obtained were converted into absorbance. The spectra for the buffer and BSA solution in buffer were obtained, and the spectrum of buffer solution was subtracted from the BSA solution to get FT-IR spectra of BSA. Similarly, the BSAneratinib solution was prepared and the spectra for the free neratinib was subtracted from the bound form. The FT-IR results provided evidence of possible conformational changes in the protein molecule.

# Site Probe Experiment

Site probe experiments were also conducted to determine the binding site involved in the interaction. Different concentrations of neratinib were added to equimolar concentrations of site probes (phenylbutazone or ibuprofen) and BSA; the FI was then determined at room temperature (298 K) and excitation wavelength of 280 nm.

# UV–Visible Spectra Measurement

The UV-Visible absorption spectra were attained in the range of 200–400 nm for BSA, neratinib and BSA-neratinib complex at room temperature (298 K) with a UV-1800 spectrophotometer (Shimadzu, Japan). The BSA-neratinib spectra were acquired by keeping BSA concentration constant (1.5µM) and varying neratinib concentration.

# Molecular Docking

Molecular docking analysis was performed to studythe interaction between neratinib and BSA. The docking was performed on Molecular Operating Environment (MOE-2014). The structure for neratinib was drawn in the MOE, whereas the BSA crystalline protein structure was obtained from protein data bank (pdb) with the pdb code number 4OR0 (http://www.rcsb. org). Chain A of the BSA molecule was selected for the docking analysis due to the fact that BSA exist as a homodimer of two chains. Both protein receptors and ligands were protonated when prepared; and the energy minimization was performed with the default parameters of Force field MMFF94X, eps = r and cut off (8–10). The docking parameters used in the analysis were kept as default with Triangle Matcher. The rescoring function 1 was set as London dG and the rescoring function 2 was set as GBVI/WSA dG along with 10 conformation generations in order to fit the binding groove. mdb output file was generated for further analysis and evaluation of neratinib–BSA interaction. The active binding site that might be involved in the interaction was obtained from the site specific probe experiments (Jahanban-Esfahlan et al., 2015; Wani et al., 2017b,c). RMSD (root mean square deviation) parameters were used to select the most suitable interaction of BSA with neratinib.

# RESULTS

# Fluorescence Quenching

The FI of BSA and BSA-neratinib complex were recorded with excitation at 280 nm and emission in the range of 300– 500 nm. The BSA concentration was kept constant whereas, the concentration of neratinib was varied. A decrease in FI was observed with increasing neratinib concentration. This was attributed to the quenching of fluorescence by BSA because of the formation of a non-fluorescent complex between neratinib and BSA (**Figure 2**). The quenching data was analyzed using the Stern-Volmer equation:

$$\frac{F}{F\_0} = 1 + K\_{\mathfrak{s}\mathfrak{r}}\left[Q\right] = 1 + K\_{\mathfrak{q}}\,\mathfrak{r}\_0\left[Q\right].$$

F<sup>0</sup> and F represent the FIs in absence and presence of neratinib; Ksv: Stern-Volmer quenching constant; [Q]: quencher concentration; K<sup>q</sup> : quenching rate constant; τ<sup>0</sup> : fluorophore's lifetime devoid of quencher and is valued 10−<sup>8</sup> for a biopolymer (Lakowicz and Weber, 1973). The values obtained for Ksv at the three different temperatures are presented in **Table 1** (**Figure 3**). During the synchronous fluorescence experiments, a stronger quenching of FI was observed for tryptophan residues 1λ = 60 nm compared to tyrosine residues 1λ = 15 nm indicating the contribution of tryptophan in the intrinsic fluorescence of BSA (**Figure 4**). Also a red shift equal to 1 nm was observed for tryptophan residue. The 3D (3-dimensional) spectrofluorometric analysis of BSA and BSA-neratinib complex (**Figure 5**) was performed indicating changes in the BSA conformation after addition of neratinib.

# Binding Constant

Small drug ligands interact with proteins binding sites independently and the equilibrium among the free and bound molecules is represented by the following equation (He et al., 2010):

$$\log \frac{(F\_0 - F)}{F} = n \log K\_b \pm n \log \left[ \frac{1}{[Q] - \frac{(F\_0 - F)[P]}{F\_0}} \right]$$

Where K<sup>b</sup> is binding constant and n is binding site number; [Q] and [P] are the total concentrations of quencher and protein. A plot between log (F0-F)/F vs. log {1/([Q]–(F0-F) [P]/F0)} is used to calculate the binding constant (intercept) and number of binding sites (slope). The binding constants and number of binding sites were determined at all the three temperatures and

298 K, λex = 280 nm.

TABLE 1 | Stern–Volmer quenching constants (KSV) and bimolecular quenching rate constant (Kq) for the binding of neratinib to BSA at three different temperatures.


are presented in **Table 2**. The number of binding sites were found equal to unity. The binding constant obtained for BSA-neratinib complex was found to be 8.1 × 10<sup>4</sup> , whereas, in presence of phenylbutazone and ibuprofen were found to be 0.38 × 10<sup>2</sup> and 4.8 × 10<sup>4</sup> , respectively (**Figure 3**).

# Binding Mode

The binding mode is established based on the thermodynamic parameters that include enthalpy change (1H<sup>0</sup> ), entropy change (1S 0 ) and free energy change (1G 0 ). The thermodynamic parameters are given in **Table 2**. **Figure 3** represents the van't Hoff plot for neratinib and BSA interaction.

# DISCUSSION

# Neratinib Binding to the Serum Albumins

Fluorescence spectroscopy acts as a tool for investigation of the interaction between biological macromolecules (proteins) and small drug ligands. The interaction can be studied in terms of the mechanism involved in binding interaction, binding constants, etc. The FI can get reduced due to several molecular interactions that may include excited-state reactions, complex formations, energy transfer and molecular rearrangements. This decrease in the FI is known as fluorescence quenching. The type of quenching involved (static or dynamic) is derived from the linearity of the Stern-Volmer plot between F0/F vs. [Q] (**Figure 3**). The Stern-Volmer plot alone cannot give sufficient information about the nature of quenching involved in the interaction. Thus, other evidences are still required for its determination. The change in temperature is used as a tool to investigate and distinguish between the static and dynamic quenching that may be involved in ligand-BSA interaction. The Ksv value decreases at higher temperature in static quenching, and vice versa in case of dynamic quenching. These results infer that a static quenching and complex formation could occur between neratinib and BSA. It was further supported by the quenching rate constants obtained (**Table 1**). The quenching constant for collision quenching can achieve a maximum value 2 × 10<sup>10</sup> M−<sup>1</sup> S −1 for biopolymers. Our quenching constant values were much higher than those obtained by scattered procedure clearly showing the involvement of static quenching in the BSA-neratinib interaction (Shi et al., 2014; Wani et al., 2017a).

The synchronous fluorescence spectrophotometric experiments were performed to obtain information regarding the microenvironment present in the immediate neighborhood of chromosphere molecules. The conformational changes were reflected by changes in the maximum emission wavelength. A higher quenching and red shift of 1 nm was observed for tryptophan residue suggesting an increase in polarity of the surrounding environment (**Figure 4**). Therefore, it was concluded that the BSA conformation changes upon interaction of neratinib with BSA (Albert et al., 2006; Meti et al., 2015).

In the 3-dimensional spectral analysis for BSA in presence of neratinib, two peaks were found namely Peak 1 and Peak 2 (**Figure 5**). Peak 1 was found at the excitation wavelength of 230 nm and emission wavelength of 344 nm. Peak 1 is formed due to π-π ∗ transition of polypeptide structures present in the

BSA molecule. Peak 2 was found at the excitation and emission wavelength of 280 and 342 nm, respectively. Tryptophan and tyrosine residues are responsible for the formation of Peak 2. A sharp decrease in the FI of BSA was witnessed after addition of neratinib meaning that fluorescence quenching occured. A sparse spectrum in the contour plot (**Figures 5C,D**) was observed for BSA in presence of neratinib, which confirms the occurrence of conformational changes in BSA after neratinib addition.

A decrease in the binding constants was noticed as the temperature increased indicating the instability of BSA-neratinib complex. Furthermore, the number of binding sites was found to be equal to 1, indicating a single class of binding sites on BSA.

Site specific probes, phenylbutazone and ibuprofen, were used for determination of the binding sites present on BSA (Hu et al., 2004). A decrease in the values of binding constants was observed in presence of drug site probes. Phenylbutazone caused a greater reduction in the binding constant compared


TABLE 2 | Binding and thermodynamic parameters of binding between neratinib and BSA.

to ibuprofen inferring Site I as the binding site for neratinib (**Figure 3**).

# Types of Interaction Force Between BSA With Neratinib

The complex formation relies on the thermodynamic process due to the fact that binding constants are temperaturedependent. The thermodynamic processes help characterize the kind of forces engaged among BSA and neratinib (Ni et al., 2008). The forces that might be involved in binding small ligands to proteins include hydrogen bonds and Van der Waals forces, hydrophobic interaction or electrostatic forces. The binding mode is established based on the thermodynamic parameters that include enthalpy change (1H<sup>0</sup> ), entropy change (1S 0 ), and free energy change (1G 0 ). The thermodynamic parameters were evaluated by the following equations:

$$
\begin{aligned}
\ln \text{K}b &= -\frac{\Delta H^0}{RT} + \frac{\Delta S^0}{R} \\
\Delta G^0 &= \Delta H^0 - T\Delta S^0 = -RT\ln \text{K}b
\end{aligned}
$$

K<sup>b</sup> and R represent the binding constant and universal gas constant, respectively. The negative (–) 1H<sup>0</sup> and 1S 0 indicate the presence of hydrogen bonding and Van der Waals forces between BSA and neratinib. Moreover, (−1H<sup>0</sup> ) cannot occur during electrostatic interactions since these interactions occur when 1H<sup>0</sup> is either very small or almost zero (Ross and Subramanian, 1981; Ni et al., 2008). **Figure 3**, represents the van't Hoff plot for neratinib and BSA interaction. The spontaneous interaction between BSA and neratinib is indicated by (−1G 0 ) value. Both the enthalpy change and entropy change acquired negative values in the neratinib-BSA interaction, suggesting an enthalpy-driven interaction and the entropy value reported as negative number indicates its unfavorability for the binding process.

# UV–Vis Absorption Studies

The UV–vis absorption spectra suggests a complex formation occurred between BSA and neratinib (**Figure 6**). An increase in the absorption intensity of BSA was observed with higher neratinib concentrations. The complex formation between BSA and neratinib is further confirmed as a blue shift was observed in the λmax of BSA (Kandagal et al., 2006; Peng et al., 2015).

# FT-IR Studies

Infrared spectroscopy is used to investigate the secondary structures and dynamics of protein. The band frequencies as

a result of amide I, II, and III vibrations in the IR region provide information about the secondary protein structure (i.e., the amide I band 1,600–1,700 cm−<sup>1</sup> and amide II band 1,548 cm−<sup>1</sup> ). The information provided by amide I is more valuable due to its sensitivity to protein structure change than amide II. **Figure 7** provides information regarding the changes in BSA after neratinib addition. It is clear that there were a shift of peak occurred in amide I from 1645.51 to 1652.88 cm−<sup>1</sup> and a slight shift in amide II peak from 1544.70 to 1543.02 cm−<sup>1</sup> , suggesting a change in the secondary structure of BSA after interaction with neratinib.

# Molecular Simulation Studies

Molecular docking experiments were performed to understand the interaction between neratinib and BSA. The docking experiments further supported spectrophotometric and spectrofluorometric data (Ali et al., 2010; Shahabadi and Fili, 2014). In molecular docking studies, the ligand gets tied to the binding pocket of the protein in different positions thus providing valuable information on the binding site and mode. The two binding sites present on BSA protein are designated as Site I and Site II, and are present in sub-domains IIA and IIIA, respectively. The site probe experiment revealed site I as the binding site for neratinib which was further confirmed by the docking results. The sub-domains IIA of site 1 was analyzed with varied conformational adaptations and the least possible BSAneratinib complex energies were obtained. **Figure 8A** represents the finest conformation of neratinib-BSA complex. It is evident that neratinib interacted with Trp-213 through pi-pi interaction

and with Asp-450 and Ala-209 by hydrogen bonds (**Figure 8B**). It was reported that neratinib forms a reversible covalent bond with Lys-190 of HSA. Neratinib contains a 4-(dimethylamino) crotonamide Michael acceptor and a covalent bond is formed between ε-amine of lysine of HSA and β-carbon of the amide functional group of neratinib. The covalent bond formed between neratinib and HSA is dependent on temperature, pH and time, and is independent of neratinib concentration (Chandrasekaran et al., 2010; Wang et al., 2010). The peptide LDELRDEGKASSAK is unique to human and monkey albumin; and neratinib binds to this peptide covalently. It has also been reported that neratinib does not bind covalently to plasma proteins from other species like dogs, rabbits and rodents as the sequence of amino acid residues from 182 to 195 in the albumin of these species is different than that in monkey and humans. The amino acid sequence of residues in BSA from 182 to 195 is ETMREKVLTSSARQ, meaning that BSA cannot bind covalently to neratinib due to this variation (Wang et al., 2010). The binding energy of neratinib-BSA complex at Site I by molecular docking was found to be −24.12 kj mol−<sup>1</sup> , which is in an agreement with the binding energy of −27.93 kj mol−<sup>1</sup> found experimentally at 298 K. On the basis of experimental and docking results, it is concluded that hydrophobic (pi-pi interaction) and hydrophilic (hydrogen bonding) were involved in the BSA-neratinib complex stabilization.

# CONCLUSION

Neratinib approved for use in early stage HER2 overexpressed/amplified breast cancer was investigated for its interaction with BSA. The site probe and molecular docking experimental results established that neratinib binds to the site I, subdomain IIA of BSA. The fluorescence quenching, synchronous fluorescence, UV and FT-IR data together with the docking studies confirmed the formation of a complex between BSA and neratinib. Van der Waals forces and hydrogen bonding were found to be involved in the BSAneratinib interaction in a enthalpy-driven manner. Based on

our findings, the pharmacological and biochemical aspects involved in the BSA-neratinib interaction could be better understood.

# AUTHOR CONTRIBUTIONS

Conceived and designed the experiments: TW and MA. Performed the experiments: AB and SZ. Analyzed the data: AB

# REFERENCES


and TW. Contributed reagents, materials, analysis tools: SZ, MA, and TW. Wrote the paper: TW and SZ.

# ACKNOWLEDGMENTS

The authors would like to extend their sincere appreciation to the Deanship of Scientific Research, King Saud University, for funding the research group No. RG-1438-042.


spectroscopic and molecular docking methods. J. Lumin. 145, 643–650. doi: 10.1016/j.jlumin.2013.08.042


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Wani, Bakheit, Abounassif and Zargar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Discovery of the Linear Region of Near Infrared Diffuse Reflectance Spectra Using the Kubelka-Munk Theory

Shengyun Dai, Xiaoning Pan, Lijuan Ma, Xingguo Huang, Chenzhao Du, Yanjiang Qiao\* and Zhisheng Wu\*

### Edited by:

*Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam*

### Reviewed by:

*Eleonora-Mihaela Ungureanu, Politehnica University of Bucharest, Romania Michalina Kotyczka-Moranska, Institute for Chemical Processing of Coal, Poland*

### \*Correspondence:

*Yanjiang Qiao yjqiao@263.net Zhisheng Wu wzs@bucm.edu.cn*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *30 November 2017* Accepted: *19 April 2018* Published: *07 May 2018*

### Citation:

*Dai S, Pan X, Ma L, Huang X, Du C, Qiao Y and Wu Z (2018) Discovery of the Linear Region of Near Infrared Diffuse Reflectance Spectra Using the Kubelka-Munk Theory. Front. Chem. 6:154. doi: 10.3389/fchem.2018.00154* *Key Laboratory of TCM-Information Engineering of State Administration of TCM, Pharmaceutical Engineering and New Drug Development of Traditional Chinese, Medicine of Ministry of Education, Beijing University of Chinese Medicine, Beijing, China*

Particle size is of great importance for the quantitative model of the NIR diffuse reflectance. In this paper, the effect of sample particle size on the measurement of harpagoside in *Radix Scrophulariae* powder by near infrared diffuse (NIR) reflectance spectroscopy was explored. High-performance liquid chromatography (HPLC) was employed as a reference method to construct the quantitative particle size model. Several spectral preprocessing methods were compared, and particle size models obtained by different preprocessing methods for establishing the partial least-squares (PLS) models of harpagoside. Data showed that the particle size distribution of 125–150µm for *Radix Scrophulariae* exhibited the best prediction ability with R<sup>2</sup> pre = 0.9513, RMSEP = 0.1029 mg·g −1 , and RPD = 4.78. For the hybrid granularity calibration model, the particle size distribution of 90–180µm exhibited the best prediction ability with R<sup>2</sup> pre = 0.8919, RMSEP = 0.1632 mg·g −1 , and RPD = 3.09. Furthermore, the Kubelka-Munk theory was used to relate the absorption coefficient *k* (concentration-dependent) and scatter coefficient *s* (particle size-dependent). The scatter coefficient *s* was calculated based on the Kubelka-Munk theory to study the changes of *s* after being mathematically preprocessed. A linear relationship was observed between *k*/*s* and absorption *A* within a certain range and the value for *k*/*s* was >4. According to this relationship, the model was more accurately constructed with the particle size distribution of 90–180 µm when *s* was kept constant or in a small linear region. This region provided a good reference for the linear modeling of diffuse reflectance spectroscopy. To establish a diffuse reflectance NIR model, further accurate assessment should be obtained in advance for a precise linear model.

Keywords: Kubelka-Munk theory, Near infrared (NIR) diffuse reflectance spectroscopy, particle size, PLS, harpagoside, Radix Scrophulariae

**16**

# INTRODUCTION

The implementation of process analytical technology (PAT) in the pharmaceutical industry is intended to enhance the quality of products through the measurement of critical quality and performance parameters (Roggo and Ulmschneider, 2008). Near infrared spectroscopy (NIRS) is regarded as a vital tool for the implementation of PAT, as it is increasingly used in pharmaceutical research and development due to its high analysis speed, low-cost, and non-destructive characteristics (De Beer et al., 2011). NIR spectra of chemical species (consisting of C–H, N–H, O–H, and S–H bonds; Sarraguça et al., 2011) can be used to predict their chemical and physical properties (Prieto et al., 2009).

The NIR technology includes two main parts that are transmission spectroscopy and diffuse reflectance spectroscopy. The selection of spectral form is mainly based on the state of samples (i.e., transmission spectroscopy is suitable for liquid samples such as herbal extracts and liquid preparations, while diffuse reflectance spectroscopy is generally used for solid samples such as pharmaceutical powders or granules). Diffuse reflectance spectroscopy is an analytical technique that measures the diffuse reflection of different wavelengths of light to obtain the surface information of the materials.

Various physical, chemical, and biochemical properties in Mediterranean soils were NIR predicted (Zornoza et al., 2008). Chen et al. employed an NIR model for the analysis of total polyphenol content in green tea (Chen Q. et al., 2008). Classification accuracy of about 100 % was obtained by discriminant and classification tree analyses of 82 honey samples by diffuse reflectance mid-infrared Fourier transform spectroscopy (DRIFTS) (Bertelli et al., 2007). Borin et al. utilized NIR technology for the simultaneous quantification of some common adulterants (starch, whey, or sucrose) found in milk powder samples (Borin et al., 2006). All these investigations have illustrated the trend of using NIR technology to predict physical and chemical information.

Recently, the application of NIR in studying Chinese herbal medicine (CHM) has dramatically increased such as discrimination analysis and quality control for various samples e.g., raw materials, excipients, and dosage forms. Wu et al. used the NIR and different PLS models to quantify the baicalin contents of Yinhuang oral solution based on a total error concept (Wu et al., 2013). Chen et al. employed NIR to distinguish Ganoderma lucidum samples collected from different geographical origins using principal component analysis (PCA) and discriminant analysis algorithms (Chen Y. et al., 2008).

On the other hand, it is well known that the particle size of sample affects NIR spectra. Several studies have been published on the effect of particle size on the determination of drug content in mixed powder products (Norris and Williams, 1984; Aucott and Garthwaite, 1988; Bull, 1991). Franke et al. (1998) reported the particle size determination of lactose using chemometricsbased NIR spectra. However, they did not mention any basic principle to determine particle size in the experiments. Paskatan et al. (2001) reviewed theoretical and practical particle size analysis of powder by NIR spectroscopy. But they did not show the relationship between the basic light scattering principle and the particle size of main contents.

Kubelka-Munk theory (Otsuka, 2004) is the basic quantitative theory of NIRS. The particle size of sample affects the light scattering, directly influencing model construction. It was shown that an accurate knowledge of the particles is crucial in the product development (Blanco and Peguero, 2008). Meanwhile, the differences in CHM particle size could result in different optical path lengths and multiplicative light scattering effects (Jin et al., 2012). Thus, it is important to establish an expeditious method to determine the particle size of CHM.

However, there were a few NIR studies on the simultaneous determination of particle size and active pharmaceutical ingredients of CHM. Wu Z. S. et al. demonstrated that the particle size affected NIR measurement of saikosaponin A in Bupleurum chinense DC (Wu et al., 2015). Bittner et al. employed a successful application of NIR spectroscopy in combination with multivariate data analysis (MVA) for the simultaneous identification and particle size determination of amoxicillin trihydrate particles (Bittner et al., 2011).

Scrophularia radix (Xuanshen), the root of Scrophularia ningpoensis Hemsl., was a typical CHM with a history going back over 1000 years (The State Pharmacopoeia Commission of People's Republic of China, 2015). It is originally from Zhejiang province and it is a component of the natural herbal supplement named "Zhe Ba Wei." The major ingredients of Scrophularia radix are iridoids, and harpagoside is one of the main bioactive components with antioxidant, antimicrobial and antitumor activities (Miyazawa and Okuno, 2003; Jing et al., 2011).

In this study, Scrophularia radix was taken as an example and harpagoside was regarded as an API of Scrophularia radix. HPLC was used as a reference method to determine the harpagoside

TABLE 1 | HPLC gradient elution of *Scrophularia* radix extract.


**Abbreviations:** NIR, Near Infrared Diffuse; PAT, Process Analytical Technology; DRIFTS, Diffuse Reflectance Mid-infrared Fourier Transform Spectroscopy; CHM, Chinese Herbal Medicine; NIRS, NIR Spectroscopy; MVA, Multivariate Data Analysis; HPLC, High Performance Liquid Chromatography; RMSEC, Root Mean Square Error of Calibration; RMSECV; Root Mean Square Error of Cross-Validation; RMSEP, Root Mean Square Error of Prediction; MSC, Multiplicative Scatter Correction; SNV, Standard Normal Variate; 1D, First Derivative; 2D, Second Derivative; SG, Savitzky–Golay; PRESS, Predicted Residual Sum of Squares; PLS, Partial Least Squares; SCOT, Second Overtones Region; FCOT, First Combination-Overtone; RPD, Residual Predictive Deviation; API, Active Pharmaceutical Ingredient; EMSC, Extended Multiplicative Scatter Correction.

content. NIR was used to monitor the prediction potential of the models of single particle size and mix particle size simultaneously. To our best knowledge, this paper is the first to study on particle size and harpagoside determination in Scrophularia radix with NIR diffuse reflectance spectroscopy. The differences between single particle size model and mix particle size model from the perspective of the Kubelka-Munk theory were explained.

# MATERIALS AND METHOD

# Materials

Ten batches of S. ningpoensis Hemsl. radix were gifted from Daozhen (Guizhou, China), three representative samples were taken from each batch. All samples were identified by Prof. Chunsheng Liu (Beijing University of Chinese Medicine, China). Harpagoside reference standard (lot: 111730-201307) was purchased from the National Institutes for Food and Drug Control (Beijing, China). Acetonitrile (Fisher Scientific, Pittsburgh, PA) was of HPLC-grade. Acetic acid (Beijing Chemical Works, Beijing, China) was of analytical grade. Deionised water was purchased from Hangzhou Wahaha Co., Ltd (Zhejiang, China).

# Preparation of Samples

Scrophularia radix samples were crushed into pieces by a disintegrator after brushing off soil dust from the surface. Thirty samples of Scrophularia radix were then pulverized with a blender and screened through a 10-mesh sieve. Finally, the powders were divided into four parts. One part was used for HPLC determination of the harpagoside content. The remaining parts were then smashed and screened through 24-, 50-, 65-, 80-, 100-, 120-, and 150-mesh sieves.

An amount of each sieved sample of Scrophularia radix powder (1 g) was accurately weighed and placed in a 100 mL Erlenmeyer flask. The sample was extracted with 50 mL of 50% ethanol under ultrasonic vibration (40 kHZ, 220 V) for 45 min. After cooling to room temperature, the solution was filtered through a 0.45-µm membrane filter for HPLC analysis.

# NIR Equipment and Measurement

The NIR spectra were recorded by a XDS Rapid Content Analyser and VISION software (Metrohm NIR Systems, Florida, USA).

The wavelength range for the spectra was 780–2,500 nm. Each spectrum was an average of 64 scans with air as the background, and the wavelength increment was of 0.5 nm. Unless stated otherwise, each sample was measured in triplicate and its mean value was used in the subsequent analysis.

# HPLC Method

A certain amount of harpagoside standard was accurately weighed with an XS205DU electronic balance (Mettler Toledo, Greifensee, Switzerland) and then dissolved in 100 mL of methanol to obtain the concentration of 0.02432 mg·mL−<sup>1</sup> .

HPLC analysis of Scrophularia radix (according to Chinese Pharmacopoeia, 2010 ed) was carried out using a Waters 2695 HPLC system, Waters 2996 DAD detector and auto-sampler (Waters Technologies, Palo Alto, CA). Ten microliters aliquots of the sample solutions were chromatographically analyzed in gradient elution mode on an octadecylsilyl column [250 × 4.6 mm, 5µm (Dikma, China)] with the mobile phase consisting

of acetonitrile and 0.4% acetic acid (v/v) at a flow rate of 1.0 mL·min−<sup>1</sup> (**Table 1**). The column temperature was kept at 30◦C and the detection wavelength set at 280 nm. This chromatographic method exhibited good linearity (Y = 3 × 106X−104747, R <sup>2</sup> = 0.9998) over the concentration range 0.04864–0.02432 mg·mL−<sup>1</sup> .

# Software

Data analysis was performed by the Unscrambler version 9.6 software package (CAMO Software AS, Oslo, Norway) and home-made routines programmed in MATLAB code (MATLAB v7.0, Math Works, Natick, MA). Following the Kennard-Stone algorithm, 210 samples were divided into 140 calibration samples and 70 validation samples. The root mean square error of calibration (RMSEC), root mean square error of crossvalidation (RMSECV), root mean square error of prediction (RMSEP) and corresponding R <sup>2</sup> were used to evaluate the PLS model.

In order to establish a robust harpagoside model, a number of preprocessing methods were selected. For instance, multiplicative scatter correction (MSC) and standard normal variate (SNV) were used to eliminate redundant effects of

### TABLE 2 | PLS model using preprocessing methods for different single particle sizes.


*#The original spectra without any pretreatment.*

\**The best preprocessing methods using in each different single particle size.*

particle size. Derivative methods including first derivative (1D) and second derivative (2D) were obtained to reduce baseline variations observed in original diffuse reflectance spectra and to enhance spectral features. Meanwhile, a ninepoint Savitzky-Golay smoothing filter (SG) was employed to depress the background noise amplified by the derivative. For the particle size model, MSC, SNV, and second derivative were not appropriate for an effect to be modeled, so 1D + SG, normalization and baseline subtraction were used. Leaveone-out cross-validation was used to validate the validity of methods. The lowest predicted residual sum of squares (PRESS) value was used to determine the optimum latent variables.

# Quantitative Models of NIR Diffuse Reflectance Using the Kubelka-Munk Theory

Kubelka-Munk theory is the theoretical basis for the establishment of quantitative models of NIR diffuse reflectance and its function is as follows (Otsuka, 2004):

$$\operatorname{f}(R\_{\infty}) = \frac{\left(1 - R\_{\infty}\right)^2}{2} R\_{\infty} = \frac{k}{s}$$

According to the Kubelka-Munk function, reflectance is inversely to proportional to the light-scattering coefficient (s), and the s value is inversely proportional to particle size.

The absorbance of NIR diffuse reflectance is expressed by the Kubelka-Munk equation:

$$\mathbf{A} = -\lg\left[1 + \frac{k}{s} - \sqrt{\left(\frac{k}{s}\right)^2 + 2\left(\frac{k}{s}\right)}\right]$$

# RESULTS AND DISCUSSION Spectral Characteristics of NIR Diffuse Reflectance Spectra of Different Particle Size Samples

The representative raw spectra of Scrophularia radix with different particle sizes are shown in **Figure 1** i.e., the spectral profiles were similar in shape. However, the main influences of particle size variation on diffuse reflectance spectra was the baseline offset. The well-known phenomenon that larger particles showed a stronger absorption, illustrates that the particle size is vital to the response. Some weak absorption peaks were demonstrated in the second overtone region (SCOT, 1,000– 1,400 nm) of the fundamental C-H stretching bands, while much fluctuations in the region of first combination-overtone (FCOT, 1,400–2,040 cm−<sup>1</sup> ) and combination region (CR, 2,040- 2,500 nm) were observed. Those absorption peaks might be caused by the diffuse reflectance on different particle sizes.

# HPLC Determination of Harpagoside Content in Scrophularia Radix

The HPLC chromatograms of the representative sample and standard are shown in **Figure 2**. The retention time of harpagoside in a sample extract was the same as that for the standard solution. **Figure 3** shows the harpagoside concentration of 30 samples. There is a significant difference in harpagoside concentration of samples of different particle sizes. The biggest difference of the particle sizes was located in the range of 180– 250µm, but the overall concentration design was suitable for the modeling.

TABLE 3 | Preprocessing methods for different mix particle size models (3 particle size ranges).


*#The original spectra without any pretreatment.*

\**The best preprocessing method for different mix particle size models.*

# PLS Models for NIR Diffuse Reflectance Data Using Scrophularia Radix of Each Single Particle Size

Based on different preprocessing methods, the PLS model for each particle size was constructed. **Figure 4** showed the relationship between the latent variables and PRESS for different preprocessing methods. In general, the lowest PRESS value means the best latent variables (Pan et al., 2015). The model was validated for prediction by internal sample set. Moreover, the model performance values for each particle size using different preprocessing methods are illustrated in **Table 2**. Data showed that the raw spectra were the best to construct the particle size model of 355–850µm and <90µm. While the best preprocessing method for the particle size model of 250–355µm, 180–250µm, 150–180µm, 125– 150µm, and 90-150µm was EMSC, SG9, SG9, SNV, and MSC, respectively.

In addition, the model evaluation parameters, i.e., RMSEC, RMSECV, RMSEP, and RPD, for the particle size of 355–850µm was 0.0576, 0.1642, 0.2094, and 2.02, respectively. The parameter values of other particle sizes are summarized in **Table 2**. The relation map between predicted value and reference value is shown in **Figure 5**, indicating that the best prediction result was for the particle size of 125–150µm. Therefore, it could be known that the NIR model was influenced by different particle sizes and its quantitative characteristics was explored according to different particle sizes.

# PLS Models for NIR Diffuse Reflectance Data Using Scrophularia Radix of Mix Particle Size

The comparison of model performance for different types of mix particle size (i.e., seven, six, five, four, and three types of particle size) manifests that the mix particle size model was best

constructed for 3-type mix particle size (**Table 3**). Preprocessing methods were also various, such as MSC, SNV, EMSC and SG9. It can be seen from **Figure 6**, the optimum preprocessing method for the mixed particle size of 180–850µm, 150–355µm, 125– 250µm, 90–180µm, and 0–150µm was SG9, untreated original spectra, EMSC, EMSC, and MSC, respectively, as this model has the lowest PRESS value.

The best prediction from the mix particle size model was for 90–180µm with RPD value >3 (**Table 3**). The RPD values of other mix particle size models were also about 2, meaning that the model performance of the mix particle size models was similar. This result further revealed that particle size was vital to quantitative model performance of diffuse reflectance spectra using NIR sensor. In order to make the relationship clearer, a detailed comparison of the model of the single particle size and mixed particle size was summarized.

# Comparison of the Model Performance for Single Particle Size and Mix Particle Size

It can be concluded from the comparison between the single particle size and mixed particle size models that the RPD value of the former was better than the latter. Although the prediction results were good in the prediction performance in a certain particle size range by using a single particle size model, the prediction results of single particle size model were not stable. Most of the applications of NIR diffuse reflectance spectra were for a relatively broad range of particle sizes. As a result, a mix particle size calibration model was used for prediction in subsequent studies.

Moreover, the mix particle size correction model was also used to predict the validation set for each particle size for examining which particle size samples could be more accurately predicted as well as achieving the guideline for subsequent sample preparation. The model for particle size of 90–180µm was selected to predict the particle size of 150–180 µm, 125– 150µm, and 90–125µm and the best preprocessing method is MSC (**Table 4**) and RPD values of the three prediction models are 3.81, 5.78, and 2.81 (**Table 5**).

On the other hand, the RPD values of the models of single particle size were 3.40, 4.78, and 2.52. Compared with the single particle size model, the RPD value of the mix particle size model was better illustrating that the prediction of the mix particle size correction model was more accurate (**Table 5**). The relation map between the reference and validation sets was shown in **Figure 7**. The correlation between reference and prediction values was good, which further demonstrated that the mix particle size model was better than the single particle size model. Why particle size was of great importance to the quantitative model of the NIR diffuse reflectance? It was performed by the Kubelka-Munk theory, which is a critical theory in the NIR diffuse reflectance.

# Discovery of the Linear Region of NIR Diffuse Reflectance Spectra Using the Kubelka-Munk Theory

In practice, NIR diffuse reflectance is usually used for solid particle determination and its quantitative evidence is based on the Kubelka-Munk theory (**Figure 8**).

It can be learnt from the equation that the absorbance had relationship with the k/s value. A linear relationship was discovered between k/s value and A within a certain range.

TABLE 4 | The prediction model for the single particle size by using the mix particle size model.


*#The original spectra without any pretreatment.*

\**The best prediction model for the single particle size by using the mix particle size model.*

TABLE 5 | Predicted results of different samples of single Scrophulariaceae Radix particle size model and calibration particle size model.


\**The best predicted results.*

As illustrated in **Figure 9**, the value for k/s was >4 obviously indicating that a linear region existed. This results also explained and guided the modeling performance of NIR diffuse reflectance. It was found that such a linear region provides a reference for the linear modeling of diffuse reflectance spectra. It is important to note that the linear region is beneficial for establishing a NIR diffuse reflectance model. According to our data, when the scatter coefficient s does not change, the absorption coefficient k is proportional to the sample concentration. In this study, the quantitative models for single particle size and mix particle size were both constructed to minimize the limitation that the particle size of samples was only available in a certain range. The model of single particle size was better than the mix particle size owing to a small change in the scattering coefficient s.

# CONCLUSIONS

Particle size is of great importance to the quantitative model of the NIR diffuse reflectance. In this study, the single particle size and mix particle size models of Radix Scrophulariae were constructed using PLS methods. For the single particle size model, it was obvious that the best prediction model was for

the particle size distribution of 125–150µm. This particle size distribution illustrated that small particle size was beneficial to construct the quantitative model of harpagoside in Radix Scrophulariae.

For the mix particle size model, a better prediction was obtained for the particle size distribution of 90–180µm indicating that the mix particle size model could explain more variation in the sample, and the accuracy and robustness of the mix particle size model would be improved. Meanwhile, the quantitative evidence of NIR diffuse reflectance of different particle sizes was based on the Kubelka-Munk theory. A linear relationship was discovered between k/s value and A within a certain range. Data showed that a narrow range of the scatter coefficients s resulted in a better model. Besides, the value for k/s was >4 clearly indicating that a linear region exited. This linear region helped explain and guide the modeling performance of NIR diffuse reflectance data. Finding such a linear region provided a methodological reference for the linear modeling of NIR diffuse reflectance spectra. Thus, further accurate assessment should be obtained in advance for a precise linear model.

# REFERENCES


Our study also showed that the quantitative analysis of CHM samples was more accurate when the scattering coefficient s remains unchanged or differs insignificantly at theoretical level.

# AUTHOR CONTRIBUTIONS

ZW and YQ: conceived the research; XP: performed the experiment; SD: wrote the manuscript; CD, LM, and XH: analyzed the data. All the authors prepared the manuscript and discussed the results.

# ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (81773914), Beijing Nova Program of China (xx2016050), and Science Fund for Distinguished Young Scholars in BUCM (2015-JYB-XYQ-003). The authors thank the Key Laboratory of TCM Information Engineering of State Administration of Traditional Chinese Medicine, (Beijing, China) for the assistance in data processing, Modernization of Traditional Chinese Medicine of Daozhen county of China.

correction of multiplicative effects caused by variations in physical properties of samples. Anal. Chem. 84, 320-326. doi: 10.1021/ac202598f


powders. Anal. Bioanal. Chem. 399, 2137-2147. doi: 10.1007/s00216-010- 4230-6


various physical, chemical and biochemical properties in Mediterranean soils. Soil Biol. Biochem. 40, 1923-1930. doi: 10.1016/j.soilbio.2008.04.003

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Dai, Pan, Ma, Huang, Du, Qiao and Wu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Nearest Correlation-Based Input Variable Weighting for Soft-Sensor Design

Koichi Fujiwara\* and Manabu Kano

Department of Systems Science, Kyoto University, Kyoto, Japan

In recent years, soft-sensors have been widely used for estimating product quality or other important variables when online analyzers are not available. In order to construct a highly accurate soft-sensor, appropriate data preprocessing is required. In particular, the selection of input variables or input features is one of the most important techniques for improving estimation performance. Fujiwara et al. proposed a variable selection method, in which variables are clustered into variable groups based on the correlation between variables by nearest correlation spectral clustering (NCSC), and each variable group is examined as to whether or not it should be used as input variables. This method is called NCSC-based variable selection (NCSC-VS). However, these NCSC-based methods have a lot of parameters to be tuned, and their joint optimization is burdensome. The present work proposes an effective input variable weighting method to be used instead of variable selection to conserve labor required for parameter tuning. The proposed method, referred to herein as NC-based variable weighting (NCVW), searches input variables that have the correlation with the output variable by using the NC method and calculates the correlation similarity between the input variables and output variable. The input variables are weighted based on the calculated correlation similarities, and the weighted input variables are used for model construction. There is only one parameter in the proposed NCVW since the NC method has one tuning parameter. Thus, it is easy for NCVW to develop a soft-sensor. The usefulness of the proposed NCVW is demonstrated through an application to calibration model design in a pharmaceutical process.

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Daniel Cozzolino, Central Queensland University, Australia Larisa Lvova, Università degli Studi di Roma Tor Vergata, Italy

> \*Correspondence: Koichi Fujiwara fujiwara.koichi@i.kyoto-u.ac.jp

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 28 February 2018 Accepted: 30 April 2018 Published: 22 May 2018

### Citation:

Fujiwara K and Kano M (2018) Nearest Correlation-Based Input Variable Weighting for Soft-Sensor Design. Front. Chem. 6:171. doi: 10.3389/fchem.2018.00171 Keywords: soft-sensor, calibration model, variable weighting, partial least squares, near infrared spectroscopy

# 1. INTRODUCTION

It is important in terms of process safety and quality control to estimate product quality or other process variables, particularly when online analyzers are not available. Soft-sensors are mathematical models for estimating variables that are difficult to measure by hard sensors in realtime from other variables that are easy to measure. They have been used in various industries, for example, measurement of product composition at distillation columns in chemical processes, silicon wafer surface flatness in semiconductor processes, and active ingredient content of drugs in pharmaceutical processes. There are three methodologies for constructing soft-sensors: (i) firstprincipal modeling based on physicochemical knowledge of processes, (ii) statistical modeling based on process data, and (iii) a combination of the two. These methodologies also are called white-box, black-box, and gray-box modeling, respectively (Ahmad et al., 2014). In particular, statistical modeling has attracted wide attention due to recent advances in machine learning. Although we can utilize various machine learning techniques for soft-sensor development, partial least squares (PLS) is still widely used in chemometrics as well as soft-sensor design. This is because it is possible to construct an accurate linear regression model even when the multicollinearity problem occurs (Wold et al., 2001; Kano and Ogawa, 2010; Kano and Fujiwara, 2013).

One of the major issues in developing a precise soft-sensor is input variable selection. Although soft-sensors are well-fitted to modeling data when numerous variables are used as the input, their performance may deteriorate when unimportant variables are used for estimation. In particular, input variable selection is a key when a calibration model is constructed from Near-infrared spectroscopy (NIRS) which is a powerful online measurement technology due to its short measuring time and non-invasiveness (Roggo et al., 2007; Miyano et al., 2014). The number of measured wavelengths of an NIR spectrum is usually more than 100.

If all of the possible variable combinations are tested, the computational load increases exponentially as the candidate variables increase. Appropriate variables must be selected in a systematic manner, which is referred to as input variable selection in soft-sensors, and feature selection in machine learning. A technique for input variable selection should be developed for improving the efficiency of soft-sensor design (Andersen and Bro, 2010; Mehmood et al., 2012).

In linear regression, stepwise and least absolute shrinkage and selection operator (Lasso) are widely used as input variable selection methods (Hocking, 1976; Tibshirani, 1996). In addition, PLS-Beta and variable influence on projection (VIP) are available for selecting input variables of PLS (Kubinyi, 1993).

Methods of selecting variables on the basis of correlation have been proposed because the correlation between variables should be considered when building a good regression model (Fujiwara et al., 2009). In correlation-based variable selection methods, variable groups are constructed according to the correlation, some of which are selected as the input variables. Nearest correlation spectral clustering (NCSC) (Fujiwara et al., 2010, 2011) is used for variable grouping. In NCSC-based variable selection (NCSC-VS), variable groups are constructed by NCSC, and it is examined whether or not they should be used as the input variables according to their contribution to the estimates (Fujiwara et al., 2012b). In addition, NCSC-based group Lasso (NCSC-GL) uses group Lasso (Yuan and Lin, 2006; Bach, 2008) for variable group selection after NCSC (Fujiwara and Kano, 2015). Although both NCSC-VS and NCSC-GL can build highlyaccurate soft-sensors, tuning their parameters is complicated and time-consuming because they have multiple parameters to be tuned. Therefore, the number of their tuning parameters should be reduced for efficient variable selection.

Another approach is input variable weighting or input variable scaling, which multiplies each input variable by weights according to its importance from the viewpoint of estimation (Kim et al., 2014). The present work proposes an effective input variable weighting method to replace variable selection in order to conserve labor required for parameter tuning. The proposed method, referred to herein as NC-based variable weighting (NCVW), searches input variables that have the correlation with the output variable by using the NC method and calculates the correlation similarity between each input variable and the output variable. The input variables are weighted based on the calculated correlation similarities, and the weighted input variables are used for modeling. Since there is only one parameter in the proposed NCVW, an efficient soft-sensor design is realized. In this work, the usefulness of the proposed NCVW is demonstrated through application to calibration model design for estimating active pharmaceutical ingredient (API) content.

This paper is organized as follows. Section 2 introduces conventional variable selection methods for PLS modeling, and NCVW is proposed in section 3. Section 4 reports on application results of the proposed method to pharmaceutical data. The conclusion and future work are described in section 5.

# 2. CONVENTIONAL METHODS

This section introduces PLS and conventional input variable selection methods.

# 2.1. PLS

PLS is a widely used linear regression method in chemometrics as well as soft-sensor design. Given an input data matrix **X** ∈ ℜN×<sup>M</sup> whose nth row is the nth input sample **x**<sup>n</sup> ∈ ℜ<sup>M</sup> and an output data vector **y** ∈ ℜ<sup>N</sup> whose nth element is the nth output sample y<sup>n</sup> ∈ ℜ, **X** and **y** are mean-centered and appropriately scaled. The input **X** ∈ ℜN×<sup>M</sup> and the output **y** ∈ ℜ<sup>N</sup> are broken down as follows:

$$\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E} \tag{1}$$

$$y = Tb + f \tag{2}$$

where **T** ∈ ℜN×<sup>K</sup> is the latent variable matrix, whose columns are the latent variable **t**<sup>k</sup> ∈ ℜ<sup>N</sup> (k = 1, · · · , K), **P** ∈ ℜM×<sup>K</sup> is the loading matrix of **X** whose columns are the loading vectors **p**<sup>k</sup> ∈ ℜ<sup>M</sup> , and **b** = [b1, · · · , bK] T is the regression coefficient vector of **y**. K denotes the number of adopted latent variables. **E** ∈ ℜN×<sup>M</sup> and **f** ∈ ℜ<sup>N</sup> are errors.

A PLS model can be constructed by the non-linear iterative partial least squares (NIPALS) algorithm. Let the first to kth latent variables be **t**1, · · · , **t**<sup>k</sup> , the loading vectors be **p**1, · · · , **p**<sup>k</sup> and the loading be b1, · · · , b<sup>k</sup> . The (k+1)th residual input and output are as follows:

$$\mathbf{X}\_{k+1} = \mathbf{X}\_k - \mathbf{t}\_r \mathbf{p}\_k^T \tag{3}$$

$$\mathcal{y}\_{k+1} = \mathcal{y}\_k - b\_k \mathfrak{t}\_k. \tag{4}$$

**t**k is a linear combination of the columns of **X**<sup>k</sup> , that is, **t**<sup>k</sup> = **X**k**w**<sup>k</sup> where **w**<sup>k</sup> ∈ ℜ<sup>M</sup> is the kth weighting vector. **w**<sup>k</sup> is the eigenvector corresponding the maximum eigenvalue of the following eigenvalue problem:

$$\mathbf{X}\_{k-1}^T \mathbf{y}\_{k-1}^T \mathbf{y}\_{k-1} \mathbf{X}\_{k-1} \mathbf{w}\_k = \lambda \mathbf{w}\_k \tag{5}$$

where λ is an eigenvalue. The kth loading vector **p**<sup>k</sup> and the kth loading b<sup>k</sup> are **p**<sup>k</sup> = **X** T k **t**k/**t** T k **t**<sup>k</sup> and b<sup>k</sup> = **y** T k **t**k/**t** T k **b**k . This procedure is repeated until the number of adopted latent variables K is achieved; K can be determined by cross-validation.

# 2.2. PLS-Beta

PLS-Beta translates a PLS model, Equations (1, 2), into a multiple linear regression (MLR) model and selects input variables based on the magnitude of its regression coefficients (Kubinyi, 1993). The translated model is expressed as

$$
\hat{\mathbf{y}} = T(T^T T)^{-1} T \mathbf{y} = \mathbf{X} \boldsymbol{\beta}\_{\text{pls}} \tag{6}
$$

where βpls = **W**(**P** <sup>T</sup>**W**) −1 (**T** <sup>T</sup>**T**) −1 **y**, and **W** = [**w**1, · · · ,**w**K] ∈ ℜM×K. The evaluation index of PLS-Beta ν is defined as

$$\nu = \frac{||\beta\_{\text{select}}||}{||\beta\_{\text{pts}}||} \text{ ( $0 < \nu \le 1$ )}\tag{7}$$

where βselect is the regression coefficient vector of the selected input variables. We select individual input variables in descending order of the magnitude of βpls until ν achieves a predefined threshold.

# 2.3. Variable Influence on Projection (VIP)

The VIP evaluates the contribution of each input variable to the output (Kubinyi, 1993). The VIP score of the jth input variable is

$$V\_{\dot{j}} = \sqrt{M \sum\_{k=1}^{K} \left( \mathbf{w}\_{jk}^2 b\_k^2 (\mathbf{t}\_k^T \mathbf{t}\_k) / ||\mathbf{w}\_k||^2 \right) \Big/ \sum\_{k=1}^{K} b\_k^2 (\mathbf{t}\_k^T \mathbf{t}\_k)} \tag{8}$$

where wjk is the jth element of **w**<sup>k</sup> . Variables satisfying V<sup>j</sup> > η (> 0) are selected.

# 2.4. Stepwise

Stepwise is an input variable selection method for the MLR model based on a statistical test which checks whether or not the true value of the regression coefficient of a newly added candidate variable is zero (Hocking, 1976).

# 2.5. Least Absolute Shrinkage and Selection Operator (Lasso)

Lasso is least squares with L<sup>1</sup> regularization so that some regression coefficients approach zero (Tibshirani, 1996). The objective function of Lasso is as follows:

$$\beta\_{\text{lasso}} = \underset{\beta}{\text{arg min}} \; (||\mathbf{y} - \mathbf{X}\beta||\_2^2 + \lambda ||\beta||\_1), \lambda \; \text{(> 0)}\tag{9}$$

Least angle regression (LARS) solves the problem of Equation (9) efficiently (Efron et al., 2004).

# 3. NEAREST CORRELATION BASED VARIABLE WEIGHTING (NCVW)

The present work proposes a new method for weighting input variables for PLS modeling to be used instead of variable selection. Since the proposed method uses the nearest correlation (NC) method for calculating correlation-based variable weights, this section explains the NC method and variable selection methods based on the NC method before the proposed method is described.

# 3.1. NC Method

The NC method was originally developed as an unsupervised learning technique for detecting samples whose correlation is similar to the query (Fujiwara et al., 2012a). The procedure of the NC method is described in Algorithm 1.


The concept of Algorithm 1 is explained through a simple example. In **Figure 1** (left), there are seven samples **x**q, **x**1, · · · , **x**6, of which five **x**<sup>q</sup> and **x**1, · · · , **x**<sup>4</sup> are on the same plane P. That is, plane P expresses the hidden correlation between the five samples and **x**<sup>5</sup> and **x**<sup>6</sup> have a different correlation. The aim of the NC method here is to detect samples whose correlation is similar to the query **x**q, that is, to detect **x**1, · · · , **x**<sup>4</sup> on P.

In steps 3–5, the entire space is translated so that **x**<sup>q</sup> becomes the origin by subtracting **x**<sup>q</sup> from all other samples **x**<sup>n</sup> as shown in **Figure 1** (right). The translated plane P becomes the linear subspace V since it contains the origin.

Draw lines connecting each sample and the origin, and check whether another sample is on the line in steps 6–8. In this example, pairs **x**1-**x**<sup>4</sup> and **x**2-**x**<sup>3</sup> satisfy such a relationship, and **x**<sup>5</sup> and **x**6, which are not on V, cannot make pairs. At this time, the correlation coefficients of these pairs must be 1 or −1. Thus, the pairs whose correlation coefficients are ±1 are thought to have a correlation similar to **x**q. The threshold of the correlation coefficient γ (0 < γ ≤ 1) is used for constraint relaxation. Steps 6–8 correspond to the above procedure.

Finally, the pairs whose correlations are similar to the query **x**<sup>q</sup> are output in step 9.

# 3.2. NCSC

NCSC was originally proposed for sample clustering based on correlation between variables (Fujiwara et al., 2010, 2011), in which the NC method and spectral clustering (SC) (Ding et al., 2001; Ng et al., 2002) are integrated. SC is a graph theorybased clustering method, which can partition a weighted graph, whose weights express affinities between nodes, into subgraphs by cutting some of their arcs. In NCSC, the NC method is 2010, 2012a,b).

used for building an affinity graph expressing the correlationbased similarities between samples, and SC partitions the graph constructed by the NC method.

Algorithm 2 shows an affinity matrix construction procedure in NCSC. Steps 6–13 correspond to the NC method, and the weighted graph constructed by the NC method is expressed as an affinity matrix **S**. Although some SC algorithms have been proposed, the max-min cut (Mcut) algorithm (Ding et al., 2001) or its extended method (Ng et al., 2002) is used herein.

## **Algorithm 2** Affinity matrix construction

1: Set γ and J. 2: **S** ∈ ℜN×<sup>N</sup> ← **O**N,N. 3: L = 1. 4: **for** L = 1 to N **do** 5: **S**<sup>L</sup> ∈ ℜN×<sup>N</sup> ← **O**N,N. 6: **for all** n = 1, 2, · · · , N (n 6= L) **do** 7: **x** ′ <sup>n</sup> = **x**<sup>n</sup> − **x**L. 8: **end for** 9: **for all** k, l (k 6= l) **do** 10: Calculate C ′ k,l from **x** ′ k and **x** ′ l . 11: **if** |C ′ k,l | ≥ γ **then** 12: (**S**L)k,<sup>l</sup> = (**S**L)l,<sup>k</sup> = 1. 13: **end if** 14: **end for** 15: **S** = **S** + **S**L. 16: **end for**

NCSC has two parameters: the threshold in the NC method γ and the number of clusters partitioned by SC, J. Previous studies have suggested the default value of γ to be 0.99 (Fujiwara et al., 2010, 2011), and that J needs to be determined by trial and error.

# 3.3. NCSC-VS and NCSC-GL

NCSC has been utilized for variable selection in soft-sensor design. In these methods, multiple variable groups are constructed by NCSC, of which some are selected as the input variables of a soft-sensor. NCSC classifies variables into J variable groups **v**<sup>j</sup> = {x<sup>m</sup> | m ⊂ Vj} (j = 1, · · · , J), where V<sup>j</sup> is the subset of variable indexes and V = ∪V<sup>j</sup> . An affinity matrix is derived from the transposed input variable matrix **X** <sup>T</sup> by the NC method for variable grouping.

NCSC-VS evaluates each variable group as to whether or not its members should be used as input variables from the viewpoint of contribution to the output (Fujiwara et al., 2012b). The jth PLS model with the number of latent variables P, f P j , is built from the jth variable group matrix **X**<sup>j</sup> , and its contribution is evaluated by

$$C\_j^p = 1 - \frac{||\hat{y}\_j^p||^2}{||y||^2} \tag{10}$$

where **y**ˆ P j is the estimate of f P j . We select D (≤ J) variable groups in descending order of C P j and construct the final PLS model from the selected input variables.

NCSC-GL selects variable groups by using group Lasso instead of contribution evaluation in NCSC-VS. Group Lasso is an extension of Lasso for selecting some input variable groups from predefined multiple variable groups (Yuan and Lin, 2006; Bach, 2008).

Suppose that M variables are divided into J groups; and **X**<sup>j</sup> and β<sup>j</sup> denote the input data matrix and the regression coefficient vector corresponding to the jth group, respectively. The number of variables in the jth group is M<sup>j</sup> , that is, M = P<sup>J</sup> <sup>j</sup>=<sup>1</sup> M<sup>j</sup> . The regression coefficients of group Lasso is derived as:

$$\beta\_{\text{glasso}} = \underset{\beta}{\text{arg min}} \left( ||\mathbf{y} - \sum\_{j=1}^{J} \mathbf{X}\_{j} \beta\_{j}||\_{2}^{2} + \lambda \sum\_{j=1}^{J} \sqrt{M\_{j}} ||\beta\_{j}||\_{2} \right) \tag{11}$$

where β = [β T 1 , · · · , β T J ] T , and λ is a parameter. Variable groups must be constructed in advance in group Lasso. Thus, NCSC-GL uses variable groups formed by NCSC as the input of group Lasso.

NCSC-VS has four tuning parameters: γ in the NC method, the number of variable groups partitioned by SC, J, latent variables in the PLS models for variable group evaluation, P, and selected variable groups, D. On the other hand, there are three tuning parameters in NCSC-GL: γ in the NC method, the number of variable groups J formed by SC and λ in group Lasso. These three or four parameters need to be tuned for appropriate input variable selection. However, their joint optimization is burdensome and time-consuming. For more efficient soft-sensor design, the number of tuning parameters should be reduced.

# 3.4. NCVW

A new input variable weighting method, referred to as NC-based variable weighting (NCVW), is proposed to be used instead of variable selection for conserving labor required for parameter tuning. The proposed method applies the NC method to the input variables and output variable together for calculating similarities based on the correlation between the input variables and output variable, and uses the input variables weighted by the calculated similarities for modeling.

Let the nth input sample and the nth output sample are **x**<sup>n</sup> ∈ ℜ<sup>M</sup> and yn, where M denotes the number of input variables. In NCVW, the NC method is applied to extended samples

$$\mathbf{x}'\_{n} = \left[ \mathbf{x}^{[1]}\_{n}, \dots, \mathbf{x}^{[M]}\_{n}, y\_{n} \right]^{T} \left( n = 1, \dots, N \right) \tag{12}$$

and the affinity matrix **S** ′ is constructed. Next, the 1st to Mth element in the (M + 1)th column of **S** which corresponds to the output variable is extracted as a weighting vector **w** = [w [1] , · · · ,w [M] ]. Finally, a new input variable for PLS modeling is formed as

$$\mathbf{z}\_{\mathfrak{n}} = \mathbf{w} \circ \mathfrak{x} = [\boldsymbol{\mathfrak{w}}^{[1]} \boldsymbol{\mathfrak{x}}^{[1]}, \cdots, \boldsymbol{\mathfrak{w}}^{[M]} \boldsymbol{\mathfrak{x}}^{[M]}]^T. \tag{13}$$

where **a** ◦ **b** denotes an element-wise product between vectors **a** and **b**. Algorithm 3 summarizes the procedure of the proposed NCVW.

**Algorithm 3** Nearest correlation based variable weighting (NCVW)

1: Prepare **x**<sup>n</sup> and y<sup>n</sup> (n = 1, · · · N).

2: **x**<sup>n</sup> ←− [x [1] <sup>n</sup> , · · · , x [M] <sup>n</sup> , yn] T (n = 1, · · · , N)

3: Get **S** ∈ ℜ(M+1)×(M+1) by applying Algorithm 2 to **x**n.

4: Extract the 1st to Mth element in the M + 1th column of **S** as **w** = [w [1] , · · · ,w [M] ].

 $\mathfrak{s}$ :  $\mathfrak{z}\_n = \mathfrak{w} \circ \mathfrak{x} = [\mathfrak{w}^{[1]} \mathfrak{x}^{[1]}, \dots, \mathfrak{w}^{[M]} \mathfrak{x}^{[M]}]^T$  ( $n = 1, \dots, N$ ).  $\mathfrak{e}$ : construct a model from  $\mathfrak{z}\_n$  by PL.S.

In soft-sensor design, the correlation among multiple input variables needs to be considered as well as the correlation between an individual input variable and the output variable. Thus, the proposed NCVW does not evaluate the correlation between each input variable and the output variable, but the correlation of multiple input variables together, which may contribute to an improvement in the estimation performance of a soft-sensor. In addition, the proposed NCVW has only one parameter, which is the threshold of the NC method γ . This leads to a huge efficiency improvement of soft sensor development.

# 4. CASE STUDY

This case study evaluates the performance of the proposed NCVW through application to pharmaceutical data provided by Daiichi Sankyo Co., Ltd. (Kim et al., 2011).

# 4.1. Objective Data

The objective of this case study is to design a calibration model that estimates active pharmaceutical ingredient (API) content in a target drug. NIR spectra (2203 points in 800−2500 nm) and the API content were measured from the granules of the drug through experiments. Since the number of wavelengths in NIR spectra was large, appropriate input wavelengths of NIR spectra had to be selected for constructing a precise calibration model. The modeling data and validation data consisted of 576 and 20 samples, respectively.

# 4.2. Model Construction

Before modeling, a first-order differential Savitzky-Golay smoothing filter (Savitzky and Golay, 1964) was applied to the spectra. As a benchmark, a PLS model using all the wavelengths as the input was constructed, which was called PLS-All. The number of its adopted latent variables was determined by crossvalidation. Input wavelengths were selected using PLS-Beta, VIP, stepwise, Lasso, NCSC-VS, and NCSC-GL. Parameters used in each method were selected by trial and error, which are shown in **Table 1**. We calculated the root-mean-square error (RMSE) for the modeling data in each parameter and determined the optimal wavelengths based on the calculated RMSE.

We designed PLS models with the wavelengths selected by each method in which cross-validation was used for determining the appropriate number of latent variables. Although Lasso derives regression coefficients, the PLS model was built from the wavelengths whose regression coefficient was not zero. This is for the reason that the number of retained wavelengths was still large and dimension reduction by PLS may have been needed. On the other hand, in the proposed NCVW, we calculated variable weights and constructed the PLS model from the weighted wavelengths. Finally, the API content was estimated by these constructed PLS models.

These procedures were repeated 100 times for calculating average CPU time per one modeling of each method. The computer configuration was as follows: OS: Windows10 (64bit),


CPU: Intel Core i7-8700 (3.2 GHz×6), RAM: 64G bytes, and MATLAB 2018a.

**Table 2** summarizes the results of the case study. #Wavelength and #LV mean the numbers of selected wavelengths and adopted latent variables determined by cross-validation, R 2 is the determination coefficient, "CPU time" is the average CPU times [s], and "Parameters" denotes the optimal parameters in



each method. In addition, **Figure 2** shows the detailed estimation results.

While PLS-Beta, VIP, and Lasso improved the estimation performance compared to PLS-All, only stepwise was worse than PLS-All. Both NCSC-VS and NCSC-GL achieved higher performance than methods above; and, in particular, NCSC-GL had the best performance. The proposed NCVW achieved almost the same performance as NCSC-VS and NVSC-GL, even though NCVW has only one tuning parameter. RMSE of NCVW was improved by about 42% in comparison with PLS-All.

It is concluded that the proposed NCVW is a tuning-free softsensor design technique and that its performance is comparable to the NCSC-based methods.

# 4.3. Discussion

According to **Table 2**, the CPU time of NCSC-VS, NCSC-GL, and the proposed NCVW were much longer than those of other methods. NCSC occupied more than 99% of their CPU time since it uses iteration for similarity calculation, which means NCVW does not improve the computational load. In addition, the estimation performance of NCVW was not improved in

2010, 2012a,b).

comparison with NCSC-GL; however, construction of the actual soft-sensor therewith is much easier than NCSC-VS and NCSC-GL. The latter methods respectively have four and three tuning parameters. In this case study, 36 calculations in NCSC-VS and 12 calculations in NCSC-GL were repeated for searching the best parameter combination according to **Table 1**. It becomes difficult to find the optimal parameter combination when the number of tuning parameters increases. On the other hand, NCVW has just one parameter–the threshold of the NC method γ and its recommended value has been proposed to be γ = 0.99 (Fujiwara et al., 2010, 2011). In fact, the total computation times of NCSC-VS, NCSC-GL, and the proposed NCVW were about 121, 42, and 3 min, respectively, for parameter tuning in this case study. Thus, the proposed NCVW makes the soft-sensor design much more efficient than NCSC-VS and NCSC-GL.

Variable weighting based on another type of the weight, the correlation coefficient between each input variable and the output variable, was evaluated. This method is called correlation coefficient-based variable weighting (CCVW). The mth variable weight of CCVW is defined as follows:

$$c^{[m]} = \frac{\mathbf{y}^T \mathbf{x}^{[m]}}{||\mathbf{y}|| ||\mathbf{x}^{[m]}||} \tag{14}$$

where **x** [m] ∈ ℜ<sup>N</sup> denotes the mth column in the input data matrix **X** ∈ ℜN×<sup>M</sup> and **y** ∈ ℜ<sup>N</sup> is the output data vector. A PLS model was constructed from the input variables weighted by c [m] . RMSE and R <sup>2</sup> of NCVW were 1.34 and 0.84, respectively. This showed the effectiveness of the variable weight by NCVW which consider the correlation of multiple input variables and the output variable together.

**Figure 3** shows the results of wavelength selection of NCSC-VS and the variable weights calculated by the proposed NCVW. The colored bands express the selected wavelengths, and the colors denote groups by NCSC-VS. The red line is the weights of NCVW. The wavelength groups selected by NCSC-VS contained almost only specific peaks. On the other hand, in NCVW, the weights of almost all wavelength regions that contain peaks, were large while some peaks had small weights. This is consistent with the physicochemical knowledge that information about compounds is contained in specific peaks. Some peaks might have important information about the API content, and other peaks might not contribute to API content estimation. Therefore, the weights by NCVW suggest that unnecessary peaks for API content estimation exist in NIR spectra. This indicates that NCVW can create meaningful weights for soft-sensor design.

# 5. CONCLUSION

In the present work, an input variable weighting method was proposed for efficient and highly-accurate soft-sensor design. The proposed NCVW derives the variable weights on the basis of the correlation between the input variables and output variable by utilizing the NC method and builds a PLS model from the weighted input variables. Since NCVW has just one tuning parameter, its soft-sensor design is efficient. The performance of NCVW was evaluated through the case study of calibration model development of the pharmaceutical process. The result showed that the estimation performance of NCVW was comparable to that of NCSC-VS and NCSC-GL, while the labor required for parameter tuning was greatly conserved. Although the objective data used in the case study was NIR spectra data, the application area of the proposed method is not limited to a specific type of data. The proposed NCVW is applicable to general soft-sensor design when the number of input variables is large. Therefore, NCVW will contribute to realizing the efficient soft-sensor design.

# AUTHOR CONTRIBUTIONS

KF developed the proposed method, analyzed the data, and wrote the initial draft of the manuscript. MK contributed to data collection and analysis and assisted in the preparation of the manuscript. Both authors approved the final version of the manuscript, and agree to be accountable for all aspects of the work.

# FUNDING

This work was partially supported by the JFE 21st Century Foundation.

# ACKNOWLEDGMENTS

The authors thank Daiichi-Sankyo Co., Ltd. for providing real operation data used in case studies.

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Fujiwara and Kano. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Compilation of a Near-Infrared Library for Construction of Quantitative Models of Oral Dosage Forms for Amoxicillin and Potassium Clavulanate

Wen-bo Zou, Xiao-meng Chong, Yan Wang and Chang-qin Hu\*

Antibiotic Division, National Institutes for Food and Drug Control, Beijing, China

The accuracy of quantitative models for near-infrared (NIR) spectroscopy is dependent upon calibration samples with concentration variations. Conventional sample-collection methods have shortcomings (especially time-consumption), which creates a "bottleneck" in the application of NIR models for Process Analytical Technology (PAT) control. We undertook a study to solve the problem of sample collection for construction of NIR quantitative models. Amoxicillin and potassium clavulanate oral dosage forms (ODFs) were used as examples. The aim of this study was to find an approach to construct NIR quantitative models rapidly using a NIR spectral library based on the idea of a universal model. The NIR spectral library of amoxicillin and potassium clavulanate ODFs was defined and comprised the spectra of 377 batches of samples produced by 26 domestic pharmaceutical companies, including tablets, dispersible tablets, chewable tablets, oral suspensions, and granules. The correlation coefficient (rT) was used to indicate the similarities of the spectra. The calibration sets of samples were selected from a spectral library according to the median r<sup>T</sup> of the samples to be analyzed. The r<sup>T</sup> of the samples selected was close to the median rT. The difference in r<sup>T</sup> of these samples was 1.0–1.5%. We concluded that sample selection was not a problem when constructing NIR quantitative models using a spectral library compared with conventional methods of determining universal models. Sample spectra with a suitable concentration range in NIR models were collected rapidly. In addition, the models constructed through this method were targeted readily.

Keywords: near-infrared spectroscopy, universal model, sample selection, spectral library, quantitative analysis

# INTRODUCTION

Near infrared spectroscopy (NIRS) is a rapid, low-cost, and non-destructive technology that has been used widely in quality control and for the rapid detection of pharmaceuticals (Jamrógiewicz, 2012; Chong et al., 2016; Dong et al., 2016). It has also been used to monitor pharmaceutical manufacturing online (Möltgen et al., 2012; Sarraguça et al., 2014; Wahl et al., 2014). In 2003, US The

### Edited by:

Federico Marini, Sapienza Università di Roma, Italy

### Reviewed by:

Thiagarajan Soundappan, Navajo Technical University, United States Huawen Wu, BaySpec, Inc., United States

> \*Correspondence: Chang-qin Hu hucq@nifdc.org.cn

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 02 November 2017 Accepted: 07 May 2018 Published: 24 May 2018

### Citation:

Zou W, Chong X, Wang Y and Hu C (2018) Compilation of a Near-Infrared Library for Construction of Quantitative Models of Oral Dosage Forms for Amoxicillin and Potassium Clavulanate. Front. Chem. 6:184. doi: 10.3389/fchem.2018.00184 Food and Drug Administration (FDA) announced Pharmaceutical Current Good Manufacturing Practices (cGMPs) for the twenty-first century to obtain better knowledge of production processes. The document offers guidelines to secure pharmaceutical quality via process control of raw materials as well as intermediate and final products during manufacturing (Velagaleti et al., 2002; United States Food and Drug Administration, 2004). Process Analytical Technology (PAT) is the key point of process control during pharmaceutical production (United States Pharmacopeial Convention, 2015). NIRS is the most frequently used method of PAT because it is efficient, pollution-free and has no need for sample pretreatment (Hertrampf et al., 2015).

The accuracy of quantitative analysis depends on NIR models. Sample selection is challenged during the selection of NIR quantitative models. A sufficient number of samples are needed to comprise the appropriate concentration range necessary for the calibration set. However, collecting enough calibration samples with concentration variability in the PAT process is difficult.

Five methods have been proposed to collect calibration samples. The first method uses normal products and the development of samples, which are normally out of specification and can extend the concentration range (Gottfries et al., 1996; Merckle and Kovar, 1998; Corti et al., 1999). The second method uses standard additions for active pharmaceutical ingredients (APIs) or excipients to increase or decrease the sample concentration (Dreassi et al., 1996; Blanco et al., 1997, 2001). The third method uses laboratory-made samples by changing the concentration of the components in the matrix (Moffat et al., 2000; Blanco et al., 2001). The fourth method uses laboratory-made samples with production samples that comprise granules, tablet cores, and coated tablets (these are all sources of variation in the model) (Blanco et al., 1998). The fifth method uses a mixture of API and excipients in different proportions for preparation of laboratory-scale samples (Mafalda and Lopes, 2009).

These methods can broaden the range of the calibration concentration. However, the sample-preparation procedure is time-consuming. Also, the samples prepared in the laboratory are not "real" commercial products because they cannot encompass all the chemical and physical properties of commercial products (e.g., excipients, particle size, polymorphs). Besides, constructing models using underdosed and overdosed samples may carry problems in terms of the correlation between the concentrations of API and other excipients (Mafalda and Lopes, 2009). When constructing models of compound preparations, the underdosing/overdosing procedure should be done by means of a "sample concentration matrix." This involves calculation of the cross-correlation between the constituents as their individual concentrations are increased or decreased, thereby avoiding spurious correlations among constituents (Blanco and Alcala, 2006). Therefore, sample selection remains a "bottleneck" in the application of NIR models for PAT control.

We have been studying NIR universal models (Feng et al., 2010). Such a universal model could be used to rapidly analyze pharmaceuticals from different manufacturers under the same international non-proprietary name (INN). A homologous sample based on the application of universal samples has been proposed (Zou et al., 2013). A set of samples are considered "homologous" if they contain the same API, similar excipients, and similar production processes. The NIR spectra of the samples in one homologous sample set are, therefore, highly similar. Calibration sets in the universal model comprise several homologous samples. Samples can be accurately analyzed via universal models if they fall into homologous samples from the calibration set. Errors may occur, and the original model should be updated if the universal model analyzes a new sample that cannot be covered by the existing homologous sample sets. Universal models do not need sample preparation. All of the calibration and validation samples can be obtained in the market. The method of sample selection ensures an appropriate range of calibration concentration, which is important to develop a robust calibration.

Amoxicillin and potassium clavulanate are compound preparations of β-lactam and β-lactamase inhibitors, respectively. They are used for the treatment of bacterial infections of the respiratory and urinary tracts. The oral dosage forms (ODFs) for amoxicillin and potassium clavulanate combined in different ratios are tablets (7:1, 4:1, 2:1), dispersible tablets (14:1, 7:1, 4:1), chewable tablets (8:1, 2:1), granules (7:1, 4:1), and oral suspensions (7:1, 4:1, 2:1). Universal models of tablets of amoxicillin and potassium clavulanate are constructed to measure the content of amoxicillin, potassium clavulanate, water, and the major impurity: cycle-closed dimer (Chong et al., 2016). Some NIR methods have been proposed for determination of amoxicillin in suspensions and capsules, in which calibration samples are formulated similar to those for commercial products (Silva et al., 2012; Khan et al., 2016).

Herein, we took the concept of a universal model to build a NIR spectral library of ODFs for amoxicillin and potassium clavulanate by collecting various products with different strengths from different manufacturers. Calibration samples could be chosen from the NIR spectral library when establishing NIR universal models to determine the contents of amoxicillin, potassium clavulanate, and/or water in the PAT control. Samples were considered to be homologous if they were similar to calibration samples. The feasibility of constructing NIR models using a NIR spectral library was discussed. Thus, the problem of collecting calibration samples could be resolved by PAT control.

# MATERIALS AND METHODS

# Samples and Reagents

Three hundred and seventy seven batches of amoxicillin and potassium clavulanate ODFs produced by 26 manufacturers were collected in post-marketing surveillance in 2012 and 2014. There were 74 batches of tablets, 78 batches of dispersible tablets, 10 batches of chewable tablets, 96 batches of granules, and 120 batches of oral suspensions; 211 samples of amoxicillin capsule were from 100 batches provided by ZhuHai United Laboratories. The amoxicillin capsules included mixed intermediate granules of amoxicillin capsules as well as filled capsules and/or packaged capsules of the same batch. A reference standard of amoxicillin trihydrate (lot number: 130409-201011; content: 85.8%) and potassium clavulanate (lot number: 130429-201307; content: 95.0%) were provided by the US National Institutes for Food and Drug Control.

Methanol was purchased from Fisher Scientific (Pittsburgh, PA, USA). Phosphoric acid was obtained from Beijing Chemical Works (Beijing, China). Sodium dihydrogen phosphate dihydrate was purchased from Sinopharm Chemical Reagents (Beijing, China).

# Reference Method

The reference contents of amoxicillin and potassium clavulanate were determined by high-performance liquid chromatography (HPLC) (Chong et al., 2016) using an Ultimate 3000 HPLC system (Dionex, Sunnyvale, CA, USA) and an ZORBAX SB-C18 column (5µm, 150 × 4.6 mm; Agilent Technologies, Santa Clara, CA, USA). The chromatographic conditions were: column temperature, 30◦C; detection wavelength, 220 nm; flow rate, 1 mL min−<sup>1</sup> ; injection volume, 20 µL; mobile phase, 5:95 (v/v) methanol/phosphate buffer (0.05 mol L−<sup>1</sup> sodium dihydrogen phosphate pH adjusted to 4.4 with 10% phosphoric acid).

For each tablet or granule/oral suspension of amoxicillin and potassium clavulanate, 10 tablets or 10 bags of granules/oral suspensions were pulverized in a motor, weighed accurately, dissolved in the mobile phase to get 0.5 mg mL−<sup>1</sup> of amoxicillin or potassium clavulanate for HPLC analysis. Two replicate runs were done for each sample to get the average reference value. The water content was determined via the Karl Fischer method according to the Chinese Pharmacopoeia. 1

# Acquisition and Pre-processing of NIR Spectra

Acquisition of NIR spectra was done on a MATRIX-F FT-NIR spectrometer (Bruker Optics, Billerica, MA, USA) equipped with a 1.5-mm fiberoptic diffuse reflectance probe and an extended TE-cooled indium gallium arsenide (InGaAs) detector. Data were collected and processed using OPUS v6.5 software (Bruker Optics).

The fiberoptic probe was used to record diffuse reflectance spectra at 8 cm−<sup>1</sup> resolution in the spectral range 4,000– 12,000 cm−<sup>1</sup> . During each measurement, 32 co-added scans were undertaken. The measurement was carried out by putting the fiberoptic diffuse reflectance probe close to the sample. For each tablet, dispersible tablet, and chewable tablet of amoxicillin and potassium clavulanate, three tablets were selected randomly and measured. The weight of each tablet was 0.5– 1.0 g. Three sample bags, weighing 3.0–6.0 g, were selected randomly and measured for a granule and oral suspension of amoxicillin and potassium clavulanate. For each mixed intermediate granule of an amoxicillin capsule, 5 g of powder was placed in a vial and measured in triplicate. For each filled capsule and packaged capsule of amoxicillin, 5 g of powder of the capsule was placed in a vial and measured thrice. The three original spectra were averaged by OPUS v6.5 software. The average spectra were then subjected to a Savitzky–Golay first derivative treatment with 17-point smoothing, followed by vector normalization transformation. The pre-processed spectra were used for construction and validation of the model.

# Compilation of a NIR Spectral Library

The NIR spectral library comprised the spectra of 377 batches of amoxicillin and potassium clavulanate ODFs produced by 26 manufacturers (74 batches of tablets, 78 batches of dispersible tablets, 10 batches of chewable tablets, 96 batches of granules, and 120 batches of oral suspensions). For the NIR spectra of the library, the content of amoxicillin was 4.77–57.86%, the content of potassium clavulanate was 1.03–20.17%, and the water content was 0.24–9.30%. The correlation coefficient r<sup>T</sup> between the spectra of each amoxicillin and potassium clavulanate ODF in the library and average spectra of tablets of amoxicillin and potassium clavulanate were calculated from 4,200 to 10,000 cm−<sup>1</sup> . The r<sup>T</sup> ranged from 34.42 to 99.69% with an average of 71.78%. The r<sup>T</sup> (Equation 1) of the two spectra y<sup>1</sup> (k) and y<sup>2</sup> (k) was calculated as the ratio of their covariance to the product of the two standard deviations σy1 and σy2. The value of r<sup>T</sup> ranges from −1 (inverted spectra) to +1 (identical spectra) and is expressed as a percentage.

$$\mathbf{r}\_{\rm T} = \frac{\text{Cov}(\boldsymbol{\wp}\_1(\boldsymbol{k}), \boldsymbol{\wp}\_2(\boldsymbol{k}))}{\sigma\_{\boldsymbol{\wp}\_1} \sigma\_{\boldsymbol{\wp}\_2}} \tag{1}$$

Construction of the NIR Quantitative Model Calibration models were constructed using the PLS1 algorithm (PLS regression for one y-variable) (Brereton, 2000; Burns and Ciurczak, 2008) available in the Quant 2 package of OPUS v6.5 software. The Rank value is the number of main factors in building the PLS model. Validation methods of calibration model include a Test Set Validation (TSV) and Leave-One-Out Cross Validation (LOOCV). In the relevant Figures and Tables, rank is the number of PLS latent variables (LV), which is determined by a one-sided F-test on PRESS (Equation 2). R 2 (Equation 3) is the coefficient of determination, and gives the percentage of variance present in the true component values, which is reproduced in the prediction. M is the number of samples of the validation set. Y<sup>m</sup> is the mean of true concentration values. Differ<sup>i</sup> (Equation 4) is the difference between the true value and predicted value. RMSEP (Equation 5) is the root-mean-standard error of prediction in TSV. RMSECV (Equation 6) is the root-mean-standard error of LOOCV. Principal Component Analysis (PCA) scores indicate the position (coordinates) of the samples. PCA is calculated on the basis of calibration spectra.

$$\text{PRESS} = \sum\_{\text{i=1}}^{\text{M}} \text{Liftér}\_{\text{i}} \text{>}\tag{2}$$

$$\mathbf{R}^2 = \left( 1 - \frac{\sum\_{\mathbf{i}=1}^{\mathbf{M}} \text{(Differ}\_{\mathbf{i}}\text{)}^2}{\sum\_{\mathbf{i}=1}^{\mathbf{M}} \text{(Yi} - \text{Y}\_{\mathbf{m}}\text{)}^2} \right) \times 100 \tag{3}$$

$$\text{Diffier}\_{\mathbf{i}} = \mathbf{Y}\_{\mathbf{i}}^{\text{true}} - \mathbf{Y}\_{\mathbf{i}}^{\text{pred}} \tag{4}$$

<sup>1</sup>Chinese Pharmacopeia 2015th Volume IV.103–104.

FIGURE 2 | Representative spectra of a tablet, dispersible tablet, chewable tablet, granule, and oral suspension of amoxicillin and potassium clavulanate, and the spectrum of amoxicillin.

$$\text{RMSE} = \sqrt{\frac{1}{\mathcal{M}\_{\text{l}}} \cdot \sum\_{i=1}^{\mathcal{M}\_{\text{l}}} \text{(Diffier}\_{\text{i}})^2} \tag{5}$$

$$\text{RMSECV} = \sqrt{\frac{1}{\text{M}\_{\text{l}}} \cdot \sum\_{i=1}^{\text{M}\_{\text{l}}} \text{(Diffier}\_{i})^{2}} \tag{6}$$

# Conventional Method of Construction of a Universal Quantitative Model

A universal model was constructed based on our reported method (Chong et al., 2016). That is, all sample spectra were grouped into hierarchical clusters based on the Euclidean distance calculated from the Ward algorithm, and 19 groups were set according to the sample-selection strategy (Jia

et al., 2011). Three random samples from each cluster were selected. Two of these samples were composed of the calibration set, and the remaining one was the validation set. Sixty-four spectra were selected to establish the NIR quantitative model to analyze the content of amoxicillin, potassium clavulanate, and water. Some test spectra for which the prediction differences were greater than the expected values, were transferred to the calibration set to optimize the model.

## Construction of a Universal Quantitative Model Using the NIR Spectral Library

All spectra were sequenced via the r<sup>T</sup> value. One spectrum was selected according to differences in r<sup>T</sup> values to construct a NIR quantitative model. Two-thirds of these spectra were composed of the calibration set; whereas one-third of the spectra were in the test set. Some spectra from the test set, for which the prediction difference was greater than expected, were transferred to the calibration set to optimize the model. These spectra could be used to analyze the content of amoxicillin, potassium clavulanate, and water.

## Validation of the Accuracy of the NIR Quantitative Model

The accuracy of NIR quantitative models was evaluated by Prediction Difference, which was the difference between the predicted content and reference content of amoxicillin, clavulanate, and water.

Prediction Difference = |Prediction Content − Reference Content|

# Sample Measurements

Here, 377 batches of amoxicillin and potassium clavulanate ODFs from post-marketing surveillance were measured in two time periods: 137 samples were measured for about 3 months in 2012, and the others were measured for about 6 months in 2014.

Spectra of 211 amoxicillin capsule samples were acquired for PAT control at ZhuHai United Laboratories (Guangdong Sheng, China) for about 7 months in 2016.

# EXPERIMENTAL DESIGN

Four steps were designed in the experiment (**Figure 1**). At first, a universal model (model 1) of amoxicillin for all amoxicillin and potassium clavulanate ODFs was constructed using a conventional calibration method for sample selection. Then, the NIR spectral library was used for modeling (model 2). Model 1 was used as a reference model to compare with model 2. If the results analyzed by model 2 were close to those analyzed by model 1, the spectral library was effective. Simultaneously, the appropriate difference between r<sup>T</sup> values of adjacent spectra in the calibration set was tested (models 2, 3, and 4). At the second step, models for a dispersible tablet (models 5 and 6) and models for a granule (models 7 and 8) were constructed using a general method for constructing a NIR model using spectral library. If the general models performed well, it was validated by constructing models for analyzing potassium clavulanate (clavulanate models 1, 2, 3, and 4) and water content (water models 1, 2, 3, and 4) in the third step. Finally, the spectral-library method was applied to a real PAT control. Models for analyzing amoxicillin and water content in mixed intermediate granules of amoxicillin capsules were constructed.

# RESULTS AND DISCUSSION

Representative spectra of a tablet, dispersible tablet, chewable tablet, granule, and oral suspension of amoxicillin and potassium clavulanate are shown in **Figure 2**. The spectra of a tablet, dispersible tablet, and chewable tablet are similar. Due to their prescription, low strength, and production process, the spectra of granule and oral suspension are quite different from those of a tablet, dispersible tablet, and chewable tablet. The spectrum of the amoxicillin API (**Figure 2**) was similar to the spectra of amoxicillin and potassium clavulanate ODFs in some spectral regions, such as the bands between 8,300 and 9,500 cm−<sup>1</sup> (overtone of C-H stretching vibrations), and between 5,300 and 6,500 cm−<sup>1</sup> , 4,200 and 4,800 cm−<sup>1</sup> (overtone of C = O bonds). The calibration models analyzing amoxicillin could be set up on the basis of these spectral ranges.

# Universal Quantitative Model for Amoxicillin ODFs

The universal quantitative model for amoxicillin set up using the conventional method was called "amoxicillin model 1" (model 1). The spectral range employed for model 1 is shown in **Figure 3**. **Figure 4** shows the result of test-set validation of model 1.

After optimization, there were 44 sample spectra for the calibration set (training set) and 20 for the validation set (test set). R <sup>2</sup> was found to be 98.83% with RMSEP 1.23% (**Table 1**). The average predicted difference between the predicted content and reference content of amoxicillin was only 1.3%. Hence, this NIR method could be a replacement of the HPLC method.

The universal quantitative model for amoxicillin constructed using the NIR spectral library was called "amoxicillin model 2" (model 2). Model 2 and the subsequent models were optimized and validated by the same method as that used for model 1. The difference in r<sup>T</sup> values between adjacent spectra using model 2 was about 1.0%. The results of the two models were close (**Table 1**), so they had the same analysis capacity for amoxicillin.

The average of the difference between r<sup>T</sup> values of adjacent spectra in the calibration set of model 1 was about 1.5%. The influence of the difference between r<sup>T</sup> values of adjacent spectra was also investigated. "Amoxicillin model 3" (model 3) and "amoxicillin model" 4 (model 4) were constructed with a difference of 2.0 and 0.8%, respectively. The prediction differences of 377 samples analyzed by models 1, 2, 3, and 4 were compared (**Table 1**). We found that the prediction differences of


models 1, 2, and 4 were close and less than that of model 3 especially with samples, whose deviation was >5%. When there were large differences between r<sup>T</sup> values of adjacent spectra, the calibration samples decreased and became less representative. A difference of 1.5% was suitable for modeling.

# Construction of NIR Quantitative Models for Specific ODFs of Amoxicillin Using a Spectral Library

We used dispersible tablets of amoxicillin and potassium clavulanate as an example to establish a universal quantitative model for one dosage form (**Table 2**). Calibration sample spectra were selected according to the r<sup>T</sup> value from a spectral library comprising 377 spectra of amoxicillin and potassium clavulanate ODFs. At first, only the calibration spectra of dispersible tablets were selected from the spectral library. The 78 spectra of dispersible tablets were sequenced by r<sup>T</sup> value, and 30 spectra were chosen with a difference between r<sup>T</sup> values of adjacent spectra of 1.0–1.5% to construct "amoxicillin model" 5 (model 5).

"Amoxicillin model 6" (model 6) was established in a similar way. It means that the spectra of all dosage forms were selected for calibration. The average r<sup>T</sup> value of 78 dispersible tablets was nearly the median value of r<sup>T</sup> of the 30 calibration spectra, among which there were 12 spectra for a dispersible tablet. Comparing the PCA-score distribution space of models 5 and 6, the calibration samples of model 5 covered almost all of the distribution space of dispersible tablets; whereas the calibration samples of model 6 covered more space than model 5 (**Figure 5**). The prediction results of 78 batches of dispersible tablets by models 1, 2, 5, and 6 are shown in **Table 3**. The prediction differences seen in models 5 and 6 were lower than in the other two models. It is clearly indicated that it was feasible to construct NIR quantitative models of dispersible tablets of amoxicillin and potassium clavulanate using a spectral library.

The prescription and production process of tablets/dispersible tablets and granules/oral suspensions are quite different. As a result, the spectra of those dosage forms differed greatly (**Figure 2**). On this occasion, granules were taken as an example to validate the feasibility of constructing NIR quantitative models using a spectral library.

Models 7 and 8 were established similar to models 5 and 6. Thirty spectra were selected with a difference between r<sup>T</sup> values of adjacent spectra of 1.0–1.5%. Calibration spectra from model 7 were chosen from 96 spectra of granules. Samples from model 8 were from all dosage forms in the spectral library. Because the

spectra, and white triangles represent other spectra of dispersible tablets in the spectral library. (B) PCA scores of model 6; blocks represent training-set spectra, dark-blue blocks represent tablets, pink blocks represent dispersible tablets, yellow blocks represent chewable tablets, light-blue blocks represent oral suspensions, orange blocks represent granules, and white triangles represent other spectra of dispersible tablets in the spectral library.


TABLE 2 | Parameters of models 5, 6, 7, and 8.

Frontiers in Chemistry | www.frontiersin.org May 2018 | Volume 6 | Article 184

r<sup>T</sup> value of the oral suspension was close to that of a granule, 13 oral suspension spectra were comprised by model 8. A tablet spectrum was not included in model 8. The prediction values of all the granules included by the spectral library by models 1, 2, 7, and 8 are shown in **Table 3**. The results of models 7 and 8 were better.

Dispersible tablets of amoxicillin and potassium clavulanate could be analyzed equally well by models 5 and 6. Similar results could be obtained for granules by models 7 and 8. The r<sup>T</sup> value was critical for sample selection, but it was not necessary to choose the same dosage form as the samples to be measured.

A general method for constructing NIR quantitative models using a spectral library was summarized based on the experiments above (**Figure 6**). Firstly, the appropriate spectra of the samples to be measured were acquired, and the r<sup>T</sup> value calculated according to the definition of r<sup>T</sup> in the spectral library. Secondly, the calibration samples were selected based on the median r<sup>T</sup> value of samples to be measured. The difference between r<sup>T</sup> values of adjacent spectra in the calibration set was about 1.0–1.5%. The number of calibration samples should be ≥ 30, and their r<sup>T</sup> value should cover the range of samples to be measured. Finally, the model accuracy is validated by the samples to be measured. Appropriate sample spectra could be added to the calibration set to optimize the model if necessary.

# Validation of the General Method for Constructing a NIR Quantitative Model Using a Spectral Library

### Constructing NIR Quantitative Models for Potassium Clavulanate

A universal quantitative model for potassium clavulanate ("clavulanate model 1") was set up as shown in section Conventional Method of Construction of a Universal Quantitative Model. "Clavulanate model 2" was constructed by the general method as shown in **Figure 6**. The parameters and prediction difference for 377 samples in the spectral library of the two models (**Table 4**) indicated that the

TABLE 3 | Predictions of dispersible tablets and granules of amoxicillin and potassium clavulanate by amoxicillin models.


two methods of constructing models could lead to ideal results.

Similar to section Universal Quantitative Model for Amoxicillin ODFs, clavulanate model 3 (for dispersible tablets) and clavulanate model 4 (for granules) were constructed by the general method mentioned above. The prediction difference of 78 batches of dispersible tablets and 96 batches of granules by clavulanate models 3 and 4 were both <1.0% (**Table 4**). These two models were accurate and reliable. The feasibility of establishing a NIR quantitative model using the spectral library was also demonstrated.

# Constructing NIR Quantitative Models for Water Content

Universal quantitative models for water content (water model 1, 2, 3, 4) were established as shown in section Conventional Method of Construction of a Universal Quantitative Model, and the general method for a spectral library consequently resulted as shown in section Constructing NIR Quantitative Models for Potassium Clavulanate (**Table 5**). Water models 1 and 2 could be used to analyze all the dosage forms of amoxicillin and potassium clavulanate in the spectral library. Water models 3 and 4 could be used to analyze dispersible tablets and granules, respectively. **Table 5** shows that the prediction differences of the four models was <1.0%. These data further validated the validity of the general modeling method using a spectral library.

# Application of the Method for Constructing a NIR Quantitative Model Using a Spectral Library

The production process of amoxicillin capsules can be summarized as follows: granules are mixed with excipients after dry granulation of API and sieving; the mixed granules are then placed into capsules. The content of mixed intermediate granules of amoxicillin capsules ranged from 80.0 to 84.0%. The water content ranged from 12.1 to 13.0%. Mixed


TABLE 5 | Parameters and prediction differences of water models.




intermediate granules had only slight variability, so their NIR spectra were not suitable for a calibration set. We tried to set up NIR quantitative models for analyzing the content of amoxicillin and water in mixed intermediate granules of amoxicillin capsules using a spectral library of amoxicillin and potassium clavulanate ODFs because their spectra were similar.

The r<sup>T</sup> values of 211 samples were calculated according to the definition of rT. The median value of r<sup>T</sup> of sample spectra was 91.33%. The maximum and minimum r<sup>T</sup> values were 99.29 and 88.98%, respectively. About 40 calibration spectra were selected from the spectral library. The difference in the adjacent spectra was 1.0–1.5%. The NIR model for amoxicillin was optimized by adding 16 spectra of mixed intermediate granules to the calibration set. The 13 spectra of mixed intermediate granules were added to the calibration set of the model for water content. Then, NIR quantitative models for the content of amoxicillin and water were constructed (**Table 6**). The prediction difference between the two models was small, so they could be used to analyze the content of amoxicillin and water of mixed granules rapidly during production.

# REFERENCES


# CONCLUSIONS

A NIR spectral library of amoxicillin and potassium clavulanate ODFs was established using a universal model. The similarity between NIR spectra was represented by the correlation coefficient rT. About 30–50 calibration spectra were selected from the spectral library according to the median r<sup>T</sup> value to construct the NIR quantitative model. The difference in r<sup>T</sup> values between adjacent calibration spectra was about 1.0–1.5%. Compared with conventional modeling, this general method using a spectral library could be used to resolve samplecollection problems. This method requires calibration samples with an appropriate concentration range over a short time for PAT control. Furthermore, the quantitative models were more specific than models constructed by conventional methods. The proposed method offers a new and effective approach to solve the sample-selection problem in PAT modeling.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.


infrared diffuse reflectance spectroscopy. J. Pharm. Biomed. Anal. 51, 12–17. doi: 10.1016/j.jpba.2009.07.018


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zou, Chong, Wang and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Low-Cytotoxicity Fluorescent Probes Based on Anthracene Derivatives for Hydrogen Sulfide Detection

Xuefang Shang<sup>1</sup> \*, Jie Li <sup>1</sup> , Yaqian Feng<sup>2</sup> , Hongli Chen<sup>3</sup> , Wei Guo<sup>1</sup> , Jinlian Zhang<sup>2</sup> , Tianyun Wang<sup>4</sup> and Xiufang Xu<sup>5</sup>

<sup>1</sup> Key Laboratory of Medical Molecular Probes, School of Basic Medical Sciences, Xinxiang Medical University, Xinxiang, China, <sup>2</sup> School of Pharmacy, Xinxiang Medical University, Xinxiang, China, <sup>3</sup> School of Life Sciences and Technology, Xinxiang Medical University, Xinxiang, China, <sup>4</sup> Department of Biochemistry, Xinxiang Medical University, Xinxiang, China, <sup>5</sup> Department of Chemistry, Nankai University, Tianjin, China

Owing to the role of H2S in various biochemical processes and diseases, its accurate detection is a major research goal. Three artificial fluorescent probes based on 9-anthracenecarboxaldehyde derivatives were designed and synthesized. Their anion binding capacity was assessed by UV-Vis titration, fluorescence spectroscopy, HRMS, <sup>1</sup>HNMR titration, and theoretical investigations. Although the anion-binding ability of compound 1 was insignificant, two compounds 2 and 3, containing benzene rings, were highly sensitive fluorescent probes for HS<sup>−</sup> among the various anions studied (HS−, F−, Cl−, Br−, I−, AcO−, H2PO<sup>−</sup> 4 , SO2<sup>−</sup> 3 , Cys, GSH, and Hcy). This may be explained by the nucleophilic reaction between HS<sup>−</sup> and the electron-poor C=C double bond. Due to the presence of a nitro group, compound 3, with a nitrobenzene ring, showed stronger anion binding ability than that of compound 2. In addition, compound 1 had a proliferative effect on cells, and compounds 2 and 3 showed low cytotoxicity against MCF-7 cells in the concentration range of 0–150 µg·mL−<sup>1</sup> . Thus, compounds 2 and 3 can be used as biosensors for the detection of H2S in vivo and may be valuable for future applications.

Keywords: fluorescent probe, hydrogen sulfide, 9-anthracenecarboxaldehyde, nucleophilic substitution, cytotoxicity

# INTRODUCTION

Hydrogen sulfide (H2S) is a toxic gas with smell resembling rotten eggs. It is a bioactive gaseous signaling molecule, along with nitrous oxide (NO) and carbon monoxide (CO) (Kimura et al., 2012; Lisjak et al., 2013; Kimura, 2015; Mishanina et al., 2015). CO and NO are reactive oxygen species, whereas H2S gas is a scavenger of reactive oxygen species. Under certain pressure conditions, H2S can modulate mitochondria in mammalian cells. It also participates in many biochemical processes such as inflammation, blood pressure control, neuro-transmission, and ischemia reperfusion (Fu et al., 2012; Andreadou et al., 2015; Li F. et al., 2015; Wallace et al., 2015). H2S is also a relaxing agent that can act on smooth muscle and can serve as a modulator of cardiac function in cardiovascular therapy (Polhemus and Lefer, 2014; Barr et al., 2015; Chai et al., 2015; Holwerda et al., 2015). In addition, abnormal levels of H2S are associated with many diseases, oxygen sensing, and even death (Olson et al., 2006; Pandey et al., 2012). Therefore, the construction of fluorescent probe to detect H2S has important practical applications.

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Lingxin Chen, Yantai Institute of Coastal Zone Research (CAS), China Aldo Arrais, Università degli Studi del Piemonte Orientale, Italy

### \*Correspondence:

Xuefang Shang xuefangshang@126.com

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 22 January 2018 Accepted: 15 May 2018 Published: 05 June 2018

### Citation:

Shang X, Li J, Feng Y, Chen H, Guo W, Zhang J, Wang T and Xu X (2018) Low-Cytotoxicity Fluorescent Probes Based on Anthracene Derivatives for Hydrogen Sulfide Detection. Front. Chem. 6:202. doi: 10.3389/fchem.2018.00202

Traditional methods for determining the concentration of H2S in biological samples include colorimetric, electrochemical, chromatographic, metal-induced vulcanization, and fluorescence analyses (Tangerman, 2009; Shen et al., 2011). Fluorescent molecular probes are commonly used for detection tool in various fields, including in biological samples owing to their ability to convert chemical information into light signals with high sensitivity and selectivity. Hence, the development of fluorescent probes for the detection of H2S has attracted substantial research attention (Jiménez et al., 2003; Choi et al., 2009; Yu et al., 2012, 2014).

However, a few reports have focused on the development of fluorescent probes based on the binuclear character of H2S (Asthana et al., 2016; Das et al., 2016). Therefore, we used this approach to synthesize highly selective and sensitive fluorescent probes that can detect H2S. Under physiological conditions, hydrogen sulfides exist as 30% H2S in a non-resolving state and 70% residual HS−. Thus, HS<sup>−</sup> detection can serve as a proxy for H2S. In this study, we designed and synthesized novel anthracene derivatives in which a -C=C- bond served as an interaction site (**Scheme 1**). The abilities of these compounds to bind to various anions (HS−, (n-C4H9)4NF (F−), (n-C4H9)4NCl (Cl−), (n-C4H9)4NBr (Br−), (n-C4H9)4NI (I−), (n-C4H9)4NAcO (AcO−), (n-C4H9)4NH2PO<sup>4</sup> (H2PO<sup>−</sup> 4 ), Na2SO<sup>3</sup> (SO2<sup>−</sup> 3 ), cysteine (Cys), glutathione(GSH), and homocysteine (Hcy) were assessed through UV-Vis titration, fluorescence spectroscopy, HRMS and <sup>1</sup>HNMR titration for HS<sup>−</sup> sensitivity and selectivity. These compounds were also investigated for cytotoxicity to MCF-7 cells.

# MATERIALS AND METHODS

Most of the starting materials were obtained commercially. All reagents and solvents were of analytical grade. Sodium hydrosulfide, all anions, in the form of tetrabutylammonium salts such as (n-C4H9)4NF, (n-C4H9)4NCl, (n-C4H9)4NBr, (n-C4H9)4NI, (n-C4H9)4NAcO, and (n-C4H9)4NH2PO4, and amino acids (Cys, GSH, and Hcy) were purchased from Aladdin (Shanghai, People's Republic of China), stored in a vacuum desiccator containing self-indicating silica, and used without further purification. Tetrabutylammonium salts were dried for 24 h under a vacuum with P2O<sup>5</sup> at 333 K before use. Dimethyl sulfoxide was distilled in vacuo after being dried with CaH2. <sup>1</sup>H NMR spectra were recorded using a Varian Unity Plus 400 MHz spectrometer. ESI-HRMS was performed using a Mariner apparatus. UV-Vis spectroscopy titration was performed using a Shimadzu UV2550 spectrophotometer at 289 K. Fluorometric titration was performed using an Eclipse fluorescence spectrophotometer (Agilent, Santa Clara, CA, USA) at 298 K. IR spectroscopy was performed using an IRTracer-100

instrument. The binding constants (Ks) were obtained by the non-linear least-squares method for data fitting.

Cells in logarithmic growth phase were seeded in 96-well plates at a density of 2.0 × 10<sup>4</sup> cells per well and cultured for 24 h. The culture medium was then replaced with 200 µL of Roswell Park Memorial Institute (RPMI) 1640 medium containing various concentrations of the compound, and the cells were further incubated for 24 h. Next, the cells were washed with phosphate buffered saline (PBS) three times, and 100 µL of culture medium and 20 µL of MTT solution were added to each well. After further incubation (4 h), the absorbance of each well was detected at 490 nm using a microplate reader (Thermo Multiskan MK3, Thermo Fisher Scientific, MA, USA). Plain cell culture medium was used as the control.

Compound **1** was synthesized according to previous methods (Ding et al., 2013). 9-Anthracenecarboxaldehyde (82.4 mg, 0.4 mmol) and acetone (35 mg, 0.6 mmol) were dissolved in ethanol (50 mL). Then, under stirring, an aqueous sodium hydroxide solution (2 mL, 0.04 mol·L −1 ) was slowly added to the reaction flask. The mixture was stirred at room temperature for 6 h and adjusted to pH 5–6 with dilute hydrochloric acid (0.1 mol·L −1 ) until the reaction was complete. The reaction was monitored by thin-layer chromatography. Typically, a precipitate formed and was collected by filtration. The solid was washed with high purity water and ethanol, and dried under a vacuum. Yield: 87%. <sup>1</sup>H-NMR (400 MHz, CDCl3, 298 K) δ 8.84 (d, J = 16.2 Hz, 1H), 8.52 (s, 1H), 8.38 (d, J = 8.3 Hz, 2H), 8.07 (d, J = 7.9 Hz, 2H), 7.69– 7.47 (m, J = 88 Hz, 4H). <sup>13</sup>C NMR (101 MHz, CDCl3) δ 194.10, δ 147.53, δ 141.15, δ 135.40, δ 134.28, δ 129.71, δ 128.98, δ 128.60, δ 126.54, δ125.35. IR spectrum, ν cm <sup>−</sup><sup>1</sup> : 1668 (C=O); 1628 (C=C); 1593 (Ar-C=C); 999 (C=C-H). ESI-HRMS (m/z): 457.2 (M + Na)+.

Compound **2** and **3** were synthesized according to the above procedure.

Compound **2**: <sup>1</sup>H NMR (400 MHz, CDCl3, 298 K) δ 8.83 (d, J = 15.8 Hz, 1H), 8.52 (s, 1H), 8.40–8.27 (m, J = 52 Hz, 2H), 8.18–8.00 (m, J = 72 Hz, 4H), 7.68–7.60 (m, 2H), 7.60–7.48 (m, 6H). <sup>13</sup>C NMR (101 MHz, DMSO) δ 191.24,δ 140.88,δ 139.87,δ 137.75,δ 131.15,δ 129.15,δ 128.52,δ 127.32,δ 126.53,δ 125.50. IR spectrum, ν cm <sup>−</sup><sup>1</sup> : 3050 (Ar C-H); 1730 (C=O); 1560 (C=C); 720 (C=C-H). ESI-HRMS (m/z): 309.1 (M + H)+, 331.1 (M + Na)+.

Compound **3**: <sup>1</sup>H NMR (400 MHz, CDCl3, 298 K) δ 8.92 (d, J = 15.8 Hz, 1H), 8.92 (d, J = 15.8 Hz, 1H), 8.55 (s, 1H), 8.47 (s, 1H), 8.51–8.36 (m, J = 60.0 Hz, 3H), 8.42–8.22 (m, 6H), 8.28 (dd, J = 23.9 Hz, 8.3 Hz, 4H), 8.16–8.05 (m, J = 44 Hz, 2H), 8.15– 8.04 (m, J = 44 Hz, 2H), 7.62–7.52 (m, J = 40 Hz, 4H), 7.65–7.52 (m, J = 52 Hz, 4H), 7.28 (s, 3H). <sup>13</sup>C NMR (101 MHz, DMSO) δ 188.64,δ 150.36,δ 142.55,δ 131.59,δ 131.32,δ 129.47,δ 127.43,δ 126.15,δ 125.55,δ 124.41. IR spectrum, ν cm−<sup>1</sup> : 1750 (C=O); 1590 (C=C); 1520 (N-O); 880 (C=N). ESI-HRMS (m/z): 376.1 (M + Na)+.

# RESULTS AND DISCUSSION

# UV-Vis Spectral Titration

UV-Vis titration was performed in dimethyl sulfoxide by the stepwise addition of sodium hydrosulfide (**Figure 1**). For compound **1**, the presence of HS<sup>−</sup> resulted in an increase in the absorption intensity at 315 nm, but the spectral changes were very small. Furthermore, the addition of F−, Cl−, Br−, I−, AcO−, <sup>H</sup>2PO<sup>−</sup> 4 , SO2<sup>−</sup> 3 , Cys, GSH, or Hcy resulted in very weak spectral changes for compound **1**, and the binding capacity was negligible.

For compound **2**, the intensity of the absorption peak increased at 312 nm after the addition of sodium hydrosulfide. A hyperchromic effect was observed during the host-guest interaction process. The change in the UV-Vis spectrum was due to the interaction between sodium hydrosulfide and the electrondeficient C=C double bond (Zhao et al., 2012). However, the addition of F−, Cl−, Br−, I−, AcO−, or H2PO<sup>−</sup> 4 did not cause a substantial spectral response for compound **2** (Figure S1), suggesting that the host-guest interaction was weak (Shao et al., 2009; Shang et al., 2013, 2015a). For compound **3**, the intensity of the absorption peak at 336 nm increased, and the absorption band was enhanced after HS<sup>−</sup> addition. However, the addition of F−, Cl−, Br−, I−, AcO−, H2PO<sup>−</sup> 4 , SO2<sup>−</sup> 3 , Cys, GSH, or Hcy resulted in a very weak spectral response, indicating that the host-guest interaction was negligible. These results suggested that compounds **2** and **3** both showed high sensitivity and selectivity for HS−.

# Fluorescence Response

The photophysical responses of the three probes to various anions were examined. As shown in **Figure 2**, compound **1** showed an emission peak centered at 582 nm. After the addition of HS<sup>−</sup> to a solution of compound **1**, the spectral response of compound **1** was very weak, indicating that the binding ability was negligible.

For compound **2**, emission peaks were centered at 382 and 404 nm. After the addition of HS−, the fluorescence emission was significantly quenched. No significant spectral changes were observed after titration of F−, Cl−, Br−, I−, H2PO<sup>−</sup> 4 , AcO−, SO2<sup>−</sup> 3 , Cys, GSH, or Hcy, indicating that compound **2** had an insignificant binding capacity for these anions (Figure S2A).

For compound **3**, there was almost no fluorescence response. After the addition of HS−, a new emission peak at approximately 420 nm appeared, which was gradually accompanied by two shoulders centered at 402 and 440 nm. This fluorescence enhancement may be resulted from two possible signal transduction mechanisms: the inhibition of photo-electron transfer and binding induced by the guest's host molecules (Watanabe et al., 1998; Lee et al., 2002; Lin et al., 2006). However, no significant spectral changes were observed when compound **3** was titrated with F−, Cl−, Br−, I−, H2PO<sup>−</sup> 4 , AcO−, SO2<sup>−</sup> 3 , Cys, GSH, or Hcy, indicating that compound **3** did not significantly bind to these anions (Figure S2B). The fluorescence calibration curve for compound **3** after the addition of HS<sup>−</sup> indicated that the emission intensity was non-linear when various quantities of HS<sup>−</sup> were added to a solution with a certain concentration of compound **3** (Shang et al., 2012a).

# Binding Constant

The spectral responses of compound **1** after the addition of anions were very weak; hence, the binding constant could not be calculated. The UV-Vis spectral changes for compounds **2** and **3** were ascribed to the formation of host-guest (1:2)

mol·L −1 , <sup>λ</sup>ex <sup>=</sup> 442 nm; (B) compound <sup>2</sup>: 1.46 <sup>×</sup> <sup>10</sup>−<sup>4</sup> mol·<sup>L</sup> −1 , HS−: 0–50.1 × 10−<sup>4</sup> mol·L −1 , <sup>λ</sup>ex <sup>=</sup> 324 nm; (C) compound <sup>3</sup>: 1.1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> mol·<sup>L</sup> −1 , HS−: 0–7.7 × 10−<sup>4</sup> mol·L −1 , λex = 368 nm.

complexes; when the absorbance intensity was greatest, the ratio of [H]/([H]+[G]) was approximately 0.3, according to a Jobplot (Figure S3). The binding constants were calculated by the non-linear least-squares method according to the UV-Vis data provided in **Table 1** (Bourson et al., 1993; Liu et al., 2001, 2004). It was shown that, the spectra changed little for compound **1**, and compounds **2** and **3** showed the strongest binding ability for HS<sup>−</sup> among the various anions tested. The anion binding abilities were in decreasing order: HS<sup>−</sup> >> SO2<sup>−</sup> <sup>3</sup> ∼ Cys ∼ GSH ∼ Hcy ∼ F <sup>−</sup> ∼ Cl<sup>−</sup> ∼ Br<sup>−</sup> ∼ I <sup>−</sup> <sup>∼</sup> AcO<sup>−</sup> <sup>∼</sup> <sup>H</sup>2PO<sup>−</sup> 4 . The standard deviations for the binding constants were R<sup>3</sup> = 0.9941 and R<sup>2</sup> = 0.9945. Among the three compounds, the standard deviation for compound **1** was not statistically significant, and those for compounds **2** and **3** were significant (compound **2**, **S** = 31.6011, compound **3**, **S** = 159.3298) (Figure S6). The anion binding ability could be attributed to the host-guest interactions and the match in space structures. It means that HS<sup>−</sup> ions strongly bound to these compounds, according to their binding constants (Shang et al., 2012b).

Compound **3** showed a stronger binding ability toward HS<sup>−</sup> ions than that of compound **2**, owing to the presence of a nitro group. The nitro group served as an electron-withdrawing group that enhanced the binding ability between the C=C double bond in compound **3** and HS−. According to the HRMS data, the observed negative ion peak (418.0577) was the MS peak of the **3**-HS<sup>−</sup> complex (theoretical value: 418.0572) (Figure S4). In addition, there was no peak of –CH2- in the <sup>1</sup>HNMR titration results, suggesting that the C=C double bond was broken during the interaction between compound **3** and HS<sup>−</sup> (Figure S5). Therefore, a possible host-guest binding mechanism was as follows. The first step was the Michael addition reaction of the conjugated system (Li J. et al., 2015). The first HS<sup>−</sup> ion was added to the C=C moiety as a nucleophile. Then, the second


<sup>a</sup>Anions was added in the form of sodium sulfide or tetra-n-butylammonium salts. <sup>b</sup>The spectra changed little, and the binding constant could not be determined (ND). HS<sup>−</sup> ion attacked the active hydrogen atom (alpha-H) as an electrophile moiety, forming the final structure as shown in **Scheme 2**. The final structure was verified by mass spectrometry. The reaction of compound **3** with HS<sup>−</sup> was conducted in a simulated physiological environment, and the reaction product was subjected to a fluorescence analysis. A large increase in the fluorescence spectrum was observed.

# Cytotoxicity Assessment

The cytotoxicity of the three compounds against MCF-7 cells was evaluated by MTT assays (Vibet et al., 2008; Jiang et al., 2014; Alemany et al., 2015; Jouvin et al., 2015; Moustakim et al., 2017) (**Figure 3**). Compound **1** had a proliferative effect on the cells, and compounds **2** and **3** in the range of 0– 150 µg·mL−<sup>1</sup> showed very low cytotoxicity. Cell viability was minimally affected (80% cell viability), when the concentrations of compounds **2** and **3** were increased to 150 µg·mL−<sup>1</sup> . In agreement with the determined binding constants, compounds **2** and **3** each showed a high binding capacity and low cytotoxicity and thus can be used to detect HS<sup>−</sup> in vivo (Gao et al., 2015; Shang et al., 2017). Compared with previous estimates in the literature (Zou et al., 2013; Lin et al., 2015), the cytotoxicity of the synthesized compounds was relatively low. Hence, these probes are favorable candidates for in vitro hydrogen sulfide detection.

FIGURE 3 | Cell viability values (%), as estimated from MTT proliferation assays, vs. incubation concentrations of fluorescent probe.

# Theoretical Investigation

Among the three synthesized compounds, compound **3** showed the highest sensitivity and selectivity for HS<sup>−</sup> according to the binding constants. Consequently, the geometries were optimized for compound **3** and the combination product **3**-HS (**Figure 4**) based on the density functional theory method and the level of B3LYP/3-21G. The calculation was implemented in Gaussian03 (Frisch et al., 2003; Gao et al., 2017). As shown in **Figure 4**, the distance of the intramolecular hydrogen bond in compound **3** was 2.390 Å between the hydrogen atom of the interaction site (-HC=CH-) and the oxygen atom of the carbonyl group. According to previous studies (Ni et al., 2012; Maity et al., 2014), the existence of intramolecular hydrogen bonding and an electron-withdrawing group (-NO2) increases the sensitivity. Hence, the stronger the electron-withdrawing effect is, the higher sensitivity for HS<sup>−</sup> this compound gets. The combination between compound **3** and HS<sup>−</sup> was also optimized. Our results indicated that the spatial structure of the host may change, as a result of the host-guest interaction. Therefore, the combination product (**3**-HS) existed in resonance form. The distance of the hydrogen bond (2.006 Å) indicated that a stable six-cycle was formed containing a sulfur atom and a hydrogen atom in a hydroxyl group (the resonance form of ketone) after compound **3** interacted with HS−. These results also explained the strong ability of compound **3** to bind to HS−.

In addition, the molecular frontier orbitals were introduced to explore the hyperchromic effect (by UV-Vis titration as described above). This effect was observed in the host-guest interaction process by the electron transition of the frontier orbital. The selected frontier orbitals for compound **3** and the host-guest complex are shown in **Figure 5**. An orbital analysis revealed that the highest occupied molecular orbital (HOMO) density in compound **3** was mainly localized on the anthracene moiety, whereas the lowest unoccupied molecular orbital (LUMO) density was localized on the nitrophenyl and ketone group moieties (Shang et al., 2015b). These results indicated that the electron transition of the highest HOMO resulted in a hyperchromic effect in the UV-Vis spectra.

# CONCLUSIONS

In conclusion, three compounds were synthesized, and their abilities to bind to various anions were detected by UV-Vis titration, fluorescence spectroscopy, HRMS, <sup>1</sup>HNMR titration and theoretical investigations. Compounds **2** and **3** showed selectivity and sensitivity for HS−. Notably, compound **3** showed the strongest sensing ability for HS<sup>−</sup> among the synthesized compounds. The mechanism underlying this interaction was the nucleophilic reaction between HS<sup>−</sup> and the electron-poor C=C double bond. Theoretical investigations also elucidated the role of molecular frontier orbitals in the hyperchromic effect. In addition, compounds **2** and **3** showed low cytotoxicity against MCF-7 cells in the concentration range of 0–150 µg·mL−<sup>1</sup> and can be subsequently used as fluorescent probes to detect H2S, HS−, or S2<sup>−</sup> species in vivo. These results provide a probe with a novel sensing mechanism for hydrogen sulfide, based on the

# REFERENCES


amphipolar character of the S atom of the new compounds to be used in practical applications to detect H2S. Our finding establishes a basis for further applications of molecular probes.

# AUTHOR CONTRIBUTIONS

XS, and TW responsible for the experimental design. JL and YF responsible for the synthesis and properties of detection. WG and JZ responsible for the characterization of compounds. HC is responsible for the detection of cytotoxicity. XX is responsible for the quantitative calculation of the data.

# ACKNOWLEDGMENTS

This work was supported by funding from the Program for Science & Technology Innovation Talents in Universities of Henan Province (15HASTIT039), the Fluorescence Probe and Biomedical Detection Research Team of Xinxiang City (CXTD16001), the Xinxiang Medical University Graduate Scientific Research Innovation Support Project (YJSCX201638Y), and the Scientific and Technological Research Projects of Henan Province, China (172102210449, 182102311124). We would like to thank Editage (www.editage.com) and International Science Editing (http://www.internationalscienceediting.com) for English language editing.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00202/full#supplementary-material


array as the active phase. Angew. Chem. Int. Ed Engl. 53, 12855–12859. doi: 10.1002/anie.201406848


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Shang, Li, Feng, Chen, Guo, Zhang, Wang and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# MicroNIR/Chemometrics Assessement of Occupational Exposure to Hydroxyurea

### Roberta Risoluti\* and Stefano Materazzi

Department of Chemistry, Sapienza - University of Rome, Rome, Italy

Portable Near Infrared spectroscopy (NIRs) coupled to chemometrics was investigated for the first time as a novel entirely on-site approach for occupational exposure monitoring in pharmaceutical field. Due to a significant increase in the number of patients receiving chemotherapy, the development of reliable, fast, and on-site analytical methods to assess the occupational exposure of workers in the manufacture of pharmaceutical products, has become more and more required. In this work, a fast, accurate, and sensitive detection of hydroxyurea, a cytotoxic antineoplastic agent commonly used in chemotherapy, was developed. Occupational exposure to antineoplastic agents was evaluated by collecting hydroxyurea on a membrane filter during routine drug manufacturing process. Spectra were acquired in the NIR region in reflectance mode by the means of a miniaturized NIR spectrometer coupled with chemometrics. This MicroNIR instrument is a very ultra-compact portable device with a particular geometry and optical resolution designed in such a manner that the reduction in size does not compromise the performances of the spectrometer. The developed method could detect up to 50 ng of hydroxyurea directly measured on the sampling filter membrane, irrespective of complexity and variability of the matrix; thus extending the applicability of miniaturized NIR instruments in pharmaceutical and biomedical analysis.

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Daniel Cozzolino, Central Queensland University, Australia Marçal Plans Pujolras, Nestle Purina PetCare Company, United States

> \*Correspondence: Roberta Risoluti roberta.risoluti@uniroma1.it

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 01 March 2018 Accepted: 31 May 2018 Published: 19 June 2018

### Citation:

Risoluti R and Materazzi S (2018) MicroNIR/Chemometrics Assessement of Occupational Exposure to Hydroxyurea. Front. Chem. 6:228. doi: 10.3389/fchem.2018.00228 Keywords: MicroNIR, chemometrics, hydroxyurea, occupational exposure, pharmaceutics

# INTRODUCTION

Hydroxyurea (HU) or hydroxycarbamide, is a non-alkylating hydroxylated urea analog mainly recognized as antineoplastic and antiviral agent (Spivak and Hasselbalch, 2011). The cytotoxic and genotoxic potential efficacy of hydroxyurea makes this molecule one of the most performing agent commonly used in chemotherapy (Spivak and Hasselbalch, 2011; Karsy et al., 2016; Liew et al., 2016). In addition, HU is usually involved in the treatment of Sickle Cell Disease (SCD) (Davies and Gilmore, 2003; Heeney and Ware, 2008; Italia et al., 2009; Flanagana et al., 2010; Candrilli et al., 2011), psoriasis (Yarbro and Leavell, 1969), Philadelphia-chromosome negative myeloproliferative syndromes (MPs) (Yarbro and Leavell, 1969), some types of solid cancers (Karsy et al., 2016), and in the therapy of HIV infection (Lori et al., 1994).

An important issue when dealing with HU is related to its harmful potential (Millicovsky et al., 1981; Woo et al., 2005) especially in prolonged exposure conditions (Elchuri et al., 2015; Broto et al., 2017), as it inhibits class I ribonucleotide reductase, leading to replication fork stalling (Quattrone et al., 2013; Liew et al., 2016). Workers involved in the manufacture of drugs, may be exposed to HU during manufacturing, transport, and distribution. In addition, as the number of patients receiving chemotherapy has considerably increased, there is a growing concern about the development of reliable, fast and accurate methods to assess the occupational exposure of workers during drug manufacturing process.

A number of analytical methods have been developed to quantify hydroxyurea in biological fluids, including spectrophotometric measurements by colorimetric techniques (Milks and Janes, 1956; Davidson and Winter, 1963; Bolton et al., 1965; Sivakumar et al., 2013; Legranda et al., 2017), electroanalytical determination (Naik et al., 2015), Nuclear Magnetic Resonance (NMR) (Main et al., 1987; Sorg et al., 2005; De Marco et al., 2011), High Performance Liquid Chromatography (HPLC) (Pujari et al., 1997; Iyamu et al., 1998; Manouilov et al., 1998), Gas Chromatography coupled to Mass Spectrometry (GC-MS) (James et al., 2006; Kettani et al., 2009; Garg et al., 2015), and Liquid chromatography—tandem mass spectrometry (LC-MS/MS) (Dalton et al., 2005; Usawanuwat et al., 2014; Marahatta et al., 2016; Hai et al., 2017). Despite the copious literature for HU detection, the assay of HU may be cumbersome due to its molecular dimension, reactivity and ability to chemical and enzymatic degradation (Iyamu et al., 1998; Marahatta and Ware, 2017).

The National Institute for Occupational and Safety Health (NIOSH) (Naumann et al., 1996) has proposed the exposure control limits (ECL) for HU not exceeding 0.01 mg/m<sup>3</sup> , as a consequence of the potential toxicity. Conventional chromatographic techniques (Osytek et al., 2008) usually require an accurate sample clean-up to extract HU from a filter membrane and eliminate matrix interferences. All these procedures may be critical in estimating a tiny amount of HU and may lead to sample modification (Osytek et al., 2008). To overcome these problems, spectroscopic techniques have been largely proposed to give both qualitative and quantitative information about complex samples (Zontova et al., 2016; Materazzi et al., 2017a,b). In addition, multivariate statistical analysis has already proved to be helpful in interpreting complex spectral signals (Oliveri et al., 2011; Risoluti et al., 2016a,b, 2018; Materazzi et al., 2017c).

In this work, Near Infrared Spesctroscpy is proposed as a rapid and non-destructive technique to detect and quantify HU on a glass fiber filter in order to assess a novel procedure for occupational exposure estimation. A very ultra-compact portable instrument named MicroNIR (45-mm diameter, 42 mm height and 60-g operating weight) entirely powered (5 V) and controlled via USB port of a portable computer, was used to acquire spectra; and chemometrics tools were considered to perform real-time estimation of HU. A key feature of our portable MicroNIR/Chemometrics approach is mainly related to the possibility of directly analyze samples without any pretreatment or extraction. In addition, the method is simple and time-saving, and it can achieve the same outcomes as the conventional spectrometer.

# MATERIALS AND METHODS

# Materials

Hydroxyurea reference standard was purchased as powder from Sigma-Aldrich. Glass fiber filters with 2.5-cm diameter, 1-µm pore size, and 790-µm thickness (Merk Millipore) were used as membrane to collect HU. Sampling was performed by the means of a Chronos sampling device (Zambelli Srl) operated at a flow rate of 3.5 L/min for 15 min, in order to mimic occupational exposure (not exceeding 3.5 µg/filter). Reference materials were prepared in a glove-box module consisting of a cube-shaped glass box isolated from the ambient temperature and 40 µl of HU solution in deionized water at different concentrations were added to reproduce the potential amounts of HU on a filter (50, 3.5 ng, and 50 µg).

# MicroNIR/Chemometrics Method

Spectra were collected by a portable, ultra-compact and lowcost device MicroNIR spectrometer, developed and distributed by Viavi Solutions (JDSU Corporation, Milpitas, USA). This device operates in the spectral region 900–1,700 nm and consists of a linear variable filter (LVF) as dispersing element directly connected to a 128-pixel linear indium gallium arsenide (InGaAs) array detector and two tungsten light bulbs as radiation source.

In the MicroNIR, measuring the optimum focal point of the illumination source from the spectrometer's window to a sample is achieved by the means of a special collar. As a result, this particular geometry permits to achieve comparable outcomes as the reduction in size does not compromise the performances of the spectrometer. The instrument control was performed by the MicroNIR Pro software (JDSU Corporation, Milpitas, USA) and chemometric tools such as Principal Component Analysis (PCA) and Partial Least Square (PLS) algorithms were used as unsupervised technique and calibration models by V-JDSU Unscrambler Lite (Camo software AS, Oslo, Norway).

Spectra were collected at a nominal spectral resolution of 6.25 nm in the reflectance mode. Spectralon was used as NIR reflectance standard (blank), with a 99% diffuse reflectance, while a dark reference was obtained from a fixed place in the room. The acquisitions were performed with an integration time of 10 ms, resulting in a total measurement time of 2.5 s for each sample.

As recommended for spectroscopic data (Rinnan et al., 2009), mathematical pre-treatments were considered for chemometric evaluation such as scatter-correction methods [Standard Normal Variate transform (SNV) (Barnes et al., 1989), Multiplicative Scatter Correction (MSC) (Geladi et al., 1985), and Mean Centering (Wold and Sjöström, 1977)], Savitzky-Golay (SG) polynomial derivative filters (Savitzky-Golay, 1964) as spectral derivation techniques. Among these pre-treatments, the combination of second derivative algorithm followed by Mean Centering was selected because it provided the best outcomes in terms of Root Mean-Squared Error of Calibration (RMSEC), Root Mean-Squared Error of Prediction (RMSEP), and coefficient of determination (R 2 ) (Miller and Miller, 2000; Mark and Workman, 2007).

Figures of merits were used to estimate model performances. In particular, Residual Predictive Deviation (RPD) was used to evaluate correction forecasting model and calculated as the standard deviation (SD)/RMSEP. In general, the model is considered stable when RPD ≥3 or not satisfactory when RPD <2. In this work, the precision of the method was determined on nine different samples with concentrations regularly

distributed along the linear range, using nine replicates in the same day.

Sensitivity (SEN) represents the fraction of the analytical signal responsible for an increase in the concentration of HU and was calculated as follows: SEN = 1/b, where b is the vector of regression coefficients with A latent variables. The minimum detectable concentration (MDC) is defined as the lowest concentration that can be reliably measured according to ISO 11843-2:2000 recommendations<sup>1</sup> .

# Experimental Design

Calibration and validation models were developed using the dataset from 297 samples. The data set was divided into two groups, the calibration set (216 samples) and validation set (81 samples). In order to provide a sample selection for the calibration and validation set as representative as possible and to ensure uniformity of dataset, the X and Y distances were taken into account simultaneously, by applying the Kennard–Stone (KS) uniform sampling algorithm. The calibration set consisted of a series of reference samples including blanks (filters without HU) and fortified blanks with increasing amounts of HU (50, 3.5 ng, and 50 µg).

A comprehensive sampling procedure was scheduled as follows: samples were collected in a preserved glove box and nine spectra were acquired in reflectance mode for each membrane, as shown in **Figure 1**. A total of nine filters were used to optimize the model of prediction for HU exposure. Six independent batches were prepared for calibration; while validation was performed on the same type of samples as the calibration set, but fully independent batches, using three series of filters.

# GC-MS Method

GC-MS analysis was done on a Perkin Elmer system (Waltham, MA) using a HP-5MS (30 m × 0.25 mm × 0.25 mm) as capillary separation column. Electron impact (EI) ionization was employed at a voltage of 70 eV. The carrier gas was helium delivered at a constant flow of 1 mL/min. The oven temperature program was initially set at 150◦C for 1 min, ramped to 140◦C at 12◦C/min and maintained for 1 min, and then ramped to 270◦C at 35◦C/min for 2.5 min. The temperatures for the inlet, interface, ion source and quadrupole were set at 270, 250, 230, and 150◦C, respectively. Mass spectral data was collected in the scan mode from m/z 44 to 400; in the SIM mode, fragments at 277 and 292 m/z were monitored for quantification and confirmation, respectively.

# RESULTS

To develop a novel analytical method to monitor occupational exposure to cancerogenic agents by evaluating the amount of HU on a filter, multivariate statistical analysis was performed for optimal selection of the experimental procedure. As a consequence, a number of variables were considered in order to ensure a correct and representative sampling procedure: (a) membrane type and sampling side; (b) sampling procedure to reproduce HU exposure in terms of volume to be added on a filter; (c) spectra acquisition. Preliminarily,

<sup>1</sup> ISO 11843-2:2000. Capability of Detection, International Standards Organization. Geneva.

all the acquired MicroNIR data corresponding to different experimental conditions were pre-treated and processed by a simple exploratory tool such as PCA. After that, a prediction model of HU based on Partial Least Square Regression (PLSR) was entirely developed and validated.

# Sampling Procedure Optimization

To make the method representative, the first investigated issue consisted of reference material preparation. Two different ways of HU deposition on a filter were investigated: (i) calibration on different filters i.e., four different filters (one blank and three fortified blanks) were considered; and (ii) calibration on a single filter i.e., only one filter was used and progressively fortified with increasing amounts of HU. In this case, spectra of blank and fortified samples were acquired prior to each deposition by the portable MicroNIR. In the first case, samples were prepared using 40 µl of aqueous solution of HU on each filter; while in the second case, a volume of 15 µl was used for each deposition.

All the acquired spectra were pre-treated and analyzed simultaneously by PCA. As displayed in **Figure 2**, each point represents an average of the nine respective spectra of a filter and colors were used to highlight the quantity of HU. The interpretation of the scores plot provides preliminary important information with respect to HU deposition and correlation to its different amounts on a filter.

A good correlation could be observed for samples of the same class (blank and fortified blanks) as there was no data dispersion, suggesting a correct repeatability of the method. This observation is very interesting because it is possible to clearly discriminate HU quantity on the membrane of a filter. For any deposition way, hence, the method would be suitable in practice where occupational exposure of workers may be monitored by a personal sampling system collecting a real blank (prior to HU handling) to be fortified and directly analyzed.

In addition, as shown in **Figure 2**, in both cases moving along PC1 (97 and 89% of explained variance) all the analyzed samples could be well grouped according to HU amount. It further confirms the ability of the approach MicroNIR/Chemometrics in monitoring occupational exposure to HU according to its amount collected on a filter.

A deeper investigation of the acquired spectra was performed by comparing the two series of samples (four- and one-filter calibration) in a single dataset. **Figure 3** displays PCA data showing that the same samples can be divided into two main groups according to PC2: four- and one-filter calibration. As a result of the PCA data, the different locations of samples in the plot indicate the contribution of HU deposition way on the spectroscopic signal.

Such a result is not surprising when a reflectance acquisition mode is involved, because the surface of the filter membrane may have some influence on the spectral response as a function of the volume added. Despite the different behaviors, samples could be clearly differentiated according to PC1 (91% of explained variance) and the preliminary outcomes suggested the possibility to further investigate the repeatability of the method.

With the aim of extending this procedure to real samples, nine different filters were prepared and subsequently fortified with different amounts of HU so as to increase the number of investigated samples and evaluate whether the method would be batch-dependent. As shown in **Figure 4**, all the samples of the same class (displayed in different colors), could be well grouped and located in the plot according to PC1. In addition, no dispersion of data was observed thus indicating the effectiveness of the optimized HU deposition on a filter. On the basis of preliminary interesting results, a prediction model of HU on a single filter membrane was successfully validated.

# PLS Model of Prediction

In order to obtain the best results of calibration, the effect of a number of pre-treatments was evaluated i.e., the combination of spectral pre-treatments and wavelength range selection.

TABLE 1 | Figures of merit of HU calculated with different spectral pre-treatments in calibration and prediction steps.


Calibration and validation sets were pre-processed using Standard Normal Variate (SNV) scaling (Barnes et al., 1989), MSC (Geladi et al., 1985), and Mean Centering (Wold and Sjöström, 1977), Savitzky-Golay (SG) polynomial derivative filters (Savitzky-Golay, 1964) and a combination of these pretreatments.

For evaluation of model performances, comparison was made for different spectral pre-treatments to identify the most effective one in terms of prediction error using the Predicted Residual Error Sum of Squares (PRESS) to represent the sum of squares of the prediction error and the coefficient of determination (R 2 ). Usually, the smaller the PRESS value is, the better the model's predictive ability is. R <sup>2</sup> provides the percentage variation in y explained by x-variables and is largely used to evaluate the fitting performance. Satisfactory results (R 2 and RMSEC) were obtained for the calibration of HU as shown in **Table 1**.

Good model agreement is confirmed in the validation step (R <sup>2</sup> > 0.9817 and RMSEP < 2.14 for all the optimized models). As far as the data are concerned, the best performance can be achieved by using second derivative pre-treatment followed by mean centering (4 latent variables) as it provides the lowest RMSE and highest R<sup>2</sup> values. Furthermore, the effect of the variable spectral selection within the calibration block was evaluated to improve the model's ability to predict HU. As illustrated in **Figure 5**, the first principal component loadings accounted for more than 87% of the total variance.

Validation results of the most performing model (second derivative pre-treatment followed by mean centering) after variable selection in the range 1,540–1,600 nm are reported in **Table 2**, showing that the optimized model could quantify HU on a glass fiber filter with limit of detection of 50 ng/filter. This finding points out that an adequate PLS regression model can help quantify HU directly from MicroNIR measurements without any prior sample preparation.

# Evaluation of Prediction Ability

The validated model was consequently used to process 30 filters collected during routine HU handling. In order to evaluate the prediction ability of the model, all the samples were simultaneously analyzed by the reference method (GC-MS) and MicroNIR/Chemometrics approach. Data obtained from the MicroNIR approach (**Table 3**) show that the PLS model permitted to achieve the best prediction precision with RMSEP of 0.12 and RPD of 6.1, which ensured the accuracy and robustness of the model.

In addition, the chromatographic analysis detected HU in only 19 of the 30 samples as the Limit of Detection (LOD) and Limit of Quantification (LOQ) of this method were 0.7 and 2.5 µg, respectively. When the amount of HU was chromatographically found to be lower than the LOQ of the method, LOD was used to compare with the predicted values obtained by MicroNIR/Chemometrics approach. The results showed a R 2 of 0.99 and acceptable values of bias at 95% confidence (see **Table 3**).

TABLE 2 | Analytical figures of merit for PLS quantification model.


\*Latent variables. \*\*Minimum detection concentration.

TABLE 3 | Results of the MicroNIR approach.


MicroNIR computed values were found to be significantly lower than corresponding GC ones as the LOD of the MicroNIR method is 50 ng, meaning that the MicroNIR/Chemometrics can be a promising approach for occupational exposure monitoring at HU low levels.

# CONCLUSIONS

An ultra-compact portable device (MicroNIR) was applied to assess a novel way for HU occupational exposure monitoring.

A comprehensive sampling procedure was pointed out. Chemometric evaluation of spectra collected by a miniaturized device operated in the Near Infrared region, was optimized and entirely validated by PLS regression. The proposed method has the advantage of simplicity and avoiding sample pre-treatment, thus limiting even the analyst's HU exposure. Moreover, this approach may be considered as the optimal technology to determine cancerogenic agents or other dangerous molecules in a single-touch analysis as it is entirely portable and nondestructive. The achieved results highlight the extremely high potential of MicroNIRs to detect the HU with lower detection limits with respect to reference methods. To the best of the authors' knowledge, this approach would be the first ever proposed for the on-site detection of HU. It requires no sample preparation, is non-destructive and easy to perform (no highlyskilled personnel required), allowing a rapid evaluation of the HU occupational exposure.

# AUTHOR CONTRIBUTIONS

SM and RR conceived the study and developed the experimental design. RR performed the chemometric evaluation of data. SM and RR analyzed and interpreted data and wrote the manuscript. All authors reviewed and approved the manuscript.

# REFERENCES


J. Chromatogr. B Anal. Technol. Biomed. Life Sci. 87, 446–450. doi: 10.1016/j.jchromb.2008.12.048


chromatography using electrochemical detection. J. Chromatogr. B Biomed. Sci. Appl. 694, 185–191. doi: 10.1016/S0378-4347(97)00120-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Risoluti and Materazzi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Plasma Biochemical Analysis of Acute Lead Poisoning in a Rat Model by Chemometrics-Based Fourier Transform Infrared Spectroscopy: An Exploratory Study

### Wenli Tian, Dan Wang, Haoran Fan, Lujuan Yang and Gang Ma\*

Key Laboratory of Medicinal Chemistry and Molecular Diagnosis of Ministry of Education, Key Laboratory of Analytical Science and Technology of Hebei Province, College of Chemistry and Environmental Science, Hebei University, Baoding, China

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Zoltán Kónya, University of Szeged, Hungary Chih-Ching Huang, National Taiwan Ocean University, Taiwan

> \*Correspondence: Gang Ma gangma@hbu.edu.cn

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 29 March 2018 Accepted: 11 June 2018 Published: 28 June 2018

### Citation:

Tian W, Wang D, Fan H, Yang L and Ma G (2018) A Plasma Biochemical Analysis of Acute Lead Poisoning in a Rat Model by Chemometrics-Based Fourier Transform Infrared Spectroscopy: An Exploratory Study. Front. Chem. 6:261. doi: 10.3389/fchem.2018.00261

In this work, we explored to use chemometrics-based Fourier transform infrared (FTIR) spectroscopy to investigate the plasma biochemical changes due to acute lead poisoning (ALP) in a rat model. We first collected the FTIR spectra of the plasma samples from the rats with and without suffering from ALP. We then performed the chemometric analysis of these FTIR spectra using principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA). We found that the chemometrics-based FTIR spectroscopy can discriminate the rats with and without ALP. Further analysis on the PLS-DA regression coefficient revealed that the spectral changes, in particular, corresponding to the biochemical changes of proteins in the plasma may be used as potential spectral biomarkers for the diagnostics of lead poisoning. Our work demonstrates the potential of chemometrics-based FTIR spectroscopy as a promising tool for the biochemical analysis of plasma that could consequently enable an objective, convenient and non-destructive diagnostics of lead poisoning. To the best of our knowledge, this work is the first application of chemometrics-based FTIR spectroscopy in the diagnostics of lead poisoning.

Keywords: FTIR spectroscopy, infrared spectroscopy, chemometrics, lead poisoning, acute lead poisoning, principle component analysis, partial least squares discriminant analysis

# INTRODUCTION

Lead is an omnipresent metal that has been used since prehistoric times. Prior to the industrial revolution, human exposure to lead in the environment was relatively low, but significantly increased over time due to modern industrial activities. It is estimated that over 300 million tons of lead has been released to the environment by human activities (Tong et al., 2000), which leads to a rapid increase in lead exposure to the environment. A previous study indicated that the lowest levels of human blood lead in industrial era were 50–200 times higher than preindustrial era (Flegal and Smith, 1992b). As for lead poisoning, in 1839, Tanqueral des Planches described the symptoms of acute lead poisoning (ALP) and studied the signs of ALP in adults (Hunter, 1978). In the middle and late nineteenth century, lead poisoning became a serious health problem among Britain workers. British Parliament eventually enacted relevant laws and regulations to prevent lead poisoning (Hunter, 1978; Smith, 1984; Winder, 1984; Tong et al., 2000). Lead poisoning can be caused by human ingestion and respiration of lead and related products such as lead-containing paints. Lead can cause a series of physiological and biochemical changes within human body, affecting central and peripheral nervous system, cardiovascular system, reproductive system, immune system, gastrointestinal tract, liver, kidney and brain (Hunter, 1978; Smith, 1984; Winder, 1984; Kazantzis, 1989; Goldstein, 1992; Tong, 1998; Tong et al., 2000).

The basic principle in lead poisoning diagnostics is based on the determination of lead level in human body. There are currently several methods available for measuring lead in blood samples. For example, one common method is the so-called blood film method, in which the morphology of the red blood is examined with a microscope to reveal basophilic stippling of red blood cells (i.e., red blood cells with dots in their morphologies). However, this method is not very specific because other unrelated conditions (such as folate and vitamin B12 deficiencies) can also give basophilic stippling of red blood cells. Lead level can be evaluated indirectly by measuring erythrocyte protoporphyrin (EP) in blood samples. It is noted that such EP measurement is not very sensitive and specific because an increase in EP level can also be observed in the case of iron deficiency. X-ray fluorescence method can be used to determine the cumulative exposure and total body burden of lead. However, this method is not so convenient because X-ray fluorescence instrument is not widely available in clinic. Apparently, the current methods in lead poisoning diagnostics still have some limitations and disadvantages (Patrick, 2006; Brodkin et al., 2007). Searching for a specific, rapid, convenient, objective and cost-effective method for lead poisoning diagnostics is no doubt very meaningful (Flegal and Smith, 1992a).

In recent years, Fourier transform infrared (FTIR) spectroscopy has been widely used in the biochemical analysis field (Baker et al., 2014). FTIR spectroscopy is a simple, convenient, non-destructive, rapid and low-cost detection method to sample biological materials such as blood and tissue for diagnostic purposes (Deleris and Petibois, 2003; Ellis and Goodacre, 2006; Krafft et al., 2007, 2009; Gasper et al., 2009; Gajjar et al., 2013; Baker et al., 2014; Mitchell et al., 2014; Ollesch et al., 2014; Sheng et al., 2015; Staniszewska-Slezak et al., 2015; Depciuch et al., 2017; Elmi et al., 2017; Ghimire et al., 2017; Guo et al., 2017; Le Corvec et al., 2017; Li et al., 2017; Liu et al., 2017; Paraskevaidi et al., 2017; Roy et al., 2017; Sarkar et al., 2017; Titus et al., 2017; De Bruyne et al., 2018; Rai et al., 2018). When combined with chemometric analysis, FTIR spectroscopy can be further empowered in disease diagnostics. Now, FTIR spectroscopy has been used in many studies to detect the physiological states and disease-specific biomarkers in the blood. For example, Staniszewska-Slezak et al. established the rat models for pulmonary arterial hypertension and systemic hypertension, and then collected the FTIR spectra of rat plasma samples. By using FTIR spectroscopy combined with principal component analysis (PCA), they found that they could distinguish the two different hypertension states as well as the healthy state. They also envisioned that chemometrics-based FTIR spectroscopy could potentially provide some spectral biomarkers for disease diagnostics (Staniszewska-Slezak et al., 2015). Roy et al. recently used attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy in combination with partial least squares discriminant analysis and partial least squares regression to identify malaria parasites, blood glucose and urea levels in whole blood samples (Roy et al., 2017). Titus et al. recently proposed an FTIR approach combined with cluster and heterogeneity analyses to rapidly screen colitis without using biopsies or in vivo measurements (Titus et al., 2017). Paraskevaidi et al. recently demonstrated an excellent diagnostic performance of chemometrics-based ATR-FTIR spectroscopy by analyzing plasma samples from patients with Alzheimer's disease (Paraskevaidi et al., 2017).

In our work, we focused on the biochemical changes of plasma after lead poisoning using a rat model suffering from ALP. The main goal of this study was to find the plasma biochemical changes induced by lead in rats by FTIR spectroscopy combined with chemometric approaches such as PCA and partial least squares discriminant analysis.

# EXPERIMENTAL

# ALP Rat Model

Male Wister rats (240 ± 20 g) were purchased from the Vital River Lab Animal Technology Co., Ltd. (Beijing, China). Animals were housed under constant temperature, humidity and lighting (12 h per day) and were allowed free access to food and water. The animal experiment was carried out in accordance with the guidelines for the care and use of laboratory animals and the relevant ethical regulations of the Animal Ethics Committee of Tianjin Tasly Institute. The protocol was approved by the Animal Ethics Committee of Tianjin Tasly Institute.

The rats (N = 4) before lead injection were used as the control group and these rats after lead injection used as the test group. To induce ALP, the rats were intraperitoneally injected with PbCl<sup>2</sup> saline solution (5 mg lead per kg). For chemometric modeling, blood samples were collected from the control group and the test group 24 h post-injection. Blood samples were also collected from the test group 36 and 48 h post-injection for model validation. In addition, another control group (N = 4), namely a group with acute cadmium poisoning, was studied by intraperitoneally injecting the rats with CdCl<sup>2</sup> saline solution (5 mg cadmium per kg). The blood samples from this control group were collected 24 h post-injection. The blood samples were stored at about −80 ◦C for further treatment. Both PbCl<sup>2</sup> and CdCl<sup>2</sup> of analytical grade were obtained from local vendors.

# Plasma Sample Preparation

The blood sample was centrifuged at 3,000 rpm for 10 min, and a 10-µl aliquot of supernatant plasma was pipetted on the top of a piece of 1 × 1 cm aluminum foil. Each blood sample was used to prepare five replicate samples on aluminum foil. The foil was then placed in an oven set at 37◦C for 2 h, and the obtained dry plasma film was subsequently used for FTIR measurement.

# FTIR Measurement

FTIR measurements were carried out on a Bruker Vertex 70 FTIR spectrometer (Ettlingen, Germany) equipped with a DLaTGS detector in attenuated total reflection (ATR) mode. 4 cm−<sup>1</sup> resolution and 32 scans were used for each measurement. A Pike Technologies MIRacle single-reflection ATR accessory (Madison, USA) with a diamond element was employed. When performing spectral acquisition, the plasma sample was pressed against the diamond crystal using a pressing device from Pike Technologies for a close contact. For each piece of aluminum foil with blood sample, at least seven FTIR spectra were taken by measuring signals at different locations on the foil.

# Spectral Pretreatment

The obtained FTIR spectra of the plasma samples were first screened to remove some error-based large deviation spectra. In ATR-FTIR mode, the contact between the sample and diamond crystal has a significant effect on the spectral quality, e.g., a poor contact will lead to poor quality FTIR spectra (abnormally low absorbance). These spectra need to be removed from the spectral dataset before chemometric analysis. Such spectral deviation is not due to the intrinsic deviation of one sample from its group (i.e., the control or test groups), but purely related to the spectral artifact caused by an improper contact between the sample and diamond crystal. These "abnormal" spectra could be easily identified visually with OPUS software and they

FIGURE 2 | Plasma FTIR second derivative spectra of the rat group without ALP (A) and with ALP (B) in the 3,100–2,800 and 1,750–900 cm−<sup>1</sup> spectral regions.

spectra of the rat groups without and with ALP.

were then removed from the spectral dataset manually. The remaining spectra were used for chemometric analysis after being subjected to spectral pre-treatment including smoothing, scattering correction, vector normalization and second derivative treatment with chemometric software.

# Chemometric Analysis

Chemometric analysis was performed using Unscrambler software (version 10.4) for PCA and partial least squares discriminant analysis (PLS-DA). In our study, we selected the data from the second derivative FTIR spectra in the regions of 3,100–2,800 and 1,750–900 cm−<sup>1</sup> for PCA. In addition, we also used 4-fold cross validation to test rat inter-individual variability on the spectra. The above-mentioned chemometric approach is relatively simple and sufficiently powerful to help differentiate the rat groups with and without ALP, spectroscopically.

# RESULTS AND DISCUSSION

**Figure 1** shows the plasma FTIR spectra of the rat groups without and with ALP after spectral pretreatment such as smoothing, baseline correction, and vector normalization. On the other hand, **Figure 2** shows the second derivative spectra of the plasma FTIR spectra presented in **Figure 1**. These second derivative spectra were the dataset used in the following chemometric analysis. The reason to have derivative treatment on the absorbance spectra in **Figure 1** is 2-fold. First, the second derivative treatment can further magnify the spectral changes and differences between the control and test groups. Second, the second derivative treatment can also eliminate possible interference of the baseline in chemometric analysis. In addition, in **Figure 2**, we have only included the spectral regions of 3,100– 2,800 and 1,750–900 cm−<sup>1</sup> and removed the spectral region of 2,800–1,750 cm−<sup>1</sup> (as this region contains very limited spectral information). The 3,100–2,800 cm−<sup>1</sup> region corresponds to the C-H stretching absorptions; whereas the 1,750–900 cm−<sup>1</sup> corresponds to the protein amide I and amide II regions, and the fingerprint region. The displayed spectral regions in **Figure 2** contain most of the spectral information that is highly correlated to the ALP-induced biochemical changes in the plasma, thus making them suitable in our chemometric analysis.

As for the control and test group spectra datasets, we first used the most basic chemometric approach, PCA, to perform data analysis. We found the contribution rates of the first five principal components (namely PC-1, PC-2, PC-3, PC-4, and PC-5) are 64, 20, 7, 3, and 2%, respectively. The cumulative contribution rate of these five principal components reaches 96%, indicating that they can reflect most of the spectral variations and differences among the spectra of the control and test groups.

The two-dimensional score plots of PC-1 vs. PC-2, PC-1 vs. PC-3 and PC-2 vs. PC-3 were respectively shown in **Figures 3A–C**. Among the three score plots, we can clearly see that the two groups are well separated (**Figure 3A**) or they still have some significant overlaps (**Figures 3B**,**C**). **Figure 3A** gives the best discriminant result for the control and test groups. Our chemometric analysis study obviously demonstrates that with just some simple chemometric approaches such as PCA and PLS-DA, FTIR spectroscopy can be used to discriminate the rat groups with and without ALP.

For 4-fold cross validation on our data, each sample was used once as a test set while the remaining samples formed the training set. The results show that (i) there are significant differences between the test and control groups of plasma due to ALP and (ii) rat inter-individual variability has little influence on the spectral differences between the two groups. First, we analyzed the regions of 3,100–2,800 and 1,750–900 cm−<sup>1</sup> with PLS-DA. As displayed in **Figure 4**, the Y-variance plot shows that the line was basically leveled at PC7, and the more PCs could be overfitting; so seven PCs were selected for further analysis. **Figure 5** shows that PLS-DA could distinguish between health and ALP rats completely with seven PCs. However, the blue and red models of cross validation (CV) were not well matched. So, the fingerprint region of 1,750–900 cm−<sup>1</sup> was selected. As displayed in **Figure 6**, the Yvariance plot shows that seven PCs should be selected for further analysis. **Figure 7** shows not only that PLS-DA can distinguish between health and ALP rats completely with seven PCs, but also that the blue model fits well with the red CV model. In addition, the health and ALP groups in the red CV model are well separated by the 0.5 threshold line. In summary, the plasma spectra of health and ALP rats were distinctly different and inter-individual variability had no impact on the discrimination analysis of health and ALP rats.

The selectivity and robustness of our proposed PLS-DA model were also tested with additional controls to evaluate whether this model can give a correct discrimination when (i) when the rats suffer from another heavy metal poisoning and (ii) when the rats suffer from different extents of ALP. To address the first issue, we developed an acute cadmium poisoning rat model. Rats were injected with CdCl<sup>2</sup> solution to induce acute poisoning and the blood samples were collected 24 h post-injection. The plasma FTIR spectra and corresponding derivatives of this control group are presented in Figure S1 in the Supplementary Material. The

data with this control group were tested with our PLS-DA model. As we have mentioned above, the 0.5 value line is the threshold in the PLS-DA model in **Figure 7**. For data points above this line, the model predicts the rats are in ALP status; for data points below this line, the model predicts the rats are not in ALP status. As displayed in **Figure 8**, the predicted values for the rats suffering from acute cadmium poisoning are all below the 0.5 threshold, indicating that our PLS-DA model predicts that the rats suffering from cadmium poisoning are not in ALP status. This is a correct discrimination. To address the second issue, we performed a time-dependent study (up to 48 h post-injection) on the ALP rat model. The rats exposed to lead poisoning for different periods of time would suffer from lead poisoning to different extents. The plasma FTIR spectra and corresponding derivatives of this control group are presented in Figures S2, S3 in the Supplementary Material. We tested the 36 and 48 h data with our PLS-DA model. As we can see in **Figure 9**, the predicted

FIGURE 5 | PLS-DA predicted and reference plots in the 3,100–2,800 and 1,750–900 cm−<sup>1</sup> spectral regions.

values for these two control rat groups are all above the 0.5 threshold, indicating that these samples are in ALP status. This is a correct discrimination. These additional control experiments support the fact that our PLS-DA model is robust for ALP prediction.

Basically, some lead-induced biochemical changes in the plasma can be sensitively captured with chemometrics-based FTIR spectroscopy. To gain more insight into the biochemical changes induced by ALP in the plasma, the PLS-DA regression coefficient plot could be used to reflect corresponding spectral changes. As shown in **Figure 10**, this plot corresponds to the ALP-induced change in the composition and structure of the biochemical components in the plasma including biomacromolecular constitutes (such as proteins, DNAs and RNAs) as well as small molecular constitutes and metabolites

FIGURE 7 | PLS-DA predicted and reference plot in the 1,750–900 cm−<sup>1</sup> spectral region.

proposed PLS-DA model for ALP.

(such as lipids and carbohydrates). These plasma constitutes have characteristic vibrational absorptions in the PLS-DA regression coefficient plot. For example, through the spectral analysis of the 1,700–1,600 cm−<sup>1</sup> amide I region, we could obtain the information relevant to proteins; through the spectral analysis of the 1,300–1,000 cm−<sup>1</sup> region, we could obtain the information relevant to DNA and RNA. In addition, the intensity of the PLS-DA regression coefficient plot in different spectral regions could also provide information about the most prominent changes in the plasma. A summary is provided in **Table 1** for the spectral assignments for prominent peaks (either positive or negative) in the PLS-DA regression coefficient plot. They are based on the assignments in previous studies (Barth and Zscherp, 2002; Zandomeneghi et al., 2004; Zou et al., 2013; Staniszewska-Slezak et al., 2015). The peaks in the amide I (1,700–1,600 cm−<sup>1</sup> ) and amide II (around 1,550 cm−<sup>1</sup> ) correspond to absorptions of plasma proteins. In this region, we observed several prominent peaks in the PLS-DA regression coefficient plot including the amide I and amide II peaks at 1,706, 1,689, 1,672, 1,656, 1,643, 1,613, 1,550, and 1,534 cm−<sup>1</sup> . This observation in the PLS-DA regression coefficient plot suggests that ALP induced significant compositional and structural changes of the proteins in the plasma of the ALP rat model. Such changes may be due to the direct coordination effect of lead ion with protein or be due to the perturbation of lead ion on the biosynthesis of proteins in the rat. In addition, lead ion may interact (or coordinate) with the side chains of some amino acids (such as tryptophan, histidine, aspartic acid, and glutamic acid) or affect the biosynthesis of these amino acids. Such interactions or perturbations are suggested by the observation of the peaks at 1,505, 1,354, and 1,241 cm−<sup>1</sup> (corresponding to the side chain of tryptophan), at 1,583 and 1,433 cm−<sup>1</sup> (corresponding to the side chain of histidine) and at 1,417 cm−<sup>1</sup> (corresponding to the side chains of aspartic acid and glutamic acid). The PLS-DA regression coefficient plot also suggests that the nucleic acid, DNA and RNA changes in the plasma as the peaks at 1,221, 1,120, 1,080, and 1,062 cm−<sup>1</sup> are observed in the regression coefficient. These peaks correspond to the PO<sup>−</sup> 2 and C-O absorption of DNA and RNA. At last, the peaks at 1,034 cm−<sup>1</sup> (which may be related to the metabolism of glucose and polysaccharides) and at 989 and 972 cm−<sup>1</sup> (which corresponds to the phosphorylation modification of proteins) are also observed in the regression coefficient plot. In summary, on the one hand, the PLS-DA regression coefficient plot suggests a very complex biochemical changes that occurred in the body of the lead-poisoned rats; one the other hand, ALP-induced protein changes seem to be the most important cause for the rat poisoning. This finding further implies that the spectral changes corresponding to the biochemical changes of proteins may be used as potential spectral biomarkers for the diagnostics of ALP.

# CONCLUSION

In this exploratory study, we have demonstrated that FTIR spectroscopy empowered with PCA and PLS-DA analysis can capture ALP-induced biochemical changes in the plasma spectroscopically and is capable of differentiating the rats with and without suffering from ALP. Furthermore, the revealed FTIR spectral changes, in particular, corresponding to the biochemical changes of proteins, may be used as potential spectral biomarkers for the diagnostics of lead poisoning. Our method has sufficient discriminant ability and the potential to be employed as a bloodbased objective, convenient, and non-destructive diagnostic tool TABLE 1 | Spectral assignment for the observed peaks in the PLS-DA regression coefficient plot.


for lead poisoning. To the best of our knowledge, this work is the first application of chemometrics-based FTIR spectroscopy in the diagnostics of lead poisoning. We hope the chemometrics-based

# REFERENCES


FTIR spectroscopy can evolve into an objective, convenient, cost-effective and non-destructive disease diagnostics tool in the future.

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of institutional guidelines of the Animal Ethics Committee of Tianjin Tasly Institute. The protocol was approved by the Animal Ethics Committee of Tianjin Tasly Institute.

# AUTHOR CONTRIBUTIONS

WT and GM designed the project. WT, DW, HF, LY, and GM conducted the experiments and analysed the data. GM, WT, and HF wrote the manuscript.

# ACKNOWLEDGMENTS

We gratefully acknowledge the financial support from the National Natural Science Foundation of China (No. 21075027), the Natural Science Foundation of Hebei Province (Nos. B2011201082 and B2016201034), Juren plan, and Program for Changjiang Scholars and Innovative Research Team in University (No. IRT\_15R16). GM thanks Xiangke Chen for helpful discussions.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00261/full#supplementary-material


Guo, F., Zhu, Y., Chen, C., Wang, S., and Liang, S. (2017). Construction of different calibration models by FTIR/ATR spectra and their application in screening of phenylketonuria. Spectrochim. Acta A Mol. Biomol. Spectrosc. 177, 33–40. doi: 10.1016/j.saa.2017.01.020

Hunter, D. (1978). The Disease of Occupations. Sevenoaks: Hodder and Stoughton.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Tian, Wang, Fan, Yang and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pharmaceutical Analysis Model Robustness From Bagging-PLS and PLS Using Systematic Tracking Mapping

Na Zhao<sup>1</sup> , Lijuan Ma2,3, Xingguo Huang2,3, Xiaona Liu<sup>4</sup> , Yanjiang Qiao1,2,3 \* and Zhisheng Wu1,2,3 \*

<sup>1</sup> Key Laboratory of Xinjiang Phytomedicine Resources and Utilization, Ministry of Education, School of Pharmacy, Shihezi University, Shihezi, China, <sup>2</sup> Beijing University of Chinese Medicine, Beijing, China, <sup>3</sup> Pharmaceutical Engineering and New Drug Development of TCM of Ministry of Education, Beijing, China, <sup>4</sup> School of Integrated Traditional Chinese and Western Medicine, Binzhou Medical University, Yantai, China

Our work proved that processing trajectory could effectively obtain a more reliable and robust quantitative model compared with the step-by-step optimization method. The use of systematic tracking was investigated as a tool to optimize modeling parameters including calibration method, spectral pretreatment and variable selection latent factors. The variable was selected by interval partial least-squares (iPLS), backward interval partial least-square (BiPLS) and synergy interval partial least-squares (SiPLS). The models were established by Partial least squares (PLS) and Bagging-PLS. The model performance was assessed by using the root mean square errors of validation (RMSEP) and the ratio of standard error of prediction to standard deviation (RPD). The proposed procedure was used to develop the models for near infrared (NIR) datasets of active pharmaceutical ingredients in tablets and chlorogenic acid of Lonicera japonica solution in ethanol precipitation process. The results demonstrated the processing trajectory has great advantages and feasibility in the development and optimization of multivariate calibration models as well as the effectiveness of bagging model and variable selection to improve prediction accuracy and robustness.

Keywords: multivariate calibration, near infrared spectroscopy, processing trajectory, Bagging-PLS, variable selection

# INTRODUCTION

Multivariate calibration is the process of relating the measured response to the analyte amounts, concentrations, or other measured values of physical or chemical properties. Partial least squares (PLS) regression is the most effective and commonly used regression techniques in multivariate calibration because of its calibration model quality and ease of implementation. The statistical results show that approximately 20,000 published papers reports used PLS models from 2005 to 2017. The PLS technique has been effectively applied to different fields, especially in pharmaceutical analysis.

Kachrimanis et al. developed a fast and precise method using FT-Raman spectroscopy alongside with PLS for the quantitation of monoclinic and orthorhombic paracetamol in powder mixtures (Kachrimanis et al., 2007). Yu et al. established a PLS model using near infrared spectroscopy (NIR) and gas chromatography data to determine l-borneol in Blumea balsamifera (Ai-na-xiang) samples

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Huawen Wu, BaySpec, Inc., United States Francesco Crea, Università degli Studi di Messina, Italy

\*Correspondence:

Yanjiang Qiao yjqiao@263.net Zhisheng Wu wzs@bucm.edu.cn

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 28 November 2017 Accepted: 12 June 2018 Published: 06 July 2018

### Citation:

Zhao N, Ma L, Huang X, Liu X, Qiao Y and Wu Z (2018) Pharmaceutical Analysis Model Robustness From Bagging-PLS and PLS Using Systematic Tracking Mapping. Front. Chem. 6:262. doi: 10.3389/fchem.2018.00262 (Yu et al., 2017). Sarkhosh et al. developed a PLS model of redox potential with genetic algorithms selecting pixels in multivariate image analysis for a quantitative structure-activity relationships (QSAR) study of trypanocidal activity for quinone compounds (Sarkhosh et al., 2014). Üstün et al. built a fast quantification method combining <sup>1</sup>H NMR spectroscopy with PLS to determine the chondroitin sulfate and dermatan sulfate in danaparoid sodium (Üstün et al., 2011). Wu et al. used NIR as a process analytical technology and developed the PLS model of 11 amino acids to monitor their concentration change during hydrolysis process of Cornu Bubali (Wu et al., 2013b).

The successful application of PLS depends on the development and validation of multivariable models. Recently, the multivariate data needs a more suitable method to establish a robust and reliable PLS model. However, many parameters need to be optimized for a quantitative PLS model, which include spectral pretreatment, variable selection, calibration methods, etc. To improve model performance, the pretreatments are used to reduce the undesirable variations effects from instrument, environment, sample preparation protocol, etc. (Faber, 1999; Blanco et al., 2007; Fernández-Cabanás et al., 2007; Lim et al., 2016).

Besides, variable selection in modeling is also an important step to identify informative features and/or remove uninformative variables for better prediction performance and model complexity reduction. Recently, based on the PLS algorithm, some variable selection methods have been developed including interval partial least-squares (iPLS) (Saudland et al., 2000), backward interval partial least-square (BiPLS) (Leardi and Nørgaard, 2004) and synergy interval partial least-squares (SiPLS) (Munck et al., 2001), etc. Many studies have confirmed the efficiency of these variable selection methods for improving model performance (Chen et al., 2008; Di et al., 2010; Wu et al., 2013a; Mahanty et al., 2016).

In addition, a single model is often not robust because of the change of calibration data and model parameters. An alternative effective approach to improve model robustness is ensemble modeling that establishes multiple models and combines their predictions into a single value. Bagging-PLS is one of most important ensemble modeling techniques. About 60 papers were published on the use of Bagging-PLS model in the period 2005–2017. Galvão et al. used bagging strategies in conjunction with Multiple Linear Regression (MLR) and PLS to develop the multivariate calibration models for four diesel quality parameters, showing that the prediction accuracy was improved by subagging procedure (Galvão et al., 2006). Pan et al. combined ensemble method of Bagging with PLS to detect naringin, hesperidin and neohesperidin in pilot-scale extraction process of Fructus aurantii with online NIR sensors (Pan et al., 2015).

Most of the published works dealing with PLS model used a univariate to optimize these modeling parameters step by step according to the root-mean-square error. The number of modeling paths of this method was limited and the results were often not the global optimal. Then, we proposed processing trajectory that can provide a systematic way to optimize parameters in a quantitative model (Zhao et al., 2015).

Based on the above considerations, we extend the optimization of spectral pretreatment, latent factors and variable selection using tracking procedure to spectral pretreatment, latent factors, variable selection and calibration method. The methods of variable selection included iPLS, BiPLS, and SiPLS. The models were established by using PLS and Bagging-PLS. The model performance was assessed using the root mean square errors of validation (RMSEP) and the ratio of standard error of prediction to standard deviation (RPD) (Esbensen et al., 2014; Williams et al., 2014). Two diferent NIR spectral datasets (one standard and one open source) were analyzed. The proposed procedure was used to predict active pharmaceutical ingredients (API) in tablets and chlorogenic acid of Lonicera japonica solution in ethanol precipitation process.

# DATASETS AND ANALYSIS

# Datasets

### Tablet

The NIR transmittance spectra of a pharmaceutical tablet were described in Dyrby et al. (2002) and publicly available at http:// www.models.life.ku.dk/Tablets. This tablet dataset consists of 310 samples measured in the range of 7,000–10,500 cm−<sup>1</sup> with a

resolution of 16 cm−<sup>1</sup> i.e., a total number of 404 variables per sample. The objective of the analysis was to predict the API content of the tablet. The content of API in the tablets (% w/w) was assayed by high performance liquid chromatography (HPLC). The tablet dataset was supplied in Data Sheet 1. This dataset was divided into two groups: 207 and 103 samples for training and validation with Kennard-Stone (KS) algorithm, respectively.

## Lonicera japonica

The NIR spectral dataset of Lonicera japonica has been reported previously (Wu et al., 2012). The data consisted of 216 samples with 2,800 variables in the range of 1,100– 2,500 nm that measured on an XDS rapid liquid analyzer with VISION software in the transmission mode (Foss NIR Systems, Silver Spring, MD, USA). NIR spectra of Lonicera japonica solution obtained from ethanol precipitation process, were measured to estimate chlorogenic acid content. HPLC was used as the reference method for chlorogenic acid determination as recommended by the Chinese Pharmacopoeia (CHP, 2010 Edition) for Lonicera japonica monograph. The dataset of Lonicera japonica was supplied in Data Sheet 2. In this study, the training data consisted of 144 samples and the remaining 72 samples were used for validation.

## Multivariate Data Analyses

The spectral pretreatment of data was performed using chemometric tool in this study (SIMCA P + 11.5, Umetrics, Sweden). Data analysis was conducted using Unscrambler 9.7 software package (Camo Software AS, Norway) and Matlab version 7.0 (MathWorks Inc., USA). Some of the algorithms were developed by Norgaard et al., readily downloadable from http://www.models.life.ku.dk/iToolbox.

### Multivariate Calibration

A procedure for the development and optimization of multivariate calibration models using processing trajectory is summarized in **Figure 2**. The rationale behind this approach is that there was more than one path to obtain good model with different parameter combinations. Thus, the procedure was used to track and evaluate modeling processes with different parameters including spectral pretreatments, variable selections, latent factors, and calibration methods. The evaluation indexes of model includes RMSEP and RPD.

# RESULT AND DISCUSSION

# Raw Spectra

The raw NIR spectra of the tablet and Lonicera japonica solution were shown in **Figure 1**, which represent their characteristic peak locations regarding the active substance in each spectral dataset. In the NIR transmittance spectra of tablet (**Figure 1A**), there were several broad peaks located at around 10,000, 8,830, 8,200, and 7,840 cm−<sup>1</sup> , which originated from several components in the corresponding drug tablet. In addition, there were large fluctuations in the combined region of fundamental vibrations in the raw spectra of Lonicera japonica solution. Therefore, the spectral region of 1,100–1,900 nm was selected.

# Processing Trajectory of PLS Model

The modeling procedure using processing trajectory was showed in **Figure 2**. Taking the tablet dataset as an example, the data set were split in to calibration and validation sets and the

spectra were preprocessed using different methods including first derivative (1st), second derivative (2nd) and Savitzky-Golay smoothing with 9 points [SG(9)]. The iPLS, BiPLS and SiPLS were then used to select variables. Finally, the PLS and Bagging-PLS models were developed with latent factors from 1 to 10. Both RPD and RMSEP were calculated to evaluate the model. **Figure 2** showed different modeling paths and model results. The parameters for PLS and Bagging-PLS models of API in tablet and chlorogenic acid of Lonicera japonica solution were shown in Tables S1, S2.

The RPD and RMSEP had similar trends in PLS and Bagging-PLS models. In **Figure 2A**, the RMSEP decreased with increasing latent factor coupled with different pretreatment methods and variables selections. The RPD also increased with an increase of small latent factors. However, when the latent variable was greater than a certain value, the RPD became smaller. Variances in RMSEP and RPD indexes were not obvious when using 1st and 2nd derivative preprocessed spectra. Other pretreatment methods were superior to 1st and 2nd derivative processing. The model for Lonicera japonica dataset is shown in **Figure 2B**. Similar results were found for the tablet dataset. The model results of other pretreatment methods were also better than 2nd derivative processing.

Moreover, this finding indicates that more than one modeling path could ensure a successful model. Data obtained from different modeling paths and model classification were shown in **Figure 3**. There were six good models with RPD between 3 and 3.5 (**Figure 3A**), and some very good model paths with RPD values greater than 3.5 (**Figure 3B**). In the previous modeling process routine, the parameters were optimized one at a time according to the resultant prediction accuracy. This was a poor approach to path modeling vs. step-by-step parameter optimization (Table S3). The optimal parameters of the API model obtained step-by-step optimized were the raw spectra and iPLS-selecting variable under 3 latent factors. The model performance was fair. However, the result of processing trajectory showed that six good models could be obtained by combination of SG(9) pretreatment and BiPLS-selecting variables.

# Development and Validation of Calibration Models

The best nonsystematic parameter combination for the chlorogenic acid Bagging-PLS model was raw spectra and iPLS or BiPLS variables selection under 2 latent factors. The model performance was good. However, there were 24 very good models with different systematic parameter combinations in the result of processing trajectory. The best parameter combination of the chlorogenic acid model was that the model was developed by Bagging-PLS with SG(9) spectral pretreatment and SiPLSselecting variables under 6 factors. It demonstrated that the model obtained through the processing trajectory was better than that step-by-step optimized. It means that the optimal systematic model parameter combination can be obtained via the processing trajectory and bagging ensemble modeling techniques, and variable assignment could improve prediction accuracy and robustness.

The model validity was evaluated in terms of RMSEP and RPD values. Taking the tablet dataset as an example, **Figure 2A** showed that the model established using Bagging-PLS with SG(9) pretreatment and BiPLS-selecting variables under 10 latent factors had the best performance. The RMSEP and RPD values of the validation set were 0.4126% and 3.2234, respectively. In contrast, the RMSEP and RPD of the model step-by-step optimized were 0.5164% and 2.5755, respectively. These results also showed that the model developed with Bagging-PLS had a good predictive performance. Similarly, the model of Lonicera japonica solution was developed using Bagging-PLS with SG(9) spectral pretreatment and SiPLS-selecting variables under 6 latent factors. The RMSEP and RPD were 0.0728 mg/mL and 3.9166, respectively. The RMSEP and RPD of the model step-bystep optimized were 0.0891% and 3.1966, respectively. **Figure 4** presents the data obtained with Bagging-PLS models using the two datasets. The prediction values reasonably agreed with

HPLC results. The parameters indicated that NIRS could be used for the determination of API in tablets and chlorogenic acid of Lonicera japonica solution in ethanol precipitation process.

# CONCLUSION

We proposed processing trajectory to optimize the parameters of multivariate calibration such as spectral pretreatment, latent factors, variable selection and calibration methods. The models were developed using PLS and Bagging-PLS with different spectral pretreatments and variable selection methods under different latent factors. The chemometric indicators (RMSEP and RPD) were used to evaluated the model. The different PLS and Bagging-PLS models were used to quantify the API in tablets and chlorogenic acid of Lonicera japonica solution in ethanol precipitation process. The result illustrated that the processing trajectory has great advantages and feasibility in the development and optimization of multivariate calibration models and the effectiveness of bagging model and variable selection to improve prediction accuracy and robustness.

In conclusion, the application of processing trajectory for model optimization shows excellent results to develop a reliable

# REFERENCES


and robust model. The proposed should be translated into an algorithm to be integrated into PLS software, helping to obtain better models.

# AUTHOR CONTRIBUTIONS

YQ and ZW conceived and designed the study. NZ performed the experiment with the help of LM, XH, and XL. NZ and ZW wrote the manuscript. All authors read and approved the final manuscript.

# ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (81773914), Beijing Nova Program of China (xx2016050), Science Fund for Distinguished Young Scholars in BUCM (2015-JYB-XYQ-003) and Fund for young teachers in BUCM (2016-JYB-JSMS-061).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00262/full#supplementary-material


Zhao, N., Wu, Z. S., Zhang, Q., Shi, X. Y., Ma, Q., and Qiao, Y. J. (2015). Optimization of parameter selection for partial least squares model development. Sci. Rep. 5:11647. doi: 10.1038/srep11647

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhao, Ma, Huang, Liu, Qiao and Wu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Fusion of MALDI Spectrometric Imaging and Raman Spectroscopic Data for the Analysis of Biological Samples

### Oleg Ryabchykov 1,2, Juergen Popp1,2 and Thomas Bocklitz 1,2 \*

<sup>1</sup> Spectroscopy and Imaging Research Department, Leibniz Institute of Photonic Technology, Member of Leibniz Health Technology, Jena, Germany, <sup>2</sup> Institute of Physical Chemistry and Abbe Center of Photonics, Friedrich Schiller University Jena, Jena, Germany

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Xia Guan, Louisiana State University, United States Ennio Carbone, Università degli Studi Magna Græcia di Catanzaro, Italy Frédéric Jacques Cuisinier, Université de Montpellier, France Anna V. Sharikova, University at Albany, United States

> \*Correspondence: Thomas Bocklitz thomas.bocklitz@uni-jena.de

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 17 December 2017 Accepted: 08 June 2018 Published: 16 July 2018

### Citation:

Ryabchykov O, Popp J and Bocklitz T (2018) Fusion of MALDI Spectrometric Imaging and Raman Spectroscopic Data for the Analysis of Biological Samples. Front. Chem. 6:257. doi: 10.3389/fchem.2018.00257

Despite of a large number of imaging techniques for the characterization of biological samples, no universal one has been reported yet. In this work, a data fusion approach was investigated for combining Raman spectroscopic data with matrix-assisted laser desorption/ionization (MALDI) mass spectrometric data. It betters the image analysis of biological samples because Raman and MALDI information can be complementary to each other. While MALDI spectrometry yields detailed information regarding the lipid content, Raman spectroscopy provides valuable information about the overall chemical composition of the sample. The combination of Raman spectroscopic and MALDI spectrometric imaging data helps distinguishing different regions within the sample with a higher precision than would be possible by using either technique. We demonstrate that a data weighting step within the data fusion is necessary to reveal additional spectral features. The selected weighting approach was evaluated by examining the proportions of variance within the data explained by the first principal components of a principal component analysis (PCA) and visualizing the PCA results for each data type and combined data. In summary, the presented data fusion approach provides a concrete guideline on how to combine Raman spectroscopic and MALDI spectrometric imaging data for biological analysis.

Keywords: MALDI-TOF, Raman imaging, data combination, data fusion, normalization, PCA

# INTRODUCTION

Different analytical methods could be utilized for biomedical analysis (e.g., cells, and tissues, etc.) to highlight a certain aspect of the sample e.g., morphological microstructure, distribution of electronic chromophores, molecule classes, or special proteins. Among the label-free imaging approaches, matrix-assisted laser desorption/ionization (MALDI) spectrometry, and Raman microscopy are certainly among the most powerful imaging techniques for the investigation of biomedical samples. Raman spectroscopy is a non-destructive spectroscopic method, which provides complex molecular information about the general chemical composition of the sample with a rather high spatial resolution (Abbe limit) to highlight subcellular features (Kong et al., 2015). The drawback of Raman imaging lies in its weak scattering efficiency that makes sampling time rather long for large area imaging. Raman spectroscopic imaging has demonstrated its potential for biomedical diagnosis in numerous cancer-related studies (Tolstik et al., 2014), biological material analysis (Butler et al., 2016), cell characterization studies (Ramoji et al., 2012), and many other biomedical applications (Matousek and Stone, 2013; Ember et al., 2017).

On the other side, MALDI mass spectrometry provides information on specific substances, such as lipids or proteins (Fitzgerald et al., 1993). MALDI is a soft ionization technique utilized for mass-spectrometric imaging (Gessel et al., 2014) to determine large organic molecules and biomolecules undetected by conventional ionization techniques. This technique was employed in clinical parasitology (Singhal et al., 2016), microbial identification (Urwyler and Glaubitz, 2016), and cancer tissue investigation (Hinsch et al., 2017).

Raman spectroscopic and MALDI mass spectrometric imaging both offer a high molecular sensitivity. Moreover, Raman spectroscopy has been sequentially applied together with different mass spectrometric techniques to address a variety of biological tasks such as characterization of succinylated collagen (Kumar et al., 2011), investigation of microbial cells (Wagner, 2009), identification of fungal strains (Verwer et al., 2014) and characterization of lipid extracts from brain tissue (Köhler et al., 2009). In all the aforementioned studies, the Raman and mass spectrometric data are analyzed separately, and then summarized or compared to each other (Masyuko et al., 2014; Bocklitz et al., 2015; Muhamadali et al., 2016). To significantly increase the information content, Raman spectroscopic and MALDI mass spectrometric imaging data have to be co-registered (Bocklitz et al., 2013) followed by a high-level (distributed) data fusion. It means that each data type is analyzed separately to obtain the respective scores, which are then fused together. Alternatively, spectroscopic imaging can be used for mapping an area that is suitable for further investigation by means of MALDI spectrometric imaging (Fagerer et al., 2013) or a certain mass peak is used to define an area, from which the Raman spectra are analyzed (Bocklitz et al., 2013). Such a hierarchical pipeline corresponds to a decentralized data fusion approach.

In the present work, we introduced an analytical method to perform a low-level (centralized) fusion of Raman and MALDI imaging data. Because the experimental implementation of correlated imaging is challenging in many aspects (Masyuko et al., 2013), we utilized a computational approach to combine imaging data obtained by MALDI spectrometry and Raman spectroscopy. The correlation of Raman spectroscopy with mass spectrometric imaging techniques such as MALDI (Ahlf et al., 2014) or secondary ion mass spectrometry (SIMS) (Lanni et al., 2014) have proved its usefulness for biological applications. Moreover, a combination of MALDI imaging data with optical microscopy could attenuate instrumental effects (Van De Plas et al., 2015), and a joint analysis of vibrational and MALDI mass spectra could provide valuable information on brain tissue (Van De Plas et al., 2015; Lasch and Noda, 2017). Nevertheless, even if Raman and MALDI spectra are obtained by correlated imaging, each type of spectra shows its own specific features and should be preprocessed separately. Because the measurement techniques are based on different physical effects, the difference in data dimensionality and dynamic range can affect the contribution of each datatype in the analysis. Therefore, a weighting coefficient that balances the influence of Raman spectroscopic and MALDI spectrometric data in the data fusion center is required.

# MATERIALS AND METHODS

# Experimental Details

We demonstrated the data fusion on an example dataset of MALDI spectrometric and Raman spectroscopic scans obtained from the same mouse brain sample (Mus musculus) of 10µm cryosection. The sample was cut on a cryostat, and then dried on a precooled conductive ITO-coated glass slide. Subsequently, Raman spectra were obtained using a confocal Raman microscope CRM-alpha300R (WITec, Ulm, Germany) and excited with a 633 nm HeNe laser (Melles Griot). The laser irradiation was adjusted in order to have about 10 mW power. The laser was coupled through an optical fiber into a Zeiss microscope. A spectral map was obtained by a raster scan with a 25µm grid with a dwell time of 2 s and a pre-bleaching time of 1 s.

After the Raman scan, MALDI mass spectrometric imaging was performed with a common matrix alpha-cyano 4-hydroxy cinnamic acid (5 mg/mL) in 50% acetonitrile and 0.2% trifluoracetic acid. The ImagePrep station (Bruker Daltonics) was used to prepare and apply the matrix on the sample. The MALDI-time-of-flight (MALDI-TOF) spectrometric map was obtained on a Ultraflex III MALDI-TOF/TOF mass spectrometer (Bruker Daltonics, Bremen, Germany). A "smartbeam" laser (λ = 355 nm, repetition rate 200 Hz) was used. The spectrometer was calibrated with an external standard, a peptide calibration mixture (Bruker Daltonics). The measurements were performed in the positive reflectron mode with 500 shots per spectrum and spatial resolution of 75µm.

Further experimental details for both data types and an example of a hierarchical data fusion implementation can be found in the report by Bocklitz et al. (2013). Nevertheless, in the context of a further discussion, it is important to highlight that in MALDI mass spectrometric imaging a matrix suitable for the analysis of the lipid content was applied.

# Preprocessing of Raman Spectroscopic Data

The influence of corrupting effects (e.g., cosmic spikes, fluorescence) on Raman spectra cannot be avoided completely. Thus, the development of complex preprocessing routines (Bocklitz et al., 2011) is required. To allow further analysis of the Raman spectra obtained with different calibrations, all spectra need to be interpolated to the same wavenumber axis (Dörfer et al., 2011). Moreover, keeping all the spectra in a single data matrix simplifies a further processing routine, so it is advantageous to perform the calibration as one of the first steps of the preprocessing workflow (**Figure 1**). Besides the wavenumber calibration, intensity calibration should be performed for the comparison of the measurements obtained with different devices

or in the case where some changes in the measurement device have occurred (Dörfer et al., 2011).

The calibration is always needed for a reliable analysis, especially if the measurements were performed over a large time period, or settings of the device were changed between the measurements. In contrast, the following step within the preprocessing workflow (i.e., noise removal) is an optional step. However, among smoothing methods, only the running median with a relatively large window is applicable for cosmic ray noise removal. Unfortunately, filtering with a large window may corrupt the Raman bands themselves. Alternatively, 2–3 spectra per point can be acquired to eliminate the spikes that are not present in each spectrum. Nevertheless, this approach increases the measurement time dramatically. Therefore, this approach is not suitable for Raman imaging when a large number of spectra are recorded. Thus, specialized spike correction approaches like wavelet transform (Ehrentreich and Summchen, 2001), correlation methods (Cappel et al., 2010), calculation of the Laplacian of the spectral data matrix (Schulze and Turner, 2014; Ryabchykov et al., 2016), or a difference between the original and a smoothed spectrum (Zhang and Henson, 2007) must be used for spike removal.

The next step in the preprocessing workflow for Raman spectra is fluorescence background removal. In this work, the sensitive nonlinear iterative peak (SNIP) clipping algorithm (Ryan et al., 1988) was used for baseline estimation. The SNIP algorithm can be utilized for background estimation for a number of spectral measurements, like X-ray and mass spectra.

After baseline correction, the Raman spectra must be normalized (Afseth et al., 2006) to complete the basic preprocessing. There are several normalization approaches (e.g., vector normalization, normalization to integrated spectral intensity, or a single peak intensity value) that enhance the stability of the spectral data. In this work, we used vector normalization and l1-normalization (Horn and Johnson, 1990) for Raman spectra. The difference between normalization to integrated spectral intensity and l1-normalization is that the latter utilized absolute intensity values. As a result, the difference between both normalization approaches becomes more significant when negative values appear in the baseline corrected spectra due to noise or baseline correction artifacts.

# Preprocessing of MALDI Spectrometric Data

Although the measurement techniques themselves differ dramatically for Raman and MALDI mass spectroscopic imaging data, the preprocessing of these data has a lot in common. The m/z values are set according to an internal calibration and may "float" slightly from one measurement to another. Therefore, a phase correction along the m/z axis must be performed within the preprocessing workflow (**Figure 2**) to ensure that the spectra obtained in different measurements are comparable. For this purpose, it is advisable to use the stable intense peaks within the phase correction routine (Gu et al., 2006).

From a theoretical point of view, MALDI spectra should not feature a spectral background. Nevertheless, in measured MALDI

spectra a background is present. In literature, a background present in MALDI mass spectra is also known as "chemical noise background" (Krutchinsky and Chait, 2002). This type of noise results from matrix impurities and unstable ion clusters created during the sample scanning.

Similarly to Raman spectral preprocessing, the SNIP algorithm (Ryan et al., 1988) can be used to eliminate the background from mass spectra. Another complication in the analysis of MALDI spectra results from the fact that even after the phase correction, peak positions vary insignificantly among different spectra. An interpolation procedure, which is applied in Raman data preprocessing, would corrupt the sharp peaks found in MALDI spectra and is therefore not applied. To enable a direct comparison of the spectra, a binning procedure is applied. This procedure is based on the equalization of the m/z-values of peak positions within a certain range. Since the average peak width along the m/z axis increases with increased mass, the binning range is set with a so-called tolerance relative to the mass values. In contrast to Raman spectroscopy, intensity calibration for MALDI mass spectrometric imaging is not required. Nevertheless, normalization may be applied. Various types of normalization are used for MALDI mass spectroscopic imaging data: total ion count (TIC), vector norm (RMS), median, square root, logarithmic, and normalization to a noise level. In contrast to the Raman spectral data, MALDI mass spectra do not feature negative values. Thus, TIC normalization and normalization to l1-norm, which is a sum of absolute values, are equal for MALDI spectra. If the significance level of the data is high, the normalization may be not necessary for the subsequent analysis.

# Computational Details

For MALDI data acquisition and calibration, a flexImaging software version 3.0 (Bruker Daltonics) was used. The data processing was also performed in R (R Core Team, 2017) using packages akima (Gebhardt)<sup>1</sup> , Peaks (Morhac)<sup>2</sup> , readBrukerFlexData (Gibb)<sup>3</sup> , rsvd (Erichson)<sup>4</sup> , spatstat (Baddeley and Turner, 2005), and Spikes (Ryabchykov et al., 2016).

Prior to the data preprocessing and data fusion, the MALDI and Raman spectra were interpolated to the same (spatial) grid by utilizing a co-registration framework. Based on the false color images of Raman spectroscopic and MALDI spectrometric scans, 6 points clearly representing the same positions on every scan were manually selected. The coordinates of the Raman spectroscopic map were then transformed to the coordinate system of the MALDI mass spectrometric map. Subsequently, the Raman spectra were interpolated to the grid of the MALDI mass spectral map. To perform this interpolation, every point within the Raman grid was assigned to the nearest point within the MALDI grid. After that, the average of the Raman spectra, assigned to the same point within the MALDI grid, was

<sup>1</sup>Gebhardt, H. A. "akima: Interpolation of Irregularly and Regularly Spaced Data." <sup>2</sup>Morhac, M. "Peaks: Peaks."

<sup>3</sup>Gibb, S. "readBrukerFlexData: Reads Mass Spectrometry Data in Bruker <sup>∗</sup>flex Format."

<sup>4</sup>Erichson, N. B. "rsvd: Randomized Singular Value Decomposition."

calculated. Two spectral maps were thus obtained and aligned in a point-wise manner.

After the alignment, the Raman spectroscopic and MALDI mass spectrometric imaging data were preprocessed. During the preprocessing, the wavenumber calibration of the Raman spectra and the phase correction of MALDI spectra were performed. The MALDI mass spectrometric imaging data were subsequently subjected to noise removal, background correction, and TIC normalization. The Raman spectra were corrected for fluorescence background and vector normalized. The SNIP algorithm was used for background estimation in both cases.

After the preprocessing, Raman and MALDI mass spectral data differed in their dimensionality and in dynamic range. Data with different dynamic ranges would contribute unequally in a further analysis and consequently the spectral matrices have to be additionally weighed before performing the PCA. The weighting coefficient was selected as a ratio between the l1 norms of the matrices, which are sums over the absolute values in the matrix. After the weighting, the data were combined in a single matrix and analyzed with a PCA. To illustrate the benefit of data fusion and weighting, we also analyzed the un-weighted data in a combined manner and each data type separately. We also investigated the case, where the same normalization approach was applied to both data types and no additional weighting is required. When the Raman spectra were normalized to the total spectral intensity, which is equivalent to TIC normalization of mass spectra, the data matrices had equal l1-norms.

# RESULTS AND DISCUSSION

Both Raman spectroscopic and MALDI mass spectrometric imaging data provide different insights into the chemical composition of the sample. Information on a broad range of molecules can be obtained from the Raman spectra. This information can be complemented by detailed information on lipid content, obtained from the MALDI data. To utilize both types of information together, a data fusion must be applied. This data fusion may be performed during different stages of the analysis workflow. Therefore, the architecture of the data processing workflow is dependent on the selected data fusion approach. These approaches can be divided into the following types (Castanedo, 2013):


The decentralized and distributed architecture already showed their effectiveness for biomedical investigations (Bocklitz et al., 2013; Ahlf et al., 2014). The current work focuses on the centralized data fusion approach, also called low-level data fusion. In contrast to decentralized and distributed architectures, the centralized architecture shows a simpler workflow (**Figure 3A**). The data are combined in early steps of the analysis, directly after the preprocessing and even before the dimension reduction. At the data fusion center, where the different types of data are combined, an additional normalization or scaling of the data may be required to weight the influence of the different data types on the global model. The need for this weighting step arises from the differences in the data dimensionality, measurement units and dynamic ranges of the different measurement techniques. It is worth mentioning that the weighting is not a major issue in high-level data fusion approaches, which usually deal with standardized lowdimensional outputs of preliminary analysis in the data fusion center. However, a low-level data fusion (such as the applied centralized data fusion model) deals directly with preprocessed spectra of different types. Thus, the data scaling may dramatically influence extraction efficiency of the features.

To investigate the impact of data weighting, we searched for a marker that would allow an objective comparison of different data fusion and normalization approaches. This weighting scheme is designed for biological samples (i.e., a complex chemical composition), of which a large number of independent features have to be identified for appropriate description. By applying a PCA for dimension reduction, a large portion of the data variance is expected to be spread among multiple principal components (PCs) and the optimal approach should correspond to the slowest raise of the cumulative proportion of variance with a number of PCs.

The variances of the data explained by PCA are shown in the **Figure 4** where the normalization and fusion approaches (described in section Computational Details) are shown. Unfortunately, a direct comparison between cumulative proportions of variance obtained from Raman and MALDI mass spectral data, and their combined data is not suitable due to the different number of variables. However, different trends in the observed variance by the PCs in data with the same dimensionality can be interpreted. The left side of **Figure 4** shows that the variance of vector normalized Raman data is spread among a larger number of PCs than that of the total area normalized Raman data. This finding indicates that the vector normalization allows extracting a larger number of significant features from Raman data. Because the Raman spectra were vector normalized and the MALDI spectra were TIC normalized, the Raman data contribute more to the overall data variance than the MALDI data. Consequently, the PCA will focus on the variations in the Raman data and the variations in the MALDI data will have only a small influence. Alternatively, two datasets can be balanced by normalizing spectra of both types to their l1-norms. By definition, this norm is a sum of absolute values. It takes dimensionality and scaling of the data into account, so no additional weighting is required. TIC normalization performed on MALDI data is already equal to l1-normalization because

there are no negative values present in the mass spectra. The right side of **Figure 4** clearly shows that there is a marked difference between the approach not taking the data scaling into account and the approaches based on weighting or identical normalization. However, no significant benefit was observed when comparing the weighting to identical normalization approach.

To further investigate the influence of weighting on data fusion, the weighting coefficient was varied in a range from 1 to 20 and a PCA utilized for every case. The extracted curves of the cumulative proportion of the variance were organized as a surface plot (**Figure 5**). To make the interpretation easier, the curves, which correspond to the data combination without weighting and with weighting based on the ratio of l1 norms, are additionally highlighted in **Figure 5**. Although no single weighting coefficient is globally the best, the proposed weighting coefficient lies close to the area where the data variance is spread between multiple PCs. Thus, fusing data in this manner enables the PCA to extract a larger number of reliable features.

Although an optimal data fusion has been achieved as abovementioned, a direct comparison of cumulative proportions of variance explained by the PCA for data with different dimensionalities may be misleading. Hence, the results obtained from the combined approach and separated data analysis (**Figure 6**) were checked by means of inspecting the PCA loadings and scores. The first three PCs were visualized separately for the MALDI spectrometric imaging data (**Figures 6A,C**), Raman spectroscopic imaging data (**Figures 6B,D**), and their combination (**Figures 6E–G**).

The comparison of the PCA scores in **Figure 6** shows that the image of the MALDI-Raman combination (**Figure 6G**) depicts clearer spatial features of the sample (compared to **Figures 6C,D**). The corresponding false-color score composite (**Figure 6G**) is less noisy, and looks subjectively better than the images obtained separately from the MALDI mass spectrometric (**Figure 6C**) and Raman spectroscopic data (**Figure 6D**). Moreover, the loading vector of the third PC of the MALDI spectra (shown in blue color in **Figure 6A**) has positive and negative values related to isotopes of the same molecules. It means that it represents mostly noise and variations in the signal to noise ratio. On the other hand, the MALDI part of the loadings of the third PC in the combined analysis (shown in blue color in **Figure 6E**) reflects a joint behavior for the isotopes of the same ions. Moreover, the Raman part of this PC contains the peaks associated with lipids (Notingher and Hench, 2006), namely the C = C stretching region (1,655–1,680 cm−<sup>1</sup> ), and CH deformation band (1,420–1,480 cm−<sup>1</sup> ). Although these two peaks may also be associated with Amide I and CH deformations of proteins, there is a decrease in the protein-associated range (Notingher and Hench, 2006) in the wavenumber region 1,128–1,284 cm−<sup>1</sup> . Furthermore, there are notable changes in the CH-stretching region (2,800–3,100 cm−<sup>1</sup> ). Thus, the third PC of the combined data represents the actual diversity in the lipid composition of the sample. The relationship of the CH stretching region of the Raman spectra to the changes in the lipid content can also be observed by a high correlation of the Raman spectral region with MALDI mass spectra (**Figure 7**).

Since both data types simultaneously reflect variations in lipid content, the specific changes in the correlation profiles (**Figure 7**) of the Raman and MALDI data are observed in the areas related to lipid bands in Raman spectra. Besides the contributions of lipids, which are found in the third PC, the fingerprint region of Raman spectra contains numerous peaks related to proteins and DNA. These Raman bands correlate with MALDI peaks both positively and negatively (**Figure 7**). The correlation of a certain MALDI peak with the Raman data shows a similar structure, but with an opposite sign. This sign change reflects changes in the contribution of specific lipids with respect to the overall increase of lipid content in the sample.

One of the non-lipid compounds, which feature strong Raman bands, is phenylalanine. Its symmetrical ring breathing mode and C-H in-plane mode are visible in the first two PCs at 1,004 and 1,030 cm−<sup>1</sup> . Another peak related to phenylalanine can be found in the first two PCs at 1,104 cm−<sup>1</sup> (Movasaghi et al., 2007). Aside of that, the first PC contains contributions of tryptophan at 760 cm−<sup>1</sup> (Bonifacio et al., 2010). The protein backbone C-Cα stretching of collagen is present in the second PC at 936 cm−<sup>1</sup> and the ν(C–C) protein backbone is located in the first two PCs at 816 cm−<sup>1</sup> (Bonifacio et al., 2010). Also, prominent collagenassociated bands like Amide I and Amide III can be seen in the first PC at 1,655–1,680 and 1,220–1,284 cm−<sup>1</sup> , respectively (Krafft et al., 2005; Notingher and Hench, 2006). Moreover, the peak at 1,647 cm−<sup>1</sup> is associated with the random coil structure of proteins in general (Movasaghi et al., 2007). This peak is also present in the first two PCs.

The main contribution to the first PC is the ratio between the fingerprint region of Raman spectra and C-H stretching region. On the other side, the fingerprint region of the second PC contains both positive and negative peaks, reflecting the changes in protein content. Along with the protein content, valuable information about DNA is obtained from the first two PCs of the Raman spectra. The peak at 1,180 cm−<sup>1</sup> represents cytosine and guanine. Another DNA peak is located at 1,263 cm−<sup>1</sup> and represents adenine and thymine (Movasaghi et al., 2007). All Raman spectral features provide a complex overview of the chemical composition of the mouse brain section. The MALDI data, on the other hand, extends the overview of the distribution of biomolecules based on Raman spectroscopy with detailed information about the lipid content composition.

# CONCLUSION

variance for a given number of PCs.

In this paper, a data fusion scheme was investigated to analyze Raman spectroscopic and MALDI mass spectrometric imaging data together. We described the most significant corrupting effects influencing the analysis of Raman spectroscopic and MALDI mass spectrometric imaging data. The preprocessing workflows were shown for the suppression of these corrupting effects by means of calibration, noise reduction, background correction, and normalization for both data types. After the pretreatment steps, the importance of data weighting prior to data fusion is highlighted, especially when the data are

FIGURE 6 | PCA analysis: first three PCs calculated for MALDI spectra (A), Raman spectra (B), combined Raman-MALDI data (E,F) and their false-color score composites (C,D,G). Red, green, and blue colors indicate the first, second and third PCs, respectively. Separate plots for the loadings and false color images can be found as Supplementary Material. The PCs composite image of the combined data (G) shows a smoother appearance, and the loadings after data fusion (E,F) are easier to interpret. See text for further details.

obtained from different sources and have different scales and dimensionalities. As there is no universal way of balancing the influence of data types on the analysis, optimization, and validation of weighting approaches should be done according to the specific data. In order to allow a judgment of the quality of a weighting, we proposed an approach that allows estimating the goodness of data weighting. This approach is based on analyzing proportions of data variance explained by PCs and we applied this approach by examining the cumulative variance. It was shown that the weighting, based on the ratio of l1-norms of the data matrices, allows optimal unmixing of the example data set into features. Besides the comparison of different weighting schemes, the proposed method can be used for the comparison of normalization approaches. It was found that vector normalization allows better unmixing of the example Raman data as compared to the normalization to the integrated spectral intensity (l1-norm). Besides the establishment of a weighting approach, we discovered that a nearly optimal result compared to the weighting is achieved if the spectra of both types are normalized to the same norm. We could demonstrate this by normalizing both types of spectra of an example dataset to the same norm. This was the l1-norm in our example. However, it is important to keep in mind that this method of comparing the cumulative proportions of variance should be used only when a researcher is interested in maximizing the number of extracted independent features.

The revealing of additional meaningful features by means of optimal data fusion was demonstrated for the combination of Raman spectroscopic and MALDI mass spectrometric imaging data. We showed this by comparing the third PC extracted from each type of data separately and from the combined data. The MALDI-related part of the third combined component showed a clearer interpretation in comparison to the third loading obtained from the MALDI data alone. Moreover, the Ramanrelated part of the combined component reflected variations in lipid to protein ratio. This PC depicts a decrease in a proteinassociated range that occurs along with an increase of bands related to the CH deformation and C=C stretching in lipids, which can be found in the regions 1,128–1,284, 1,420–1,480, and 1,655–1,680 cm−<sup>1</sup> , respectively. Therefore, changes in the lipid to protein ratio and changes in lipid content itself can be observed simultaneously through the data fusion of Raman spectroscopic and MALDI mass spectrometric imaging data.

Finally, the advantage of the combined analysis was illustrated by a comparison of the PCA results visualized as false-color RGB images. These images were obtained separately for the preprocessed Raman and MALDI imaging data and for the

# REFERENCES


combined data. Visual investigation of the images showed that the combined approach provides a sharper image with less noise contributions. This allows the conclusion that the data fusion increases reliability not only for the spectral but also for the spatial features present in the data.

# ETHICS STATEMENT

This research is based on already published data provided to the authors by Bocklitz et al. (2013). For this reason, an ethics approval was not required as per institutional and national guidelines.

# AUTHOR CONTRIBUTIONS

TB and JP initiated the study, supervised the study and discussed the results. OR performed the analysis including the development of the R scripts. TB performed the pre-study including the co-registration step. OR, JP, and TB wrote the manuscript.

# ACKNOWLEDGMENTS

Financial support of the EU via the project HemoSpec (FP 7, CN 611682), co-funding of the EU for the project PhotoSkin (FKZ 13N13243) and support of the BMBF via the project PhotoSkin (FKZ 13N13243) and Uro-MDD (FKZ 03ZZ0444J) are highly acknowledged. The publication of this article was funded by the Open Access Fund of the Leibniz Association. Authors wish to thank Dr. Anna Crecelius for acquiring the data and to Prof. Dr. Ferdinand von Eggeling for helpful discussions.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00257/full#supplementary-material

Supplementary Image 1 | The plots from Figure 6 provided in vector format.

quantitative correlation of MALDI-TOF and Raman imaging. Anal. Chem. 85, 10829–10834. doi: 10.1021/ac402175c


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ryabchykov, Popp and Bocklitz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Real-Time Analysis of Potassium in Infant Formula Powder by Data-Driven Laser-Induced Breakdown Spectroscopy

### Da Chen, Jing Zong, Zhixuan Huang, Junxin Liu and Qifeng Li\*

*College of Precision Instrument and Opto-Electronics Engineering, Tianjin University, Tianjin, China*

Potassium represents one of the most crucial minerals in infant formula that supports healthy growth and development of infants. Here, a novel strategy for the real-time quantification of potassium in infant formula samples is introduced. Using laser-induced breakdown spectroscopy (LIBS) in a data-driven approach, a modified random frog algorithm (MRFA) is adopted in a higher-density discrete wavelet transform (HDWT) domain for the selection of the most important features related to potassium, which is named as DD-LIBS. In DD-LIBS, the HDWT oversamples the LIBS signals in both time and frequency domains by a factor of two, enhancing the spectral expandability in an approximately shift-invariant way. The MRFA is thus capable of isolating the features of potassium with experience accumulated from the collected LIBS data. Such pretreatment combined with a partial least squared (PLS) model can significantly suppress the uncontrolled shift and broadening effects on multivariate calibration, improving the capability of LIBS for accurate quantification of potassium. The present work demonstrates the feasibility of DD-LIBS for the quantification of potassium content of 90 commercial infant formula samples. A satisfactory result illustrates DD-LIBS as a feasible tool for real-time analysis of potassium content with little sample preparation. This strategy may be well extended to other element detection in the presence of uncontrolled interference.

Keywords: laser-induced breakdown spectroscopy, higher density wavelet transform, modified random frog algorithm, infant formula, potassium

# INTRODUCTION

Infant formula, as a breast-milk substitute, plays a significant role since it is the sole source of nutrition for some infants (Deckelbaum et al., 2004; Meucci et al., 2010; Codex, 2015; AOAC International, 2016). The international standard for infant formula set by Codex Alimentarius Commission (CAC) has a strict requirement of the essential composition and nutrition content (Codex, 2015). Meanwhile, all infant formulas marketed must also meet local standards, which are based on the national physique and health level (The Ministry of Health People's Republic of China, 2010b). As an essential cation in intracellular fluid, potassium is one

### Edited by:

*Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam*

### Reviewed by:

*Venugopal Rao Soma, University of Hyderabad, India Noureddine Melikechi, University of Massachusetts Lowell, United States*

> \*Correspondence: *Qifeng Li qfli@tju.edu.cn*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *28 February 2018* Accepted: *11 July 2018* Published: *31 July 2018*

### Citation:

*Chen D, Zong J, Huang Z, Liu J and Li Q (2018) Real-Time Analysis of Potassium in Infant Formula Powder by Data-Driven Laser-Induced Breakdown Spectroscopy. Front. Chem. 6:325. doi: 10.3389/fchem.2018.00325*

**Abbreviations:** LIBS, Laser-induced breakdown spectroscopy; RFA, random frog algorithm; MRFA, modified random frog algorithm; HDWT, higher density wavelet transform; PLS, partial least square.

of the most important minerals to support healthy growth and development of infants, because potassium is critically involved with acid-based balance function, osmotic pressure regulation, nerve impulse conduction, muscle construction and Na+/K<sup>+</sup> ATPase (Soetan et al., 2010). An incorrect intake of potassium can also cause diseases (such as hyperkalemia and hypokalemia), which therefore turns the correct control of potassium content of infant formula into a superior importance for both international and local standards (Deckelbaum et al., 2004; Koletzko et al., 2005; The Ministry of Health People's Republic of China, 2010b; Codex, 2015).

To determine the potassium content, the current standard analytical methods are mostly based on atomic absorption spectrophotometry (AAS) (The Ministry of Health People's Republic of China, 2010a), inductively coupled plasma atomic emission spectrometry (ICP-AES) (The Ministry of Health People's Republic of China, 2010a; ISO, 2018a) and inductively coupled plasma mass spectrometry (ICP-MS) (ISO, 2018b), etc. These methods require a laborious and time-consuming sample processing procedure, together with strictly controlled laboratory environment and large sample volume (Panne et al., 2001; Awan et al., 2013; Matsumoto et al., 2016). However, the huge consumption of infant formula at a level of million tons greatly challenges the efficiency of current analytical methods (Tan et al., 2017), and leads to the necessity to develop an efficient and simple method for quantifying the potassium content in infant formula.

Laser-induced breakdown spectroscopy (LIBS), an optical emission spectroscopy technique, presents a potential solution to this challenge (Aragón and Aguilera, 2008). In LIBS, a highpower density laser pulse is focused on a target material in less than a nanosecond, during which a high-temperature plasma is generated by vaporizing a small portion of the target (Zheng et al., 2014). As a result, the radiant characteristics of elements are emitted by the excited atomic, ionic, and molecular fragments produced by the plasma (Harmon et al., 2006; Bousquet et al., 2007). Hence, LIBS offers a strong capability to rapidly detect the element contents in many type of samples (Panne et al., 2001; Bousquet et al., 2007; Hussain and Gondal, 2008; Eseller et al., 2010), with little sample preparation (Hahn and Omenetto, 2010; Hou et al., 2016).

The development of lasers, optics and charge-coupled array detectors has driven a critical revolution in the sensitivity of LIBS, making it a "future superstar" analytical method (Hou et al., 2016). However, the complex process of laser-sample and plasma-particle interactions may distort LIBS peaks (Hahn and Omenetto, 2012). The spectral interference presented in the LIBS signals often leads to an unresolved, broadened and often shifted center of gravity that introduces wavelength shift of spectral peaks (Cremers and Radziemski, 2013), which compromises the LIBS calibration performance. Alternatively, a calibrationfree LIBS (CF-LIBS) based on strict theoretical assumptions of laser induced plasma may estimate analyte concentrations correctly. However, CF-LIBS data are severely affected by the selfabsorption effect and estimation of plasma temperature (Sun and Yu, 2009), which is challenging for pharmaceutical applications. To improve calibration results, the higher-density discrete wavelet (HDWT) signal processing method with shift-invariant capability becomes a good candidate (Selesnick, 2006). With HDWT, a minor wavelength shift in the raw spectra will not cause a significant variance of the HDWT coefficients at different scales (Qin et al., 2010), which guarantees the reliability of the future calibration models with the HDWT coefficients.

The unique feature of HDWT is that it processes the spectral data in an approximately shift-invariant way, while oversampling the spectral signals in both time and frequency domains by a factor of two, as opposed to the shift-variant downsampling in the conventional discrete wavelet transform (DWT) (Selesnick, 2006). It allows to generate triple wavelet coefficients and thus enables to isolate the localized LIBS spectral features more accurately and robustly (Han et al., 2017). After being processed by HDWT, the LIBS spectral bands of potassium can be well extracted by specific HDWT coefficients, which can be optimized by the feature selection methods (Yun et al., 2013). Since the underlying mechanism of LIBS signals is too complex to be interpreted directly, the observed LIBS data themselves must drive variable selection to optimize multivariate calibration (Parab et al., 2009).

Several feature selection procedures have been developed, including random frog algorithm (RFA) (Li et al., 2012), competitive adaptive reweighted sampling (CARS) (Li et al., 2009), uninformative variable elimination (UVE) and its derivation (Cai et al., 2008; Moros et al., 2008), and randomization tests (Kennedy and Cade, 1996) etc. Among above-mentioned procedures, RFA presents a unique advantage in processing high dimensional spectral data without any prior knowledge that matches the demand of data-driven well. However, the RFA tends to generate a semi-random result that may not correlate accurately with targeted chemicals. In this case, a modified random frog algorithm (MRFA) is adopted by the multiple resampling strategy, in which the RFA has executed hundreds of times to select variables with the highest probability. Therefore, the MRFA is expected to improve the reliability of the LIBS models.

In this work, a data-driven strategy is proposed to isolate the spectral features of potassium with experience accumulated from the observed LIBS data. This strategy aims to estimate the relationship between LIBS spectral datasets and potassium concentrations from the existing input-output data (Gani et al., 2009), which is named as data-driven LIBS (DD-LIBS). In DD-LIBS, the MRFA was adopted in the HDWT domains instead of raw LIBS spectra to avoid spectral interference. A calibration model was then constructed with the selected HDWT coefficients. The DD-LIBS strategy was validated by using 90 commercial infant formula samples.

# MATERIALS AND METHODS

# Sample Resource and Preparation

Samples of 90 commercially available infant formulas were purchased from the local market, which includes 24 mainstream brands in China. The potassium content was measured by flame atomic absorption spectrometry according to the Chinese national test standard method GB5009.91-2017. To reduce the effects of particle size on LIBS signals, solid infant formula samples were pressed into compact pellets by using a hydraulic press machine under 30 MPa pressure. The measurable characteristics of diameter, thickness, and mass of the pellets were 20 mm, 10 mm, and 4 g, respectively.

# Laser-Induced Breakdown Spectrometry System

In this study, an Ocean Optics LIBS 2500-7 spectrometer system was equipped with CFR Nd. YAG Laser source (LIBS-LAS200MJ, Big Sky Laser Technologies). The laser was operated at a fundamental wavelength of 1,064 nm, and the pulse energy utilized in this experiment was 50 mJ. The pulse duration was 9.5 ns, and the pulse repetition rate was 10 Hz. The LIBS 2500-7 has seven channels to provide a broad spectral wavelength range from 200 to 880 nm, covering the emission spectra of all elements. Each channel is equipped with a 2048-element linear CCD array to present a high optical resolution of 0.1 nm (FWHM). The frame rate was 10 Hz. The integration time was 2.1 ms, and it could be changed in a free-run mode to match sample properties. The trigger delay was from −121 to +135 µs in 500 ns steps. The delay time was set at 0.83 µs, which was determined through optimizing the signal-background ratio (SBR) and characteristic spectral intensity.

# Experimental Procedure

For each LIBS analysis, the pellets were put on the sample stage, and 10 different spots of one pellet were evenly selected for LIBS measurement, which reduces the effects of inhomogeneity and surface variations on LIBS signals. Each spot was ablated with 10 laser pulses. As a result, total 100 LIBS spectra were collected and averaged into a single LIBS spectrum, which improves the stability of LIBS experiments.

# Calibration Approach

Samples were randomly divided into two sets, i.e., a 65-sample set was used to build a calibration model and a 25-sample set was used to validate the calibration model.

# Normalization Methods

In order to use LIBS in a timely manner, minimal sample pretreatment is preferred. Thus, in LIBS measurement, normalization is performed to compensate for physical variations and sample matrix differences. In this work, five normalization methods, such as average, normalization by norm, spectral area, spectral height, and carbon emission lines (Abdel-Salam et al., 2013; Castro and Pereirafilho, 2016; dos Santos Augusto et al., 2017), were compared.

# Data Analysis Through Data-Driven LIBS

The LIBS spectra are affected by matrix effect and other unknown interference, resulting in broadened and shifted LIBS peaks. DD-LIBS is thus proposed to reduce the effect of peak broadening and shift on multivariate calibration. To correct shifted and expanded spectral peaks, HDWT was applied by implementing the three channel filter banks to conduct an oversampling operation for generating nearly shift-invariant wavelet coefficients.

After the HDWT calculation, the raw LIBS spectra were decomposed into localized components labeled by a scale, facilitating the feature selection methods to isolate the spectral bands related to potassium. Then, the MRFA was performed by using the bagging strategy, assigning 70% samples to a training subset and 30% samples to a validation set. The procedure was repeated for 1,000 times to generate 1,000 different selection probabilities of each HDWT coefficient for accumulation. The flowchart of MRFA is shown in **Figure 1**.

In this work, only the HDWT coefficient with the highest probability was selected for further calibration because it provided valuable robustness against the uncontrolled and unknown spectral interference, and the feature selection result can be easily validated by the reference LIBS spectra of potassium.

As mentioned above, DD-LIBS was established by integrating HDWT, MRFA and PLS together. The HDWT codes were written in Matlab 2013a based on the Selesnick's theory (Selesnick, 2006). The programs of PLS and RFA were available in the libPLS toolbox for Matlab (Li et al., 2014), and the MRFA was modified from RFA in Matlab 2013a.

### Evaluation Parameters

The root mean square error of cross-validation (RMSECV) was used to determine the HDWT parameters, and the coefficient of determination (R 2 ) was used to evaluate the calibration performance of the developed models (Chu, 2011):

$$RMSECV = \sqrt{\frac{\sum\_{i=1}^{m} (\nu\_{i,actual} - \nu\_{i,predicted})^2}{m - 1}} \tag{1}$$

$$\mathcal{R}^2 = 1 - \frac{\sum\_{i=1}^n \left(\wp\_{i,actual} - \wp\_{i,predicted}\right)^2}{\sum\_{i=1}^n \left(\wp\_{i,actual} - \overline{\wp}\_{i,actual}\right)^2} \tag{2}$$

Where yi,actual is the reference value of the potassium concentration of sample i, yi,predicted represents the predicted value of sample i, m is the number of calibration samples, and y¯i,actual represents the average reference concentration of all samples. When we obtain a RMSECV from the prediction set, we refer it as a RMSEP. The evaluation criterion is very simple: the smaller the value of RMSEP is, the stronger the prediction capability of the model is.

The limit of detection (LOD) was calculated by using the following equation (ICH Guideline, 2005):

$$\text{LOD} = \frac{\text{3.3} \times \text{SD}\_{blank}}{\text{s}} \tag{3}$$

Where SDblank is the standard deviation of the baseline near peaks, and s is the slope of the calibration curve.

# RESULTS AND DISCUSSION

# LIBS Spectrum of Infant Formula

In this work, a typical full spectrum and regional potassium peaks of an infant formula are presented in **Figure 2A**. The LIBS spectrum of infant formula has sharp characteristic peaks with different intensities, and each peak uniquely corresponds to a specific element. According to the Atomic Spectra Database (ASD) of National Institute of Standards and Technology (NIST), the peaks located at 766.57 and 769.95 nm were selected for quantifying the potassium content in infant formula. As shown in **Figure 2B**, the spectra of five representative samples with different potassium concentrations were illustrated from 0.415/100 g to 0.815/100 g. It was clear that the intensity of the potassium peaks related to its concentrations accordingly but not linearly, because the potassium peaks were affected by both potassium concentrations and physical parameters (such as laser energy fluctuation and effects related to the sample texture and density). Unfortunately, the contribution of any interference to LIBS was unclear, and DD-LIBS was thus developed to perform the quantitative analysis of potassium by using the existing inputoutput LIBS data.

# Selection of Normalization Method

samples with different concentrations.

Five normalization methods were compared by calculating the RMSEP of each PLS calibration model. The RMSEPs of these five normalization methods including average, normalization by norm, spectral area, spectral height, and carbon emission lines, were 0.056, 0.065, 0.076, 0.059, and 0.096, respectively. It is clear that the average normalization strategy was most suitable with the lowest RSMEP value and was subsequently applied in this work. After data normalization, the calibration performance of the univariate, PLS and DD-LIBS models was then compared to facilitate the understanding of the LIBS quantification.

# Univariate Analysis

The univariate analysis represents the most conventional modeling strategy, in which the analyte's concentration and the peak intensity or the peak area are set as x and y, respectively (El Haddad et al., 2014). In this work, two calibration curves were made with two potassium peaks as shown in **Figures 3A,B**. **Figure 3C** demonstrates another calibration curve using the areas of these two peaks. The LOD obtained from the first peak of potassium was 37 ppm. As shown in **Figures 3A,B**, the R <sup>2</sup> of both peak height curves are pretty low, which

peak at 766.57 nm, (B) the intensity of the second peak at 769.95 nm and (C) the areas of two peaks at 766.57 and 769.95 nm.

means that the correlation is poor (El Haddad et al., 2014). The R <sup>2</sup> of area (C) is also not satisfactory for quantification even it is slightly higher than the two peaks above-mentioned. The reason is that the univariate analysis is compromised by both matrix effect and sample complexity (Hou et al., 2016; Sanghapi et al., 2016). It is therefore expected that the multivariate analysis could improve the calibration performance through latent projection instead of univariate regression, and PLS was chosen as it is mostly adopted in multivariate calibration.

# PLS Calibration

The spectral features of potassium were assigned from 751.90 to 774.86 nm, which contains 512 variables. To evaluate prediction capability of the PLS model, R 2 and RMSEP were calculated. **Figure 4** demonstrates that the prediction results of the PLS model exceed those of univariate analysis. However, the prediction performance could be further improved through the suppression of the uncontrolled spectra shift and broadening.

# DD-LIBS Strategy

In DD-LIBS, the HDWT aims to suppress the effects of peak shift and broadening on multivariate calibration through the oversampling and shift-invariant operation. With the combination of MRFA, DD-LIBS is expected to isolate the spectral features related to the potassium accurately.

### Determination of HDWT Parameters

The performance of HDWT depends on wavelet filters and decomposition scales, which should be optimized before calibration. In HDWT, four wavelet filters with different vanishing moments are available (Selesnick, 2006). Theoretically, the wavelet filter with higher vanishing moment shrinks the peak more efficiently than that with lower vanishing moment (Han et al., 2017). Here, the "bi4" wavelet filter with four vanishing moments was selected, since it possesses the highest vanishing moment in the current HDWT filter bank (Selesnick, 2006). By using the "bi4" filter, the spectral resolution would be expanded by a factor of three, which significantly improved the spectral expandability in an approximately shift-invariant way.

The decomposition scale is also critical in HDWT, so it was optimized by the minimum RMSECV criterion. **Figure 5** indicates the relationship between the scale and RMSECV using the leave-one-out cross-validation of the calibration set. As a result, the scale four was selected for the HDWT calculation.

### Feature Selection Obtained by MRFA

After the HDWT calculation, the original 512 variables were expanded into 1,520 new variables, providing additional flexibility to isolate the features of potassium in the presence of uncontrolled spectral interference. In the sequence, MRFA was adopted to select the accurate features of potassium. **Figure 6** illustrates the accumulated probability of each variable after 1,000 times of MRFA calculation, and the variable with the highest probability was selected for further multivariate calibration.

With the variables selected by MRFA, a PLS model was built. Only one PLS factor was required for calibration, which reveals that DD-LIBS is capable of isolating the spectral peaks of potassium accurately. As compared to **Figure 4**, the R <sup>2</sup> of DD-LIBS is improved from 0.887 to 0.962 as shown in **Figure 7**.

It is also of great interest to investigate the reconstructed spectra obtained from the selected variables, which is

fundamental to understand how DD-LIBS suppresses the effects of uncontrolled peak shift and broadening on multivariate calibration efficiently. The broadening and shift effect on the LIBS spectral peaks vary from sample to sample as shown in **Figure 8A**, which may impair the LIBS calibration models. As a comparison, the DD-LIBS filtered data is illustrated in **Figure 8B**. It is clear that the reconstructed signals of DD-LIBS locate at



the same positions as the highest LIBS peak of potassium, and the intensity values at 766.48 and 766.53 nm are the same. It reveals that DD-LIBS cleverly selected the shift-invariant spectral features to overcome the effects of peak shift and peak broadening on multivariate calibration. It is reasonable to expect that DD-LIBS could provide a promising tool to measure potassium content in infant formula accurately, no matter how the uncontrolled interference exists.

# Comparison of Different Methods

**Table 1** shows the prediction results for potassium content in infant formula obtained by different methods. It is obvious that the univariate method presents a poor calibration result, revealing the LIBS spectral analysis should be carefully designed. The PLS model improves the prediction performance of univariate method through multivariate calibration, but the PLS factors are abnormally high. The results illustrate that the additional PLS factors have to be adopted for estimating unknown spectral interference, tending to generate an overfitting result that relies on the current data set too much. It is unexpected that the combination of RFA and PLS produces a worse result when compared with that of the PLS model. This could be attributed to the effect of spectral interference, e.g., matrix effect, laser energy fluctuation, sample texture and density, and noise, etc. on the feature selection in raw spectra.

The HDWT is explored to suppress the spectral interference. The RFA selects the most important HDWT coefficients, resulting in a better prediction precision than that of the RFA-PLS model. As expected, DD-LIBS provides the best prediction results

# REFERENCES


with only one PLS factor, revealing that the LIBS spectral features of potassium are isolated efficiently. As a result, only one PLS factor is required to construct a high-quality calibration model, thus enhancing the reliability and robustness of the LIBS spectral analysis in the presence of uncontrolled interference.

# CONCLUSION

This study presented a novel strategy, named DD-LIBS, as an approach for real-time quantification of potassium content in commercial infant formula samples. With the combination of HDWT and MRFA, DD-LIBS selected the most important feature related to the potassium accurately, independent of spectral interference. As a result, DD-LIBS generated a highquality calibration model with only one PLS factor, and the DD-LIBS reconstructed spectra were highly consistent with the original spectral bands of potassium. These satisfactory results suggested a broad expandability of DD-LIBS in the quantification of any targeted element in solid samples in the presence of uncontrolled interference. Once DD-LIBS model has been constructed, it can cleverly predict unknown LIBS spectra as long as these spectra are within a range of relationships learned in the training phase.

# AUTHOR CONTRIBUTIONS

DC planned and supervised the experiments, processed the raw data, revised the manuscript. JZ processed the raw data, wrote the manuscript. JL performed the experiments. ZH advised on data processing and algorithm application. QL revised the manuscript, advised about the principles of LIBS.

# FUNDING

This work was supported by the National Natural Science Foundation of China [61378048, 21305101, 21273159], National Key Research and Development Program of China (2017YFC0803603), Tianjin Research Program of Application Foundation and Advanced Technology [14JCZDJC34700], the Open Funding of State Key Laboratory of Precision Measuring Technology and Instruments [PIL1605], the Program for New Century Excellent Talents in University [NCET-11-0368].

laser-induced breakdown spectroscopy. Arabian J. Sci. Eng. 38, 1655–1661. doi: 10.1007/s13369-013-0548-7


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chen, Zong, Huang, Liu and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Essentials of Aquaphotomics and Its Chemometrics Approaches

Roumiana Tsenkova<sup>1</sup> \*, Jelena Muncan ´ 1,2, Bernhard Pollner <sup>3</sup> and Zoltan Kovacs <sup>4</sup>

 Biomeasurement Technology Laboratory, Graduate School of Agricultural Science, Kobe University, Kobe, Japan, Nanolab, Biomedical Engineering Department, Faculty of Mechanical Engineering, University of Belgrade, Belgrade, Serbia, Department for Hygiene and Medical Microbiology, Medical University of Innsbruck, Innsbruck, Austria, <sup>4</sup> Department of Physics and Control, Faculty of Food Science, Szent István University, Budapest, Hungary

Aquaphotomics is a novel scientific discipline involving the study of water and aqueous systems. Using light-water interaction, it aims to extract information about the structure of water, composed of many different water molecular conformations using their absorbance bands. In aquaphotomics analysis, specific water structures (presented as water absorbance patterns) are related to their resulting functions in the aqueous systems studied, thereby building an aquaphotome—a database of water absorbance bands and patterns correlating specific water structures to their specific functions. Lightwater interaction spectroscopic methods produce complex multidimensional spectral data, which require data processing and analysis to extract hidden information about the structure of water presented by its absorbance bands. The process of extracting information from water spectra in aquaphotomics requires a field–specific approach. It starts with an appropriate experimental design and execution to ensure high-quality spectral signals, followed by a multitude of spectral analysis, preprocessing and chemometrics methods to remove unwanted influences and extract water absorbance spectral pattern related to the perturbation of interest through the identification of activated water absorbance bands found among the common, consistently repeating and highly influential variables in all analytical models. The objective of this paper is to introduce the field of aquaphotomics and describe aquaphotomics multivariate analysis methodology developed during the last decade. Through a worked-out example of analysis of potassium chloride solutions supported by similar approaches from the existing aquaphotomics literature, the provided instruction should give enough information about aquaphotomics analysis i.e. to design and perform the experiment and data analysis as well as to represent water absorbance spectral pattern using various forms of aquagrams—specifically designed aquaphotomics graphs. The explained methodology is derived from analysis of near infrared spectral data of aqueous systems and will offer a useful and new tool for extracting data from informationally rich water spectra in any region. It is the hope of the authors that with this new tool at the disposal of scientists and chemometricians, pharmaceutical and biomedical spectroscopy will substantially progress beyond its state-of-the-art applications.

Keywords: aquaphotomics, water, near infrared spectroscopy, multivariate analysis, water spectral pattern, aquagram, aquap2

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Daniel Cozzolino, Central Queensland University, Australia Felix Scholkmann, UniversitätsSpital Zürich, Switzerland Zhisheng Wu, Beijing University of Chinese Medicine, China

> \*Correspondence: Roumiana Tsenkova rtsen@kobe-u.ac.jp

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 15 April 2018 Accepted: 30 July 2018 Published: 28 August 2018

### Citation:

Tsenkova R, Muncan J, Pollner B and ´ Kovacs Z (2018) Essentials of Aquaphotomics and Its Chemometrics Approaches. Front. Chem. 6:363. doi: 10.3389/fchem.2018.00363

**98**

# INTRODUCTION TO AQUAPHOTOMICS

Aquaphotomics is a novel scientific discipline founded by Professor Roumiana Tsenkova at Kobe University, Japan, in 2005 (Tsenkova, 2005, 2006a,b,c, 2009) with the objective of studying and systematizing knowledge about water-light interaction, which was found to be a huge source of information on the subject of the structural and related functional properties of aqueous systems. This is a complementary "omics" discipline dealing with the large-scale, comprehensive study of water as the "molecular and energy mirror" of the rest of the aqueous system. While proteomics studies proteins, glycomics carbohydrates and lipidomics—lipids; aquaphotomics explores the roles, relationships and functions of the water—an equally important biomolecule and one of nature's fundamental building blocks.

The word "aquaphotomics" is derived from the words aqua water and photo-light since this new discipline studies water by using its interaction with the light. Thus, aquaphotomics is a science which uses water-light interaction to explore the structure of water—as a system and matrix composed of many different water molecular conformations, thereby resulting in various functionalities (Tsenkova, 2009). The main objective of establishing aquaphotomics as a novel scientific discipline was to provide a common platform and strategy to lead to an improved general understanding of the water functionality by utilizing water-light interaction at every frequency of the electromagnetic spectrum. The majority of aquaphotomics works so far have been done by using near infrared (NIR) spectroscopy, especially in the area of the 1st overtone of the OH stretching band (1,300– 1,600 nm) where many water absorbance bands are identified and consistent with previously reported or calculated overtones of water absorbance bands in the infrared region (Weber et al., 2000, 2001; Smith et al., 2005; Tsenkova, 2009; Tsenkova et al., 2015). What aquaphotomics research studies showed is that NIR spectroscopy, and in general water-light interaction over the entire electromagnetic spectrum, can significantly contribute to the field of water science and better understanding of water molecular systems (Tsenkova, 2009).

The NIR wavelength region from around 680 to 2,500 nm is considered as an excellent tool for water observation that provides an enormous amount of information about water molecular structure (Büning-Pfaue, 2003; Tsenkova, 2009). The NIR light allows a longer penetration length, as compared to infrared, even up to 10 mm in the short wavelength region (750–1,100 nm) (Workman, 2000), making it a rapid and non-destructive measurement technique particularly suitable for studying intact biological systems. Numerous NIR spectra can be obtained in various conditions and states of the systems (under different perturbations)—all in real time. NIR spectroscopy has a rich history of applications in pharmaceutical and medical fields. Water, however, with its NIR characteristic spectrum was often seen as a problematic component and the common source of measurement error, because it could alter sample spectra, hide weak absorbance bands and shift other absorbance bands (Ciurczak and Igne, 2014). In fact, water is cited as one of the main disadvantages of NIR spectroscopy in pharmaceutical applications since it prevents a direct quantification (Jamrógiewicz, 2012).

Traditionally, water bands in the NIR region around 1,440 nm (the first overtone of OH stretch) and 1,940 nm (a combination of OH bending and stretching) have been very useful in the studies of the state of water in various samples (Ozaki, 2002). One of the major and most common applications of NIR spectroscopy was moisture determination (Osborne et al., 1993; Reeves, 1995). NIR spectroscopy has been used to investigate water content, hydrogen bonds and hydration state in a variety of fields such as agriculture and food industry, medical and pharmaceutical sciences, and polymer and textile industries (Ozaki, 2002).

Although some early works on water analysis reported the rich informational potential of its NIR spectrum (Hirschfeld, 1985; Iwamoto et al., 1987; Grant et al., 1989; Maeda et al., 1995), it was only with the development of aquaphotomics that the properties of water as a "collective matter and energy mirror" were truly explored (Tsenkova, 2009). The so-called "water mirror approach" of aquaphotomics utilizes the high sensitivity of water's hydrogen bonds, where all the components of the aqueous system and surrounding energies influence the water structure, i.e., the covalent bonds. Every aqueous system is a dynamic arrangement of water molecular network hydrogenbonded to other constituents and influenced by perturbations. Any perturbation of the aqueous system results in changes of water molecular conformations, which in turn produce changes in the corresponding NIR spectra at their respective water absorbance bands. As a consequence of the strong potential of water molecules for hydrogen bonding, water, a natural matrix of any aqueous or biological system, changes its absorbance pattern every time it adapts to a physical or chemical change in the system itself or its environment (Tsenkova, 2008c). It is this quality of water that indirectly permits measurements of small quantities or structural changes of other molecules present in the aqueous system. By tracking the changes of water absorbance bands in the spectra of aqueous or biological systems, the information is extracted about not only water structure but also other components present in water or the state of the system as a whole (Tsenkova, 2006c, 2007, 2008b, 2009).

Being rapid and non-destructive, NIR spectroscopy is a powerful technique with an incredible range of applications, whose horizons have been further expanded by aquaphotomics. Since its establishment more than a decade ago, aquaphotomics has grown into a vast and multidisciplinary scientific field, encompassing many research areas (**Table 1**). Changes in the absorption spectrum of water are used for quantification of the solutes present in water, even when the solutes do not absorb NIR light at all (Grant et al., 1989; Tsenkova, 2009; Gowen et al., 2015). This so-called water-mirror approach enables measurements of concentrations previously impossible with NIR spectroscopy at ppm levels (Sakudo et al., 2006b; Tsenkova, 2008b; Gowen et al., 2013; Bázár et al., 2014, 2015), and even at ppb levels under certain experimental conditions (Sakudo et al., 2005, 2006b; Tsenkova et al., 2007b; Tsenkova, 2008a,b). Furthermore, the aquaphotomics research of biological systems introduced a concept of water spectral pattern as a holistic biomarker (Tsenkova, 2006c, 2007), which relates

certain structures of water with functionalities of the respective biological systems, thus opening new directions toward nondestructive quality monitoring applications and non-invasive biodiagnosis.

The aquaphotomics research fields have two things in common. First, water is the common matrix of all the systems studied. Second, the approach to extract the information hidden in complex and multidimensional spectra of such systems requires a specific aquaphotomics methodology developed over the years and based on rich experience in dealing with a great variety of aqueous systems. The objective of this paper is to provide guidance about how to perform aquaphotomics analysis of NIR data. Using an example dataset of aqueous salt solutions, each step of the analysis will be explained and supplemented by similar examples from the existing literature illustrating how specific steps in data analysis provide new insights, improve spectral quality, or reveal new information. The basic methodology explained in this work is applicable to the analysis of NIR data of any aqueous system, with minor aqueous system- and purpose-specific adjustments. A step-bystep explanation of aquaphotomics analysis supplemented by citations of similar works will provide a solid basic knowledge about how to start and perform the analysis as well as where to look for further information. It is the hope of the authors that, with this new tool at the disposal of scientists and chemometricians, pharmaceutical and biomedical spectroscopy will utilize the richness of NIR water spectra to extend its applications far beyond moisture determination, leading to a substantial progress beyond the current state of the art.

# GLOSSARY OF AQUAPHOTOMICS TERMS

This glossary is intended to define the terms and certain abbreviations commonly used in the aquaphotomics literature, which will appear throughout this paper. New terminology has emerged over time and with the development of aquaphotomics and the resulting need to better describe its subject of exploration using newly discovered knowledge. The origin and definitions for the terms are compiled from several sources, which are listed in the respective columns of **Table 2**.

With the main terms explained, we can now formulate the objective of aquaphotomics analysis i.e., the water mirror approach to analyze aqueous systems as a whole, using their multidimensional spectra and focusing on water absorbance bands located at specific regions, allows observation and absorbance measurements. When activated water absorbance bands are found in response to some perturbation of interest, then a water absorbance spectral pattern caused by the respective perturbation is identified. By compiling water absorbance patterns in an aquaphotome, aquaphotomics builds up a comprehensive database of the states of the analyzed system as a whole, in terms of identified water structures shaped by various internal or external perturbations. In future applications, aquaphotome database will provide a rapid identification of causes for changes and influences on the system based on the recognized water spectral patterns, which serve as holistic markers of the state of the aqueous system or biomarkers in the case of biological systems (Tsenkova, 2006c; Kovacs et al., 2016).

# AQUAPHOTOMICS METHODS

# Basic Workflow and General Guidance

The basic workflow of aquaphotomics analysis from the experimental design to the final act of building an aquaphotome is illustrated in **Figure 1**. Similar to every conventional NIR spectroscopy work, everything starts with a proper experimental design and instrumental setup.

Although NIR spectroscopy, in general, does not require sample preparation, there are some specific aspects in aquaphotomics experimental design requiring more attention.

First of all, it is an absolute must to ensure that the instruments have high-quality spectral signals. In general, not all spectrometer systems are suited for aquaphotomics experiments. It is advisable to check the instrument's performance beforehand to ensure the high quality of the spectra in the entire Vis-NIR region (400– 2,500 nm). All subsequent analysis will be highly influenced by the quality of raw spectral data. It is therefore of the utmost importance to evaluate raw spectra prior to any real experimental work. The basic analytical procedures for detecting errors of NIR data and evaluation of signal quality have been recently provided in an extensive study performed by Bazar et al., which tested and compared the performance of three spectrometer systems (Bazar et al., 2016). This paper can be used as a general guidance on how to test the quality and performance of NIR instrument before venturing further.

Ensuring good spectral quality is particularly important since, in addition to the already known complexity of NIR spectra due to the overtone and combination modes resulting in broad bands, the changes in the spectra of aqueous systems caused by some perturbation of interest are small and subtle. The useful information may end up being buried in noise if the instrument does not provide a high signal-to-noise ratio. Another prerequisite is the use of a high-resolution instrument. Water absorbance bands in the NIR range are usually located very close to each other, so high spectral resolution of 0.5 or 1 nm will ensure an optimal detection and separation of the bands in a subsequent analysis.

An experiment should be carried out according to previously defined protocols to ensure the same environmental conditions. The purpose of carefully designed and established protocols is to minimize the influence of unknown factors that may affect sample spectra.

The specificity of experimental design may vary depending on the type of aqueous system involved; however, the design must ensure that each sample is presented with several replicates (sample replicates) and each measurement is performed by using several consecutive illuminations (consecutive replicates, consecutive spectra). Collecting and averaging multiple scans is part of the standard practice to remove noise—recoding 64 or more scans per one spectrum reduces the noise levels significantly (Manley, 2014). Measuring liquid samples should always start with pure water (18.2 M·cm) and all subsequent measurements should be done with a cuvette always placed in

### TABLE 1 | Fields of aquaphotomics applications.


the same position (the same side). The same cuvette should be used throughout the experiment. It should be first rinsed at least in triplicate with sample before final filling. After that, it is placed in the sample holder and allowed to equilibrate before scanning in order to minimize inter-sample variation.

Reference measurement (blank air) should be done before each sample measurement. The order of sample measurement and sample replicates should be completely randomized; but pure water should be always scanned after a previously defined number of samples (e.g., every 5, 7, or 10 sample measurements). There are two reasons for measurements of pure water in between samples. First, these spectra are used as an environmental control, monitoring known and unknown influences on water and could later be used to correct or remove unwanted influences from sample spectra. Second, it builds a large library of pure water spectra. There are many advantages of building such a library—it contains the spectra of pure water under various changing conditions over a longer period of time under different temperatures, humidity conditions and various dayto-day variations of the instrument and working environment. Building such a database has been proved very useful for correction in general NIR applications (Tillmann and Paul, 1998). In addition, a novel method for enhancement of spectral signals has been recently developed, which also relies on building a similar library (Kojic et al., 2017 ´ ).

It is also advisable to monitor and log major external influences such as laboratory temperature, atmospheric pressure and humidity, as well as sample holder temperature or cuvette. Measuring and logging external parameters can be very useful for identification of major sources of spectral variation as well as for exploration of the dynamics of different aqueous systems under the same environmental perturbations.

As opposed to traditional NIR spectroscopy, which places emphasis on the control of the environment during the measurements, "perturbation" is often used in aquaphotomics and is sometimes even a necessary component of experiments, which helps in revealing hidden information. The analysis of aqueous systems' spectra under the influence of some chosen, intentional, perturbation can be defined as an evaluation of the system by applying changes to the selected parameters and re-estimation of the results (Tsenkova, 2007). In practice, the most frequently used perturbations to induce changes in the respective systems are changes in temperature (Gowen et al., 2013; Chatani et al., 2014; Putra et al., 2017; Wenz, 2018), consecutive illuminations (Tsenkova, 2005; Chatani et al., 2014; Wenz, 2018), and changes in dilution (Gowen et al., 2013; Wenz, 2018). Other types of perturbations can also be used to test the robustness of the models developed. Besides temperature perturbation, for example Putra et al. (2017) and Meilina et al. (2011) introduced perturbations by different metal ions to test the regression model developed for the measurement of cadmium concentrations in aqueous solutions. The use of intentional, artificially created perturbations provides a change in entropy and leads to the revelation of hidden spectral information (Tsenkova, 2006c). A recent work by Wentz on water in model membranes employed four types of perturbation in the same work in order to probe and thoroughly examine changes in the water matrix [i.e., temperature, consecutive illuminations, concentration (dilution)], and difference in molecular structure of phospholipids (fourteen identical carbon acyl chains but with polar heads differing in the presence of an hydroxyl or a choline group) (Wenz, 2018). The most frequently used intentional perturbations (consecutive illuminations or increasing temperature) result in similar changes in water matrix—an increase in the number of free water molecules, which are then available for "scanning" of the rest of the system; in other words—to interact with its components, which results in changes in sample spectra and provision of additional

### TABLE 2 | Glossary of aquaphotomics terms.


information. Regarding unintentional perturbations, it is always advisable to investigate what perturbations (i.e., factors) have an influence on the developed models. These perturbations may include individual differences or the presence of disease in the case of biological systems studied, or even sample thickness (Tsenkova, 2004).

The first step of analysis begins with the inspection of raw spectral data. Although NIR spectra of aqueous systems are comprised of broad, overlapping spectral bands, visual spectral inspection still remains a vital step before any further data analysis. Visual inspection gives the first clues about the presence of outliers, helps in deciding what preprocessing steps to proceed with, gains a general insight into how samples are grouped and on what spectral regions to focus the attention. All the subsequent steps—data preprocessing, conventional spectral analysis and chemometrics application, which will be described in more detail later-serve to extract the information of interest. From the aspect of conventional data analysis—with building, testing and validation of a model—either qualitative or quantitative, depending on the objective of the experiment, the work is done when suitable prediction accuracy is achieved. However, this is only half of the work done in an aquaphotomics analysis. Each step of the analysis—raw data inspection, preprocessing, conventional and chemometrics analysis (an array of exploratory, classification and regression analysis)—provide certain quantitative outputs like derivatives, subtracted spectra, regression vectors or loading vectors, discriminating power and others, which all unravel water absorbance bands most affected by perturbation of interest (WABS, **Figure 1**).

The NIR spectra of aqueous systems are very complex, and changes in their absorbance spectra caused by some perturbation will usually be very subtle, but nonetheless persistent and consistent. From all the WABs discovered during multiple steps of aquaphotomics analysis, a noticeable pattern of repeating, common absorbance bands will emerge to reveal perturbationinduced water absorbance bands i.e., how and what water molecular conformations are affected. When this absorbance spectral pattern water absorbance pattern (WASP) is recognized, it can be presented in a simple, yet concise and informative manner by using aquagrams. This aspect of aquaphotomics analysis adds one more dimension to the results obtained in that it provides understanding of the water functionality in the respective system. It allows linking discovered WASPs with the conditions of the aqueous systems analyzed, revealing how and why water changes the way it does under certain perturbation. This is of special importance for living, biological systems. The storing of WASPs into a large aquaphotome database allows for a fast comparison and identification of the state of aqueous or biological systems, thereby in essence providing biodiagnosis based on the state of water.

# Aquaphotomics Analysis of Potassium Chloride Solutions—A Worked-Out Example

To better illustrate the working process of aquaphotomics analysis, we will present an example of analysis performed on the spectral dataset of aqueous solutions of potassium chloride in the next sessions. The perturbation of the water matrix by salt and measurement of salt concentration are already available in aquaphotomics literature (Gowen et al., 2015) and even in very early near infrared spectroscopy applications (Grant et al., 1989). We have chosen this perturbation since it perfectly illustrates the aquaphotomics water-molecular and energy mirror concept in that the salts are practically transparent for NIR light. Therefore, the results obtained thereby are based entirely on the changes in the water molecular matrix. Experimental condition will be described next.

# Materials and Methods

## **Sample preparation**

Potassium-chloride (KCl, M = 74.56 g.mol−<sup>1</sup> , purity ≥ 99.0% w/w, Wako Pure Chemical Industries, Ltd. Kobe, Japan) was used.

All samples were prepared by using deionized water from a Milli-Q water purification system (Millipore, Molsheim, France). A stock solution of 100 mM was prepared at first. Working solutions were made by serial dilution of the stock solution in 10-mM steps to produce the following KCl concentrations: 10, 20, 30, 40, 50, 60, 70, 80, and 90 mM. All samples of the stock and working solutions were freshly prepared in two independent sample replicates (i.e. a total of 20 samples for the analysis).

### **NIR spectra collection**

Transmittance spectra of KCl aqueous solutions were acquired by using a FOSS-XDS spectrometer (FOSS NIRSystems, Inc., Hoganas, Sweden) equipped with a Rapid Liquid Analyzer module consisting of a temperature-controlled cuvette holder. The temperature of the sample holder was kept constant at 28◦C during all measurements. This temperature was chosen to be close to the ambient temperature (ca. 28◦C), allowing a fast and easy way of maintaining constant temperature during measurements. Each sample was firstly incubated in the sample holder for 90 s before scanning to get the required temperature of 28◦C. Deionized water samples were measured as an environmental control for every five sample measurements. Spectral acquisition order was randomized with respect to salt concentration. The 1-mm path length quartz sample cell was used as a container.

The spectra were acquired in the range of 400–2,500 nm, with a resolution of 0.5 nm. Each saved spectrum was an average of 32 successive scans. This number of scans was chosen to shorten the acquisition time. Three consecutive spectra were recorded for each sample and for each measurement. The reference spectrum was recorded before each measurement. The spectral data were transformed to pseudo-absorbance units (logT−<sup>1</sup> , where T = transmittance). One sample was represented by six spectra in total, from two independent sample replicates and three consecutive spectra. The total number of recorded spectra was 75 (10 concentrations × 2 sample replicates × 3 consecutive scans + 15 control scans of deionized water).

The FOSS-XDS instrument was operated by using VISION 3.5 software (FOSS NIRSystems, Inc., Hoganas, Sweden).

### **Data analysis**

For the purpose of this paper, the data analysis of KCl solutions was performed by using only the wavelength range from 1,300 to 1,600 nm, which represents the absorption region of OH bonds of water (1st overtone of OH).

Smoothed spectra were calculated by using a Savitzky-Golay polynomial filter (2nd order polynomial fit and 21 points). Difference spectra were calculated by subtraction the average spectrum of deionized water from the average spectra of potassium-chloride solutions for each concentration level. The 2nd derivative spectra of potassium-chloride solutions were calculated by using a Savitzky-Golay filter (2nd order polynomial fit and 21 points). Principal component analysis (PCA) was used to describe multidimensional patterns in the spectral data and to discover outliers. The relationship between the actual and predicted concentrations of KCl was examined by using Partial Least Squares Regression (PLSR) based on leave-one (concentration)-out cross validation, i.e., without six spectra of the two independent sample replicates at a time during the iterative validation process.

The regression was performed on the previously smoothed (Savitzky-Golay filter, 2nd order polynomial filter, 21 points) and multiplicative scatter corrected (MSC) spectra in the spectral range of 1,300–1,600 nm. The precision and accuracy of the developed PLSR model were evaluated by the coefficient of determination (R 2 ) and root mean square error (RMSE) of crossvalidation.

Raw spectra, difference spectra, loading vectors of PCA analysis, and regression vector of PLSR analysis were examined in order to find and assign characteristic water absorbance bands showing considerable changes in response to changes in KCl concentration. Thus, identified bands were used to describe water spectral pattern of salt solutions. To visually represent changes of water spectral pattern as a function of salt concentration, different types of aquagrams were constructed, namely classic aquagrams, aquagrams with confidence intervals and temperature-based aquagrams. The instructions for all necessary calculations and steps to produce these charts are explained in a separate section (Water spectral pattern represented by aquagrams).

All data analysis was performed by using R Project for Statistical Computing (R Core Team, 2017) (RRID:SCR\_001905) and an "aquap2" package (Pollner and Kovacs, 2016).

### Aquap2 Package

The "aquap2" package developed by Pollner and Kovacs (2016) (free download and instructions available at www. aquaphotomics.com) provides an easy-to-use data preparation and analysis tools developed for extending the functionalities of the R project software to the needs of aquaphotomics. It is a non-commercial, free-to-use software, which can dramatically speed up analysis time, especially in the case of large datasets. It is very flexible and allows an automation of highly repetitive tasks, while also providing special functionalities not available in other commercially available chemometrics software, such as frequently used graph—aquagrams.

Aquap2 package offers the following functionalities:


# THE POWER OF RAW SPECTRA AND CONVENTIONAL SPECTROSCOPIC ANALYSIS

With so many chemometrics methods available, one often neglects the possibility that something can be extracted from the raw spectra, especially since changes in the water spectra in the near infrared region are subtle and difficult to observe with the naked eyes. However, the first, most natural step in all data analysis is to inspect the raw data.

In the NIR region, the water spectrum consists of four main maxima located approximately at 970, 1,190, 1,450, and 1,940 nm, which are due to the second overtone of the OH stretching band (3ν1,3), combination of the first overtone of the OH stretching and OH bending band (2ν1,3 + ν2), the first overtone of the OH stretching band (2ν1,3) and combination of the OH stretching and OH bending band (2ν1,3+ ν2), respectively (Luck, 1974). All these regions are informationally valuable. So far, more than 500 water absorbance bands have been identified under these broad peaks (Tsenkova, 2009; Tsenkova et al., 2015). Depending on the type of aqueous system, some regions can prove to be more suitable for analysis and provide more information; hence it is always advisable to closely examine each of these regions.

Let us now look at the raw, untreated spectra acquired for our potassium chloride example dataset (**Figure 2**).

The raw spectra were plotted to visualize the spectral changes introduced by adding different concentrations of salt to pure water. Two large peaks (around 1,450 and 1,940 nm attributed to the first overtone and combination region of OH stretching and bending vibrations) dominate the spectra of potassium chloride solutions. It is logical because salts do not exhibit the NIR spectra. Very small, broad features can also be observed around 1,190 nm. The region of the combination band shows significant noise due to the high absorption of water, which far exceeds 3 absorbance units and will be excluded from subsequent analysis. Further analysis will be performed only in the region of the first overtone of water, where for the most part, water absorbance bands can be clearly resolved and for which good literature sources exist about the specific assignments of water molecular conformations.

In this stage of data evaluation, two types of calculations are usually performed: averaging and spectral subtraction. The

averaging can be done across all spectral consecutives and sample replicates. At this stage, the goal of averaging is to eliminate the influence of variations, which are not of primary interest, such as those attributable to different temperatures, humidity, or consecutive illumination. The average spectra of different groups of samples calculated this way will better reveal differences among different sample groups. However, the averaged spectra are influenced by outliers, so some measures of detecting and eliminating them should be taken before this step.

The next step is a spectral subtraction, which produces difference spectra. This is a very effective way for detection of subtle differences between the two spectra (Ozaki et al., 2003).

There are many approaches to spectral subtraction, and the simplest, classical approach is to subtract from the average spectrum of all samples, the averaged spectrum of pure water measured as a control during the experiment or of the solvent. This is the most simple and efficient method of bringing immediately a better visualization and observation of the water bands hidden under broad overtone and combination peaks.

Another subtraction method, recently developed, proposes a "closest spectrum" subtraction (Kojic et al., 2017 ´ ). This subtraction method involves creating all the possible pairs of differences (solution—pure solvent) and finding the closest spectral pair (minimal difference) based on the smallest area under the curve of the difference spectrum. Thus, the found spectrum, the "closest spectrum," is then subtracted from the remaining spectra. Pure solvent spectra can be acquired during the experiment or found in a library of solvent spectra which must be previously created by performing an acquisition under various, mainly temperature, perturbations. This method provides, on average, a 4-fold increase in precision as compared to traditionally used average spectrum subtraction (Kojic et al., ´ 2017).

Another way of enhancing differences is to calculate the difference spectrum along some perturbation of interest. This type of subtraction can reveal water absorbance bands activated by a particular perturbation. This simple approach, for example, allowed an immediate identification of main differences in the water structure between the groups of bacterial cultures S. auerus and E. coli (Nakakimura et al., 2012). In addition, in the study of the effect of soybean mosaic virus, the difference spectrum between the average spectra of healthy and diseased plants clearly revealed water absorbance bands due to virus-induced changes (Jinendra et al., 2010). Another example can be found in a study of the spectral behavior of mushrooms subjected to physical perturbation by different levels of mechanical vibration (Gowen et al., 2009b). The difference spectra obtained by subtracting the averaged spectrum of undamaged mushrooms from averaged spectra of damaged mushrooms subjected to different perturbation levels revealed sharp features around 1,398 nm for the two highest level of perturbations, which corresponds to absorption of free single water molecules trapped by ions (Kojic et al., 2014 ´ ) at the mushroom surface originated from physically damaged cell walls.

Another highly efficient approach in revealing different water dynamics in samples is a subtraction of the 1st consecutive spectra from all other consecutive measurements. This subtraction technique was first applied in a study of different prion protein isoforms in water solutions (Tsenkova et al., 2004; Tsenkova, 2005), when it was shown for the first time that illumination changes the water system and each consecutive spectrum of the sample is influenced by light absorption. The effect of absorbed photons on water molecular systems increased a number of free water molecules available to interact with solutes in the aqueous system, performing "scanning" of solutes and the rest of the water molecular system resulting in changes of the corresponding spectra. In this way, additional information can be extracted, which is especially beneficial when the aqueous systems analyzed are very similar. In the case of the prion protein study, this approach revealed drastic differences in the free O-H absorbance bands and superoxides for different prion protein isoforms (Tsenkova et al., 2004; Tsenkova, 2005).

The spectra transformed as just described can also be further analyzed by using other data-mining approaches.

# SPECTRAL PREPROCESSING—IMPROVING AND ENHANCING SPECTRAL INFORMATION

The fundamental problem, not only in aquaphotomics analysis but also generally in all spectral analysis, is how to extract the useful information hidden in the complex spectral measurements. The objective of preprocessing is to enhance the information of interest, and decrease or remove unwanted influences on spectral signals.

The spectral preprocessing methods include mathematical pretreatments, such as centering and normalization (meancentering, standard normal variate transformation (SNV)(Barnes et al., 1993); noise-reduction methods, such as smoothing or wavelet transform (Patil, 2015); baseline correction methods which include de-trending (Barnes et al., 1989); multiplicative scatter correction (MSC) (Dhanoa et al., 1994); extended multiplicative scatter correction (EMSC) (Martens and Martens, 2001); and spectral derivatives which, in addition to baseline correction, also resolve overlapping peaks.

Spectral patterns collected are usually affected by noise or instrumental variations that may have a detrimental effect on further analysis and conclusions that may be drawn (Gowen and Amigo, 2012). The weakly absorbing bands in the NIR region are far more affected as compared to the stronger ones. The best approach in ensuring high-quality and noiseless spectra, begins with the conditions of spectral collection which should be carefully controlled. Usually, collecting and averaging multiple scans successfully reduce the noise. However, some level of noise should be expected so that the common practice is to use smoothing techniques (Manley, 2014).

The most common de-noising techniques used in aquaphotomics methods are based on the Savitzky–Golay approach (Savitzky and Golay, 1964), which fits the spectral pattern to a polynomial function (second-order polynomial) in a step-wise manner. Continuous wavelet transform (CWT) is also one of the de-noising techniques, proved to be very efficient for processing analytical signals (Shao et al., 2003), and is of recently frequently used for enhancing spectral resolution and background removal in aquaphotomics works (Shao et al., 2010; Kang et al., 2011; Shan et al., 2015; Cui et al., 2016).

Mean centering of spectra is a pre-processing technique mostly used with principal component analysis (Agelet and Hurburgh Jr, 2010). It involves a subtraction of the average spectrum from the entire dataset, which results in reduced number of variables and complexity of subsequently built models (Manley, 2014).

Apart from random noises, the spectra of aqueous systems often exhibit baseline variations (in slope and offset) due to the scattering originated from differences in sample surface or particle size variations (Ozaki et al., 2003). Baseline offset problems are commonly solved by the application of SNV or MSC corrections methods. MSC is a better choice for correction when variations in the spectral slope are also present as a result of additive variation, which increases with wavelength due to the scattering present in samples. The disadvantage of MSC transformation is that it is sample-dependent; hence any change in the sample set requires a recalculation of all MSC related subsequent calculations (Dhanoa et al., 1994).

Detrending is also a possible choice for correction of baseline shift and curvilinearity. This method consists of modeling the baseline as a function of wavelength with a second-degree polynomial and a subsequent subtraction of this function from each spectrum individually.

With correction for baseline variations, one should be careful as sometimes they can contain information of interest. For example, in a study of prion protein isoforms, the benefit of multiplicative scatter correction was 2-fold. First, it confirmed the presence of scattering for one isoform of prion protein, which helped better understanding of its interaction with water by explaining that an increase in bulk water and changes in protein structure are the cause of scattering. Second, when correction for the scattering was applied, a subsequent analysis revealed differences in different protein isoforms not related to the scatter (Tsenkova et al., 2004). However, in a problem of somatic cell count determination, removal of the baseline variation by application of the second derivative transformation led to a diminished accuracy of prediction of somatic cell count in milk, leading to the conclusion that the baseline correction removed significant information (Tsenkova et al., 2001a).

The use of derivation as a pre-processing technique for NIR data is quite common. There are two ways of calculating derivatives: the Norris–Williams derivation (Norris and Williams, 1984) and Savitzky–Golay derivation (Savitzky and Golay, 1964). Derivatives can solve two basic problems with NIR spectra of aqueous systems: overlapping peaks and large baseline variations. The effect of derivatives is most clearly seen in the second derivative of a spectrum, which is able to separate overlapping bands. The second effect of the second derivative is removal of baseline shifts (Williams and Norris, 1987; Heise and Winzen, 2002). Two side effects of the derivatives are the loss of the original shape of a spectral curve, which may result in a difficult data interpretation and a reduction in signal-tonoise ratio. Choosing window size when performing derivatives should also be done with caution in the case of spectra of aqueous systems because this parameter influences a number of points in the resulting spectral vector (Rinnan et al., 2009), which may lead to a wavelength loss and a subsequent loss of information about some water bands.

Iwamoto et al. (1987) showed that the derivative transformation of spectra was a useful method of separating multiple absorptions in broad spectral peaks of water and used it successfully to better understand the state of water in foodstuffs. In aquaphotomics applications, the second derivative is a very popular and efficient approach for discovering activated water absorbance bands that are not visible in the original spectrum (see for example Jinendra et al., 2010; Jinendra, 2011; Kinoshita et al., 2012; Bázár et al., 2016; Kovacs et al., 2016).

Let us now look at the examples of application of these preprocessing steps on the spectra of potassium chloride solutions. The smoothed spectra were calculated by using a Savitzky-Golay filter (2nd order polynomial fit and 21 points) and presented in **Figure 3**. Only the area of the first overtone 1,300–1,600 nm is plotted to provide a better visualization of

how smooth the spectra should look. Next, a subtraction of the average spectrum of Milli-Q water from all the averaged spectra of potassium-chloride solutions was done and is presented in **Figure 4**.

The subtracted spectra reveled the existence of at least two major peaks under the broad overtone spectral curve of potassium-chloride solutions around 1,412 and 1,500 nm. It is also possible to observe a slight peak shift at 1,412 nm with increasing salt concentration.

The 2nd derivative spectra of potassium-chloride solutions were calculated by using a Savitzky-Golay filter (2nd order polynomial and 21 points) and presented in **Figure 5**. The second derivative spectra also indicate an existence of the band at 1,412 nm and we can also see the second band located at 1,462 nm.

With these simple preprocessing steps, we have so far identified at least two water absorbance bands activated by salt perturbation.

# CHEMOMETRICS- THE IMPORTANCE OF CONSISTENCY

Similar to the classical spectroscopy, the use of chemometrics methods is a crucial part of the aquaphotomics data analysis as well. It includes many well-known exploratory, classification and regression methods depending on the objective of the experiment.

Principal components analysis (PCA) (Cowe and McNicol, 1985) is one of the most useful and probably mostly commonly used exploratory technique in spectroscopy during the early stages of data analysis. Its objective is to determine a possible relationship between samples, i.e., to provide the first clues about major directions and sources of variation in the dataset. It compresses data by constructing new variables and the results are presented in scores and loadings plots. The scores plots visualize the spectra in the form of scores in the transformed space of newly constructed variables—principal components, while the corresponding loadings plots denote the contributions of original variables—wavelengths. The novelty of PCA application in aquaphotomics analysis is that a particular attention is given to the analysis of all loading vectors as they can reveal activated water absorbance bands.

PCA in the case of our salt dataset was used to describe multidimensional patterns in the spectral data and discover outliers. PCA data presented in the scores (**Figures 6**, **7**) and loadings plots (**Figure 8**) reveal major sources of variation in the data. The first two principal components describe more than 99.9% of variation in the dataset. The first principal component, whose loading shows two dominant features (a peak positive peak at 1,415 nm and a negative peak at 1,498 nm), is related to changes in water matrix caused by consecutive illumination. This effect is similar to that of temperature (Segtnan et al., 2001) in that free or weakly hydrogen bonded species absorbing at 1,415 nm increase at the expense of strongly hydrogen bonded water molecules absorbing at 1498 nm. The second principal component, which explains 11.403% of variation, shows the influence of concentration. It can be seen from the PC1-PC2 scores plot that while the scores move toward the negative part of the PC2 with increasing concentration, the pure water scores are entirely located in the positive part of this PC. The loading vector of PC2 presented in **Figure 8** reveals major water absorbance bands affected by the presence of salt in water i.e., 1,402, 1,444, and 1,530 nm. Regarding loading vectors, it is very important to look at all PC loadings since changes in water are very subtle and might be also described by a higher number of PC loading vectors.

The next steps of the analysis depend on the objective of the experiment. They can involve classification methods to group samples together according to their spectra, or regression methods to link sample spectra to some quantifiable properties (Roggo et al., 2007).The application of these methods in aquaphotomics analysis does not differ much as compared to the classical NIR applications. However, the unique characteristics for the aquaphotomics approach are as follows.

First, the initial step of the aquaphotomics approach involves qualitative analysis. This step may include the application of PCA or some unsupervised classification analysis, performed with the objective of data exploration and better understanding of spectral variability. This step may even include some preliminary regression analysis, which can show very poor prediction results and non-linearity existence. However, it can provide information about the existence of natural clusters of samples indicating the need for separate modeling for different groups of samples thus discovered. For example, the most accurate prediction of milk components such as protein, lactose and fat in cow milk was achieved when the models were separately built by using milk spectra from healthy and mastitis animals (Tsenkova et al., 2001a,c). A subtraction of the averaged spectra of these two groups will give us the first information about the "important" WAMACS to be used in further analysis. The presence of mastitis disease (bacterial infection) significantly alters the structure of water in milk and milk composition, causing non-linearity in the regression models if the spectra of healthy and mastitis animals are used together. In this case, separately built regression models form a part of the aquaphotome database, where a different regression model is applicable depending on the physiological status of the animal. In this respect, aquaphotomics does not aim nor considers it possible to build global models. This is especially true in the analysis of biological systems that are far too complex to be described with only one model.

Second, the most important feature of aquaphotomics analysis is the special attention paid to original and transformed spectral vectors as well as model outputs. This reveals the contribution of original variables—wavelengths, to model development and tracks consistently repeating variables. The identified variables with high contribution, which constantly repeat through all the steps of aquaphotomics analysis, are the most informative ones. For aquaphotomics, these variables are the places in the spectra, where various water molecular conformations absorb. Their identification is crucial for better understanding of the aqueous system and response of its water matrix to the perturbation. In other words, the variables, which consistently appear in all aquaphotomics analysis (i.e., in

subtracted spectra or transformed spectra, spectral derivatives, model outputs in the form of PCA loadings, PLSR regression vectors, SIMCA discriminating powers etc.), are the locations of water absorbance bands, where spectral variations under controlled and uncontrolled perturbations could be observed. If they persistently and consistently appear through all of the analysis, we can consider these water absorbance bands as activated.

Let us now look at the PLSR application on our salt dataset. The regression was performed on previously smoothed (Savitzky-Golay filter, 2nd order polynomial, 21 points) and MSC transformed spectra in the spectral range of 1,300– 1,600 nm to build a model for prediction of potassium-chloride concentration. The results of PLSR analysis are presented in **Figures 9**, **10**, showing a close correlation and a relatively low error of cross-validation using five latent variables (r 2 = 0.9989, RMSECV = 1.147 mM, **Figure 9**). The main absorbance bands showing a significant weight in the PLS regression vector (**Figure 10**) match very well with those found in the previously applied methods, and all belong to the ranges of WAMACS found in the first overtone of water (Tsenkova, 2009). The favorable prediction results are not surprising since it is well established that salts influence the spectrum of water and these changes can be used for prediction of salt concentration (Grant et al., 1989; Gowen et al., 2015). Because salts do not absorb the NIR light, these results and the previously mentioned studies demonstrate the feasibility of aquaphotomics watermirror approach. In other words, the absorbance bands of water can be used to obtain indirectly the information about changes in solute concentrations.

FIGURE 6 | PCA analysis of Milli-Q water and aqueous solutions of potassium-chloride in the concentration range of 10–100 mM derived from the smoothed (calculated with a Savitzky-Golay filter using 2nd order polynomial and 21 points) and MSC transformed absorbance (logT-1) spectra in the spectral range of 1,300–1,600 nm (OH first overtone)—Scores plots for the first two principal components.

1,300–1,600 nm (OH first overtone)—Scores plots for the first six principal components.

It is worth mentioning that the analysis may include several more chemometrics methods that can also contribute to the identification of water absorbance bands activated by the perturbation of interest.

Employing discriminant analysis such as Partial Least Squares Discriminant Analysis (PLS-DA) (Martens and Martens, 2001) for discriminating between solvent and solutions can help in gaining more insight about how the solutes affect the water matrix of the solvent. For example, this method was employed to discriminate between solvent and pesticide–containing solutions (Gowen et al., 2011). Examination of the regression vectors of PLS discriminant analysis provides an additional help in revealing water absorbance bands activated by the presence of solutes.

Similarly, Soft Modeling of Class Analogies (SIMCA) (Wold and Sjöström, 1977) can be employed for the same purpose. The discriminating power of SIMCA analysis, in that case, reveals water absorbance bands with the highest discriminating power which distinguishes between pure solvent and solutions. One such example can be found in an aquaphotomics study concerned with measurements of different saccharides at millimolar concentrations (Bázár et al., 2015). Sometimes, both discrimination methods (SIMCA and PLS-DA) are employed for the same purpose of discriminating the solvent from the solutions and the discovery of additional information about activated water absorbance bands by solutes. In a study concerned with the detection of UVC damaged DNA, both PLS-DA and SIMCA were applied to distinguish between non-irradiated and

potassium-chloride concentration: Y fit of training and one-sample-out cross-validation.

UVC-irradiated DNA solutions (Goto et al., 2015). Applying two chemometrics methods for the examination of one aspect of the experimental study demonstrates the stability of the applied methodology, namely, consistency in results.

Both the SIMCA and PLS-DA methods are naturally used in most cases when the objective of the study is discrimination between different samples. For classification and discrimination purposes in aquaphotomics, the most commonly used methods are SIMCA and PLS-DA. The SIMCA method was employed, e.g., for discrimination between healthy and mosaic virus infected soybean plants (Jinendra et al., 2010), for discrimination between healthy and mastitic animals based on the spectra of urine, blood and milk of dairy cows (Tsenkova, 2004), for discrimination between different brands of commercially available mineral waters (Muncan et al., 2014 ´ ), for discrimination of different bacteria strains (Remagni et al., 2013; Slavchev et al., 2015, 2017) and others. The PLS-based discriminant analysis was applied for discrimination between irradiated and non-irradiated DNA solutions (Goto et al., 2015), discrimination between solvents and pesticides containing solutions (Gowen et al., 2011), and discrimination between worn and new soft contact lenses based on conventional hydrogels (Šakota Rosic et al., 2016 ´ ).

Quantitative aquaphotomics analysis usually includes partial least squares regression (PLSR) (Martens and Martens, 2001) or principal component regression (PCR) (Næs et al., 2002). The principal uniqueness of the aquaphotomics approach in the application of these two methods is the utilization of water absorbance bands for indirect quantification of analytes in water, which change the water matrix. The feasibility of this approach was demonstrated in a study whose objective was quantification of different types of salt in water solutions (NaCl, KCl, MgCl2, and AlCl3), where the overall detection limit of 1,000 ppm was reported (Gowen et al., 2015). The experiment was reproduced in three independent laboratories by using 3 different spectrometer systems and in different ambient conditions. The reported detection limit of 1,000 ppm indicates that under specified conditions, the aquaphotomics approach substantially improved the detection limit for NIRS (around 5 times) (Pasquini, 2018).

Using an aquaphotomics approach, PLSR gave excellent results for quantification of various analytes in water solutions such as sugars [glucose, fructose, sucrose and lactose and their

mixtures (total sugar and each sugar concentrations)] (Bázár et al., 2015), insulin protein (Chatani et al., 2014), DNA, isolated cyclobutane pyrimidine dimers, and UVC-irradiation dose (Goto et al., 2015). The same approach also provided a favorable accuracy of measurements in more complex biological samples, such as human serum albumin (HSA) and γ-globulin in phosphate buffer solutions (Murayama et al., 1998), urinary estrone-3-glucuronide (E1G) concentrations in urine of giant pandas (Kinoshita et al., 2010, 2012), HIV virus in human plasma (Sakudo et al., 2005), somatic cell counts in cow milk samples (Tsenkova et al., 2001a; Tsenkova, 2004), as well as fat, lactose, protein and urea nitrogen content of milk (Tsenkova, 2004).

Very recently, a critical review on NIRS and its modern perspectives expressed concerns regarding the capability of aquaphotomics for measurement of analytes in very low concentrations, given the fact that the concentrations of 5,000 ppm (mg L−<sup>1</sup> ) or 0.5% (w/v) are roughly regarded as a common limit of quantification for NIRS (Pasquini, 2018). Capability comparison of the traditional NIRS and aquaphotomics approach is based on an incorrectly assumed equivalence. While the established limit of detection for the traditional approach is based on the utilization of absorbance bands of analytes in the NIR region, the aquaphotomics approach utilizes water absorbance bands. In this sense, the quantification of analytes is based on entirely different principles, and as such, logically offers different limits of detection. Different approaches and their accuracy of detection were well demonstrated in studies on the measurement of concentrations of polystyrene particles in water (Tsenkova et al., 2007b). When the first overtone of water (i.e., aquaphotomics approach) was used to develop a model for low concentrations of polystyrene particles in aqueous suspensions (1 – 0.0001%), the measurements achieved a high accuracy even in the case of very low concentrations. However, when the traditional approach was applied and measurements were based on the polystyrene band near 1,680 nm (C-H stretching from aromatic C-H (2ν) (Workman, 2016)—i.e., decreasing particle concentration led to a substantial decrease in accuracy of prediction.

Aquaphotomics can work with very water-rich systems. The intensity of water bands in the NIR spectra of such systems is much stronger than that of any constituent (Tsenkova, 2004), especially if they are in very low concentrations. The possibility of detecting and measuring such low concentrations arises from the fact that every molecule of analyte is hydrated with an abundance of water molecules, which adapt to its structure and assume various conformations that can be observed based on their respective absorbance bands in the NIR region. Since many water molecules are involved with hydration of just one molecule of analyte, the water acts as a sort of amplifier, and instead of measuring analytes directly, the information on their concentration is obtained indirectly by measuring changes in always abundant solvent molecules.

NIR spectroscopy as a non-destructive tool offers the advantage of in vivo spectral monitoring of living objects. Aquaphotomics combined with time-resolved NIR spectroscopy allows a better understanding of biological functions and underlying water dynamics.

One of the excellent methods for exploring water dynamics is generalized two-dimensional (2D) correlation spectroscopy (Noda et al., 1995; Liu et al., 1996). In 2D correlation spectroscopy, an external perturbation is applied to a system during spectral measurements, which enables exploration of spectral signals as a function of time or perturbation level (where perturbation can be a number of consecutives, temperature, concentration etc.). This method has significant advantages over one-dimensional spectra. Spreading the spectral region over another dimension allows a deconvolution of overlapped bands and monitoring a specific order of spectral intensity changes. Moreover, 2D correlation spectroscopy offers the possibility of investigating various intra- and inter-molecular interactions through selective correlation of peaks. This technique, in addition to PCA, considerably contributed to the understanding of the structure of liquid water (Segtnan et al., 2001). Furthermore, it was applied for extraction of useful information from NIR spectra of protein aqueous solutions during heat-induced denaturation of ovalbumin (Wang et al., 1998) and acid-induced denaturation of human serum albumin (Murayama et al., 2000). The method can be applied even in the case of complex biological fluids such as milk (Czarnik-Matusewicz et al., 1999; Tsenkova, 2004) or complex biological samples such as fruits (Giangiacomo et al., 2009). 2D correlation analysis was also employed for the investigation of wafer etchant solutions composed of several inorganic acids (HCl, H2SO4, H3PO4, and HNO3) (Chang et al., 2018). This study, using a typical water-mirror approach, applied 2D correlation analysis to examine NIR water bands perturbed by four acids and determined their dissimilar characteristics. The results showed that components with higher acidity in single-component samples perturbed water hydrogen bond network more significantly, and in turn allowed more accurate concentration measurements. Heterospectral correlation (Noda and Ozaki, 2004) i.e., investigation of correlation between water absorbance bands in different regions of the electromagnetic spectrum (IR and NIR) or by different techniques (NIR and Raman spectroscopy) can significantly contribute to the development of aquaphotomics through discovery and identification of new water absorbance bands. However, it should be pointed out that there is one inherent weakness of the method, i.e., high level of sensitivity to noise.

Other approaches for examination of water dynamics are also often in use. For example, plotting SIMCA interclass distance as a function of time revealed time-dependent spectral dynamics of virus infection in soybean plants (Jinendra et al., 2010). The SIMCA interclass distance between the groups of infected and non-infected plants showed small values of around 1.2 (2 weeks after inoculation), then gradually decreased to the lowest value of 0.8 (3 weeks after inoculation). After this critical point, the value of interclass distance increased steadily. Thus, revealed water dynamics mirrored the dynamics of viral infection where, due to the defense reaction from the plants, the disease impact was initially suppressed exactly 3 weeks after inoculation. The same approach was utilized in a study of the ovulation period in giant pandas (Kinoshita et al., 2010). Interclass distances were calculated between spectra of urine collected each day in the time series and urine spectra collected at the first day of investigation when the female animals had been in an estrous state. This analysis showed that the SIMCA distance between these two groups increased simultaneously with an increase in E1G concentration, a major estrogen metabolite excreted in the urine during estrus. Another study was concerned with investigation of protein fibrillation and employed spectral monitoring of water structural changes in real time during fibrillation of insulin (Chatani et al., 2014). This study monitored the process of fibrillation of insulin indirectly by monitoring water molecular structure dynamics in the region of the first overtone (1,300– 1,600 nm), while the verification of formation of fibrils was performed by two methods i.e., FTIR spectroscopy and Atomic Force Microscopy. The PCA analysis of NIR spectra of protein solutions found that for the first two PCs, score changes can be mainly attributed to a change in light scattering; however, the scores of PC3, when expressed as a function of time (in minutes), showed a time course of changes in water structure coinciding well with the proposed nucleation, elongation and equilibrium phase of protein fibrillation (Chatani et al., 2014). It is worth mentioning that other ways of exploring water dynamics are possible. For example, expressing SIMCA interclass distance as a function of consecutive illumination or temperature can reveal different responses to perturbation in different samples, which otherwise, without perturbation, may be difficult to discriminate due to a high similarity. Also, expressing SIMCA interclass distance between solvent and solutions of varying concentrations, as a function of concentration, may reveal concentration ranges in which solutes have structure-breaking and structure-making effect, thus indicating the need for building separate regression models for different ranges of concentrations.

Recently, several novel chemometrics methods were introduced to aquaphotomics studies. Multivariate curve resolution-alternating least squares (MCR-ALS) was applied to characterize the effects of temperature and salt perturbations on the NIR spectra of water in order to gain more insight into hydrogen bonding (Gowen et al., 2013). This advanced data analysis technique applies a factor model approach with the objective of recovering pure concentration and spectral profiles of the components in complex mixture systems without any prior knowledge of these features (Czarnecki et al., 2015). To perform MCR, however, one has to estimate firstly a number of significant components, usually based on PCA analysis, In contrast to PCA, MCR can provide results that have actual physical and chemical meaning (Czarnecki et al., 2015). The "components" in terms of water structures could be interpreted as the changing forms of water when perturbations were applied. Three distinct components were found with varying temperature dependence in the range 30-45◦C in the region of first overtone of water, while different salts and salt concentration levels affected the water hydrogen bonded network in different ways according to its acidity (Gowen et al., 2013). By resolving different systems into idealized pure components, MCR-ALS allowed better examination of water molecular matrix and resulted in the conclusion that the water structure can be reasonably interpreted as a multi-state system.

Evolving factor analysis (EFA) was applied for exploration of hydration and secondary structures of bovine serum albumin in aqueous solutions (Yuan et al., 2003). Application of this method allowed an extraction of spectral information, which indicated significant changes of bovine serum albumin in secondary structure. The application of independent component analysis (ICA) was reported in spectroscopic analysis of hydrogen bonding in water-acetone mixtures for resolving the spectra to independent components and obtaining their concentration profiles (Monakhova et al., 2014). A Gaussian fitting method was applied to study glucose-induced variation of water under temperature perturbation (Cui et al., 2016). This method, applied on a NIR difference absorbance spectra (in region 700–1,100 nm), helped identify and quantify 16 inorganic salts in water in the concentration range from 30 to 500 mM (Steen et al., 2015).

A series of articles were also published on employing and developing various chemometrics methods specifically for

temperature-perturbed samples (Peinado et al., 2006; Shao et al., 2010, 2018; Kang et al., 2011; Shan et al., 2015; Cui et al., 2017b). Instead of trying to eliminate the influence of temperature, a Parallel Factor (PARAFAC) model was used to extract and separate relevant sources of both physical and chemical information (Peinado et al., 2006). PARAFAC analysis was also used to rationalize concentration-dependent peak shifts and quantification of different water species in acetone (Andrews et al., 2014), and also for a quantitative analysis of the NIR spectra of temperature-perturbed mixtures, water-ethanolpropanol and water-ethanol-glycerin (Peinado et al., 2006). Multilevel simultaneous component analysis (MSCA) has been applied to the investigation of a relationship between temperature and NIR spectra of different samples in different concentrations: water-ethanol-isopropanol, (Shan et al., 2015) and water-glucose (Cui et al., 2017a) under temperature-perturbation. This method was proposed specifically for analyzing multivariate data at different levels (Timmerman, 2006). The method offers a unique way to study the composition of solvent, temperature effect and quantitative analysis (Shan et al., 2015). Cui et al. tested three high-order chemometric algorithms: multiway principal component analysis (MPCA) (Wold et al., 1987), parallel factor analysis (PARAFAC) (Bro, 1997) and alternating trilinear decomposition (ATLD) (Wu et al., 1998) in the analysis of temperature-dependent NIR spectra of binary and ternary water-alcohol mixtures (Cui et al., 2017b). All three algorithms proved to be very powerful tools for capturing temperature– and concentration–induced spectral variations, from which a structural variation could be observed and a quantitative determination performed. Another work of Shao et al. proposes mutual factor analysis (MFA) for quantification based on temperature-dependent NIR spectra (Shao et al., 2018). In this work, multi-component mixtures were analyzed for quantification of components and better understanding of molecular interactions in solutions. From the spectra of water–glucose mixtures, both spectral variations induced by temperature and concentration were obtained while serum samples were used for method validation (Shao et al., 2018).

The ultimate choice of chemometrics method to be applied in aquaphotomics analysis depends on the type of the aqueous system explored, spectral dataset and the research objective. Obviously, there are many chemometric methods available. The important aspect of every aquaphotomics analysis is emphasis on consistency so that each preprocessing method, conventional spectroscopic method or chemometrics method applied to extract the information from water spectra can contribute to the development of an emerging aquaphotome. Each step of aquaphotomics data analysis is important, because it can contribute to better understanding of the complexity of aqueous systems, irrespective of chemometrics method applied.

With reference to our example of potassium chloride solutions, after examining the raw spectra, difference spectra, second derivative spectra, loadings of PCA analysis and regression vector of PLSR analysis, we have identified the main water absorbance bands activated by the perturbation of potassium chloride in the concentrations up to 100 mM. The last step of analysis for our worked-out example is to represent water absorbance spectral patterns using aquagrams.

# WATER SPECTRAL PATTERN REPRESENTED BY AQUAGRAMS

## Classic Aquagrams

In data analysis, many situations arise where data visualization is helpful, even essential, for better understanding. In aquaphotomics, the need arose for a clear and comprehensive graphical representation of the water spectral patterns as well as for their easy comparison. That is why the aquagrams were introduced (Tsenkova, 2010).

When activated water absorbance bands are found based on the previously described steps, the last step is to apply MSC or SNV transformation of the raw spectra, and extract the absorbance at selected activated water bands. Thus, the calculated absorbance is normalized and averaged for different samples or sample groups, and the values are displayed on radial axes defined by the activated water absorbance bands in a radar chart.

The normalized absorbance is calculated as follows:

$$A^\*\_{\lambda} = \frac{A\_{\lambda} - \mu\_{\lambda}}{\sigma\_{\lambda}} \tag{1}$$

Where A , λ - is a normalized absorbance displayed on the aquagram, A<sup>λ</sup> - absorbance after multiplicative scatter correction (MSC) or standard normal variate transformation (SNV), µ<sup>λ</sup> – mean of all spectra for the examined group of samples after transformation, σ<sup>λ</sup> – standard deviation of all spectra for the examined group of samples after transformation, λ – selected wavelengths chosen for display from activated water absorbance bands.

An exact number of axes as well as water absorbance bands will be chosen for display, depending on a specific system and perturbation; however, the axes always display various conformations of water molecules, making aquagrams very convenient tools for a quick insight into the water structure of the system. For the first overtone of water, the axes of the aquagram are usually based on previously discovered 12 WAMACs. The aquagrams are visually very convenient to allow a fast and comprehensive comparison of different systems or conditions of the same system by comparison of their WASPs.

As it can be seen from Equation (1), the classic aquagram is a relative construction, depending on the samples included in calculation. Also, it is a matter of choice whether the display of absorbance calculated based on the above equation is done by using a circular chart (radar chart) or a linear one. The package aquap2 offers both options (Pollner and Kovacs, 2016).

The more advanced version of a classic aquagram is an aquagram with confidence intervals (Pollner and Kovacs, 2016). This aquagram adds one more function, the possibility to observe whether the differences among WASPs presented in the aquagrams are statistically significant. This type of aquagram, in addition to averaged WASPs for selected groups of samples, displays its confidence intervals with 95% upper and lower limits, as calculated by using the Bootstrap method for data validation and uncertainty estimation (Davison and Hinkley, 1997; Pollner and Kovacs, 2016). With this novel function, the aquagrams with confidence intervals are not only convenient for visualization, but also especially suitable for classification and discrimination.

For our example dataset of potassium chloride solutions, after selecting wavelengths from the WAMACS regions in the 1st overtone of water based on the previous steps of the analysis, the classic aquagrams without and with confidence intervals, calculated by using aquap2 package, are presented in **Figures 11**, **12**.

In both types of aquagrams, it is easy to observe a large difference between the spectral patterns of water (red line) and salt solutions. Increasing the concentration of salt in water leads to increased absorbance in the region between 1,342 and 1,374 nm which corresponds to C1, C2, and C<sup>3</sup> WAMACS, i.e., absorbance of the free OH stretch (OH-(H2O)n, n = 1. . . 4) (Xantheas, 1995; Robertson et al., 2003). An increase in the absorbance with increasing salt concentration can also be seen in the region stretching from 1,440 to 1,452 nm, i.e., C7-C<sup>8</sup> WAMACS that are known as bands of water hydration (Gowen et al., 2009a) and water dimers (S1) (Segtnan et al., 2001; Cattaneo et al., 2009) and symmetric and asymmetric stretching of the first overtone of water (Siesler et al., 2008; Cattaneo et al., 2009; Gowen et al., 2009a). However, in the range between 1,476 and 1,512 nm, i.e., C10-C12, samples with higher salt concentration show lower absorbance values and this region is usually connected to strongly hydrogen bonded water (Segtnan et al., 2001; Tsenkova, 2009). The spectral pattern of salt solutions represented in the aquagrams shows that for the range of concentrations of salt under study, increasing salt concentration has a structure-breaking effect on water.

FIGURE 15 | Aquagrams with 95% confidence intervals of Milli-Q water and aqueous solutions of potassium-chloride in the concentration range of 10–100 mM calculated on the MSC transformed absorbance (logT-1) spectra in the spectral range of 1,300–1,600 nm (OH first overtone) using the linearized version of the "temperature-based" mode with average values.

# Temperature-Based Aquagrams

In the previous section, we briefly mentioned that classic aquagrams are relative constructs, meaning that the WASPs displayed depend on the samples included in calculation. This is disadvantageous if the WASPs of samples or groups of samples ought to be compared over time or in different experiments. The development of a new temperaturebased aquagram (Pollner and Kovacs, 2016) overcomes this difficulty by transformation of how spectral changes are expressed.

For the calculation of temperature-based aquagrams, it is necessary to first acquire a spectral library consisting of spectra of pure water (Milli-Q) at different temperatures covering a wider range of temperatures than the one expected to be used during the experiment. This created library, or socalled reference dataset, provides the basis for temperature aquagram calculation. The spectra from this dataset are to be compared with the spectra acquired during the experiment experimental dataset, giving the ground to express the effect of certain perturbation on spectral pattern of experimental samples in terms of the effect of temperature on pure water spectra. In this way, the effect of any perturbation on samples can be expressed in the "temperature equivalent units," in other words, changes in pure water spectra caused by temperature.

The calculation of a temperature-based aquagram is based on a comparison of areas covered by 12 WAMACS (Ci, i = 1, 12) coordinates in the region of the 1st overtone of water. The average spectra across all sample replicates and consecutive scans are calculated for the reference and experimental datasets. The area under the curve (AUC) for every single average spectrum for both reference and experimental datasets, at the wavelength range of each WAMACS (Ci) is calculated by taking into account the baseline estimated by linear fitting on the two edges of the first overtone region (i.e. through 1,300 and 1,600 nm points). The ratio of AUCs for every single water matrix coordinate and AUC for the first overtone region (i.e., 1,300–1,600 nm) are calculated for each averaged spectrum of both datasets in order to provide normalized values for comparison of reference and experimental datasets and to eliminate possible differences due to the scattering or path length differences. Using local polynomial regression for the reference dataset, a continuous array of values for the relative area of each Ci is calculated for a continuous temperature range chosen to include a specific temperature. In this way, a temperature calibration equation is obtained establishing a relationship between temperature and each Ci area, including the temperature at which the experiment was performed. When it is known how each Ci area for the pure water dataset is changed as a function of temperature, it is possible to pair these changes to spectral changes in the experimental dataset, i.e., to perform linking (mapping) and express the changes in Ci areas of the experimental datasets in the unit of temperature (degree Celsius) equivalent.

With this type of aquagram, it is also possible to include confidence interval limits. In that case, it is also necessary to perform transformation of upper and lower 95% confidence limits in the same manner just described above for the average spectra from the experimental dataset.

The whole calculation procedure for temperature-based aquagrams is implemented in the aquap2 package of R programing language (Pollner and Kovacs, 2016; R Core Team, 2017). An obvious disadvantage of temperature-based aquagrams is that they are based on previously discovered WAMACS regions in the first overtone of water (Tsenkova, 2010), meaning that at the moment this type of aquagram cannot be used for other windows of the electromagnetic spectrum where water absorbs.

The temperature based aquagrams without and with confidence intervals for our dataset of aqueous solutions of potassium-chloride spectra are presented in **Figures 13**, **14**, respectively. The linearized version of the temperature-based aquagram for **Figure 14** is plotted in **Figure 15**, where the additional table shows average values at all WAMACs.

Further understanding can be obtained from the temperaturebased aquagram. The addition of, for instance, 90 mM potassium-chloride to Milli-Q water results in structural changes equivalent to temperature changes of about 0.54, 0.48, 0.3, 0.02, 0.1, 0.58, 1.53, 1.14, 0.19, −0.08, −0.26 and −0.49◦C at C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, and C<sup>12</sup> coordinates, respectively. Furthermore, the differences are statistically significant for calculated confidence intervals, e.g., the above listed differences between pure Milli-Q and 90 mM aqueous solution of potassium-chloride was significant (p < 0.05) at the coordinates C1, C2, C3, C6, C7, C8, C11, and C12.

# CONCLUDING REMARKS

In this paper, the fundamentals of the aquaphotomics approach to data analysis have been presented and discussed. A variety of applications illustrate the potential of aquaphotomics as a powerful new spectroscopic tool to study various aspects of aqueous and biological systems, which are of interest in the pharmaceutical and biomedical fields. The process of analysis illustrated by the application of aquaphotomics analysis on aqueous salt solutions was intended as guidance for certain steps of the analysis with the simplest experimental system, which anyone can easily reproduce. Together with the examples from sources of literature referenced throughout the text, this paper should provide the basis for independent experimental work in this field. The existing aquaphotomics literature shows the results which are probably only the tip of the iceberg of possible applications. With the explained methodology of aquaphotomics analysis presented herein, we hope that scientists and chemometricians will implement it in their fields and come up with new ideas of applications as well as new and more sophisticated mathematical tools to contribute to this growing field.

# AUTHOR CONTRIBUTIONS

ZK, BP, and RT designed and performed experiments. ZK performed data analysis. JM, ZK, and RT interpreted results and wrote the manuscript.

# ACKNOWLEDGMENTS

The author JM gratefully acknowledges the financial support of JSPS Postdoctoral Fellowship for Foreign Researchers (P17406).

# REFERENCES


The author ZK gratefully acknowledges the support by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences, Hungary and by the ÚNKP-17-4 New National Excellence Program of the Ministry of Human Capacities.


quantification of pesticides in aqueous solution. Am. J. Anal. Chem. 2, 53–62. doi: 10.4236/ajac.2011.228124


diagnosis," in Mastitis in Dairy Production. Current Knowledge and Future Solutions, 4th IDF International Mastitis Conference (Maastricht), 901.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Tsenkova, Mun´can, Pollner and Kovacs. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Raman Spectroscopy for Pharmaceutical Quantitative Analysis by Low-Rank Estimation

Xiangyun Ma<sup>1</sup> , Xueqing Sun<sup>1</sup> , Huijie Wang<sup>1</sup> , Yang Wang<sup>1</sup> , Da Chen<sup>2</sup> and Qifeng Li <sup>1</sup> \*

*<sup>1</sup> School of Precision Instrument and Opto-electronics Engineering, Tianjin University, Tianjin, China, <sup>2</sup> State Key Laboratory of Precision Measurement Technology and Instruments, Tianjin University, Tianjin, China*

Raman spectroscopy has been widely used for quantitative analysis in biomedical and pharmaceutical applications. However, the signal-to-noise ratio (SNR) of Raman spectra is always poor due to weak Raman scattering. The noise in Raman spectral dataset will limit the accuracy of quantitative analysis. Because of high correlations in the spectral signatures, Raman spectra have the low-rank property, which can be used as a constraint to improve Raman spectral SNR. In this paper, a simple and feasible Raman spectroscopic analysis method by Low-Rank Estimation (LRE) is proposed. The Frank-Wolfe (FW) algorithm is applied in the LRE method to seek the optimal solution. The proposed method is used for the quantitative analysis of pharmaceutical mixtures. The accuracy and robustness of Partial Least Squares (PLS) and Support Vector Machine (SVM) chemometric models can be improved by the LRE method.

### Edited by:

*Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam*

### Reviewed by:

*Andreas Borgschulte, Swiss Federal Laboratories for Materials Science and Technology, Switzerland Pellegrino Musto, Consiglio Nazionale Delle Ricerche (CNR), Italy*

> \*Correspondence: *Qifeng Li Lqfli@tju.edu.cn*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *28 February 2018* Accepted: *20 August 2018* Published: *10 September 2018*

### Citation:

*Ma X, Sun X, Wang H, Wang Y, Chen D and Li Q (2018) Raman Spectroscopy for Pharmaceutical Quantitative Analysis by Low-Rank Estimation. Front. Chem. 6:400. doi: 10.3389/fchem.2018.00400* Keywords: Raman spectroscopy, quantitative analysis, pharmaceuticals, low-rank estimation, chemometric model

# INTRUDUCTION

Raman spectroscopy is one of the vibrational spectroscopic techniques that has been commonly applied in quantitative analysis (Strachan et al., 2004; Numata and Tanaka, 2011; Ai et al., 2018). Being non-invasive and marker-free, it has been proved to be an effective tool in the field of physics, chemistry, and biology (Graf et al., 2007; Neugebauer et al., 2010; Ryu et al., 2012; Tan et al., 2017). Coupled with chemometrics methods, it has the advantages of high sensitivity and resolution in biomedical and pharmaceutical quantitative analysis.

The quantitative analysis based on Raman spectra at low signal-to-noise ratio (SNR) levels is still problematic (Li, 2008; Chen et al., 2014). Generally, a Raman spectrum can be divided into two parts: the signal containing desired information and the noise containing unwanted information. Basically, the latter may include photon-shot noise, sample-generated noise, instrument-generated noise, computationally generated noise, and externally generated noise (Pelletier, 2003). Due to the inherently weak property of Raman scattering, the noise will lead to a deterioration in SNR of Raman spectra, affecting the accuracy of quantitative analysis. For instance, data of online monitoring in limited integration time always tend to be inaccurate (Han et al., 2017; Virtanen et al., 2017).

Some approaches of preprocessing Raman spectra to minimize this problem have been proposed (Clupek et al., 2007; Ma et al., 2017), such as first and second derivatives (Johansson et al., 2010), polynomials fitting (Vickers et al., 2001), Fourier transform (Pelletier, 2003), and wavelet transform (Chen et al., 2011; Li et al., 2013). Among these approaches, wavelet transform can extract peak information and remove background noise, which has been the most widely used preprocessing method (Du et al., 2006). However, the processing of Raman spectra can be further optimized to improve the accuracy of pharmaceutical quantitative analysis.

**123**

In this paper, we introduce a simple and feasible Raman spectroscopic analysis method based on Low-Rank Estimation (LRE). Our experiments are implemented based on the Partial Least Squares (PLS) and Support Vector Machine (SVM) chemometric models. The aim of this experimental design is to enhance the quality of pharmaceutical quantitative analysis by significantly improving the accuracy and robustness of the chemometric models used.

# MATERIALS AND METHODS

Pharmaceutical substances (norfloxacin, penicillin potassium, and sulfamerazine) were purchased from Dalian Meilun Biotechnology Co., Ltd (China) and used without further purification. These substances were well blended in different proportions, pulverized, and compressed into three-component tablets. Other physical properties of these tablets (such as density, height, and diameter) were kept completely consistent. Mixed solutions were also prepared with methanol and ethanol in 100 different proportions. Raman spectral data were recorded by using a Renishaw inVia Raman spectrometer (Gloucestershire, U.K.). This system consisted of a 785-nm diode laser (∼40 mW) and a 1,200 l/mm grating. In this work, the integration times of Raman spectra were 0.1–0.5 s.

PLS and SVM regression methods were used to model and predict pharmaceutical concentration of the samples based on their Raman spectra. Eighty-five samples were selected as the training set and the remaining 15 samples as the testing set, based on Kennard-Stone (KS) algorithm. The parameters of PLS and SVM models were tuned based on grid search algorithm. The optimal parameters were obtained by k-folder cross-validation.

The accuracy and robustness of above-mentioned chemometric models were further improved by conventional Wavelet Transform (WT) method and Low-Rank Estimation (LRE) method, respectively. In the WT method, the signals were split into different frequency components to remove simultaneously low-frequency background and high-frequency noise components. The Symlet wavelet filter (sym11, scale = 7) was optimally selected to provide the sharpest peaks associated with the analytes of interest. The LRE method was originally developed by our group in three-dimension to speed up Raman spectral imaging (Li et al., 2018). In this study, we used the LRE method in two-dimension to process the observed Raman spectral data matrix. In this method, the alternating least squares (ALS) algorithm is used to estimate the largest singular value of the matrix (Kroonenberg and Leeuw, 1980; Halko et al., 2011). The matrix estimation has two sets of parameters. Each set is estimated in turn by solving a least-squares problem and holding the other set fixed. After both sets have been estimated once, the procedure is repeated until convergence.

The Frank-Wolfe (FW) algorithm is applied in the LRE method to seek the optimal solution. Recently, the FW algorithm has been popularly used in machine learning due to its characteristics of simple implementation and modest memory requirement (Jaggi, 2013; Guo et al., 2017). The steps of the LRE method are detailed in **Table 1**.



4: Compute the step length *r*, *r <sup>i</sup>*+<sup>1</sup> <sup>=</sup> arg min*r*∈[0,1](*<sup>A</sup>* <sup>−</sup> (*<sup>X</sup> <sup>i</sup>* + *r*(*s <sup>i</sup>*+<sup>1</sup> − *X i* ))) *i*+1 *i*+1 *i*+1

5: *X <sup>i</sup>*+<sup>1</sup> = (1 − *r* )*X <sup>i</sup>* + *r s* 6: stopping criterion: ALS(*<sup>X</sup> i*+1 ) *<sup>i</sup>*+<sup>1</sup> <sup>&</sup>gt; *<sup>m</sup>*

7: end for

8: The last iteration of *X* is the final solution of the LRE method.

*s*

Output *X*

Through being processed by the LRE method, the low-rank training and testing sets can be obtained from the raw training and testing data matrices, respectively. In general, an abundant data matrix can enhance the effect of the LRE method. When a number of testing spectral data is small, the training spectral data can be added to the raw testing data matrix as a supplement. The added spectral data are only used to strengthen the impact of the LRE method. The conventional regression models are applied to the low-rank training and testing sets to perform quantitative Raman analysis.

# RESULTS AND DISCUSSION

Noise-free Raman spectral dataset is a low-rank matrix. In **Figure 1**, the red line shows the ranks of Raman spectral data matrix in an integration time of 1 s, suggesting that the Raman spectra have low-rank property when the noise is low. The lowrank property comes from high correlations among spectral signatures. Each spectral signature can be represented by a linear combination of a small number of pure spectral endmembers, which is known as linear spectral mixing model (Iordache et al., 2011; Golbabaee and Vandergheynst, 2012). The blue and green plots show singular values of the matrix in a shorter integration time, which implies that the ranks of Raman spectra increase with decreasing integration time owing to a greater proportion of the noise. The low-rank property can be used as a constraint to improve the accuracy of pharmaceutical quantitative analysis (Yi et al., 2017).

Raw Raman spectra recorded for three pure pharmaceutical substances are shown in **Figure 2A**. Thirty Raman spectra obtained from three-component tablets with different proportions are shown in **Figure 2B**. It is clear that each pharmaceutical component has its own special characteristic peaks. However, their respective Raman bands are overlapped. Particularly, Raman signals of lower-concentration component are almost swamped and covered by those of higherconcentration one, which represents a common problem in practice for biomedical and pharmaceutical quantitative analysis. For clarity, the Raman spectra in **Figure 2B** were collected

in an integration time of 5 s, which have a high SNR. In our experiments, the integration times of Raman spectra are in the range of 0.1–0.5 s, which is over 10 times shorter than that shown in **Figure 2**. Under this condition, the spectral signals are weaker and have poor SNR.

The comparisons of predicted and actual values for norfloxacin are illustrated in **Figure 3**, which indicates the advantage of the LRE method for pharmaceutical quantitative analysis. The coefficient of determination (R 2 ) and root mean square error (RMSE) of the chemometric models used for quantitative analysis of three pharmaceutical components are listed in **Table 2**. The unsatisfactory results of the raw spectral data show that the pre-treatment of Raman spectra is necessary. In this study, the LRE method and conventional wavelet transform (WT) method are applied to improve the accuracy of quantitative analysis. As shown in **Figure 3**, both the conventional WT and LRE methods can improve the predicted results. However, it is clear that the LRE method has a better

FIGURE 3 | Actual vs. predicted values of norfloxacin based on the PLS (A) and SVM (B) model, where the black solid line are diagonals. Raw Raman spectra are collected in an integration time of 0.2 s.


TABLE 3 | *R* <sup>2</sup> and RMSE values of the chemometric models for norfloxacin in different integration times.


performance than the conventional WT method in enhancing the prediction accuracy for pharmaceutical quantitative analysis.

As shown in **Table 2**, the raw Raman spectra are all collected in an integration time of 0.2 s. The LRE method is significantly better than the conventional WT method in terms of R 2 and RMSE for all components. Quantitation limit (QL) for each pharmaceutical substance is calculated. By definition in ICH guideline (ICH Harmonised Tripartite Guideline, 2005), QL is the lowest concentration of an analyte that can be quantitatively determined with suitable precision and accuracy. It is most TABLE 4 | *R* <sup>2</sup> and RMSE values of the chemometric models for methanol in different integration times.


often determined as 10 times the standard deviation of the noise from the blank. The LRE method can be used reliably with more than a 15-fold improvement of the practicalQL. Through being processed by the LRE method, QL values for norfloxacin, penicillin potassium, and sulfamerazine are 0.17, 0.13, and 0.19%, respectively. These results reveal that the LRE method can simultaneously improve the performance of quantitative analysis for pharmaceutical multi-component mixtures.

**Table 3** lists R 2 and RMSE values of the chemometric models used for quantitative analysis of norfloxacin in different integration times. The integration times of raw Raman spectra are 0.1, 0.2, and 0.5 s. Raman spectrum's SNR is always proportional to integration time. For evaluating spectral quality, the SNR is defined as the ratio of the peak value of the signal to the root mean square of the noise. For integration times of 0.1, 0.2, and 0.5 s, the average SNR of Raman spectra are 2.47, 3.66, and 6.21, respectively. R 2 and RMSE values of the chemometric models for methanol in different integration times are listed **Table 4**. The average SNR of the Raman spectra in the integration times of 0.1, 0.2, and 0.5 s are 2.13, 3.34, and 5.89, respectively.

As shown in **Tables 3**, **4**, the accuracy of the quantitative analysis raises with increasing SNR. According to R 2 and RMSE values, it can be proved that the LRE method has a better performance than the conventional WT method. The degree of improvement is higher for low-SNR Raman spectra, which indicates that the LRE method has good noise immunity.

In summary, all predicted results of the Raman spectra preprocessed by the LRE method are in good agreement with corresponding actual values. This method can be applied to improve the accuracy of quantitative analysis based on both PLS and SVM models. It is unrelated to the selection of chemometric models. The LRE method is not restricted by the state of a sample, meaning that it is applicable to both solid and liquid samples. Therefore, it can be regarded as an efficient tool with satisfactory prediction accuracy for pharmaceutical quantitative analysis, especially in the case of low-SNR spectra.

# REFERENCES


# CONCLUSION

The LRE method has been successfully applied in Raman spectroscopy for pharmaceutical quantitative analysis. It is a simply and feasibly method that can improve the accuracy and robustness of PLS and SVM chemometric models. Our data show that the LRE method has advantages in improving R 2 and RMSE for quantitative analysis of pharmaceutical multi-component mixtures, especially in the case of low-SNR spectra. The LRE method will promote the development of Raman spectroscopy in biomedical and pharmaceutical quantitative analysis.

# AUTHOR CONTRIBUTIONS

XM participated in the lab work, supervising lab work, interpretation of data, drafting the manuscript, performing the statistical analysis. XS participated in the lab work, interpretation of data, drafting the manuscript, performing the statistical analysis. HW design of the work, interpretation of data. YW supervised the research, performing the statistical analysis. DC supervised the research, final approval of the version to be published. QL participated in the lab work, supervising lab work, final approval of the version to be published.

# FUNDING

National Key Research and Development Program of China (2017YFC0803603).

matrix decompositions. SIAM Review 53, 217–288. doi: 10.1137/0907 71806


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ma, Sun, Wang, Wang, Chen and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Accuracy Improvement of In-line Near-Infrared Spectroscopic Moisture Monitoring in a Fluidized Bed Drying Process

Andrey Bogomolov 1,2 \*, Joachim Mannhardt <sup>1</sup> and Oliver Heinzerling<sup>3</sup>

<sup>1</sup> Blue Ocean Nova GmbH, Aalen, Germany, <sup>2</sup> Samara State Technical University, Samara, Russia, <sup>3</sup> Drug Product Development, AbbVie Deutschland GmbH & Co. KG, Ludwigshafen am Rhein, Germany

An exploratory analysis of a large representative dataset obtained in a fluidized bed drying process of a pharmaceutical powder has revealed a significant correlation of spectral intensity with granulate humidity in the whole studied range of 1091.8–2106.5 nm. This effect was explained by the dependence of powder refractive properties, and hence light penetration depth, on the water content. The phenomenon exhibited a close spectral similarity to the well-known stochastic variation of spectral intensities caused by the process turbulence (the so-called "scatter effect"). Therefore, any traditional scattercorrective preprocessing incidentally eliminates moisture-correlated variance from the data. To preserve this additional information for a more precise moisture calibration, a time-domain averaging of spectral variables has been suggested. Its application resulted in a distinct improvement of prediction accuracy, as compared to the scatter-corrected data. Further improvement of the model performance was achieved by the application of a dynamic focusing strategy when adjusting the model to a drying process stage. Probe fouling was shown to have a minor effect on prediction accuracy. The study resulted in a considerable reduction of the root-mean-square error of in-line moisture monitoring to 0.1%, which is close to the reference method's reproducibility and significantly better than previously reported results.

Keywords: fluidized bed drying, moisture monitoring, NIR spectroscopy, light scatter, scatter correction, lighthouse probe, process analytical technology

# INTRODUCTION

Fluidized bed drying is a common unit operation routinely performed in the pharmaceutical production of solid dosage forms. In a typical batch granulation process, the drying stage immediately follows either the fluidized bed or high-shear granulation stage. It is often considered as one of the most critical steps for achieving stable product quality, i.e., for obtaining granules with desired properties at their minimal variability. Therefore, a close monitoring of the residual moisture content in the process medium is necessary for any quality assurance system in granulate production.

In modern industrial practice, moisture is commonly analyzed in isolated samples. Karl Fischer titration is a classic water analysis technique that has been widely used for decades. A viable alternative accepted by pharmacopeias is thermogravimetric analysis with a drying balance that determines moisture content in the sample as percentage weight loss on drying (LOD). At present,

### Edited by:

Federico Marini, Università degli Studi di Roma La Sapienza, Italy

### Reviewed by:

Ludovic Duponchel, Université de Lille, France Huawen Wu, BaySpec, Inc., United States

> \*Correspondence: Andrey Bogomolov ab@globalmodelling.com

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 16 April 2018 Accepted: 10 August 2018 Published: 10 October 2018

### Citation:

Bogomolov A, Mannhardt J and Heinzerling O (2018) Accuracy Improvement of In-line Near-Infrared Spectroscopic Moisture Monitoring in a Fluidized Bed Drying Process. Front. Chem. 6:388. doi: 10.3389/fchem.2018.00388 both techniques are realized as compact desktop devices enabling the at-line analysis of samples taken from a running process.

For the process type studied here, the at-line analysis of granulate moisture content typically takes 20–30 min, representing a good alternative to off-line laboratory analysis of the final product. However, such operability is insufficient to carry out real-time process control, for example by generating alarms on abnormal process states and performing timely corrections. For the same reason, at-line analysis is hardly suitable for accurately determining the process end-point the time point at which the product reaches its optimal properties. Therefore, instant in-line monitoring of the moisture content in fluidized bed drying is strongly desired to provide a necessary level of process control and to meet growing quality requirements.

Near-infrared (NIR) spectroscopy is an undoubted favorite among real-time sensor systems for moisture monitoring in the production of solids, specifically, in the drying step (Roggo et al., 2007; Burggraeve et al., 2013; Da Silva et al., 2014). In such systems, the diffuse reflectance spectra of the process material are typically measured through an immersion probe. The key advantages of NIR spectroscopy as an in-line analytical technique include the suitability for measurements in media of highly variable bulk density, nondestructiveness, and the capability to place the probe into an appropriate position within the process space while keeping it connected to a remote spectrometer through a fiber optic cable.

The classic NIR spectroscopic moisture analysis relies on two intensive water absorption bands around 1,440 and 1,930 nm, enabling quantitative determination of the moisture in a wide concentration range. In low-selective NIR spectra, the component bands are essentially overlapped and their quantitative analysis requires the application of multivariate modeling, also known as chemometrics. In particular, the partial least-squares (PLS) regression algorithm (Sjöström et al., 1983) is widely accepted in process chemometrics (Bogomolov, 2011).

Over the last decades, the practical acceptance of NIR spectroscopy for in-line moisture monitoring in fluidized bed processing of powders and solids have been constantly growing. Published works (Frake et al., 1997; Rantanen et al., 2000; Zhou et al., 2003; Green et al., 2005; Nieuwmeyer et al., 2007; Skibsted et al., 2007; Luukkonen et al., 2008; Mantanus et al., 2009; Alcalà et al., 2010; Corredor et al., 2011; Peinado et al., 2011; Burggraeve et al., 2012; Demers et al., 2012; Möltgen et al., 2012; Obregón et al., 2013) have focused on the general feasibility of the analysis or on the investigation of specific experimental or modeling aspects (e.g., important process influences, sampling, control strategy, and model transfer). At the same time, the resulting models are typically built and validated on relatively small sets of samples and batches, which can be accounted for by the technical complexity of industrial experiments. Hence, the accuracy estimates reported for similar process setups and conditions are very diverse (Zhou et al., 2003; Green et al., 2005; Nieuwmeyer et al., 2007; Skibsted et al., 2007; Mantanus et al., 2009; Alcalà et al., 2010; Corredor et al., 2011; Peinado et al., 2011; Burggraeve et al., 2012; Demers et al., 2012; Möltgen et al., 2012) and the "ultimate" moisture determination accuracy by in-line NIR spectroscopy under widely variable process conditions remains unknown. Therefore, despite significant progress, the method can hardly be regarded as completely established yet.

In-depth considerations of NIR spectroscopic analysis in terms of light propagation in the complex fluidized bed process medium are rare (Rantanen et al., 2000; Luukkonen et al., 2008; Burggraeve et al., 2013). One of the main obstacles complicating the NIR spectroscopic monitoring of fluidized bed drying is related to process turbulence. A highly variable density of the material around the probe, and consequently the quantity of light reaching the detector, causes intensive random fluctuations of the overall intensity of in-line spectra that are often referred to as the "scatter effect." The problem is commonly resolved by preprocessing the spectra prior to the modeling step. The three most-used scatter correction methods are multiplicative scatter correction (MSC), standard normal variate (SNV), and spectral derivatives (Rinnan et al., 2009). The application of a scatter correction method to in-line process NIR spectra is ubiquitous; no exception has been found in the literature. In most cases, the choice of the preprocessing method is empirical or arbitrary.

In some publications, it was noticed that the NIR spectra expressed in the logarithmic reflectance units (lg(1/R)) exhibited a significant downward shift of the background as the drying progressed (Frake et al., 1997; Rantanen et al., 2000; Zhou et al., 2003; Luukkonen et al., 2008; Burggraeve et al., 2012). Two plausible explanations were suggested, both related to the altering of light scatter conditions in the course of drying. On one hand, the uniform decrease in spectral intensities could be caused by an increase in scattering particle size; this explanation was given by Burggraeve et al. (2012) and Frake et al. (1997). On the other hand, the presence of water on crystal surfaces affects the reflective properties of the granulated powder, resulting in a deeper light penetration and a subsequent higher absorbance of wetter samples (Rantanen et al., 2000; Luukkonen et al., 2008). Rantanen et al. (2000) provided an experimental evidence of the latter phenomenon by using the pharmaceutical excipient (microcrystalline cellulose) as well as inorganic glass beads ("ballotini") with a known size distribution.

The present work aims at building an accurate and robust functional prediction model for in-line moisture content monitoring in fluidized bed drying based on a large representative set of designed process data. Both experimental and modeling factors have been scrutinized to improve the performance of the prediction model. A thorough exploratory data analysis has been applied to help understand the process multivariate trajectory delivered by in-line diffuse-reflectance NIR spectroscopy better. In this study, we focus on efficiently using of the whole spectral information, including both absorption and scatter-related effects of water, to improve the performance of in-line moisture monitoring.

# MATERIALS AND METHODS

Twenty-five pilot-scale fluidized bed drying batches of a pharmaceutical powder mixture were studied by using a 256 pixel diode-array TIDAS 1121 SSG NIR spectrophotometer with

a wavelength range of 1091.8–2106.5 nm (J&M Analytik AG, Germany) that was equipped with the Lighthouse ProbeTM (LHP) from GEA Pharma Systems nv – Collette, Belgium (Engler et al., 2009) immersed into the process medium. The LHP was periodically cleaned and recalibrated without process interruption (see section S1.4 of **Supplementary Material**). The total number of cleaning cycles in all batches was 19.

The data of each batch included from 396 to 1,213 NIR spectra collected at 5-s intervals (16,303 spectra in total). In the course of the process, 301 samples of about 5 g (between 5 and 26 samples from each batch) were isolated and analyzed for moisture content as weight loss on drying using a HR73 halogen moisture analyzer (Mettler Toledo GmbH, Switzerland). Reproducibility checks for three LOD analyzers performed during the whole study showed that the measurement standard deviation error does not exceed 0.06% (section S1.2 of **Supplementary Material**).

The main process and the sample information are summarized in **Table S-1**. Out of the 301 samples, three were rejected from further analysis as evident outliers (section S2.3.1 of **Supplementary Material**).

Individual batch conditions were set in accordance with a developed experimental design to cover the whole range of practical process variability. Moisture content in the selected samples varied between 2.38 and 25.92%. The active pharmaceutical ingredient (API) was present in four assay levels: 0 (placebo), 0.1, 1.0, and 10.0 mg. The range of process temperatures was 30.5–49.7◦C. Eight batches (88 reference samples) formed a validation subset that was representative of the process conditions and used for model validation; the other 17 batches were used as the calibration set in that case (**Table S-1**).

A subset of 101 experimental samples were additionally analyzed off-line by using an MPA Fourier-transform (FT-) IR spectrometer (Bruker, Germany) with an integrating sphere (section S1.5 of **Supplementary Material**).

Principal component analysis (PCA) and PLS regression are multivariate data analysis algorithms described in the literature (Sjöström et al., 1983; Wold et al., 1987). The multivariate spaces, namely, PCA model principal components (PCs) and PLS latent variables (LVs) represented by their score (**t**) and loading (**p**) vectors, were used for exploratory data analysis. Conventional data preprocessing methods employed were MSC, SNV, and first-derivative using the Savitzky–Golay smoothing filter, as described by Rinnan et al. (2009).

Three validation techniques were applied with each regression model: leave-one(-sample)-out (LOO), a.k.a. full cross-validation (CV), leave-a-batch-out (LBO) CV, and validation by a preselected set (**Table S-1**). The performance of the models was characterized by rootmean-square errors (RMSE) of calibration, validation, and prediction, as well as corresponding determination coefficients R 2 .

A detailed description of data acquisition and analysis is given in section S1 of **Supplementary Material**.

# RESULTS AND DISCUSSION

# Exploratory Analysis of In-line Spectral Data

**Figure 1** presents a set of 1,213 in-line NIR spectra obtained in batch B03 (**Table S-1**). An expected intensity reduction of the main water band in the 1,920–1,940 nm range during the process is clearly observed. Another distinct feature is the high variability of spectral intensities over the whole wavelength range (the so-called "scatter effect"), caused by strong instant density fluctuations of the granulate (and its spatial distribution) around the probe.

At the same time, the overall spectral intensity tends to fall gradually during the process, generally following the dynamics of water reduction. This trend can be illustrated by the time dependencies of the spectral intensity at two separate wavelengths: 1932.0 nm at the maximum of the main water band and 1708.1 nm where no noticeable water absorption is expected. Both intensities strongly correlate with the reference moisture content (**Figure 2A**). Data smoothing along the time scale makes this correlation even more distinct.

The moisture- and time-dependent changes in the batch processes can be effectively visualized by using data animation (section S2.1 and **Video S-1**, **Supplementary Material**). Animated spectral data reveal the same trends, namely water band reduction and stochastic background variation accompanied by a gradual fall of the spectrum intensity in the whole range.

In this situation, preprocessing is desirable, but it should be applied to the data variable vectors, i.e., along the time scale, as shown in **Figure 2A**. As the turbulence effect is supposed to be pure noise, the smoothing of variables is a straightforward way to eliminate it with a minimal loss of the informative variance.

One of the simplest smoothing techniques, the moving window averaging algorithm, has been used to preprocess the matrix of spectral data **X**. In this method, each element xij in **X,** where i and j are respectively the object (spectrum) and variable (wavelength) indices, is replaced by a corrected value x s ij calculated as a mean of the surrounding points within a window having the width defined by an odd number k (Equation 1):

$$\mathbf{x}\_{ij}^{s} = \frac{\sum\_{i=i-(k-1)/2}^{i=i+(k-1)/2} \mathbf{x}\_{ij}}{k} \tag{1}$$

The transformation is performed for each variable in **X**. (k – 1)/2 end-points on each side of the variable vector were smoothed with a reduced window of (l – 1) . 2 + 1 points, where l is the point ordinal number from either spectrum end.

Data averaging within a selected time window is similar to a respective enhancement of the spectrum acquisition time, thus enlarging the virtual sample size captured by a single measurement. However, in contrast to the measurement time adjustment, the mathematical averaging does not place any limit on the time step of data acquisition, i.e., it can be performed with a time window that is much wider than the physical step size. A positive effect of the variable smoothing for the modeling of a fermentation process data has been reported (Skibsted et al., 2001).

Pair-wise correlations between the LOD values and the intensities at individual variables in the corresponding (closest to the sampling times) in-line spectra were analyzed in the whole wavelength range. **Figure 2B** presents linear correlation coefficients (r) as a function of wavelength in B03. All spectral variables exhibit a strong intensity correlation with the moisture content, even in the raw data. Eliminating the process noise using the suggested averaging method (Equation 1) results in a dramatic enhancement of r. It also looks natural that correlation maxima are observed around major water bands. However, even beyond the water absorbance regions, this correlation is very high. Thus, the lowest r observed in B03 at the short-wave end of the spectral region is still greater than 0.8 (**Figure 2B**); after the smoothing, this value increases to 0.98. Similar dependencies were observed for all the 25 studied batches.

A high correlation of lg(1/R) with the moisture content in the whole studied NIR range is in agreement with some published observations. This phenomenon can be explained by altering the refractive properties of the granulate (Rantanen et al., 2000). Indeed, in the course of drying, the liquid bridges holding the primary particles together (Burggraeve et al., 2013) are replaced by air. The crystal–air interface is characterized by a higher difference of refractive indices than the crystal–water pair. Thus, drying leads to a higher scatter and hence an increased quantity of diffusely reflected light reaching the detector—that corresponds to a decrease in the spectral intensity expressed in absorbance type of units. For relatively large particles constituting the granules, this effect should be wavelength-independent. An intuitive illustration of the particle wetting effect and its uncomplicated explanation using the representative layer theory was given by Dahm (2013). A similar correlation of the Raman spectral background with the moisture content was observed in our earlier studies on pellet coating (Bogomolov et al., 2010) and granulation process monitoring (Bogomolov, 2011), and was also explained by the effect of moisture on the light propagation conditions in the process medium. Considering the strength of the spectrum variable correlation with the moisture content observed in the whole range of process conditions studied, an earlier explanation of the phenomenon in terms of changing particle size distribution during the drying course (Frake et al., 1997; Burggraeve et al., 2012) has not been confirmed. This hypothesis does not agree with the complex shape of the correlation curve in **Figure 2B**. Particle size distribution can be a minor watercorrelated factor affecting the spectra of the drying process, though.

The effect of humidity on the light penetration depth in porous materials can be compared to the watermark technique commonly used for banknote authentication. The very name of watermarks comes from the visual similarity of paper thickness variation and its wetting effects, both resulting in a decrease in the back-scattered light. Darkening of wetted powders (e.g., sand) is another manifestation of the same phenomenon that is not limited to the visible light and should be inherent in any material with a highly developed surface. The spectral variance related to the changing refractive properties of the powder is also expected to be present in the in-line process spectra. However, being wavelength-independent, the moisture-related spectral changes are masked by the stochastic "scatter effect" and then eliminated by any scatter correction. Earlier studies on inline moisture analysis by using NIR spectroscopy neither paid any significant attention to the analytical information hidden in the "watermarks" nor attempted to use it in the modeling.

A deeper insight into the data structure and its modification by adopting different preprocessing methods was obtained by the PCA of augmented process data (section S2.2 of **Supplementary Material**) that makes possible the investigation of process trajectories of individual batches in the same multivariate factor space.

As one can see from the scores of batch B10 taken as an example here (**Figure 3** and **Figure S-3**), the first PC (95.49% of **X**-variance) of the raw-data model (**Figure 3A**) is strongly associated with the moisture content, while PC<sup>2</sup> (4.23%) basically describes the process turbulence. A remarkable similarity of the first two loadings (**Figure S-4a**) with the correlation coefficient r = 0.998 is a confirmation of a close spectral affinity of these two phenomena. A scatter-driven correlation of spectral intensities with the moisture content is confirmed by the uniformly positive **p**1. A simultaneous presence of the water absorption peaks in this plot implies that PC<sup>1</sup> tends to capture the whole variance due to the moisture reduction, related to both absorbance and scatter phenomena.

Although the process noise is basically described by PC2, it strongly pollutes PC<sup>1</sup> and all further components in the raw-data model. The suggested smoothing method effectively eliminates this noise from the model scores (**Figures 3B,C** and **Figures S-3b,c**) without any essential change to the loadings (**Figures S-4b,c**). In contrast, the SNV, MSC, and first derivative (**Figures S-4d-f)**) strongly modify the whole factor space; they essentially remove random fluctuations from the first two score vectors (less noisy for the first derivative) but further PCs stay very noisy (**Figures 3D–F**). The smoothed data is suitable for exploring the process trajectories in the PCA factor space. Most of the minor features revealed in the refined scores **t**2-**t**<sup>7</sup> (**Figures 3B–F**) can be assigned to certain process events, i.e., to changing process phases or LHP cleaning cycles. The PCA score plots for all batches can be found in **Figure S-3**.

**X**-variances captured by individual PCs (**Table S-3** in section S2.2.2 of **Supplementary Material**) indicate at least six significant factors for all preprocessing methods, while the PC8–PC<sup>10</sup> are definitely negligible. The PC<sup>7</sup> seems to be a boundary case, and its significance should be proved by using other criteria. Considering spectrum-like loadings (**Figure S-4**) and process-reflecting scores, in particular in the time-wise averaged data (**Figure 3C**), seven PCs are likely to be relevant. Additional considerations helping to deduce a number of PCs in the augmented process data are considered in section S2.2 of **Supplementary Material**.

In general, the low variances captured by minor principal components PC2-PC<sup>7</sup> (**Table S-3**) illustrate a much higher sensitivity of NIR spectroscopy to water than to other chemical or physical variability sources in the drying process medium. Nevertheless, a thorough study of the complete PCA model resulted in some practically important observations. Thus, LHP fouling and cleaning during the process has a minor effect on the observed in-line spectra, in particular, at the final process stage (section S2.2.1 of **Supplementary Material**).

An exploratory data analysis performed has revealed an essential correlation of all spectral variables with the moisture content. The PCA analysis of the united dataset (16,303 spectra) has shown that this effect is overlaid with a variation on the stochastic spectrum intensity caused by the process noise. Since both scatter-driven effects have similar spectral signatures, the application of conventional normalization or derivative preprocessing methods of scatter correction incidentally removes useful information contained in the spectrum background. Instead, it was suggested to perform the smoothing of spectral variables along the time domain, e.g., using a moving window average.

# Building an Accurate PLS Regression Model of Moisture Content

For efficiently using the additional moisture-related information contained in the spectral variables, the dependence of model accuracy on averaging window width (WW = k points) has been studied. PLS models for all possible odd k values between 3 and 101 in different moisture ranges were compared (section S2.3.2 in **Supplementary Material**). Since the in-line smoothing of time dependencies results in a delay of 2k – 1 trajectory points (half WW) between the process and analysis times (Bogomolov, 2011), light smoothing is technically preferred. WW = 15 was found to be optimal in all cases as it provided an essential improvement of the model accuracy with a reasonable delay of 35 s. The full WW of 70 s approximately corresponds to a material circulation period in this process and dryer type. Thus, each portion of the granulate has a good chance of being exposed to spectroscopic measurement during this time. Due to the averaging, a virtual sample size captured by spectroscopic measurement, and hence the level of scrutiny of analysis, is extended. From this point of view, an optimal WW should correspond to an averaged spectrum that is representative of the bulk material volume, while remaining a nearly instant measurement compared to the total process time. This principle can be suggested as a rule of thumb for optimal data averaging in the drying process analysis and similar applications. A 47-point averaging was found to be a "global" optimum in our case; stronger smoothing does not lead to any significant gain. Based on these observations, 15- and 47-point smoothing windows have been chosen as benchmarks for model comparison (the respective preprocessing methods are designated as S15 and S47). **Table 1** presents a summary

of full-spectrum modeling results for different moisture ranges, preprocessing techniques, and validation methods.

(violet line); and LHP cleaning start/end points (vertical green lines).

The data covers a wide range of moisture contents from 2 to 26% (**Table S-1**). As the prediction error may be nonuniform depending on the drying stages (Mantanus et al., 2009), several PLS models were built corresponding to moisture LOD ranges <20% (D20), <15% (D15), and <10% (D10), in addition to the full-data (D) models. The abundance of measurement points makes possible the use of this data reduction without a significant impact to the model quality. The upper value of moisture content noticeably reduces the RMSE (e.g., for LBO CV, it falls from 0.21 in D to 0.13 in D10), keeping R 2 at the same high level of 0.997–0.998 (**Table 1**). A strong error dependence on the moisture content can be practically employed to improve the performance of moisture monitoring in general. Thus, prediction software can switch to a more precise model as soon as a certain moisture content level is reached, providing an automatic model "focusing" in the process course. By this way, the most critical final stage of drying can be monitored with the highest accuracy.

A number of LVs to be kept in PLS models was estimated from the RMSE of different validation methods and from the explained **X**- and **y**-variances (**Table S-4**). **Figure 4** compares the LBO CV RMSE dependencies on the number of LVs for the models in different moisture ranges (**Figure 4A**) and data averaging degrees (**Figure 4B**). Their common trend is that the validation error reaches a plateau starting from the seventh LV; faint minima at higher factor numbers do not seem significant. Note that LBO TABLE 1 | PLS regression statistics for in-line moisture content determination: model comparison for different moisture ranges and preprocessing techniques using different validation methods; all models were built with 7 LVs.


<sup>a</sup>Dataset used: D – full dataset, D20, D15, and D<sup>10</sup> – datasets limited to LOD moisture content below 20, 15, and 10%, respectively; <sup>b</sup> the number of samples without outliers (see section S2.3 of Supplementary Information); <sup>c</sup>preprocessing applied; <sup>d</sup>calibration statistics; <sup>e</sup> full cross-validation statistics; <sup>f</sup> leave-a-batch-out cross-validation statistics; <sup>g</sup> validation set (Table 1) prediction statistics; <sup>h</sup> variable averaging with 15-point window; <sup>i</sup> variable averaging with 47-point window; <sup>j</sup>Savitzky–Golay first derivative with second-order polynomial and 15-point smoothing window.

CV is generally the most conservative (i.e., resulting in the highest errors) validation method in **Table 1**. Data scatter correction does not result in any model simplification as expected. **Figure 4B** shows that the validation RMSE for MSC-preprocessed D<sup>15</sup> data is even higher than the RMSEV of the model obtained after moderate (S15) data smoothing. This effect is observed for any number of LVs higher than one. Starting from the sixth LV, the prediction error after MSC becomes even worse than in the raw-data model. This behavior agrees with the earlier PCA-based conclusion that conventional scatter correction refines only the two first factors of the multivariate space, transferring the process noise into higher yet significant model dimensions.

The analysis of the captured **X**- and **y**-variances (**Table S-4**) exhibited similar trends. It was also shown that seemingly insignificant variances captured by the seventh LV in the calibration data are still in agreement with the respective precisions of the NIR spectrometer and the LOD analyzer (section S2.3.3 of **Supplementary Material**).

The first two PLS loadings (**Figure S-6**) are almost identical to those in the augmented PCA (**Figure S-4**); therefore, both multivariate modeling spaces are essentially the same. Meaningful shapes of the first seven loadings, which are similar in PCA (**Figure S-4**) and PLS models (**Figure S-6**) as well as PCA scores (**Figure 3**), provide an additional justification of the chosen model's complexity. The noticeable positive offset of **p**<sup>1</sup> in raw and smoothed data models (**Figures S-6a-c**) indicates that PLS regression makes use of both absorbance and scattercorrelated variances for moisture calibration. The loadings **p<sup>3</sup>** to **p<sup>7</sup>** still exhibit similar (as in PCA) interpretable spectrum-like features. Therefore, seven LVs were found to be optimal for all moisture ranges and data preprocessing methods, consistent with the earlier PCA result for all spectral data. This number is also reasonable, considering the physical and chemical complexity of the process as well as the anticipated nonlinearity of spectral responses. It is also acceptable from the point of view of calibration set size.

**Table S-4** also confirms the efficiency of variable smoothing. Cumulative **y**-variances grow with the averaging WW, reducing a misbalance between the **X**- and **y**-variances for any number of LVs, in particular, for LV1. Starting from LV3, the **y**-variance captured in smoothed data becomes higher than that in the models preceded by scatter correction (e.g., MSC). In detail, the problem of deducing the optimal number of LVs is considered in section S2.3.3 (**Supplementary Material**).

Validation statistics presented in **Table 1** evidences that the suggested data averaging approach is advantageous as compared to the MSC, SNV, and first derivative using the Savitzky–Golay smoothing filter. It is remarkable that any scatter correction (most essentially, MSC or SNV) leads to higher calibration and validation errors than those for raw spectral data (This comparison is provided for D and D15, but it holds for all datasets). **Figure 5** illustrates the model performance achieved in D15.

FIGURE 4 | RMSE dependencies (LBO CV) on the number of LVs in PLS models: (A) for nonpreprocessed data in different moisture content ranges: D (squares), D20 (diamonds), D15 (circles), and D10 (triangles); and (B) D15 data with different smoothing degrees: none (solid), S15 (dashed), and S47 (dash-dotted), as well as for MSC preprocessing (red dotted, filled markers).

A subset of 101 process samples was additionally analyzed off-line by using a high-resolution FT-NIR spectrometer (section S2.3.5 of **Supplementary Material**). The integration sphere applied in this case excluded any scatter-related stochastic variation of spectral intensities. Nevertheless, all spectral variables (including the background signal) exhibited the same strong correlation with the sample moisture content (**Figures S-9, S-10**), as in the case of in-line spectra (**Figure 2**). This fact confirms our previously given explanation of this effect in terms of changing light propagation conditions. Moreover, the performance of the PLS model built on 96 off-line spectra (samples with LOD > 15% were used) was found to be essentially the same (cross-validation RMSE = 0.108) as in the model built on respective averaged in-line spectra (S15) of the same process samples (**Table S-6**). This remarkable result provides an additional confirmation of the efficiency of the suggested method. For more details on the off-line analysis results, see **Supplementary Material**, section S2.3.5.

The time dependencies of the predicted moisture content in B12 (**Figure S-7**) illustrate the additional advantages of the suggested preprocessing technique. Variable smoothing most efficiently eliminates the noise contained in process trajectories at the beginning of the drying process, when the moisture content is greater than 15%. It also helps avoid prediction artifacts related to probe cleaning during the "wet" process stage. Section S2.3.4 in **Supplementary Material** provides a detailed discussion of the predicted drying trajectories.

In numerous publications on in-line diffuse-reflectance NIR monitoring of fluidized bed drying and similar processes, data analysis is always prefaced by MSC, SNV, or derivatives without exception. A mandatory application of corrective preprocessing may only be justified in preliminary feasibility studies, when the small calibration/validation dataset does not allow for building models of adequate complexity. The results reported here could be used as evidence for the destructiveness of scatter correction for the moisture calibration, as it eliminates a significant portion of the useful variance. Similar ideas have been formulated in the literature (Chen and Thennadil, 2012), where the information content of MSC coefficients was analyzed. The PLS capability of employing the quantitative information delivered by the scatter has earlier been illustrated in other applications, in particular in particle size analysis (Nieuwmeyer et al., 2007) and the quantitative determination of fat and protein in milk (Bogomolov et al., 2012). In these cases, the predictive models built on raw data exhibited a noticeably better performance, as compared to those in which any scatter-correction was applied. For in-line process data, the suggested smoothing approach, performed in a time rather than spectral domain, presents a viable alternative to the classic scatter correction of spectra, to eliminate noise while preserving useful information contained in the spectral variables.

# CONCLUSIONS

In light of our presented results, the following recommendations to practical NIR spectroscopic monitoring of moisture content in fluidized bed drying and similar process types can be formulated. A very common practice of a priori scatter correction of inline process spectra prior to the multivariate calibration is generally discouraged, because it may eliminate an essential part of the water-related variance from the data and thus deteriorate the resulting prediction model. To avoid this, quantitative modeling should be prefaced by an exploratory analysis of the raw data to investigate the relevance of both absorbance and scatter-related effects of moisture by using a sufficiently large representative set of designed samples and process conditions. These considerations are equally valid in cases when water content is not directly determined, but it should be taken into account by an accurate multivariate model as an important process factor. Process noise, i.e., stochastic background and intensity variations of in-line spectra, can be efficiently eliminated with a minimal loss of useful information by means of data smoothing along the time scale. The parameters of smoothing strengths should be adjusted depending on the process scale and dynamics. Building accurate quantitative models should rely on a methodically determined number

REFERENCES


of latent variables. A deliberate application of less LVs than their optimal number following from the model diagnostics sometimes done by researchers to guarantee an avoidance of overfitting—is not always justified. An underfitting may often be more undesirable for model prediction accuracy.

# AUTHOR CONTRIBUTIONS

AB conceived and wrote the paper and analyzed the data. JM conceived the project and organized and planned industrial experiments. OH performed the experiments and analyzed the data.

# ACKNOWLEDGMENTS

The Ministry of Education and Science of the Russian Federation supported this work within the framework of the basic part of state task on the theme Adaptive technologies of analytical control based on optical sensors (Project No. 4.7001.2017/BP). The authors thank Tomas Vermeire (GEA, Belgium) and the colleagues from the previous Pharmaceutical and Analytical Development department (Weesp, The Netherlands) for their support of the experiments. J&M Analytik AG is acknowledged for organization and support. Prof. Dr. Rudolf W. Kessler (Reutlingen University, Germany) is acknowledged for fruitful discussions. Ivan and Petr Bogomolov helped in manuscript preparation.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem. 2018.00388/full#supplementary-material

particle size from these parameters. Anal. Chim. Acta 746, 37– 46. doi: 10.1016/j.aca.2012.08.006


spectroscopy with consideration of sampling effects on method accuracy. Anal. Chem. 77, 4515–4522. doi: 10.1021/ac050272q


**Conflict of Interest Statement:** OH is an AbbVie employee and may own AbbVie stock/options.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Bogomolov, Mannhardt and Heinzerling. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Eliminating Non-linear Raman Shift Displacement Between Spectrometers via Moving Window Fast Fourier Transform Cross-Correlation

### Hui Chen1,2,3†, Yan Liu1†, Feng Lu<sup>1</sup> \*, Yongbing Cao2,4 \* and Zhi-Min Zhang<sup>5</sup> \*

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Andreas Borgschulte, Swiss Federal Laboratories for Materials Science and Technology, Switzerland Sebastian Primpke, Alfred Wegener Institut Helmholtz Zentrum für Polar und Meeresforschung, Germany

### \*Correspondence:

Feng Lu fenglufeng@hotmail.com Yongbing Cao ybcao@vip.sina.com Zhi-Min Zhang zhangzhimin@csu.edu.cn

†These authors have contributed equally to this work and are co-first authors

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 08 February 2018 Accepted: 05 October 2018 Published: 25 October 2018

### Citation:

Chen H, Liu Y, Lu F, Cao Y and Zhang Z-M (2018) Eliminating Non-linear Raman Shift Displacement Between Spectrometers via Moving Window Fast Fourier Transform Cross-Correlation. Front. Chem. 6:515. doi: 10.3389/fchem.2018.00515 <sup>1</sup> School of Pharmacy, Second Military Medical University, Shanghai, China, <sup>2</sup> Department of Vascular Disease, Shanghai TCM-Integrated Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China, <sup>3</sup> Quality Control Department, Shanghai Diracarta Biomedical Technology Co., Ltd, Shanghai, China, <sup>4</sup> Department of Foundation and New Drug Research, Shanghai TCM-Integrated Institute of Vascular Disease, Shanghai, China, <sup>5</sup> College of Chemistry and Chemical Engineering, Central South University, Changsha, China

Obtaining consistent spectra by using different spectrometers is of critical importance to the fields that rely heavily on Raman spectroscopy. The quality of both qualitative and quantitative analysis depends on the stability of specific Raman peak shifts across instruments. Non-linear drifts in the Raman shifts can, however, introduce additional complexity in model building, potentially even rendering a model impractical. Fortunately, various types of shift correction methods can be applied in data preprocessing in order to address this problem. In this work, a moving window fast Fourier transform cross-correlation is developed to correct non-linear shifts for synchronization of spectra obtained from different Raman instruments. The performance of this method is demonstrated by using a series of Raman spectra of pharmaceuticals as well as comparing with data obtained by using an existing standard Raman shift scattering procedure. The results show that after the removal of shift displacements, the spectral consistency improves significantly, i.e., the spectral correlation coefficient of the two Raman instruments increased from 0.87 to 0.95. The developed standardization method has, to a certain extent, reduced instrumental systematic errors caused by measurement, while enhancing spectral compatibility and consistency through a simple and flexible moving window procedure.

Keywords: Raman instruments, shift correction, cross-correlation, fast fourier transform, moving window

# INTRODUCTION

Over the last few decades, the use of Raman spectroscopy in combination with chemometric methods has increased significantly for analysis of pharmaceutical products (Sacré et al., 2010; Dégardin et al., 2011; Loethen et al., 2015), detection of food adulteration (Zou et al., 2009; Cheng et al., 2010), and other applications (Mrozek et al., 2004; Taleb et al., 2006; Muehlethaler et al., 2011). Raman spectroscopy is a powerful tool for sample analysis and benefits from several advantages such as high speed, simplicity, non-destructive nature, and cost-effectiveness. To date,

**139**

it has been extensively applied in pharmaceutical analysis by constructing multivariate calibration models. However, these models will be invalid if an existing calibration model is applied to spectra that are collected on a different occasion or a separate instrument, or when the response of an old instrument suffers from variations (Du et al., 2011; Brown, 2013). These variations may, if left untreated, dominate the calibration models, thereby making analysis of samples impractical. Consequently, chemometric techniques have been used to circumvent these problems through instrumental transfer or standardization so as to isolate and compensate for any instrumental and environmental variations.

A number of methods, including both instrumental transfer and standardization, have been discussed in the literature (Wang et al., 1991, 1992; De Noord, 1994; Mann and Vickers, 1999; Nguyen Quang et al., 1999; Hutsebaut et al., 2005; Kompany-Zareh and van den Berg, 2010; Rodriguez et al., 2011b; Weatherall et al., 2013). The direct standardization (DS) and piecewise direct standardization (PDS) developed by Wang et al. (1991, 1992) are the most extensively used procedures for spectral response standardization. Using the PDS method, Gryniewicz-Ruzicka (Gryniewicz-Ruzicka et al., 2011) obtained a very low detection limit for diethylene glycol in pharmaceuticalgrade glycerin by using five portable Raman spectrometers. This method, however, requires the user to measure several standards prior to analyzing samples. In addition, both the use of the moving window strategy and the selection of principal components have a noticeable impact on the performance of PDS, which needs to be determined carefully. Furthermore, neither the DS nor the PDS method can deal with different (i.e., non-linear) shifts in the peaks in Raman spectra. It is worth mentioning that in contrast to the various instrumental spectral responses, Raman shift inconsistencies arise mainly from different charge-coupled device (CCD) detectors (Vickers and Mann, 1999). Nonetheless, the use of inconsistent spectra will diminish significantly the predictive power of a calibration model. As a result, the removal of Raman shifts or wavelength inconsistencies for spectra synchronization has become a particularly significant aspect of Raman spectroscopy analysis. In 1996, a mathematical procedure to correct wavelength drifts to synchronize Raman spectra was presented by Booksh et al. (1996). Typically, empirical data are required to select a number of principal components and channels to increase the synchronization precision. Westad and Martens (1999) developed a more general concept of shift determination and tested it on Raman spectra. The results revealed, however, that the spectra were not reproduced exactly after removal of peak drifts exceeding a discrete spectral resolution. Hutsebaut et al. (2005) used a Raman shift standard scattering (SSS for short) method in combination with a linear fitting to determine shift drifts between measured Raman peak and reference positions. A similar approach was used by Rodriguez et al. (2011b) to transfer Raman spectral libraries among instruments. Nevertheless, the use of Raman shift standards is inappropriate for in-line monitoring applications as a result of the difficulties associated with incorporating one or more of the materials proposed as shift standards in a system for in-line measurements. Recently, another approach for the removal of disturbing factors in the CCD responses and instrumental apparatus functions was proposed by Weatherall et al. (2013). Unfortunately, the use of baselineWavelet continuous wavelet transform as a function to identify major peaks' positions accurately requires idealized line profiles of the corresponding peaks, which is not practical for real Raman spectra. In addition, several parameters that influence the final results, such as the width of the window and the choice of the signal-to-noise threshold, need to be specified, mostly by the users.

As a result of the multifarious theoretical and practical limitations of the existing instrument standardization methods (Chen et al., 2015), there is a significant demand for methods that are easier to implement (i.e., fewer or even no tunable parameters required) in order to acquire better analytical performance. Accordingly, we introduced a cross-correlation method in order to address the problems (such as tunable parameters, need idealized line profiles, etc.) discussed above. Generally, in signal processing, cross-correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them, and is also known as a sliding dot product or sliding inner-product (Welch, 1974; Goshtasby et al., 1984). When coupled with fast Fourier transform (FFT) algorithms, the efficiency of FFT can be exploited in the numerical computation of cross-correlations, accelerating thus the convolution calculation (Bracewell, 1980). FFT cross-correlation may therefore be the fastest method in signal processing for shift correction (Bergland, 1969), and benefits from many advantages such as high speed and accuracy. Moreover, it also eliminates the requirement for alignment parameters. Previously, two alignment methods were proposed to estimate the shifts between segments in large chromatographic and spectral datasets, namely, peak alignment by FFT (Wong et al., 2005b) and recursive alignment by FFT (Wong et al., 2005a). However, these two methods move segments by insertion and deletion of data points at the start and end of segments, without considering peak information, which may cause changes in the shapes of peaks by introducing artifacts and removing peak points. Zhan et al. (Zhang et al., 2012) developed another method, known as the multi-scale peak alignment (MSPA) method, to synchronize peaks against a reference chromatogram (aligning peaks from large to small scales), which is accelerated by the application of FFT cross-correlation while preserving peak shape during synchronization. Similarly, Li et al. (2013a) developed a moving window FFT cross-correlation (MWFFT) method to effectively synchronize high-throughput chromatograms without segment size optimization. However, the Raman spectra profiles were different from the chromatograms, which required peak fitting to obtain perfect profiles and a precise Raman shift.

In the present work, the MWFFT was improved and subsequently applied to spectral standardization to address the issues associated with spectral drifts in Raman spectrometers. The performance of this method was compared to that of the SSS method (Hutsebaut et al., 2005) by using two Raman datasets from primary and secondary spectrometers. The aim of our study was to make the MWFFT as a powerful and practical method for standardization across

Raman spectrometers, which can be easily implemented and well-suited for solving Raman shifts displacements between spectrometers.

# MATERIALS AND METHODS

# Standards and Samples

Standards (acetaminophen and cyclohexane) were provided by the National Institute for the Control of Pharmaceutical and Biological Products. Pharmaceutical tablets (listed in **Table 1**) from five different manufacturers were provided by the Shanghai Institute for Food and Drug Control.

# Raman Spectrometers

Two Raman instruments with an excitation wavelength of 785 nm were used, and their physical parameters are listed in **Table 2**. In this work, the i-Raman is regarded as the "master" (primary) instrument, while the GemRam is regarded as the "slave" (secondary) instrument.

The integration times of the standards and drugs were of 2 and 3 s, respectively. Unless stated otherwise, six Raman spectra were collected for each drug during the experiment. It is worth noting that the final spectrum of each drug was calculated as the average of spectra collected from a variety of positions. Moreover, only the spectral region containing the most abundant information (i.e., 300–1,700 cm−<sup>1</sup> ) was used in subsequent data analysis.

# Cross-Correlation

In signal processing, cross-correlation is a standard technique to calculate the similarity between and estimate the linear shift of two signals as a function of one relative to the other, which is also known as the sliding dot product. It is obvious that any changes involving the shifting of one signal will affect the correlation coefficient calculated for any combination of two signals that includes this shifted signal. For two discrete signals such as those in the Raman spectra, the cross-correlation is defined as:

$$\mathbf{c}(i) = \frac{\sum\_{i} \left( \mathbf{r}(i) - \bar{\mathbf{r}} \right) \left( \mathbf{s}(i+j) - \bar{\mathbf{s}} \right)}{\sqrt{\sum\_{i} \left( \mathbf{r}(i) - \bar{\mathbf{r}} \right)^{2}} \sqrt{\sum\_{i} \left( \mathbf{s}(i+j) - \bar{\mathbf{s}} \right)^{2}}} \tag{1}$$

where r is the reference signal, s is the signal to be synchronized, c is the cross-correlation values for all lags. As a simple example, consider two simulated Raman spectra r and s that differ only by a known displacement of 90 points along the x-axis. We can determine by how much s be shifted along the x-axis in order to maximize its similarity to r by using cross-correlation. The above formula slides s along the x-axis, calculating the sum of their product at each position. When the value of c is maximized, i.e., the signals match well due to peak synchronization, they make the most significant contribution to the sum of their product. A visual description of the calculation procedure of cross-correlation and estimation of shifts between signals via cross-correlation is shown in **Figure 1**.

TABLE 1 | Correlation coefficients of drug tablets before and after shift correction.


<sup>u</sup>Correlation coefficient before shift correction; <sup>s</sup>Correlation coefficient by SSS; <sup>m</sup>Correlation coefficient by MWFFT.

# Moving Window FFT Cross-Correlation

The FFT is typically used to calculate the cross-correlation between 1D and 2D signals (Papoulis, 1962; Cooley et al., 1969; Dutt and Rokhlin, 1993). In the present work, FFT was used to increase the speed of cross-correlation between two datasets, in which one signal may be shifted relative


to another. In addition, and perhaps more significantly for its application to the spectral synchronization problem, FFT cross-correlation is not heuristic and thus can identify consistently the best match between signals by finding the maximum correlation coefficient (Wong et al., 2005b).

Usually, the cross-correlation method can only estimate linear shifts between Raman spectra. However, Raman shift displacements are often non-linear in real samples. Consequently, we adopted the moving window procedure in this work to address this problem. In this procedure, the shifts relative to the reference can be estimated by FFT crosscorrelation, allowing us to obtain the shift profiles of all samples. Furthermore, MWFFT can be implemented and optimized simply and effectively only if a moving window of appropriate size is utilized. With a window moving from the beginning to the end of the two spectra, one can obtain a matrix of shift points. Accordingly, the shift profile can be obtained by calculating the mode value of each column of the shift matrix. **Figure 2A** shows an example Raman shift profile estimated by using the moving window strategy and FFT cross-correlation. It is apparent from the obtained shift profile that non-linear shifts exist across the entire spectral region, while the change points are observed in two regions with different shifts. By moving the continuous region around the change points, the synchronization procedure can be finished smoothly to obtain the synchronized spectrum, which can be seen in **Figure 2B**, with all the non-linear shifts successfully synchronized.

# RESULTS

There are two common ways of correcting the x-axis in Raman spectrometers (McCreery, 2005). The first one is to simply use the SSS method (Hutsebaut et al., 2005); the second one is based on absolute frequency calibration using the emission line spectra of gases. The SSS method, which requires the acquisition of Raman spectra of common materials with well-established Raman shift peak frequencies in order to correct the Raman shift axis directly, is used as a comparative method in this work. Several well-known Raman shift chemical standards, namely, cyclohexane and acetaminophen, are chosen over others for this study since their spectral combination can provide more signals in the region from 300 to 1,700 cm−<sup>1</sup> (see **Table 3**). The left panel in **Figure 3** shows the spectra acquired for the used chemical standards on two instruments, while the right panel shows a plot of their differences. It should be noted that when the SSS method was used, the spectra acquired on the primary instrument were regarded as the reference, i.e., the peak positions in these spectra were used for synchronization. The relevant peak positions obtained on the secondary instrument are compared to those obtained on the primary instrument and are subsequently subtracted from the primary peak positions to afford the corresponding shift displacements. Linear fitting is then used to describe the shift displacements between the two instruments. Finally, the shift correction is carried out by linear interpolation.

# Synchronization of Pharmaceutical Datasets

Data synchronization of the raw Raman spectra are presented to evaluate the performance of the MWFFT method (**Figure 2**). In order to gain further insight into the two shift correction algorithms, and the properties and advantages of MWFFT in particular, different batches of pharmaceutical tablets were examined to verify the practicability and effectiveness of MWFFT. **Figure 4** describes the application of MWFFT—each tablet from a total of 40 drugs was analyzed on average six times on two instruments to obtain six different spectra. Subsequently, these spectra were detected for outlier. The average spectrum obtained from six spectra acquired on the primary instrument can be regarded as a reference without outliers. Analogously, we obtained the spectrum of the same tablet on the secondary instrument, and this represents the spectrum to be synchronized. Finally, MWFFT was applied to remove the shift displacements in order to synchronize the spectra across the two instruments.

Prior to data analysis, adaptive iteratively reweighted penalized least squares (airPLS) (Zhang et al., 2010a,b; Li et al.,

2013b) baseline correction and Savitzky–Golay smoothing (Savitzky and Golay, 1964) (a 9-point wide window and a second-order polynomial) were used in the preprocessing of a variety of pharmaceutical datasets. All processing tasks were implemented on a personal computer (CPU: 2.53G, RAM: 8GB) with MATLAB R2013a. Firstly, we demonstrate the effect of MWFFT by using the pharmaceutical datasets (**Figure 5**). The primary instrument spectra (black lines) are used as references for synchronization. **Figure 5** shows the magnified versions of the sample profiles, focusing on a particular set of peaks in order to allow the performance of the MWFFT method to be evaluated by visual inspection. For the acyclovir and captopril datasets, it can be seen that before synchronization (top panel in **Figure 5**), the peaks in the spectrum collected on the secondary instrument are de-synchronized with respect to that obtained on the primary instrument, and vary from sample to sample. After synchronization (middle panel in **Figure 5**) using the MWFFT method, it is apparent that all the spectra are now properly synchronized. This outcome is attributed to the action of the MWFFT method, which appropriately slides the peaks to match the reference spectrum with a window size of 70 points. In addition, for the sake of comparison, the results obtained using the SSS method for the same spectra are displayed in the bottom panel of **Figure 5**.

# Correlation Coefficient After Synchronization

The correlation, or distance, between a signal and the reference point is often used as an optimization objective function—when the signals match, the correlation coefficient is maximized. In this case, correlation coefficient can be used to assess the synchronization problem (Lee Rodgers and Nicewander, 1988). Generally, the correlation coefficient is a good descriptor of similarity, with a value of 1.00 indicating a perfect match, while 0 indicates significant dissimilarity. The correlation coefficient is simple to use and possesses several desirable properties, which we discussed in detail in our previous work (Gao et al., 2014). The correlation coefficient can be calculated by using the following equations:

$$\begin{aligned} r &= \frac{\sum\_{i=1}^{n} \left( X\_i^p - \bar{\mathbf{x}}^p \right) \left( X\_i^s - \bar{\mathbf{x}}^s \right)}{\int \sqrt{\sum\_{i=1}^{n} \left( X\_i^p - \bar{\mathbf{x}}^p \right)^2 \sum\_{i=1}^{n} \left( X\_i^s - \bar{\mathbf{x}}^s \right)^2} \\\\ \sum\_{i=1}^{n} \left( X\_i^p - \bar{\mathbf{x}}^p \right) \left( X\_i^{sa} - \bar{\mathbf{x}}^{sa} \right) \end{aligned} \tag{2}$$

$$R = \frac{\sum\_{l=1}^{n} \left(\mathbf{x}\_{l}^{n} - \hat{\mathbf{x}}\_{l}^{n}\mathbf{x}\_{l}^{n} - \hat{\mathbf{x}}\_{l}^{n}\right)^{2}}{\int \sqrt{\sum\_{l=1}^{n} \left(\mathbf{X}\_{l}^{p} - \bar{\mathbf{x}}^{p}\right)^{2} \sum\_{l=1}^{n} \left(\mathbf{X}\_{l}^{sa} - \bar{\mathbf{x}}^{sa}\right)^{2}}} \tag{3}$$

Here, X<sup>p</sup> and X<sup>s</sup> represent the spectra of n drugs measured on the primary and secondary instruments, respectively. Parameters x p , x s , and x sa represent the average spectra of X<sup>p</sup> , X<sup>s</sup> , and X sa, respectively. Xsa indicates the secondary shift corrected spectrum, while r and R denote a similarity between the primary original spectrum and the secondary spectrum (before or after shift correction). The correlation coefficient of each drug's spectrum was calculated, and the results are summarized in **Table 1**. During the preprocessing, linear interpolation was used to re-compute intensity based on the master Raman shift xaxis in order to unify the spectra obtained using the primary and secondary instruments. It is apparent from **Table 1** that the correlation coefficients between the two instruments improved significantly after shift correction.

As can be seen in **Table 1**, the correlation coefficient assessment prior to the shift correction exhibited a slight variation among different batches of a drug. Nevertheless, these variations are within the three-sigma range. Portable spectrometers are often based on the use of library-based spectral correlation methods (Carron and Cox, 2010), which frequently utilize the hit-quality index (HQI) as the figure of merit to characterize the correlation with each other. The

TABLE 3 | Raman shifts (cm−<sup>1</sup> ) used to calibrate standard samples.


<sup>a</sup>Values as reported by ASTM E1840-96.

typical minimum threshold that classifies an unknown sample as a "Pass" is 0.95 (Rodriguez et al., 2011a, 2013), which is similar to the correlation coefficient. Clearly, the MWFFT method makes a significant contribution to the level of similarity for the spectra obtained using the slave instrument. The synchronization increased the similarities for all drugs above the verification threshold of 0.95, while the similarity for one captopril tablet remained under 0.95 when the SSS procedure was used. Consequently, it is obvious that the MWFFT method can correct the non-linear shifts successfully, synchronizing thus the secondary spectra to the reference spectra in a timeeffective manner. In addition, MWFFT can reduce the systematic differences across spectrometers, which can increase the spectral consistency of different instruments as well as the compatibility with library search. Furthermore, this method can be used as an on-line standardization method across Raman instruments in the future.

# DISCUSSION

# Selection of Reference Spectrum

A wide application of the MWFFT method necessitates the selection of an appropriate reference spectrum. When a drug sample is measured on a secondary instrument to obtain an average spectrum for synchronization, its corresponding standard spectrum contained in the existing spectral library can

FIGURE 3 | Spectra of acetaminophen (A) and cyclohexane (C) acquired on two different instruments. Magnified spectral differences in (B,D) correspond to the shaded areas in (A,C), respectively.

be certainly used as the reference to correct shift displacements. However, when the spectral library does not contain the required reference spectrum, it would be preferable to use the reference spectra of existing drugs with the same generic name in the database in order to obtain a new matrix of shift points. As a result, the shift profile of the new drug can be calculated from the mode of each column of the matrix. Through this profile, one can obtain a new reference spectrum by shift correction, which can be subsequently applied. Otherwise, one can regard the new sample spectrum directly as a reference, and save it in the database for subsequent analysis. The entire procedure is depicted in **Figure 6**.

# Avoiding Peak Detection Using the Moving Window Strategy

The existing peak detection methods, e.g., the wavelet and ridge line peak picking method, need idealized line profiles of the corresponding peaks in order to detect the displacements accurately, which is not practical for the spectroscopic analysis of real samples. Moreover, several parameters need to be specified

lines indicate the reference spectra. The inset shows the full Raman spectra, whereas the shaded areas indicate the region magnified in the main panel.

with a priori knowledge, which largely influence the final results and can be difficult to implement in C programming language. By contrast, the use of the moving window strategy can allow an estimation of non-linear shifts between spectra flexibly and without peak detection for peak synchronization. With a window that moves from the beginning to the end of two spectra, one can obtain an N-dimensional matrix of shift points, where the data points of a Raman spectrum are N. In this case, the shift profile can be calculated from the mode of each column of the matrix, while the mean and median of the matrix can outline the paths of the shifts. The Raman shift profile of metronidazole tablet is shown in **Figure 2A** using a green dotted line. It is apparent that the profiles in all regions are corrected by the MWFFT method, meaning this method is sufficiently flexible for estimation of non-linear shifts between spectra.

# Advantages of the MWFFT Method

The MWFFT method has several distinctive advantages when compared to the traditional methods as a result of the continuity and redundancy of the moving window procedure. Usually, the direct evaluation of cross-correlation requires O (N<sup>2</sup> ) time complexity for a Raman spectrum of length N, which is time-consuming for spectra with thousands of data points. Fortunately, cross-correlation can be calculated by using FFT much more efficiently since it can significantly decrease the time complexity of crosscorrelation from O (N<sup>2</sup> ) to O (NlogN). The use of the moving window strategy with FFT cross-correlation, with a window size w, leads to a time complexity of one window wlogw. Accordingly, the time complexity of MWFFT is Nwlogw, where N represents the number of data points in a Raman spectrum.

The MWFFT method evaluates the shift of each point. In the moving window strategy, only one parameter needs to be taken into account, which makes this procedure simple and practical, as there is no need for chemical standards. By contrast, the SSS method requires the use of some chemical standards in order to locate the position of each peak, which is used in turn to obtain the corresponding shift displacement. After the shift of each point is estimated by MWFFT, the points in the spectrum are shifted according to their shifts by insertion and deletion. The present work introduced a change point, i.e., a discontinuity point in the shift profile. It is possible to see that the change points (**Figure 2A**), around which insertions and deletions occur frequently, are not in the peak region. Consequently, peak distortions can be effectively avoided, allowing the peak shape to be preserved during the synchronization procedure with MWFFT. Overall, the advantages associated with the use of non-linear shift estimation, insertion and deletion around change points, and shape preservation make MWFFT a flexible, rapid, practical, and precise method for correcting shifts in synchronization of Raman datasets.

# Evaluation of the Synchronization Quality

Generally, Raman spectra will become more consistent, exhibit higher correlation coefficients, and be more similar to each other after a successful synchronization. The correlation coefficient can be used as a criterion for assessing the synchronization quality between the primary and secondary spectra. The synchronized spectra are commonly used to perform library-based searches and are further analyzed by chemometric algorithms. Usually, distance and Euclidean distance in particular (Juday, 1993),



acMean Euclidean distances of acyclovir datasets; caMean Euclidean distances of captopril datasets.

can also be a good criterion for evaluating the quality of synchronization. Generally, the more similar the spectra are, the smaller is the Euclidean distance between them, and vice versa. In this work, the mean Euclidean distance (Dmean) is calculated as follows:

$$D\_{mean} = \,^1\!/\_{\mathfrak{h}} \sum\_{i=1}^n \sqrt{\sum\_{j=1}^k \left(X\_{i,j}^p - X\_{i,j}^s\right)^2} \tag{4}$$

where the rows of matrix X correspond to observations (n), while the columns correspond to variables (k). X<sup>p</sup> i and X s i are the ith primary (reference) spectrum and secondary spectrum, respectively. It is worth mentioning at this stage that the normalization algorithm (Heraud et al., 2006) is used to scale the spectra within a similar range before calculating the distances. The results are summarized in **Table 4**. It is apparent that the mean Euclidean distance of the pharmaceutical datasets shift-corrected by SSS and MWFFT were considerably reduced when compared to the uncorrected ones. In addition, for the two datasets, MWFFT performed slightly better than the SSS method in terms of non-linear shift correction.

# CONCLUSIONS

Methods for the synchronization of spectra are indispensable for successful applications using different spectrometers. In the present work, we used the moving window strategy in combination with FFT cross-correlation to synchronize Raman spectra. This technique, abbreviated as MWFFT, was shown to eliminate accurately and effectively non-linear shift displacements between Raman spectra. Owing to the continuity of the moving window technique, non-linear shifts are corrected and shift profiles are obtained for each spectrum. In general, the use of the FFT cross-correlation methodology is timesaving and results in a significant improvement in speed. Moreover, this method can reduce or even remove systematic differences between Raman spectrometers (a dramatic increase in similarity from 0.87 to 0.95 after synchronization of the spectra between master (primary) and slave (secondary) spectrometers), as well as the compatibility with Raman spectral library. It is better than the SSS method in terms of correcting non-linear shifts and does not require the use of Raman shift standards. These advantages make MWFFT a promising shift correction method that addresses the demand for automated, flexible, rapid, and reliable data preprocessing, which plays an important role in Raman spectroscopy analysis using different spectrometers. Finally, MWFFT can be easily implemented

REFERENCES


with C and C++ programming languages (available as open source package at http://code.google.com/p/mwfft), which may be well-suited to solving the Raman shift displacements between spectrometers in the fields that rely heavily on the use of Raman spectrometers.

# ETHICS STATEMENT

The experimental protocol was approved by the Research Ethics Committee of The Second Medical University and Shanghai University of Traditional Chinese Medicine, The findings and conclusions in this article have not been formally disseminated by the State Food and Drug Administration and should not be construed to represent any agency determination or policy.

# AUTHOR CONTRIBUTIONS

HC designed and carried out experiments. Z-MZ and YL assisted with analyzing the results and discussions. HC and Z-MZ wrote the manuscript. FL reviewed and edited the manuscript. YC reviewed and checked our manuscript, gave constructive amendments to the text, and also approved the version to be published.

# FUNDING

This work is financially supported by Ministry of Science and Technology of the People's Republic of China (2017YFF0210103, 2012YQ180132). The studies meet with the approval of the university's review board.


**Conflict of Interest Statement:** HC is employed by the company Shanghai Diracarta Biomedical Technology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chen, Liu, Lu, Cao and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Wavelet Transform-Based UV Spectroscopy for Pharmaceutical Analysis

### Erdal Dinç<sup>1</sup> \* and Zehra Yazan<sup>2</sup>

<sup>1</sup> Department of Analytical Chemistry, Faculty of Pharmacy, Ankara University, Ankara, Turkey, <sup>2</sup> Department of Chemistry, Ankara University Faculty of Science, Ankara, Turkey

In research and development laboratories, chemical or pharmaceutical analysis has been carried out by evaluating sample signals obtained from instruments. However, the qualitative and quantitative determination based on raw signals may not be always possible due to sample complexity. In such cases, there is a need for powerful signal processing methodologies that can effectively process raw signals to get correct results. Wavelet transform is one of the most indispensable and popular signal processing methods currently used for noise removal, background correction, differentiation, data smoothing and filtering, data compression and separation of overlapping signals etc. This review article describes the theoretical aspects of wavelet transform (i.e., discrete, continuous and fractional) and its characteristic applications in UV spectroscopic analysis of pharmaceuticals.

### Edited by:

Hoang Vu Dang, Hanoi University of Pharmacy, Vietnam

### Reviewed by:

Gaetano Ragno, Università della Calabria, Italy Joseph Dubrovkin, Western Galilee College, Israel

> \*Correspondence: Erdal Dinç dinc@ankara.edu.tr

### Specialty section:

This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry

Received: 27 April 2018 Accepted: 03 October 2018 Published: 26 October 2018

### Citation:

Dinç E and Yazan Z (2018) Wavelet Transform-Based UV Spectroscopy for Pharmaceutical Analysis. Front. Chem. 6:503. doi: 10.3389/fchem.2018.00503 Keywords: discrete wavelet transform, continuous wavelet transform, fractional wavelet transform, UV spectroscopy, pharmaceutical analysis

# INTRODUCTION

In experimental studies, instruments or devices can provide signals (or graphs) in different formats e.g., spectrum, chromatogram, voltammogram, and electroferogram etc. The analysis of chemicals and pharmaceuticals in various samples is based upon the utilization of the measured signals of substances of interest. In practice, such an analysis for a multicomponent mixture may not be determined without a prior separation step due to spectral overlapping. Therefore, high performance liquid chromatography (HPLC) is one of the most commonly used techniques for quantitative estimation in the quality control of raw materials and commercial products in laboratories. In some cases, chromatographic determination could not be possible due to not only similar physicochemical behavior of analytes but also time- and solvent-consumption for optimal experimental conditions.

In practice, UV spectroscopic methods are widely used in chemical and pharmaceutical analysis. As compared to chromatographic ones, the use of spectroscopic methods provides a rapid analysis with low-cost and acceptable results. However, multicomponent analysis may not be possible with a traditional UV spectrophotometric approach due to spectral interferences of both active and inactive ingredients in samples. In some cases, derivative spectrophotometry (O'Haver and Green, 1976; O'Haver, 1979; Levillain and Fompeydie, 1986; Ragno et al., 2006) and its improved versions e.g., ratio spectra-derivative spectrophotometry (Salinas et al., 1990), ratio spectra-derivative spectrophotometry-zero crossing (Berzas Nevado et al., 1992; Dinç and Onur, 1998; Dinç, 1999), and double-divisior-ratio spectra-derivative spectrophotometry (Dinç and Onur, 1998; Dinç, 1999; Gohel et al., 2014; Shokry et al., 2014) could be used in place of conventional UV spectrophotometric method for analysis of binary and ternary mixtures without using a separation step. However, these spectral approaches may not always yield successful data due to severely overlapping spectral bands, spectral noise and baseline variation. Additionally, high-order differentiation of spectra may lead to spectral deterioration i.e., a decrease in signal intensity and signal-to-noise ratio. As a result, a number of mathematical manipulations (or signal processing methods) are often required to make instrumental signals more meaningful for analysis purpose.

Generally speaking, transform (i.e., Fourier, Hilbert, shorttime Fourier, Wigner distribution, Radon, and wavelet) is a very suitable technique in the pre-treatment step to simplify signals. Fourier transform (FT) is the first method to modify chemical signal (Griffiths, 1977; Cooper, 1978; Griffiths and De Haseth, 1986; Ernst, 1989) with the mathematical essence such as filtering, convolution/deconvolution etc. FT analysis can localize signal in frequency domain very well, but not so much in time domain. In contrast, wavelet transform (WT) has the advantage of localizing signals both in time (position) and frequency (scale) domains, making it a preferable mathematical tool to replace FT in the study of the local property of a signal and the removal of the perturbation of measuring error in spectral analysis. Nowadays, WT is one of the most signal analysis algorithms commonly used in the different fields of chemistry and engineering, providing alternative ways or opportunities to resolve complex spectral bands or diverse data types of signals.

For readers interested in learning the general theory of wavelets, more details can be found in the literature (Mallat, 1988; Chui, 1992; Daubechies, 1992; Newland, 1993; Byrnes et al., 1994; Chui et al., 1994; Vetterli and Kovacevi ˇ c, 1995; Strang and ´ Nguyen, 1996).

In the signal smoothing and de-noising of spectral peaks, the elimination of noise requires an application of appropriate filters to the raw spectral data such as some conventional signal filters Savitzky–Golay, Fourier and Kalman (Brown et al., 1994, 1996). The use of WT in signal analysis is two-fold: (i) to detect the singularities of a signal very likely caused by high-frequency noise and (ii) to separate the signal frequencies at different scales (Palavajjhala et al., 1994; Yan-Fang, 2013; Li and Chen, 2014). To illustrate this, Barclay et al. (1997) performed a comparative study in de-noising and smoothing of Gaussian peak by using wavelet, Fourier and Savitzky–Golay filters i.e., smoothing eliminates high-frequency components of the transformed signal irrespective of their amplitudes, while de-noising eliminates small-amplitude components of the transformed signal irrespective of their frequencies.

Historically, WT principal applications in chemistry were first explored by Walczak and Massart (1997a), who presented an approach based on the application of wavelet packet transform (WPT) to the best-basis selection for the compression and denoising of a set of signals in time-frequency domain. In their paper, the proposed technique was compared to Wickerhauser's approach (Wickerhauser, 1994) of fast approximate principal component analysis (PCA). These authors also published two more papers on the application of wavelets for data processing i.e., the introduction of WPT for noise suppression and signal compression (Walczak and Massart, 1997b) and the use of WT for signal compression and denoising, image processing, data compression and multivariate data modeling in analytical chemistry (Walczak and Massart, 1997c). On the other hand, Alsberg et al. (1997) tried to introduce WT to chemometricians by suggesting the short-time FT technique as a resolution to obtain information about frequency changes over time as well as the WT for de-noising, baseline removal, determination of derivative zero crossings and signal compression. In 1997, WT application in chemical analysis was also confirmed by Wang et al. (1997) and Depczynski et al. (1997). Up to date, WT processing of the different types of raw signals has been reported for liquid chromatography (Shao et al., 1997, 1998a,b,c) and NMR spectroscopy (Neue, 1996; Barache et al., 1997), Raman spectra (Cai et al., 2001; Ehrentreich and Summchen, 2001), and voltammetry (Chen et al., 1996; Fang and Chen, 1997; Zheng et al., 1998; Zhong et al., 1998; Aballe et al., 1999; Zheng and Mo, 1999) IR and Raman spectroscopy (Shao and Zhuang, 2004; Hwang et al., 2005; Chalus et al., 2007; Jun-fang et al., 2007; Lai et al., 2011). In this context, as in the various fields of mathematics and engineering, the implementations of WT in analytical chemistry and neighbor disciplines has become increasingly attractive as an alternative way to analyze complex mixtures previously unresolved by traditional analytical techniques.

With reference to the above-mentioned review, the aim of this paper is to describe the fundamentals of WT methodologies and its typical implementations for UV spectroscopic analysis of pharmaceuticals.

# BRIEF HISTORY OF WAVELETS

In the literature, the first study was related to the Haar Wavelet transform. This family was suggested by the mathematician Alfred Haar in 1909. However, the word "wavelet" was not used in the period of Haar. In fact, the word "wavelet" was invented by Morlet and the physicist Alex Grossman in 1984. After the first orthogonal Haar wavelet, the second orthogonal wavelet known as "Meyer wavelet" was formulated by the mathematician Yves Meyer in 1985. In 1988, Stephane Mallat and Meyer elaborated the concept of multiresolution. In the same year, a systematical method to construct compactly supported continuous wavelets was found by Ingrid Daubechies. Afterwards, Mallat proposed the fast wavelet transform. The emergence of this algorithm increased the implementations of the WT in the signal processing field.

In other words, the history of the wavelet families could be given in the following chronological order: Haar families in 1910, Morlet wavelet concept in 1981, Morlet and Grossman, "wavelet" in 1984, Meyer, "orthogonal wavelet" in 1985, Mallat and Meyer, multiresolution analysis in 1988, Daubechies, compact support orthogonal wavelet in 1988 and Mallat, fast wavelet transform in 1989 (c.f. Chun-Lin, 2010).

Basically, WT can be mainly classified into discrete wavelet transform (DWT) and continuous wavelet transform (CWT) in the signal analysis. The theory and implementations of wavelets in chemistry and related fields were well documented as review papers (Leung et al., 1998; Dinç and Baleanu, 2007b; Dinç, 2013; Li and Chen, 2014; Medhat, 2015) and reference books (Walczak and Massart, 2000a,b; Walczak and Radomski, 2000; Brereton, 2003, 2008; Chau et al., 2004; Danzer, 2007; Mark and Workman, 2007; Dubrovkin, 2018).

## WAVELET TRANSFORM ALGORITHMS

FT is based upon the decomposition of a signal into a set of trigonometric (sine and cosine) functions i.e., FT represents a signal in terms of sinusoids. The representation of FT of a signal from time mode to frequency mode is illustrated in **Figure 1**. For the determination of a local information in the FT, it is required to use an analyzing function ψ having localization properties in both frequency and time domains. This ψ function is named as a wavelet and it must be wave of finite duration.

WT contains the decomposition of a signal into a set of basic functions (wavelets). Basis functions of WT are small waves detected in different times. On the contrary to FT, WT gives information on both time and frequency, making it as an alternative approach to eliminate the resolution problem in signal analysis.

By definition, wavelets are the mathematical methods that convert the data into various coefficients and then analyze each coefficient at a resolution corresponding to its scale. Projection of a signal onto wavelet basic functions is called the wavelet transform. In other words, wavelets are mathematical functions generated from a mother wavelet Ψ(x) by the scaling parameter (dilatation) and shifting parameter (translation) i.e., the signal is expanded on a set of the dilatation (scaling parameter) of functions

$$
\psi \left( \frac{x - a}{b} \right) \tag{1}
$$

The scaling parameter has a significant role for the variation of time and frequency resolution when processing the signal.

For a given mother wavelet (Daubechies, 1992) ψ (x) by the scaling parameter and shifting parameter o fψ (x), a set of functions expressed by ψa,<sup>b</sup> (x) is obtained from the following equation.

$$\psi\_{a,b}\ (\mathbf{x}) = \frac{1}{\sqrt{|a|}} \ \psi \ \left(\frac{\mathbf{x} - b}{a}\right), \mathbf{a} \neq \mathbf{0}, \mathbf{a}, \mathbf{b} \in \mathbb{R} \tag{2}$$

where a is the scaling parameter, b is the shifting parameter and R is domain of real number. The mathematical expression of a CWT on a function f (x) is given below

$$CWT\left\{f\left(\mathbf{x}\right);a,b\right\} = \int\_{-\infty}^{\infty} f(\mathbf{x}) \psi\_{a,b}^\*(\mathbf{x}) d\mathbf{x} = \left\langle f(\mathbf{x}), \psi\_{a,b} \right\rangle \tag{3}$$

here the superscript <sup>∗</sup> is related to the complex conjugate and hf(x), ψa,<sup>b</sup> i represents the inner product of function f(x) onto the wavelet function ψa,b(x).

The original signal can be completely reconstructed by a sampled version of the CWT. Usually, the exemplar is follows as

$$a = 2^{-m} \\ \text{and } b = n2^{-m} \tag{4}$$

Here a and b denote scale and dilatation parameters, respectively, and R is the real number. The expression of DWT can be given as

$$DWT = \int\_{-\infty}^{+\infty} f \text{ (X)} \ \psi\_{m\_\*n}^\*(\mathbf{x}) \, dt \tag{5}$$

Where ψ<sup>∗</sup> m, n (X) = 2 <sup>−</sup><sup>m</sup> ψ (2<sup>m</sup> x − n) is the dilated and translated version of the mother wavelet. In the application of the DWT, only outputs from the low-pass filter are processed by WT. However, in the wavelet packet decomposition of signals, both outputs from the low-pass and high-pass filters are manipulated by WT (Strang and Nguyen, 1996). Multiresolution decomposition with wavelets is an interesting topic for signal and image analysis (Mallat, 1988; Daubechies, 1992).

Some families of wavelets with names and their coding list are illustrated in **Table 1**.

For signal processing, there is also another WT approach i.e., fractional wavelet transform (FWT) specifically designed for rectification of the limitations of the WT and fractional FT (Blu and Unser, 2000, 2002; Unser and Blu, 2000). FWT is based on the fractional B-splines. As it is already known, the splines play an important role on the early development of the theory of WT.

TABLE 1 | Families of wavelets with names and their coding list.


A B-spline is generalization of the Beziers curve. Let a vector known as the knot be defined by T = {t0, t1, . . . , tm} where T is a non-decreasing sequence with t<sup>i</sup> ǫ [0, 1], and define control point P0, Pn. The knots t0, t1, . . . , t<sup>m</sup> is called internal knots. If p = m n - 1 denotes the degree, the basis function is defined as follows:

$$N\_{i,\ 0}(t) = f\left(\mathbf{x}\right) = \begin{cases} 1, \text{ if } t\_i \le t < t\_{i+1} \text{ and } t\_{i+1} \\ \quad 0 \text{ otherwise} \end{cases} \tag{6}$$

and

$$N\_{i,p}^{\\\uparrow}(t) = \frac{t - t\_i}{t\_{i+p} - t\_i} N\_{i,p-1} \ (t) + \frac{t\_{i+p+1} - t}{t\_{i+p} + 1 - t\_{i+1}} N\_{i+1,p-1} \ (t) \tag{7}$$

Therefore, the curve defined by

$$C(t) = \sum\_{i=0}^{n} P\_i \, N\_{i, \, p} \, (t) \tag{8}$$

is a B-spline

Fractional B-spline: The fractional B-spline is defined as

$$\beta\_+^a(\alpha) = \frac{\sum\_{k=0}^{+\infty} (-1)^k \binom{\alpha+1}{k} \binom{x-k}{k}\_+^\alpha}{\Gamma\left(\alpha+1\right)} \tag{9}$$

where Euler's Gamma function is obtained by

$$
\Gamma\_{\pm}(\alpha+1) = \int\_0^{+\alpha} \varkappa^{\alpha} \ e^{-\varkappa} \, d\varkappa \tag{10}
$$

and

$$(\boldsymbol{\chi} - k)^{+}\_{\alpha} = \max \left( \boldsymbol{\chi} - k, \ 0 \right)^{\alpha} \tag{11}$$

The forward fractional finite difference operator of order α is defined as

$$
\Delta\_+^{\alpha} f \begin{pmatrix} \mathbf{x} \end{pmatrix} = \sum\_{k=0}^{+\infty} (-1)^k \begin{pmatrix} \alpha \\ k \end{pmatrix} f \begin{pmatrix} \mathbf{x} - k \end{pmatrix}, \tag{12}
$$

where

$$\left(\frac{\alpha}{k}\right) = \frac{\Gamma\_{\mid}(\alpha+1)}{\Gamma\left(k+1\right)\left(\alpha-k+1\right)}\tag{13}$$

B-splines fulfill the convolution property, namely

$$
\beta\_+^{\alpha 1} \* \beta\_+^{\alpha 2} = \beta\_+^{\alpha 1 + \alpha 2} \tag{14}
$$

The centered fractional B-splines of degree α is defined as

$$\beta\_\*^{\alpha}(\mathbf{x}) = \frac{1}{\Gamma\_-(\alpha+1)} \sum\_{k \in \mathbb{Z}} (-1)^k \left| \frac{\alpha+1}{k} \right| \left| \mathbf{x} - k \right|\_\*^{\alpha} \tag{15}$$

where

$$|\mathbb{x}|\_{-}^{\alpha} = \begin{cases} \frac{|\cdot|^{\alpha}}{-2\sin\left(\frac{\pi}{2}\alpha\right)}, & \alpha \text{ not even} \\\frac{X\log x}{(-1)^{1+n}\pi}, & \alpha \text{ even} \end{cases} \tag{16}$$

The fractional B-spline wavelet is defined as

$$\begin{aligned} \psi\_+^{\alpha} \left( \frac{\chi}{2} \right) &= \sum\_{k \in \mathbb{Z}} \frac{(-1)^k}{2^{\alpha}} \\ &\sum\_{1 \in \mathbb{Z}} \binom{\alpha+1}{1} \beta\_\*^{2\alpha+1} \left( 1+k-1 \right) \beta\_+^{\alpha} \left( \chi - k \right) \text{(17)} \end{aligned}$$

We mention that the fractional splines wavelets of degree obey the following

$$\int\_{-\infty}^{+\infty} X^n \, \psi\_+^{\alpha} \, \left( \alpha \right) d\mathfrak{x} = 0, \dots, \left[ \alpha \right] \tag{18}$$

and the Fourier transform fulfills the following relations

$$
\hat{\psi}\_+^{\alpha}(\varpi) = \mathcal{C} \left( j\varpi \right)^{\alpha+1}, \text{ as } \varpi \to 0 \tag{19}
$$

and

$$\hat{\psi}\_{\*}^{\alpha}(\varpi) = \mathcal{C} \,(j\varpi)^{\alpha+1}, \,\, as \,\, \varpi \to 0 \tag{20}$$

where ψˆ <sup>α</sup> <sup>+</sup> (̟) is symmetric. The fractional spline wavelet behaves like a fractional derivative operator.

# STRATEGIES IN CWT APPLICATIONS TO UV SPECTROSCOPY ANALYSIS OF MULTICOMPONENT MIXTURES

For the past 15 years, the potential application of CWT in chemistry, especially in combination with other mathematical methods, leads us to a conclusion that WT has interestingly became a useful algorithm for UV quantitative analysis of pharmaceuticals. Four different models [i.e., continuous wavelet transform-zero crossing (CWT-ZC), ratio spectra-continuous wavelet transform (RS-CWT), ratio spectra-continuous wavelet transform-zero crossing (RS-CWT-ZC), and double divisor ratio spectra-continuous wavelet transform (DDRS-CWT)] were described in the implementation of CWT to UV spectroscopic data for the resolution of overlapping spectra to quantify drugs in different types of samples. The modeling of CWT—UV spectroscopic approaches are detailed below. Fundamentally, these approached can be successfully applied to the UV spectroscopic analysis of binary and ternary mixtures, provided that the law of additivity of absorbance is obeyed.

# CONTINUOUS WAVELET TRANSFORM-ZERO CROSSING

The application of CWT-ZC approach to UV spectroscopic signals was first proposed by Dinç and Baleanu (2003a).

If a mixture of two analytes (M and N) is considered (see **Figure 2A**) and the absorbance of this binary mixture is measured at λ<sup>i</sup> , we can have the following equation (Charlotte Grinter and Threlfall, 1992):

$$\mathbf{A}\_{\rm mix,\,\lambda i} = \alpha\_{\mathbf{M},\,\lambda i} \mathbf{C}\_{\mathbf{M}} + \beta\_{\mathbf{N},\,\lambda i} \mathbf{C}\_{\mathbf{N}} \tag{21}$$

where Amλ<sup>i</sup> is the absorbance of the binary mixture at wavelength λ<sup>i</sup> , and the coefficients are the absorptivities of M and N, respectively. C<sup>M</sup> and C<sup>N</sup> represent the concentrations of M and N, respectively.

If CWT is applied to Equation (21), the following function can be obtained as

$$
\psi\_{\text{(a.b),MIX, \lambda i}} = \psi\_{\text{(a.b), M, \lambda i}} \mathcal{C}\_{\text{M}} + \psi\_{\text{(a.b), N, \lambda i}} \mathcal{C}\_{\text{N}} \tag{22}
$$

If <sup>ψ</sup>(a.b),N,λiC<sup>N</sup> <sup>=</sup> 0, then we obtain the following equation

$$
\Psi\_{\text{(a.b)},\text{MIX},\lambda i} = \Psi\_{\text{(a.b)},\text{M},\lambda i} \,\text{C}\_{\text{M}} \tag{23}
$$

Equation (23) shows that CWT (ψ(a.b), <sup>M</sup>, <sup>λ</sup><sup>i</sup> <sup>C</sup>M) amplitudes of M in the binary mixture are dependent only on C<sup>M</sup> regardless of C<sup>N</sup> (see **Figure 2B**).

# RATIO SPECTRA-CONTINUOUS WAVELET TRANSFORM

Apart from CWT-ZC approach, overlapping spectral bands in a binary mixture could be solved by the application of a combined hybrid approach i.e., RS-CWT (Dinç and Baleanu, 2004a,c).

The absorption spectra of M and N compounds, and their mixture are indicated in **Figure 3A**. By being divided by the standard spectrum (AN,λ<sup>i</sup> = βλ<sup>i</sup> C o N ) of one of the compounds in the binary mixture, Equation (21) becomes

$$\frac{A\_{\text{m},\lambda\dot{i}}}{\beta\_{\lambda\dot{i}}\,\,\mathbf{C}\_{\text{N}}\,^{o}} = \frac{\alpha\_{\lambda\dot{i}}\,\,\mathbf{C}\_{\text{M}}}{\beta\_{\lambda\dot{i}}\,\,\mathbf{C}\_{\text{N}}\,^{o}} + \frac{\beta\_{\lambda\dot{i}}\,\,\mathbf{C}\_{\text{N}}}{\beta\_{\lambda\dot{i}}\,\,\mathbf{C}\_{\text{N}}\,^{o}}\tag{24}$$

**Figure 3B** shows the ratio spectra of analytes and their binary mixture. If CWT is applied to Equation (24), the following equation can be obtained

$$CWT\left[\frac{A\_{m,\lambda i}}{\beta\_{\lambda i} \, ^{\circ}C\_{N}^{\circ}}\right] = CWT\left[\frac{\alpha\_{\lambda i}}{\beta\_{\lambda i}}\right] \frac{^{\circ}C\_{M}}{C\_{N}^{\circ}} + CWT\left[\frac{\beta\_{\lambda i}}{\beta\_{\lambda i}}\right] \frac{^{\circ}C\_{N}}{^{\circ}C\_{N}^{\circ}} \tag{25}$$

If CWT h βλi βλi i CN C o N = 0, then we obtain

$$\text{CWT} \left[ \frac{\text{A}\_{\text{m}, \text{\AA}}}{\beta\_{\lambda i} \text{ C}\_{\text{N}}^{\text{o}}} \right] = \text{CWT} \left[ \frac{\alpha\_{\lambda i}}{\beta\_{\lambda i}} \right] \frac{\text{C}\_{\text{M}}}{\text{C}\_{\text{N}}^{\text{o}}} \tag{26}$$

The ratio-CWT amplitudes of the binary mixture given in Equation (26) depend only on C<sup>M</sup> and C<sup>N</sup> o regardless of C<sup>N</sup> (e.g., see **Figure 3C**).

# RATIO SPECTRA-CONTINUOUS WAVELET TRANSFORM-ZERO CROSSING

In RS-CWT-ZC approach (Dinç et al., 2005a), if a mixture of three analytes (X, Y, and Z) is considered and the absorbance of this ternary mixture is measured at λ<sup>i</sup> , the following mathematical expression (Charlotte Grinter and Threlfall, 1992) would be given

$$\mathbf{A}\_{\text{mix},\lambda i} = \alpha\_{\text{X},\lambda i} \mathbf{C}\_{\text{X}} + \beta\_{\text{Y},\lambda i} \mathbf{C}\_{\text{Y}} + \gamma\_{\text{Z},\lambda i} \mathbf{C}\_{\text{Z}} \tag{27}$$

Where Amix, <sup>λ</sup><sup>i</sup> is the absorbance of the ternary mixture at wavelength λ<sup>i</sup> , and coefficients αX, <sup>λ</sup><sup>i</sup> , βY, <sup>λ</sup><sup>i</sup> , and γZ, <sup>λ</sup><sup>i</sup> denote the absorptivities of X, Y, and Z, respectively. CX, CY, and C<sup>Z</sup> represent the concentrations of X, Y, and Z, respectively.

If Equation (27) is divided by the spectrum of a standard solution (C<sup>o</sup> X ) of one of the compounds in the ternary mixture, we have the following equation:

$$\frac{\mathbf{A}\_{\text{mix, \lambda i}}}{\alpha \mathbf{x}\_{\text{, \lambda i}} \mathbf{C} \mathbf{x}^{o}} = \frac{\alpha\_{\text{X, \lambda i}} \mathbf{C}\_{\text{X}}}{\alpha \mathbf{x}\_{\text{, \lambda i}} \mathbf{C} \mathbf{x}^{o}} + \frac{\beta\_{\text{Y, \lambda i}} \mathbf{C}\_{\text{Y}}}{\alpha \mathbf{x}\_{\text{, \lambda i}} \mathbf{C}\_{\text{X}}^{o}} + \frac{\gamma\_{\text{Z, \lambda i}} \mathbf{C}\_{\text{Z}}}{\alpha \mathbf{x}\_{\text{, \lambda i}} \mathbf{C}\_{\text{X}}^{o}} \tag{28}$$

If CWT is applied to Equation (28), the following equation can be obtained

$$\text{CWT} \left[ \frac{\text{A}\_{\text{mix},\lambda i}}{\alpha\_{\text{X},\lambda i} \text{C}\_{\text{X}}^{\text{o}}} \right] = \text{CWT} \left[ \frac{\beta\_{\text{Y},\lambda i} \text{C}\_{\text{Y}}}{\alpha\_{\text{X},\lambda i} \text{C}\_{\text{X}}^{\text{o}}} \right] + \text{CWT} \left[ \frac{\nu\_{\text{Z},\lambda i} \text{C}\_{\text{Z}}}{\alpha\_{\text{X},\lambda i} \text{C}\_{\text{X}}^{\text{o}}} \right] \tag{29}$$

Equation (29) indicates that the CWT amplitudes of the ratio spectra of the ternary mixture are dependent only on C<sup>Z</sup> and C<sup>X</sup> o regardless of the concentrations of other compounds.

# DOUBLE DIVISOR RATIO SPECTRA-CONTINUOUS WAVELET TRANSFORM

In addition to RS-CWT-ZC approach, the spectral resolution of ternary mixtures could be effectively done by DDRS-CWT approach (Dinç and Baleanu, 2008a) as follows.

When two compounds in the ternary mixture is used as a double divisor, we have

$$\mathcal{A}^{\bullet}\_{\text{mix, \lambda i}} = \mathcal{a}\_{\text{X, \lambda i}} \mathcal{C}^{\bullet}\_{\text{X}} + \beta\_{\text{Y, \lambda i}} \mathcal{C}^{\bullet}\_{\text{Y}} \tag{30}$$

By dividing Equation (27) and (30), we obtain as follows

$$\begin{split} \frac{\mathbf{A}\_{\text{mix, \\_\text{i}\\_i}}}{\alpha\_{\text{X, \\_i\\_i}}\mathbf{C}\_{\text{X}}^{\text{o}} + \beta\_{\text{Y, \\_i\\_i}}\mathbf{C}\_{\text{Y}}^{\text{o}}} &= \frac{\alpha\_{\text{X, \\_i\\_i}}\mathbf{C}\_{\text{X}}}{\alpha\_{\text{X, \\_i\\_i}}\mathbf{C}\_{\text{X}}^{\text{o}} + \beta\_{\text{Y, \\_i\\_i}}\mathbf{C}\_{\text{Y}}^{\text{o}}} \\ + \frac{\beta\_{\text{Y, \\_i\\_i}}\mathbf{C}\_{\text{Y}}^{\text{o}}}{\alpha\_{\text{X, \\_i\\_i}}\mathbf{C}\_{\text{X}}^{\text{o}} + \beta\_{\text{Y, \\_i\\_i}}\mathbf{C}\_{\text{Y}}^{\text{o}}} + \frac{\gamma\_{\text{Z, \\_i\\_i}}\mathbf{C}\_{\text{Z}}^{\text{o}}}{\alpha\_{\text{X, \\_i\\_i}}\mathbf{C}\_{\text{X}}^{\text{o}} + \beta\_{\text{Y, \\_i\\_i}}\mathbf{C}\_{\text{Y}}^{\text{o}}} \end{split} \tag{31}$$

TABLE 2 | Applications of the continuous wavelet transform-zero crossing technique to UV spectroscopic analysis of pharmaceuticals.


Equation (31) can be simplified to

$$\frac{\mathbf{A}\_{\text{mix, \lambda i}}}{\alpha\_{\text{X, \lambda i}} \mathbf{C}\_{\text{X}} \mathbf{o}^{\text{o}} + \beta\_{\text{Y, \lambda i}} \mathbf{C}\_{\text{Y}} \mathbf{o}^{\text{o}}} = \mathbf{k} + \frac{\gamma\_{\text{Z, \lambda i}} \mathbf{C}\_{\text{Z}}}{\alpha\_{\text{X, \lambda i}} \mathbf{C}\_{\text{X}} \mathbf{o}^{\text{o}} + \beta\_{\text{Y, \lambda i}} \mathbf{C}\_{\text{Y}} \mathbf{o}^{\text{o}}} \tag{32}$$

Where k = αX, <sup>λ</sup>iCX+ βY, <sup>λ</sup>iC<sup>Y</sup> αX, λiC o <sup>X</sup>+ βY, <sup>λ</sup><sup>i</sup> C o Y represents a constant for a given concentration range with respect to λ<sup>i</sup> in a certain region or point of wavelength.

A typical case is when C<sup>X</sup> o and C<sup>Y</sup> o are the same or very close to each other, namely C<sup>X</sup> <sup>o</sup> = C<sup>Y</sup> o or ∼= <sup>C</sup><sup>X</sup> <sup>o</sup> ∼= <sup>C</sup><sup>Y</sup> o . Therefore, we obtain

$$
\alpha\_{\rm X, \lambda \dot{\imath}} \mathcal{C}\_{\rm X}^{\bullet} + \beta\_{\rm Y, \lambda \dot{\imath}} \mathcal{C}\_{\rm Y}^{\bullet} = \mathcal{C}\_{\rm X}^{\bullet} \left( \alpha\_{\rm X, \lambda \dot{\imath}} + \beta\_{\rm Y, \lambda \dot{\imath}} \right) \tag{33}
$$

and Equation (32) can be written as

$$\frac{\mathbf{A}\_{\text{mix, \lambda i}}}{\alpha\_{\text{X, \lambda i}} \mathbf{C}\_{\text{X}}^{\text{o}} + \beta\_{\text{Y, \lambda i}} \mathbf{C}\_{\text{Y}}^{\text{o}}} = k + \frac{\mathbf{y}\_{\text{Z, \lambda i}} \mathbf{C}\_{\text{Z}}}{\mathbf{C}\_{\text{X}}^{\text{o}} \left(\alpha\_{\text{X, \lambda i}} + \beta\_{\text{Y, \lambda i}}\right)} \tag{34}$$

After applying CWT to Equation (31), we have

$$\text{CWT}\_{\text{(a,b)}} \left( \frac{\text{A}\_{\text{mix, \lambda i}}}{\text{a} \chi\_{\text{, \lambda i}} + \text{\beta} \chi\_{\text{, \lambda i}}} \right) \frac{1}{\text{C}\_{\text{X}}^{\text{o}}} = \text{CWT}\_{\text{(a,b)}} \left( \frac{\chi\_{\text{Z, \lambda i}} \text{C}\_{\text{Z}}}{\left( \alpha\_{\text{X, \lambda i}} + \beta\_{\text{Y, \lambda i}} \right)} \right) \frac{1}{\text{C}\_{\text{X}}^{\text{o}}} \tag{35}$$

or

$$\text{CWT}\_{(a,b)}\left(\frac{\text{A mix},\text{\AA}}{\text{a}\text{X},\text{\AA}+\text{\beta}\text{Y},\text{\AA}}\right) = \text{CWT}\_{(a,b)}\left(\frac{\text{YZ},\text{\AA}}{\left(\text{a}\_{\text{X},\text{\AA}\text{i}}+\text{\beta}\text{Y}\_{\text{\text},\text{\AA}\text{i}}\right)}\right)\text{CZ}\tag{36}$$

In Equation (36), C<sup>Z</sup> is to proportional to the coefficients, CWT(a,b) Amix, λi αX, <sup>λ</sup>i+ βY, <sup>λ</sup><sup>i</sup> , at λi. If this procedure is separately applied for pure Z and its ternary mixture, the CWT(a,b) coefficients are coincided at some characteristic point or region of wavelength, independent upon both C<sup>X</sup> and CY.

# WAVELET TRANSFORM-BASED UV SPECTROSCOPIC ANALYSIS OF PHARMACEUTICALS

Typical applications of CWT and FWT algorithms for UV spectroscopic analysis of pharmaceuticals are displayed in **Tables 2**–**5**. It is worth mentioning that WT could be solely applied to raw spectra and ratio spectra (as above-specified) as well as utilized as a hybrid approach (FWT-derivative, FWT-CWT-zero crossing, WT combined with multivariate calibration) for the simultaneous determination of analytes in pharmaceutical binary and ternary mixtures. It was shown that wavelet analysis of UV spectroscopic data was performed by using Wavelet Toolbox and m-file in MATLAB software. The numerous works provided by Dinç and co-workers have clearly highlighted the success of WT-based UV spectroscopic analysis for multicomponent synthetic mixtures, veterinary and pharmaceutical dosage forms as well as different types of test (e.g., assay, in vitro dissolution, stability indicating). Most studies proved it to be suitable for the routine analysis of dosage forms with good precision and accuracy, comparable to HPLC.

# CONCLUSIONS

In the point of view of UV spectroscopic analysis of multicomponent mixtures, CWT-based UV spectroscopic


TABLE 4 | Applications of the ratio spectra-continuous wavelet transform, ratio spectra- continuous wavelet transform-zero crossing approaches to UV spectroscopic analysis of pharmaceuticals.




methods have outperformed both conventional and derivative UV spectroscopy in resolving spectrally binary and ternary mixtures. Nevertheless, wavelet analysis may not also have a sufficient power to resolve overlapping spectra of analytes in samples due to similarity of molecular structures and signal frequencies in some cases. They may not give desirable results for a complex mixture containing more than three compounds and/or a significant difference in ratios of active ingredients. In such a case, the use of WT coupled with chemometric PLS and PCR calibrations is advisable. Undoubtedly, however, wavelets can still be used as

## REFERENCES


a mathematical prism for signal analysis because they can offer many possibilities such as baseline correction, noise removal and resolution of overlapping peaks, when the frequencies of analyzed components are significantly different from each other.

# AUTHOR CONTRIBUTIONS

Contributions of ED are planning and writing of the review paper. Contributions of ZY are literature review, collection, editing, and format arrangement.


of the fractional fourier transform bands for resolving two component mixture. Signal Image Video P. 9, 801–807. doi: 10.1007/s11760-013-0503-9


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Dinç and Yazan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chemometric Methods for Spectroscopy-Based Pharmaceutical Analysis

Alessandra Biancolillo and Federico Marini\*

*Department of Chemistry, University of Rome La Sapienza, Rome, Italy*

Spectroscopy is widely used to characterize pharmaceutical products or processes, especially due to its desirable characteristics of being rapid, cheap, non-invasive/non-destructive and applicable both off-line and in-/at-/on-line. Spectroscopic techniques produce profiles containing a high amount of information, which can profitably be exploited through the use of multivariate mathematic and statistic (chemometric) techniques. The present paper aims at providing a brief overview of the different chemometric approaches applicable in the context of spectroscopy-based pharmaceutical analysis, discussing both the unsupervised exploration of the collected data and the possibility of building predictive models for both quantitative (calibration) and qualitative (classification) responses.

### Edited by:

*Cosimino Malitesta, University of Salento, Italy*

### Reviewed by:

*Daniel Cozzolino, Central Queensland University, Australia Andreia Michelle Smith-Moritz, University of California, Davis, United States*

\*Correspondence: *Federico Marini federico.marini@uniroma1.it*

### Specialty section:

*This article was submitted to Analytical Chemistry, a section of the journal Frontiers in Chemistry*

Received: *07 July 2018* Accepted: *05 November 2018* Published: *21 November 2018*

### Citation:

*Biancolillo A and Marini F (2018) Chemometric Methods for Spectroscopy-Based Pharmaceutical Analysis. Front. Chem. 6:576. doi: 10.3389/fchem.2018.00576* Keywords: spectroscopy, chemometrics and statistics, component analysis (PCA), partial least squares (PLS), classification, partial least squares discriminant analysis (PLS-DA), soft independent modeling of class analogies (SIMCA), pharmaceutical quality control

# INTRODUCTION

Quality control on pharmaceutical products is undoubtedly an important and widely debated topic. Hence, in literature, various methods have been proposed to check quality of medicines, either qualitative (e.g., for the identification of an active pharmaceutical ingredient, API; Blanco et al., 2000; Herkert et al., 2001; Alvarenga et al., 2008) or quantitative (quantification of the API; Blanco et al., 2000; Yao et al., 2007; Cruz Sarraguça and Almeida Lopes, 2009); involving either destructive or non-invasive online techniques. Recently, due to the benefits they bring, several non-destructive methodologies based on spectroscopic techniques (mainly Near-Infrared NIR) combined with chemometric tools have been proposed for pharmaceutical quality check (Chen et al., 2018; Rodionova et al., 2018).

Despite the development of analytical methodologies and the commitments of national and supranational entities to regulate pharmaceutical quality control, substandard and counterfeit medicines are still a major problem all over the world.

# Chemometrics as Tool for Fraud/Adulteration Detection

Poor-quality pharmaceuticals can be found on the market for two main reasons: low production standards (mainly leading to substandard medicines) and fraud attempts. Counterfeited drugs may present different frauds/adulterations; for instance, they could contain no active pharmaceutical ingredient (API), a different API from the one declared, or a different (lower) API strength. As mentioned above, several methodologies have been proposed in order to detect substandard/counterfeit pharmaceuticals; among these, a major role is played by those based on the application of spectroscopic techniques in combination with different chemometric methods. The relevance of these methodologies is due to the fact that spectroscopy (in particular, NIR) combined with exploratory data analysis, classification and regression method can lead to effective, high performing, fast, nondestructive, and sometimes, online methods for checking the quality of pharmaceuticals and their compliance to production and/or pharmacopeia standards. Nevertheless, the available chemometric tools applicable to handle spectroscopic (but, of course not only those) data are numerous, and there is plenty of room for their misapplication (Kjeldahl and Bro, 2010). As a consequence, the aim of the present paper is to report and critically discuss some of the chemometric methods typically applied for pharmaceutical analysis, together with an essential description of the figures of merit which allow evaluating the quality of the corresponding models.

# EXPLORATORY DATA ANALYSIS

In the large part of the studies for the characterization of pharmaceutical samples for quality control, verification of compliance and identification/detection of counterfeit, fraud or adulterations, experimental signals (usually in the form of some sorts of fingerprints) are collected on a series of specimens. These constitute the data the chemometric models operate on. These data are usually arranged in the form of a matrix **X**, having as many rows as the number of samples and as many columns as the number of measured variables. Accordingly, assuming that samples are spectroscopically characterized by collecting an absorption (or reflection/transmission) profile (e.g., in the infrared region), each row of the matrix corresponds to the whole spectrum of a particular sample, whereas each column represents the absorbance (or reflectance/transmittance) of all the individuals at a particular wavenumber. This equivalence between the experimental profiles and their matrix representation is graphically reported in **Figure 1**.

Once the data have been collected, exploratory data analysis represents the first step of any chemometric processing, as it allows "to summarize the main characteristics of data in an easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis" (Tukey, 1977). Exploratory data analysis provides an overall view of the system under study, allowing to catch possible similarities/dissimilarities among samples, to identify the presence of clusters or, in general, systematic trends, to discover which variables are relevant to describe the system and, on the other hand, which could be in principle discarded, and to detect possible outlying, anomalous or, at least, suspicious samples (if present). As evident also from the definition reported above, in the context of exploratory data analysis a key role is played by the possibility of capturing the main structure of the data in a series of representative plots, through appropriate display techniques. Indeed, considering a general data matrix **X**, of dimensions N×M, one could think of its entries as the coordinates of N points (the samples) into a Mdimensional space whose axes are the variables, which makes this representation unfeasible for the cases when more than three descriptors are collected on each individual. This is why exploratory data analysis often relies on the use of projection (bilinear) techniques to reduce the data dimensionality in a "clever" way. Projection methods look for a low-dimensional representation of the data, whose axes (normally deemed components or latent variables) are as relevant as possible for the specific task. In the case of exploratory data analysis, the most commonly used technique is Principal Components Analysis (PCA) (Pearson, 1901; Wold et al., 1987; Jolliffe, 2002).

# Principal Component Analysis

Principal component analysis (PCA) is a projection method, which looks for directions in the multivariate space progressively providing the best fit of the data distribution, i.e., which best approximate the data in a least squares sense. This explains why PCA is the technique of choice in the majority of cases when exploratory data analysis is the task: indeed, by definition, for any desired number of dimensions (components) F in the final representation, the subspace identified by PCA constitutes the most faithful F-dimensional approximation of the original data. This allows compression of the data dimensionality at the same time reducing to a minimum the loss of information. In particular, starting from a data matrix **X**(N×M) , Principal Component Analysis is based on its bilinear decomposition, which can be mathematically described by Equation (1):

$$X = T\mathbf{P}^T + E \tag{1}$$

The loadings matrix **P**(M×F) identifies the F directions, i.e., the principal components (PC), along which the data should be projected and the results of such projection, i.e., the coordinates of the samples onto this reduced subspace, are collected in the scores matrix **T**(N×F) . In order to achieve data compression, usually F ≪ M so that the PCA representation provides an approximation of the original data whose residuals are collected in the matrix **E**(N×M) .

Since the scores represent a new set of coordinates along highly informative (relevant) directions, they may be used in two- or three-dimensional scatterplots (scores plots). This offers a straightforward visualization of the data, which can highlight possible trends in data, presence of clusters or, in general, of an underlying structure. A schematic representation of how PCA works is displayed in **Figure 2**.

**Figure 2** shows one of the simplest possible examples of feature reduction, since it describes the case where samples described by three measured variables can be approximated by being projected on an appropriately chosen two-dimensional sub-space. However, the concept may be easily generalized to higher-dimensional problems, such as those involving spectroscopic measurements. **Figure 3** shows an example of the application of PCA to mid infrared spectroscopic data. In particular, the possibility of extracting as much information as possible from the IR spectra recorded on 51 tablets containing

either ketoprofen or ibuprofen in the region 2,000–680 cm−<sup>1</sup> (661 variables) is represented.

A large portion of the data variability can be summarized by projecting the samples onto the space spanned by the first two principal components, which account for about 90% of the original variance, and therefore can be considered as a good approximation of the experimental matrix. Inspection of the scores plot suggests that the main source of variability is the difference between ibuprofen tablets (blue squares) and ketoprofen ones (red circles), since the two clusters are completely separated along the first principal component. To interpret the observed cluster structure in terms of the measured variables, it is then necessary to inspect the corresponding loadings, which are also displayed in **Figure 3** for PC1. Indeed, for spectral data, the possibility of plotting the loadings for the individual components in a profile-like fashion, rather than producing scatterplot for pairs of latent variables (as exemplified in **Figure 2**) is often preferred, due to its more straightforward interpretability: spectral regions having positive loadings will have higher intensity on samples which have positive scores on the corresponding component, whereas bands associated to negative loadings will present higher intensity on the individuals falling at negative values of the PC. In the example reported in **Figure 3**, one could infer, for instance, that the ketoprofen samples (which fall at positive values of PC1) have a higher absorbance at the wavenumbers where the loadings are positive, whereas ibuprofen samples should present a higher signal in correspondence to the bands showing negative loadings.

Based on what reported above, it is evident how the quality of the compressed representation in the PC space depends on the number of components F chosen to describe the data. However, at the same time, it must be noted that when the aim of calculating a PCA is "only" data display, as in most of the applications in the context of exploratory analysis, the choice of the optimal number of components is not critical: it is normally enough to inspect the data distribution across the first few dimensions and, in many cases, considering the scores plot resulting from the first two or three components could be sufficient. On the other hand, there may be cases when the aim of the exploratory analysis is not limited to just data visualization and, for instance, one is interested in the identification of anomalous or outlying observations, or there could be the need of the imputation of missing elements in the data matrix; additionally, one could also need to obtain a compressed representation of the data to be used for further predictive modeling. In all such cases, the choice of the optimal dimensionality of the PC representation is critical for the specific purposes and, therefore, the number of PCs should be carefully estimated. In this respect, different methods have been proposed in the literature and a survey of the most commonly used can be found in Jolliffe (2002).

Among the applications described above, the possibility of using PCA for the identification/detection of potential outliers deserves a few more words, as it could be of interest for pharmaceutical quality control. Actually, although outliers—or anomalous observations, in general—could be, in principle, investigated by visually inspecting the scores plot along the first components, this approach could be subjective and anyway would not consider some possible data discrepancies. Alternatively, when it is used as a model to

(highlighted in light red in the leftmost panel) spanned by the first two principal components. Inspection of the data set can be carried out by looking at the distribution of the samples onto the informative PC subspace (scores plot) and interpretation can be then carried out by examining the relative contribution of the experimental variable to the definition of the principal components (loadings plot).

build a suitable approximation of the data, PCA provides a powerful toolbox for outlier detection based on the definition of more objective test statistics, which can be easily automatized or, anyway, embedded in control strategies, also on-line. This is accomplished by defining two distance measurements: (i) a squared Mahalanobis distance in the scores space, which follows the T<sup>2</sup> statistics (Hotelling, 1931) and accounts for how extreme the measurement is in the principal component subspace, and (ii) a squared orthogonal Euclidean distance (the sum of squares of the residuals after approximating the observation by its projection), which is normally indicated as Q statistics (Jackson and Muldholkar, 1979) and quantifies how well the model fits that particular individual. Outlier detection is then carried out by setting appropriate threshold values for the T<sup>2</sup> and Q statistics and verifying whether the samples fall below or above those critical limits. Moreover, once an observation is identified as a potential outlier, inspection of the contribution plot can help in relating the detected anomaly to the behavior of specific measured variables.

## Selected Examples

PCA is customarily used for the quality control of drugs and pharmaceuticals; several examples of the application of this technique to solve diverse issues have been reported in the literature. One of the most obviously relevant ones is fraud detection. For example, in Rodionova et al. (2005) PCA was applied to both bulk NIR spectroscopy and hyperspectral imaging (HSI) in the NIR range to spot counterfeit drugs. In particular, bulk NIR was used to differentiate genuine antispasmodic drugs from forgeries, whereas HSI on the ground uncoated tablets was employed to identify counterfeited antimicrobial drugs. In both cases, the spectroscopic data were subjected to PCA, which allowed to clearly identify clusters in the scores plot, corresponding to the two kinds of tablets, i.e., genuine and counterfeited. In the case of the imaging platform, where the signal is stored as a data hypercube [i.e., a three-way numerical array of dimension number of horizontal pixels Nx, number of vertical pixels N<sup>y</sup> and number of wavelengths Nλ, in which each entry corresponds to the spectral intensity measured

observed differences in terms of the spectroscopic signal is made possible by the inspection of the loadings on PC, which are shown in a "spectral-like" fashion in (C).

at a certain wavelength and a specific spatial position (x-y coordinates)], a preliminary unfolding step is needed. Unfolding is the procedure allowing to reorganize a higher-order array into a two-way matrix, which can be then processed with standard chemometric techniques. In the case of hyperspectral data cubes, this is carried out by stacking the spectra corresponding to the different pixels one on top of each other, in a way to obtain a matrix of dimensions (N<sup>x</sup> × N<sup>y</sup> and Nλ).

Another relevant application of exploratory analysis is related to quality check. For instance, PCA can be applied to investigate formulations not meeting predefined parameters. In Roggo et al. (2005), PCA was used to inquire a suspicious blue spot present on tablets. Samples were analyzed by a multi-spectral (IR) imaging microscope and PCA analysis was performed on the unfolded data-cube, indicating that the localized coloration was not due to contamination, but actually given by wet indigo carmine dye and placebo (expected ingredients of the formulation).

PCA can also be used for routine quality checks at the end of a production process. For example, in Myakalwar et al. (2011) laser-induced breakdown spectroscopy (LIBS) and PCA were combined with the aim of obtaining qualitative information about the composition of different pharmaceuticals.

# REGRESSION

As discussed in the previous section, exploratory analysis is a first and fundamental step in chemometric data processing and, in some cases, it could be the only approach needed to characterize the samples under investigation. However, due to its unsupervised nature, it provides only a (hopefully) unbiased picture of the data distribution but it lacks any possibility of formulating predictions on new observations, which on the other hand may be a fundamental aspect to solve specific issues. In practice, very often quality control and/or authentication of pharmaceutical products rely on some forms of qualitative or quantitative predictions. For instance, the quantification of a specific compound (e.g., an active ingredient or an excipient) contained in a formulation is a routine operation in pharmaceutical laboratories. This goal can be achieved by combining instrumental (e.g. spectroscopic) measurements with chemometric regression approaches (Martens and Naes, 1991; Martens and Geladi, 2004). Indeed, given a response to be predicted y and a vector of measured signals (e.g., a spectrum) **x**, the aim of regression methods is to find the functional relationship that best approximates the response on the basis of the measurements (the predictors). Mathematically, this can be stated as:

$$
\gamma = \hat{\boldsymbol{\gamma}} + \boldsymbol{e} = \boldsymbol{f} \; (\boldsymbol{\mathfrak{x}}) + \boldsymbol{e} \tag{2}
$$

where yˆ is the predicted response (i.e., the response value approximated by the model), f (**x**) indicates a general function of and **x** and e is the residual, i.e., the difference between the actual response and its predicted value. In many applications, the functional relationship between the response and the predictors f (**x**) can be assumed to be linear:

$$\hat{\boldsymbol{y}} = \boldsymbol{f}\left(\mathbf{x}\right) = b\_1\boldsymbol{\chi}\_1 + b\_2\boldsymbol{\chi}\_2 + \dots + b\_M\boldsymbol{\chi}\_M = \mathbf{x}^T\boldsymbol{\mathbf{b}}\tag{3}$$

where x1, x<sup>2</sup> . . . x<sup>M</sup> are the components of the vector of measurements **x** and the transpose indicates that it is normally expressed as a row vector, while the associated linear coefficients b1, b<sup>2</sup> . . . bM, which weight the contributions of each of the M Xvariables to y, are called regression coefficients and collected in the vector **b**. Building a regression model means to find the optimal value of the parameters **b**, i.e., the values which lead to the lowest error in the prediction of the responses. As a direct consequence of this consideration, it is obvious how it is mandatory to have a set of samples (the so-called training set) for which both the experimental data **X** and the responses **y** are available, in order to build a predictive model. Indeed, the information on the **y** is actively used to calculate the model parameters. When data from more than a single sample are available, the regression problem in Equations (2, 3) can be reformulated as:

$$\mathbf{y} = \hat{\mathbf{y}} + \mathbf{e} = \mathbf{X}b + \mathbf{e} \tag{4}$$

where the vectors **y**ˆ and **e** collect the predictions and residuals for the different samples, respectively. Accordingly, the most straightforward way of calculating the model parameters in Equation (4) is by the ordinary least-squares approach, i.e., by looking at those values of **b**, which minimize the sum of squares of the residuals **e**:

$$\min\_{\mathbf{b}} \mathbf{e}^T \mathbf{e} = \min\_{\mathbf{b}} \sum\_{i=1}^N e\_i^2 \tag{5}$$

e<sup>i</sup> being the residual for the ith sample and N being the number of training observations. The corresponding methods is called multiple linear regression (MLR) and, under the conditions of Equation (5), the regression coefficients are calculated as:

$$b = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \tag{6}$$

Here it is worth to highlight that, if one wishes to use the same experimental matrix **X** to predict more than one response, i.e., if, for each sample, instead of a single scalar y<sup>i</sup> , there is a dependent vector

$$\mathbf{y}\_i^T = \begin{bmatrix} \mathbf{y}\_{i1}\mathbf{y}\_{i2} \cdots \mathbf{y}\_{iL} \end{bmatrix} \tag{7}$$

L being the number of responses, then each dependent variable should be regressed on the independent block by means of a set of regression coefficients. Assuming that the L responses measured on the training samples are collected in a matrix **Y**, whose columns **y**<sup>l</sup> are the individual dependent variables,

$$Y = \left[\mathbf{y}\_1 \cdots \mathbf{y}\_l \cdots \mathbf{y}\_L\right] \tag{8}$$

the corresponding regression equations could be written as:

$$\begin{aligned} \mathcal{y}\_1 &= \mathbf{X}\mathbf{b}\_1 + \mathbf{e}\_1 \\ &\vdots \\ \mathcal{y}\_l &= \mathbf{X}\mathbf{b}\_l + \mathbf{e}\_l \\ &\vdots \\ \mathcal{y}\_L &= \mathbf{X}\mathbf{b}\_L + \mathbf{e}\_L \end{aligned} \tag{9}$$

which can be grouped into a single expression:

$$Y = \mathbf{X}\mathbf{B} + \mathbf{E} \tag{10}$$

where the residuals, i.e., the differences between the measured and predicted responses are collected in the matrix **E**, and the regression coefficients vectors are gathered in a matrix **B**, which can be estimated, analogously to Equation (6), as:

$$\mathcal{B} = [\boldsymbol{b}\_1 \cdots \boldsymbol{b}\_l \cdots \boldsymbol{b}\_L] = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}.\tag{11}$$

Equations (9–11) indicate that, as far as MLR is concerned, building a model to predict one response at a time or another model to predict multiple responses altogether would lead to the same results since, in the latter case, each dependent variable is anyway modeled as if it were alone. In either case, the solutions of the least-squares problem reported in Equations (6, 11) rely on the possibility of inverting the matrix **X** <sup>T</sup>**X** , i.e., on the characteristics of the predictors. Indeed, in order for that matrix to be invertible, the number of samples should be higher than that of variables and the variables themselves should be as uncorrelated as possible. These conditions are rarely met by the techniques which are used to characterize pharmaceutical samples and, in particular, never met by spectroscopic methods. Due to these limitations, alternative approaches have been proposed in the literature to build regression models in cases where standard multiple linear regression is not applicable. In particular, since in order for the regression solution to exist, the predictor matrix should be made of few, uncorrelated variables, most of the alternative approaches proposed in the literature involve the projection of the **X** matrix onto a reduced space of orthogonal components and the use of the corresponding scores as regressors to predict the response(s). In this regard, one of the most widely used approaches is principal component regression (PCR) (Hotelling, 1957; Kendall, 1957; Massy, 1965; Jeffers, 1967; Jolliffe, 1982, 2002; Martens and Naes, 1991; Martens and Geladi, 2004) which, as the name suggests, involves a two-stage process where at first principal component analysis is used to compress the information in the **X** block onto a reduced set of relevant scores, as already described in Equation (1):

$$T = \mathbf{X} \mathbf{P} \tag{12}$$

and then these scores constitute the predictor block to build a multiple linear regression:

$$
\hat{Y} = \text{TC} \tag{13}
$$

**C** being the matrix of regression coefficients for this model. By combining Equations (12, 13), it can be easily seen how PCR still describes a linear relationship between the responses **Y** and the original variables **X**:

$$
\hat{Y} = \mathbf{TC} = \mathbf{XPC} = \mathbf{XB}\_{PCR} \tag{14}
$$

mediated by a matrix of regression coefficients **B**PCR (=**PC**), which is different from the one that would be estimated by Equation (11), since it is calculated by taking into account only the portion of the variability in the **X** block accounted for by the selected principal components. The use of principal component scores as predictors allows to solve the issues connected to the matrix **X** <sup>T</sup>**X** being usually ill-conditioned when dealing with spectroscopic techniques, but may be still suboptimal in terms of predictive accuracy.

Indeed, as described in Equations (12, 13), calculating a PCR model is a two-step procedure, which involves at first the calculation of PC scores and then the use of these scores to build a regression model to predict the response(s). However, these two steps have different objective functions, i.e., the criterion which is used to extract the scores from the **X** matrix is not the same which guides the calculation of the regression coefficients **C** in Equation (13). Stated in different words, the directions of maximum explained variance (especially when there are many uninformative sources of variability in the data) may not be relevant for the prediction of the **Y**. To overcome this drawback, an alternative approach to component-based regression is represented by the Partial Least-Squares algorithm (Wold et al., 1983; Geladi and Kowalski, 1986; Martens and Naes, 1991) which, due to its being probably the most widely used calibration method in chemometrics, will be described in greater detail in the following subparagraph.

# Partial Least Squares (PLS) Regression

Partial Least Squares (PLS) regression (Wold et al., 1983; Geladi and Kowalski, 1986; Martens and Naes, 1991) was proposed as an alternative method to calculate reliable regression models in the presence of ill-conditioned matrices. Analogously to PCR, it is based on the extraction of a set of scores **T** by projecting the **X** block on a subspace of latent variables, which are relevant for the calibration problem. However, unlike PCR, the need for the components not only to explain a significant portion of the **X** variance but also to be predictive for the response **Y** is explicitly taken into account for the definition of the scores. Indeed, in PLS, the latent variables (i.e., the directions onto which the data are projected) are defined in such a way to maximize the covariance between the corresponding scores and the response(s): maximizing the covariance allows to obtain scores which at the same time describe a relevant portion of the **X** variance and are correlated with the response(s). Due to these characteristics, and differently than what already described in the case of MLR (see Equation 11) and, by extension, PCR, in PLS two distinct algorithms have been proposed depending on whether only one or multiple responses should be predicted (the corresponding approaches are named PLS1 and PLS2, respectively). In the remainder of this section, both algorithms will be briefly described and commented.

When a single response has to be predicted, its values on the training samples are collected in a vector **y**; accordingly, the PLS1 algorithm extracts scores from the **X** block having maximum covariance with the response. In particular, the first score **t**<sup>1</sup> is the projection of the data matrix **X** along the direction of maximum covariance **r**1:

$$\max\_{\mathbf{r}\_1} \left[ Cov(\mathbf{t}\_1, \mathbf{y}) \right] = \max\_{\mathbf{r}\_1} \left( \mathbf{t}\_1^T \mathbf{y} \right) \tag{15}$$

While the successive scores **t**<sup>2</sup> · · · **t**F, which are all orthogonal, account in turn for the maximum residual covariance. Therefore, PLS1 calculates a set of orthogonal scores having maximum covariance with **y**, according to:

$$T = \mathbf{X} \mathbf{R} \tag{16}$$

**R** being the weights defining the subspace onto which the matrix should be projected, and then uses these scores as regressors for the response:

$$
\hat{y} = Tq \tag{17}
$$

**q** being the coefficients for the regression. Similarly to what already shown in the case of PCR, Equations (16, 17) can be then combined in a single one to express the regression model as a function of the original variables, through the introduction of the regression vector **b**PLS<sup>1</sup> (=**Rq**):

$$
\hat{y} = Tq = \, \mathbf{X} \mathbf{R}q = \mathbf{X}b\_{PLS1}.\tag{18}
$$

In contrast, in the multi-response case (PLS2), it is assumed that also the matrix **Y**, which collects the values of the dependent variables on the training samples, has a latent structure, i.e., it can be approximated by a component model:

$$\hat{Y} = \mathbf{U}\mathbf{Q}^T\tag{19}$$

**U** and **Q** being the **Y** scores and loadings, respectively. In particular, in order for the calibration model to be efficient, it is assumed that the **X** and the **Y** matrices share the same latent structure. This is accomplished by imposing that the component be relevant to describe the variance of the independent block and predictive for the responses. In mathematical terms, pairs of scores are simultaneously extracted from the **X** and the **Y** blocks so to have maximum covariance:

$$\max\_{r\_i, q\_i} \left[ Cov(t\_i, u\_i) \right] = \max\_{r\_i, q\_i} (t\_i^T u\_i) \tag{20}$$

Where **t**<sup>i</sup> and **u**<sup>i</sup> are the **X** and the **Y** scores on the ith latent variable, respectively, **q**<sup>i</sup> is the ith column of the **Y** loading matrix **Q** while **r**<sup>i</sup> is the ith column of the **X** weight matrix **R**, which has the same meaning as specified in Equation (16). Additionally, these scores are made to be collinear, through what is normally defined as the inner relation:

$$
\mu\_i = \mathfrak{t}\_i \mathfrak{c}\_i \,\,\forall i \,\,\tag{21}
$$

c<sup>i</sup> being a proportionality constant (inner regression coefficient). When considering all the pairs of components, Equation (21) can be rewritten in a matrix form as:

$$U = T\mathbf{C} \tag{22}$$

where:

$$\mathbf{C} = \begin{bmatrix} c\_1 \ \cdots \ \mathbf{0} \\ \vdots \ \ddots \ \vdots \\ \mathbf{0} \ \cdots \ c\_F \end{bmatrix} . \tag{23}$$

Also in this case, by combining all the equations defining the model, it is possible to express the predicted responses as a linear function of the original variables:

$$\hat{Y} = \mathbf{U}\mathbf{Q}^T = \mathbf{T}\mathbf{C}\mathbf{Q}^T = \mathbf{X}\mathbf{R}\mathbf{C}\mathbf{Q}^T = \mathbf{X}\mathbf{B}\_{PLS2} \tag{24}$$

where the matrix of regression coefficients **B**PLS<sup>2</sup> is defined as **RCQ**<sup>T</sup> .

Based on the above description, it is clear that, when more than one response has to be modeled, it is essential to decide whether it could be better to build an individual model for each dependent variable, or a single model to predict all the responses, as the results would not be the same. In particular, it is advisable to use the PLS2 approach only when one could reasonably assume that there are systematic relationships between the dependent variables.

On the other hand, independently on what model one decides to use, once the values of the regression coefficients (here generally indicated as **B**) have been estimated based on the training samples, they can be used to predict the responses for any new set of measurements (**X**new):

$$
\hat{Y}\_{new} = X\_{new} \mathcal{B}.\tag{25}
$$

Here, it should be stressed that, in order for the calibrations built by PLS (but the same concept holds for PCR) to be accurate and reliable, a key parameter is the choice of an appropriate number of latent variables to describe the data. Indeed, while selecting a low number of components one can incur in the risk of not explaining all the relevant variance (underfitting), including too many of them (so that not only the systematic information is captured, but also the noise), can lead to overfitting, i.e., to a model which is very good in predicting the samples it has been calculated on, but performs poorly on new observations. To reduce this risk, a proper validation strategy is needed (see section Validation) and, in particular, the optimal number of latent variables is selected as the one leading to the minimum error during one of the validation stages (usually, cross-validation).

# Selected Application of Regression Methods to Pharmaceutical Problems

Regression methods in general, and especially PLS, are often combined with spectroscopy in order to develop rapid and (sometimes) non-destructive methodologies for the quantification of active ingredients in formulations. For instance, Bautista et al. (1996) quantified three analytes of interest (caffeine, acetylsalicylic acid and acetaminophen) in their synthetic ternary mixtures and different formulations by UV-Vis spectroscopy assisted by a PLS calibration model. Mazurek et al. proposed two approaches based on coupling FT-Raman spectroscopy with PLS and PCR calibration for estimation of captopril and prednisolone in tablets (Mazurek and Szostak, 2006a) and diclofenac sodium and aminophylline in injection solutions (Mazurek and Szostak, 2006b). The authors compared results obtained from calibration models built by using unnormalised spectra with the values found when an internal standard was added to each sample and the spectra were normalized by its selected band intensity at maximum or integrated. Another study on injection samples was proposed by Xie et al. (2010), using NIR spectroscopy combined with PLS and PCR to quantify pefloxacin mesylate (an antibacterial agent) in liquid formulations. PLS regression was also coupled to MIR (Marini et al., 2009) and NIR spectroscopy (Rigoni et al., 2014) to quantify the enantiometric excess of different APIs in the solid phase, also in the presence of excipients, based on the consideration that, in the solid phase, the spectrum of the racemic mixture could be different from that of either pure enantiomer. Specifically, it was possible to accurately quantify the enantiomeric excess of S-(+)-mandelic acid and S-(+) ketoprofen by MIR spectroscopy coupled by PLS on the whole spectrum and after variable selection by sequential application of backward interval PLS and genetic algorithms (biPLS-GA) (Marini et al., 2009), while NIR was used to quantify the enantiomeric excess of R-(–)-epinephrine and S-(+)-ibuprofen (Rigoni et al., 2014). In the latter case, it was also shown that, when using the validated model to quantify the enantiomeric excess of API in the finished products, the influence of excipients and dosage forms (intact tablets or powders) has a relevant impact on the final predictive accuracy.

# CLASSIFICATION

As already introduced in the previous section, in chemometric applications, in general, and in the context of pharmaceutical analysis, in particular, one is often interested in using the experimentally collected data (e.g., spectroscopic profiles) to predict qualitative or quantitative properties of the samples. While the regression methods for the prediction of quantitative responses have been already presented and discussed in section Regression, the main chemometric approaches for the prediction of qualitative properties of the individuals under investigation are outlined herein. These approaches are generally referred to as classification methods, since any discrete level that the qualitative variable can assume may also be defined as a class (or category) (Bevilacqua et al., 2013). For instance, if one were interested in the possibility of recognizing which of three specific sites a raw material was supplied from, it is clear that the response to be predicted could only take three possible values, namely "Site A," "Site B," and "Site C"; each of these three values would correspond to a particular class. A class can be then considered as an ensemble of individuals (samples) sharing similar characteristics. In this example, samples from the first class would all be characterized by having been manufactured from a raw material produced in Site A, and similar considerations could be made for the specimens in the second and third classes, corresponding to Site B and Site C, respectively. As it could already be clear from the example, there are many ambits of application for classification methods in pharmaceutical and biomedical analysis, some of which will be further illustrated in section Selected Applications of Classification Approaches for Pharmaceutical Analysis, after a brief theoretical introduction to the topic as well as the chemometric methods most frequently used in this context (especially, in combination with spectroscopic techniques).

As anticipated above, classification approaches aim at relating the experimental data collected on a sample to a discrete value of a property one wishes to predict. This same problem can be also expressed in geometrical terms by considering that each experimental profile (e.g., spectrum) can be seen as point in the multivariate space described by the measured variables. Accordingly, a classification problem can be formulated as the identification of regions in this multivariate space, which can be associated to a particular category, so that if a point falls in one of these regions, it is predicted as being part of the corresponding class. In this respect, classification approaches can be divided into two main sub-groups: discriminant and class-modeling methods. In this framework, a fundamental distinction can be made between discriminant and class-modeling tools, which constitute the two main approaches to perform classification in chemometrics (Albano et al., 1978). In detail, discriminant classification methods focus on identifying boundaries in the multivariate space, which separate the region(s) corresponding to a particular category from those corresponding to another one. This means they need representative samples from all the categories of interest in order to build the classification model, which will be then able to predict any new sample as belonging only to one of the classes spanned by the training set. In a problem involving three classes, a discriminant classification method will look for those boundaries in the multivariate space identifying the regions associated to the three categories in such a way as to minimize the classification error (i.e., the percentage of samples wrongly assigned). An example is reported in **Figure 4A**. On the other hand, class-modeling techniques look at the similarities among individuals belonging to the same category, and aim at defining a (usually bound) subspace where samples from the class under investigation can be found with a certain probability; in this sense, they resemble outlier tests, and indeed they borrow most of the machinery from the latter. Operationally, each category is modeled independently on the others and the outcome is the definition of a class boundary which should enclose the category sub-space:, i.e., individuals falling within that space are likely to belong to the class (are "accepted" by the class model), whereas samples falling outside are deemed as outliers and rejected. It is then evident that one of the main advantages of class modeling approaches is that they allow building a classification model also in the asymmetric case, where there is only a category of interest and the alternative one is represented by all the other individuals not falling under the definition of that particular class. In this case, since the alternative category is ill-defined, heterogeneous, and very likely to be underrepresented in the training set, any discriminant model would result suboptimal, as its predictions would strongly depend on the (usually not enough) samples available for that class. On the other hand, modeling techniques define the category space only on the basis of data collected for the class of interest, so those problems can be overcome.

When the specific problem requires to investigate more than one class, each category is modeled independently on the others and, accordingly, the corresponding sub-spaces may overlap (see **Figure 4B**). As a consequence, classification outcomes are more versatile than with discriminant methods: a sample can be accepted by a single category model (and therefore be assigned to that class), by more than one (falling in the area where different class spaces overlap and, hence, resulting "confused") or it could fall outside any class-region and therefore be rejected by all the categories involved in the model.

# Discriminant Methods

As mentioned above, predictions made by the application of discriminant methods are univocal; namely, each sample is uniquely assigned to one and only one of the classes represented in the training set. This is accomplished by defining decision surfaces, which delimit the boundaries among the regions of space associated to the different categories. Depending on the model complexity, such boundaries can be linear (hyperplanes) or assume more complex (non-linear) shapes. When possible, linear discriminant models are preferred as they have less parameters to tune, require a lower number of training samples and are in general more robust against overfitting. Based on these considerations, the first-ever and still one of the most commonly used discriminant techniques is Linear Discriminant Analysis (LDA), originally proposed by Fisher (1936). It relies on the assumption that the samples of each class are normally distributed around their respective centroids with the same variance/covariance matrix (i.e., the same within-category scatter). Under these assumptions, it is possible to calculate the probability that each sample belongs to a particular class g p g **x** , as:

$$p\left(\left.g\right|\mathbf{x}\right) = \frac{\pi\_{\mathcal{S}}}{C}e^{-\frac{1}{2}\left(\mathbf{x}-\overline{\mathbf{x}}\_{\mathcal{S}}\right)^{T}\mathbf{S}^{-1}\left(\mathbf{x}-\overline{\mathbf{x}}\_{\mathcal{S}}\right)}\tag{26}$$

available hyperspace into as many regions as the number of the investigated categories (three, in the present example), so that whenever a sample falls in a particular region of space, it is always assigned to the associated class. Modeling techniques (B) build a separate model for each one of the categories of interest, so that there can be regions of spaces where more than a class is mapped and others where there is no class at all.

where **x<sup>g</sup>** is the centroid of class g, **S** the overall within-class variance/covariance matrix, π<sup>g</sup> the prior probability (i.e., the probability of observing a sample from that category before carrying out any measurement), C is a normalization constant and the argument of the exponential **x**−**x<sup>g</sup>** T **S** −1 **x**−**x<sup>g</sup>** is defined as the squared Mahalanobis distance of the individual to the center of the category. Classification is then accomplished by assigning the sample to the category, to which it has the highest probability of belonging.

LDA is a well-established technique, which works well also on data for which the normality assumption is not fulfilled but, unfortunately, it can rarely be used on spectroscopic data for the same reasons MLR cannot be utilized for regression (see section Regression): calculation of matrix **S** −1 requires the experimental data matrix to be well-conditioned, which is not the case, when dealing with a high number of correlated variables measured on a limited number of samples. To overcome these limitations, LDA can be applied on the scores of bilinear models used to compress the data (e.g., on principal components), but the most commonly used approach involves a suitable modification of the PLS algorithm which makes it able to deal with classification issues; the resulting method is called partial least squares discriminant analysis (PLS-DA) (Sjöström et al., 1986; Ståle and Wold, 1987; Barker and Rayens, 2003), and it will be briefly described in the following paragraph.

Partial Least Squares Discriminant Analysis (PLS-DA) In order for the PLS algorithm to deal with discriminant classification problems, the information about class belonging has to be encoded in a response variable **Y**, which can be then regressed onto the experimental matrix **X** to provide the predictive model (Sjöström et al., 1986). This is accomplished by defining **Y** as a "dummy" binary matrix, having as many rows as the number of samples (N) and as many columns as the number of classes (G). Each row in **Y** is a vector encoding the information about class belonging of the corresponding sample, whereas each column is associated to a particular class (the first column to class 1, the second to class 2 and so on up to the G th). As such, the row vector corresponding to a particular sample will contain all zeros except for the column associated to the class it belongs to, where there will be a one. For instance, in the case of a problem involving three categories, a sample belonging to Class 2 will be represented by the vector **y**<sup>i</sup> = [0 1 0]. A PLS regression model is then calculated between the experimental data matrix **X** and the dummy **Y** [as described in section Partial Least Squares (PLS) Regression], and the matrix of regression coefficients obtained is used to predict the value of the responses on new samples, **Y**ˆ new. Since the dependent variable is associated to the categorical information, classification of the samples is based on the predicted responses **Y**ˆ new which, however, are not binary but real-valued. As a consequence, different approaches have been proposed in the literature to define how to classify samples in PLS-DA based on the values of **Y**ˆ new. The naivest approach (see e.g., Alsberg et al., 1998) is to assign each sample to the category corresponding to the highest value of the predicted response vector. For instance, if the following predictions were obtained for a particular sample: **y**ˆnew,k= [0.1 − 0.4 0.8], it would be assigned to Class 3. On the other hand, other strategies have been also suggested, like the application of LDA on **Y**ˆ new or on the PLS scores (Nocairi et al., 2004; Indahl et al., 2007), or the use of thresholds based on probability theory (Pérez et al., 2009).

# Class-Modeling Methods

As already stated, class-modeling methods aim at identifying a closed (bound) sub-space, where it is likely to find samples from a particular category, irrespective of whether other classes should also be considered or not. They try to capture the features, which make individuals from the same category similar to one another. Operationally, they define the class space by identifying the "normal" variability which can be expected among samples belonging to that category and, accordingly, introducing a "distance-to-the-model" criterion which accounts for the degree of outlyingness of any new sample. Among the different class-modeling techniques proposed in the literature, soft independent modeling of class analogies (SIMCA) is by far the most commonly used, especially for spectroscopic data, due to its ability of dealing with ill-conditioned experimental data matrices and, therefore, it will be briefly described below (for more details, the reader is referred to Wold, 1976; Wold and Sjöström, 1977, 1987).

## Soft Independent Modeling of Class Analogies (SIMCA)

The main idea behind SIMCA is that the systematic variability characterizing the samples for a particular category can be captured and accurately accounted for by a PCA model of appropriate dimensionality. This model is built by using only the samples from the investigated category:

$$\mathbf{X}\_{\mathcal{S}} = \mathbf{T}\_{\mathcal{S}} \mathbf{P}\_{\mathcal{S}}^T + \mathbf{E}\_{\mathcal{S}} \tag{27}$$

where the symbols have the same meaning as in Equation (2), and the subscript indicates that the model is calculated by using only the training data from class g. The use of PCA to define the similarities among the samples belonging to the category of interest provides also the machinery to assess whether any new sample is likely to come from that class or not through the definition of two statistics normally used for outlier detection, namely T 2 and Q. As already introduced in section Principal Component Analysis, the former is the squared Mahalanobis distance of a sample to the center of the scores space, indicating how far the individual is from the distribution of the "normal" samples in the space spanned by the significant PCs (Hotelling, 1931), while the latter is the (Euclidean) distance of the sample to its projection onto the PC space, describing how well that individual is fitted by the PCA model (Jackson and Muldholkar, 1979). In the context of SIMCA, once the PCA model of the gth category is calculated according to Equation (27), any specimen to be predicted is projected onto that model and its values of T 2 and Q are used to calculate an overall distance to the model di,<sup>g</sup> (Yue and Qin, 2001), which constitutes the basis for class acceptance or rejection:

$$d\_{i\p} = \sqrt{\left(T\_{i\p}^2\right)^2 + \left(Q\_{i\p}\right)^2} \tag{28}$$

where the subscript indicates that the ith sample is tested against the model of the gth category. Accordingly, the boundary of the class space is identified by setting a proper threshold to the distance, so that if a sample has a distance to the model lower than the threshold it is accepted by the category and, otherwise, it is rejected.

# Selected Applications of Classification Approaches for Pharmaceutical Analysis

As mentioned before, classification approaches are widely applied in quality controls of pharmaceuticals, in particular to detect counterfeit drugs, as, for instance, it is reported in da Silva Fernandes et al. (2012), where NIR and fluorescence spectroscopy were combined with different classification methods to distinguish among pure and adulterated tablets. In Storme-Paris et al. (2010), a non-destructive approach is proposed to distinguish genuine tablets from counterfeit or recalled (from the market) medicines. In order to achieve this, NIR spectra (directly collected on the tables) are analyzed by SIMCA. Results obtained suggest the validity of this approach; in fact, it allowed highlighting small differences among drugs (e.g., different coating), and it provided an excellent differentiation among genuine and counterfeits products. For the same purpose, namely counterfeit drug detection, NIR spectra were also widely combined with PLS-DA. Only to mention one, de Peinder et al. (2008) demonstrated the validity of this approach to spot counterfeits of a specific cholesterol-lowering medicine. Despite the fact that the authors highlighted the storage conditions sensibly affecting NIR spectra (because of humidity), the PLS-DA model still proved to be robust and provided excellent predictions.

# VALIDATION

Chemometrics relies mainly on the use of empirical models which, given the experimental measurements, should summarize the information of the data, reasonably approximate the system under study, and allow predictions of one or more properties of interest. Bearing this in mind, given the "soft" (i.e., empirical) nature of the models employed, there are many models one could in principle calculate on the same data and their performances could be influenced by different factors (number of samples and their representativeness, the method itself, the algorithm, and so on) (Brereton et al., 2018). Thus, selecting which model is the most appropriate for the data under investigation and verifying how reliable it is, is of fundamental importance and the chemometric strategies for doing so are collectively referred to as validation (Harshman, 1984; Westad and Marini, 2015). To evaluate the quality of the investigated models, the validation process requires the definition of suitable diagnostics, which could be based on model parameters but more often rely on the calculation of some sort of residuals (i.e., error criteria). In this context, in order to avoid overoptimism or, in general, to obtain estimates which are as unbiased as possible, it is fundamental that the residuals which are used for validation are not generated by the application of the model to the data it has been built on, since in almost all cases, they cannot be considered as representative of the outcomes one would obtain on completely new data. For such reason, a correct validation strategy should involve the estimation of the model error on a dataset different than the one used for calculating the model parameters. This is normally accomplished through the use of an external test set or cross-validation.

The use of a second, completely independent, set of data for evaluating the performances and, consequently, calculating the residuals (test set validation) is the strategy which best mimics how the model will be routinely used, and it is therefore the one to be preferred, whenever possible. On the other hand, crossvalidation is based on the repeated resampling of the dataset, into a training and a test sub-sets, so that at each iteration only a part of the original samples is used for model building while the remaining individuals are left out for validation. This procedure is normally repeated up to the moment when each sample has been left out at least once or, anyway, for a prespecified number of iterations. Cross-validation is particularly suited when the number of available samples is small and there is no possibility of building an external test set, but the resulting estimates can be still biased as the calibration and validation sets are never completely independent on one another. In general, it is rather used for model selection (e.g., estimating the optimal number of components) than for the final validation stage.

# OTHER SELECTED APPLICATIONS

In addition to some specific applications described above, in this paragraph additional examples will be presented to further

# REFERENCES


emphasize the usefulness of chemometrics-based spectroscopy for pharmaceutical analysis.

Morris and Forbes (2001) coupled NIR spectroscopy with multivariate calibration for quantifying narasin chloroformextracted from granulated samples. In another study, Forbes et al. (2001) proposed a transmission NIR spectroscopy method using multivariate regression for the quantification of potency and lipids in monensin fermentation broth.

Ghasemi and Niazi (2007) developed a spectrophotometric method for the direct quantitative determination of captopril in pharmaceutical preparation and biological fluids (human plasma and urine) samples. Since the spectra were recorded at various pHs (from 2.0 to 12.8), different models were tested, including the possibility of a preliminary spectral deconvolution using multi-way approaches. In particular, the use of PLS on the spectra at pH 2.0 allowed to build a calibration curve which resulted in a very good accuracy. Li et al. (2014) used Raman spectroscopy to identify anisodamine counterfeit tablets with 100% predictive accuracy and, at the same time, NIR spectroscopy to discriminate genuine anisodamine tablets from 5 different manufacturing plants. In the latter case, PLS-DA models were found to have 100% recognition and rejection rates. Willett and Rodriguez (2018) implemented a rapid Raman assay for on-site analysis of stockpiled drugs in aqueous solution, which was tested on Tamiflu (oseltamivir phosphate) by using three different portable and handheld Raman instruments. PLS regression models yielded an average error with respect to the reference HPLC values, which was lower than 0.3%. Other examples of application can be found in Forina et al. (1998), Komsta (2012), Hoang et al. (2013), and Lohumi et al. (2017).

# CONCLUSIONS

Chemometrics provide a wealth of techniques for both the exploratory analysis of multivariate data as well as building reliable calibration and classification strategies to predict quantitative and qualitative responses based on the experimental profiles collected on the samples. Coupled to spectroscopic characterization, it represents an indispensable and highly versatile tool for pharmaceutical analysis at all levels.

# AUTHOR CONTRIBUTIONS

AB and FM jointly conceived and designed the paper, and wrote the manuscript. All authors agreed on the content of the paper and approved its submission.


regression, calibration and prediction methods. Talanta 43, 2107–2121. doi: 10.1016/S0039-9140(96)01997-2


Kendall, M. G. (1957). A Course in Multivariate Analysis. London: Hafner.


pharmaceuticals by using partial least squares and principal component regression multivariate calibration. Spectrochim. Acta A 75, 1535–1539. doi: 10.1016/j.saa.2010.02.012


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Biancolillo and Marini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.