Diagnosis of Lung Cancer by ATR-FTIR Spectroscopy and Chemometrics

Lung cancer is the leading cause of cancer-related death in the world. Early diagnosis has great significance for the survival of patients with lung cancer. In this paper, attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectroscopy combined with chemometrics was used to study the serum samples from patients with lung cancer and healthy people. The results of spectral band area comparison showed that the concentrations of protein, lipid and nucleic acids molecules in serum of patients with lung cancer were increased compared with those in healthy people. The original spectra were preprocessed to improve the accuracy of principal component regression (PCR) and partial least squares-discriminant analysis (PLS-DA) models. PLS-DA results for first derivative spectral data in nucleic acids (1250-1000cm-1) band showed 80% sensitivity, 91.89% specificity and 87.10% accuracy with high Rc2 of 0.8949 and Rv2 of 0.8153, low RMSEC of 0.3136 and RMSEV of 0.4180. It is shown that ATR-FTIR spectroscopy combined with chemometrics might be developed as a simple method for clinical screening and diagnosis of lung cancer.


INTRODUCTION
Lung cancer is the leading cause of cancer-related death in the world. It causes more than 2.2 million new cancer cases and 1.8 million deaths, accounting for 18% of all cancer deaths in 2020 (1). The cause of lung cancer is largely attributed to smoking and genetic inheritance (2,3). Lung cancer lacks early diagnostic biomarkers. Most patients with lung cancer are already in advanced stage when diagnosed. The treatment effect of patients with advanced lung cancer is poor, and the survival rate is still expected to be less than 15% (4,5). Therefore, early diagnosis has great significance for the survival of patients with lung cancer.
The traditional clinical diagnosis method is based on the histological examination of tumor tissue samples, but it is invasive and difficult to repeatedly biopsy for dynamic monitoring (6). Common screening methods, such as chest Xrays (CXR), magnetic resonance imaging (MRI) and low-dose computed tomography (LDCT), have some disadvantages. CXR examination cannot fully show early lung lesions with high false negatives. MRI has low sensitivity and cannot be used in patients with particular metal-based implants, pacemakers, and those suffering from acute claustrophobia (7). LDCT has high rates of false-positive results and adverse effects caused by exposure to hazardous radiation (4,(8)(9)(10). The existing light-induced theranostic platforms also have several limitations such as tissue absorption and limited imaging (11). Therefore, there is a need for a sample diagnostic method for the diagnosis of lung cancer.
In recent years, vibrational spectroscopic techniques such as Fourier transform infrared (FTIR) spectroscopy and Raman spectroscopy have been widely used in biological samples due to their low cost and small sample consumption (12). Unfortunately, Raman spectroscopy has some limitations because of its strong fluorescence background and weak signal. However, these limitations are not associated with infrared spectroscopy (13). There have been reported that infrared spectroscopy was used to analyze the lung tissue. Bangaoil et al. classified malignant and benign lung tissue sections using infrared spectroscopy combined with principal component analysis (PCA) and hierarchical cluster analysis (HCA). The finally analysis results were consistent with histopathological conclusions (14). Kaznowska et al. studied the tissue samples from healthy people and patients with lung cancer using infrared spectroscopy, and found the corresponding wavenumber changes of the functional groups in lipids, carbohydrates, proteins, DNA and phospholipids (15). However, the tissue samples in these studies still need to be collected from surgery, which is highly invasive. Wang et al. studied the FTIR spectra of serum samples by drying the serum on BaF2 window under vacuum, and found that there were differences in the protein secondary structure of serum between the patients with lung cancer and healthy people (16). However, there is still a lot of work to be done in the practical application of infrared spectroscopy in the clinical diagnosis of lung cancer.
In this paper, ATR-FTIR spectroscopy combined with chemometrics was used to study the serum samples from patients with lung cancer and healthy people in order to explore a simple diagnostic method for lung cancer and lay the foundation for the clinical application of infrared spectroscopy in the diagnosis of lung cancer in the future.

Samples Preparation
Serum samples were provided by The First People's Hospital of Yunnan Province. All subjects had given informed consent to be included before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Yunnan Normal University (Number: ynnuethic2021-13). Serum samples from 92 patients with lung cancer and 155 samples from healthy people were collected. The information of patients with lung cancer and healthy people was listed in Table 1. 50ml serum samples was placed on a glass slide and dried in a vacuum oven at room temperature (25°C) for 40 minutes. The purpose of vacuum pumping is to accelerate the drying speed. Drying in the vacuum drying oven can ensure that the sample will not be polluted and oxidized, and the organic substance will not be destroyed within 40 minutes. Then the serum was removed from the slide for measuring ATR-FTIR spectrum. Before sampling, the glass slides were soaked with aqua regia for 1 hour, washed with water and then soaked in acetone for 1 hour, and finally washed with ultrapure water and dried for use.

ATR-FTIR Spectroscopy
ATR-FTIR spectra were measured in the range of 4000-600cm -1 by a Frontier spectrometer (Perkin Elmer, UK), coupled with an ATR accessory and a deuterated triglycine sulfate (DTGS) detector. Each spectrum was an accumulation of 32 scans at a resolution of 4cm -1 . The dried serum sample was transferred to the crystal plate, and then pressed with pressure tip to ensure the best contact with the crystal surface. The air background spectrum was recorded before each sample scan and automatically deducted when the sample was tested. After each measurement, the crystal surface was cleaned with ethanol and ultrapure water, and then dried with a dust-free paper. Three IR spectra were collected for each serum sample and the resulting spectra were averaged using OMNIC 8.2 software (Thermo Scientific).

Spectral Data Preprocessing
The influence of noise and irrelevant information can be eliminated by proper preprocessing of the original spectra. This increases the accuracy of the analytical model and improves the signal-to-noise ratio (17). Baseline correction (BL) is a necessary processing method in infrared spectroscopy, which is helpful for further qualitative or quantitative analysis (18). Savitzky-Golay (SG) smoothing is adopted to increase the spectral quality by eliminating random noise. Derivative processing can eliminate background interference and spectral overlap, and minimize baseline drift caused by the differences in optical setups (19). Multiplicative scatter correction (MSC) is aimed to effectively eliminate the influence of scattering and improve the spectral information to obtain a relatively ideal spectrum (20). Standard normal variate (SNV) is used to reduce baseline shifting or tilt due to scattering and the change of light distance (21). It subtracts the average intensity from the spectral intensity to achieve offset correction, and then divides the standard deviation to reduce the multiplicative effect (22).

Spectral Band Area Analysis
The spectral band area was measured using OriginPro 9.1 software (OriginLab Corporation, Northampton, MA). The obtained results were presented as mean ± SEM (standard mean error). For statistics, independent sample t test was processed using SPSS 19 software (SPSS, Inc., Chicago, IL) and GraphPad Prim 9.0 (GraphPad Software Inc., CA, USA). The statistical significance was signified as less than or equal to p < 0.05*, p < 0.01**, p < 0.001***, and p < 0.0001****.

Chemometrics Analysis
Principal component regression (PCR) and partial least squaresdiscriminant analysis (PLS-DA) were performed to analyze the spectral data using Unscrambler X 10.4 software (Camo Software AS, Oslo, Norway). PCR is a regression analysis based on PCA (23). It decomposes the X matrix by PCA, and then takes the transformed new variables as predictive variables for multiple linear regression (MLR) (24). PLS-DA is a linear supervised classification technique combining partial least squares (PLS) regression with linear discriminant analysis (LDA) (25). It can find variables and directions from the multivariate space to distinguish the categories in the calibration set (26). After preprocessing the spectral data, they were randomly divided into calibration set (69 serum samples from patients with lung cancer and 116 serum samples from healthy people) and validation set (23 serum samples from patients with lung cancer and 39 serum samples from healthy people) according to the ratio of 3:1 for model work. The performance of the regression model was evaluated by calculating the square of the correlation coefficient (R 2 ) and the root mean square error (RMSE) (27). The sensitivity, specificity and accuracy were used to evaluate the judgment ability of the diagnostic model. The corresponding formula is as follows:

RESULTS AND DISCUSSION
ATR-FTIR Spectra of Serum Figure 1 shows the IR spectra after baseline correction and SG smoothing (9-point) of serum samples from patients with lung cancer and healthy people. The average IR spectra of them are shown in Figure 2. It can be seen that the main components of serum are protein, lipid and nucleic acids. The amide I protein (1700-1600cm −1 ) band mainly originated from the a-helix structure at 1646cm -1 (28). The amide II protein (1560-1500cm −1 ) band mainly originated from the N-H functional group at 1542cm -1 (29). The peak at 1740cm −1 was attributed to the C=O stretching vibration from ester carbonyl in triglycerides (25). The spectral band of 3000-2800cm −1 was mainly correlated with the lipid-related C-H asymmetric stretching vibration of CH 3 at 2959cm -1 and CH 2 at 2930cm -1 (30,31). The spectral band of 1250-1000cm −1 was correlated with the P=O asymmetric stretching vibration at 1243cm -1 and symmetric stretching  vibration at 1079cm -1 of PO − 2 in nucleic acids (32). It could be observed that the absorbance of average IR spectrum in serum from patients with lung cancer was significantly increased at nucleic acids band compared with healthy people. However, there were no significant differences in amide I, amide II and lipid bands in average IR spectra.

Comparison of Spectral Band Area
To further analyze the differences between serum of patients with lung cancer and healthy people in these four bands, we showed the statistical analysis results of the spectral band area of serum samples in Figure 3. It can be observed that the spectral band area of patients with lung cancer was significantly increased in amide I, amide II and nucleic acids bands compared with healthy people (p < 0.0001). Although the increased area in lipid band was not significant compared with the other three bands, there was still a statistical difference between the two groups of serum samples (p < 0.05). According to Beer-Lambert law, the increase of the absorbance of the spectral band indicates the increase of the corresponding functional group concentration (31). Therefore, the concentrations of protein, lipid and nucleic acids molecules in serum of patients with lung cancer were increased compared with those in healthy people. This may be due to the aerobic glycolysis in cancer cells that produces a large number of biosynthetic intermediates such as lipid, protein and nucleotide, which are used for the growth and proliferation of cancer cells (33).

Chemometrics Analysis
In order to evaluate the classification effect of these four bands on serum of patients with lung cancer and healthy people, PCR and PLS-DA were performed on the spectral data after preprocessing using full cross-validation. A model with a low value of RMSE and a high value of R 2 closer to 1 is considered to be a good model (24,34).  Figure 4 shows the score plot of Factor-1 and Factor-2 using PLS-DA model for first derivative spectral data in nucleic acids (1250-1000cm −1 ) band. The first two factors indicate that 65% (X1 36%, X2 27%) of the X variance, explains 58% (Y1 50%, Y2 8%) of the sample classification level. It can be seen that serum samples are distributed into two clusters along the Factor-1. The red cluster is mainly composed of serum samples from patients with lung cancer, and the black cluster is mainly composed of serum samples from healthy people. In this model, 80% of patients with lung cancer were correctly identified, 91.89% of healthy people were correctly separated, and the total accuracy rate was 87.10%. Figure 5 shows the loading plot of Factor-1 for identifying the peaks with high weights in classifying samples. There are positively weighted peaks around 1176cm -1 , 1130cm -1 , 1085cm -1 and 1043cm -1 , of which 1176cm -1 was related to the vibration band of sugar-phosphate, 1130cm -1 was assigned to the C=O stretching vibration of ribose in RNA (32), 1085cm -1 was ascribed to the symmetric phosphate vibrations (35), 1043cm -1 was attributed to the stretching vibration and bending vibration  of C-O in carbohydrates (14). Two positively weighted peaks at 1226cm -1 and 1026cm -1 , of which 1226cm -1 was due to the asymmetric stretching vibration of PO − 2 in nucleic acids (36), 1026cm -1 was related to the stretching vibration of C-O and bending vibration of C-H in aromatic amino acids (37). Therefore, the loading plot of Factor-1 showed that PO − 2 in nucleic acids play a key role in distinguishing the serum patients with lung cancer from that of healthy people. This may be due to DNA damage caused by oxidative chemical mutagenic aberrations in serum of patients with lung cancer (38).

CONCLUSIONS
ATR-FTIR spectroscopy combined with chemometrics was used to study the serum of patients with lung cancer and healthy    people. The results of spectral band area comparison showed that the concentrations of protein, lipid and nucleic acids molecules in serum of patients with lung cancer were increased compared with those in healthy people. PCR and PLS-DA were performed on the spectral data after different preprocessing. PLS-DA model for first derivative spectral data in nucleic acids (1250-1000cm -1 ) band is the best model with high R 2 c of 0.8949 and R 2 v of 0.8153, low RMSEC of 0.3136 and RMSEV of 0.4180. The corresponding PLS-DA results showed 80% sensitivity, 91.89% specificity and 87.70% accuracy. The results showed that ATR-FTIR spectroscopy combined with chemometrics could effectively distinguish the serum of patients with lung cancer from that of healthy people.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The protocol was approved by the Ethics Committee of Yunnan Normal University (Number: ynnuethic2021-13). The patients/ participants provided their written informed consent to participate in this study.