An Integrated Quantitative Proteomics Workflow for Cancer Biomarker Discovery and Validation in Plasma

Blood plasma is one of the most widely used samples for cancer biomarker discovery research as well as clinical investigations for diagnostic and therapeutic purposes. However, the plasma proteome is extremely complex due to its wide dynamic range of protein concentrations and the presence of high-abundance proteins. Here we have described an optimized, integrated quantitative proteomics pipeline combining the label-free and multiplexed-labeling-based (iTRAQ and TMT) plasma proteome profiling methods for biomarker discovery, followed by the targeted approaches for validation of the identified potential marker proteins. In this workflow, the targeted quantitation of proteins is carried out by multiple-reaction monitoring (MRM) and parallel-reaction monitoring (PRM) mass spectrometry. Thus, our approach enables both unbiased screenings of biomarkers and their subsequent selective validation in human plasma. The overall procedure takes only ~2 days to complete, including the time for data acquisition (excluding database searching). This protocol is quick, flexible, and eliminates the need for a separate immunoassay-based validation workflow in blood cancer biomarker investigations. We anticipate that this plasma proteomics workflow will help to accelerate the cancer biomarker discovery program and provide a valuable resource to the cancer research community.


INTRODUCTION
Plasma is an attractive and reliable sample for cancer research due to its easy accessibility, and plasma proteome can provide a plethora of important information regarding the normal physiological states as well as the cancer-induced alterations in our body (1,2). Importantly, recent studies showed whole blood as a specimen for liquid biopsy for personalized medicine applications and monitoring the therapeutic responses to the treatment of cancers (3,4). Mass spectrometry (MS)-based label-free and multiplexed label-based proteomics profiling of the plasma or serum proteome is widely used for unbiased discovery of potential biomarkers for diverse types of human diseases including cancers, infectious diseases, cardiovascular and metabolic disorders (2,(5)(6)(7)(8).
In recent years, multiple-reaction monitoring (MRM) and parallel-reaction monitoring (PRM) mass spectrometry approaches have emerged as attractive alternatives for protein immunoassays (9). These targeted proteomics approaches can accurately measure concentrations of multiple proteins in complex biological samples, such as plasma (10)(11)(12)(13). Importantly, results obtained in multiplexed MRM/PRM-MS assays correlate well with immunoassay-based measurements (10,14). One key advantage of these targeted MS-assays is that these allow quantification of variants and modified forms of the proteins by targeting their specific peptide sequences (15,16). Quantification by traditional immunoassay-based techniques such as Western blotting is based on a single antibody that is often inadequately characterized and protein quantification solely depends on a single signal (17). On the contrary, the quality of the isotopically labeled reference peptides used in MRM or PRM-based methods could be easily evaluated by a fragment ion spectrum and these approaches use multiple signals for obtaining more reliable and robust quantification (17). Moreover, immunoassay-based techniques are often difficult to perform for multiple targets due to the lowthroughput of these approaches and the unavailability of suitable antibodies for many proteins. To this end, MRM and PRMbased approaches allow the accurate quantification of hundreds of peptides in a single injection/run of mass spectrometry and therefore are more high-throughput compared to the conventional immunoassay-based measurements. Consequently, a combined workflow involving both discovery and validation phase quantitative proteomics techniques would be extremely beneficial for cancer biomarker research.
There are several methods describing sample processing for quantitative proteomics analysis of plasma samples in various cancers, while we have demonstrated here a combined method for both discovery and validation of protein markers in plasma samples. In this respect, we have extensive experience of applying label-based multiplexed quantitative proteomics for the discovery of biomarker panels in cancer and other diseases (18)(19)(20)(21)(22)(23)(24). Such multiplexing using stable isotope labeling results in increased throughput, higher precision, better reproducibility, reduced technical variations and fewer missing values (8,20,(25)(26)(27)(28)(29)(30). Further, proteomic profiling using label-free quantitation (LFQ) is another attractive method for cancer biomarker quantification (23,31). In recent years, we have reported targeted quantitation of proteins by Multiple Reaction Monitoring (MRM) mass spectrometry (18,32). Here, we have described an amalgamated analysis pipeline for plasma biomarker analysis by integrating the know-how of different quantitative & targeted proteomics methods (Figures 1A-D).

EXPERIMENTAL DESIGN
In this integrated quantitative proteomics pipeline, three biological pool of plasma samples were analyzed for obtaining a comprehensive proteome profile, and subsequent validation of a few selective peptides. Each of the three plasma pools (named as samples A, B, and C) was a uniform mixture of ten different individual plasma samples. In order to perform targeted proteomics analyses, a pool of 21 heavy labeled synthetic peptides were spiked into the plasma samples at a different ratio. Global quantitative proteomics was performed using both label-free and label-based such as Isobaric tags for relative and absolute quantitation (iTRAQ 4-plex) and Tandem Mass Tag TM (TMT 6-plex) quantitation approaches (Figures 1B,C), while the targeted proteomics was carried out using MRM and PRM-based MS assays ( Figure 1D). In iTRAQ experiment, we have used different amount of digested peptides per label to determine the minimum amount of peptides to be labeled and the accuracy of the quantitation.
This protocol consists of label-free and label-based (iTRAQ and TMT) proteome profiling methods for cancer biomarker discovery, followed by the targeted approaches (MRM and PRM) for validation of a few potential marker proteins.

STEPWISE PROCEDURE
Plasma Sample Preparation Timing 20 min    Figure S1. 19. Activate the desalting column with 50 µl of methanol.
Centrifuge at 1,000 g for 2 min at RT and discard the liquid from the collection vial. Repeat this step two times. CRITICAL: All subsequent centrifugation steps for desalting are for the same duration at the same speed and RT. 20. Wash the desalting column with 50 µl of acetonitrile.
Centrifuge at 1,000 g for 2 min at RT and discard the liquid from the collection vial. Repeat this step two times. 21 39. The desalted peptides can be run for label-free quantitation using the below-mentioned LC (Section A) and MS (Section B) parameters. We observed good reproducibility between three technical replicates (see anticipated results below).

A. LC Parameters
i. Take 2 µg of digested peptides and make up the volume to 10 µl. ii. CRITICAL: The concentration of desalted peptides will be 200 ng/µl. iii. Place the vials in the auto-sampler stand of nLC 1200. iv. Equilibrate the pre-column (Thermo Fisher Scientific, P/N 164564, S/N 10694527) and analytical column (Thermo Fisher Scientific, P/N ES803A, S/N 10918620) five times of column volume with 0.1% (v/v) FA. v. Load 1 µg of digested peptides onto the column using the nLC 1200 system. vi. Set up the LC gradient based on sample complexity. We have used 120 min LC gradient for label-free quantitation of the plasma samples. The brief details of LC gradient are mentioned in Supplementary Method 1, 2.

B. MS Parameters
i. Open Thermo Scientific Xcalibur software double click on instrument setup and select the template from peptides-ID with default parameters (Figure S2).
ii. Populate the MS parameters from Figure S3 and save as a new method (Supplementary Method 3). iii. Open Thermo Scientific Xcalibur software, double click on sequence setup and fill sample details such as sample type, sample name, file save location, instrument method file, the volume of injection, and position of the sample. iv. Select the row and click on the run sequence.

Instrument Method Generation for MRM Using Skyline
i. Load the sequence of synthetic peptides into Skyline and set the parameters for peptide and transition setting as mentioned in Figures S4, S5. ii. Export the unscheduled transition list as a single method from Skyline (Figure S6A). iii. Import the unscheduled transition list as an Inclusion list in a MRM acquisition method in Xcalibur.

Instrument Method Generation for PRM Using Skyline
i. Load the sequence of synthetic peptides into Skyline and set the parameters as mentioned in Figures S4, S5. ii. Export the unscheduled isolation list as a single method from Skyline ( Figure S6B).

LC Parameters
i. Follow the steps 39Ai-39Avi from experiment #1. ii. We have used the same LC gradient for PRM, which we have used for MRM.

Set Up Instrument Method for PRM
i. Open Thermo Scientific Xcalibur software double click on instrument setup and select the template from MS n with default parameters.
ii. Import the unscheduled isolation list as an Inclusion list in a Targeted-MS 2 acquisition method in Xcalibur and populate the MS parameters from Figure S8 and save as a new method. iii. Open Thermo Scientific Xcalibur software, double click on sequence setup and fill sample details such as sample type, sample name, file save location, instrument method file, the volume of injection, and position of the sample. iv. Select the row and click on the run sequence. Da) and static modification as carbamidomethyl (+57.021 Da) on cysteine, monoisotopic masses, and trypsin cleavage (max 2 missed cleavages). The peptide precursor mass tolerance was 10 ppm, and MS/MS tolerance was 0.05 Da. The false discovery rate (FDR) for proteins, peptides, and peptide spectral matches (PSMs) peptides were kept 1%. The quantification values for proteins were exported from proteome discoverer 2.2. The brief parameters were mentioned in Table 1. The .raw files from the label-free method were searched against the same database. Most of the proteome discoverer parameters were kept the same as above mentioned for iTRAQ 4-plex method except dynamic modifications for the iTRAQ reagents (+144.102 Da) on lysines and N-termini of a peptide and for TMT reagents (+229.163 Da) on lysine and N-termini of the peptide. ii. We normalized the data sets using the abundance of total peptide for the identification of differentially expressed proteins. The normalization by total peptide amount is the default option in Proteome Discoverer (v2.2). In this case, it sums the peptide group abundances for each sample and determines the maximum sum for all files, and it calculates the normalization factor using the sum of the sample and the maximum sum in all files.
The users may also use additional data normalization in subsequent steps. There are several normalization approaches, including central tendency, linear regression, locally weighted regression, quantile techniques, and others (33). The normalization methods are evaluated in terms of their ability to reduce variation between technical replicates. Although all these methods can reduce the systematic bias to some extent, each approach has its own advantages and disadvantages (33)(34)(35). Therefore, the selection of the normalization approaches also depends on the experimental designs and type of data sets.

Targeted Proteomics Data Analysis
The steps for data analysis of MRM and PRM are the same. We have performed data analysis using Skyline (Skylinedaily 19.1.9.350).
i. Open the skyline document containing the list of transitions. ii. Now click on import results located under the file tab as shown in Figure S9A. iii. Locate the folder containing the results and upload the files at once. You would see a window like the one shown in Figure S9B. iv. Once the import is completed, look at the retention times of the peaks that Skyline detects automatically. To ensure that the right peak has been detected, go to the "View" tab and select replicate comparison under the retention time option. v. Now correct the retention times of peptides that have been wrongly annotated by Skyline.

CRITICAL STEP:
Consider the dot p values, shape and intensity of the peak among the many other parameters while deciding on the right peak. The re-annotation involves dragging the mouse cursor below the X-axis from the start time to the end time of the eluted peak. vi. Once the re-annotation is complete and the areas of all the peaks have been corrected, save the document. vii. Export the data and perform statistical data analysis.

RESULTS
One of the major challenges of cancer plasma proteomics has been its inability to discover markers with clinical implications. However, improvement in instrumentation and mass spectrometry-based platforms have contributed to the revival of plasma proteomics (36)(37)(38)(39). Currently, several proteomics techniques are being used for MS-based quantitation of plasma proteins for different cancer projects. This study provides a complete proteomics workflow for the discovery and validation of potential biomarker candidates from plasma samples using mass spectrometry. Additionally, the study also provides an optimized sample preparation strategy to get decent coverage of the plasma proteome, which is essential for cancer biomarker discovery projects.
We used a 120 min LC gradient for label-free quantitation and ( Figure S10A) detected 2332 peptides corresponding to 241 proteins with at least one unique peptide at 1% FDR (Table S1). We identified 183 proteins common in all three samples (Figure 2A). The heatmap and correlation matrix indicate high levels of consistency (Pearson r value > 0.99) (Figure 2B and Figure S10B) between the technical replicates (R1, R2, and R3) of different biological samples (Sample A, B, and C). We observed an average of 965 peptides and 170 proteins below than 20% coefficient of variation (CV) (Figures 2C,D). In case of iTRAQ experiment, we have labeled varying amounts of peptides (100, 50, 25, 12.5 µg) using iTRAQ reagents to check the minimum amount of peptide to be labeled and observed minimum 50 µg amount of peptide could be used for the good quantitative proteomics experiment ( Figure S11). However, the number of proteins identified in 114 labeled samples was relatively lower than the other three labels, i.e., 115, 116, and 117. This could be a result of labeling a significantly low number of peptides with the 114-label compared to the other three labels. Around 219 proteins were identified and quantified using iTRAQ-based multiplexed quantitative proteomics ( Figure 3A, Table S2). In TMT experiments, we identified 376 proteins, and 182 proteins were common across all the three quantitative proteomics techniques (LFQ, iTRAQ 4-plex, and TMT 6-plex) ( Figure 3B, Table S3). Studies performing in-depth comparisons of label-free and label-based quantitation (37,(40)(41)(42) are also available. We observed a slight increase in the identification of proteins using fractionated samples (six fractions) of TMT 6-plex experiment in comparison to label-free quantitation and iTRAQ 4-plex with a 43.3% overlap between the proteins identified using all three approaches ( Figure 3B). Further, LFQ provides the flexibility of analyzing clinical samples processing and running as or when available and generating individual datasets. Obtained peptides/protein datasets could be analyzed in different contexts based on IHC, radiology, and other known clinical parameters to address various clinical questions in cancer biology.
The recent developments in the field of targeted proteomics are showing promises in bridging the gap between biomarker discovery and validation of the potential biomarkers (15,30,43). We have provided here a workflow for targeted proteomics using    retention times and peak areas across various dilutions for the peptide DPTFIPAPIQAK as observed in the MRM experiment and PRM experiment (Figures 4C,D). The intensity of the synthetic peptides using MRM and PRM was found to be correlated with the levels of synthetic peptides spiked into samples A, B, and C (Figures 4E,F). We monitored the levels of a few potential cancer biomarkers in plasma samples using MRM, and PRM approaches. The peptide AGALNSNDAFVLK from Gelsolin-1 and SGLSTGWTQLSK from Alpha-1B-glycoprotein showed a good response (Intensities of 10 3 in MRM and 10 6 PRM) and good spectral match with library (dotp value > 0.93) in both the targeted approaches (Figures 5A,B).

DISCUSSION
Quantitative approaches involving ultra-sensitive mass spectrometers, which are presented as the pinnacle of promising proteomics technologies, are undoubtedly one of the most widely used approaches in biomarker discovery in recent years. The integrated quantitative proteomics pipeline combining global and targeted approaches described here could be extremely useful in cancer biomarker discovery and validation in plasma samples without a need for any separate immunoassay-based validation method.
Preanalytical variables introduced during blood collection, plasma separation, and storage conditions can adversely influence the quantification of proteins in plasma samples (44), and thereby the outcome of the overall analysis. Potential cancer biomarkers are often very low-abundance proteins and the numbers of detectable proteins are restricted by the complexity of plasma or serum proteome (6,45,46). Therefore, it requires extensive depletion of the high-abundance proteins and fractionation methods to obtain comprehensive coverage of the plasma proteome, which certainly introduces substantial experimental time and cost in such quantitative proteomics workflow. In general, the establishment of any clinically relevant protein biomarker panel requires analysis involving large clinical cohorts, including multiple types of control populations (2,23), which is more crucial for cancer biomarker based projects due to the inter-and intra-tumoral heterogeneity. However, the sample throughput of the discovery phase quantitative proteomics is still moderate and not adequately efficient to satisfy this need (47). Finally, due to the requirement of sophisticated instrumentation and experienced personnel, such MS-based quantitative proteomics workflow is not suitable for routine screening of blood samples in clinical setups.
Analysis of plasma proteome using two complementary quantitation methods as described here provided a satisfactory coverage. Despite advancements in biomarker discovery, there is still no consensus on whether pooling serum samples for shotgun proteomics experiments is always advisable in the discovery phase. While there are many studies that have used serum pooling as a strategy for cancer biomarker discovery (48)(49)(50)(51), there also exist studies which advocate otherwise (52,53). Pooling of clinical samples are often practiced in quantitative proteomics analysis when large numbers of samples need to be studied or there is not an adequate amount of each sample for individual analysis. If sample pooling is performed during the discovery phase of the analysis, it is essential to validate the results in individual diseased and control samples selected randomly from the pooled populations.
In this workflow, the discovery phase experiments [Labelfree (LFQ) and Label-based (iTRAQ or TMT)] were performed using an Orbitrap Fusion instrument. The targeted (validation) experiments were performed using two different platforms: multiple reaction monitoring (MRM) using a Triple Quadrupole instrument, and parallel reaction monitoring (PRM) using an Orbitrap Fusion instrument. These two techniques are based on similar principles, and the choice of the method is largely reliant on the type of instrument available to the users. Plasma abundance of a potential cancer biomarker-Alpha-1B-glycoprotein was monitored in the pooled samples and further validated in individual samples using MS-based targeted approaches ( Figure S13). Using this integrated quantitative proteomics workflow we were able to quantify the relatively low abundant plasma proteins as well ( Figure S14). The targeted approaches were found to be much superior in terms of quantification accuracy in comparison to the shotgun proteomics approaches. While MRM experiments can be carried out on low-resolution instruments like the triple quadrupole LC-MS (QqQ LC-MS), PRM experiments require the use of high-resolution LC-MS instruments with the QTOF or Q-Orbitrap configuration. Taken together, we conclude that plasma proteomics-based cancer biomarker projects could heavily benefit from detailed workflows of quantitative and targeted proteomics provided in this study. We have demonstrated here multiple possible quantitative approaches in the discovery and validation phases of this combined workflow, but all the methods are not required to be performed simultaneously. Different combinations including any of these discovery and validation phase approaches, could be implemented in biomarker research. Selection of the specific label-based or label-free quantification approach for discovery workflow and MRM or PRM for targeted workflow may depend on the key biological question to be addressed, number of samples, and availability of MS instruments and resources.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ebi. ac.uk/pride/archive/, PXD017834, http://www.peptideatlas.org/ (54), PASS01619.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by institutional review boards and ethics committee of the Indian Institute of Technology Bombay (IITB-IEC/2016/026). The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
VK, SS, and SR conceived and designed the experiments. VK performed the MS-based quantitative proteomics experiments and data were analyzed by VK, SR, and SG. The manuscript was written by VK, SR, SG, and SS. All authors agreed on the interpretation of data and approved the final version of the manuscript.