Differential Gene Expression Landscape of Co-Existing Cervical Pre-Cancer Lesions Using RNA-seq

Genetic changes occurring in different stages of pre-cancer lesions reflect causal events initiating and promoting the progression to cancer. Co-existing pre-cancerous lesions including low- and high-grade squamous intraepithelial lesion (LGSIL and HGSIL), and adjacent “normal” cervical epithelium from six formalin-fixed paraffin-embedded samples were selected. Tissues from these 18 samples were isolated using laser-capture microdissection, RNA was extracted and sequenced. RNA-sequencing generated 2.4 billion raw reads in 18 samples, of which ~50.1% mapped to known and annotated genes in the human genome. There were 40 genes up-regulated and 3 down-regulated (normal to LGSIL) in at least one-third of the sample pairs (same direction and FDR p < 0.05) including S100A7 and KLK6. Previous studies have shown that S110A7 and KLK7 are up-regulated in several other cancers, whereas CCL18, CFTR, and SLC6A14, also differentially expressed in two samples, are up-regulated specifically in cervical cancer. These differentially expressed genes in normal to LGSIL progression were enriched in pathways related to epithelial cell differentiation, keratinocyte differentiation, peptidase, and extracellular activities. In progression from LGSIL to HGSIL, two genes were up-regulated and five down-regulated in at least two samples. Further investigations using co-existing samples, which account for all internal confounders, will provide insights to better understand progression of cervical pre-cancer.


INTRODUCTION
Squamous cell carcinoma of the cervix results from a sequence of well delineated non-invasive pre-cancer stages. Based on the cytological findings of Bethesda Classification system (1), these pre-cancer stages are classified as low-grade squamous intraepithelial lesion (LGSIL), consisting of cytological atypia and histological cervical intraepithelial lesion (CIN) 1, and high-grade squamous intraepithelial lesion (HGSIL), consisting primarily of CIN 2-3 plus carcinoma in situ (2,3). In 2014, it is estimated that 12,360 cases of invasive cervical cancer will occur in the United States; however, 1.25 million women are expected to be diagnosed with pre-cancer by a Papanicolaou (Pap) test (4). Similarly,~600,000 cervical cancers are expected worldwide, with epidemic proportions of pre-cancers, mostly undiagnosed (5). Studies of cervical disease progression (6) suggest that lesions in~60% of women with LGSIL will spontaneously regress, another 20-30% will persist unchanged, about 5-10% will progress to high-grade HGSIL, and only 1% will develop invasive carcinoma (6,7). The likelihood of HGSIL regression is 33%; progression to invasion is >12% (6). Persistent high-risk HPV (HR-HPV) infection (8), high-viral load (9), and integration of HPV DNA (10) are likely markers or determinants of progression of pre-cancer lesions to cervical cancer; however, host factors have not been thoroughly studied. While genetic alterations in cancer are common, changes found in different grades of pre-cancer lesions are more likely to reflect causal events initiating and promoting the progression to cancer, yet little is known about these genomic changes that occur.
Some experts arguably consider that it is reasonable standardcare to follow low-grade pre-cancer lesions and HPV infections without active treatment. Since HPV infection and LGSIL are diagnosed in epidemic proportions, novel biomarkers with higher specificity for cervical lesions would improve cervical cancer screening. Better early detection biomarkers would also greatly assist in the stratification of patients for chemoprevention trials of pre-neoplasia. Currently, there are no validated diagnostic or prognostic biomarkers that identify LGSIL destined toward HGSIL or cervical cancer. Since HGSIL are near-obligate precursors of cervical cancer, it is standard clinical practice to use invasive surgical interventions to reduce the burden of progression to cancer. The detection of LGSIL that may progress to HGSIL and the prevention of this progression is an important and suitable goal for noninvasive medical intervention to reduce the incidence of cervical cancer.
Gene expression studies are quite sparse for HPV-related cervical dysplasia. Squamous cell carcinoma of the cervix results from a sequence of well delineated non-invasive pre-cancer stages. Quantifying the decisive physical changes, i.e., differential function and expression of genes in the co-existing "normal" cervical epithelium and neighboring pre-cancerous lesions; low-grade squamous intraepithelial lesions (LGSIL) and high-grade squamous intraepithelial lesions (HGSIL) will elucidate when, where, and to what extent genomic variations facilitate development and progression of pre-cancer. Since LGSIL and HGSIL lesions are generally related when they co-occur, these types of samples provide an opportunity to assess morphologic progression with regard to space and time, while controlling for internal confounders (11). Specifically, RNA sequencing (RNA-seq) can examine expression patterns of genes from formalin-fixed paraffin-embedded (FFPE) tissues, which can be used to study pre-cancer progression in co-existing LGSIL and HGSIL tissues (12). In the current study, we utilized laser-capture microdissection to extract specific tissues from co-existing neoplastic stages (normal, LGSIL, and HGSIL) on FFPE samples from six women who underwent loop electrosurgical extraction procedures (LEEP), and performed innovative RNA extraction and sequencing (RNA-seq) technologies to enable comprehensive gene expression profiling of selected cell types for comparison within and between individuals.

STUDY POPULATION AND PATIENT SAMPLE SELECTION
The patient samples were obtained from the University of Alabama at Birmingham (UAB) Comprehensive Cancer Center (UAB-CCC) tissue procurement shared facility (TPSF) where standard protocols are followed to routinely collect cervical samples from the UAB colposcopy clinic and preserve as FFPE tissues (13). Since formalin fixation methodology, which affects nucleic acid integrity, can vary among hospital laboratories, we used only samples from women that received LEEP treatment at the UAB colposcopy clinics during June 2010 to April 2012 for abnormal cytology. A UAB pathologist prospectively reviewed cervical tissue samples from~850 women, aged 20-25 years old to confirm if they had a co-existing spectrum of normal and pre-cancer LGSIL and HGSIL on the sample block (Figure 1). If co-existing lesions (LGSIL and HGSIL) were found for LEEP treatments during that period, women must have had HGSIL (CIN2/3) confirmed biopsies within 6 months of their treatment. Samples with evidence of immunosuppression or HIV infection were excluded due to known differences in rates of cervical abnormalities and pathogenesis (14,15). Following these criteria, we identified 10 FFPE blocks from European American women ( Figure S1 in Supplementary Material) of whom 6 were similar in relation to information on referral, cytology, demographic characteristics, and subsequent histologic biopsy ( Table 1) are used for the study. Specific laboratory methods (Figure 1) were followed as specified below. The study protocols using these samples conformed to human-experimentation guidelines set forth by the United States Department of Health and Human Services and was reviewed and approved by institutional review board (IRB) at UAB.

SLIDE PREPARATION
Formalin-fixed paraffin-embedded blocks from six participants were used for slide preparation and microdissection. First, the pathologist, an expert in gynecologic lesion analysis, identified and marked 12 pre-cancerous lesions (LGSIL and HGSIL in each) and 6 co-existing normal tissues, respectively, to ensure the correct cell populations. Then, depending on the size of the tissues on the block, two to six 10 µm sections from the FFPE blocks were cut using RNAse-free blades at each area identified by the pathologist and mounted on the coated slide (Leica Microsystems, Buffalo Grove, IL, USA). To prevent contamination, a new blade and a different slide was prepared for each cell type (normal, LGSIL, and HGSIL) and captured separately. The samples were heated in a water bath (58°) for 2 min and the paraffin debris was carefully scraped off. No staining of samples was performed since it can further degrade RNA in FFPE tissues.

MICRODISSECTION METHOD
To reduce sample degradation, all laboratory procedures including sample preparation, microdissection, and RNA extraction were performed within a 24 h period. Microdissection of the 18 specific tissues (adjacent normal, LGSIL, and HGSIL) from 6 participants was performed to reduce possible contamination of different cell types in heterogeneous tissue, which could mask the detection of gene expression alterations in the cells of interest. Slide preparation, laser-capture microdissection (LCM), and RNA extraction was conducted at LifeSpan Molecular Pathology Core Services lab in Providence, Rhode Island. LCM was conducted using the Arcturus Veritas Laser-Capture Microdissection (Mountain View, CA, USA) to isolate individual cell populations from distinct normal and pre-cancer LGSIL and HGSIL cells, which were then placed Frontiers in Oncology | Cancer Genetics on CapSure HS LCM Caps using the UV capture laser protocols. Adjacent stained slides were used as a roadmap to determine which areas would be collected. The relatively low intensity of the laser does not damage DNA, RNA, or proteins in the captured cells (16). Approximately 10,000 cells were collected per tissue sample onto a single LCM cap. Each population of captured cells was re-examined by microscopic visualization for confirmation, and before and after images of each sample were taken ( Figure  S1 in Supplementary Material) for additional confirmation by a gynecologic pathologist specialized in cancers.

RNA EXTRACTION
Total RNA was extracted from all 18 microdissected tissues from 6 to 10 LCM using the Qiagen AllPrep DNA/RNA Micro Kit (Qiagen, CA, USA), according to the manufacturer's instructions, with extended proteinase K digestion. Prior to amplification, the quality and level of degradation of the extracted RNA was assessed with RIN (RNA integrity number) assigned by the Agilent 2100 Bioanalyzer instrument using the RNA 6000 Pico kit (Agilent Technologies, Santa Clara, CA, USA). All 18 samples were stored in nucleasefree tubes and stored at −80°C until shipment for analysis.

RNA PRE-PROCESSING, LIBRARY PREPARATION, AND SEQUENCING
To increase low-RNA yields, the Nugen Ovation © FFPE RNA-Seq System (Nugen, CA, USA) was used to amplify cDNA from total RNA for transcriptome sequencing. Amplification was initiated at the 3 end and also randomly throughout the whole transcriptome in the sample. As described in user's manual, amplification of both mRNA and non-polyadenylated transcripts makes the Ovation RNA-Seq System (Nugen, San Carlos, CA, USA) ideal for amplification prior to next generation sequencing (NGS

DATA PRE-PROCESSING AND ALIGNMENT OF SEQUENCED READS
FASTQ sequence files were generated from the raw base-call data for all 18 samples using CASAVA 1.8 (Illumina, CA, USA), the standard data processing package from Illumina. De-multiplexing of sample indices was set with one mismatch tolerance to separate the two samples within each lane. Quality assessment of raw FASTQ reads was performed using the FASTQC program, as previously described (17).
The paired end reads were aligned against the Ensembl GRCh37.62 B (hg19) reference genome using 150 as the mate pair means inner distance and the pre-set settings in TopHat v2.0.6. The alignment quality and distribution of the reads were estimated using SAMTools v1.18 (Illumina, CA, USA). The reference genome-guided transcript assembly of the aligned reads was performed using quartile normalization, bias correction, and the default assembly settings for Cufflinks v2.1.1. The transcript abundance was calculated by estimating the fragments per kilobase of exon per million mapped fragments (FPKM), and all expressed transcripts were binned (0, <1, 1-10, and >10) on the basis of their abundance (FPKM).

DATA VISUALIZATION AND CLUSTERING ANALYSIS
The similarity of the relative gene transcript abundances (using Log2 transformed values of FPKM) for each of the 18 samples were compared using Spearman's correlation calculation in SAS and an unsupervised hierarchical clustering analysis in Gene-E, a visualization tool developed by the Broad Institute (http://www. broadinstitute.org/cancer/software/GENE-E/). The pairwise comparisons of FPKM between normal to LGSIL and LGSIL to HGSIL in each sample were compared between the tissues from these three stages and visualized as stacked histograms by genomic location in each chromosome using Circos plots (18).

DIFFERENTIAL GENE EXPRESSION ANALYSIS
UCSC known genes annotation was used to assess differentially expressed genes for the pre-cancer stage effect using CuffDiff tools (19) in two different methods: (1) grouping all six samples by www.frontiersin.org their tissue classification and comparing the geometric averages progressively between normal vs. LGSIL and LGSIL vs. HGSIL; and (2) comparing the geometric averages progressively within each individual separately (six pairs for normal to LGSIL and six pairs for LGSIL to HGSIL) using CuffDiff, and then merging the results across all individuals. Because of the small sample size and unique sample type, both methods were assessed to fully capture the variability in the structured experimental design. To account and adjust for multiple testing, the FDR q-values were calculated by CuffDiff from the raw p-values, which are estimated using beta negative binomial tests of variance in read counts (19). Transcripts were considered to be differentially expressed if their expression values (log2) differed by a factor of 1.5 and FDR < 0.05. For FPKM = 0, zero to positive KPKM was considered positive infinity and positive KPFM to zero was considered negative infinity (Tables S3A,B in Supplementary Material). Volcano plots of -log 10 (p-value) vs. Log2 (FPKM) fold-change were made to examine these association in each tissue pair within each individual.

PATHWAY ENRICHMENT ANALYSIS
WEB-based GEne SeT AnaLysis Toolkit (WebGestalt) was used for gene ontology (GO) analysis to identify pathways that were enriched in all significant gene lists by each of the progression stage pairs (LGSIL vs. Normal, HGSIL vs. LGSIL) (20). Only genes that were differentially expressed in at least two participant sample comparisons (both in direction and significance at FDR q ≤ 0.05) were included in the analysis. Statistical significance was estimated using the hypergeometric test and GO categories were considered to be significant if the pathway included at least five genes and the adjusted p-value of enrichment using a Benjamini-Hochberg FDR correction was ≤0.05.

RESULTS
The mRNA sequencing of the co-existing spectrum of normal and pre-cancer LGSIL and HGSIL on the sample block from 6 patients (18 samples) generated a total 2.4 billion raw reads, ranging from 264 to 470 million reads per person (Figure 2; Table S1 in Supplementary Material). The overall raw read mean quality score was high (mean Phred Quality Score = 36.76) with 93.5% of bases above Q30. Among the 2.4 billion high-quality raw reads, 50.1% of the reads were mapped to the human genome with known gene annotations. Of the 29,061 genes across 22 autosomal and X chromosomes, 3727 of them had zero FPKM in all samples so 25,334 genes were examined for differential gene expression (Table S2 in Supplementary Material).
As shown in the unsupervised hierarchical clustering dendogram ( Figure S2A in Supplementary Material), most of the samples from each cervical pre-cancer stage clustered together by sample and then by stage in our analysis. Sample #111 showed the most differences within the three stages and sample #115, which clustered all three stages, seemed to be in a different branch. Overall, more genes were up-regulated during progression from normal to LGSIL, with samples #115 and #116 having the largest numbers   Table S3A in Supplementary Material). On the other hand, more genes were down-regulated during progression from LGSIL to HGSIL, with samples #111 and #115 having the largest number of genes (Figure 2; Table S3B in Supplementary Material). Down-regulation of genes is a hallmark signature of advanced precancer progression to cancer. One patient sample (#115), which consistently had a larger number of differentially expressed genes in both progression from normal to LGSIL and LGSIL to HGSIL, was one of two samples that displayed moderate dysplasia (CIN2).
None of the 25,334 genes were differentially expressed after stringent statistical correction (FDR p ≤ 0.05), using the first method when the analysis performed was combined for all merged samples (by pre-cancer stage) (Tables S3A,B in Supplementary Material). However, three genes (C1orf120, EPGN, and NDRG4) were differentially expressed (FDR q ≤ 0.20) during progression from normal to LGSIL and six genes (ADH7, AX746562, CRNN, FABP, SLN, and SPRR3) were differentially expressed (FDR q ≤ 0.20) during progression from LGSIL to HGSIL as listed in Table 2 (all results in Tables S3A,B in Supplementary Material).
The main analysis using the second method was focused on the differentially expressed genes (FDR p ≤ 0.05) identified by pairwise comparison of each sample separately using CuffDiff ( Figure  S3 in Supplementary Material). Between 6 patient samples of normal to LGSIL pairs, there were a total of 211 genes down-regulated and 429 up-regulated ( Figure 3A). Of these, S100A7 and KLK6 were differentially up-regulated in 3 of 6 patient samples (FDR q ≤ 0.05); whereas, CCL18, FLT3, and RORC were down-regulated and 38 genes were up-regulated in 2 of 6 patients ( Figure 3B). In progression to high-grade pre-cancer (LGSIL to HGSIL), although there were a total of 187 different genes up-regulated in various samples, the majority (165) were present in one patient sample (#115), which displayed moderate dysplasia. Overall, C12orf63, KRTDAP, SBSN, FABP4, CBLN1 were down-regulated whereas CTCFL and PLAC8L1 were up-regulated in at least two of six samples ( Table 3). When comparing the progression from LGSIL to HGSIL in these six patient samples, seven genes were differentially expressed in at least two samples ( Table 3).

DISCUSSION
We used next generation RNA-seq methods to study the transcriptomic landscapes of co-existing pre-cancer lesions and understand the mechanism of pre-cancer progression from normal to LGSIL and LGSIL to HGSIL in six individuals. To our knowledge, there have been no published reports of assessment of transcriptomic expression patterns from co-existing spectrum of neoplasia from a single sample. In most studies, samples from different individuals or from different times from the same individual are compared. A

Common pairs Normal-to-LGSIL (q-value)
LGSIL-to-HGSIL (q-value) major advance in functional genomic investigations is the use of NGS, RNA-seq with ex vivo-derived genetic material originating from morphologically distinct cellular subpopulations within tissue. The first application combining LCM and cDNA microarray technologies to analyze gene expression in breast cancer specimens was reported in 1999 (22). Since then, most investigations of changes in gene expression associated with the progression stages in cancer have been targeted for breast cancer, prostate cancer, lung cancer, and colorectal cancer (23)(24)(25)(26)(27)(28)(29)(30). Reports of gene expression studies of cervical cancer are sparse, specifically for pre-cancer lesions (31)(32)(33).

3/6 Pairs
Using gene expression transcripts clustered by pre-cancer lesion grades, several genes were either down-or up-regulated during the progression process in each individual (Figures 3A,B). However, Frontiers in Oncology | Cancer Genetics 43 genes were differentially expressed (in the same direction and statistically significant) in the samples from normal to LGSIL and 7 genes were differentially expressed from LGSIL to HGSIL in 2 or more individuals of the 6 participants ( Table 3). In particular, S100A7 and KLK6 genes were differentially expressed during progression from normal tissue to LGSIL in 3/6 samples and could be key members of signature networks in cervical pre-cancer progression. The KLK6 gene was differentially up-regulated in 3/6 LGSIL samples compared to HGSIL; whereas, KLK10, 11, 12, and 13 were also up-regulated in 2/6 samples. The KLK gene family is a member of the protease clan PA, protease family S1 with subfamily A, located on chromosome 19q13.3-q13.4. KLK6 along with 10 and 11 and 13 are emerging biomarkers for ovarian and cancer (34). Moreover, KLK6 has been differentially expressed in breast, uterine, and colon cancers (35)(36)(37). The S100A7 gene, also known as psoriasin, is located within the epidermal differentiation complex on human chromosome 1q21 (38) and plays an important role as an immunomodulatory protein in skin (39). Increased S100A7 expression has been reported in several epithelial malignancies, including head and neck squamous cell carcinoma (HNSCC) and oral dysplasia (40), which shares many features, including HPV infection, with cervical SCC, as well as skin, bladder, breast cancer, and adenocarcinomas of the stomach (41). However, we did not test for HPV and could not assess the interactions of these genes with HPV.
Three other genes are also noteworthy, chemokine (C-C motif) ligand 18 (CCL18) (42), cystic fibrosis transmembrane conductance regulator (CFTR) (43), and solute carrier family 6 member 14 (SLC6A14), as they have been reported to be up-regulated in cervical cancer malignancy and in the same direction as in progression from normal to LGSIL in our study (44). CCL18 is known to be involved in adaptive immune system and its role in cancer is not fully known but is best known for inducing metastasis of breast cancer cells by binding to PITPNM3 (45). CFTR, found in epithelial cells functions as a cAMP-activated ATP-gated anion channel and SLC6A14 functions as Na+ and Cl− dependent neutral and basic amino acid transporter, but their roles in cancer are not known. Of note, two genes ZBTB7C and GALNT5 were differentially expressed in the same direction in two samples but opposite in the third in normal to LGSIL progression. The pathway analysis of these genes provided further insight into the expression profiling of genes involved in pre-cancer progression. The functions of the genes identified in the progression from normal to LGSIL were enriched in the biological processes and pathways of keratinocyte differentiation, epidermis development, peptidase activities, and extracellular region activities.
Inclusion of pre-cancer tissues from the same individual with respect to time and space is a major strength of this approach where all internal confounders are controlled for in determining signature gene expression patterns. However, there might be residual heterogeneity in samples between patients ( Figure  S2 in Supplementary Material); thus, the analysis from progression within each individual might be more informative than the combined analysis taking the mean or median values across different stages from different individuals. For instance, four patient samples had ICD9 code 233.1 (carcinoma in situ) ( Table 1) and three of them (samples #109, #110, and #116) had S100A7 genes up-regulated and another set (samples #110, #111, and #116) had KLK6 up-regulated in normal to LGSIL pairs. Samples #113 and #115 may have differing cellular makeups since their HGSIL were diagnosed as moderate dysplasia (CIN2) (ICD9 code 622.12). Microdissection methods create a homogeneous sample, but any error in procedures can include adjacent heterogeneous tissues. For example, in samples #109 and #110, HGSIL cells were adjacent to normal samples ( Table 1) and contaminated samples can also misclassify the expression patterns. Also, microarray technologies typically require fresh tissues; however, FFPE tissues have been regularly used in clinical research. With careful selection of samples, such as the age of the sample, length, and conditions of storage (13,46), gene expression analyses of FFPE tissues have shown comparable results to frozen tissues (47)(48)(49). In addition, various new technologies have proven that RNA acquired by lasercapture microdissection from FFPE samples also yields reliable microarray and NGS data (50). Our use of LCM to separate a heterogeneous sample as well as presence of common gene signatures for cervical pre-cancer lead us to have more confidence in the phenotype and therefore results. Utilization of FFPE blocks in such genomic approaches is innovative as they can provide tremendous resources and opportunities for epidemiological, basic, and translational studies.
While effective prophylactic vaccines are available (for HPV types 6, 11, 16, and 18), there is no treatment for infected individuals and logistic issues of vaccine delivery still exist. Thus, discovering biomarkers that are associated with the progression of pre-cancer lesions and cancer are extremely important. We are cognizant that our study is limited by number of patient samples and our findings need to be replicated in larger study. Thus, it is unknown if some of these gene patterns observed might be rare and specific to individual patients and thus may not be used as universal biomarker for progression. However, our approach of using co-existing samples is novel and may enable accurate aggregate genomic information regarding the potential mechanism or pathway of progression, since all known and unknown confounding factors are controlled for by internal comparisons. The identified differentially expressed genes (validated in two or three samples), specifically S100A7, KLK6, CCL18, CFTR, and SLC6A14, with reports of involvement in other cancers or novel ones with potentially related biological pathways, represent targets for understanding mechanisms of pre-cancer progression.
Since complex processes regulate gene expression, it is very likely that not all important transcriptome genes were identified and conversely that some gene expression changes identified in this study may not later be confirmed, and therefore, the results of this study should be replicated in larger studies. Additionally, we chose to only assess differential gene expression in co-existing lesions although there are other processes and markers of expression including RNA-splicing and transcript expression that were not assessed in the analysis. Despite these limitations, we successfully designed and analyzed a novel application of RNA-Seq technology to identify and annotate genes and networks that may be present in cervical co-existing lesions. Future studies can examine other complexities of the transcriptome using the RNA-seq data including splice junctions, fusion, allelic variants including somatic mutations, and HPV integration that could aggregately either www.frontiersin.org explain these differential gene expression patterns or progression of pre-cancer.

AUTHOR CONTRIBUTIONS
Sadeep Shrestha directed all aspects of the cervical pre-cancer transcriptome project. Kathryn E. Royse and Sadeep Shrestha designed the experiments and wrote the main manuscript text with help from Degui Zhi and Christine F. Skibola. Degui Zhi led the overall data QC and analyses with assistance from Kathryn E. Royse, Vinodh Srinivasasainagendra, Laura Kelly Vaughan, and David K. Crossman. Michael G. Conner helped identify the samples and pre-cancer lesions, Buffie Clodfelder-Miller provided technical assistance with microdissection protocol, and Shawn Levy directed the RNA sequencing experiments.