Cell-of-Origin Subtyping of Diffuse Large B-Cell Lymphoma by Using a qPCR-based Gene Expression Assay on Formalin-Fixed Paraffin-Embedded Tissues

The well-established cell-of-origin (COO) algorithm categorizes diffuse large B-cell lymphoma (DLBCL) into activated B-cell-like (ABC) and germinal center B-cell-like (GCB) subgroups through gene expression profiling. We aimed to develop and validate a qPCR-based gene expression assay to determine the COO subgroups of DLBCL with formalin-fixed paraffin-embedded (FFPE) tissue. We first established a DLBCL transcriptome database of 1,016 samples retrieved from three published datasets (GSE10846, GSE22470, and GSE31312). With this database, we identified a qPCR-based 32-gene expression signature (DLBCL-COO assay) that is significantly associated with the COO subgroups. The DLBCL-COO assay was further validated in a cohort of 160 Chinese DLBCL patients. Biopsy samples from DLBCL patients with paired FFPE and fresh frozen tissue were collected to assign COO subtypes based on the immunohistochemistry (IHC) algorithm (Han's algorithm), DLBCL-COO assay, and global gene expression profiling with RNA-seq. For 111 paired FFPE and fresh DLBCL samples, the concordance between the IHC, qPCR, and RNA-seq methods was 77.5% and 91.9%, respectively. The DLBCL-COO assay demonstrated a significantly superior concordance of COO determination with the “gold standard” RNA-seq compared with the IHC assignment with Han's algorithm (P = 0.005). Furthermore, the overall survival of GCB patients defined by the DLBCL-COO assay was significantly superior to that of ABC patients (P = 0.023). This effect was not seen when the tumors were classified by the IHC algorithm. The DLBCL-COO assay provides flexibility and accuracy in DLBCL subtype characterization. These findings demonstrated that the DLBCL-COO assay might serve as a useful tool for guiding prognostic and therapeutic options for DLBCL patients.

The well-established cell-of-origin (COO) algorithm categorizes diffuse large B-cell lymphoma (DLBCL) into activated B-cell-like (ABC) and germinal center B-cell-like (GCB) subgroups through gene expression profiling. We aimed to develop and validate a qPCR-based gene expression assay to determine the COO subgroups of DLBCL with formalin-fixed paraffin-embedded (FFPE) tissue. We first established a DLBCL transcriptome database of 1,016 samples retrieved from three published datasets (GSE10846, GSE22470, and GSE31312). With this database, we identified a qPCR-based 32-gene expression signature (DLBCL-COO assay) that is significantly associated with the COO subgroups. The DLBCL-COO assay was further validated in a cohort of 160 Chinese DLBCL patients. Biopsy samples from DLBCL patients with paired FFPE and fresh frozen tissue were collected to assign COO subtypes based on the immunohistochemistry (IHC) algorithm (Han's algorithm), DLBCL-COO assay, and global gene expression profiling with RNA-seq. For 111 paired FFPE and fresh DLBCL samples, the concordance between the IHC, qPCR, and RNA-seq methods was 77.5% and 91.9%, respectively. The DLBCL-COO assay demonstrated a significantly superior concordance of COO determination with the "gold standard" RNA-seq compared with the IHC assignment with Han's algorithm (P = 0.005). Furthermore, the overall survival of GCB patients defined by the DLBCL-COO assay was significantly superior to that of ABC patients (P = 0.023). This effect was not seen when the tumors were classified by the IHC algorithm. The DLBCL-COO assay provides flexibility and accuracy in DLBCL subtype characterization. These findings demonstrated that the DLBCL-COO assay might serve as a useful tool for guiding prognostic and therapeutic options for DLBCL patients.
Keywords: diffuse large B-cell lymphoma, cell-of-origin, gene expression profiling, immunohistochemistry, quantitative polymerase reaction (PCR), formalin-fixed paraffin-embedded tissue INTRODUCTION Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of malignant lymphomas, accounting for more than 40% of newly diagnosed cases. Although DLBCL is potentially curable with standard treatment, there is an urgent need for new therapies since most refractory or relapsed patients will eventually die from the disease. DLBCL has been recognized as a group of heterogeneous diseases with diverse genetic features and variable clinical outcomes. Almost two decades ago, Alizadeh et al. (1) performed gene expression profiling (GEP) with cDNA microarrays to explore unrecognized molecular heterogeneity in DLBCL. Using hierarchical clustering, there were at least two distinct groups within DLBCL: the germinal center B-cell-like (GCB) group and the activated B-cell-like (ABC) group. This method is widely recognized as the cellof-origin (COO) classification algorithm. In a series of large randomized clinical studies following the establishment of COO classification, DLBCL patients with the ABC subtype showed significantly inferior characteristics compared with those with the GCB subtype, even in the clinical study evaluating the efficacy of immune chemotherapy (2).
In recent years, COO classification has been not only recognized as a prognostic factor but also used to tailor therapies for DLBCL patients (3). Additionally, COO classification or its surrogates are widely incorporated into the clinical development of state-of-the-art therapies for de novo and refractory/relapsed DLBCL patients (4). Thus, the World Health Organization (WHO) Classification for Lymphoid Malignancies required the determination of COO for every newly diagnosed DLBCL case. However, COO classification using cDNA microarrays or RNA-seq is not economical or flexible for surgical pathology laboratories and is not compatible with formalin-fixed, paraffin-embedded (FFPE) samples. Immunohistochemistry (IHC) panels, such as the Hans algorithm, may be applied as surrogates and are widely used, but there is low concordance with cDNA microarray or RNA-seq classification, and intraobserver and interobserver variation may undermine their accuracy (5). Although medium-throughput assays, such as NanoString, may be applied to FFPE samples and may be accurate compared with the "gold standard" assay (6), the integrated and enclosed platform, high price, and sophisticated workflow may limit their routine application.
In the current study, we developed a novel gene expression assay (DLBCL-COO assay) that allows differentiation between the GCB and ABC DLBCL subtypes in FFPE specimens using a quantitative reverse transcription polymerase chain reaction (qPCR) platform and evaluated the DLBCL-COO assay against RNA-seq and IHC assays. We further discussed its potential application in routine clinical practice as well as the clinical development of novel therapies for DLBCL patients.

Gene Expression Database Curation
The DLBCL gene expression datasets with confirmed COO subtypes were collected from a public data repository, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database, and curated to form a comprehensive DLBCL transcriptome database. The gene expression datasets retrieved from three GEO series (GSE10846, GSE22470, and GSE31312) were mainly conducted on two different Affymetrix oligonucleotide microarray platforms, including the GeneChip Human Genome U133A Array and the U133 Plus 2.0 Array. Detailed descriptions of the specimen characteristics and clinical features are provided in the original studies.

Gene Expression Data Analysis
Normalization and analysis of gene expression data were performed using R software and packages available from the Bioconductor project (www.bioconductor.org). The singlechannel array normalization (SCAN) approach from the SCAN-UPC package was used to process the Affymetrix microarray data (7,8). Upon normalizing each raw CEL file, SCAN outputs probe-level expression values. We further used custom mapping files from the BrainArray resource to summarize the probelevel intensities directly to gene-level expression values (9). Thus, probe mapping to multiple genes and other problems associated with older generations of Affymetrix probe designs were avoided. After normalization, we applied the ComBat approach to adjust for batch effects (10). To identify a gene expression signature, we used the recursive feature elimination-support vector machine (RFE-SVM) algorithm for feature selection and classification modeling (11). A linear SVM classifier was derived using the training samples with known ABC or GCB labels and applied to the test samples. When the probability predicted by the DLBCL-COO assay that a sample belongs to the ABC or GCB subgroup is >75%, the specimen is classified as the ABC or GCB subtype accordingly. Otherwise, specimens with a probability lower than 75% were considered unclassified. The Database for Annotation, Visualization and Integrated Discovery (DAVID) bioinformatics resource was used to integrate functional genomic annotations (12). A biological network was constructed by NetworkAnalyst software (www.networkanalyst.cn, version 3.0) (13,14). Proteinprotein interactions were retrieved from the IMEx Interactome Database (15).

Development of the DLBCL-COO Assay
The DLBCL-COO assay was developed on the Applied Biosystems 7500 Real-Time PCR system (Applied Biosystems, Foster City, CA, USA), targeting 32 candidate markers and three housekeeping genes identified with microarray analysis. To support clinical applications using FFPE samples with poor RNA quality, primers were designed to amplify short template mRNA regions of exon-spanning junctions. In addition, the TaqMan MGB probes incorporate a 5 ′ fluorescent reporter dye and a 3 ′ nonfluorescent quencher, which offers the advantage of lower background signal, resulting in better precision in quantitation.

Case Selection
The study was approved by the ethical committee of Fudan University Shanghai Cancer Center (Approval case number:

Morphology and Immunohistochemistry
The Cases were designated as GCB or non-GCB using the algorithm specified by Hans et al. (16). The morphological and IHC results were independently evaluated by two pathologists (W-HY and X-QL).

Sample Processing and qPCR Analysis
Total RNA was isolated from FFPE tissue and fresh tissue using the RecoverAll Total Nucleic Acid Isolation Kit (Thermo Fisher Scientific, Waltham, MA, USA) per the manufacturer's guidelines. The concentration of total RNA was quantified by a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), while RNA integrity and quality were further appraised using agarose gel electrophoresis. For each sample, reverse transcription was performed on isolated total RNA using the High-Capacity cDNA Reverse Transcription Kit with RNase Inhibitor (Applied Biosystems, Foster City, CA, USA). The PCR program consisted of an initiation step at 95 • C for 10 min, followed by 40 cycles at 95 • C for 15 s and 60 • C for 1 min. All measurements were taken in triplicate. The melting curves of each measurement were checked; only the coordinate results were included in the subsequent analysis. Three genes (IPO8, PGK1, and TFRC) that have been reported to be consistently expressed in DLBCL cells were selected as housekeeping genes. First, qPCR results of housekeeping genes with various sample storage duration and RNA quality were investigated, and then, the average Ct value of each target gene minus the mean of three housekeeping genes was calculated as Ct. The -Ct value of each gene was applied for downstream analysis.

RNA Sequencing and Data Analyses
RNA-seq was performed on the NovaSeq 6000 system (Illumina, San Diego, CA, USA) using 1 µg of RNA extracted from fresh tumor tissue according to the manufacturer's instructions. The raw sequencing data were preprocessed using the BRB-SeqTools suite (https://github.com/DeplanckeLab/BRB-seqTools). The GEP-based classification method was performed to determine the COO molecular subtype of each specimen as described in Wright et al. (17) and Reddy et al. (18).

Statistical Analysis
For comparison with the Hans-based IHC method, all COO subtypes of samples from the GEP methods were categorized as either "GCB" or "non-GCB." All GCB predictions remained GCB, and any "ABC" or "UNC" subtype predictions from the RNA sequencing and qPCR assays were converted to the "non-GCB" subtype. The concordance between any pair of assays was calculated using only the total number of samples that could be called by both of those assays. The overall percent agreement and asymptotic 95% confidence intervals (CIs) are presented. To determine the positive percent agreement (PPA) and negative percent agreement (NPA), the global GEP-based subtyping method served as a standard reference in each comparison.
Overall survival (OS) was defined as the time from diagnosis

Identification of a 32-Gene Expression Signature in the Training Set
The training set consisted of 167 ABC and 183 GCB samples. After the data normalization and annotation steps, a matrix of 20,342 unique genes in 350 samples (≈7.12 million data points) was prepared for downstream bioinformatics analyses. Extracting a subset of informative genes from high-dimension genomic data is a critical step for gene expression signature identification. Here, we deployed the RFE-SVM algorithm with the linear SVM classification model and the parameter C equal to 1. The algorithm identified a compact panel of 32 genes that are significantly associated with the two molecular subtypes. As listed in Table 3, 16 genes were overexpressed in the ABC subtype, and 16 genes were overexpressed in the GCB subtype. We further investigated whether the 32 candidate genes exhibited biological features relevant to the DLBCL molecular subtypes. As shown in Table 4, the most significantly enriched gene categories are involved in B-cell differentiation, B-cell activation, humoral immune response, and hemopoiesis. We also explored the underlying biological networks of the selected candidate genes. We used the 32 genes as seeds to generate a minimum protein-protein interaction network. The network comprised 21 genes of the 32-gene set and was centered on essential nodes such as BCL6, UBC, AICDA, LMO2, UCHL1, and MME ( Figure S1).

Independent Validation in Fresh and FFPE DLBCL Samples
The classification model comprising 32 subtype-specific genes was established using the entire training set and then applied to Test Set 1, which was composed of 71 ABC and 144 GCB fresh frozen samples. With the 32-gene expression signature, 69 samples were classified as ABC and 146 as GCB. The overall agreement between the 32-gene expression signature

Clinical Validation of the 32-Gene Expression Signature by qPCR Analysis
A total of 160 DLBCL patients with confirmed COO subtypes based on IHC assignment were enrolled in the current study.
Han's algorithm assigned 60 cases (37.5%) as GCB and 100 cases (62.5%) as non-GCB. One hundred fifty-nine of 160 FFPE specimens met all criteria and were successfully assayed by the DLBCL-COO assay. We first evaluated the hierarchical clustering of the 32 genes and 159 samples based on the qPCR data. Complete linkage hierarchical clustering analysis was performed where the metric of similarity was Pearson's correlation between the 32-gene expression profiles of the samples. As shown in Figure 2, the samples were clustered into distinct groups that followed the COO subtypes. Among the three subtypes, most GCB samples clustered together, whereas the unclassified samples were more similar to ABC samples. According to the predictions by the 32-gene signature, 89 cases (56.0%) were classified as ABC, 51 cases (32.1%) as GCB, and 19 cases (11.9%) as unclassified. In addition, 113 DLBCL patients had paired fresh frozen tissue, and 111 cases passed stringent quality control for RNA-seq analysis. The gold standard RNA-seq method defined 34 cases (30.6%) as GCB, 50 cases (45.1%) as ABC, and 27 cases (24.3%) as unclassified. The concordance between DLBCL-COO and RNA-seq and the concordance between IHC Han's algorithm and RNA-seq are summarized in Table 6. The DLBCL-COO assay demonstrated a significantly superior concordance of COO determination with the gold standard RNA-seq compared with the IHC assignment with Han's algorithm (91.9 vs. 77.5%; P = 0.005). Additionally, the PPA and NPA of the DLBCL-COO assay assigning GCB/non-GCB were 88.2% (30 of 34, 95% CI: 0.72-0.96) and 93.5% (72 of 77, 95% CI: 0.85-0.98), respectively. One hundred twenty-nine DLBCL cases with survival information, IHC assignment results, and DLBCL-COO assay results were identified. The clinical information related to the IHC and DLBCL-COO assignment results is summarized in Table 2. Han's algorithm failed to stratify DLBCL patients, mostly treated with the R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) regimen, into different prognostic groups ( Figure 3A) (P = 0.091). However, the OS of  GCB patients defined by the DLBCL-COO assay was significantly superior to that of ABC patients ( Figure 3B) (P = 0.023). The patients assigned as unclassified DLBCL had an intermediate OS ( Figure 3B).

DISCUSSION
As DLBCL is a heterogeneous disease in genetic, biological, and clinical behavior, precise classification is critical for predicting prognosis or the efficacy of therapies. Characterizing DLBCL into GCB and ABC based on COO represents a milestone in the heterogeneity delineation of DLBCL. These COO classification results successfully correlated with the patient outcome, even in the era of immunochemotherapy with rituximab (2). The COO classification system demonstrated the different cancer biology and etiologies in DLBCL, making it possible to tailor therapies to different subgroups of patients. The most exciting application of COO classification may be the efficacy prediction of BTK inhibitors and lenalidomide in treating refractory or relapsed DLBCL patients (19,20). Although randomized phase three clinical studies evaluating the efficacy of BTK inhibitors and lenalidomide in treating treatment-naive DLBCL based on the COO classification failed (21,22), COO determination for newly diagnosed DLBCL patients is still mandatory. Several novel classification systems based on the DLBCL genetic landscape have been proposed recently, like four genetic subtypes based on the status of MYD88 L265P, CD79b mutations, NOTCH1 mutations, BCL6 fusion, NOTCH2 mutations, BCL2 translocations, and EZH2 mutations (23). However, these systems highly interacted with the COO classification (23,24), indicating that COO classification may be the backbone of other state-of-the-art classification algorithms.
As the gold standard of COO determination is global GEP based on cDNA microarray or RNA-seq, which is inaccessible for routine testing, the most widely used and flexible COO surrogate is IHC. However, the interobserver and intraobserver reproducibility of IHC COO assignment are not satisfactory, and the concordance across different IHC COO algorithms is quite low (25). The IHC COO assignment failed to predict the outcome of DLBCL patients treated with immunochemotherapy (26) and failed to predict the efficacy of the BTK inhibitor ibrutinib in treating DLBCL (27). Several medium through-put assays compatible with FFPE samples have been developed in recent years, demonstrating high concordance with global GEP based on cDNA microarray or RNA-seq (6). Nonetheless, the complexity of the assay based on a specific platform (NanoString) or the Illumina sequencer and the high cost may potentially limit its wide application in routine practice, especially in poor resource areas.
Therefore, it is necessary to establish a COO determination assay with appropriate cost, comparable accuracy with the gold standard assay, and compatibility with FFPE samples. The qPCR technology is generally considered the "gold standard" procedure for measuring individual gene expression and is often used to confirm the findings of DNA microarray and RNA-seq analyses. Furthermore, the qPCR technology can be easily applied to FFPE specimens, and thus, it is widely applicable in clinical practice. Recently, Tekin et al. reported a successful validation of a qPCR-based six-gene predictor for DLBCL prognosis in an international clinical study (28). Herein, the DLBCL-COO assay is a qPCR assay that detects a 32-gene expression profile for DLBCL molecular classification. The DLBCL-COO assay was trained against the so-called gold standard of COO assignment using GEP on fresh frozen tissue, tested, and then validated in multiple independent cohorts. Although a slight loss in signal intensities was observed when FFPE sample storage duration increased (Figure S2), the qPCR-based TaqMan assays remained accurate and robust for gene expression profiling. The overall successful rate of the DLBCL-COO assay is satisfactory (159/160, 99%), even for the FFPE samples archived 5 years ago, indicating satisfactory compatibility with FFPE samples. This may be critical for relapsed or refractory DLBCL, as biopsied samples may be archived for several years. Regarding accuracy, the concordance of the DLBCL-COO assay with the gold standard RNA-seq assay was 91.9%, which was comparable to the NanoString and HTG assays, even though there is a lack of head-to-head studies, suggesting that the COO assignment by DLBCL-COO is precise.
In addition to routine clinical practice, the clinical development of novel therapies for DLBCL also requires a COO assignment assay with high accuracy and consistency, and short turnover duration. In the PHOENIX study, which is a randomized, double-blind, placebo-controlled, multicenter, phase 3 study comparing the efficacy and safety of ibrutinib in combination with R-CHOP vs. placebo in combination with R-CHOP in patients with the newly diagnosed non-GCB subtype of DLBCL, GEP showed that 75.9% of patients with non-GCB DLBCL assigned by IHC had ABC DLBCL (23). As central pathology COO assignment and review were applied in this well-controlled study, the concordance between the IHC COO assignment and GEP assignment may be much lower. In another phase 3 study evaluating the efficacy of R-CHOP plus lenalidomide in previously untreated ABC DLBCL (ROBUST study), the NanoString Lymph2Cx GEP assay was applied to assign COO, demonstrating 15% failure in the test (29). As the samples from previously untreated DLBCL patients were recently biopsied in the ROBUST study, the failure rate may be higher for the long-archived samples of relapsed and refractory DLBCL patients. In these settings, a more compatible assay beyond GEP as well as a more accurate assay beyond IHC will be more effectively incorporated into clinical development.

CONCLUSION
In conclusion, the DLBCL-COO assay provides flexibility and accuracy in DLBCL subtype characterization into GCB and ABC. These subtype distinctions should help guide disease prognosis and treatment options in DLBCL clinical practice. Further prospective studies including incorporation into prospective interventional studies will be needed to evaluate the performance in detail.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study, these can be found in the NCBI Gene Expression Omnibus (GSE10846, GSE22470, GSE147986, and GSE31312).

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Clinical Research Ethics Committee of Fudan University Shanghai Cancer Center. The patients/participants provided their written informed consent to participate in this study.