Biomarkers and molecular endotypes of sarcoidosis: lessons from omics and non-omics studies

Sarcoidosis is a chronic granulomatous disorder characterized by unknown etiology, undetermined mechanisms, and non-specific therapies except TNF blockade. To improve our understanding of the pathogenicity and to predict the outcomes of the disease, the identification of new biomarkers and molecular endotypes is sorely needed. In this study, we systematically evaluate the biomarkers identified through Omics and non-Omics approaches in sarcoidosis. Most of the currently documented biomarkers for sarcoidosis are mainly identified through conventional “one-for-all” non-Omics targeted studies. Although the application of machine learning algorithms to identify biomarkers and endotypes from unbiased comprehensive Omics studies is still in its infancy, a series of biomarkers, overwhelmingly for diagnosis to differentiate sarcoidosis from healthy controls have been reported. In view of the fact that current biomarker profiles in sarcoidosis are scarce, fragmented and mostly not validated, there is an urgent need to identify novel sarcoidosis biomarkers and molecular endotypes using more advanced Omics approaches to facilitate disease diagnosis and prognosis, resolve disease heterogeneity, and facilitate personalized medicine.


Background
Sarcoidosis is an inflammatory disorder characterized by granuloma formation in affected organs, most often in the lung (~90%) (1,2).Scadding stage of sarcoidosis is therefore based on intrathoracic involvement.The most common organs involved are the lung, skin, eyes, liver, lymph nodes, salivary glands, bone/joints/ muscle, spleen, nervous system, kidneys, sinuses, and heart.For example, lung sarcoidosis is featured by the presence of coalescing, tightly clustered, non-necrotizing granulomas, and is complicated by lung fibrosis in up to 20% progressive patients who account for 75% sarcoidosis-related deaths of respiratory causes (2)(3)(4)(5).The etiology of sarcoidosis is still unknown since it was first described 1.5 centuries ago (1).The prevalence, presentation, prognosis, and triggering antigen are extremely variable (6).The outcomes of lung sarcoidosis are affected by age, gender, race, incomes, environment, lung morbidity, lung leucocyte infiltration, requirement of treatment, and genetic variants of associated genes (e.g., human leucocyte antigen class II (HLA class II), tumor necrosis factor a (TNF-a) and annexin XI (ANXA11), among others) (1,2).However, there are no reliable biomarkers for predicting the propensity of sarcoidosis to progress to lung fibrosis (7).
The combination of advanced Omics techniques and machine learning algorithms holds the potential to facilitate mechanistic investigations and discovery of novel biomarkers for this intriguing illness.Molecular endotypes and biomarkers provide critical information for the development of new therapeutics.Several studies on the phenotypes of sarcoidosis have been published (8)(9)(10), but they are outside the scope of this review.In sharp contrast, only one study on the molecular endotypes of the disease has recently been reported to date (11).Biomarkers are widely applied to diagnosis, differential diagnosis, prognosis, treatment response, disease activity, severity assessment, chronicity evaluation, and the implementation of interventions.In this paper, we systematically review the identified biomarkers based on their clinical relevance.The clinical data were analyzed with differentiation analysis, correlation analysis, biomarker-specific tests, and machine learning algorithms.Recently, several publications have comprehensively reviewed sarcoidosis biomarkers identified using genomics (12)(13)(14).We will not reiterate this topic here.Instead, we will focus on the transcriptomics (and other Omics) studies published since 2018, as earlier studies have already been covered in a prior review (15).

Workflow of omics analysis for biomarkers and endotypes
The workflow of clinical omics studies is generally consistent, yet it adapts based on the specific objectives, materials, platforms, and data analysis strategies involved (Figure 1).Omics studies could be applied to the identification of candidate biomarkers, molecular endotypes, gene expression signatures, mechanisms, and druggable targets.The workflow is initiated by collecting relevant biologic specimens, including cells, tissues, and body fluid samples from sarcoidosis patients and healthy controls.The specimens are then pre-processed (e.g., obtaining single-cell suspension, extracellular vesicle isolation, or DNA/RNA/protein extraction) ready for the analysis by Omics platforms to generate Omics datasets.Once the datasets become available, data cleaning is required before downstream analysis, including normalization, missing data imputation, batch effect correction, and application of cut-off criteria.Next, the datasets will be profiled to identify differentially expressed genes/transcripts/proteins/metabolites.Various statistical and machine learning techniques are employed based on the specific tasks at hand.For instance, unsupervised clustering can be utilized to uncover novel molecular endotypes, while supervised logistic regression is appropriate for finding potential biomarkers.Omics studies can shed light on disease signatures, candidate biomarkers, molecular endotypes, and clinical disease-omics correlations.Finally, these observations will be cross-referenced to functional annotations, correlations with clinical variables, and literature reports.Validation of candidate biomarkers and molecular endotypes is also required in a separate independent cohort.An alternative strategy is to split one cohort into training and validation group.Besides, preclinical animal models of the disease and organ-on-chip could be used to validate the identified markers and mechanisms of action (16, 17).

Selection criteria of clinical studies to review
We searched the PubMed using sarcoidosis, biomarker, and endotype as input keywords.Only original clinical studies were included.The hits were subsequently categorized by the employed data analysis strategies, including classical statistics, AUC (area under curve), machine learning algorithms, and the utilization of omics platforms.We documented and compared the types of biomarkers, sources of specimens, sample sizes, statistical and machine learning methods used, performance metrics of identified biomarkers and endotypes, and validation procedures.

Types of biomarkers by purposes
Biomarker refers to a broad spectrum of biomedical features, i n c l u d i n g t h e s p e c i fi c a n a l y t e s , a n a t o m i c f e a t u r e s , pathophysiological characteristics, and pharmacologic responses to therapeutic interventions that can be measured accurately and reproducibly.The biomarkers of sarcoidosis can be categorized based on their clinical applications."Diagnostic biomarkers" serve to differentiate sarcoidosis patients from healthy individuals."Differential diagnosis biomarkers" aid in differentiating sarcoidosis from other similar conditions.Some biomarkers exhibit specificity for disease severity, inflammation, disease activity, and/or chronicity."Prognostic biomarkers" are predictive for the outcomes of the disease, including mortality, lung fibrosis, and organ failure.Additionally, "predictive" biomarkers are indicative of treatment response and intervention efficacy (12)(13)(14)(18)(19)(20)(21)(22)(23)(24)(25).As shown in Tables 1, 2, the most common biomarkers reported for sarcoidosis are for diagnosis, followed by prognosis, organ involvement, and treatment response.
Biomarker is generally identified combining biomedical tests and biostatistic/bioinformatic analyses.Biomedical measurements include ELISA, MRI, IHC, echocardiogram, flow cytometry, etc., as summarized in Tables 1-4 for sarcoidosis.These data will then be analyzed using t-test, odd ratio, AUC, logistic regression, and machine learning algorithms based on the nature of data.Validation in a separate cohort is essential for any potential biomarker candidate.AUC is an acceptable approach to compute the sensitivity, specificity, and accuracy of individual biomarker or a biomarker panel.This is a common strategy to identify and evaluate the biomarkers of sarcoidosis in both Omics and non-Omics studies.

Sources of specimens for biomarker study
Albeit sarcoidosis being a systemic disorder, both local and circulating biomarkers have been explored.Common biopsy specimens include whole blood, plasma, serum, blood cells, tissues from different organs, and liquid biosamples (bronchioalveolar lavage fluid (BAL), joint fluid, spinal fluid, urine, and lymph node puncture fluid), along with their derivatives (cells, extracellular vesicles (EVs), DNA, microRNA (miRNA), etc.).Both Omics and non-Omics approaches have been reported.Targeted non-Omics methods include colorimetry, fluorimetry, enzyme linked immunosorbent assay (ELISA), Western blot (WB), qPCR (quantitative polymerase chain reaction), microarray, flow cytometry, and immunohistochemistry (IHC) analyses of targeted molecules.The non-targeted Omics studies, on the other hand, utilized gene chips, advanced arrays, RNA sequencing (RNA-seq), mass spectrometry, nuclear magnetic resonance (NMR), and 16S rRNA-seq (see Section 4), permitting a more comprehensive screen, in an unbiased manner.Omics approaches have the advantage of being comprehensive and unbiassed despite being more expensive and computation intensive.

Sarcoidosis biomarkers identified by targeted non-omics studies
Studies designed to identify diagnostic markers typically compare patients with healthy controls.For the identification of differential diagnosis biomarkers, other diseases often serve as General workflow of Omics study to identify biomarkers and endotypes.First, liquid specimens and tissue samples collected from exploratory cohort are prepared ready for the analysis on the omics platforms.Second, omics datasets, including genomics, transcriptomics, proteomics, metabolomics, and microbiomics, are generated by experiments.Next, the datasets are pre-processed before statistical and machine learning analyses.Statistical analyses include differentiation expression tests, correlation analysis, logistic regression, Bayesian model, etc. Machine learning algorithms encompass supervised methods (e.g., gradient boosting, deep neural network, and laten classification and prediction) and unsupervised methods (e.g., PCA, k-means clustering, hierarchical clustering).Furthermore, network analysis can be performed with weighted gene co-expression network analysis (WGCNA), weighted protein correlation network analysis (WPCNA), genome-wide association study (GWAS) and meta-analysis.In general, function annotations are essential to rank differential gene-, transcript-, protein-, or metabolite-related signaling pathways, functions, and interactions.These analyses help identify candidate biomarkers, molecular endotypes, and unique phenotypes, along with developed predictive models.Specific metrics such as ROC AUC c statistic, odds ratio, risk ratio, and others are applied for biomarker comparisons.Finally, it is critical to validate the results in independent cohorts of multi-ethnic origins and under-represented populations, particularly since sarcoidosis is common in minorities.
control groups.In contrast, individuals with only sarcoidosis are required for the studies aimed to identify predictive biomarkers, biomarkers for the response to therapy, activity, chronicity, organ involvement, and severity of the disease, with longitudinal follow up in some cases.The average sample size across 30 non-omics studies is 88, ranging from 6 to 694.Only two studies included a separate validation cohort (27, 33).

Omics-based identification of biomarkers for sarcoidosis
In recent years, Omics studies for the identification of biomarkers and molecular endotypes have been emerging, using a wide array of platforms.They apply both machine learning algorithms and traditional biostatistics for data analysis.The biomarkers identified by genome-wide association and transcriptomics studies have been comprehensively reviewed recently (15, 20, 69, 70).Thus, this review will specifically delve into transcriptomics studies since 2018.Finally, we review other Omics studies, including proteomics, metabolomics, and microbiomics.
Other transcriptomic studies have attempted to differentiate sarcoidosis patients by clinical phenotypes.Unsupervised clustering analyses have been used to identify biomarkers associated with pulmonary sarcoidosis, cardiac sarcoidosis, and other phenotypes (Table 4).miR-155 and miR-223 were associated with pulmonary sarcoidosis.In addition, upregulated TLR/NOD (toll-like receptor/ nucleotide-binding oligomerization domain) signaling, intrinsic apoptosis, and inflammatory pathways were uncovered in sarcoidosis patients, compared to healthy controls (61).A clustering of 12 patients suggested an increase in mRNA for ribosome biogenesis and lymphocytes in BAL cells in all patients (62).Moreover, GPNMB (transmembrane glycoprotein NMB) emerged as a potential biomarker for multinucleated giant cells associated with cardiac sarcoidosis (63).It is worth noting that these studies are limited by their small sample sizes, which pose challenges in identifying highly sensitive and specific biomarkers.Another concern is the lack of cross-validation.Consequently, further investigations using independent cohorts, along with the integration of machine learning algorithms and classical statistics (AUC), warrant consideration.

Proteomics
Proteins are the main vehicles of cellular function, and their abnormal alterations can result in organ disorders.A couple of proteomic studies of sarcoidosis have been reported.Serum zinc finger protein 687 (ZNF688), ADP-ribosylation factor GTPaseactivating protein 1 (ARFGAP1), CD14, and LBP (lipopolysaccharidebinding protein) have been identified as validated biomarkers for differentiating sarcoidosis from healthy control (23,57).By comparison, serum a-2 chain of haptoglobin and amyloid A present potential as diagnostic biomarkers but await validation.Clustering analyses, including principal component analysis (PCA) and oracle product lifecycle analytics (OPLA), have been applied to select proteomics-based serum biomarkers for identifying sarcoidosis (Table 4).Typically, these analyses cluster protein panels to differentiate controls and disease cases.For example, a 25-protein panel related to Fcg-mediated phagocytosis and clathrin-mediated endocytosis exhibited diagnostic potential, as did panels involving regulation-associated factors (64).Furthermore, Fc-regulationassociated factors and IgG-related factors have been reported to be biomarkers for the presence of Lofgren's syndrome, a distinct phenotype of sarcoidosis (65, 66).The sample sizes of the studies using LC-MS (liquid chromatography mass spectroscopy) were generally small, with fewer than 10 cases.The identified proteins and protein panels should be analyzed individually for the AUC value of each panel member.Moreover, targeted proteomics using antibodies or aptamers as ligands, which exhibit increased technical sensitivity, have barely been applied to the study of sarcoidosis.Given that global proteomics is timeconsuming and resource-intensive, there is a long way to go before the identification of individual proteins for clinical use and the translation of proteomics biomarkers to the bedside.

Metabolomics
To date, four studies have analyzed blood and saliva metabolites in sarcoidosis (58)(59)(60)67).Either LC-MS or NMR was used to detect metabolites.Notably, a saliva-based panel of six metabolites demonstrated the ability to differentiate sarcoidosis patients from healthy controls, yielding an AUC of 0.87 (58).Moreover, plasma p-coumaroylagmatine and palmitoylcarnitine were identified as differential diagnosis biomarkers for lung fibrosis, with their involvement extending to collagen and arginine-proline pathways (59).To compare veteran (military or other occupation) from civilian sarcoidosis patients, one study identified six differentially expressed metabolites and 33 trace elements (60).Metabolomics has also been applied to characterize lipid profile responses to exercise in sarcoidosis patients (67).Fatty acids, triglycerides, and total cholesterol were significantly reduced in patients on exercise regimen, suggesting the potential of using blood lipid profile as a prognosticator for recovery of lung function.Clearly, metabolomics is a powerful approach to identify potential biomarkers for sarcoidosis.More studies are warranted to validate the identified metabolic biomarkers and to identify additional biomarkers for diagnosis, differential diagnosis, activity, severity, chronicity of the disease, outcomes, and response to treatment.

Microbiome
Accumulating evidence suggests that the crosstalk between the gut microbiota and the lung, known as the gut-lung axis, is critical.The lung microbiota of sarcoidosis has been reviewed recently (71).The identification of microbial biomarkers for sarcoidosis is an emerging direction.To date, only one study has performed 16S rRNA-seq of BAL on 8 sarcoidosis cases aiming to identify microbial markers for diagnosis.This study identified three taxa as potential biomarkers in BAL specific to sarcoidosis: Corynebacteriales, Corynebacterium, and Neisseria.OTUID_476, each with a linear discriminant analysis (LDA) score greater than 3.0 (68).Nonetheless, these potential microbial biomarkers await confirmation and rigorous statistical analyses.In addition to lung microbiota, microbial markers should be identified from the gut or other involved organs in future studies.

Molecular endotypes
The identification of molecular endotypes for a given disease will shed light on disease pathogenicity, diagnosis, stratification, prognosis, and development of new personalized therapies.Currently, there is only one report on the "transcriptomic" endotypes of sarcoidosis (11).Four potential endotypes were identified by unsupervised analysis of RNA-seq data in BAL cells, including hilar lymphadenopathy with increased acute T-cell immune response, extraocular organ involvement with phosphatidylinositol-3-kinase (PI3K) activation pathways, chronic and multiorgan disease with increased immune response pathways, and multiorgan disease with increased IL-1 and IL-18 immune and inflammatory responses.These mRNA-based endotypes based on signatures from BAL cells await independent validation (72).In addition, a clinical trial has recently been registered to define the endotypes of CD4 T helper and T regulator cell in sarcoidosis (73).Clearly, molecular endotype studies of sarcoidosis is just in its infancy.

Conclusion
Despite decades of both basic and clinician research, our understanding of sarcoidosis remains limited.No specific interventions exist for systemic or single organ sarcoidosis due to the incomplete understanding of its pathogenicity.Consequently, there is an urgent need to find the molecular basis of various phenotypes and biomarkers.This pursuit is crucial for predicting long-term outcomes and responses to the therapy targeting the different manifestations of sarcoidosis.So far, a substantial portion (70%) of biomarkers identified through targeted non-Omics studies have been cross validated by different groups, various phenotypes, or distinct organ involvements (Table 5).These non-Omics-derived biomarkers may serve multiple purpose, including roles in diagnosis, differential diagnosis, prognosis determination, and assessment of disease activity, chronicity, and severity, and evaluation of therapeutic response.However, these focused studies need to be expanded to or fortified with larger, unbiased Omics based studies, in order to uncover improved biomarkers for this disease.
Very few individual biomarkers are identified by both Omics and non-Omics studies.It is most likely due to divergent approaches and tissues.Non-Omics studies measured one or few biomarkers at the protein level.In contrast, Omics studies profile the landscape of genes, transcripts, proteins or metabolites and identify a panel of top-ranked biomarkers.These Omics biomarkers shall be validated by other clinical studies independently.One challenge of implementing these biomarkers clinically is the difficulty of identifying these phenotypes at the bedside using the same approaches.It remains a question whether genomics, transcriptomics, and metabolomics biomarkers could be validated by proteomics.In addition, the organ specificity of identified biomarkers may lead to the inconsistency between identified Omics and non-Omics biomarkers.Of note, the organ specific biomarker could be applied to differentiate involved organs.All validated high-quality biomarkers are highlighted in italic in Table 5.These validated biomarkers are recommended for sarcoidosis.Without doubt, the combination of non-Omics and Omics assays will improve the identification of biomarkers in sarcoidosis.
One advantage of Omics studies is their high throughput capacity.This feature enables the identification of molecular endotypes in sarcoidosis and multiple biomarkers ranked by their importance.The development of new bioinformatics and machine learning algorithms holds significant potential for extracting more accurate or predictive information from Omics datasets, to prioritize critical biomarkers, meet clinical needs, and to identify molecular endotypes associated with different phenotypes.With respect to sarcoidosis, Omics studies are still in its infancy.These reported Omics studies allude to several potential biomarkers/ panels, signaling pathways, and integrated networks, but the general paucity of sensitivity, specificity and accuracy comparisons, insufficient statistical power, and few crossvalidations prevent rigorous conclusions from being drawn.Most of Omics-derived biomarker panels are not ready to be translated to bedside.More Omics studies and multi-Omic integrative investigations are needed to validate published biomarkers, and to identify more accurate biomarkers for sarcoidosis, using well annotated reference cohorts.Clinical trials are necessary to evaluate the clinical application of top-ranked biomarkers, each on an individual basis.To sum, there is an urgent need to identify novel sarcoidosis biomarkers and molecular endotypes to facilitate disease diagnosis and prognosis, resolve disease heterogeneity and facilitate personalized medicine.

TABLE 2
Sarcoidosis biomarkers identified in non-omics studies that have been assessed for their potential to accurately distinguish disease groups.

TABLE 1
Biomarkers associated or correlated with sarcoidosis, identified using non-Omics approaches.
The listed studies did not attempt to compute the biomarker potential to accurately distinguish sarcoidosis from controls.SA, sarcoidosis; Ctl, controls; CSF, Cerebrospinal fluid; NS, neurosarcoidosis; BAL, bronchioalveolar fluid; EVs, extracellular vesicles; PS, pulmonary sarcoidosis; HP, hypersensitivity pneumonitis; AHC, hierarchical clustering; PCA, principal component analysis; RF, random forest; IPA, ingenuity pathway analysis; BAFF, B-cell-activating factor.* reported by more than one study.Biomarkers from 4 of 10 studies have been validated.

TABLE 3 Sarcoidosis
Biomarkers identified in Omics studies that have been assessed for their potential to accurately distinguish disease groups.

TABLE 5
Summary of identified biomarkers by non-Omics and Omics studies.