Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns

Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.

Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease.Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis.Thus, the output data of such normalizations will be poorly compatible with each other.Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format.We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced.
Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores.

Introduction
Gene expression data are widely used in the fields of functional genomics and molecular medicine, e.g., in cancer research (~350,000 PubMed papers found using search terms gene expression and cancer in April 2023).Two major approaches are used nowadays for large-scale transcriptional profiling: microarray hybridization (MH) of mRNA (Lashkari et al., 1997;Bednár, 2000;King and Sinha, 2001;Rew, 2001) and mRNA sequencing (RNAseq) (Nagalakshmi et al., 2008;Maher et al., 2009;Wang et al., 2009;Chu and Corey, 2012;Ingolia et al., 2012;Korir et al., 2015;Taylor et al., 2016).Both approaches utilize different rationales and can be further subdivided in several technological platforms.Consequently, the output data of MH and RNAseq are essentially platform-biased and require specific normalization/harmonization procedures in case of any intercomparison may be needed.
Most of such cross-platform harmonization/normalization methods return the results in a flexible format.As such, the shape of the output normalized gene expression profiles fully depends on the group of samples under normalization and can be poorly compatible with the results of another normalization involving different transcriptional profiles.Thus, a new normalization procedure is typically required for every comparison.
More recently, a new concept was formulated for the harmonization methods: conversion of a whole set of profiles into the shape of a predefined experimental platform, e.g., in the Training Distribution Machine (TDM) method (Thompson et al., 2016).According to this paradigm, the harmonized results should look as if they were obtained using a single pre-defined gene expression profiling platform.
Current versions of Shambhala utilize transformation of a fraction of ~8,000 most strongly expressed human genes because their transcriptional activities can be assessed with the greatest precision compared to the low-expressed genes (Borisov et al., 2019).In our previous report (Borisov et al., 2022) we showed that Shambhala-2 returns transformed gene expression profiles that are clustered according to their biological origin rather than by their experimental platform.However, it remained unexplored whether this harmonization also retains differential gene expression features that can functionally characterize the samples under analysis.
Here, we tested Shambhala performance in retention of foldchange gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores.Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format.Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, and stability of drug efficiency scores.We also assessed the quality of Shambhala output data for building both local and global machine learning expression-based classifiers of human cancer and normal tissue types.The piecewisecubic Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs. validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.

Study design
Our validation of the Shambhala method for uniformly-shaped harmonization included the following tests (Figure 1). 1) Comparison of different cancer and normal tissue datasets of gene expression profiles obtained using different equipment/protocols.The normalization methods included QN (Bolstad et al., 2003), DESeq2 (Love et al., 2014), Empirical Bayes a.k.aComBat (Johnson et al., 2007;Lagani et al., 2016), and different modification of linear and cubic Shambhala.The quality control metrics included: -the accuracy of the transfer learning classifier, where the ML model is trained on profiles obtained in one batch and validated on profiles obtained in different batches and using different platforms/protocols; -clustering quality of expression profiles after harmonization: "good harmonization" means that such clustering corresponds to biological type of the sample rather than to batch or experimental platform; -correlation and linear regression coefficient between expression profiles for the same tissue type, obtained using different equipment/protocols: should be close to 1 for a good harmonization.2) Analysis of correlation and linear regression coefficients, as well as sign stability for downstream measures of gene expression derivatives (case-no-control log-fold changes, pathway activation levels (PALs) (Aliper et al., 2017), and drug balanced efficiency score (BES) (Tkachev et al., 2020b).For a good harmonization, the correlation and linear regression coefficients should be close to 1, and the sign change rates should be as low as possible.The correlation, linear regression and sign stability tests were performed in the following conditions: -different linear and cubic Shambhala modes vs. QN; -LFC, PAL, and BES values calculated using Oncobox ANTE control sampling vs. GTEx control sampling.3) Retention of biologically relevant differences between various types of profiles after harmonization (QN, DESeq2, and multiple Shambhala modifications): -differences between the male and female individual samples for the genes located on sex chromosomes, -differences between cancer expression profiles in hormonedependent, HER2-positive, and triple-negative breast cancer patients.

Gene expression datasets
We curated six gene expression datasets for cancer and corresponding normal transcriptomic profiles, namely, The Cancer Genome Atlas (TCGA) (Tomczak et al., 2015)-both for cancer (i) and normal (ii) tissues; Gene-Tissue Expression Consortium (GTEx) (GTEx Consortium, 2013;The GTEx Consortium et al., 2015)-normal samples obtained using both RNAseq (iii) and MH (iv); (v) Oncobox Atlas of Normal Tissue Expression, ANTE, (Suntsova et al., 2019), and (vi) Oncobox experimental collection of human cancer expression profiles.Among these six gene expression datasets, five were obtained by RNA sequencing (with platforms Illumina HiSeq 2000 and 3000, and one-using expression microarray platform Affymetrix Human Gene 1.1 ST Array (Table 1).Four RNA expression profiling protocols were used to obtain the above six datasets.Two datasets represented cancer samples, and four were obtained for normal human tissues (Table 1).All sequencing data used here represented full-length RNA sequencing.

Shambhala harmonization of gene expression profiles
Harmonization of datasets using Shambhala-1 and Shambhala-2 methods was done as previously described (Borisov et al., 2022), Figure 2.Both methods perform harmonization of each gene expression profile independently in the initial raw (R) dataset.The procedure relies on two preselected auxiliary datasets: the calibration (P) and reference definitive (Q) datasets.Every single profile is taken from the R-dataset, and quantile-normalize (Bolstad et al., 2003) with the P-dataset, to form the transformed dataset Pʹ.Then Pʹ-dataset is normalized using the XPN (Shabalin et al., 2008) or CuBlock (Junet et al., 2021) protocols for Shambhala-1 and -2 methods, respectively, to produce the double transformed dataset Pʺ.From the dataset Pʺ, the finally harmonized individual profile is obtained; the harmonization procedure is repeated for every different profile in the dataset R. The whole procedure converts the initial dataset R into the harmonized dataset H.
We used three alternative rescaling modes for Shambhala-2 method.The P-based rescaling (Borisov et al., 2022) utilizes simple translation and multiplication of the log-expression level (LE g ʺ) for each gene g in the dataset H as follows: LE g = μ gQ + LE g ʺ•σ gQ , where μ gQ and + σ gQ are mean value and standard deviation, respectively, for log-expression level of gene g in the Q-dataset.Second mode termed Q-based rescaling, utilizes setting the mean value, and the standard deviation, for the log-expression levels of each gene in the dataset H, to the corresponding levels of the dataset Q, equal to μ gQ and σ gQ , respectively: LE g = μ gQ + (LE g ʺ − μ gH )•σ gQ /σ gH .The third mode termed R-based rescaling, has the mean value, and the standard deviation, for the log-expression levels of each gene g that are used in the dataset H to the corresponding levels of the dataset R, equal to μ gR and σ gR , respectively: As the auxiliary datasets P and Q, we used the dataset P 0 (obtained using the MH platform Affymetrix Human Genome U133A 2.0 Array) and Q 0 (obtained using the NGS platform GTEx Ilumina HiSeq 2000), respectively.Among other tested Pand Q-datasets, the datasets P 0 and Q 0 showed the best results in our previous studies (Borisov et al., 2022).
To avoid overtraining, we applied the transfer learning approach to the sample tissue type classifiers.After harmonization of gene expression profiles, we used one group of samples as the training dataset, and another group as the validation dataset (Table 1).The number of classes, i.e., biological sample types, was fifteen for the cancer type classifier, and varied from 8 to 20 (depending on the selection of training and validation datasets) for normal tissue classifiers.All the classifiers used the Euclidean feature space of log-expression of each gene (for the kNN approach) and of 20 principal components (for the SVM approach), applying three machine learning methods: (i) Shambhala-1/2 approach (Borisov et al., 2019;2022;Borisov and Buzdin, 2022) to uniformly shaped harmonization of gene expression data.Gene expression profiles for samples (1, . . .,N), e.g., obtained using different experimental platforms are taken one by one, separately merged, and quantilenormalized with an auxiliary calibration dataset P. The resulting dataset Pʹ is then transformed into the format of the reference definitive dataset Q, which results in dataset Pʺ.The latter contains the finally transformed profile of initial sample i, which is considered harmonized (H).The profiles of all other samples are harmonized one by one using the same algorithm.

Clustering quality assessment of harmonized expression profiles
We used Watermelon Multisection (WM) method to quantitatively assess the quality of clustering for the expression profiles under analysis according to (Zolotovskaia M. A. et al., 2020;Borisov and Buzdin, 2022;Borisov et al., 2022).This method returns a specific metric for the assessment of an entropy-based quality of clustering on dendrograms according to known predefined classes.WM can evaluate performance of hierarchical clustering relative to a trait of interest, e.g., known tissue type in our case.When moving from the root of the dendrogram to its distal branches, one can measure information gain (IG) at each node of the dendrogram.In WM, the overall process of gradual information gain at each node is referred to as the observed information gain trajectory.Shortly, the WM metric is the normalized difference between the observed IG, theoretically maximal IG that corresponds to the fastest separation of predefined classes into the distinct branches, and null IG 0 trajectory that describes the worst (totally random) distribution of predefined classes on the dendrogram (Zolotovskaia M. A. et al., 2020;Borisov et al., 2022).Therefore, 0 < WM < 1, and the higher value means better class separation on the dendrogram.Highquality harmonization is expected to result in clustering according to the tissue type, but not according to the experimental platform or other technical factors.Thus, the ratio R WMS WMP of WM-metrics according to biological sample classes (WM S ) and experimental platform classes (WM P ) may be used for evaluating harmonization quality: the higher R means better harmonization quality (Borisov and Buzdin, 2022;Borisov et al., 2022).
We used the following protocol for WM metric evaluation (Borisov et al., 2022).For each harmonizing dataset, we randomly selected five samples for each known combination of tissue type and experimental profiling platform and then calculated the WM metrics for such a selection.Each selection was randomized and repeated 25 times according to (Borisov et al., 2022).

Correlation/regression analysis for median gene expression vectors
For each pair of training and validation datasets, we calculated the median log-expression levels for each gene and each sample type.Let us call the median log-expression level vectors for a certain biological type as v 1 and v 2 , for the training and validation datasets, respectively.For each of v 1 vs. v 2 pairs, we calculated the Spearman correlation and linear regression coefficient (k).The value k was the geometric mean over the values k 1 and k 2 .Here k 1 and k 2 , are the linear regression coefficients with and without the offset item b, in the regression models: then we set the resulting k value to zero.

Calculation of cancer-to-normal logfold change, pathway activation level, and anticancer drug efficiency score values
Cancer-to-normal log-fold change (LFC) of gene expression, pathway activation levels (PALs), and balanced efficiency score of anticancer targeted drugs (BES) were calculated according to (Borisov et al., 2020a;Tkachev et al., 2020b).Cancer and normal tissue samples included in the analysis are listed in Table 2.Note that BES values are calculated using not only drug target genes, but also on the basis of PAL values, which calculation in turn requires nearly 8,000 genes that survive the Shambhala harmonization.

Correlation, regression, and stability analysis for BES values
Shambhala applicability for BES calculations was tested as follows.We performed the correlation and linear regression analysis for two comparisons of BES vectors v 1 vs. v 2 .
For each comparison, we calculated the correlation and linear regression coefficient (k), similarly to the way that we applied to median log-expression values between similar biological types in the training and validation datasets.
Additionally, for these two comparisons, we calculated the percentage of sign-changed BES values as a function of the width (w) of a significance threshold around zero.If for the i-th component (corresponding to the drug i) of the vector v 1 , v 1i < − w/2, and, simultaneously, v 2i > w/2, or vice versa, then the i-th component is considered sign-changing for the comparison of v 1 vs. v 2 .The resulting percentage of sign-changed values is the ratio of the number of signchanged components and of the total number of components.
Since different modes of Shambhala harmonization could affect the absolute values of LFC, PAL and BES, we calculated the percentage of the sign-changed BES values using two modes.1) For the BES values without correction, marked "as is"; 2) For the BES values divided by the corresponding linear regression coefficient k, marked as "divided by k."

Results
In this study, we tried to characterize the ability of Shambhala approach to provide tissue specific clustering of the harmonized gene expression profiles, and at the same time to retain their characteristic differential gene expression patterns.These points were assessed after co-harmonization of gene expression profiles obtained using two different experimental platforms and four library preparation protocols (Table 1).

Differential clustering of human normal and cancer expression profiles
We first investigated the ability of Shambhala to support tissue specific clustering of harmonized expression profiles in comparison with the other, non-uniformly shaped harmonization methods.We took human tissue gene expression datasets from the Gene-Tissue Expression (GTEx) repository (GTEx Consortium, 2013;The GTEx Consortium et al., 2015), The Cancer Genome Atlas (TCGA) database (Tomczak et al., 2015), Atlas of Normal Tissue Expression (ANTE) collection of expression profiles (Suntsova et al., 2019), and from the Oncobox experimental collection of cancer tissue RNA sequencing profiles.The samples were obtained using the versions of NGS platform Illumina: HiSeq 2000 for the TCGA and GTEx RNA sequencing data, and HiSeq 3000 for the ANTE and Oncobox cancer data, and the microarray hybridization (MH) platform Affymetrix Human Gene 1.1 ST Array for the GTEx MH data.Taken together, they represented 37 human tissue types including 15 cancer and 22 normal tissue types.Four different gene library preparation protocols were used for obtaining these datasets (Table 1), one common for the ANTE and Oncobox data; one common for the normal and cancer TCGA data, and specific protocols for the GTEx RNA sequencing, and MH data (Table 1).
The P-, Q-, and R-rescaling modes for Shambhala-2 differ by different mean and standard deviation values, which are applied to set the log-expression levels after the harmonization procedure.Note that all modes of Shambhala reduced the number of genes after harmonization to ~8,000 most strongly expressed "reaper" human genes (Borisov et al., 2019), Table 3.The genes in the harmonized output are formed by the intersection of genes in three datasets: the raw (R), and two auxiliaries (P and Q).The best results, in terms of stressing the biological origin of the profiles and banning the artifacts generated by platform/protocol-specific bias, were obtained for the P-dataset Affymetrix Human Genome U133A 2.0 Array (dataset P 0 ) with 8,385 genes (Borisov et al., 2022).This MH platform is often used for routine transcriptomic profiling.Thus, using this P-dataset acts fairly similarly to the explicit expression filtering, and increases the signal-to-noise ratio, thereby highlighting the biological origin of the samples.For the sake of comparability between different normalization methods, we made all QN, DESeq2, and ComBat calculations with this set of highly expressed ~8,000 genes, which help determine the biological origin of the profile.
We then applied the following machine learning (ML) methods to build tissue type classifiers for the harmonized profiles: (i) 11 nearest neighbors; (ii) first nearest neighbor; (iii) linear support vector machine (SVM).We performed 6 ML tests, one with cancer samples, and five with normal samples.In such tests, training and validation groups of samples were taken from different initial datasets (Table 3).
For each transfer learning test, and each sample type in the classifier, we calculated the accuracy, i.e., percentage of correct tissue type predictions in the validation dataset.
Supplementary Material S1 (Supplementary Figures S1-S6) demonstrates accuracy trends for predicting tissue types.Importantly, uniform output Shambhala method showed comparable performance with the gold standard flexible output methods QN and DESeq2, for both local (kNN-based) and global (SVM-based) classifiers (Figures 3A, C, D; Figure 4).Note that QN, DESeq2, and ComBat methods were also applied to the ~8,000 genes, which survived Shambhala harmonization.Note also that the ComBat method designed for the batch affect removal, artificially makes two or more datasets under analysis to look similar, but mixes up the profiles with the different biological origin obtained using the same platform, which results in poor ML performance (Figures 3B, 4).
In addition, bigger number of samples in the validation dataset generally corresponded to greater prediction accuracy, although this trend is rather vague (Supplementary Material S1).From the Supplementary Figures S3A-C it looks like a number of samples higher than approximately 30 is required for all three classifiers tested to get a high accuracy, with a few exceptions.The reason for this phenomenon that can be seen in Supplementary Figure S1 is unknown to the authors, since the validation dataset is not used for ML model construction, and the number of cases in the validation dataset theoretically should not affect the expected accuracy of prediction.Interestingly, correlation of the prediction accuracy with the number of samples in the training dataset was even lower (data not shown).
Furthermore, Shambhala harmonization helped to restore distant order in the correct grouping of biologically similar samples (Table 4; Supplementary Material S2; Supplementary Figures S1, S2 to Supplementary Figures S2-S5).We assessed elimination of the platform/protocol-specific bias after harmonization using statistical method Watermelon Multisection (WM) (Zolotovskaia M. A. et al., 2020;Borisov and Buzdin, 2022;Borisov et al., 2022).WM enables tracking the entropy loss/ information gain at each node of the clustering dendrogram when moving in the direction from the root to the distal branches, thus giving WM metric for a given dendrogram.It can assess the quality of sample clustering according to known predefined groups (in our case, tissue types).The higher is WM metric, the better is the clustering according to known tissue types, and vice versa.
We, therefore, calculated WM metrics for the harmonized tissue samples in two major settings: WM P for the classes corresponding to experimental platforms/protocols (e.g., TCGA-RNAseq, Oncobox-RNAseq, etc.), and WM S for the classes corresponding to tissue types.Thus, the ratio R WMS WMP may serve as the measure for harmonization quality.The higher is R, the better is clustering according to tissue type in relation to platform-specific bias, and vice versa.
In our analysis, the WM metric showed that the ComBat method had the best performance for platform bias elimination for most ML trials with cancer and normal human tissues, except for merging GTEx and TCGA normal RNAseq profiles, and merging GTEx and ANTE normal RNAseq profiles (Table 4; Supplementary Figures S2-S6 to Supplementary Figures S2-S10 in Supplementary Material S2).However, this elimination of the batch effect is only apparent, and does not result in proper clustering of the same sample types.The high values of R for WM metrics are provided by low values of WM P , rather than the high values of WM S , and profile clustering after ComBat is not done according to the tissue type (Supplementary Figures S2-S6 to Supplementary Figures S2-S10).Thus, according to our findings, ComBat method did not demonstrate an overall superior output as it blurs the similarity between the profiles of the same biological type, even when obtained using one experimental platform.
In addition, Sh2PBR harmonization mode also showed the optimal performance in the correlation-regression analysis of profiles for the same tissue samples.Having averaged the logexpression levels for each gene over all samples of certain type in the training and validation datasets, we arrived at the expression level vectors v 1 and v 2 , respectively, for each tissue type.The distribution of Spearman correlation and linear regression coefficients between the corresponding v 1 and v 2 vectors are shown on Figure 4 and Supplementary Material S3.
An ideal method for cross-platform harmonization should result in as similar as possible transcriptomic profiles for the same samples or the same tissue types, even when obtained using different platforms or protocols.Consequently, the correlation and linear regression coefficients between the v 1 and v 2 vectors should be as close as possible to 1 (note that the linear regression coefficients may be also lower or greater than 1).We found that Sh2PBR method results in the correlation and linear regression coefficients very close to 1, thus being the method of choice for harmonization of human normal transcriptomic profiles (Figure 4 and Supplementary Material S3).
In brief, we may summarize that.-In transfer learning tests, when the ML models were trained on gene expression profiles obtained using one experimental platform and validated on profiles obtained using another platform, the Shambhala modes showed performance better or comparable to QN and DESeq2, and better than ComBat; -In terms of correlation/linear regression coefficients, the Sh2PBR mode showed the results maximally close to 1; -The ComBat method, which is designed to eliminate the batch effect intentionally, showed the least performance in the tests with one cancer and four normal tissue datasets.

Correlation, regression, and sign-change analysis of cancer drug balanced efficiency score (BES) after application of different methods of harmonization
The universal harmonization of gene expression data is of interest not only for proper classification of biosamples.It has also important clinical implications.For example, bioinformatic platform Oncobox is designed for personalized prediction of cancer drug activities using cancer and normal gene expression profiles (Poddubskaya et al., 2019;Borisov et al., 2020a;Tkachev et al., 2020b;Zolotovskaia et al., 2022).It evaluates differential gene expression (reflected by log-fold change (LFC) values) and identifies altered molecular pathways (reflected by pathway activation levels, PALs) which may serve as the molecular targets for cancer drugs, thus giving balanced efficiency score (BES) for every drug under analysis.Other possible gene expression-based methods of drug efficiency scoring should be also mentioned in this context (Lazar et al., 2015;Solomon et al., 2022).However, finding proper tumormatching normal tissues is frequently challenging and cannot accommodate for statistically correct differential gene analysis.It is, therefore, important to compare experimental cancer expression with the normal tissue datasets, e.g., published as the part of GTEx, TCGA, and ANTE projects.Such a comparison may require data harmonization, provided that different equipment, reagents and protocols could be employed for obtaining different datasets.The Shambhala harmonization has filtered away approximately 60% of protein-coding genes with the lowest expression (Table 3, Supplementary Material S4).However, among all 167 drugs in the Oncobox database, 129 (77%) retained all target genes after harmonization (Table 5).The enrichment analysis of Gene Ontology terms for the survived vs. filtered out genes (Figure 5) shows that the majority of survived genes govern important physiological process, including cell cycle, whereas the majority of rejected genes are related to sensory/olfactory mechanism, and only a minor part of neglected genes deal with the mitosis and cancer.Note also that important genes distinguishing between cancer subtypes, e.g., for breast cancer, like ERBB2 (HER2), ESR1 (ER), and two forms of membrane component of PR, PGRMC1 and PGRMC2, survived shambhalization.
Thus, we assessed the influence of Shambhala harmonization on the BES values for these 129 targeted cancer drugs, for totally 605 experimental RNA sequencing samples of 20 cancer types (Table 2).We then analyzed performance of four Shambhala modifications (Sh1, Sh2PBR, Sh2QBR, and Sh2RBR) in a series of tests for correlation, regression, and sign-change assessments.
(i) first, comparison of BES values (v 1 ), which were obtained using QN for Oncobox cancer and ANTE normal gene expression profiles, which is a routinely used protocol for BES calculation in clinical use (Tkachev et al., 2020b), vs. those (v 2 ) which were obtained after Shambhala harmonization of cancer and normal profiles (Figure 6, Supplementary Figures S1-S6) (ii) second, comparison of BES values (v 1 ), which were obtained using ANTE normal gene expression profiles (Suntsova et al., 2019), vs. those (v 2 ) obtained with the corresponding GTEx (GTEx Consortium, 2013) normal profiles (Figure 7, Supplementary Figures S2-S6).
We found that the BES values for QN and all modes of Shambhala harmonization were strongly correlated in all cancer types under analysis (Figure 6, Supplementary Figures S1-S6).It can be expected that ideal data harmonization will keep stable the collection of normal samples; instead, different protocol but the same equipment was used for sequencing in GTEx project.Thus, technically ANTE norms better correspond to the Oncobox cancers than GTEx norms for the same tissues.However, the number of Oncobox ANTE normal samples is limited, whereas GTEx tissue samples are much more numerous.Figure 7,   Frontiers in Molecular Biosciences frontiersin.org16 Borisov et al. 10.3389/fmolb.2023.1237129

Retention of biological properties after uniformly shaped harmonization
Besides many statistical performance indicators such as correlation and regression coefficients, sign stability rates, and ML accuracy, it is also important whether the harmonization retains general biological characteristics of biosamples.Otherwise harmonization output will have little sense even when the statistical metrics look acceptable.One such simple test for retention of biological significance can be done with the sex-specific gene expression.Out of 848 X-chromosome genes, 285 survived Shambhala harmonization, and none survived among the 47 Y-chromosome genes (Supplementary Material S8).Using expression of these survivor X-inked genes as the biomarkers, we performed a principal component analysis assay to check for the presence of distinct clusters formed by the male and female patient biosamples from the Oncobox database (for 202 male and 357 female patients).It can be seen from Figure 10 that all Shambhala modes retained sex specific gene expression pattern comparable to QN, and better than for DESeq2.
We also considered the differential gene patterns between different breast cancer molecular subtypes.Although we expect no essential batch effect in the Oncobox data, the overall expression pattern seems tangled.In terms of clustering dendrograms, the hormone-dependent (ESR1 and PGR-positive subtypes), HER2positive, and triple negative cancer samples are mixed together for all normalization/harmonization methods (Supplementary Material S9).The standard pipelines for differential gene expression analysis (Kuznetsova et al., 2021) seem inappropriate since Shambhala affects the absolute values of LFC.However, Shambhala showed strong ability to retain ROC AUC 2 metrics between these tumor subtypes.Here we obtained  (Lex et al., 2014).
2 The ROC (receiver-operator curve) is a widely-used graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.The area under the ROC curve, called ROC AUC, or simply AUC, is routinely employed for assessment of any classifier's quality.
pools of marker genes by calculating AUC (with the threshold of AUC > 0.8) for two pairwise comparisons: (i) HER2-positive vs. hormone-dependent BC; (ii), Triple negative vs. HER2-positive BC.For triple negative vs. HER2-positive, and for HER2-positive vs. hormone-dependent breast cancer, the Sh2QBR and Sh2RBR methods introduced numerous artifacts with the marker genes that were absent for QN.Interestingly, DESeq2 also introduced many genes with AUC > 0.8 for HER2-positive vs. hormonedependent comparison that were absent for QN, Sh2PBR, and Sh1 (Figure 11, Supplementary Material S10).Table 6 and Supplementary Material S11 show the most frequently prescribed (by the Oncobox drug scoring system) drugs, and most up-and downregulated genes (with gene description) and pathways for three BC subtypes and different harmonization methods.Thus, our tests with the expression of the X-chromosome genes in male and female cancer patients, and with marker genes for the hormone-dependent vs. HER2-positive vs. triple-negative breast cancers confirmed the best performance of the Sh1 and Sh2PBR modes over other methods tested.

Discussion
Dozens of methods for cross-platform normalization have been developed for the analysis of both microarray hybridization and RNA sequencing gene expression data, yet none of them is currently recognized as an overall gold standard (Borisov and Buzdin, 2022).The majority of these methods return output data in a flexible format (Piccolo et al., 2013;Thompson et al., 2016;Franks et al., 2018;Maleknia et al., 2020;Zhang et al., 2020;Tang et al., 2021;Huang et al., 2022) which requires recalculation of all expression samples every time upon the addition of new gene expression profiles.In search for the universal output format, we developed a novel approach termed Shambhala for uniformly shaped crossplatform harmonization of gene expression data (Borisov et al., 2019;Borisov et al., 2022;Borisov and Buzdin, 2022).The key feature of Shambhala is the one-by-one conversion of each individual profile into the universal shape of the reference definitive dataset independently from the other profiles.This not only creates the basis for unlimited number of further updates of the gene expression bank(s), but also makes it possible to combine together any number of expression datasets.Shambhala allows adding new samples to previous normalized/harmonized data set without the need for renormalizing them, with no currently known limitations for the number of merged datasets and number of samples in such datasets.
In our opinion, the current versions of Shambhala have the following major limitations: limited repertoire of normalized human genes (~8,000, or ~40% of all protein-coding genes) and higher calculation costs per individual profile due to the algorithm complexity.However, Shambhala has the following advantage over other normalization methods: its output is returned in a universal format comparable with all other "shambhalized" human profiles obtained using any experimental platform or protocol.This requires no recalculation of the whole dataset upon the addition of new sample(s).In the current version of Shambhala, the maximum amplification of signal-to-noise ratio was prioritized whereas it resulted in a reduced spectrum of genes in the harmonized output.Thus, we added reliability expression filters (Borisov et al., 2019) and the selection of the most highly expressed genes during intersection with the auxiliary datasets (Borisov et al., 2022).The resulting reduced gene set, however, includes 73%-89% of genes participating in molecular pathways, depending on the pathway database (Nishimura, 2001;Schaefer et al., 2009;Croft et al., 2014;Zolotovskaia et al., 2022) and ~90% of molecular targets of targeted cancer therapeutics (Table 5) which makes such an analysis meaningful for cancer research.However, developing the next versions of Shambhala enabling the high-quality transformation of a greater proportion (ideally all) of human genes, and also non-human genes for other model objects will be a matter of our further studies.
Our previous analysis (Borisov et al., 2022) has demonstrated the strong capacity of the cubic transformation-based Shambhala method (Shambhala-2) to eliminate the platform bias from the clustering dendrograms of gene expression profiles, which assured -Clustering of samples according to their biological features, and not according to technical factors like platform/protocol used; -Close to 1 correlation coefficients for gene expression values in comparison with the initial datasets; -Close to 1 linear regression coefficients for relative gene expression values and their derivatives like pathway activation levels (PAL) and drug efficiency scores (BES) in comparison with the initial datasets; -Stability of sign for logarithmic relative gene expression values like fold-change and their derivatives like drug efficiency score in comparison with the initial datasets.
-Conservation or "common sense" biological properties, such as expression levels of sex-related genes, or marker genes between disease subtypes.
In addition, taking into account Big Data analytic approaches, another desirable feature is data suitability for various ML methods, e.g., local and global.An ideal harmonization of gene expression data has to generate similar molecular profiles for the samples of similar biological origin, even when obtained using different equipment and protocols.This means that the ML process trained on the harmonized data obtained using one platform and validated on the harmonized data obtained using another platform should show high overall accuracy.Likewise, the profiles of similar biological origin but obtained using different equipment/protocols, should be strongly correlated.
We compared performance of four available versions of Shambhala with the gold standard flexible output methods QN and DESeq2 according to twelve analytic criteria reflecting the above considerations (Table 7).For most criteria the Sh2PBR mode was either better (criteria #3, 4, 5, 8, 11, and 12) or at least comparable (#1, 2, and 6) than any other method under consideration.Only for the criterion #7, the method Sh2PBR showed unusually low linear regression coefficient k ~0.20÷0.25,which indicates that Sh2PBR causes four-tofive-fold decrease of case-to-control log-fold changes in gene expression levels (Figure 6B).However, this proportional variation of case-to-control log-fold changes does not perturb the correlation coefficients, order, and expression ranks of the individual genes and molecular pathways, and also of the predicted drug efficiency scores (Supplementary Figures S1-S6).Moreover, after the Sh2PRB harmonization, the percentage of sign-changed BES values remained lower than all other harmonization methods by approximately 20% (Figures 8A,B).
In Table 8 we summarized the recommendations for the use of different normalization/harmonization methods.Overall, our correlation, regression, and sign-change analysis has demonstrated the best results of the Sh2PBR version of Shambhala.We, therefore, suggest that Sh2PBR can be considered as the method of choice for harmonization of various types of human gene expression data.

FIGURE 1
FIGURE 1 Study design.(A): Comparison of different cancer and normal tissue datasets of gene expression profiles obtained using different equipment/ toolkits.(B) Analysis of correlation and linear regression coefficients, as well as sign stability for downstream measures of gene expression derivatives.(C) Analysis for retention of biologically relevant differences between various cancer profiles after harmonization.

FIGURE 7
FIGURE 7 Spearman correlation and linear regression coefficients between the corresponding BES values for the same cancer drugs calculated with ANTE vs. GTEx normal references for 20 cancer types and 129 targeted cancer drugs.(A) QN, (B) Sh1, (C) Sh2PBR, (D) Sh2QBR, (E) Sh2RBR.

FIGURE 8
FIGURE 8Percentage of BES values, which change their sign when using Shambhala harmonization instead of QN (A), and when using GTEx instead of the ANTE normal references (B), as a function of width (w) for the sign-changing significance threshold in all 605 experimental Oncobox samples of 20 cancer types, and 129 anti-cancer target drugs.Mode "as is" is given for BES values without correction.Mode "divided by k" is given for BES values divided by the corresponding linear regression coefficients (see Figure3for (A); Figure 4 for (B)).The width (w) was measured in terms of median absolute values of BES for all cancer cases.

FIGURE 9
FIGURE 9Percentage of LFC (A), PAL (B), and BES (C) values, which change their sign when using GTEx instead of the ANTE normal references, as a function of width (w) for the sign-changing significance threshold in 74 breast cancer samples with known cancer subtype (hormone-dependent, HERS-positive, and triple negative, Supplementary Material S7).The width (w) was measured in terms of median absolute values of LFC, PAL, or BES for breast cancer cases.

FIGURE 11
FIGURE 11UpSet intersection diagrams(Conway et al., 2017) of marker genes (AUC > 0.80) for the comparison of breast cancer (BC) expression profiles of different molecular subtypes.(A) HER2-positive vs. hormone-dependent BC. (B) triple negative vs. HER2-positive BC.Along the x-axis of the UpSet diagrams, all observed intersections of the analyzed sets are represented.The intersections are marked with turquoise dots, with the rows of the dot matrix corresponding to different sets, and columns showing the intersection combinations.UpSet diagrams were proposed as more informative alternative to traditional Venn diagrams for the analysis of more than four intersecting datasets(Lex et al., 2014).

TABLE 1
Gene expression sample types and group sizes used for the tissue type classifier comparisons.
(Continued on following page)Frontiers in Molecular Biosciences frontiersin.org

TABLE 1 (
Continued) Gene expression sample types and group sizes used for the tissue type classifier comparisons.

TABLE 2
Number of cancer and normal tissue gene expression profiles used for the functional tests of harmonization methods.

TABLE 3
Overview of tissue type transfer learning classifiers.

TABLE 4 Median
R values for WM-based quality metrics of different harmonization experiments.Genes survived after Shambhala harmonization which are included in the molecular pathway and drug target databases(Zolotovskaia et al., 2022)used in this study.

TABLE 6
Most frequently prescribed drugs, according to Oncobox drug scoring system for 74 breast cancer samples and different harmonization methods.

TABLE 7
Twelve performance criteria for versions of Shambhala method with universal gene expression harmonization output in comparison with the flexible output methods QN, DESeq2, and ComBat.