Application note: TDbasedUFE and TDbasedUFEadv: bioconductor packages to perform tensor decomposition based unsupervised feature extraction

Motivation Tensor decomposition (TD)-based unsupervised feature extraction (FE) has proven effective for a wide range of bioinformatics applications ranging from biomarker identification to the identification of disease-causing genes and drug repositioning. However, TD-based unsupervised FE failed to gain widespread acceptance due to the lack of user-friendly tools for non-experts. Results We developed two bioconductor packages—TDbasedUFE and TDbasedUFEadv—that enable researchers unfamiliar with TD to utilize TD-based unsupervised FE. The packages facilitate the identification of differentially expressed genes and multiomics analysis. TDbasedUFE was found to outperform two state-of-the-art methods, such as DESeq2 and DIABLO. Availability and implementation TDbasedUFE and TDbasedUFEadv are freely available as R/Bioconductor packages, which can be accessed at https://bioconductor.org/packages/TDbasedUFE and https://bioconductor.org/packages/TDbasedUFEadv, respectively.


. Introduction
Tensor decomposition (TD)-based unsupervised feature extraction (FE) has been successfully applied to a wide range of problems (Taguchi, 2020) since it was introduced several years ago (Taguchi, 2017).Despite its success, the method failed to gain widespread acceptance, possibly due to the lack of practical tools to perform TD.To address this end, we have developed two bioconductor packages, TDbasedUFE and TDbasedUFEadv, which allow researchers to perform TD-based unsupervised FE easily without the need of detailed knowledge of TD.The purpose of this manuscript is not to demonstrate the superiority over the other methods, since the superiority over the other methods has already been demonstrated in numerous studies cited below.The purpose of this manuscript is to simply inform about the implementation of the established method into easy-to-use environment./frai. .

. Methods
TD-based unsupervised FE (Taguchi, 2017) was derived from principal component analysis (PCA)-based unsupervised FE (Taguchi and Murakami, 2013), which was introduced 10 years ago.As datasets grew in complexity and began to include multiple measurement conditions, such as comparisons of multiple tissues from human subjects rather than just those from human patients restricted to a single tissue, tensors were employed instead of matrices.Tensors, which can have multiple indices, each of which can have multiple comparison criteria, better accomodate complex data structures.For example, a three mode tensor x ijk can naturally store the expression of ith gene at kth tissue of jth human subjects.In contrast, matrices with only two indices corresponding to rows and columns require combining the tissue index and the human index into a single column, rendering data interpretation challenging.
TDbasedUFE and TDbasedUFEadv are user-friendly packages that allow individuals who are unfamiliar with tensors to perform unsupervised feature extraction.Since a matrix can be considered as a two-mode tensor, these packages can also be used to apply PCA-based unsupervised FE to the dataset.TDbasedUFE focuses on two popular functions developed for TD-based unsupervised FE, including the identification of differentially expressed genes (DEGs) and multiomics analyses.For the DEG identification, the basic algorithm is based on a recent study (Taguchi and Turki, 2022b) that established a new standard deviation (SD) optimization approach.For multiomics analysis, the basic algorithm is based on the same study (Taguchi and Turki, 2022c).However, TDbasedUFE also incorporates SD optimization, which was not available when the study was published.Although the algorithm is not specifically designed for DNA methylation profiles, we found that the approach described in the study (Taguchi and Turki, 2022b) is also applicable to DNA methylation profiles (Taguchi and Turki, 2023).In this regard, any type of differential analysis on single omics data can be performed by functions implemented in TDbasedUFE.In fact, we have shown (Turki et al., 2023) that histone modification profiles can be analyzed using the algorithm described in the study (Taguchi and Turki, 2022b).
TDbasedUFE and TDbasedUFEadv accept a multiple omics profile dataset formatted as a tensor.TD is applied on this dataset using Tucker decomposition based on higher order singular value decomposition (HOSVD) (Taguchi, 2020) algorithm.For instance, if x ijk ∈ R N×M×K represents the gene expression of ith gene of jth human subject's kth tissue (Figure 1 left), TD is applied to x ijk , and the following equation is obtained: where G ∈ R N×M×K is a core tensor that represents the weight of the product u ℓ 1 i u ℓ 2 j u ℓ 3 k to x ijk , and u ℓ 1 i ∈ R N×N , u ℓ 2 j ∈ R M×M , and u ℓ 3 k ∈ R K×K are singular value matrices and orthogonal matrices.Initially, singular value vectors attributed to samples, u ℓ 2 j and u ℓ 3 k , are investigated to identify those of interest.For instance, u ℓ 2 j represents the distinction between healthy controls and patients, and u ℓ 3 k represents tissue specificity (e.g., expressed only in the heart).Then, the singular value vectors attributed to genes (i.e., features) u ℓ 1 i that share G of the largest absolute value with the identified u ℓ 2 j and u ℓ 3 k are selected.Features (is) with larger absolute values of u ℓ 1 i are identified based on P-values computed by assuming that u ℓ 1 i obeys a Gaussian distribution (null hypothesis) as follows: where P χ 2 [> x] is the cumulative χ 2 distribution where the argument is larger than x, and σ ℓ 1 is the optimized standard deviation such that u ℓ 1 i obeys Gaussian distribution as much as possible (see Taguchi and Turki, 2022b for more details about how to optimize σ ℓ 1 ).Then P i s are, then, adjusted using the Benjamini-Hochberg criterion to consider multiple comparison correction.Finally, is with adjusted P i less than threshold value (typically, 0.01) are selected.
When TDbasedUFE is applied to multiomics datasets (Figure 1 right), the multiomics profiles are formatted as x i k j ∈ R N k ×M (i.e., kth omics datasets are associated with as many as N k features).The x i k j s are multiplied with each other to obtain the following equation: HOSVD is, then, applied to x jj ′ k as follows: After identifying u ℓ 2 j coincident with labels (e.g., patients and healthy control), singular value vectors attributed to individual features associated with kth omics are computed as follows: Moreover, P i k is, then, computed as follows: and i k s associated with adjusted P i k less than 0.01 are selected.In contrast to TDbasedUFE, which can perform only two tasks, TDbasedUFEadv can perform more complicated tasks.For example, TDbasedUFEadv can perform (Ng and Taguchi, 2020) integrated analysis of two omics profiles that share samples and reduce the memory required by summing up the sample index.TDbasedUFEadv can also perform integrated analysis of two omics profiles that share features (Taguchi and Turki, 2019).TDbasedUFEadv can also perform integrated analysis of multiple (more than two) omics profiles that shared features (Taguchi and Turki, 2022a) or samples (Taguchi and Turki, 2021). of 20,501 mRNAs, and 7,295 out of 485,577 methylation probes are associated with adjusted P i k less than 0.01 (these features are expected to be distinct between the four stages as well).
To compare the performence of TDbasedUFE with those of SOTA methods, we employed DIABLO, which is implemented in the mixomics package (Rohart et al., 2017) in Bioconductor (please refer to the Supplementary Document for the R code to perform mulitiomics analysis using DIABLO).Even we used the minimum setup (folds=2, nrepeat=1), DIABLO failed to converge to a solution within 3 h.When the recommended setup in the vignette (folds=10, nrepeat=10) was employed, DIABLO did not converge to the solution with few enough errors up to 10 components (ncomp=10) and showed no tendency for errors to decrease as the number of components increased (Supplementary Figure S3).As a result, we were unable to select features using DIABLO and had to conclude that TDbasedUFE outperformed DIABLO for this multiomics dataset.
To evaluate the biological relevance of miRNAs, mRNAs, and methylation probes identified by TDbasedUFE, we have uploaded these to various databases.First, we uploaded the identified miRNAs to DIANA-mirpath v3.0 (Vlachos et al., 2015) and found that many cancer-related KEGG pathways are enriched (please refer to the Supplementary Document for URL to DIANA-mirpath using these miRNAs).Next, we uploaded the identified mRNAs to Enrichr (Xie et al., 2021) and found many cancer-related pathways in the "KEGG 2021 Human" categories and various cancer cell lines.Finally, we uploaded 2,668 unique gene symbols associated with the identified 7,295 probes to Enrichr and found several cancer-related pathways in "KEGG 2021 Human" and various cancer cell lines.In conclusion, the miRNAs, mRNAs, and methylation probes identified by TDbasedUFE are biologically relevant.

. Discussion
Here, we have introduced TDbasedUFE and TDbasedUFEadv, two packages that can perform TD-based unsupervised FE without requiring extensive knowledge of tensor decompositions.Our results demonstrated that these packages outperform two SOTA methods, DESeq2 and DIABLO, when applied for DEG identification and multiomics analysis, respectively.With TDbasedUFE and TDbasedUFEadv, users can perform TD-based unsupervised FE easily and effectively.
In this implementation, TDbasedUFE/TDbasedUFEadv can accept variety of datasets generated from high throughput sequencing and/or old-fashioned microarray seamlessly.TDbasedUFE/TDbasedUFEadv can also accept the various combinations of these profiles as inputs (multiomics analysis).TDbasedUFE/TDbasedUFEadv can output the list of features associated with (adjusted) P-values.The possible output features are dependent on the input features.When genes are input, the output features are also genes.When genomic regions are input, the output features are also genomic regions.The list of features can be analyzed with enrichment analysis to understand biological meanings within the downstream analyses.
Current implementation does not have specific limitation since the implemented methods have already been tested over various topics in the numerous previous publications cited in this study.There are no future directions since it is a report to inform the implementation of established method.
As for other unsupervised gene selection methods, readers might check the review article Ang et al. (2016), although it listed as small as fifteen studies ranging from 2006 to 2012, which is relatively small compared with the number of our publications cited in this paper.

FIGURE
FIGURESchematic diagram that explains TD-based unsupervised FE.Left: DEG identification, ( ) u ℓ j associated with the distinction between patients and healthy controls is selected.( ) u ℓ k associated with tissue specificity is selected.( ) G(ℓ ℓ ℓ ) is investigated with fixed ℓ and ℓ .( ) u ℓ i with G of the largest absolute value is selected.( ) is (indicated in red) whose absolute values are significantly larger than expected are selected.Right: Multiomics analysis, ( ) u ℓ j associated with the distinction between patients and healthy controls is selected.( ) u ℓ i is computed from u ℓ j .( ) is (indicated in red) whose absolute values are significantly larger than expected are selected.