Omics data integration in computational biology viewed through the prism of machine learning paradigms

Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data complexity. This is caused by the multiplication of data types and batch effects, which hinders the joint usage of all available data within common analyses. Data integration describes a set of tasks geared towards embedding several datasets of different origins or modalities into a joint representation that can then be used to carry out downstream analyses. In the last decade, dozens of methods have been proposed to tackle the different facets of the data integration problem, relying on various paradigms. This review introduces the most common data types encountered in computational biology and provides systematic definitions of the data integration problems. We then present how machine learning innovations were leveraged to build effective data integration algorithms, that are widely used today by computational biologists. We discuss the current state of data integration and important pitfalls to consider when working with data integration tools. We eventually detail a set of challenges the field will have to overcome in the coming years.


Introduction
This last decade has witnessed a sharp increase in the amount and complexity of data produced for cellular biology, thanks to an ever-growing number of bulk and single-cell profiling assays. These technologies allowed scientists to study heterogeneous cell populations through many biological feature spaces (or modalities) such as mRNA expression (Klein et al., 2015;Macosko et al., 2015), DNA methylation (Guo et al., 2013) and chromatin accessibility (Buenrostro et al., 2015a;Buenrostro et al., 2015b), and protein abundance (Aebersold and Mann, 2003;Westermeier and Marouga, 2005;Tibes et al., 2006). These assays can be carried out either in bulk, which yields for each sample a single averaged molecular profile, or at the single-cell level, which provides an exquisite insight into cell states and types present in the cell population. In particular, carrying out biological assays at the single-cell level snapshots cells at various points of a dynamical process, which can then be leveraged for various applications such as lineage tracing (Schiebinger et al., 2019), transcriptional dynamics (La Manno et al., 2018), inference of transcriptional trajectories (Chen H. et al., 2019) and many more.
In addition, during the last few years, there have been several joint assays proposed to profile single cells through several modalities simultaneously, such as scM&T-seq for transcriptome and methylome (Angermueller et al., 2016), sc-GEM for genotype, transcriptome and methylome (Cheow et al., 2016), CITE-seq for transcriptome and surface proteins (Stoeckius et al., 2017), or SNARE-seq for transcriptome and chromatin accessibility (Chen S. et al., 2019). It is also worth mentioning spatial transcriptomics, which yields measurements from a small number of cells in each well while also providing positional information of cells within the biological tissue (Ståhl et al., 2016). Finally, important phenotypical information can be obtained from microscopic imaging data, such as whole slide imaging (Pantanowitz et al., 2011).
Hand-to-hand with the surge of biological modalities, there has been an explosion in the number of available datasets helped by various scientific initiatives to make biological data more easily available (Conesa and Beck, 2019); among these initiatives, one can mention atlases of entire organisms such as the Tabula Muris (Schaum et al., 2018) and Human (Tabula Sapiens Consortium et al., 2022) Consortia. We would also like to talk about diseasebased atlas such as The Cancer Genome Atlas (TGCA) database (Weinstein et al., 2013), and the IMMUcan database (Camps et al., 2023) which provides an exquisite insight into the nature of tumor microenvironment. When tackling difficult biological questions, using data gathered across different sources or modalities is enticing. On the one hand, combining data from different sources helps to provide a comprehensive view of the biological object of interest. For example, it can facilitate the discovery of rare but relevant cell types or states, or help quantify the relative abundance of cell types across a collection of biological samples. On the other hand, having different modalities at their disposal allows scientists to link them together, possibly leading to exciting mechanistic discoveries. Finally, there can be an emergent property where analyzing a biological object through several modalities simultaneously could yield superior information compared to analyzing each modality individually.
Unfortunately, there are several obstacles to overcome before data from several sources and modalities can be used within an analysis pipeline. First, the multiplicity of sources comes at the price of all sorts of batch effects, as datasets can come from different replicas, technologies, individuals, or even species. Then, combining datasets containing measurements from different modalities is a major computational challenge, especially when samples are not linked across datasets, as there is no trivial common space to embed samples together. Therefore, there is a real need for methods and tools that would be able to tie together biological datasets across datasets (or batches) and modalities. In this review, we investigate this question through the prism of machine learning paradigms, and present how a few of these concepts are today widely used within popular, state-of-the-art data integration methods.

Data integration links biological datasets across batches or modalities
Data integration describes a set of problems that represent different facets of the question of tying together biological datasets across batches and modalities: vertical, horizontal, diagonal and mosaic integration (Argelaguet et al., 2021), which indicate the nature of anchors that exist between datasets ( Figure 1A).
In vertical integration (VI), each dataset contains a set of measurements carried out on the same set of samples (separate bulk experiments with matched samples in different modalities or single-cells measured through joint assays) ( Figure 1B). VI identifies links between biological features, such as scRNA-seq transcript counts and scATAC-seq peaks, which can help formulate mechanistic hypotheses across modalities. VI methods usually rely on dimensionality reduction, matrix factorization, or modeling. Some can be endowed with additional biological knowledge, such as pathway data and functional interaction between features across modalities.
Horizontal integration (HI) describes the complementary task where several datasets have been acquired in the same biological modality, allowing multiple batches to be expressed within a common features space ( Figure 1C). HI's primary use is to correct batch effects between datasets that can be explained by experimenter variation, different sequencing technologies, or interindividual biological specificities (e.g., species, sex, or ethnicity). HI has been a very popular research topic for the last few years, and many HI tools have been proposed to this day. They can rely on a large variety of computational paradigms such as nearest neighbors, clustering, deep neural networks, matrix factorization, manifold alignment, and many more. Some tools may require additional priors, such as selecting a reference dataset or having access to cell types as labels.
When no trivial anchoring exists between datasets, diagonal (DI) or mosaic integration (MI) formalisms must be used. DI describes the framework where each dataset is measured in a different biological modality, while MI allows pairs of datasets to be measured in overlapping modalities ( Figure 1D). DI and MI are the most challenging facets of data integration and are subject to active research. Methods proposed to perform DI and MI usually rely on advanced machine learning paradigms capable of high levels of abstraction, such as deep neural networks, manifold alignment, or transport theory. Some tools operate in a completely unsupervised fashion, while others require additional information to help them bridge the gap between modalities.
Data integration of biological data is tightly related to several machine learning topics such as domain adaptation (Pan et al., 2010;You et al., 2019;Farahani et al., 2021), data fusion (Castanedo, 2013;Gao et al., 2020) and manifold alignment (Wang et al., 2011). Therefore, it is unsurprising to observe strategies leveraging similar machine learning paradigms such as supervised dimensionality reduction, matrix factorization, nearest neighbors, optimal transport, or deep autoencoders. Interestingly, new methods in all these domains go hand-to-hand with advances in machine learning, with many recent methods featuring advanced machine learning concepts. This is arguably a natural evolution as data complexity and quantity increase, which motivates the need for more powerful models capable of increased levels of abstraction.

Horizontal integration (HI) links batches anchored by their common modality
Horizontal integration (HI) describes the situation where several batches are all gathered in a common modality with overlapping feature spaces. It is worth noting that depending on the tool, there may only suffice that each pair of datasets contains an overlapping feature space (e.g., dataset A containing features f 1 , f 2 , dataset B containing features f 1 , f 3 and dataset C containing features f 2 , f 3 ). HI is a convenient framework in which cells can directly be compared across different batches due to their feature space overlap, which allows the use of natural concepts such as distances, neighborhoods, or similarity measures. Many tools have been proposed to tackle HI, and we gathered a non-exhaustive list of them in (Table 1). As we can see, these methods use various Frontiers in Bioinformatics frontiersin.org strategies to identify similar cells across batches and embed cells into a joint space. Some require additional information, such as reference datasets or cell labels. The remainder of this section is devoted to describing the main computational principles and machine learning paradigms HI methods rely on and providing some rationale and guidelines about each of them.  Many HI methods rely on manifold alignment strategies to integrate batches together (Figure 2A), allowing them to consider the whole data structure instead of matching individual cells. Perhaps the oldest and most natural manifold alignment technique is Procrustes analysis (Gower, 1975), named after the mythical greek thug who cut or stretched his victims so that they fit the length of their bed. This is an old and intuitive machine learning paradigm mostly used for shape alignment that aims at projecting query datasets onto a reference one while only allowing simple transformations (rotation, rescaling, and shifting). Procrustes-based methods are not often used to integrate single-cell data, although some attempts can be found in the literature (Eto et al., 2018). First  Frontiers in Bioinformatics frontiersin.org introduced to infer cell differentiation trajectories (Schiebinger et al., 2019), discrete optimal transport (OT) theory and its extensions (Gromov-Wasserstein, partial OT, unbalanced OT) is the most popular paradigm used for manifold alignment-based HI. It aims to align cells as discrete probability distributions represented as weighted point clouds in a metric space based on pairwise cell-cell cost matrices between batches that are often distance matrices. OT and its extensions have been successfully applied to horizontal and diagonal data integration (Cao et al., 2022b;Demetci et al., 2022). Manifold alignment-based HI is a powerful paradigm, but it can sometimes struggle to solve complex alignment tasks (for instance, when the structure of a dataset presents ambiguous symmetries or when some batches contain specific cell types that must not be aligned).
Another class of HI methods seeks similar cells across batches, operating at the single-cell level rather than at a global level ( Figure 2B). Some are based on the nearest neighbors approach like mutual nearest neighbors (MNN) (Haghverdi et al., 2018), CONOS (Barkas et al., 2019), Scanorama (Hie et al., 2019), Seurat Butler et al., 2018;Stuart et al., 2019;Hao et al., 2021) that include different integration schemes such as CCA and robust PCA (RPCA), or BBKNN (Polański et al., 2020). All nearest neighbors-based methods rely on the hypothesis that batch effects are almost orthogonal to biological effects, which would allow identifying similar cells across batches through simple orthogonal projection. They then apply various strategies to end up with a joint representation of cells like correction vectors or joint graph construction. These methods tend to work best when facing slight to moderate batch effects and generally fail when batch effects are far from being orthogonal to relevant biological signals. They tend to scale well to large datasets thanks to various optimizations during nearest neighbors computation like nearest neighbors descent (Dong et al., 2011). Another metric-based approach is described in Harmony (Korsunsky et al., 2019), which is probably the most used tool in practice for HI of single-cell data. It uses an iterative algorithm of successive biased clustering across batches and correction. First, cells are clustered across datasets with such a bias that penalizes clusters of cells with a homogeneous batch of origin. Then, cells of a given cluster are pooled towards each other. An optimality criterion is tested at each iteration to assess whether batch mixing is sufficient, using a local purity metric called Local Inverse Simpson's Index (LISI). Due to its simplicity and availability with both Python and R packages, Harmony is widely used today and still achieves respectable results in benchmarks (Anaissi et al., 2022) despite being limited when facing strong batch effects .
Deep autoencoders (DAEs) (and more recently variational autoencoders) have been popular tools in single-cell for a few years already and excel at performing a variety of complex preprocessing tasks, such as dimensionality reduction (Wang and Gu, 2018), or denoising and correcting dropouts (Eraslan et al., 2019), as well as acting as generative models (Trong et al., 2020). DAEs are neural networks that leverage a bottleneck structure to learn a compressed data representation in a low dimensional space, which can then be exploited for various tasks ( Figure 2C). DAE is a powerful framework to carry out horizontal data integration with tools such as scvi (Lopez et al., 2018), scAlign (Johansen and Quon, 2019) or DESC . In particular, scANVI, part of the scvi framework, is the top performer tool in the  atlas-scale benchmark. DAEs generally have high computational capabilities thanks to the fact to be able to exploit GPU acceleration during training. The main downside of DAEs is the large amounts of data necessary for their training and their lack of interpretability, though there are efforts to improve on the latter point (Svensson et al., 2020;Treppner et al., 2022).
In an attempt to organize these methods into a common framework, we introduced Transmorph (Fouché et al., 2022), an open-source computational framework that allows the user to assemble custom HI pipelines from basic algorithmic blocks. This framework focuses on methods that combine a matching step, identifying similar cells across batches, and an embedding step, where these correspondences are used to generate a joint representation of all datasets. Transmorph also gives access to pre-build HI pipelines, HI quality assessment routines, benchmarking datasets and easy access to other state-of-the-art HI tools such as Harmony (Korsunsky et al., 2019) and scvi (Lopez et al., 2018). We hope to see more initiatives deployed in the next years in this sense to provide frameworks that can help organize the field of HI methods.
Despite the myriad approaches proposed to tackle HI, it remains challenging today to correct strong batch effects. For instance, (Tran et al., 2020;Luecken et al., 2022), showed that if several methods can satisfyingly remove moderate batch effects, integrating datasets across species remains difficult for unsupervised methods which do not require cell labeling information. Also, many methods rely on finding first an overlapping feature space between all datasets, which can be an obstacle when building large atlases combining many batches of varying quality, where the number of common features can shrink drastically. Finally, the problem of selecting appropriate metrics to assess data integration quality is still difficult. Most benchmarks use a mixture of metrics to measure different aspects of the data integration task such as batch mixture, label clustering or topology preservation, depending on the information available: • Batch mixture metrics such as batch-LISI are commonly used to measure how much the data integration procedure brought cells from different datasets close to one another. These metrics are popular because they do not require additional information, such as cell types or states, and can be used as unsupervised tools. Unfortunately, a good integration does not necessarily imply good batch mixture metrics, as two datasets without overlapping cell types should not be mixed after integration; similarly, projecting all datasets together onto a single point would result in perfect batch mixing, but all the biological information would be lost. For these reasons, even though batch mixture metrics are quite informative and widely used, most benchmarks also include other integration metrics to compensate for these limitations. • Label clustering metrics, such as normalized mutual information or adjusted Rand index, provide an additional axis to measure data integration quality by assessing if cells of similar type cluster together after integration. Label clustering metrics are usually quite good for controlling the data integration quality if cell types can be identified confidently.
The main downside of these metrics is the necessity to have Frontiers in Bioinformatics frontiersin.org high-confidence cell labels available before integration, which is often not the case (especially as one of the purposes of data integration is to be carried out before clustering and cell type inference). • Finally, topology preservation metrics assess how data integration has preserved relations between the different cells and penalize cases where cells that were close before integration have been brought far apart by the algorithm (meaning cells that were initially similar but are dissimilar after integration). Topology can be biology-driven by observing the conservation of signals related to specific cell processes, such as cell cycle or other transcriptomic trajectories, or data-driven with algorithms as simple as comparing the k-nearest neighbors of a cell before and after integration and penalizing the differences.
Evaluating the quality of a HI can be daunting, as shown by the large variety of metrics that have been developed for it. In practice, we often use a batch mixture metric such as LISI, complemented by a secondary metric that can be either a label clustering metric if highconfidence labels are available and a topology preservation metric otherwise.

Vertical integration (VI) connects modalities measured in the same cells
Vertical integration (VI) uses several datasets containing individual measurements from the same cells obtained from joint single-cell assays measured through different biological features (e.g., gene expression and chromatin accessibility) to infer relations between the different modalities (Table 2). VI is usually declined into two variants, namely, local VI and global VI. Local VI identifies links between individual features (such as genes and methylated promoters), and can be used to formulate hypotheses of direct or indirect biological interactions between the omics layers (e.g., gene expression and accessibility of a chromatin region), with methods like LMM (Van Der Wijst et al., 2018) or Spearman's rank correlation coefficient (Cuomo et al., 2020). On the other hand, global VI links features across different modalities via global factors that can be related to biological processes (e.g., identifying a group of genes and chromatin regions to correspond to proliferation activity).
A family of global VI tools are based on a methodology inspired by canonical correlation analysis (CCA) (Hotelling, 1992), which use joint feature measurements across datasets to identify correlated features across modalities ( Figure 3A). RGCCA (Tenenhaus and Tenenhaus, 2011) extended this framework to simultaneously allow the analysis of more than 2 datasets. These concepts have been refined in (Tenenhaus et al., 2014) and DIABLO  to achieve better feature selection.
On the other hand, other popular global VI tools are based on matrix decomposition algorithms ( Figure 3B) (Lock et al., 2013;Argelaguet et al., 2018;Jin et al., 2020). These tools generally aim to decompose each data matrix into a component explained by global factors, a component containing dataset-specific and modality-specific factors, and a noise term. They mostly differ by their exact decomposition model and specific strategies used to infer its parameters.
If deep autoencoders did wonders for HI, they were also successfully applied to VI problems (Minoura et al., 2021) by using two distinct encoders and decoders using a shared latent space into which both modalities are projected. This strategy notably allows the network to "translate" a modality into another. We can also mention the recent MIRA method (Lynch et al., 2022), which leverages a variational autoencoder approach to learn gene expression and chromatin accessibility shared topics.
Overall, the VI framework has allowed the growth of methods taking advantage of the powerful sample anchoring across datasets, with many approaches proposed inspired by statistics and machine learning. A few important benchmarks have been carried out to assess  Several strategies can be carried out to tackle the diagonal integration computational challenge (A) A biological object (e.g., a population of cells). can be profiled using different assays, without obvious means to link both representations. (B) Knowledge of interaction between features across modalities can be obtained from vertical integration of external datasets generated using joint assays. This information can then be leveraged to compare cells between batches even if they are not expressed in the same modality, which allows to use horizontal integration tools. (C) Datasets can be independently encoded into abstractions that can then be matched in an unsupervised fashion to build a joint representation of datasets. (D) Datasets can be jointly encoded into a unique abstraction, for instance through a learning process using a deep autoencoder framework, that can then be used as a joint embedding of datasets.
Frontiers in Bioinformatics frontiersin.org 08 the quality of VI tools, notably (Cantini et al., 2021) which focuses on joint dimensionality reduction (jDR) methods. Due to the difficulty of setting up joint assays and the inability of these methods to function without matched cells, there is a crucial need for diagonal integration (DI) tools that aim to integrate datasets across batches and modalities.

Diagonal and mosaic integration jointly embed non-or partiallyanchored datasets
Diagonal integration (DI) and mosaic integration (MI) are two data integration frameworks for single-cell data that do not require datasets to be acquired through matched biological assays (Table 3). In this paragraph, we use DI indistinguishably from MI. The goal is to leverage datasets structure and possibly external information, such as genomic locations, pathways, or partial sample or modality overlap to infer complete bonds between cells across modalities without relying on explicit sample anchoring ( Figures 4A, B). DI generally aims to build a joint embedding of datasets into a common latent space, while MI focuses on inferring missing modalities from partially anchored datasets. Let us focus on the two main families of methods that exist for tackling DI: manifold alignment and deep autoencoders. These two machine learning paradigms can handle high levels of abstraction, which seems required to tackle DI in the general case.
Manifold alignment methods (Welch et al., 2017;Liu et al., 2019;Cao et al., 2020;Cao et al., 2022b;Demetci et al., 2022) for DI operate similarly as in the HI case and work under the assumption stating that smooth point clouds alignment corresponds to meaningful biological correspondence ( Figure 4C). This allows them to work in an unsupervised fashion without requiring additional knowledge other than data matrices. Despite working accurately in some cases, it has been shown this hypothesis is far from being universal (Xu and McCord, 2022). In this article, the authors show that under some simple data tweaking, such as missing cell types or different sample sizes, manifold alignment DI methods can generate erroneous embeddings featuring clusters with mixed cell types. This is concerning, as validating DI is a challenging task, given that it is rarely the case to have reliable cell type labels across modalities at disposal. Therefore, we suggest that these unsupervised manifold alignment methods must be used carefully and only when integration quality control is feasible. In other cases, it is preferable to choose another DI method that allows the user to provide additional information that helps bridge the gap across modalities.
As for HI and VI, deep autoencoders are powerful tools for solving DI tasks, with several advantages. First, they can take advantage of GPU acceleration built in deep learning libraries to greatly speed up the training process, and naturally scale to very large datasets. The second benefit of using these neural networks is that they offer the possibility to train a separate encoder and decoder for each biological modality, which helps capture modality-specific factors compared to manifold alignment algorithms where all omics layers are treated similarly. These separate encoders generally share a joint latent space ( Figure 4D), with some form of penalty to force latent representations to overlap. They also present an algorithmic structure that facilitates the introduction of external biological guidance, like in the GLUE tool (Cao and Gao, 2022), which uses a guidance graph as prior knowledge about functional relationships between features across modalities. We would also like to mention in this category the Polarbear tool (Zhang R. et al., 2022), which leverages deep autoencoders to notably translate single-cell data between RNA-seq and ATAC-seq.
To the best of our knowledge, there do not exist at the time of writing a large-scale, independent benchmark of DI methods like for HI . This is arguably difficult to set up due to the number of single-cell modalities available today, given the fact that, in addition, not all methods can deal with all modalities. Some may also require specific prior knowledge, and output type may vary. Furthermore, there is a lack of reliable metrics for assessing the quality of DI methods and real-life benchmarking datasets. A first breakthrough is to note in this direction, with a NIPS single-cell analysis competition organized recently which gave access to a public multimodal dataset containing single-cell gene expression, protein expression, and chromatin accessibility using CITE-seq and Multiome (Lance et al., 2022). With the democratization of such datasets, benchmarking DI methods will become more accessible, which will help standardize the field and identify the bestperforming methods for each scenario.
To finish, there is a growing interest in integrating single-cell data with other related data modalities, such as whole slide images or spatial transcriptomics. There is a particular interest in deconvoluting spatial transcriptomic spots by integrating them with a single-cell RNA-seq dataset obtained from a similar same tissue. This is a current challenge, and several methods have been proposed for this task, notably benchmarked in .
Overall, DI is arguably the most challenging data integration problem, and solving it is still a very active research area. This very convenient data integration paradigm is extremely versatile, as it theoretically does not need any anchoring (cells or features) between the different datasets. In practice, if many DI tools indeed work in a completely unsupervised way leveraging data topology such as MMD-MA (Liu et al., 2019), Pamona (Cao and Gao, 2022) or SCOT (Demetci et al., 2022), others require additional information to bridge the gap between modalities like GLUE (Cao et al., 2022a) or MultiVI (Ashuach et al., 2021) which can take a covariate design matrix as an optional parameter. For the moment, it appears that these biased methods offer more control on the results, as data topology can be misleading in practice and yield aberrant results (Xu and McCord, 2022). Therefore, using DI tools that can be enriched with biological context seems to be the best choice in the applications where such context can be obtained in a reliable way, typically when integrating datasets where strong covariates exist between modalities.

Discussion
Data integration consists of distinct challenges depending on the anchoring that exists between datasets, and each facet of DI requires distinct tools that leverage various algorithmic strategies. For instance, metric-based methods excel at solving HI tasks, whereas linear matrix analysis methods excel at solving VI tasks. Machine learning paradigms with high abstraction levels, such as manifold alignment methods and deep neural networks, are excellent assets for dealing with DI and MI problems, the latter also performing well at HI and VI tasks. Overall, VI methods are pretty good at solving the task, HI methods are capable of dealing with small to moderate batch effects but still struggle to mitigate significant batch effects such as inter-species data, and DI/MI problems are arguably still unsolved in the general case.
We talked about the Transmorph framework that articulates computational blocks to conceive HI pipelines, but this is not the only framework that exists which is related to data integration. We can cite MUON (Bredikhin et al., 2022), which facilitates the handling of data consisting of different modalities, Polyphony (Cheng et al., 2022), which carries out transfer learning across datasets by leveraging data integration algorithms, or SinCast (Deng et al., 2022) which is specialized in cell type inference by mapping a query onto an atlas.
It is essential to note that there are important pitfalls to data integration that must not be overlooked. The primary issue that can be encountered is named overcorrection and describes an undesirable event where a data integration method incorrectly aligns cells that do not share the same biological type or state. This typically happens when batch effects are too strong, when a dataset contains specific cell types, when cell type distribution is highly imbalanced, or when there is little anchoring between batches. Overcorrection can be difficult to detect when there is no easy access to cell labels and is a critical issue that hinder every subsequent analysis step. Indeed, it can lead to cells belonging to the same cluster without sharing critical biological properties such as cell type or states. Other issues are worth noting even though they are not exclusive to the data integration task, such as the difficulty in differentiating between true zeros and missing values in RNA-seq datasets or the fact that different modalities are often expressed using different data types (e.g., binary or integer data) which may be difficult to handle jointly within mathematical frameworks. Finally, data integration tools based on abstract machine learning paradigms such as deep autoencoders often comes at the cost of a decrease in model interpretability which is an important downside for any healthrelated application. However, many efforts are made to overcome this issue (Svensson et al., 2020;Treppner et al., 2022) and we expect to see many more in the years to come.
There is always an urgent need for large-scale, independent benchmarks like the HI benchmark proposed in , or the VI benchmark carried out in (Cantini et al., 2021). To the best of our knowledge, there is still a lack of large-scale independent DI and MI benchmarks. Two things are necessary to carry out such benchmarks: high-quality datasets and reliable metrics. A list of potential datasets can be found in (Argelaguet et al., 2021). There is no clear consensus about which quality assessment metric to use, and most benchmarks like  opt for a mixture of metrics that cover several aspects of data integration: conservation of biological variance (CBV) metrics which measure how close similar cells (type or state) are after integration, and removal of batch effects (RBE) metrics. Some CBV metrics are label-based, such as normalized mutual information (NMI), adjusted Rand index (ARI), average silhouette width (ASW), class local inverse Simpson's index (cLISI), isolated label F1 (ILF) and isolated label silhouette (ILS), others are label-free and generally assess the conservation of biological processes such as cell cycle, highly variable genes, and transcriptomic trajectories. RBE metrics include batch-PC regression, batch-ASW, graph connectivity, iLISI, and kBet. We often observe a tradeoff between CBV and RBE, which can lead to different methods choice depending on the application, whether it is preferable to have good dataset mixing or conservation of subtle biological signals.
To conclude, years of algorithmic and computational advances made it possible to solve most HI and VI problems with satisfying performance, with only the most complicated instances still being problematic (e.g., HI of many batches with strong batch effects). Solving DI and MI is the next computational challenge. The most promising approaches that have been developed to tackle it are based on deep learning models, particularly deep autoencoders. It has been shown that purely unsupervised DI may not be a well-posed problem and could suffer fundamental flaws (Xu and McCord, 2022), which greatly incentivizes using knowledge-driven tools that allow the user to include external information to enhance models with functional information that link features across modalities. Finally, apart from developing new tools, there is also an urgent need to enrich the data integration ecosystem with organizing frameworks, standardized benchmarks, datasets, and quality assessment metrics.

Author contributions
AF and AZ wrote the manuscript. All authors contributed to the article and approved the submitted version.

Funding
This work was supported by the French government under the management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). These funding sources had no role in the design, execution, and interpretation of the results of this study.

Conflict of interest
AZ was employed by the Evotec company. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Frontiers in Bioinformatics frontiersin.org