Combining Partially Overlapping Multi-Omics Data in Databases Using Relationship Matrices

Akdemir, Deniz; Knox, Ron; Isidro y Sánchez, Julio

doi:10.3389/fpls.2020.00947

ORIGINAL RESEARCH article

Front. Plant Sci., 14 July 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fpls.2020.00947

Combining Partially Overlapping Multi-Omics Data in Databases Using Relationship Matrices

1. Agriculture & Food Science Centre, Animal and Crop Science Division, University College Dublin, Dublin, Ireland
2. SCRDC-CRDSW, Swift Current Research and Developmental Centre, Swift Current, SK, Canada
3. Centro de Biotecnología y Genómica de Plantas (CBGP, UPM – INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, Madrid, Spain

Article metrics

View details

Citations

5,6k

Views

1,7k

Downloads

Abstract

Private and public breeding programs, as well as companies and universities, have developed different genomics technologies that have resulted in the generation of unprecedented amounts of sequence data, which bring new challenges in terms of data management, query, and analysis. The magnitude and complexity of these datasets bring new challenges but also an opportunity to use the data available as a whole. Detailed phenotype data, combined with increasing amounts of genomic data, have an enormous potential to accelerate the identification of key traits to improve our understanding of quantitative genetics. Data harmonization enables cross-national and international comparative research, facilitating the extraction of new scientific knowledge. In this paper, we address the complex issue of combining high dimensional and unbalanced omics data. More specifically, we propose a covariance-based method for combining partial datasets in the genotype to phenotype spectrum. This method can be used to combine partially overlapping relationship/covariance matrices. Here, we show with applications that our approach might be advantageous to feature imputation based approaches; we demonstrate how this method can be used in genomic prediction using heterogeneous marker data and also how to combine the data from multiple phenotypic experiments to make inferences about previously unobserved trait relationships. Our results demonstrate that it is possible to harmonize datasets to improve available information across gene-banks, data repositories, or other data resources.

Introduction

The rapid scientific progress in these genomic approaches is due to the decrease in genotyping costs by the development of next-generation sequencing platforms since 2007 (Mardis, 2008a; Mardis, 2008b). High-throughput instruments are routinely used in laboratories in basic science applications, which has led to the democratization of genome-scale technologies, such as genomic predictions and genome-wide associating mapping studies. Genomic prediction, i.e. predicting an organism’s phenotype using genetic information (Meuwissen et al., 2001), is currently used by many breeding companies because it improves three out of the four factors affecting the breeder equation (Hill and Mackay, 2004). It reduces generation number, improves accuracy of selection, and increases selection intensity for a fixed budget when comparing with marker-assisted selection or phenotypic selection (Heffner et al., 2010; Heffner et al., 2011; de los Campos et al., 2013; Desta and Ortiz, 2014; Juliana et al., 2018). Genomic prediction and selection (GS) are a continuously progressing tool that promises to help meet the human food challenges in the next decades (Crossa et al., 2017). Genome-wide associating mapping studies, which originated in human genetics (Bodmer, 1986; Risch and Merikangas, 1996; Visscher et al., 2017), have also become a routine in plant breeding (Gondro et al., 2013).

The biological data generated in the last few years from this genomic progress have grown exponentially which have led to a high dimensional and unbalanced nature of the “omics” data. Data normally comes in various forms of marker and sequence data: expression, metabolomics, microbiome, classical phenotype, image-based phenotype (Bersanelli et al., 2016). Private and public breeding programs, as well as companies and universities, have developed different genomics technologies that have resulted in the generation of unprecedented levels of sequence data, which bring new challenges in terms of data management, query, and analysis.

It is clear that detailed phenotype data, combined with increasing amounts of genomic data, have an enormous potential to accelerate the identification of key traits to improve our understanding of quantitative genetics (Crossa et al., 2017). Nevertheless, one of the challenges that still need to be addressed is the incompleteness inherent in these data, i.e., several types of genomic/phenotypic information covering only a few of the genotypes under study (Berger et al., 2013). Data harmonization enables cross-national and international comparative research, as well as allows the investigation of whether or not datasets have similarities. In this paper, we address the complex issue of utilizing the high dimensional and unbalanced omics data by combining the relationship information from multiple data sources, and how we can facilitate data integration from interdisciplinary research. The increase of sample size and the improvement of generalizability and validity of research results constitute the most significant benefits of the harmonization process. The ability to effectively harmonize data from different studies and experiments facilitates the rapid extraction of new scientific knowledge.

One way to approach the incompleteness and the disconnection among datasets is to combine the relationship information learned from these datasets. The statistical problem addressed in this paper is the calculation of a combined covariance matrix from incomplete and partially overlapping pieces of covariance matrices that were obtained from independent experiments. We assume that the data is a random sample of partial covariance matrices from a Wishart distribution (Anderson, 2003), then we derive the expectation-maximization algorithm for estimating the parameters of this distribution. According to our best knowledge no such statistical methodology exists, although the proposed method has been inspired by similar methods such as (conditional) iterative proportional fitting for the Gaussian distribution (Cramer, 1998; Cramer, 2000) and a method for combining a pedigree relationship matrix and a genotypic matrix relationship matrix which includes a subset of genotypes from the pedigree-based matrix (Legarra et al., 2009; Christensen et al., 2012) (namely, the H-matrix approach or the related single-step genomic prediction). The applications in this paper are chosen in the area genomic prediction in the case where there is partial genomic and phenotypic information about several populations. However, the statistical method is applicable much beyond the described applications in this article.

The integration of heterogeneous and large omics data constitutes a challenge and an increasing number of scientific studies address this issue. A brief review and classification of some promising statistical approaches are described in Bersanelli et al. (2016). According to this article, our covariance-based method falls in the network-based data integration category (as opposed to non-network based methods such as feature imputation) which include popular methods such as similarity network fusion Wang et al. (2014), weighted multiplex networks Menichetti et al. (2014) both of which can be used to combine several complete networks by suitable weighting. The main breakthrough here is that the proposed method in this article can be used to combine several incomplete but partially overlapping networks and that the proposed approach is supported theoretically by the maximum likelihood formalization.

Methods and Materials

Statistical Methods for Combining Incomplete Data

Imputation

The standard method of dealing with heterogeneous data involves the imputation of features (Shrive et al., 2006). If the datasets to be combined overlap over a substantial number of features then the unobserved features in these datasets can be accurately imputed based on some imputation method (Bertsimas et al., 2017).

Imputation step can be done using many different methods: Several popular approaches include Beagle (Browning and Browning, 2016), random forest (Breiman, 2001) imputation, expectation-maximization based imputation (Endelman, 2011), low-rank matrix factorization methods that are implemented in the R package (Hastie and Mazumder, 2015). In addition, parental information can be used to improve imputation accuracies (Browning and Browning, 2009; Nicolazzi et al., 2013; VanRaden et al., 2015; Gonen et al., 2018). In this study, we used the low-rank matrix factorization method in all of the applications which included an imputation step. The selection of this method was due to the computational burden of the other alternatives.

Combining Genomic Relationship Matrices

In this section, we describe the Wishart EM-Algorithm for combining partial genetic relationship matrices¹.

Wishart EM-Algorithm for Estimation of a Combined Relationship Matrix From Partial Samples

Let A = {a₁, a₂,…,a_m} be the set of partially overlapping subsets of genotypes covering a set of K (i.e., ) with total n genotypes. Let be the relationship matrices for genotypes in sets a₁, a₂,…,a_m We want to estimate the overall relationship matrix Σ for the n genotypes using Moreover, if we focus on one single relationship matrix we drop the subscript and write G_a.

Starting from an initial estimate of the genetic relationship matrix Σ⁽⁰⁾ = νΨ⁽⁰⁾, the Wishart EM-Algorithm repeats updating the estimate of the genetic relationship matrix until convergence:

where a is the set of genotypes in the given partial genomic relationship matrix, b is the set difference of K and a. We assume partitioning of Ψ⁽^t⁾ as

where is the part of matrix that correspond to the genotypes a, is the part of matrix that correspond to the genotypes b, and is the part that correspond to the relationship of genotypes in a and b. The matrices P_a are permutation matrices that put each relationship matrix in the summation in the same order. The superscripts in parenthesis “(t)” denote the iteration number. The estimate Ψ⁽^T⁾ at the last iteration converts to the estimated genomic relationship with . Σ⁽⁰⁾ is the initial estimate of the relationship of the n genotypes that reflects the a priori knowledge about the combined relationship.

A weighted version of this algorithm can be obtained replacing G_a in Equation 1 with for a vector of weights (w_1,w_2,…,w_m)'.

Derivation of the Wishart-EM algorithm and its asymptotic errors are given in Supplementary. We note here that the choice of the degrees of freedom parameter ν does not affect the estimate of the combined relationship matrix but it has an effect on the asymptotic standard errors. While it is possible to estimate this parameter by maximizing the likelihood function, in practice since we are assuming large samples (many features) go into the calculation of the partial matrices, a large value for ν (in the order of the average number of features used in the calculation of the partial matrices) will give reasonable results.

Also, we note that when combining a relationship matrix say A with a relationship matrix nested in say G it the algorithm can be implemented with Σ⁽⁰⁾ = A and the single G to update it. In this case, the algorithm converges in one iteration and the resulting relationship matrix will be the same as the one that would be obtained by the H-Mat and the related to single-step genomic prediction (Legarra et al., 2009; Christensen et al., 2012) approaches; in other words, our algorithm generalizes their approach to two or more relationship matrices not necessarily nested.

Materials: Datasets and Experiments

In this section, we describe the datasets and the experiments we have designed to explore and exploit the Wishart EM-Algorithm.

Note that the applications in the main text involve real datasets and validation with such data can only be as good as the ground truth known about the underlying system. We also included several simulation studies in the Supplementary (Supplementary Applications 1 and 2) using simulated data to show that the algorithm performs as expected (maximizes the likelihood and provides a “good” estimate of the parameter values) when the ground truth is known.

Application 1: Potato Dataset; When Imputation Is Not an Option. Anchoring Independent Pedigree-Based Relationship Matrices Using a Genotypic Relation Matrix

In this application, we demonstrate that genomic relationship matrices can be used to connect several pedigree-based relationship matrices by the Wishart-EM-Algorithm.

The dataset is cited in (Endelman et al., 2018) and is available in the R Package AGHmatrix (Rampazo Amadeu et al., 2016). It consists of the pedigree of 1,138 potato samples, 571 of these genotypes also have data for 3,895 tetraploid markers. The pedigree-based relationship matrix A was calculated with R package AGHmatrix (Rampazo Amadeu et al., 2016) using pedigree records, there were 185 founders (clones with no parent).

The application experiment was structured as follows:

A random sample (Nped) of two non-overlapping pedigree-based relationship matrix Nped ∈ {100, 150, 250} were selected. This means, there is no information in common between pedigree.
A random sample (Ngeno) from half of the genotypes from each pedigree was selected to create a genotypic relationship matrix. This means that in each Ngeno ∈{20,40,80} half of the genotypes come from one pedigree and the other half from the other. This allows us to have partially overlapping data to create a combined relationship matrix.
These genetic relationship matrices were combined to get a combined genetic relationship matrix (See Figure 1).

Figure 1

This combined relationship matrix was compared to the pedigree-based relationship matrix of the corresponding genotypes using mean squared errors and Pearson’s correlations. These correlations and the mean squared errors were calculated only using the unobserved (validation) part of the combined relationship matrix. This experiment was repeated 30 times for each Ngeno, Nped pair.

Application 2: Rice Dataset. Combining Independent Low-Density Marker Datasets

Rice dataset was downloaded from www.ricediversity.org. After curation, the marker dataset consisted of 1,127 genotypes observed for 387,161 markers. We treat the totality of information as the ground truth, i.e., we assume that the true genomic relationship for the 1,127 genotypes is characterized by the 387,161 markers. The purpose of this application is to demonstrate that we can make inferences about the assumed true genomic relationship matrix by observing several smaller heterogeneous subsets of the available. This involves inferring a common estimate for the relationships that are already observed and producing estimates for relationships that haven’t been observed. Supplementary Figure S5 demonstrate this experiment pictorially.

In each instance of the experiment, N_kernel ∈ {3, 5, 10, 20, 40, 80} marker datasets with 200 genotypes and 2,000 markers were created by randomly sampling the genotypes and markers in each genotype file. These datasets were combined using the Wishart EM-Algorithm and also by imputation to give two genomic relationship matrices. For the totality of genotypes in these combined datasets, we also randomly sampled 2,000, 5,000, or 10,000 markers, and calculated the genomic relationships based on these marker subsets. All of these genomic relationship matrices were compared with the corresponding elements of the relationship matrix based on the entire genomic data by calculating the mean squared error and correlation between the upper diagonal elements including the diagonals. This experiment was replicated 20 times. Application results are showed in Figure 8.

Application 3: Wheat Data at Triticeae Toolbox. Combining Genomic Datasets to Use in Genomic Prediction

This application involves estimating breeding values for seven economically important traits for 9,102 wheat lines obtained by combining 16 publicly available genotypic datasets. The genotypic and phenotypic data were downloaded from the Triticeae toolbox database. Each of the marker datasets was pre-processed to produce the corresponding genomic relationship matrices. Table 1 and Supplementary Figure S7 describes the phenotypic records and number of distinct genotypes for each trait.

Table 1

Label	Data	# Genotypes	# Markers
d1	2012_SRWW_ElitePanel	276	90,782
d2	2014_HAPMAP	53	180,198
d3	2014_SRWW_YNVP	307	109,073
d4	2014_TCAPABBSRWMID	365	100,340
d5	CornellMaster_2013	1,128	18,846
d6	Dart_NebDuplicates_2010	278	1,970
d7	HWWAMP_2013	288	32,288
d8	HWWAMP_2014	311	265,551
d9	NSGC9k_spring	2,196	5,303
d10	NSGC9k_winter	1,674	5,010
d11	TCAP90k_HWWAMP_SPRN	20	16,842
d12	TCAP90k_LeafRust	339	24,610
d13	TCAP90k_NAMparents	60	25,851
d14	TCAP90k_SpringAm	248	24,343
d15	TCAP90k_SWW	317	24,978
d16	WWDP9k	2,258	6,232

Marker datasets from Triticeae Toolbox: Labels and names for the datasets, number of genotypes and markers in each of the selected 16 genotypic datasets.

Using the combined relationship matrix we can build genomic prediction models. To test the performance of predictions based on the combined relationship matrix, we formulated two cross-validation scenarios. The common genotypes among the 16 genotypic experiments are shown in Figure 2 and the common markers among genotypic experiments in Figure 3. The availability of the phenotypic data for all the datasets are showed in Figure 4.

Cross-validation scenario 1
The first scenario involved a 10 fold cross-validation based on a random split of the data. For each trait, the available genotypes were split into 10 random folds. The GEBVs for each fold was estimated from a mixed model (see Supplementary Section 5.4 for a description of this model) that was trained on the phenotypes available for the remaining genotypes. The accuracy of the predictions was evaluated by calculating the correlations between the GEBVs and the observed trait values.
Cross-validation scenario 2
Here, we performed a leave one dataset out cross-validation. i.e. we leave out the phenotypic values for the traits of the associated genotypes in one of the 16 genomic datasets and then estimate the trait values of those genotypes based on a mixed trained model. The training population was built on the remaining genotypes and phenotypic information after leaving the phenotypic records out. This scenario was used for each trait, and the accuracies were evaluated by calculating the correlations between the estimated and the observed trait values within each dataset.

Figure 2

Figure 3

Figure 4

Application 4: Maize Data—Genomics and Transcriptomics for Genomic Prediction

In this application, we look into the effects of marker density and data size overlapping on genome-wide relationship matrix and genomic prediction accuracies using a multi-omics data that includes 332,177 genotypic markers and 31,237 feature transcriptomics. The phenotypes used in this application are yield, height, and flowering time from 388 maize lines. More information about the dataset and how it was curated can be found in Azodi et al. (2020).

The aim of this application was i) to study the effect of the number of genotypes common across different populations on the genomic prediction accuracies and ii) to evaluate the effect of the number of genotypes common across different populations and the marker density on the accuracy of predicting unobserved genomic relationships.

To accomplish the first objective we perform the following steps in a cross-validation experiment which was repeated 50 times.

First the genotypes in the dataset were randomly partitioned into three groups with 128, 130, and 130 individuals in them. These groups do not have common genotypes. We named the relationship matrices for these different sets of genotypes as K1, K2, and K3. After this, a percentage (20, 40, 50%) of genotypes from K1 and the same percentage of genotypes from K2 were randomly selected and the relationship matrix for these genotypes is denoted by K12. Similarly, the same percentage of genotypes as above from K2 and the same percentage of genotypes from K3 were randomly selected and the relationship matrix for these genotypes is denoted by K23. Additionally, a random subset of genotypes in K1 that are not in K12 are identified as the Test (validation) genotypes (see Figure 5 for the split of genotypes into these sets).
Two different combined genomic relationship matrices are calculated using two different scenarios. In scenario 1, we assume K1, K2, and K3 are relationship matrices obtained from different partitions of the whole markers dataset divided into three groups. On the other hand, K12 and K23 are obtained from different partitions of the transcriptions divided randomly into two. Since the majority of the individuals have markers we denote this scenario as “Geno.” In scenario 2, the method is the same but we replace the role of genotypic markers and transcriptomics. In this case, K1, K2, and K3 are relationship matrices from transcriptomics and K12 and K23 are obtained from genomic markers.
We used three different training population (TRS) methods. The first training population only uses individuals in K2 as training (Train1, TRS1), the second training population only uses the genotypes in K3 as training (Train2, TRS2). Finally, the union of these individuals makes up what we call Train3 or TRS3 (Figure 5).
CK-BLUP models were trained using the phenotypes from three different training sets and using the two combined relationship matrices. Also, a G-BLUP model using the full genetic information (388 genotypes and 332,177 markers), a G-BLUP model using full transcriptomic information (388 genotypes and 31,237 transcriptomes), and a multiple-kernel mixed-effects model which combined these two matrices were build using the same three training sets.
Each model is used to predict the individuals in the test sets and the predictions were compared to the available phenotypic values using correlation as the agreement measure.

Figure 5

To accomplish the second objective, we devised a similar cross-validation experiment as the first objective with the following changes.

We used only the genomic marker data (no transcriptomics), i.e., K1, K2, K3, K12, K23 are all marker-based genomic relationship matrices.
The number of markers for estimating the partial relationship matrices K1, K2, K3, K12, K23 were changed between 1,000 and 40,000 with no common markers across datasets.
The number of overlap between K12 and K1 (also K12 and K2), similarly the overlap between K23 and K2 (also K23 and K3) is changed between 10 and 60.
The accuracy (Coefficient of determination R²) of estimating the unobserved genomic relationships were calculated after estimating the combined relationship matrix and comparing it to the corresponding elements of the marker-based relationship matrix that was obtained using all 388 genotypes and all of the 332,177 markers (Figure 12).

Application 5: Wheat Data at Triticeae Toolbox. Combining Phenotypic Experiments

The Wishart EM-Algorithm can also be used to combine correlation matrices² obtained from independent phenotypic experiments. One hundred forty-four phenotypic experiments involving 95 traits in total were selected from 2,084 trials and 216 traits available at the Triticeae Toolbox. In this filtered set of trials, each trial and trait combination had at least 100 observations and two traits. Furthermore, the percentage of missingness in these datasets was at most 70%. The mean and the median of the number of traits in these trials were 5.9 and 4 correspondingly (See Figure 6 and Supplementary Figure S6).

Figure 6

The correlation matrix for the traits in each trial was calculated and then combined using the Wishart EM-Algorithm. The resulting covariance matrix was used in learning a directed acyclic graph (DAG) using the qgraph R Package (Epskamp et al., 2012).

Another application that involved combining the phenotypic correlation matrices from oat (78 correlation matrices), barley (143 correlation matrices), and wheat (144 correlation matrices) datasets downloaded and selected in a similar way as above were combined to obtain the DAG involving 196 traits in the Supplementary (Supplementary Application 6.1).

Results

Application 1: When Imputation Is Not an Option: Anchoring Independent Pedigree-Based Relationship Matrices Using a Genotypic Relation Matrix—Potato Data

Figure 7 shows the correlation and mean squared error (MSE) results as either of the sizes of the pedigree matrices and the number of genotypes in the genomic relationship matrices increases. The MSE results for these experiments ranged from 0.004 to 0.017 with a mean of 0.009, and the correlation values ranged from 0.52 to 0.94 with a mean of 0.78.

Figure 7

Application 2: Rice Dataset. Combining Independent Low-Density Marker Datasets

The MSE and correlation results for this experiment are given in Figure 8. In general, as the number of independent datasets increases the accuracy of all of the methods/scenarios increases (decreasing MSEs and increasing correlations). In general, the accuracy of the Wishart EM-algorithm in terms of MSEs ranged from 0.0003 to 0.028 with a mean value of 0.0007. The accuracies measured in correlation ranged from 0.989 to 0.998 with a mean value of 0.995. For the imputation based method MSEs ranged from 0.014 to 0.028 (mean 0.019) and the correlations ranged from 0.805 to 0.970 (mean 0.920).

Figure 8

Figure 9 displays the scatter plot of full genomic relationship matrix (obtained using all 387,161 markers) against the one obtained by combining a sample of partial relationship matrices (200 randomly selected genotypes and 2,000 randomly selected markers each) over varying numbers of samples (3, 5, 10, 20, 40, and 80 partial relationship matrices). Observed parts (observed-diagonal and observed non-diagonal) of the genomic relationship matrix can be predicted with high accuracy and no bias. As the sample size increase, the estimates get closer to the one obtained using all of the data. We observe that the estimates of the unobserved parts of the relationship are biased towards zero but his bias quickly decreases as the sample size increases.

Figure 9

Application 3: Wheat Data at Triticeae Toolbox. Combining Genomic Datasets to Use in Genomic Prediction

The results summarized by Figure 10 indicate that when a random sample of genotypes are selected for the test population, the accuracy of the genomic predictions using the combined genomic relationship matrix can be high (Cross-validation scenario 1). Average accuracy for estimating plant height was about 0.68, and for yield 0.58. The lowest accuracy values were for test weight with a mean value of 0.48. The performance decreases significantly across population predictions (Cross-validation scenario 2). Some populations showed low prediction accuracies such as d5, d6, and d7, but other as d12 and d16 showed high predictability. Average accuracy for estimating plant height was about 0.30, for yield 0.28.

Figure 10

Application 4: Maize Data—Genomics and Transcriptomics for Genomic Prediction

Figures 11 and 12 show comparisons of full data accuracies vs. partial relationship data. As expected, as the number of common genotypes increases there is a decrease on the differences to the full data. Our results show that up to 80% of the genomic prediction accuracy can be recovered using 50% overlap partial relationship data (Figure 11). The results in Figure 11 point to the feasibility of the application of the CK-BLUP approach when only partial data is available. With the CK approach, we can stitch several genetic relationship matrices together to extend genomic predictions although no genomic features are common between the training and test sets. Besides, as the amount of connection between the different genotypic relationship matrices increases the accuracy also improves. For example, as we increase the number of genotypes in K12, and K23, the accuracy of predictions of the unobserved relationships improve as seen in Figure 12. The number of markers seems to have a secondary effect that is more pronounced when the number of genotypes in K12 and K23 becomes larger.

Figure 11

Figure 12

Application 5: Wheat Data at Triticeae Toolbox—Combining Phenotypic Experiments

In this application, we combined correlation matrices obtained from independent phenotypic experiments. Figures 13 and S3 displayed the correlation matrix for the traits in a directed acyclic graph (DAG) and a heatmap, respectively. In Figure 13 each node represents a trait and each edge represents a correlation between two traits. One of the strengths of this representation is that you can elucidate the correlation between traits that you did not measure in your experiment. For example, among all the traits, grain width (GW) and above-ground biomass (Above_bm) are positively correlated (blue arrows) with grain yield. In turn, GW is highly positively correlated with biomass at maturity (Biomass M) but negatively correlated with harvest index (HI). Negative correlations (red) can also be observed among traits. Traditional inverse correlations such as protein (WGP) and GW can be also observed.

Figure 13

Combining datasets by correlation matrices also help to group traits. Figure S3 shows two groups of positively correlated traits. The traits in these two groups are positively correlated within the group but negatively correlated with traits in other groups. For instance, we see that yield-related traits such as grain yield, grain weight, or harvest index, are positively correlated. On the other hand, these traits are negatively correlated with disease-related traits such as bacterial leaf streak, stripe rust traits, and also with quality traits such as protein and nutrient content.

Discussion and Conclusions

Genomic data are now relatively inexpensive to collect and phenotypes remain to be the primary way to define organisms (Lehner, 2013). Many genotyping technologies exist and these technologies evolve which leads to heterogeneity of genomic data across independent experiments (Masseroli et al., 2016; Townend, 2018; Lüth et al., 2018). Similarly, phenotypic experiments, due to the high relative cost of phenotyping, usually can focus only on a set of key traits of interest. Therefore, when looking over several phenotypic datasets, the usual case is that these datasets are extremely heterogeneous and incomplete, and the data from these experiments accumulate in databases (Maiella et al., 2018; Alaux et al., 2018).

This presents a challenge but also an opportunity to make the most of genomic/phenotypic data in the future. In the long term, such databases of genotypic and phenotypic information will be invaluable to scientists as they seek to understand complex biological organisms. Issues and opportunities are beginning to emerge, like the promise of gathering phenotypical knowledge from totally independent datasets for meta-analyses.

To address the challenges of genomic and phenotypic data integration (Suravajhala et al., 2016; Stark et al., 2019), we developed a simple and efficient approach for integrating data from multiple sources. This method can be used to combine information from multiple experiments across all levels of the biological hierarchy such as microarray, gene expression, microfluidics, and proteomics will help scientists to discover new information and to develop new approaches.

For example, Figure 8 shows that we can estimate the full genomic relationship matrix more precisely from 10 independent partially overlapping datasets of 200 genotypes and 2,000 markers each than estimating from a dataset (for the combined set of genotypes) that has 2,000 fixed markers. Twenty independent genomic datasets of 200 genotypes and 2,000 markers are as good as one genomic dataset with 5,000 markers. When we compare it to the rest of the entries, imputation is the least effective for estimating the unobserved parts of the genomic relationship matrix. This suggests that accounting for incomplete genetic relationships would be a more promising approach than estimating the genomic features by imputation and then calculating the genomic relationship matrix.

Figure 7 shows we can accurately estimate the unobserved relationships among the genotypes in two independent pedigree-based relationship matrices by genotyping a small proportion of the genotypes in these datasets. For instance, the mean correlation for the worst-case setting (50 genotypes in each pedigree and 10 from each of the pedigree genotyped) was 0.72. This value increased up to 0.94 for the best case (250 genotypes in each pedigree and 40 from each of the pedigree genotyped).

Linear mixed modes with marker-based additive relationship matrix are the standard approach to estimate GEBVs. If the phenotypic information corresponding to the genotypes in one or more of the component matrices is missing then the genotypic value estimates can be obtained using the available phenotypic information. In this sense, the combined genomic information links all the genotypes and the experiments.

Imputation has been the preferred method when dealing with incomplete and datasets (Browning, 2008; Browning and Browning, 2009; Howie et al., 2011; Druet et al., 2014; Erbe et al., 2016). However, imputation can be inaccurate if the data is very heterogeneous (Van Buuren, 2011). In these cases, as seen in applications above, the proposed approach which uses the relationships instead of the actual features seems to outperform imputation for inferring genomic relationships. Besides, the methods introduced in this article are useful even when imputation is not feasible. For example, two partially overlapping relationship matrices, one pedigree-based and the other can be combined to make inferences about the genetic similarities of genotypes in both of these datasets (Figure 7).

There are also limitations to our approach. In particular when we combine data using relationship matrices original features (markers) are not imputed. Our method may not be the best option when inferences about genomic features are needed, such as in GWAS. We can address this issue by imputing the missing features using the combined relationship matrix, for instance, using a k-nearest neighbor imputation (Hastie et al., 2001) or by kernel smoothing. Moreover, if the marker data in the independent genomic studies can be mapped to local genomic regions, then the combined relationship matrices can be obtained for these genomic regions separately. Then a kernel-based model such as the ones in Yang et al. (2008); Akdemir and Jannink (2015) can be used for association testing. The nature of missingness in data will also affect our algorithm’s performance. Inference based on approaches that ignore the missing data mechanisms is valid for missing completely at random, missing at random but probably not for not missing at random (Rubin, 1976; Little and Rubin, 2002). The results of our algorithm depend on the prior information that is expressed in the initial estimate of the combined relationship matrix. This dependence, on the other hand, will decreases as the number of partial relationship matrices increases since these incomplete relationship matrices take the role of independent samples to update our prior information. When the sample size (i.e., the number of relationship matrices that are combined) is small this matrix should be carefully selected.

As it can be seen in Figure 10B, the genomic prediction accuracies can be low when predicting over heterogeneous populations. Nevertheless, using correlated traits in a multi-trait genomic prediction model can lead to improved prediction accuracies by borrowing information among the traits. In particular, if some unbalanced phenotypic data are available for the target set and a training set of genotypes, these can be used as additional anchors to improve accuracy. Similarly, incomplete environmental data about the different experiments in the target and training sets can be combined using the methods discussed here to possibly improve genomic prediction accuracies. The difficulty in predicting over heterogeneous populations could also be due to genetic variants are specific to particular populations. In this case, the populations could be clustered into groups and genomic prediction can be applied within each group. An alternative way to select a sub-population for training for a specific target set lies in selecting an optimized training population from a large set of candidates for that target set (Isidro et al., 2015).

Software and Data Availability

The software was written using C++ and R and an R (R Core Team (2019) package CovCombR (Akdemir et al., 2020) is made available publicly. The code and data for replicating some of the analysis can be requested from the corresponding authors.

Data Availability Statement

The datasets analyzed in this study are publicly available and can be obtained from the sources cited within.

Author Contributions

DA: Conception and design of the work, statistics, derivations, proofs, R programs, simulations, drafting the article, and critical revision of the article. JI: R graphical programs, statistics, drafting the article, critical revision of the article. RK: Critical revision of the article.

Funding

Results have been achieved within the framework of the first transnational joint call for research projects in the SusCrop ERA-Net Cofund on Sustainable Crop Production, with funding from the Research Council of Norway (NFR grant 299615) “Deutsches Bundesministerium für Bildung und Forschung” (031B0810), “Bundesministerium für Nachhaltigkeit und Tourismus Österreich” (Forschungsprojekt Nr. 101402), the Genome Canada project CTAG2, and the Canadian Agricultural Partnership administered by the Canadian Wheat Research Coalition, and the Department of Agriculture, Food and the Marine (DAFM), Ireland.

Statements

Acknowledgments

This manuscript has been released as a pre-print at biorXiv Akdemir and Sanchéz (2019).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2020.00947/full#supplementary-material

Footnotes

1.^In what follows, we will refer to genetic relationship matrices that measure how genotypes are related (See Supplementary Section 5.3 for a description of how to calculate a genetic relationship matrix from genome-wide markers [genomic relationship matrix]). However, a theme in this article is that a genetic relationship matrix is a special kind of covariance matrix. Therefore, the same arguments below apply to covariance matrices that measure the relationship between traits or features.

2.^We used correlations instead of covariances because the phenotypic experiments were very heterogeneous in terms of the variances of the different traits.

References

1
AkdemirD.JanninkJ.-L. (2015). Locally epistatic genomic relationship matrices for genomic association and prediction. Genetics199, 857–871. doi: 10.1534/genetics.114.173658
- CrossRef
- Google Scholar
2
AkdemirD.SanchézJ.II (2019). Adventures in multi-omics i: Combining heterogeneous data sets via relationships matrices. arXiv preprint arXiv:1912.03358. doi: 10.1101/857425
- CrossRef
- Google Scholar
3
AkdemirD.SomoM.SanchezJ.II (2020). CovCombR: Combine Partial Covariance or Relationship Matrices. R package version 1.0. doi: 10.1111/camh.12387
- CrossRef
- Google Scholar
4
AlauxM.RogersJ.LetellierT.FloresR.AlfamaF.PommierC.et al. (2018). Linking the international wheat genome sequencing consortium bread wheat reference genome sequence to wheat genetic and phenomic data. Genome Biol.19, 111. doi: 10.1186/s13059-018-1491-4
- CrossRef
- Google Scholar
5
AndersonT. (2003). An Introduction to Multivariate Statistical Analysis 3rd ed., vol. 147 (New York: Wiley & Sons). doi: 10.1016/s0377-2217(02)00243-6
- CrossRef
- Google Scholar
6
AndersonT. W. (1984b). An Introduction to Multivariate Statistical Analysis, 2nd Edition(Wiley).
- Google Scholar
7
AzodiC. B.PardoJ.VanBurenR.de los CamposG.ShiuS.-H. (2020). Transcriptome-based prediction of complex traits in maize. Plant Cell32, 139–151. doi: 10.1105/tpc.19.00332
- CrossRef
- Google Scholar
8
BergerB.PengJ.SinghM. (2013). Computational solutions for omics data. Nat. Rev. Genet.14, 333–346. doi: 10.1038/nrg3433
- CrossRef
- Google Scholar
9
BersanelliM.MoscaE.RemondiniD.GiampieriE.SalaC.CastellaniG.et al. (2016). Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinf.17, S15. doi: 10.1186/s12859-015-0857-9
- CrossRef
- Google Scholar
10
BertsimasD.PawlowskiC.ZhuoY. D. (2017). From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res.18, 7133–7171. doi: 10.1287/ijoo.2018.0001
- CrossRef
- Google Scholar
11
BodmerW. F. (1986). “Human genetics: the molecular challenge,” in Cold Spring Harbor symposia on quantitative biology, vol. 51. (Cold Spring Harbor Laboratory Press), 1–13. doi: 10.1002/bies.950070109
- CrossRef
- Google Scholar
12
BreimanL. (2001). Random forests. Mach. Learn.45, 5–32. doi: 10.1023/A:1010933404324
- CrossRef
- Google Scholar
13
BrowningB. L.BrowningS. R. (2009). A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet.84, 210–223. doi: 10.1016/j.ajhg.2009.01.005
- CrossRef
- Google Scholar
14
BrowningB. L.BrowningS. R. (2016). Genotype imputation with millions of reference samples. Am. J. Hum. Genet.98, 116–126. doi: 10.1016/j.ajhg.2015.11.020
- CrossRef
- Google Scholar
15
BrowningS. R. (2008). Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet.124, 439–450. doi: 10.1007/s00439-008-0568-7
- CrossRef
- Google Scholar
16
ChristensenO. F.MadsenP.NielsenB.OstersenT.SuG. (2012). Single-step methods for genomic evaluation in pigs. Animal6, 1565–1571. doi: 10.1017/S1751731112000742
- CrossRef
- Google Scholar
17
CramerE. (1998). Conditional iterative proportional fitting for gaussian distributions. J. Multivar. Anal.65, 261–276. doi: 10.1006/jmva.1998.1739
- CrossRef
- Google Scholar
18
CramerE. (2000). Probability measure with given marginals and conditionals: I-projections and conditional iterative proportional fitting. Stat Risk Model.18, 311–330. doi: 10.1524/strm.2000.18.3.311
- CrossRef
- Google Scholar
19
CrossaJ.Pérez-RodríguezP.CuevasJ.Montesinos-LópezO.JarquínD.de los CamposG.et al. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci.22, 961–975. doi: 10.1016/j.tplants.2017.08.011
- CrossRef
- Google Scholar
20
de los CamposG.HickeyJ. M.Pong-WongR.DaetwylerH. D.CalusM. P. (2013). Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics193, 327–345. doi: 10.1534/genetics.112.143313
- CrossRef
- Google Scholar
21
DempsterA.LairdN.RubinD. (1977). Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B (Meth.), 391–38. doi: 10.1111/j.2517-6161.1977.tb01600.x
- CrossRef
- Google Scholar
22
DempsterA. P.RubinD. B.TsutakawaR. K. (1981). Estimation in covariance components models. J. Am. Stat. Assoc.76, 341–353. doi: 10.1080/01621459.1981.10477653
- CrossRef
- Google Scholar
23
DestaZ. A.OrtizR. (2014). Genomic selection: genome-wide prediction in plant improvement. Trends Plant Sci.19, 592–601. doi: 10.1016/j.tplants.2014.05.006
- CrossRef
- Google Scholar
24
DruetT.MacleodI.HayesB. (2014). Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity112, 39. doi: 10.1038/hdy.2013.13
- CrossRef
- Google Scholar
25
EndelmanJ. B.CarleyC. A. S.BethkeP. C.CoombsJ. J.CloughM. E.da SilvaW. L.et al. (2018). Genetic variance partitioning and genome-wide prediction with allele dosage information in autotetraploid potato. Genetics209, 77–87. doi: 10.1534/genetics.118.300685
- CrossRef
- Google Scholar
26
EndelmanJ. B. (2011). Ridge regression and other kernels for genomic selection with r package rrblup. Plant Genome4, 250–255. doi: 10.3835/plantgenome2011.08.0024
- CrossRef
- Google Scholar
27
EpskampS.CramerA. O. J.WaldorpL. J.SchmittmannV. D.BorsboomD. (2012). qgraph: Network visualizations of relationships in psychometric data. J. Stat. Softw.48, 1–18. doi: 10.18637/jss.v048.i04
- CrossRef
- Google Scholar
28
ErbeM.FrischknechtM.PauschH.EmmerlingR.MeuwissenT.GredlerB.et al. (2016). 0409 genomic prediction using imputed sequence data in dairy and dual purpose breeds. J. Anim. Sci.94, 198–199. doi: 10.2527/jam2016-0409
- CrossRef
- Google Scholar
29
GondroC.Van der WerfJ.HayesB. J. (2013). Genome-wide association studies and genomic prediction(Springer). doi: 10.1007/978-1-62703-447-0_26
- CrossRef
- Google Scholar
30
GonenS.WimmerV.GaynorR. C.ByrneE.GorjancG.HickeyJ. M. (2018). A heuristic method for fast and accurate phasing and imputation of single-nucleotide polymorphism data in bi-parental plant populations. Theor. Appl. Genet.131, 2345–2357. doi: 10.1007/s00122-018-3156-9
- CrossRef
- Google Scholar
31
GuptaA.NagarD. (2000). Matrix Variate Distributions 11 of Chapman and Hall/CRC Monographs and Surveys in Pure and Applied Mathematics (Boca Raton, FL: Chapman and Hall). doi: 10.1515/156939703322386878
- CrossRef
- Google Scholar
32
HastieT.MazumderR. (2015). softImpute: Matrix Completion via Iterative Soft-Thresholded SVD. R package version 1.4.
- Google Scholar
33
HastieT.TibshiraniR.NarasimhanB.ChuG.HastieT.TibshiraniR.et al. (2001). impute: Imputation for microarray data. Bioinformatics17, 520–525. doi: 10.1007/978-3-642-57489-4_7
- CrossRef
- Google Scholar
34
HeffnerE. L.LorenzA. J.JanninkJ.-L.SorrellsM. E. (2010). Plant breeding with genomic selection: gain per unit time and cost. Crop Sci.50, 1681–1690. doi: 10.2135/cropsci2009.11.0662
- CrossRef
- Google Scholar
35
HeffnerE. L.JanninkJ.-L.IwataH.SouzaE.SorrellsM. E. (2011). Genomic selection accuracy for grain quality traits in biparental wheat populations. Crop Sci.51, 2597–2606. doi: 10.2135/cropsci2011.05.0253
- CrossRef
- Google Scholar
36
HillW. G.MackayT. F. (2004). Ds falconer and introduction to quantitative genetics. Genetics167, 1529–1536. doi: 10.1186/jbiol133
- CrossRef
- Google Scholar
37
HowieB.MarchiniJ.StephensM. (2011). Genotype imputation with thousands of genomes. G3: Genes Genomes Genet.1, 457–470. doi: 10.1534/g3.111.001198
- CrossRef
- Google Scholar
38
IsidroJ.JanninkJ.-L.AkdemirD.PolandJ.HeslotN.SorrellsM. E. (2015). Training set optimization under population structure in genomic selection. Theor. Appl. Genet.128, 145–158. doi: 10.1007/s00122-014-2418-4
- CrossRef
- Google Scholar
39
JulianaP.SinghR. P.PolandJ.MondalS.CrossaJ.Montesinos-LópezO. A.et al. (2018). Prospects and challenges of applied genomic selection—a new paradigm in breeding for grain yield in bread wheat. Plant Genome11, 1–17 doi: 10.3835/plantgenome2018.03.0017
- CrossRef
- Google Scholar
40
KolloT.von RosenD. (2006). Advanced multivariate statistics with matrices. vol. 579 (Springer Science & Business Media). doi: 10.1080/03610920903576556
- CrossRef
- Google Scholar
41
LüthS.KletaS.Al DahoukS. (2018). Whole genome sequencing as a typing tool for foodborne pathogens like listeria monocytogenes–the way towards global harmonisation and data exchange. Trends Food Sci. Technol.73, 67–75. doi: 10.1016/j.tifs.2018.01.008
- CrossRef
- Google Scholar
42
LegarraA.AguilarI.MisztalI. (2009). A relationship matrix including full pedigree and genomic information. J. Dairy Sci.92, 4656–4663. doi: 10.3168/jds.2009-2061
- CrossRef
- Google Scholar
43
LehnerB. (2013). Genotype to phenotype: lessons from model organisms for human genetics. Nat. Rev. Genet.14, 168–178. doi: 10.1038/nrg3404
- CrossRef
- Google Scholar
44
LittleR.RubinD. (2002). Statistical analysis with missing data. (New York: wiley).
- Google Scholar
45
MaiellaS.OlryA.HanauerM.LanneauV.LourghiH.DonadilleB.et al. (2018). Harmonising phenomics information for a better interoperability in the rare disease field. Eur. J. Med. Genet.61, 706–714. doi: 10.1016/j.ejmg.2018.01.013
- CrossRef
- Google Scholar
46
MardisE. R. (2008a). The impact of next-generation sequencing technology on genetics. Trends Genet.24, 133–141. doi: 10.1016/j.tig.2007.12.007
- CrossRef
- Google Scholar
47
MardisE. R. (2008b). Next-generation dna sequencing methods. Annu. Rev. Genomics Hum. Genet.9, 387–402. doi: 10.1146/annurev.genom.9.081307.164359
- CrossRef
- Google Scholar
48
MasseroliM.KaitouaA.PinoliP.CeriS. (2016). Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods111, 3–11. doi: 10.1016/j.ymeth.2016.09.002
- CrossRef
- Google Scholar
49
MenichettiG.RemondiniD.PanzarasaP.MondragónR. J.BianconiG. (2014). Weighted multiplex networks. PloS One9, e97857. doi: 10.1371/journal.pone.0097857
- CrossRef
- Google Scholar
50
MeuwissenT. H. E.HayesB. J.GoddardM. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics157, 1819–1829. doi: 10.1534/genetics.107.070953
- CrossRef
- Google Scholar
51
NicolazziE.BiffaniS.JansenG. (2013). Imputing genotypes using pedimpute fast algorithm combining pedigree and population information. J. Dairy Sci.96, 2649–2653. doi: 10.3168/jds.2012-6062
- CrossRef
- Google Scholar
52
R Core Team (2019). R: A Language and Environment for Statistical Computing (Vienna, Austria: R Foundation for Statistical Computing).
- Google Scholar
53
Rampazo AmadeuR.CellonC.OlmesteadJ. W.Franco GarciaA. A.ResendeM. F.Jr. (2016). Aghmatrix: R package to construct relationship matrices for autotetraploid and diploid species: A blueberry example. Plant Genome9, 1–10. doi: 10.3835/plantgenome2016.01.0009
- CrossRef
- Google Scholar
54
RischN.MerikangasK. (1996). The future of genetic studies of complex human diseases. Science273, 1516–1517. doi: 10.1126/science.273.5281.1516
- CrossRef
- Google Scholar
55
RubinD. B. (1976). Inference and missing data. Biometrika63, 581–592. doi: 10.1093/biomet/63.3.581
- CrossRef
- Google Scholar
56
SchölkopfB.SmolaA. (2005). Learning with kernels (Cambridge, MA: MIT Press).
- Google Scholar
57
ShriveF. M.StuartH.QuanH.GhaliW. A. (2006). Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Method.6, 57. doi: 10.1186/1471-2288-6-57
- CrossRef
- Google Scholar
58
StarkZ.DolmanL.ManolioT. A.OzenbergerB.HillS. L.CaulfiedM. J.et al. (2019). Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet.104, 13–20. doi: 10.1016/j.ajhg.2018.11.014
- CrossRef
- Google Scholar
59
SuravajhalaP.KogelmanL. J.KadarmideenH. N. (2016). Multi-omic data integration and analysis using systems genomics approaches: methods and applications in animal production, health and welfare. Genet. Selection Evol.48, 38. doi: 10.1186/s12711-016-0217-x
- CrossRef
- Google Scholar
60
TownendD. (2018). Conclusion: harmonisation in genomic and health data sharing for research: an impossible dream? Hum. Genet.137, 657–664. doi: 10.1007/s00439-018-1924-x
- CrossRef
- Google Scholar
61
Van BuurenS. (2011). “Multiple imputation of multilevel data,” in Handbook of advanced multilevel analysis, vol. 10. , 173–196.
- Google Scholar
62
VanRadenP. M.SunC.O’ConnellJ. R. (2015). Fast imputation using medium or low-coverage sequence data. BMC Genet.16, 82. doi: 10.1186/s12863-015-0243-7
- CrossRef
- Google Scholar
63
VanRadenP. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci.91, 4414–4423. doi: 10.3168/jds.2007-0980
- CrossRef
- Google Scholar
64
VisscherP. M.WrayN. R.ZhangQ.SklarP.McCarthyM.IIBrownM. A.et al. (2017). 10 years of gwas discovery: biology, function, and translation. Am. J. Hum. Genet.101, 5–22. doi: 10.1016/j.ajhg.2017.06.005
- CrossRef
- Google Scholar
65
WangB.MezliniA. M.DemirF.FiumeM.TuZ.BrudnoM.et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods11, 333. doi: 10.1038/nmeth.2810
- CrossRef
- Google Scholar
66
YangH.-C.HsiehH.-Y.FannC. S. (2008). Kernel-based association test. Genetics179, 1057–1068. doi: 10.1534/genetics.107.084616
- CrossRef
- Google Scholar

Summary

Keywords

multi-omics, phenomics, genomic selection, multiple kernel learning, mixed models, covariance estimation, expectation-maximization

Citation

Akdemir D, Knox R and Isidro y Sánchez J (2020) Combining Partially Overlapping Multi-Omics Data in Databases Using Relationship Matrices. Front. Plant Sci. 11:947. doi: 10.3389/fpls.2020.00947

Received

16 April 2020

Accepted

10 June 2020

Published

14 July 2020

Volume

11 - 2020

Edited by

Xiaoyong Pan, Shanghai Jiao Tong University, China

Reviewed by

Zhe Zhang, South China Agricultural University, China; Zhihong Zhu, The University of Queensland, Australia

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Deniz Akdemir, deniz.akdemir.work@gmail.com; Julio Isidro y Sánchez, j.isidro@upm.es

This article was submitted to Computational Genomics, a section of the journal Frontiers in Plant Science

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computational Genomics

ORIGINAL RESEARCH article

Combining Partially Overlapping Multi-Omics Data in Databases Using Relationship Matrices

Abstract

Introduction

Methods and Materials