Data Integration in Poplar: ‘Omics Layers and Integration Strategies

Populus trichocarpa is an important biofuel feedstock that has been the target of extensive research and is emerging as a model organism for plants, especially woody perennials. This research has generated several large ‘omics datasets. However, only few studies in Populus have attempted to integrate various data types. This review will summarize various ‘omics data layers, focusing on their application in Populus species. Subsequently, network and signal processing techniques for the integration and analysis of these data types will be discussed, with particular reference to examples in Populus.


GWAS APPROACHES AND MULTIPLE HYPOTHESIS CORRECTION
EMMAX (Kang et al., 2010) is one particular GWAS method that attempts to correct for the effect of individual relatedness within the population. It is a faster version of the EMMA method (Kang et al., 2008). GWAS methods such as EMMAX model the relationship between measured phenotypes and SNPs as a linear model: where y i is a measured phenotype for individual i, β k is the effect of SNP k on the phenotype, X is a matrix of fixed effects (SNPs) in which X ik is the minor allele count of SNP k in individual i, and i represents environmental variation on the phenotype y i (Kang et al., 2010). The aim is to determine which of the β k are significantly different from zero, thus identifying which SNPs have a significant effect on the phenotype (Kang et al., 2010). EMMAX accounts for sample structure by calculating a kinship matrix K that contains pairwise genetic similarities of the of the individuals under consideration. A variance component model is used, partitioning the phenotypic variance into variance due to environmental factors σ 2 e , and variance due to the additive effect of genetic factors σ 2 a (Kang et al., 2010). This variance component model includes the kinship matrix, modeling the variance-covariance structure of the phenotype in terms of the genetic similarity of pairs of individuals defined in the kinship matrix (Kang et al., 2010): where Var(Y ) is the variance-co-variance structure of the phenotype and I is the identity matrix. The β k are then estimated using Generalized Least Squares and an F-test is used to determine which of these β k are statistically different from zero (Kang et al., 2010). Each SNP k corresponding with a β k statistically different from zero thus potentially affects the phenotype. Thus, for a given measured phenotype, EMMAX produces a list of all SNPs and their respective p-values for their association with the phenotype. A p-value threshold can then be applied to determine which of the associations are significant.
Performing a GWAS involves testing multiple hypotheses, each asking "is SNP k associated with the phenotype p?" for each SNP in the dataset. When testing multiple hypotheses, or a so-called family of m hypotheses, the quantity called the Family-wise Error Rate (FWER) becomes inflated (Johnson et al., 2010). The FWER is defined as the probability that at least one null hypothesis was rejected when it should not have been, or, the probability of achieving at least one false positive. When a statistical test is performed and a p-value is generated, the p-value represents the Type-1 error rate (or false-positive rate), which is the probability that the null hypothesis was incorrectly rejected (Johnson et al., 2010). Let α. represent the originally chosen p-value threshold. Then, for each true null hypothesis, the probability that it was incorrectly rejected is α. Given that a null hypothesis is true, the probability it was not rejected is thus 1 − α. If we assume that all m null hypotheses are true, the probability that all m null hypotheses were not rejected (i.e. the probability of obtaining no false positives) is (1 − α) m . Therefore, given that all null hypotheses are true, the probability of obtaining at least one false positive (also known as the FWER) is (Johnson et al., 2010): As can be seen from Equation S3, the FWER increases with the number of hypotheses tested. The probability of obtaining false positives thus increases with the number of hypothesis tests performed. Methods for multiple hypothesis correction attempt to control this FWER.
Let H 1 , H 2 ...H m be a family of m hypotheses and let P 1 , P 2 ...P m be their respective p-values. Bonferroni Correction is a simple method which rejects null hypothesis H i if (Narum, 2006): This has been proven to control the FWER, ensuring that FWER ≤ α.
An adaptation to this method known is as Sequential Bonferroni Correction or Holm-Bonferroni Correction (Holm, 1979). This method orders the hypotheses such that P 1 ≤ P 2 ≤ ... ≤ P m . The index k is then determined such that k is the largest index for which the following holds (Holm, 1979): Another type of multiple hypothesis correction attempts to control the False Discovery Rate (FDR), which is defined as the proportion of rejected null hypotheses that were incorrectly rejected, or, the proportion of Type-1 errors made (Benjamini and Hochberg, 1995). This is performed by ordering p-values in a similar fashion to Holm-Bonferroni Correction. The index k is then determined such that k is the largest index for which the following holds (Benjamini and Hochberg, 1995): Hypotheses H 1 , H 2 ...H k are then rejected and hypotheses H k+1 , H k+2 ...H m are not rejected. This procedure ensures that the FDR is below α. point to some reviews Mathematically, a graph G is an ordered pair defined as G = (V, E) where V is a set of nodes and E is a set of edges (Golumbic, 2004). Each edge e ij ∈ E is defined as a set of two nodes: where i ∈ V and j ∈ V . In biological network applications, nodes represent a biological object of interest and edges will represent associations/interactions/similarities between these biological objects.
A graph can be represented numerically as a matrix, namely an Adjacency Matrix (Golumbic, 2004). The Adjacency Matrix A is an n × n matrix where n = |V |, the number of nodes in the network. Each entry a ij in an Adjacency Matrix associated with a graph is defined as: The Adjacency Matrix associated with the small example graph in Figure 1A is shown in Figure 1B. Each edge e ij in a graph can be assigned a real number weight w ij which represents the strength of the relationship between the two nodes it connects. A weighted graph can be mathematically represented as a Weighted Adjacency Matrix. This matrix is constructed in a similar manner to the normal Adjacency Matrix. Each entry a ij of the Weighted Adjacency Matrix is defined as: where w ij is the weight associated with edge e ij (Golumbic, 2004).
A bipartite graph G = (V, E) is a graph in which the nodes of V can be partitioned into two nonoverlapping sets, V 1 and V 2 and each edge e ij ∈ E is defined as: where v i ∈ V 1 and v j ∈ V 2 (Marcus, 2008). Intuitively, this means that a bipartite graph (or a bipartite network) consists of two classes of nodes in which nodes of one class can only be connected to nodes of the other class. An example of a bipartite network is shown in Figure 1C, and it's matrix representation in Figure 1D.