Pleiotropic and Epistatic Network-Based Discovery: Integrated Networks for Target Gene Discovery

Biological organisms are complex systems that are composed of functional networks of interacting molecules and macromolecules. Complex phenotypes are the result of orchestrated, hierarchical, heterogeneous collections of expressed genomic variants. However, the effects of these variants are the result of historic selective pressure and current environmental and epigenetic signals, and, as such, their co-occurrence can be seen as genome-wide correlations in a number of different manners. Biomass recalcitrance (i.e., the resistance of plants to degradation or deconstruction, which ultimately enables access to a plant’s sugars) is a complex polygenic phenotype of high importance to biofuels initiatives. This study makes use of data derived from the re-sequenced genomes from over 800 different Populus trichocarpa genotypes in combination with metabolomic and pyMBMS data across this population, as well as co-expression and co-methylation networks in order to better understand the molecular interactions involved in recalcitrance, and identify target genes involved in lignin biosynthesis/degradation. A Lines Of Evidence (LOE) scoring system is developed to integrate the information in the different layers and quantify the number of lines of evidence linking genes to lignin-related lignin-phenotypes across the network layers. The resulting Genome Wide Association Study networks, integrated with Single Nucleotide Polymorphism (SNP) correlation, co-methylation and co-expression networks through the LOE scores are proving to be a powerful approach to determine the pleiotropic and epistatic relationships underlying cellular functions and, as such, the molecular basis for complex phenotypes, such as recalcitrance.


Introduction
Populus species are promising sources of cellulosic biomass for biofuels because of their fast growth rate, high cellulose content and moderate lignin content (Sannigrahi et al., 2010). Ragauskas et al. (2006) outline areas of This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
research needed "to increase the impact, efficiency, and sustainability of bio-refinery facilities" (Ragauskas et al., 2006), such as research into modifying plants to enhance favorable traits, including altered cell wall structure leading to increased sugar release, as well as resilience to biotic and abiotic stress. One particular research target in Populus species is the decrease/alteration of the lignin content of cell walls.
A large collection of different data types has been generated for Populus trichocarpa. The genome has been sequenced and annotated , and the assembly is currently in its third version of revision. A collection of 1,100 accessions of P. trichocarpa that have been clonally propagated in four different common gardens (Tuskan et al., 2011;Slavov et al., 2012;Evans et al., 2014) have been resequenced, which has provided a large set of ⇠ 28,000,000 Single Nucleotide Polymorphisms (SNPs) that has recently been publicly released (http://bioenergycenter.org/besc/gwas/). Many molecular phenotypes, such as untargetted metabolomics and pyMBMS phenotypes, that have been measured in this population provide an unparalleled resource for Genome Wide Association Studies (for example, see McKown et al. (2014)). DNA methylation data in the form of MeDIP (Methyl-DNA immunoprecipitation)-seq has been performed on 10 different P. trichocarpa tissues (Vining et al., 2012), and gene expression has been measured across various tissues and conditions. This study involves integrating these various data types in order to identify new possible candidate genes involved in lignin biosynthesis/degradation/regulation. Integrating Genome Wide Association Study (GWAS) data with other data types has previously been done to help provide context and identify relevant subnetworks/modules (Calabrese et al., 2017;Bunyavanich et al., 2014). Ritchie et al. (2015) reviewed techniques for integrating various data types for the aim of investigating gene-phenotype associations. Integrating multiple lines of evidence is a useful strategy as the more lines of evidence that connect a gene to a phenotype lowers the chance of false positives. Ritchie et al. (2015) categorized data integration approaches into two main classes, namely multi-staged analysis and meta-dimensional analysis. Multi-staged analysis analyses aims to enrich a biological signal through various steps of analysis. Meta-dimensional analysis involves the concurrent analysis of various data types, and is divided into three subcategories (Ritchie et al., 2015): Concatenation-based integration concatenates the data matrices of different data types into a single matrix on which a model is constructed (for example, see Fridley et al. (2012)). Model-based integration involves constructing a separate model for each dataset and then constructing a final model from the results of the separate models (for example, see Kim et al. (2013)). Transformation-based integration involves transforming transforming each data type into a common form (e.g. a network) before combining them (see for example, Kim et al. (2012)).
This study involves the development of an approach which can be seen as a type of transformation-based integration. Association networks for various different data types were constructed, including a pyMBMS GWAS network, a metabolomics GWAS network, as well as co-expression, co-methylation and SNP correlation networks, and subsequently the information in the different networks was integrated through the calculation of the newly developed Lines Of Evidence (LOE) scores defined in this study. These scores quantify the number of lines of evidence connecting each gene to lignin-related genes and phenotypes. This multi-omic data integration approach allowed for the identification of new possible candidate genes involved in lignin biosynthesis/regulation through multiple lines of evidence.

Overview
This approach involved combining various data types in order to identify new possible target genes involved in lignin biosynthesis/degradation/regulation. Figure 1 summarizes the overall approach. First, association networks were constructed including metabolomics and pyMBMS GWAS networks, co-expression, co-methylation and SNP correlation networks. Known lignin-related genes and phenotypes were then identified, and used as seeds to select lignin-related subnetworks from these various networks. The Lines Of Evidence (LOE) scoring technique was developed, and each gene was then scored based on its Lines Of Evidence linking it to lignin-related genes and phenotypes.

Metabolomics Data
The P. trichocarpa leaf samples for 851 unique clones were collected over three consecutive sunny days in July 2012. For 200 of those clones, a second biological replicate was also sampled. Typically, leaves (leaf plastocron index 9 plus or minus 1) on a south facing branch from the upper canopy of each tree were quickly collected, wiped with a wet tissue to clean both surfaces and the leaf then fast frozen under dry ice. Leaves were kept on dry ice and shipped back to the lab and stored at -80 C until processed for analyses. Metabolites from leaf samples were lyophilized and then ground in a micro-Wiley mill (1 mm mesh size). Approximately 25 mg of each sample was twice extracted in 2.5 mL 80% ethanol (aqueous) for 24 hr with the extracts combined, and 0.5 ml dried in a helium stream. Sorbitol (75 µl of a 1 mg/mL aqueous solution) was added before extraction as an internal standard to correct for differences in extraction efficiency, subsequent differences in derivatization efficiency and changes in sample volume during heating. Metabolites in the dried sample extracts were converted to their trimethylsilyl (TMS) derivatives, and analyzed by gas chromatography-mass spectrometry, as described previously Li et al., 2012). Briefly, dried extracts of metabolites were dissolved in acetonitrile followed by the addition of N-methyl-N-trimethylsilyltrifluoroacetamide (MSTFA) with 1% trimethylchlorosilane (TMCS), and samples then heated for 1 h at 70 C to generate TMS derivatives. After 2 days, aliquots were injected into an Agilent 5975C inert XL gas chromatograph-mass spectrometer (GCMS). The standard quadrupole GCMS is operated in the electron impact (70 eV) ionization mode, targeting 2.5 full-spectrum (50-650 Da) scans per second, as described previously . Metabolite peaks were extracted using a key selected ion, characteristic m/z fragment, rather than the total ion chromatogram, to minimize integrating co-eluting metabolites. The peak areas were normalized to the amount of internal standard (sorbitol) injected and the amount of sample extracted. A large user-created database (>2400 spectra) of mass spectral electron impact ionization (EI) fragmentation patterns of TMS-derivatized metabolites, as well as the Wiley Registry 10th Edition combined with NIST 2014 mass spectral library, are used to identify the metabolites of interest to be quantified.

pyMBMS Data
A commercially available molecular beam mass spectrometer (MBMS) designed specifically for biomass analysis was used for pyrolysis vapor analysis (Evans and Milne, 1987;Tuskan et al., 1999). Approximately 4 mg of air dried 20 mesh biomass was introduced into the quartz pyrolysis reactor via 80 uL deactivated stainless steel Eco-Cups provided with the autosampler. Mass spectral data from m/z 30-450 were acquired on a Merlin Automation data system version 3.0 using 17 eV electron impact ionization.
The pyMBMS mz peaks were annotated as described in , as done previously in (Muchero et al., 2015).

Single Nucleotide Polymorphism Data
A dataset consisting of 28,342,758 SNPs called across 882 P. trichocarpa  genotypes was obtained from http://bioenergycenter.org/besc/gwas/. This dataset is derived from whole genome sequencing of undomesticated P. trichocarpa genotypes collected from the U.S. and Canada, and clonally replicated in common gardens (Tuskan et al., 2011). Genotypes from this population have previously been used for population genomics (Evans et al., 2014) and GWAS studies in P. trichocarpa (McKown et al., 2014) as well as for investigating linkage disequilibrium in the population (Slavov et al., 2012).
Whole genome resequencing was carried out on a sample 882 P. trichocarpa natural individuals to an expected median coverage of 15x using Illumina Genome Analyzer, HiSeq 2000, and HiSeq 2500 sequencing platforms at the DOE Joint Genome Institute. Alignments to the P. trichocarpa Nisqually-1 v.3.0 reference genome were performed using BWA v0.5.9-r16 with default parameters, followed by post-processing with the picard FixMateInformation and MarkDuplicates tools. Genetic variants were called by means of the Genome Analysis Toolkit v. 3.5.0 (GATK; Broad Institute, Cambridge, MA, USA) (McKenna et al., 2010;Van der Auwera et al., 2013). Briefly, variants were called independently for each individual using the concatenation of RealignerTargetCreator, IndelRealigner and HaplotypeCaller tools, and the whole population was combined using GenotypeGVCFs, obtaining a dataset with all the variants detected across the sample population. Biallelic SNPs were extracted using the SelectVariants tool and quality-filtered using the GATKâȂŹs machine-learning implementation Variant Quality Score Recalibration (VQSR). To this end, the tool VariantRecalibrator was used to create the recalibration file and the sensitivity tranches file. As a "truth" dataset, we used SNP calls from a population of seven female and seven male P. trichocarpa that had been crossed in a half diallel design. "True" SNPs were identified by the virtual absence of segregation distortion and Mendelian violations in the progeny of these 49 crosses (ca. 500 offspring in total). As a "non-true" dataset, we used the SNP calls of seven open-pollinated crosses from these 7 females (n = 90), filtered using hard-filtering methods recommended in the GATK documentation (tool: VariantFiltration; quality thresholds: QD < 1.5, FS > 75.0, MQ < 35.0, missing alleles < 0.5 and MAF > 0.05). The prior likelihoods for the true and non-true datasets were Q = 15 and Q = 10, respectively, and the variant quality annotations to define the variant recalibration space were DP, QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR and InbreedingCoeff. Finally, we used the ApplyRecalibration tool on the full GWAS dataset to assign SNPs to tranches representing different levels of confidence. We selected SNPs in the tranche with true sensitivity < 90, which minimizes false positives, but at an expected cost of 10% false negatives. The final filtered dataset had a transition/transversion ratio of 2.07, compared to 1.88 for the unfiltered SNPs. To further validate the quality of these SNP calls, we compared them to an Illumina Infinium BeadArray that had been generated from a subset of this population dataset (Geraldes et al., 2013). The average match rate was 96% (±2% SD) for 641 individuals across 20,723 loci.
SNPs in this dataset were divided into different Tranches, indicating the percentage of "true" SNPs recovered. For further analysis in this study, we made use of the PASS SNPs, corresponding to the most stringent Tranche, recovering 90% of the true SNPs [ see http://gatkforums.broadinstitute.org/gatk/discussion/39/ variant-quality-score-recalibration-vqsr]. VCFtools (Danecek et al., 2011) was used to extract the desired Tranche of SNPs from the VCF file and reformat it into .tfam and .tped files.

GWAS Analysis
The metabolomics and pyMBMS data was used as phenotypes in a genome wide association analysis. The respective phenotype measured over all the genotypes were analyzed to account for potential outliers. A median absolute deviation (MAD) from the median (Leys et al., 2013) cutoff was applied to determine if a particular measurement of a given phenotype was an outlier with respect to all measurements of that phenotype across the population. To account for asymmetry, the deviation values were estimated separately for values below and above the median, respectively. The distribution of the measured values together with the distribution of their estimated deviation was analyzed and a cutoff of 5 was determined to identify putative outlier values. Phenotypes that had non-outlier measurements in at least 20 percent of the population were retained for further analysis, this was to ensure sufficient signal for the genome wide association model. This resulted in 1262 pyMBMS derived phenotypes and 818 metabolomics derived phenotypes.
To estimate the statistical significant associations between the respective phenotypes and the SNPs called across the population, we applied a linear mixed model using EMMAX Kang et al. (2010). Taking into account population structure estimated from a kinship matrix, we tested each of the respective 2080 phenotypes against the highconfidence SNPs and corrected for multiple hypotheses bias using the Benjamini-Hochberg control for false-discovery rate of 0.1 Benjamini and Hochberg (1995). This was done in parallel with a python wrapper that utilized the schwimmbad python package (Price-Whelan and Foreman-Mackey, 2017).
SNP-Phenotype GWAS networks were then pruned to only include SNPs that resided within genes, and SNPs were mapped to their respective genes, resulting in a gene-phenotype network. SNPs were determined to be within genes using the gene boundaries defined in the Ptrichocarpa_210_v3.0.gene.gff3 from the P. trichocarpa version 3.0 genome assembly on Phytozome (Goodstein et al., 2012).

Co-Expression Network Construction
Populus trichocarpa (Nisqually-1) RNA-seq dataset from JGI Plant Gene Atlas project (Sreedasyam et al., unpublished) was obtained from Phytozome. This dataset consists of samples for standard tissues (leaf, stem, root and bud tissue) and libraries generated from nitrogen source study. List of sample descriptions was accessed from: https: //phytozome.jgi.doe.gov/phytomine/aspect.do?name=Expression.

Plant growth and treatment conditions
Populus trichocarpa (Nisqually-1) cuttings were potted in 4 00 X 4 00 X 5 00 containers containing 1:1 mix of peat and perlite. Plants were grown under 16-h-light/8-h-dark conditions, maintained at 20-23 C and an average of 235 µmol m 2 s 1 to generate tissue for (1) standard tissues and (2) nitrogen source study. Plants for standard tissue experiment were watered with McCownâȂŹs woody plant nutrient solution and plants for nitrogen experiment were supplemented with either 10mM KNO3 (NO3 plants) or 10mM NH4Cl (NH4+ plants) or 10 mM urea (urea plants). Once plants reached leaf plastochron index 15 (LPI-15), leaf, stem, root and bud tissues were harvested and immediately flash frozen in liquid nitrogen and stored at -80 C until further processing was done. Every harvest involved at least three independent biological replicates for each condition and a biological replicate consisted of tissue pooled from 3 plants.

RNA extraction and sequencing
Tissue was ground under liquid nitrogen and high quality RNA was extracted using standard Trizol-reagent based extraction (Li and Trick, 2005). The integrity and concentration of the RNA preparations were checked initially using Nano-Drop ND-1000 (Nano-Drop Technologies) and then by BioAnalyzer (Agilent Technologies). Plate-based RNA sample prep was performed on the PerkinElmer Sciclone NGS robotic liquid handling system using Illumina's TruSeq Stranded mRNA HT sample prep kit utilizing poly-A selection of mRNA following the protocol outlined by Illumina in their user guide: http://support.illumina.com/sequencing/sequencing_kits/truseq_stranded_ mrna_ht_sample_prep_kit.html, and with the following conditions: total RNA starting material was 1 ug per sample and 8 cycles of PCR was used for library amplification. The prepared libraries were then quantified by qPCR using the Kapa SYBR Fast Illumina Library Quantification Kit (Kapa Biosystems) and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then prepared for sequencing on the Illumina HiSeq sequencing platform utilizing a TruSeq paired-end cluster kit, v4, and IlluminaâȂŹs cBot instrument to generate a clustered flowcell for sequencing. Sequencing of the flowcell was performed on the Illumina HiSeq2500 sequencer using HiSeq TruSeq SBS sequencing kits, v4, following a 2x150 indexed run recipe.

Correlation Analysis
Gene expression atlas data for P. trichocarpa consisting of 63 different samples were used to construct a co-expression network. Reads were trimmed using Skewer (Jiang et al., 2014). Star (Dobin et al., 2013) was then used to align the reads to the P. trichocarpa reference genome   Supplementary Figure S1A shows the distribution of Spearman correlation values for the co-expression network. An absolute threshold of 0.85 was applied.
BamTools stats (Barnett et al., 2011) was used to determine basic properties of the reads in each .bam file. Samtools (Li et al., 2009) was then used to extract only mapped reads. The number of reads which mapped to each gene feature was determined using htseq-count (Anders et al., 2014). These read counts were then converted to TPM values (Wagner et al., 2012), providing a methylation score for each gene in each tissue. The TPM value for a gene g in a given sample was defined as: where c g is the number of reads mapped to gene g and l g is the length of gene g in kb, calculated by subtracting the gene start position from the gene end position, and dividing the resulting difference by 1,000. A methylation matrix M was then formed, in which rows represented genes, columns represented tissues and each entry ij represented the methylation score (TPM) of gene i in tissue j. A co-methylation network (see references (Busch et al., 2016;Akulenko and Helms, 2013;Davies et al., 2012)) was then constructed by calculating the Spearman correlation coefficient between the methylation profiles of all pairs of genes using mcxarray and mcxdump programs from the MCL-edge package (Van Dongen, 2008, 2001 http://micans.org/mcl/. Supplementary Figure S1B shows the distribution of Spearman Correlation values. An absolute threshold of 0.95 was applied. Read counting using htseq-count, as well as Spearman correlation calculations were performed in parallel using Perl wrappers making use of the Parallel::MPI::Simple Perl module, developed by Alex Gough and available on The Comprehensive Perl Archive Network (CPAN) at www.cpan.org and used compute resources at the Oak Ridge Leadership Computing Facility (OLCF).

SNP Correlation Network Construction
The Custom Correlation Coefficient (CCC) (Climer et al., 2014b,a) was used to calculate the correlation between the occurrence of pairs of SNPs across the 882 genotypes. The CCC between allele x at position i and allele y and position j is defined as: where R i x j y is the relative co-occurrence of allele x at position i and allele y at position j, f i x is the frequency of allele x at position i and f j y is the frequency of allele y at position j.
This was performed in a parallel fashion using similar computational approaches as described for the co-expression network above. The set of ⇠10 million SNPs was divided into 20 different blocks, and the CCC was calculated for each within-block and cross-block SNPs in separate jobs, to a total of 210 MPI jobs ( Figure 2). A threshold of 0.7 was then applied. The resulting SNP correlation network was pruned to only include SNPs that resided within genes. Gene boundaries used were defined in the Ptrichocarpa_210_v3.0.gene.gff3 from the P. trichocarpa version 3.0 genome assembly on Phytozome (Goodstein et al., 2012). A local LD filter was then set, retaining correlations between SNPs greater than 10kb apart. The distribution of CCC values can be seen in Supplementary Figure S1C 3.6. Gene Annotation P. trichocarpa gene annotations in the Ptrichocarpa_210_v3.0.annotation_info.txt file from the version 3.0 genome assembly were used, available on Phytozome (Goodstein et al., 2012). This included Arabidopsis best hits and corresponding gene descriptions, as well as GO terms (Gene Ontology Consortium, 2017;Ashburner et al., 2000) and Pfam domains (Finn et al., 2016). Genes were also assigned MapMan annotations using the Mercator tool (Lohse et al., 2014).

Scoring Lines of Evidence (LOE)
A scoring system was developed in order to quantify the Lines Of Evidence (LOE) linking each gene to lignin-related genes/phenotypes. The LOE scores quantify the number of lines linking each gene to lignin-related genes and phenotypes across the different network data layers. The process of defining and calculating LOE scores is described below.

Selection of Lignin-related Genes
Lignin building blocks (monolignols) are derived from phenylalanine in the phenylpropanoid and monolignol pathways, and phenylalanine itself is produced from the shikimate pathway (Vanholme et al., 2010). To compile a list of P. trichocarpa genes which are related to the biosynthesis of lignin, P. trichocarpa genes were assigned MapMan annotations using the Mercator tool (Lohse et al., 2014). Genes in the Shikimate (MapMan bins 13.1.6.1, 13.1.6.3 and 13.1.6.4), Phenylpropanoid (MapMan bin 16.2) and Lignin/Lignan (MapMan bin 16.2.1) pathways were then selected. A list of these lignin-related genes and their MapMan annotations can be seen in Supplementary Table S1.

Selection of Lignin-related Phenotypes
Lignin-related pyMBMS peaks, as described in ), Davis et al. (2006 and Muchero et al. (2015) were identified among the pyMBMS GWAS hits, and are shown in Supplementary Table S2. Lignin-related metabolites and metabolites in the lignin pathway were also identified among the metabolomics GWAS hits, a list of which can be seen in Supplementary Table S3. For partially identified metabolites, additional RT and mz information can be seen in Supplementary Table S3.

Extraction of Lignin-Related Subnetworks
Let L G , L M and L P represent our sets of lignin-related genes, metabolites and pyMBMS peaks, respectively (Supplementary Tables S1, S2 and S3). A network can be defined as N = (V, E) where V is the set of nodes and E is the set of edges connecting nodes in V. In particular, let the co-expression network be represented by N coex = (V coex , E coex ), the co-methylation network by N cometh = (V cometh , E cometh ) and the SNP correlation network by N snp = (V snp , E snp ). The GWAS networks can be represented as bipartite networks N = (U, V, E) where U is the set of phenotype nodes, V is the set of gene nodes, and E is the set of edges, with each edge e ij connecting node i 2 U with node j 2 V. Let the metabolomics GWAS network be represented by N metab = (U metab , V metab , E metab ) and the pyMBMS GWAS network by N pymbms = (U pymbms , V pymbms , E pymbms ). We construct the guilt by association subnetworks of genes connected to lignin-related genes/phenotypes as follows: N L coex is the subnetwork of N coex including the lignin related genes l 2 L G and their direct neighbors: N L cometh is the subnetwork of N cometh including the lignin related genes l 2 L G and their direct neighbors: N L snp is the subnetwork of N snp including the lignin related genes l 2 L G and their direct neighbors: N L metab is the subnetwork of N metab including the lignin related metabolites m 2 L M and their direct neighboring genes: N L pymbms is the subnetwork of N pymbms including the lignin related pyMBMS peaks p 2 L P and their direct neighboring genes:

Calculating LOE Scores
For a given gene g, the degree of that gene D(g) indicates the number of connections that the gene has in a given network. Let D coex (g), D cometh (g), D snp (g) , D metab (g) , D pymbms (g) represent the degrees of gene g in the lignin subnetworks N L coex , N L cometh , N L snp , N L metab and N L pymbms , respectively. The LOE breadth score LOE breadth (g) is then defined as The LOE breadth (g) score indicates the number of different types of lines of evidence that exist linking gene g to lignin-related genes/phenotypes.
The LOE depth score LOE depth (g) represents the total number of lines of evidence exist linking gene g to ligninrelated genes/phenotypes, and is defined as The GWAS LOE score LOE gwas (g) indicates the number of lignin-related phenotypes (metabolomic or pyMBMS) that a gene is connected to, and is defined as: Distributions of the LOE scores can be seen in Supplementary Figure S2. Cytoscape version 3.4.0 (Shannon et al., 2003) was used for network visualization.

Packages Used
Networks were visualized using Cytoscape version 3.4.0 (Shannon et al., 2003). Expression, methylation, SNP correlation and GWAS diagrams were created using R (R Core Team, 2017) and various R libraries (de Vries and Ripley, 2016;Auguie, 2017;Wickham, 2007;Arnold, 2017;Wickham, 2009). Data parsing, wrappers and LOE score calculation was performed using Perl. Diagrams were edited to overlay certain text using Microsoft PowerPoint.

Layered Networks, LOE Scores and New Potential Targets
This study involved the construction of a set of networks providing different layers of information about the relationships between genes, and between genes and phenotypes, and the development of a Lines Of Evidence scoring system (LOE scores) which integrate the information in the different network layers and quantify the number of lines of evidence connecting genes to lignin-related genes/phenotypes. The GWAS network layers provide information as to which genes are potentially involved in certain functions because they contain genomic variants significantly associated with measured phenotypes. The co-methylation and co-expression networks provide information on different layers of regulatory mechanisms within the cell. The SNP correlation network provides information about possible co-evolution relationships between genes, through correlated variants across a population.
Marking known genes and phenotypes involved in lignin biosynthesis in these networks allowed for the calculation of a set of LOE (Lines Of Evidence) scores for each gene, indicating the strength of the evidence linking each gene to lignin-related functions. The breadth LOE score indicates the number of types of lines of evidence (number of layers) which connect the gene to lignin-related genes/phenotypes, whereas the depth LOE score indicates the total number of lignin-related genes/phenotypes the gene is associated with. Individual layer LOE scores (e.g. co-expression LOE score or GWAS LOE score) indicate the number of lignin-related associations the gene has within that layer.
To select the top set of potential new candidate genes involved in lignin biosynthesis, genes which showed a number of different lines of evidence connecting them to lignin-related functions were identified by selecting genes with a LOE breadth score >= 3. Since the GWAS networks provide the highest resolution, most direct connections to lignin-related functions, it was also required that our potential new targets had a GWAS score >= 1. This provides a set of 375 new candidate genes potentially involved in lignin biosynthesis, identified through multiple lines of evidence (Supplementary Table S4). This set of Potential New Target genes will be referred to as set of PNTs. A selection of these potential new candidates below and their annotations, derived from their Arabidopsis best hits, will be discussed below.

Agamous-like Genes
Genes in the AGAMOUS-LIKE gene family are MADS-box transcription factors, many of which which have been found to play important roles in floral development (Yoo et al., 2006;Fernandez et al., 2014;Yu et al., 2017Yu et al., , 2004Yu et al., , 2002Lee et al., 2000). Three potential AGAMOUS-LIKE (AGL) genes are found in the set of PNTs, in particular, a homolog of Arabidopsis AGL8 (AT5G60910, also known as FRUITFUL), a homolog of Arabidopsis AGL12 (AT1G71692), and a homolog of Arabidopsis AGL24 (AT4G24540) and AGL22 (AT2G22540).
The first potential AGL gene in our set of PNTs is Potri.012G062300, with a breadth score of 3 and a GWAS score of 2 ( Figure 3A), whose best Arabidopsis thaliana hit is AGL8 (AT5G60910). It has GWAS associations with a lignin-related metabolite (quinic acid) and a lignin pyMBMS peak (syringol) ( Figure 3C, Table 1) and is co-methylated with three lignin-related genes ( Figure 3B, Table 3). There is thus strong evidence for the involvement of P. trichocarpa AGL8 in the regulation of lignin-related functions. There is literature evidence that supports the hypothesis of AGL8's involvement in the regulation of lignin biosynthesis. A patent exists for the use of AGL8 expression in reducing the lignin content of plants (Yanofsky et al., 2004). The role of AGL8 (FUL) was described in Ferrándiz et al. (2000), in which they investigated the differences in lignin deposition in transgenic plants in which AGL8 is constitutively expressed, loss-of-function AGL8 mutants and wild-type Arabidopsis plants (Ferrándiz et al., 2000). In wild-type plants, a single layer of valve cells were lignified. In loss-of-function AGL8 mutants, all valve mesophyl cell layers were lignified, while in the transgenic plants, constitutive expression of AGL8 resulted in loss of lignified cells (Ferrándiz et al., 2000). This study thus showed the involvement of AGL8 in fruit lignification during fruit development.
There is evidence of other AGAMOUS-LIKE genes affecting lignin content. A study by Gimenez et al. (2010) investigated TALG1, an AGAMOUS-LIKE gene in tomato, and found that TAGL1 RNAi-silenced fruits showed increased lignin content, and increased expression levels of lignin biosynthesis genes (Giménez et al., 2010). A recent study by Cosio et al. (2017) showed that AGL15 in Arabidopsis is also involved in regulating lignin-related functions, in that AGL15 binds to the promotor of peroxidase PRX17, and regulates its expression (Cosio et al., 2017). In addition, PRX17 loss of function mutants had reduced lignin content (Cosio et al., 2017).

MYB Transcription Factors
MYB proteins contain the conserved MYB DNA-binding domain, and usually function as transcription factors. R2R3-MYBs have been found to regulate various functions, including flavonol biosynthesis, anthocyanin biosynthesis, lignin biosynthesis, cell fate and developmental functions (Dubos et al., 2010). The set of PNTs contains several genes which are homologs of Arabidopsis MYB transcription factors, including homologs of Arabidopsis MYB66/MYB3, MYB46, MYB36 and MYB111.
There is already existing literature evidence for how some of these MYBs affect lignin biosynthesis. Liu et al. (2015) reviews the involvement of MYB transcription factors in the regulation of phenylpropanoid metabolism. MYB3 in Arabidopsis is known to repress phenylpropanoid biosynthesis (Zhou et al., 2017a), and a P. trichocarpa homolog of MYB3 is found in our set of potential new targets. Another potential new target is the P. trichocarpa homolog of Arabidopsis MYB36 (Potri.006G170800) which is connected to lignin-related functions through multiple lines of evidence ( Figure 6). In Arabidopsis, MYB36 has been found to regulate the local deposition of lignin during casparian strip formation, and myb36 mutants exhibit incorrectly localized lignin deposition (Kamiya et al., 2015).
A MYB transcription factor in the set of PNTs which has, to our knowledge, not yet been directly associated with lignin biosynthesis is MYB111 ( Figure 7A-D). However, with existing literature evidence, one can hypothesize that MYB111 can alter lignin content by redirecting carbon flux from flavonoids to monolignols. There is evidence that MYB111 is involved in crosstalk between lignin and flavonoid pathways. Monolignols and flavonoids are both derived from phenylalanine through the phenylpropanoid pathway (Liu et al., 2015). There is crosstalk between the signalling pathways of ultraviolet-B (UV-B) stress and biotic stress pathways (Schenke et al., 2011). In the study by Schenke et al. (2011), it was shown that under UV-B light stress, Arabidopsis plants produce flavonols as a UV protectant. Also, simultaniously applying the bacterial elicitor flg22, which simulates biotic stress, repressed flavonol biosynthesis genes and induced production of defense compounds including camalexin and scopoletin, as well as lignin, which provides a physical barrier preventing pathogens' entry (Schenke et al., 2011). This crosstalk involved regulation by MYB12 and MYB4 (Schenke et al., 2011). This study by Schenke et al. (2011) was performed using cell cultures. A second study (Zhou et al., 2017b) used Arabidopsis seedlings, and found that MYB111 may be involved in the crosstalk in planta (Zhou et al., 2017b). The multiple lines of evidence connecting the P. trichocarpa homolog of Arabidopsis MYB111 (Potri.010G141000) to lignin related functions, in combination with the above literature evidence suggests the involvement this gene in the regulation of lignin biosynthesis by redirecting carbon flux from flavonol biosynthesis to monolignol biosynthesis, as part of the crosstalk between UV-B protection and biotic stress signalling pathways.

Chloroplast Signal Recognition Particle
Potri.016G078600, a homolog of the Arabidopsis chloropast signal recognition particle cpSRP54 occurs in the set of PNTs (Figure 8). It has a GWAS LOE score of 3, through GWAS associations with salicyl-coumaroyl-glucoside, a caffeoyl conjugate and a feruloyl conjugate ( Figure 8B, Table 1, Supplementary Table S4). It also has a breadth score of 4, indicating that it is linked to lignin-related genes/phenotypes though 4 different types of associations ( Figure 8). CpSRP54 gene has been found to regulate carotenoid accumulation in Arabidopsis (Yu et al., 2012). CpSRP54 and cpSRP43 form a "transit complex" along with a light-harvesting chlorophyll a/b-binding protein (LHCP) family member to transport it to the thylakoid membrane (Groves et al., 2001;Schünemann, 2004). A study in Arabidopsis found that cpSRP43 mutants had reduced lignin content (Klenell et al., 2005). Since CpSRP54 regulates carotenoid accumulation, and cpSRP43 appears to affect lignin content, it is possible that chloroplast signal recognition particles affect lignin and carotenoid content through flux through the phenylpropanoid pathway, the common origin of both of these compounds. In fact, a gene mutation cue1 which causes LHCP underexpression also results in reduced aromatic amino acid biosynthesis (Streatfield et al., 1999). These multiple lines of evidence, combined with the above cited literature suggests that chloroplast signal recognition particles in P. trichocarpa could potentially influence lignin content.

Concluding Remarks
This study made use of high-resolution GWAS data, combined with co-expression, co-methylation and SNP correlation networks in a multi-omic, data layering approach which has allowed the identification of new potential target genes involved in lignin biosynthesis/regulation. Various literature evidence supports the involvement of many of these new target genes in lignin biosynthesis/regulation, and these are suggested for future validation for involvement in the regulation of lignin biosynthesis. The data layering technique and LOE scoring system developed can be applied to other omic data types to assist in the generation of new hypotheses surrounding various functions of interest.

Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Author Contributions DW calculated methylation TPM values, constructed the networks, developed the scoring technique, performed the data layering and scoring analysis and interpreted the results, PJ performed the outlier analysis and GWAS, MS mapped gene expression atlas reads and calculated gene expression TPM values, SD, GT and WM lead the effort on constructing the GWAS population, TJT led the leaf sample collection for GCMS-based metabolomic analyses, identified the peaks, and summarized the metabolomics data, MZM collected the leaf samples and manually extracted the metabolite data, NZ conducted leaf sample preparation, extracted and derivatized and analyzed the metabolites by GCMS, PR aided in peak extraction, JS and AS generated the gene expression atlas data, SD and DMS generated the SNP calls, RS generated the pyMBMS data, DJ conceived of and supervised the project, generated MapMan annotations and edited the manuscript, DW, PJ, SD, DMS, RS, TJT, JS and AS wrote the manuscript.