Identification of structural variation in mouse genomes

Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.


INTRODUCTION
Structural variation (SV) is generally considered as rearrangements of DNA regions affecting DNA sequence length and/or orientation in the genome of one species, and includes deletions, insertions, copy-number gains, inversions, and transposable elements. Structural variation has long been known to be pathogenic, resulting in rare genomic disorders such as well-known Charcot-Marie Tooth disease (Lupski et al., 1991;reviewed in Lupski, 1998reviewed in Lupski, , 2009, or more recently Koolen de Vries and 16p11.2 micro-deletion syndromes (Walters et al., 2010;Jacquemont et al., 2011;Koolen et al., 2012). Population-based SV has also begun to emerge as an important source of genomic variation contributing to common human diseases (Sebat et al., 2007;Hollox et al., 2008;Stefansson et al., 2008;Conrad et al., 2010;Pinto et al., 2010;Girirajan et al., 2011;Jarick et al., 2011;Malhotra et al., 2011;Elia et al., 2012;Helbig et al., 2014;Ramos-Quiroga et al., 2014), cancer development (Diskin et al., 2009;Stephens et al., 2011;Northcott et al., 2012;Rausch et al., 2012a;Malhotra et al., 2013;Ni et al., 2013), neuronal mosaicism in the human brain (McConnell et al., 2013) and genomic evolution (Perry et al., 2007;Itsara et al., 2010;Sudmant et al., 2013). However, the characterization of sequence flanking the breakpoints of structural variants (we call this breakpoint features), including for example micro-deletion and micro-insertion of 1 base pair (bp) up to several hundreds of bp, has remained challenging but is important with respect to not only their accurate identification, but also interpretation of their function and prediction of mechanisms by which structural variants arose (Yalcin et al., 2012a).
SVs have traditionally been observed by array comparative genome hybridization (aCGH), a method for analyzing copy number variations by measuring fluorescence between two differentially labeled DNA samples (DNA of a test sample compared to a reference sample). Using aCGH, the extent of genome-wide SV in the mouse was first demonstrated in 2007 with the detection of 80 high-confident copy number variants in 20 inbred strains of mice (Graubert et al., 2007), subsequently followed by other studies, summarized in Table 1 (Cutler et al., 2007;Akagi et al., 2008;Cahan et al., 2009;Henrichsen et al., 2009;Agam et al., 2010;Quinlan et al., 2010). These studies, however, have proven to be difficult to interpret due to their poor reproducibility (Agam et al., 2010) and inability to detect certain types of structural variants. For example inversions and insertions of novel sequence are blind to aCGH technology because inversions do not affect copy number, which is what is detected by aCGH technique, and novel sequence insertions have no copy in the reference genome.
With the emergence of next-generation sequencing (NGS) (Mardis, 2011), the Mouse Genomes Project (http://www.sanger. ac.uk/resources/mouse/genomes/) was able to sequence the entire genomes of 18 classical laboratory strains and wild-derived lines of inbred strains of mice, producing detailed maps of SV and retro-transposon elements in each mouse strain, relative to the reference mouse strain C57BL/6J Nellaker et al., 2012;Wong et al., 2012;Simon et al., 2013). For the first time, this resulted in the detection of an extraordinarily larger number of structural variants than previously observed using aCGH, totaling 710,000 novel structural variants affecting 1% of the mouse genome and encompassing 10 times more total nucleotides than single nucleotide polymorphisms . As a comparison, we had identified 121 deletions in a previous aCGH study of SV in DBA/2J, with SV length ranging between minimum size of 5 kilobases (Kb) and maximum of 260 Kb (median size 48 Kb) (Agam et al., 2010), whereas in a latest NGS study of SV we found far more deletions (a total of 16,318) in that same strain, of much smaller size (minimum size of 100 bp, maximum of 10 Kb, median of 400 bp) (Figure 1).
Such genome-wide abundance in structural variation has led to several important questions: what is the molecular architecture  of these variants, what are the mechanisms of SV formation and how do they impact gene function? In this review, we address these questions and redefine what we have learnt so far about the nature, origins, and role of structural variation from current studies in the mouse. Finally, we discuss the promises of novel methods which are likely to facilitate access to repeat-rich regions and assembly of complex genomic regions, in order to assess the origins and functional impact of structural variation in the most challenging regions of the mouse genome.

DETECTION OF STRUCTURAL VARIANTS USING PAIRED-END MAPPING METHODS
While most deep-sequencing applications focus on the identification of single-nucleotide polymorphisms (SNPs) or small insertion deletion polymorphisms, structural variation can also be identified from the same data. However, while the basic types of structural variants (deletions, insertions, inversions, and duplications) can be identified using a combination of computational methods, the detection of complex rearrangements remains challenging. We define complex rearrangements as those structural variants consisting of a combination of basic types that directly about each other or that are nested within each other (e.g., an inversion directly flanked by insertions, or a deletion nested within a tandem duplication). Typically, genomic DNA of a test genome is sheared into fragments of 300-500 bp to generate a sequencing library. Short paired-reads (50-250 bp) from either extremity of the fragment (called paired-end reads) are sequenced and mapped to the reference genome. Structural variants are then called based on orientation, distance, and depth of the mapped paired-reads (also reviewed in Medvedev et al., 2009;Alkan et al., 2011). Depending on the size and type of structural variant, these methods exploit read pairs (Korbel et al., 2007;Chen et al., 2009), split-reads Albers et al., 2011), single end clusters and read depth (Simpson et al., 2010).
The most widely used methods are read pair and read depth methods. Read pair based methods analyze distance and orientation of paired reads to infer deletion, insertion, inversion and tandem duplication events as shown in Figure 2. When the paired-end reads are mapping in the correct orientation ("+/−" is normal) but to a distance that is significantly larger than the average fragment length, this suggests a deletion, whereas if the distance is smaller than the fragment length, it suggests an insertion. When the two sequenced ends map back to the reference genome in the wrong orientation ("+/+" and "−/−"), and at a distance that is significantly larger than the size of the fragment itself, this indicates an inversion. Finally, when paired-end reads map with orientation "−/+" to a large distance, it suggest tandem duplication. In the single-end cluster analysis, one of the paired-end reads maps to the reference while its mate map to the inserted sequence (de novo sequence or repeat element insertion). Read depth methods take advantage of the high coverage of next generation sequencing to infer increase or decrease of reads at a locus. When the coverage is higher than the expected genome coverage, duplication is inferred, whereas when it is smaller or null, deletion is inferred. Once the structural variant is detected using FIGURE 2 | Read mapping patterns used by computational methods to detect basic structural variation from NGS data. This figure shows the principle of SV identification using (i) read-pair analysis, (ii) split-read mapping, (iii) single end cluster analysis, and (iv) read depth analysis. Deletions and insertions are represented using red rectangles, and inversions and duplications using light blue arrows. Reads are represented using solid dark blue arrows. The first step consists in sequencing a test genome. Typically, the genomic test DNA is fragmented into chunks of 300-500 bp. Then, reads of 50-250 bp are sequenced from either side of each fragment (we call these paired-end reads). The second step consists in mapping these paired-end reads to the mouse reference genome. A rightward facing arrow denotes a positive strand alignment, and leftward a negative strand alignment. (i) In the read-pair analysis approach, when the paired-end reads are mapping in the correct orientation ("+/−" is normal) but to a distance that is significantly larger than the average fragment length. If we suppose this distance to be 1100 bp, it suggests a deletion of 600 bp, whereas if the distance is smaller than the fragment length, for example 200 bp, it suggests an insertion of 300 bp. When the two sequenced ends of two fragments map back to the reference genome in the wrong orientation ("+/+" and "−/−"), and at a distance that is significantly larger than the size of the fragment itself, this indicates an inversion. Finally, when paired-end reads map with orientation "−/+" to a large distance, it suggest tandem duplication. (ii) In the split-read approach, one of the paired-end reads map to the reference genome while its mate contains the structural variant, typically a deletion or an insertion of small length. (iii) In the single-end cluster analysis, one of the paired-end reads maps to the reference while its mate map to the inserted sequence that can be either de novo sequence or repeat element such as LINE, SINE, or ERV. (iv) Finally, the read depth approach takes advantage of the high coverage of next generation sequencing that makes it possible to detect copy number changes. Of note, the coverage drops at insertion and inversion breakpoints, which when combined with paired-end reads analysis makes the SV call highly reliable. these analyses, breakpoint refinement is typically achieved using local sequence assembly.
Remarkably, in the past several years many algorithms have been developed to discover basic structural variation in pairedend next generation sequencing data. There are over 50 programs to date ( Table 2), however none is as yet considered to reach a community standard and only a handful combine multiple methods for the detection of structural variation (Medvedev et al., 2010;Wong et al., 2010;Rausch et al., 2012b;Sindi et al., 2012;Hart et al., 2013). Accurate structural variant calling depends on many factors such as sequencing library biases, read length, uniform sequencing coverage, and proximity of SVs to repeat sequences. Some of the most frequent sequencing library biases that can detrimentally affect SV detection are high PCR duplicates, non-normal fragment size distributions, and uneven representation of the genome at varying levels of GC content. Therefore, false negative rates of most studies remain high (20-30%) compared to SNP calling (<5%). False positive rates are also high and are often caused by misalignment of the short reads and sometimes by reference genome assembly errors.
There is a growing awareness of complex structural variants (Berger et al., 2011;Stephens et al., 2011;Quinlan and Hall, 2012;Yalcin et al., 2012a;Malhotra et al., 2013), however, their genome-wide detection is much more challenging and less intuitive as they often generate ambiguous paired-end mapping patterns. Complex structural variants are very often completely or partially missed, or incorrectly classified because a single method on its own might not be sufficient to capture the whole complexity of the structural variant (e.g., an apparent deletion and inversion may be simultaneously part of a tandem duplication region). Thus, it is important to combine multiple methods, something that the community has begun to do. Sindi and colleagues, for example, used an algorithm combining both read pairs and read depth signals into a probabilistic model implemented in a software GASV-PRO that significantly improves detection specificity (Sindi et al., 2012). Rausch and colleagues have developed DELLY that integrates short insert paired-ends,  long-range mate-pairs and split-read alignments to accurately delineate genomic rearrangements at single-nucleotide resolution (Rausch et al., 2012b). In our studies, we used SVMerge (Wong et al., 2010), a pipeline that integrates structural variation calls from five existing software, and validates breakpoints using local de novo assembly. Unbiased exploration of next-generation sequencing data is laborious, however it is essential for deciphering the true complex nature of structural variants. Toward this goal, we visualized read mappings to the whole of mouse chromosome 19 as well as a random set of regions on other chromosomes using the shortread visualization tool LookSeq (Manske and Kwiatkowski, 2009) in 17 inbred strains of laboratory mice (Yalcin et al., 2012a) as well as in C57BL/6J mice (Simon et al., 2013). We were able to recognize classical paired-end mapping (PEM) patterns, but unexpectedly we were also able to detect a number of other patterns, of greater diversity and complexity that would have been missed or miscalled by existing computational SV detection methods. When two (or more) structural variants co-localize at a locus in the genome (right next to each other), or when one or more structural variants are embedded within another one of larger size (nested), it creates confusing paired-end mapping patterns and incoherent read depth. Figure 3 highlights some complex rearrangements that cause conflicting signals during automatic detection. For example, a deletion directly flanked by a large insertion is characterized by null read depth as expected, however paired reads supporting the deletion are missing because of the insertion. However, we showed that it is possible to train genome-wide computational analysis to detect most of these atypical patterns using integration of multiple detection methods (Wong et al., 2010).
In conclusion, to study the whole diversity and complexity of structural variants, future algorithms need to integrate multiple signals and sequence analyses features based on what we have learnt so far about the architecture of structural variants, while visual approaches will continue to increase our understanding of complex forms of structural variants such as inversions and translocations that remain to be fully resolved. It is important to gain better sensitivity and specificity in the identification of structural variants especially those that have complex architecture to study accurately their impact on diseases such as tumor heterogeneity (Russnes et al., 2011), and on the evolution of genomes. For each complex rearrangements, we provide: (1) a drawing of the paired-end mapping (PEM) pattern, (2) an illustration using the short read visualization tool LookSeq (Manske and Kwiatkowski, 2009), and (3) PCR validation. We draw paired-end reads (black arrows) and how they map to the reference genome (dashed gray lines). Green arrows represent primer pairs used for PCR validation. PCR amplification was carried out across eight inbred strains of mice (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J, and LP/J), which are the parental strains of the Heterogenesous Stock population (Valdar et al., 2006). Hyperladder II is the size marker. Genomic coordinates refer to the mm9 mouse assembly.

FUNCTIONAL IMPACT OF STRUCTURAL VARIANTS
The functional impact of structural variants is still controversial in the literature. On one hand, some studies showed that SNPs are more likely to contribute to individual phenotypic differences than structural variants (Conrad et al., 2010;Keane et al., 2011); on the other hand, several studies have estimated the impact of structural variation using its effect on gene expression, and these estimates ranged from 10 to 74% (Stranger et al., 2007;Cahan et al., 2009;Henrichsen et al., 2009;Yalcin et al., 2011). It has also been reported that structural variation can influence gene expression both spatially and temporally (Chaignat et al., 2011), including genes outside of SV margins (Henrichsen et al., 2009), and can do so through chromatin conformation changes (Gheldof et al., 2013). The influence of structural variation on gene expression is specifically reviewed in Harewood et al. (2012).
Interpreting the phenotypic consequences of structural variation can be done using different methods. In this review, we describe three methods with specific emphasis on genome wide association studies. Genome wide association studies (GWASs) identify genomic loci associated with individual differences (these regions are called Quantitative Trait Loci, QTLs) using large populations of outbred mice, while taking advantage of recombinants that have naturally accumulated during breeding (Valdar et al., 2006;Yalcin et al., 2010). When combined with the availability of full genome sequences, GWASs in outbred mice are providing significant advances into the understanding of the genotype-phenotype relationship (reviewed in Yalcin and Flint, 2012), especially the impact of structural variants on phenotypic differences.
To test causality of a structural variant within a QTL region, Richard Mott and colleagues have developed a statistical test (called merge) to identify genomic variants likely to be functional from those less likely to be functional (Yalcin et al., 2005). Unexpectedly, very few SVs (only 12) out of about 100,000 SVs present in classical inbred strains of mice  overlapped with a gene within QTL regions identified using an outbred population of mice known as the Heterogenous Stock mice (Talbot et al., 1999;Valdar et al., 2006;Yalcin et al., 2011). Table 3 lists these structural variants associated with quantitative traits in outbred mice. These were amongst the larger effect size QTLs. Although the number of SVs causing phenotypic differences is small, it is expected that these SVs will provide significant insights into gene function. We highlight two examples in the next paragraph. Figure 4 shows a deletion of 600 bp lying within the promoter region of H2-Ea (histocompatibility 2, class II antigen E alpha) that is affecting CD4 + /CD8 + ratio in T lymphocytes. This locus was fine-mapped to single-gene resolution using a population of commercial outbred mice (CFW) . Causality was confirmed using mouse transgenic data with and without the deletion. The ratio of CD4 + /CD8 + was significantly increased in transgene positive mice with the deletion when compared to transgene negative mice (without the deletion), both in the spleen and in the thymus. Figure 5 illustrates a transposable element, an intracisternal A-particle (IAP) element of 6400 bp, which has inserted in the promoter region of Eps15 (Epidermal Growth Factor Receptor Pathway Substrate 15). This variant modulates home cage activity in outbred mice. There is a decrease of expression in the brain in mice with the IAP element. Data from the mouse knockout of Eps15 also show a significant decrease of home cage activity when compared to matched wildtype mice.
A second way to assess the phenotypic consequences of structural variation is to undertake a comprehensive phenotypic comparison between two closely related sub-strains of mice, and examine the relationship between structural variants and phenotypic changes between these strains. In a recent study, comparing phenotypic and genomic analysis of C57BL/6J and C57BL/6N sub-strains, 15 structural variants differentiating C57BL/6J and C57BL/6N were identified encompassing genic regions ( Table 4).  Phenotype) terms that coincide with the phenotype differentiating C57BL/6J and C57BL/6N. The first is an intronic LINE insertion found in the intron of Chl1 (Cell adhesion molecule with homology to L1CAM). C57BL/6N mice displayed abnormal spatial memory in the Morris water maze test compared to C57BL/6J mice. Interestingly, knockout mice of Chl1 also show abnormal spatial working memory. The second is an intronic ERV insertion in Rptor (Regulatory associated protein of MTOR, complex 1) in C57BL/6J mice. These mice were characterized by decreased fat mass and blood glucose. Knockout mice of Rptor interestingly also showed decreased fat mass and blood glucose amongst other metabolic phenotypes. The third is the well-known deletion at the Nnt (Nicotinamide nucleotide transhydrogenase) locus (Freeman et al., 2006) in C57BL/6J, which is associated with significantly impaired glucose tolerance. A third way is to search for structural variants that affect a coding region of a gene, potentially creating a null or hypomorphic allele. We found about 50 structural variants encompassing a coding segment reviewed in Yalcin et al., 2012b), affecting eleven already known genes (Amd2,Defb8,Fv1,Skint4,Skint3,Skint9,Soat1,Tas2r103,Tas2r120,Trim5, and Trim12a) (Best et al., 1996;Persson et al., 1999;Bauer et al., 2001;Nelson et al., 2005;Boyden et al., 2008;Tareen et al., 2009;Wu et al., 2010) and, in some cases, are giving rise to specific phenotype in mice. For example, a deletion of 1342 bp affecting the fourth  coding exon of Fv1 (Friend-virus-susceptibility-1) is associated with retrovirus replication (Best et al., 1996;Yalcin et al., 2011), and a deletion of 6817 bp on the first exon of Soat1 (Sterol Oacyltransferase 1) results in hair interior defects Yalcin et al., 2011).

Frontiers in Genetics | Evolutionary and Population Genetics
July 2014 | Volume 5 | Article 192 | 8 Human GWAS have shown that common SNPs (minor allele frequency >5%) explain only some fraction of the heritability, suggesting that SVs might also be contributing to individual phenotypic variation (Manolio et al., 2009). Results presented in this review suggest that, given the abundance of structural variants in mouse genomes, SVs make less of a contribution to individual phenotypic variation than SNPs. However, when they do, structural variants have a large effect size on the phenotype, providing a unique opportunity to investigate the relationship between structural variants and phenotypic differences, at a molecular as well as mechanistic level.

EVOLUTIONARY IMPLICATIONS AND TRANSPOSABLE ELEMENTS
Transposable elements (TEs) have been highly influential in shaping the structure and evolution of mammalian genomes, as exemplified by TE-derived sequence contributing between 38 and 69% of genomic sequence (Buzdin, 2004;Cordaux and Batzer, 2009;Shapiro, 2010;de Koning et al., 2011). TE insertions also can influence the transcription, translation or function of genes. Functional effects of TE insertions include their regulation of transcription by acting as alternative promoters or as enhancer elements and via the generation of antisense transcripts, or of transcriptional silencers. TEs are classified on the basis of their transposition mechanism (Goodier and Kazazian, 2008). Class I retrotransposon propagates in the host genome through an intermediate RNA step, requiring a reverse transcriptase to revert it to DNA before insertion into the genome. Class II DNA transposons do not have an RNA intermediate, and translocate with the aid of transposases and DNA polymerase. The overwhelming majority, over 96%, of TEs in the mouse genome, are of the retrotransposon type. These are further classified into three distinct classes: short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), and the endogenous retrovirus (ERV) superfamily (Stocking and Kozak, 2008). The ERV elements are ancient remnants of exogenous virus infections, consisting of internal sequence that encodes viral genes that are flanked by long terminal repeats (LTRs). Therefore, TEs provide a potential source of variants detrimental to the host by altering pre-existing gene function.
Previous studies examined two ERV families in eight mouse strains (IAP or ETn/MusD elements in C57BL/6J, A/J, DBA/2J, SPRET/EiJ, CAST/EiJ, MOLF/EiJ, WSB/EiJ, and 129X1/SvJ) (van de Lagemaat et al., 2006;Quinlan et al., 2010;Li et al., 2012), with one study in particular focusing on intronic insertions (Zhang et al., 2011a) and another exploring LINE variation in four strains (A/J, DBA/2J, 129S1/SvImJ, and 129X1/SvJ) (Akagi et al., 2008). However, the largest genome-wide survey of TE polymorphism in multiple laboratory mouse strains was carried out as part of the Mouse Genomes Project Nellaker et al., 2012). There were two types of polymorphic TE to be cataloged; those that are present in the reference genome and not present in one or more other strain; and those that are not present in the reference genome and present in one or more other strain. In total, 103,798 TE variants (TEVs) (28,951 SINEs,40,074 LINEs,and 34,773 ERVs) were computationally predicted among the 17 sequenced mouse strains in addition to the C57BL/6J reference strain. By placing the TE insertions within a primary phylogeny, it was possible to observe the relative expansions of all the TE families over an approximate 2 million years time period. This primary phylogeny matched the phylogeny expected from the heritage of the mouse strains (Beck et al., 2000). This analysis revealed the historic expansion of ERV families, most notably IAPs, in laboratory strains. Another interesting family are the MuLV family which arose recently and thus is found in a smaller number of copies that together show a higher fraction of variable elements.
TEV density varies by chromosome, by local nucleotide composition (G + C content) (Filipski et al., 1973;Macaya et al., 1976;Thiery et al., 1976), and by position relative to functional sequence, such as exons. LINE TEVs show a bias for being located in A + T-rich sequence, whilst SINE TEVs tend to reside in G + C-rich sequence (Korenberg and Rykowski, 1988;Boyle et al., 1990). It was also observed that ERV TEVs are more heterogeneous than SINEs or LINEs in their G + C bias, with MuLV TEVs being as enriched in high G + C sequence as SINEs. Interestingly, by contrast to monomorphic TEs, polymorphic TEVs are more unevenly distributed among the chromosomes (having accounted for G + C content) with, for example, chromosome 19 exhibiting a significant enrichment of SINEs and the X chromosome showing a strong deficit of all three TEV classes (Nellaker et al., 2012). The depletion of polymorphic LINEs on the X chromosome was previously seen in a study of four mouse strains (A/J, DBA/2J, 129S1/SvImJ, and 129X1/SvJ) (Akagi et al., 2008). TEVs from all three classes show strong and significant depletions in proteincoding gene exons, implying that such insertions are strongly deleterious (assuming that most TEVs across the noncoding genome are neutral or deleterious). The significant deficits of ERV or LINE TEVs in introns indicate that many were deleterious and thus were selectively purged over these strains' evolutionary history. These observations agree with previous findings that LINE TE insertions are less tolerated within gene-rich sequence (Kvikstad and Makova, 2010).
A strong orientation bias is evident for each of the three TE classes (32.6, 41.7, and 41.6% for ERV, LINE, and SINE TEVs, respectively) (Nellaker et al., 2012). The orientation bias for IAP TEVs was recently reported to be 25.9% for a redundant set of 3317 intronic IAPs (Li et al., 2012). The strong biases for ERVs and, to a lesser extent for LINEs, are consistent with these elements being depleted from introns. The large set of TEVs examined in the genome-wide analysis allowed the authors to infer whether the location of a TEV within a gene structure affects the strength by which it is purified from the population. Orientation bias was significantly stronger for ERV TEVs within middle or last introns, and for SINE TEVs within first introns (Nellaker et al., 2012). A recent study of 161 mouse ERV TEVs identified their strongest intronic orientation bias to be in the close vicinity of exon boundaries (Zhang et al., 2011a).
Indeed, using a stringent statistical re-sampling approach to take into account confounding influences of strain and expression divergence, TEVs were found to be twice as likely to reside in a differentially expressed gene as expected by chance (Nellaker et al., 2012). However, when TEVs are considered with other forms of potential co-segregating mutations (SNPs, indels, and other structural variations), only 34 TEVs passed a stringent genome-wide test, and these TEVs contain significantly fewer LINEs than the null expectation that all TEV classes have equal effects (Nellaker et al., 2012). While it has been extensively documented in the literature that de novo LINE insertions can cause changes in gene expression, it appears that, in Mus musculus, purifying selection has preferentially purged such variants. However, given that the proportion of expression heritability attributable to TEVs generally is no more than 10% .
To summarize, transposable elements make up almost half of the mouse genome (Gogvadze and Buzdin, 2009) and importantly their activity is the most prevalent mechanism for generating large structural variations in laboratory inbred mouse strains . However, as we demonstrated in this review, transposable elements appear to be under strong purifying selection for deleterious insertions with the majority of insertions observable in present day mouse strains having little phenotypic effects (Nellaker et al., 2012).

DATA ACCESS AND VISUALIZATION
The entire set of structural variation calls across 18 mouse genomes (129P2/OlaHsd, 129S1/SvImJ, 129S5SvEv Brd , A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6NJ, CAST/EiJ, CBA/J, FIGURE 6 | How to access and query the data automatically and manually. (A) Workflow of how to automatically query structural variants. Our work was published relative to mm9 Genome Build, but data can also be visualized directly onto mm10. A gene name or genomic region can be searched for simple and complex structural variants. Results can be exported as TSV and CSV format. (B) Workflow of how to manually search for structural variants. To do this, we use LookSeq (Manske and Kwiatkowski, 2009) as a Web-based tool to visualize paired end reads NGS data. The choice of the insert size depends on the size of the underlying structural variant, so that when the variant is large the insert size should also be large. Types of structural variants can be recognized using our comprehensive catalog of paired end mapping (PEM) patterns described in Yalcin et al. (2012a).

Frontiers in Genetics | Evolutionary and Population Genetics
July 2014 | Volume 5 | Article 192 | 10 DBA/2J, FVB/NJ, LP/J, NOD/ShiLtJ, NZO/HILtJ, PWK/PhJ, SPRET/EiJ, and WSB/EiJ) have been posted on the following ftp site ftp://ftp-mouse.sanger.ac.uk/. Data sets described in this review are also available under accession numbers "estd118" , "estd185" (Yalcin et al., 2012a), "estd200" , and "estd204" (Simon et al., 2013) from the Database of Genomic Variants Archive (DGVa). The project website (http://www.sanger.ac.uk/resources/ mouse/genomes/) provides tools to automatically search for structural variants by location, gene, strain, type, and functional impact. A workflow of the procedure is explained in Figure 6A. Results can be exported as TSV and CSV format. Specificity and sensitivity of automatic SV calls are described in detail in Yalcin et al. (2011). To access and query the data manually, visualization of alignments (both at base-pair and read-pair levels) can be done using LookSeq (Figure 6B) (Manske and Kwiatkowski, 2009), a Web-based tool to visualize paired end reads NGS data or using the Integrative Genomics Viewer (IGV) (Robinson et al., 2011;Thorvaldsdottir et al., 2013). Structural variants can be visually identified using our comprehensive catalog of paired end mapping (PEM) patterns described in Yalcin et al. (2012a).

FUTURE WORK AND CONCLUDING REMARKS
The current approaches for cataloging mutations are primarily based on aligning sequencing reads to the appropriate reference genome to identify SNPs, indels, and structural variations. The majority of SV discovery methods to date have been based on observing patterns of clusters of aberrant read mappings to the reference genome. However, for many groups of strains or individuals there are many haplotypes that are not present on the reference genome and therefore are excluded from the catalog of mutations. This is especially true for the wild-derived mouse strains such as SPRET/EiJ, CAST/EiJ, and PWK/PhJ. So while the current approaches can often detect the presence of a nonreference haplotype in the form of a large insertion, they are blind to sequence variation occurring on the haplotype.
One solution to this problem is to create data structures capable of representing all of the haplotypes present in a group of related samples. In a recent study, Iqbal et al. developed de Bruijn graph methods for detecting and genotyping simple and complex genetic variants in an individual or population without a reference genome and were able to discover more than 3 Mb of sequence absent from the human reference genome (Iqbal et al., 2012).
The String Graph Assembler (SGA) was the first sequence assembly pipeline for next-generation data based on sequence overlaps (Simpson and Durbin, 2012). At the heart of SGA is the use of a compressed data structure called the FM-index, which is used to model the read sequence overlap graph of all the samples. Recently, work has been carried out to investigate building these structures using reads from multiple samples to represent all of the haplotypes present in the samples (Simpson, 2012).
An alternative approach is to first create individual wholegenome de novo assemblies for each sample and then subsequently carry out whole-genome alignments of the preassembled sequences. Several algorithms have been proposed for creating whole-genome alignments taking into account substitutions, insertions, deletions, and larger structural rearrangements. One such implementation of this approach is the combined Progressive Cactus and Hierarchical Alignment (HAL) graph pipeline (Paten et al., 2011). HAL is a graph-based hierarchical alignment format for storing multiple genome alignments arranged phylogenetically with the corresponding ancestral sequence reconstructions as internal nodes (Hickey et al., 2013).
The Mouse Genomes Project (http://www.sanger.ac.uk/ resources/mouse/genomes/) has made a substantial contribution toward our understanding of structural variation diversity in mouse genomes and in their correlation to phenotypic variation. However, as explained in this review, there are ongoing challenges in computational detection of SVs with complex molecular architecture. Improved sequencing technologies with longer read lengths, along with the completion of de novo assemblies of mouse genomes, will be crucial in the identification of the remaining structural variants. De novo assembly also avoids reference bias in ascertainment of SVs (Sousa and Hey, 2013). Using longer fragments in sequencing library construction also aids in de novo assembly and SV detection in genomic regions that are "inaccessible" to short-read mapping due to their repetitive nature.