Sequence composition of BAC clones and SSR markers mapped to Upland cotton chromosomes 11 and 21 targeting resistance to soil-borne pathogens

Genetic and physical framework mapping in cotton (Gossypium spp.) were used to discover putative gene sequences involved in resistance to common soil-borne pathogens. Chromosome (Chr) 11 and its homoeologous Chr 21 of Upland cotton (G. hirsutum) are foci for discovery of resistance (R) or pathogen-induced R (PR) genes underlying QTLs involved in response to root-knot nematode (Meloidogyne incognita), reniform nematode (Rotylenchulus reniformis), Fusarium wilt (Fusarium oxysporum f.sp. vasinfectum), Verticillium wilt (Verticillium dahliae), and black root rot (Thielaviopsis basicola). Simple sequence repeat (SSR) markers and bacterial artificial chromosome (BAC) clones from a BAC library developed from the Upland cotton Acala Maxxa were mapped on Chr 11 and Chr 21. DNA sequence through Gene Ontology (GO) of 99 of 256 Chr 11 and 109 of 239 Chr 21 previously mapped SSRs revealed response elements to internal and external stimulus, stress, signaling process, and cell death. The reconciliation between genetic and physical mapping of gene annotations from new DNA sequences of 20 BAC clones revealed 467 (Chr 11) and 285 (Chr 21) G. hirsutum putative coding sequences, plus 146 (Chr 11) and 98 (Chr 21) predicted genes. GO functional profiling of Unigenes uncovered genes involved in different metabolic functions and stress response elements (SRE). Our results revealed that Chrs 11 and 21 harbor resistance gene rich genomic regions. Sequence comparisons with the ancestral diploid D5 (G. raimondii), A2 (G. arboreum) and domesticated tetraploid TM-1 AD1 (G. hirsutum) genomes revealed abundance of transposable elements and confirmed the richness of resistance gene motifs in these chromosomes. The sequence information of SSR markers and BAC clones and the genetic mapping of BAC clones provide enhanced genetic and physical frameworks of resistance gene-rich regions of the cotton genome, thereby aiding discovery of R and PR genes and breeding for resistance to cotton diseases.


Introduction
Cultivated plant species are under continuous attack by pathogens, which imposes a major challenge for growers by causing significant crop yield loss (Blasingame and Patel, 2004;Roberts et al., 2007). The future of crop improvement depends on understanding of the distribution, structure, and organization of disease resistance (R) and pathogen-induced (PR) genes (Ulloa et al., 2011). Plants have a great capacity to recognize pathogen effectors and inducers through different strategies (Dodds and Rathjen, 2010); however, our understanding of these strategies and interactions is still limited. New DNA sequence information coupled with the physical alignment of genomic regions into chromosomal maps and the anchoring of genetic maps are all steps that will improve the accuracy of detecting R or PR genes (van Loon et al., 2006;Bent and Mackey, 2007;Kou and Wang, 2010;Ulloa et al., 2011) and gene functions of important biological processes in crops (Rong et al., 2004;Ulloa et al., 2007;Chaudhary et al., 2009). In addition, these new discoveries will have important implications for breeding effective pest and disease resistance into elite cultivars by marker-assisted selection (MAS) (Ulloa et al., 2011(Ulloa et al., , 2013. Plants express multiple R genes with specificities for different strains of viruses, bacteria, fungi and nematodes, and individual plant genomes include hundreds of R gene-like sequences (Bent and Mackey, 2007;Adams-Phillips et al., 2008;Ulloa et al., 2011). The most studied R genes encode putative intra-cellular proteins with nucleotide binding sites (NBS) and leucine-rich repeat motifs (LRR), which represent the largest R gene family. NBS-LRR proteins can be subdivided in two types based on structural features of the N terminus: TIR-NBS-LRR proteins which resemble the intracellular domains of Drosophila Toll and mammalian IL-1 receptors and CC-NBS-LRR proteins which contain a coiled-coil domain (Jones and Dangl, 2006;Guo et al., 2011;Qi and Innes, 2013). Based on phylogenetic relationships, most R genes reside in clusters either as tandem duplicates on a tree or mixed clusters that contain genes from different branches of a species-wide tree (Meyers et al., 2005). Different R gene-mediated signal transduction pathways may utilize some distinct signaling components and induce a set of plant responses (Sato et al., 2007;Adams-Phillips et al., 2008). In contrast, PR genes have been classified into 17 families of pathogenesisrelated proteins. These proteins are induced through the action of the signaling compounds of salicylic acid, jasmonic acid or ethylene (Fonseca et al., 2009;Panstruga et al., 2009;Stepanova and Alonso, 2009). They possess antimicrobial activities in vitro through hydrolytic activities on cell walls, contact toxicity, and perhaps an involvement in defense signaling. However, these proteins serve essential plant functions (senescence, wounding, cold stress, and present in floral tissue) whether they are used in defense or not (van Loon et al., 2006).
In cotton ( Niu et al., 2008;Dighe et al., 2009;Ulloa et al., 2011Ulloa et al., , 2013Fang et al., 2014;Zhao et al., 2014). Cotton is one of the most economically important crops, providing the world's leading natural fiber, and it is a polyploidy model for cytogenetic, genomic, and evolutionary biology research (Kim and Triplett, 2001;Wendel and Cronn, 2003;Ulloa et al., 2007;Chaudhary et al., 2009). The estimated cotton yield loss due to diseases was 10.93% in the United States in 2004 (Blasingame and Patel, 2004). Increased knowledge of resistance to cotton pathogens such as RKN, REN, FOV, VW, BRR, and of genomic segments housing R or PR genes will help to elucidate the mechanisms of qualitative and quantitative disease resistance.
Knowledge of R and PR genes has increased with the availability of genome data and the increasing number of genes reported to be involved in resistance . New DNA sequences can be examined to discover genes involved in disease resistance by sequence comparisons with existing databases of expressed sequence tags (ESTs) such as GenBank (http://www.ncbi.nlm.nih.gov/). Additional studies using genomic and proteomic technologies have facilitated global comparisons of R and PR expression profiles (Ulloa et al., 2011;Yin et al., 2012;Wang et al., 2013;Wei et al., 2013) and pathway components of genes involved in disease defense and/or response (Chisholm et al., 2006).

Selection and Sequencing of BAC Clones of Upland Cotton Chromosomes 11 and 21
Two strategies were deployed to recruit BAC clones that mapped to Upland cotton Chr 11 and Chr 21 from the cv. Acala Maxxa genomic library (Tomkins et al., 2001). The first strategy used MUSB SSR markers previously mapped to Chr 11 (Frelichowski et al., 2006). Some of these marker-loci were later placed on Chr 21 Yu et al., 2012). We selected BAC clones which contained 12 MUSB SSRs (Table 1) from these two chromosomes. Some of these selected MUSB markers were identified as being associated with FOV resistance, using genetic and QTL mapping methods, and bulked segregant analysis (BSA) on resistant and susceptible progeny with different genetic backgrounds (Ulloa et al., 2011(Ulloa et al., , 2013Ulloa M and Roberts P unpublished information). Other MUSB markers were selected because they were mapped in the vicinity of an underlying QTL involved in pathogen resistance ( Table 2).
The second strategy was to use SSR marker-sequences previously mapped on Chr 11 and Chr 21 (CMD: http://www. cottonmarker.org/) to select BAC clones previously sequenced from the Acala Maxxa library by sequence-comparison. These BAC clones were originally sequenced erroneously as part of the maize sequencing project by the Genome Sequencing Center, Washington University School of Medicine. The DNA sequence information of these BACs was deposited into GenBank under the accession numbers: AC193383, AC187848, AC187214, AC187470, AC202821, AC190836, AC202830, and AC187810. Sequences of each BAC clone ( Table 1) were compared to SSR marker-sequences from Chr 11 and Chr 21. The selection criteria of tagging a BAC clone with mapped SSR markers from these chromosomes were as follows: only the sequence of each SSR marker spanning forward primer to the reverse primer (including the SSR motif) was used for the comparison. DNA sequences were blasted using all six frames (forward +1 to +3 and reverse −1 to −3) base positions. Potential BAC clones were tagged with an SSR marker when both (BAC and SSR) DNA sequences had a similarity >96%.

Sequencing and Assembly of Upland Cotton BAC Clones
A small-insert (3-5 kb) library was constructed from each of the 12 BAC clones, which harbored the selected MUSB markers on Chr 11 and Chr 21 ( Table 1). Small-insert DNA fragments were generated by isolating BAC DNA as a maxi-prep from the BAC clone and subjecting the DNA to random fragmentation by hydroshearing (Digilab R , Digilab Inc., Holliston, MA). Fragments between 3 and 5 kb were size-selected by gel electrophoresis, were end-repaired and cloned into the hicopy plasmid-based cloning vector pBlueskriptKSII+ (Agilent Technologies) and then electroporated into E. coli DH10B host cells. Transformants were selected on Lysogeny broth (LB) plates containing carbenicillan, X-Gal and IPTG. White recombinant colonies were picked robotically using the Genetix Q-bot (Genetix, Boston, MA) and stored as individual clones in Genetix 96-well microtiter plates as glycerol stocks at −80 • C. Sequencing was performed using the Dye-terminator cycle sequencing kit v3.1 (Applied Biosystems, Foster City, CA). Sequence data from the forward and reverse universal priming sites of the shotgun clones were accumulated on an ABI 3730xl DNA analyzer (Applied Biosystems, Foster City, CA). The BAC clones were sequenced to approximately 8X clone coverage (assuming 120 kb average insert size) and assembled with PHRAP software (Ewing et al., 1998), and edited with Consed (Gordon et al., 1998). Sequence contigs were ordered and oriented by the bridging shotgun method, and gaps were joined by the addition of N's giving a single contiguous consensus sequence for analysis. The sequencing of the BAC clones, which harbored the MUSB markers, was performed at Clemson University Genomics Institute, SC, USA. Additional information about the sequencing of these clones can be found in Ulloa et al. (2011). The DNA sequence information of these BACs was deposited into GenBank under the accession numbers: KM396694 (28E08), KM396695 (28O10), KM396696 (26K03), KM396697 (24E04), KM396698 (40I16), KM396699 (34K01), KM396700 (29O06) KM396701 (33K23), KM396702 (18O18), KM396703 (31K15), KM396704 (30E04), and KM396705 (32H19). The numbers and letters identify the BAC clone.

BAC Sequence Annotation of Stress Response Elements
DNA sequence-local alignments were made with the comprehensive G. hirsutum unigene set from http://www.plantgdb.org. The Unigene set consisting of 98,420 Unigenes (G. hirsutum mRNA assembly May 8, 2008; based on GenBank release 165) was downloaded from PlantGDB (www.plantgdb.org). Unigene sequences were BLASTN aligned to each BAC sequence individually with an e ≤1e-5 and identity ≥90%. Gene Ontology (GO) annotation was conducted using the Blast2GO program with default parameters (Gene Ontology Consortium, 2006;Conesa and Gotz, 2008). Gene prediction and annotation were performed using the prediction program Augustus (Stanke and Morgenstern, 2005). The Augustus program was tested on the Arabidopsis gene set, which considers expressed sequence tag (EST) matches as additional support for gene identification. All predicted genes and unigenes were subjected to a similar analysis using BLASTX through the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/) nr protein database with a value of 1e-5 to identify previously established protein motifs. Stress response elements (SRE) were identified based on the description of bioprocess of GO annotation. Genes involved in stress response elements were identified according to associated protein molecular function (MF), bioprocess (BP), and cell component (CC).

Alignment to Gossypium raimondii (D 5 ), G. arboreum (A 2 ), G. hirsutum TM-1 (AD 1 ), and Other Genomes
BAC sequences were aligned to the G. raimondii diploid D 5 whole genome (phytozome.net) (Paterson et al., 2012) through NCBI-nucleotide BLAST, G. arboreum diploid A 2 whole genome (http://cgp.genomics.org.cn)  and TM-1 AD 1 genome (http://cottongen.org) from two independent groups (CGP-BGI group, Li et al., 2015;NAU-NBI group, Zhang et al., 2015) with an e ≤1e-10 and identity ≥90%. The comparisons of the BAC sequences on Chr 11 and Chr 21 with corresponding chromosomes in A 2 , D 5 , AD 1 genome backgrounds were conducted. The average identity and the percentage of mapped BAC sequences were calculated based on consecutive matched sequence with compared genomes. The TM-1 sequence from the    CGP-BGI group was used as a genome background to determine that resistance genes from these BACs are more frequently located in the regions of Chr 11 and Chr 21 with Fisher's exact test (P < 0.05). Comparisons were also made between these BACs and other plant taxa: Arabidopsis thaliana, Vitis vinifera, Populus trichocarpa, and Theobroma cacao.

Marker Analysis and Data Mining
SSR markers previously mapped on Chr 11 and Chr 21 reported in the Cotton Marker Database (CMD: www.cottonmarker.org) were used to investigate DNA sequence composition. Sequences were then BLASTed through the NCBI (http://www.ncbi.nlm. nih.gov/). Sequences were compared against three databases: (a) Nucleotide collection (nr/nt); (b) Expressed Sequence Tags (EST); and (c) Non-Redundant protein sequences (nr). The top sequence hits found for each sequence in all three databases were then BLASTed through GO (http://www.geneontology. org/). The top functional hits given by GO were collected along with their categorized gene products [biological process (BP), cellular component (CC), and molecular function (MF)]. SSR markers involved in defense response or stress response were categorized according to top blasted protein function (receptor, disease protein, transcription factor, and oxygen-reduction and so on) and GO annotation.

BAC Sequence and Annotation for Stress Response Elements
Twenty selected BAC clones were analyzed for potential coding elements involved in response to biotic/abiotic stress mechanisms ( Table 1). Twelve BAC clones tagged with BAC-end MUSB [selected from Frelichowski et al. (2006) and Ulloa et al. (2008Ulloa et al. ( , 2011] markers were sequenced: BAC-derived MUSB0404, MUSB0641, MUSB0827, MUSB0953, MUSB1000, MUSB1015, MUSB1035, MUSB1076, MUSB1163, and MUSB1278 from Chr 11, and MUSB0810 and MUSB0823 from Chr 21 ( Table 1). The estimated BAC clone size according to assembled sequence data ranged from 68 to 140 kb with an average of 106 kb per BAC. The BAC clones were sequenced to an approximate 8X coverage, which resulted in 3-8 ordered contigs spanning up to 140,000 bp. In addition, seven BAC clones tagged to previously mapped SSR markers (25 NAUs and one TMB) on Chr 21 from the Upland cotton cultivar Acala Maxxa genomic library previously sequenced by the Genome Sequencing Center, Washington University School of Medicine were also investigated for potential coding elements: AC193383, AC187848, AC187214, AC187470, AC202821, AC190836, AC202830, and AC187810 (Table 1). These Maxxa BACs, erroneously sequenced by the maize group, were used in a different cotton characterization study by Guo et al. (2008). In this study, the 10 BAC clones from Chr 11 yielded a total of 1,129,445 bp while the 10 BAC clones from Chr 21 yielded 974,552 bp, for a total of 2,103,997 bp sequence data.
BAC sequence annotation by BLASTN alignment to the publicly available G. hirsutum Unigene set (GenBank release 165) revealed 467 (Chr 11) and 285 (Chr 21) putative Unigenes (e ≤ 1e-5). Functional signature annotations of BAC-mapped Unigene sequences were aligned to the non-redundant protein database and assigned GO terms. A total of 238 out of 467 of Chr 11 and 233 out of 285 of Chr 21 putative Unigenes were found to be similar to known protein sequences with e ≤ 1e-5 (Table 1), while 229 putative Unigenes on Chr 11 and 52 on Chr 21 had no match to known protein sequences with e ≤ 1e-5 (Table 1 and Tables S1, S2). There were 41 Unigenes on Chr 11 and 224 on Chr 21 involved in disease defense response or stress response elements (SRE) ( Table 1) based on sequence description from the BLASTed protein database and GO annotations [P (bioprocess), F (molecular function) and C (cell component)] (additional information highlighted in yellow in Tables S1, S2). Stress response elements involved in internal and external stimulus, stress, signaling process and cell death from these Unigenes are shown in Table S3 for Chr 11 and Table  S4 for Chr 21. In addition, 44 transposable elements (TEs) and 120 DNA/RNA polymerase family proteins were identified on Chr 11, and nine TEs but no DNA/RNA polymerase protein on Chr 21 (Table 1).
Augustus gene prediction software revealed 146 genes on Chr 11 and 98 genes on Chr 21. The results indicated abundance of genes with considerable homology to disease response elements for these BAC clones (Table 1 and Tables  S5-S8), with function in cellular growth and development processes, transport, translation, plus metabolic functions and stress response elements. Forty-three genes on Chr 11 BACs and 59 genes on Chr 21 BACs were involved in defense response (Table 1 and Tables S5, S6 highlighted in yellow), including receptor kinase proteins, early-responsive to dehydration stress proteins, subtilisin-like serine endopeptidase family proteins, strictosidine synthase-like, universal stress proteins, auxinresponsive proteins, and disease resistance proteins involved in stress response. GO annotation showed a range of defense associated proteins for MF, and SRE included responses to biotic/abiotic stimulus, signaling, and cell death (Tables S7, S8). The Augustus gene prediction software also indicated 56 TE on Chr 11 BACs and 16 on Chr 21 BACs (Table 1). TE included retrotransposon ty1-copia subclass, retrotransposon ty3-gypsy subclass, gag-pol polyprotein, mutant gag-pol polyprotein, mutator sub-class protein and copia-like retrotransposable elements (Table 3, Tables S5, S6). The longest TE hit length extended 6759 bp. A GO analysis further characterized these TE into a range of defense-related acitivities (Table 3 and Tables S5,  S6). In addition to the TEs, 15 DNA/RNA polymerase family proteins were identified on Chr 11 but none were identified on Chr 21 (Table 1).
Twenty-three disease resistance proteins were identified in four BACs (31K15 on Chr 11, and AC190836, AC202830 and AC187810 on Chr 21). The BAC 31K15 associated with marker MUSB1076 linked to R gene rkn1  and cluster regions containing leucine-rich repeat protein, NBS-LRR resistance protein rgh2 or rgh1, and CC-NBS-LRR resistance protein. Three BAC clones (AC190836, AC202830, and AC187810) on Chr 21 contained R genes harboring NBS-LRR proteins, including CC-NBS-LRR class disease resistance, tmv resistance protein and other disease resistance proteins (Table 3). Based on structural features of the N terminus, NBS-LRR proteins were surrounded by additional receptor proteins such as serine-threonine and kinase-like proteins, and TEs (Table 3). Moreover, NBS-LRR genes were identified within clusters and in the vicinity of the RKN, REN, FOV, VW, and BRR resistance of marker-genes previously reported (Bolek et al., 2005;Wang et al., 2006;Niu et al., 2008;Dighe et al., 2009;Ulloa et al., 2011).
More specifically, a percent identity plot of duplication harboring NBS-LRR resistance motifs for BAC clones AC187810 vs. AC202830 on Chr 21 is given in Figure 2, in which a set of seven regions were found harboring NBS-LRR motifs with a minimum of 70% identity spanning the clone length of ∼90 kb.

Alignment to Gossypium raimondii (D 5 ), G. arboreum (A 2 ), G. hirsutum TM-1 (AD 1 ) and Other Genomes
A synteny block comparison was made of alignment of full length sequences of Chr 11 and 21 BAC clones to the two available assembled whole diploid genome sequences of G. arboreum (A 2 ) and G. raimondii (D 5 ) (Tables S9-S13). The comparisons among the matched sequences showed 84.23% identity with Chr 11 BACs and 98.54% identity with BACs of Chr 21 of the tetraploid (AD) genome, corresponding to D 5 Chr 7 genome sequence (Tables S9, S12, S13). Eight percent and 80% consecutive sequences from chromosomes 11 and 21, respectively were mapped to D 5 Chr 7. Seven Chr 11 BACs with no consecutive mapping sequence were also mapped to D 5 Chr 7 in several regions. Most matched sequences of these seven BACs were TEs (Tables S9, S12, S13) which showed multiple copies through the whole genome, including D 5 Chr 7. More BLAST hits of Chr 11 BACs than Chr 21 BACs with Chr 7 A 2 genome sequence were found (Tables S9-S11). However, only one Chr 11 BAC (29O06) showed consecutive sequence length with Chr 7 A 2 genome (Tables S9-S11). The BAC sequences matched with the A 2 genome were mostly transposable elements which are distributed across the whole genome. Alignment of Chr 11 and Chr 21 BAC sequences from G. hirsutum Maxxa to G. hirsutum TM-1 genome showed slight differences between the two sequencing groups BGI and NBI, possibly due to different assembly methods (Tables S14, S15). In total, 42 and 52% consecutive sequences of Maxxa BACs on chromosomes 11 and 21, respectively, were mapped to TM-1 At-Chr1 (equals Chr 11) and Dt-Chr7 (equals Chr 21) from BGI sequencing data (Tables S14, S15). From NBI sequencing data, 41 and 62% consecutive sequences of Maxxa BACs on chromosomes 11 and 21 were mapped to A11 (equals Chr 11) and D11 (equals Chr 21) of the TM-1 genome, respectively. The identities of matched sequences between Maxxa BACs and TM-1 genome reached 98% for Chr11 comparison and 97% for Chr 21 comparison with both BGI and NBI sequencing data. Some BAC sequences were aligned to unmapped scaffolds and mapped chromosomes, such as 34K01, indicating the unmapped scaffolds might be connected to the mapped chromosome. Partial consecutive sequences of the Maxxa BAC 32H19 on Chr 21 linked to the marker MUSB0823 were mapped to TM-1 genome Chr 11 (Tables S14, S15). Part of Maxxa BAC 40I16 sequence linked to MUSB1278 was mapped to Chr 7 in the TM-1 genome (Tables S14, 15). Most unmapped Maxxa BAC sequences matched with Chr 11 or Chr 21 were transposable elements across the whole genome. The enrichment analysis with Fisher's exact test indicated that 115 out of 168 GOs compared with TM-1 genome sequence from CGP-BGI group were over-represented in Chr 11 and Chr 21 regions with p < 0.05 (range from 8.12E-33 to 0.041). The 115 GOs included stress response elements, such as oxidoreductase activity, cellcell signaling, defense response to virus, syncytium formation, response to abiotic stimulus, MAP kinase kinase kinase activity, and transmembrane receptor protein tyrosine kinase signaling pathway.
Comparison of Chr 11 and Chr 21 BAC sequences with four other plant taxa-Arabidopsis thaliana, Vitis vinifera, Populus trichocarpa, and Theobroma cacao, revealed conserved regions of short sequences with each plant species. Alignments with T. cacao and V. vinifera were especially strong for certain cotton BAC clones, but less so with A. thaliana and P. trichocarpa. Results from these comparisons and subsequent GO analyses did not provide additional information.

SSR Marker Sequence Annotation for Stress Response Elements
Comparison of available sequence information from 256 SSRs on Chr 11 and 239 on Chr 21 to sequences in NCBI EST databases indicated considerable sequence similarity to known genes in plants, with 145 and 142 gene-homologies, respectively, of which 99 on Chr 11 and 109 on Chr 21 were indicated to play a role in plant defense. SSR sequences were similar to transcription factors R 2 R 3 -myb transcription factor, heat shock transcription factor, receptor kinase protein, lightregulated protein, zinc finger protein, leucine-rich repeat family protein, nucleic binding protein, WRKY DNA-binding protein, and Verticillium wilt resistance-like protein (Tables S16, S17). Because of duplicated loci from a single marker mapped on Chr 11 and its homoeolog Chr 21, similar genes, pseudogenes, or gene-forms may be present on both chromosomes (Figure 1; www.cottonmarker.org). Categorization of the gene function revealed that markers of Chrs 11 and 21 mapped to genes associated with all three GO: BP, CC, and MF (Tables S16, S17). GO also revealed similarities to SRE genes involved in internal and external stimulus, stress, signaling process and cell death ( Table 4, Tables S18, S19). The table S20 provides data on the distance between the mapped chromosome-wide and BACspecific markers and the defense gene sequences found on Chrs 11 and 21 listed in Table 3.

Discussion
The approach in this study was to develop a genetic and physical framework for the genomic regions of Upland cotton homoeologous Chr 11 and Chr 21 that contain important nematode and fungal disease resistance associations with molecular markers such as SSRs. While various QTL and other genetic mapping approaches have revealed the importance of this pair of cotton chromosomes in defense to biotic stresses, there has hitherto been little physical structure development and use of sequence annotation to advance our understanding of its genetic organization. The current and previous marker work provided numerous mapped marker sequences for these two chromosomes, some of which are important for use in cotton breeding programs. Furthermore, this resource allowed us to identify existing BAC clones in the G. hirsutum Acala Maxxa BAC library that are from Chr 11 and Chr 21 based on genetic mapping with SSR markers derived from the BAC-end sequences. Targeted full clone sequence of these mapped BAC clones provided a second resource of genomic DNA sequence to investigate defense response motif content of this cotton genome region. The Maxxa BAC clone and marker sequence data were also compared to the whole genome sequence assemblies of the G. raimondii D 5 and G. arboreum A 2 ancestral diploid genomes (Paterson et al., 2012;Li et al., 2014), and two G. hirsutum TM-1 AD 1 whole genome assemblies which are now publicly available Zhang et al., 2015).
Of particular interest is the very high defense response element content of sequences from both the SSR markers and the BAC clones on both Chr 11 and Chr 21. This result is in line with the currently recognized importance of this pair of cotton chromosomes in resistance to a wide range of parasitic nematodes and disease-causing pathogens of cotton revealed through genetic mapping of resistance trait determinants. The gene ontology annotations clearly demonstrate the richness of this region in the evolution of defense genes. Typically resistance loci evolve by tandem duplication followed by mutation and divergence of functional specificity, for example nematode resistance in soybean (Cook et al., 2012), often in response to or as a hedge against similar mutation and evolutionary changes in virulence factors in the nematode or pathogen. The large number of NBS-LRR type motifs with tandem repeats, for example as summarized for one of the two BAC clones in Figure 2 and sequence duplication of the BAC clones on Chr 21 (Figure 2), exemplifies this evolutionary hot-spot of defense gene-rich arrangement.
Comparison of DNA sequence between Chr 11 and Chr 21 for certain BAC clones also indicates the high homology between the sequences of the homoeologous chromosome pair. Thus, herein we not only report apparent large-scale duplication events within an Upland cotton chromosome, but also considerable duplication and an evolving separation of sequence homology between a pair of homoeologous chromosomes. This provides cotton with an enormous reservoir of defense response genes, some of which may be defeated related to prior pathogen forms, while others provide a resource for defense against future pathogen forms.
More TEs were identified on Chr 11 (At subgenome) than on Chr 21 (Dt subgenome) according to both G. hirsutum Unigene (A/D: 44/9) and predicted gene databases (A/D: 56/16) ( Table 1), which might account for the physical difference in size of the A-subgenome in reference to the D-subgenome. Li et al. (2014) reported that there were a total of 4098 TEs on Chr 7 (equivalent to Chr 11 in G. hirsutum) in the diploid G. arboreum A genome and only 1542 TEs on Chr 7 (equivalent to Chr 21 in G. hirsutum) in the diploid G. raimondii D 5 genome even though there were similar numbers of loci identified on Chr 7 in both diploid genomes. At least 64.8% TEs were identified in the TM-1 genome by Zhang et al. (2015) and 66% TEs by Li et al. (2015). More TEs in the A sub-genome (at least 843.5 Mb, genome size 1477 Mb) than in the D sub-genome (at least 433 Mb, genome size 831 Mb) were determined in the TM-1 genome (Zhang et al., 2015). TEs are known to play a dominant role contributing to angiosperm evolution and diversity (Oliver et al., 2013). In cotton, allotetraploid G. hirsutum was derived from reuniting of diploid A and D genomes about 1-2 million years ago (mya) through independent and differential accumulation of TEs 5 mya (Hu et al., 2010;Li et al., 2014). We found that resistance genes in BACs were always surrounded with retrotransposable elements ( Table 3). Retrotransposons based on "cut and paste" mode are more abundant in cotton, including Ty1-copia and Ty3-gypsy elements (Hawkins et al., 2006;Hu et al., 2010). More than 50% retrotransposon frequencies were reported in the TM-1 genome Zhang et al., 2015). TEs involved in abiotic and biotic stress responses have gained more attention recently (Grandbastien, 1998;Grandbastien et al., 2005;Cowley and Oakey, 2013;McDowell and Meyers, 2013;Oliver et al., 2013;Tsuchiya and Eulgem, 2013;Wheeler, 2013). More TEs on the At subgenome might suggest more adaptation to biotic stress response on Chr11 than on Chr 21. In addition, we found 120 DNA-RNA polymerase family protein genes contributing to regulation of transcription on Chr 11 BACs with the G. hirsutum Unigene database but none of these on Chr 21. It is not clear to what extent DNA-RNA polymerase family proteins function in stress response but these results suggest divergent evolution between the A and D genomes.
Comparison of G. hirsutum AD 1 whole genome with A 2 and D 5 were thoroughly conducted by Li et al. (2015) and Zhang et al. (2015) and with other genomes (A. thaliana, T. cacao, Glycine max, and V. vinifera) . However, the 20 Maxxa BACs could not be fully mapped to the TM-1 genome, indicating that differences occur between the two tetraploid G. hirsutum AD 1 cotton varieties. Abundant transposable elements might cause the difference between the two G. hirsutum cotton varieties. In addition, homeologous exchanges were also observed between At subgenome Chr 11 and Dt subgenome Chr 21 (Tables S14, 15). For example, Maxxa BAC 32H19 linked to MUSB0823 on Chr 21 (Figure 1, Yu et al., 2012) was mapped to both Chr11 and Chr 21on TM-1 genome (Tables S14, S15).
Comparisons between Maxxa BACs from the tetraploid AD 1 cotton and the A 2 and D 5 ancestral genomes were made to better understand the evolution of the AD genome, particularly in regard to relationships that may shed light on resistance evolution. Comparison of sequence alignments showed less similarity between tetraploid AD Chr 11 and D 5 genome than between AD Chr 21 and D 5 genome, further supporting independent evolution of the A and D genomes. Likewise, sequence alignments showed less similarity between tetraploid AD Chr 21 and A 2 genome than between AD Chr 11 and A 2 genome. The divergence of the A and D genomes is also reflected in the origins of resistance traits. For example, in a previous study G. hirsutum (AD 1 ) and G. barbadense (AD 2 ) were found to share the same SSR marker MUCS088 alleles as G. arboreum (A 2 ), suggesting nematode resistance introduction was from the diploid cotton (A 2 ) genome .
The comparison of aligned sequences with four other sequenced plant taxa indicated a conservation of genic sequence among these plants. The highest similarities of cotton BAC sequence to the other plant taxa indicated the closest relationship with T. cacao. Both G. raimondii and G. arboreum genomes showed close collinear relationships with T. cacao and both of them might share a common ancestor having diverged from T. cacao 18-58 mya (Paterson et al., 2012;Wang et al., 2012b;Li et al., 2014).
Genome-wide association studies (GWAS) have been utilized successfully to identify genetic variation in plants (Brachi et al., 2011), and the availability of diploid and tetraploid whole genome sequences makes possible GWAS for identifying genetic variation in cotton. A whole genome marker map in cotton was constructed by Wang et al. (2013) based on the G. raimondii D 5 genome (Paterson et al., 2012). Wei et al. (2013) conducted systematic analysis and comparison of nucleotide-binding site disease resistance genes in the G. raimondii D 5 genome (Wang et al., 2012b) and genome-wide analysis of the gene families of resistance gene analogs and their response to Verticillium wilt was made in both the G. raimondii D 5  and G. arboreum A 2 genomes . A comprehensive meta QTL analysis was made for fiber quality, yield, drought tolerance and disease resistance with different cotton populations (Said et al., 2013). GWAS in the tetraploid (AD) TM-1 cotton revealed positively selected genes for fiber improvement in the A genome and for stress tolerance in the D genome (Zhang et al., 2015). GWAS in the allotetraploid cotton to identify resistancerich regions will provide more insights about the evolution of the homoeologous chromosomes 11 and 21 and benefit disease management.
In conclusion, the sequence information and physical mapping of BAC clones provide an additional genomic resource of these resistance gene-rich regions of the Upland cotton genome on Chr 11 and Chr 21. BAC clone sequences are deposited in GenBank (NCBI: http://www.ncbi.nlm. nih.gov). Continuing genetic and physical framework alignment of sequence information in cotton will help to expedite the discovery of R and PR genes and the assembly of a whole Upland cotton tetraploid genome, eventually supporting breeding for disease resistance in cotton production.