Chromosome-Level Alpaca Reference Genome VicPac3.1 Improves Genomic Insight Into the Biology of New World Camelids

The development of high-quality chromosomally assigned reference genomes constitutes a key feature for understanding genome architecture of a species and is critical for the discovery of the genetic blueprints of traits of biological significance. South American camelids serve people in extreme environments and are important fiber and companion animals worldwide. Despite this, the alpaca reference genome lags far behind those available for other domestic species. Here we produced a chromosome-level improved reference assembly for the alpaca genome using the DNA of the same female Huacaya alpaca as in previous assemblies. We generated 190X Illumina short-read, 8X Pacific Biosciences long-read and 60X Dovetail Chicago® chromatin interaction scaffolding data for the assembly, used testis and skin RNAseq data for annotation, and cytogenetic map data for chromosomal assignments. The new assembly VicPac3.1 contains 90% of the alpaca genome in just 103 scaffolds and 76% of all scaffolds are mapped to the 36 pairs of the alpaca autosomes and the X chromosome. Preliminary annotation of the assembly predicted 22,462 coding genes and 29,337 isoforms. Comparative analysis of selected regions of the alpaca genome, such as the major histocompatibility complex (MHC), the region involved in the Minute Chromosome Syndrome (MCS) and candidate genes for high-altitude adaptations, reveal unique features of the alpaca genome. The alpaca reference genome VicPac3.1 presents a significant improvement in completeness, contiguity and accuracy over VicPac2 and is an important tool for the advancement of genomics research in all New World camelids.


INTRODUCTION
Alpacas and llamas were domesticated in the high Andes around 9,000 years ago and have been associated with humans for as long as cattle, horses and dogs (Wheeler, 1995;Bruford et al., 2003). It is thought that the ancient Incan civilization owed success largely to llama dung, which provided fertilizer and enabled corn to be cultivated at very high altitudes. Today, alpacas continue to serve the rural families of the Altiplano as an important source of fiber and meat (Cruz et al., 2017). In addition, alpacas are also gaining popularity worldwide, mainly for their high quality fiber, and as a docile companion species. In addition, alpacas and camelids in general, are species of broader interest for several fields in biology and biomedical sciences. For example, the family Camelidae forms the most basal clade in the phylogeny of the eutherian order Cetartiodactyla (Murphy et al., 2005;Zhou et al., 2011) and is, thus, a key-group in the mammalian evolutionary tree, and is being used to aid in the annotation of the human genome (Genome 10K Community of Scientists, 2009). Further, genetic relationships between South American camelids, the domesticated alpaca (Vicugna pacos) and llama (Lama glama), and the wild guanaco (Lama guanicoe) and vicuña (Vicugna vicugna), are intriguing and still not completely resolved (Bruford et al., 2003;Barreta et al., 2013;Marin et al., 2018). All camelids are uniquely adapted to extreme environments -the New World species to high altitude and the Old World camels to arid desert environments (Wu et al., 2014), due to these adaptations their genomes may reveal important signatures of natural or human selection. Camelids are also of biomedical interest because of the presence of small and functionally efficient heavy chain-only antibodies, which are not found in other mammalian groups (Flajnik et al., 2011;Griffin et al., 2014;Cohen, 2018).
Despite being a species of broad interest, the analysis of camelid genomes, including that of the alpaca, had a late start and lags behind other domesticated species. Camelid karyotypes were described in the 1980s (Bianchi et al., 1986), showing that all extant species have a conserved diploid number (2n = 74) and very similar chromosome morphology. Yet, the first cytogenetic and comparative chromosome maps for these species emerged only recently (Balmus et al., 2007;Avila et al., 2014aAvila et al., ,b, 2015, almost concurrently with genome sequencing projects. At present, there are two annotated sequence assemblies for the alpaca that are available at all main Genome Browsers such as NCBI 1 , UCSC 2 and Ensembl 3 : VicPac1 (version 1.0) and VicPac2 (version 2.0.1). Both used DNA from the same female Huacaya individual. The first assembly was generated at the Broad Institute by Sanger sequencing and has 2.51X genome coverage, the second was assembled at Washington University by combining the former Sanger reads with newly generated 454 GS FLX data. This resulted in an assembly with 22X genome coverage and annotation for 24,553 genes and 33,208 proteins. VicPac1 and VicPac2 form the alpaca reference genome and are currently the main tools for alpaca genomics. There is also a third assembly, Vipacos_V1.0, which was generated for the comparison of genomic signatures of selection and adaptations between the dromedary, Bactrian camel and alpaca (Wu et al., 2014). Vipacos_V1.0 was assembled from short-read Illumina data and reached 72.5X genome coverage, but is not integrated with VicPac1 or VicPac2. Despite this progress, all three alpaca assemblies are relatively short -2 billion DNA base-pairs (2 Gb) instead of the anticipated 2.5-3 Gb; all are fragmented into a large number of contigs and scaffolds, and none have scaffolds assigned to chromosomes. The overall utility of these datasets as an alpaca reference genome to serve the interests of researchers, breeders and the health and welfare of the animals, is therefore limited and needs improvement.
The aim of this study was to re-sequence, re-assemble de novo and re-annotate the alpaca genome using the same female Huacaya DNA donor as in VicPac1 and VicPac2. We used next generation long-and short-read sequencing platforms to generate the data and initial assembly; Dovetail Chicago R scaffolding and HiRise TM for advanced assembly; RNAseq and bioinformatics pipelines for annotation, and cytogenetic comparative map data to anchor sequence scaffolds to chromosomes.

Genome and Assembly Features
The genome of a female Huacaya alpaca was sequenced generating ∼190X genome coverage of paired-end (PE) and mate-pair (MP) short-read Illumina data (2.72 billion PE reads, 272 Gb; 1.52 billion MP reads, 152 Gb), ∼8X genome coverage of Pacific Biosciences (PacBio) long-read data (2.4 million subreads; 18.0 Gb), and ∼60X genome coverage Dovetail Chicago R chromatin interaction scaffolding data (459 million PE reads; 137.7 Gb). A multi-stage assembly improvement strategy was applied through four separate assembly iterations. Firstly, we produced a hybrid de novo assembly using the PE and MP shortread data together with the Sanger and 454 data from the VicPac1 and VicPac2 assemblies, respectively. This assembly (Qmas1) had more contigs and scaffolds than VicPac2, lower scaffold N50, but higher contig N50 (Table 1). Next, we integrated Qmas1 and VicPac2 to produce a meta-assembly (Qmas1/VicPac2) that resulted in contiguity improvements, namely a reduction in the number of contigs and scaffolds and the simultaneous increase in contig and scaffold N50s ( Table 1). The next iteration of the assembly incorporated the ∼ 8X PacBio long-read data and resulted in modest improvements to the assembly (designated VicPac3) compared to previous iterations ( Table 1). The final assembly iteration involved scaffolding the VicPac3 assembly with the MP short read data and Dovetail Chicago R data. This final assembly also resulted in significant improvements in the assembly metrics (see Table 1), including a significant increase of scaffold N50 from 9.86 Mb in VicPac3 to 24 Mb in VicPac3.1. Compared to all previous assemblies, VicPac3.1 has the best assembly metrics and most importantly, 90% of the assembly sequence length (L90) is contained in just 103 scaffolds (0.1% of all scaffolds; Table 1). The remaining 10% of the assembly sequence length is made up of smaller, fragmented scaffolds. Addition of higher coverage long-read data, for example 20X, compared to the 8X we used, may be needed to generate further improvements to the assembly, through filling gaps and joining scaffolds. The most critical improvements in the contiguity and accuracy of the assembly occurred during the meta-assembly of Qmas1 and VicPac2, and subsequent HiRise TM scaffolding of VicPac3. The latter corrected 240 inaccurate assemblies, joined 1813 scaffolds, and essentially improved the size of scaffold N50 and reduced L50 and the total number of scaffolds ( Table 1).
The GC-content of the alpaca genome was ∼41% and remained the same across all our assembly iterations, and is similar to that reported in prior alpaca assemblies ( Table 1). The 2.12 Gb size of the re-assembled genome VicPac3.1 is similar to previous assemblies of the same individual (Table 1) but smaller than the 2.63 Gb estimation by k-mer analysis (Wu et al., 2014). Genome size estimation using a range of k-mer frequencies obtained from our short-read data produced size estimates ranging from 2.05 to 2.29 Gb (Supplementary Figure 1 and Supplementary Table 1), which are very similar to the obtained genome sizes for all assemblies in Table 1 for the same animal, but smaller than the prior k-mer estimation (Wu et al., 2014). On the other hand, measurement of the genome size by flow cytometry using alpaca fibroblasts suggested size of 2.88 Gb with a range of 2.73-3.01 Gb (95% confidence interval; Supplementary  Figure 2), thus larger than the bioinformatic estimates by us or others. However, it must be noted that the available computational and empirical methods for estimating genome size are subject to very large errors. Furthermore, genome size will vary between individuals. These factors combined may account for the differences between the estimates, and the exact size of the alpaca genome is yet to be determined by additional studies.
The Benchmarking Universal Single-Copy Orthologs (BUSCO) 4 mammalian gene set with 4,104 conserved 4 https://busco.ezlab.org/ mammalian orthologs (hereafter BUSCOs) was used to assess genome completeness in terms of recovery of these BUSCOs, to evaluate assembly iterations and compare them to previous alpaca assembly versions. While BUSCO analysis is more appropriate for direct comparison of different genome assemblies within a species, it can provide useful benchmarks when compared to assemblies of other species. Therefore, we also produced BUSCO assessment data for cow, sheep, dromedary and Bactrian camel. Serial improvements in BUSCO scores were observed throughout the iterative assembly process ( Table 2), with the final assembly, VicPac3.1, having the highest BUSCO completeness at 96.1% with 3,944 genes and the lowest number of missing BUSCOs (77 genes; 1.9%). Compared to other available camelid genomes, the final assembly demonstrated comparable, but slightly superior scores across all metrics, suggesting that this assembly is one of the most complete available for camelids, and has completeness scores comparable with the cattle and sheep genomes. The datasets are available in BioProject ID PRJNA544883.

Chromosomal Assignment
Sequences from the available alpaca cytogenetic map (Avila et al., 2014a(Avila et al., ,b, 2015 and comparative data with human, cattle and pig genomes (Balmus et al., 2007) were used to anchor the alpaca genome sequence assembly to physical chromosomes. In VicPac3.1, 75.9% of sequence scaffolds (in bp; ∼1.6 Gb) are mapped to the 36 pairs of alpaca autosomes and the X chromosome (Table 3 and Supplementary Table 2) providing the first chromosome-level assembly for the alpaca, or any camelid genome. Notably, this is a 31.9% increase in the amount of anchored sequence compared to our anchoring of VicPac2 (44% of sequence scaffolds in bp; 0.96 Gb). Among the most notable improvements are assemblies of 14 alpaca chromosomes, viz., chrs2, 5, 7, 8, 10, 17, 19, 22, 24, 27, 28, 31, 33, and 34 that uniquely correspond to a single large scaffold (Table 3); VicPac2 only contains 2 chromosomes made up of single scaffolds (chrs19 and 31; Table 3 and Supplementary Table 3). Additionally, the total number of chromosomally anchored scaffolds was reduced from 129 in VicPac2 to 88 in VicPac3.1, while simultaneously increasing the percentage of the genome anchored, further highlighting the significant improvements in contiguity of VicPac3.1. Currently, the most contiguous and largest is the 121 Mb scaffold of chr2, which likely represents the entire chromosome. In contrast, chr11 and chr16 remain rather fragmented and correspond to six different scaffolds each. It is notable that three scaffolds, with a total size of 5 Mb, correspond to the smallest autosome, chr36, because no sequences were assigned to this chromosome by Zoo-FISH (Balmus et al., 2007). Despite this progress, assemblies of several large chromosomes remain fragmented and incomplete. For example, eight unique scaffolds correspond to chrX, but cover collectively less than 40 Mb of the anticipated 150 Mb, which is the size of an average mammalian X chromosome 5,6,7 .

Genome Annotation
Altogether, preliminary annotation predicted 42,389 genes in the alpaca genome. Of these, 22,462 were coding genes with an average of 14.7 exons per gene. Single-exon genes accounted for 11% (2,519) of these, thus multi-exon genes had an average of 16.5 exons. Overall, the predicted protein coding genes represented 39.6% of assembled sequence with coding exons covering 2.4% of assembled sequence. Coding genes contained 1.3 isoforms on average with a total of 29,337 coding isoforms. The number of genes predicted in the alpaca genome was higher than expected for a vertebrate or mammalian genome (Pennisi, 2012). This may be due to the limited transcriptome depth used to stitch exons into contiguous genes. The human transcriptome now has many terabases of sequencing depth from multiple tissues and conditions to generate a comprehensive (but still incomplete) transcriptome (Steward et al., 2017). The total number of predicted exons in the alpaca is also larger than more comprehensively annotated mammalian genomes, owing to small exonic fragments which may actually be exon extension and truncation events rather than unconnected exons. However, 22,462 predicted coding genes (e-value < 1e-20) are similar to the number of known proteins in the mammalian RefSeq database (Pruitt et al., 2014) 8 . It is possible that the remaining 19,927 predicted genes which have no similarity to any mammalian peptide, may be long non-coding RNA genes, having canonical intron-exon structure, polyadenylation sites and other coding gene-like features, but having a degenerate, or vestigial open reading frame sufficient to avoid nonsense mediated decay (Chang et al., 2007).
Of the predicted peptides, 58% were considered full length, compared to the length of the human best matching peptide, though more transcriptome sequencing is required to improve exon connectivity (Figure 1). Additionally, using OrthoFinder (Emms and Kelly, 2015), which identifies both "orthogroups" (genes descended from a single gene in the last common ancestor of a group of species, allowing manyto-many relationships and gene expansions) and orthologs between each pair of species in the comparison, we identified 21,136 orthogroups between VicPac3.1, dromedary, wild and captive Bactrian camels, cow and sheep with a mean size of 9 genes per orthogroup (Supplementary Table 4). Of these, 17,916 orthogroups contained an alpaca (VicPac3.1) protein, which provides further evidence for the quality of these gene annotations. Notably, 15,777 orthogroups contained all six species and 3,959 orthogroups contained all six species and were comprised of single-copy genes (Supplementary Table 4). The latter is close to the 4,104 mammalian BUSCOs 9 . The quality of the assembly and annotation was further validated by aligning all alpaca orthologs from chromosomally anchored scaffolds to the dromedary, Bactrian camel, cattle and human genomes (Supplementary Tables 5, 6) and compiling conserved synteny data between alpaca-human and alpaca-cattle with 11,765 and 8,494 orthologs, respectively.
Repeat content in VicPac3.1 was 33.5% and largely the same as in Vipacos_V1 (Table 1). RepeatMasker 10 based annotation (Supplementary Table 7) identified over 4.6 million repeat elements, with the most abundant class being LINEs (19.41%), followed by LTR elements (5.81%) and SINEs (3.79%), of which the vast majority were MIRs (3.25%). DNA transposons accounted for 3.25% of sequence and these were largely comprised of the hAT-Charlie superfamily (1.75%).
As 75.9% of scaffolds were chromosomally assigned, annotation this gave a better idea about the gene content and gene density of individual chromosomes (see Table 3 and Figure 2). In total, 31,748 predicted genes were assigned to chromosomes in VicPac3.1. The highest number was assigned to chr11 (1,958 genes), followed by chr2 and chr1 (Table 3 and Figure 2A). Chromosomes 1 and 2 had the longest assemblies, thus it was unsurprising that they contained the most predicted genes. On average, there was a predicted gene every 103 kb of chromosomally assigned sequence, with chr22 being the most gene dense (34.3 genes/Mb) while the most gene sparse chromosome was chr27 with 0.5 genes per Mb (Table 3 and Figure 2B). These numbers, however, are expected to change when the annotation improves and more of the currently unassigned scaffolds will be mapped to chromosomes.

Highlights of Selected Features of the Alpaca Genome
The Major Histocompatibility Complex (MHC) We specifically examined the sequence of the alpaca MHC and characterized MHC organization and gene content in relation to other camelids and cetartiodactyls. The region encodes many proteins of the innate and adaptive immune systems and contains the key immune response genes for hostpathogen interactions (Trowsdale, 1995;Kelley and Trowsdale, 2005;Plasil et al., 2016;Viluma et al., 2017). In order to counteract the high variability of pathogens and pathogenderived molecules, the MHC has evolved into one of the most gene rich, highly polymorphic, and structurally complex regions of the mammalian genome, and is characterized by copy number variations and segmental duplications (Kelley and Trowsdale, 2005;Viluma et al., 2017). This complexity means that this relatively small region, which spans only approximately 4 Mb (Kelley and Trowsdale, 2005), is among the most challenging regions for genome assembly and annotation. The MHC is located in camelid chr20, as revealed by cytogenetic mapping of specific MHC loci in alpaca (Avila et al., 2014a) and dromedary (Plasil et al., 2016). In VicPac3.1, two large scaffolds of 21.1 Mb (ScfyRBE_77293) and 17.6 Mb (ScfyRBE_9351) ( Table 3 and Supplementary Table 2) were uniquely mapped to chr20. This resulted in a 38.7 Mb assembly for chr20, making it among the largest and most contiguous assemblies of the medium size alpaca chromosomes ( Table 3). The assembly of chr20 in VicPac3.1, also served as a good scaffold for placing dromedary and Bactrian camel assemblies on the chromosome, allowing for detailed comparison of the MHC region in New and Old World camelids (Figure 3).
The alpaca MHC spanned two separate scaffolds: Class I (723 kb) and Class III were in ScfyRBE_77293 and Class II (320 kb) was in ScfyRBE_9351 (Figure 3). The overall organization of the alpaca MHC relative to the centromeretelomere axis closely resembled the MHC of the dromedary and Bactrian camel (Plasil et al., 2016), i.e., centromere-Class II-Class III-Class I-telomere ( Figure 3A). This orientation was confirmed with the cytogenetic and sequence map position of the CRISP2 gene, which does not belong to MHC, maps very close to the centromere in chr20q (Avila et al., 2014a), and was found in scaffold ScyRBE_9351 together with MHC Class II sequences ( Figure 3A). The MHC organization, like that seen in camelids where all MHC genes are syntenic and Class III sequences are positioned between Class I and Class II sequences, is typical of all mammalian (Ruan et al., 2016a,b;Viluma et al., 2017) and many vertebrate species (Flajnik, 2018). Alpaca and camel Class II sequences seem to be present in one block as seen in humans, pigs and horses (Li et al., 2012;Viluma et al., 2017). This is in contrast to cattle, sheep and porpoise where Class II has been disrupted by a large inversion into IIa and IIb sub-regions that happened in the ancestral chromosome of these cetartiodactyl lineages (Childers et al., 2006;Gao et al., 2010;Li et al., 2012;Ruan et al., 2016a).
Further, we more closely inspected the gene contents and order of Class I and Class II genes in alpaca, Bactrian camel and dromedary (Figure 3). In general, these MHC regions were collinear in camelids, though a few differences between the species were observed. In Class I, seven genes were annotated in alpaca and Bactrian camel but nine genes in the dromedary, because of an expansion of HLA-A-like sequences in the latter ( Figure 3C). We speculate that the unique and specialized microbiomes of deserts (Bang et al., 2018) may have driven expansion of HLA-A in this genome. Further, no sequences of HLA-G corresponding to Class I heavy chain paralogs were found in alpaca, though these sequences are present in both camel species. In contrast Class I heavy chain paralogs HLA-E and HLA-B were present only in alpaca and not in camels. Class II contained 16 genes in alpaca and 17 genes in camels ( Figure 3B). The difference was due to the HLA-DRB1 locus, which was found in camels but not in the alpaca. Furthermore, an inversion has probably happened in the dromedary Class II changing the relative order of HLA-DRB1 and HLA-DRB4 in relation to that in the Bactrian camel. These minor differences in MHC gene content between camelids may be true, though it is equally plausible that they are due to difficulties associated with annotation of the highly variable MHC sequences. However, a general observation about the camelid Class II region was that even though the region is collinear between the three species, chromosome in camelid genomes for which human and other mammalian homology could not be revealed by Zoo-FISH (Balmus et al., 2007). Because of that, it was suggested that the chromosome is largely heterochromatic and contains very few genes if any at all. This hypothesis changed when two protein coding genes, ZPBP and VWC2, were mapped to chr36 by FISH revealing its homology to a segment in the short arm of human chr7 (HSA7p) at 49-56 Mb (Avila et al., 2014a(Avila et al., , 2015. The second and perhaps even more important reason for interest in chr36 is its involvement in the Minute Chromosome Syndrome (MCS) (Avila et al., 2015). This is a unique chromosomal abnormality in alpacas and llamas with no counterpart in any other mammalian species. The condition is invariably associated with abnormal sexual development and infertility in females and curiously, it is recurrent (Avila et al., 2014b;Fellows et al., 2014;Raudsepp and Chowdhary, 2016). Cytogenetic manifestation of MCS is a dramatic size difference between the homologs of chr36, and because it was thought that the smaller homolog is abnormal, the condition was named minute. Now it is known that the abnormal one is the larger chr36, because it carries a massive nucleolus organizer region (NOR), which is not found in chr36 in normal alpacas (Avila et al., 2015). Molecular causes of MCS are, however, poorly understood and an improved reference assembly for chr36 is an important resource to study the condition at molecular level. In VicPac3.1, chr36 is represented by three scaffolds with a total size of about 5 Mb (Table 4). Altogether, we predicted The order of genes in the table follows their relative order within chr36 scaffolds.

genes in chr36, of which 23 have known orthologs in HSA7p
showing that alpaca chr36 shares homology to a ∼12 Mb region in HSA7p (Table 4).

Adaptations to High Altitude
Alpacas are adapted to high altitude and low oxygen environments, and therefore different evolutionary forces must have shaped their genomes as compared to dromedary and Bactrian camels, the desert species. Therefore, we specifically aimed to identify in the alpaca genome candidate genes for high altitude adaptations. We selected 20 genes for which signatures of positive selection have been reported in other high altitude species (Table 5). Through the application of d N /d S substitution ratio ω (see Material and Methods; Supplementary Table 8), we investigated whether any of these genes exhibit signals of selection in camelids. Nine high altitude adaptation genes exhibited sites that were under negative (purifying) selection in the alpaca compared to other camelids ( Table 5), suggesting selection to remove deleterious mutations that might alter gene function. Three genes in this group, EPAS1, EGLN1, and PPARA regulate or are regulated by hypoxia inducible factor 1α (Hif-1α), which is a master regulator of the cellular response to hypoxia (Qiu et al., 2012;Simonson et al., 2012). All three genes are known to be involved in high altitude adaptations in dogs (Gou et al., 2014), humans (Beall, 2014;Bigham and Lee, 2014;Jeong et al., 2014), and EPAS1 also in Tibetan snakes (Li et al., 2018). EPAS1 genotypes have been associated in several studies with the dampened hemoglobin phenotype, while noncoding variants in and around EPAS1 and EGLN1, are strongly associated with a reduced blood concentration of hemoglobin in Tibetan highlanders (Beall, 2014;Bigham and Lee, 2014;Gou et al., 2014). Under purifying selection in alpacas are also genes encoding for ARG2 and ADAM17 proteins, which both affect Hif-1α stability and activity (Qiu et al., 2012). Alleles of human ADAM17 are present at significantly different frequencies in Tibetans compared to low-altitude dwellers (Simonson et al., 2012). The NFE2L2 gene has unique amino acid residue changes in the dromedary and Bactrian camel genomes and is correlated with the oxidative stress response (Wu et al., 2014). In the present analysis, this gene exhibits signatures of purifying selection in alpaca, but not in camels (Table 5 and  Supplementary Table 8). Among the candidate genes tested, only PPARA and EHHADH were under positive selection in alpacas but not in camels, showing significant higher branch specific ω value (Supplementary Table 8). Signatures of both purifying and positive selection were found in different regions of NOTCH4 and RBPJ, with both genes suggested to be involved in the regulation of responses to hypoxia in deer mouse (Scott et al., 2015) and pig (Jia et al., 2016).

Genes Involved in Fiber Color and Quality
Alpacas produce one of the most highly prized natural fibers in the world. This fiber comes in a large range of natural colors, which is a significant point of differentiation with fine fiber sheep, such as Merinos. The key mammalian color genes MC1R (melanocortin 1 receptor) and ASIP (agouti signaling protein) have also been found to regulate alpaca fiber color (Feeley and Munyard, 2009;Feeley et al., 2011). Interestingly, although the donor of the DNA used for the alpaca genome (Carlotta ;  Table 1) was fawn, the ASIP gene in all VicPac genomes, including VicPac3.1, contains a 57 bp deletion in exon 4 associated with loss of function of ASIP, and black color. However, this is counteracted by epistatic interaction of MC1R, which is homozygous for the alternative allele at two of the three known loss of function SNPs (Feeley and Munyard, 2009), and which prevents the expression of eumelanin (black pigment). Importantly, MC1R is correctly annotated in VicPac3.1 vs. VicPac2, in which it was misnamed and annotated as having three exons instead of one. It was recently shown that alpaca and camel MC1R maps by FISH to chr21 (Alshanbari et al., 2019) and not to chr9 as anticipated by Zoo-FISH (Balmus et al., 2007) and FISH mapping orthologs from HSA16q (Avila et al., 2014a). The location of MC1R in alpaca chr21 is supported by VicPac3.1, showing that chr21 shares conserved synteny with both HSA1 and the terminal part of HSA16q ( Table 6). The annotation of other important mammalian color genes such as Tyrosinase related protein 1 (TYRP1), dopachrome tautomerase (DCT), The order of genes in the table follows their relative order in corresponding scaffolds.
The new assembly also improved sequence quality, annotation and chromosomal assignment of keratin (KRT) and keratinassociated protein (KRTAP) genes, some of which are the primary candidates for fleece and fiber quality (Allain and Renieri, 2010). Like in other mammals, alpaca KRT and KRTAP genes are clustered in gene families and located predominantly in chr12 (25 KRT genes) and chr16 (22 KRT genes and 2 KRTAPs) ( Table 8).

Summary
Reference assembly VicPac3.1 with its improved accuracy, contiguity, chromosomal anchoring and preliminary annotation, constitutes a key resource for understanding the architecture of the alpaca genome, and is critical for the discovery of genetic blueprints of diseases/disease resistance, congenital disorders and traits of biological significance. It will provide a strong basis for whole genome population-scale studies in alpacas and other South American camelids, and for comparative genomics among camelids and with other mammals. High quality assembly is also the prerequisite for in depth functional annotation of the alpaca genome in the future, similar to the FAANG initiatives that are ongoing in other domestic species (Andersson et al., 2015).

Ethics Statement
Procurement of blood and tissue samples followed the United States Government Principles for the Utilization and Care of Vertebrate Animals Used in Testing, Research and Training. These protocols were approved as AUP #2011-96, # 2018-0342 CA and CRRC #09-47 at Texas A&M University.

Samples and DNA Isolation
Peripheral blood was procured from a female Huacaya alpaca Nyala's Accoyo Empress Carlotta -the same DNA donor that was used for the assemblies VicPac1 and VicPac2. Blood DNA was isolated using a Gentra Puregene Blood Kit (Qiagen), following the manufacturer's protocol, and evaluated for quality and quantity on the Agilent 2200 TapeStation. This showed that the gDNA was of high quality and high molecular weight (i.e., fragments larger than 50,000 bp) and suitable not only for Illumina sequencing, but also for both long-read Pacific Biosciences (PacBio) sequencing and constructing Chicago R libraries for HiRise TM scaffolding (Dovetail Genomics).

Genome Sequencing and Assembly
Two sequencing libraries were generated: a shotgun pairedend library (fragment size 200 bp) and a mate-pair library (2-5 kb) which were sequenced across 8 lanes on the Illumina 2500 platform (5 lanes paired-end and 3 lanes mate-pair; both 2 × 100 bp) commercially at Macrogen Inc. (SK) to generate short-read sequence data with > 100X genome coverage. Additionally, one PacBio SMRT cell library was constructed and sequenced across 20 SMRT cells on the RSII platform to generate long-read data with 5-6X genome coverage; conducted commercially by PacBio Sequencing Services at the University of Washington. Short reads were filtered for quality, adaptors removed and filtered for a minimum length of 60 bp using Trimmomatic v0.33 (Bolger et al., 2014) with ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 SLIDINGWINDOW:4:25 MINLEN:60. The final genome assembly was produced through a multi-stage process. First, we generated a hybrid assembly with MaSuRCA v3.2.1 (Zimin et al., 2013) using default parameters, using all the paired-end and mate-pair Illumina data, ∼22X Roche 454 data from VicPac2.0 and ∼3X Sanger data from VicPac1 (both available via PRJNA30567). The hybrid assembly was designated as Qmas1. This was further developed into a meta-assembly guided by the VicPac2 assembly and Illumina mate-pair alignments using Metassembler (Wences and Schatz, 2015) with MUMmer4 (Marcais et al., 2018) and the following parameters: MateAn_A = 2000, MateAn_B = 3000, nucmer_l = 50, nucmer_c = 300) in order to generate a more contiguous assembly for the alpaca genome. This assembly was designated as Qmas1/VicPac2. Using the ∼5X PacBio consensus sub-reads and Illumina mate-pair data, we scaffolded the Qmas1/VicPac2 contigs using OPERA-LG v.2.05 (Gao et al., 2016); assembly designation VicPac3.0. The final assembly, VicPac3.1, was obtained by constructing two 2X 151 bp Chicago R libraries which were then subjected to HiRise TM scaffolding along with VicPac3.0 and the mate-pair libraries, and this was done by Dovetail Genomics (United States).

Chromosomal Assignment
Sequence scaffolds were anchored to alpaca autosomes and the X chromosome with the help of alpaca cytogenetic markers (Avila et al., 2014a(Avila et al., ,b, 2015. Sequences of the overgo and PCR primers that were used for FISH analysis were mapped to VicPac3.1 scaffolds using BBMap v35 11 with the following parameters: pairedonly = t, minid = 0.97 and pairlen = 500, which generated no ambiguous mappings, retaining those with 97% identity (equivalent of 1 bp mismatch) and that the primers were mapped in the correct orientation. Overgo primers were mapped with the same parameters, but allowing for 95% identity matches. We considered scaffolds anchored when the primers uniquely mapped to one scaffold and the primers mapped in the correct orientation.

Preliminary Genome Annotation
A preliminary annotation of the VicPac3.1 was produced primarily for comparative assessment. We used AUGUSTUS v3.3.1 (Hoff and Stanke, 2018) for gene model prediction. First, gene hints were built using VicPac2.0.2 (GCA_000164845.3), human (GCA_000001405.27), cow (Bos taurus; GCA_000003055.3), dromedary (Camelus dromedaries; GCA_000767585.1), and Bactrian camel (GCF_000311805.1 and GCF_000767855.1) peptides. Peptides were mapped to the alpaca genome assembly using BLAT v. 36x2 (Kent, 2002) with default parameters, then converted to hints with blat2hints.pl, taking the best two matches to every peptide. AUGUSTUS was then run with these hints, and human was used as the training species (closest pre-trained species). Finally, the predicted gene models were confirmed by mapping testis and skin RNA-Seq data to VicPac3.1 with STAR v 2.5 (Dobin et al., 2013) in two-pass mode with default parameters and checking correct junctions in the STAR junction files and well-known gene models in IGV (Robinson et al., 2011;Thorvaldsdottir et al., 2013).
The VicPac3.1 scaffolds anchored to chr20 were used to anchor dromedary and Bactrian scaffolds to chr20, using reciprocal besthit Blast v2.2.26+, blastn implemented with default settings. Pairwise comparative alignments were conducted for anchored chr20 scaffolds using MAUVE v 2.4.0 (Darling et al., 2004) with default settings for alpaca (VicPac3.1) vs. dromedary, alpaca vs. Bactrian, alpaca vs. cow and alpaca vs. sheep. Cow and sheep genome assemblies are already chromosomally assigned so we used their respective chr20 sequence fasta. We compared Major Histocompatibility complex (MHC) gene synteny among camelid chr20 (alpaca, dromedary and Bactrian camels), using orthology and syntenic position between anchoring orthologs. All MHC and MHC-like genes (any gene with a blast e-value less than 1e-20 to any human MHC gene) in the MHC Class I and MHC Class II syntenic regions were annotated with respect to human peptide best matches.

Positive Selection
To investigate whether candidate genes involved in adaptation to high altitudes exhibit signals of selection among the Camelidae, coding sequences (CDS) of 23 genes previously identified as potentially having a role in high altitude adaptation (EGLN1, EPAS1, PPARA, IKBKG, KLF6, RBPJ, SF3B1, EFEMP1, HOXB6, ATF6, ADAM17, MMP3, ARG2, ERP44, NFE2L2, MGST2, AQP1, AQP2, AQP3, CKMT1, EHHADH, ACAALA, NOTCH4) were extracted from the VicPac3.1 and from dromedary and wild and domesticated Bactrian camel assemblies. Multiple sequence alignments were conducted using GUIDANCE2 (Sela et al., 2015) with default quality cutoffs, codon alignments with PRANK (Loytynoja and Goldman, 2010) as the MSA program specified with the -F parameter. We used the longest CDS of a gene for alignment when there was more than one per species. Signatures of selection were searched with two d N /d S based tests using HyPhy 15 3pc . First, the aBSREL (Smith et al., 2015) branch-site model, which tests if each branch in the phylogeny has a proportion of sites evolving under positive selection, as we tested all branches we FDR corrected the likelihood-ratio test p-values. Second, the FEL (Kosakovsky Pond and Frost, 2005) model which assumes selective pressure is constant for each site across the phylogeny and calculates whether the nonsynonymous (d N ) substitution rate is significantly different from the synonymous (d S ) rate, using the likelihood ratio test.

Transcriptome Sequencing and Analysis
High quality (RIN > 9.6) RNA was extracted from the testis of one normal male alpaca and one normal male llama using PureLink RNA Mini Kit (Ambion). The RNA was converted into cDNA with NEXTflex Rapid Directional qRNA-Seq kit (BIOO), prepared into 2 × 100 bp PE TruSeq libraries (Illumina), and sequenced on Illumina HiSeq2500 platform. We obtained, on average, 90 million PE reads per sample. The RNA from the skin samples was prepared as reported in Cransberg Ph.D. Thesis (Cransberg, 2017). Briefly, skin biopsies were collected from 20 white, 20 brown and 5 black alpacas, the RNA was extracted using Trizol reagent and the FastPrep system (Thermo Life Sciences) and an RNeasy Kit (Qiagen). After confirmation of RNA quality (Bioanalyser; Agilent) three equi-molar pools of RNA were prepared (one for each color). Sequencing libraries were prepared using an Illumina Tru-seq RNA kit, and sequenced on a single lane of an Illumina Genome Analyser GAIIx to generate 54 bp PE reads.

Genome Size Estimation
Genome size was first estimated using filtered short-read k-mer distributions. k-mer frequencies were calculated using Jellyfish v2.2.8 (Marcais and Kingsford, 2011) with canonical k-mers, for a range of k-values (17, 21, 25, and 31). These k-mer distributions were then analyzed in Genoscope 16 (Vurture et al., 2017) with a maximum k-mer coverage of 1,000 and -1 (where -1 is no maximum coverage). Genome size was also estimated by flow cytometry using the protocol described elsewhere (Zhu et al., 2012) with a modification that the concentration of RNAse was doubled (200 µg/ml). Briefly, primary fibroblast cell lines of 2 alpacas and 2 horses were cultured in T25 culture flasks until 100% confluency. The cells of each individual were trypsinized, washed 6 times in PBS, fixed in cold 70% ethanol and stained with Propidium Iodine. The stained cells were analyzed on a BD Accuri TM C6 personal flow cytometer separately for each animal. Results were gated in order to prevent exogenous DNA from lysed cells from affecting the results. Peaks were observed based on the amount of PI absorbed by each cell population (Supplementary Figure 2). Both horses and one alpaca were measured on 3 separate occasions, the other alpaca was only measured once due to a limited number of cells. The average median PI concentration and a 95% CI was calculated for the horse an alpaca, respectively, using all measurements available. The genome size was estimated using the formula: Size alpaca = PI alpaca /PI horse (2.7 Gb). Where PI alpaca denotes the median amount of PI absorbed by alpaca cells; PI horse denotes the median amount of PI absorbed by horse cells; 2.7 Gb is the expected size of the horse genome (Wade et al., 2009).

ETHICS STATEMENT
Procurement of blood and tissue samples followed the United States Government Principles for the Utilization and Care of Vertebrate Animals Used in Testing, Research and Training. These protocols were approved as AUP #2011-96, # 2018-0342 CA and CRRC #09-47 at Texas A&M University.

AUTHOR CONTRIBUTIONS
TR, BA, and PP designed and initiated the project. MR, KM, LC, FJ, TA, FA, MJ, GW, RC, and TR conducted the experimental work. MR, KM, FJ, LC, and TA carried out the genome assembly, annotation, and bioinformatics analyses. MJ and FA contributed to the testis transcriptome and data analyses. TR, AT, RC, and KM collected the samples for genome and transcriptome sequencing. TR, MR, KM, and LC wrote the manuscript with input from all authors.

FUNDING
This study was supported by grants from the Morris Animal Foundation (D09LA-004, D14LA-005) and the Alpaca Research Foundation; funds from the Curtin University, School of Biomedical Sciences supported PacBio and Dovetail sequencing and skin transcriptome analysis; the authors highly appreciate donations to ARF and MAF by Leslie Herzog of Herzog Alpacas.