Draft Genome Assembly for the Tibetan Black Bear (Ursus thibetanus thibetanus)

Species in the bear genus are well-known for their adaptation to various environments; examples are the polar bear from the Arctic (Thiemann et al., 2008), the sun bear from rainforests (Linkie et al., 2007), and the Tibetan black bear, which inhabits the Qinghai-Tibet Plateau (QTP) (Liu, 2004). The QTP is known as the “Third Pole” since it is the largest and highest plateau in the world, with an average altitude of 4,500m above sea level. Like the polar regions, the environment of the QTP is inimical to living organisms due to the low temperature and low level of oxygen. Large mammals, such as humans and taurine cattle, usually suffer from severe pulmonary hypertension when inhabiting high altitudes (Wu et al., 2007). Previous studies on indigenous species of the QTP have found many adaptive changes in the genes and pathways related to hypoxia responses, oxygen transport, etc. (Qiu et al., 2012). However, when bears arrived on the QTP and how they have been able to survive and thrive there remains largely unknown. The Tibetan black bear (Ursus thibetanus thibetanus) is a subspecies of the Asiatic black bear (Ursus thibetanus) and is one of the four known bear species living on the QTP (Lan et al., 2017). Tibetan black bears are typical forest animals. They are omnivorous but feed mainly on plants. Their sense of smell and hearing are very sensitive. Most Tibetan black bears rely on hibernation to survive the cold and lack of food in winter (Liu, 2004). In recent years, due to the expansion of human activity and poaching, the Tibetan black bear has experienced a rapid population reduction. It is now classified as a vulnerable species by the International Union for Conservation of Nature (IUCN). Here, we report a high-quality genome assembly for the Tibetan black bear based on 10× genomic sequencing technology. This assembly and associated resources will serve as an important genetic resource for specialize, species protection, and the evolution of the bear genus.


INTRODUCTION
Species in the bear genus are well-known for their adaptation to various environments; examples are the polar bear from the Arctic (Thiemann et al., 2008), the sun bear from rainforests (Linkie et al., 2007), and the Tibetan black bear, which inhabits the Qinghai-Tibet Plateau (QTP) (Liu, 2004). The QTP is known as the "Third Pole" since it is the largest and highest plateau in the world, with an average altitude of 4,500 m above sea level. Like the polar regions, the environment of the QTP is inimical to living organisms due to the low temperature and low level of oxygen. Large mammals, such as humans and taurine cattle, usually suffer from severe pulmonary hypertension when inhabiting high altitudes (Wu et al., 2007). Previous studies on indigenous species of the QTP have found many adaptive changes in the genes and pathways related to hypoxia responses, oxygen transport, etc. (Qiu et al., 2012). However, when bears arrived on the QTP and how they have been able to survive and thrive there remains largely unknown.
The Tibetan black bear (Ursus thibetanus thibetanus) is a subspecies of the Asiatic black bear (Ursus thibetanus) and is one of the four known bear species living on the QTP (Lan et al., 2017). Tibetan black bears are typical forest animals. They are omnivorous but feed mainly on plants. Their sense of smell and hearing are very sensitive. Most Tibetan black bears rely on hibernation to survive the cold and lack of food in winter (Liu, 2004). In recent years, due to the expansion of human activity and poaching, the Tibetan black bear has experienced a rapid population reduction. It is now classified as a vulnerable species by the International Union for Conservation of Nature (IUCN).
Here, we report a high-quality genome assembly for the Tibetan black bear based on 10× genomic sequencing technology. This assembly and associated resources will serve as an important genetic resource for specialize, species protection, and the evolution of the bear genus.

DATA
The 10× genomic sequencing technology was applied in this project. After filtering out the redundant and low-quality reads, a total of 270 Gb bases were used for the following analysis ( Table 1 and Table S1). The draft genome has a size of 2.37 Gb, which is close to the estimated genome size of 2.36 Gb (Table 1, Figure 1A, Table S2). We aligned the sequenced reads with the reference genome and found that most bases (98.42%) were covered by more than 20 reads ( Figure 1B). The genome assembly had the longest scaffold of 78,658,804 bp and scaffold N50 and contig N50 sizes of 26.80 Mb and 145.97 Kb, respectively ( Table S2). The longest 27 and 107 scaffolds occupied 50% and 90% of all sequences, suggesting a high level of continuity.Using the mammalian gene set, BUSCO integrity assessment gave a value of 95.9% (Table S3), indicating a high level of completeness.
To further check the quality of this genome assembly, we aligned the Tibetan black bear genome with the dog (Canis lupus familiaris) genome ( Figure 1C), which was the closest species with a chromosome-level assembly currently available. We found that some scaffolds from the Tibetan black bear genome essentially matched the dog genome at the chromosome level; for example, Scaffold00001, Scaffold00008, and Scaffold00021 corresponded to chr22, chr23, and chr31 of the dog genome assembly, respectively. We also observed evidence for several potential chromosome rearrangement events in the Tibetan black bear, including chr13 and chr28 in the dog, which appeared to lie together in the Tibetan black bear genome (Table S4).
About 41.01% of the genome assembly was estimated to be repetitive sequences, a value similar to those in the panda and dog, and LINEs occupied the largest percentage (25.02%) of repeat regions (Table S5). Using a combination of de novo prediction and homologous prediction, we found a total of 18,304 protein-coding genes ( Table 1); of these, a total of 17,814 (97.32%) genes were found to have functional annotation in at least one of the databases mentioned above (Table S6). We compared our gene set with those from other species with respect to average gene length, average CDS length, average exon number, average exon length, and average intron length; all results were comparable to those from other species ( Table S7). The BUSCO results also indicated the completeness of our gene set, with a BUSCO value of 95.4% (complete = 95.4%, single = 94.0%, duplicated = 1.4%, fragmented = 2.6%, missed = 2.0%, genes = 4,104).
We compared the gene sets of the Tibetan black bear, panda, polar bear, and brown bear and identified a total of 18,927 gene families. Within them, we found 460 species-specific genes in the Tibetan black bear, which was similar to the numbers in other bear species ( Figure 1D). Using 5,229 1:1 orthologous genes between nine species, we reconstructed their phylogenetic relationships ( Figure 1E). The Tibetan black bear was found to be an outgroup relative to the American black bear, the brown bear, and the polar bear, indicating a paraphyletic relationship for the black bear species, which is in concordance with a previous study . The divergence time between the Tibetan black bear and the common ancestor of the polar bear, brown bear, and American black bear was estimated to take place at 9.27 million years ago ( Figure 1E). Here, we found that the population history of the polar bear and American black bear was consistent with the previous research (Miller et al., 2012). Interestingly, we found that Tibetan black bear once experienced a significant population growth compared to other bear species after 100,000 years ago ( Figure 1F).

Sample Collection, Library Construction, and Sequencing
We obtained muscle samples from a naturally dying female Tibetan black bear in the Lang county of Tibet, which has an average elevation of about 3,000 m, in 2018. The samples were thoroughly ground, and high molecular weight DNA was isolated using a RecoverEase DNA Isolation Kit (Agilent, PN 720203, San Clara, USA) with minor modifications to the manufacturer's protocol. We then followed the standard process for construction of a Chromium library. The 150 bp pair-end DNA nanospheres used for sequencing were prepared according to a previously published protocol (Drmanac et al., 2010), and the standard cyclic amplification procedure was performed on a NovaSeq 6000. The final raw sequencing file was converted to fastq format using the Supernova v2.0.0 software package (Weisenfeld et al., 2017) with default parameters for subsequent analysis ( Figure S1).

Genome Size Estimation and Genome Assembly
To predict the size of the Tibetan black bear genome, we firstly used Jellyfish v2.2.6 (Marçais and Kingsford, 2011) to count the kmer frequency with a kmer size of 39. GenomeScope web tools (parameters: Kmer length 39, Read length 150) (Vurture et al., 2017) were employed to predict the genome size, and the resulting value was around 2.36 Gb. The draft genome was assembled by the Supernova v2.0.0 software using the recommended pipeline (parameters: mkoutput -style = pseudohap -minsize = 500). Burrows-Wheeler Aligner (BWA, v0.7.15) (Li and Durbin, 2009) was used to confirm the singlebase accuracy of the draft assembly with default parameters. A Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0.2) (Simao et al., 2015) analysis was applied to evaluate the completeness of the gene set in our draft genome with the library "mammalia_odb9".

Conversation of Syntenic Relationship With Dog Genome
To validate the completeness of our assembly, we compared our scaffolds to the dog chromosome set (GCF_000002285.3) 1 . Specifically, those 74 Tibetan black bear scaffolds at least 10 Mbp in length (total length ∼1.92 Gb) were aligned to the dog chromosomes using LAST v942 (parameters: -E0.05) (Kielbasa et al., 2011). The best one-to-one sequence alignment for each was identified and extracted with a length of at least 10 Kb. The start and end positions of the 74 matching scaffolds that fit these criteria were then used to generate a CIRCOS plot (http://circos. ca/, accessed at Mar. 18, 2018) (Krzywinski et al., 2009), which showed regions of collinearity as well as rearrangements.
One-to-one orthologous gene families were detected based on the gene clusters from these nine species using the result of OrthoFinder. To reconstruct the phylogenetic relationships among these animals, multiple alignments of the genes within each single-copy gene family were performed using MUSCLE 3.8.31 (Edgar, 2004), and then the Perl script pan2nal.pl (Suyama et al., 2006) was used to convert amino acid alignment to codon alignment. Only 4-fold-degenerate sites were extracted in order to reconstruct a precise phylogenetic tree. After alignment with ClustalW2 (parameters: -CONVERT -TYPE=DNA) (McWilliam et al., 2013), the phylogenetic analysis was performed in RAXML 8.2.4 (parameters: -m GTRGAMMA -f a -x 12345 -N 100p 12345) (Stamatakis, 2014) for each alignment. Divergence time estimations were determined using MCMCTree (part of the PAML package) (Yang, 2007) with a correlated rate clock model. The calibration time was obtained from www.timetree. org (accessed at May 30, 2019), which gave human/mouse, human/dog, dog/tiger, and dog/brown bear divergence times as 85-97, 91-101, 51-56, and 42-48 million years ago (Mya), respectively. The analysis was run twice to make sure that the result was reliable.

Population History
Pairwise sequential Markovian coalescent (PSMC) analysis (Li and Durbin, 2011), based on a hidden Markov model, was used to estimate the history of effective population sizes based on genome-wide heterozygous sequence data. The BWA-MEM algorithm (BWA v0.7.15) (Li, 2013) was used to align the clean genome sequence reads to the reference genome with default parameters. Then, by using a combination of SAMtools/BCFtools  with PSMC recommend parameters, we obtained the consensus diploid sequences. Finally, PSMC software was run with 100 bootstraps to generate the effective population history plot. We also downloaded polar bear (SRR942298) 9 , giant panda (SRR504890) 10 , American black bear (SRR518723) 11 , and brown bear (SRR935621) 12 resequencing data from NCBI. The same pipeline was run. The generation time and substitution rate were 11 years and 1.82e-8 per site per generation, respectively (Benazzo et al., 2017).

DATA AVAILABILITY STATEMENT
The whole genome sequencing data were submitted to the NCBI Sequence Read Archive (SRA) database with accession number SRP224444 13 and Bioproject accession PRJNA573607. The assembled draft genome of Tibetan black bear has been deposited at GenBank with accession WEIE00000000. The annotation files of Tibetan black bear have been published 9 NCBI Sequence Read Archive, https://www.ncbi.nlm.nih.gov/sra/SRR942298 (2015). 10

ETHICS STATEMENT
Ethical review and approval was not required for the animal study because The Tibetan black bear that died of natural causes was not applicable for animal ethic. Usage of the tissue samples in our study was approved by the government of China.

AUTHOR CONTRIBUTIONS
XC and JH designed and supervised the project. JL, MH, YZ, and KY prepared the samples. CZ, WX, CL, XS, and YY analyzed the data. CZ, WX, and JL wrote the manuscript with other authors' help and XC and JH revised the manuscript. All authors read and approved the final manuscript.

FUNDING
This study was supported by Research Funds for Interdisciplinary subject, NWPU (19SH030408).