A Method for the Analysis of African Swine Fever by Viral Metagenomic Sequencing

Ji, ChiHai; Jiang, JingZhe; Wei, YingFang; Wang, ZhiYuan; Chen, YongJie; Mai, ZhanZhuo; Cai, MengKai; Qin, ChenXiao; Cai, Yu; Yi, HeYou; Liang, Guan; Lu, Gang; Gong, Lang; Zhang, GuiHong; Wang, Heng

doi:10.3389/fvets.2021.766533

ORIGINAL RESEARCH article

Front. Vet. Sci., 23 November 2021
Sec. Veterinary Infectious Diseases
Volume 8 - 2021 | https://doi.org/10.3389/fvets.2021.766533

A Method for the Analysis of African Swine Fever by Viral Metagenomic Sequencing

ChiHai Ji^1,2,3,4,5^†

JingZhe Jiang⁶^†

YingFang Wei^1,2,3,4,5

ZhiYuan Wang^1,3,4,5

YongJie Chen^1,2,3,4,5

ZhanZhuo Mai^1,3,4,5

MengKai Cai^1,3,7

ChenXiao Qin^1,3,4,5

Yu Cai^1,3,4,5

HeYou Yi^1,3,4,5

Guan Liang^1,3,4,5

Gang Lu^1,3

Lang Gong^1,3,4,5

GuiHong Zhang^1,2,3,4,5^*

Heng Wang^1,2,3,4,5^*

¹Guangdong Provincial Key Laboratory of Prevention and Control for Severe Clinical Animal Diseases, College of Veterinary Medicine, South China Agricultural University, Guangzhou, China
²Shenzhen Kingkey Smart Agriculture Times Co., Ltd., Shenzhen, China
³Key Laboratory of Zoonosis Prevention and Control of Guangdong Province, Guangzhou, China
⁴Guangdong Laboratory for Lingnan Modern Agriculture, Guangzhou, China
⁵National Engineering Research Center for Breeding Swine Industry, Guangzhou, China
⁶Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, China
⁷Guangdong Meizhou Vocational and Technical College, Meizhou, China

In 2018, there was an outbreak of African swine fever (ASF) in China, which spread to other provinces in the following 3 years and severely damaged China's pig industry. ASF is caused by the African swine fever virus (ASFV). Given that the genome of the African swine fever virus is very complex and whole genome information is currently inadequate, it is important to efficiently obtain virus genome sequences for genomic and epidemiological studies. The prevalent ASFV strains have low genetic variability; therefore, whole genome sequencing analysis provides a basis for the study of ASFV. We provide a method for the efficient sequencing of whole genomes, which requires only a small number of tissues. The database construction method was selected according to the genomic types of ASFV, and the whole ASFV genome was obtained through data filtering, host sequence removal, virus classification, data assembly, virus sequence identification, statistical analysis, gene prediction, and functional analysis. Our proposed method will facilitate ASFV genome sequencing and novel virus discovery.

Introduction

ASF is a highly contagious and fatal swine disease. The pathogen of ASF is the African swine fever virus (ASFV), which is the only member of the family Asfarviridae and the only known DNA arbovirus (1, 2). Depending on the strain, ASFV has a large (170–193 kbp) double-stranded DNA genome containing 151–167 genes, which are involved in viral replication and assembly as well as in modulating host cellular functions and immune evasion (3). The virus can be transmitted through direct contact with infected swine, their products, and soft ticks of the genus Ornithodoros (4). ASF was first described in Kenya in 1921; it then spread to other African, European, Caribbean, and South American countries (3, 5, 6). The disease was introduced into Georgia in 2007 and then spread throughout Eastern Europe, including Russia, Belarus, Ukraine, Estonia, Lithuania, Latvia, Romania, Moldova, Czech Republic, and Poland (7–9). In August 2018, China reported its first ASF outbreak. Within 1 year, ASFV spread rapidly into all provinces in mainland China. The spread of ASF has resulted in huge losses to the Chinese pig industry (3).

ASFV has a high genetic and antigenic diversity. Based on the p72 protein (B646L), 24 genotypes have been identified, while at least eight serotypes are recognized based on hemadsorption inhibition (10, 11). The spread of the African swine fever virus in China is a serious threat to the diversity and survival of pigs. To facilitate much needed epidemiological investigations, advance research, and further vaccine development, it would be expedient to have a simple and reproducible method for full genome sequencing of the ASFV (12). In the early stage, we used the first-generation sequencing technology to sequence the whole genome of ASFV, which is very time-consuming and with heavy workload. This demands for a faster method for rapid sequencing of the whole ASFV genome, and second-generation sequencing is an important tool for sequencing large genomes, which is essential for effective emergency management in the event of disease outbreaks. Current methods of virus enrichment in second-generation sequencing are inefficient and time-consuming. By improving the enrichment method, we can effectively increase the proportion of virus samples and provide more effective data for subsequent analysis. As the ASFV genome contains a wide range of homopolymers and repeat regions, the short-read data generated by second generation sequencing platforms need to be processed carefully. ASFV genome recombination (such as inversion or duplication) may be missed when comparing reference sequences, and the quality of the consistent sequence is heavily dependent on the reference sequence and therefore, may be misassembled. After sequencing, data filtration, host sequence removal, virus classification, data assembly, virus sequence identification, statistical analysis of virus abundance, gene prediction, and functional analysis were performed. Through the analysis of these steps, an accurate sequence is finally obtained. Our results confirm the feasibility of sequencing an ASFV genome directly from positive clinical tissues, and provide a basis for further epidemiological research and evolutionary analysis (13).

Materials and Methods

Experimental Materials

We obtained 0.45 μm and 0.22 μm filter (PVDF) membranes from Merck Display Materials Co., Ltd. (Shanghai, China), OptiPrep Density Gradient Medium purchasedfrom Guangzhou Fan-si Biotechnology Co., Ltd. (Guangzhou, ChinaGuangzhou, ChinaGuangzhou, ChinaGuangzhou, ChinaGuangzhou, China), DNase I (Promega Corporation, Madison, USA), BSA (configured as 1% BSA-SM solution, filtered by 0.22 μm membrane), and gelatin from porcine skin purchased from Sigma-Aldrich Trading Co., Ltd. (Shanghai, China). Furthermore, we used an overspeed centrifuge (Beckman Coulter, USA), HiPure Viral DNA Kit (D3191) purchased from Majorbio (Shanghai, China), and 10 × SM buffer (pH 7.5, 1 M NaCl, 100 mM MgSO4, 500 mM Tris, 0.1% gelatin; working concentration was diluted with ultra-pure water to 1 × SM). Op density gradient solution was made using Optiprep original solution, mixed with 10 × SM buffer (9:1).

Enrichment and Purification of Virus-Like Particles (VLPs)

Cases were identified from the Ministry of Agriculture and Rural Affairs of Huangpu District, Guangzhou (information released by the Ministry of Agriculture and Rural Affairs: http://www.moa.gov.cn/gk/yjgl_1/yqfb/201812/t20181223_6165 395.htm). The spleen tissue was completely fixed with 10% formalin solution for 72 h. Two formalin fixed pig spleens 1 g each were taking and cut into small pieces with sterile scissors, and then poured into 2 × sucrose -Triton washing solution for washing, 1,000 r/min centrifugation for 15 min, precipitation was beaten evenly with TE Buffer (pH 9.0), adding SDS to the final concentration of 1%, protease K to 200 μg/mL, and kept in water bath at 48°C for 48 h. And 5–10 mL of pre-cooled 1 × SM buffer (SM buffer filtered by 0.22 μm) was added. The spleen was evenly homogenized and placed in liquid nitrogen and a 37°C water bath alternately three times. Spleens were centrifuged at 4°C for 5 min at 1,000 rpm and 12,000 rpm in succession, and the supernatant was filtered using a 0.45 μm filter. Before using the 0.45 μm filter, 1% BSA solution was filtered through the wetting membrane. Then, qPCR was performed on each liquid layer using specific ASFV primers to determine the virus Cq value in each liquid layer. The liquid layer with the low Cq value was selected, each two liquid layers were mixed, and 1 × μSM buffer solution was added to fully mix. The solution was centrifuged at 160,000 g for 1 h, and the supernatant was discarded. According to the amount of precipitation, 100–500 μL 1 × SM buffer solution was added for resuspension (with repeated pipette mixing to avoid violent shock). DNaseI was added according to the manufacturer's instructions, with treatment at 37°C for 1 h. Single- and double-stranded DNA were excised simultaneously to fragment the DNA for library construction. EDTA (Promega Corporation, Madison, USA) was added according to the manufacturer's instructions to terminate the reaction. The HiPure Viral DNA Kit (D3191) (Majorbio, Shanghai, China) was used to extract viral DNA, and the concentration was measured. Sequencing libraries were generated using NEB Next® Ultra™ DNA Library Prep Kit for Illumina® (New England Biolabs, MA, USA) following manufacturer's recommendations and index codes were added. Finally, the library was sequenced on an Illumina Novaseq 6,000 and 150 bp paired-end reads were generated. Each sample added 507,333,357 reads. (The extracted DNA was sent to China Guangdong Magigene Technology Co., Ltd [Guangdong, China]) for sequencing. All the experiments involving the ASF virus were carried out in a biosafety level (BSL)-3 laboratory at South China Agricultural University (Guangzhou, ChinaGuangzhou, ChinaGuangzhou, ChinaGuangzhou, ChinaGuangzhou, China).

Data Filtering

After obtaining the metagenomic sequencing data of the sample, it was necessary to evaluate the quality of the sequencing data and remove low-quality data to ensure the credibility of the subsequent analysis results. High-quality sequences obtained by quality control were used for downstream data analysis. The quality control process used the software SOAPnuke (14), with the specific processing steps as follows: (1) Removal of Adapter Paired reads; (2) Removal of single-ended reads with N's (N, uncertain base information) and >5% of paired reads; (3) When the single-ended sequencing read was low quality (sQ ≤ 20) and when the number of bases was >20% of the total number of read bases, these paired reads were removed; (4) Replicated reads produced by PCR amplification were removed; (5) Removal of polyX (ATCG) sequences.

Removal of Host Sequences

SOAPaligner (15) and BWA (16) software were used to compare clean reads to the specified host genome and to remove host sequences.

Virus Classification

The comparison software SOAPaligner (15) and BWA (16) were used to compare clean reads to the virus reference database in order to quickly obtain virus classification information in the samples.

Data Assembly

The assembly software IDBA (17), SPAdes (18), metaSPAdes, MEGAHIT (19), and Trinity (20) were used to assemble high-quality reads of each sample to obtain a longer contig sequence. Specific software information is provided in the Results section. Then, the number, length, and N50 statistic of the assembly sequence were counted. BWA software was used to compare high-quality reads to the assembly sequence to calculate the utilization rate of assembled reads, and the assembly effect was evaluated using these statistical data.

Virus Sequence Identification

A variety of methods [including BLAST, HMMSearch (21), and Metagenemark (22)] and databases [including NT, NR, VPFS (23), VFam (24), PFAM (25), and KEGG] were used to identify viral sequences. Annotation was based on the reference database; the corresponding virus sequences were isolated from the NT database, and BLASTN was used to compare contigs with the constructed virus database for species annotation. Using novel virus identification methods to find candidate virus sequences, contigs were compared with multiple databases, as long as one of the following three conditions were satisfied: (1) Comparison between BLASTN (v2.9.0+) and the virus database isolated from NT (virus-NT, including phages) was used to screen the comparison results with e ≤ 1 × 10⁻⁵ (e: exponent); (2) Comparison between BLASTX (v2.9.0+) and the virus-NR database isolated from NR (including bacteriophages) was used to screen the comparison results with e ≤ 1 × 10⁻³; (3) Metagenemark (v3.38) was used to predict the genes, and then HMMSearch (v3.2.1) software was used to compare the protein sequences with the HMM database (VPFS and VFAM), and the comparison results were screened with e ≤ 1 × 10⁻⁵.

Elimination of False Positives

The candidate virus sequences obtained above were compared with the NT database BLAST (v2.9.0+) and screened at e ≤ 1 × 10⁻¹⁰. The sequences not aligned in the previous step were compared with the NR database Diamond (v0.9.10) and screened with e ≤ 1 × 10⁻³. NCBI taxonomy data was used to annotate the above-mentioned alignment results. If more than 20% of the alignment results in the first 50 alignment results were non-viral sequences (annotated results were Eukaryota, Bacteria, and Archaea), the sequences were considered to be non-viral sequences, and the rest were considered to be viral sequences. Virus contigs were annotated according to the best hit comparison results of virus contigs and virus-NT (e ≤ 1 × 10⁻⁵).

Virus Abundance Statistics

Reads were compared with identified virus contigs, and the reads per kilo bases per million reads (RPKM) values of each contig were calculated for comparative analysis between samples.

\begin{array}{l} RPKM = \frac{Contig reads}{Total mapped reads (millions) * Contig length (KB)} \end{array}

Note: (1) Contig reads: number of reads in a contig; (2) Total mapped reads: number of reads in millions; (3) Contig length: contig length, in kbp.

Gene Prediction

Metagenemark (22) was used to predict the gene sequences of virus contigs, and the number and length of the predicted genes were evaluated.

Functional Analysis

The predicted gene protein sequence and UniProtKB/Swiss-Prot database sequence of the virus [ViralZone (26), reviewed proteins, https://viralzone.expasy.org/] were used for functional annotation information.

ASFV Statistical Analysis

The contig on ASFV was identified and compared according to the final virus sequence for analysis. MAFFT (27) software was used to perform multiple comparisons between the ASFV genome and the ASFV reference strain genome.

Results

DNA Extraction

After centrifugation, the products were layered into Eppendorf tubes, which were weighed before use and after adding each liquid layer. The weights of the liquid layers were recorded and the density of each liquid layer was calculated (Figure 1A) to assess the stratification. The DNA of each liquid layer was extracted with the OMEGA nucleic acid extraction kit, and then qPCR was performed on each liquid layer with specific African swine fever virus primers to determine the virus Cq value in each liquid layer (Figure 1B). The results showed that the Cq value of layer 16–19 was the lowest, and the virions were enriched in layer 16–19. Then, the two liquid layers adjacent to layer 16–19 were mixed, 1 × SM buffer was added, and the virions were fully mixed.

FIGURE 1

Figure 1. Density and Cq value stratified graphs. (A) Density diagram of each liquid layer after overspeed centrifugation; (B) Cq value of each liquid layer detected by qPCR after ultracentrifugation. (A,B) Are the same two samples of ultracentrifugation.

Quality Control of Sequencing Data

To ensure the accuracy of subsequent analysis, SOAPnuke (v2.0.5) software was first used to process the raw data from the machine, and high-quality clean reads were obtained. The quality distribution is shown in Figure 2. The results showed that the average base mass (green line) at all positions was above 30, and the quality of the data was high enough to be used in the following analysis.

FIGURE 2

Figure 2. Base quality values of sequencing data from the Illumina sequencing platforms. Quality values are represented by Qphred, where Qphred = −10 × log₁₀(e), with e representing the base sequencing error rate. The corresponding relationship between base recognition error rate and Phred score in Illumina Base Calling software is as follows: when Phred score is 10, the base correct recognition rate is 90%; when the Q-score is Q10 and Phred score is 20, the correct base identification rate is 99%; when the Q-score is Q20 and Phred score is 30, the correct base recognition rate is 99.9%; when the Q-score is Q30 and Phred score is 40, the correct base recognition rate is 99.99%; and when the Q-score is Q40, the correct base identification rate is 99.99%. The abscissa represents the position of the base, the ordinate represents the quality of the base at each position, the left graph represents the data before quality control, and the right graph represents after quality control.

Removal of Host Contamination

To avoid the influence of the host sequence on subsequent analysis, BWA (v0.7.17, default parameter: mem-k30) software was used to compare clean reads with the host database. Sus scrofa was used as the host reference information (accession: NC_010443.5). The comparison results where the length of the comparison was >80% of the total read length were filtered, and then the corresponding sequence was removed. The results showed that 16.5% of clean reads (PE: paired-end) were obtained after host removal, as shown in Table 1.

TABLE 1

Table 1. Removal of host sequence and virus sequence statistics.

Virus Composition Analysis

To quickly obtain virus composition information in samples, BWA (v0.7.17, default parameter: mem-k30) software was used to compare clean reads to virus reference data (isolated from NT data); the comparison results were filtered if the length of the comparison was >80% of the total read length. Statistical analysis showed that virus reads (PE) accounted for 3.19% of clean reads (PE), as shown in Table 1. According to the annotation information of the NCBI Taxonomy Database, the virus classification information was counted. To improve the accuracy of the results, comparison results with >5 reads covered were filtered during species annotation. Reads (PE) of the African swine fever virus family (Asfarviridae, red column in the figure) accounted for 87.12% of the total reads (PE). Asfarviridae had only one ASFV member, so it was not necessary to make annotations at the genus level for the display. The statistical results are shown in Figure 3.

FIGURE 3

Figure 3. Statistics of the results of horizontal annotation of the viral family. Note: (1) Species with proportion of reads >1% were selected for display; proportion of virus = number of single virus reads/number of total virus reads × 100.