viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors

Bhuvaneshwar, Krithika; Song, Lei; Madhavan, Subha; Gusev, Yuriy

doi:10.3389/fmicb.2018.01172

METHODS article

Front. Microbiol., 05 June 2018

Sec. Systems Microbiology

Volume 9 - 2018 | https://doi.org/10.3389/fmicb.2018.01172

This article is part of the Research TopicUsing Genomics, Metagenomics and Other "Omics" to Assess Valuable Microbial Ecosystem Services and Novel Biotechnological ApplicationsView all 53 articles

viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors

Krithika Bhuvaneshwar^*

Lei Song

Subha Madhavan

Yuriy Gusev^*

Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, United States

An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples. We present an open source bioinformatics pipeline viGEN, which allows for not only the detection and quantification of viral RNA, but also variants in the viral transcripts. The pipeline includes 4 major modules: The first module aligns and filter out human RNA sequences; the second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral-gene level thus allowing for downstream differential expression analysis of viral genes between case and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package. In this paper, we applied the viGEN pipeline to two case studies. We first demonstrate the working of our pipeline on a large public dataset, the TCGA cervical cancer cohort. In the second case study, we performed an in-depth analysis on a small focused study of TCGA liver cancer patients. In the latter cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. This allowed us to find differentially expressed viral-transcripts and viral-variants between the groups of patients, and connect them to clinical outcome. From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. The results presented corresponded with published literature in terms of rate of detection, and impact of several known variants of HBV genome. This pipeline is generalizable, and can be used to provide novel biological insights into microbial infections in complex diseases and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.

Introduction

An estimated 17% of cancers worldwide are associated with infectious causes. These infectious agents include viruses, bacteria, parasites and other microbes. Examples of viruses include human papilloma viruses (HPVs) in cervical cancer, epstein-Barr virus (EBV) in nasopharyngeal cancer, hepatitis B and C in liver cancer (HBV and HCV), human herpes virus 8 (HHV-8) in Kaposi sarcoma (KS); human T-lymphotrophic virus-1 (HTLV-1) in adult T cell lymphocytic leukemia (ATL) and non-Hodgkin lymphoma; merkel cell polyomavirus (MCV) in Merkel cell carcinoma (ACS, 2007). Bacteria such as Helicobacter pylori have been implicated in stomach cancer. Parasites have also been associated with cancer, examples are Opisthorchis viverrini and Clonorchis sinensis in bile duct cancer and Schistosoma haematobium in bladder cancer (ACS, 2007). Detection and characterization of these infectious agents in tumor samples can give us better insights into disease mechanisms and their treatment (Hausen, 2000).

Vaccines have been developed to help protect against infection from the many cancers. But these vaccines can only be used to help prevent infection and cannot treat existing infections (ACS, 2007). There are several screening methods widely used to detect viral infections, especially for blood borne viruses including HBV, HCV, HIV and HTLV. These include the enzyme linked immunosorbent assay (ELISA or EIA) (Yoshihara, 1995), chemluminescent immunoassay (ChLIA), Indirect fluorescent antibody (IFA), Western blot (WB), Polymerase Chain Reaction (PCR), and Rapid immunoassays¹. ELISA and WB test detects and measures antibodies in serum taken from the patient's blood, and are typically prescribed after certain symptoms are observed in the patient.

There are several challenges in detection of viruses in tumors including loss of viral information in progressed tumors and limited or latent replication resulting in low transcription of tumors (Schelhorn et al., 2013). The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome data from tumor samples.

The popularity of next-generation sequencing (NGS) technology has exploded in the last decade. NGS technologies are able to perform rapid sequencing, and in a massively parallel fashion (Datta et al., 2015). In recent years, applications of NGS technologies in clinical diagnostics have been on the rise (Barzon et al., 2011; Byron et al., 2016). Amongst the various NGS technologies, whole-transcriptome sequencing, also called RNA-seq, has been very popular with methods and tools being actively developed. Exploring the genome using RNA-seq gives a different insight than looking at the DNA since the RNA-seq would have captured actively transcribed regions. Every aspect of data output from this technology is now being used for research, including detection of viruses and bacteria (Khoury et al., 2013; Salyakina and Tsinoremas, 2013; Wang et al., 2016). They are also independent of prior sequence information, and require less starting material compared to conventional cloning based methods, making them powerful and exciting new technologies in virology (Datta et al., 2015). These high throughput technologies give us direct evidence of infection in the tissue as compared to ELISA-based assays, which only proves presence of infection somewhere in the human body. RNA-seq technology has hence enabled the exploration and detection of viral infections in human tumor samples. This technology also enables detection of variants in viral genome, which have been connected to clinical outcome (Moyes et al., 2005; Downey et al., 2015).

In recent years, US regulators approved a viral based cancer therapy (Ledford, 2015), proving that the study of viruses in the human transcriptome has biomedical interest, and is paving the way for promising research and new opportunities.

In this paper, we present our pipeline viGEN to not only detect and quantify read counts at the individual viral-gene level, but also detect viral variants from human RNA-seq data. The characterization of viral variants helps enable better epidemiological analysis. The input file to our pipeline is a fastq (Wikipedia, 2009) file, so our viGEN pipeline can be extended to work with genomic data from any NGS technology. Our pipeline can also be used to detect and explore not only viruses, but other microbes as well, as long as the sequence information is available in NCBI².

We applied our viGEN pipeline to two case studies as a proof of concept - a dataset of 304 cervical cancer patients, and a set of 50 liver cancer patients, both from the TCGA collection. We first applied the pipeline to the transcriptome of cervical cancer patients to see if we are able to detect the human papilloma viruses. We also performed additional in-depth analyses on a small focused study of liver cancer patients. In this cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis.

From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. This enabled us to perform downstream analysis to give us new insights into disease mechanisms.

In addition to the two case studies, we have made available an end-to-end tutorial demonstrated on a publicly available RNA-seq sample from an HBV liver cancer patient from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). We also provided step-by-step instructions on how to run our viGEN pipeline on this sample data, along with the code at https://github.com/ICBI/viGEN/ and demonstrated the detection of HBV transcripts in this sample. This allows other users to apply this pipeline to explore viruses in their data and disease of interest. We are currently implementing the viGEN pipeline in the Seven Bridges Cancer Genomics Cloud³.

There are a number of existing pipelines that detect viruses from human transcriptome data. Of these, very few pipelines offer quantification at the gene expression level. A comprehensive comparison of these pipelines is provided in Table 1. Our goal was not to compete with these other tools, but to offer a convenient and complete end–to-end publicly available pipeline to the bioinformatics community. To the best of our knowledge there are no publicly available pipelines or packages that would provide this type of complete analysis in one package. Customized solutions have been reported in the literature however were not made public.

TABLE 1

Table 1. Comparison of existing pipelines that detect viruses from human transcriptome data.

In the future, our plan is to package this pipeline and make it available to users through Bioconductor (Lawrence et al., 2013), allowing users to perform analysis on either their local computer or the cloud.

Materials and Methods

In this paper, we applied our viGEN pipeline to two case studies as a proof of concept - a dataset of 304 cervical cancer patients, and a set of 50 liver cancer patients, both from the TCGA collection (NCI, 2011). We first applied the pipeline to the transcriptome of cervical cancer patients to see if we are able to detect the human papilloma viruses. We also performed additional in-depth analyses on a small focused study of liver cancer patients afflicted with Hepatitis B virus. In this cohort, we perform viral-gene quantification, viral-variant extraction and survival analysis. The results from these analyses allowed us to compare experimental and control groups using viral-gene expression data and viral-variant data, and give us insights into their impacts on the tumor, and disease mechanisms.

In the following sections, we describe the viGEN pipeline, and the two case studies.

The viGEN Pipeline

The viGEN pipeline includes 4 major modules. Figure 1 shows an image of our viGEN pipeline.

FIGURE 1

Figure 1. viGEN pipeline. Each module has a color, shown in the legend.

Module 1: Viral Genome Level Analysis (Filtered Human Sample Input)

In Module 1 (labeled as “filtered human sample input”), the human RNA sequences were aligned to the human-reference genome using the RSEM (Li and Dewey, 2011) tool. One of the outputs of RSEM includes sequences that did not align to the human genome (hence the name “filtered human sample input”). These un-aligned sequences were taken and aligned to the viral reference file using popular alignment tools BWA (Li and Durbin, 2009) and Bowtie2 (Langmead and Salzberg, 2012).

Module 2: Viral Genome Level Analysis (Unfiltered Human Sample Input)

In Module 2 (labeled as “unfiltered human sample input”), the RNA seq sequences were directly aligned to the viral reference using Bowtie2 without any filtering.

The reason for using two methods to obtain the viral genomes in human RNA-seq data (Module 1 and Module 2) was to allow us to be as comprehensive as possible in viral detection.

The aligned reads from Module 1 and 2 were in the form of BAM files (Center-for-Statistical-Genetics, 2013), from which read counts were obtained for each viral genome species (referred to as “genome level counts”) using Samtools idxstats (Li et al., 2009) or Picard BAMIndexStats⁴ tools. Using the genome level counts, we estimated the number of reads that covered the genome, a form of viral copy number. Viral copy number was defined as in equation below:

\begin{array}{l} V i r a l c o p y n u m b e r = \frac{N u m b e r o f m a p p e d r e a d s \times R e a d l e n g t h}{G e n o m e l e n g t h} \end{array}

Only those viral species with copy number more than a threshold are selected for the next module.

Module 3: Viral Gene Expression Analysis

The BAM files from Module 1 and 2 (from Bowtie2 and BWA) were input into Module 3 (referred to as “viral gene expression level analysis”), which calculated quantitate read counts at the individual viral-gene level. We found existing RNA-seq quantification tools to be not sensitive enough for viruses, and hence developed our own algorithm for this module. Our in-house algorithm used region-based information from the general-feature-format (GFF) files⁵ of each viral genome, and the reads from the BAM file. It created a summary file, which had a total count of reads within or on the boundary of each region in the GFF file. This is repeated for each sample and for each viral GFF file. At the end, a matrix is obtained where the features (rows) are regions from the GFF file, and the columns are samples. The read count output from Module 3 (viral gene expression module) allowed for downstream differential expression analysis of viral genes between case and controls groups. The source code for our in-house algorithm, written using the R programming language (R Core Team, 2014), has been made public at available at github.com/ICBI/viGEN.

Module 4: Viral RNA Variant Calling Module

The BAM files from Module 1 and 2 (from Bowtie2) were also input to Module 4 to detect mutations in the transcripts from these viruses (referred to as “viral RNA variant calling module”). The BAM files were first sorted coordinate-wise using Samtools (Li et al., 2009); PCR duplicates were removed using tool Picard⁴, then the chromosomes in the BAM file were ordered in the same way as the reference file using Picard. The Viral reference file was created from combining all known and sequenced human viruses obtained from NCBI². Because viral variants are known to be low frequency, we have selected a variant calling tool Varscan2 (Koboldt et al., 2012), which allows detection of low-frequency variants (Spencer et al., 2014). Low quality and low depth variants were flagged, but not filtered out, in case these low values were due to low viral load. Once the variants were obtained, they were merged to form a multi-sample VCF file. Only variants that had a variant in two or more samples were retained. PLINK (Chang et al., 2015) was used to perform case-control association test (Fishers Exact Test) to compare groups.

Tutorial in Github

The viGEN pipeline is easy to implement because our pipeline incorporates existing best practices and tools available. For Module 3, we developed our own algorithm for viral-gene quantification. The major motivation for this paper was to build on existing viral detection tools, and to build a quantification tool in order to quantify, explore and analyse the genes detected in viruses. The source code for the in-house algorithm, along with a tutorial on how to execute the code on sample data has been made public at https://github.com/ICBI/viGEN/.

Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we used a publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly available HBV liver cancer RNA-seq dataset from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). This dataset is also available through EBI SRA (http://www.ebi.ac.uk/ena/data/view/SRR1946637). The dataset consisted of 50 HBV Liver cancer patients, and 5 adjacent normal liver tissues. We downloaded the raw reads for one sample, and applied our viGEN pipeline to it and were able to successfully detect HBV transcripts in this sample. A step-by-step workflow that includes – description of tools, code, intermediate and final analysis results are provided in Github: https://github.com/ICBI/viGEN/. This tutorial has also been provided as Additional File 1.

Custom Reference Index

We were interested in exploring all viruses existing in humans. So we first obtained reference genomes of all known and sequenced human viruses obtained from NCBI² (745 viruses) and merged them into one file (referred to as the “viral reference file”) in fasta file format (Wikipedia, 2004). This file has been shared in our Github page.

Case Studies

Cervical Cancer Dataset

Cervical cancer is caused by the Human Papilloma Virus (HPV). This dataset consisted of 304 cervical cancer patients in the TCGA data collection. These samples were primary tumors from either Cervical Squamous Cell Carcinoma or Endocervical Adenocarcinoma where RNA-seq data was available.

We applied our viGEN pipeline on these samples using the Seven Bridges platform (https://cgc.sbgenomics.com). Among the 304 cervical cancer patients, 22 patients had virus detection confirmed by PCR or other lab methods and made available through the clinical data. So we used this information from the 22 patients to estimate the sensitivity and specificity of our viGEN pipeline.

Liver Cancer Dataset

This dataset consisted of 50 liver cancer patients in the TCGA data collection. 25 of these patients were afflicted with Hepatitis B virus (labeled “HepB”), while the rest of the 25 patients had a co-infection of both Hepatitis B and C viruses (labeled “HepB+C”). Information about viral presence was obtained from “Viral Hepatitis Serology” attribute from the clinical information.

We first applied the viGEN pipeline on the 50 samples, using the Globus Genomics platform (Bhuvaneshwar et al., 2015). Once the viral genomes were detected, we then chose only the high abundance viral species for the gene quantification step and viral variant detection steps (Module 3 and 4 respectively).

We then performed a focused analysis on this dataset. We used the viral-gene expression read counts, to examine the differences between “Dead” and “Alive” samples. The Dead/Alive status of the samples was obtained from the clinical data and refers to patients in the cohort that died or not from cancer. We performed this analysis on the 25 patients in the HepB only group to prevent any confounding with the HepB+HepC group. Out of 25 HepB patients, 16 were alive (baseline group), and 9 dead (comparison group) as per the clinical data. The analysis was performed using a Bioconductor software package called EdgeR (Robinson et al., 2010) in the R programming language (http://www.R-project.org).

Cox proportional hazards (Cox PH) regression model (Cox and Oakes, 2000) was then applied to look at the association of viral-gene expression data with overall survival. Thie Cox model was applied on all 50 samples in the cohort (i.e., 25 Hep B and 25 HepB+HepC) samples to maximize power.

We also compared the dead and alive samples at the viral RNA variant level in the HepB group using a tool called PLINK to see if it can add valuable information to the tumor landscape in humans.

Results

Detection of HPV in Cervical Cancer Patients

We used our viGEN pipeline to detect viruses in the RNA of human cervical tissue and obtained viral copy number for each species. We used a threshold copy number of 10 as a “positive” viral detection for both HPV-16, HPV-18 and HPV-26 viruses. Based on this criterion, HPV-16 was detected in 53% of the samples, HPV-18 in 13% of the samples and HPV-26 in 0.3 % of the samples (Figure 2). The threshold copy number limit that defines a “positive” detection is one of the parameters of the software which could be set by the user depending on the specifics of the experiment.

FIGURE 2

Figure 2. The HPV viruses detected in cervical cancer patients using the viGEN pipeline.

We obtained the clinical data for this TCGA cervical cancer cohort from the cBio portal (Cerami et al., 2012). Among the 304 patients, 22 patients had virus detection confirmed by PCR or other lab methods and made available through the clinical data. Out of the 22 patients, 12 patients had the HPV-16 virus, 4 patients had HPV-18, and the rest had other HPV viruses. So we used this information from the clinical data to estimate the sensitivity and specificity of our viGEN pipeline. We got a sensitivity of 83% and specificity of 60% for HPV-16 detection (Table 2A); and a sensitivity of 75% and specificity of 94% for HPV-18 detection (Table 2B).

TABLE 2A

Table 2A. Estimation of sensitivity and specificity for HPV-16 detection in TCGA cervical cancer samples using the viGEN pipeline.

TABLE 2B

Table 2B. Estimation of sensitivity and specificity for HPV-18 detection in TCGA cervical cancer samples using the viGEN pipeline.

Additional Analysis in Liver Cancer Patients

Detection of Hepatitis B Virus at the Genome Level

We applied our viGEN pipeline (modules 1 and 2) on the RNAseq data from the TCGA liver cancer tumors, and obtained genome-level read counts for each viral species. We used a threshold copy number of 10 to define a positive detection of the Hepatitis B virus.

Once the viral genomes were detected, we short-listed the high abundance viral species for the viral-gene quantification step and viral-variant detection steps (Module 3 and 4 respectively). High abundance was defined as those virus species that were detected in at-least 5 samples. In addition to Hepatitis B and C viruses, several other viruses came up in this short list including Human endogenous retrovirus K113 (HERV K113) and others. A complete list is provided in Table 3.

TABLE 3

Table 3. Virus species detected in at-least 5 samples.

Comparing Dead and Alive Samples in the Using Viral Gene Expression Data

To get a more detailed overview of the viral landscape, we applied Module 3 of the viGEN pipeline to the liver cancer dataset. This allowed us to quantify viral-gene expression regions in the RNA of liver tumor tissues. We then used those results to examine the differences between dead and alive samples.

It is known that these patients were afflicted with the Hepatitis B virus and hence many of the differentially expressed regions were from this viral genome. But as we know, other viruses also coexist in humans. This was confirmed by the presence of differentially expressed viral-regions from other viruses.

The differentially expressed regions that were significant among the results are shown in Tables 4A,B. Table 4A lists only the differentially expressed regions from Hepatitis B virus and Table 4B shows the differentially expressed regions from other viruses.

TABLE 4

Table 4. Differential expression analysis of transcript level read counts Liver cancer dataset comparing Dead and Alive samples.

From the differential expression analyses, the two most informative results were (1) a region of the Hepatitis B genome that produced the HBeAg and HBcAg proteins were overexpressed in the dead patients and (2) another region of the Hepatitis B genome that produced HBsAg protein was overexpressed in the alive patients.

In detail, we saw several important findings as described below:

(a) Region NC_003977.1_CDS_1814_2452 of the Hepatitis B genome was 2.18 times overexpressed (log fold change = +1.128) in dead patients. This region contains Gene C that produces pre-code protein external core antigen; HBeAg. HBeAg is produced by proteolytic processing of the pre-core protein

(b) Region NC_003977.1_CDS_1901_2452 which was 1.74 times overexpressed (log fold change = +0.8, FDR = 0.053) in dead patients contains Gene C as above, but encodes a different core antigen HBcAg

(c) Region NC_003977.1_CDS_2848_4050 of the Hepatitis B genome was 6.73 times over expressed (log fold change = −2.7) in the alive patients of compared to the dead'patients. This region encodes Gene S that produces a large surface protein/L glycoprotein/L-HBsAG

(d) We also found several regions of the Human endogenous retrovirus K113 (HERV K113) viral genome (NC_022518.1_region_1112_6746, NC_022518.1_STS_5100_5381 and NC_022518.1_STS_7174_7323) to be about 2 times overexpressed on average in alive patients (log fold change = −1.186, −1.051, −0.992).

Survival Analysis (Cox Regression) Using Viral Gene Expression Data

Based on the results from previous section, we selected two most informative regions from the Hepatitis B genome (log counts per million from NC_003977.1_CDS_2848_4050, NC_003977.1_CDS_1814_2452) for a Cox Proportional Hazard (Cox PH) model to look at association with overall survival event and time. The result from this model (Table 5), are consistent with the results from differential expression analysis:

(a) The Cox PH model shows that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_1814_2452, increases the hazard of event (death) by 70%.

(b) On the other hand, that assuming other covariant to be constant, unit increase in expression of this region NC_003977.1_CDS_2848_4050, decreases the hazard of event (death) by 43%.

TABLE 5

Table 5. Cox proportional hazard survival analysis (across 25 HepB samples and 25 HepB + HepC Samples).

Comparing Dead and Alive Samples Using Viral-Variant Data

We performed variant calling (Module 4) on the liver cancer patients to see if it can add valuable information to the tumor landscape in humans. We then compared the dead and alive samples at the viral-variant level on the 25 HepB patients. For this analysis, the outputs from both Module 1 and 2 were fed into Module 4.

Most of the top variants from filtered human sample (Module 1 + Module 4) and unfiltered human sample (Module 2 + Module 4), were the same. We collated the significant common results (p-value ≤ 0.05) in Tables 6A,B. Among these results, we saw several missense and frameshift variants in Gene X of the Hepatitis B genome (nucleotide 1479), Gene P (2573, 2651, 2813), and a region that overlaps Gene P and PreS1 (nucleotides 2990, 2997, 3105, 3156). All these variants were found mutated more in the cases than controls. Other significant common results included variants in Gene C (nucleotide 1979, 2396) and variants in PreS2 region (nucleotide positions 115, 126 and 148) (Table 6A).

TABLE 6

Table 6. Results of case-control association test applied on the results from viral variant calling.

In addition, there were two missense variants that were common among the top results, but not significant (p-value = 0.06). They were variants in the X gene of the Hepatitis B genome (nucleotides 1762 and 1764) (Table 6A).

Among the significant common results to both, were a few variants of the Human endogenous retrovirus K113 complete genome (HERV K113). These include nucleotide positions 7476, 7426, and 8086. These map to frameshift and missense mutations in the putative envelope protein of this virus (Q779_gp1, also called “env”) (Table 6B).

Discussion

Detection of HPV in Cervical Cancer Patients

The Seven Bridges team used two metagenomic tools, Centrifuge (Kim et al., 2016) and Kraken (Wood and Salzberg, 2014), to detect HPV viruses on the same cohort of TCGA patients (Bridges, 2017; Malhotra et al., 2017), and shared the results with us. They used an abundance of 0.02 as a positive viral detection (Bridges, 2017; Malhotra et al., 2017). We compared viGEN with Kraken and Centrifuge in terms of the percentage of samples where the species was detected (Table 7). We can see that the results are in the same range for all three tools.

TABLE 7

Table 7. Comparing the viral detection ability of viGEN with other tools.

We also estimated the sensitivity and specificity of these tools using the same 22 patients and compared with that of the viGEN pipeline. The Centrifuge tool had a sensitivity of 83% and specificity of 60% for HPV-16 detection; and a sensitivity of 75% and specificity of 94% for HPV-18 detection. The Kraken tool had a sensitivity of 83% and specificity of 20% for HPV-16 detection; and a sensitivity of 75% and specificity of 17% for HPV-18 detection (detailed in Additional File 2). It shows that our viGEN pipeline was able to match the sensitivity and specificity of Centrifuge tool and surpassed that of Kraken (detailed in Additional Files 2, 3).

Additional Analysis on Liver Cancer Patients

We used our viGEN pipeline to get genome-level read counts obtained from viruses detected in the RNA of human liver tissue. In our results, HBV was detected in 20% of the samples. This is similar to earlier analyses of TCGA liver cancer cohort study (Khoury et al., 2013; Tang et al., 2013; The Cancer Genome Atlas Research Network, 2017), which detected the HBV virus in 23 and 32% (with typically low counts range) of cases respectively.

It has also been reported that the viral gene X (HBx) was the most predominately expressed viral gene in liver cancer samples (Tang et al., 2013) which is in concordance with our findings where the peak number of reads were observed for gene X region of the HBV genome.

Comparing Dead and Alive Samples in the Liver Cancer Cohort Using Viral Gene Expression Data

To get a more detailed overview of the viral landscape, we examined the human RNA-seq data to detect and quantify viral gene expression regions. We then examined the differences between dead and alive samples at the viral-transcript level on the Hepatitis B sub-group (Tables 4A,B).

From the differential expression analyses, the two most informative results were (1) a region of the Hepatitis B genome that produced the HBeAg protein was overexpressed in the dead patients and (2) another region of the Hepatitis B genome that produced HBsAg protein was overexpressed in the alive patients.

Presence of HBeAg or HBcAg is an indicator of active viral replication; this means the person infected with Hepatitis B can likely transmit the virus on to another person. Typically, loss of HBeAg is an indicator of recovery from acute Hepatitis B infection. Active viral replication could allow the virus to persist in infected cells, and increase the risk of disease (Tage-Jensen et al., 1985; Liang, 2009). So our results, showing that antigens HBeAg and HBcAg were overexpressed in dead patients compared to alive patients makes sense, indicating that these patients never recovered from acute infection.

The results also indicate a higher level of HBsAg in the alive patients compared to the dead patients. The highest levels of HBsAg in the virus are known to occur in the “immunotolerant phase.” This pattern is seen in patients who are inactive carriers of the virus i.e., they have the wild type DNA, and the virus has been in the host for so long, that the host does not see the virus as a foreign protein in the body, and hence there's no immune reaction against the virus. In this phase, there is known to be minimal liver inflammation and low risk of disease progression (Park, 2004; Tran, 2011; Locarnini and Bowden, 2012). This could explain why we saw higher level of HBsAg in the alive patients compared to the dead patients.

Also among the significant results were three regions from the Human endogenous retrovirus K113 (HERV K113) genome (with negative log fold change) that were overexpressed in the alive patients. Two of these regions were Sequence-tagged sites (STS) and the third region was in the gag-pro-pol region that has frameshifts. HERV could protect the host from invasion from related viral agents through either retroviral receptor blockade or immune response to the undesirable agent (Nelson et al., 2003).

Overall, we found that our results from viral-gene expression level make biological sense, with much of the results validated through published literature.

Comparing Dead and Alive Samples in the Liver Cancer Cohort Using Viral-Variant Data

We performed variant calling on the viral data to see if it can add valuable information to the tumor landscape in humans. We then compared the dead and alive samples at the viral-variant level on the 25 patients in the Hepatitis B sub-group.

Among the significant results (Tables 6A,B) included variants in Gene C (nucleotide 1979, 2396) and variants in PreS2 region (nucleotide positions 115, 126 and 148). The Gene C region creates the pre-capsid protein, which plays a role in regulating genome replication (Tan et al., 2015). The mutation in the 2396 position lies in a known CpG island (ranging from 2215 to 2490), whose methylation level is significantly correlated with hepatocarcinogenesis (Jain et al., 2015). Mutations in PreS2 are associated with persistent HBV infection, and emerge in chronic infections. The PreS1 and PreS2 regions are known to play an essential role in the interaction with immune responses because they contain several epitopes for T or B cells (Cao, 2009).

Mutations in the 1762/1764 positions of the X gene are known to be associated with greater risk of HCC (Cao, 2009; Wang et al., 2014), and is independent of serum HBV DNA level (Wang et al., 2014). This mutation combination is also known to be associated with hepatitis B related acute-on-chronic liver failure (Xiao et al., 2011). It is predicted that mutations associated with HCC variants are likely generated during HBV-induced pathogenesis. The A1762T/G1764A combined mutations was shown to be a valuable biomarker in the predicting the risk of HCC (Cao, 2009; Wang et al., 2014); and are often detected about 10 years before the diagnosis of HCC (Cao, 2009).

Among the significant common results to both, were a few variants of the Human endogenous retrovirus K113 complete genome (HERV K113). These variants map to frameshift and missense mutations in the putative envelope protein of this virus (Q779_gp1, also called “env”). Studies have shown that this envelope protein mediates infections of cells (Robinson and Whelan, 2016). HERV K113 is a provirus and is capable of producing intact viral particles (Boller et al., 2008). Studies have shown a strong association between HERV-K antibodies and clinical manifestation of disease and therapeutic response (Moyes et al., 2005; Downey et al., 2015). It is hypothesized that retroviral gene products can be “reawakened” when genetic damage occurs through mutations, frameshifts and chromosome breaks. Even though the direct oncogenic effects of HERVs in cancer are yet to be completely understood, it has shown potential as diagnostic or prognostic biomarkers and for immunotherapeutic purposes including vaccines (Downey et al., 2015).

We compared various viral detection pipeline using the several criteria (Table 1). Our pipeline provides similar functionality as the tools listed in Table 1 for the detection of viruses from human RNAseq data; but also has an advantage of enabling gene-level expression analysis and quantification, as well as variant analysis of viral genomes in a single open source publicly available package.

Limitations

One limitation of our viGEN pipeline is that it is dependent on sequence information from reference genome. This makes it challenging to detect viral strains where reference sequence information is not known. In the future, we plan to explore de novo assembly incorporating more sophisticated methods like Hidden Markov Models (HMM) (Alves et al., 2016). This would enable us to provide in-depth analysis of strain pathogenicity in the context of clinical outcome.

Biological Significance

We show that our viGEN pipeline can thus be used on cancer and non-cancer human NGS data to provide additional insights into the biological significance of viral and other types of infection in complex diseases, and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. Detection and characterization of these infectious agents in tumor samples can give us better insights into disease mechanisms and their treatment (Hausen, 2000).

Conclusion

With the decreasing costs of NGS analysis, our results show that it is possible to detect viral sequences from whole-transcriptome (RNA-seq) data in humans. Our analysis shows that it is not easy to detect DNA and RNA viruses from tumor tissue, but certainly possible. We were able to not only quantify them at a viral-gene expression level, but also extract variants. Our goal is to facilitate better understanding and gain new insights in the biology of viral presence/infection in actual tumor samples. The results presented in this paper on two case studies are in correspondence with published literature and are a proof of concept of our pipeline.

This pipeline is generalizable, and can be used to examine viruses present in genomic data from other next generation sequencing (NGS) technologies. It can also be used to detect and explore other types of microbes in humans, as long as the sequence information is available from the National Center for Biotechnology Information (NCBI) resources.

This pipeline can thus be used on cancer and non-cancer human NGS data to provide additional insights into the biological significance of viral and other types of infection in complex diseases, and tumorigeneses. We are planning to package this pipeline and make it open source to the bioinformatics community through Bioconductor.

Availability of Data and Material

The TCGA liver cancer dataset was used in the analysis and writing of this manuscript. The data can be obtained from https://cancergenome.nih.gov/.

Since access to TCGA raw data is controlled access, we could not use this dataset to create a publicly available tutorial. So we looked for publicly available RNA-seq dataset to demonstrate our pipeline with an end-to-end workflow. We chose one sample (SRR1946637) from publicly available liver cancer RNA-seq dataset from NCBI SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878). This dataset is also available through EBI SRA (http://www.ebi.ac.uk/ena/data/view/SRR1946637). The dataset consists of 50 Liver cancer patients, and 5 adjacent normal liver tissues. We downloaded the raw reads for one sample, and applied our viGEN pipeline to it. A step-by-step workflow that includes – description of tools, code, intermediate and final analysis results are provided in Github: https://github.com/ICBI/viGEN/.

Project details:

Project name: viGEN

Project home page: https://github.com/ICBI/viGEN/

Operating system(s): The R code is platform independent. The shell scripts can run on Unix, Linux, or iOS environment

Programming language: R, bash/shell

Other requirements: N/A

License: N/A

Any restrictions to use by non-academics: N/A

Author Contributions

KB and YG designed the pipeline. KB and LS implemented the pipeline. KB and YG wrote the manuscript with editorial comments from SM.

Funding

This work was funded by the Lombardi cancer center support grant (P30 CA51008).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2018.01172/full#supplementary-material

Additional File 1. viGEN Github tutorial.

Additional File 2. Detailed results from analysis of TCGA cervical cancer patients.

Additional File 3. Output from Kraken and Centrifuge shared by the Seven Bridges Team.

Abbreviations

HBV, Hepatitis B virus; HCV, Hepatitis C Virus; HERV K113, Human Endogenous Retrovirus K113; TCGA, The Cancer Genome Atlas; HCC, Hepatocellular carcinoma; NAFLD, nonalcoholic fatty liver disease; Hep B, Hepatitis B; Hep C, Hepatitis C; HepB + HepC, coinfected with both Hepatitis B and C virus; HBsAg, Hepatitis B surface antigen; HBeAg, Hepatitis B type e antigen; NGS, next-generation sequencing; RNA-seq, whole transcriptome sequencing; BAM, Binary version of Sequence alignment/map format; CDS, coding sequence; Cox PH, Cox Proportional Hazard; HBx, viral gene X; STS, Sequence-tagged sites; NCBI, National Center for Biotechnology Information; GFF, general-feature-format.

Footnotes

1. ^FDA Complete List of Donor Screening Assays for Infectious Agents and HIV Diagnostic Assays (Accessed March 05, 2016). Available online at: https://www.fda.gov/biologicsbloodvaccines/bloodbloodproducts/approvedproducts/licensedproductsblas/blooddonorscreening/infectiousdisease/ucm080466.htm

2. ^NCBI (2009). NCBI FTP Site for Viruses. Available online at: ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/

3. ^Seven Bridges Cancer Genomics Cloud. Available online at: https://cgc.sbgenomics.com/

4. ^BroadInstitute Picard. Available online at: http://broadinstitute.github.io/picard

5. ^Ensembl GFF/GTF File Format - Definition and Supported Options 2016, (Accessed July, 2016). Available online at: http://useast.ensembl.org/info/website/upload/gff.html

References

ACS (2007). Infections That Can Lead to Cancer 2015. Available online at: http://www.cancer.org/cancer/cancercauses/othercarcinogens/infectiousagents/infectiousagentsandcancer/infectious-agents-and-cancer-viruses

Alves, J. M., de Oliveira, A. L., Sandberg, T. O., Moreno-Gallego, J. L., de Toledo, M. A., de Moura, E. M., et al. (2016). GenSeed-HMM: a tool for progressive assembly using profile HMMs as seeds and its application in alpavirinae viral discovery from metagenomic data. Front. Microbiol. 7:269. doi: 10.3389/fmicb.2016.00269

PubMed Abstract | CrossRef Full Text | Google Scholar

Barzon, L., Lavezzo, E., Militello, V., Toppo, S., and Palù, G. (2011). Applications of next-generation sequencing technologies to diagnostic virology. Int. J. Mol. Sci. 12, 7861–7884. doi: 10.3390/ijms12117861

PubMed Abstract | CrossRef Full Text | Google Scholar

Bhaduri, A., Qu, K., Lee, C. S., Ungewickell, A., and Khavari, P. A. (2012). Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics 28, 1174–1175. doi: 10.1093/bioinformatics/bts100

PubMed Abstract | CrossRef Full Text | Google Scholar

Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., et al. (2015). A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnol. J. 13, 64–74. doi: 10.1016/j.csbj.2014.11.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Boller, K., Schönfeld, K., Lischer, S., Fischer, N., Hoffmann, A., Kurth, R., et al. (2008). Human endogenous retrovirus HERV-K113 is capable of producing intact viral particles. J. Gen. Virol. 89(Pt 2), 567–572. doi: 10.1099/vir.0.83534-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Bridges, S. (2017). Identifying Viral Sequences In TCGA Data Using Kraken And Centrifuge. (Accessed May 23, 2017). Available online at: https://www.sevenbridges.com/centrifuge/

Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D., and Craig, D. W. (2016). Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271. doi: 10.1038/nrg.2016.10

PubMed Abstract | CrossRef Full Text | Google Scholar

Cao, G. W. (2009). Clinical relevance and public health significance of hepatitis B virus genomic variations. World J. Gastroenterol. 15, 5761–5769. doi: 10.3748/wjg.15.5761

PubMed Abstract | CrossRef Full Text | Google Scholar

Center-for-Statistical-Genetics (2013). BAM 2013 (Accessed February 26, 2013). Available online at: http://genome.sph.umich.edu/wiki/BAM

Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A., et al. (2012). The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404. doi: 10.1158/2159-8290.CD-12-0095

PubMed Abstract | CrossRef Full Text | Google Scholar

Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., and Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 4:7. doi: 10.1186/s13742-015-0047-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, Y., Yao, H., Thompson, E. J., Tannir, N. M., Weinstein, J. N., and Su, X. (2013). VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue. Bioinformatics. 29, 266–267. doi: 10.1093/bioinformatics/bts665

PubMed Abstract | CrossRef Full Text | Google Scholar

Cox, D. R., and Oakes, D. (2000). Analysis of Survival Data. Boca Raton, FL: Chapman and Hall.

Google Scholar

Datta, S., Budhauliya, R., Das, B., Chatterjee, S., and Vanlalhmuaka Veer, V. (2015). Next-generation sequencing in clinical virology: discovery of new viruses. World J. Virol. 4, 265–276. doi: 10.5501/wjv.v4.i3.265

PubMed Abstract | CrossRef Full Text | Google Scholar

Downey, R. F., Sullivan, F. J., Wang-Johanning, F., Ambs, S., Giles, F. J., and Glynn, S. A. (2015). Human endogenous retrovirus K and cancer: innocent bystander or tumorigenic accomplice? Int J. Cancer. 137, 1249–1257. doi: 10.1002/ijc.29003

PubMed Abstract | CrossRef Full Text | Google Scholar

Hausen, H. Z. (2000). Infections Causing Human Cancer. Weinheim: Wiley.

Google Scholar

Jain, S., Chang, T. T., Chen, S., Boldbaatar, B., Clemens, A., Lin, S. Y., et al. (2015). Comprehensive DNA methylation analysis of hepatitis B virus genome in infected liver tissues. Sci. Rep. 5:10478. doi: 10.1038/srep10478

PubMed Abstract | CrossRef Full Text | Google Scholar

Khoury, J. D., Tannir, N. M., Williams, M. D., Chen, Y., Yao, H., Zhang, J., et al. (2013). Landscape of DNA virus associations across human malignant cancers: analysis of 3,775 cases using RNA-Seq. J. Virol. 87, 8916–8926. doi: 10.1128/JVI.00340-13

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, D., Song, L., Breitwieser, F. P., and Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729. doi: 10.1101/gr.210641.116

PubMed Abstract | CrossRef Full Text | Google Scholar

Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576. doi: 10.1101/gr.129684.111

PubMed Abstract | CrossRef Full Text | Google Scholar

Kostic, A. D., Ojesina, A. I., Pedamallu, C. S., Jung, J., Verhaak, R. G., Getz, G., et al. (2011). PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396. doi: 10.1038/nbt.1868

PubMed Abstract | CrossRef Full Text | Google Scholar

Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. doi: 10.1038/nmeth.1923

PubMed Abstract | CrossRef Full Text | Google Scholar

Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., et al. (2013). Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9:e1003118. doi: 10.1371/journal.pcbi.1003118

PubMed Abstract | CrossRef Full Text | Google Scholar

Ledford, H. (2015). Cancer-fighting viruses win approval. Nature 526, 622–623. doi: 10.1038/526622a

PubMed Abstract | CrossRef Full Text

Li, B., and Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformat. 12:323. doi: 10.1186/1471-2105-12-323

CrossRef Full Text | Google Scholar

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760. doi: 10.1093/bioinformatics/btp324

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079. doi: 10.1093/bioinformatics/btp352

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J. W., Wan, R., Yu, C. S., Co, N. N., Wong, N., and Chan, T. F. (2013). ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution. Bioinformatics 29, 649–651. doi: 10.1093/bioinformatics/btt011

PubMed Abstract | CrossRef Full Text | Google Scholar

Liang, T. J. (2009). Hepatitis B: the virus and disease. Hepatology 49(5 Suppl), S13–S21. doi: 10.1002/hep.22881

PubMed Abstract | CrossRef Full Text | Google Scholar

Locarnini, S., and Bowden, S. (2012). Hepatitis B surface antigen quantification: not what it seems on the surface. Hepatology 56, 411–414. doi: 10.1002/hep.25732

PubMed Abstract | CrossRef Full Text | Google Scholar

Malhotra, R., Grady, P., Lehnert, E., Sethi, A., Prakasha, P., and Brandi, N. (2017). “Enabling scalable and rapid metagenomic profiling of the transcriptome with the Seven Bridges Cancer Genomics Cloud,” in BioIT World 2017 (Boston, MA).

Moyes, D. L., Martin, A., Sawcer, S., Temperton, N., Worthington, J., Griffiths, D. J., et al. (2005). The distribution of the endogenous retroviruses HERV-K113 and HERV-K115 in health and disease. Genomics 86, 337–341. doi: 10.1016/j.ygeno.2005.06.004

PubMed Abstract | CrossRef Full Text | Google Scholar

NCI (2011). The Cancer Genome Atlas. Available online at: https://tcga-data.nci.nih.gov/

Nelson, P. N., Carnegie, P. R., Martin, J., Davari Ejtehadi, H., Hooley, P., Roden, D., et al. (2003). Demystified. Human endogenous retroviruses. Mol. Pathol. 56, 11–18. doi: 10.1136/mp.56.1.11

PubMed Abstract | CrossRef Full Text | Google Scholar

Park, J. H. (2004). [Hepatitis B virus surface antigen: a multifaceted protein]. Korean J. Hepatol. 10, 248–259.

PubMed Abstract | Google Scholar

R Core Team (2014). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available online at: http://www.R-project.org/

Robinson, L. R., and Whelan, S. P. (2016). Infectious entry pathway mediated by the human endogenous retrovirus K Envelope Protein. J. Virol. 90, 3640–3649. doi: 10.1128/JVI.03136-15

PubMed Abstract | CrossRef Full Text | Google Scholar

Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140. doi: 10.1093/bioinformatics/btp616

PubMed Abstract | CrossRef Full Text | Google Scholar

Salyakina, D., and Tsinoremas, N. F. (2013). Viral expression associated with gastrointestinal adenocarcinomas in TCGA high-throughput sequencing data. Hum. Genomics. 7:23. doi: 10.1186/1479-7364-7-23

PubMed Abstract | CrossRef Full Text | Google Scholar

Schelhorn, S. E., Fischer, M., Tolosi, L., Altmüller, J., Nürnberg, P., Pfister, H., et al. (2013). Sensitive detection of viral transcripts in human tumor transcriptomes. PLoS Comput. Biol. 9:e1003228. doi: 10.1371/journal.pcbi.1003228

PubMed Abstract | CrossRef Full Text | Google Scholar

Spencer, D. H., Tyagi, M., Vallania, F., Bredemeyer, A. J., Pfeifer, J. D., Mitra, R. D., et al. (2014). Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J. Mol. Diagn. 16, 75–88. doi: 10.1016/j.jmoldx.2013.09.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Tage-Jensen, U., Aldershvile, J., and Schlichting, P. (1985). Immunosuppressive treatment of HBsAg-positive chronic liver disease: significance of HBeAg. Hepatology 5, 47–49. doi: 10.1002/hep.1840050111

PubMed Abstract | CrossRef Full Text | Google Scholar

Tan, Z., Pionek, K., Unchwaniwala, N., Maguire, M. L., Loeb, D. D., and Zlotnick, A. (2015). The interface between hepatitis B virus capsid proteins affects self-assembly, pregenomic RNA packaging, and reverse transcription. J. Virol. 89, 3275–3284. doi: 10.1128/JVI.03545-14

PubMed Abstract | CrossRef Full Text | Google Scholar

Tang, K. W., Alaei-Mahabadi, B., Samuelsson, T., Lindh, M., and Larsson, E. (2013). The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat. Commun. 4:2513. doi: 10.1038/ncomms3513

PubMed Abstract | CrossRef Full Text | Google Scholar

The Cancer Genome Atlas Research Network (2017). Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma. Cell 169, 1327–41.e23. doi: 10.1016/j.cell.2017.05.046

CrossRef Full Text

Tran, T. T. (2011). Immune tolerant hepatitis B: a clinical dilemma. Gastroenterol. Hepatol. (N Y). 7, 511–516.

PubMed Abstract | Google Scholar

Wang, F., Sun, Y., Ruan, J., Chen, R., Chen, X., Chen, C., et al. (2016). Using small RNA deep sequencing data to detect human viruses. Biomed. Res. Int. 2016:2596782. doi: 10.1155/2016/2596782

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Q., Jia, P., and Zhao, Z. (2013). VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS ONE 8:e64465. doi: 10.1371/journal.pone.0064465

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Y. Z., Zhu, Z., Zhang, H. Y., Zhu, M. Z., Xu, X., Chen, C. H., et al. (2014). Detection of hepatitis B virus A1762T/G1764A mutant by amplification refractory mutation system. Braz. J. Infect. Dis. 18, 261–265. doi: 10.1016/j.bjid.2013.09.005

PubMed Abstract | CrossRef Full Text | Google Scholar

Wikipedia (2004). FASTA Format. Available online at: https://en.wikipedia.org/wiki/FASTA_format

Wikipedia (2009). FASTQ Format. Available online at: https://en.wikipedia.org/wiki/FASTQ_format

Wood, D. E., and Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15:R46. doi: 10.1186/gb-2014-15-3-r46

PubMed Abstract | CrossRef Full Text | Google Scholar

Xiao, L., Zhou, B., Gao, H., Ma, S., Yang, G., Xu, M., et al. (2011). Hepatitis B virus genotype B with G1896A and A1762T/G1764A mutations is associated with hepatitis B related acute-on-chronic liver failure. J. Med. Virol. 83, 1544–1550. doi: 10.1002/jmv.22159

PubMed Abstract | CrossRef Full Text | Google Scholar

Yoshihara, N. (1995). [ELISA for diagnosis of infections by viruses]. Nihon Rinsho. 53, 2277–2282.

PubMed Abstract | Google Scholar

Keywords: RNA-seq, viral detection, liver cancer, TCGA, variant analysis, next-generation sequencing, cancer immunology

Citation: Bhuvaneshwar K, Song L, Madhavan S and Gusev Y (2018) viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors. Front. Microbiol. 9:1172. doi: 10.3389/fmicb.2018.01172

Received: 26 January 2018; Accepted: 15 May 2018;
Published: 05 June 2018.

Edited by:

Diana Elizabeth Marco, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

Reviewed by:

João Marcelo Pereira Alves, Universidade de São Paulo, Brazil
Hetron Mweemba Munang'andu, Norwegian University of Life Sciences, Norway

Copyright © 2018 Bhuvaneshwar, Song, Madhavan and Gusev. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Krithika Bhuvaneshwar, a2I0NzJAZ2VvcmdldG93bi5lZHU=
Yuriy Gusev, eWc2M0BnZW9yZ2V0b3duLmVkdQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.