viGEN: An open source pipeline for the detection and quantification of viral RNA in human tumors
- 1Innovation Center for Biomedical Informatics, Georgetown University, United States
An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples.
We present an open source bioinformatics pipeline viGEN, which allows for not only the detection and quantification of viral RNA, but also variants in the viral transcripts. The pipeline includes 4 major modules: The first module aligns and filter out human RNA sequences; the second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral-gene level thus allowing for downstream differential expression analysis of viral genes between case and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package.
In this paper, we applied the viGEN pipeline to two case studies. We first demonstrate the working of our pipeline on a large public dataset, the TCGA cervical cancer cohort. In the second case study, we performed an in-depth analysis on a small focused study of TCGA liver cancer patients. In the latter cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. This allowed us to find differentially expressed viral-transcripts and viral-variants between the groups of patients, and connect them to clinical outcome.
From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. The results presented corresponded with published literature in terms of rate of detection, and impact of several known variants of HBV genome. This pipeline is generalizable, and can be used to provide novel biological insights into microbial infections in complex diseases and tumorigeneses. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.
Keywords: RNA-Seq, Viral detection, liver cancer, TCGA, variant analysis, Next-generation sequencing, cancer immunology
Received: 26 Jan 2018;
Accepted: 15 May 2018.
Edited by:Diana E. Marco, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina
Reviewed by:Hetron M. Munang'Andu, NMBU, Norway
João Marcelo P. Alves, Universidade de São Paulo, Brazil
Copyright: © 2018 Bhuvaneshwar, Song, Madhavan and Gusev. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Ms. Krithika Bhuvaneshwar, Georgetown University, Innovation Center for Biomedical Informatics, Washington, United States, email@example.com
Dr. Yuriy Gusev, Georgetown University, Innovation Center for Biomedical Informatics, Washington, United States, firstname.lastname@example.org