SciApps: A Cloud-Based Platform for Analyses and Distribution of the MaizeCODE data

MaizeCODE is a project aimed at identifying and analyzing functional elements in the maize genome. In its initial phase, MaizeCODE assayed up to five tissues from four maize strains (B73, NC350, W22, TIL11) by RNA-Seq, Chip-Seq, RAMPAGE, and small RNA sequencing. To facilitate reproducible science and provide both human and machine access to the MaizeCODE data, we developed SciApps, a cloud-based portal, for analysis and distribution of both raw data and analysis results. Based on the SciApps workflow platform, we generated new components to support the complete cycle of the MaizeCODE data management. These include publicly accessible scientific workflows for reproducible and shareable analysis of various functional data, a RESTful API for batch processing and distribution of data and metadata, a searchable data page that lists each MaizeCODE experiment as a reproducible workflow, and integrated JBrowse genome browser tracks linked with workflows and metadata. The SciApps portal is a flexible platform that allows integration of new analysis tools, workflows, and genomic data from multiple projects. Through metadata and a ready-to-compute cloud-based platform, the portal experience improves access to the MaizeCODE data and facilitates its analysis.


4
which represents an entity that chains together raw data, analysis results, experimental metadata, and computational provenance.SciApps extracts experimental metadata and attaches them to a specific workflow so that users can access them directly on the SciApps portal.Genome browser tracks are automatically generated and displayed within an integrated version of JBrowse (Skinner et al., 2009) by looping through the list of experiments/workflows via the SciApps RESTful API.
Together, the SciApps portal supports the complete cycle of MaizeCODE data management, and the level of automation it provides greatly decreases the chances of human errors in data organization, analysis, and distribution.

Processing and accessing the MaizeCODE data with the SciApps RESTful API
The cloud-based architecture of SciApps (Wang et al., 2018) enables highly scalable processing of MaizeCODE data on the XSEDE/TACC cloud.Both intermediate and final results are archived in the CyVerse Data Store, where the raw data are also hosted.As discussed above, each SciApps workflow captures experimental metadata and computational provenance along with the raw and processed data.Batch processing is supported through the RESTful API; the API endpoints are provided in Table 1.The analysis workflow for a specific assay is typically built interactively within the SciApps GUI using one data set as a template.Once the workflow is captured, it can then be easily and automatically applied to analyze other genomes and tissues.Alternatively, users may also build workflows entirely programmatically with a series of analysis job IDs via the API.Experimental metadata are retrieved via the CyVerse Terrain API, and then attached to the workflow via the SciApps API at runtime.The API supports the MaizeCODE DCC for automatically processing a large amount of data and also supports retrieval of results and metadata by end users.For example, genome browser tracks can be automatically generated given a workflow ID by the following steps (Figure 1): 1. Retrieve job IDs and inputs with the workflow endpoint, given a workflow ID; 2. Retrieve the output path with the job endpoint, given a job ID; 3. Construct the browser-ready link with the retrieved information.To simplify the process, the MaizeCODE DCC encodes the genome, tissue, and replicate information into the input raw data file path, which is also accessible through the workflow metadata endpoints.SciApps also names the output .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted November 23, 2019.; https://doi.org/10.1101/852269doi: bioRxiv preprint 6 filename based on the input filename with the output ID (defined by the app) as the prefix.As shown in Figure 1, once the input filename, input path, and output path to cloud storage are retrieved by calling the API, the output file path can be constructed to build the browser links.Each workflow has two replicates.$WF_ID is the workflow ID, and $JOB_ID is the job ID of a step of the workflow.As an example, the cURL for retrieving the input filename of replicate 1 is: curl -X GET --header "Accept: application/json" "https://www.sciapps.org/workflow/$WF_ID"| jq

Accessing the MaizeCODE experiments as reproducible workflows
The MaizeCODE data page can be accessed under 'Data' from the top navigation bar of SciApps (Figure 2).Keyword search is supported to allow the user to narrow down the list of experiments to a specific genome or tissue or assay in real time.Once an experiment is selected, the user can access the metadata, workflows, and ground-level analysis results of the experiments, starting from raw sequence data.With the 'Relaunch' tab, user can reproduce the entire analysis with one click or apply the same analysis workflow to new data.Using the 'Share' tab, the analysis can be shared with others.Users can load the results to the History panel and subject them to further analysis using the modular apps.Because all results are archived in the cloud, downstream analyses can be completed quickly, e.g., differential expression analysis between two tissues can be completed in a few minutes, rather than hours when starting from the raw sequence data. .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted November 23, 2019.; https://doi.org/10.1101/852269doi: bioRxiv preprint 8 'Relaunch' the analysis, 'Visualize' the graphic diagram of the workflow (with URLs for the raw sequence files from the input file node), 'Load' the results to the History panel, 'Share' the analysis with others, and display the experimental 'Metadata'.User can perform a keyword search for a specific dataset (e.g., B73 ears RNA-Seq).In the right panel, SciApps displays the history of the selected datasets; the visualization (eye) icon opens a panel where users can generate links to visualize the results in a web browser (e.g., a QC report) or genome browser (e.g., alignments or signal tracks).The left panel shows a list of modular apps that can be launched to perform a variety of downstream analyses with the loaded results.

Accessing the MaizeCODE data as Genome Browser tracks
Once the analysis is completed, genome browser tracks are automatically generated given the workflow ID by calling the SciApps API for an integrated version of JBrowse.The browser tracks can be accessed under the 'Tools' menu within the top navigation bar.As shown in Figure 3, tracks are organized by genome, tissue, replicate, and assay.Checking the box next to each track will load it into the browser.The SciApps workflow ID is embedded, so clicking on a track brings up the workflow 'Relaunch' interface, which can be used to reproduce the track signal if needed.In this interface, the user can also check the parameters used for the analysis, as well as additional results in the History panel.At the bottom of the interface, a diagram button visualizes the workflow diagram, and a metadata button displays the experimental metadata associated with the workflow.From the results, user can also generate additional browser track links through the visualization (eye) icon.For example, this can be used to verify the signal track with the alignment files (in the BAM format).As mentioned earlier, the results can also be used to perform a downstream analysis on the same interface.Finally, the browser tracks are available as a JSON file for integration into other platforms (e.g., the JSON file for B73 is available at https://data.sciapps.org/view2/data2/B73/v4/apollo_data/trackList.json.

Accessing the raw reads on CyVerse Data Store
The raw sequence data is deposited into the CyVerse Data Store via iCommands (https://docs.irods.org/4.2.1/icommands/user/), with metadata attached before submission to the NCBI SRA.From there, users can access the raw data in several ways.Within SciApps, the input file node of the graphic diagram for a workflow/experiment is linked to the raw sequence file.Clicking on the input node will open the CyVerse Data Common landing page in a web browser.The metadata attached to the raw sequence file is also displayed on the same page.
The user can further navigate through all released raw data from the landing page (http://datacommons.cyverse.org/browse/iplant/home/shared/maizecode/released/);the SciApps workflow ID is attached as metadata to the raw data files if it has been processed.The user can use the ID to load the workflow on the SciApps portal.For batch downloading of raw sequence files through the GUI or the command line, we recommend CyberDuck (https://cyberduck.io/)or iCommands, respectively.

Analysis with reproducible workflows
Bioinformatics applications (or apps) are integrated into SciApps as modular components that can be chained with other apps into an automated workflow.Individual apps are built with Singularity images (Kurtzer et al., 2017) from BioConda recipes (Grüning et al., 2018)

RESULTS AND DISCUSSION
A large variety of software is needed to process the MaizeCODE data.For each experiment (consisting of two replicates), a workflow with a unique ID is provided via the SciApps platform.
One major goal of SciApps is to empower anyone in the community to easily repeat an entire analysis, or use a workflow with alternative parameters for each step if so desired.A second major goal is to empower community members to process their own comparable data sets using protocols validated by MaizeCode.In the following sections, we describe how RNA-seq data is processed, how the results can be visualized, and how the primary analysis results can be used for differential expression analysis.

Processing the RNA-seq data
Besides the UCSC genome analysis tools (for format conversion and generating browser track signals), the major software used in MaizeCODE RNA-seq data analysis are bbduk (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/) for trimming low quality reads and adapter sequences, FastQC and MultiQC (Ewels et al., 2016) for visually checking read quality, STAR (Dobin et al., 2013) for alignment, RSEM (Li and Dewey, 2011) for quantifying gene expression, and StringTie (Pertea et al., 2015) for transcriptome assembly.All tools are integrated into SciApps individually as separate apps, and also combined as a single app, MCrna, for rapid batch processing of RNA-seq data without requiring intermediate results to be transferred between the TACC and CyVerse cloud.Figure 5 shows the relationship among these analysis tools within the MCrna app, which is used to process both replicates of an .CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted November 23, 2019.; https://doi.org/10.1101/852269doi: bioRxiv preprint 13 experiment in parallel.For each replicate, the MultiQC software outputs a quality report for the sequence data, before and after trimming, in an interactive HTML format.This report can be accessed via the visualization (eye) icon in the History panel (next to each loaded replicate, as shown in Figure 2).As with the HTML format, text, image, and other web browser-compatible files can be visualized by clicking the icon.For files that can be displayed on a Genome Browser (e.g., BAM, bigwig, etc), the user can also generate browser-ready links by clicking the same icon.These links address the cloud storage system from the CyVerse project, so they can be displayed on Genome Browsers hosted by different portals.If the user clicks on the output file name (from the History panel), they will be directed to the CyVerse Data Commons landing page, where they can preview or download the results.For files over a few GBs, we recommend that the user download their data using either iCommands or CyberDuck, using the CyVerse Data Store path available in the file URL.

Automated differential expression analysis
. CC-BY-NC-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available The copyright holder for this preprint (which was this version posted November 23, 2019.; https://doi.org/10.1101/852269doi: bioRxiv preprint 14 One of the key advantages of distributing the MaizeCODE data through SciApps is that it facilitates downstream analysis.In this section, we will show how differential expression analysis can be performed, on either the gene or isoform levels, using the primary analysis results discussed in the last section. As shown in Figure 6, after loading the results for two experiments from the MaizeCODE data page (Figure 2), one for ear tissue and the other for the root tissue, we can launch the RSEM_de app from the 'Comparison' category (or through searching the app panel).For the analysis, users drag and drop output files with names starting with 'rsem_' into the input field for both replicates of each sample.The analysis job can then be submitted to the cloud for running, and the results (i.e., the differentially expressed genes) will be available within a few minutes.
Note that the app is flexible in handling different numbers of replicates per sample.Additional input fields can be added using the '+ Insert' button.For the MaizeCODE project, most data sets are generated with two replicates.
Users can check the results through the History panel or the list of jobs (under the 'Workflow' tab from the top menu).Users can also select jobs from the History panel and save them as new workflows to organize the analysis and/or share it with others.Given that the XSEDE/TACC cloud is a shared resource, and jobs may be queued for several minutes to several hours, we have also established a local cluster (Wang et al., 2015) to quickly process small jobs requiring less than an hour to complete.Powered by the Agave API (Dooley et al., 2012), SciApps treats both the XSEDE/TACC cloud and the local cluster as virtual execution systems, allowing a scientific app to be configured to run on either cloud.As such, by using the CyVerse Data Store The annotation file is then passed again to the StringTie app to compute gene expression quantification results, which are then input to the Ballgown app (Frazee et al., 2015) to compute differentially expressed isoforms.Again, the user can drag and drop the alignment files and assembled transcripts from the MaizeCODE primary analysis results without repeating the timeconsuming alignment and quantification steps.Given that all results are accessible through web URLs, users can also retrieve the data directly to their local server for further interactive

Figure 1 .
Figure 1.Flowchart of generating genome browser tracks via cURL for a MaizeCODE workflow.

Figure 2 .
Figure 2. Web browser interface of the MaizeCODE data page.In the middle panel page, a list

Figure 3 .
Figure 3. Genome browser tracks for the MaizeCODE data.JBrowse is used to hold the

Figure 4 .
Figure 4. Graphical workflow diagrams for differential expression analysis (top) and MethylC-

Figure 6 .
Figure 6.Using the RSEM_de app for gene-level differential expression analysis analysis.For example, SciApps users can inspect the quantification results of each pair of peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made availableThe copyright holder for this preprint (which was this version posted November 23, 2019.; https://doi.org/10.1101/852269doi: bioRxiv preprint

Table 1 .
SciApps release 1.0 RESTful API peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available (Krueger and Andrews, 2011)s to ensure reproducibility across different cloud resources.To support the analysis of MaizeCODE data, over 20 software tools are integrated.Figure4shows two publicly accessible workflows for differential expression analysis and cytosine methylation analysis, building on the popular STAR(Dobin et al., 2013)/RSEM(Li and Dewey, 2011)/StringTie(Pertea et al., 2015)and Bismark(Krueger and Andrews, 2011)pipelines, respectively.These workflows can be constructed either with the SciApps GUI or through the API.The user can retrieve the inputs, metadata, results, and provenance of the software used in the analysis with a unique workflow ID.The interactive graph, along with the platform guide (https://cyverse-sciapps-guide.readthedocs-hosted.com/en/latest/index.html), helps users to understand how multiple apps are used together to analyze a specific assay.For MaizeCODE data, the graph is also helpful for visually inspecting the input-output relationship.Additionally, the user can check the input data (through the input file node) and relaunch each individual step of the analysis, or even the entire analysis, via the web interface or API.