FAST: FAST Analysis of Sequences Toolbox

FAST (FAST Analysis of Sequences Toolbox) provides simple, powerful open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can simplify the documentation and reproducibility of bioinformatic protocols, supporting better transparency in biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R, and GenBank help make FAST easy and rewarding to learn. FAST automates numerical, taxonomic, and text-based sorting, selection and transformation of sequence records and alignment sites based on content, index ranges, descriptive tags, annotated features, and in-line calculated analytics, including composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics make FAST useful for molecular evolutionary analysis. FAST is portable, easy to install and secure thanks to the relative maturity of its Perl and BioPerl foundations, with stable releases posted to CPAN. Development as well as a publicly accessible Cookbook and Wiki are available on the FAST GitHub repository at https://github.com/tlawrence3/FAST. The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FastQ formatted files are also supported. FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought.


Introduction
Bioinformatic software for non-programmers is traditionally implemented for user convenience in monolithic applications with Graphical User Interfaces (GUIs) (Smith et al., 1994;Stothard, 2000;Rampp et al., 2006;Librado and Rozas, 2009;Waterhouse et al., 2009;Gouy et al., 2010). However, the monolithic application paradigm is easily outscaled by today's big biological data, particularly Next Generation Sequencing (NGS) data at gigabyte-and terabyte-scales. Better empowerment of non-programmers for genome-scale analytics of big biological data has been achieved through web-based genome browser interfaces (Markowitz et al., 2014;Cunningham et al., 2015;Rosenbloom et al., 2015). On the other hand, for smaller datasets, sequence and alignment editor applications encourage manual manipulation of data, which is error-prone and essentially irreproducible. To reduce error and increase reproducibility in the publishing of bioinformatic and biostatistical protocols it is important to facilitate the documentation and automation of data science workflows through scripts and literate programming facilities (Knuth, 1984) such as emacs org-mode (http://orgmode. org, as demonstrated in, for example Delescluse et al., 2012) that both completely document and encode scientific workflows for machine processing of biological data.
Reproducibility in bioinformatics and biostatistics protocols is crucial to maintaining public trust in the value of its investments in high-throughput and high-dimensional measurements of complex biological systems (Baggerly and Coombes, 2009;Hutson, 2010;Baggerly and Coombes, 2011;Huang and Gottardo, 2013). In one analysis, only two of 18 published microarray gene-expression analyses were completely reproducible, in part because key analysis steps were made with proprietary closed-source software (Ioannidis et al., 2008). Furthermore, even though analytical errors are a major source of retractions in the scientific literature (Casadevall et al., 2014), peer-review and publication of scientific data processing protocols is generally not yet required to publish scientific studies. Adequate documentation of bioinformatic and biostatistical workflows and open source sharing of code upon publication (Peng, 2009) facilitates crowd-sourced verification, correction and extension of code-based analyses (Barnes, 2010;Morin et al., 2012), and reuse of software and data to enable more scientific discovery returns from public data (Peng, 2011). Peer review and publication of the data science protocols associated to scientific studies stems temptation to overinterpret results and encourages more objectivity in data science (Boulesteix, 2010). The ultimate remedy for these problems is to expand literacy in modern computational and statistical data science for science students in general (Morin et al., 2012;Joppa et al., 2013).
Web-based open-source workflow suites such as Galaxy (Blankenberg and Hillman-Jackson, 2014), Taverna (Oinn et al., 2006) and BioExtract (Lushbough et al., 2011) are a recent innovation in the direction of greater reproducibility in bioinformatics protocols for genome-scale analytics. However, the most powerful, transparent and customizable medium for reproducible bioinformatics work is only available to bioinformatics specialists and programmers through Application Programming Interfaces (APIs) such as BioPerl and Ensembl (Yates et al., 2015).
Yet workflow design suites and programming APIs require dedication and time to learn. There is a need for more bioinformatics software in between GUIs and APIs, that empowers non-programmer scientists and researchers to interactively and reproducibly control, process and analyze their data without manual interventions. Closer inspection of data and interactive construction and control of data workflows makes it so much easier to rapidly prototype error-free workflows, nipping errors in the bud that can completely confound downstream analyses. In scientific computing, the time-tested paradigm for rapid prototyping of reproducible data workflows is the Unix command-line.
In this tradition we here present FAST: FAST Analysis Sequences Toolbox, modeled after the standard Unix toolkit (Peek, 2001), now called Coreutils. The FAST tools follow the Unix philosophy to "do one thing and do it well" and "write programs to work together." (Stutz, 2000). FAST workflows are completely automated; no manual interventions to data are required. FAST falls between a GUI and an API, because it is used through a Command-Line Interface (CLI). Although the FAST tools are written in Perl using BioPerl packages (Stajich et al., 2002), FAST users do not need to be able to program Perl or know BioPerl. FAST users only need basic competence in Unix and the modest skill to compose command pipelines in the Unix shell. FAST therefore supports an emerging movement to empower non-programmer biologists to learn Unix for scientific computing. Books and courses in this emerging market include the recent "UNIX and Perl to the Rescue!" (Bradnam and Korf, 2012) and the Software Carpentry and Data Carpentry Foundations workshops (Wilson, 2014).
Unix command pipe-lines are the paradigmatic example of the "pipes and filters" design pattern that embodies serial processing of data through sequences of modular and reuseable computations. The "pipes and filters" design pattern is a special case of component-based software engineering (McIlroy, 1969) and a core paradigm in software architecture (Garlan and Shaw, 1994). The component-wise organization of FAST affords access to an infinite variety of customizable queries and workflows on biological sequence data using a small command vocabulary and combinatorial logic. Componentbased software is easier to learn, maintain and extend. It also makes it easy for users to interactively develop new protocols through the modular extension and recombination of existing protocols. As shown from the examples below, non-trivial computations may be expressed on a single line of the printed page. Thus, FAST can help empower non-biologist programmers to develop and communicate powerful and reproducible bioinformatic workflows for scientific investigations and publishing.
Open-source command-line utilities for bioinformatics such as the EMBOSS package (Rice et al., 2000), the FASTX tools (Gordon, 2009) or the scripts that come with BioPerl (Stajich et al., 2002) typically offer suites of tools with simple, well-defined functions that lend themselves to scripting, but are not necessarily designed according to the Unix toolbox philosophy specifically to interoperate through serial composition over pipes. Similarly, FaBox (Villesen, 2007) is a free and open online server with functions that overlap with FAST tools, but is not designed for serial composition. On the other hand, the Unix toolbox model has been used before in more or less more specialized bioinformatics applications such as the popular SAMTools suite (Li et al., 2009) and in the processing of NMR data (Delaglio et al., 1995). A toolsuite called bp-utils, with a similar design philosophy and some overlapping functionality with FAST, has recently been released at http://diverge.hunter. cuny.edu/labwiki/Bioutils. We have written extensive documentation for each FAST utility along with useful error messages following recommended practice (Seemann, 2013). FAST is free and open source; its code is freely available to anyone to re-use, verify and extend through its GitHub repository.

The FAST Data Model
The Unix Coreutils paradigm allows users to treat plain-text files and data streams as databases in which records correspond to single lines containing fields separated by delimiters such as commas, tabs, or strings of white-space characters. FAST extends this paradigm to biological sequence data, allowing users to treat collections of files and streams of multi-line sequence records as databases for complex queries, transformations and analytics. FAST generalizes the GNU Coreutils model exactly because it models sequence record descriptions as an ordered collection of description fields (see below).
Another design feature of Unix tools that also characterizes the FAST tools is their ability to accept input not only from one or more files but also from what is called standard input, a datastream supported by the Unix shell, and to output analogously to standard output. It is this facility that allows FAST tools to be serially composed in Unix pipelines that compactly represent an infinite variety of expressive bioinformatic workflows.
The default data exchange format for FAST tools is the universally recognized FastA format (Lipman and Pearson, 1985). While no universal standard exists for this format, for FAST, "FastA format" means what is conventionally called "multi-fasta" format of sequence or alignment data, largely as implementated in BioPerl in the module Bio::SeqIO::fasta (Stajich et al., 2002).
In the FAST implementation of FastA format, multiple sequence records may appear in a single file or input stream. Sequence data may contain gap characters. The logical elements (or fields) of a sequence record are its identifier, its description and its sequence. The identifier (indicated with id in the illustration below) and description (desc) together make the identifier line of a sequence record, which must begin with the sequence record start symbol > on a single line. The description begins after the first block of white-space on this line (indicated with <space>). The sequence of a record appears immediately after its identifier line and may continue over multiple lines until the next record starts.
In FAST, users may alter how description fields are defined in sequence records by using Perl-style regular expressions to define delimiters (indicated by <delim>). FAST uses one-based indexing of description fields.
The FAST data model is illustrated as follows: In FAST, the sequence identifier is thought of as the 0th field of the identifier line. One-based indexing of description fields in FAST is therefore consistent with zero-based indexing in Perl and one-based indexing of sequence coordinates, making all indexing consistent and uniform in FAST.
Most FAST tools extend the field-based paradigm further by supporting tagged values in sequence record descriptions. Tagged values are name-value pairs with a format "name=value" as common in General Feature Format (GFF) used in sequence annotation (see e.g., https://www.sanger.ac.uk/resources/ software/gff/) or an alternative "name:value" format that certain FAST tools themselves can annotate in-line into sequence records by appending a new field to sequence record descriptions. Support for tagged values in FAST makes it possible to operate on sequence records with unordered or heterogeneous description fields.

Overview of the FAST Tools
FAST utilities may be assigned to categories according to their default behavior and intended use. There are FAST tools for selection of data from sequence records, transformation of data, annotation of sequence record descriptions with computed characteristics of the data, and analysis. A complete description of all utilities included in the first major release of FAST is shown in Table 1.
The analysis class is distinguished from the other classes because by default, these utilities output tables of plain-text data rather than sequence record data in FastA format. Two other tools, fasconvert and gbfcut, are designed to either input or output FastA format sequence records by default. Standardization of the FAST data model allows users to serially compose FAST tools into pipelines at the Unix command-line, which is indicated as the "main workflow" in the overview of the project shown in Figure 1.

General Implementation and Benchmarking
The BioPerl backend of FAST 1.x is version 1.6.901 downloaded in January, 2012. Bio::SeqIO components were updated to version 1.6.923 on June 4, 2014 and some Bio::Root components were updated on July 10, 2014 (github commit 50f87e9a4d). We introduced a small number of customizations to the BioPerl code-base, primarily to enable the translation of sequences containing gaps. All of the BioPerl dependencies of FAST are isolated under its own FAST name-space.
To help reduce the overall installation footprint of FAST, BioPerl dependencies of FAST scripts were analyzed with the Cava packager (http://www.cavapackager.com).
Nearly all FAST utilities process sequence records inline and therefore have linear runtime complexity in the number of sequences. Exceptions are fassort and fastail which both require some paging of data into temporary files. We performed benchmarking of FAST tools using randomly generated sequences of even composition sourced generated in Python and the Benchmark v1.15 Perl module on a MacBook Pro 2.5 Ghz Intel i7, with 8 Gb of RAM. We examined average CPU runtime over 100 replicates, comparing input sizes of 25K, 250K, or 1M sequence records of length 100, 10K, 100K, or 1M bp. Our benchmarking results show that despite data paging, fassort runtimes scale linearly with input size (Figure 2). FAST is not designed to be fastest at computing its solutions. Rather the fastness of FAST lies in how quickly an adept user can interactively prototype, develop, and express bioinformatic workflows with it.

Installation and Dependencies
FAST requires a working Perl installation, with official releases distributed through the Comprehensive Perl Archive Network (CPAN). A small footprint of BioPerl dependencies has been packaged together in the FAST namespace. Other CPAN dependencies may be detected and installed by the cpan package manager. A fully automated install from CPAN may on many systems be initiated by executing perl -MCPAN -e 'install FAST'. A manual install follows standard Perl install procedure. After downloading and unpacking the source directory, change into that directory and execute: perl Makefile.PL; make; make test; (sudo) make install.
We recommend that first-time users first complete the automated install from CPAN which will handle prerequisites, and then download and open the source code directory in order to practice the example usage commands (such as those in the sequel) on sample data provided within.  For reference on Perl regular expressions, try executing man perlre or perldoc perlre. For example, to print only protein sequences that do not start with M for methionine, execute:

Implementation and Usage of Individual Tools
fasgrep -s -v "^M" t/data/P450.fas In the above command the -s option directs fasgrep to search the sequence data of each record. The -v option directs fasgrep to print records that do not match the pattern given by its argument, which is the regular expression^M, in which the anchor^specifies the beginning of the sequence data. fasgrep uses the BioPerl Bio::Tools::SeqPattern library to support ambiguity expansion of IUPAC codes in its regular expression arguments. Thus, to show that a segment of Saccharomyces cerevisiae chromosome 1 contains at least one instance of an "Autonomous Consensus Sequence" characteristic of yeast origins of replication (Leonard and McHali, 2013), look whether the following command outputs a sequence or not (note that all commands reproduced here should be entered on a single line at the Unix shell prompt): fasgrep -se 'WTTTAYRTTTW' t/data/chr01.fas FIGURE 2 | Average processor time of 100 repetitions required to complete analysis using indicated utility. Utilities were run on six datasets consisting of (A) 25,000, 2,50,000, and 10,00,000 100 bp sequences and (B) 10,000, 1,00,000, and 10,00,000 1000 bp sequences.
which is equivalent to:

fasgrep -se '[AT]TTTA[CT][AG]TTT[AT]' t/data/chr01.fas
These examples demonstrate queries on sequence data, but fasgrep may be directed to search against other parts of sequence records including identifiers, descriptions, fields and more.
fasfilter supports precise numerical-based selections of sequence records from numerical data in identifiers, descriptions, fields or tagged-values in descriptions.
fasfilter supports open ranges such as 100-, meaning "greater than or equal to 100, " closed ranges like 1e6-5e8 (meaning 1 × 10 6 to 5 × 10 8 ) and compound ranges such as 200-400,500-. Ranges may be specified in Perl-style (or GenBank coordinate style) like from..to, in R/Octave-style like from:to or UNIX cut-style as in from-to. For example, to print records with gi numbers between 200 and 500 million, try executing: fasfilter -x "gi\|(\d+)" 2e8..5e8 t/data/P450.fas This example uses the -x option which directs fasfilter to filter on the value within the capture buffer which occurs within the left-most pair of parentheses of the argument, here (\d+), and \d+ is a regular expression matching a string of one or more digits from 0 to 9. The backslash after gi in the first argument quotes the vertical bar character to make it literal, since the vertical bar character is a special character in regular expressions.
fascut supports index-based selections of characters and fields in sequence records allowing repetition, reordering, variable steps, and reversals. Ranges are specified otherwise similarly to fasfilter. Negative indices count backwards from last characters and fields. fascut outputs the concatenation of data selections for each sequence record. Variable step-sizes in index ranges conveniently specify first, second or third codon positions in codon sequence records, for example. Examples using this syntax appear in the sequel. To print the last ten residues of each sequence, execute: fascut -10..-1 t/data/P450.fas alncut implements content-based selection of sites in alignments including gap-free sites, non-allgap sites, variable or invariant sites and parsimoniously informative sites, or their set-complements, all with the option of statefrequency-thresholds applied per site. By default, alncut prints only invariant sites. To print the set-complement, or only variable sites, use the -v option:

alncut -v t/data/popset_32329588.fas
To print sites in which no more than two sequences contain gaps, execute: alncut -gf 2 t/data/popset_32329588.fas gbfcut allows annotation-based sequence-extraction from GenBank format sequence files, useful for extracting all sequences that correspond to sets of the same type of annotated features in genome data. For example, to output 5 ′ and 3 ′ Untranslated Region (UTR) sequences from a GenBank formatted sequence of a gene, we use the -k option to restrict matching to features whose "keys" match the regular expression "UTR": gbfcut -k UTR t/data/AF194338.1.gb gbfcut can handle split features such as a coding region (CDS) that is split over several exons: gbfcut -k CDS t/data/AF194338.1.gb More fine-grained queries of features are possible using qualifiers defined with the -q option. Multiple qualifiers may be provided at once, specifying the selection of records for which all qualifiers apply (conjunction). For example, compare the output of the following two commands: gbfcut -k tRNA t/data/mito-ascaris.gb gbfcut -k tRNA -q product=Ser -q note^AGN t/data/mito-ascaris.gb The second command queries for features with key "tRNA" containing at least one qualifier "/product" whose value matches the string literal "Ser" and no qualifiers of type "/note" whose values match the string literal "AGN." gbfalncut automates the selection of sites from alignments that correspond to one or more features annotated on one of the sequences in a separate GenBank record. This workflow eliminates the need for manual entry of coordinates and implements a useful bioinformatic query in terms of known and reproducible quantities from public data and sequence records, allowing users to query sites based on biological vocabularies of sequence features. For an example of its use see the section "Composing Workflows in FAST" in the sequel.
faspaste concatenates data from records input in parallel from multiple data-streams or files, record-byrecord. The user may paste data from the standard input stream and from multiple input files, in an order defined by the arguments. Records from standard input may be used multiple times in concatenating data. Like in some implementions of the Unix tool paste, a hyphen input argumentto faspaste refers to the standard input stream and may be used more than once as an input argument. For maximum configurability, faspaste concatenates only one data field type (i.e., sequences or descriptions) at a time. Users may select which data stream will provide templates to receive concatenated data in output records. For example, to paste sequences of corresponding records from two data-files together and output them with the identifiers and descriptions of the data in the first file, execute: fassort and fasuniq are designed to be often used together in Unix pipelines. The fassort utility implements numerical and textual sorting of sequence records by specific fields. The fasuniq utility removes (and optionally counts) records that are redundant with respect to a specific field, such as sequences or identifiers. In the implementation of fassort, pages of data are sorted with optimized routines in Perl Sort::Key that, if necessary, are written to temporary files and merged with Sort::MergeSort. Like its Unix Coreutil analog uniq, fasuniq compares only immediately successive input records. Therefore, users will usually want to first sort data with fassort before passing it to fasuniq. To illustrate, the following example combines and sorts input records from two instances of the same file, and then counts and removes each redundant record: fassort -s t/data/P450.fas t/data/P450.fas | fasuniq -c This example illustrates that the same file may be specified as an input stream more than once to any FAST command.
fastax and fastaxsort implement taxonomic searching and sorting of sequence records, whose records are already annotated with NCBI taxonomic identifiers using taxonomic data from NCBI taxonomy Sayers et al., 2009). For example, a query of "Metazoa" would match records labeled "Homo sapiens, " "Drosophila melanogaster, " and "Lepidoptera" but not "Candida albicans" or "Alphaproteobacteria." Taxonomic selections may be logically negated and/or restricted to only those records containing valid NCBI taxonomic identifiers. Purely for historical reasons, the internal implementation of NCBI taxonomic data is custom to FAST rather than the Bio::Taxonomy libraries in BioPerl. A sample of data from tRNAdb-CE (Abe et al., 2014), in which data records are annotated with valid NCBI taxonomic identifiers in specific description fields, is included with the FAST installation package. After downloading datafiles "nodes.dmp" and "names.dmp" from NCBI Taxonomy, the following command filters sequences from Rhizobiales, assuming that records are labeled with their species (and strain) of origin in the third field of the description of the sample data file: fastax -f 3 -S " \| " nodes.dmp names.dmp Rhizobiales t/data/tRNAdb-CE.sample2000.fas fastr and fassub handle, respectively, character-and string-based transformations of sequence records. The utility fastr handles character-based transliterations, deletions and "squashing" (deletion of consecutive repeats), sequence degapping, and restriction or remapping of sequence data to strict or IUPAC ambiguity alphabets. For example, to lower-case all sequence characters, execute:

fastr -s 'A-Z' 'a-z' t/data/P450.fas
Degapping requires only the simple command: fastr --degap t/data/P450.clustalw2.fas The utility fassub allows more arbitrary substitutions on sets of strings matched to Perl regexes, implemented through direction of the Perl s/// substitution operator on specific fields. Capture buffers may be used to refer to matched data in substitutions, for example, to reverse the order of genus and species in a file in which scientific names occur in descriptions enclosed with square brackets: t/data/P450.fas fascomp, fasxl and fascodon provide for annotation and analytics of compositions, translations, and codon usage frequencies of sequence records (with start and stop codons counted distinctly, in the last case). All genetic codes included in BioPerl, ultimately from NCBI Entrez, are supported.
alnpi outputs molecular population genetic statistics cited in Table 2 for each alignment on input. It can output a set of statistics for each alignment on input in plain text or L A T E X format. alnpi also supports sliding window and pairwise analysis of input data. Data and command examples are provided to reproduce the tables and sliding window analyses of statistics published in Ardell et al. (2003). Purely for historical reasons, alnpi does not use the perlymorphism routines in the BioPerl library Bio::PopGen (Stajich and Hahn, 2005). However, all of the code for these calculations has been reviewed and compared against calculations produced from DNASP (Librado and Rozas, 2009) as described previously (Ardell, 2004).

Composing Workflows in FAST
Here we show how to interactively prototype a pipeline that computes the sliding window profile of Tajima's D of Figure 4A in Ardell et al. (2003) from a publicly available datafile. The datafile associated to this figure is an NCBI PopSet with accession ID 32329588 containing an alignment of a fully annotated ciliate gene (accession AF194338.1) against several partially sequenced allelic variants. One of the variants with accession ID AY243496.1 appears to be partly non-functionalized. First to see this data, we view it in the pager less (press "q" to quit and "space" to page):  Nei and Li, 1979 Nucleotide diversity π Nei and Li, 1979 Watterson estimator θ W Watterson, 1975 Expected number of alleles E(K) Ewens, 1972 Tajima's D D Tajima, 1989 Fu and Li's D* D* Fu and Li, 1993 Fu and Li's F* F* Fu and Li, 1993;Simonsen et al., 1995 Fu andLi's Eta S η S Fu andLi, 1993 Fu and Li's Eta η Fu and Li, 1993 Frontiers in Genetics | www.frontiersin.org less t/data/popset_32329588.fas A key feature of the Unix shell allows users to recall previous commands in their so-called history, usually by typing the "uparrow, " for possble re-use and editing. To check the number of sequences and characters in the alignment, execute: faswc t/data/popset_32329588.fas To compute our population genetic statistics we wish to remove the annotated reference sequence, the deactivated allele, and one additional sequence from analysis, which we can do using fasgrep, and verify that it reduced data by the correct number of records (six) by piping to faswc (the command is broken over two lines here but may be entered as one line on the Unix prompt): Another example, that reproduces a published result from (Ardell et al., 2003), demonstrates the utility of combining gbfalncut with alnpi, allowing users to select sites from alignments corresponding to features annotated on one of the sequences in a separate GenBank file. For example, to calculate a Tajima's D statistic for 5 ′ UTRs, corresponding to the the last line in Table 1 of that work, execute: gbfalncut -k t/data/AF194338.1.gb 5.UTR t/data/popset_32329588.fas | fasgrep -v "(AF194|349[06])" | fastr --strict -N -| alncut -g | alnpi

Selecting Sequences by Encoded Motifs
An advantage of the annotation approach in FAST is the ability to select and sort sequences by attributes computed and annotated into data by utilities upstream in the pipeline. For example, to select protein-coding genes from a file cds.fas whose translations contain the N-glycosylation amino acid motif (Kornfeld and Kornfeld, 1985), one could execute: fasxl -a cds.fas | fasgrep -t xl0 The first command in the pipeline translates each sequence and appends the translation to the description with the tag "xl0" (indicating translation in the zeroth reading frame). The second command in the pipeline uses a regular expression to represent the N-glycosylation amino acid motif pattern as the value of a "name:value" pair in the description with tag "xl0, " hence processing the annotations produced by fasxl. The regex argument to fasgrep is quoted to protect the argument from interpretation by the shell. The last command in the pipeline removes the last field in the description, restoring records as they were before they were annotated by fasxl.

Sorting Records by Third Codon Position Composition
Another example illustrates the powerful expression of ranges in fascut. An optional "by" parameter in ranges allows increments or decrements in steps larger than one. To extract third-position bases from codon sequence records, compute and annotate their compositions into record descriptions, ultimately sorting records by their third-position adenosine contents, do: fascut 3:-1:3 cds.fas | fascomp | fassort -nt comp_A

More Advanced Merging of Data Records
More advanced usage of faspaste requires Unix pipelines. For example to join both descriptions and sequences from two data-files, execute: faspaste data1.fas data2.fas | faspaste -d -data2.fas The hyphen second argument (-) to the second instance of faspaste refers to the input received from standard input through the pipe. This example works because by default, faspaste uses ("mutates") records from the data stream named in its first argument to receive the data concatenated from all records.
To prepend the first sequence of one file repeatedly to every sequence in another file, execute: fashead -n 1 t/data/fasxl_test4.fas | faspaste -r -t/data/fasxl_test4.fas To prepend the first sequence of one file repeatedly to every other sequence in another file, using identifiers and descriptions from the second file in the output, execute: fashead -n 1 t/data/fasxl_test3.fas | faspaste -r -R 2 -t/data/fasxl_test4.fas

Further Documentation and Usage Examples
Upon installation, FAST generates and installs a complete man page for each FAST utility, which should be accessible by one or both of the following commands:

man fasgrep perldoc fasgrep
In addition, a FAST Cookbook has been contributed by the authors and is available with the source code distribution or from the project GitHub repository at https://github.com/tlawrence3/ FAST.

Concluding Remarks and Future Directions
Planned additions in future versions of FAST include fasrand and alnrand for automated sampling, permutations and bootstrapping of sequences and sites, respectively, and fasgo and fasgosort for selection and sorting of records by Gene Ontology categories (The Gene Ontology Consortium, 2015).