esl-sfetch - Man Page

retrieve (sub-)sequences from a sequence file

Synopsis

esl-sfetch [options] seqfile key
  (retrieve a single sequence by key)

esl-sfetch -c from..to [options] seqfile key
  (retrieve a single subsequence by key and coords)

esl-sfetch -f [options] seqfile keyfile
  (retrieve multiple sequences using a file of keys)

esl-sfetch -Cf [options] seqfile subseq-coord-file
  (retrieve multiple subsequences using file of keys and coords)

esl-sfetch --index msafile
  (index a sequence file for retrievals)

Description

esl-sfetch retrieves one or more sequences or subsequences from seqfile.

The seqfile must be indexed using esl-sfetch --index seqfile. This creates an SSI index file seqfile.ssi.

To retrieve a single complete sequence, do esl-sfetch seqfile key, where key is the name or accession of the desired sequence.

To retrieve a single subsequence rather than a complete sequence, use the -c start..end option to provide start and end coordinates. The start and end coordinates are provided as one string, separated by any nonnumeric, nonwhitespace character or characters you like; see the -c option below for more details.

To retrieve more than one complete sequence at once, you may use the -f option, and the second command line argument will specify the name of a keyfile that contains a list of names or accessions, one per line; the first whitespace-delimited field on each line of this file is parsed as the name/accession.

To retrieve more than one subsequence at once, use the -C option in addition to -f, and now the second argument is parsed as a list of subsequence coordinate lines. See the -C option below for more details, including the format of these lines.

In DNA/RNA files, you may extract (sub-)sequences in reverse complement orientation in two different ways: either by providing a from coordinate that is greater than to, or by providing the -r option.

When the -f option is used to do multiple (sub-)sequence retrieval, the file argument may be - (a single dash), in which case the list of names/accessions (or subsequence coordinate lines) is read from standard input. However, because a standard input stream can't be SSI indexed, (sub-)sequence retrieval from stdin may be slow.

Options

-h

Print brief help; includes version number and summary of all options, including expert options.

-c coords

Retrieve a subsequence with start and end coordinates specified by the coords string. This string consists of start and end coordinates separated by any nonnumeric, nonwhitespace character or characters you like; for example, -c 23..100, -c 23/100, or -c 23-100 all work. To retrieve a suffix of a subsequence, you can omit the end ; for example, -c 23: would work. To specify reverse complement (for DNA/RNA sequence), you can specify from greater than to; for example, -c 100..23 retrieves the reverse complement strand from 100 to 23.

-f

Interpret the second argument as a keyfile instead of as just one key. The first whitespace-limited field on each line of keyfile is interpreted as a name or accession to be fetched. This option doesn't work with the --index option. Any other fields on a line after the first one are ignored. Blank lines and lines beginning with # are ignored.

-o <f>

Output retrieved sequences to a file <f> instead of to stdout.

-n <s>

Rename the retrieved (sub-)sequence <s>. Incompatible with -f.

-r

Reverse complement the retrieved (sub-)sequence. Only accepted for DNA/RNA sequences.

-C

Multiple subsequence retrieval mode, with -f option (required). Specifies that the second command line argument is to be parsed as a subsequence coordinate file, consisting of lines containing four whitespace-delimited fields: new_name, from, to, name/accession. For each such line, sequence name/accession is found, a subsequence from..to is extracted, and the subsequence is renamed new_name before being output. Any other fields after the first four are ignored. Blank lines and lines beginning with # are ignored.

-O

Output retrieved sequence to a file named key. This is a convenience for saving some typing: instead of

  % esl-sfetch -o SRPA_HUMAN swissprot SRPA_HUMAN

you can just type

  % esl-sfetch -O swissprot SRPA_HUMAN

The -O option only works if you're retrieving a single alignment; it is incompatible with -f.

--index

Instead of retrieving a key, the special command esl-sfetch --index seqfile produces an SSI index of the names and accessions of the alignments in the seqfile. Indexing should be done once on the seqfile to prepare it for all future fetches.

Expert Options

--informat <s>: Assert that seqfile is in format <s>, bypassing format autodetection. Common choices for <s> include: fasta, embl, genbank. Alignment formats also work; common choices include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (fasta or FASTA both work).

Copyright

Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.

Author

http://eddylab.org

Info

Nov 2020 Easel 0.48 Easel Manual