esl-reformat - Man Page

convert sequence file formats

Description

esl-reformat reads the sequence file seqfile in any supported format, reformats it into a new format specified by format, then outputs the reformatted text.

The format argument must (case-insensitively) match a supported sequence file format. Common choices for format include: fasta, embl, genbank. If seqfile is an alignment file, alignment output formats also work. Common choices include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (fasta or FASTA both work).

Unaligned format files cannot be reformatted to aligned formats. However, aligned formats can be reformatted to unaligned formats, in which case gap characters are simply stripped out.

Options

-d

DNA; convert U's to T's, to make sure a nucleic acid sequence is shown as DNA not RNA. See -r.

-h

Print brief help; includes version number and summary of all options, including expert options.

-l

Lowercase; convert all sequence residues to lower case. See -u.

-n

For DNA/RNA sequences, converts any character that's not unambiguous RNA/DNA (e.g. ACGTU/acgtu) to an N. Used to convert IUPAC ambiguity codes to N's, for software that can't handle all IUPAC codes (some public RNA folding codes, for example). If the file is an alignment, gap characters are also left unchanged. If sequences are not nucleic acid sequences, this option will corrupt the data in a predictable fashion.

-o <f>

Send output to file <f> instead of stdout.

-r

RNA; convert T's to U's, to make sure a nucleic acid sequence is shown as RNA not DNA. See -d.

-u

Uppercase; convert all sequence residues to upper case. See -l.

-x

For DNA sequences, convert non-IUPAC characters (such as X's) to N's. This is for compatibility with benighted people who insist on using X instead of the IUPAC ambiguity character N. (X is for ambiguity in an amino acid residue).

Warning: like the -n option, the code doesn't check that you are actually giving it DNA. It simply literally just converts non-IUPAC DNA symbols to N. So if you accidentally give it protein sequence, it will happily convert most every amino acid residue to an N.

Expert Options

--gapsym <c>: Convert all gap characters to <c>. Used to prepare alignment files for programs with strict requirements for gap symbols. Only makes sense if the input seqfile is an alignment.
--informat <s>: Assert that input seqfile is in format <s>, bypassing format autodetection. Common choices for <s> include: fasta, embl, genbank. Alignment formats also work; common choices include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (fasta or FASTA both work).
--mingap: If seqfile is an alignment, remove any columns that contain 100% gap or missing data characters, minimizing the overall length of the alignment. (Often useful if you've extracted a subset of aligned sequences from a larger alignment.)
--keeprf: When used in combination with --mingap, never remove a column that is not a gap in the reference (#=GC RF) annotation, even if the column contains 100% gap characters in all aligned sequences. By default with --mingap, nongap RF columns that are 100% gaps in all sequences are removed.
--nogap: Remove any aligned columns that contain any gap or missing data symbols at all. Useful as a prelude to phylogenetic analyses, where you only want to analyze columns containing 100% residues, so you want to strip out any columns with gaps in them. Only makes sense if the file is an alignment file.
--wussify: Convert RNA secondary structure annotation strings (both consensus and individual) from old "KHS" format, ><, to the new WUSS notation, <>. If the notation is already in WUSS format, this option will screw it up, without warning. Only SELEX and Stockholm format files have secondary structure markup at present.
--dewuss: Convert RNA secondary structure annotation strings from the new WUSS notation, <>, back to the old KHS format, ><. If the annotation is already in KHS, this option will corrupt it, without warning. Only SELEX and Stockholm format files have secondary structure markup.
--fullwuss: Convert RNA secondary structure annotation strings from simple (input) WUSS notation to full (output) WUSS notation.
--replace <s>: <s> must be in the format <s1>:<s2> with equal numbers of characters in <s1> and <s2> separated by a ":" symbol. Each character from <s1> in the input file will be replaced by its counterpart (at the same position) from <s2>. Note that special characters in <s> (such as "~") may need to be prefixed by a "\" character.
--small: Operate in small memory mode for input alignment files in Pfam format. If not used, each alignment is stored in memory so the required memory will be roughly the size of the largest alignment in the input file. With --small, input alignments are not stored in memory. This option only works in combination with --informat pfam and output format pfam or afa.

Copyright

Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.

Author

http://eddylab.org

Referenced By

esl-alimerge(1).

Nov 2020 Easel 0.48 Easel Manual