esl-alimanip [options] msafile
esl-alimanip can manipulate the multiple sequence alignment(s) in msafile in various ways. Options exist to remove specific sequences, reorder sequences, designate reference columns using Stockholm "#=GC RF" markup, and add annotation that numbers columns.
The alignments can be of protein or DNA/RNA sequences. All alignments in the same msafile must be either protein or DNA/RNA. The alphabet will be autodetected unless one of the options --amino, --dna, or --rna are given.
Print brief help; includes version number and summary of all options, including expert options.
- -o <f>
Save the resulting, modified alignment in Stockholm format to a file <f>. The default is to write it to standard output.
- --informat <s>
Assert that msafile is in alignment format <s>, bypassing format autodetection. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (a2m or A2M both work).
- --outformat <s>
Write the output in alignment format <s>. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. The string <s> is case-insensitive (a2m or A2M both work). Default is stockholm.
Print help, as with -h, but also include undocumented developer options. These options are not listed below, are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.
- --lnfract <x>
Remove any sequences with length less than <x> fraction the length of the median length sequence in the alignment.
- --lxfract <x>
Remove any sequences with length more than <x> fraction the length of the median length sequence in the alignment.
- --lmin <n>
Remove any sequences with length less than <n> residues.
- --lmax <n>
Remove any sequences with length more than <n> residues.
- --rfnfract <x>
Remove any sequences with nongap RF length less than <x> fraction the nongap RF length of the alignment.
- --detrunc <n>
Remove any sequences that have all gaps in the first <n> non-gap #=GC RF columns or the last <n> non-gap #=GC RF columns.
- --xambig <n>
Remove any sequences that has more than <n> ambiguous (degenerate) residues.
- --seq-r <f>
Remove any sequences with names listed in file <f>. Sequence names listed in <f> can be separated by tabs, new lines, or spaces. The file must be in Stockholm format for this option to work.
- --seq-k <f>
Keep only sequences with names listed in file <f>. Sequence names listed in <f> can be separated by tabs, new lines, or spaces. By default, the kept sequences will remain in the original order they appeared in msafile, but the order from <f> will be used if the --k-reorder option is enabled. The file must be in Stockholm format for this option to work.
With --seq-k or --seq-r, operate in small memory mode. The alignment(s) will not be stored in memory, thus --seq-k and --seq-r will be able to work on very large alignments regardless of the amount of available RAM. The alignment file must be in Pfam format and --informat pfam and one of --amino, --dna, or --rna must be given as well.
With --seq-k <f>, reorder the kept sequences in the output alignment to the order from the list file <f>.
- --seq-ins <n>
Keep only sequences that have at least 1 inserted residue after nongap RF position <n>.
- --seq-ni <n>
With --seq-ins require at least <n> inserted residues in a sequence for it to be kept.
- --seq-xi <n>
With --seq-ins allow at most <n> inserted residues in a sequence for it to be kept.
- --trim <f>
File <f> is an unaligned FASTA file containing truncated versions of each sequence in the msafile. Trim the sequences in the alignment to match their truncated versions in <f>. If the alignment output format is Stockholm (the default output format), all per-column (GC) and per-residue (GR) annotation will be removed from the alignment when --trim is used. However, if --t-keeprf is also used, the reference annotation (GC RF) will be kept.
Specify that the 'trimmed' alignment maintain the original reference (GC RF) annotation. Only works in combination with --trim.
- --minpp <x>
Replace all residues in the alignments for which the posterior probability annotation (#=GR PP) is less than <x> with gaps. The PP annotation for these residues is also converted to gaps. <x> must be greater than 0.0 and less than or equal to 0.95.
- --tree <f>
Reorder sequences by tree order. Perform single linkage clustering on the sequences in the alignment based on sequence identity given the alignment to define a 'tree' of the sequences. The sequences in the alignment are reordered according to the tree, which groups similar sequences together. The tree is output in Newick format to <f>.
- --reorder <f>
Reorder sequences to the order listed in file <f>. Each sequence in the alignment must be listed in <f>. Use --k-reorder to reorder only a subset of sequences to a subset alignment file. The file must be in Stockholm format for this option to work.
- --mask2rf <f>
Read in the 'mask' file <f> and use it to define new #=GC RF annotation for the alignment. <f> must be a single line, with exactly <alen> or <rflen> characters, either the full alignment length or the number of nongap #=GC RF characters, respectively. Each character must be either a '1' or a '0'. The new #=GC RF markup will contain an 'x' for each column that is a '1' in lane mask file, and a '.' for each column that is a '0'. If the mask is of length <rflen> then it is interpreted as applying to only nongap RF characters in the existing RF annotation, all gap RF characters will remain gaps and nongap RF characters will be redefined as above.
With --mask2rf, do not overwrite existing nongap RF characters that are included by the input mask as 'x', leave them as the character they are.
Add annotation to the alignment numbering all of the columns in the alignment.
Add annotation to the alignment numbering the non-gap (non '.') #=GC RF columns of the alignment.
- --rm-gc <s>
Remove certain types of #=GC annotation from the alignment. <s> must be one of: RF, SS_cons, SA_cons, PP_cons.
Annotate individual secondary structures for each sequence by imposing the consensus secondary structure defined by the #=GC SS_cons annotation.
Update Infernal's cmalign 0.72-1.0.2 posterior probability "POST" annotation to "PP" annotation, which is read by other miniapps, including esl-alimask and esl-alistat.
Assert that the msafile contains protein sequences.
Assert that the msafile contains DNA sequences.
Assert that the msafile contains RNA sequences.
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.