esl-alistat - Man Page
summarize a multiple sequence alignment file
esl-alistat [options] msafile
esl-alistat summarizes the contents of the multiple sequence alignment(s) in msafile, such as the alignment name, format, alignment length (number of aligned columns), number of sequences, average pairwise % identity, and mean, smallest, and largest raw (unaligned) lengths of the sequences.
If msafile is - (a single dash), multiple alignment input is read from stdin.
The --list, --icinfo, --rinfo, --pcinfo, --psinfo, --cinfo, --bpinfo, and --iinfo options allow dumping various statistics on the alignment to optional output files as described for each of those options below.
The --small option allows summarizing alignments without storing them in memory and can be useful for large alignment files with sizes that approach or exceed the amount of available RAM. When --small is used, esl-alistat will print fewer statistics on the alignment, omitting data on the smallest and largest sequences and the average identity of the alignment. --small only works on Pfam formatted alignments (a special type of non-interleaved Stockholm alignment in which each sequence occurs on a single line) and --informat pfam must be given with --small. Further, when --small is used, the alphabet must be specified with --amino, --dna, or --rna.
Print brief help; includes version number and summary of all options, including expert options.
Use a tabular output format with one line of statistics per alignment in msafile. This is most useful when msafile contains many different alignments (such as a Pfam database in Stockholm format).
- --informat <s>
Assert that input msafile is in alignment format <s>, bypassing format autodetection. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (a2m or A2M both work).
Assert that the msafile contains protein sequences.
Assert that the msafile contains DNA sequences.
Assert that the msafile contains RNA sequences.
Operate in small memory mode for Pfam formatted alignments. --informat pfam and one of --amino, --dna, or --rna must be given as well.
- --list <f>
List the names of all sequences in all alignments in msafile to file <f>. Each sequence name is written on its own line.
- --icinfo <f>
Dump the information content per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --rinfo <f>
Dump information on the frequency of gaps versus nongap residues per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --pcinfo <f>
Dump per column information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --psinfo <f>
Dump per sequence information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --iinfo <f>
Dump information on inserted residues in tabular format to file <f>. Insert columns of the alignment are those that are gaps in the reference (#=GC RF) annotation. This option only works if the input file is in Stockholm format with reference annotation. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --cinfo <f>
Dump per-column residue counts to file <f>. If used in combination with --noambig ambiguous (degenerate) residues will be ignored and not counted. Otherwise, they will be marginalized. For example, in an RNA sequence file, a 'N' will be counted as 0.25 'A', 0.25 'C', 0.25 'G', and 0.25 'U'.
With --cinfo, do not count ambiguous (degenerate) residues.
Dump per-column basepair counts to file <f>. Counts appear for each basepair in the consensus secondary structure (annotated as "#=GC SS_cons"). Only basepairs from sequences for which both paired positions are canonical residues will be counted. That is, any basepair that is a gap or an ambiguous (degenerate) residue at either position of the pair is ignored and not counted.
With --icinfo, --rinfo, --pcinfo, --iinfo, --cinfo, and --bpinfo, weight counts based on #=GS WT annotation in the input msafile. A residue or basepair from a sequence with a weight of <x> will be considered <x> counts. By default, raw, unweighted counts are reported; corresponding to each sequence having an equal weight of 1.
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.