esl-alistat - Man Page

summarize a multiple sequence alignment file

Synopsis

esl-alistat [options] msafile

Description

esl-alistat summarizes the contents of the multiple sequence alignment(s) in msafile, such as the alignment name, format, alignment length (number of aligned columns), number of sequences, average pairwise % identity, and mean, smallest, and largest raw (unaligned) lengths of the sequences.

If msafile is - (a single dash), multiple alignment input is read from stdin.

The --list, --icinfo, --rinfo, --pcinfo, --psinfo, --cinfo, --bpinfo, and --iinfo options allow dumping various statistics on the alignment to optional output files as described for each of those options below.

The --small option allows summarizing alignments without storing them in memory and can be useful for large alignment files with sizes that approach or exceed the amount of available RAM.  When --small is used, esl-alistat will print fewer statistics on the alignment, omitting data on the smallest and largest sequences and the average identity of the alignment. --small only works on Pfam formatted alignments (a special type of non-interleaved Stockholm alignment in which each sequence occurs on a single line) and --informat pfam must be given with --small. Further, when --small is used, the alphabet must be specified with --amino, --dna, or --rna.

Options

-h

Print brief help;  includes version number and summary of all options, including expert options.

-1

Use a tabular output format with one line of statistics per alignment in msafile. This is most useful when msafile contains many different alignments (such as a Pfam database in Stockholm format).

Expert Options

--informat <s>

Assert that input msafile is in alignment format <s>, bypassing format autodetection. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (a2m or A2M both work).

--amino

Assert that the msafile contains protein sequences.

--dna

Assert that the msafile contains DNA sequences.

--rna

Assert that the msafile contains RNA sequences.

--small

Operate in small memory mode for Pfam formatted alignments. --informat pfam and one of --amino, --dna, or --rna must be given as well.

--list <f>

List the names of all sequences in all alignments in msafile to file <f>. Each sequence name is written on its own line.

--icinfo <f>

Dump the information content per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.

--rinfo <f>

Dump information on the frequency of gaps versus nongap residues per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.

--pcinfo <f>

Dump per column information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.

--psinfo <f>

Dump per sequence information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.

--iinfo <f>

Dump information on inserted residues in tabular format to file <f>. Insert columns of the alignment are those that are gaps in the reference (#=GC RF) annotation. This option only works if the input file is in Stockholm format with reference annotation. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.

--cinfo <f>

Dump per-column residue counts to file <f>. If used in combination with --noambig ambiguous (degenerate) residues will be ignored and not counted. Otherwise, they will be marginalized. For example, in an RNA sequence file, a 'N' will be counted as 0.25 'A', 0.25 'C', 0.25 'G', and 0.25 'U'.

--noambig

With --cinfo, do not count ambiguous (degenerate) residues.

--bpinfo

Dump per-column basepair counts to file <f>. Counts appear for each basepair in the consensus secondary structure (annotated as "#=GC SS_cons"). Only basepairs from sequences for which both paired positions are canonical residues will be counted. That is, any basepair that is a gap or an ambiguous (degenerate) residue at either position of the pair is ignored and not counted.

--weight

With --icinfo, --rinfo, --pcinfo, --iinfo, --cinfo, and --bpinfo, weight counts based on #=GS WT annotation in the input msafile. A residue or basepair from a sequence with a weight of <x> will be considered <x> counts.  By default, raw, unweighted counts are reported; corresponding to each sequence having an equal weight of 1.

See Also

http://bioeasel.org/

Author

http://eddylab.org

Info

Nov 2020 Easel 0.48 Easel Manual