hmmpgmd_shard - Man Page

sharded daemon for database search web services

The hmmpgmd_shard program provides a sharded version of the hmmpgmd program that we use internally to implement high-performance HMMER services that can be accessed via the internet. See the hmmpgmd man page for a discussion of how the base hmmpgmd program is used. This man page discusses differences between hmmpgmd_shard and hmmpgmd. The base hmmpgmd program loads the entirety of its database file into RAM on every worker node, in spite of the fact that each worker node searches a predictable fraction of the database(s) contained in that file when performing searches. This wastes RAM, particularly when many worker nodes are used to accelerate searches of large databases.

Hmmpgmd_shard addresses this by dividing protein sequence database files into shards. Each worker node loads only 1/Nth of the database file, where N is the number of worker nodes attached to the master. HMM database files are not sharded, meaning that every worker node will load the entire database file into RAM. Current HMM databases are much smaller than current protein sequence databases, and easily fit into the RAM of modern servers even without sharding.

Hmmpgmd_shard is used in the same manner as hmmpgmd , except that it takes one additional argument: --num_shards <n> , which specifies the number of shards that protein databases will be divided into, and defaults to 1 if unspecified. This argument is only valid for the master node of a hmmpgmd system (i.e., when --master is passed to the hmmpgmd program), and must be equal to the number of worker nodes that will connect to the master node. Hmmpgmd_shard will signal an error if more than num_shards worker nodes attempt to connect to the master node or if a search is started when fewer than num_shards workers are connected to the master.

Options

-h: Help; print a brief reminder of command line usage and all available options.
--master: Run as the master server.
--worker <s>: Run as a worker, connecting to the master server that is running on IP address <s>.
--cport <n>: Port to use for communication between clients and the master server. The default is 51371.
--wport <n>: Port to use for communication between workers and the master server. The default is 51372.
--ccncts <n>: Maximum number of client connections to accept. The default is 16.
--wcncts <n>: Maximum number of worker connections to accept. The default is 32.
--pid <f>: Name of file into which the process id will be written.
--seqdb <f>: Name of the file (in hmmpgmd format) containing protein sequences. The contents of this file will be cached for searches.
--hmmdb <f>: Name of the file containing protein HMMs. The contents of this file will be cached for searches.
--cpu <n>: Number of parallel threads to use (for --worker ).
--num_shards <n>: Number of shards to divide cached sequence database(s) into. HMM databases are not sharded, due to their small size. This option is only valid when the --master option is present, and defaults to 1 if not specified. Hmmpgmd_shard requires that the number of shards be equal to the number of worker nodes, and will give errors if more than num_shards workers attempt to connect to the master node or if a search is started with fewer than num_shards workers connected to the master.

Copyright

Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.

For additional information on copyright and licensing, see the file called COPYRIGHT in your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

Author

http://eddylab.org

Info

Nov 2020 HMMER 3.3.2 HMMER Manual

hmmpgmd_shard - Man Page

Synopsis

Description

Options

See Also

Copyright

Author

Info