nettee passes a data stream to one or more child nodes using a daisychain method. On each node nettee may also direct the stream to a file or pipe. nettee allows large amounts of data to be quickly distributed to multiple nodes on a network at a rate limited only by the network bandwidth. The distribution chain is typically linear for each network switch but may branch when nodes utilize multiple switches. For maximum throughput only one instance of nettee should utilize each network interface.
When nettee starts it waits for a connection from the upstream node before attempting to connect to its downstream nodes. Consequently nettee may be started on the nodes in any order (by a script, rsh, ssh, and so forth.) Typically only the node that reads the data stream for stdin or a file will be set to log messages, so that the progress of the transfer may be monitored. Transmission errors are detected by comparing the total number of bytes read by each child node with the number of bytes transmitted to that child.
By default severe errors cause the entire chain to abort. By utilizing the -conwf and -colwf options nettee may be instructed to do its best to continue processing in the event of certain write failures of the data stream. Note that failures which occur while the distribution chain is forming are still fatal events. To allow the program to continue with a truncated or alternate chain if chain formation errors are encountered utilize the -connf option, and optionally specify alternate targets in each hostlist. If the node above the failed node is allowed to emit messages and errors ( for instance: -v 5 ) messages similar to these will be sent to the log destination ( -log ):
Failures detected in child 0 [node34]: NWF
Failures detected in child 1 [node35]: NONE
Failures detected in chain: NWF
The first type of message describes the failures that were detected in the named child node, that is, those named in the -next option. The second message describes failures that were detected anywhere further on in the chain. The error codes currently defined are: NONE no errors, NWF network write failure, LWF local write failure, BBC child returned incorrect byte count, BSTAT child returned unknown or bad status, and NNF could not connect to (one or more) downstream chain nodes.
nettee will normally emit an EXIT_SUCCESS status. (0 on Unix.) This is true even if the errors were detected and handled in the node itself or in a child node. nettee will emit an EXIT_FAILURE status if it was forced to close by an unhandled event such as a timeout, write failure, or unexpected socket closure.
Print help information.
Print error status codes.
Print version, license, and copyright information.
- -in <SRC>
Reads data from <SRC> which may have one of three values: nettee reads from the upstream node; - reads from stdin; socket read the output of a command from a socket; filename reads from a file. If no -in option is present the programs reads data from the upstream node.
- -out <DST>
Writes data locally to <DST> which may have one of three values: none writes nothing locally; - writes to stdout; socket write the datastream to a command through a socket; filename writes to a file. If no -out option is present the program writes data to stdout.
- -next <HOSTLISTS>
Writes data to downstream destination[s] hostlist1(,hostlist2(,hostlist3(...))) where the hostlist entries are separated by commas or spaces. A hostlist consists of either a single hostname, or a comma separated list of hostnames enclosed in square brackets. Example: node1,[node2,node3],[node4,node5,node6],node7. The bracketed form allows for automatic failover if unreachable nodes are encountered and if -connf is specified. The first hostname in the list is tried, then the next, and so on. There may be 1-8 hostlists. The number of hostlists controls the topology of the distribution chain. Use a linear distribution chain (a single hostlist) when all nodes share a single network switch. Use a forked distribution chain (multiple hostlist) when nodes are connected to two or more network switches. The End of Chain condition (no downstream write) is indicated by a <HOSTS> value of . , "" , or _EOC_ . An End of Chain condition is also indicated by the absence of an -next option. If End of Chain is indicated there may not be any other hostslists specified.
- -cmd <COMMAND>
Specifies the command to use in conjunction with an -in socket or -out socket option. Since only a single <COMMAND> may be specified socket may not be applied to both -in and -out at the same time. When -cmd is used with -in socket a child process running <COMMAND> reads data from a disk or other device and writes the resulting data stream to stdout. When -cmd is used with -out socket a child process running <COMMAND> reads the datastream from stdin and writes the processed data to a disk or other device. Typically the <COMMAND> string invokes tar or some other archiving program. In some instances using sockets and -cmd will be faster than using the same command in a pipe due to the larger buffer size used for the socket. Run nettee -hexamples to see a usage example.
- -stm <EOS>
stream text through a nettee chain until the string <EOS> is encountered, then exit. This allows short text messages to traverse the chain without waiting for a buffer to fill. Since the text message can very rapidly traverse the nettee chain it can be piped into execinput (or any other program that will execute its stdin as commands) to produce essentially simultaneous execution on all target nodes. The <EOS> string is not passed through the data chain and its length is ignored. When used to start further nettee processes on the target nodes <PORT> values must be chosen to avoid interference. While this mode may be convenient for setting up Beowulf nodes it is exceedingly dangerous for general use since any command introduced into the command stream will execute on all chain nodes as if submitted by the owner of the nettee process on that node. Run nettee -hexamples to see a usage example.
- -name <STRING>
Specify the node name used in messages (<=127 characters). If not supplied the values of the environmental variables MYHOSTNAME and HOSTNAME are first checked, and if those are not defined, the result of a gethostname() call is used.
- -log <LDST>
Errors and messages are written to <LDST> which may have one of two values: - writes to stderr or filename writes to a file. If no -log option is present the program writes messages to stderr.
- -p,-port <PORT>
First of two consecutive ports use for communication. If no -port option is present the program uses the default value of 9997.
- -v <VERBOSE>
<VERBOSE> is a bit mask which controls the types of warning and error messages which are sent to the -log destination. Bit values indicate: 1 show error messages; 2 show command line settings; 4 show messages; 8 show periodic status messages during transfer; 16 prepend nodename to all messages. Use a <VERBOSE> value of 0 to eliminate all messages. If no -v is present the program uses a default <VERBOSE> value of 1.
Suppresss "ignored signal" messages.
- -t <WAIT>
Wait up to <WAIT> seconds for a connection from upstream in the chain to form or data to be received. If neither of these events occur exit with an error. A value of 0 waits forever and will only exit on an end of data condition. If no -t is present the program uses a default <WAIT> value of 0. The -iconnf<WAIT> and -w options control timeouts for downstream connections.
Wait for the next node to boot or attach to the network. If not specified and the next node is not reachable nettee will exit with an error no matter what the -t <WAIT> and -iconnf <WAIT> timeout values are.
Continue on Local Write Failure. Normally the failure of a write of the data stream to the local output will be fatal and the entire distribution chain will collapse immediately. (Typically this happens when data is written to disk and a partition fills or there is an ownership problem. A complete disk failure may initially present this way but often goes on to crash the node, resulting also in a network write failure.) When -colwf is set and a local write failure occurs on a node that node will continue to relay data down the chain. The node that failed will not have correctly processed the data stream locally but all other nodes will be unaffected by this failure. The top node will emit an error message when this occurs so that a subsequent analysis with other tools may locate the node(s) which failed. This option may only be employed on a node that reads data from an upstream node.
Continue on Network Write Failure. Normally the failure of a write of the data stream to the next node will be fatal and the entire distribution chain will collapse immediately. (Typically this happens when a node crashes while nettee is running.) When -conwf is set and a network write failure occurs on a node (indicating that the next node has failed) the node will continue to process the data stream locally but will make no further attempts to transfer data to the next node in the chain. This allows the data transfer to complete on a chain down to the node above a failed node. The top node will emit an error message when this occurs so that a subsequent analysis with other tools may locate the node(s) which failed. This option may only be employed on a node that reads data from an upstream node
- -connf <WAIT>
Continue on Next Node Failure. Give each node in a hostlist <WAIT> seconds to join the chain. After that each successive host in the hostlist is given <WAIT> seconds to join, and if none succeed, no data will be sent to any of those hosts. If -connf is not specified or the wait time is set to zero seconds, the program will wait forever for a connection to the first node in each hostlist.
- -progress <INTERVAL>
If -v 8 is used a status message is emitted every <INTERVAL> bytes transferred. The default value of 10000000 will be too small for a very fast network.
nettee is derived from Felix Rauch's dolly which is available here: http://www.cs.inf.ethz.ch/CoPs/patagonia/#dolly
The nettee home page is: http://saf.bio.caltech.edu/nettee.html
Copyright: 2008 David Mathog and Caltech. Copyright: Felix Rauch and ETH Zurich
Freely distributed under the second GNU General Public License (GPL 2).
David Mathog Biology Division, Caltech