rig - Man Page

Monitor a system for events and trigger specific actions

Examples (TL;DR)

Usage

rig <RESOURCE OR SUBCOMMAND> [OPTIONS] <Actions> [ACTION OPTIONS]

Description

rig is a tool to assist in troubleshooting seemingly randomly occurring events or events that occur at times that make active monitoring by a sysadmin difficult.

rig sets-up detached processes, known as 'rigs', that watch a given resource for a trigger condition, and once that trigger condition is met takes actions defined by the user.

Global Options

The following are options available to all rigs (resources).

--delay DELAY

Specify the number of seconds to wait after a rig is triggered before running the configured actions. Note that the rig will still trigger and stop all watcher threads immediately - this delay comes after thread termination but before action execution in order to avoid a possible race condition where multiple watcher threads could conceivably trigger during a sufficiently high delay time.

Default 0 seconds, meaning execute actions immediately upon rig trigger condition being met.

--debug

Set logging level to debug instead of the default info level.

--foreground

Run the rig in the foreground, keeping stdout attached.

--interval SECONDS

Specify the amount of time to wait between a rig's polling cycles. Most rigs monitor their resources in a flow of update -> compare -> wait, where wait is simply sleeping until the next needed update. Use this option to set how long a rig should wait/sleep before updating their monitors again.

Default: 1, meaning update and compare once every second.

--name NAME

Give the rig a name, rather than generating a random one at deployment.

By default, rigs are given a randomly generated string as a name, which will appear in rig info output and in the rig's socket name. Using this option will use the provided name instead, and may be useful in distinguishing rigs when several are deployed at one time

--no-archive

Do not create a tar archive of the collected data after a rig has been triggered.

Normally, once all data has been collected, rig will create a gzip'd tar archive under /var/tmp containing all the files created from the rig's actions - after which, the temp directory at /var/tmp/rig/<id>/ is deleted.

Using this option skips creating the archive and preserves the temp directory.

--repeat COUNT

Repeat certain actions COUNT number of times after the initial execution of the action.

Actions will, unless otherwise specified by this option, only execute once. Using this option actions that support repetition will be repeated an additional COUNT number of times. For example, using --repeat 2 will result in repeatable actions being executed three (3) total times.

Not every action supports repetition - in fact most do not. See specific action's sections for information on if it can be repeated or not.

--repeat-delay SECONDS

Number of seconds to wait between repetitive executions of the same action.

This can be useful when using an action like gcore when you want to get coredumps over a certain time period. For example, using --repeat 1 --repeat-delay 60 will give you two (2) coredumps taken one minute apart.

Defaults to one second.

--restart COUNT

Restart a configured rig up to COUNT number of times after being triggered.

By default, a rig will trigger once and then terminate. Using this option, an individual rig may restart itself up to COUNT number of times, producing an additional archive of the requested data after the triggering event happens again.

Note that this is the number of times to restart, not the total number of times to run. Using a restart value of '2' means that there will be 3 total archives generated for a rig.

By default, this is set to 0, meaning terminate after the first trigger event. Use a value of '-1' to have a rig perpetually restart itself without limit.

Subcommands

rig list

Show a list of known existing rigs and their status. Status information is obtained by querying the socket created for that particular rig.

rig destroy -i [ID or 'all']

Destroy a deployed rig with id ID. If ID is 'all', destroy all known rigs. Note that if another entity kills the pid for the running rig, destroy will fail as the socket is no longer connected to the (now killed) process. In this case use the --force option to cleanup the lingering socket.

Any data the rig has generated will be lost when invoking destroy.

rig info -i [ID]

Get detailed information on a rig. This information will include configuration options, the entire cmdline string given to launch the rig, as well as information on each action the rig is configured to take and what the expected result from those action(s) are.

Currently, this data is written to stdout in JSON format.

rig trigger -i [ID]

Manually trigger rig with id ID. This will cause the specified rig to begin executing the actions configured for it, as if the trigger condition had been met.

Note that this is only effective on a single rig basis, so using a value of 'all' for the ID will not work.

Resources

These are the system resources that rig can monitor. There may be additional manpages for specific resources. Where applicable this will be noted below.

Note that 'resources', 'monitors', and referencing 'a rig' as a distinct entity all refer to the same thing.

When creating a rig, if successful the rig's ID will be printed to console.

logs

Watch a single or multiple log files and/or journald units for a specified message. When that message is matched to any watched file or journal, the trigger condition is met and configured actions are initiated.

The following options are available for the logs rig:

-m|--message STRING

Define the string that serves as the trigger condition for the rig. This can be a regex string or an exact message. Be very careful in using the '*' regex character as this may cause unintended behavior such as the rig immediately triggering on the first message seen.

Note that a small amount of transformation and testing is done on the provided STRING. First, '*' characters are converted to the python-style regex match of '.*'. After which, rig performs a test on if the provided message will regex-match itself, and if that fails the rig aborts the creation process.

Aside from the conversion noted above, regexes provided in this option must be python-style and not shell-style.

--logfile FILE

A comma-delimited list of files to watch. Each FILE specified will be monitored from the current end of the file, so old entries will not set off the rig's actions.

Default: /var/log/messages

--no-files

Do not monitor any log files.

--journal UNIT

A comma-delimited list of journal units to watch. The journal is watched as a singular entity, and will be filtered to only read from the provided UNIT(s). If no UNIT is specified, the whole system journal will be monitored.

Default: 'system'

--no-journal

Do not monitor the journal.

--count COUNT

The number of times the --message string should be matched before the rig is triggered. Default 1 - meaning match on the first occurence.

ping

Perform a simple ongoing ping test against a specified host. Pings are sent one at a time at a defined interval, and the response is evaluated. Ping-type rigs may monitor for number of lost packets and/or packets exceeding a specified RTT in milliseconds.

Packets are first evaluated for loss (including timeouts), then for RTT time.

The following options are available for the ping rig:

--host ADDRESS

The target IP or hostname to ping. This is a required option in order for a ping rig to be created.

During rig creation, a 'sanity check' ping is sent to the ADDRESS to ensure that it is an address that is reachable on the network and that it will respond to ICMP packets. If this sanity check fails, rig creation is aborted.

--ping-timeout SECONDS

Specify the number of SECONDS to allow for a ping response. If a ping encounters a timeout, then it is considered both a lost packet and a packet exceeding the RTT threshold (see --ping-ms-max and --ping-ms-count).

--lost-count PACKETS

Specify the number of PACKETS to accept being lost or timed-out, before triggering the rig.

Default: 1 (trigger on the first lost packet)

--ping-interval SECONDS

Specify the number of SECONDS to wait between ping requests sent to the target host.

Default: 1

--ping-ms-max MILLISECONDS

Specify the RTT threshold to allow for a returned ping request. If the RTT reported by the ping command is above this value in milliseconds, it is counted against the threshold of packets exceeding this value specified by --ping-ms-count.

By default, this form of checking is disabled. Any integer value passed to this option will enable RTT monitoring.

--ping-ms-count PACKETS

Specify the number of PACKETS that may exceed the defined --ping-ms-max RTT value before triggering the rig.

Default: 5

process

Watch a single process or list of processes for state changes or resource consumption thresholds. When the process enters the specified state or the specified resource consumption threshold is met, the trigger condition is met.

The following options are available for the process rig:

--proc

A PID or process name of processes to watch. If a process name is specified, then rig will attempt to convert this to a PID during rig creation. If multiple PIDs are found, the default behavior is to fail creation and exit. To have rig monitor all processes found for a process name, use the --all option.

--state STATE

The state that a process needs to be in, in order to trigger the rig. The following is a list of supported states:

   NAME         Description                      SHORTHAND
   dead         Dead - should never be seen         'X'
   disk-sleep   Uninterruptible sleep           'D' or 'UN'
   running      Currently running               'R' or 'run'
   sleeping     Interruptible sleep             'S' or 'sleep'
   stopped      Stopped                         'T' or 'stop'
   zombie       Exited, still in proc table     'Z' or 'zomb'

Users can use either the full status name, or the shorthand noted in the final column of the table above. Both the names and the shorthand values are case sensitive.

This can also be set to a "not" value by preceeding one of the above state strings with a exclaimation mark (!), e.g. '!sleeping' will match any non-sleep (S) state status for the process(es). Most shells will require you to quote the state string when using the '!' character.

Note that using '!running' will cause rig to not trigger against a state of 'sleeping', as generally speaking 'running' processes spend much of their time in S state, and it is assumed that triggering against such a process is not desired.

Process status is polled once every second.

--rss INTEGER

The amount of rss (resident set size) memory usage to use as a threshold for triggering the rig. If the process' RSS usage goes above this value, trigger.

The value provided here may be suffixed with K, M, or G to denote the IEC unit. Rig will convert the provided value and suffix into a value in bytes.

--vms INTEGER

The same as --rss but monitoring Virtual Memory Size instead.

--memperc PERCENT

The percentage of total system memory a process is consuming to use as a threshold for triggering the rig. If the process' %mem meets or exceeds this value, trigger.

PERCENT may be a whole integer or a float. When using a float, the process rig respects up to two (2) decimal points of precision. For example, using ´--memperc 10.25´ is the same as using ´--memperc 10.25678´.

--cpuperc PERCENT

The percentage of CPU usage a process is consuming to use as a threshold for triggering the rig. If the process' %cpu meets or exceeds this value, trigger.

PERCENT may be a whole integer or a float. When using a float and monitoring for CPU usage, rig respects one (1) decimal point of precision due to how CPU usage is reported.

PERCENT may be above 100 - as CPU usage can exceed 100 when a process is running on multiple CPUs.

system

Watch the system's utilization of resources as a whole, e.g. total CPU or memory usage. When the utilization of a given resource is either exceeded or falls below the given threshold (determined as appropriate for each resource), the trigger condition is met.

The following options are available for the system rig:

--iowait PERCENT

The amount of %iowait as reported by the kernel to use as a threshold value.

If exceeded, trigger the rig.

--steal PERCENT

The amount of %steal as reported by the kernel to use as a threshold value.

If exceeded, trigger the rig.

--nice PERCENT

The amount of %nice as reported by the kernel to use as a threshold value.

If exceeded, trigger the rig.

--guest PERCENT

The amount of %guest as reported by the kernel to use as a threshold value.

If exceeded, trigger the rig.

--user

The amount of %user as reported by the kernel to use as a threshold value.

If exceeded, trigger the rig.

--available INTEGER

The amount of available memory in MiB as reported by the kernel to use as a threshold value.

If the amount of available memory falls below this threshold, trigger the rig.

--free INTEGER

The amount of free memory in MiB as reported by the kernel to use as a threshold value.

If the amount of free memory falls below this threshold, trigger the rig.

--used INTEGER

The amount of used memory in MiB as reported by the kernel to use as a threshold value.

If the amount of used memory exceeds this threshold, trigger the rig.

--slab INTEGER

The amount of slab memory in MiB as reported by the kernel to use as a threshold value.

If the amount of slab memory exceeds this threshold, trigger the rig.

--cpuperc PERCENT

The amount of total CPU usage as reported by the kernel as a percentage to use as a threshold value.

If exceeded, trigger the rig.

This value may be a whole integer or a float. Floats are precise out to one (1) decimal point.

--memperc PERCENT

The amount of total memory usage as reported by the kernel as a percentage to use as a theshold value.

If exceeded, trigger the rig.

This value may be a whole integer or a float. Floats are precise out to one (1) decimal point.

--loadavg FLOAT

System load average as reported by the OS to use as a threshold value. If the reported loadavg exceeds this value, trigger the rig. This option can accept either an integer (1) or a float (1.0).

Linux returns loadavg data for the past 1, 5, and 15 minutes. The system rig will monitor only one (1) of these intervals at a time, as controlled by the --loadavg-interval option.

--loadavg-interval [1, 5, 15]

Which time interval the rig should monitor when watching the system's loadavg. Only 1, 5, and 15 are accepted values for this option, as that is what the Linux kernel returns loadavg data for.

Default: 1

--temp INTEGER

The temperature in Celsius rig should monitor the CPU for meeting or exceeding.

This option takes an integer value, though temperature data is single decimal point sensitive, so a temperature of 50.9 degrees will not trigger a rig that sets this option to 51.

By default rig will monitor the first physical CPU package installed on the system. This may be changed via the --cpu-id option. Note that rig will only monitor whole packages and not individual cores, and that package temperatures reported are the highest reported temperature for any core in that package.

--cpu-id ID

If specified, monitor this physical CPU package. By default, rig will monitor physical CPU package 0 - meaning the first physically installed CPU.

When specifying an ID here, remember that in Linux CPU IDs are zero-indexed, so the first CPU will be ID 0, the second ID 1, and so forth.

Default: 0

Filesystem

Watch a filesystem, directory, or file for utilization changes. Currently this rig is focused on space consumption, and will trigger when the specified path or backing filesystem exceeds the defined threshold for space utilization.

The following options are available for the filesystem rig:

--path PATH

Specify the filesystem, directory, or file path for the rig to monitor. The location provided must exists when the rig initializes for monitoring to be supported.

--size SIZE

Specify the size threshold to trigger on for the provided --path. The size given must be an integer suffixed with either K, M, G, or T. The provided value will be converted to bytes.

--fs-size SIZE

Use this option instead of --size if you want to monitor the space usage of the backing filesystem for --path rather than the size of the path alone.

Similar to --size this value must be suffixed with either K, M, G, or T.

--fs-used PERCENT

Similar to --fs-size but instead provide a percentage value to trigger on, when the filesystem's %used exceeds this value.

Note that using this option is ultimately the same as --fs-size as rig will convert the specified percentage into a raw bytes value to use for comparisons.

Actions

The following actions are supported responses to triggered rigs. These may be chained together on a single rig, so deploying multiple rigs with matching trigger conditions with single, varying actions is unnecessary.

Actions are executed based on a priority weighting system, where lower values represent a higher priority action, and those actions with lower values are executed before those with higher values. This is to allow more time-sensitive actions to be taken before those that may either take a long time to execute or are otherwise unaffected by allowing other actions to run before them. Action priority values are set by the actions directly and are currently not able to be modified by users.

gcore

Collect a coredump of a given process or processes using GDB's gcore utility.

Note that this does _not_ interrupt the running process(es). Cores are saved to /tmp and will be named either core.$pid or core.$proc_name.$pid depending on if a PID or process name was provided. This action will be executed first when a rig is triggered and multiple actions are specified.

This action supports repetition via the --repeat option.

The gcore action supports the following options:

--gcore PROCESS

Enables this action and takes either a PID or process name as a value. If a process name is given, the PID is determined at rig creation. If multiple PIDs are found for the same process name, the default behavior is to fail rig creation. Use the --all-pids option to instead use all PIDs discovered for a process name.

This option can be specified multiple times. E.G. --gcore 12345 --gcore myprocess will generate a coredump for PID 12345 and a process matching the name 'myprocess'.

--all-pids

Tells this action to collect a coredump for all PIDs found for a provided process name.

--freeze

Freeze the process(es) that will be core dumped by sending a SIGSTOP prior to calling gcore on the discovered pid(s).

If successful, then rig will send a SIGCONT after the gcore execution has completed in order to thaw the process.

kdump

Generate a vmcore by triggering a kernel crash via sysrq.

Note that this action WILL cause node disruption by triggering a kernel panic to generate the vmcore. This means your system will reboot when this action is triggered.

The kdump action does not perform any configuration checks on the system's kdump installation. It is assumed that kdump has been properly configured and tested prior to using this action.

The kdump action supports the following options:

--kdump

Enables this action

--sysrq INTEGER

When the rig is deployed, if this option is set, rig will set the system's /proc/sys/kernel/sysrq to the value provided. See sysrq kernel documentation for information on what values are supported.

sosreport

Run an sos report after the rig has been triggered. Select plugin enablement options as well as the --plugin-option from sos report are supported by this rig. This action should run after any time-sensitive actions otherwise specified by the user for a given rig.

The sosreport action supports the following options:

--sosreport

Enables this action

--enable-plugins PLUGINS

Specifically force the specified comma-delimited list of PLUGINS to be enabled.

--plugin-option PLUGOPT

Modify a specific plugin's runtime options. This is passed directly to sos report as the same --plugin-option value, which should take the form 'name.option=value'. For example, to increase the podman plugin timeout use ´--plugin-option podman.timeout=600´.

If you need to pass multiple sos report plugin options, use a comma-delimited list here instead of specifying this option multiple times.

--skip-plugins PLUGINS

Do not run these specified plugins. Use a comma-delimited list to skip multiple plugins.

--only-plugins PLUGINS

Only enable these specific plugins, disable all others. Use a comma-delimited list to specify multiple plugins.

tcpdump

Start collecting a tcpdump when the rig is initialized, and stop the collection when the rig triggers. This action will be triggered before most other actions, but after the gcore action.

Note there will be a slight delay in configuring any rig that uses the tcpdump action as rig must verify that the tcpdump process started successfully during the initialization process.

The tcpdump action supports the following options:

--tcpdump

Enables this action

--iface INTERFACE

Starts the tcpdump to monitor the provided INTERFACE. In almost all situations this should likely be set to a specific interface on the system, however the value of 'any' is accepted by the tcpdump command in order to listen on all interfaces. Be wary of using this however as use of 'any' means will make it impossible to determine which interface a particular packet came in on in the resulting packet capture.

Default: eth0

--filter FILTER

Provide a filter to use with tcpdump in order to reduce the amount of traffic recorded in the packet capture. This value is passed directly to the tcpdump utility, and thus can be any valid filter accepted by tcpdump.

For most shells you must quote the filter string for rig to pass it correctly.

--snaplen LENGTH --snapshot-length LENGTH

Set the snapshot length for the packet capture. This will truncate captured packets to LENGTH bytes, which defaults to 262144 bytes. Using a value of 0 (also the default), will imply a LENGTH of 262144 bytes.

--dump-size SIZE

Limit the size of the packet capture file(s) to SIZE in MB.

Default: 10

--captures CAPTURES

Specify the number of packet capture files to keep. If more than one (1), then tcpdump will rotate the packet capture file when it reaches the --size value and keep CAPTURES number of files.

E.G. Using a CAPTURES of 2 and a DUMP-SIZE of 5, then when the rig terminates you will have up to 2 5MB packet captures.

Default: 1 (packet capture file is replaced upon reaching SIZE limit).

monitor

While a rig is running, monitor various system statistics and record them for later review. These statistics may be file contents or command outputs.

This action begins collecting information when the rig is started, and stops when the rig is triggered.

By default, networking-centric information is monitored via commands such as netstat, ss, top, ps, and more. Similarly several networking-related files under /proc/ are monitored.

The rate at which these collections take place is controlled via the --interval option.

The monitor action supports the following options:

--monitor

Enables this action.

--disable-monitor-defaults

Do not monitor or collect any of the default items. This implies that all collections will be specified via --monitor-files and/or --monitor-commands.

--monitor-files FILES

A comma-delimited list of files to monitor. Monitored files have their contents copied to a file within the rig's archive of the same name. The contents will be separated by a timestamp header taken at the time of collection.

--monitor-commands COMMANDS

A comma-delimited list of commands to execute every --interval seconds, and have that output saved to a file within the rig's archive. Output collections are separated by a timestamp header taken at the time of collection.

Note that commands will need to be properly quoted if there are spaces (or other quotes) in the command string. For example, to run 'ps auxwww' the proper invocation would be --monitor-commands='ps auxwww'.

In-line shell scripting is not supported. While it may be possible for such values to function, there are no guarantees as to those executions working properly, at all, not causing unintended side-effects or harm, et cetera. Dragons ahead, and so forth.

noop

Does nothing - this action runs a no-op. This is ideally used for when you need to test a rig's configuration to make sure a rig's trigger condition is set properly - e.g. a regex string for the logs' rig message option.

The noop action supports the following options:

--noop

Enables this action

Maintainer

Jake Hunsaker <jhunsake@redhat.com>

Info

January 2019