sanlock man page

sanlock — shared storage lock manager

Synopsis

sanlock [COMMAND] [ACTION] ...

Description

sanlock is a lock manager built on shared storage.  Hosts with access to the storage can perform locking.  An application running on the hosts is given a small amount of space on the shared block device or file, and uses sanlock for its own application-specific synchronization.  Internally, the sanlock daemon manages locks using two disk-based lease algorithms: delta leases and paxos leases.

Externally, the sanlock daemon exposes a locking interface through libsanlock in terms of "lockspaces" and "resources".  A lockspace is a locking context that an application creates for itself on shared storage. When the application on each host is started, it "joins" the lockspace. It can then create "resources" on the shared storage.  Each resource represents an application-specific entity.  The application can acquire and release leases on resources.

To use sanlock from an application:

How lockspaces and resources translate to delta leases and paxos leases within sanlock:

Lockspaces

Resources

Expiration

Failures

If a process holding resource leases fails or exits without releasing its leases, sanlock will release the leases for it automatically (unless persistent resource leases were used.)

If the sanlock daemon cannot renew a lockspace delta lease for a specific period of time (see Expiration), sanlock will enter "recovery mode" where it attempts to stop and/or kill any processes holding resource leases in the expiring lockspace.  If the processes do not exit in time, sanlock will force the host to be reset using the local watchdog device.

If the sanlock daemon crashes or hangs, it will not renew the expiry time of the per-lockspace connections it had to the wdmd daemon.  This will lead to the expiration of the local watchdog device, and the host will be reset.

Watchdog

sanlock uses the wdmd(8) daemon to access /dev/watchdog.  wdmd multiplexes multiple timeouts onto the single watchdog timer.  This is required because delta leases for each lockspace are renewed and expire independently.

sanlock maintains a wdmd connection for each lockspace delta lease being renewed.  Each connection has an expiry time for some seconds in the future.  After each successful delta lease renewal, the expiry time is renewed for the associated wdmd connection.  If wdmd finds any connection expired, it will not renew the /dev/watchdog timer.  Given enough successive failed renewals, the watchdog device will fire and reset the host.  (Given the multiplexing nature of wdmd, shorter overlapping renewal failures from multiple lockspaces could cause spurious watchdog firing.)

The direct link between delta lease renewals and watchdog renewals provides a predictable watchdog firing time based on delta lease renewal timestamps that are visible from other hosts.  sanlock knows the time the watchdog on another host has fired based on the delta lease time. Furthermore, if the watchdog device on another host fails to fire when it should, the continuation of delta lease renewals from the other host will make this evident and prevent leases from being taken from the failed host.

If sanlock is able to stop/kill all processing using an expiring lockspace, the associated wdmd connection for that lockspace is removed. The expired wdmd connection will no longer block /dev/watchdog renewals, and the host should avoid being reset.

Storage

On devices with 512 byte sectors, lockspaces and resources are 1MB in size.  On devices with 4096 byte sectors, lockspaces and resources are 8MB in size.  sanlock uses 512 byte sectors when shared files are used in place of shared block devices.  Offsets of leases or resources must be multiples of 1MB/8MB according to the sector size.

Using sanlock on shared block devices that do host based mirroring or replication is not likely to work correctly.  When using sanlock on shared files, all sanlock io should go to one file server.

Example

This is an example of creating and using lockspaces and resources from the command line.  (Most applications would use sanlock through libsanlock rather than through the command line.)

1.

Allocate shared storage for sanlock leases.

This example assumes 512 byte sectors on the device, in which case the lockspace needs 1MB and each resource needs 1MB.

# vgcreate vg /dev/sdb
# lvcreate -n leases -L 1GB vg
2.

Start sanlock on all hosts.

The -w 0 disables use of the watchdog for testing.

# sanlock daemon -w 0
3.

Start a dummy application on all hosts.

This sanlock command registers with sanlock, then execs the sleep command which inherits the registered fd.  The sleep process acts as the dummy application.  Because the sleep process is registered with sanlock, leases can be acquired for it.

# sanlock client command -c /bin/sleep 600 &
4.

Create a lockspace for the application (from one host).

The lockspace is named "test".

# sanlock client init -s test:0:/dev/test/leases:0
5.

Join the lockspace for the application.

Use a unique host_id on each host.

host1:
# sanlock client add_lockspace -s test:1/dev/vg/leases:0
host2:
# sanlock client add_lockspace -s test:2/dev/vg/leases:0
6.

Create two resources for the application (from one host).

The resources are named "RA" and "RB".  Offsets are used on the same device as the lockspace.  Different LVs or files could also be used.

# sanlock client init -r test:RA:/dev/vg/leases:1048576
# sanlock client init -r test:RB:/dev/vg/leases:2097152
7.

Acquire resource leases for the application on host1.

Acquire an exclusive lease (the default) on the first resource, and a shared lease (SH) on the second resource.

# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
8.

Acquire resource leases for the application on host2.

Acquiring the exclusive lease on the first resource will fail because it is held by host1.  Acquiring the shared lease on the second resource will succeed.

# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
9.

Release resource leases for the application on both hosts.

The sleep pid could also be killed, which will result in the sanlock daemon releasing its leases when it exits.

# sanlock client release -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client release -r test:RB:/dev/vg/leases:2097152 -p $P
10.

Leave the lockspace for the application.

host1:
# sanlock client rem_lockspace -s test:1/dev/vg/leases:0
host2:
# sanlock client rem_lockspace -s test:2/dev/vg/leases:0
11.

Stop sanlock on all hosts.

# sanlock shutdown

Options

COMMAND can be one of three primary top level choices

sanlock daemon start daemon
sanlock client send request to daemon (default command if none given)
sanlock direct access storage directly (no coordination with daemon)

Daemon Command

sanlock daemon [options]

-D     no fork and print all logging to stderr

-Q 0|1 quiet error messages for common lock contention

-R 0|1 renewal debugging, log debug info for each renewal

-L pri write logging at priority level and up to logfile (-1 none)

-S pri write logging at priority level and up to syslog (-1 none)

-U uid user id

-G gid group id

-t num max worker threads

-g sec seconds for graceful recovery

-w 0|1 use watchdog through wdmd

-h 0|1 use high priority (RR) scheduling

-l num use mlockall (0 none, 1 current, 2 current and future)

-b sec seconds a host id bit will remain set in delta lease bitmap

-e str local host name used in delta leases

Client Command

sanlock client action [options]

sanlock client status

Print processes, lockspaces, and resources being managed by the sanlock daemon.  Add -D to show extra internal daemon status for debugging. Add -o p to show resources by pid, or -o s to show resources by lockspace.

sanlock client host_status

Print state of host_id delta leases read during the last renewal. State of all lockspaces is shown (use -s to select one). Add -D to show extra internal daemon status for debugging.

sanlock client gets

Print lockspaces being managed by the sanlock daemon.  The LOCKSPACE string will be followed by ADD or REM if the lockspace is currently being added or removed.  Add -h 1 to also show hosts in each lockspace.

sanlock client renewal -s LOCKSPACE

Print a history of renewals with timing details. See the Renewal history section below.

sanlock client log_dump

Print the sanlock daemon internal debug log.

sanlock client shutdown

Ask the sanlock daemon to exit.  Without the force option (-f 0), the command will be ignored if any lockspaces exist.  With the force option (-f 1), any registered processes will be killed, their resource leases released, and lockspaces removed.  With the wait option (-w 1), the command will wait for a result from the daemon indicating that it has shut down and is exiting, or cannot shut down because lockspaces exist (command fails).

sanlock client init -s LOCKSPACE

Tell the sanlock daemon to initialize a lockspace on disk.  The -o option can be used to specify the io timeout to be written in the host_id leases. (Also see sanlock direct init.)

sanlock client init -r RESOURCE

Tell the sanlock daemon to initialize a resource lease on disk. (Also see sanlock direct init.)

sanlock client read -s LOCKSPACE

Tell the sanlock daemon to read a lockspace from disk.  Only the LOCKSPACE path and offset are required.  If host_id is zero, the first record at offset (host_id 1) is used.  The complete LOCKSPACE and io timeout are printed.

sanlock client read -r RESOURCE

Tell the sanlock daemon to read a resource lease from disk.  Only the RESOURCE path and offset are required.  The complete RESOURCE is printed. (Also see sanlock direct read_leader.)

sanlock client align -s LOCKSPACE

Tell the sanlock daemon to report the required lease alignment for a storage path.  Only path is used from the LOCKSPACE argument.

sanlock client add_lockspace -s LOCKSPACE

Tell the sanlock daemon to acquire the specified host_id in the lockspace. This will allow resources to be acquired in the lockspace.  The -o option can be used to specify the io timeout of the acquiring host, and will be written in the host_id lease.

sanlock client inq_lockspace -s LOCKSPACE

Inquire about the state of the lockspace in the sanlock daemon, whether it is being added or removed, or is joined.

sanlock client rem_lockspace -s LOCKSPACE

Tell the sanlock daemon to release the specified host_id in the lockspace. Any processes holding resource leases in this lockspace will be killed, and the resource leases not released.

sanlock client command -r RESOURCE -c path args

Register with the sanlock daemon, acquire the specified resource lease, and exec the command at path with args.  When the command exits, the sanlock daemon will release the lease.  -c must be the final option.

sanlock client acquire -r RESOURCE -p pid
sanlock client release -r RESOURCE -p pid

Tell the sanlock daemon to acquire or release the specified resource lease for the given pid.  The pid must be registered with the sanlock daemon. acquire can optionally take a versioned RESOURCE string RESOURCE:lver, where lver is the version of the lease that must be acquired, or fail.

sanlock client convert -r RESOURCE -p pid

Tell the sanlock daemon to convert the mode of the specified resource lease for the given pid.  If the existing mode is exclusive (default), the mode of the lease can be converted to shared with RESOURCE:SH.  If the existing mode is shared, the mode of the lease can be converted to exclusive with RESOURCE (no :SH suffix).

sanlock client inquire -p pid

Print the resource leases held the given pid.  The format is a versioned RESOURCE string "RESOURCE:lver" where lver is the version of the lease held.

sanlock client request -r RESOURCE -f force_mode

Request the owner of a resource do something specified by force_mode.  A versioned RESOURCE:lver string must be used with a greater version than is presently held.  Zero lver and force_mode clears the request.

sanlock client examine -r RESOURCE

Examine the request record for the currently held resource lease and carry out the action specified by the requested force_mode.

sanlock client examine -s LOCKSPACE

Examine requests for all resource leases currently held in the named lockspace.  Only lockspace_name is used from the LOCKSPACE argument.

sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num

Set an event for another host.  When the sanlock daemon next renews its delta lease for the lockspace it will: set the bit for the host_id in its bitmap, and set the generation, event and data values in its own delta lease.  An application that has registered for events from this lockspace on the destination host will get the event that has been set when the destination sees the event during its next delta lease renewal.

sanlock client set_config -s LOCKSPACE

Set a configuration value for a lockspace. Only lockspace_name is used from the LOCKSPACE argument. The USED flag has the same effect on a lockspace as a process holding a resource lease that will not exit.  The USED_BY_ORPHANS flag means that an orphan resource lease will have the same effect as the USED.
-u 0|1 Set (1) or clear (0) the USED flag.
-O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.

Direct Command

sanlock direct action [options]

-o sec io timeout in seconds

sanlock direct init -s LOCKSPACE
sanlock direct init -r RESOURCE

Initialize storage for 2000 host_id (delta) leases for the given lockspace, or initialize storage for one resource (paxos) lease.  Both options require 1MB of space.  The host_id in the LOCKSPACE string is not relevant to initialization, so the value is ignored.  (The default of 2000 host_ids can be changed for special cases using the -n num_hosts and -m max_hosts options.)  With -s, the -o option specifies the io timeout to be written in the host_id leases.  With -r, the -z 1 option invalidates the resource lease on disk so it cannot be used until reinitialized normally.

sanlock direct read_leader -s LOCKSPACE
sanlock direct read_leader -r RESOURCE

Read a leader record from disk and print the fields.  The leader record is the single sector of a delta lease, or the first sector of a paxos lease.

sanlock direct dump path[:offset[:size]]

Read disk sectors and print leader records for delta or paxos leases.  Add -f 1 to print the request record values for paxos leases, and host_ids set in delta lease bitmaps.

LOCKSPACE option string

-s lockspace_name:host_id:path:offset

lockspace_name name of lockspace
host_id local host identifier in lockspace
path path to storage reserved for leases
offset offset on path (bytes)

RESOURCE option string

-r lockspace_name:resource_name:path:offset

lockspace_name name of lockspace
resource_name name of resource
path path to storage reserved for leases
offset offset on path (bytes)

RESOURCE option string with suffix

-r lockspace_name:resource_name:path:offset:lver

lver leader version

-r lockspace_name:resource_name:path:offset:SH

SH indicates shared mode

Defaults

sanlock help shows the default values for the options above.

sanlock version shows the build version.

Other

Request/Examine

The first part of making a request for a resource is writing the request record of the resource (the sector following the leader record).  To make a successful request:

  • RESOURCE:lver must be greater than the lver presently held by the other host.  This implies the leader record must be read to discover the lver, prior to making a request.
  • RESOURCE:lver must be greater than or equal to the lver presently written to the request record.  Two hosts may write a new request at the same time for the same lver, in which case both would succeed, but the force_mode from the last would win.
  • The force_mode must be greater than zero.
  • To unconditionally clear the request record (set both lver and force_mode to 0), make request with RESOURCE:0 and force_mode 0.

The owner of the requested resource will not know of the request unless it is explicitly told to examine its resources via the "examine" api/command, or otherwise notfied.

The second part of making a request is notifying the resource lease owner that it should examine the request records of its resource leases.  The notification will cause the lease owner to automatically run the equivalent of "sanlock client examine -s LOCKSPACE" for the lockspace of the requested resource.

The notification is made using a bitmap in each host_id delta lease.  Each bit represents each of the possible host_ids (1-2000).  If host A wants to notify host B to examine its resources, A sets the bit in its own bitmap that corresponds to the host_id of B.  When B next renews its delta lease, it reads the delta leases for all hosts and checks each bitmap to see if its own host_id has been set.  It finds the bit for its own host_id set in A's bitmap, and examines its resource request records.  (The bit remains set in A's bitmap for set_bitmap_seconds.)

force_mode determines the action the resource lease owner should take:

  • FORCE (1): kill the process holding the resource lease.  When the process has exited, the resource lease will be released, and can then be acquired by anyone.  The kill signal is SIGKILL (or SIGTERM if SIGKILL is restricted.)
  • GRACEFUL (2): run the program configured by sanlock_killpath against the process holding the resource lease.  If no killpath is defined, then FORCE is used.

Persistent and orphan resource leases

A resource lease can be acquired with the PERSISTENT flag (-P 1).  If the process holding the lease exits, the lease will not be released, but kept on an orphan list.  Another local process can acquire an orphan lease using the ORPHAN flag (-O 1), or release the orphan lease using the ORPHAN flag (-O 1).  All orphan leases can be released by setting the lockspace name (-s lockspace_name) with no resource name.

Renewal history

sanlock saves a limited history of lease renewal information in each lockspace. See sanlock.conf renewal_history_size to set the amount of history or to disable (set to 0).

IO times are measured in delta lease renewal (each delta lease renewal includes one read and one write).

For each successful renewal, a record is saved that includes:

  • the timestamp written in the delta lease by the renewal
  • the time in milliseconds taken by the delta lease read
  • the time in milliseconds taken by the delta lease write

 Also counted and recorded are the number io timeouts and other io errors that occur between successful renewals.

Two consecutive successful renewals would be recorded as:

timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0

Those fields are:
   

  • timestamp is the value written into the delta lease during that renewal.
  • read_ms/write_ms are the milliseconds taken for the renewal read/write ios.
       
  • next_timeouts are the number of io timeouts that occured after the renewal recorded on that line, and before the next successful renewal on the following line.
  • next_errors are the number of io errors (not timeouts) that occured after renewal recorded on that line, and before the next successful renewal on the following line.
       

The command 'sanlock client renewal -s lockspace_name' reports the full history of renewals saved by sanlock, which by default is 180 records, about 1 hour of history when using a 20 second renewal interval for a 10 second io timeout.

Internals

Disk Format

  • This example uses 512 byte sectors.
  • Each lockspace is 1MB.  It holds 2000 delta_leases, one per sector, supporting up to 2000 hosts.
  • Each paxos_lease is 1MB.  It is used as a lease for one resource.
  • The leader_record structure is used differently by each lease type.
  • To display all leader_record fields, see sanlock direct read_leader.
  • A lockspace is often followed on disk by the paxos_leases used within that lockspace, but this layout is not required.
  • The request_record and host_id bitmap are used for requests/events.
  • The mode_block contains the SHARED flag indicating a lease is held in the shared mode.
  • In a lockspace, the host using host_id N writes to a single delta_lease in sector N-1.  No other hosts write to this sector.  All hosts read all lockspace sectors when renewing their own delta_lease, and are able to monitor renewals of all delta_leases.
  • In a paxos_lease, each host has a dedicated sector it writes to, containing its own paxos_dblock and mode_block structures.  Its sector is based on its host_id; host_id 1 writes to the dblock/mode_block in sector 2 of the paxos_lease.
  • The paxos_dblock structures are used by the paxos_lease algorithm, and the result is written to the leader_record.

0x000000 lockspace foo:0:/path:0

(There is no representation on disk of the lockspace in general, only the sequence of specific delta_leases which collectively represent the lockspace.)

delta_lease foo:1:/path:0

0x000 0         leader_record         (sector 0, for host_id 1)
                magic: 0x12212010 
                space_name: foo
                resource_name: host uuid/name
                ...
                host_id bitmap        (leader_record + 256)

delta_lease foo:2:/path:0

0x200 512       leader_record         (sector 1, for host_id 2)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                ...
                host_id bitmap        (leader_record + 256)

delta_lease foo:3:/path:0

0x400 1024      leader_record         (sector 2, for host_id 3)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                ...
                host_id bitmap        (leader_record + 256)

delta_lease foo:2000:/path:0

0xF9E00         leader_record         (sector 1999, for host_id 2000)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                ...
                host_id bitmap        (leader_record + 256)

0x100000 paxos_lease foo:example1:/path:1048576

0x000 0         leader_record         (sector 0)
                magic: 0x06152010
                space_name: foo
                resource_name: example1

0x200 512       request_record        (sector 1)
                magic: 0x08292011

0x400 1024      paxos_dblock          (sector 2, for host_id 1)
0x480 1152      mode_block            (paxos_dblock + 128)

0x600 1536      paxos_dblock          (sector 3, for host_id 2)
0x680 1664      mode_block            (paxos_dblock + 128)

0x800 2048      paxos_dblock          (sector 4, for host_id 3)
0x880 2176      mode_block            (paxos_dblock + 128)

0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
0xFA280         mode_block            (paxos_dblock + 128)

0x200000 paxos_lease foo:example2:/path:2097152

0x000 0         leader_record         (sector 0)
                magic: 0x06152010
                space_name: foo
                resource_name: example2

0x200 512       request_record        (sector 1)
                magic: 0x08292011

0x400 1024      paxos_dblock          (sector 2, for host_id 1)
0x480 1152      mode_block            (paxos_dblock + 128)

0x600 1536      paxos_dblock          (sector 3, for host_id 2)
0x680 1664      mode_block            (paxos_dblock + 128)

0x800 2048      paxos_dblock          (sector 4, for host_id 3)
0x880 2176      mode_block            (paxos_dblock + 128)

0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
0xFA280         mode_block            (paxos_dblock + 128)

Lease ownership

Not shown in the leader_record structures above are the owner_id, owner_generation and timestamp fields.  These are the fields that define the lease owner.

The delta_lease at sector N for host_id N+1 has leader_record.owner_id N+1.  The leader_record.owner_generation is incremented each time the delta_lease is acquired.  When a delta_lease is acquired, the leader_record.timestamp field is set to the time of the host and the leader_record.resource_name is set to the unique name of the host.  When the host renews the delta_lease, it writes a new leader_record.timestamp. When a host releases a delta_lease, it writes zero to leader_record.timestamp.

When a host acquires a paxos_lease, it uses the host_id/generation value from the delta_lease it holds in the lockspace.  It uses this host_id/generation to identify itself in the paxos_dblock when running the paxos algorithm.  The result of the algorithm is the winning host_id/generation - the new owner of the paxos_lease.  The winning host_id/generation are written to the paxos_lease leader_record.owner_id and leader_record.owner_generation fields and leader_record.timestamp is set.  When a host releases a paxos_lease, it sets leader_record.timestamp to 0.

When a paxos_lease is free (leader_record.timestamp is 0), multiple hosts may attempt to acquire it.  The paxos algorithm, using the paxos_dblock structures, will select only one of the hosts as the new owner, and that owner is written in the leader_record.  The paxos_lease will no longer be free (non-zero timestamp).  Other hosts will see this and will not attempt to acquire the paxos_lease until it is free again.

If a paxos_lease is owned (non-zero timestamp), but the owner has not renewed its delta_lease for a specific length of time, then the owner value in the paxos_lease becomes expired, and other hosts will use the paxos algorithm to acquire the paxos_lease, and set a new owner.

Files

/etc/sanlock/sanlock.conf

See Also

wdmd(8)

Referenced By

fence_sanlock(8), fence_sanlockd(8), sanlk-reset(8), sanlk-resetd(8), sanlock_selinux(8).

2015-01-23