The High Availability cluster stack's highest priority is protecting the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage: For example, ext3 file systems are only mounted once in the cluster, OCFS2 volumes will not be mounted unless coordination with other cluster nodes is available. In a well-functioning cluster Pacemaker will detect if resources are active beyond their concurrency limits and initiate recovery. Furthermore, its policy engine will never exceed these limitations.
However, network partitioning or software malfunction could potentially cause scenarios where several coordinators are elected. If this so-called split brain scenarios were allowed to unfold, data corruption might occur. Hence, several layers of protection have been added to the cluster stack to mitigate this.
The primary component contributing to this goal is IO fencing/STONITH since it ensures that all other access prior to storage activation is terminated. Other mechanisms are cLVM2 exclusive activation or OCFS2 file locking support to protect your system against administrative or application faults. Combined appropriately for your setup, these can reliably prevent split brain scenarios from causing harm.
This chapter describes an IO fencing mechanism that leverages the storage itself, followed by the description of an additional layer of protection to ensure exclusive storage access. These two mechanisms can be combined for higher levels of protection.
You can reliably avoid split brain scenarios by using one or more
STONITH Block Devices (SBD), watchdog support and
the external/sbd STONITH agent.
In an environment where all nodes have access to shared storage, a small partition of the device is formatted for use with SBD. The size of the partition depends on the block size of the used disk (1 MB for standard SCSI disks with 512 Byte block size, DASD disks with 4 kB block size need 4 MB). After the respective daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.
The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.
The daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The work-load will terminate anyway if the storage connectivity has been lost.
Increased protection is offered through watchdog
support. Modern systems support a hardware watchdog
that has to be “tickled” or “fed” by a
software component. The software component (usually a daemon) regularly
writes a service pulse to the watchdog—if the daemon stops feeding
the watchdog, the hardware will enforce a system restart. This protects
against failures of the SBD process itself, such as dying, or becoming
stuck on an IO error.
If Pacemaker integration is activated, SBD will not self-fence if device majority is lost. For example, your cluster contains 3 nodes: A, B, and C. Due to a network split, A can only see itself while B and C can still communicate. In this case, there are two cluster partitions, one with quorum due to being the majority (B, C), and one without (A). If this happens while the majority of fencing devices are unreachable, node A would immediately commit suicide, but the nodes B and C would continue to run.
SBD supports the use of 1-3 devices:
The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage.
This configuration is primarily useful for environments that use host-based mirroring but where no third storage device is available. SBD will not terminate itself if it loses access to one mirror leg, allowing the cluster to continue. However, since SBD does not have enough knowledge to detect an asymmetric split of the storage, it will not fence the other side while only one mirror leg is available. Thus, it cannot automatically tolerate a second failure while one of the storage arrays is down.
The most reliable configuration. It is resilient against outages of one device—be it due to failures or maintenance. SBD will only terminate itself if more than one device is lost. Fencing messages can be successfully be transmitted if at least two devices are still accessible.
This configuration is suitable for more complex scenarios where storage is not restricted to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.
The following steps are necessary to set up storage-based protection:
All of the following procedures must be executed as root. Before
you start, make sure the following requirements are met:
The environment must have shared storage reachable by all nodes.
The shared storage segment must not make use of host-based RAID, cLVM2, nor DRBD*.
However, using storage-based RAID and multipathing is recommended for increased reliability.
It is recommended to create a 1MB partition at the start of the device.
If your SBD device resides on a multipath group, you need to adjust the
timeouts SBD uses, as MPIO's path down detection can cause some
latency. After the msgwait timeout, the message is
assumed to have been delivered to the node. For multipath, this should
be the time required for MPIO to detect a path failure and switch to
the next path. You may have to test this in your environment. The node
will terminate itself if the SBD daemon running on it has not updated
the watchdog timer fast enough.
Test your chosen timeouts in your specific environment. In case you use
a multipath storage with just one SBD device, pay special attention to
the failover delays incurred.
In the following, this SBD partition is referred to by
/dev/SBD . Replace it
with your actual pathname, for example:
/dev/sdc1.
Make sure the device you want to use for SBD does not hold any data.
The sbd command will overwrite the device without
further requests for confirmation.
Initialize the SBD device with the following command:
root #sbd-d /dev/SBD create
This will write a header to the device, and create slots for up to 255 nodes sharing this device with default timings.
If you want to use more than one device for SBD, provide the devices
by specifying the -d option multiple times, for
example:
root #sbd-d /dev/SBD1 -d /dev/SBD2 -d /dev/SBD3 create
If your SBD device resides on a multipath group, adjust the timeouts SBD uses. This can be specified when the SBD device is initialized (all timeouts are given in seconds):
root #/usr/sbin/sbd-d /dev/SBD -4 1801 -1 902 create
With the following command, check what has been written to the device:
root #sbd-d /dev/SBD dump Header version : 2 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10
As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.
Watchdog will protect the system against SBD failures, if no other software uses it.
No other software must access the watchdog timer. Some hardware vendors ship systems management software that use the watchdog for system resets (for example, HP ASR daemon). Disable such software, if watchdog is used by SBD.
In SUSE Linux Enterprise High Availability Extension, watchdog support in the Kernel is enabled by default:
It ships with a number of different Kernel modules that provide
hardware-specific watchdog drivers. The High Availability Extension uses the SBD daemon as
software component that “feeds” the watchdog. If
configured as described in
Section 17.1.3.3, “Starting the SBD Daemon”, the SBD daemon
will start automatically when the respective node is brought online
with systemctl start pacemaker.service.
Usually, the appropriate watchdog driver for your hardware is
automatically loaded during system boot. softdog is
the most generic driver, but it is recommended to use a driver with
actual hardware integration. For example:
On HP hardware, this is the hpwdt driver.
For systems with an Intel TCO, the iTCO_wdt driver
can be used.
For a list of choices, refer to
/usr/src/KERNEL_VERSION/drivers/watchdog.
Alternatively, list the drivers that have been installed with
your Kernel version with the following command:
root #rpm-ql kernel-VERSION |grepwatchdog
As most watchdog driver names contain strings like
wd, wdt, or
dog, use the following command to check
which driver is currently loaded:
root #lsmod|egrep"(wd|dog)"
To automatically load the watchdog driver, create the
file /etc/modules-load.d/watchdog.conf containing
a line with the driver name. For more information refer to the man page
modules-load.d.
If you change the timeout for watchdog, the other two values
(msgwait and stonith-timeout)
must be changed as well. The watchdog timeout depends mostly on your
storage latency. This value specifies that the majority of devices must
be successfully finished their read operation within this time frame.
If not, the node will self-fence.
The following “formula” expresses roughly this relationship between these three values:
Timeout (msgwait) = (Timeout (watchdog) * 2) stonith-timeout = Timeout (msgwait) + 20%
For example, if you set the timeout watchdog to 120, you have to
set the msgwait to 240 and the
stonith-timeout to 288. You can check the
output with cs_make_sbd_devices:
root #cs_make_sbd_devices--dump ==Dumping header on disk /dev/sdb Header version : 2.1 UUID : 619127f4-0e06-434c-84a0-ea82036e144c Number of slots : 255 Sector size : 512 Timeout (watchdog) : 20 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 40 ==Header on disk /dev/sdb is dumped
If you setup a new cluster, the ha-cluster-init
command takes the above considerations into account.
The SBD daemon is a critical piece of the cluster stack. It has to be running when the cluster stack is running, or even when part of it has crashed, so that it can be fenced.
Enable the SBD daemon to start at boot time with:
root #systemctlenable sbd.service
Run ha-cluster-init. This script ensure that
SBD is correctly configured and the configuration file
/etc/sysconfig/sbd is added to the list
of files that needs to be synchronized with Csync2.
If you want to configure SBD manually, perform the following step:
To make the Corosync init script start and stop SBD,
edit the file /etc/sysconfig/sbd and
search for the following line, replacing
SBD with your SBD device:
SBD_DEVICE="/dev/SBD"
If you need to specify multiple devices in the first line, separate them by a semicolon (the order of the devices does not matter):
SBD_DEVICE="/dev/SBD1; /dev/SBD2; /dev/SBD3"
If the SBD device is not accessible, the daemon will fail to start and inhibit Corosync startup.
If the SBD device becomes inaccessible from a node, this could cause the node to enter an infinite reboot cycle. This is technically correct behavior, but depending on your administrative policies, most likely a nuisance. In such cases, better do not automatically start up Corosync and Pacemaker on boot.
Before proceeding, ensure that SBD has started on all nodes by
executing systemctl restart pacemaker.service.
The following command will dump the node slots and their current messages from the SBD device:
root #sbd-d /dev/SBD list
Now you should see all cluster nodes that have ever been started with
SBD listed here, the message slot should show
clear.
Try sending a test message to one of the nodes:
root #sbd-d /dev/SBD message nodea test
The node will acknowledge the receipt of the message in the system logs:
Aug 29 14:10:00 nodea sbd: [13412]: info: Received command test from nodeb
This confirms that SBD is indeed up and running on the node and that it is ready to receive messages.
To complete the SBD setup, activate SBD as a STONITH/fencing mechanism in the CIB as follows:
root #crmconfigurecrm(live)configure#propertystonith-enabled="true"crm(live)configure#propertystonith-timeout="40s"crm(live)configure#primitivestonith_sbd stonith:external/sbd \ op start interval="0" timeout="15" start-delay="10"crm(live)configure#commitcrm(live)configure#quit
The resource does not need to be cloned, as it would shut down the respective node anyway if there was a problem.
Which value to set for stonith-timeout depends on
the msgwait timeout.
The msgwait timeout should be longer than the
maximum allowed timeout for the underlying IO system. For example,
this is 30 seconds for plain SCSI disks. Provided you set the
msgwait timeout value to 30 seconds, setting
stonith-timeout to 40 seconds is appropriate.
Since node slots are allocated automatically, no manual host list needs to be defined.
Disable any other fencing devices you might have configured before, since the SBD mechanism is used for this function now.
Once the resource has started, your cluster is successfully configured for shared-storage fencing and will utilize this method in case a node needs to be fenced.
Log in as root and start a shell.
Create the configuration file /etc/sg_persist.conf:
sg_persist_resource_MDRAID1() {
devs="/dev/sdd /dev/sde"
required_devs_nof=2
}Run the following commands to create the primitive resources
sg_persist:
root #crmconfigurecrm(live)configure#primitivesg ocf:heartbeat:sg_persist \ params config_file=/etc/sg_persist.conf \ sg_persist_resource=MDRAID1 \ reservation_type=1 \ op monitor interval=60 timeout=60
Add the sg_persist primitive to a
master-slave group:
crm(live)configure#msms-sg sg \ meta master-max=1 notify=true
Set the master on the alice server and the slave on the bob node:
crm(live)configure#locationms-sg-alice-loc ms-sg inf: alicecrm(live)configure#locationms-sg-bob-loc ms-sg 100: bob
Do some tests. When the resource is in master/slave status,
on the master server, you can mount and write on /dev/sdc1, while
on the slave server you cannot write.
In most cases you may want to use the above resource with the
Filesystem resource, for example, OCFS2. In this
case, you need to perform the following steps:
Add an OCFS2 primitive:
crm(live)configure#primitiveocfs2 ocf:heartbeat:Filesystem \ params device="/dev/sdc1" directory="/mnt/ocfs2" fstype=ocfs2
Create a clone from a basegroup:
crm(live)configure#clonecl-group basegroup
Add relationship between ms-sg and
cl-group:
crm(live)configure#colocationocfs2-group-on-ms-sg inf: cl-group ms-sg:Mastercrm(live)configure#orderms-sg-before-ocfs2-group inf: ms-sg:promote cl-group
Check all your changes with the edit
command.
Commit your changes.
This section introduces sfex, an additional low-level
mechanism to lock access to shared storage exclusively to one node. Note
that sfex does not replace STONITH. Since sfex requires shared storage,
it is recommended that the external/sbd fencing
mechanism described above is used on another partition of the storage.
By design, sfex cannot be used in conjunction with workloads that require concurrency (such as OCFS2), but serves as a layer of protection for classic fail-over style workloads. This is similar to a SCSI-2 reservation in effect, but more general.
In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.
Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker, and the sfex component ensures that even if Pacemaker were subject to a split brain situation, the lock will never be granted more than once.
These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.
In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, it defaults to one, and needs 1 KB of storage space allocated per lock.
The shared partition for sfex should be on the same logical unit as the data you wish to protect.
The shared sfex partition must not make use of host-based RAID, nor DRBD.
Using a cLVM2 logical volume is possible.
Create a shared partition for use with sfex. Note the name of this
partition and use it as a substitute for
/dev/sfex below.
Create the sfex meta data with the following command:
root #sfex_init-n 1 /dev/sfex
Verify that the meta data has been created correctly:
root #sfex_stat-i 1 /dev/sfex ; echo $?
This should return 2, since the lock is not
currently held.
The sfex lock is represented via a resource in the CIB, configured as follows:
crm(live)configure#primitivesfex_1 ocf:heartbeat:sfex \ # params device="/dev/sfex" index="1" collision_timeout="1" \ lock_timeout="70" monitor_interval="10" \ # op monitor interval="10s" timeout="30s" on_fail="fence"
To protect resources via a sfex lock, create mandatory ordering and
placement constraints between the protectees and the sfex resource. If
the resource to be protected has the id
filesystem1:
crm(live)configure#orderorder-sfex-1 inf: sfex_1 filesystem1crm(live)configure#colocationcolo-sfex-1 inf: filesystem1 sfex_1
If using group syntax, add the sfex resource as the first resource to the group:
crm(live)configure#groupLAMP sfex_1 filesystem1 apache ipaddr
See http://www.linux-ha.org/wiki/SBD_Fencing and
man sbd.