OCFS,OCFS2,ASM,RAW 讨论
HEARTBEAT
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.
O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)
# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?
# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
7
# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:
* For SLES9, edit the command line in /boot/grub/menu.lst.
title Linux 2.6.5-7.244-bigsmp (with deadline)
kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp
* For RHEL4, edit the command line in /boot/grub/grub.conf:
title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
root (hd0,0)
kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
initrd /initrd-2.6.9-22.EL.img
To see the current kernel command line, do:
# cat /proc/cmdline
QUORUM AND FENCING
# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:
* it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
OR,
* it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.
# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:
# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# How does one list out the startup and shutdown ordering of the OCFS2 related services?
* To list the startup order for runlevel 3 on RHEL4, do:
# cd /etc/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S10network S24o2cb S25ocfs2
* To list the shutdown order on RHEL4, do:
# cd /etc/rc6.d
# ls K*ocfs2* K*o2cb* K*network*
K19ocfs2 K20o2cb K90network
* To list the startup order for runlevel 3 on SLES9, do:
# cd /etc/init.d/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S05network S07o2cb S08ocfs2
* To list the shutdown order on SLES9, do:
# cd /etc/init.d/rc3.d
# ls K*ocfs2* K*o2cb* K*network*
K14ocfs2 K15o2cb K17network
Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.
NOVELL SLES9
# Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.
* The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
* The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.
RELEASE 1.2
# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:
* It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
* Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).
# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.
UPGRADE TO THE LATEST RELEASE
# How do I upgrade to the latest release?
* Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)
* Umount all OCFS2 volumes.
# umount -at ocfs2
* Shutdown the cluster and unload the modules.
# /etc/init.d/o2cb offline
# /etc/init.d/o2cb unload
* If required, upgrade the tools and console.
# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm
* Upgrade the module.
# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm
* Ensure init services ocfs2 and o2cb are enabled.
# chkconfig --add o2cb
# chkconfig --add ocfs2
* To check whether the services are enabled, do:
# chkconfig --list o2cb
o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off
* At this stage one could either reboot the node or simply, restart the cluster and mount the volume.
# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:
ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value
it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:
SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc: denied { mount } for ...
The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.
[ 本帖最后由 nntp 于 2006-9-1 00:00 编辑 ]
回复于:2006-08-31 22:02:14
PROCESSES
# List and describe all OCFS2 threads?
[o2net]
One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
[user_dlm]
One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
[ocfs2_wq]
One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
[o2hb-14C29A7392]
One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
[ocfs2vote-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
[dlm_thread]
One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
[dlm_reco_thread]
One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
[dlm_wq]
One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
[kjournald]
One per mount. Is used as OCFS2 uses JDB for journalling.
[ocfs2cmt-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
[ocfs2rec-0]
Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.