Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions [ID 559365.1] |
|
|
Modified 11-AUG-2010 Type PROBLEM Status PUBLISHED |
|
In this Document
Applies to:
Oracle Server - Standard Edition - Version: 10.1.0.5 to 11.2.0.1 - Release: 10.1 to 11.2
Oracle Server - Enterprise Edition - Version: 10.1.0.5 to 11.2.0.1.0 [Release: 10.1 to 11.2]
Linux x86
HP-UX PA-RISC (64-bit)
IBM AIX on POWER Systems (64-bit)
Oracle Solaris on SPARC (64-bit)
HP-UX Itanium
Red Hat Enterprise Linux Advanced Server x86-64 (AMD Opteron Architecture)
Red Hat Enterprise Linux Advanced Server Itanium
Oracle Solaris on x86-64 (64-bit)
Linux x86-64
UnitedLinux Itanium
Oracle Server Enterprise Edition - Version: 10.1.0.5 to 11.1.0.7
Oracle Clusterware
Symptoms
Oracle Clusterware evicts the node from the cluster when
- Node is not pinging via the network heartbeat
- Node is not pinging the Voting disk
- Node is hung/busy and is unable to perform either of the earlier tasks
In Most cases when the node is evicted, there is information
written to the logs to analyze the cause of the node eviction. However
in certain cases this may be missing, the steps documented in this note
are to be used for those cases where there is not enough information or
no information to diagnose the cause of the eviction for Clusterware
versions less than 11gR2 (11.2.0.1).
Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.
Changes
None
Cause
When the node is evicted and the node is extremely busy in terms of
CPU (or lack of it) it is possible that the OS did not get time to
flush the logs/traces to the file system. It may be useful to set
diagwait attribute to delay the node reboot to give additional time to
the OS to write the traces. This setting will provide more time for
diagnostic data to be collected by safely and will NOT increase
probability of corruption. After setting diagwait, the Clusterware will
wait an additional 10 seconds (Diagwait - reboottime). Customers can
unset diagwait by following the steps documented below after fixing
their OS scheduling issues.
* -- Diagwait can be set on windows but it does not change the behaviour as it does on Unix-Linux platforms
@ For internal Support Staff
Diagwait attribute was introduced in
10.2.0.3 and is included in 10.2.0.4 & 11.1.0.6 and higher releases.
It has also been subsequently backported to 10.1.0.5 on most platforms.
This means it is possible to set diagwait on 10.1.0.5 (or
higher), 10.2.0.3 (or higher) and in 11.1.0.6 (or higher). If the
command crsctl set/get css diagwait reports "unrecognized parameter diagwait specified" then
it can be safely assumed that the Clusterware version does not the
necessary fixes to implement diagwait. If that is the case then customer
is adviced to apply the latest patchset available before attempting to
set diagwait
Solution
It is important that the clusterware stack must be down on all the
nodes when changing diagwait .The following steps provides the
step-by-step instructions on setting diagwait.
- Execute as root
#crsctl stop crs
#/bin/oprocd stop
- Ensure that Clusterware stack is down on all nodes by executing
#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
This
should return no processes. If there are clusterware processes running
and you proceed to the next step, you will corrupt your OCR. Do not
continue until the clusterware processes are down on all the nodes of
the cluster.
- From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force
- Check if diagwait is set successfully by executing. the
following command. The command should return 13. If diagwait is not set,
the following message will be returned "Configuration parameter
diagwait is not defined"
#crsctl get css diagwait
- Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs
- Validate that the node is running by executing:
#crsctl check crs
Unsetting/Removing diagwaitCustomers
should not unset diagwait without fixing the OS scheduling issues as
that can lead to node evictions via reboot. Diagwait delays the node
eviction (and reconfiguration) by diagwait (13) seconds and as such
setting diagwait does not affect most customers.In case there is a need
to remove diagwait, the above mentioned steps need to be followed except
step 3 needs to be replaced by the following command
#crsctl unset css diagwait -f
(Note: the -f option must be used when unsetting diagwait since CRS will be down when doing so)
References
- Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
阅读(1900) | 评论(0) | 转发(0) |