分类: Oracle
2010-04-15 09:37:29
3.查看CRS进程
[oracle@ra1 ~]$ps -ef|grep crs
root 3241 1 0 08:35 ? 00:00:00 /bin/su -l oracle -c sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/evmd; exec /u01/app/oracle/product/crs/bin/evmd '
oracle 4787 3241 0 08:36 ? 00:00:00 /u01/app/oracle/product/crs/bin/evmd.bin
root 4892 4774 0 08:36 ? 00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd; /u01/app/oracle/product/crs/bin/ocssd || exit $?'
oracle 4893 4892 0 08:36 ? 00:00:00 /bin/sh -c ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd; /u01/app/oracle/product/crs/bin/ocssd || exit $?
oracle 4918 4893 0 08:36 ? 00:00:01 /u01/app/oracle/product/crs/bin/ocssd.bin
oracle 5189 4787 0 08:36 ? 00:00:00 /u01/app/oracle/product/crs/bin/evmlogger.bin -o /u01/app/oracle/product/crs/evm/log/evmlogger.info -l /u01/app/oracle/product/crs/evm/log/evmlogger.log
oracle 6186 1 0 08:36 ? 00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
oracle 6187 6186 0 08:36 ? 00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
root 19744 1 0 08:48 ? 00:00:00 /u01/app/oracle/product/crs/bin/crsd.bin restart
oracle 8784 9729 0 09:01 pts/1 00:00:00 grep crs
初步判断由crs引起的系统资源异常
4. 停掉CRS资源
其中包括CSS进程、CRS进程(database, listener,node)、EVM进程等。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]#./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
[root@ra1 bin]#./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[root@ra1 bin]#./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
5. 查看进程
[root@ra1 bin]# ps -ef|grep ora_
root 23490 9148 0 09:11 pts/1 00:00:00 grep ora_
6. 修改CRS进程为手动启动(根据实际情况可选操作)
由于CRS服务是自动注册在主机重起的脚本里面的,所以需要手工修改此服务为手工启动,因为此时我们需要的是服务器中的应用,数据库不再需要,所以可以修改这个默认值,但是大部分的生产环境要根据实际情况来操作。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]# /u01/app/oracle/product/crs/bin/crsctl disable crs
[root@ra1 bin]# more /etc/oracle/scls_scr/ra1/root/crsstart
disable
此时系统资源恢复正常。进一步查找原因:
metalink information: Bug No. 7235094
PROBLEM:
--------
racgimon has file handle leak on healthcheck file. . At the customer's site, ServiceGuard detected Split Brain then a node was bounced. At that time, "ORA-27301: OS failure message: File table overflow" was recorded on alert.log. Also, "glance" showed that racgimon was opening more than 26,000 filehandles. The racgimon process was started around 20 days ago(14th Jun). Due to the handle leak by racgimon, the operating system was exhausting the kernel limit for maximum opened files ("nfile" on HP-UX).
DIAGNOSTIC ANALYSIS:
--------------------
"$ORACLE_HOME/log/< NodeName>/racg/imon_< InstanceName>.log"
During the handle leak, ragimon log recoded the following error at every 60 secondes(Health check interval). .
- imon_r1024.log .
2008-07-04 16:16:24.707: [RACG][20] [25433][20][ora.r1024.r10241.inst]:
GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
The error recorded on imon_r1024.log above seems same as Bug:6931689. On the other hand, Bug:6989661 explains an looping error in racgimon can result in opened files not closed. So I guess the racgimon was looping error due to Bug:6931689, then the loop error caused handle leak. At last, it exceeded "nfile" on HP-UX and ServiceGuard, Oracle, or any other applications could not run normally. .
WORKAROUND:
-----------
kill racgimon sometimes. .
RELATED BUGS:
-------------
Bug:6989661
Bug:6931689
参考文献:
metalink:
Bug No. 7235094
Filed 04-JUL-2008 Updated 08-JUL-2008
Product Oracle - Enterprise Edition Product Version 10.2.0.4
Platform. HP-UX Itanium Platform. Version No
Database Version 10.2.0.4 Affects Platforms Port-Specific
Severity Severe Loss of Service Status Duplicate Bug. To Filer
Base Bug 6931689 Fixed in Product Version No Data