大量的racgmain进程，占用了资源-wqqzlm-ChinaUnix博客

WQQ——天天向上！！！garydba.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

wqqzlm

博客访问： 676898
博文数量： 168
博客积分： 2928
博客等级：中校
技术积分： 1904
用户组：普通用户
注册时间： 2010-01-04 09:56

文章分类

全部博文（168）

生活（5）
考试认证（0）

网络工程师认证（0）

ORACLE认证（0）

CISCO认证（0）

IBM认证（0）

HP认证（0）
双机专题（1）

rose ha（0）

linux_ Hear（0）

linux_cluster（0）

AIX_HACMP（0）

Solaris_cluster（0）

HPUX_MC/ServiceG（1）
weblogic（0）
Solaris（0）

读书笔记（0）
VMWARE（0）
ORACLE（127）

系统管理（19）

故障诊断（39）

安装迁移（23）

升级调优（14）

备份恢复（18）

数据保护（0）

DataGuard（4）

streams（0）

RAC（10）
网络学习（3）
linux（18）
HPUX（10）

存储备份（0）

动手实践（4）

学逻辑卷（6）
AIX（4）

故障处理（2）

基础知识（1）

实践操作（1）
未分配的博文（0）

文章存档

2010年（168）

我的朋友

大鬼不动

相关博文

大量的racgmain进程，占用了资源

分类： Oracle

2010-04-15 09:37:29

早上发现有一台数据库服务器异常，异常现象为：登陆慢，系统资源idle=0%，被大量的racgmain进程占用：
   临时处理办法：
1. 查看os资源状况
      重起系统前，查看资源状况，发现有大量的racgmain进程，占用了资源。

  2. 查看database资源状况
[oracle@ra1 ~]$ sqlplus /nolog
SQL*Plus: Release 10.2.0.1.0 - Production on Mon Jun 15 08:39:04 2009
Copyright (c) 1982, , . All rights reserved.
SQL> conn / as sysdba
Connected.
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database Enterprise Edition Release10.2.0.1.0- Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

3.查看CRS进程
[oracle@ra1 ~]$ps -ef|grep crs
root      3241     1 0 08:35 ?        00:00:00 /bin/su -l oracle -c sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/evmd; exec /u01/app/oracle/product/crs/bin/evmd '
oracle    4787 3241 0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/bin/evmd.bin
root      4892 4774 0 08:36 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd; /u01/app/oracle/product/crs/bin/ocssd || exit $?'
oracle    4893 4892 0 08:36 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd; /u01/app/oracle/product/crs/bin/ocssd || exit $?
oracle    4918 4893 0 08:36 ?        00:00:01 /u01/app/oracle/product/crs/bin/ocssd.bin
oracle    5189 4787 0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/bin/evmlogger.bin -o /u01/app/oracle/product/crs/evm/log/evmlogger.info -l /u01/app/oracle/product/crs/evm/log/evmlogger.log
oracle    6186     1 0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
oracle    6187 6186 0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
root     19744     1 0 08:48 ?        00:00:00 /u01/app/oracle/product/crs/bin/crsd.bin restart
oracle    8784 9729 0 09:01 pts/1    00:00:00 grep crs

初步判断由crs引起的系统资源异常

4. 停掉CRS资源
其中包括CSS进程、CRS进程（database, listener,node）、EVM进程等。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]#./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

[root@ra1 bin]#./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.

[root@ra1 bin]#./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM

5. 查看进程
[root@ra1 bin]# ps -ef|grep ora_
root 23490 9148 0 09:11 pts/1 00:00:00 grep ora_

6. 修改CRS进程为手动启动（根据实际情况可选操作）
由于CRS服务是自动注册在主机重起的脚本里面的，所以需要手工修改此服务为手工启动，因为此时我们需要的是服务器中的应用，数据库不再需要，所以可以修改这个默认值，但是大部分的生产环境要根据实际情况来操作。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]# /u01/app/oracle/product/crs/bin/crsctl disable crs
[root@ra1 bin]# more /etc/oracle/scls_scr/ra1/root/crsstart
disable

此时系统资源恢复正常。进一步查找原因：
metalink information: Bug No. 7235094

PROBLEM:
--------
racgimon has file handle leak on healthcheck file. . At the customer's site, ServiceGuard detected Split Brain then a node was bounced. At that time, "ORA-27301: OS failure message: File table overflow" was recorded on alert.log. Also, "glance" showed that racgimon was opening more than 26,000 filehandles. The racgimon process was started around 20 days ago(14th Jun). Due to the handle leak by racgimon, the operating system was exhausting the kernel limit for maximum opened files ("nfile" on HP-UX).

DIAGNOSTIC ANALYSIS:
--------------------
"$ORACLE_HOME/log/< NodeName>/racg/imon_< InstanceName>.log"
During the handle leak, ragimon log recoded the following error at every 60 secondes(Health check interval). .
- imon_r1024.log .
2008-07-04 16:16:24.707: [RACG][20] [25433][20][ora.r1024.r10241.inst]:
GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

The error recorded on imon_r1024.log above seems same as Bug:6931689. On the other hand, Bug:6989661 explains an looping error in racgimon can result in opened files not closed. So I guess the racgimon was looping error due to Bug:6931689, then the loop error caused handle leak. At last, it exceeded "nfile" on HP-UX and ServiceGuard, Oracle, or any other applications could not run normally. .

WORKAROUND:
-----------
kill racgimon sometimes. .

RELATED BUGS:
-------------
Bug:6989661
Bug:6931689

参考文献：
metalink:
Bug No. 7235094
Filed 04-JUL-2008 Updated 08-JUL-2008
Product Oracle - Enterprise Edition Product Version 10.2.0.4
Platform. HP-UX Itanium Platform. Version No
Database Version 10.2.0.4 Affects Platforms Port-Specific
Severity Severe Loss of Service Status Duplicate Bug. To Filer
Base Bug 6931689 Fixed in Product Version No Data

阅读(2420) | 评论(0) | 转发(0) |

上一篇：DBA警世录--谨慎+细心

下一篇：处理待机后，网络不能用的方法

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6