问题描述:
巡检某客户的rac数据库时,发现alert log中有大量报错,如下:
Tue Dec 29 10:33:17 2009
Errors in file /oracle/app/admin/BILL/udump/bill2_ora_12230.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 42, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol3'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 42, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol2'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 42, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol1'
ORA-29701: unable to connect to Cluster Manager
查看其trace文件内容:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/RAC10g
System name: HP-UX
Node name: M-BILL2
Release: B.11.31
Version: U
Machine: ia64
Instance name: BILL2
Redo thread mounted by this instance: 2
Oracle process number: 46
Unix process pid: 12230, image: (TNS V1-V3)
*** ACTION NAME:(0000001 FINISHED70) 2009-12-29 10:33:16.050
*** MODULE NAME:( (TNS V1-V3)) 2009-12-29 10:33:16.050
*** SERVICE NAME:(SYS$USERS) 2009-12-29 10:33:16.050
*** SESSION ID:(704.12066) 2009-12-29 10:33:16.050
clsc_connect: (60000000000fef00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_m-bill2_crs_bill))
2009-12-29 10:33:16.051: [ CSSCLNT]clsssInitNative: connect failed, rc 9
kgxgncin: CLSS init failed with status 3
kjfmsgr: unable to connect to NM for reg in shared group
ORA-00206: error in writing (block 1633, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol3'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 1633, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol2'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 1633, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol1'
ORA-29701: unable to connect to Cluster Manager
分析原因:
从trace中的:
*** MODULE NAME:( (TNS V1-V3)) 2009-12-29 10:33:16.050
判断应该是在做rman备份时出错。
检查发现,在目录
/usr/netvault/scripts
下确实有定时备份归档日志的脚本。
并且在执行该脚本时会报该错误:
RMAN> connect target /
RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row
RMAN-06901: WARNING: disabling update of the V$RMAN_STATUS and V$RMAN_OUTPUT rows
ORACLE error from target database:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 1634, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol3'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 1634, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol2'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 1634, # blocks 1) of control file
ORA-00202: control file: '/dev/datavg1/rcontrol1'
ORA-29701: unable to connect to Cluster Manager
该错误主要特征是:
ORA-29701: unable to connect to Cluster Manager
参考metalink文档:
Unable To Connect To Cluster Manager Ora-29701 [ID 391790.1]。
发现该节点上的/tmp/.oracle比另一节点上少了如下文件:
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sAm-bill1_crs_bill_evm
srwxrwxrwx 1 root root 0 Dec 8 12:43 sCRSD_UI_SOCKET
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sCm-bill1_crs_bill_evm
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sOCSSD_LL_m-bill1_crs_bill
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sOracle_CSS_LclLstnr_crs_bill_1
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sSYSTEM.evm.acceptor.auth
srwxrwxrwx 1 root root 0 Dec 8 12:43 sm-bill1DBG_CRSD
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sm-bill1DBG_CSSD
srwxrwxrwx 1 oracle10 dba 0 Dec 8 12:43 sm-bill1DBG_EVMD
srwxrwxrwx 1 root root 0 Dec 8 12:43 sora_crsqs
srwxrwxrwx 1 root root 0 Dec 8 12:43 sprocr_local_conn_0_PROC
注意:通常这些文件放在/var/tmp/.oracle目录下,但在部分操作系统放在/tmp/.oracle下。
这些文件是一些特殊的socket文件,用于本地客户端通过IPC协议连接到例如TNS listener,CSS,CRS以及EVM daemons,甚至是数据库或ASM实例。
这与trace文件信息吻合:
clsc_connect: (60000000000fef00) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_m-bill2_crs_bill))
解决方案:
只能通过重启CRS来重建这些文件。
以root用户登录:
# $ORA_CRS_HOME/bin/crsctl stop crs
# $ORA_CRS_HOME/bin/crsctl start crs
在停止过程中可能会遇到如下错误:
Stopping resources. This could take several minutes.
Error while stopping resources. Possible cause: CRSD is down.
这是由于无法访问CRSD造成的,但是由于crs后台进程仍然存在,必须重启系统。
如果是kill ocssd.bin进程,同样会造成节点重启。
总结:
如果需要定期整理删除/tmp或/var/tmp目录,需要注意该目录下的隐藏目录.oracle。
后续实验:
在LINUX中一个正常的rac环境中,删除一个节点的/var/tmp/.oracle目录,在该节点检查crs报错:
$crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
$crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
但是在另一节点检查,crs的所有服务都状态正常,并且在故障节点的实例仍然正常工作。
检查crs日志发现,故障节点持续报类似以下错误:
2009-12-24 01:50:13.575: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2009-12-24 01:50:13.576: [ RACG][1] [29076][1][ora.m-bill2.gsd]: clsrccssgetctx: clsssinit() failed. rc=3
2009-12-24 01:50:13.578: [ RACG][1] [29076][1][ora.m-bill2.gsd]: clsrcgetprsrctx: prsr_init_ext returned rc = 3
2009-12-24 01:50:14.568: [ RACG][1] [29076][1][ora.m-bill2.gsd]: PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
Failed to get list of active nodes from clusterware
......
2009-12-24 01:50:14.600: [ COMMCRS][1]clsc_connect: (6000000000040400) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.accepto
r.auth))
执行rman报错:
RMAN> connect target /
RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row
RMAN-06901: WARNING: disabling update of the V$RMAN_STATUS and V$RMAN_OUTPUT rows
ORACLE error from target database:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 814, # blocks 1) of control file
ORA-00202: control file: '+RECOVERYDEST/ora10g/controlfile/current.256.688420755'
ORA-29701: unable to connect to Cluster Manager
ORA-00206: error in writing (block 814, # blocks 1) of control file
ORA-00202: control file: '+DG1/ora10g/controlfile/current.256.688420747'
ORA-29701: unable to connect to Cluster Manager
connected to target database: ORA10G (DBID=4007292866)
using target database control file instead of recovery catalog
阅读(3645) | 评论(0) | 转发(0) |