Chinaunix首页 | 论坛 | 博客
  • 博客访问: 214269
  • 博文数量: 123
  • 博客积分: 3066
  • 博客等级: 中校
  • 技术积分: 1920
  • 用 户 组: 普通用户
  • 注册时间: 2007-08-23 21:22
文章分类

全部博文(123)

文章存档

2012年(1)

2010年(1)

2009年(48)

2008年(73)

我的朋友

分类: Oracle

2009-01-05 09:25:10


UNDO表空间满后发生的事

    前几天我的AIX 下的9206RAC数据库实例一的UNDO表空间用满了,过了一会检查活动的回退段只有几个,但是表空间没有回收没有用的空间。以前也了解9i的UNDO这个问题。可能通通过重新建UNDO表空间,切到新的把原来的删掉。只是在查表空间占用率的时候它是高。后来想把UNDO_RETETION,改小。谁知,在一个实例上改了以后就出现问题了。

  我在实例上执行:

    alter system set undo_retention=600 scope=both sid=sid1;

 但是过了一会用SQLPLUS,客户端,都连不上了。数据库处于HANG的状态下。ALERT日中出现以下信息:

  Tue Oct 30 17:05:21 2007
ALTER SYSTEM SET undo_retention=600 SCOPE=BOTH SID='SID1';
Tue Oct 30 17:15:08 2007
Thread 2 advanced to log sequence 60894
  Current log# 9 seq# 60894 mem# 0: /dev/rlv_redo91
  Current log# 9 seq# 60894 mem# 1: /dev/rlv_redo92
Tue Oct 30 17:35:52 2007
Thread 2 advanced to log sequence 60895
  Current log# 10 seq# 60895 mem# 0: /dev/rlv_redo101
  Current log# 10 seq# 60895 mem# 1: /dev/rlv_redo102
Tue Oct 30 17:55:36 2007
Thread 2 advanced to log sequence 60896
  Current log# 6 seq# 60896 mem# 0: /dev/rlv_redo61
  Current log# 6 seq# 60896 mem# 1: /dev/rlv_redo62
Tue Oct 30 18:15:21 2007
Thread 2 advanced to log sequence 60897
  Current log# 7 seq# 60897 mem# 0: /dev/rlv_redo71
  Current log# 7 seq# 60897 mem# 1: /dev/rlv_redo72
Tue Oct 30 18:17:37 2007
Timed out trying to start process J000.
Tue Oct 30 18:19:41 2007
Timed out trying to start process J000.
Tue Oct 30 18:21:46 2007
Timed out trying to start process J000.
Tue Oct 30 18:23:56 2007
Timed out trying to start process J000.
Tue Oct 30 18:26:01 2007
Timed out trying to start process J000.
Tue Oct 30 18:28:06 2007
Timed out trying to start process J000.
Tue Oct 30 18:33:16 2007
Timed out trying to start process J000.
Tue Oct 30 18:35:21 2007
Timed out trying to start process J000.
Tue Oct 30 18:35:36 2007
Thread 2 advanced to log sequence 60898
  Current log# 8 seq# 60898 mem# 0: /dev/rlv_redo81
  Current log# 8 seq# 60898 mem# 1: /dev/rlv_redo82
Tue Oct 30 18:37:26 2007
Timed out trying to start process J000.
Tue Oct 30 18:39:56 2007
Timed out trying to start process J000.
Tue Oct 30 18:42:01 2007
Timed out trying to start process J000.
Tue Oct 30 18:44:06 2007
Timed out trying to start process J000.
Tue Oct 30 18:49:10 2007
Timed out trying to start process J000.
Tue Oct 30 18:50:39 2007

后来没有办法将数据库重启才恢复正常、

事后到METALINK上弄了一个SR 上面回复为遇到了BUG:

The issue you are facing is Bug 4051184 (Base Bug 3023661). This bug affects 9.2.0.x and 10.1.0.x also. This is fixed only in 10.2 and because of major code ch
ange, the fix is not available for 9.2 or 10.1

As a workaround, you need to set undo_retention in spfile(scope=spfile) or the init.ora file.
To resolve the hang, you can try killing pz99 process instead of aborting the instance.

ps -ef | grep pz99
kill -9

查BUG3023661信息查不到,这个BUG还没有公开。哎!!!

又在上面找到一个关于10G的与我这个相类似的现象。

Doc ID:  Note:397024.1

Applies to:
Oracle Server - Enterprise Edition - Version: 10.1.0.4
This problem can occur on any platform.

Symptoms
3 nodes RAC system on 10.1.0.4. The command "alter system set undo_retention=1800 sid='*';" was issued on node1, immeidately all instances hang. User can not logon to instance 2 and 3. Logon to instance 1 is still OK but query hangs.

Alert log on node 2 and 3 report:

Timed out trying to start process J000.
Timed out trying to start process q000.
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/oracle/admin/RAC/bdump/rac3_cjq0_4229.trc


Cause
This is caused by unpublished Bug.3023661  RAC INSTANCE HANGS WHEN MODIFYING UNDO_RETENTION PARAMETER
(or published Bug 4220405 ALTER SYSTEM SET UNDO_RETENTION= HANGS RAC INSTANCES which has been closed as base bug 3023661)
Bug 4220405 ALTER SYSTEM SET UNDO_RETENTION= HANGS RAC INSTANCES which has been closed as base bug 3023661)
The problem is caused by deadlock between CKPT and PZ99 slave. The internal algorithm to query the current value of undo_retention of each instance and modify it has problem. It does unnecessary gv$ query and lob$ update when spfile is used. It is coding problem. The bug has been fixed in 10.2 and it is not backportable to 10.1.0.x due to code structure change.

Solution
Workaround is to modify one instance at a time, specify the detailed SID name for alter system command, eg:

alter system set undo_retention=1800 sid='RAC1';
alter system set undo_retention=1800 sid='RAC2';
alter system set undo_retention=1800 sid='RAC3';
......
until all instances are modified.

The following workaround can be used to get out of hanging situation rather than kill the background process on instance 2 or 3 to force them shutdown:

On hanging instance 2 and 3:
ps -ef | grep pz99
kill -9

This should break the deadlock, allow instance 2 and 3 to continue. undo_retention should be reset per above commands.

在网站上就没有找到我这个版本的。系统的相同现象。难道我是这个版本遇到这个BUG的第一人吗。呵呵

最终确定了产生问题原因。有一点不解,好多朋友也都这么改过。可能跟具体的应用有关吧。世界之大什么事都有可能遇到。通过这次也积累了经验。希望通过这个对遇到类似问题的朋友有所帮助。



 发表于: 2007-11-05,修改于: 2007-11-05 10:27 已浏览1587次,有评论1条 推荐 投诉
阅读(1622) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~