【转帖】UNDO表空间满后发生的事-xihii-ChinaUnix博客

xihii

首页　| 　博文目录　| 　关于我

xihii

博客访问： 214269
博文数量： 123
博客积分： 3066
博客等级：中校
技术积分： 1920
用户组：普通用户
注册时间： 2007-08-23 21:22

文章分类

全部博文（123）

我的文章（0）
未分配的博文（123）

文章存档

2012年（1）

2010年（1）

2009年（48）

2008年（73）

我的朋友

相关博文

【转帖】UNDO表空间满后发生的事

分类： Oracle

2009-01-05 09:25:10

UNDO表空间满后发生的事

前几天我的AIX 下的9206RAC数据库实例一的UNDO表空间用满了，过了一会检查活动的回退段只有几个，但是表空间没有回收没有用的空间。以前也了解9i的UNDO这个问题。可能通通过重新建UNDO表空间，切到新的把原来的删掉。只是在查表空间占用率的时候它是高。后来想把UNDO_RETETION,改小。谁知，在一个实例上改了以后就出现问题了。

　　我在实例上执行：

　　　　alter system set undo_retention=600 scope=both sid=sid1;

　但是过了一会用SQLPLUS，客户端，都连不上了。数据库处于HANG的状态下。ALERT日中出现以下信息：

　　Tue Oct 30 17:05:21 2007
ALTER SYSTEM SET undo_retention=600 SCOPE=BOTH SID='SID1';
Tue Oct 30 17:15:08 2007
Thread 2 advanced to log sequence 60894
Current log# 9 seq# 60894 mem# 0: /dev/rlv_redo91
Current log# 9 seq# 60894 mem# 1: /dev/rlv_redo92
Tue Oct 30 17:35:52 2007
Thread 2 advanced to log sequence 60895
Current log# 10 seq# 60895 mem# 0: /dev/rlv_redo101
Current log# 10 seq# 60895 mem# 1: /dev/rlv_redo102
Tue Oct 30 17:55:36 2007
Thread 2 advanced to log sequence 60896
Current log# 6 seq# 60896 mem# 0: /dev/rlv_redo61
Current log# 6 seq# 60896 mem# 1: /dev/rlv_redo62
Tue Oct 30 18:15:21 2007
Thread 2 advanced to log sequence 60897
Current log# 7 seq# 60897 mem# 0: /dev/rlv_redo71
Current log# 7 seq# 60897 mem# 1: /dev/rlv_redo72
Tue Oct 30 18:17:37 2007
Timed out trying to start process J000.
Tue Oct 30 18:19:41 2007
Timed out trying to start process J000.
Tue Oct 30 18:21:46 2007
Timed out trying to start process J000.
Tue Oct 30 18:23:56 2007
Timed out trying to start process J000.
Tue Oct 30 18:26:01 2007
Timed out trying to start process J000.
Tue Oct 30 18:28:06 2007
Timed out trying to start process J000.
Tue Oct 30 18:33:16 2007
Timed out trying to start process J000.
Tue Oct 30 18:35:21 2007
Timed out trying to start process J000.
Tue Oct 30 18:35:36 2007
Thread 2 advanced to log sequence 60898
Current log# 8 seq# 60898 mem# 0: /dev/rlv_redo81
Current log# 8 seq# 60898 mem# 1: /dev/rlv_redo82
Tue Oct 30 18:37:26 2007
Timed out trying to start process J000.
Tue Oct 30 18:39:56 2007
Timed out trying to start process J000.
Tue Oct 30 18:42:01 2007
Timed out trying to start process J000.
Tue Oct 30 18:44:06 2007
Timed out trying to start process J000.
Tue Oct 30 18:49:10 2007
Timed out trying to start process J000.
Tue Oct 30 18:50:39 2007

后来没有办法将数据库重启才恢复正常、

事后到METALINK上弄了一个SR　上面回复为遇到了BUG：

The issue you are facing is Bug 4051184 (Base Bug 3023661). This bug affects 9.2.0.x and 10.1.0.x also. This is fixed only in 10.2 and because of major code ch
ange, the fix is not available for 9.2 or 10.1

As a workaround, you need to set undo_retention in spfile(scope=spfile) or the init.ora file.
To resolve the hang, you can try killing pz99 process instead of aborting the instance.

ps -ef | grep pz99
kill -9

查BUG3023661信息查不到，这个BUG还没有公开。哎！！！

又在上面找到一个关于10G的与我这个相类似的现象。

Doc ID: Note:397024.1

Applies to:
Oracle Server - Enterprise Edition - Version: 10.1.0.4
This problem can occur on any platform.

Symptoms
3 nodes RAC system on 10.1.0.4. The command "alter system set undo_retention=1800 sid='*';" was issued on node1, immeidately all instances hang. User can not logon to instance 2 and 3. Logon to instance 1 is still OK but query hangs.

Alert log on node 2 and 3 report:

Timed out trying to start process J000.
Timed out trying to start process q000.
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/oracle/admin/RAC/bdump/rac3_cjq0_4229.trc

Cause
This is caused by unpublished Bug.3023661 RAC INSTANCE HANGS WHEN MODIFYING UNDO_RETENTION PARAMETER
(or published Bug 4220405 ALTER SYSTEM SET UNDO_RETENTION= HANGS RAC INSTANCES which has been closed as base bug 3023661)
Bug 4220405 ALTER SYSTEM SET UNDO_RETENTION= HANGS RAC INSTANCES which has been closed as base bug 3023661)
The problem is caused by deadlock between CKPT and PZ99 slave. The internal algorithm to query the current value of undo_retention of each instance and modify it has problem. It does unnecessary gv$ query and lob$ update when spfile is used. It is coding problem. The bug has been fixed in 10.2 and it is not backportable to 10.1.0.x due to code structure change.

Solution
Workaround is to modify one instance at a time, specify the detailed SID name for alter system command, eg:

alter system set undo_retention=1800 sid='RAC1';
alter system set undo_retention=1800 sid='RAC2';
alter system set undo_retention=1800 sid='RAC3';
......
until all instances are modified.

The following workaround can be used to get out of hanging situation rather than kill the background process on instance 2 or 3 to force them shutdown:

On hanging instance 2 and 3:
ps -ef | grep pz99
kill -9

This should break the deadlock, allow instance 2 and 3 to continue. undo_retention should be reset per above commands.

在网站上就没有找到我这个版本的。系统的相同现象。难道我是这个版本遇到这个BUG的第一人吗。呵呵

最终确定了产生问题原因。有一点不解，好多朋友也都这么改过。可能跟具体的应用有关吧。世界之大什么事都有可能遇到。通过这次也积累了经验。希望通过这个对遇到类似问题的朋友有所帮助。

发表于： 2007-11-05，修改于： 2007-11-05 10:27 已浏览1587次，有评论1条推荐投诉

阅读(1622) | 评论(0) | 转发(0) |

上一篇：【RAC（Real Application Cluster真正应用集群）是Oracle9i数据库中采

下一篇：【转帖】关于redo log 和 undo log 这两个概念

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6