全部博文(147)
分类: Oracle
2012-05-07 09:41:53
CRMII ORA-600[KGHLKREM1]
问题分析报告
1 CRMII ORA-600[KGHLKREM1] 1.1问题描述
数据库alert日志中报ORA-600 [KGHLKREM1]的内部错误,日志内容如下:
Tue Apr 10 10:01:27 2012 Errors in file /oracle/admin/crmii/udump/crmii_ora_3305506.trc: ORA-00600: internal error code, arguments: [KGHLKREM1], [0x7000001B5000020], [], [], [], [], [], [] Tue Apr 10 10:01:27 2012 Errors in file /oracle/admin/crmii/udump/crmii_ora_3317898.trc: ORA-00600: internal error code, arguments: [KGHLKREM1], [0x7000001B5000020], [], [], [], [], [], [] Tue Apr 10 10:01:28 2012 Errors in file /oracle/admin/crmii/udump/crmii_ora_3706974.trc: ORA-00600: internal error code, arguments: [KGHLKREM1], [0x7000001B5000020], [], [], [], [], [], [] Tue Apr 10 10:01:28 2012 Errors in file /oracle/admin/crmii/udump/crmii_ora_12157196.trc: ORA-00600: internal error code, arguments: [KGHLKREM1], [0x7000001B5000020], [], [], [], [], [], [] Tue Apr 10 10:01:28 2012 Errors in file /oracle/admin/crmii/udump/crmii_ora_3846228.trc: ORA-00600: internal error code, arguments: [KGHLKREM1], [0x7000001B5000020], [], [], [], [], [], [] …… |
从错误代码可以看出大量进程由于发生了600错误,导致异常终止,并产生相应dump文件
1.2问题分析
(1)错误的含义
具体错误代码为ORA-600 [KGHLKREM1]
Description
A parallel query using a bloom filter can cause SGA memory corruption if the session is terminated abnormally.(eg: killed via ALTER SYSTEM KILL or similar). This memory corruption can lead to various ORA-600 errors and an instancecrash.
Workaround The workaround is to set _bloom_filter_enabled = FALSEat system level to prevent use of bloom filters.
Note: The fix for this issue cannot be provided as a one off patch. To avoid this problem use the workaround above or use a release which has the fix included.
Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. Always consult with Oracle Support for advice. |
(2)错误的原因
Ora-600错误通常是由于oracle软件自身bug造成的,在一定触发条件下产生,ORA-00600[Kghlkrem1]是由于并行查询中某个进程被异常终止,此时若此查询使用bloom filter功能将导致sga内存损坏,使得实例crash
从进程的dump文件下能够找到进程在出错前都调用了哪些函数,根据ORA-00600: [Kghlkrem1] With Parallel Query Execution [ID 389621.1],通过比对发现进程dump文件中的过程为:
kghnerror <- kghadd_reserved_extent <- kghget_reserved_ext <- ent <- kghgex <- kghfnd <- kghalo <- kghgex <- kghalf <- kghalp
1.3原因确认
综上,该错误的起因是由于并行查询中的单个进程被异常终止(原因待查),致使此时的并行查询触发了ora-600 [Kghlkrem1]的错误,在dump出大量进程trc文件的同时,造成sga内存损坏,引起实例crash。
1.4处理建议
1、数据库的当前版本是10.2.0.3,该BUG在10.2.0.4 及以后版本的patchset中修复。
2、根据ID 389621.1可以通过在system层面设置隐含参数 _bloom_filter_enabled为 FALSE禁用并行查询使用bloom filter,规避这个bug。
修改命令:
SQL> alter system set "_bloom_filter_enabled"=false scope=both;
---上述参数的修改,带来的影响是很轻微的,可以参考ORACLE的官方说明:
“Disabling the bloom filter might have a slight negative impact on PQ performance but since this is a new 10gR2 feature, it may not be noticeable.”
回退命令:
SQL> alter system set "_bloom_filter_enabled"=true scope=both;
---通过上述命令可以回退参数的原来设置值
3、根据故障发生前数据库系统的AWR报告可以判断,此时Buffer Cache由8304M下降到6128M,Shared Pool Size则由6976M上升到9152M,由于该数据库的SGA_MAX_SIZE固定为16G,系统在需要更多内存完成语句解析时,无法从OS获得的情况下,从Buffer Cache中抢夺了2176M,导致在语句执行过程中数据无法在内存中获得,这一点可以从Buffer Hit %不到90%的命中率得以印证,在Top 5的等待事件中也可以明显看到存在两个数据读事件。根据现有系统配置,vmstat中free的内存在6G左右,考虑到Shared Pool在高峰时增长了2176M,同时在SGA自动管理的现有模式下,考虑需要分配两倍大小,即4352M(4.25G),建议设置SGA_MAX_SIZE增大到20G左右,SGA_TARGET同样增大到20G。