Oracle RAC 11gR2数据库单节点linux操作系统无法启动-zhangshengdong-ChinaUnix博客

文为世范，行为士则thirdratedba.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

zhangshengdong

博客访问： 2013780
博文数量： 176
博客积分： 1857
博客等级：上尉
技术积分： 2729
用户组：普通用户
注册时间： 2012-04-14 22:55

个人简介

吾生有涯,而知无涯,适当止学.循序渐进,步步提升 Talk is cheap, show me the code.

文章分类

全部博文（176）

Python（2）
一些随笔（1）
Oracle Gold（2）
网络（2）
基准测试（1）
监控（3）
虚拟化（3）
Hbase（1）
java（0）
其他（4）
Oracle（67）

ORA-（3）

Oracle问题集锦（8）

Oracle备份恢复（5）

Oracle_RAC（8）

troubleshooting（16）

Oracle基础知识（23）

Oracle_DG（3）
MongoDB（6）
C语言（1）
linux（17）
Perl（4）
Mysql（59）

MySQL基准测试（1）

Percona（2）
未分配的博文（3）

文章存档

2019年（1）

2018年（14）

2017年（20）

2016年（31）

2015年（15）

2014年（5）

2013年（10）

2012年（80）

我的朋友

相关博文

Oracle RAC 11gR2数据库单节点linux操作系统无法启动

分类： Oracle

2017-12-01 16:12:56

场景:新部署的RAC数据库，在做数据库初始化的时候，大量的并发导致操作，加之服务器/u01目录写满(并发导入上TB数据)，在这个状态下，直接服务器reboot，导致linux服务器无法启动。

环境: Oracle RAC 11gR2
Linux redhat 6.4

错误日志如下:

点击(此处)折叠或打开

INFO: task bonnie++:31785 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
udevd[1368]:worker[1528] unexpectedly returned with status 0x0100
udevd[1368]:worker[1528] failed while handling '/devices/pci0000:00/0000:00:03.2/0000:04:00.1/host1/rport-2:0-5/target2:0:3/2:0:3:32/block/sdb/sdb5'

根源分析:

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

操作思路:

通过理解，刷新大量的数据(在缓存上面)写到硬盘上，默认的时间限制是120秒。大量的写操作，一时写不进去，就触发了这个issue。因此，修改了/etc/sysctl.conf文件的参数如下：

vm.dirty_ratio=10

-------------------------------------------分割线-------------------------------------------------------------
虽然修改了上述操作，还没有完全解决问题。
继续问题分析:

The number of spawned udevd workers depends only on the amount of RAM. As a result, for machines with relatively big RAM sizes and lots of disks, a lot of udevd workers are running in parallel, maximizing CPU and I/O. This can cause udev events to timeout, because of hardware bottlenecks.
【大量并发的udevd 工作进程，最大化的使用CPU和IO，最终导致udev事件超时，原因为：硬件瓶颈】
A fix that helps govern multiple parallel driver loads that were occurring via modprobe to prevent unnecessary driver loads which contributed to high system resource use during device discovery which also could case the udev events to timeout.
This and related isses were fixed within the udev-147-2.63.el6_7.1 via (private) bugzilla 1281469 and 1281467. Additional fixes are present from (private) bugzillas 1170313, 885978 and 816724 that address other related issues that contribute to 0x100 messages being displayed.
【udev的补丁可以修复和改善这个issue】

修改思路：
在/lib/udev/rules.d/10-dm.rules文件，添加一行参数如下：

OPTIONS+="event_timeout=600"

然后，重启服务器的时候，注意，先关机。
- shutdown -h now
10分钟之后，在启动Oracle节点服务器，之后，linux服务器正常启动。

文献参考:
【1】http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/
【2】http://m.blog.csdn.net/vic_qxz/article/details/72781461

阅读(3079) | 评论(0) | 转发(0) |

上一篇：多路径的方式----部署Oracle RAC 11GR2 based in RHEL6.4

下一篇：EXPDP ORA-31634 ---导出的错误

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6