客户有台HP RX8640出现故障,跑是的双机环境,出问题是备机。问题定位在cell0 上的处理器,我们先给客户寄了一个经过测试的处理器,客户自己动手更换后,还是发生宕机。客户怀疑我们的处理器的是有问题,处理器寄回来了,我在RX7640上跑了很长时间,没有发现宕机事件或相关的告警信息。于是客户建议我们安排人过去现场调试。为以防万一,我带了二颗CPU,还有一块CELL板,如果真是CPU的问题,那么就换CPU,如cell板,那是就换板;没去之前,我感觉可能是CELL出问题。
到现场后,我按照即定思路,换了CPU,正常运行,ioscan查看CPU已正常识别使用,大约过了10分钟,宕机发生了。查看与客户当初的给的日志一样,于是我拆下CELL0,观察了一下机器。发现机房的通风设计并不是现在通常看到的地部通风,而两侧使用空调吹,与人体温度差不多;摸了一下机器,前后风扇,CELL0的背部出风口温度相对CELL1与整机来说是最高的,cpu0刚好在出风口。又摸了CELL0上的四颗CPU,发现CPU0温度确实最高。机器内部灰尘很多;
于是采取了如下操作:每机器的风扇与板子清尘,把CPU0更换到进风位置。晚上更换后,开机看日志,正常,到第二天没再发生宕机。
客户把crash日志发给HP,HP答复:
From the crash analysis, we can see that there was a panic caused by spinlock deadlock.
And there was no crash dump for cpu 0&1, they belong to the same socket processor, they met a fault and didn’t release the lock as expected, then time out happened, and system panic.
The solution is to replace this processor on cell 0 socket 0.
现场宕机控制台显示:
** A system crash has occurred. (See the above messages for details.)
*** The system is now preparing to dump physical memory to disk, for use
*** in debugging the crash.
*** The dump will be compressed.
*** To change this dump type, press any key within 10 seconds.
*** Select one of the following dump types, by pressing the corresponding key:
C) The dump will be compressed.
S) The dump will be without compression.
N) There will be NO DUMP performed
*** Enter your selection now.
*** Unrecognized response. Please try again.
系统会把宕机文件保留后,自动重启后停在如下界面:
HP-UX Start-up in progress
__________________________
Configure system crash dumps ........................................ OK
Removing old vxvm files ............................................. OK
VxVM INFO V-5-2-3360 VxVM device node check ......................... OK
VxVM INFO V-5-2-3362 VxVM general startup ........................... OK
VxVM INFO V-5-2-3366 VxVM reconfiguration recovery .................. OK
Mount file systems .................................................. OK
Setting hostname .................................................... OK
Start Kernel Logging facility ....................................... OK
Set privilege group ................................................. OK
Display date ........................................................ N/A
Save system crash dump if needed ....................................
参考rx8640安装指南,如何拆。
npar指南,关于分区。
SG指南,关于双机。
阅读(4507) | 评论(0) | 转发(0) |