Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1981982
  • 博文数量: 593
  • 博客积分: 20034
  • 博客等级: 上将
  • 技术积分: 6779
  • 用 户 组: 普通用户
  • 注册时间: 2006-02-06 14:07
文章分类

全部博文(593)

文章存档

2016年(1)

2011年(101)

2010年(80)

2009年(10)

2008年(102)

2007年(16)

2006年(283)

我的朋友

分类:

2006-10-17 13:09:32

百年难遇的670CPU故障..上周被我遇到了........欲哭无泪.
刚准备去巡检,途中接到报障电话,称670死机了.无法启动.....
迅速赶到现场发现在91FF就停住了.然后就不断的重启...又到91FF又停住了好一会又重启....
仔细观察发现有时LED停止在E51F...
等了好久.终于等到了.产生了代码:B11B4691
于是呼,马上查手册.
B1xx 4691 Description: System firmware to service processor interface failure. (System firmware surveillance time out) Action: This code may be informational, or it may indicate a system firmware to service processor interface failure. Before changing any parts, examine word 13 in the service processor error log entry, or bytes 68 and 69 in the AIX error log entry. For detailed instructions on finding word 13 in the service processor error log entry, or bytes 68 and 69 in the AIX error log entry, refer to error code B1xx 4699. If the error code (word 11 value) is B19F4691and the word 13 value is 230Axxxx, this is an informational message. No action is required by the customer or service representative. 1. Check for system firmware updates. 2. Go to the service processor main menu and select System Information Menu. Then select Read Progress Indicators From Last System Boot. Begin your repair action with the error code or checkpoint immediately preceding B1xx 4691. If a location code displays with the error code or checkpoint, replace the part at that location. If changing that part does not fix the problem, or no location code is specified, and you have an 8-character error code, go to the “Checkpoints and Error Codes Index” on page 374. If changing that part does not fix the problem, or no location code is specified, and you have a 4-character checkpoint, go to “Firmware Checkpoints” on page 337. 3. If the problem is not resolved, call the second level support.
 

再查下E51F看看..

E51F :

End of I/O configuration 1. Check for system firmware updates. 2. Go to “MAP 1542: I/O Problem Isolation” on page 276.

 

9xxx :

9xxx checkpoints are displayed by the service processor after the power-on sequence is initiated. A system processor takes control when 91FF displays on the operator panel display.Note: Certain checkpoints may remain in the display for long periods of time. A spinning cursor is visible in the upper-right corner of the display during these periods to indicate that system activity is continuing.

91FF Control being handed to system processor from service processor See note 1 on page 335.

 

再看下系统的启动手程吧这样更容易分析问题:

IPL Flow The IPL process starts when ac power is connected to the system. The IPL process has the following phases:

 

Phase 1: Service Processor Initialization Phase 1 starts when ac power is connected to the system and ends when OK is displayed in the media subsystem operator panel. 8xxx checkpoints are displayed during this phase. Several 9xxx codes may also be displayed. Service processor menus are available at the end of this phase by striking any key on the console keyboard.

 

Phase 2: Hardware Initialization by the Service Processor Phase 2 starts when system power-on is initiated by pressing the power on button on the media subsystem operator panel. 9xxx checkpoints are displayed during this time. 91FF, the last checkpoint in this phase, indicates the transition to phase 3 is taking place.

 

 

Phase 3: System Firmware initialization On a full system partition, at phase 3, a system processor takes over control and continues initializing partition resources. During this phase, checkpoints in the form Exxx are displayed. E105, the last checkpoint in this phase, indicates that control is being passed to the operating system boot program. On a partitioned system, there is a global systemwide initialization phase 3, during which a system processor continues the initialization process. Checkpoints in this phase are of the form Exxx. This global phase 3 ends with a "LPAR..." on the operator panel. As a logical partition begins a partition-initialization phase 3, one of the system processors assigned to that partition continues initialization of resource assigned to that partition. Checkpoints in this phase are also of the form Exxx. This partition phase 3 ends with an E105 displayed on the partition’s virtual operator panel on the HMC, indicating control has been passed to that logical partition’s operating system boot program. For both the global and partition phase 3, location codes may also be displayed on the physical operator panel and the partition’s virtual terminal, respectively. v Phase 4: Operating System Boot When the operating system starts to boot, checkpoints in the form 0xxx and 2xxx are displayed. This phase ends when the operating system login prompt displays on the operating system console.

 

 

总总代码的分析都与firmware有关哦.于是查看系统的firmware是不是有得更新.或许有得解.

进了SP.查看了firmware,晕.已经是新的了.了解到前些时已经升级过的.......

哦哦,,哦哦...难道是升级的微码,有问题/?????

因为两台机同时升级的.为什么另一台没问题呢..好像又不可能是firmware呢....

 

想想firmware与什么硬件有关呢.是不是它坏了,在作怪?于是查找到是primary I/0 book.难道是这东东坏了?产生firmware的故障...

于是决定调备件Primary i/o book更换试试.

 

再看下SP日志.结果发现有如下日志:

                Error Log

 

1.  10/09/2006 05:29:24     The IPL ROS surveillance interval exceeded.

    B11B4691                        

 

2.  10/09/2006 05:05:57     System Processor Failure

    4b2725cf        U1.18-P1-C1

 

3.  10/09/2006 05:05:54     System Processor Failure

    4b2725cf        U1.18-P1-C1

 

4.  10/09/2006 04:51:57     System Memory Failure

    4503269a        U1.18-P1-C18 x2

 

5.  10/09/2006 04:51:57     System Memory Failure

    4503269a        U1.18-P1-C7 x2

 

晕.这里说的是CPU和L3Cache故障呢..这个要是坏了,那可就惨了......

定位下下是不是CPU真的故障呢.于是disconfig 第一个CPU.将启动过程改为慢启...

故障依旧............

 

没则,三个故障都有可能,PRIMARY I/0 BOOK ,CPU,L3CACHE..

调备件准备更换吧....

一时备件库还没670CPU.就只能先测试I/O BOOK.先从北京送来一个I/O BOOK.小心翼翼换上去了.重启机器...故障还是一样!!!!!汉!!!!

不干心...升级微码...强汉.故障报错的信息一样....

......

排除了I/O BOOK.那还有可能是就是CPU ,L3CACHE.

继续测试呀,测啊试呀,,没日没夜,,,

终于在我把整个第一路CPU的八个全部禁止了,这时过了,LED显示LPAR.......

终于看到希望了..机器启来了...

重动LPAR..部份LPAR启来了,部份LPAR一启动,整机就又DOWN了.....

在同事的帮助下.又测试又判断出L3CACHE的故障...

 

这下好了,找到了故障点了.就好办多了.准备备件更换就是了..

可是老板没备670CPU.说实话这东东不容易坏,百年难遇.可是我们的运气不好.哎,没法啊.....IBM也是说,你们点子低,被你们碰到了....苦从国外采购要等好几天.决定从当地IBM购买670CPU..IBM的手序可真够复杂...传真的N多回.都花了N个多小时,看了下上面的价格,,晕..670CPU开价273W..吓一跳..670整机,现在都没这贵呢.....老板不同意这天价的CPU.说..几百万啊..不是小数目....

 

没办法,只得让应用在备机上跑,就多等几天等自己公司的采购加紧点了..

还好速度还可以.货刚报关过来,一到深圳我就亲自去了深圳取货..小心翼翼的搬回CPU.哈哈.开心了...因为问题就快解决了.....

 

晚上开晚动手.把手册打印好的步骤,按照上面一步步,,没过几个小时就换好了.检查下没问题了.哈哈.启动机器了...一切OK.

 

时间不多了.这会再去观察下下情况.等稳定了再把应用移回来...

得过去看看了.得好好照顾它了.呵呵:)

 

 

 

 

阅读(3914) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~