硬件的告警一般主要为硬盘,电源,风扇,CPU,主板,内存等,由于Solaris分为x86平台和SPARC平台,所以这里只针对SPARC平台。
Solaris Fire系统平台提供了SC系统控制器,这是一个驻留在与系统底板相连的IB_SSC组件上,SC负责提供LOM管理平台,在机器的背面一般会有一个控制台串行接口,以太网口,和一个LOM口,这些都可以连接至SC;这个前面已经说过了。
这里列举一下lom shell下显示系统当前状态的一些命令:
Lom> showalarm 显示相关告警信息
Lom> showboards 显示主板状态信息
Lom> showcomponent 显示系统配置参数信息
Lom> showenvironment 显示系统整体环境信息
Lom> showfault 显示系统错误信息
Lom> showlogs 显示相关日志信息
这些命令有助于帮助找出系统告警的具体细节信息。
具体的切换,如果当前在$或者#模式下,可以通过输入换码序列来进入lom状态,一般是“#.”,如果想再切换到普通模式下,在lom shell下键入console即可。
当然,最直观的方式是通过LED指示灯来观察系统的故障报警,直接观察System Fault的那个指示灯,如果闪琥珀色的灯,则表示系统出现故障。
在Solaris普通模式下也是可以执行lom实用程序的,在/usr/sbin/lom,可以通过man查看详细信息。
以下介绍一些相关的检查:
检查LED指示灯的状态:
# lom -l
LOM alarm states:
Alarm1=off
Alarm2=off
Alarm3=on
Fault LED=off
查看日志:
#lom -e n,[x] n是希望查看的报告的数目,x是事件级别:1.致命事件;2.警告事件;3.信息事件;4.用户事件(不适用Fire)
#lom -e 11
LOMlite Event Log:
Fri Jul 19 15:16:00 commando-sc lom: Boot: ScApp 5.13.0007, RTOS
23
Fri Jul 19 15:16:06 commando-sc lom: Caching ID information
Fri Jul 19 15:16:08 commando-sc lom: Clock Source: 75MHz
Fri Jul 19 15:16:10 commando-sc lom: /N0/PS0: Status is OK
Fri Jul 19 15:16:11 commando-sc lom: /N0/PS1: Status is OK
Fri Jul 19 15:16:11 commando-sc lom: Chassis is in single
partition mode.
Fri Jul 19 15:27:29 commando-sc lom: Locator OFF
Fri Jul 19 15:27:46 commando-sc lom: Alarm 1 ON
Fri Jul 19 15:27:52 commando-sc lom: Alarm 2 ON
Fri Jul 19 15:28:03 commando-sc lom: Alarm 1 OFF
Fri Jul 19 15:28:08 commando-sc lom: Alarm 2 OFF
检查风扇:
# lom -f
Fans:
1 OK speed self-regulating
2 OK speed self-regulating
3 OK speed self-regulating
4 OK speed self-regulating
5 OK speed self-regulating
6 OK speed self-regulating
7 OK speed self-regulating
8 OK speed self-regulating
9 OK speed 100 %
10 OK speed 100 %
这个其实也可以通过prtdiag -v显示
检查内部电压传感器:
# lom -v
Supply voltages:
1 SSC1 v_1.5vdc0 status=ok
2 SSC1 v_3.3vdc0 status=ok
3 SSC1 v_5vdc0 status=ok
4 RP0 v_1.5vdc0 status=ok
5 RP0 v_3.3vdc0 status=ok
6 RP2 v_1.5vdc0 status=ok
7 RP2 v_3.3vdc0 status=ok
8 SB0 v_1.5vdc0 status=ok
9 SB0 v_3.3vdc0 status=ok
10 SB0/P0 v_cheetah0 status=ok
11 SB0/P1 v_cheetah1 status=ok
12 SB0/P2 v_cheetah2 status=ok
13 SB0/P3 v_cheetah3 status=ok
14 SB2 v_1.5vdc0 status=ok
15 SB2 v_3.3vdc0 status=ok
16 SB2/P0 v_cheetah0 status=ok
17 SB2/P1 v_cheetah1 status=ok
18 SB2/P2 v_cheetah2 status=ok
19 SB2/P3 v_cheetah3 status=ok
20 IB6 v_1.5vdc0 status=ok
21 IB6 v_3.3vdc0 status=ok
22 IB6 v_5vdc0 status=ok
23 IB6 v_12vdc0 status=ok
24 IB6 v_3.3vdc1 status=ok
25 IB6 v_3.3vdc2 status=ok
26 IB6 v_1.8vdc0 status=ok
27 IB6 v_2.4vdc0 status=ok
System status flags:
1 PS0 status=okay
2 PS1 status=okay
3 FT0 status=okay
4 FT0/FAN0 status=okay
5 FT0/FAN1 status=okay
6 FT0/FAN2 status=okay
7 FT0/FAN3 status=okay
8 FT0/FAN4 status=okay
9 FT0/FAN5 status=okay
10 FT0/FAN6 status=okay
11 FT0/FAN7 status=okay
12 RP0 status=okay
13 RP2 status=okay
14 SB0 status=ok
15 SB0/P0 status=online
16 SB0/P0/B0/D0 status=okay
17 SB0/P0/B0/D1 status=okay
18 SB0/P0/B0/D2 status=okay
19 SB0/P0/B0/D3 status=okay
20 SB0/P1 status=online
21 SB0/P1/B0/D0 status=okay
22 SB0/P1/B0/D1 status=okay
23 SB0/P1/B0/D2 status=okay
24 SB0/P1/B0/D3 status=okay
25 SB0/P2 status=online
26 SB0/P2/B0/D0 status=okay
27 SB0/P2/B0/D1 status=okay
28 SB0/P2/B0/D2 status=okay
检查内部温度:
#lom -t
1 SSC1 t_sbbc0 36 degC : warning 102 degC : shutdown 107 degC
2 SSC1 t_cbh0 45 degC : warning 102 degC : shutdown 107 degC
3 SSC1 t_ambient0 23 degC : warning 82 degC : shutdown 87 degC
4 SSC1 t_ambient1 21 degC : warning 82 degC : shutdown 87 degC
5 SSC1 t_ambient2 28 degC : warning 82 degC : shutdown 87 degC
6 RP0 t_ambient0 22 degC : warning 82 degC : shutdown 87 degC
7 RP0 t_ambient1 22 degC : warning 53 degC : shutdown 63 degC
8 RP0 t_sdc0 62 degC : warning 102 degC : shutdown 107 degC
9 RP0 t_ar0 47 degC : warning 102 degC : shutdown 107 degC
10 RP0 t_dx0 62 degC : warning 102 degC : shutdown 107 degC
11 RP0 t_dx1 65 degC : warning 102 degC : shutdown 107 degC
12 RP2 t_ambient0 23 degC : warning 82 degC : shutdown 87 degC
13 RP2 t_ambient1 22 degC : warning 53 degC : shutdown 63 degC
14 RP2 t_sdc0 57 degC : warning 102 degC : shutdown 107 degC
15 RP2 t_ar0 42 degC : warning 102 degC : shutdown 107 degC
16 RP2 t_dx0 53 degC : warning 102 degC : shutdown 107 degC
17 RP2 t_dx1 56 degC : warning 102 degC : shutdown 107 degC
18 SB0 t_sdc0 48 degC : warning 102 degC : shutdown 107 degC
19 SB0 t_ar0 39 degC : warning 102 degC : shutdown 107 degC
20 SB0 t_dx0 49 degC : warning 102 degC : shutdown 107 degC
21 SB0 t_dx1 54 degC : warning 102 degC : shutdown 107 degC
22 SB0 t_dx2 57 degC : warning 102 degC : shutdown 107 degC
23 SB0 t_dx3 53 degC : warning 102 degC : shutdown 107 degC
24 SB0 t_sbbc0 53 degC : warning 102 degC : shutdown 107 degC
25 SB0 t_sbbc1 40 degC : warning 102 degC : shutdown 107 degC
26 SB0/P0 Ambient 29 degC : warning 82 degC : shutdown 87 degC
27 SB0/P0 Die 57 degC : warning 92 degC : shutdown 97 degC
28 SB0/P1 Ambient 27 degC : warning 82 degC : shutdown 87 degC
29 SB0/P1 Die 51 degC : warning 92 degC : shutdown 97 degC
30 SB0/P2 Ambient 27 degC : warning 82 degC : shutdown 87 degC
31 SB0/P2 Die 53 degC : warning 92 degC : shutdown 97 degC
32 SB0/P3 Ambient 29 degC : warning 82 degC : shutdown 87 degC
33 SB0/P3 Die 50 degC : warning 92 degC : shutdown 97 degC
34 SB2 t_sdc0 51 degC : warning 102 degC : shutdown 107 degC
35 SB2 t_ar0 40 degC : warning 102 degC : shutdown 107 degC
36 SB2 t_dx0 52 degC : warning 102 degC : shutdown 107 degC
37 SB2 t_dx1 54 degC : warning 102 degC : shutdown 107 degC
38 SB2 t_dx2 61 degC : warning 102 degC : shutdown 107 degC
39 SB2 t_dx3 53 degC : warning 102 degC : shutdown 107 degC
40 SB2 t_sbbc0 52 degC : warning 102 degC : shutdown 107 degC
Solaris自身含有自我诊断和恢复的机制,通过AD引擎来分析是那个具体的FRU导致了故障,举例如下:
[AD] Event: E2900.ASIC.AR.ADR_PERR.10473006
CSN: DomainID: A ADInfo: 1.SCAPP.17.0
Time: Fri Dec 12 09:30:20 PST 2003
FRU-List-Count: 2; FRU-PN: 5405564; FRU-SN: A08712; FRU-LOC: /N0/IB6
FRU-PN: 5404974; FRU-SN: 000274; FRU-LOC: /N0/RP2
Recommended-Action: Service action required
[DOM] Event: SFV1280.L2SRAM.SERD.0.60.10040000000128.7fd78d140
CSN: DomainID: A ADInfo: 1.SF-SOLARIS-DE.5_8_Generic_116188-01
Time: Wed Nov 26 12:06:14 PST 2003
FRU-List-Count: 1; FRU-PN: 3704129; FRU-SN: 100ACD; FRU-LOC: /N0/SB0/P0/E0
Recommended-Action: Service action required
以上这些信息也包含在系统日志文件/var/log/messages中。
对于其他错误,可以使用:
lom>showerrorbuffer
ErrorData[0]
Date: Fri Jan 30 10:23:32 EST 2004
Device: /SSC1/sbbc0/systemepld
Register: FirstError[0x10] : 0x0200
SB0 encountered the first error
ErrorData[1]
Date: Fri Jan 30 10:23:32 EST 2004
Device: /SB0/bbcGroup0/repeaterepld
Register: FirstError[0x10]: 0x0002
sdc0 encountered the first error
ErrorData[2]
Date: Fri Jan 30 10:23:32 EST 2004
Device: /SB0/sdc0
ErrorID: 0x60171010
Register: SafariPortError0[0x200] : 0x00000002
ParSglErr [01:01] : 0x1 ParitySingle error
阅读(4305) | 评论(0) | 转发(2) |