Analyzing SCSI FMA Ereports-zsgd-ChinaUnix博客

zsgd's blogzsgd.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

zsgd

博客访问： 607783
博文数量： 118
博客积分： 2114
博客等级：大尉
技术积分： 1275
用户组：普通用户
注册时间： 2009-03-10 00:02

文章分类

全部博文（118）

rysnc（0）
MAC（0）
Network（2）
DNS（1）
Hypervisor（1）
restore（0）
tape（4）
mail（2）
storage（2）
mysql（3）
tools（6）
Windows（4）
乱七八糟（5）
Shell（6）
Python（1）
English（3）

New Concept（1）
Others（1）
FreeBSD（1）
Solaris（23）
Linux（38）

yum（1）

IPTABLES（3）

httpd（2）

ftp（3）

Partition（6）

scratch（1）

Performance（2）

Server（0）

Moniter（8）

Security（6）

Network（2）

Jobs Management（1）
未分配的博文（15）

文章存档

2019年（1）

2018年（4）

2017年（1）

2016年（6）

2015年（1）

2014年（1）

2013年（5）

2012年（4）

2011年（17）

2010年（13）

2009年（65）

我的朋友

最近访客

推荐博文

Analyzing SCSI FMA Ereports

分类：

2010-06-04 14:10:08

By , December 2008

Contents

This is the second article in a series about the SCSI DISK FMA project:

-- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
-- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
-- Describes what to do when you get an FMA fault on a Solaris system.
-- Describes a useful tool for programmers to use for testing the sd driver.

Overview

Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.

Error Reports and Payloads

Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.

The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. lists the available values of driver-assessment.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.

Step 1: Check to see the ereport class that propagated this fault using fmdump.

bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.

bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3

Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal. For more information see the description in . This is why one of your hard drives is retired.

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)

Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.

bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions

Payload Name	Description

`ENA`	Error Numeric Association. Can be used to associate a series of related ereports.
`detector`	The device that detected the error condition.
`cdb`	Command Description Block.
`driver-assessment`	The action the driver is going to take.
`op-code`	The SCSI command that resulted in the error condition.
`pkt-reason`	Refer to the man page for scsi_pkt(9s), pkt-reason section.
`pkt-state`	Refer to the man page for scsi_pkt(9s), pkt-state section.
`pkt-stats`	Refer to the man page of scsi_pkt(9s), pkt-statistics section.
`stat-code`	SCSI STATUS Code of the SCSI command.
`key`	Sense key of the SCSI command.
`asc`	Additional Sense Code.
`ascq`	Additional Sense Code Qualifier.
`sense-data`	SCSI Sense data sent back from the device.
`lba`	Logical Block Address on the device.
`un-decode-info`	Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
`un-decode-value`	Could be empty or be used together with un-decode-info to indicate the undecodable value.

Table 2: Example Usage for FMA Tools

Example Usage	Description

`fmdump -ev`	Show the ereport list with ENA.
`fmdump -e -n =`	Show ereports that match the specified pattern.
`fmdump -eV`	Show ereport details, usually combined with -n option.
`fmdump -V -u`	Show fault details with given .
`fmadm faulty -u`	Display status information for faulty resources with given .
`fmadm repair`	Set the status of a faulty device with given back to normal.

Table 3: Available Values of driver-assessment

Value	Description

`fatal`	SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
`fail`	The scsi driver is not going to stop the service but it cannot guarantee normal service.
`info`	The driver has detected an error, but the services provided by the device instance are unaffected.
`retry`	The scsi driver is going to retry a failed command and the service is unaffected.
`recovered`	The SD driver has recovered a SCSI command and the service is unaffected.

For More Information

-- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
-- Describes what to do when you get an FMA fault on a Solaris system.
-- Describes a useful tool for programmers to use for testing the sd driver.

By , December 2008

Contents

This is the second article in a series about the SCSI DISK FMA project:

-- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
-- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
-- Describes what to do when you get an FMA fault on a Solaris system.
-- Describes a useful tool for programmers to use for testing the sd driver.

Overview

Error Reports and Payloads

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.

Step 1: Check to see the ereport class that propagated this fault using fmdump.

bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.

bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)

Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.

bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions

Payload Name	Description

`ENA`	Error Numeric Association. Can be used to associate a series of related ereports.
`detector`	The device that detected the error condition.
`cdb`	Command Description Block.
`driver-assessment`	The action the driver is going to take.
`op-code`	The SCSI command that resulted in the error condition.
`pkt-reason`	Refer to the man page for scsi_pkt(9s), pkt-reason section.
`pkt-state`	Refer to the man page for scsi_pkt(9s), pkt-state section.
`pkt-stats`	Refer to the man page of scsi_pkt(9s), pkt-statistics section.
`stat-code`	SCSI STATUS Code of the SCSI command.
`key`	Sense key of the SCSI command.
`asc`	Additional Sense Code.
`ascq`	Additional Sense Code Qualifier.
`sense-data`	SCSI Sense data sent back from the device.
`lba`	Logical Block Address on the device.
`un-decode-info`	Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
`un-decode-value`	Could be empty or be used together with un-decode-info to indicate the undecodable value.

Table 2: Example Usage for FMA Tools

Example Usage	Description

`fmdump -ev`	Show the ereport list with ENA.
`fmdump -e -n =`	Show ereports that match the specified pattern.
`fmdump -eV`	Show ereport details, usually combined with -n option.
`fmdump -V -u`	Show fault details with given .
`fmadm faulty -u`	Display status information for faulty resources with given .
`fmadm repair`	Set the status of a faulty device with given back to normal.

Table 3: Available Values of driver-assessment

Value	Description

`fatal`	SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
`fail`	The scsi driver is not going to stop the service but it cannot guarantee normal service.
`info`	The driver has detected an error, but the services provided by the device instance are unaffected.
`retry`	The scsi driver is going to retry a failed command and the service is unaffected.
`recovered`	The SD driver has recovered a SCSI command and the service is unaffected.

For More Information

-- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
-- Describes what to do when you get an FMA fault on a Solaris system.
-- Describes a useful tool for programmers to use for testing the sd driver.

Contents

This is the second article in a series about the SCSI DISK FMA project:

SCSI DISK FMA Project Part 1: SCSI Device Drivers as FMA Telemetry Detectors -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
SCSI DISK FMA Project Part 2: Analyzing SCSI FMA Ereports -- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
SCSI DISK FMA Project Part 3: FMA Behavior of Retired Faulted SCSI Disks -- Describes what to do when you get an FMA fault on a Solaris system.
SCSI DISK FMA Project Part 4: SD Fault Injection -- Describes a useful tool for programmers to use for testing the sd driver.

Overview

Error Reports and Payloads

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to Table 1 for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See Table 2 for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.

Step 1: Check to see the ereport class that propagated this fault using fmdump.

bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.

bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3

Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal. For more information see the description in Table 3. This is why one of your hard drives is retired.

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)

Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.

bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions

Payload Name	Description

`ENA`	Error Numeric Association. Can be used to associate a series of related ereports.
`detector`	The device that detected the error condition.
`cdb`	Command Description Block.
`driver-assessment`	The action the driver is going to take.
`op-code`	The SCSI command that resulted in the error condition.
`pkt-reason`	Refer to the man page for scsi_pkt(9s), pkt-reason section.
`pkt-state`	Refer to the man page for scsi_pkt(9s), pkt-state section.
`pkt-stats`	Refer to the man page of scsi_pkt(9s), pkt-statistics section.
`stat-code`	SCSI STATUS Code of the SCSI command.
`key`	Sense key of the SCSI command.
`asc`	Additional Sense Code.
`ascq`	Additional Sense Code Qualifier.
`sense-data`	SCSI Sense data sent back from the device.
`lba`	Logical Block Address on the device.
`un-decode-info`	Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
`un-decode-value`	Could be empty or be used together with un-decode-info to indicate the undecodable value.

Table 2: Example Usage for FMA Tools

Example Usage	Description

`fmdump -ev`	Show the ereport list with ENA.
`fmdump -e -n =`	Show ereports that match the specified pattern.
`fmdump -eV`	Show ereport details, usually combined with -n option.
`fmdump -V -u`	Show fault details with given .
`fmadm faulty -u`	Display status information for faulty resources with given .
`fmadm repair`	Set the status of a faulty device with given back to normal.

Table 3: Available Values of driver-assessment

Value	Description

`fatal`	SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
`fail`	The scsi driver is not going to stop the service but it cannot guarantee normal service.
`info`	The driver has detected an error, but the services provided by the device instance are unaffected.
`retry`	The scsi driver is going to retry a failed command and the service is unaffected.
`recovered`	The SD driver has recovered a SCSI command and the service is unaffected.

For More Information

SCSI DISK FMA Project Part 1: SCSI Device Drivers as FMA Telemetry Detectors -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
SCSI DISK FMA Project Part 3: FMA Behavior of Retired Faulted SCSI Disks -- Describes what to do when you get an FMA fault on a Solaris system.
SCSI DISK FMA Project Part 4: SD Fault Injection -- Describes a useful tool for programmers to use for testing the sd driver.

阅读(2319) | 评论(0) | 转发(0) |

上一篇：pure-ftpd + mysql + webadmin

下一篇：ZFS Troubleshooting Guide

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6