Chinaunix首页 | 论坛 | 博客
  • 博客访问: 574731
  • 博文数量: 118
  • 博客积分: 2114
  • 博客等级: 大尉
  • 技术积分: 1275
  • 用 户 组: 普通用户
  • 注册时间: 2009-03-10 00:02
文章分类

全部博文(118)

文章存档

2019年(1)

2018年(4)

2017年(1)

2016年(6)

2015年(1)

2014年(1)

2013年(5)

2012年(4)

2011年(17)

2010年(13)

2009年(65)

分类:

2010-06-04 14:10:08

 
 
By , December 2008  
Contents

This is the second article in a series about the SCSI DISK FMA project:

  • -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
  • -- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
  • -- Describes what to do when you get an FMA fault on a Solaris system.
  • -- Describes a useful tool for programmers to use for testing the sd driver.
Overview

Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.

Error Reports and Payloads

Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

  • ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
  • ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
  • ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
  • ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
  • ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
  • ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.

The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. lists the available values of driver-assessment.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

  • fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
     
  • fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.
 
Step 1: Check to see the ereport class that propagated this fault using fmdump.
 
bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8
 

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.
 
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
 

Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal. For more information see the description in . This is why one of your hard drives is retired.

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)
 
Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.
 
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
 

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions
 
 
Payload Name
Description
ENA
Error Numeric Association. Can be used to associate a series of related ereports.
detector
The device that detected the error condition.
cdb
Command Description Block.
driver-assessment
The action the driver is going to take.
op-code
The SCSI command that resulted in the error condition.
pkt-reason
Refer to the man page for scsi_pkt(9s), pkt-reason section.
pkt-state
Refer to the man page for scsi_pkt(9s), pkt-state section.
pkt-stats
Refer to the man page of scsi_pkt(9s), pkt-statistics section.
stat-code
SCSI STATUS Code of the SCSI command.
key
Sense key of the SCSI command.
asc
Additional Sense Code.
ascq
Additional Sense Code Qualifier.
sense-data
SCSI Sense data sent back from the device.
lba
Logical Block Address on the device.
un-decode-info
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
un-decode-value
Could be empty or be used together with un-decode-info to indicate the undecodable value.
 
 
Table 2: Example Usage for FMA Tools
 
 
Example Usage
Description
fmdump -ev
Show the ereport list with ENA.
fmdump -e -n =
Show ereports that match the specified pattern.
fmdump -eV
Show ereport details, usually combined with -n option.
fmdump -V -u
Show fault details with given .
fmadm faulty -u
Display status information for faulty resources with given .
fmadm repair
Set the status of a faulty device with given back to normal.
 
 
Table 3: Available Values of driver-assessment
 
 
Value
Description
fatal
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
fail
The scsi driver is not going to stop the service but it cannot guarantee normal service.
info
The driver has detected an error, but the services provided by the device instance are unaffected.
retry
The scsi driver is going to retry a failed command and the service is unaffected.
recovered
The SD driver has recovered a SCSI command and the service is unaffected.
 
For More Information
  • -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
  • -- Describes what to do when you get an FMA fault on a Solaris system.
  • -- Describes a useful tool for programmers to use for testing the sd driver.
 
By , December 2008  
Contents

This is the second article in a series about the SCSI DISK FMA project:

  • -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
  • -- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
  • -- Describes what to do when you get an FMA fault on a Solaris system.
  • -- Describes a useful tool for programmers to use for testing the sd driver.
Overview

Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.

Error Reports and Payloads

Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

  • ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
  • ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
  • ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
  • ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
  • ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
  • ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.

The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. lists the available values of driver-assessment.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

  • fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
     
  • fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.
 
Step 1: Check to see the ereport class that propagated this fault using fmdump.
 
bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8
 

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.
 
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
 

Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal. For more information see the description in . This is why one of your hard drives is retired.

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)
 
Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.
 
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
 

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions
 
 
Payload Name
Description
ENA
Error Numeric Association. Can be used to associate a series of related ereports.
detector
The device that detected the error condition.
cdb
Command Description Block.
driver-assessment
The action the driver is going to take.
op-code
The SCSI command that resulted in the error condition.
pkt-reason
Refer to the man page for scsi_pkt(9s), pkt-reason section.
pkt-state
Refer to the man page for scsi_pkt(9s), pkt-state section.
pkt-stats
Refer to the man page of scsi_pkt(9s), pkt-statistics section.
stat-code
SCSI STATUS Code of the SCSI command.
key
Sense key of the SCSI command.
asc
Additional Sense Code.
ascq
Additional Sense Code Qualifier.
sense-data
SCSI Sense data sent back from the device.
lba
Logical Block Address on the device.
un-decode-info
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
un-decode-value
Could be empty or be used together with un-decode-info to indicate the undecodable value.
 
 
Table 2: Example Usage for FMA Tools
 
 
Example Usage
Description
fmdump -ev
Show the ereport list with ENA.
fmdump -e -n =
Show ereports that match the specified pattern.
fmdump -eV
Show ereport details, usually combined with -n option.
fmdump -V -u
Show fault details with given .
fmadm faulty -u
Display status information for faulty resources with given .
fmadm repair
Set the status of a faulty device with given back to normal.
 
 
Table 3: Available Values of driver-assessment
 
 
Value
Description
fatal
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
fail
The scsi driver is not going to stop the service but it cannot guarantee normal service.
info
The driver has detected an error, but the services provided by the device instance are unaffected.
retry
The scsi driver is going to retry a failed command and the service is unaffected.
recovered
The SD driver has recovered a SCSI command and the service is unaffected.
 
For More Information
  • -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
  • -- Describes what to do when you get an FMA fault on a Solaris system.
  • -- Describes a useful tool for programmers to use for testing the sd driver.
Contents

This is the second article in a series about the SCSI DISK FMA project:

Overview

Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.

Error Reports and Payloads

Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.

An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.

Six new ereports are introduced by SCSI FMA:

  • ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
  • ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
  • ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
  • ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
  • ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
  • ereport.io.scsi.cmd.disk.tran -- SCSI command transport error

There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.

ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.

The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. Table 3 lists the available values of driver-assessment.

There are many other useful payloads for analyzing SCSI FMA ereports. Refer to Table 1 for details.

FMA Utilities for Administrators

Utilities are provided for inspecting details of ereports:

  • fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
     
  • fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.

Both of these tools need to be run as 'root' user. See Table 2 for example usage of these tools. If you need more detailed instructions, refer to the man page.

An Example of Analyzing Ereports

If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format).

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty

Description : The command was terminated with a non-recovered error condition
              that may have been caused by a flaw in the media or an error in
              the recorded data. 
              Refer to  for more information.

Response    : The device may be offlined or degraded.

Impact      : It is likely that continued operation will result in data
              corruption, which may eventually cause the loss of service or the
              service degradation.

Action      : Schedule a repair procedure to replace the affected device. Use
              'fmadm faulty' to find the affected disk.
 
Step 1: Check to see the ereport class that propagated this fault using fmdump.
 
bash-3.2# fmdump 

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6

TIME                 UUID                                 SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

nvlist version: 0
	version = 0x0
	class = list.suspect
	uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
	code = DISK-8000-4Q
	diag-time = 1222322778 736676
	de = (embedded nvlist)
	nvlist version: 0
		version = 0x0
		scheme = fmd
		authority = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			product-id = Sun Fire X4500
			chassis-id = 00:14:4F:20:E3:08     
			server-id = icecube
		(end authority)

		mod-name = eft
		mod-version = 1.16
	(end de)

	fault-list-sz = 0x1
	fault-list = (array of embedded nvlists)
	(start fault-list[0])
	nvlist version: 0
		version = 0x0
		class = fault.io.scsi.cmd.disk.dev.rqs.merr
		certainty = 0x64
		resource = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				chassis-id = 00-14-4F-20-E3-08
				server-id = icecube
			(end authority)

			hc-list-sz = 0x3
			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

			hc-specific = (embedded nvlist)
			nvlist version: 0
				lba = 0x12345678
				ascq = 0x0
				asc = 0x11
				key = 0x3
			(end hc-specific)

		(end resource)

		asru = (embedded nvlist)
		nvlist version: 0
			scheme = dev
			version = 0x0
			device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
			devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
		(end asru)

		fru = (embedded nvlist)
		nvlist version: 0
			version = 0x0
			scheme = hc
			hc-root = 
			serial = KRVN63ZAJLP44D
			part = HITACHI-HDS7250SASUN500G-0633KLP44D
			revision = K2AOAJ0A
			authority = (embedded nvlist)
			nvlist version: 0
				product-id = Sun-Fire-X4500
				server-id = icecube
				chassis-id = 00-14-4F-20-E3-08
			(end authority)

			hc-list = (array of embedded nvlists)
			(start hc-list[0])
			nvlist version: 0
				hc-name = chassis
				hc-id = 0
			(end hc-list[0])
			(start hc-list[1])
			nvlist version: 0
				hc-name = bay
				hc-id = 23
			(end hc-list[1])
			(start hc-list[2])
			nvlist version: 0
				hc-name = disk
				hc-id = 0
			(end hc-list[2])

		(end fru)

		location = HD_ID_23
	(end fault-list[0])

	fault-status = 0x1
	__ttl = 0x1
	__tod = 0x48db2a5a 0x2d49f2c8
 

According the output of fmdump -V, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.

Step 2: Check the ereport sequence using the ENA you got from Step 1.
 
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801 

TIME                 CLASS                                 ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment 

	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = retry
	driver-assessment = fatal

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code 

	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8
	op-code = 0x8

bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key 
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
	key = 0x3
 

Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal. For more information see the description in Table 3. This is why one of your hard drives is retired.

Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.

Step 3: Use fmadm to check the faulty device.

Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump.

bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6  DISK-8000-4Q   Critical 

Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects     : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
                  faulted and taken out of service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
                  faulty
....................  (details omitted)
 
Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.
 
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 
    
    fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
 

Now you're able to use these tools to diagnose your system.

Table 1: SCSI FMA Ereport Payload Descriptions
 
 
Payload Name
Description
ENA
Error Numeric Association. Can be used to associate a series of related ereports.
detector
The device that detected the error condition.
cdb
Command Description Block.
driver-assessment
The action the driver is going to take.
op-code
The SCSI command that resulted in the error condition.
pkt-reason
Refer to the man page for scsi_pkt(9s), pkt-reason section.
pkt-state
Refer to the man page for scsi_pkt(9s), pkt-state section.
pkt-stats
Refer to the man page of scsi_pkt(9s), pkt-statistics section.
stat-code
SCSI STATUS Code of the SCSI command.
key
Sense key of the SCSI command.
asc
Additional Sense Code.
ascq
Additional Sense Code Qualifier.
sense-data
SCSI Sense data sent back from the device.
lba
Logical Block Address on the device.
un-decode-info
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value.
un-decode-value
Could be empty or be used together with un-decode-info to indicate the undecodable value.
 
 
Table 2: Example Usage for FMA Tools
 
 
Example Usage
Description
fmdump -ev
Show the ereport list with ENA.
fmdump -e -n =
Show ereports that match the specified pattern.
fmdump -eV
Show ereport details, usually combined with -n option.
fmdump -V -u
Show fault details with given .
fmadm faulty -u
Display status information for faulty resources with given .
fmadm repair
Set the status of a faulty device with given back to normal.
 
 
Table 3: Available Values of driver-assessment
 
 
Value
Description
fatal
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h).
fail
The scsi driver is not going to stop the service but it cannot guarantee normal service.
info
The driver has detected an error, but the services provided by the device instance are unaffected.
retry
The scsi driver is going to retry a failed command and the service is unaffected.
recovered
The SD driver has recovered a SCSI command and the service is unaffected.
 
For More Information
阅读(2038) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~