Contents
This is the second article in a series about the SCSI DISK FMA project:
- -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
- -- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
- -- Describes what to do when you get an FMA fault on a Solaris system.
- -- Describes a useful tool for programmers to use for testing the
sd driver.
Overview
Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.
Error Reports and Payloads
Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.
An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.
Six new ereports are introduced by SCSI FMA:
ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran -- SCSI command transport error
There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.
ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.
The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. lists the available values of driver-assessment .
There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.
FMA Utilities for Administrators
Utilities are provided for inspecting details of ereports:
fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.
Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.
An Example of Analyzing Ereports
If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format ).
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
Description : The command was terminated with a non-recovered error condition
that may have been caused by a flaw in the media or an error in
the recorded data.
Refer to for more information.
Response : The device may be offlined or degraded.
Impact : It is likely that continued operation will result in data
corruption, which may eventually cause the loss of service or the
service degradation.
Action : Schedule a repair procedure to replace the affected device. Use
'fmadm faulty' to find the affected disk.
|
Step 1: Check to see the ereport class that propagated this fault using fmdump .
bash-3.2# fmdump
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
code = DISK-8000-4Q
diag-time = 1222322778 736676
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = 00:14:4F:20:E3:08
server-id = icecube
(end authority)
mod-name = eft
mod-version = 1.16
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.scsi.cmd.disk.dev.rqs.merr
certainty = 0x64
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = 00-14-4F-20-E3-08
server-id = icecube
(end authority)
hc-list-sz = 0x3
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
hc-specific = (embedded nvlist)
nvlist version: 0
lba = 0x12345678
ascq = 0x0
asc = 0x11
key = 0x3
(end hc-specific)
(end resource)
asru = (embedded nvlist)
nvlist version: 0
scheme = dev
version = 0x0
device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
(end asru)
fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = icecube
chassis-id = 00-14-4F-20-E3-08
(end authority)
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end fru)
location = HD_ID_23
(end fault-list[0])
fault-status = 0x1
__ttl = 0x1
__tod = 0x48db2a5a 0x2d49f2c8
|
According the output of fmdump -V , you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.
Step 2: Check the ereport sequence using the ENA you got from Step 1.
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = fatal
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
|
Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal . For more information see the description in . This is why one of your hard drives is retired.
Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.
Step 3: Use fmadm to check the faulty device.
Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump .
bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
.................... (details omitted)
|
Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
|
Now you're able to use these tools to diagnose your system.
|
ENA
|
Error Numeric Association. Can be used to associate a series of related ereports. |
detector
|
The device that detected the error condition. |
cdb
|
Command Description Block. |
driver-assessment
|
The action the driver is going to take. |
op-code
|
The SCSI command that resulted in the error condition. |
pkt-reason
|
Refer to the man page for scsi_pkt(9s), pkt-reason section. |
pkt-state
|
Refer to the man page for scsi_pkt(9s), pkt-state section. |
pkt-stats
|
Refer to the man page of scsi_pkt(9s), pkt-statistics section. |
stat-code
|
SCSI STATUS Code of the SCSI command. |
key
|
Sense key of the SCSI command. |
asc
|
Additional Sense Code. |
ascq
|
Additional Sense Code Qualifier. |
sense-data
|
SCSI Sense data sent back from the device. |
lba
|
Logical Block Address on the device. |
un-decode-info
|
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value. |
un-decode-value
|
Could be empty or be used together with un-decode-info to indicate the undecodable value. |
|
fmdump -ev
|
Show the ereport list with ENA. |
|
Show ereports that match the specified pattern. |
fmdump -eV
|
Show ereport details, usually combined with -n option. |
fmdump -V -u
|
Show fault details with given . |
fmadm faulty -u
|
Display status information for faulty resources with given . |
fmadm repair
|
Set the status of a faulty device with given back to normal. |
|
fatal
|
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h). |
fail
|
The scsi driver is not going to stop the service but it cannot guarantee normal service. |
info
|
The driver has detected an error, but the services provided by the device instance are unaffected. |
retry
|
The scsi driver is going to retry a failed command and the service is unaffected. |
recovered
|
The SD driver has recovered a SCSI command and the service is unaffected. |
For More Information
- -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
- -- Describes what to do when you get an FMA fault on a Solaris system.
- -- Describes a useful tool for programmers to use for testing the
sd driver. |
Contents
This is the second article in a series about the SCSI DISK FMA project:
- -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
- -- Describes how to use existing tools to analyze a structured FMA ereport log instead of searching syslog.
- -- Describes what to do when you get an FMA fault on a Solaris system.
- -- Describes a useful tool for programmers to use for testing the
sd driver.
Overview
Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK) driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.
Error Reports and Payloads
Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.
An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.
Six new ereports are introduced by SCSI FMA:
ereport.io.scsi.cmd.disk.dev.rqs.merr -- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr -- Device error
ereport.io.scsi.cmd.disk.dev.serr -- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr -- Unexpected data error
ereport.io.scsi.cmd.disk.recovered -- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran -- SCSI command transport error
There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.
ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.
The driver-assessment value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. lists the available values of driver-assessment .
There are many other useful payloads for analyzing SCSI FMA ereports. Refer to for details.
FMA Utilities for Administrators
Utilities are provided for inspecting details of ereports:
fmdump -- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm -- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.
Both of these tools need to be run as 'root' user. See for example usage of these tools. If you need more detailed instructions, refer to the man page.
An Example of Analyzing Ereports
If you are unlucky one day you might see the following message printed to your console or /var/adm/messages (or even worse, one of your hard drives might be invisible to you when running format ).
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
Description : The command was terminated with a non-recovered error condition
that may have been caused by a flaw in the media or an error in
the recorded data.
Refer to for more information.
Response : The device may be offlined or degraded.
Impact : It is likely that continued operation will result in data
corruption, which may eventually cause the loss of service or the
service degradation.
Action : Schedule a repair procedure to replace the affected device. Use
'fmadm faulty' to find the affected disk.
|
Step 1: Check to see the ereport class that propagated this fault using fmdump .
bash-3.2# fmdump
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
code = DISK-8000-4Q
diag-time = 1222322778 736676
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = 00:14:4F:20:E3:08
server-id = icecube
(end authority)
mod-name = eft
mod-version = 1.16
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.scsi.cmd.disk.dev.rqs.merr
certainty = 0x64
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = 00-14-4F-20-E3-08
server-id = icecube
(end authority)
hc-list-sz = 0x3
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
hc-specific = (embedded nvlist)
nvlist version: 0
lba = 0x12345678
ascq = 0x0
asc = 0x11
key = 0x3
(end hc-specific)
(end resource)
asru = (embedded nvlist)
nvlist version: 0
scheme = dev
version = 0x0
device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
(end asru)
fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = icecube
chassis-id = 00-14-4F-20-E3-08
(end authority)
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end fru)
location = HD_ID_23
(end fault-list[0])
fault-status = 0x1
__ttl = 0x1
__tod = 0x48db2a5a 0x2d49f2c8
|
According the output of fmdump -V , you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr with an ENA of 0x04d1f9bdabb00801.
Step 2: Check the ereport sequence using the ENA you got from Step 1.
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = fatal
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
|
Now you see that the read command has been retried five times and finally failed, with a value of driver-assessment = fatal . For more information see the description in . This is why one of your hard drives is retired.
Note: An ereport with the value driver-assessment = fatal results in the fault being propagated.
Step 3: Use fmadm to check the faulty device.
Run fmadm faulty -u UUID, where UUID is what you get from the output of fmdump .
bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
.................... (details omitted)
|
Step 4: Replace the faulty drive (ASRU) and use fmadm to recover the faulty status.
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
|
Now you're able to use these tools to diagnose your system.
|
ENA
|
Error Numeric Association. Can be used to associate a series of related ereports. |
detector
|
The device that detected the error condition. |
cdb
|
Command Description Block. |
driver-assessment
|
The action the driver is going to take. |
op-code
|
The SCSI command that resulted in the error condition. |
pkt-reason
|
Refer to the man page for scsi_pkt(9s), pkt-reason section. |
pkt-state
|
Refer to the man page for scsi_pkt(9s), pkt-state section. |
pkt-stats
|
Refer to the man page of scsi_pkt(9s), pkt-statistics section. |
stat-code
|
SCSI STATUS Code of the SCSI command. |
key
|
Sense key of the SCSI command. |
asc
|
Additional Sense Code. |
ascq
|
Additional Sense Code Qualifier. |
sense-data
|
SCSI Sense data sent back from the device. |
lba
|
Logical Block Address on the device. |
un-decode-info
|
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value. |
un-decode-value
|
Could be empty or be used together with un-decode-info to indicate the undecodable value. |
|
fmdump -ev
|
Show the ereport list with ENA. |
|
Show ereports that match the specified pattern. |
fmdump -eV
|
Show ereport details, usually combined with -n option. |
fmdump -V -u
|
Show fault details with given . |
fmadm faulty -u
|
Display status information for faulty resources with given . |
fmadm repair
|
Set the status of a faulty device with given back to normal. |
|
fatal
|
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h). |
fail
|
The scsi driver is not going to stop the service but it cannot guarantee normal service. |
info
|
The driver has detected an error, but the services provided by the device instance are unaffected. |
retry
|
The scsi driver is going to retry a failed command and the service is unaffected. |
recovered
|
The SD driver has recovered a SCSI command and the service is unaffected. |
For More Information
- -- Describes the concept of "device-as-detector" and give you a brief introduction to SCSI DISK FMA efforts.
- -- Describes what to do when you get an FMA fault on a Solaris system.
- -- Describes a useful tool for programmers to use for testing the
sd driver. |
Contents
This is the second article in a series about the SCSI DISK FMA project:
Overview
Based on the implementation of SCSI FMA phase III, the sd/ssd (SCSI DISK)
driver is able to send out FMA telemetries (ereports) when detecting an error condition. Through analyzing the ereports, you find out what is happening at the kernel driver level. This article describes how you can use this new feature (SCSI FMA) to analyze a potential error condition.
Error Reports and Payloads
Ereports (error reports) are generated upon the detection of an abnormal condition, recorded in persistent storage (for example a file system) in binary format, and used as input to automated diagnosis engines.
An ereport is described by its event class (hierarchy path) and a payload of name-value pairs that can be used for diagnosis and logging.
Six new ereports are introduced by SCSI FMA:
ereport.io.scsi.cmd.disk.dev.rqs.merr
-- Media error
ereport.io.scsi.cmd.disk.dev.rqs.derr
-- Device error
ereport.io.scsi.cmd.disk.dev.serr
-- SCSI command status error
ereport.io.scsi.cmd.disk.dev.uderr
-- Unexpected data error
ereport.io.scsi.cmd.disk.recovered
-- SCSI command recovered from a failure
ereport.io.scsi.cmd.disk.tran
-- SCSI command transport error
There are many payloads along with these ereports. For analyzing problems, ENA and driver-assessment are really useful.
ENA (error numeric association) is used in SCSI FMA as a link for a sequence of related ereports. For example, a command retried several times that finally succeeds would result in a sequence of posted ereports that are associated by the same ENA value.
The driver-assessment
value is used to indicate the action the driver is going to take. Usually this value is helpful for the administrator to analyze what happened to a specific SCSI command at the kernel level. Table 3 lists the available values of driver-assessment
.
There are many other useful payloads for analyzing SCSI FMA ereports. Refer to Table 1 for details.
FMA Utilities for Administrators
Utilities are provided for inspecting details of ereports:
fmdump
-- A fault management log viewer. The FMA framework maintains two categories of logs: one for faults, and another for ereports. Using fmdump
you can see the detail of a specific pattern of ereports and also the fault list produced by the diagnosis engine.
fmadm
-- A tool for fault management configuration. It provides many functions, some of them quite frequently used, including viewing the faulty system component and resolving a fault.
Both of these tools need to be run as 'root' user. See Table 2 for example usage of these tools. If you need more detailed instructions, refer to the man page.
An Example of Analyzing Ereports
If you are unlucky one day you might see the following message printed to your console or /var/adm/messages
(or even worse, one of your hard drives might be invisible to you when running format
).
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
Description : The command was terminated with a non-recovered error condition
that may have been caused by a flaw in the media or an error in
the recorded data.
Refer to for more information.
Response : The device may be offlined or degraded.
Impact : It is likely that continued operation will result in data
corruption, which may eventually cause the loss of service or the
service degradation.
Action : Schedule a repair procedure to replace the affected device. Use
'fmadm faulty' to find the affected disk.
|
Step 1: Check to see the ereport class that propagated this fault using fmdump
.
bash-3.2# fmdump
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
bash-3.2# fmdump -V -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
TIME UUID SUNW-MSG-ID
Sep 25 14:06:18.7598 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
code = DISK-8000-4Q
diag-time = 1222322778 736676
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Sun Fire X4500
chassis-id = 00:14:4F:20:E3:08
server-id = icecube
(end authority)
mod-name = eft
mod-version = 1.16
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.io.scsi.cmd.disk.dev.rqs.merr
certainty = 0x64
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
chassis-id = 00-14-4F-20-E3-08
server-id = icecube
(end authority)
hc-list-sz = 0x3
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
hc-specific = (embedded nvlist)
nvlist version: 0
lba = 0x12345678
ascq = 0x0
asc = 0x11
key = 0x3
(end hc-specific)
(end resource)
asru = (embedded nvlist)
nvlist version: 0
scheme = dev
version = 0x0
device-path = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
devid = id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D
(end asru)
fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
serial = KRVN63ZAJLP44D
part = HITACHI-HDS7250SASUN500G-0633KLP44D
revision = K2AOAJ0A
authority = (embedded nvlist)
nvlist version: 0
product-id = Sun-Fire-X4500
server-id = icecube
chassis-id = 00-14-4F-20-E3-08
(end authority)
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = bay
hc-id = 23
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = disk
hc-id = 0
(end hc-list[2])
(end fru)
location = HD_ID_23
(end fault-list[0])
fault-status = 0x1
__ttl = 0x1
__tod = 0x48db2a5a 0x2d49f2c8
|
According the output of fmdump -V
, you see that this fault is triggered by ereport.io.scsi.cmd.disk.dev.rqs.merr
with an ENA of 0x04d1f9bdabb00801.
Step 2: Check the ereport sequence using the ENA you got from Step 1.
bash-3.2# fmdump -ev -n ena=0x04d1f9bdabb00801
TIME CLASS ENA
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
Sep 25 14:06:16.6727 ereport.io.scsi.cmd.disk.dev.rqs.merr 0x04d1f9bdabb00801
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep driver-assessment
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = retry
driver-assessment = fatal
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep op-code
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
op-code = 0x8
bash-3.2# fmdump -eV -n ena=0x04d1f9bdabb00801|grep key
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
key = 0x3
|
Now you see that the read
command has been retried five times and finally failed, with a value of driver-assessment = fatal
. For more information see the description in Table 3. This is why one of your hard drives is retired.
Note: An ereport with the value driver-assessment = fatal
results in the fault being propagated.
Step 3: Use fmadm
to check the faulty device.
Run fmadm faulty -u
UUID, where UUID is what you get from the output of fmdump
.
bash-3.2# fmadm faulty -u 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 25 14:06:18 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6 DISK-8000-4Q Critical
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merr
Affects : dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN63ZAJLP44D//pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@5,0
faulted and taken out of service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E3-08:server-id=icecube:serial=KRVN63ZAJLP44D:part=HITACHI-HDS7250SASUN500G-0633KLP44D:revision=K2AOAJ0A/chassis=0/bay=23/disk=0)
faulty
.................... (details omitted)
|
Step 4: Replace the faulty drive (ASRU) and use fmadm
to recover the faulty status.
bash-3.2# fmadm repair 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
fmadm: recorded repair to 1707f2f9-af9c-c76e-a166-bdb5fcc4cad6
|
Now you're able to use these tools to diagnose your system.
|
ENA
|
Error Numeric Association. Can be used to associate a series of related ereports. |
detector
|
The device that detected the error condition. |
cdb
|
Command Description Block. |
driver-assessment
|
The action the driver is going to take. |
op-code
|
The SCSI command that resulted in the error condition. |
pkt-reason
|
Refer to the man page for scsi_pkt(9s), pkt-reason section. |
pkt-state
|
Refer to the man page for scsi_pkt(9s), pkt-state section. |
pkt-stats
|
Refer to the man page of scsi_pkt(9s), pkt-statistics section. |
stat-code
|
SCSI STATUS Code of the SCSI command. |
key
|
Sense key of the SCSI command. |
asc
|
Additional Sense Code. |
ascq
|
Additional Sense Code Qualifier. |
sense-data
|
SCSI Sense data sent back from the device. |
lba
|
Logical Block Address on the device. |
un-decode-info
|
Usually indicating the payload that is storing an unexpected value or other information as a hint of undecodable value. |
un-decode-value
|
Could be empty or be used together with un-decode-info to indicate the undecodable value. |
|
fmdump -ev
|
Show the ereport list with ENA. |
|
Show ereports that match the specified pattern. |
fmdump -eV
|
Show ereport details, usually combined with -n option. |
fmdump -V -u
|
Show fault details with given . |
fmadm faulty -u
|
Display status information for faulty resources with given . |
fmadm repair
|
Set the status of a faulty device with given back to normal. |
|
fatal
|
SD driver failed the current SCSI command due to a non-recoverable device error (sense-key 0x3h or 0x4h). |
fail
|
The scsi driver is not going to stop the service but it cannot guarantee normal service. |
info
|
The driver has detected an error, but the services provided by the device instance are unaffected. |
retry
|
The scsi driver is going to retry a failed command and the service is unaffected. |
recovered
|
The SD driver has recovered a SCSI command and the service is unaffected. |
For More Information