Chinaunix首页 | 论坛 | 博客
  • 博客访问: 212229
  • 博文数量: 26
  • 博客积分: 390
  • 博客等级: 二等列兵
  • 技术积分: 269
  • 用 户 组: 普通用户
  • 注册时间: 2010-10-11 18:19
文章分类

全部博文(26)

文章存档

2014年(6)

2012年(4)

2011年(16)

分类:

2011-08-15 15:36:20

SMART Disk Errors

/var/log/messages

These messages are logged with the service set to smartd. The line therefore start as follows

Nov 14 16:08:23 lsfnfs02 smartd[17081]:

Many messages then follow with

Device: /dev/twe0 [3ware_disk_01]

Since this text is the same for most messages, it is not shown in the message contents below.

not capable of SMART self-check
Error SMART Values Read failed: Input/output error
Replace disk using
Input/output error Replace disk mentioned in the next line using
Read SMART Self Test Log Failed Replace disk using
Device /dev/sda, please try '-d 3ware,N' An error in the NCM component for Smartd has failed to configure the 3Ware controller entries correctly for smartd. Contact the disk support for analysis
scsiModePageOffer: raw_curr too small, offset=106
Device: /dev/sda, Bad IEC (SMART) mode page, err=5, skip device
Caused by entries for a 3ware RAID unit in the /etc/smartd.conf. 3ware configurations should not have the entries for /dev/sd in smartd.conf, only the ones for the 3ware physical disks. Check the CDB profile and the response from smartctl -d /dev/sda
Currently unreadable (pending) sectors If the number of sectors which are pending is greater than 5, create a vendor call . Otherwise, keep watching since the number may increase
Offline uncorectable sectors If the number of sectors which are offline is greater than 5, create a vendor call
not ATA, no IDENTIFY DEVICE Structure The disk is broken and should be replaced using
FAILED SMART self-check. BACK UP DATA NOW! This message can appear if the self tests are aborted manually in order to clear out the error log. If this is true, run a series of short tests to clear the 20 error log history. If this is not the case, the disk is failing and should be immediately replaced through
execute Short Self-Test failed. Replace disk via
starting scheduled Short Self-Test. This is information only. There is no error so the message can be ignored
same Attribute has different ID numbers: 196 = 196 = 194 This is information only. There is no error so the message can be ignored
SMART Usage Attribute: 9 Power_On_Hours changed from 57 to 56 This is information only. There is no error so the message can be ignored
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61 This is reported on Seagate disks. It is the time between ECC hardware failures (which are recovered automatically). This variable value will vary and does not indicate a hardware problem unless the value is very low (e.g. <5). Typical values around 60.
is SMART capable. Adding to "monitor" list. This is information only. There is no error so the message can be ignored
opened This is information only. There is no error so the message can be ignored
found in smartd database. This is information only. There is no error so the message can be ignored
Failed SMART usage Attribute: 194 Temperature_Celsius. The disk has experienced a high temperate alarm. A review of the current configuration should be made urgently with TSI section since it indicates a problem with the cooling around the machine. If this is only a single disk, a vendor call can be opened via
SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 249 to 248 This is information only. There is no error so the message can be ignored
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 277 to 276 This is information only. There is no error so the message can be ignored
Warning! SMART ATA Error Log Structure error: invalid SMART checksum. This does not seem to be fatal and happens on good disk arrays such as with the 8006-2LP. For the moment, this message can be ignored.
# 9 Short offline Completed: read failure The smart disk test did not complete ok. Identify the failing disk using and open a vendor call to replace with
ERROR: Verify failed: Port #0. Some verify operations on 8xxx controllers seem to be reporting failed verify status to the log but the verify actually seems to have succeeded. Check the status with tw_cli to see that the unit is OK and not DEGRADED. If it is ok, there is no problem. Otherwise, follow the usual procedures using lemon-host-check to identify the error
SMART Support is: Unavailable - Packet Interface Devices [this device: Array controller] does not support SMARRT Raise a vendor call to replace the disk since it is not responding to SMART commands

To identify which disk has generated a smartd error, the following steps should be taken

Smart Test Errors

The smartctl command gives access to the logs of previous runs of smartd to perform tests. For example,

# /usr/sbin/smartctl -l selftest --device=3ware,0 /dev/twe0
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is

Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2945 -
# 2 Short offline Completed without error 00% 2930 -
# 3 Extended offline Completed without error 00% 2907 -
# 4 Short offline Completed without error 00% 2883 -
# 5 Short offline Completed without error 00% 2859

Test #1 is the most recent test. New tests can be run manually using smartctl -tshort or smartctl -tlong.

The tests are

Short offline Scan the disk in the background sampling for blocks. This can be re-executed using -tshort
Extended offline Read all the blocks on the disk. This can be re-executed using -tlong.
Conveyance offline These tests are run sometimes by vendors to test disks after they have been moved. They are not normally run by us. These lines can be ignored.

The NCM component for smartd configures these tests to run once a week on Sundays for long tests and every day for short ones.

The status can be one of the following

Completed without error All ran ok None required
Aborted by host The machine aborted the test. The exact cause of this is still not known at this time No action to recommend since cause is not known
Completed: read failure Indicates that selftest failed to read a block while checking the disk media Contact vendor to replace disk using . Some vendors require that we wait until the disk actually fails but the smart test results are generally accepted as indication that the disk will fail.
Completed: unknown failure The cause and severity for this message is not known. It is recommended to run a long self test and check the results. If no fatal error occurs, contact TSI for assistance. For disks in a 3ware RAID array, this will generally cause a SMART-ERROR in the tw_cli show output. Raise a vendor call using
Fatal or unknown error The cause and severity for this message is not known. It is recommended to run a long self test and check the results. If no fatal error occurs, contact TSI for assistance  
Completed: servo/seek failure The cause and severity for this message is not known. It is recommended to run a long self test and then contact TSI for assistance  
Completed: handling_damage?? Cause not known but seen on some old hardware. May indicate too much vibration of the disk. Run an extended self test and then contact TSI for assistance  

Counter Analysis

1 Raw Read Error Rate  
4 Start Stop Errors This has been seen as a failure on smartctl just before a disk has failed. It is therefore considered as being an indication of an upcoming failure
5 Reallocated Sectors Ct This indicates a bad disks where the number of bad sectors has exceeded an acceptable threshold. This has been seen before a disk fails.
10 Spin Retry Count This has been reported as FAILING-NOW, such as on lxb0483. Severity unknown at this time
190 Unknown Attribute This has been reported on E4_NOC_2800 systems. The root cause is not known so it is to be ignored
196 Reallocated_Event_Count  
197 Current_Pending_Sector  
198 Offline_Uncorrectable  
199 UDMA_CRC_Error_Count This may also be related to internal cabling with the machine. If so, the cabling should be corrected rather than the disk replaced
200 Multi Zone Error Rate This is considered as a good indication of a hardware problem. The 3ware CLI will generally indicate a SMART-ERROR for the port.

Lemon Errors

The Lemon Smart sensors can report a number of errors. The table below indicates the list of errors and the actions to take.

There are several sensors

  • SMARTD daemon monitoring. This gives the error SMARTD_WRONG.
  • The Smart self test failure. This gives the error SMART_SELFTEST This looks at the most recent Short tests until the first Extended test.
  • Counters exceeding limits. This gives the error SMART_FAILING

The SMART status can be found by checking the metric 6130 and 6132 as follows

# lemon-cli -m ChkSmartFailing

# lemon-cli -m ChkSmartSelftest

The status is reported on the [INFO] line (0 OK in this case). The Actions can be performed only for machines from vendors who accept disk replacement when SMART errors occur. Currently, none of our hardware vendors has agreed to this for Linux systems.

SMARTD_WRONG The smartd daemon is not running. See
A smart self test has completed but the error was not one encountered before. Since the SMART error reporting is very conservative, this will not raise an operator alarm. Contact TSI as in .
The smart status Completed: unknown failure was received. This is currently not considered a fatal problem and can be ignored. No operator alarm will have been raised. No action required
The smart status Completed: electrical failure was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined No action required
The smart status Completed: servo/seek_failure was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined No action required
The smart status Interrupted (host reset) was received. This may indicate a problem but currently is not confirmed as being a fatal problem. No operator action required
Spin retry count exceeded This should not be reported an error for the operators and no corrective action is nown at this time
An unknown attribute is reported as having failed. This should not cause an error for the operators and there is no known corrective action
The smart sense function for a disk is disabled. Reenable the smart functions (e.g. smartctl -s on -d 3ware,6 /dev/twa0). If this fails (e.g. with a Error SMART Enable failed: Input/output error) error, open a vendor call
A disk has failed the SMART short selftest with a cause which is known to show bad disks Run the Win to determine if the disk is really got errors. If so, it should show the status SMT601E when the extended test fails and then a vendor call can be raised
A disk has failed the SMART short selftest with a cause which is known to show bad disks Run the Win to determine if the disk is really got errors. If so, it should show the status SMT601E when the extended test fails and then a vendor call can be raised
A large number of uncorrectable sectors has been found. Raise a and ask for the disk as identified in the Location to be replaced.
ALERT!This alarm has been suspended while further investigations on some disk types is being performed.
A large number of write protected sectors has been found. Raise a and ask for the disk as identified in the Location to be replaced.
ALERT!This alarm has been suspended while further investigations on some disk types is being performed.
A disk is reporting that a SMART counter has been exceeded. Raise a and ask for the disk as identified in the Location to be replaced
The number of bad sectors on the disk has exceeded recommended levels Raise a and ask for the disk as identified in the Location to be replaced
The start stop count does not appear to be an important counter but it has indicated disks failure in the past. Raise a and ask for the disk as identified in the Location to be replaced
The raw error rate on the disk has exceeded recommended levels Raise a and ask for the disk as identified in the Location to be replaced
The number of reallocated events on the disk has exceeded recommended levels Raise a and ask for the disk as identified in the Location to be replaced
The number of current pending sectors has exceeded recommended levels Raise a and ask for the disk as identified in the Location to be replaced
The number of offline incorrectable sectors on the disk has exceeded recommended levels Raise a and ask for the disk as identified in the Location to be replaced
The number of UDMA CRC errors reported by the disk has exceeded recommended levels Raise a . The root cause may be either disk cabling or the disk at Location so both should be checked
A disk has failed the SMART extended selftest with a cause which is known to show bad disks. Run the WIN in order to make the disk DEGRADED. In this case, you can raise a and ask for the disk as identified in the Location to be replaced.
A disk has failed the SMART extended selftest giving the completed: read error status. If the disk is behind a 3ware controller, run the WIN in order to make the disk DEGRADED. In this case, you can raise a and ask for the disk as identified in the Location to be replaced. Otherwise, it is vendor dependent if they will replace the disk.
A disk has had a multi zone error rate which is high. This indicates a disk which is about to fail. The 3ware CLI will report the port as SMART-ERROR. Raise a vendor call for disk replacement
The SMART test has run with a Fatal or unknown error. This indicates a failed disk Run the WIN in order to make the disk DEGRADED. In this case, you can raise a and ask for the disk as identified in the Location to be replaced.
The disk has failed so badly that it cannot even be opened by the SMART software. The disks are listed in /etc/smartd.conf. Check if there have been any disks recently removed from the configuration with the service manager. If not, raise a vendor call. If there has been a disk removed, updated the CDB profile and run ncm_wrapper.sh --co smartd to re-generate /etc/smartd.conf
A controller is not visible for the smartctl command. Unless there have been recent changes to the CDB hardware profile, this error indicates a controller failure. Check that the controller is listed in the smartd.conf file. If so, raise a vendor call. Otherwise, run ncm_wrapper.sh --co smartd to re-generate the smartd.conf file.
The disk has failed so badly that it cannot even be opened by the SMART software. The disks are listed in /etc/smartd.conf. Check if there have been any disks recently removed from the configuration with the service manager. If not, raise a vendor call. If there has been a disk removed, updated the CDB profile and run ncm_wrapper.sh --co smartd to re-generate /etc/smartd.conf
A controller is not visible for the smartctl command. Unless there have been recent changes to the CDB hardware profile, this error indicates a controller failure. Check that the controller is listed in the smartd.conf file. If so, raise a vendor call. Otherwise, run ncm_wrapper.sh --co smartd to re-generate the smartd.conf file.
Status of the selftest is not known. This can occur for disks which have reported a test other than short or long. Check the results of smartctl -l selftest. Run to force a long test and contact TSI if this is still producing an alarm on completion of the test

Further Assistance

In the event of the problem not being listed above, please contact the disk service manager listed at .

Related Documents

Old description of SMART errors
Opening a vendor call for a disk problem
Working instructions for Smart Disk operations
3ware Problem determination guide
Requirements for vendors replacement for SMART errors

-- - 15 Nov 2005

阅读(5500) | 评论(0) | 转发(0) |
0

上一篇:没有了

下一篇:if construction 备忘

给主人留下些什么吧!~~