首页　| 　博文目录　| 　关于我

博客访问： 214120
博文数量： 26
博客积分： 390
博客等级：二等列兵
技术积分： 269
用户组：普通用户
注册时间： 2010-10-11 18:19

文章分类

全部博文（26）

linux（7）
娱乐（1）
未分配的博文（18）

文章存档

2014年（6）

2012年（4）

2011年（16）

我的朋友

SMART Disk Errors

/var/log/messages

These messages are logged with the service set to smartd. The line therefore start as follows

Nov 14 16:08:23 lsfnfs02 smartd[17081]:

Many messages then follow with

Device: /dev/twe0 [3ware_disk_01]

Since this text is the same for most messages, it is not shown in the message contents below.


not capable of SMART self-check Error SMART Values Read failed: Input/output error	Replace disk using
Input/output error	Replace disk mentioned in the next line using
Read SMART Self Test Log Failed	Replace disk using
Device /dev/sda, please try '-d 3ware,N'	An error in the NCM component for Smartd has failed to configure the 3Ware controller entries correctly for smartd. Contact the disk support for analysis
scsiModePageOffer: raw_curr too small, offset=106 Device: /dev/sda, Bad IEC (SMART) mode page, err=5, skip device	Caused by entries for a 3ware RAID unit in the /etc/smartd.conf. 3ware configurations should not have the entries for `/dev/sd` in smartd.conf, only the ones for the 3ware physical disks. Check the CDB profile and the response from `smartctl -d /dev/sda`
Currently unreadable (pending) sectors	If the number of sectors which are pending is greater than 5, create a vendor call . Otherwise, keep watching since the number may increase
Offline uncorectable sectors	If the number of sectors which are offline is greater than 5, create a vendor call
not ATA, no IDENTIFY DEVICE Structure	The disk is broken and should be replaced using
FAILED SMART self-check. BACK UP DATA NOW!	This message can appear if the self tests are aborted manually in order to clear out the error log. If this is true, run a series of short tests to clear the 20 error log history. If this is not the case, the disk is failing and should be immediately replaced through
execute Short Self-Test failed.	Replace disk via
starting scheduled Short Self-Test.	This is information only. There is no error so the message can be ignored
same Attribute has different ID numbers: 196 = 196 = 194	This is information only. There is no error so the message can be ignored
SMART Usage Attribute: 9 Power_On_Hours changed from 57 to 56	This is information only. There is no error so the message can be ignored
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61	This is reported on Seagate disks. It is the time between ECC hardware failures (which are recovered automatically). This variable value will vary and does not indicate a hardware problem unless the value is very low (e.g. <5). Typical values around 60.
is SMART capable. Adding to "monitor" list.	This is information only. There is no error so the message can be ignored
opened	This is information only. There is no error so the message can be ignored
found in smartd database.	This is information only. There is no error so the message can be ignored
Failed SMART usage Attribute: 194 Temperature_Celsius.	The disk has experienced a high temperate alarm. A review of the current configuration should be made urgently with TSI section since it indicates a problem with the cooling around the machine. If this is only a single disk, a vendor call can be opened via
SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 249 to 248	This is information only. There is no error so the message can be ignored
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 277 to 276	This is information only. There is no error so the message can be ignored
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.	This does not seem to be fatal and happens on good disk arrays such as with the 8006-2LP. For the moment, this message can be ignored.
# 9 Short offline Completed: read failure	The smart disk test did not complete ok. Identify the failing disk using and open a vendor call to replace with
ERROR: Verify failed: Port #0.	Some verify operations on 8xxx controllers seem to be reporting failed verify status to the log but the verify actually seems to have succeeded. Check the status with tw_cli to see that the unit is OK and not DEGRADED. If it is ok, there is no problem. Otherwise, follow the usual procedures using `lemon-host-check` to identify the error
SMART Support is: Unavailable - Packet Interface Devices [this device: Array controller] does not support SMARRT	Raise a vendor call to replace the disk since it is not responding to SMART commands

To identify which disk has generated a smartd error, the following steps should be taken

Smart Test Errors

The smartctl command gives access to the logs of previous runs of smartd to perform tests. For example,

# /usr/sbin/smartctl -l selftest --device=3ware,0 /dev/twe0
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is 

Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2945         -
# 2  Short offline       Completed without error       00%      2930         -
# 3  Extended offline    Completed without error       00%      2907         -
# 4  Short offline       Completed without error       00%      2883         -
# 5  Short offline       Completed without error       00%      2859

Test #1 is the most recent test. New tests can be run manually using smartctl -tshort or smartctl -tlong.

The tests are


Short offline	Scan the disk in the background sampling for blocks. This can be re-executed using `-tshort`
Extended offline	Read all the blocks on the disk. This can be re-executed using `-tlong`.
Conveyance offline	These tests are run sometimes by vendors to test disks after they have been moved. They are not normally run by us. These lines can be ignored.

The NCM component for smartd configures these tests to run once a week on Sundays for long tests and every day for short ones.

The status can be one of the following


Completed without error	All ran ok	None required
Aborted by host	The machine aborted the test. The exact cause of this is still not known at this time	No action to recommend since cause is not known
Completed: read failure	Indicates that selftest failed to read a block while checking the disk media	Contact vendor to replace disk using . Some vendors require that we wait until the disk actually fails but the smart test results are generally accepted as indication that the disk will fail.
Completed: unknown failure	The cause and severity for this message is not known. It is recommended to run a long self test and check the results. If no fatal error occurs, contact TSI for assistance. For disks in a 3ware RAID array, this will generally cause a `SMART-ERROR` in the `tw_cli show` output.	Raise a vendor call using
Fatal or unknown error	The cause and severity for this message is not known. It is recommended to run a long self test and check the results. If no fatal error occurs, contact TSI for assistance
Completed: servo/seek failure	The cause and severity for this message is not known. It is recommended to run a long self test and then contact TSI for assistance
Completed: handling_damage??	Cause not known but seen on some old hardware. May indicate too much vibration of the disk. Run an extended self test and then contact TSI for assistance

Counter Analysis


1	Raw Read Error Rate
4	Start Stop Errors	This has been seen as a failure on smartctl just before a disk has failed. It is therefore considered as being an indication of an upcoming failure
5	Reallocated Sectors Ct	This indicates a bad disks where the number of bad sectors has exceeded an acceptable threshold. This has been seen before a disk fails.
10	Spin Retry Count	This has been reported as `FAILING-NOW`, such as on lxb0483. Severity unknown at this time
190	Unknown Attribute	This has been reported on E4_NOC_2800 systems. The root cause is not known so it is to be ignored
196	Reallocated_Event_Count
197	Current_Pending_Sector
198	Offline_Uncorrectable
199	UDMA_CRC_Error_Count	This may also be related to internal cabling with the machine. If so, the cabling should be corrected rather than the disk replaced
200	Multi Zone Error Rate	This is considered as a good indication of a hardware problem. The 3ware CLI will generally indicate a `SMART-ERROR` for the port.

Lemon Errors

The Lemon Smart sensors can report a number of errors. The table below indicates the list of errors and the actions to take.

There are several sensors

SMARTD daemon monitoring. This gives the error SMARTD_WRONG.
The Smart self test failure. This gives the error SMART_SELFTEST This looks at the most recent Short tests until the first Extended test.
Counters exceeding limits. This gives the error SMART_FAILING

The SMART status can be found by checking the metric 6130 and 6132 as follows

# lemon-cli -m ChkSmartFailing

# lemon-cli -m ChkSmartSelftest

The status is reported on the [INFO] line (0 OK in this case). The Actions can be performed only for machines from vendors who accept disk replacement when SMART errors occur. Currently, none of our hardware vendors has agreed to this for Linux systems.


SMARTD_WRONG	The smartd daemon is not running.	See
	A smart self test has completed but the error was not one encountered before. Since the SMART error reporting is very conservative, this will not raise an operator alarm.	Contact TSI as in .
	The smart status `Completed: unknown failure` was received. This is currently not considered a fatal problem and can be ignored. No operator alarm will have been raised.	No action required
	The smart status `Completed: electrical failure` was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined	No action required
	The smart status `Completed: servo/seek_failure` was received. This is currently not understood and therefore no operator alarm will have been raised. It may indicate problems but no concrete actions have yet been defined	No action required
	The smart status `Interrupted (host reset)` was received. This may indicate a problem but currently is not confirmed as being a fatal problem.	No operator action required
	Spin retry count exceeded	This should not be reported an error for the operators and no corrective action is nown at this time
	An unknown attribute is reported as having failed.	This should not cause an error for the operators and there is no known corrective action
	The smart sense function for a disk is disabled.	Reenable the smart functions (e.g. `smartctl -s on -d 3ware,6 /dev/twa0`). If this fails (e.g. with a `Error SMART Enable failed: Input/output error`) error, open a vendor call
	A disk has failed the SMART short selftest with a cause which is known to show bad disks	Run the Win to determine if the disk is really got errors. If so, it should show the status `SMT601E` when the extended test fails and then a vendor call can be raised
	A disk has failed the SMART short selftest with a cause which is known to show bad disks	Run the Win to determine if the disk is really got errors. If so, it should show the status `SMT601E` when the extended test fails and then a vendor call can be raised
	A large number of uncorrectable sectors has been found.	Raise a and ask for the disk as identified in the `Location` to be replaced. This alarm has been suspended while further investigations on some disk types is being performed.
	A large number of write protected sectors has been found.	Raise a and ask for the disk as identified in the `Location` to be replaced. This alarm has been suspended while further investigations on some disk types is being performed.
	A disk is reporting that a SMART counter has been exceeded.	Raise a and ask for the disk as identified in the `Location` to be replaced
	The number of bad sectors on the disk has exceeded recommended levels	Raise a and ask for the disk as identified in the `Location` to be replaced
	The start stop count does not appear to be an important counter but it has indicated disks failure in the past.	Raise a and ask for the disk as identified in the `Location` to be replaced
	The raw error rate on the disk has exceeded recommended levels	Raise a and ask for the disk as identified in the `Location` to be replaced
	The number of reallocated events on the disk has exceeded recommended levels	Raise a and ask for the disk as identified in the `Location` to be replaced
	The number of current pending sectors has exceeded recommended levels	Raise a and ask for the disk as identified in the `Location` to be replaced
	The number of offline incorrectable sectors on the disk has exceeded recommended levels	Raise a and ask for the disk as identified in the `Location` to be replaced
	The number of UDMA CRC errors reported by the disk has exceeded recommended levels	Raise a . The root cause may be either disk cabling or the disk at `Location` so both should be checked
	A disk has failed the SMART extended selftest with a cause which is known to show bad disks.	Run the WIN in order to make the disk `DEGRADED`. In this case, you can raise a and ask for the disk as identified in the `Location` to be replaced.
	A disk has failed the SMART extended selftest giving the completed: read error status.	If the disk is behind a 3ware controller, run the WIN in order to make the disk `DEGRADED`. In this case, you can raise a and ask for the disk as identified in the `Location` to be replaced. Otherwise, it is vendor dependent if they will replace the disk.
	A disk has had a multi zone error rate which is high. This indicates a disk which is about to fail. The 3ware CLI will report the port as `SMART-ERROR`.	Raise a vendor call for disk replacement
	The SMART test has run with a `Fatal or unknown error`. This indicates a failed disk	Run the WIN in order to make the disk `DEGRADED`. In this case, you can raise a and ask for the disk as identified in the `Location` to be replaced.
	The disk has failed so badly that it cannot even be opened by the SMART software. The disks are listed in /etc/smartd.conf.	Check if there have been any disks recently removed from the configuration with the service manager. If not, raise a vendor call. If there has been a disk removed, updated the CDB profile and run `ncm_wrapper.sh --co smartd` to re-generate /etc/smartd.conf
	A controller is not visible for the smartctl command. Unless there have been recent changes to the CDB hardware profile, this error indicates a controller failure.	Check that the controller is listed in the smartd.conf file. If so, raise a vendor call. Otherwise, run `ncm_wrapper.sh --co smartd` to re-generate the smartd.conf file.
	The disk has failed so badly that it cannot even be opened by the SMART software. The disks are listed in /etc/smartd.conf.	Check if there have been any disks recently removed from the configuration with the service manager. If not, raise a vendor call. If there has been a disk removed, updated the CDB profile and run `ncm_wrapper.sh --co smartd` to re-generate /etc/smartd.conf
	A controller is not visible for the smartctl command. Unless there have been recent changes to the CDB hardware profile, this error indicates a controller failure.	Check that the controller is listed in the smartd.conf file. If so, raise a vendor call. Otherwise, run `ncm_wrapper.sh --co smartd` to re-generate the smartd.conf file.
	Status of the selftest is not known.	This can occur for disks which have reported a test other than short or long. Check the results of `smartctl -l selftest`. Run to force a long test and contact TSI if this is still producing an alarm on completion of the test

Further Assistance

In the event of the problem not being listed above, please contact the disk service manager listed at .