分类: LINUX
2009-12-14 11:12:31
These messages are logged with the service set to smartd. The line therefore start as follows
Nov 14 16:08:23 lsfnfs02 smartd[17081]:
Many messages then follow with
Device: /dev/twe0 [3ware_disk_01]
Since this text is the same for most messages, it is not shown in the message contents below.
not capable of SMART self-check Error SMART Values Read failed: Input/output error |
Replace disk using |
Input/output error | Replace disk mentioned in the next line using |
Read SMART Self Test Log Failed | Replace disk using |
Device /dev/sda, please try '-d 3ware,N' | An error in the NCM component for Smartd has failed to configure the 3Ware controller entries correctly for smartd. Contact the disk support for analysis |
scsiModePageOffer: raw_curr too small, offset=106 Device: /dev/sda, Bad IEC (SMART) mode page, err=5, skip device |
Caused by entries for a 3ware RAID unit in the /etc/smartd.conf. 3ware configurations should not have the entries for /dev/sd in smartd.conf, only the ones for the 3ware physical disks. Check the CDB profile and the response from smartctl -d /dev/sda |
Currently unreadable (pending) sectors | If the number of sectors which are pending is greater than 5, create a vendor call . Otherwise, keep watching since the number may increase |
Offline uncorectable sectors | If the number of sectors which are offline is greater than 5, create a vendor call |
not ATA, no IDENTIFY DEVICE Structure | The disk is broken and should be replaced using |
FAILED SMART self-check. BACK UP DATA NOW! | This message can appear if the self tests are aborted manually in order to clear out the error log. If this is true, run a series of short tests to clear the 20 error log history. If this is not the case, the disk is failing and should be immediately replaced through |
execute Short Self-Test failed. | Replace disk via |
starting scheduled Short Self-Test. | This is information only. There is no error so the message can be ignored |
same Attribute has different ID numbers: 196 = 196 = 194 | This is information only. There is no error so the message can be ignored |
SMART Usage Attribute: 9 Power_On_Hours changed from 57 to 56 | This is information only. There is no error so the message can be ignored |
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61 | This is reported on Seagate disks. It is the time between ECC hardware failures (which are recovered automatically). This variable value will vary and does not indicate a hardware problem unless the value is very low (e.g. <5). Typical values around 60. |
is SMART capable. Adding to "monitor" list. | This is information only. There is no error so the message can be ignored |
opened | This is information only. There is no error so the message can be ignored |
found in smartd database. | This is information only. There is no error so the message can be ignored |
Failed SMART usage Attribute: 194 Temperature_Celsius. | The disk has experienced a high temperate alarm. A review of the current configuration should be made urgently with TSI section since it indicates a problem with the cooling around the machine. If this is only a single disk, a vendor call can be opened via |
SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 249 to 248 | This is information only. There is no error so the message can be ignored |
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 277 to 276 | This is information only. There is no error so the message can be ignored |
Warning! SMART ATA Error Log Structure error: invalid SMART checksum. | This does not seem to be fatal and happens on good disk arrays such as with the 8006-2LP. For the moment, this message can be ignored. |
# 9 Short offline Completed: read failure | The smart disk test did not complete ok. Identify the failing disk using and open a vendor call to replace with |
ERROR: Verify failed: Port #0. |
Some verify operations on 8xxx controllers seem to be reporting failed
verify status to the log but the verify actually seems to have
succeeded. Check the status with tw_cli to see that the unit is OK and
not DEGRADED. If it is ok, there is no problem. Otherwise, follow the
usual procedures using lemon-host-check to identify the error |
SMART Support is: Unavailable - Packet Interface Devices [this device: Array controller] does not support SMARRT | Raise a vendor call to replace the disk since it is not responding to SMART commands |
To identify which disk has generated a smartd error, the following steps should be taken
The smartctl command gives access to the logs of previous runs of smartd to perform tests. For example,
# /usr/sbin/smartctl -l selftest --device=3ware,0 /dev/twe0
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is
Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2945 -
# 2 Short offline Completed without error 00% 2930 -
# 3 Extended offline Completed without error 00% 2907 -
# 4 Short offline Completed without error 00% 2883 -
# 5 Short offline Completed without error 00% 2859
Test #1 is the most recent test. New tests can be run manually using smartctl -tshort
or smartctl -tlong
.
The tests are
Short offline | Scan the disk in the background sampling for blocks. This can be re-executed using -tshort |
Extended offline | Read all the blocks on the disk. This can be re-executed using -tlong . |
Conveyance offline | These tests are run sometimes by vendors to test disks after they have been moved. They are not normally run by us. These lines can be ignored. |
The NCM component for smartd configures these tests to run once a week on Sundays for long tests and every day for short ones.
The status can be one of the following
Completed without error | All ran ok | None required |
Aborted by host | The machine aborted the test. The exact cause of this is still not known at this time | No action to recommend since cause is not known |
Completed: read failure | Indicates that selftest failed to read a block while checking the disk media | Contact vendor to replace disk using . Some vendors require that we wait until the disk actually fails but the smart test results are generally accepted as indication that the disk will fail. |
Completed: unknown failure |
The cause and severity for this message is not known. It is recommended
to run a long self test and check the results. If no fatal error
occurs, contact TSI for assistance. For disks in a 3ware RAID array,
this will generally cause a SMART-ERROR in the tw_cli show output. |
Raise a vendor call using |
Fatal or unknown error | The cause and severity for this message is not known. It is recommended to run a long self test and check the results. If no fatal error occurs, contact TSI for assistance | |
Completed: servo/seek failure | The cause and severity for this message is not known. It is recommended to run a long self test and then contact TSI for assistance | |
Completed: handling_damage?? | Cause not known but seen on some old hardware. May indicate too much vibration of the disk. Run an extended self test and then contact TSI for assistance |
1 | Raw Read Error Rate | |
4 | Start Stop Errors | This has been seen as a failure on smartctl just before a disk has failed. It is therefore considered as being an indication of an upcoming failure |
5 | Reallocated Sectors Ct | This indicates a bad disks where the number of bad sectors has exceeded an acceptable threshold. This has been seen before a disk fails. |
10 | Spin Retry Count | This has been reported as FAILING-NOW , such as on lxb0483. Severity unknown at this time |
190 | Unknown Attribute | This has been reported on E4_NOC_2800 systems. The root cause is not known so it is to be ignored |
196 | Reallocated_Event_Count | |
197 | Current_Pending_Sector | |
198 | Offline_Uncorrectable | |
199 | UDMA_CRC_Error_Count | This may also be related to internal cabling with the machine. If so, the cabling should be corrected rather than the disk replaced |
200 | Multi Zone Error Rate | This is considered as a good indication of a hardware problem. The 3ware CLI will generally indicate a SMART-ERROR for the port. |
The Lemon Smart sensors can report a number of errors. The table below indicates the list of errors and the actions to take.
There are several sensors
SMARTD_WRONG
.
SMART_SELFTEST
This looks at the most recent Short
tests until the first Extended
test.
SMART_FAILING
The SMART status can be found by checking the metric 6130
and 6132
as follows
# lemon-cli -m ChkSmartFailing
# lemon-cli -m ChkSmartSelftest
The status is reported on the [INFO]
line (0 OK
in this case). The Actions can be performed only for machines from
vendors who accept disk replacement when SMART errors occur. Currently,
none of our hardware vendors has agreed to this for Linux systems.
In the event of the problem not being listed above, please contact the disk service manager listed at .
Old description of SMART errors | |
Opening a vendor call for a disk problem | |
Working instructions for Smart Disk operations | |
3ware Problem determination guide | |
Requirements for vendors replacement for SMART errors |
-- - 15 Nov 2005