Chinaunix首页 | 论坛 | 博客
  • 博客访问: 885229
  • 博文数量: 339
  • 博客积分: 3151
  • 博客等级: 中校
  • 技术积分: 3425
  • 用 户 组: 普通用户
  • 注册时间: 2010-10-10 14:47
文章分类

全部博文(339)

文章存档

2023年(43)

2022年(44)

2021年(3)

2020年(13)

2019年(39)

2018年(25)

2015年(2)

2014年(18)

2013年(12)

2012年(48)

2011年(79)

2010年(13)

分类: 服务器与存储

2013-04-27 10:10:52

Document LINK ADDRESS:


Replacing a drive on an Sun Fire[TM] X4500 that has not been explicitly failed by ZFS [ID 1011391.1]

Symptoms

There are instances when an Sun Fire[TM] X4500's drive firmware SMART (Self-Monitoring Analysis and Reporting Technology) predictively fails out a disk and reports it to fmadm.

ZFS however can report it as healthy. Running cfgadm -c unconfigure is how the service manual recommends replacing the drive. However, since the drive was still healthy according to ZFS, the command will fail with the following:

root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0

Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

This operation will suspend activity on the SATA device

Continue (yes/no) yes

cfgadm: Hardware specific failure: Failed to unconfig device at ap_id: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

Changes

N/A

Cause

.N/A

Solution

Since the drive was still healthy according to ZFS it needs to be offlined, cfgadm unconfigured, physically replaced, cfgadm configured, and finally zpool replaced

1. Prior to replacing the drive, cfgadm -alv , will show the following output


root@th12 # cfgadm -alv

Ap_Id Receptacle Occupant Condition Information When Type Busy Phys_Id

sata0/0::dsk/c0t0d0 connected configured ok Mod: HITACHI HDS7250SASUN500G 0627K7KP8F FRev: K2AOAJ0A SN: KRVN67ZAJ7KP8F unavailable disk n /devices/pci@0,0/pci1022,7458@1/pci11ab,11ab@1:0
sata0/1::dsk/c0t1d0 connected configured ok Mod: HITACHI HDS7250SASUN500G 0628KB06EF FRev: K2AOAJ0A SN: KRVN65ZAJB06EF unavailable disk n /devices/pci@0,0/pci1022,7458@1/pci11ab,11ab@1:1

(output ommitted for brevity)


sata1/7::dsk/c1t7d0 connected configured ok Mod: HITACHI HDS7250SASUN500G 0628K8RH1D FRev: K2AOAJ0A SN: KRVN63ZAJ8RH1D unavailable disk n /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7

2. fmadm and fmdump will show the drives as faulty:


root@th12 # fmadm faulty

STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------

degraded hc:///:serial=KRVN63ZAJ8RH1D/component=sata1/7
665c1b1a-7405-6f8a-adc5-be4e32dc9232

-------- ----------------------------------------------------------------------

root@th12 # fmdump

TIME UUID SUNW-MSG-ID
Dec 01 00:23:09.5984 665c1b1a-7405-6f8a-adc5-be4e32dc9232 DISK-8000-0X

root@th12 # fmdump -v

TIME UUID SUNW-MSG-ID
Dec 01 00:23:09.5984 665c1b1a-7405-6f8a-adc5-be4e32dc9232 DISK-8000-0X

100% fault.io.disk.predictive-failure

Problem in: hc:///:serial=KRVN63ZAJ8RH1D:part=HITACHI-HDS7250SASUN500G-628K8RH1D:revision=K2AOAJ0A/motherboard=0/hostbridge=0/
pcibus=0/pcidev=2/pcifn=0/pcibus=2/pcidev=1/pcifn=0/sata-port=7/disk=0

Affects: hc:///:serial=KRVN63ZAJ8RH1D/component=sata1/7
FRU: hc:///component=HD_ID_45

root@th12 #

3. Format will show the following:


root@th12 # format

Searching for disks...done
AVAILABLE DISK SELECTIONS:

0. c0t0d0 /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@0,0
1. c0t1d0 /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@1,0

(output ommitted for brevity)

14. c1t6d0 /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@6,0
15. c1t7d0 /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@7,0

Specify disk (enter its number): 15

selecting c1t7d0

[disk formatted]

/dev/dsk/c1t7d0s0 is part of active ZFS pool zpool1. Please see zpool(1M).


FORMAT MENU:

disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
inquiry - show vendor, product and revision
volname - set 8-character volume name

format> p


PARTITION MENU:


0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk

partition> p


Current partition table (original):


Total disk sectors available: 976756749 + 16384 (reserved sectors)

Part Tag Flag First Sector Size Last Sector

0 usr wm 34 465.75GB 976756749
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976756750 8.00MB 976773133
partition> q

4. The zfs commands zpool will show that the pool is healthy and online:


root@th12 # zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool1 20.8T 1.14M 20.8T 0% ONLINE -


root@th12 # zpool status zpool1

pool: zpool1
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zpool1 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c7t7d0 ONLINE 0 0 0

errors: No known data errors
root@th12 #

5. In order to replace the drive, you need to offline the drive in zfs:

root@th12 # zpool offline zpool1 c1t7d0

Bringing device c1t7d0 offline

root@th12 # 

6. The zpool status command will show the following after the drive has been offlined.

root@th12 # zpool status zpool1

pool: zpool1
state: DEGRADED
status: One or more devices has been taken offline by the adminstrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Online the device using 'zpool online' or replace the device with 'zpool replace'.
scrub: none requested
config:

NAME STATE READ WRITE CKSUM

zpool1 DEGRADED 0 0 0
raidz ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c7t7d0 ONLINE 0 0 0

errors: No known data errors
root@th12 #

7. Now that the drive has been offlined from the zfs pool, it can be removed from dynamically reconfigured from OS control by running the following command:

root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0

Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:7
This operation will suspend activity on the SATA device

Continue (yes/no) yes


root@th12 # 
Dec 5 14:20:02 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:20:02 th12 port 7: link lost
Dec 5 14:20:03 th12 sata: WARNING: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:20:03 th12 SATA device detached at port 7
Dec 5 14:20:29 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:20:29 th12 port 7: link lost
Dec 5 14:20:29 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:20:29 th12 port 7: link established
Dec 5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:21:30 th12 port 7: device reset
Dec 5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:21:30 th12 port 7: device reset
Dec 5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:21:30 th12 port 7: link lost
Dec 5 14:21:30 th12 sata: NOTICE: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:21:30 th12 port 7: link established
Dec 5 14:21:30 th12 sata: WARNING: /pci@0,0/pci1022,7458@2/pci11ab,11ab@1:
Dec 5 14:21:30 th12 SATA device attached at port 7

8. Notice that the drive no longer shows up in cfgadm


root@th12 # cfgadm -al | grep t7

sata0/7::dsk/c0t7d0 disk connected configured ok
sata2/7::dsk/c4t7d0 disk connected configured ok
sata3/7::dsk/c5t7d0 disk connected configured ok
sata4/7::dsk/c6t7d0 disk connected configured ok
sata5/7::dsk/c7t7d0 disk connected configured ok

9. At this point the drive is safe to remove. You should see the drive's blue light lit up indicating that it is safe to remove it. Physically Replace Drive.

 

10. Once the drive has been physically replaced, you can configure it back into OS control by running the following command:


root@th12 # cfgadm -c configure sata1/7::dsk/c1t7d0

11. Notice that cfgadm now shows the drive again


root@th12 # cfgadm -al | grep t7


sata0/7::dsk/c0t7d0 disk connected configured ok
sata1/7::dsk/c1t7d0 disk connected configured ok
sata2/7::dsk/c4t7d0 disk connected configured ok
sata3/7::dsk/c5t7d0 disk connected configured ok
sata4/7::dsk/c6t7d0 disk connected configured ok
sata5/7::dsk/c7t7d0 disk connected configured ok


root@th12 #

12. You can now put the drive into zfs control by running the below substituting your drive's c#t#d# for c1t7d0 in the example below.

root@th12 # zpool replace zpool1 c1t7d0 c1t7d0

root@th12 #

13. The pool is now healthy again


root@th12 # zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zpool1 20.8T 1.38M 20.8T 0% ONLINE -


Notice the message that in the zpool status command below under scrub.

root@th12 # zpool status zpool1

pool: zpool1
state: ONLINE
scrub: resilver completed with 0 errors on Tue Dec 5 14:22:46 2006
config:

NAME STATE READ WRITE CKSUM

zpool1 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c7t7d0 ONLINE 0 0 0

errors: No known data errors

root@th12 #

14. Use the fmadm command to repair the status of the drive in the fault management service:


root@th12 # fmadm repair 665c1b1a-7405-6f8a-adc5-be4e32dc9232



总结如下:
1. ZFS尚末检测到盘failed(faulty,unavailabe..),但是drive firmware SMART预见到盘failed,然后会报告给fmadm.
步骤如下:
1. ZFS认为此盘healthy,所以cfgadm -c unconfigure会失败
root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0
因此步骤如下:
    1.1 zfs offline
        root@th12 # zpool offline zpool1 c1t7d0
    1.2 cfgadm -c unconfigure
        root@th12 # cfgadm -c unconfigure sata1/7::dsk/c1t7d0
        root@th12 # cfgadm -al 会看到c1t7d0已经是unconfigure
    1.3 physical replaced
        你会看到磁盘亮蓝灯,说明可以拔。再插新盘
    1.4 cfgadm -c configure
        root@th12 # cfgadm -c configure sata1/7::dsk/c1t7d0
    1.5 zpool replace
        root@th12 # zpool replace zpool1 c1t7d0 c1t7d0

消除fmadm里面的错误信息。
    root@th12 # fmadm repair 665c1b1a-7405-6f8a-adc5-be4e32dc9232
    重启串口,消除黄灯。(重启机器当然也可以消除黄灯)
2.fmadm faulty 是可以查看盘的错误的记录的。就是drive firmware SMART报告给fmadm的。
3. zpool list可以查看当前有几个pool
4. zpool status pool_name 查看pool_name的状态
阅读(1805) | 评论(1) | 转发(0) |
给主人留下些什么吧!~~

starwang11122016-12-14 12:09:08

fmadm faulty  输出的硬盘号怎样对应上zpool 的硬盘号呢?