Chinaunix首页 | 论坛 | 博客
  • 博客访问: 91766
  • 博文数量: 17
  • 博客积分: 560
  • 博客等级: 中士
  • 技术积分: 175
  • 用 户 组: 普通用户
  • 注册时间: 2006-08-31 15:58
文章分类

全部博文(17)

文章存档

2013年(2)

2012年(3)

2011年(6)

2010年(4)

2009年(2)

我的朋友

分类: 系统运维

2012-08-06 13:17:04


Sun/Solaris Cluster 3.0/3.1: How to fix wrong DID entries after a disk replacement [ID 1007674.1]
修改时间 29-AUG-2011 类型 PROBLEM 移植的 ID 210631 状态 PUBLISHED

Applies to: Solaris Cluster - Version: 3.0 and later [Release: 3.0 and later ]
All Platforms
SymptomsThis document explains what to do if you are unable to bring a device group online and the messages show the following:

Dec 13 19:58:21 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 474256 daemon.info] Validations of all specified global device services complete.
Dec 13 19:58:25 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1
Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases
Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1
Dec 13 19:58:30 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/dbcal to this node failed: Node failed to become the primary.
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1
Dec 13 19:58:38 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/jes1 to this node failed: Node failed to become the primary.

Also it is possible that scdidadm -c complained about dids changed/missing.
Changes

CauseTypically, this problem happens after replacement of a bad disk, without following the right procedure. The right procedure to replace a bad disk with SVM/Disk Suite and Cluster 3.x is detailed in Technical Instruction Document 1004951.1 - Sun/Solaris Cluster 3.x: How to change SCSI JBOD disk with Solstice DiskSuite SDS // Solaris Volume Manager SVM.

For instance: following that procedure it has been forgotten to run cfgadm command on the node that does not own the diskset, let's say node2. Now when scgdevs command is run it removes node2 from the list of nodes for the did instance that corresponds to the replaced disk and adds a new did instance for node2 only.

You will end up with two did instances for the same physical disk, one for each nodes. In this scenario node2 could fail to take over the diskset since in the replicas it references a did instance for which node2 has no access. As an example in the scdidadm output below you can see that did instances 13 and 37 are present for disk c3t1d0

root@node2 # /usr/cluster/bin/scdidadm -L

13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13
37 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d37

Using scdidadm command you can verify that disk id's for those dids are different

root@node1 # /usr/cluster/bin/scdidadm -o diskid -l d13
46554a495453552030304e3043344e4a2020202000000000

root@node2 # /usr/cluster/bin/scdidadm -o diskid -l d37
46554a495453552030315830383637342020202000000000

and on node1 the 'iostat -E' command returns for disk c3t1d0 a s/n different from the one returned on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section)

root@node1 # /usr/bin/iostat -E

sd31 Soft Errors: 203 Hard Errors: 242 Transport Errors: 272
Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ

root@node2 # /usr/bin/iostat -E

sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31
Vendor: FUJITSU Product: MAN3367M SUN36G Revision: 1502 Serial No: 01X0867

This outputs are due to the fact that node1 correctly references the s/n of the disk currently present while node2 is still referencing the s/n of the replaced disk. SolutionIf you are lucky the scdidadm -L shows two did instances (i.e. 13 and 37) for only one shared disk (i.e. c3t1d0).
In this case first of all you have to check which of them is actually referencing a disk no longer present on the JBOD.
It can be easily achieved visual inspecting the s/n of the disks currently present on the JBOD, let's say did 37 is the bad one (disk with s/n 01X0867 has been replaced)
Now to fix this issue you have to remove disk c3t1d0 from node2

root@node2 # /usr/sbin/cfgadm -c unconfigure c3::dsk/c3t1d0
root@node2 # /usr/sbin/devfsadm -Cv

Remove did instance 37 from cluster

root@node2 # /usr/cluster/bin/scdidadm -C

Verify with 'scdidadm -L' that did instace 37 has been cleared.
You are ready to add disk c3t1d0 back on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section)

root@node2 # /usr/sbin/cfgadm -c configure c3::sd31
root@node2 # /usr/sbin/devfsadm

On node2 verify that s/n for disk c3t1d0 has changed

root@node2 # /usr/bin/iostat -E

sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31
Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ

Add node2 on the list of nodes for did instance 13

root@node2 # /usr/cluster/bin/scgdevs

Verify with 'scdidadm -L' that you have two entries for did instance 13, one for each node

root@node2 # /usr/cluster/bin/scdidadm -L

13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13
13 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13

- If you are unlucky the scdidadm -L shows two did instances for many shared disks.
In this case if nodes can be shutdown and you have an old scdidadm output you can execute the procedure provided below, otherwise you have to repeat the steps above for each of the affected shared disks.

1. With an old scdidadm -l output, first check how the DID layout looks like.

# /usr/cluster/bin/scdidadm -l

In this example, node "cronos" should look like:

1 cronos:/dev/rdsk/c0t0d0 /dev/did/rdsk/d1
2 cronos:/dev/rdsk/c1t1d0 /dev/did/rdsk/d2
3 cronos:/dev/rdsk/c1t0d0 /dev/did/rdsk/d3
4 cronos:/dev/rdsk/c2t40d0 /dev/did/rdsk/d4
5 cronos:/dev/rdsk/c3t44d23 /dev/did/rdsk/d5
6 cronos:/dev/rdsk/c2t40d23 /dev/did/rdsk/d6
7 cronos:/dev/rdsk/c2t40d22 /dev/did/rdsk/d7
8 cronos:/dev/rdsk/c2t40d21 /dev/did/rdsk/d8
9 cronos:/dev/rdsk/c2t40d20 /dev/did/rdsk/d9
10 cronos:/dev/rdsk/c2t40d19 /dev/did/rdsk/d10
11 cronos:/dev/rdsk/c2t40d18 /dev/did/rdsk/d11
12 cronos:/dev/rdsk/c2t40d17 /dev/did/rdsk/d12
13 cronos:/dev/rdsk/c2t40d16 /dev/did/rdsk/d13
14 cronos:/dev/rdsk/c2t40d15 /dev/did/rdsk/d14
15 cronos:/dev/rdsk/c2t40d14 /dev/did/rdsk/d15
16 cronos:/dev/rdsk/c2t40d13 /dev/did/rdsk/d16
17 cronos:/dev/rdsk/c2t40d12 /dev/did/rdsk/d17
18 cronos:/dev/rdsk/c2t40d11 /dev/did/rdsk/d18
19 cronos:/dev/rdsk/c2t40d10 /dev/did/rdsk/d19
20 cronos:/dev/rdsk/c2t40d9 /dev/did/rdsk/d20
21 cronos:/dev/rdsk/c2t40d8 /dev/did/rdsk/d21
22 cronos:/dev/rdsk/c2t40d7 /dev/did/rdsk/d22
23 cronos:/dev/rdsk/c2t40d6 /dev/did/rdsk/d23
24 cronos:/dev/rdsk/c2t40d5 /dev/did/rdsk/d24
25 cronos:/dev/rdsk/c2t40d4 /dev/did/rdsk/d25
26 cronos:/dev/rdsk/c2t40d3 /dev/did/rdsk/d26
27 cronos:/dev/rdsk/c2t40d2 /dev/did/rdsk/d27
28 cronos:/dev/rdsk/c2t40d1 /dev/did/rdsk/d28
29 cronos:/dev/rdsk/c3t44d22 /dev/did/rdsk/d29
30 cronos:/dev/rdsk/c3t44d21 /dev/did/rdsk/d30
31 cronos:/dev/rdsk/c3t44d20 /dev/did/rdsk/d31
32 cronos:/dev/rdsk/c3t44d19 /dev/did/rdsk/d32
33 cronos:/dev/rdsk/c3t44d18 /dev/did/rdsk/d33
34 cronos:/dev/rdsk/c3t44d17 /dev/did/rdsk/d34
35 cronos:/dev/rdsk/c3t44d16 /dev/did/rdsk/d35
36 cronos:/dev/rdsk/c3t44d15 /dev/did/rdsk/d36
37 cronos:/dev/rdsk/c3t44d14 /dev/did/rdsk/d37
38 cronos:/dev/rdsk/c3t44d13 /dev/did/rdsk/d38
39 cronos:/dev/rdsk/c3t44d12 /dev/did/rdsk/d39
40 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d40
41 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d41
42 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d42
43 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d43
44 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d44
45 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d45
46 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d46
47 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d47
48 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d48
49 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d49
50 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d50
51 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d51

And the output for node "volcano" should look like:

4 vulcano:/dev/rdsk/c3t44d0 /dev/did/rdsk/d4
5 vulcano:/dev/rdsk/c2t40d23 /dev/did/rdsk/d5
6 vulcano:/dev/rdsk/c3t44d23 /dev/did/rdsk/d6
7 vulcano:/dev/rdsk/c3t44d22 /dev/did/rdsk/d7
8 vulcano:/dev/rdsk/c3t44d21 /dev/did/rdsk/d8
9 vulcano:/dev/rdsk/c3t44d20 /dev/did/rdsk/d9
10 vulcano:/dev/rdsk/c3t44d19 /dev/did/rdsk/d10
11 vulcano:/dev/rdsk/c3t44d18 /dev/did/rdsk/d11
12 vulcano:/dev/rdsk/c3t44d17 /dev/did/rdsk/d12
13 vulcano:/dev/rdsk/c3t44d16 /dev/did/rdsk/d13
14 vulcano:/dev/rdsk/c3t44d15 /dev/did/rdsk/d14
15 vulcano:/dev/rdsk/c3t44d14 /dev/did/rdsk/d15
16 vulcano:/dev/rdsk/c3t44d13 /dev/did/rdsk/d16
17 vulcano:/dev/rdsk/c3t44d12 /dev/did/rdsk/d17
18 vulcano:/dev/rdsk/c3t44d11 /dev/did/rdsk/d18
19 vulcano:/dev/rdsk/c3t44d10 /dev/did/rdsk/d19
20 vulcano:/dev/rdsk/c3t44d9 /dev/did/rdsk/d20
21 vulcano:/dev/rdsk/c3t44d8 /dev/did/rdsk/d21
22 vulcano:/dev/rdsk/c3t44d7 /dev/did/rdsk/d22
23 vulcano:/dev/rdsk/c3t44d6 /dev/did/rdsk/d23
24 vulcano:/dev/rdsk/c3t44d5 /dev/did/rdsk/d24
25 vulcano:/dev/rdsk/c3t44d4 /dev/did/rdsk/d25
26 vulcano:/dev/rdsk/c3t44d3 /dev/did/rdsk/d26
27 vulcano:/dev/rdsk/c3t44d2 /dev/did/rdsk/d27
28 vulcano:/dev/rdsk/c3t44d1 /dev/did/rdsk/d28
29 vulcano:/dev/rdsk/c2t40d22 /dev/did/rdsk/d29
30 vulcano:/dev/rdsk/c2t40d21 /dev/did/rdsk/d30
31 vulcano:/dev/rdsk/c2t40d20 /dev/did/rdsk/d31
32 vulcano:/dev/rdsk/c2t40d19 /dev/did/rdsk/d32
33 vulcano:/dev/rdsk/c2t40d18 /dev/did/rdsk/d33
34 vulcano:/dev/rdsk/c2t40d17 /dev/did/rdsk/d34
35 vulcano:/dev/rdsk/c2t40d16 /dev/did/rdsk/d35
36 vulcano:/dev/rdsk/c2t40d15 /dev/did/rdsk/d36
37 vulcano:/dev/rdsk/c2t40d14 /dev/did/rdsk/d37
38 vulcano:/dev/rdsk/c2t40d13 /dev/did/rdsk/d38
39 vulcano:/dev/rdsk/c2t40d12 /dev/did/rdsk/d39
40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40
41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41
42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42
43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43
44 vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44
45 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45
46 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46
47 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47
48 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48
49 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49
50 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50
51 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51
52 vulcano:/dev/rdsk/c0t0d0 /dev/did/rdsk/d52
53 vulcano:/dev/rdsk/c1t1d0 /dev/did/rdsk/d53
54 vulcano:/dev/rdsk/c1t0d0 /dev/did/rdsk/d54

Be aware that the sd instance number and the c#t#d# name of a shared disk is not necessarily the same on both nodes.
For instance in the example above the same shared disks is referenced with c2t40d11 on node "vulcano" and with c3t44d11 on node "cronos"

2. Now, we have to change the affected/missing dids; to figure out which dids have changed, we have to execute:

# /usr/cluster/bin/scdidadm -L

And look for what did entries are not in both nodes (please note that the following example shows an output for nodes "cronos" and "volcano" with the problem):

40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40
41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41
42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42
43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43
44 vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44
45 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45
46 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46
47 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47
48 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48
49 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49
50 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50
51 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51
...
* 55 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d55
* 56 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d56
* 57 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d57
* 58 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d58
* 59 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d59
* 60 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d60
* 61 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d61
* 62 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d62
* 63 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d63
* 64 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d64
* 65 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d65
* 66 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d66

These did entries don't have a match on the other node (the ones marked with an asterisk '*'), and in did, have changed, according with an output from scdidadm -l prior to the problem. Those dids that do not have a match are the ones that need to be changed; on this example, dids 55 to 66 need to be changed from 40 to 51. To change those back, we must do the following:

-Shutdown both nodes to the ok prompt.

-Boot a node (the last one to come down) in single user mode, out of the cluster:

boot -sx

-Edit the file:

/etc/cluster/ccr/did_instances

And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51

-Excecute:

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances -o

-Boot the node back in cluster mode.

-Now boot the other node in single user mode, out of the cluster:

boot -sx

-Edit the file:

/etc/cluster/ccr/did_instances

And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51

-Then, excecute:

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances

-Boot the node back in cluster mode.

-Check if the problem has been fixed (steps 1 and 2 above).

If everything has been fixed:

-Check the output from metastat -s and see if any metadevices needs to be re-sync'ed.


Additional Information
To find out the sd instance number of a c#t#d# disk you have to match the disk path in the format output with a sd entry in /etc/path_to_inst file:

root@node2 # /usr/sbin/format

c3t1d0
/pci@8,700000/scsi@5,1/sd@1,0

root@node2 # /usr/bin/grep "/pci@8,700000/scsi@5,1/sd@1,0" /etc/path_to_inst

"/node@2/pci@8,700000/scsi@5,1/sd@1,0" 31 "sd"

In this case the sd instance number is 31.

==========
For Solaris Cluster 3.2 it is *much* easier to change did instance numbers.
Technical Instruction Document 1009730.1 Solaris Cluster 3.2 renaming "did" devices





显示相关信息 相关内容

产品
  • Sun Microsystems > Enterprise Computing > High Availability/Clustering > Solaris Cluster
关键字
CHANGE; DID CHANGE; LOST DID; SDS; STALE DATABASE; SVM

返回页首返回页首

Copyright (c) 2007, 2010, Oracle. All rights reserved. Legal Notices and Terms of Use | Privacy Statement
var rateArticleSuccess = "您的评级已发送到 Oracle"; var rateArticleFailure = "无法为文档评级"; checkArticleRatingStatus();
阅读(3158) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~