Sun/Solaris Cluster 3.0/3.1: How to fix wrong DID entries after a disk replacement [ID 1007674.1] |
|
|
修改时间 29-AUG-2011 类型 PROBLEM 移植的 ID 210631 状态 PUBLISHED |
|
Applies to: Solaris Cluster - Version: 3.0 and later [Release: 3.0 and later ]
All Platforms
SymptomsThis document explains what to do if you are unable to bring a device group online and the messages show the following:
Dec 13 19:58:21 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 474256 daemon.info] Validations of all specified global device services complete.
Dec 13 19:58:25 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1
Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases
Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1
Dec 13 19:58:30 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/dbcal to this node failed: Node failed to become the primary.
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases
Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1
Dec 13 19:58:38 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/jes1 to this node failed: Node failed to become the primary.
Also it is possible that scdidadm -c complained about dids changed/missing.
Changes
CauseTypically, this problem happens after replacement of a bad disk, without following the right procedure. The right procedure to replace a bad disk with SVM/Disk Suite and Cluster 3.x is detailed in Technical Instruction Document 1004951.1 - Sun/Solaris Cluster 3.x: How to change SCSI JBOD disk with Solstice DiskSuite SDS // Solaris Volume Manager SVM.
For instance: following that procedure it has been forgotten to run cfgadm command on the node that does not own the diskset, let's say node2. Now when scgdevs command is run it removes node2 from the list of nodes for the did instance that corresponds to the replaced disk and adds a new did instance for node2 only.
You will end up with two did instances for the same physical disk, one for each nodes. In this scenario node2 could fail to take over the diskset since in the replicas it references a did instance for which node2 has no access. As an example in the scdidadm output below you can see that did instances 13 and 37 are present for disk c3t1d0
root@node2 # /usr/cluster/bin/scdidadm -L
13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13
37 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d37
Using scdidadm command you can verify that disk id's for those dids are different
root@node1 # /usr/cluster/bin/scdidadm -o diskid -l d13
46554a495453552030304e3043344e4a2020202000000000
root@node2 # /usr/cluster/bin/scdidadm -o diskid -l d37
46554a495453552030315830383637342020202000000000
and on node1 the 'iostat -E' command returns for disk c3t1d0 a s/n different from the one returned on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section)
root@node1 # /usr/bin/iostat -E
sd31 Soft Errors: 203 Hard Errors: 242 Transport Errors: 272
Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ
root@node2 # /usr/bin/iostat -E
sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31
Vendor: FUJITSU Product: MAN3367M SUN36G Revision: 1502 Serial No: 01X0867
This outputs are due to the fact that node1 correctly references the s/n of the disk currently present while node2 is still referencing the s/n of the replaced disk. SolutionIf you are lucky the scdidadm -L shows two did instances (i.e. 13 and 37) for only one shared disk (i.e. c3t1d0).
In this case first of all you have to check which of them is actually referencing a disk no longer present on the JBOD.
It can be easily achieved visual inspecting the s/n of the disks currently present on the JBOD, let's say did 37 is the bad one (disk with s/n 01X0867 has been replaced)
Now to fix this issue you have to remove disk c3t1d0 from node2
root@node2 # /usr/sbin/cfgadm -c unconfigure c3::dsk/c3t1d0
root@node2 # /usr/sbin/devfsadm -Cv
Remove did instance 37 from cluster
root@node2 # /usr/cluster/bin/scdidadm -C
Verify with 'scdidadm -L' that did instace 37 has been cleared.
You are ready to add disk c3t1d0 back on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section)
root@node2 # /usr/sbin/cfgadm -c configure c3::sd31
root@node2 # /usr/sbin/devfsadm
On node2 verify that s/n for disk c3t1d0 has changed
root@node2 # /usr/bin/iostat -E
sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31
Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ
Add node2 on the list of nodes for did instance 13
root@node2 # /usr/cluster/bin/scgdevs
Verify with 'scdidadm -L' that you have two entries for did instance 13, one for each node
root@node2 # /usr/cluster/bin/scdidadm -L
13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13
13 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13
- If you are unlucky the scdidadm -L shows two did instances for many shared disks.
In this case if nodes can be shutdown and you have an old scdidadm output you can execute the procedure provided below, otherwise you have to repeat the steps above for each of the affected shared disks.
1. With an old scdidadm -l output, first check how the DID layout looks like.
# /usr/cluster/bin/scdidadm -l
In this example, node "cronos" should look like:
1 cronos:/dev/rdsk/c0t0d0 /dev/did/rdsk/d1
2 cronos:/dev/rdsk/c1t1d0 /dev/did/rdsk/d2
3 cronos:/dev/rdsk/c1t0d0 /dev/did/rdsk/d3
4 cronos:/dev/rdsk/c2t40d0 /dev/did/rdsk/d4
5 cronos:/dev/rdsk/c3t44d23 /dev/did/rdsk/d5
6 cronos:/dev/rdsk/c2t40d23 /dev/did/rdsk/d6
7 cronos:/dev/rdsk/c2t40d22 /dev/did/rdsk/d7
8 cronos:/dev/rdsk/c2t40d21 /dev/did/rdsk/d8
9 cronos:/dev/rdsk/c2t40d20 /dev/did/rdsk/d9
10 cronos:/dev/rdsk/c2t40d19 /dev/did/rdsk/d10
11 cronos:/dev/rdsk/c2t40d18 /dev/did/rdsk/d11
12 cronos:/dev/rdsk/c2t40d17 /dev/did/rdsk/d12
13 cronos:/dev/rdsk/c2t40d16 /dev/did/rdsk/d13
14 cronos:/dev/rdsk/c2t40d15 /dev/did/rdsk/d14
15 cronos:/dev/rdsk/c2t40d14 /dev/did/rdsk/d15
16 cronos:/dev/rdsk/c2t40d13 /dev/did/rdsk/d16
17 cronos:/dev/rdsk/c2t40d12 /dev/did/rdsk/d17
18 cronos:/dev/rdsk/c2t40d11 /dev/did/rdsk/d18
19 cronos:/dev/rdsk/c2t40d10 /dev/did/rdsk/d19
20 cronos:/dev/rdsk/c2t40d9 /dev/did/rdsk/d20
21 cronos:/dev/rdsk/c2t40d8 /dev/did/rdsk/d21
22 cronos:/dev/rdsk/c2t40d7 /dev/did/rdsk/d22
23 cronos:/dev/rdsk/c2t40d6 /dev/did/rdsk/d23
24 cronos:/dev/rdsk/c2t40d5 /dev/did/rdsk/d24
25 cronos:/dev/rdsk/c2t40d4 /dev/did/rdsk/d25
26 cronos:/dev/rdsk/c2t40d3 /dev/did/rdsk/d26
27 cronos:/dev/rdsk/c2t40d2 /dev/did/rdsk/d27
28 cronos:/dev/rdsk/c2t40d1 /dev/did/rdsk/d28
29 cronos:/dev/rdsk/c3t44d22 /dev/did/rdsk/d29
30 cronos:/dev/rdsk/c3t44d21 /dev/did/rdsk/d30
31 cronos:/dev/rdsk/c3t44d20 /dev/did/rdsk/d31
32 cronos:/dev/rdsk/c3t44d19 /dev/did/rdsk/d32
33 cronos:/dev/rdsk/c3t44d18 /dev/did/rdsk/d33
34 cronos:/dev/rdsk/c3t44d17 /dev/did/rdsk/d34
35 cronos:/dev/rdsk/c3t44d16 /dev/did/rdsk/d35
36 cronos:/dev/rdsk/c3t44d15 /dev/did/rdsk/d36
37 cronos:/dev/rdsk/c3t44d14 /dev/did/rdsk/d37
38 cronos:/dev/rdsk/c3t44d13 /dev/did/rdsk/d38
39 cronos:/dev/rdsk/c3t44d12 /dev/did/rdsk/d39
40 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d40
41 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d41
42 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d42
43 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d43
44 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d44
45 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d45
46 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d46
47 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d47
48 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d48
49 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d49
50 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d50
51 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d51
And the output for node "volcano" should look like:
4 vulcano:/dev/rdsk/c3t44d0 /dev/did/rdsk/d4
5 vulcano:/dev/rdsk/c2t40d23 /dev/did/rdsk/d5
6 vulcano:/dev/rdsk/c3t44d23 /dev/did/rdsk/d6
7 vulcano:/dev/rdsk/c3t44d22 /dev/did/rdsk/d7
8 vulcano:/dev/rdsk/c3t44d21 /dev/did/rdsk/d8
9 vulcano:/dev/rdsk/c3t44d20 /dev/did/rdsk/d9
10 vulcano:/dev/rdsk/c3t44d19 /dev/did/rdsk/d10
11 vulcano:/dev/rdsk/c3t44d18 /dev/did/rdsk/d11
12 vulcano:/dev/rdsk/c3t44d17 /dev/did/rdsk/d12
13 vulcano:/dev/rdsk/c3t44d16 /dev/did/rdsk/d13
14 vulcano:/dev/rdsk/c3t44d15 /dev/did/rdsk/d14
15 vulcano:/dev/rdsk/c3t44d14 /dev/did/rdsk/d15
16 vulcano:/dev/rdsk/c3t44d13 /dev/did/rdsk/d16
17 vulcano:/dev/rdsk/c3t44d12 /dev/did/rdsk/d17
18 vulcano:/dev/rdsk/c3t44d11 /dev/did/rdsk/d18
19 vulcano:/dev/rdsk/c3t44d10 /dev/did/rdsk/d19
20 vulcano:/dev/rdsk/c3t44d9 /dev/did/rdsk/d20
21 vulcano:/dev/rdsk/c3t44d8 /dev/did/rdsk/d21
22 vulcano:/dev/rdsk/c3t44d7 /dev/did/rdsk/d22
23 vulcano:/dev/rdsk/c3t44d6 /dev/did/rdsk/d23
24 vulcano:/dev/rdsk/c3t44d5 /dev/did/rdsk/d24
25 vulcano:/dev/rdsk/c3t44d4 /dev/did/rdsk/d25
26 vulcano:/dev/rdsk/c3t44d3 /dev/did/rdsk/d26
27 vulcano:/dev/rdsk/c3t44d2 /dev/did/rdsk/d27
28 vulcano:/dev/rdsk/c3t44d1 /dev/did/rdsk/d28
29 vulcano:/dev/rdsk/c2t40d22 /dev/did/rdsk/d29
30 vulcano:/dev/rdsk/c2t40d21 /dev/did/rdsk/d30
31 vulcano:/dev/rdsk/c2t40d20 /dev/did/rdsk/d31
32 vulcano:/dev/rdsk/c2t40d19 /dev/did/rdsk/d32
33 vulcano:/dev/rdsk/c2t40d18 /dev/did/rdsk/d33
34 vulcano:/dev/rdsk/c2t40d17 /dev/did/rdsk/d34
35 vulcano:/dev/rdsk/c2t40d16 /dev/did/rdsk/d35
36 vulcano:/dev/rdsk/c2t40d15 /dev/did/rdsk/d36
37 vulcano:/dev/rdsk/c2t40d14 /dev/did/rdsk/d37
38 vulcano:/dev/rdsk/c2t40d13 /dev/did/rdsk/d38
39 vulcano:/dev/rdsk/c2t40d12 /dev/did/rdsk/d39
40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40
41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41
42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42
43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43
44 vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44
45 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45
46 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46
47 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47
48 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48
49 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49
50 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50
51 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51
52 vulcano:/dev/rdsk/c0t0d0 /dev/did/rdsk/d52
53 vulcano:/dev/rdsk/c1t1d0 /dev/did/rdsk/d53
54 vulcano:/dev/rdsk/c1t0d0 /dev/did/rdsk/d54
Be aware that the sd instance number and the c#t#d# name of a shared disk is not necessarily the same on both nodes.
For instance in the example above the same shared disks is referenced with c2t40d11 on node "vulcano" and with c3t44d11 on node "cronos"
2. Now, we have to change the affected/missing dids; to figure out which dids have changed, we have to execute:
# /usr/cluster/bin/scdidadm -L
And look for what did entries are not in both nodes (please note that the following example shows an output for nodes "cronos" and "volcano" with the problem):
40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40
41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41
42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42
43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43
44 vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44
45 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45
46 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46
47 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47
48 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48
49 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49
50 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50
51 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51
...
* 55 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d55
* 56 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d56
* 57 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d57
* 58 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d58
* 59 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d59
* 60 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d60
* 61 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d61
* 62 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d62
* 63 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d63
* 64 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d64
* 65 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d65
* 66 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d66
These did entries don't have a match on the other node (the ones marked with an asterisk '*'), and in did, have changed, according with an output from scdidadm -l prior to the problem. Those dids that do not have a match are the ones that need to be changed; on this example, dids 55 to 66 need to be changed from 40 to 51. To change those back, we must do the following:
-Shutdown both nodes to the ok prompt.
-Boot a node (the last one to come down) in single user mode, out of the cluster:
boot -sx
-Edit the file:
/etc/cluster/ccr/did_instances
And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51
-Excecute:
# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances -o
-Boot the node back in cluster mode.
-Now boot the other node in single user mode, out of the cluster:
boot -sx
-Edit the file:
/etc/cluster/ccr/did_instances
And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51
-Then, excecute:
# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances
-Boot the node back in cluster mode.
-Check if the problem has been fixed (steps 1 and 2 above).
If everything has been fixed:
-Check the output from metastat -s and see if any metadevices needs to be re-sync'ed.
Additional Information
To find out the sd instance number of a c#t#d# disk you have to match the disk path in the format output with a sd entry in /etc/path_to_inst file:
root@node2 # /usr/sbin/format
c3t1d0
/pci@8,700000/scsi@5,1/sd@1,0
root@node2 # /usr/bin/grep "/pci@8,700000/scsi@5,1/sd@1,0" /etc/path_to_inst
"/node@2/pci@8,700000/scsi@5,1/sd@1,0" 31 "sd"
In this case the sd instance number is 31.
==========
For Solaris Cluster 3.2 it is *much* easier to change did instance numbers.
Technical Instruction Document 1009730.1 Solaris Cluster 3.2 renaming "did" devices
相关内容
|
返回页首
Copyright (c) 2007, 2010, Oracle. All rights reserved. Legal Notices and Terms of Use | Privacy Statement