Purpose:
The purpose of this troubleshooting guide is to provide a general
approach to solve some of A1000's top problematic areas reported
by the field in the radiance database. These problematic areas are:
battery FRU, A1000 controller/HBA and A1000 lun access.
Preliminary inspection by Visual faul indications & RM6 healthck:
Check the RAID module for any visible indications of failures like
amber LED on controller/drives, power supply, fan modules, battery,
bend pins etc. This quick diagnose could shed some light about the
problem. Regardless any indications of failures are found, please
run health check to allow the A1000 to detect the failures and to
do an overall root cause analysis, and then follow the
instructions provided to replace the component.
Note: See Appendix A for health check procedures.
If the steps in Recovery do not resolve the problem, please proceed
the following steps to identify the root cause and fix for individual
A1000 component.
I. For A1000 battery FRU:
Common problems - battery life expiration, or battery failure.
The ultimate solution for these common problems is to replace
the battery and then reset the battery age.
1. Determine the battery age by running the following RM6 CLI command:
raidutil -c -B
For example:
#/usr/lib/osa/bin/raidutil -c c1t0d0 -B
LUNs found on c1t0d0.
LUN 0 RAID 0 10 MB
LUN 1 RAID 5 1000 MB
Battery age is between 720 days and 810 days.
raidutil succeeded!
battery age between 630 and 720 days - near expiration
battery age greater than 720 days - expired
Battery should be replaced for the above cases.
NOTE: A1000 battery is not hotswappable.
2. If the battery age is less than 630 days, and the fault LED or
healthchk has indicated a failure, it is a battery failure case.
Please gather the battery support information from the label on
the battery canister, and record same in the Radiance case notes
with indications of 'battery failed/non-expiration'. This
information in the database is valuable for subsequent reliability
analysis. The Radiance Support Type should be set to
"Hardware On-Site".
Battery support information example:
Part number : 370-3417-01
Serial number : 17-digit number
Date of manufacture : mm/dd/yy
Date of installation: mm/dd/yy
Date of replacement : mm/dd/yy
After the above informations are gathered, please replace battery.
NOTE: A1000 battery is not hotswappable.
3. After battery replacement, run the following RM6 command to reset
the battery age: raidutil -c -R
For example:
#/usr/lib/osa/bin/raidutil -c c1t0d0 -R
LUNs found on c1t0d0.
LUN 0 RAID 0 10 MB
LUN 1 RAID 5 1000 MB
raidutil succeeded!
4. Run RM6 "raidutil -c -B" command again to verify the battery
age has been reset to zero.
5. Run RM6 Healthck to make sure the battery problem is fixed.
Note: See Appendix A for health check procedures.
II. For A1000 controller/HBA:
Common problems - Unable to scan/access the controller, unresponsive/dead
controller or offline controller.
1. Hook up the serial port to the A1000 and power cycle the controller.
After the boot cycle completed, check the serial console for the
following message:
"NOTE: Logical Unit 0 is now optimal and online."
If you couldn't see this message, there is some problem with the A1000
controller otherwise you can fairly be sure the problem lies with the
host, its device entries, cable, terminator or HBA.
Try doing a probe-scsi-all at the ok prompt. If you can see the A1000
then HBA/SCSI cable/terminator are good.
2. Check mismatch firmware/NVSRAM version.
3. Check rmlog to see if there are recent failures recorded.
4. Use CLI cmd "lad" to verify the controller is visible,
this could find out if problems only existed in RM6 GUI.
Note: See Appendix C for "lad" command syntax and example.
5. Check /kernel/drv/sd.conf for proper rm6 entries.
6. Reboot the host to fix any temporary problem with RM6.
7. "Unresponsive/Dead controller" could be cause by a power
cycle of the controller during the OS's device scan, reboot
the host after controller complete its initialization may
fix the problem.
8. "Offline controller" could be cause by a fault HBA. Place the
controller back online is needed after replacing the HBA.
9. Run RM6 Healthck to make sure the controller/HBA problem is fixed.
Note: See Appendix A for health check procedures.
III. For A1000 Lun access:
Common problems - Dead lun or Missing lun.
1. This could be cause of failed drive or wrong drive replacement.
Replace the failed/wrong drive, format the lun and restore
data from backup should fix the problem.
2. This could also be cause of an interrupted write process has
failed. Stop all I/O to the lun, format the lun and restore
data from backup should fix the problem.
3. Lun 0 is required for a normal communication between the host
and the controller. Check if Lun 0 does exist or recreate it.
4. Check the System_MaxLunsPerController parameter in the RM6
with the current numbers of luns.
5. Run RM6 Healthck to make sure the Lun problem is fixed.
Note: See Appendix A for health check procedures.
Appendix:
A. two ways to run healthck:
cli: /usr/lib/osa/bin/healthck -a
example:
monty51# /usr/lib/osa/bin/healthck -a
Health Check Summary Information
monty51_001: Unable To Scan Module
healthck succeeded!
gui: /usr/lib/osa/bin/rm6 -> recovery guru -> stethescope icon -> Show Procedure button
B. raidutil -c -B
example:
#/usr/lib/osa/bin/raidutil -c c1t0d0 -B
LUNs found on c1t0d0.
LUN 0 RAID 0 10 MB
LUN 1 RAID 5 1000 MB
Battery age is between 720 days and 810 days.
raidutil succeeded!
C. lad
example:
#./usr/lib/osa/bin/lad
c1t0d0 1T71523997 LUNS: 0 1 2
阅读(1247) | 评论(0) | 转发(0) |