Chinaunix首页 | 论坛 | 博客
  • 博客访问: 3194422
  • 博文数量: 443
  • 博客积分: 11301
  • 博客等级: 上将
  • 技术积分: 5679
  • 用 户 组: 普通用户
  • 注册时间: 2004-10-08 12:30
个人简介

欢迎加入IT云增值在线QQ交流群:342584734

文章分类

全部博文(443)

文章存档

2022年(1)

2021年(1)

2015年(2)

2014年(1)

2013年(1)

2012年(4)

2011年(19)

2010年(32)

2009年(2)

2008年(4)

2007年(31)

2006年(301)

2005年(42)

2004年(2)

分类:

2006-04-08 12:33:34

Purpose:
        
         The purpose of this troubleshooting guide is to provide a general
         approach to solve some of A1000's top problematic areas reported
         by the field in the radiance database. These problematic areas are:
         battery FRU, A1000 controller/HBA and A1000 lun access.
        
         Preliminary inspection by Visual faul indications & RM6 healthck:
        
         Check the RAID module for any visible indications of failures like
         amber LED on controller/drives, power supply, fan modules, battery,
         bend pins etc. This quick diagnose could shed some light about the
         problem.  Regardless any indications of failures are found, please
         run health check to allow the A1000 to detect the failures and to
         do an overall root cause analysis, and then follow the
         instructions provided to replace the component.
         Note: See Appendix A for health check procedures.
   
         If the steps in Recovery do not resolve the problem, please proceed
         the following steps to identify the root cause and fix for individual
         A1000 component.
        
         I.  For A1000 battery FRU:
             Common problems - battery life expiration, or battery failure.
             The ultimate solution for these common problems is to replace
             the battery and then reset the battery age.
             
             1. Determine the battery age by running the following RM6 CLI command:
                raidutil -c -B 
               
                For example:
                #/usr/lib/osa/bin/raidutil -c c1t0d0 -B
                LUNs found on c1t0d0.
                LUN 0    RAID 0    10 MB
                LUN 1    RAID 5    1000 MB
                Battery age is between 720 days and 810 days.
                raidutil succeeded!
               
                battery age between 630 and 720 days - near expiration
                battery age greater than 720 days    - expired
               
                Battery should be replaced for the above cases.
                NOTE: A1000 battery is not hotswappable.
             2. If the battery age is less than 630 days, and the fault LED or
                healthchk has indicated a failure, it is a battery failure case.
                Please gather the battery support information from the label on
                the battery canister, and record same in the Radiance case notes
                with indications of 'battery failed/non-expiration'.  This
                information in the database is valuable for subsequent reliability
                analysis.  The Radiance Support Type should be set to
                "Hardware On-Site".  
                Battery support information example:
                Part number         : 370-3417-01
                Serial number       : 17-digit number
                Date of manufacture : mm/dd/yy
                Date of installation: mm/dd/yy
                Date of replacement : mm/dd/yy
                After the above informations are gathered, please replace battery.
                NOTE: A1000 battery is not hotswappable.
             3. After battery replacement, run the following RM6 command to reset
                the battery age:  raidutil -c -R 
                 
                For example:
                #/usr/lib/osa/bin/raidutil -c c1t0d0 -R
                LUNs found on c1t0d0.
                LUN 0    RAID 0    10 MB
                LUN 1    RAID 5    1000 MB
                raidutil succeeded!               
               
             4. Run RM6 "raidutil -c -B" command again to verify the battery
                age has been reset to zero.                                                               
             5. Run RM6 Healthck to make sure the battery problem is fixed.
                Note: See Appendix A for health check procedures.
                    
          
         II. For A1000 controller/HBA:
             Common problems - Unable to scan/access the controller, unresponsive/dead
             controller or offline controller.
 
               1. Hook up the serial port to the A1000 and power cycle the controller.
                  After the boot cycle completed, check the serial console for the
                  following message:
                  "NOTE: Logical Unit 0 is now optimal and online."
                  If you couldn't see this message, there is some problem with the A1000
                  controller otherwise you can fairly be sure the problem lies with the
                  host, its device entries, cable, terminator or HBA.
  
                  Try doing a probe-scsi-all at the ok prompt. If you can see the A1000
                  then HBA/SCSI cable/terminator are good.
 
               2. Check mismatch firmware/NVSRAM version.
               3. Check rmlog to see if there are recent failures recorded.
               4. Use CLI cmd "lad" to verify the controller is visible,
                  this could find out if problems only existed in RM6 GUI.
                  Note: See Appendix C for "lad" command syntax and example.                      
               5. Check /kernel/drv/sd.conf for proper rm6 entries.
               6. Reboot the host to fix any temporary problem with RM6.
               7. "Unresponsive/Dead controller" could be cause by a power
                   cycle of the controller during the OS's device scan, reboot
                   the host after controller complete its initialization may
                   fix the problem.      
               8. "Offline controller" could be cause by a fault HBA. Place the
                   controller back online is needed after replacing the HBA.
               9. Run RM6 Healthck to make sure the controller/HBA problem is fixed.
                  Note: See Appendix A for health check procedures.
         III. For A1000 Lun access:
              Common problems - Dead lun or Missing lun.
                            
               1. This could be cause of failed drive or wrong drive replacement.
                  Replace the failed/wrong drive, format the lun and restore
                  data from backup should fix the problem.
               2. This could also be cause of an interrupted write process has
                  failed. Stop all I/O to the lun, format the lun and restore
                  data from backup should fix the problem.    
               3. Lun 0 is required for a normal communication between the host
                  and the controller. Check if Lun 0 does exist or recreate it.
               4. Check the System_MaxLunsPerController parameter in the RM6
                  with the current numbers of luns.
               5. Run RM6 Healthck to make sure the Lun problem is fixed.
                  Note: See Appendix A for health check procedures.  
           
Appendix:
A. two ways to run healthck: 
   cli: /usr/lib/osa/bin/healthck -a
        example:
        monty51# /usr/lib/osa/bin/healthck -a
        Health Check Summary Information
        monty51_001:              Unable To Scan Module
        healthck succeeded!
   gui: /usr/lib/osa/bin/rm6 -> recovery guru -> stethescope icon -> Show Procedure button
B. raidutil -c -B 
               
        example:
        #/usr/lib/osa/bin/raidutil -c c1t0d0 -B
        LUNs found on c1t0d0.
        LUN 0    RAID 0    10 MB
        LUN 1    RAID 5    1000 MB
        Battery age is between 720 days and 810 days.
        raidutil succeeded!
              
C. lad
        example:
        #./usr/lib/osa/bin/lad
        c1t0d0 1T71523997 LUNS: 0 1 2
阅读(1245) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~