故障现象:
对磁带库(ADIC Sclar 1000)和legato networker进行日常的维护。重新启动磁带库和legato networker以后。在执行nsrjb –vHE的时候报错误消息:Can’t fetch old volume,随后,leagto networker中所有的磁带的vloume信息全部丢失,备份任务没有办法进行。
故障分析&处理过程:
首先,对磁带库进行分析。第一,磁带库没有报错误消息。然后,对磁带库做了tech和inventory对磁带库进行一个简单的检查。发现,磁带库正常工作。所以,想到会不会是legato networker的index db损坏了。所以,先决定对legato networker的index db进行恢复。
下面先来介绍一下index db恢复的办法:
第一步首先要查找一下
happy # mminfo -B
date time level ssid file record volume
03/26/07 09:07:05 full 119322881 660 0 index.014
03/27/07 04:14:39 full 136949505 682 0 index.014
03/27/07 09:05:01 full 141409537 686 0 index.014
03/28/07 04:13:26 full 159049217 701 0 index.014
03/28/07 09:03:40 full 163507201 706 0 index.014
03/29/07 04:15:39 full 181201665 723 0 index.014
03/29/07 08:03:30 full 184701441 725 0 index.014
03/29/07 09:03:31 full 185623297 729 0 index.014
03/30/07 04:36:12 full 203635969 745 0 index.014
03/30/07 09:03:16 full 207737857 749 0 index.014
03/31/07 04:28:17 full 225632513 765 0 index.014
03/31/07 09:03:14 full 229855745 769 0 index.014
04/01/07 04:51:28 full 248107009 788 0 index.014
04/01/07 09:02:35 full 251964161 796 0 index.014
04/02/07 04:29:48 full 269892609 812 0 index.014
04/02/07 09:04:24 full 274110465 819 0 index.014
04/03/07 04:13:04 full 291754241 839 0 index.014
04/03/07 09:03:59 full 296222721 843 0 index.014
04/04/07 04:15:42 full 313912833 856 0 index.014
04/04/07 09:02:36 full 318319617 861 0 index.014
04/05/07 04:11:58 full 335973889 875 0 index.014
04/05/07 09:03:46 full 340455937 879 0 index.014
04/06/07 04:12:21 full 358098177 894 0 index.014
04/06/07 09:03:03 full 362563329 898 0 index.014
04/07/07 04:27:09 full 380443905 913 0 index.014
04/07/07 08:03:25 full 383765761 915 0 index.014
04/07/07 09:03:39 full 384690945 919 0 index.014
04/08/07 04:18:33 full 402430209 935 0 index.014
04/08/07 09:02:42 full 406794753 941 0 index.014
04/09/07 04:20:44 full 424582145 956 0 index.014
04/09/07 09:05:48 full 428960769 960 0 index.014
04/10/07 04:23:43 full 446746369 609 0 index.016
04/10/07 09:03:38 full 451045889 616 0 index.016
04/11/07 04:19:19 full 468797185 632 0 index.016
04/11/07 09:03:33 full 473163009 637 0 index.016
04/12/07 04:14:26 full 490840577 653 0 index.016
04/12/07 09:03:28 full 495280129 657 0 index.016
04/13/07 04:16:29 full 512990465 673 0 index.016
04/13/07 09:03:20 full 517396481 677 0 index.016
04/14/07 04:23:42 full 535219713 692 0 index.016
04/14/07 09:03:25 full 539516161 696 0 index.016
04/15/07 04:33:57 full 557495553 714 0 index.016
04/15/07 08:03:36 full 560715777 717 0 index.016
04/15/07 09:03:37 full 561637633 721 0 index.016
04/16/07 04:36:59 full 579660545 738 0 index.016
04/16/07 09:04:43 full 583772929 742 0 index.016
04/17/07 04:14:37 full 601435393 764 0 index.016
04/17/07 09:03:03 full 605865729 768 0 index.016
04/18/07 04:12:54 full 623527425 784 0 index.016
04/18/07 09:02:51 full 627981057 789 0 index.016
04/19/07 04:13:04 full 645648385 805 0 index.016
04/19/07 09:02:51 full 650099457 809 0 index.016
04/20/07 04:33:11 full 668075777 826 0 index.016
04/20/07 09:02:51 full 672217857 831 0 index.016
04/21/07 04:24:36 full 690062337 847 0 index.016
04/21/07 09:02:59 full 694338305 851 0 index.016
04/21/07 20:32:45 full 704933121 860 0 index.016
04/22/07 04:28:59 full 712248065 870 0 index.016
04/22/07 09:03:06 full 716458497 875 0 index.016
04/23/07 04:28:18 full 734355969 890 0 index.016
04/23/07 08:03:10 full 737656321 893 0 index.016
04/23/07 09:05:06 full 738607617 897 0 index.016
04/24/07 04:15:16 full 756274177 28 0 index.017
04/24/07 09:04:37 full 760718593 31 0 index.017
04/25/07 04:12:41 full 778352897 46 0 index.017
04/25/07 09:02:53 full 782810625 51 0 index.017
04/26/07 04:16:58 full 800537089 67 0 index.017
04/26/07 09:04:29 full 804953345 71 0 index.017
04/27/07 04:20:49 full 822714625 87 0 index.017
04/27/07 09:02:55 full 827047681 92 0 index.017
04/28/07 04:26:51 full 844925697 109 0 index.017
04/28/07 09:04:30 full 849190401 113 0 index.017
happy # nsrexecd
happy # nsrd
happy # mmrecov
mmrecov: Using happy as server
NOTICE: mmrecov is used to recover the NetWorker server's media index and
resource files from media (backup tapes or disks) when any of this
critical NetWorker data has been lost or damaged. Note that this
command will OVERWRITE the server's existing media index. mmrecov is not
used to recover NetWorker clients' on-line indexes; normal recover
procedures may be used for this purpose. See the mmrecov(1m) and
nsr_crash(1m) man pages for more details.
/dev/rmt/0cbn
rd= happy1:/dev/rmt/1cbn
rd= happy2:/dev/rmt/3cbn
rd= happy3:/dev/rmt/1cbn
/dev/rmt/1cbn
rd= happy1:/dev/rmt/2cbn
rd= happy2:/dev/rmt/4cbn
rd= happy3:/dev/rmt/2cbn
/dev/rmt/2cbn
rd= happy1:/dev/rmt/3cbn
/dev/rmt/3cbn
rd happy4:/dev/rmt/1cbn
rd= happy5:/dev/rmt/0cbn
What is the name of the device you plan on using [/dev/rmt/0cbn]?
Enter the latest bootstrap save set id: 849190401
Enter starting file number (if known) [0]: 113
Enter starting record number (if known) [0]: 0
Please insert the volume on which save set id 849190401 started
into /dev/rmt/0cbn. When you have done this, press :
Scanning /dev/rmt/0cbn for save set 849190401; this may take a while...
scanner: scanning LTO Ultrium tape index.017 on /dev/rmt/0cbn
/nsr
/nsr: file exists, overwriting
/export/home/nsr/res.R/nsrla.res
/export/home/nsr/res.R/servers
/export/home/nsr/res.R/nsr.res
/export/home/nsr/res.R/nsrjb.res
/export/home/nsr/res.R/
nsrmmdbasm -r /export/home/nsr/mm/mmvolume6/
/export/home/nsr/mm/
scanner: ssid 849190401: scan complete
scanner: ssid 849190401: 8906 KB, 8 file(s)
/dev/rmt/0cbn: 1:verifying label, moving backward 2 file(s)
/dev/rmt/0cbn: 1:mounted LTO Ultrium tape index.017 (write protected)
If your resource files were lost, they are now recovered in the 'res.R'
directory. Copy or move them to the 'res' directory, after you have shut
down the service. Then restart the service.
Otherwise, just restart the service.
If the on-line index for happy was lost, it can be recovered using
the nsrck command.
(建议在进行恢复的时候,先对原来的目录进行一个备份,其实,备份的意义并不大,我恢复完以后发现原来的目录并没有被覆盖掉,而且,需要将原来的目录改名,然后,在将你恢复过来的.R的目录名,恢复成nsr的名字,然后,重新启动legato networker就可以了。如果做服务器的迁移也可以使用同样的步骤,但是,在建立好以后需要在对legato networker进行一个inventory的操作)
后续观察,大概过了3天的时间发现,所有的备份任务突然都不能执行了。在那里一直报等待磁带的消息。然后,对legato进行相应的磁带操作都报等待消息。估计问题出在磁带库方面。然后,到达现场进行处理。
到达现场以后,将legato networker停掉使用命令nsr_shutdown –qa命令。然后,将带库重新启动,在启动的时候报75号错误。查手册看错误号信息。决定错误在机械手上。但是,还是决定先确认一下驱动器的工作情况,实验了驱动器以后发现,所有的驱动器工作都正常。然后,对机械手进行操作,当机械手开始工作的时候,带库就开始报75号错误消息。然后,开始申请备件进行更换。
将机械手卸下以后,检查机械手的情况。发现,机械上没有任何问题,然后,将故障点定位到LGR卡上。然后,更换LGR卡。更换好以后,重新安装机械手。然后,重新启动磁带库,然后做了tech和inventory对磁带库进行一个简单的检查。一切都正常。磁带库应该工作正常了。然后,重新启动legato networker.命令为: nsrexecd nsrd。启动完以后运行命令nsrjb –vvHE。一切正常。然后,启动图形界面nwadmin。然后,发起group。发现group运行正常。
到这里,所有的故障都已经解决,备份任务都运行正常。
总结:
按照以上的解决办法来进行考虑。估计最早出现故障现象的时候,应该首先考虑到硬件故障,但是,硬件的运行检查一切都正常。估计,当时,LGR卡已经出先了问题,但是,并没有将故障现象报出来,所以,忽略了硬件的故障。所以,先决定解决软件的故障。(一般情况下,只有在对legato networker进行升级以后,容易出现Can’t fetch old volume问题。但是,一般情况下,如果出现这样的情况来恢复bootstrap的时候,建议,在进行日常备份的时候,把index单独建立一个index pool这样如果一旦出现问题的时候,在寻找bootstrap的时候就会非常容易,而且同样也容易解决legato networker的回收期和数据期不同步的问题。如果,没有建立index pool的时候,你需要将最后一起成功备份的磁带找出来,如果备份任务很多的话,这个是比较让人讨厌的事情。在恢复以后,数据正常备份了3天以后,硬件的故障还是反映出来了。最后,还是通过更换硬件来解决的。建议,出现类似的故障的时候,软,硬件都需要考虑才能更好的解决问题。
写的不好,欢迎朋友们给予更好的建议和解决方案。
阅读(3152) | 评论(0) | 转发(0) |