2013年(350)
分类: Oracle
2013-04-10 13:40:19
背景:某个Oracle 11gR2 Dataguard环境中Primary意外宕机,考虑到业务比较重要,因此没有等待,直接执行了failover,将角色切换到Standby端,切换完之后发现新的Primary端网卡也有异常,时有丢包,虽不严重毕竟也是个故障点,那么就考虑再创建一个数据库,将业务切换到新的ORACLE中,是现成的,但数据量较大,直接复制的方式需要停机并且操作时间很长,因此首选仍然是通过Dataguard的角色切换。
要创建Standby,必须要先有一份Primary的完整映像,一般可以通过在Primary端创建备份集,或者是使用RMAN DUPLICATE FROM ACTIVE DATABASE特性两种方式来创建,不过这两种方案都会导致Primary端负载升高,考虑到当前Primary正承担重要业务,负载已然不低,因此上述两方式均被否决。
检查服务器时看到failover之前创建过全备,并且完整备份及之后产生的所有日志也均在,可否使用这份备份集进行呢?如果是之前版本,那么可以肯定此路不通,不过考虑到10g版本中引入了跨RESETLOGS恢复的特性,原理是将RESETLOGS的操作也记录到REDOLOG中(之前是重置REDO的方式,导致OPEN RESETLOGS操作前后的REDO不再连续),物理Standby能否正确识别failover前后生成的日志并应用呢,考虑到此次切换属于典型的时间紧任务急,因此上述方案值得尝试。创建Standby的准备工作略(含,复制standby控制文件,生成spfile,创建密钥文件,配置监听、网络服务名等);
接下来复制旧的备份集和之后产生的归档文件至Standby端,进入RMAN执行恢复:
RMAN> startup mount;
RMAN> restore database;
恢复虽历时不短,但操作顺利完成,而后进入sqlplus命令行模式下,
SQL> alter database recover managed standby database disconnect from session;
Database altered.
SQL> select * from v$managed_standby;
PROCESS PID STATUS CLIENT_P CLIENT_PID CLIENT_DBID
--------- ---------- ------------ -------- ---------------------------------------- ----------------------------------------
GROUP# RESETLOG_ID THREAD# SEQUENCE# BLOCK# BLOCKS DELAY_MINS KNOWN_AGENTS ACTIVE_AGENTS
---------------------------------------- ----------- ---------- ---------- ---------- ---------- ---------- ------------ -------------
ARCH 15186 CONNECTED ARCH 15186 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15188 CONNECTED ARCH 15188 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15190 CONNECTED ARCH 15190 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15192 CONNECTED ARCH 15192 3423599117
N/A 0 0 0 0 0 0 0 0
MRP0 16800 WAIT_FOR_LOG N/A N/A N/A
N/A 759413453 1 1749 0 0 0 9 9
上述信息可以看到正在等待第1749号日志。
RMAN中查看归档,看看是否完整:
RMAN> list backup of archivelog all;
..............
................
BS Key Size Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
502 1.65G DISK 00:07:13 07-DEC-11
BP Key: 502 Status: AVAILABLE Compressed: YES Tag: TAG20111207T021510
Piece Name: /data/backup/rman/fomtiuhf_1_1-20111207.full
List of Archived Logs in backup set 502
Thrd Seq Low SCN Low Time Next SCN Next Time
---- ------- ---------- --------- ---------- ---------
1 1737 1897848796 01-DEC-11 1898158044 01-DEC-11
1 1738 1898158044 01-DEC-11 1898277748 01-DEC-11
1 1739 1898277748 01-DEC-11 1898524562 02-DEC-11
1 1740 1898524562 02-DEC-11 1898736322 02-DEC-11
1 1741 1898736322 02-DEC-11 1899050433 02-DEC-11
1 1742 1899050433 02-DEC-11 1899363421 02-DEC-11
1 1743 1899363421 02-DEC-11 1899691461 02-DEC-11
1 1744 1899691461 02-DEC-11 1900008107 02-DEC-11
1 1745 1900008107 02-DEC-11 1900311751 02-DEC-11
1 1746 1900311751 02-DEC-11 1900470992 02-DEC-11
1 1747 1900470992 02-DEC-11 1900730459 03-DEC-11
1 1748 1900730459 03-DEC-11 1900978002 03-DEC-11
1 1749 1900978002 03-DEC-11 1901054322 03-DEC-11
1 1750 1901054322 03-DEC-11 1901098575 03-DEC-11
1 1751 1901098575 03-DEC-11 1901390021 03-DEC-11
1 1752 1901390021 03-DEC-11 1901717349 03-DEC-11
1 1753 1901717349 03-DEC-11 1902040375 03-DEC-11
1 1754 1902040375 03-DEC-11 1902343084 03-DEC-11
..................
..................
经验证是有的,那么先恢复出来吧,rman命令行中执行restore archivelog命令如下:
RMAN> restore archivelog from sequence 1749;
Starting restore at 07-DEC-11
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/07/2011 21:13:46
RMAN-20242: specification does not match any archived log in the repository
............................
...........................
居然报错说未能找到,这就奇怪了,再尝试查看更细粒度的归档呢:
RMAN> list backup of archivelog sequence 1749;
specification does not match any backup in the repository
居然也没有找到。
分析应该是由于,FAILOVER(相当于OPEN RESETLOGS)后,resetlog_id发生了变化,执行的list或restore都是恢复当前resetlog_id的信息,没有找到匹配的记录。
再次执行list archivelog查看:
BS Key Size Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
504 1.34G DISK 00:04:28 07-DEC-11
BP Key: 504 Status: AVAILABLE Compressed: YES Tag: TAG20111207T021510
Piece Name: /data/backup/rman/fqmtiuu6_1_1-20111207.full
List of Archived Logs in backup set 504
Thrd Seq Low SCN Low Time Next SCN Next Time
---- ------- ---------- --------- ---------- ---------
1 1772 1906185973 05-DEC-11 1906407667 05-DEC-11
1 1773 1906407667 05-DEC-11 1906414068 05-DEC-11
1 1774 1906414068 05-DEC-11 1906415997 05-DEC-11
1 1775 1906415997 05-DEC-11 1906467127 05-DEC-11
1 1776 1906467127 05-DEC-11 1906707111 06-DEC-11
1 1777 1906707111 06-DEC-11 1906946742 06-DEC-11
1 1778 1906946742 06-DEC-11 1907268990 06-DEC-11
1 1779 1907268990 06-DEC-11 1907582378 06-DEC-11
1 1780 1907582378 06-DEC-11 1907906222 06-DEC-11
1 1781 1907906222 06-DEC-11 1908211804 06-DEC-11
1 1782 1908211804 06-DEC-11 1908532669 06-DEC-11
1 1783 1908532669 06-DEC-11 1908846355 06-DEC-11
1 1784 1908846355 06-DEC-11 1908994400 06-DEC-11
1 1785 1908994400 06-DEC-11 1909129826 06-DEC-11
1 1786 1909129826 06-DEC-11 1909129833 06-DEC-11
1 1787 1909129833 06-DEC-11 1909248075 06-DEC-11 (TERMINAL)
BS Key Size Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
503 158.83M DISK 00:00:30 07-DEC-11
BP Key: 503 Status: AVAILABLE Compressed: YES Tag: TAG20111207T021510
Piece Name: /data/backup/rman/frmtiuu6_1_1-20111207.full
List of Archived Logs in backup set 503
Thrd Seq Low SCN Low Time Next SCN Next Time
---- ------- ---------- --------- ---------- ---------
1 1 1909248074 06-DEC-11 1909248078 06-DEC-11
1 2 1909248078 06-DEC-11 1909541467 07-DEC-11
1 3 1909541467 07-DEC-11 1909684157 07-DEC-11
可以看到,1787号日志之后标记了terminal,再之后的日志文件序号自动重置,注意1787号日志文件和1号日志文件的LOW SCN和NEXT SCN,可以看到SCN是连续的。
这最起码确定了日志中的操作是连续的,接下来就是要想办法将之前的归档恢复出来。
先是尝试从指定的备份集中恢复,执行命令如下:
RMAN> restore archivelog all from '/data/backup/rman/fomtiuhf_1_1-20111207.full';
Starting restore at 07-DEC-11
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/07/2011 21:19:22
RMAN-06509: only SPFILE or control file can be restored from AUTOBACKUP
看起来这种命令不支持恢复归档,分析备份片段的信息,发现tag是相同的,继续尝试通过tag恢复,执行命令如下:
RMAN> restore archivelog all from tag "TAG20111207T021510";
Starting restore at 07-DEC-11
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/07/2011 21:26:54
RMAN-06026: some targets not found - aborting restore
...................
RMAN-06025: no backup of archived log for thread 1 with sequence 1708 and starting SCN of 1893487782 found to restore
RMAN-06025: no backup of archived log for thread 1 with sequence 1707 and starting SCN of 1893344684 found to restore
RMAN-06025: no backup of archived log for thread 1 with sequence 1706 and starting SCN of 1893135723 found to restore
RMAN-06025: no backup of archived log for thread 1 with sequence 1705 and starting SCN of 1892813124 found to restore
RMAN-06025: no backup of archived log for thread 1 with sequence 1704 and starting SCN of 1892478885 found to restore
............................
抛出错误,原因是备份信息不全。这些文件的备份集确实没有,虽然操作失败了,但是却带来了新的希望,是否是由于这些备份不存在所以才失败,虽然已经找不到这些文件的备份集了,但是可以换一个角度思考,如果让RMAN认为这些文件不需要恢复,是否就能够成功执行了呢?
移动备份集到其它路径下,而后执行RMAN命令检查备份集:
RMAN> crosscheck backup of archivelog all;
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1153 device type=DISK
crosschecked backup piece: found to be 'EXPIRED'
backup piece handle=/data/backup/rman/fpmtiuhg_1_1-20111207.full RECID=500 STAMP=769227312
crosschecked backup piece: found to be 'EXPIRED'
backup piece handle=/data/backup/rman/fnmtiuhf_1_1-20111207.full RECID=501 STAMP=769227311
..........................
..........................
删除这些备份信息:
RMAN> delete noprompt expired backup;
然后再将备份集恢复至原始路径,执行catalog命令,将需要的归档文件备份重新注册到控制文件中:
RMAN> catalog backuppiece '/data/backup/rman/fpmtiuhg_1_1-20111207.full';
cataloged backup piece
backup piece handle=/data/backup/rman/fpmtiuhg_1_1-20111207.full RECID=500 STAMP=769277444
RMAN> catalog backuppiece '/data/backup/rman/fomtiuhf_1_1-20111207.full';
cataloged backup piece
backup piece handle=/data/backup/rman/fpmtiuhf_1_1-20111207.full RECID=501 STAMP=769277444
...........................
............................
注意仅注册那些包含恢复用到的归档文件备份集即可,执行完成后再次查看备份的归档文件:
RMAN> list backup of archivelog all;
确认无误后执行命令将其恢复出来:
RMAN> restore archivelog all;
......................
归档文件恢复成功后切换至sqlplus命令行模式下,查看日志应用的情况:
SQL> select * from v$managed_standby;
PROCESS PID STATUS CLIENT_P CLIENT_PID CLIENT_DBID
--------- ---------- ------------ -------- ---------------------------------------- ----------------------------------------
GROUP# RESETLOG_ID THREAD# SEQUENCE# BLOCK# BLOCKS DELAY_MINS KNOWN_AGENTS ACTIVE_AGENTS
---------------------------------------- ----------- ---------- ---------- ---------- ---------- ---------- ------------ -------------
ARCH 15186 CONNECTED ARCH 15186 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15188 CONNECTED ARCH 15188 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15190 CONNECTED ARCH 15190 3423599117
N/A 0 0 0 0 0 0 0 0
ARCH 15192 CONNECTED ARCH 15192 3423599117
N/A 0 0 0 0 0 0 0 0
MRP0 16800 APPLYING_LOG N/A N/A N/A
N/A 769214827 1 1753 334080 0 0 9 9
从上述信息可以看出,已经开始应用了!!好消息啊,接下来就看是否能够成功迈过resetlogs那个操作了。
查看Alert日志中的信息:
Media Recovery Waiting for thread 1 sequence 1749 branch(resetlogs_id) 759413453
Wed Dec 07 21:46:35 2011
Fetching gap sequence in thread 1 branch(resetlogs_id) 759413453, gap seq 1749-1778
Wed Dec 07 21:47:16 2011
Fetching gap sequence in thread 1 branch(resetlogs_id) 759413453, gap seq 1749-1777
Wed Dec 07 21:47:47 2011
Fetching gap sequence in thread 1 branch(resetlogs_id) 759413453, gap seq 1749-1777
Wed Dec 07 21:48:18 2011
Fetching gap sequence in thread 1 branch(resetlogs_id) 759413453, gap seq 1749-1777
Wed Dec 07 21:48:34 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1749_759413453.dbf
Wed Dec 07 21:49:19 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1750_759413453.dbf
Wed Dec 07 21:49:45 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1751_759413453.dbf
Wed Dec 07 21:50:50 2011
Sqlplus端持续监测v$managed_standby视图中的信息,注意到当应用至1787号日志时MPR进程就消失了,查看alert日志:
Wed Dec 07 22:07:23 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1786_759413453.dbf
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1787_759413453.dbf
Identified End-Of-Redo for thread 1 sequence 1787
Wed Dec 07 22:07:34 2011
Media Recovery End-Of-Redo indicator encountered
Media Recovery Applied until change 1909248073
Signalling error 1152 for datafile 1!
Errors in file /data/ora11g/diag/rdbms/oradb2/oracle9i/trace/oracle9i_pr00_16802.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '/data/ora11g/oradata/oracle9i/system01.dbf'
Slave exiting with ORA-1547 exception
Errors in file /data/ora11g/diag/rdbms/oradb2/oracle9i/trace/oracle9i_pr00_16802.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '/data/ora11g/oradata/oracle9i/system01.dbf'
Wed Dec 07 22:07:35 2011
Recovery Slave PR00 previously exited with exception 1547
Errors in file /data/ora11g/diag/rdbms/oradb2/oracle9i/trace/oracle9i_mrp0_16800.trc:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '/data/ora11g/oradata/oracle9i/system01.dbf'
MRP0: Background Media Recovery process shutdown (oracle9i)
日志端看出,应用进程由于遇到错误,自动停止了,ORA-01547和ORA-01152属于一般错误,常见于执行恢复过程中REDO数据异常。
两项错误的官方解释如下:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
Cause: Media recovery with one of the incomplete recovery options ended without error. However, if the ALTER DATABASE OPEN RESETLOGS command were attempted now, it would fail with the specified error. The most likely cause of this error is forgetting to restore one or more datafiles from a sufficiently old backup before executing the incomplete recovery.
Action: Rerun the incomplete media recovery using different datafile backups, a different control file, or different stop criteria.
ORA-01152: file string was not restored from a sufficiently old backup
Cause: An incomplete recovery session was started, but an insufficient number of logs were applied to make the database consistent. This file is still in the future of the last log applied. The most likely cause of this error is forgetting to restore the file from a backup before doing incomplete recovery.
Action: Either apply more logs until the database is consistent or restore the database file from an older backup and repeat recovery.
错误信息描述的对啊,确实是日志序号不足,因为resetlog后不是递增,而是重置成1了,没啥好说的,继续应用呗,再次尝试重新启动归档应用,
SQL> alter database recover managed standby database disconnect from session;
Database altered.
同时修改Primary的log_archive_dest参数,启动发送归档到当前这个standby端:
SQL> alter system set log_archive_dest_state_2='enable';
注意观察从端的alert日志:
Wed Dec 07 22:18:32 2011
alter database recover managed standby database disconnect from session
Attempt to start background Managed Standby Recovery process (oracle9i)
Wed Dec 07 22:18:32 2011
MRP0 started with pid=26, OS id=17498
MRP0: Background Managed Standby Recovery process started (oracle9i)
started logmerger process
Wed Dec 07 22:18:37 2011
Managed Standby Recovery not using Real Time Apply
Parallel Media Recovery started with 8 slaves
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_1_769214827.dbf
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_2_769214827.dbf
Completed: alter database recover managed standby database disconnect from session
Wed Dec 07 22:19:54 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_3_769214827.dbf
Wed Dec 07 22:20:28 2011
Media Recovery Waiting for thread 1 sequence 4
日志成功继承到新的序列了,太好啦,成功啦!!
后面,随着primary端的日志不断发送至standby端,standby的日志接收和应用也在继续:
Wed Dec 07 22:21:57 2011
RFS[1]: Assigned to RFS process 17537
RFS[1]: Identified database type as 'physical standby': Client is ARCH pid 23229
RFS[1]: Opened log for thread 1 sequence 9 dbid -871368179 branch 769214827
Wed Dec 07 22:21:57 2011
RFS[2]: Assigned to RFS process 17539
RFS[2]: Identified database type as 'physical standby': Client is ARCH pid 23225
RFS[2]: Opened log for thread 1 sequence 8 dbid -871368179 branch 769214827
Wed Dec 07 22:21:57 2011
RFS[3]: Assigned to RFS process 17541
RFS[3]: Identified database type as 'physical standby': Client is ARCH pid 23233
RFS[3]: Opened log for thread 1 sequence 7 dbid -871368179 branch 769214827
Wed Dec 07 22:21:58 2011
Fetching gap sequence in thread 1, gap sequence 4-9
Wed Dec 07 22:22:02 2011
RFS[4]: Assigned to RFS process 17543
RFS[4]: Identified database type as 'physical standby': Client is LGWR ASYNC pid 17618
Primary database is in MAXIMUM PERFORMANCE mode
RFS[4]: Selected log 4 for thread 1 sequence 11 dbid -871368179 branch 769214827
Wed Dec 07 22:22:09 2011
Archived Log entry 73 added for thread 1 sequence 9 rlc 769214827 ID 0xcc9aa297 dest 2:
RFS[1]: Opened log for thread 1 sequence 4 dbid -871368179 branch 769214827
Wed Dec 07 22:22:12 2011
Archived Log entry 74 added for thread 1 sequence 7 rlc 769214827 ID 0xcc9aa297 dest 2:
RFS[3]: Opened log for thread 1 sequence 5 dbid -871368179 branch 769214827
Wed Dec 07 22:22:14 2011
Archived Log entry 75 added for thread 1 sequence 8 rlc 769214827 ID 0xcc9aa297 dest 2:
RFS[2]: Opened log for thread 1 sequence 6 dbid -871368179 branch 769214827
Wed Dec 07 22:22:23 2011
Archived Log entry 76 added for thread 1 sequence 4 rlc 769214827 ID 0xcc9aa297 dest 2:
Wed Dec 07 22:22:23 2011
Media Recovery Log /data/ora11g/oradata/oracle9i/archive/1_4_769214827.dbf
RFS[1]: Opened log for thread 1 sequence 10 dbid -871368179 branch 769214827
Wed Dec 07 22:22:25 2011
至此,Dataguard环境部署成功。然后,找一个合适的时间点,执行计划内的switchover即可。