早上巡检时发现OGG目标端一个事务处理时间非常长,直接导致REPLICAT进程严重延迟:- GGSCI (sv890n01) 13> info rads_1
- REPLICAT RADS_1 Last Started 2012-12-10 09:46 Status RUNNING
- Checkpoint Lag 01:33:16 (updated 11:35:14 ago)
- Log Read Checkpoint File ./dirdat/pa000021
- 2012-12-11 01:55:10.313205 RBA 1753940084
- SQL> select table_name,num_rows from dba_tables where table_name='GT_MXYE_NEW';
其中 "updated 11:35:14 ago" 记录了上一次检查点发生时间(即目标库上一个事务commit)和当前系统时间的时间差,而 "Checkpoint Lag 01:33:16" 指的是上一个检查点发生时间和上一个事务最后一条record的时间戳之间的时间差。也就是说当前这个事务已经执行了11.5个小时....以下是处理过程:
通过 v$session , v$sql 定位到当前sql为
- DELETE FROM "ADS"."GT_MXYE_NEW" WHERE "ZQDM" = :b0 AND "GTTGZH" = :b1 AND "JYSDM" = :b2 AND "YE" = :b3 AND "DQDM" = :b4 AND "SXDM" = :b5 AND ROWNUM = 1
通过 v$transaction.start_time 发现事务的开始时间与前面的延时是一致的,并且这条sql已经执行了50万次,在源库查询发现晚上"ADS"."GT_MXYE_NEW"这个表删除了500万条记录,sql为
- DELETE FROM GT_MXYE_NEW A WHERE EXISTS ( SELECT 1 FROM ADB_ZW_GTTGYEBDMX B WHERE 1=1 AND A.ZQDM=B.ZQDM AND A.GTTGZH=B.GTTGZH AND A.JYSDM=B.JYSDM AND B.KSRQ=:B1 )
该表没有主键和唯一约束,目标端把源端的sql拆分成了500万条再执行,查看执行计划
- SQL> explain plan for DELETE FROM "ADS"."GT_MXYE_NEW" WHERE "ZQDM" = :b0 AND "GTTGZH" = :b1 AND "JYSDM" = :b2 AND "YE" = :b3 AND "DQDM" = :b4 AND "SXDM" = :b5 AND ROWNUM = 1;
- Explained.
- SQL> @?/rdbms/admin/utlxpls.sql
- PLAN_TABLE_OUTPUT
- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Plan hash value: 3603936140
- -------------------------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- -------------------------------------------------------------------------------------------------
- | 0 | DELETE STATEMENT | | 1 | 41 | 629 (1)| 00:00:08 |
- | 1 | DELETE | GT_MXYE_NEW | | | | |
- |* 2 | COUNT STOPKEY | | | | | |
- |* 3 | TABLE ACCESS BY INDEX ROWID| GT_MXYE_NEW | 1 | 41 | 629 (1)| 00:00:08 |
- |* 4 | INDEX RANGE SCAN | IDX_GT_MXYE_NEW | 13477 | | 87 (2)| 00:00:02 |
- -------------------------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter(ROWNUM=1)
- 3 - filter("GTTGZH"=:B1 AND "JYSDM"=:B2 AND "DQDM"=:B4 AND "SXDM"=:B5 AND
- "YE"=TO_NUMBER(:B3))
- 4 - access("ZQDM"=:B0)
- 19 rows selected.
kill replicat 后在目标端新建复合索引(官方文档:KILL REPLICAT 后当前事务回滚并记录检查点,下次重启进程不会丢失数据):
- GGSCI (sv890n01) 15> KILL REPLICAT RADS_1
- Sending KILL request to MANAGER ...
- Killed process (26318) for REPLICAT RADS_1
- GGSCI (sv890n01) 16> info all
- Program Status Group Lag at Chkpt Time Since Chkpt
- MANAGER RUNNING
- REPLICAT ABENDED RADS_1 01:33:16 11:40:01
创建复合索引再看执行计划
- SQL> create index ads.ind_gt_mxye_ogg on ads.GT_MXYE_NEW(ZQDM,GTTGZH,JYSDM,YE,DQDM,SXDM) TABLESPACE TS_ADC_INDEX nologging;
- Index created.
- SQL> explain plan for DELETE FROM "ADS"."GT_MXYE_NEW" WHERE "ZQDM" = :b0 AND "GTTGZH" = :b1 AND "JYSDM" = :b2 AND "YE" = :b3 AND "DQDM" = :b4 AND "SXDM" = :b5 AND ROWNUM = 1;
- Explained.
- SQL> @?/rdbms/admin/utlxpls.sql
- PLAN_TABLE_OUTPUT
- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Plan hash value: 415052804
- --------------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
- --------------------------------------------------------------------------------------
- | 0 | DELETE STATEMENT | | 1 | 41 | 3 (0)| 00:00:01 |
- | 1 | DELETE | GT_MXYE_NEW | | | | |
- |* 2 | COUNT STOPKEY | | | | | |
- |* 3 | INDEX RANGE SCAN| IND_GT_MXYE_OGG | 1 | 41 | 3 (0)| 00:00:01 |
- --------------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - filter(ROWNUM=1)
- 3 - access("ZQDM"=:B0 AND "GTTGZH"=:B1 AND "JYSDM"=:B2 AND
- "YE"=TO_NUMBER(:B3) AND "DQDM"=:B4 AND "SXDM"=:B5)
- 17 rows selected.
重新启动REPLICAT进程,运行了11个小时还没运行完的事务15分钟内执行完毕,再看延时情况
- GGSCI (sv890n01) 29> info all
- Program Status Group Lag at Chkpt Time Since Chkpt
- MANAGER RUNNING
- REPLICAT RUNNING RADS_1 09:33:16 00:00:07
由于当前事务已经提交并且写了检查点,相应的延时都发生了变化。
总结:源端的sql和目标端的sql不一定是一致的,原因有可能是OGG的内部机制,也有可能是该表没有外键或者唯一约束,也有可能是REPLICAT进程参数设置不当造成的,需要进一步学习。
阅读(11774) | 评论(0) | 转发(0) |