Chinaunix首页 | 论坛 | 博客
  • 博客访问: 11251
  • 博文数量: 1
  • 博客积分: 10
  • 博客等级: 民兵
  • 技术积分: 25
  • 用 户 组: 普通用户
  • 注册时间: 2011-03-12 11:55
  • 认证徽章:
文章分类
文章存档

2018年(1)

我的朋友

分类: Oracle

2018-02-25 17:54:16

一, 现象描述

某生产系统应用人员反映自己发起的job异常,没有正常运行,导致相关表无数据,自己尝试停止job但未成功,也无法重新发起job。

二, 问题分析

job的登录模式一般为’DBMS_SCHEDULER,通过PL/SQL工具登录运行如下语句进行查看

select S.STATUS,

       S.INST_ID,

       S.SID,

       S.PADDR,

       S.SQL_ID,

       S.SQL_EXEC_START,

       S.ACTION,

       S.LAST_CALL_ET,

       S.BLOCKING_INSTANCE,

       S.BLOCKING_SESSION,

       S.EVENT,

       S.WAIT_CLASS

  from v$session S

 where module = 'DBMS_SCHEDULER';


查询到一个被kill的job,状态为killed,并未释放,通过下面的语句查询到spid后,通过登录后台kill -9 的方式杀掉了系统进程。
Select p.spid from v$session s,v$processes p where p.addr=s.paddr and s.sid=508
之后让应用再次发起job,但job依然无法执行,通过后台发现该job发生了等待事件row cache lock,等待会话为1209,P1值为11,通过下面的语句查看无返回结果,不能定位row cache lock发生在哪里

Select * from v$rowcache where cache#=11;


查询1209会话发现为系统后台进程CJQ0, CJQn和Jnnn均为为oracle作业队列进程,后台进程为什么会锁正常的job呢?

Select * from v$session where sid=1209;

应用人员通过PL/SQL提供的JOB任务发现存在如下异常现象,

(上面的截图缺少了任务CSH_BRANCH_D_1DAY,因为截取的是已经处理后的图像,其实在出现问的时候是有的,只不过忘了截取了)

当前运行很多job,但没有都没有session_id,感觉像是僵尸job,既然无法通过会话杀掉job,
我们尝试通过dbms_scheduler包进行停止,但无法停止,而且依然出现了row cache lock事件,依然被1209后台进程锁住。



Row cache lock无法通过P1定位到具体的锁对象,后台进程为什么会锁住job呢,通过上面的结果初步分析是因为之前有很多僵死的job没有被正常停止,导致再次发起同样的job任务或对该job进行处理就会出现锁的情况,而且是被后台进程CJQn锁住。

SQL>Show parameter processes

通过查看参数job_queue_processes值为1000,aq_tm_processes为10000,并未达到峰值


进一步分析

登录后台通过DEBUG命令收集相关信息分析

sysdba登录到数据库上:
$sqlplus / as sysdba
或者
$sqlplus -prelim / as sysdba <==当数据库已经很慢或者hang到无法连接

以下为操作截屏,采用的hanganalyze 3,systemstate dump 258级别。




相关说明:

Level 的含义:

1-2:只有hanganalyze输出,不dump任何进程
3:Level2+Dump出在IN_HANG状态的进程
4:Level3+Dump出在等待链里面的blockers(状态为LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待链中的进程(状态为NLEAF)
Oracle官方建议不要超过level 3,一般level 3也能够解决问题,超过level 3会给系统带来额外负担


systemstate dump有多个级别:
2:  dump (不包括lock element)

10:  dump
11:  dump + global cache of RAC
256: short stack (函数堆栈)
258: 256+2   -->short stack +dump(不包括lock element)
266: 256+10 -->short stack+ dump
267: 256+11 -->short stack+ dump + global cache of RAC

level 11和 267会 dump global cache, 会生成较大的trace 文件,一般情况下不推荐。
    一般情况下,如果进程不是太多,推荐用266,因为这样可以dump出来进程的函数堆栈,可以用来分析进程在执行什么操作。
但是生成short stack比较耗时,如果进程非常多,比如2000个进程,那么可能耗时30分钟以上。这种情况下,可以生成level 10 或者 level 258, level 258 比 level 10会多收集short short stack, 但比level 10少收集一些lock element data.


以下为hanganalyze的截取输出,可以看到主要等待为row cache lock,session id:7/porcess id:128被session id:1209/process id: 49锁住,和上面的分析一致。


Chains most likely to have caused the hang:

 [a] Chain 1 Signature: 'rdbms ipc message'<='row cache lock'

     Chain 1 Signature Hash: 0xfe4c0898


Chain 1:

-------------------------------------------------------------------------------

    Oracle session identified by:

    {

                instance: 2 (cmbd.cmbd2)

                   os id: 43516004

              process id: 128, oracle@jkptdb2

              session id: 7

        session serial #: 47563

    }

    is waiting for 'row cache lock' with wait info:

    {

                      p1: 'cache id'=0xb

                      p2: 'mode'=0x0

                      p3: 'request'=0x5

            time in wait: 0.077179 sec

      heur. time in wait: 16 min 18 sec

           timeout after: 2.922821 sec

                 wait id: 13467

                blocking: 0 sessions

            wait history:

              * time between current wait and wait #1: 0.000045 sec

              1.       event: 'row cache lock'

                 time waited: 3.000020 sec

                     wait id: 13466           p1: 'cache id'=0xb

                                              p2: 'mode'=0x0

                                              p3: 'request'=0x5

              * time between wait #1 and #2: 0.000030 sec

              2.       event: 'row cache lock'

                 time waited: 3.000016 sec

                     wait id: 13465           p1: 'cache id'=0xb

                                              p2: 'mode'=0x0

                                              p3: 'request'=0x5

              * time between wait #2 and #3: 0.000032 sec

              3.       event: 'row cache lock'

                 time waited: 3.000016 sec

                     wait id: 13464           p1: 'cache id'=0xb

                                              p2: 'mode'=0x0

                                              p3: 'request'=0x5

    }

    and is blocked by

 => Oracle session identified by:

    {

                instance: 2 (cmbd.cmbd2)

                   os id: 19792004

              process id: 49, oracle@jkptdb2 (CJQ0)

              session id: 1209

        session serial #: 9

    }

    which is waiting for 'rdbms ipc message' with wait info:

    {

                      p1: 'timeout'=0x14

            time in wait: 0.168064 sec

      heur. time in wait: 2.171697 sec

           timeout after: 0.031936 sec

                 wait id: 72756966

                blocking: 1 session

            wait history:

              * time between current wait and wait #1: 0.000350 sec

              1.       event: 'rdbms ipc message'

                 time waited: 0.200014 sec

                     wait id: 72756965        p1: 'timeout'=0x14

              * time between wait #1 and #2: 0.000304 sec

              2.       event: 'rdbms ipc message'

                 time waited: 0.200016 sec

                     wait id: 72756964        p1: 'timeout'=0x14

              * time between wait #2 and #3: 0.000307 sec

              3.       event: 'rdbms ipc message'

                 time waited: 0.200041 sec

                     wait id: 72756963        p1: 'timeout'=0x14

    }


Chain 1 Signature: 'rdbms ipc message'<='row cache lock'

Chain 1 Signature Hash: 0xfe4c0898


以下为systemstate dump(部分截取信息)输出如下:


上锁进程(session id:129):

PROCESS 49: CJQ0


下面的描述大致为:系统进程49,也就是session id 129锁住一个会话是sid:7,等待事件为row cache lock,和上面的分析吻合。

    There are 1 sessions blocked by this session.

    Dumping one waiter:

      inst: 2, sid: 7, ser: 47563

      wait event: 'row cache lock'

        p1: 'cache id'=0xb

        p2: 'mode'=0x0

        p3: 'request'=0x5

      row_wait_obj#: 520, block#: 11728, row#: 0, file# 1

      min_blocked_time: 0 secs, waiter_cache_ver: 21656


下面可以看到CJQ0进程持有S锁,而对象正是=CSH_BRANCH_D_1DAY_JOB,也就是我们上面打算停止的job名称。


kssdmh:  so 0x7000001b029ced8 (type 75, "row cache enqueues") may not have children

      ----------------------------------------

      SO: 0x7000001b02dd868, type: 75, owner: 0x700000191e0d808, flag: INIT/-/-/0x00 if: 0x1 c: 0x1

       proc=0x7000001aa6dc268, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0

      row cache enqueue: count=2 session=7000001adb58460 object=7000001810e7238, mode=S

      savepoint=0x351e5fa5

      row cache parent object: address=7000001810e7238 cid=11(dc_objects)

      hash=af8ef272 typ=21 transaction=0 flags=00000002

      objectno=352585 ownerid=57 nsp=1

      name=CSH_BRANCH_D_1DAY_JOB      own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S

      status=VALID/-/-/-/-/-/-/-/-

      request=N release=FALSE flags=8

      instance lock=QI af8ef272 41d157e6

      set=0, complete=FALSE

      data=

      00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000

      00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000

      00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000



被锁进程(session id:7)

PROCESS 128:

  ----------------------------------------

    Short stack dump:

ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-47dc<-sskgpwwait()+32<-skgpwwait()+180

<-ksliwat()+11032<-kslwaitctx()+180<-kqrigt()+1424<-kqrLockAndPinPo()+532<-kqrpre1()+864<-kqrpre()+28

<-jskaObjGet()+1024<-jskaJobRun()+2788<-jsiRunJob()+776<-jsaRunJob()+780<-spefcmpa()+600

<-spefmccallstd()+1340<-pextproc()+168<-peftrusted()+164<-psdexsp()+276<-rpiswu2()+480

<-psdextp()+632<-pefccal()+588<-pefcal()+260<-pevm_FCAL()+144<-pfrinstr_FCAL()+96

<-pfrrun_no_tool()+96<-pfrrun()+1032<-plsql_run()+716<-peicnt()+320<-kkxexe()+508

<-opiexe()+17168<-kpoal8()+1856<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940

<-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268

<-main()+204<-__start()+112



当前正在等待事件row cache lock

    Current Wait Stack:

     0: waiting for 'row cache lock'

        cache id=0xb, mode=0x0, request=0x5

        wait_id=13602 seq_num=13603 snap_id=1

        wait times: snap=0.003162 sec, exc=0.003162 sec, total=0.003162 sec

        wait times: max=3.000000 sec, heur=23 min 0 sec

        wait counts: calls=1 os=1

        in_wait=1 iflags=0x5a2


下面显示了该进程在请求一个X锁,请求的对象是正是CSH_BRANCH_D_1DAY_JOB和上面的分析是温和的,CJQ进程对 CSH_BRANCH_D_1DAY_JOB持有一个S锁,而当前的进程需要对CSH_BRANCH_D_1DAY_JOB进程操作,需要一个X锁。


SO: 0x7000001b2c1be58, type: 75, owner: 0x70000019c72b078, flag: INIT/-/-/0x00 if: 0x1 c: 0x1

         proc=0x7000001ab6cb410, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0

        row cache enqueue: count=1 session=7000001ab826ce0 object=7000001810e7238, request=X

        savepoint=0x893a

        row cache parent object: address=7000001810e7238 cid=11(dc_objects)

        hash=af8ef272 typ=21 transaction=0 flags=00000002

        objectno=352585 ownerid=57 nsp=1

        name=CSH_BRANCH_D_1DAY_JOB        own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S

        status=VALID/-/-/-/-/-/-/-/-

        request=N release=FALSE flags=8

        instance lock=QI af8ef272 41d157e6

        set=0, complete=FALSE

        data=

        00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000

        00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000


        011044c6 00000000 07000001 810e7300



下面记录了操作语句:

begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;


      ----------------------------------------

      SO: 0x70000015d5f35c8, type: 78, owner: 0x7000001ab826ce0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

       proc=0x7000001ab6cb410, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8548 ID:, pg=0


      LibraryObjectLock:  Address=70000015d5f35c8 Handle=7000000ba27f998 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1         


        User=7000001ab826ce0 Session=7000001ab826ce0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5a52c6da

      LibraryHandle:  Address=7000000ba27f998 Hash=a53cfcdd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD

        ObjectName:Name=begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;


三,问题定位

通过相关关键字row cache lock cjq在metlink搜索查看,最终定位到了一个BUG16994952 –Unable to unscheduled propagation due to CJQ self-deadlock/hang。通过BUG的描述发现和上面的分析吻合。

Bug 16994952  Unable to unschedule propagation due to CJQ self-deadlock / hang 

 This note gives a brief overview of bug 16994952.
 The content was last updated on: 23-AUG-2015
 Click here for details of each of the sections below.

Affects:

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

(Not specified)

Versions confirmed as being affected

· 11.2.0.3 

Platforms affected

Generic (all / most platforms affected)

Fixed:

The fix for 16994952 is first included in 

· 12.1.0.1 (Base Release) 


Interim patches may be available for earlier versions - click here to check. 

Symptoms:

Related To:

· Deadlock 

· Hang (Process Hang) 

· Waits for "row cache lock"

· Stack is likely to include jskaObjGet

· Advanced Queuing 

· Job Queues 

Description

Unable to unschedule propagation due to CJQ self-deadlock / hang.


Rediscovery Information:

 This bug may be the cause if the CJQ0 process is stuck in a

 self-deadlock waiting on "row cache lock" for a dc_objects

 entry in X mode for an object that it already holds in S mode.

 The stack is likely to include jskaObjGet


Workaround:

 Kill the hung CJQ process


HOOKS "WAITEVENT:row cache lock" STACKHAS:jskaObjGet WAITEVENT:row cache lock STACKHAS:jskaObjGet LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XAFFECTS_11.2.0.3 XAFFECTS_V11020003 AFFECTS=11.2.0.3 XAFFECTS_11.2.0.4 XAFFECTS_V11020004 AFFECTS=11.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_AQ TAG_DEADLOCK TAG_HANG TAG_JOBQ AQ DEADLOCK HANG JOBQ FIXED_12.1.0.1 SOLVED_SR=2 SOLVED_SR_BAND_2

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

References

Bug:16994952 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article

四,问题处理

Kill the hung CJQ process,杀掉CJQ进程

SQL>Select spid from v$session s,v$process p where p.addr=s.paddr and s.sid=129;


再次通过dbms_scheduler包进行停止,可以正常停止,依次停止僵死的job后,应用可以再次正常发起job,恢复正常。



五, 问题总结

如何快速的处理这个问题呢,如果我们上来就搜索:11g 无法停止job,很难找到和我们需要的答案,同样按照上面分析处理可能需要时间比较长,其实当我们分析到上锁的进程是CJQ的时候,就应该意识到可能是BUG,马上进行精确搜索定位问题,关键字:row cache lock CJQ。对于11G及以下版本的数据库我们遇到的99%的问题都应该是已经有过记录或其他用户出现过的。所以在处理问题的时候怎么做到快,准,稳。可以快速发现问题并处理问题,这就需要我们不断积累经验,同时要有分析处理问题的思路。同样在处理问题后要完成具体的分析报告,养成记录故障日志的习惯,并进行案例分享。

以下是我们新纽数据库运维团队的故障分析流程图,希望对大家有借鉴作用。





阅读(1489) | 评论(0) | 转发(0) |
0

上一篇:没有了

下一篇:没有了

给主人留下些什么吧!~~
评论热议
请登录后评论。

登录 注册