全部博文(413)
分类: Oracle
2012-10-11 10:15:46
转自:http://yangtingkun.itpub.net/post/468/526943
客户产品数据库上午出现了严重的性能问题,简单记录一下问题的诊断和解决过程。
Snap Id | Snap Time | Sessions | Cursors/Session | |
Begin Snap: | 7170 | 31-3月 -12 08:00:50 | 395 | 10.5 |
End Snap: | 7172 | 31-3月 -12 09:00:27 | 689 | 11.7 |
Elapsed: | 59.61 (mins) | |||
DB Time: | 17,270.93 (mins) |
从TOP 5看,等待事件中最明显的是latch: cache buffers chains,从当前系统的状态也可以看到这一点:
SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
latch: cache buffers chains 1043
SQL*Net message from client 500
SQL*Net message to client 35
rdbms ipc message 29
gc cr request 25
latch free 13
gcs remote message 10
gc buffer busy 7
8 rows selected.
系统中部分SQL的执行时间已经长达几个小时:
SQL ordered by Elapsed Time
Elapsed Time (s) | CPU Time (s) | Executions | Elap per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
341,821 | 11,917 | 112 | 3051.97 | 32.99 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
244,752 | 3,380 | 147 | 1664.98 | 23.62 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
52,419 | 11,331 | 3 | 17472.91 | 5.06 | JDBC Thin Client | select cf.cardid, (CASE WHEN c... | |
38,767 | 532 | 11 | 3524.24 | 3.74 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
37,146 | 2,524 | 2 | 18573.15 | 3.58 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
30,796 | 2,331 | 5 | 6159.20 | 2.97 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
29,991 | 2,506 | 1 | 29991.41 | 2.89 | JDBC Thin Client | SELECT count(*) total FROM ( s... | |
24,762 | 875 | 3 | 8254.06 | 2.39 | JDBC Thin Client | SELECT count(*) total FROM (SE... | |
17,845 | 699 | 2 | 8922.50 | 1.72 | JDBC Thin Client | SELECT count(*) total FROM (SE... | |
17,470 | 1,295 | 0 | 1.69 | JDBC Thin Client | select * from (select aa.*, ro... |
其实根据这些现象,基本上可以判断问题了。肯定是SQL执行计划的改变导致了当前的性能问题。而类似绑定变量窥探之类的问题只会影响个别的SQL,而这么大面积的执行计划的改变,几乎可以断定是统计信息造成的。
询问了一下客户最近的操作,原来昨天夜里通过数据库链的方式导入了一部分数据。而今天一早问题就出现了。
其实问题已经很明显了,在通过数据库链加载数据后,并没有重新收集统计信息,且由于加载时间是在半夜,这批数据也没有赶上每天22:00的统计信息自动收集的工作。这就使得Oracle在生成执行计划时,会依赖数据加载前的统计信息,从而造成了错误的执行计划。
首先解决问题的根源问题,对加载过数据的表重新收集统计:
SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE OWNER = 'USER1' AND LAST_ANALYZED < TRUNC(SYSDATE);
'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
-----------------------------------------------------------------------------------------
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
.
.
.
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)
12 rows selected.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)
PL/SQL procedure successfully completed.
.
.
.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)
PL/SQL procedure successfully completed.
虽然问题的根源已经被解决,但是当前运行的SQL并不会改变执行计划,因此还需要找到问题的SQL,从后台将其结束:
SQL> SELECT 'kill -9 ' || spid from v$session s, v$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains';
'KILL-9'||SPID
--------------------
kill -9 28321
kill -9 25384
kill -9 23697
kill -9 7239
.
.
.
kill -9 9331
kill -9 13759
740 rows selected.
SQL> SELECT 'kill -9 ' || spid from gv$session s, gv$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains' and s.inst_id = p.inst_id and s.inst_id = 2;
'KILL-9'||SPID
--------------------
kill -9 23992
kill -9 5289
kill -9 21067
kill -9 16816
kill -9 16820
kill -9 26767
.
.
.
kill -9 14981
kill -9 26678
kill -9 26682
258 rows selected.
分别在两个节点杀掉这些执行计划存在问题的会话,释放被大量占用的系统资源。
由于Oracle的执行计划并非在收集统计信息后马上生效,因此还有个别的SQL仍然沿用错误的执行计划:
SQL> select distinct inst_id, sql_id from gv$session where event = 'latch: cache buffers chains';
INST_ID SQL_ID
---------- -------------
1 39gvg7vbcm8jx
1 a6aqkm30u7p90
SQL> select address, hash_value from v$sqlarea where sql_id = 'a6aqkm30u7p90';
ADDRESS HASH_VALUE
---------------- ----------
C000000EB7ED3420 3248739616
SQL> exec dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')
PL/SQL procedure successfully completed.
SQL> select address, hash_value from v$sqlarea where sql_id ='39gvg7vbcm8jx';
ADDRESS HASH_VALUE
---------------- ----------
C000001037B8E308 3603538493
SQL> exec dbms_shared_pool.purge('C000001037B8E308’,‘3603538493', 'C')
PL/SQL procedure successfully completed.
由于当前的数据库版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool将执行计划错误的SQL清除出共享池,强制其重新生成执行计划。
SQL> select event, count(*) from gv$session where username like != user GROUP BY EVENT order by 2 desc;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message from client 370
SQL*Net message to client 15
gc cr request 10
latch free 4
Streams AQ: waiting for messages in the queue 1
5 rows selected.
数据库中SQL执行计划错误除了导致大量的latch: cache buffers chains等待以外,还存在gc cr request和latch free等这些等待事件,经分析同样是由于错误的执行计划所致。将这些会话采用相同的方法清除后,系统负载恢复到正常范围:
Snap Id | Snap Time | Sessions | Cursors/Session | |
Begin Snap: | 7188 | 31-3月 -12 14:55:17 | 257 | 20.3 |
End Snap: | 7189 | 31-3月 -12 15:03:07 | 256 | 19.9 |
Elapsed: | 7.84 (mins) | |||
DB Time: | 70.46 (mins) |