怎么对oracle数据库性能排错(作者jonathan lewis)-oracle狂热分子-ChinaUnix博客

东东堂的数据库小屋oracn.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

oracle狂热分子

博客访问： 2826293
博文数量： 389
博客积分： 4177
博客等级：上校
技术积分： 4773
用户组：普通用户
注册时间： 2008-11-16 23:29

文章分类

全部博文（389）

MongoDB（3）
12C新特性（15）
Undo/Redo（8）
ASM（7）
Enqueue/Lock（1）
Block（9）
Index（5）
Buffer Pool（7）
Shared Pool（4）
AMM/ASMM（10）
Optimizer（19）
Latch/Mutex（4）
MySQL（69）
Exadata（6）
Timesten（15）
SHELL/PERL（7）
其他（7）
LINUX（24）
DATA GUARD（14）
性能调整（12）
RAC（21）
GOLDENGATE（8）
备份与恢复（23）
问题处理（10）
EM 12C（2）
OCP（32）
数据库理论（1）
存储（3）
oracle（43）
未分配的博文（0）

推荐博文

相关博文

怎么对oracle数据库性能排错(作者jonathan lewis)

分类： Oracle

2013-11-06 16:57:43

怎么对oracle数据库性能排错(作者jonathan lewis)
A search of the internet will find all sorts of tools and tips on trouble-shooting an Oracle database, but it’s very easy to be drawn into a chase that wastes a huge amount of time and effort if you don’t start with a clear image of what you are trying to achieve and why.
Focusing only on performance trouble-shooting, I’ve gone back to basics to produce a shortlist of targets, to remind you of the only source of information that’s available to help you.

Where’s the problem.

If you’re looking for a performance problem you’re probably facing one of three scenarios – which can be typified by the questions:
Why is this screen / report always so slow ?
Why did the batch over-run by three hours last night ?
What’s going on right now ?
The third question can, of course, be split into two forms – why is “the system” always slow, and why is “the system” sometimes slow – but the critical feature is that there is (apparently) no specific issue, just a general malaise.
It’s important to remember which of these three scenarios you are addressing, because the most fruitful approach to problem solving should be guided by the class of problem you are addressing. Regardless of the type of problem, however, Oracle has only one source of information that can help you – the dynamic performance views.

什么问题？对于问题有一个清析的描述很重要。有时候可能会导致错误的判断方向

The dynamic performance views.

Internally, Oracle captures a huge amount of information about what’s going on from moment to moment and summarises it in a variety of different ways. Broadly speaking you could divide these summaries across three dimensions: people (sessions), resources, and statements; for example:
People: v$sesstat, v$session_event, v$sess_time_model
Resources: v$filestat, v$segstat, v$latch, v$event_histogram
Statements: v$sql, v$sqlstats
Summaries can be very helpful, of course, but sometimes what we really need are answers to questions like “how many times did I hit child 43 of the cache buffers chains latch the third time I ran statement X using execution plan Y, and how much CPU time did I spend spinning on that latch and how much time sleeping”.
That kind of detail, of course, is far too voluminous for any realistic capture process, but with 10g we do get some extra sampling across the three dimensions that can be very helpful. Extending from v$session_wait we get v$session_wait_history (a 100% sample of the last 10 wait events for each session), v$active_session_history (a sample once per second for the last hour or so of every session) and dba_hist_active_sess_history (a sample from v$active_session_history that is kept, by default, for the last seven days). The v$active_session_history and dba_hist_active_sess_history, however, are only available to users with the appropriate licences.

动态性能视图让oracle自己说话.作者从三个维度来分析性能数据来源.

Improving performance

So when you go trouble-shooting, what does it mean to “improve performance”. In simple terms you want something to happen faster, whether it’s a screen response, report run, batch run, time is probably the single visible measure of success.
The formula: response time = wait time + service time is probably very well known to the Oracle community (largely due to the efforts of Anjo Kolk and Cary Millsap); but I’d also like to throw another equation, or pair of equations, into the pot:
Wait time = competitive wait time + resource wait time
Service time = competitive service time + resource service time
I’m not entirely happy with the wording I have used in these expressions, but the idea I want to get across is that there is a very fluid boundary between how you decide what you call ‘wait’ time and what you call ‘service’ time.
Consider, for example, a simple disk I/O which, on an idle system, should take 6 milliseconds. From Oracle’s perspective, that would probably be considered virtually 100% wait time. From the viewpoint of ‘response time = wait time + service time’, most of that time is waiting for the heads to settle and waiting for the disk to rotate to the right position – so, at a difference level, how much of that 6 seconds is wait time, and how much is service time ?. But when two users try to access the same disc at the same time one of them will see a 12 m/s response time, of which 6 m/s is definitely waiting (on what I call competition time) for the other user to get out of the way.
At the opposite extreme, consider a simple latch acquisition. It takes a few microseconds (service time) to run through the code to acquire a latch; but if someone else is holding a latch that I want, I start to spin on that latch, working to avoid sleeping. Is the spin time “wait time” or “service time” ? I’d like to think of it as service time – I am still working, not waiting – but identify it as competitive service time rather than resource service time. (For a simple example of this type of competition resulting in a massive CPU overhead, see ).
One guideline to consider when it comes to competition time is that higher frequency tends to lead to more competition. That’s part of the thinking behind strategies which say: “if you’re going to update 1,000 rows do it in one statement, not in 1,000 statements” – the more steps you take, the more times you compete, and the higher the probability of wasting time in competition.
The reason I am struggling to make a distinction between competition and usage is simple. After the “time-based tuning” revolution, too many people spent too much time focusing on the ‘service time + wait time’ viewpoint, translating this into: ‘you’ve got to use the 10046 trace to see the time”. That’s an important aspect of trouble-shooting and the best strategy for some circumstances – but you still have to reduce the time, and you can spend too much effort trying to reduce your usage time when you could do better to reduce the competition time by reducing other people’s usage.
Strategies
So let’s put the pieces together and address the three basic questions.
Why is this screen / report always so slow ?
There’s an obvious best strategy on this one; if you can identify the task so precisely then the 10046 trace gives you the maximum detail available for every operation – except it doesn’t show you much indication of where you spent the CPU and how much of it was used in competition and you may need to do some work, and make some intelligent guesses, about how reasonable your wait times are.
Obviously the summarized (tkprof) output from the trace will allow you to identify the statements that were responsible for most of the time, and let you guess that some of the statements may have been badly affected by competition time – but you still need to know if the statement that took 1,000 sequential reads and 10 seconds is behaving reasonably because (a) that’s a reasonable number of disk reads to expect and (b) the average of 0.01 seconds per read is meaningful or did 90% of the I/Os come from cache with 10% coming from an overloaded disk system.
Why did the batch over-run by three hours last night ?
Until the advent of 10g and the licences for the diagnostic pack and performance pack, you couldn’t get any fine detail about a single session in the past. You could only get the system-wide statspack reports, which could give you some clues about resource hogs (which might be heavy users, but might also be sources of intense competition).
So the ideal has always been to do the smallest amount of extra work to get the maximum viable instrumentation into the batch jobs. In the simplest case, all it takes is two basic SQL statements each time you disconnect – with a third, possibly, for 10g. The queries simply report the work done (v$mystat – the slice of v$sesstat for the session)) time waited (v$session_event) and, optionally, database time distribution (v$sess_time_model) for the session.
Select
stn.name, mst.value
from
v$mystat mst, v$statname stn
where
mst.value != 0
and mst.statistic# = stn.statistic#
;

select
/*+ leading(mys) no_merge(mys) */
event, total_waits, total_timeouts,
round(time_waited/100,2) time_waited,
round(max_wait/100,2) max_wait
from
  (select sid from v$mystat where rownum = 1) mys,
v$session_event       sev
where
  sev.sid = mys.sid
;

select
/*+ leading(mys) no_merge(mys) */
stat_name, value
from
(select sid from v$mystat where rownum = 1) mys,
v$sess_time_model stm
where
stm.sid = mys.sid
and stm.value != 0
;
With a relatively short report you get a head start on trouble-shooting because you can compare the report for a bad day with the report for the previous day and ask questions like: “did I do far more single block reads today than yesterday – or was the average read time longer”, “did more Workarea executions spill to disk today”, “why has the number of SQL*Net roundtrips gone up by a factor of 10”.
The three queries can’t give you the complete answers because the detail isn’t there, but they will direct you to the right places to look for the answers.
Of course, with 10g, you could start with this report, to find out which jobs did most work, or suffered from the worst competition, and then go into the dba_hist_active_sess_history to see if it gives you a good sample of the critical work this session was doing when it had a problem.

What’s going on right now ?
Finally the generic ‘system is bad’ question. This may be the ‘instant response’ question, or may be the longer term view of a badly performing system. Either way, your best bet is taking snapshots.
For the longer term viewpoint, of course, Statspack (or AWR) give you a default way of collecting all sorts of interesting data in snapshots, and their mechanisms tell you all you need to know about snapshots – copy the data from a dynamic performance view, wait a bit, make another copy, report the difference.
In the case of the longer term view, you’re probably going to look for competition – what resources are under most pressure, can you see what SQL is (probably) causing that pressure. In many cases this means looking at the “Load Profile” and the “Top 5 Timed events” to get an idea of work done and time lost, then jumping to the most appropriate section of ‘SQL ordered by …’.
When you look at the SQL – there are two things to remember: usage and competition. A statement which uses a lot of CPU is denying CPU to other statements. A statement that does a lot of physical reads is slowing down the physical reads by other statements. A statement that executes a large number of times is interrupting other statements and competing for latches.
When you look at ‘greedy’ statements, don’t forget to check how greedy they are – what impact are they having on the total resource available. Several of the ‘SQL ordered by …’ reports have some text to help, such as:
SQL ordered by Gets                  DB/Inst: XXXXXX/XXXXXX Snaps: 8277-8278
-> Resources reported for PL/SQL code includes the resources used by all SQL
   statements called by the code.
-> Total Buffer Gets:     263,043,933
-> Captured SQL account for      49.8% of Total
This report would be fairly useful as it has captured roughly half the buffer gets accounted for in the period. But if this turns out to be 30 statements each responsible for about 1.6% of the total then you might move rapidly on to check the SQL ordered by CPU, or the SQL ordered by executions to see if there are any single SQL statements with a larger potential payback.
In the case of the short-term view, you need some quick, lightweight, code to take similar snapshots of several of the dynamic performance views. I have various packages to look at v$sess_io, v$sysstat, v$session_event and so on, but one of the best examples I’ve come across of such code is on Tanel Poder’s “Session Level Statspack” at http://blog.tanelpoder.com/2007/06/24/session-level-statspack/
The idea is simple – if the system is busy, someone, somewhere, is doing something inefficient – in this case you want to find the culprit quickly – who’s doing the work that is causing a problem. (Notice how the time scale changes the direction of search – although the snapshot principle is the same as the Statspack report, we can search for a person in real-time, but only search for the SQL in Statspack).
The output from a simple snapshot of v$sess_io might look like this:
SQL> execute snap_sess_io.start_snap
PL/SQL procedure successfully completed.

-- wait 5 seconds

SQL> execute snap_sess_io.end_snap
-----------------------------------
Session I/O:-   17-Jun 14:28:40
Interval:-      5 seconds
-----------------------------------
SID          CU Gets       CR Gets         Reads    Blk Change    Con Change
------    ----------    ----------    ----------    ----------    ----------
9               3654       445,630        13,925         7,194         8,539
13                 0        50,157             0             0             0
With a suitable filter to bring out only the larger numbers, you can get a short report that points you to just one or two sessions that might be worth following. (The ‘Reads’ in the above report are probably coming from a file-system cache, by the way, so not only is session 9 threatening the discs, it’s always burning extra CPU).

什对不同的问题，提供了自己的看法

Conclusion

There are cases where you can identify a business operation that is believed to be inefficient. If you have this luxury, then the 10046 trace is an enormous help to pinning down exactly where you should spend your efforts in fixing the problem. But there are cases where this direct approach is not possible.
If you have to find problems in the past, or problems that are non-localised, then Oracle offers you summaries of the work and time attributed to users, statements, and resources. You can used snapshots (at various timescales) on these summaries to help you identify the best place to spend your efforts.
Remember that competition for resources can be as great a threat to performance as use of a resource. If you stop me from working too hard, you are also reducing the competition for resources being suffered by other users. So, in the absence of precise targets, identifying the most significant resource threats may be the most cost-effective use of your time.
Footnote: in this article I’ve avoided the question of locking. When a user says – “The system is hung”, one of the first quick checks is v$lock to see if there are some blocking locks causing a problem. Because the view falls outside the normal range of performance issues due to inefficiency, I’ve left the topic aside to be pursued at a later date.

总结：虽然是以针对以前的老版本，但是还有很多让人启发的方向，新版的一个特性，可能让原来的版本的性能数据就随意得到.

阅读(3042) | 评论(0) | 转发(0) |

上一篇：Shared Pool 初探

下一篇：关于asmm的几个要点

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6