Chinaunix首页 | 论坛 | 博客
  • 博客访问: 2772052
  • 博文数量: 389
  • 博客积分: 4177
  • 博客等级: 上校
  • 技术积分: 4773
  • 用 户 组: 普通用户
  • 注册时间: 2008-11-16 23:29


分类: Oracle

2014-09-12 15:54:42

              如何对log file sync排错

    log file sync是一个oltp常见的问题,特别在繁忙的oltp系统,如果存储性能不够强,往往成为令
很多Oracle DBA头痛的性能问题之一。在很多文章和MOS文档在内的一个思路就是比较log file sync
的等待时间和log file parallel write时间,然后确认是不是log file parallel write的问题


It is well known that averages and ratios can distort the
reality (Milsap and Holt even wrote a whole book about it). A few
outliers can significantly bias the average, and as James Morle ,
LFS will be affected more than LFPW (and thus the ratio between them
will be distorted).  I will briefly summarize his argument here.

When log writer is writing fast, it generally finishes its work for
one session before the next session commits, but when it experiences an
occasional hiccup, sessions start piling up waiting on LGWR. The
figure below illustrates this mechanism by showing 10 consequtive short
LFPW causing 10 short LFS, followed by a long LFPW, causing 10 longer
LFS events.



You can easily calculate that average LFPW in this example is 1.8 ms, while average LFS is approximately twice as big, 3.7 ms.

I ran some tests to take a closer look at the effect.


I wrote a very simple PL/SQL procedure to perform inserts and commits
inside a loop in several parallel threads. I monitored OS performance
to make sure that there’s no CPU starvation (and I separately checked
LGWR waits from ASH to make sure that it’s wating on I/O and not
something else, like I/O or redo copy latch).  I traced both the user
sessions and the LGWR session and read the trace files into the database
for analysis as described in my to
produce fine-grained histograms. Plotted below are results for 1, 10
and 1000 threads. Note that the plots are in bi-logarithmic scale.


For 1 thread, the shape of LFS and LFPW histograms are almost
identical, with a small shift about 100 microseconds explained by pre-
and post- processing overhead by LGWR. With multiple threads the
behavior is different: the difference starts off as the same ~100 us
shift, but at larger times the difference between LFS and LFPW
grows nearly exponentially and eventually saturates at the number of
threads. This of course makes perfect sense: because of the mechanism
described above, the longer LFPW is, the more LFS waits of comparable
duration it would produce, but the maximum number of LFS per one LFPW of
given duration is limited to the number of threads.

As expected, as the number of threads grows, the average for LFS
changes much more dramatically (from 300 us to 20 ms) compared to LFPW
(from 200 us to 1 ms).

Note that there aren’t any big outliers (e.g. big enough to produce
LGWR warnings) in the picture — normal scatter in I/O service times is
sufficient to produce the effect.


We have seen that in many cases the comparison between the averages
would be very misleading. There would be cases when it wouldn’t, too:
when LFS waits are either produced by one session or by different
sessions, but at non-overlapping times. But in such cases LFS waits
probably wouldn’t be much of a problem.

A busy OLTP system that suffers from LFS is more likely to be similar
to our simulations with threads = 100.  Imagine that we analyze its
behavior, and we only have the averages, without the histogram to tell
us why they are so different. We’d be facing a very confusing picture:
LFS is 20 ms, but LFPW is only 1 ms — if as per the MOS note the average
LFPW represents the I/O part of the LFS, then where did the remaining
19 ms come from? Is it latching? Is it inter-process communication?
Weird effects from ? Or is it CPU? Or a ?
The best case scenario, the DBA wastes a few hours doing research on
the internet and MOS, but worse things can easily happen, ranging from
man-days wasted on combatting non-existent issues to
ill-advised hardware upgrades.

Or take another scenario: on a similar system (OLTP with 50-100
frequently commiting writing sessions and a bunch of reading sessions)
an intermittent I/O problem causes appearance of a small percentage of
anomalously long LFPW events. The percentage is small enough so that
average LFPW doesn’t change much, but average LFS changes significantly.
If the DBA starts the investigation by comparing LFPW/LFS averages he
would conclude that “it’s not I/O” and would go on a wild goose chase
instead of dealing with the real problem.

In order to protect oneself against this pitfall, it’s always better
to look at histograms rather than averages. But if histograms are not
available, there are other possibilities as well: e.g. look at number
concurrent waits on ‘log file sync’ vs time in ASH (group by
sample_time), or if even that is not available, at least compare event
counts for LFS and LFPW (if they are very different, this is a clear
indication of concurrency, i.e. the averages are meaningless).

通过比较ash中的log file sync和log file parallel write次数来决定是不是真正的由于lgwr



阅读(2552) | 评论(0) | 转发(0) |