Thursday, October 28, 2010

Commit time vs log file sync time

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

A while back I was developing predictive performance models with a customer (super senior Oracle DBA) when his phone rang. The caller said the application was taking too long to save an image and therefore there was a database performance problem. (Of course the caller knew nothing about the database but figured if there was a severe performance problem then it must be a database issue.) My client knows Oracle internals and their application very well. From an application perspective, saving an image means database commit time plus the time related to the data flowing through a complex series of routers, application servers, web servers, security processes, and on and on. So he ran a simply query comparing the previous and current day log file sync wait times. He figured that if there was a significant difference in log file sync wait times, then the problem could be database related. And if the log file sync wait times where the same, then the problem was likely to reside somewhere else.

Why did he compare log file sync times when the reported problem was saving an image? The answer is because log file sync wait time changes are a fantastic indicator of database commit time changes. Plus knowing log file sync wait times enables a good database commit time approximation. Exploring the relationship between commit time and log file sync wait time is what this blog entry is all about.

What is a log file sync wait?

When a client process issues DML such as an update, delete, or insert by the time control is returned to the client process the associated server process ensures the appropriate database buffers have been changed and the resulting redo is in the redo log buffer (not log, but buffer). Both the block buffer changes and their associated redo buffer entries must complete or the entire statement is rolled back.

When a client process issues a commit a number of things must occur and they must occur in the correct sequence. Both the server process and the log writer require CPU and IO resources to complete their tasks plus they must communicate to coordinate their individual tasks. Their CPU consumption can be tracked via the v$sess_time_model view and their wait time can be tracked via v$session_event. When the client process receives the "commit complete," from Oracle's perspective the redo now resides on an on-line redo log and both the server process and the log writer are synchronized. (One exception is when Oracle's commit write facility is being used.) When a client process issues a commit and if its server process must wait for the log writer to complete some work (CPU or IO) before the server process can continue, the server process will post the wait event log file sync.

Because significant CPU and IO resources are required for a commit to complete, significant log file sync wait times can occur with either a CPU or IO operating system bottleneck.

Keeping all this in mind, it follows that if log file sync wait times have increased, it is highly likely commit times have also increased. While this makes sense, is it really true? And are there creative ways to use this knowledge? Read on...

Is there a correlation?

More specifically, is there a correlation between database commit times and log file sync wait times? As my experiments indicated, there is indeed a strong correlation.

Here's the core experiment setup:
  1. commit (the "initial commit")
  2. insert some rows
  3. let the system stabilize
  4. get wall time, log file syncs and log file sync wait time for session
  5. commit (the "timing commit")
  6. get wall time, log file syncs and log file sync wait time for session
I created three experimental variations. The first variation is a single process performing the above with a four second stabilization time. I also varied the number of inserts per commit from 1 to 10000. I call this the batch size.  For the second variation I simply eliminated the stabilization time, which created a very intense serial process. The third variation used a stabilization time of two seconds and I added ten other concurrent processes that inserted, commited, deleted, and commited a single row...repeatedly. To avoid table related concurrency issues, each concurrent user worked on a different table. For each insert batch size change, between 90 to 100 samples where taken. I also measured the overhead of the timing collection for both the log file sync wait time and the wall time (which is our commit time), and then subtracted that from the results. One reason I took so many samples was because the wall time collection overhead varied significantly.

Here is a code snippet used to gather performance data:


begin
  select total_waits,
         time_waited_micro,
         to_char(systimestamp,'YYYYMMDDHH24MISSFF')
  into   :waits_total_1,
         :wait_time_micro_1,
         :cur_time_1
  from   v$session_event
  where  sid=:mysid
    and  event='log file sync';
end;
/

Now for the results.

The first variation included a four second delay after the initial commit was issued. This allowed the system to calm down to reduce the likelihood that the second commit, which is timed, was being impacted by previous work. The correlation between commit times and log file sync wait times was a staggering 1.000, with 1 being a perfect correlation. This means when log file sync wait time increased, so did the commit time. Including all the 1170 samples, 87% (stdev 6.2) of the commit time was due to log file sync time. When the insert batch size was one, 77% of the commit time was due to log file sync time.  The numerical and graphic results are shown below.


For the second variation I removed the four second delay. This creates an intense serial server process load on the system. Fortunately, because of Oracle's background processes, even though there is a single server process involved the log writer background process is involved providing much needed parallelization. As with the first experimental variation, the correlation between commit times and log file sync wait times was a staggering 1.000. And when considering all 1300 samples, an average 76% (stdev 17.7) of the commit time was due to log file sync time.


Notice that when the insert batch size was one, log file sync wait time accounted for only 45% of the commit time. Yet when the insert batch size was 10000, log file sync wait time accounted for 96% of the commit time. Regardless of the insert batch size there is some commit related work that must always occur. This overhead can be roughly calculated as the difference between the commit time minus the log file sync time. As our experiments demonstrate, as the insert batch size increases, this overhead takes less and less time per insert. But when the batch size is only one, this overhead took nearly 55% of the commit time. This is why it is much faster to issue 100 updates followed by a single commit compared to doing 100 update and commits. If you want to dig into this topic and actually quantify and predict, check out my blog entry entitled, Altering insert commit batch size.

The third variation incorporated concurrency into the situation. While there was a two second stabilization delay used in the statistics gathering process, the other ten concurrent processes had no delay. They simply and repeatedly inserted a row, commited, deleted, commited, repeat! This is the classic strategy for causing intense log file sync waits. In fact, looking at the experimental results below you can see the average log file sync times are over 800ms. If this was a real OLTP centric system, users would probably be irate!

For this experimental variation, the correlation between commit time and log file sync time was a respectable 0.76. On average around 73% of the commit time was due to log file sync wait time. While the percentage is in the same range as the other experimental variations, because the commit times are much longer, the actual time difference was a few hundred milliseconds not around a single millisecond.


It's important to understand in this experimental variation there is: a single log writer working in conjunction with multiple server processes, each server process must communicate and properly coordinate activity with the log writer, and they all must share resources. As a result, depending on the OS bottleneck, CPU core speed, and IO subsystem speeds, the difference in commit times and log file sync times can vary significantly. What I'm trying to say is that how I designed the experiment and the hardware I was running (Linux with 4 CPU cores and a single IO device, Oracle 11gR2) can effect the specific timed results. But regardless of these things, 1) at no time did the log file sync time exceed commit time. Technically I can't see this ever occurring and it never occurred in any of the experiment's 3,770 samples. 2) There was always respectable correlation between commit time and log file sync time, and 3) log file sync time is a very significant part of the overall commit time.

What does this mean in my work?

What does this mean in practical daily DBA performance analysis work? When commit times increase you can expect log file sync wait times to increase. And if log file sync times come to within around 75% of your commit time SLA, you have most likely breached your service levels. So it turns out my colleague's strategy of comparing log file sync times from the previous day to the current day was right on the money! And in his situation, the performance problem turned out to be a problem with the application servers, not the database server!

Is log file sync time a reasonable commit time approximation?

Yes log file sync wait time is a reasonable commit time approximation, but it is even a better predictor of commit time. In these experiments the log file sync time averaged 73%, 76%, and 87% of the total commit time. Knowing the average log file sync wait times, you can expect your commit times to around between 10% to 25% longer. More practically stated, if your log file sync wait times are approaching your commit time SLAs, you can count on your commit time SLA being breached. Creatively applying this knowledge, instead of instrumenting a user application to gather and monitor commit times, you could use Oracle's wait interface to monitor log file sync wait times instead!

So knowing log file sync times can be very valuable in both performance monitoring and performance firefighting.

Thanks for reading!

Craig.

Friday, October 15, 2010

Locks and Latches...What a Difference!

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Over this past year an unusually large number of my students have looked at me strangely (option 1, options 2) when I state that latches are not locks. There are distinct differences between the two in both purpose, design, and application. Quickly contrasting these two serialization control methods is what this blog entry is all about.

Oracle sometimes needs to be in control...

At some point Oracle must ensure certain actions occur one at a time, that is, serially. Oracle serialization control has two broad applications. First there is relational object (e.g., the employee table) control, in which enqueues are used. Second, there is memory structure (e.g., cache buffer chains) control in which latches and mutexes are used.

Locks employ enqueues to ensure relational structure access control.

Oracle uses locks implemented through enqueues to ensure relational objects are changed in an orderly fashion. When an Oracle process needs to lock a relational structure, either explicitly or implicitly, the lock request is placed into a queue. That's where we get the word "in"-queue. When the lock is given, the request is dequeued. (Kind of like a stack's push and pop functions.) Precise ordering is maintained to ensure a deadlock does not occur. Many years ago as an Oracle application developer I distinctly remember being repeatedly told how to properly lock and unlock a series of Oracle tables. In other words, Oracle's locking mechanism was very well defined and dependable. I had to make sure I understood this to avoid encouraging an application deadlock. I like to say that compared to latches and mutexes, enqueues are very boring and mature. You know exactly what's going to happen and in what order. Absolutely no adventure whatsoever!

A massive requirement difference...

Memory structure list serialization control is an entirely different beast and an exciting one at that! Oracle must closely maintain tight control over memory structure list access, otherwise deep and extreme corruption would occur. Memory structure lists are fundamentally different than relational structures as well as buffers (and a buffer's associated buffer header). For example, the cache buffer chains, the least recently used lists, and the write lists are linked lists made up of buffer headers. Buffer headers do not contains rows, they contain pointers to other buffer headers, their associated data block and buffer, and control information. Buffer headers are not relational structures and they are not Oracle buffers. As a result, Oracle needed a way to control access to these and other memory lists.

Latches/Mutexes use a spin and sleep algorithm to ensure memory structure list serialization control.

In stark contrast to Oracle locks, latches and mutexes do not use an orderly acquisition and release strategy. Latches and mutexes employ a spin (i.e., repeated attempts to acquire the latch/mutex) and sleep (i.e., backing off a bit after failing to acquire the latch/mutex) methodology. The session successfully acquiring the latch/mutex got lucky when it just happened to make a request attempt (by checking a specific memory address) after the latch/mutex became available yet before a competing session attempted the acquisition. So it's not an order thing, it's a probability thing: eventually the session will likely acquire the latch/mutex, but there is no guarantee. This is what makes latching and mutex acquisition so interesting and exciting compared to boring enqueues!

If you have been wondering about the differences between an Oracle lock and latches/mutexes, I hope this entry has helped clear things up a bit. If you want to dig into serialization in more detail, just go to my web-site and search for latch. And of course I dig into this topic in my Oracle Performance Firefighting course and book.

Thanks for reading!

Craig.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.