Tuesday, December 14, 2010

IO Read Wait Occurrence Mismatch - Part 1

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Only a fool would challenge the ascertain that Oracle's wait interface has not profoundly changed the terrain of Oracle performance diagnosis. But sometimes we push the wait interface too far and expect too much. In this blog entry I will demonstrate that using the wait interface to determine the number of non buffer cache reads (i.e., "physical" reads) can not be satisfactorily determined solely through the wait interface. But I will also provide a solution to this problem.

This also implies that based solely on Oracle's wait interface, determining the total physical IO read time can not be satisfactorily determined by simply multiplying the number of physical IO read related wait events occurrences by their respective average wait time. An alternative solution will be presented in a future blog entry.

From a broader performance perspective this is important because many Oracle performance diagnostic methods, tools, and products use Oracle's wait interface to determine the total IO read wait time which feeds into the total non-idle wait time. But as my research demonstrates and as strange as it seems, if you want the real total IO read wait time you must use information from both the classic instance activity views (i.e., v$[sess,sys]stat) and the wait interface views, or use a creative sampling strategy.

Expecting Too Much From the Wait Interface

One such example of expecting too much is using the wait interface to determine the number of non buffer cache reads. I typically call these non buffer cache reads "physical IO reads", "physical IO", or simply "PIO".

There are actually a number of physical read related wait events. In addition to the classic db file sequential read and db file scattered read, there is also direct path read, direct path read temp, db file parallel read, control file sequential read, and log file sequential read. There are others and I'm sure Oracle will add more, but these are ones that are likely related to significant amounts of time. For typical Oracle wait event descriptions, click here.

To hypothetically demonstrate this, suppose db file sequential read and db file scattered read waits occurred 1.2M times over a one hour period and there are no other physical IO read events occurring. We would expect to see the instance statistic (v$sysstat) physical read total IO requests to be 1.2M and tracing the processes at the OS level to also show 1.2M read system calls. While this can occur, as I will demonstrate below and you can check on your system, don't count on it!

This mismatch occurs because Oracle does not fully instrument its physical reads, which currently on Linux can be the implemented through the pread64 or readv system calls. In this blog entry I will demonstrate this mismatch and provide a script so you can demonstrate this on one of your own systems.

How Oracle Retrieves & Times Non Oracle Cache Reads

When a server process requires an Oracle block it checks the cache buffer chain structure to determine if the block currently resides in the buffer cache. If the block does not reside in the buffer cache the Oracle process must ask the operating system for the block. To ask for the block on a Linux system Oracle may issue a pread64 or a readv call. If Oracle instrumented the call, we can query details about the read call through Oracle's wait interface.

We can see an example of how Oracle both instruments the read call and actually makes the read call by operating system tracing a server process. Here is an example of tracing an active server process performing a physical multi block IO request.
$ strace -rp 14876
     0.000027 gettimeofday({1290553854, 957524}, NULL) = 0
     0.000028 pread64(13, "\6\242\0\0p\36\0\1..."..., 131072, 63832064) = 131072
     0.000095 gettimeofday({1290553854, 957647}, NULL) = 0
Process 14876 detached

Shown above will provide the server process with the requested blocks and the related timing information. Oracle give this read the name db file scattered read with a wait time of just over 0.095ms. I provide much more detail about this process in my book, Oracle Performance Firefighting and also in my training course titled, Oracle Performance Firefighting.

If Oracle, for whatever reason, did not instrument the pread64 call, that is it did not issue the gettimeofday calls and record their time difference, the read call would obviously occur but we would not see its occurrence and the related timing information through Oracle's wait interface.

The Call Occurrence Mismatch

My experiments gathered IO read wait occurrences from three different sources; through Oracle's wait interface, Oracle's instance statistics views, and by operating system tracing an Oracle server process. Through Oracle's wait interface, the number of waits can be determined through the total_waits column in v$system_event and v$session_event. Through Oracle's instance statistics, the number of read calls (not the number of blocks) can be determined by referencing the statistic physical read total IO requests. On Linux, the number of read system calls can be determined by operating system tracing a process using strace, grep'ing for readv and pread64, and then counting their occurrences.

For my session level experiment I created a simple script that can be downloaded here. Here is an actual output file for a very small run; gathering only six, 30 second samples.
sample stracecnt sesstatcnt waitcnt
1 2215 2215 2215
2 4242 4238 4237
3 4285 4283 4276
4 4287 4287 4286
5 3528 3383 3397
6 2180 2180 2180
Sample is the sample number.
stracecnt is count of readv and pread64 systems calls gathered by running strace
sesstatcnt is the number of v$sesstat, physical read total IO requests
waitcnt is the sum of total_waits from v$session_event

Remember all the data is related to a single server process and the v$sesstat and v$session_event counts represent the deltas, that is, the difference between the ending value and the beginning value. If you're not on Linux, you may need to make some shell script modifications.

My conclusions where not taken from the above six 30 second samples. My conclusions are based on correspondence with some trusted colleagues and by running the script on my Oracle 11R2 Linux four CPU core system and collecting sixty samples at 900 seconds each for a single server process with a majority of direct path read wait occurrences and wait time. For each sample the server process issued around around 221K IO read calls. The average number read system calls (readv, pread64) was 221,286 and the average number of read calls recorded from Oracle's instance statistics view v$sesstat (physical read total IO requests) was 221,289. The average difference was only 3 calls, which could easily be accounted for by how I collected the data. You can see the actual experimental output here.

This experiment and correspondence with Gopal Kandhasamy (co-author of Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning), indicates Oracle's instance statistics views are a reliable source for IO read occurrences.

While it appears Oracle's instance activity views are a reliable sources for IO read call occurrences, is Oracle's wait interface also a reliable source for the number of IO read calls? On my test system the wait interface occurrences closely matched data from both the instance statistic view and from strace. However, looking at real production systems I tend to see something quite different. For example, I randomly picked three Oracle 10G Statspack/AWR reports at my disposal and this is what I found:
Ref: SG, Oracle 11.2
wait interface physical read related waits occurrences 1,082,064
instance statistic, physical read total IO requests 1,521,403
Event occurrences to instance statistic ratio: 0.71

Ref: BF, Oracle 10.2
wait interface physical read related waits occurrences 190,272
instance statistic, physical read total IO requests 230,383
Event occurrences to instance statistic ratio: 0.83

Ref: PROD25, Oracle 10.2
wait interface physical read related waits occurrences 156,563
instance statistic, physical read total IO requests 156,894
Event occurrences to instance statistic ratio: 1.00
I encourage you to do the same for one of your production systems.

Based on the three observations above, other observations from other systems, data from some of my colleagues, and correspondence with Gopal Kandhasamy (co-author of Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning) indicates Oracle's wait interface is not a reliable source for IO read occurrences.

What about timing information?

What I have presented above is specifically related to the number of occurrences, that is, the count. If this is not troubling enough, consider the fact that if the number of occurrences is low, then determining the total IO related read time will also be low. And therefore any method, analysis, script, tool, or product using wait interface occurrences will produce a low total wait time value. Stated plainly, for read IO related wait events, the column v$[session,system]_event.time_waited is likely under reported.

While the total IO read wait time appears to be under reported, the average wait time appears to be solid. When a read call is instrumented, Oracle of course records the time along with the occurrence providing the data required to calculate an average. Also, when comparing IO read wait time with operating system tools such as iostat and also with observing the read system call times, it appears Oracle does record the correct average wait time. I will be blogging about this in my "average" blog series.

It is more difficult to calculate the total IO read time waited than you might think. This difficultly exists because there are multiple read wait events and they don't all have an obvious instance statistic to easily determine their count. For quantitative Oracle performance analysis it is important to get a respectable total IO read time, so I am working on a solutions which, I plan to blog about soon. Stay tuned...

Concluding Thoughts

Oracle's done a great job instrumenting its kernel code. Such a great job in fact, that most performance analysts rely heavily on the data it provides. But the wait interface is limited when kernel developers limit its use. This blog entry demonstrates one such limit, specifically using the wait interface column total_waits to determine the number of IO read call occurrences.

After Posting Note: Please read Part 2 as it has additional details that bring some closure to this topic.

Thanks for reading!

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Saturday, November 20, 2010

The Average Challenge... Part 1

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

We have all been taught that the average of a list of numbers, is their sum divided by the count. Yes this is true, but when we rely on an average for input into a more complex formula or we use it when making an important decision what it implies and communicates becomes very important.

Why is this important?

Here are some sobering examples about why this topic is so important.
  • Oracle statistics are commonly provided as an average. But as I will demonstrate (and you can validate) an average can be a poor representation of the underlying values. 
  • Just because the average multiblock read takes 30ms does not mean it usually takes 30ms, yet this is how most DBAs think, talk, and act. In fact, the typical read could take only 5ms!
  • Most DBAs assume that if the average log file parallel write takes 10ms, then most of the writes will center closely around 10ms and also that it is just as likely the writes will be less than 10ms than greater than 10ms...not true.
  • If two different SQL statements retrieve 100 blocks from outside of Oracle's buffer cache it may take one statement 10 seconds and the other statement 2 seconds to prepare the blocks for access.
  • If a SQL statement retrieves 100 blocks from outside of Oracle's buffer cache it may take one execution 10 seconds to complete while another execution only 2 seconds.
My point in this blog series is to highlight that it is important for us as performance analysts to be aware, understand, and properly use averages and also to have the ability to derive more appropriate ways to describe what's occuring. Better said, we need to use the best representation of the data to suit our performance analysis needs. If this something you are interested in, then read on!

Our Misguided Past

Suppose we look at an AWR report that shows the average log file parallel write time is 17ms. This is  validated by looking further into the report, which shows that over the AWR report interval the total log file parallel write wait time was 1200 seconds with 70,588 associated waits (the count).  What was stated and gathered is true, but from this point is where the allusions begin...

Many DBAs will make three incorrect assumptions. First, that most of the log file parallel write waits are close to 17ms and if they are not, then they are an anomaly, an outlier, and unimportant. And because they probably don't occur that often, it's nothing to be concerned about. Second, because most of us are conditioned to think of averages as part of a bell curve (i.e., normal distribution), most DBAs will envision there being just as many waits (i.e., wait occurrences) slightly less than 17ms as there are waits slightly greater than 17ms. Third, if there is a problem with using the mathematical average (i.e., mean), then everyone else is in the same boat as me and there is nothing we can realistically do anyways...so doing something about this is unusual, irrelevant, and unrealistic. All three of these assumptions are incorrect and thankfully we can address each one.

Important side note: For simplicity in this initial "average" blog entry, I am not going to use Oracle wait event times or focus on SQL statements. The reason will be clear in the subsequent blog entries. Instead I will use Oracle server and background process operating system trace output from a stressed Oracle 11.2 instance running on Linux.

The Experimental Setup

To demonstrate these assumptions and help get us thinking correctly, I created a free buffer wait load, which really stresses the IO subsystem, typically both reads and especially writes. I used a four CPU core DELL server running Oracle 11gR2 on Oracle unbreakable Linux. Here is the OS command I ran to gather the stats:
$ ps -eaf | grep lgwr
oracle   23162     1  0 Nov08 ?        00:02:30 ora_lgwr_prod18
oracle   29231 13752  0 15:11 pts/1    00:00:00 grep lgwr
$ strace -rp 23162 >lgwr_2min.trc 2>&1
$ ./crunchdata.sh lgwr_2min.trc pwrite64 mb > lgwr_2min.txt 2>&1
After two minutes, I broke out of the trace using a control-C providing me with a 1.5MB trace file with over 1000 multiblock pwrite64 system calls. I created a special shell script called crunchdata.sh to extract only the pwrite64 system call times and place this data into a single column text file. I then took the single column text file and loaded it into a pretty straightforward Mathematica notepad file to quickly crank out a nice histogram and some associated statistics. All the data shown in the blog entry and the associated analysis programs can be downloaded here.

The Results

Let's first look at the histogram shown below. The horizontal axis is composed of bins, that is, groupings or intervals of how long something took, in milliseconds. For example, the far left bin is related to pwrite64 times that were between 0ms and 10ms. The second bin is related to pwrite64 times between 11ms and 20ms. And on and on. The vertical axis is the number of samples gathered, that is, the number of occurrences associated with a bin. For example, there were just over 400 pwrite64 samples whose time was between 0ms and 10ms. And while you can't see this on the graph clearly, there were 11 samples whose time was between 111ms and 120ms. If we add up all the occurrences shown in the graph it will be total the number of samples, which in this case is 1006. If you're wondering why the horizontal axis goes out to over 350ms, it's because there were some writes that actually took over 350ms and I allowed Mathematica to include all our sample data when creating the histogram. These outliers or anomalies are more important than many DBAs think and will be discussed below.


Looking at the histogram above, the first observation and the most obvious is the pwrite64 duration samples are not normally distributed. No even close! It is very skewed. To highlight this point, in the graphic below, I overlaid a normal distribution based on the same dataset's key normal distribution inputs; average and standard deviation. As you can see, the LGWR's pwrite64 times, are clearly not normally distributed.

But is this non-nomality just a convenient anomalous sample set I selected from a large number of experiments to make a point? No. To provide additional examples of this skewness, shown below are two other examples (DBWR and a server process) based on the same database server load. Each sample set was gathered like the one detailed above, except the sample time was three minutes instead of two.

The histogram below captured the database writer's 1,832 multiblock pwrite64 calls over the three minute sample period. As you can see, most of samples were between 0ms and 25ms, which is what we would hope. Also notice there were writes that took over 1400ms (1.4 seconds!) to complete. What do you think the average write time is? (The answer is provided below.)


The histogram below captured a server process's 4,019 multiblock pread64 calls over the three minute sample period. As you can see, most of samples were between 0ms and 25ms, but there was a sample just below and above 1000ms. What do you think the average read time is? (The answer is provided below.)



What's the big deal?

What's the big deal about a skewed distribution? The samples that take much longer than the rest wreak havoc on creating a meaningful and useful average. Said another way, the calculated average has an implied meaning that is not what most DBAs expect. And to demonstrate this, let's focus on the initial LGWR pwrite64 histogram. To increase clarity I modified the histogram so each bin represents a single millisecond (not 10ms as above) and I also limited the x-axis to 150ms. This allows us to effectively zoom into the area of interest. Looking at the initial and zoomed area on the below histogram, what do you think is the average write call time now? And what do you think is the most common write call duration?


Most of us would guess the average is around 15ms. Visually it seems correct. But statistically the average is 24.7ms. This may seem strange because of the large number of very short duration samples and a grouping around 20ms. The reason is there are enough longer duration samples to effectively pull the average to the right. What may be thought of as outliers or anomalies occurred often enough to significantly impact the average.

This is actually a fairly tame example. I've seen many examples where the skew is more intense and therefore the average is pulled much further to the right. So looking at the histogram directly above, would you feel comfortable stating the average is 25ms? Perhaps, but you would probably would feel much more comfortable saying the likely pwrite64 time is 15.6ms and it is highly likely the time will be less than 31.64ms. Where did I come up with these numbers and what do they mean? Read on...

What's better than the average?

In this case and as we'll see with Oracle wait times and various statistics related to SQL statements, for our purposes the average is not the most useful way to describe what happened. As I have learned with developing predictive models and communicating with others about what has actually occurred (i.e., discussing the sample data), the statistical median and other similar statistics work much better than the average.

The median is the 50th percentile value: If you ordered all the samples and selected the middle value, that would be the median and the 50th percentile value. In our LGWR two minute sample set, the median is 15.6ms, which visually seems more useful when describing our sample set than the average of 24.7ms. The 31.64ms value I wrote above is the 80th-percentile. That is, 80% of our samples are 31.64ms or less. Also, the 10%-tile is 0.5ms and the 25th%-tile is 2.0ms. Both are typically more useful than simply stating the average is 24.7ms.

There are other statistics and visualizations that can be used to describe our sample sets. I will introduce some of those in subsequent blog entries and demonstrate how to calculate the median and other percentile figures when you encounter skewed distributions. But for now, the median nearly always provides us with a better way to describe Oracle performance analysis (background and foreground processes, SQL statement resource consumption and duration) related samples.

Referring to the DBWR histogram above, the average pwrite64 took 30.2ms and the median is 1.3ms. Referring to the server process multiblock read historgram above, the average pread64 took 24.1ms and the median is a staggering 0.1ms. This common yet intense server process physical IO read time skewness has a profound impact on modeling an Oracle system providing predictive analysis. Though sometimes it is difficult to visually see without zooming into the histogram, the difference between the average and the median is because of the large sample time variation combined with a few samples that have relatively large times. But in every case, the larger the difference between average and median the more intense the skewness.

What's the point of all this?

Understanding skewness, that the average calculation is of limited value, and that the median better (among others) communicates the performance picture allows me to be more effective in my Oracle performance analysis work. More practically, when I commonly talk about an IO wait times, I try to differentiate the average wait time and the likely wait time. As as I am beginning to demonstrate, the likely wait time value will always be less than the average wait time. But how much less? And how can we calculate and predict this? These are the topics for subsequent blog entries.

This skewness actually have far reaching implications in Oracle performance analysis. I have just touched the surface and introduced the situation which, hopefully peaked your interest. As I will subsequently blog about, this skewness also occurs when analyzing Oracle wait time and in many areas of SQL statement analysis.

Thanks for reading!


Craig.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Thursday, October 28, 2010

Commit time vs log file sync time

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

A while back I was developing predictive performance models with a customer (super senior Oracle DBA) when his phone rang. The caller said the application was taking too long to save an image and therefore there was a database performance problem. (Of course the caller knew nothing about the database but figured if there was a severe performance problem then it must be a database issue.) My client knows Oracle internals and their application very well. From an application perspective, saving an image means database commit time plus the time related to the data flowing through a complex series of routers, application servers, web servers, security processes, and on and on. So he ran a simply query comparing the previous and current day log file sync wait times. He figured that if there was a significant difference in log file sync wait times, then the problem could be database related. And if the log file sync wait times where the same, then the problem was likely to reside somewhere else.

Why did he compare log file sync times when the reported problem was saving an image? The answer is because log file sync wait time changes are a fantastic indicator of database commit time changes. Plus knowing log file sync wait times enables a good database commit time approximation. Exploring the relationship between commit time and log file sync wait time is what this blog entry is all about.

What is a log file sync wait?

When a client process issues DML such as an update, delete, or insert by the time control is returned to the client process the associated server process ensures the appropriate database buffers have been changed and the resulting redo is in the redo log buffer (not log, but buffer). Both the block buffer changes and their associated redo buffer entries must complete or the entire statement is rolled back.

When a client process issues a commit a number of things must occur and they must occur in the correct sequence. Both the server process and the log writer require CPU and IO resources to complete their tasks plus they must communicate to coordinate their individual tasks. Their CPU consumption can be tracked via the v$sess_time_model view and their wait time can be tracked via v$session_event. When the client process receives the "commit complete," from Oracle's perspective the redo now resides on an on-line redo log and both the server process and the log writer are synchronized. (One exception is when Oracle's commit write facility is being used.) When a client process issues a commit and if its server process must wait for the log writer to complete some work (CPU or IO) before the server process can continue, the server process will post the wait event log file sync.

Because significant CPU and IO resources are required for a commit to complete, significant log file sync wait times can occur with either a CPU or IO operating system bottleneck.

Keeping all this in mind, it follows that if log file sync wait times have increased, it is highly likely commit times have also increased. While this makes sense, is it really true? And are there creative ways to use this knowledge? Read on...

Is there a correlation?

More specifically, is there a correlation between database commit times and log file sync wait times? As my experiments indicated, there is indeed a strong correlation.

Here's the core experiment setup:
  1. commit (the "initial commit")
  2. insert some rows
  3. let the system stabilize
  4. get wall time, log file syncs and log file sync wait time for session
  5. commit (the "timing commit")
  6. get wall time, log file syncs and log file sync wait time for session
I created three experimental variations. The first variation is a single process performing the above with a four second stabilization time. I also varied the number of inserts per commit from 1 to 10000. I call this the batch size.  For the second variation I simply eliminated the stabilization time, which created a very intense serial process. The third variation used a stabilization time of two seconds and I added ten other concurrent processes that inserted, commited, deleted, and commited a single row...repeatedly. To avoid table related concurrency issues, each concurrent user worked on a different table. For each insert batch size change, between 90 to 100 samples where taken. I also measured the overhead of the timing collection for both the log file sync wait time and the wall time (which is our commit time), and then subtracted that from the results. One reason I took so many samples was because the wall time collection overhead varied significantly.

Here is a code snippet used to gather performance data:


begin
  select total_waits,
         time_waited_micro,
         to_char(systimestamp,'YYYYMMDDHH24MISSFF')
  into   :waits_total_1,
         :wait_time_micro_1,
         :cur_time_1
  from   v$session_event
  where  sid=:mysid
    and  event='log file sync';
end;
/

Now for the results.

The first variation included a four second delay after the initial commit was issued. This allowed the system to calm down to reduce the likelihood that the second commit, which is timed, was being impacted by previous work. The correlation between commit times and log file sync wait times was a staggering 1.000, with 1 being a perfect correlation. This means when log file sync wait time increased, so did the commit time. Including all the 1170 samples, 87% (stdev 6.2) of the commit time was due to log file sync time. When the insert batch size was one, 77% of the commit time was due to log file sync time.  The numerical and graphic results are shown below.


For the second variation I removed the four second delay. This creates an intense serial server process load on the system. Fortunately, because of Oracle's background processes, even though there is a single server process involved the log writer background process is involved providing much needed parallelization. As with the first experimental variation, the correlation between commit times and log file sync wait times was a staggering 1.000. And when considering all 1300 samples, an average 76% (stdev 17.7) of the commit time was due to log file sync time.


Notice that when the insert batch size was one, log file sync wait time accounted for only 45% of the commit time. Yet when the insert batch size was 10000, log file sync wait time accounted for 96% of the commit time. Regardless of the insert batch size there is some commit related work that must always occur. This overhead can be roughly calculated as the difference between the commit time minus the log file sync time. As our experiments demonstrate, as the insert batch size increases, this overhead takes less and less time per insert. But when the batch size is only one, this overhead took nearly 55% of the commit time. This is why it is much faster to issue 100 updates followed by a single commit compared to doing 100 update and commits. If you want to dig into this topic and actually quantify and predict, check out my blog entry entitled, Altering insert commit batch size.

The third variation incorporated concurrency into the situation. While there was a two second stabilization delay used in the statistics gathering process, the other ten concurrent processes had no delay. They simply and repeatedly inserted a row, commited, deleted, commited, repeat! This is the classic strategy for causing intense log file sync waits. In fact, looking at the experimental results below you can see the average log file sync times are over 800ms. If this was a real OLTP centric system, users would probably be irate!

For this experimental variation, the correlation between commit time and log file sync time was a respectable 0.76. On average around 73% of the commit time was due to log file sync wait time. While the percentage is in the same range as the other experimental variations, because the commit times are much longer, the actual time difference was a few hundred milliseconds not around a single millisecond.


It's important to understand in this experimental variation there is: a single log writer working in conjunction with multiple server processes, each server process must communicate and properly coordinate activity with the log writer, and they all must share resources. As a result, depending on the OS bottleneck, CPU core speed, and IO subsystem speeds, the difference in commit times and log file sync times can vary significantly. What I'm trying to say is that how I designed the experiment and the hardware I was running (Linux with 4 CPU cores and a single IO device, Oracle 11gR2) can effect the specific timed results. But regardless of these things, 1) at no time did the log file sync time exceed commit time. Technically I can't see this ever occurring and it never occurred in any of the experiment's 3,770 samples. 2) There was always respectable correlation between commit time and log file sync time, and 3) log file sync time is a very significant part of the overall commit time.

What does this mean in my work?

What does this mean in practical daily DBA performance analysis work? When commit times increase you can expect log file sync wait times to increase. And if log file sync times come to within around 75% of your commit time SLA, you have most likely breached your service levels. So it turns out my colleague's strategy of comparing log file sync times from the previous day to the current day was right on the money! And in his situation, the performance problem turned out to be a problem with the application servers, not the database server!

Is log file sync time a reasonable commit time approximation?

Yes log file sync wait time is a reasonable commit time approximation, but it is even a better predictor of commit time. In these experiments the log file sync time averaged 73%, 76%, and 87% of the total commit time. Knowing the average log file sync wait times, you can expect your commit times to around between 10% to 25% longer. More practically stated, if your log file sync wait times are approaching your commit time SLAs, you can count on your commit time SLA being breached. Creatively applying this knowledge, instead of instrumenting a user application to gather and monitor commit times, you could use Oracle's wait interface to monitor log file sync wait times instead!

So knowing log file sync times can be very valuable in both performance monitoring and performance firefighting.

Thanks for reading!

Craig.

Friday, October 15, 2010

Locks and Latches...What a Difference!

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Over this past year an unusually large number of my students have looked at me strangely (option 1, options 2) when I state that latches are not locks. There are distinct differences between the two in both purpose, design, and application. Quickly contrasting these two serialization control methods is what this blog entry is all about.

Oracle sometimes needs to be in control...

At some point Oracle must ensure certain actions occur one at a time, that is, serially. Oracle serialization control has two broad applications. First there is relational object (e.g., the employee table) control, in which enqueues are used. Second, there is memory structure (e.g., cache buffer chains) control in which latches and mutexes are used.

Locks employ enqueues to ensure relational structure access control.

Oracle uses locks implemented through enqueues to ensure relational objects are changed in an orderly fashion. When an Oracle process needs to lock a relational structure, either explicitly or implicitly, the lock request is placed into a queue. That's where we get the word "in"-queue. When the lock is given, the request is dequeued. (Kind of like a stack's push and pop functions.) Precise ordering is maintained to ensure a deadlock does not occur. Many years ago as an Oracle application developer I distinctly remember being repeatedly told how to properly lock and unlock a series of Oracle tables. In other words, Oracle's locking mechanism was very well defined and dependable. I had to make sure I understood this to avoid encouraging an application deadlock. I like to say that compared to latches and mutexes, enqueues are very boring and mature. You know exactly what's going to happen and in what order. Absolutely no adventure whatsoever!

A massive requirement difference...

Memory structure list serialization control is an entirely different beast and an exciting one at that! Oracle must closely maintain tight control over memory structure list access, otherwise deep and extreme corruption would occur. Memory structure lists are fundamentally different than relational structures as well as buffers (and a buffer's associated buffer header). For example, the cache buffer chains, the least recently used lists, and the write lists are linked lists made up of buffer headers. Buffer headers do not contains rows, they contain pointers to other buffer headers, their associated data block and buffer, and control information. Buffer headers are not relational structures and they are not Oracle buffers. As a result, Oracle needed a way to control access to these and other memory lists.

Latches/Mutexes use a spin and sleep algorithm to ensure memory structure list serialization control.

In stark contrast to Oracle locks, latches and mutexes do not use an orderly acquisition and release strategy. Latches and mutexes employ a spin (i.e., repeated attempts to acquire the latch/mutex) and sleep (i.e., backing off a bit after failing to acquire the latch/mutex) methodology. The session successfully acquiring the latch/mutex got lucky when it just happened to make a request attempt (by checking a specific memory address) after the latch/mutex became available yet before a competing session attempted the acquisition. So it's not an order thing, it's a probability thing: eventually the session will likely acquire the latch/mutex, but there is no guarantee. This is what makes latching and mutex acquisition so interesting and exciting compared to boring enqueues!

If you have been wondering about the differences between an Oracle lock and latches/mutexes, I hope this entry has helped clear things up a bit. If you want to dig into serialization in more detail, just go to my web-site and search for latch. And of course I dig into this topic in my Oracle Performance Firefighting course and book.

Thanks for reading!

Craig.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Wednesday, September 22, 2010

Buffer Cache Visualization and Tool

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Oracle's buffer cache is fascinating. And monitoring what's going on inside can be mesmerizing. But trying to convey something so complex using numbers can be, well… complex. So a common technique I use with my clients and students is to use simple graphics and to role play. But lately I've gone a step further and created a tool to visualize the buffer cache. You can download my Buffer Cache Visualization tool and use it for free. It is based on Mathematica from Wolfram, but you only need their free player to use my tool.

Note: Remember this is a model, an abstraction, a learning tool used to focus on specifics aspects of interest. Abstractions have been made and many of them are not detailed in this blog. In most cases, the models you create using the visualization tool are massively smaller than any real Oracle system. And in fact, the Oracle instance would probably not even start! But we can learn a lot by removing much of the complexity, abstract, and focus on the area of interest.

If you spend some time experimenting with the Buffer Cache Visualization tool, I can guarantee you'll achieve a much better understanding of the buffer cache and will visualize how various performance killers manifest. Plus it's a lot of fun!

Visualization Controls

There are a number of controls/options you can apply to the visualization. There are technical controls for the number of buffer headers, cache buffer chains, least recently used lists, the percentage of dirty buffers, and if cloning has manifest. There are visual controls that allow you to show only the least recently used lists, the write lists, or the cache buffer chains. Or you can see all the lists together or if you want add in all the buffer header details as well. Because the visualization can become very crowded, I also included options to only show points for the buffer headers, arrows for the links, or you can see the full nodes. There is even an option to change the visual orientation from top to bottom, bottom to top, center, left to right, and right to left. You can even alter the size of the buffer header coloring to make it easier to see. And finally, to allow you to exactly duplicate the structure you create, you can set the random number used to create the visualization. As you can see, you'll be able to create an infinite number of unique visualizations enabling you to find just the right one for your purposes.

Below is a screen shot of the version 1d controls.


Buffer Cache Introduction

The buffer cache only contains the buffers. The buffers contain the block information and have an associated block residing in a database file. There is another structure call a buffer header that is at the center of the list management. Each buffer has an associated buffer header and all the buffer headers reside in the shared pool, not the buffer cache. While the buffers are an image of the blocks (of course when they are dirty there is a difference), but the buffer headers are not images.

The Buffer Header

The purpose of the buffer headers are to efficiently map the relationships between the buffers, the actual data blocks, the LRUs, the cache buffer chains, and the write lists. Plus each buffer header contains other control information, such as the last time the block was read from disk. So the buffer header is actually at the center of buffer cache management, even though it physically resides in the shared pool. Using  OraPub's buffer cache visualization tool, you can get a close look at a buffer header by setting the number of buffer headers, cache buffer chains, and least recently used lists to only one, then choosing the node option. Here is an example:
Above you'll see that buffer header 101 (BH 101) is connected with multiple lists. Looking closely you'll see a to/from pointer with the actual buffer (BUF 101), next/prev pointers related to the cache buffer chain list (CBC 1B and CBC 1E), next/prev pointers to either an LRU or a write list (W 1B and W 1E), and a pointer to block 101 (BLK 101) on disk.

You can learn quite a bit by focusing on a buffer header without it being connected to other buffer headers via the various lists. But let's shift our focus on how the buffer headers are related to the cache buffer chains.

The Cache Buffer Chain

Oracle uses the cache buffer chains to locate a buffer in memory, that is where it resides in the buffer cache. Without getting into the details, a hashing algorithm is used to direct the searching process to a specific cache buffer chain. If the buffer's associated buffer header is on the given chain then the server process knows the buffer is indeed residing in the buffer cache. If the buffer header is not found on the chain, then the server process knows the buffer is not in the buffer cache and must make a call to the IO subsystem to retrieve the buffer. Below is an example showing four cache buffer chains: two chains containing all eight buffer headers and the remaining two cache buffer chains containing no buffer headers.

To create the above image using the visualization tool, set: Buffer Headers to 8, CBC Buckets to 4, LRUs to 1, Dirty % to 10, Cloning to No, Focus to CBC, Visualization to Nodes, Orientation to Center, Circle Size to 0.06, and the Random Seed to 1000.

While the picture above contains eight buffer headers linked to a possible four cache buffer chains. In real Oracle systems, there will be thousands of buffer headers and potentially over a million cache buffer chains. So if you tried to physically create what you see above, the Oracle instance would not start! Remember this is a model, an abstraction, a learning tool by focusing on specific aspects of interest.

Referencing the image above, if I was a server process and needed to access buffer 106, I would hash to cache buffer chain 2 (CBC 2B) looking for the buffer header 106... and I would find it. Since the buffer header contains a link to the physical buffer in the buffer cache, I would now know where to find it!

The LRU and Write Lists

The least recently used lists (LRUs) are a dual purpose list. They are used to help keep popular buffers in the buffer cache, to help server processes quickly find a non-popular free buffer, and for database writers to find non-popular dirty buffers. Each buffer header will be on either an LRU or a write list, but not both. And dirty buffers can reside on the LRU if they are deemed popular or have yet to be discovered as not popular. But regardless of the buffer's status of free, pinned, or dirty, if they are not deemed not popular then eventually they will be replaced by another buffer. (Something no buffer ever wants, because the party is in the buffer cache!)

Continuing our example, below is an example showing two LRUs (LRU 1 and LRU 2).


To create the above image, the only change from the previous image was the Focus was changed to LRU. The full settings are: Buffer Headers to 8, CBC Buckets to 4, LRUs to 2, Dirty % to 10, Cloning to No, Focus to LRU, Visualization to Nodes, Orientation to Center, Circle Size to 0.06, and the Random Seed to 1000.

The image above contains only six buffer headers linked to two LRUs. In real Oracle systems, there will be thousands of buffer headers linked to probably less than 100 LRUs. You may be wondering where the other two buffer headers are? (Remember our examples contains eight buffer headers.) In addition to residing on a cache buffer chain, each buffer header resides on either an least recently used list or a write list. In this case, six buffer headers reside on the two LRUs and the remaining two buffer headers reside on the two write lists (W1 and W2). (There is an associated write list for every least recently used list.)


Looking at the image above, the only thing I changed in the visualization is the Focus was reset to Write. The full settings are: Buffer Headers to 8, CBC Buckets to 4, LRUs to 2, Dirty % to 10, Cloning to No, Focus to Write, Visualization to Nodes, Orientation to Center, Circle Size to 0.06, and the Random Seed to 1000.

Notice that buffer header BH 107 resides on write list W 1B and cache buffer chain CBC 2, but is no where to be found on any LRU! Eventually the database writer will write buffer 107 to disk making it free again and then buffer header BH 107 will be placed back into one of the least recently used lists.

Less Clutter and More Information

If you play with this tool for a while, you'll begin to realize by chaining the Visualization setting from Nodes to Points you can cram a lot more information into the visualization. Yes, much of the detail is lost, but you'll begin to notice how complex the buffer cache structures can become and how amazing it is that there aren't more ORA-600 errors! It's not like looking at Yosemite, but you may get a limited sense of awe and wonder.

If you mouse over the nodes (when there are not too many displayed) Mathematica will display the node name. The image below shows this.
I moused over the far left point (with a to/from link) and Mathematica displayed, "LRU 1B" which is the beginning of LRU 1. The above picture was created with these settings: Buffer Headers to 8, CBC Buckets to 1, LRUs to 1, Dirty % to 10, Cloning to No, Focus to Full, Visualization to Points, Orientation to Center, Circle Size to 0.03, and the Random Seed to 1000.

If you look closely at each buffer header node (they are the center nodes of the main flowing list), you will notice that each node, which is a buffer header, has a number of branches. The single branch (that is, link) is to the block on the database file (notice each node has one "to" pointer), the remaining to and from links are so the buffer header can be correctly placed on a cache buffer chain, a least recently used or a write list, and the pointer to/from the buffer in the buffer cache.

The image above consists of a single least recently used list. Notice that the main trunk looks crowded! And crowding means that the buffer headers are residing on some of the same lists which, is more likely to cause memory related contention, such as some type of latching contention. With this type of visualization it's easy to see that if every server process needs to access this single least recently used list, contention could easily manifest. In fact, we would probably see CPU consumption dramatically increase resulting in a CPU bottleneck along with the Oracle wait event, latch: cache buffers lru chain be the top wait event. The image below is exactly the same above with the single exception that there are now two least recently used lists.
I purposely moused over least recently used list one's endpoint, LRU 1E. It's actually the node with the to/from links just to the upper left of the upper left mouse over box corner. Least recently used list LRU 1 starts at the upper and far left of the image and links all the way down to its endpoint, LRU 1E. Notice that when LRU 1 ends, to the list's right the main trunk is not quite as crowed and in fact further right the main trunk becomes even less crowded. By adding an additional least recently used list, we have reduced the crowding, that is, we have reduced the likelihood of latch related contention. I could go on and on about this, but that's the purpose of this blog entry.

And of course with the point visualization option, you can create all sorts of interesting images. Here is just one such smiley creation:
The above picture was created with these settings: Buffer Headers to 100, CBC Buckets to 10, LRUs to 6, Dirty % to 10, Cloning to Yes, Focus to Full, Visualization to Points, Orientation to Center, Circle Size to 0.03, and the Random Seed to 1063. Try changing the Random Seed and you'll see a number of variations to the basic structure.

One final picture makes use of the Arrows Visualization option. This is another way to pack lots of information into a single visualization. This is kind of a halfway image between the full nodes and the points. I personally like the color because it adds information to the image making it easier for me to understand. The cache buffer chains are green and when multiple lists cross over, they are also shown in green. (I'm still trying to change/fix that!)
The above picture was created with these settings: Buffer Headers to 10, CBC Buckets to 8, LRUs to 4, Dirty % to 10, Cloning to Yes, Focus to Full, Visualization to Arrows, Orientation to Center, Circle Size to 0.03, and the Random Seed to 1014. Again, have some fun by messing with the Random Seed values.

Summary

I hope you take the opportunity to download the buffer cache visualization tool and have some fun yourself! I think you'll be amazed and how much you can learn. Plus next time you need to help someone understand buffer cache internals or explain why a buffer cache performance situation is occurring, you can actually show them what you're talking about!

And if you enjoy the buffer cache visualization, you'll really enjoy the library cache visualization tool. The images it can create are absolutely stunning.

Thanks for reading,

Craig.

Wednesday, July 21, 2010

More on user calls

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

On May 7th I posted a blog entry entitled, Understanding user calls. This was a result of my frustration with the words being casually used, yet many DBAs (including myself for many years) not really knowing what they mean. Since user calls is one of the key workload statistics and it can also be used for predictive analysis, I decided to blog about it, running a number of tests to demonstrate its definition.

Since the original posting, I have had some wonderful conversations (especially with my students in Houston and Munich) and have been thinking about other interesting scenarios. The results of these other scenarios is what this blog is all about.

For a quick review, the key to understanding a user call is to remember we're referring to an Oracle client/user process making a single request to its server process. Period. As I explore the various scenarios below it's easy to get distracted and confused unless we stick to this definition. And when we do this, regardless of all the activity surrounding the scenario, the number of user calls will clearly emerge. So let's get started!

The DDL effect.

I ran a number of tests creating and dropping both tables and procedures. Without exception, all tests showed two user calls where made; a parse and an execute. I ran create table, drop table, create procedure, drop procedure, and I change the number of columns in the table creation statement and the number of SQL statements in the procedure. Every single DDL statement resulted in two user calls. I'm confident that the client process takes the entire statement, creates a single parse and execute call, bundles them into a single SQL*Net action (don't want to use the word call here) and sends the bundle to the server process. The result was always two user calls (11gR2 v$[sess|sys]stat #7) and one SQL*Net round trip (11gR2 v$[sess|sys]stat #557).

The recursive SQL effect.

The short answer is, no effect.  The long answer:  As background, a recursive statement is more than just SQL a developer does not type. It is formally defined by its depth. A statement's depth can be found by tracing the SQL and looking for the dep entry. If it is greater than zero, then it has been tagged as recursive SQL. Here is a link to an excerpt from my Firefighting book with more details.

This situation is closely related to the previous situation (The DDL effect). For example, when a table is created Oracle must create additional SQL (that is, recursive SQL) to insert data into the various data dictionary objects, such as tab$ and col$.

Here is the SQL I used for this test. Each of the 500 select statement iterations will be considered recursive SQL.
declare
  looper number;
  the_count number;
begin
  for looper in 1..500
  loop
    select count(*) into the_count from customers;
  end loop;
end;
/
Without exception, the above anonymous pl/sql resulted in two user calls and a single SQL*Net roundtrip. It appears the SQL*Plus client process batches the two calls (parse and execute) into a single SQL*Net action and the server process performs the rest of the work. So while the server process may perform lots and lots of work, because the client process only issued a parse and execute call to the server process, only two user calls result.

The multiple statement effect.

If multiple statements are not included within an anonymous pl/sql block, then each is separately parsed, executed, and if necessary fetches occur. (Picture yourself typing one statement at a time into SQL*Plus and you've got the right idea.) As you can imagine, the user call count can quickly grow.

However, if the statements are contained within a pl/sql block, then the user process batches all the statements together and sends them to its server process as a logical unit. The result is only two user calls; a parse and an execute. If the client process asks for information from its server process (like the put_line in the below code example) then a fetch call is required for a total of three user calls.

For example, take a look at the code below, which is modified code from the previous section. Only three user calls resulted regardless of how many SQL statement were contained in the pl/sql block, their type (select, update, commit, etc.) or placement in the anonymous block (inside the loop or out). As I'll discuss below, where we place the put_line statement does indeed affect the number of fetches. Notice that this code uses explicit cursors and the cursor is programmatically fetched one row at at time. The key is the server process is doing the fetching and only sends the final result back to the client process resulting in a single user process fetch. This scenario demonstrates the importance of remembering that a user call originates from the user process, not the server process.
set serveroutput on
declare
  cursor customer_cur is select owner,status from customers;
  customer_rec customer_cur%ROWTYPE;
  the_count number;
begin
  the_count := 0;
  open customer_cur;
  loop
    fetch customer_cur into customer_rec;
    exit when customer_cur%NOTFOUND;
    the_count := the_count + 1; 
  end loop;
  dbms_output.put_line('The count is '||the_count);
end;
/
If the above code is modified by simply moving the put_line inside the loop, then the number of user call fetches is dependent on both the number of actual fetches and the fetch array size. In my experiment, there are 71,353 rows in the customer table, so the final few lines look like this:
...
The count is 71350
The count is 71351
The count is 71352
The count is 71353

PL/SQL procedure successfully completed.

SQL> show arraysize
arraysize 15
If you have been following along closely, you should be able to guess how many user calls executing this anonymous pl/sql block required. There is a single parse and execute call plus one user fetch related to user call for every 15 rows returned. Since there were 71,353 rows, the number of fetches required were 4,757 (round 4,756.8666 up to nearest integer, that is, which is also known as a ceiling function). So the total number of user calls should be 1+1+4757 which is 4759.  And YES, that is exactly what occurred when I tested this multiple times!

Just for fun I reset the array size to 200 and reran the SQL. The number of user calls should be 1+1+ceil(71353/200) which is 359... and it was!

The procedure effect.

Here is an example of a procedure containing the anonymous pl/sql block shown above.
create procedure p1
is
  looper number;
  the_count number;
begin
  for looper in 1..500
  loop
    select count(*) into the_count from customers;
  end loop;
end;
/
Once again and without exception, when running the above procedure the user process recorded only two user calls for the procedure creation, dropping, or execution. Even with additional SQL inside and outside of the loop, only two user calls were made by the client process. But as I demonstrated in the Multiple Statement Effect section above, if data is returned to the client process than fetches will occur.

In Conclusion

Obviously, there are an infinite number of scenarios I could have come up with, but I think you get the idea: There is always a single parse and execute. And if data is returned to the client process, then for each fetch of this data, there is an additional user call. I find it is extremely helpful to remember a user call is always taken from the perspective of a user process, not a server process.

Thanks for reading!

Craig.




P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Wednesday, July 7, 2010

The OS CPU run queue; not what it appears

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

When I teach, I often do wacky things; like role play being a server process or a memory structure. But if there is one thing that guarantees to bring everyone to a state of silent anticipation is when I say, "The reported operating system run queue includes both processes waiting to be serviced and those currently being serviced." Everyone will stare straight at me, waiting for me to say I was just joking. But I'm not joking and it's the truth.

Ask ten DBAs what the OS run queue is and nine of them will tell you it's the number of processes ready and waiting for CPU. After all, that's what we've been told and the OS manual pages seem vague enough not to contradict this. When I created the HoriZone product (OraPub's now defunct capacity planning product) and was writing my Oracle Performance Firefighting book, I had to verify this; so this became a pretty serious endeavor for me. In this blog entry, I'm going to drill down into what the OS reported run queue is and demonstrate that what most of us have been taught is simply not true.


Looking at the above diagrams, there are three key parts:
  1. An outlined circle represents a CPU core. Diagrams "A", "B", and "C" each have four CPU cores. This could represent a single quad-core CPU, two dual-core CPUs, or four single-core CPUs. The reason cores are used is because they scale well, whereas threads do not. On the flip side, using actual CPUs will understate the CPU subsystem's power and not reflect the actual queuing situation as well. I'll blog about this at a latter time, but this is a well documented and the accepted capacity planning perspective.
  2. A solid circle represents an OS process; either waiting or being serviced by a CPU core. In diagram "A" above, there is one process in the system, in diagram "B" there are two, and in diagram "C" there are six processes.
  3. A rectangle is the single CPU run queue where all processes must pass and possibly wait before being served by one of the CPU cores. In diagrams "A" and "B" there are no processes in the run queue and in diagram "C" there are two processes in the run queue.


The Shocking Reality

OK, let's put some surprising reality into this. Based on upon the description of the above diagrams, diagram "A" and "B" have no processes in the run queue, whereas diagram "C" has two processes in the run queue. However, the standard performance reporting tools sar, vmstat, and top will show the CPU run queue for diagram "A" as one, in diagram "B" as two, and in diagram "C" as six. Really.

If you're like most DBAs at this point, you're giving me a strange look and waiting for what I have to say next because I've just poked a hole in something you (and me too) have believed to be true for perhaps twenty years of your life. And of course you want some proof... and this is what the rest of this blog entry is...proof. Proof as best I can supply in four forms; manual pages, a software vendor's perspective, my performance firefighting experience, and a simple experiment you can run on one of your idle Unix-like machines that will show without a doubt that the OS reported CPU run queue counts both processes ready and waiting for CPU and also processes currently receiving CPU resources.

Check the man pages

The first thing I did was to check the sar manual page (linux, solaris, aix, hpux) looking for the -q option (the average CPU run queue) with a column heading of runq-sz. This is what I found:

Linux: run queue of processes in memory and runnable.
Solaris: Run queue length (number of processes waiting for run time).
AIX: Reports the average number of kernel threads in the run queue.
HPUX: Average length of the run queue(s) of processes (in memory and runnable)

That obviously doesn't do much to clear things up. In fact, it brings more confusion than anything else.

Contact a forecasting software product company

My next data point was to contact a software company that creates predictive analysis software. Any forecasting product company will have to deal with the reality of gathering OS performance statistics and then correctly interpreting them and integrating them from both a queuing theory and a reality perspective. So I figured they would really, really know the true situation. (I had to do the same thing when I created my now defunct HoriZone predictive analysis product a few years ago.)

There is a UK based company that produces a wonderful forecasting product. In the past Metron has sent some of their employees to my Oracle Forecasting & Predictive Analysis training course so I figured I'd be more likely to get a response from them.

So I emailed their support staff and the next day I received an email from Colin Woodford that included, The run-queue length means "the sum of the number of processes that are currently running plus the number that are waiting (queued) to run." Hum... so far that was the most explicit response I have received.

My personal experiences

Working on big and mean Oracle production systems for the past twenty years has repeatedly shown that when the OS reported average CPU run queue nears the number of CPU cores, users will start calling and complaining of poor system responsiveness. I also noticed that the average CPU utilization is usually between 80% to 95%, depending on the number of CPU cores.

While this is a topic for another blog entry (or two), I've also noticed that for snappy OLTP performance the run queue needs to be clearly below the number of CPU cores but to minimize the duration of a CPU intensive batch process, the CPU run queue will be just over the number of CPU cores.

Run a straightforward and repeatable experiment

I love to design and run experiments. If you have read my papers and my books you'll notice they are full of experiments. In my earlier Oracle days I spent a lot of time learning how to design a good experiment, which includes isolating the area of interest, running it many times to get a good statistical sample size, how to process the statistics, and how to correctly interpret the results. Sometimes the experiment will take less than an hour to run, but some experiments (like the one I'm going to present) took many days to run!

The stronger and more controversial the statement I'm making the more experimental detail I will provide. From an author's perspective, it's a difficult balance to determine how much detail to supply because some people want all the details while others just want the answer. The other tangle is computer configurations can be slightly different, so many experiments will not exhibit the same result even if the load placed on the system is the same. But for this experiment, we are fortunate in that if you have an idle machine sitting around, you can run the exact same experiment and should see the exact same results.

For this experiment, I used a Dell machine with a single four-core Intel 2.6GHz CPU running Oracle Unbreakable Linux 2.6.18-164.el5PAE. Oracle software is not involved in this experiment. This information was gathered using the uname -a and the cat /proc/cpuinfo commands.

The key to this experiment is I created a very simple yet very CPU intensive process using the bc command. So intensive is this process, it essentially consumes 100% of an entire CPU core. So for example, when one of the load processes is running (and nothing else), the four-core CPU subsystem will show 25% utilization. When there are two of the CPU intensive processes running, the four-core CPU subsystem will show 50% utilization. When you see this type of behavior, you know you'll dealing with a full-on CPU sucking process!

For every incremental load increase, I gathered 32, 15 minute (900 seconds) duration samples. The system was loaded starting with one CPU intensive process on up to ten. So for the complete experiment I gathered a total of 320 samples and with each taking 15 minutes, you can see why it took four days to complete. But I wanted at least 30 samples for each incremental load and I wanted more than just a few minutes duration for each sample to ensure the system had a chance to stabilize and really run at that load intensity.

I used two separate scripts for this experiment. The first script (shown below) is the runqtest.sh shell script. It is the main controlling script that incrementally starts an additional CPU intensive process and records the CPU utilization and run queue for later analysis.

The second script (also shown below), killbc.sh, will search out and kill all bc processes. Anytime I run an experiment where the system can take over the entire machine causing it to become unresponsive, I will create a kill switch type of process. It also allows me to very quickly restart the experiment, which I frequently have to do when I'm developing the experiment.

Shown below are the two scripts:
[oracle@fourcore runqueue]$ cat runqtest.sh

#!/bin/sh

maxprocs=10
calmsecs=5
samples=32
interval=900

tmpfile=/tmp/sar1.tmp
finalfile=spinresults.txt

rm -f $finalfile

echo "Showing all bc processes now."
ps -eaf|grep bc|grep -v grep|grep -v vim
echo "If you see any bc processes, quite now!"
sleep 5

for ((p=1;p<=$maxprocs;p++))
do
  echo "scale=12;while(1<2){1+1.243}" | bc > /dev/null &;
  sleep $calmsecs
  for ((s=1;s<=$samples;s++))
  do
    echo -n "Starting sample now: "
    echo -n "processes=$p sample=$s interval=$interval ..."
    sar -uq $interval 1 > $tmpfile
    echo "done"
    util=`cat $tmpfile | head -4 | tail -1 | awk '{print $4+$5+$6}'`
    runq=`cat $tmpfile | head -7 | tail -1 | awk '{print $3}'`
    echo -e "$p\t$s\t$interval\t\t$util\t$runq" >> $finalfile
  done
done

./killbc.sh >/dev/null 2>&1

echo "Complete runqueue test at `date`" >> $finalfile

echo ""
echo ""
echo -e "Procs\tSample\tInterval(s)\tUtil%\tRunQ"
cat $finalfile

You will notice the runqtest.sh script calls the killbc.sh script, so here it is also:

[oracle@fourcore runqueue]$ cat killbc.sh

ps -eaf|grep bc|grep -v vim|grep -v grep
for p in `ps -eaf|grep bc|grep -v vim|grep -v grep|awk '{print $2}'`
do
  echo "killing process $p now."
  kill -9 $p
done
ps -eaf|grep bc|grep -v vim|grep -v grep

Sample and Population Dispersion

The next couple of paragraphs delve into statistically describing the sample and population dispersion. If you're not interested in this, just jump down to the section entitled, Experimental Results.

To describe our samples we need some, well...samples. But if we are going to make inferences about the entire population (our samples and all the other possible samples), we need to gather 30 or more samples. Why the magical 30 samples? When describing the dispersion of our samples the [sample] standard deviation is used, but when describing the population dispersion, we should use the population standard deviation. In part it's because the only difference between the sample standard deviation and the population standard deviation is there is either an N or an N-1 in part of the denominator. Below is the formula for the sample standard deviation.


Notice the 1/N. For describing your samples (as shown above) the formula includes the 1/N. However, for describing all samples, which includes our samples and the samples we did not gather (that is, the population) the 1/N is replaced with 1/(N-1). When N reaches around 30, the difference between 1/N and 1/(N-1) becomes extremely small and insignificant. So when I want to make inferences about the entire population, I always ensure I have at least 30 samples.

If you want to run this experiment but don't want to wait four days to get the results, use a sample duration of perhaps only 1 minute. Since the load is very simple (unlike when Oracle is involved) it will stabilize very quickly. If the results fluctuate wildly, then you'll know you need to increase the sample duration time. In addition to the previous few paragraphs above, it is important to gather 30 or more samples as this will confirm your samples are not the result of randomness or anomalous behavior.

The Experimental Results

Here is my summarized experimental data:












Here are the chart column definitions.

Load Processes. This is the number of CPU intensive bc processes that were running.
RunQ Avg. This is the average OS CPU run queue as reported by the sar -q command.
Util Avg. This is the average OS CPU utilization as reported by the sar -u command.
Runq Stdev. This is the standard deviation of the average run queue samples.
Runq -90%. This is the 10%-tile or the -90%-tile statistic based on the average run queue and its standard deviation. Mathematically, this is (RunQ Avg - 1.645 * standard deviation). To translate this into English, with the number of load processes at 10, the average run queue was between 9.660 and 10.500, 90% of the time.
Runq +90%. Similar to the Runq -90%, but the upper or higher end of the spectrum.
Util Stdev. This is the standard deviation of the average utilization samples.
Util -90%. This is the 10%-tile or the -90%-tile statistic based on the average utilization and its standard deviation. Mathematically, this is (Util Avg - 1.645 * standard deviation). To translate this into English, with the number of load processes at 10, the average CPU utilization was between 100.000 and 100.000, 90% of the time... yes that is extremely consistent.
Util +90%. Similar to the Util -90%, but the upper or higher end of the spectrum.
Samples. This is the number of 15 minute duration samples taken at each load increment.

Key Observations

Here are some key observations:

1. As you can see, the results are extremely consistent. This strengthens our conclusions.

2. There are four CPU cores in this Linux box. This is pretty obvious because each additional process increments the CPU utilization by 25% until, of course, we hit 100% utilization.

3. When there are one to four load processes running, the OS reported run queue matches the number of processes. This demonstrates that the Linux reported run queue does indeed include the running, that is, CPU consuming, processes. Period.

4. When the OS reported run queue surpasses the number of CPU cores while still equaling the number of load processes, this means the Linux reported run queue includes both the processes running (consuming CPU) and processes waiting to run (waiting to consume CPU but in the CPU run queue). Period.

To summarize this experiment and to make sure this is no miscommunication, this experiment demonstrates that the OS reported average CPU run queue contains both processes currently running and processes ready/waiting to run.

To Summarize

Usually I find the manual pages very helpful and enlightening, but in this case they are pretty much useless. I am very appreciative that Metron returned my email and even more so, that there definition was very specific. I hope this experiment puts to rest any doubts you may have about the definition of the OS reported CPU run queue. While I have not run this experiment on other Linux's or Unix's, I'm confident the results will be similar. For Windows systems it could be different; after all Windows is an entirely different beast.

To summarize (again), this experiment demonstrates that the OS reported average CPU run queue contains both processes currently running and processes ready/waiting to run.

Thanks for reading!

Craig.