Friday, December 9, 2011

Singular CBC Latch Acquisition Pattern Diagnosis

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Quick Introduction

In my previous posting on this topic I focused on determining cache buffer chain (CBC) wait time acquisition patterns; disperse or singular. A disperse pattern occurs when many CBC latches are active. In contrast, there are situations when only one or a few CBC child latches are extremely active, hence the singular pattern. In my previous posting, I also listed some disperse pattern solutions.

In this posting I'm focusing on the additional diagnostic steps needed before a specific solution can be determined with a singular wait acquisition pattern.

Important: To understand what I write below you need a basic understanding of Oracle buffer cache internals. In particular the buffer, latch, child latch, cache buffer chain, and buffer header. These are described (including nice diagrams) in my posting here. In that posting a visualization tool is also used which you can download for free here.

Situations with singular CBC acquisition patterns

Two of the most common popular buffer concurrency situations that can stress a single CBC latch are:

First, when a single block contains some application reference information and is not being changed often. The popular buffer causes its hash/cache buffer chain to be popular and then also the CBC child latch covering the hash buffer chain. If concurrency is high enough, this can become an issue. However, if the buffer is being changed often, we'd likely see a buffer busy wait.

Second, relates to a popular index root block. Any time an index is accessed its single root block must first also be accessed, which repeatedly forces the question, "Is this index root block in the buffer cache?". If the concurrency becomes high enough, eventually you'll see CBC latch contention associated with a single CBC child latch.

There are other singular CBC access situations. Next I'll present how to determine the acquisition pattern.

Diagnosing the acquisition pattern

We need to determine if the CBC child latch access pattern is disperse or singular. A way to do this is shown the SQL below, which is essentially my latchchild.sql OSM script. The script collects access details for latch number 150 (the CBC latch) over a 300 second period. Run the script a few times (i.e., collect a few sample sets) to ensure what you think is occurring actually is occurring.
def latch=150
def sleep=300
drop table op_interim
/
create table op_interim as
select addr,gets,sleeps
from   v$latch_children
where  latch# = &latch
/
exec dbms_lock.sleep(&sleep);
select t1.addr,
       t1.gets-t0.gets delta_gets,
       t1.sleeps-t0.sleeps delta_sleeps
from   op_interim t0,
       (
         select addr,gets,sleeps
         from   v$latch_children
         where  latch# = &latch
       ) t1
where  t0.addr = t1.addr
order by 3,2
/

ADDR             DELTA_GETS     DELTA_SLEEPS
-------- ------------------ ----------------
...
76C33818            19,8038                3
76C644A8             8,1535                4
76C7759C             9,6993                4
76C2D1CC            14,5096                4
76E03FEC            14,9355                4
76CDD394            16,0718                4
76C68904            18,2300                4
76D69374             6,5250                5
76D7800C            13,3134                5
76DA2650            15,6578                5
76C02B88            15,8293                5
76CE59E0            14,5169                7
76C7B9F8            15,8243                7
76CFF120             6,5228                9
76CFCF30            14,8187               11
76D38668            96,6345               62
76CDF508            98,1384               96
76DAF26C          1,96,2752              187

1024 rows selected.
Figure 1. A 300 second CBC child latch acquisition activity collection SQL and report. Notice the last three child latches are the most active and have the most sleeps (which we see as "waits"). 

Figure 1 shows three CBC child latches received a high proportional amount of activity, especially the 76DAF26C child latch. Therefore, we'll focus on the three child latches with addresses 76DAF26C, 76DF508, and 76D38668.

Determine the hot buffer(s) and buffer header(s)

More precisely, we need to determine the hot buffer headers that are on the CBC chains protected by the unusually active CBC child latch(es). The x$bh performance table contains information about each buffer header that has an associated buffer in the buffer cache. (More about buffer headers) There are four columns of particular interest to us:

HLADDR is the "hash latch address" hence the column name hladdr. This is a foreign key to the v$latch_children view's addr column.

FILE# is the buffer header's (also Oracle block and buffer) file number that links to many performance views and tables, such as dba_data_files.file_id and dba_segments.header_file.

DBABLK is the buffer header's (also Oracle block and buffer) block number which may be referenced in dba_segments.header_block.

TCH is the buffer header's touch count number, which is a popularity indicator...with a few twists I'll detail below. In every Oracle release I have checked (including 11.2), the tch column is not in v$bh, only x$bh.

Keep in mind that each child latch will normally "cover" multiple cache buffer chains (perhaps >100) and each chain can have zero or more associated buffer headers (average chain length is usually 0 to 1). And of course each buffer header is related to a cached Oracle block residing in the buffer cache.

In the report shown in Figure 1, we saw which CBC child latches are relatively busy; 76DAF26C, 76DF508, and 76D38668. Now we need to know which CBC chains and buffers are related to our relatively active CBC child latches. By querying x$bh we can easily determine the buffer headers associated with a given CBC child latch. We can also get a clue as to the buffer header's popularity. The SQL below is once such query.
SQL> l
  1  select hladdr, file#, dbablk, tch
  2  from   x$bh
  3  where  hladdr in ('76DAF26C','76CDF508','76D38668')
  4* order by 4
SQL> /

HLADDR        FILE#     DBABLK        TCH
-------- ---------- ---------- ----------
...
76D38668          1      70197          5
76D38668          1      39365          5
76DAF26C          1     117328        185
76CDF508          1     117329        186

47 rows selected.
As I wrote many years ago in my touch count algorithm paper and detailed in my Oracle Performance Firefighting book, Oracle uses the "touch count" algorithm to essentially tag popular buffers. There is a little twist though... a buffer's touch count can get reset to zero. So to truly determine the popular buffers, we must repeatedly sample x$bh.

The SQL below can be used to repeatedly sample x$bh finding the most popular buffers given their CBC latch address. If you look closely at the SQL, you'll notice it collects and stores 300, 1 second interval x$bh samples. If you use a larger sleep time, you'll want to increase the number of samples collected. The final select statement reports the key tch based popularity statistics.
drop table op_interim;
create table op_interim (hladdr raw(4), file# number, block# number, tch number);
declare
  i number;
begin
  for i in 1..300
  loop
    insert into op_interim 
      select hladdr,file#, dbablk, tch
      from   x$bh
      where  hladdr in ('76DAF26C','76CDF508','76BE93CC');
    dbms_lock.sleep(1);
  end loop;
end;
/
select hladdr, file#, block#,
       count(*) count, min(tch) min, median(tch) med,
       round(avg(tch)) avg, max(tch) max
from   op_interim
group by hladdr, file#, block#
order by 7
/

HLADDR    FILE#     BLOCK#      COUNT        MIN        MED        AVG        MAX
-------- ------ ---------- ---------- ---------- ---------- ---------- ----------
...
76BE93CC      1      39364        300          6          6          6          6
76BE93CC      1      69730        300         13         14         14         14
76CDF508      1     117329        600          1         36         64        181
76DAF26C      1     117328        300         70        125        125        180
76BE93CC      4    1552339        300         76        131        131        186

42 rows selected.
The statistics that I'm most interested in are the median (MED) and the maximum (MAX). And I'm hoping there is a clear buffer header or very few buffer headers that are relatively really, really active (i.e., hot).

As a side note, you may have noticed that one of the COUNTS is 600 while the others are 300. This could have occurred because there are two buffer headers related to buffer 1,117329. Perhaps one could be the current mode (CU) buffer and the other a consistent read (CR) buffer. The above SQL could hbe improved by adding the state column to more uniquely identify a buffer header by its file#, block#, and state...but I digress.

Based on the above SQL output, we know the hot buffers (file#, block#) are: (1,117328) and (4,1552339). You could also argue to include 1,117329 but this buffer header is not consistently hot because its median is much lower than the other two. But in all honestly, if this was a real production system and because the above SQL could be improved, I would investigate. Now we need to understand why these two buffer headers are so popular.

Determine the hot buffer details

I will investigate the two most popular buffers. But honestly, my focus in this posting is on the second buffer; 4,1552339.

Investigating Buffer 1, 117328

To determine the segment's name and type, I'm going to use my dba_extents based OSM script, objfb.sql. The SQL is a little tricky, so you may want to check it out. Let's first look at block 1,117328.
SQL> @objfb 1     117328

Database: prod18                                               18-NOV-11 09:45am
Report:   objfb.sql            OSM by OraPub, Inc.                Page         1
            Object Details For A Given File #(1) and block #(117328)

File number    :1
Block number   :117328
Owner          :SYSTEM
Segment name   :OP_LOAD_PARAMS
Segment type   :TABLE
Tablespace     :SYSTEM
File name      :/u01/oradata/prod18/system01.dbf
OK, so we're dealing with a table. But this is strange because the op_load_params table is used to interactively change the load intensity of my workload generator tool. (You can download an older version here. I haven't posted the latest version... just lazy I guess.)

When I'm investigating hot table buffers, I also check if the buffer is the segment header block (header blocks contain special stuff...sorry to be so non-specific... but I digress). To determine if the buffer is a segment header block, I ran the below code snippet:
SQL> l
  1  select owner, segment_name, segment_type
  2  from   dba_segments
  3  where  header_file=&hdr_file
  4*   and  header_block=&hdr_block
SQL> /
Enter value for hdr_file: 1
Enter value for hdr_block: 117328

OWNER      SEGMENT_NAME              SEGMENT_TY
---------- ------------------------- ----------
SYSTEM     OP_LOAD_PARAMS            TABLE
Since a row was returned, this popular buffer header is indeed the op_load_params table header block! This is not what I expected and not the focus of this blog posting... so I'll move on. But if this was a production system, you better believe I would figure it out!

Investigating Buffer 4, 1552339

Now let's turn our attention to the other hot buffer header, namely buffer header 4,1552339. First I'll determine the object type by running my objfb.sql script.
SQL> @objfb 4    1552339

Database: prod18                                               18-NOV-11 09:51am
Report:   objfb.sql            OSM by OraPub, Inc.                Page         1
            Object Details For A Given File #(4) and block #(1552339)

File number    :4
Block number   :1552339
Owner          :MG
Segment name   :SPECIAL_CASES_BOGUS
Segment type   :INDEX
Tablespace     :USERS
File name      :/u01/oradata/prod18/users01.dbf

1 row selected.
Very interesting... an index. So we know this hot buffer header is an index block. I'm willing to bet it's the index's root block! Why? Because every time an index is accessed, it's root block buffer header (and buffer) is also accessed. And a very active index root block buffer header can cause its hash chain to be very active which can cause problems when a process attempts to acquire the hash chain's child latch. But how can we tell if an index block is the root block?

As I wrote in my November 11, 2011 posting, we can expect an index's root block to have a block number that is one greater than its segment header block. If you look closely at the SQL below, the block number I entered is one less then the popular block (1552338 = 1552339-1). If this 1552339 block is indeed an index root block, then it's segment header will have a block number of 1552338... let's check it out!
SQL> /
Enter value for hdr_file: 4
Enter value for hdr_block: 1552338

OWNER      SEGMENT_NAME              SEGMENT_TY
---------- ------------------------- ----------
MG         SPECIAL_CASES_BOGUS       INDEX

1 row selected.

SQL> l
  1  select owner, segment_name, segment_type
  2  from   dba_segments
  3  where  header_file=&hdr_file
  4*   and  header_block=&hdr_block
SQL> 
Fantastic! A row was returned, which means the popular buffer header is indeed the index's root block!

So the initial diagnosis is complete and now we need a solution. A poor solution would be to increase the number of CBC latches. Adding CBC latches does indeed significantly help during a disperse CBC latch acquisition pattern because each latch covers fewer chains. But when there exists a singular CBC acquisition pattern, the additional latches won't help much. What we need to do is somehow make the popular index root block buffer header less popular. One solution is to hash partition the index, which effectively creates multiple root blocks because each partition has a root block. At this point we have our diagnosis and a solution.

Summary

This posting has a number of objectives:

  1. How to identify CBC child latch acquisition patterns; disperse or singular
  2. How to determine the hot buffer headers related to specific CBC child latches
  3. What information is needed to further diagnose the hot buffer header
  4. Focus on the situation (diagnosis and solution) when the hot buffer header is an index root block

This is one of my more complicated (and perhaps confusing) posts because to really understand what I'm getting at, you must have a good understanding of Oracle buffer cache internals.

Personally, the hot index root block situation is particularly satisfying. It's a very real application of when the top wait event is "latch: cache buffer chains" simply increasing the number of CBC child latches will not significantly improver performance. But with just a couple of extra diagnostic steps, we can nail the core Oracle contention area.

It took a while to get there but I hope you enjoyed the ride!

Thanks for reading!

Craig.

If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub.general@gmail .com.

Monday, December 5, 2011

Important article published in DB Trends magazine

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Greetings,

This is an unusual post because it's rare my work is published by traditional media. The magazine DB Trends and Applications (DBTA) published an article I wrote entitled, Uniting Operations Research With Time-Based DB Performance Analysis. If you receive the printed magazine it's on page 28 of the December 2011 issue but you can also read it on-line here.

What's the article about? In the short article, I introduce

Unit of Work Time Based Analysis as the intersection of Oracle performance firefighting and Oracle forecasting and predictive analysis.

This intersection is one of the main themes in my two day class, Advanced Oracle Performance Analysis. Understanding and applying the content unlocks deep performance insights and allows you to active a much high level of performance analysis. Why? Because you can objectively compare various performance solutions both numerically and visually. And this comparison can be done from a very high and abstract level (think: pictures) down to a very detailed operations research mathematical level (think: formulas). It's powerful.

It's no coincidence my three classes are similarly entitled: Oracle Performance Firefighting, Advanced Oracle Performance Analysis, and finally Oracle Forecasting & Predictive Analysis.

Because I rarely teach my forecasting class, I want to mention that I will be offering my Oracle Forecasting & Predictive Analysis class in Frankfurt, Germany, February 6 - 8, 2012. I just finished teaching this class in Santa Clara, CA last week and we all had a great time for sure!

My firefighting courses will also be taught in Sweden the week of January 30 and also the week of February 27 in Santa Clara, California.

I hope you enjoy the article! And if you have any questions, feel free to email me at orapub.general@orapub.com.

All the best in your performance endeavors!

Craig.