Learning Oracle

Latches and Latch Contention

Posted in Reading Material,Write-Ups by ealing on February 19, 2008
Tags: , ,

I’ve taken a short detour into the world of latch contention. I read two papers, Resolving Oracle Latch Contention and Conquering Oracle Latch Contention. The stand-out thing about these two papers is that, modulo a caveat about CPU usage, they recommend exactly opposite ways of reducing latch contention! One of Conquering‘s suggestions is to decrease the value of the parameter _spin_count, but Resolving‘s main suggestion is to increase the value of _spin_count! Let me try to summarise what I’ve learned about latches before explaining why I think they’ve come to different conclusions about how to deal with latch contention problems.

What is a Latch?

Latches are “lightweight” mechanisms that are used to serialise access to memory structures. It is not possible to queue for a latch – a process either gets the latch or does not. Memory structures protected by latches include things like the buffer cache, the java pool and the library cache.

Behaviour on Latch Get Failure

When a process tries to obtain a latch, it may succeed or fail. On most hardware, the latch get operation is implemented as an atomic “test-and-set” instruction. Unlike failure to obtain a lock (which may result in a process sleeping or failing), failure to obtain a latch does not result in the process giving up the CPU. Exactly what happens when a latch is not obtained depends on the mode in which the latch was requested. If it was requested in “immediate” mode then the requesting process resumes exectuion immediately; it may then fail or carry out another action. If the latch was requested in “willing to wait” mode, then the process does not resume immediately if the latch request fails. Instead, the process starts the “spin and sleep” routine (as it’s named in Conquering).

Because of the speed at which latches are obtained and released, the latch-seeking process can remain on the CPU, and is said to be “spin-locked”. Spin-locked processes are consuming CPU, and my be preventing other processes from executing. If a spin-locked process still has not obtained the latch after _spin_count attempts, it will relinquish the CPU and sleep, allowing other processes to resume execution. The sleeping process will post the “latch free” wait.

When the sleeping process is resumed on the CPU, it will again request the latch. If it fails, it will spin again, and request again, until it has once more made _spin_count attempts to acquire the latch, at which point it will sleep again. The first sleep while waiting for a particular latch will be 10ms long, as will the second. Sleep lengths follow an exponential pattern, so the third and fourth are 20ms long, the fifth and sixth are 40ms long, and so on.

Parent and Child Latches

Some latches, such as the cache buffers chains latch protecting cached data blocks, have multiple children. In these cases the parent latch is not, as far as I can see, a useful object. It exists to aggregate statistics in dynamic performance views. Where child latches exist, they offer finer granularity of locking than would otherwise be available.
For instance, if there are N child latches protecting blocks in the buffer cache, then there can be up to N processes altering these blocks concurrently. Because each block “belongs” with a particular latch, it is still possible for running processes to block each other, but this becomes less likely as the granularity of the latching increases.

Oracle Wait Interface[4] is confusing on the exact behaviour of processes trying to obtain latches. On page 145, it says:

If a process fails to get one of the child latches in the first attempt, it will ask for the next, also with the no-wait mode. The willing-to-wait mode is used only on the last latch an when all no-wait requests against other child latches have failed.

That behaviour wouldn’t make sense if the latch requested was protecting a block in the buffer cache. After all, if any of the N child latches was sufficient, then there could be N processes concurrently changing the same block! It’s been suggested to me that the above is wrong in the general case, but may be true in some specific cases like the redo copy latch. This seems plausible to me, but I can’t verify it.

Latch-Related System Information

There are a number of latch-related dynamic performance views. These include:

  • v$latch – lists all latches available in the system
  • v$latchname – a straightforward translation from number to name
  • v$latch_parent /v$latch_children – has nearly identical columns to v$latch, but de-aggregated
  • v$latch_misses – details on where latch acquisition attempts have been missed (but see Ixora article)
  • v$latchholder – contains a SID and PID for each held latch
  • v$event_histogram – contains a histogram of waits, sorted by the event that caused them (including latches) and how many milliseconds they lasted

Dealing With Contention

Resolving makes the point that latch contention must be considered in the context of the whole system. If latch free waits are only a small proportion of the total waits, then there’s little to be gained by trying to eliminate them. Furthermore, it’s not the latch miss ratio that matters, but the total number of misses.

Conquering takes an understanding-led approach. When dealing with latch contention, it suggests three questions:

  1. What is the use of the memory structure protected by this latch?
  2. Why do processes want to access this latch so often?
  3. Why do they hold the latch so long?

After understanding the cause of the problem, possible solutions should present themsleves.

_spin_count: Up or Down?

Resolving points out that _spin_count has been an undocumented parameter since version 8i (released in 1999). Its default value, 2000, has apparently not changed in that period. Since CPUs are much faster today than they were in 1999, the time taken by a spinning process to request a latch 2000 times is much lower today than it was when the default was established. So a process that cannot obtain a much latch spends a much greater proportion of its time on modern hardware sleeping than it would have on an older hardware. This is demonstrated with the results of an experiment, where the optimum value of _spin_count was found to be approximately 12000.

Before advising against raising _spin_count, Conquering notes that by the time the DBA sees a latch free wait, the posting process has already consumed CPU time waiting for the latch. The step in the author’s reasoning that I don’t quite get is the next:

“With this in mind, you may be able to see why it is very common to have latch contention when there is an operating system CPU bottleneck.”

This seems back-to-front to me. I can see that:

more latch contention => more CPU consumption by spinning processes => CPU bottleneck

What I can’t see is that a CPU bottleneck would lead to more latch contention, unless a lot of sleeping CPU-starved processes are holding latches required by processes on the CPU. In any case, I do agree that you probably won’t want to increase _spin_count when there is a CPU bottleneck: this certainly would increase CPU demand, but may not increase the rate at which useful work is done.

Reading Material

  1. Resolving Oracle Latch Contention, Guy Harrison
  2. Conquering Oracle Latch Contention, Craig Shallahamer
  3. Oracle Data Dictionary Pocket Reference, D. C. Kreines. O’Reilly, 2003
  4. Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning, R. Shee, K. Deshpande, K. Gopalakrishnan, McGraw-Hill Osborne, 2004

2 Responses to 'Latches and Latch Contention'

Subscribe to comments with RSS or TrackBack to 'Latches and Latch Contention'.

  1. Paul Karman said,

    I have read the Resolving and the Conquering documents before I found your review. Guy Harrison made a remark that stayed with me. It goes something like “You can play with _spin_count” as long as your CPU runqueue is not greater then 1. Whether a right sized machine will ever have a CPU run queue of 1 is debatable but in practice you would see a lot of CPU oversized systems. On the other hand you would probably mainly see latch contention at moments that there is heavy processing with a CPU run queue reflecting that. Anyway. I had such an CPU oversized system at hand and while using Guy Harrisons tool (Spotlight) to measure latch contention I changed the default spin count of 2000 to 4000. And sure enough, the red blobs turned beautifully green. The CPU runqueue of this 2 cpu AIX box stayed between 1 and 2 only once in a while went up to 4. CPU Idle time did not get worse. But, to be honest, I do not have comparisation material in the form of time measurements to back the improvement up and users generally only complain when a system gets slower. So, I guess I now only need to find out whether the way Spotlight measures is objective 🙂

  2. […] os blocks contetion que estão ocorrendo na base de dados; • Latch Contention: mostra quias os latch contetion que estão ocorrendo na base de dados; • Top 10 Queries using more disk reads: mostra as 10 […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: