On the Importance of Insisting upon Answering the Right Question

Here’s a quiz for you. It’s the intro to a true story with a happy ending:

Executives report a problem with Payroll processing. It’s so slow that they’re not able to pay people on Fridays like they’re supposed to. Employees are retaliating by vandalizing company property over the weekends. This has been going on for three months now; it’s getting really expensive. Database administrators report that the system’s problem is heavy CPU consumption and latch-related syscalls.

How do you solve this problem?

The first step here is vital. It is to commit your focus to the experience the users are feeling, which is Payroll. Not CPU. Not latches. Payroll. That’s the first step. It may sound pedantic, but it’s important to get this right. You’ll see why in a minute.

The next step is to answer the question, “How does Payroll spend its time?”

That’s obvious, right? If we can learn how Payroll spends its time, then any time it’s wasting, we can eliminate it.

Okay, so how do we find out how Payroll spends its time? It’s a trickier question than it sounds. The DBAs think they’ve already answered it. Remember, right in the problem statement, they said, “…the system’s problem is heavy CPU consumption and latch-related syscalls.” They even had an inch-thick report printed out to prove it.

But the problem is, what they’re saying is not the answer to our actual question. Our actual question is:

“How does Payroll spend its time?”

What the DBAs are telling us is:

“How does the system spend its time while Payroll is running?”

Those are two different questions. They are not equivalent. The second is not a surrogate for the first.

So, I may sound stubborn, but this is important: I don’t want to see the systemwide monitor’s data about the situation. In fact, I specifically want not to see it.

There are two reasons: (1) it won’t answer my question, and (2) even looking at it increases my susceptibility to confirmation bias.

The people onsite already had the confirmation bias problem. They were so sure they had a CPU and latch problem that they couldn’t conceive of their system any other way.

So, when they asked me to review the inch-thick report, I said something like, “You’ve been working on this problem for a long time. Let’s see if looking at things a little differently might help us find a breakthrough.” I pushed the report to the side and noted politely that maybe we’d need it later.

Spoiler alert: We wouldn’t need it later.

So, then, if you can’t find out how Payroll spends its time with a workload report, then how can you find out?

By tracing it.

So we asked the DBA to trace Payroll.

He probably didn’t want to. He might have been offended that we didn’t trust his word about the CPU and latches. He might have thought, what can a trace show you that I haven’t already told you? But he traced Payroll for us, and it worked.

The trace cracked the case pretty much immediately. It showed that over 70% of the Payroll program’s duration was spent on the network. Less than 15% was spent consuming CPU and latches. Surprised everybody.

All it took to make Payroll run twice as fast was to adjust a network configuration parameter. It took less than an hour to figure out how to make the horrible, three-month problem go away.

The end.

So, what’s the deal, then? Were the DBAs just stupid? Were their tools lying?

No, the DBAs were just fine. They were smart people who had looked at exactly the data they’d been trained to look at, and they had drawn exactly the conclusions they’d been trained to draw. And their tools weren’t lying. The system was dominated by CPU and latch activity. It’s just that Payroll wasn’t.

Confused? Here’s a metaphor. Imagine a Skynet “Automatic Wellness Repository” application that shows what percentages of people on Earth have which diseases. It might, with perfect accuracy, tell you that 98.6% of people on Earth who have a health problem right now have the flu. But encouraging our doctor to view the world through flu-colored glasses doesn’t help my daughter’s agonizing dislocated elbow.

We needed a trace instead of a workload report so we could see exactly what Payroll is doing—what only Payroll is doing. (The elbow and only the elbow.)

This company’s workload monitor was a reputable tool, one of the nice ones. But it was never going to help anyone solve the Payroll problem. Such tools are not good for improving user experiences because they get two things wrong: filtering and aggregation.

  • Filtering. People couldn’t see Payroll’s problem because their tools mushed the execution statistics about Payroll into a big pile with the execution statistics about everything else. Payroll dominated the company’s attention, but other programs dominated their machine, and that dominance misdirected their analysis. To diagnose Payroll, you need to look at just Payroll, with everything else filtered out. But workload monitors can’t filter like that.Fun fact: Their workload monitor did do some filtering of its own. It discarded all the network-specific syscalls that we saw in the trace data (the ones that cracked our case).
  • Aggregating. You can’t extrapolate detail from an aggregate. Nobody can. Not even Chuck Norris. It’s impossible to attribute which bits and pieces of which sums and averages from the monitoring tool belong to which user experiences. You need to aggregate by user experience, but workload monitors can’t do that.

To “tune,” watch the user. Not the DBA.

David Ensor

“Watching the user” is the key to aligning your optimization efforts with business priority. But to really “watch” a user, you have to commit to filtering your performance data to match what your user is experiencing. This may require you to learn how to use data sources you’re not presently using.

In the Oracle ecosystem, traces provide the richest, most flexible data you can get for optimizing user experiences. With the right tools, traces are easy to filter and aggregate. Traces answer questions that workload monitors can’t.

If you’re interested in reading more stories like this, my newest book called How to Make Things Faster is full of them. If you’re interested in the tools we use to diagnose problems like the one described in this article, have a look at the application called Method R Workbench.





Leave a Reply

Your email address will not be published. Required fields are marked *