Find root cause analysis for compute resource problems

May 25, 2017

Roger Yao

Introduction:
One of the great things about my role at Uila is that I get to work directly with customers and see firsthand how they use our software in their environment and how it performs to solve their real world, everyday issues. While demos, webinars, and whitepapers are an excellent source of valuable product information for customers, the best way to prove product value and maintain a feedback loop to our software designers, is to become a detective of sorts, observing and understanding how the software runs once installed in the customer’s data center.

In this post, and more to come, this Root Cause Detective will share as many details and screenshots as possible (removing identifiable information like IP addresses and machine names, of course!). That said, please read on...

Root Cause Detective, Case File #1: Carolina Biological

Recently I had the opportunity to install Uila’s software at Carolina Biological, a company that has been providing scientific supplies to schools and universities for 90 years(!), with a strong and enduring reputation amongst their customer base. For perspective, imagine what Uila’s current Silicon Valley offices would have looked like back when Carolina Biological was already a decade or so old.

Over the decades, Carolina Biological has done an excellent job of keeping their operations in lock step with modern applications and infrastructure: they use an Oracle JD Edwards ERP system that runs on VMware to support all of their critical business functions including manufacturing, fulfillment, and sales. Since so much of their success rides on their solid reputation, they knew it was time to call in a detective, with x-ray vision, when some isolated application performance issues became too challenging to resolve on their own. Enter, Uila.

Call the Detective:

When I first spoke with Carolina Biological, their users had been complaining for some time that the JDE application would periodically slow down -- even to the point that it was unusable. Despite repeated efforts to identify the root cause, the problem continued to plague the application and IT managers.

Their existing monitoring tools gave them separate views of the performance for virtualization and networking, and couldn’t correlate problems across the infrastructure. They suspected an application performance issue or networking bottleneck, but lacked visibility into either one. There was also no way for them to get alerts specific to the application. This made identifying the problem frustratingly elusive.

Observations Detected & Resolutions:

The first thing we did after we installed Uila was to define departments and sites so we could see performance from the end user perspective. The lack of visibility into end user performance was one clear gap in Carolina Biological’s performance monitoring. In the screenshot below, you can see we immediately identified a problem with the http query. The storage and network layer were both ok, but the application itself was taking 1143 milliseconds to respond (a little over a second on every transaction adds up quickly!).

Of course, an ERP application relies on multiple servers. How do you find the bottleneck? With Uila, we were able to click in and see the culprit: the primary web server was taking several seconds to respond. The dashboard also shows the the probable cause -- CPU health.

By viewing the application topology and dependency map, we were able to immediately see that the slow server was creating a bottleneck that affected the rest of the application.

From there, we confirmed the likely root cause by looking at the specific server. The Uila dashboard calculated that there was a 51 percent chance the problem was tied to CPU health, and a 24 percent chance it was tied to memory health.

So we dug a little deeper to look at the actual CPU health on the relevant system. It quickly became clear the VM was CPU-constrained at times -- up to 95% utilization. The host system and overall cluster CPU usage was fine, indicating the likely culprit being an undersized VM, and we can pull up the internal system metrics by authenticating to the server.

Uila’s full stack visibility software installed and in action enabled Carolina Biological to quickly follow the problem from the end user perspective (slow access to the ERP system), all the way through to the underlying server to identify the problem.

Roger’s Case #1 - Root Cause Detective Report:

With the underlying end-user performance issues resolved in Carolina Biological's data center using Uila's x-ray vision (and detection), we anticipate they'll have another 90 years of success and innovation for their customers. Wee-luh (Uila) looks forward to being a part of this.

If you are interested in working with Uila's "Root Cause Detective" to analyze your data center performance issues - please ask. If we can't resolve it, you may be entitled to a $50 Amazon gift card for 'stumping the detective'

IT Root Cause Detective - Case File #1: Carolina Biological

Subscribe

Latest Posts

Tags