The Curious Case of a Healthy Service Timing Out

Photo by Charlie M on Unsplash

The Curious Case of a Healthy Service Timing Out

A look at how the scientific process can help figure out very puzzling production issues

ยท

4 min read

For several months, we at Prosple Engineering have been plagued by an intermittent stream of 502 and 504 errors in our API Gateway. It seemed that one of our downstream microservices was causing it. Curiously, that service's resource consumption was way below the threshold. And our development and staging environments aren't experiencing the issues. It was a mystery.

Luckily, one of our team members stumbled upon this article, and we theorized that there's an issue with that service's Nginx configuration. Long story short, we implemented the fix and it worked.

But I won't be talking about the solution. What I'll discuss here is how we tested both the hypothesis and the fix. We could've just accepted the solution presented in the article blindly, deployed it, and see if it worked. But as engineers, we should always exercise healthy skepticsm, and test things as early in the development (or in this case, early in the troubleshooting) as possible to minimize the cost, in terms of time, effort, and money.

The Challenge

So we had a hypothesis: the service's Nginx might be misconfigured. The first step was to test that hypothesis. How did we do it? By replicating the errors on our local machine!

That sounded simple. But before we even came up with this hypothesis, we had spent months trying to replicate the issue SO THAT WE CAN COME UP WITH A HYPOTHESIS. No one had any idea. Remember, the service's health and resource consumption were more than fine, only clues we had were the intermittent Bad Gateway and Gateway Timeout Alerts that were showing up in our APM (Application Performance Monitoring).

But once we finally had a reasonable hypothesis, we were now able to narrow down our investigation. All efforts shifted to the service's Nginx instance.

The next question now is how do we exactly test the hypothesis on our local machine? As already mentioned, the errors were intermittent, which leads us to another hypothesis: if we repeatedly execute the same exact HTTP request to the service, most will succeed, and only a few will time out. Fortunately, we have a particular technique to test this new hypothesis: load testing!

Load Testing to the Rescue

Simply put, load testing is the act of bombarding an application with a huge number of requests. For this purpose, we chose autocannon, a free and open source load testing tool.

We have two hypotheses, and we will test them separately:

  1. We tested the intermittency hypothesis by load testing without changing any of the app's configurations. This was straightforward. Once we increased the load enough, we started getting the timeout errors for some of the requests.

  2. We tested the hypothesis on Nginx configuration by implementing the suggested fix and executing load testing again. Sounds counterintuitive, but this is a loosely similar approach to what they call in mathematical logic as "proof by contradiction". This wasn't as easy as testing the intermittency hypothesis though, as we discuss in the next section.

The Pitfalls of Local Machine Load Testing

There are many factors that can affect load testing, especially on one's local machine, including RAM, processing power, open applications, among others.

This was immediately apparent when we started the tests. We were either reaching the limits of the Docker containers or the host machine (the developer's laptop). There was no other choice but to tweak the settings to allow for consistent results, both before and after applying the fix.

We eventually ended up with these settings for autocannon:

  • connections: 50

  • connectionRate: 100

  • duration: 10

  • pipelining: 1

  • timeout: 30

  • renderStatusCodes: true

  • debug: true

With these settings, we were able to confirm that the fix did decrease the timeout issues.

Summary

In conclusion, diagnosing and resolving intermittent 502 and 504 errors in our API Gateway was a challenging but enlightening experience. By forming and rigorously testing our hypotheses through load testing, we were able to pinpoint and address the misconfiguration in the service's Nginx. This process underscored the importance of methodical troubleshooting and the value of skepticism in engineering. Ultimately, our diligent efforts paid off, leading to a more reliable and robust system.

ย