Availability, Resiliency, and Retry Storms

Availability, Resiliency, and Retry Storms

Last updated
Last updated August 22, 2023
Tags
Design Patterns
Resilience
Picture this for a minute. It's peak usage time for your application, users are going about their day-to-day, clicking buttons and typings things. For a short moment, one part of your infrastructure looses availability. Let's say that something triggered a database election, and your API nodes haven't yet figured out the new database topology yet. Maybe there's a DNS/TCP/TLS timeout, maybe queries hit the dead node, or one of a million other possible things.
A few of the users notice this, as some part of their UI fails to load because the underlying API calls failed. So your users do the natural, obvious thing: they try again. They hit refresh. Unfortuanately, their increased use of the API increases the load on the new primary, which hasn't finished its election yet. While the new primary is trying to come online while also serving your new surge of requests, it fails to respond to its health checks in time. So the controller thinks maybe this new primary is ineffective, let's kill it and try another election.
Whoops! The new election causes more users to notice something is happening, more refresh, more requests, more delays, more failures. You've got a retry storm on your hands.
notion image
Source: GIPHY
When I was back at Foko Retail, we had something similar happen. There was a few unoptimized queries performing table scans in one of our API calls. This API call was in a hot path, given that its data was used to render 50% of every page in our app. Even worse, since JS didn't have any builtin promise cancellation patterns, a refresh usually meant that "dead" API calls that would never return a response were unnecessarily trying to produce a response.
Previously, this hadn't been a problem. After all, databases are ridiculously fast, even when they're slow. We'd never noticed it, until just the right combination of things happened to trigger an application blackout. As we onboarded larger customers, the frequency of the blackout became unreasonable and constituted some professional poking around.

Time-to-Recovery

Monitoring your time-to-recovery is particulary important for SRE teams. Simply put, this is the amount of time that your application spent being unavailable to users. In reality, the time to fix the underlying issue should be longer than your TTR. But designing a resilient application involves embracing the idea that failure will happen, and you need to be ready for it.
Of course, we optimized the queries. We identifier the slow queries and forced them to use an index. We also added "cancelation" functionality to the hot API call, so that we don't over-queue work. But the problem brought forward a more important realization: sometimes unoptimized code will sneak by you, but we need a backup plan that doesn't involve humans next time there's an issue. When you're trying to optimize your time to recovery, relying on humans isn't terribly smart - we're very slow.
notion image
Source: GIPHY
This is where resilience components come in - they're design patterns meant to add better failure handling to your application. For a critical hot path, our failure handling was terrible (actually, non-existent, we just spit an error out). But resilience design patterns could help maintain availability long enough that perhaps our users would not notice an issue until we were already shipping a fix (or never, let's aim for never).

Thank you, circuit breakers.

notion image
Source: GIPHY
If you've ever plugged a vacuum into the wrong electrical socket in your house, and suddenly your housemates are yelling at you about why their TV just turned off during the most climactic scene, then you've experienced the pleasure of the circuit breaker design pattern. A circuit breaker is a component that sits on the client-side between a client and a service. When the service starts to fail, the circuit breaker "trips" and breaks the client's connection to the service (this is called the "closed" state). After some time of pretending to fail, the breaker enters a "half-open" state in which it tests the underlying service availability for real. At this point, if the service has recovered, the breaker resets and allows the connection again. If the service is still unavailable, it goes back to being closed.
TL;DR
  • Circuit breakers sit on the client-side, between the client and a consumable service
  • Requests are proxied, and monitored for errors
  • When an error threshold is crossed, circuit breakers stop allowing requests to be proxied - enabling the underlying service to cool off
  • When service is back, the circuit breaker resets
If you prefer to look at code, here's a tiny localized circuit breaker for an async function:
function breakerify(fn) { let allowRequests = true let lastRequestTime return async function proxy(...args) { if (!allowRequests && Date.now() - lastRequestTime < 1e3) { throw new Error(`Service is using its PTO - please stop.`) } try { lastRequestTime = Date.now() return await fn(...args) } catch (err) { allowRequests = false throw err } } }
There's many different ways that circuit breakers can be implemented. Depending on your service architecture, your circuit breaker might avoid having a "half-open" state and have a lower cost method of testing service availability (i.e. sending a PING to Redis is cheaper to test a connection rather than performing an actual operation). You can also decide how sensitive you want your breaker to be, and how quickly you want it to reset.
If your client is distributed, something like an API working with a database, it might also be a good idea to find a way to sychronize your circuit breakers (i.e. if one API node's circuit breaker trips, they all trip). This can help you avoid having a staggered half-open state - which is when your API nodes would each go into a half-open state at different times, and possibly accidentally overload the underlying service by testing it too frequently.
If you want a handy little library to help you with your breaker implementations, I created rsxjs which contains both localized circuit breakers as well as Redis-powered synchronized breakers. It was used in production at Foko, and continues to be used. I also used some other components from that library in HireFast.
notion image
notion image

Killing off the weakest link with bulkheads

Source: GIPHY
Breakers are not the only useful resilience pattern that fits this use case. Bulkheads, a cousin of the breaker, are used to implement a similar use case. While breakers exist on the client-side, bulkheads exist on the service-side. Sometimes a service can detect when it might be overloaded. In my story at Foko, the API nodes were aware that the database queries were failing and that meant the API was overloaded with queries. Though sometimes you can respond to the higher load by scaling out, this was not possible when the underlying database was struggling with the load. Scaling out the API cluster would result in increased costs without any benefit.
When your application is able to be aware of extra load, one possible solution is to start shedding load. By shedding load, you might be able to guarantee partial availability to your users. Deciding what load is actually sheddable is the tough part. At Foko, when we were designing a load shedding strategy, we focused on optimizing based on database resource usage. Thanks to the incredible real-time monitoring dashboard provided by MongoDB Atlas - we were able to simulate common user workloads, and see which queries, in which workflows, caused excessive CPU usage (usually as a result of complex aggregate pipelines). When there was spikes of excessive load, it made sense to fail those API requests since they beared the most load and took up a relatively smaller percentage of total API requests.
For more information about bulkheading, and to read about Shopify's bulkheading strategies, you can checkout semian:
notion image

Fallbacks

The core idea behind patterns like the circuit breaker is that you can fail fast, giving you the opportunity to implement fallbacks. Conceptually, the breaker is designed to fail during the closed state - since it is assuming that the underlying service is going to fail anyways. But how you handle this on a UX level is up to you.
The simplest option is to error out (i.e. no fallback) - this is often used for applications where consistency is a hard constraint, such as booking systems, hardware applications, etc. However, despite what they teach in computer science courses at university, consistency doesn't ALWAYS have to be a hard constraint. There's plenty of sensitive applications that find workarounds - such as banking systems taking time to settle funds (which provides flexibility without loosing consistency).
Another possibility is to use a cache as a fallback, trading off consistency for availability. Netflix is known to use this strategy, as are many other large applications that have looser consistency constraints. This is only useful if your queries have low variance, so you can achieve a high hit ratio with low memory usage. With HireFast, I used in-memory client-side caching for the majority of the API calls, only reporting failures if no stale response was available. But this wasn't possible with our search feature - where the variance was extremely high, so the cache would grow quickly with a low hit ratio.
You can also selectively render the UI based on what data is actually available. For instance, for HireFast, I would hide non-important UI elements like the list of folders or list of job applications, if the API calls had failed. The calls would transparently retry in the background, and if data became available, those components would pop up later.