I recently had the opportunity to attend an AWS bootcamp Herndon, VA office and a short presentation given by their team on Designing for Failure. It opened my eyes to the reality of application design when dealing with failure or even basic exception handling.
One of the defining characteristics between a good developer and a great one is how they deal with failures. Good developers will handle the obvious examples in their code – checking for unexpected input, catching library exceptions, and sometimes edge cases. Why do we build resilient applications and what about the end user?
In this blog post, I’ll share with you the key points that a great developer follows when designing resilient applications.
Why build resilient applications?
There are two main reasons that we design applications for failure. As you can probably guess from the horrifying image above, the first reason is User Experience. It’s no secret that you will have user attrition and lost revenue if you cannot shield your end users from issues outside their control. The second reason is Business Services. All business critical systems require resiliency and the difference between a 99.7% uptime and 99.99% could be hours of lost revenue or interrupted business services.
Given an application load of 1 billion requests per month, a 99.7% downtime is 2+ hours versus just 4 minutes for 99.99%. Ouch!
Werner Vogels, the CTO of Amazon Web Services once said at a previous re:Invent “Everything fails, all the time.” It’s a devastating reality and it’s something we all must accept. No matter how mathematically improbable, we simply cannot eliminate all failures. It’s how we reduce the impact of those failures that improves the overall resiliency of our applications.
The way we reduce the impact of failure on our users and business is through graceful degradation. Conceptually it’s very simple – we want to continue to operate in lieu of a failure in some degraded capacity. Keeping with the premise that applications fail all the time, you’ve probably experienced degraded services without even realizing it – and that is the ultimate goal.
Caching is the first layer of defense when dealing with a failure. Depending on your applications reliance on up-to-date bleeding edge information you should consider caching everything. It’s very easy for developers to reject caching because they always want the freshest information for their users. However, when the difference between a happy customer and a sad one is using some few-minute old data… choose the latter.
As an example, imagine you have a fairly advanced web application. What can you cache?
- Full HTML pages with CloudFront
- Database records with ElastiCache
- Page Fragments with tools such as Varnish
- Remote API calls from your backend with ElastiCache
As applications get more complex we rely on more external services than ever before. Whether it’s a 3rd party service provider or your microservices architecture at work, failures are common and often transient. A common pattern for dealing with transient failures on these types of requests is to implement retry logic. Using exponential back off or a Fibonacci sequence you can retry for some time before eventually throwing an exception. It’s important to fail fast and not trigger rate limiting on your source, so don’t continue indefinitely.
In the case of denial of service attacks, self-imposed or otherwise, your primary defense is rate limiting based on a context. You can limit the amount of requests to your application based on user data, source address or both. By imposing a limit on requests you can improve your performance during a failure by reducing the actual load and the load imposed by your retry logic. Also consider using exponential back off or a Fibonacci increase to help mitigate particularly demanding services.
For example, during a peak in capacity that cannot be met immediately, a reduction in load would allow your applications infrastructure to respond to the demand (think auto scaling) before completely failing.
When your application is running out of memory, threads or other resources you can help recovery time by failing fast. You should return an error as soon as possible when it’s detected. Not only will your users be happier not waiting on your application to respond, you will also not cascade the delay into dependent services.
Whether you’re rate limiting or simply cannot fail silently, you’ll need something to fallback to. A static fallback is a way to provide at least some response to your end users without leaving them to the wind with erroneous error output or no response at all. It’s always better to return content that makes sense to the context of the user and you’ve probably seen this before if you’re a frequent user of sites like Reddit or Twitter.
In the case of our example web application, you can configure Route53 to fallback to HTML pages and assets served from Amazon S3 with very little headache. You could set this up today!
When all of your layers of protection have failed to preserve your service, it’s time to fail silently. Failing silently is when you rely on your logging, monitoring and other infrastructure to respond to your errors with the least impact to the end user. It’s a best practice to return a 200 OK with no content and log your errors on the backend than to return a 500 Internal Server Error, similar HTTP status code or worse yet, a nasty stack trace/log dump.
Failing Fast and You
There are two patterns that you can implement to improve your ability to fail fast: Circuit Breaking and Load Shedding. Generally, you want to leverage your monitoring tools such as Cloudwatch and your logs to detect failure early and begin mitigating the impact as soon as possible. At Stelligent, we strongly recommend automation in your infrastructure, and these two patterns are automation at it’s finest.
Circuit breaking is purposefully degrading performance in light of failure events in your logging or monitoring system. You can utilize any of the degradation patterns mentioned above in the circuit. Finally, by implementing health checks into your service you can restore normal service as soon as possible.
Load shedding is a method of failing fast that occurs at the networking level. Like circuit breaking, you can rely on monitoring data to reroute traffic from your application to a Static Fallback that you have configured. For example, Route53 has failover support built right in that would allow you to use this pattern right away.
Stelligent Amazon Pollycast