A Simple Way to Develop Self-Healing Systems

Olaf Thielke
3 min readSep 27, 2019

--

Ouch! 193 HTTP Internal Server 500 errors. 193 requests from our customers failed on the live site. I had blown away our daily goal of zero HTTP 500 errors. It was my fault. I hadn’t cleared our data cache before the deployment.

It wasn’t the end of the world. I realised that I could turn this into something useful. Find the benefit in the problem — a benefit not just for me but also for others. Here, in this article, is my opportunity to make other developer’s lives a little better.

So, What Happened?

Cutting a long story short, I had made changes to the structure of a particular class of cached objects. I had overlooked that the production cache would contain the old structure. Therefore whenever the cache contained the old format, a deserialisation exception would be thrown.

However, the real problem was that this exception would bubble up the call stack and would then return an HTTP 500 Internal Server error to the client.

I believe this behaviour to at best be suboptimal and at worst to be plain wrong.

Why?

Because an optional system component caused the application to fail. Our application did not fulfil a client request because mere caching failed.

Why is caching optional?

Caching exists to improve performance only. The overall system should still work when the cache is functionally absent. Like when we have a cache miss and need to get the data from our database.

I mean, if the database was unavailable then I could appreciate why the application should be unavailable. The database is an essential component of the system. But not the cache. The cache is non-essential.

When the cache fails, throwing an exception, then this should not be allowed to induce a failure in the application.

Caching is only one example. Diagnostic logging and tracing could also be considered optional in the same sense.

How can one determine whether a subsystem is optional?

Simple. By asking oneself the question:

“If this component was absent or faulty, should the system still work?”

If the answer is No, then you have a required, essential component.
If Yes, then you have an optional, non-essential component.

Self Healing

What I am suggesting is that collectively as developers, we should strive to build systems with a degree of self-healing (or graceful degradation) in mind.

OK, I have an optional subsystem, how do I make my system self-healing?

Luckily, this is not difficult.

Handle All Exceptions

An application is destabilised by a non-essential subsystem’s exceptions when those exceptions bubble up the entire call stack.

The Solution? Ensure no unhandled exceptions escape from the failing component. The component handles all exceptions — logs them, notifies someone. But please do something. Arguably, even swallowing the exceptions is better than letting them bring down the application.

And ultimately that was the real problem: Our Cache class did not handle all exceptions. Fixed!

Conclusion

Accomplished developers design systems that perform well even when problems occur in non-essential subsystems like logging or caching. The one technique for ensuring that the application remains operational even when non-essential system modules are failing is to catch and handle all exceptions at the component boundary.

If you enjoyed this article, please leave some claps — and a bunch of claps if you loved it! :) Thank you kindly.

Join my email list to fast-track your software engineering career.

When signing up, you’ll get my guide, ‘The Road to Master Progammer’, containing 3 powerful ideas to help you shorten your journey to expert programmer.

--

--

Olaf Thielke
Olaf Thielke

Written by Olaf Thielke

'The Code Coach'. Software Simplifier & Craftsman

No responses yet