At FANDOM we believe our applications should be resilient, that is they should be able to recover from failures and while the failure occurs still be useful for the users. Health checks are one of the ways to make applications self-healing.
The idea is that the health checks detect when an instance of the application reaches an invalid state. A supervisor monitors those health checks and attempts to fix any issues. Depending on your platform the action might be different.
Many frameworks (like Dropwizard and Spring) and platforms (like Kubernetes and Marathon) encourage you to implement health checks for your web applications.
When a health check fails, AWS will stop sending requests to that instance. That’s pretty straightforward — we suspect the application is not healthy, so it might not respond correctly. Decreasing load on it might also help it recover. In Kubernetes, there’s a readiness probe behaves like AWS. There’s also a liveness probe which will terminate an unhealthy instance and start up another one. Marathon (Mesos) will also terminate unhealthy instances and start up new ones.
Do you know how a software engineer fixes a broken car?
The software engineer exits the car, closes the door, opens the door, enters the car and starts up the engine.
Restarting is a pretty well established way of dealing with faults. Although the code does not change, restarting can resolve many issues, especially those related to deadlocks, memory usage and network issues. A restart will often jolt your application out of an invalid state, which may be unlikely to be reached again or maybe at least until the next day at the office when someone can take a look at it and fix it.
So what’s wrong with using health checks?
Well, as we’ve learned the hard way, implementing health checks which rely on the application’s dependencies may result in lower availability.
- The application can fulfill some of its functionality even if one of its dependencies is not available.
- Restarting the application degrades its functionality even further.
Let’s say you’ve got a simple monolithic application serving some static assets, using a single database and a message broker doing some background tasks.
The database is not available, so command (PUT/DELETE/POST) requests will fail, some queries will also fail, but generally your service should be serving pages, maybe some with “Try again later” messages.
If your application health checks are affected by the database availability then your supervisor will stop directing traffic to your application or restart it. That means:
- The application will stop serving any requests.
- Restarting may cancel and will stop background tasks.
- Depending on your implementation, your application may not start up if it cannot reach the database.
So what to do about it?
Remove health checks which rely on dependencies
Make sure (experimentally) your application can recover if a dependency becomes unavailable, starts responding slowly or not at all and when you hammer it with a high number of requests or requests which will be handled for a long time.
If a client to a dependency had a bug introduced after your latest upgrade and connection recovery does not work, then this may wake you up if you happen to be on-call at that time.
Implement your own restart policy
Immediately restarting an unhealthy instance may cause issues, so depending on your platform it may be prudent to change that. Possible policies that may allow you to use aggressive health checks while not decreasing application availability may be: Restarting only one unhealthy instance at a time and waiting until a new one starts up and becomes healthy, or Starting up a new instance before killing the unhealthy ones and killing only the youngest unhealthy ones over the limit of instances
If you’re using Kubernetes, a good place to start might be this page.
What else can you do?
A few good practices we’re following:
- Don’t set up health checks which rely on dependencies.
- Set up health checks verifying your application is responding
- If your application consists of a HTTP server and it’s handling basic HTTP requests (in the same thread pool that it’s handling other requests), this will indicate that your application has enough threads, CPU, memory and sockets to function properly.
- Set up separate health checks for your dependencies, like databases and message brokers
- Add monitoring and alerting for both your applications and their dependencies.
There are strategies and tools you can use when a dependency is unavailable, for example caching or fallbacks, but that’s a topic for another blog post.