I’ve been supporting a SaaS product that we’ve built from the ground up for the past four years. This service, despite some bad initial decisions and staff churn, managed to survive and bring in some revenue to the owners. Today I was paged (received a message) about a critical feature that is still broken in production. This was related to problems that were identified yesterday. Users could not get their jobs done.
Rather than wait until Monday and let the stress build up I’ve decided to deploy three bugfixes to production. I’m writing this early Saturday morning and I just finished deploying and testing in production.
It sucks to be on-call and be exposed to angry customers. I’ve made a lot of changes to make on-call suck less over the past few years. Running a service these days involves more moving parts compared to FTPing a tarball and bouncing the web server back then.
I learned the hard way that the software we’ve built (and the other dependencies we use) could end up harming us in ways we could not anticipate. I would rather be ready to deal with the problem than predict every possible error case. This led to what I would refer to as a mullet model of production: the service has to run smoothly as users perceive it and easy to operate while running. Operability is not a new idea, but having worked as a sysadmin, I would want the services that I am responsible for to be relatively easy to troubleshoot.
Deploying on a Friday is taboo in some software teams. What I’ve seen is that it’s usually a symptom of a bigger problem. For example, not having good tooling for deploying code to production. Or perhaps a team issue where the new developers are left to deal with the consequences left by their former colleagues. This list of problems could go on.
To new developers reading this and frowning about on-call: not everything is bad and by being on-call you are preventing a bigger catastrophe from happening. Good luck out there!