Category: Software Ownership

  • The Battle of Helm’s Deep

    I’m currently migrating a production Kubernetes cluster from Helm v2 to v3.

    Helm v2 has been long deprecated. We’ve been using Helm to install our services for almost 4 years, but Helm v2 has been deprecated since last year and everyone seems to have moved to Helm v3.

    Helm v3 no longer depends on a server-side daemon called Tiller, which coordinates the installation of Kubernetes resources from a chart’s template.

    This is a problem not unique to myself

    Props to the Helm team for creating a helpful migration video. This has eased a lot of my worry of breaking not just one, but multiple services running in our production cluster. I was able to go through the tutorial and was able to migrate one Redis release. I could leave still use Helm v2 in our deployments, which is highly appreciated.

    See also

  • Build for operability

    In a previous post, I mentioned something about a mullet model of production: operate a service with reliability and simplicity. I intend to expand about of the terms I’ve used there.

    In a software-as-a service (SaaS), production refers to the ensemble of software used to deliver a service (e.g. an eCommerce site). If you are a web developer, this includes your code that you’ve written using some language, the database where your data is stored, and the other parts needed to run your service (e.g. hosting infrastructure, instrumentation, etc.).

    Consider the Primary Function of your service. Any feature to be build must support that Primary Function. The job of an eCommerce SaaS is to facilitate orders. Customers must be able to visit the site add products to their cart, and collect payment. It is not enough to write the features: there has to be supporting software for these features to deliver its job well.

    Operability refers to the degree to which a service can be supported as it performs its Primary Function. Operability varies a lot depending on the type of service. A few of my guide questions are: (1) Can you understand what the code does at 2am while running in production? (2) How long would it take to recall how a feature works after not making any changes for several months? (3) How difficult would it be to extend an existing feature to support a new requirement? These questions impose a lot on the software used to run the services and also its supporting tools. Having a simple, understandable codebase with sufficient test coverage helps a lot. Having a good suite of supporting tools (e.g. alert tracking, instrumentation, etc.) also helps.

    Tools and techniques are not enough. Without a team skilled in building and operating what they’ve built, operability would be very difficult to achieve. The team ties everything together. There will be some specialist roles within a team, but everyone in the team has a good mental model of how production works.

    Recommended resources

    1. Above the Line, Below the Line. Building reliable services requires a working understanding of the continuously shifting dependencies.
    2. The Soviet Union’s Philosophy of Weapons Design (Chapter 87 of Digest). Build tools with simplicity reliability in mind (e.g. AK-47).
    3. Charity Majors’ Twitter account.
  • Developers on-call and deploying on a Friday

    I’ve been supporting a SaaS product that we’ve built from the ground up for the past four years. This service, despite some bad initial decisions and staff churn, managed to survive and bring in some revenue to the owners. Today I was paged (received a message) about a critical feature that is still broken in production. This was related to problems that were identified yesterday. Users could not get their jobs done.

    Rather than wait until Monday and let the stress build up I’ve decided to deploy three bugfixes to production. I’m writing this early Saturday morning and I just finished deploying and testing in production.

    It sucks to be on-call and be exposed to angry customers. I’ve made a lot of changes to make on-call suck less over the past few years. Running a service these days involves more moving parts compared to FTPing a tarball and bouncing the web server back then.

    I learned the hard way that the software we’ve built (and the other dependencies we use) could end up harming us in ways we could not anticipate. I would rather be ready to deal with the problem than predict every possible error case. This led to what I would refer to as a mullet model of production: the service has to run smoothly as users perceive it and easy to operate while running. Operability is not a new idea, but having worked as a sysadmin, I would want the services that I am responsible for to be relatively easy to troubleshoot.

    Deploying on a Friday is taboo in some software teams. What I’ve seen is that it’s usually a symptom of a bigger problem. For example, not having good tooling for deploying code to production. Or perhaps a team issue where the new developers are left to deal with the consequences left by their former colleagues. This list of problems could go on.

    To new developers reading this and frowning about on-call: not everything is bad and by being on-call you are preventing a bigger catastrophe from happening. Good luck out there!