Tuesday, February 23, 2021

Devops and the Definition of Done

An argument for continuous deployment


    It's hard not to be drawn to the ideal of continuous deployment. The benefits are numerous: business value is delivered more quickly; the feedback loop is tightened, enabling quicker reactions to successes and failures; velocity is generated by eliminating the time and effort required to perform a release; the quality and outcomes of releases are improved as human error is removed from the equation. The path to realize continuous deployment, though long, seems somewhat straightforward: first, define and document the deployment pipeline. Next, perform the incremental work over time to streamline, and in the best case scenario, completely automate, these steps. As chunks of the release process are subsumed by automation, the level of effort to perform a release goes down. More frequent releases are afforded. The process slowly but surely converges towards complete automation, until the day finally arrives when everything is automated, level of effort hits zero, and continuous deployment is unlocked.

    I’ve not experienced this ideal. Continuous deployment to production is still a scary proposition for most. Taking steps to automate some of the release process is more palatable, but getting the work prioritized above the urgent new features and defects of the day can be challenging.

    These thoughts and feelings on continuous deployment had been simmering for some time when I got my first proper introduction to DevOps through The Phoenix Project, a book by Gene Kim, Kevin Behr, and George Spafford. I’d of course been aware of DevOps for much longer than that, but as I was soon to find out, the common narrative of DevOps, or at least the one I was exposed to, turns out to be an oversimplification of what DevOps actually stands for. It’s not merely a spirited (and somewhat unsubstantial) call for developers to be more like ops people and ops people to be more like developers. That is a part of it, but there’s much more to the story of DevOps.

    As I learned from The Phoenix Project and later The DevOps Handbook, DevOps encompasses a set of philosophies and practices aimed at delivering the most business value as quickly, efficiently, and securely as possible. It’s an acknowledgment that development is not separate from operations, and engineering is not separate from business; all are on the same team working towards a common goal of growing the business. The movement was largely inspired from learnings of the Toyota manufacturing system, which revolutionized how to assemble cars faster, more efficiently, and with fewer defects. To optimize this assembly, practices were put into place to eliminate bottlenecks, minimize friction between stations of work, and make local optimizations wherever possible. In essence, DevOps recasts software development as an assembly line of sorts, and advocates for the application of manufacturing best practices onto the process of delivering business value through software. Removing the barrier between dev and ops is only one of the many practices that DevOps advocates for.

    One of those other practices is working with small batch sizes, with the ideal batch size being one. By working with smaller batches, work can flow through the assembly line more quickly, reducing the time it takes to deliver value. It also allows us to get faster feedback about what's provided to customers, and to respond more quickly to shifting consumer demands. The logical means to achieve small and frequent batch sizes, in the world of software, is continuous deployment.

    So why is continuous deployment still viewed as non-essential in many companies? The adage “we optimize what we measure” applies. One of the fundamental constructs in software development is velocity, which measures the amount of work a team completes in a given period of time - commonly a sprint of a couple weeks. The definition of done is commonly held to be merged into master and set to be deployed in the next release.

    But why is that the definition of done? If work has yet to provide any business value, then is it really done? After all, getting code merged into master is only half the battle. The code has yet to embark on its long and tumultuous journey through the deployment pipeline. Which might look something like this, at a minimum: an artifact is built and deployed to some kind of staging environment; a battery of QA tests are run against the release candidate (which might uncover bugs and require rework, exposing that a card wasn’t “done” after all); followed by deployment to production, another battery of smoke tests, and a release notification. Higher-functioning teams may even include a step of measuring the functionality that was deployed to ensure it’s hitting on its business objectives.

    Toyota doesn’t count a car done when it’s halfway through the assembly line, nor should software be counted as done when it’s halfway to providing value in production.

    If the definition of done changes from “merged into master” to “providing business value in production”, what changes? We optimize what we measure. If cards routinely stagnate for several days at a time in the “ready for deployment” lane of the kanban board, preventing any progress in the team’s velocity, it will begin to be treated as the bottleneck it truly is. If there’s a limit to the amount of work present in any given kanban column, as is often the case to ensure work progresses at a healthy rate, more frequent deployments will naturally be required to flush out the buildup of cards.

    As it becomes more imperative to deploy functionality, all the deployment work that’s taken for granted is viewed with a more critical eye. Whereas before there was no impetus to improve the deployment pipeline, it’s now a pivotal stream of work required to move cards to done. Automation becomes a first-class citizen that’s leveraged to accelerate work through the pipeline. Infrastructure-as-code becomes a useful tool to enable safer and more predictable deployments. Telemetry is discussed early in the card lifecycle as a tool to measure delivered business value.

    By making a simple yet intuitive change to the definition of done, so much value to the business can be unlocked. As work is measured by the value it provides to the business, instead of given completion credit once it’s merged into master, engineering’s goals and objectives become better synchronized with those of the business. It will undoubtedly put stress on a system with large and infrequent batch sizes in mind; all the manual deployment work that’s typically allotted a cushy days or weeks long time period to complete must now have a much faster turnaround time to accommodate frequent batches. Production-related problems and tasks put off until the deployment process begins - how to migrate data, what schema changes need to be made, is it backwards-compatible - must be reckoned with earlier.

    But in the same way the line should be blurred, and even eliminated, between dev and ops, so should the line between development and the work needed to quickly and safely deploy to production. Because software in and of itself has little value; not until it’s working for the business does it prove its worth.