Sievo has a history of shedding its skin. Years ago, the product was a desktop application behind our customer’s firewall. It was originally written in Visual Basic by our founders. Now it’s a SaaS web application. The current technology stack is mostly C# with a touch of F# on the backend and React in the front.
I cannot tell you that story. I was not there to see it. I joined only eleven months ago.
Here is my war story
In May 2016, I joined a company with an ache. Each customer was a separate entity in our systems. With a bit of simplification, this was the structure: there was a database, the service itself and QlikView reporting. All of the mentioned must be in the same version or else.
Before, when a new version of the software was created, we would contact our customers to schedule an upgrade. First, we ran a test upgrade to test the process and validate the new version. After a few days of waiting for our busy people to give us a thumbs up, we took the service down for maintenance and performed the real upgrade. In May, the average version running in a customer environment was almost half a year old. That was a great job from a few very diligent people to keep everything up to date. Even this was an achievement: six months prior, the delay had been close to a year.
In practice, our customers were getting neither the latest features nor the fixes to defects in a timely way. In addition, we were supporting more than twenty versions of our product.
In June, we started a BuSu project. These are key initiatives to “Build Success” that have a sponsor from our top management. This time the sponsor was our CEO, Matti. The goal was to make version upgrades automatic at the press of a button for ten customers during 2016. The team consisted of a customer relations manager, software engineers, a delivery engineer, myself and the project was led by our chief architect.
The complicated button
To start, the process was automated in a rudimentary fashion. It all started by editing an Excel description of the environment, running a script on a specific host, specifying the intent with a string of parameters and waiting. Naturally, this was after checking that all the parameters are correct.
Our customer relations manager started looking for customers who would be happy to get the latest versions as soon as possible. We, the techies, started simplifying the tools in small increments. I started on a spy mission to see what the people performing the upgrade validations were doing.
Instead of letting our diligent delivery engineer execute the upgrades, we told our chief architect to pretend to be the button that runs the upgrades. One could call it a concierge prototype.
Every time a little better
With every run of the mill, we made small adjustments. In mid-June I knew that validations were not exposing any issues at all. There were 26 steps in the process to upgrade a single environment. In August, we had dropped the execution time by a third by skipping unnecessary calculations. Also, an ExterminateExcels script was created to migrate the old description files to a new format. I smuggled in some ASCII art of a Dalek into the script.
In September, all the usual causes for failures had been removed. There were 20 steps in the process to upgrade the ten customers who opted in. The execution would not halt to prompt the user. The test upgrade and actual upgrade were run back to back.
In December, there were just 5 steps in the process. We had 22 customers opted in to the new process.
The entire process to schedule upgrades for 75% of our internal and customer environments looks like this:
1. Log in
2. Set the dates in `batches.json`
3. Execute `JsonScheduler batches.json`
These three simple steps will schedule most of our systems at preset times. The status will be reported by a bot on our Slack chat.
This is not yet perfect. In a few weeks we will remove step 2 by letting the script infer the next available times for upgrades.
What’s the moral of the story? Through decisive action and working together, we have freed the time of our delivery team almost entirely. Instead of starting our mornings with an upgrade, we have increased our coffee intake, mood and quality of humor.
For our customers, there is no stress in finding days when the service can be down. The new versions get installed within a week. The largest get the most tangible benefits in service availability: The old method took the service out for 17 hours. The last upgrade took their service offline for two hours.
We are very proud of this, but we’re not nearly done.
I am pressing delete on the ExterminateExcels script, stage complete.
We even have an IoT button waiting…