Culture

Lunar Way’s journey towards true autonomy (Part 2)


The first part of this blogpost made the point that your organisational structure and collaboration model is tightly coupled with your underlying technical architecture, and that you can’t change your organizational structure and collaboration model without adapting the underlying technical architecture as well.

It was the story of our tech journey during Lunar Way’s early days, from late 2015 to late 2016, and a lot has happened since.

The transitioning phase

When we initially kicked off our feature squad setup, in late 2016, we still had a monolithic architecture.

When a minor bug needed to be fixed we had to stabilise and secure the quality of the whole system across all squads, because we needed to deploy the whole shebang at once.

The risk of introducing new bugs when dealing with so many moving parts at once is simply too high. The cost of releasing was high, and consequently we had to perform bulk changes and release at a slower pace. As you can imagine, this was not a durable foundation for working in squads that wanted to be in control of their own rapid development cycles.

In order to work the way we wanted we had to change the underlying platform and get the foundation right before moving forward.

Getting the foundation right

The overall long term vision was simple — split up our monolithic system into separate services small enough to allow each service to serve a single purpose and be able to live a decoupled and independent life.

_I like to think of the transition as moving from building and living inside the skyscraper towards building and managing the city, where each property serves its own purpose and still has the freedom to evolve over time at its own pace._get

The first step we took towards this goal was to break into a few services, but still gain the benefit of independent deployability within each squad.

Thereby the squad would be in control of their own development and release cycle, which was a crucial step towards true autonomy.

When you break up your architecture into microservices each service becomes simple and easy to comprehend, but the architectural complexity will still arise around your services, because communication, logging, debugging, tracing and data integrity becomes more complex.

The same applies to city planning. In order for each property to be independent it needs electricity, district heating, water supply and sewer systems together with logistical infrastructure to interact with its surroundings, like roads, Internet, telephone lines and so on.

In order to manage the complexity of running a microservice setup we introduced Kubernetes as the overall services orchestration mechanism that gathers all services to a coherent manageable system. As an overall umbrella for technology decisions “The Cloud Native Foundation” has been a great help. From my point of view they have done a good job of bringing all the healthy thoughts from “Continuous Delivery” into a cloud context.

This is also why we try to make an effort to support and facilitate the cloud native community in Denmark.

(If you haven’t read it yet, our DevOps and infrastructure Engineer, Kasper Nissen, is one of our masterminds behind Kubernetes and Cloud Native — so please take a look at his earlier post on the subject).

Independent deployability breakthrough

Independent Service Deployability with Kubernetes as the service orchestration mechanism became an important breakthrough for us. We reached this milestone in Q1 2017 and from there we started to gain some of the benefits we were hoping for.

The immediate benefit of independent deployability was a huge relief. Release coordination was reduced significantly and the ability to release small improvements managed within each squad was a reality.

Autonomous squads

Instead of having seven virtual feature squads we chose to narrow our setup down to four squads — each responsible for building and running different parts of the system.

Squad-Goals

Responsible for our Goals universe.

Squad-Feed

Responsible for our users’ first impressions, like the boarding flow and frontpage concept called Feed.

Squad-Finance

Responsible for Spend, Card, Transfers & Payments.

Squad-Core

Responsible for our overall architecture and microservice initiatives.

We tried to delegate feature and service responsibility so each squad had its own purpose and set of services within a similar domain. Each developer was now fully dedicated to a single squad, leaving each squad more focused and aligned than before.

Instead of dictating the same process across all squads, each squad chose how to work, which tools to use, how to ensure quality and have a deep impact on the product roadmap together with finding the best technical solutions.

From Backend bottlenecks to full stack flow

Until this point we had encountered many challenges when balancing the workload between native app developers and backend developers, resulting in bottlenecks where app development would become stuck occasionally.

After the initial microservice split up combined with the ability to launch a local development environment the app developers now had the courage to help out on the backend, which overall eliminated development bottlenecks.

You build it, you run it

As a squad, you now had the full responsibility of a certain set of services. In order to fulfill the responsibility we did everything we could to expose all relevant metrics and provide all the tooling needed to perform the build, deployment, hosting, monitoring and logging as easy as possible.

An immediate effect was that responsibility of our running software as a whole was very clearly delegated. And when a small team really feel the full responsibility of building and running a service very little slips between the cracks. The feeling of ownership was firmly planted in their psyche, together with all kinds of healthy worries that come with ownership.

It suddenly became clear to everyone that the act of coding the actual feature becomes the minor part of building and running software -whether you like it or not 🙃

Service ownership

All of these new insights led us to consider certain areas more closely. We started to talk about “Service Ownership”, and what it requires to fully take responsibility of running software, and how to maintain quality over time.

One of the concrete initiatives we implemented was to define how we wanted 2nd level support to work. A traditional setup would be to introduce an organisational layer (2nd level support) between first level support and tech. Often the drawback is it introduces too many handovers and intermediaries, which in the end affects response time for the end user.

Instead, we gave our first line supporters the responsibility of prioritising the technical issue as either a standard class (handled within twenty four hours) or expedite (handled within an hour). The first line supporters then hand over the conversation directly to the responsible squad, who now has the ability to talk directly to the customer.

During this period other positive procedures were established along with a general increase in the focus on availability and overall quality of our running software.

Long term speed

The overall platform foundations were now in place. The sense of autonomy and the responsibility that follows was starting to blossom. The visibility of our systems gave us the awareness and insights that good decisions and initiatives are based upon.

Long term product development speed was still the overall aim of our microservice strategy and squad setup. We knew from day one we had to go faster than all our competitors in order to succeed. At this point (mid 2017) our biggest steps towards autonomy and speed were still in front of us, along with new and interesting challenges — but more about that in a later post…

comments powered by Disqus