Chaos Community Day

Last Friday (September 13th) Kasper Nissen and I visited Amazon in London for a Chaos Community Day. The day was hosted by Russ Miles from ChaosIQ and the agenda included speakers from several different industries; banking, streaming services, digital and print publishing to just name a few. Our expectations were quite high, and it’s safe to say we weren’t disappointed. The talks were high quality and they gave us very different perspectives on the routes and journeys people in the community are taking.

The day was kicked-off with an introduction by Russ where he talked about the purpose of a Community Day and what kind of community he would like the Chaos Community to be:

  • Inclusive and value-centered,
  • Fueled by creativity and innovation,
  • With the ability to honestly share experiences and battle-lessons.

This came through in the talks, but also in the discussions in-between. The atmosphere was friendly and inclusive and there was respect for the many different approaches to Chaos Engineering.

What is Chaos Engineering?

The question is not if failures will happen — sooner or later failures will happen. In Production! It’s better to plan “Chaos” ahead of time and find windows where the impact is small and making sure the right people are observing and ready to learn and remediate.

@Charity Majors said it well in her talk: “Chaos” is a fancy marketing term for running tests later in the software development lifecycle. Indeed, exploring and becoming aware of the known and unknown is something that has been practiced in decades by exploratory testers. The new thing is that we shift right and experiment directly in Production. To build confidence in a system’s capability to withstand turbulent and unexpected conditions. Some unexpected conditions could be:

  • CPU spikes,
  • Low memory,
  • Malfunctioning hard drives,
  • Network latency,
  • Service degradations or straight-out outages.

Making such experiments in Production may sound like a terrible idea. In practice, it’s a very good approach to be able to build confidence. Test and Staging environments are often not provisioned and configured exactly like Production, and the traffic volumes and patterns are different. You can work on simulating Production in another environment, but this can quickly become a time sink for engineering time.

Chaos Engineering Today

Chaos Engineering is by no means a new concept — it’s just new in the software industry where traditional approaches to managing failure in Production have been mainly reactive.

Some examples are:

  • Incident Response,
  • Disaster Recovery,
  • Logging, monitoring, and alerting.

These are all reactive and provide ways to remediate unexpected conditions after they have already happened.

Chaos Engineering is inherently proactive; learning about system weaknesses before they manifest, giving us a chance to remediate before disaster strikes.

One of the early practitioners of Chaos Engineering was Netflix, who created Chaos Monkey to weed out availability issues proactively. I heard about Chaos Monkey in 2014 when attending the talk “I Don’t Test Often … But When I Do, I Test in Production” at GTAC (Google’s Test Automation Conference). This was the first time I encountered a real-world application of Chaos Engineering at scale.

Times are changing, and today zero-downtime-deploys, self-healing infrastructure, and resilient systems are rapidly becoming the norm. Today’s customers expect perfect software and 24/7/365 availability; this applies especially to businesses (like ours) that have traditionally relied on physical infrastructure and humans but are becoming completely digital. As one of the speakers (coming from a well-known streaming service) put it:

Readiness to handle failure (or the unknown) is feature zero.

It’s impressive how Chaos Engineering has evolved; today a wealth of companies in very different industries are practicing every day. Their journeys are quite different, but there does seem to be some common ground:

  • Building observability into systems; making sure it’s possible to reliably identify and track down anomalies (without observability Chaos Engineering is just, well, Chaos…)
  • Beginning by experimenting in CI or Test Environments to gain confidence. This is perfectly fine. Chaos Engineering doesn’t have to start in Production. There are no rights or wrongs — just focus on what brings value to your business.
  • Organizing “Game Days”; all-day events where engineers purposefully create regular major failures. This is typically communicated to service teams in advance, and they can then opt-out if the timing is bad or they know for certain they are not ready.

The Game Day concept was coined by Jesse Robbins, an ex-Amazon engineer who used his experience and training as a volunteer firefighter and from other industries and studies. Game Days have proved to be very successful and have been adopted by many companies and are today considered a best practice.

Chaos Engineering at Lunar Way

So is Chaos Engineering something we want to do at Lunar Way?

Absolutely! Like most companies on a similar journey, we have started experimenting in our test environments. This will give us the confidence to proceed further and maybe one day run experiments in Production. We’ll be sharing our experiences along the way.

Thanks to Russ for inviting us, Amazon for hosting, and the community for being awesome. We had a blast and will be very happy to join again next year.

What is Lunar Way?

Lunar Way is a fintech company motivated by rethinking the experience of banking, and the way people perceive money and spending in general. That is why we are using the most innovative and smart technology in order to create the banking solution for tomorrow directly in our app.


comments powered by Disqus