A year ago I shared our first learnings from entering microservice land in the blog post From Rails Monolith to Microservices — Part 1. Back then my intention was to write a follow-up post to finish the adventure and share all the key learnings we gained as part of our journey. Unfortunately, that blog post was never finished and at this point in time, it really doesn’t make sense to write the intended part 2. Instead, this post will be about the shortcomings of the early Lunar Way platform and what we have done to improve the platform to support the plans we have for the future.
In the Monolith to Microservices post, I left the Lunar Way microservice platform after we had implemented the first microservice. Since then, a lot more services have joined, but the fundamental platform architecture has not really changed since then.
Our backend is a microservice architecture with asynchronous message passing as the first choice of inter-service communication. Synchronous RPC using gRPC is used where the async mechanism is not possible.
Services may subscribe to whatever upstream events they like. This is how a service gains required knowledge about data owned by an upstream service and how business processes are implemented as a flow of messages between services taking part in the process.
Services are deployed in a Kubernetes cluster using a CI/CD pipeline triggered by git commits.
Our early microservice platform had a few other important properties:
We have had great success with the platform: over the last two years, we have deployed around 80 different services. This enables a great number of new features and products to our users. In this sense, the platform has proven its worth: without a platform based on loosely coupled microservices, we would probably never have been able to deliver the stuff we have. Also, the way our tech team is organized — in highly autonomous squads — is very dependent on the platform characteristics: each squad is able to develop, deploy and control its own services with a high degree of independence from the other squads.
However, over time we identified a number of challenges in the platform. Challenges which had an impact on not only our work as developers, but also on our users and our in-house support team.
Let’s go through them.
Messages were published after the state was changed in most of our microservices, i.e., the service received an external request, executed the required business logic and persisted the updated entity in the database and then it published a message.
The problem was that the event was not guaranteed to be published. Persistence of the updated entity and publication of the event was not atomic; hence the message publication could fail or the service could die at exactly this moment.
The severity of this problem was further enhanced by the fact that for obvious reasons we didn’t know if and when it happened. We only knew at a later point in time if the missing event caused an inconsistency in downstream services or if a business process was not being completed.
We use RabbitMQ as our message broker. This piece of technology has proven to be a reliable and performant way for us to publish messages. Furthermore, the semantics of topic exchanges make it easy for us to add new services as subscribers to specific messages.
However, RabbitMQ was a far too important component in the message delivery chain: in failure scenarios where RabbitMQ for some reason failed to deliver a message to a subscriber, we had no way of re-delivering the message. In combination with the problem of non-atomicity, message delivery guarantees were, therefore, non-existing.
A producing service, in general, did not itself store the messages and events it produced. Hence, a producing service did not itself know about the events it produced and it could therefore not answer questions about them — it only knew about the current state of the entities it stored.
Messages had a sequence number which was essentially just a timestamp. However, no inherent guarantees existed about this sequence number: e.g. two events pertaining to the same user from the same service could have sequence numbers out of order if the events had their origin in two different instances of the service. Furthermore, when creating synthetic events in a bootstrapping use case, the sequence numbers would usually be different from what it would have been in the non-synthetic case.
These facts trickled downstream to consuming services. Since a received event was not a first-class citizen with a well-defined identity and ordering, a consuming service could not reason about it, e.g. determine whether it had already received the message or figure out if it had missed an earlier event.
The impact of the inherent characteristics on the early platform as a whole centers very much around the guarantees — or the lack of guarantees — in relation to consistency across services.
Due to the described characteristics, a downstream service was not guaranteed to receive all events it subscribed to and it had no way of knowing if it had lost events. For the same reason, there’s no way in which a consuming service on its own demand could verify the consistency of the upstream data it had received.
Consequently, a downstream consumer had no other choice than to blindly trust the messages received from the upstream.
The bottom line was that there were no consistency guarantees in the early platform. Downstream services would maybe or maybe not have a consistent view of what happened in upstream dependencies. In these cases, the data was not lost from the perspective of the downstream service — it just required manual intervention to fix the situation.
The most problematic consequence of the problems described were of course when it directly influenced our users. This could be the case if events were lost somewhere along the chain from producer to consumer. The result of this would be inconsistent data in a downstream service or some business process being terminated before completion.
Inevitably, this would give rise to a support case and the only way to remedy the situation was by manual intervention of a developer.
Fortunately, this did not happen very often.
When a new service is introduced, it typically requires some information from other services in order to do its job. For example, almost all of our services require some user information in order to determine which actions to take for a specific user.
In the early platform, providing this information to a new service was a manual process, which required a developer to feed the proper messages into the new service. This was either done by side-loading relevant messages from the Poor Man’s event store into the new service or by implementing functionality in an upstream service to provide the required data.
This was a cumbersome process and due to the events not having a strict order, we had to take care that historic events replayed from the event store wouldn’t overwrite live events.
With the realisation of the inherent problems of the early platform also came a desire to fix them and instead build consistency guarantees into the platform.
The key to unlocking these guarantees was to solve the identified problems.
Hence, we set out to improve the platform with a set of characteristics being the logical opposites of the problems above:
These characteristics have a number of important implications for the architecture:
These properties in combination deliver the consistency guarantees we regard as a requirement for the Lunar Way platform of the future — a platform which we can scale to 100.000s of users with services always having a consistent view of data of upstream dependencies.
There is more than one way to design an architecture with the 4 desired characteristics. One such solution is event sourcing and this is the design pattern we have chosen to introduce into the Lunar Way platform. I will not give any introduction to event sourcing here — there are a lot of good introductions to this architectural pattern to be found.
Here’s a short list of resources we have found helpful:
By design, event sourcing provides a way to achieve the 4 desired characteristics:
Event sourcing is a very different way of thinking about a service. Coming from traditional CRUD services, developers are used to thinking about the state of an entity and requests coming in from the user modifying this state. he modification is not persisted — it only lives in the code processing the request and maybe in a notification message afterward. In event sourcing the current state is secondary — it’s something we delegate to views to care about. What really matters is the actual change.
Thinking this way is a change of paradigm and requires an effort to get used to.
Also, event sourcing comes with a lot of new terminology and concepts, which can be overwhelming, to begin with.
At Lunar Way we started out with event sourcing as a hackathon project. This project was eventually promoted to a real service and deployed into production without being exposed to user requests. We used this service as a way of getting to know the concepts and terminology and actually ended up implementing our own event sourcing library in Go based on the experience from the first service. We’re planning on open sourcing this library when we think it’s ready.
One source of confusion when talking about event sourcing is the very word “event”. People often tend to think of the events in an event sourced system as events published to the outside world. This is a misconception. The “event” in event sourcing is fundamentally internal to the service — it’s the entity used by the service to store state. It’s a big mistake to publish these internal events as available events for downstream services to consume. Doing this exposes the inner workings of the service to the outside and introduces a very hard coupling between services.
Instead, an event sourced service must publish “domain events” in the DDD meaning of domain. Due to the nature of the internal event stream of an aggregate root in the event sourced service, these domain events may be implemented as projections with the same guarantees about order and reproducibility as the inner event stream. (Check out this blog for an excellent discussion of this)
Implementing an event sourced service as part of an existing platform of non-event sourced services — as in our case — can be a bit of a challenge. If the new event sourced service is at the very top of the dependency hierarchy, it’s not a problem.
However, if the service must consume events from upstream services, there is a challenge if the new event sourced service expects upstream events also to provide the same set of guarantees around delivery, order and reproducibility. If retrofitting upstream services with these guarantees is not an option, you have no other choice than to implement an adapting layer between the upstream and the event sourced service. This adapter must guard the event sourced service against replay of the same events and augment events with an order. The first event sourced service we implemented did have upstream dependencies which we had to adapt to. We did this by implementing an adapting layer inside the service itself.
Apart from the consistency guarantees which is what we are really after, event sourcing has other benefits too.
One of the selling points you often hear when it comes to event sourcing is that you get an audit log for free. This is true to the extent that the events in the event sourced system contain the required information to act as an audit log. What you get is a complete log of all the changes in the system, but if all relevant information from the action triggering the change is not available in the events, it is not really an audit log.
Implementing an aggregate root (AR) in an event sourced service follows a very strict pattern. When processing a command, the AR cannot perform any side-effecting actions — i.e. everything the AR requires to determine whether to execute the command must be a part of the command. This means that the command processing implementation is a pure function without side effects. Thus, the domain logic becomes a pure function which can be tested easily without requiring mocks or complicated test setups. Along the same lines, reproducing bugs from production is possible by replaying the events from production and execute the same command.
At Lunar Way we have big plans for the future and in order to achieve these goals, our platform must be able to scale. We believe that building consistency guarantees into the platform is a key element for this.
If you have had similar challenges and found different solutions, we’d love to hear about it. Feel free to leave your comments below.
Lunar Way is a fintech company motivated by rethinking the experience of banking, and the way people perceive money and spending in general. That is why we are using the most innovative and smart technology in order to create the banking solution for tomorrow directly in our app.
Read more on lunarway.com