"The Essentials a Cloud Software Architect Needs to Know – Part I"
He lives and breathes Cloud architecture. Read the first part in a series of ten blogposts from André Andersen, Cloud Architect at Sogeti Sweden, focusing on “Distributed Systems, CAP theorem and beyond”, the road to building resilient systems in the Cloud.
“Either you’re building a software organization, or you’ll be losing to someone who is”, says Andrew Clay Shafer, a thought-leader in the DevOps space. Marc Andreessen also said “Software is eating the world.”
I remember being a kid, every third Friday of the month, a cassette-tape would land in our mailbox from Mr. Music. It was the highlight of the month – a new mix tape with new music!
Today, Spotify has made this kind-of obsolete, even though the charm of the mix tape-era is lost, I think I prefer the digital form. We have experienced the same shift with Netflix making renting DVDs obsolete. These services require quite a bit of engineering, especially when our customer base grows, and our customers expect more and cooler features from us.
A big portion of the customers I am privileged to speak with are to some degree pivoting towards providing digital platforms as their core business, providing a competitive advantage in contrast to their competitors.
To be able to innovate quickly, and be able to scale these innovations, smart organizations invest in understanding and mastering what Cloud providers are offering. It is another competitive advantage, being able to leverage this ever-growing plethora of tools being made available to us.
However – as these tools are becoming available, and many of you probably are already investing to some degree, building them properly as you’re building more and more cool things increase the complexity of the ecosystem with a magnitude. Harnessing the power yet being able to keep it tidy up in the cloud can be a real challenge.
Let us look at this InfoQ’s trend adoption curve, we see how organizations are starting to adopt these new development paradigms yet struggle with building them “correctly”.
A series of 10 blogposts
Cloud software architecture is a key component for business success, today and even more so tomorrow. So, I would like to share my views and get feedback and conversations going through this series of blogposts. In this series, I will cover the following:
- Distributed Systems, CAP theorem and beyond
- Resiliency with distributed systems
- Microservices - Patterns and pitfalls
- A primer to Domain-Driven Design (DDD)
- A primer to Command Query Responsibility Segregation (CQRS)
- A primer to Event Sourcing (ES)
- NoSQL (Cosmos DB)
- Build and Release Automation - Why and how
- Distributed Actors, an unsung hero on the rise?
- Bonus: CRDTs - a way to cheat the CAP theorem?
Hope you enjoy. Let’s begin the first topic!
Distributed Systems, CAP theorem and beyond – The road to building resilient systems in the Cloud
Distributed applications (i.e. microservices) are many times a magnitude more complicated than normal applications. Sure, they're not always complicated rocket science to handle. But quite often, the application with a small scope, became a completely different thing in the end. That's how businesses work, and we need to adapt and embrace this.
The #1 rule in distributing your objects is: Don't! However, there are times when it is necessary.
Why this matters to You
As expectations of software increase dramatically, and the cloud enables us to deliver software faster and more innovatively, we can't forget the challenges we have at hand when designing these systems.
The more components and the more we distribute them to try to achieve resiliency and performance, new potential shortcomings pop up - and sometimes it can feel like a game of Whack-A-Mole fighting all these potential issues.
Keeping these concerns in mind while designing the application mitigates the risk of catastrophic problems.
With this information at hand, you'll also be able to determine if creating a distributed system is necessary and worth it (think Pareto principle). In many cases, it won't be, and you've saved yourself much trouble by first designing yourself a "monolithic application."
In distributed systems, where you have more than one node dealing with the persistence of state, there needs to be agreement between them. We achieve this by using a consensus algorithm.
The CAP theorem by Eric Brewer, which stands for Consistency, Availability and Partition Tolerance, from a presentation he made at the Annual Symposium on Foundations of Computer Science in 2000. It was then formalized in 2002 by Seth Gilbert and Nancy Lynch at MIT.
In short, the CAP theorem is a classical trilemma (where you have three options, and you can only choose two.)
- Consistency meaning all clients retrieves the latest written data immediately after confirmation of write - or receive an error.
- Availability means that each request receives a non-erroneous response, without the guarantee that it contains the most recent write.
- Partition tolerance states that the system continues to operate during network partitions (i.e., when someone disconnects a network cable, a switch reboots).
In all distributed systems, you'd always want to opt for Partition Tolerance, because nobody wants a system that fails catastrophically in the event of a network partition. You're practically choosing between Availability and Consistency.
Consider the constraints of the application you're writing, and what tradeoffs are most appropriate.
One of these issues is that the CAP theorem's only problem statement is the network partitioning, meanwhile, in a production environment, another aspect to consider is latency, when your system is not having other issues.
The PACELC theorem builds on the CAP theorem, adding the notion of latency into the picture. Taking what we've learned so far about the CAP theorem, and that practically you'll always have to pick Partition Tolerance, the PACELC theorem states:
"During Partition Tolerance, you choose either Availability or Consistency, Else Latency or Consistency."
Meaning, when there is no partition (during normal operation), you're left with yet another choice between Latency and Consistency, as message passing between components of your system or database nodes in a distributed database system isn't instantaneous.
PACELC provides a better framework when reasoning about the true nature of distributed computing. From now on, let's assume our database is operating under normal conditions.
In general, you have the option of either choosing between Strong Consistency or Eventual Consistency.
- Strong Consistency, like ACID, means that once the database has confirmed your write, it guarantees that any replica in the distributed database returns the written version. Most reliable, slower.
- Eventual Consistency on the other hand, means the database confirms the write as soon as the data has been written to the first node. Faster, less reliable.
Different systems require different consistency levels, and you will have to consider the trade-offs when designing your system.
- Financial transactions should probably be strongly consistent, meanwhile;
- IoT sensor data can likely be eventual consistent.
In an upcoming post, we’ll uncover some different trade-offs databases have done, specifically with Azure Cosmos DB and their 5 different consistency models.
The fallacies of distributed computing
What we've learned so far about CAP and the PACELC theorem goes hand in hand with a few rules stated in 'Fallacies of distributed computing' by L Peter Deutsch and others from Sun Microsystems.
The fallacies are:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology does not change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Describing them in detail in this post will be too lengthy, but I recommend checking out the link above, and dive deeper.
In summary – we have unreliable, slow, and insecure networks that change over time, where sending stuff over the wire costs time and money. We also have several people and other systems who are in control of our things.
Some state with more modern technology, these fallacies are becoming obsolete. I’d assert this is not true. Mainly because we today, more than ever are relying on distributed components, and meanwhile our network capacity and speeds is vastly improved – we’re also vastly increasing the amount of data transmitted, and our expectations on the latency.
In the next series of posts, we'll uncover a few topics that help you as a software craftsperson to create more resilient and performing systems in the Cloud. Stay tuned! And do not hesitate to connect with me if you would like to discuss the content of my blogpost further!