Part 1 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 3 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 4 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 5 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 6 – Service vs Components vs Microservices
In Micro services: Micro services: It’s not (only) the size that matters, it’s (also) how you use them – part 1, we discussed how the use of the number of lines of code is a very poor measure of whether a service has the correct size and it’s totally useless for determining whether a service has the right responsibilities.
We also discussed how using 2 way (synchronous) communication between our services results in hard coupling and other annoyances:
- It results in communication related coupling (because data and logic are not always in the same service )
- It also results in contractual-, data- and functional coupling as well as high latency due to network communication
- Layered coupling (persistence is not always in the same service )
- Temporal coupling (our service can not operate if it is unable to communicate with the services it depends upon)
- The fact that our service depends on other services decreases its autonomy and makes it less reliable
- All of this results in the need for complex logic compensation due to the lack of reliable messaging and transactions.
If we combine (synchronous) 2 way communication with small / micro-services, modelled according to e.g. the rule 1 class = 1 service, we are actually sent back to the 1990s with Corba and J2EE and distributed objects.
Unfortunately, it seems that new generations of developers, who did not experience distributed objects and therefore not take part in the realization of how bad the idea was, is trying to repeat the history. This time only with new technologies, such as HTTP instead of RMI or IIOP.
Jay Kreps summed up the current Micro Service approach, using two way communication, very aptly:
Just because Micro Services tends to use HTTP, JSON and REST doesn’t make the disadvantages of remote communication disappear. The disadvantages, that newbies to distributed computing easily overlook, are summarized in the 8 fallacies of distributed computing:
- The network is reliable
However, anyone who has tried losing the connection to a server or to the Internet because network routers, switches, WiFi connections, etc. are more or less unreliable will know this is a fallacy. Even in a perfect setup, you will be able to experience crashes in the network equipment from time to time
- Latency is zero
What is easily overlooked is that it is very expensive to make a network call compared to making equivalent in-process calls. The bandwidth is more limited and latency measured in milliseconds instead of nanoseconds over the network. The more calls to be executed sequentially, the worse the overall latency becomes
- Bandwidth is infinite
In fact, network bandwidth, even on a 10 GBit network, is much lower than if the same call was made in-memory/in-process. The more data being sent and the more calls being made because of our small services, the greater the impact on the remaining bandwidth
- The network is secure
Saying NSA should cover why this is a fallacy?
- Topology doesn’t change
Reality is different. Services deployed to production will experience a constantly changing environment. Old servers are upgraded or moved (if necessary also changing IP address), network equipment is changed or reconfigured, firewalls, change configuration, etc.
- There is one administrator
In any large-scale installation there will be several administrators: Network administrators, Windows admins, Unix admins, DB admins, etc.
- Transport cost is zero
A simple example of why this is a fallacy we can look at the cost of serializing / deserializing from the internal representation to / from JSON / XML / …
- The network is homogeneous
Most networks consist of various brands of network equipment which supports various protocols and communicate with computers with different operating systems, etc.
The review of the 8 fallacies of distributed computing is far from complete. If you are curious, Arnon Rotem-Gal-Oz has made a more thorough review (PDF format).
What is the alternative to 2 way (synchronous) communication between services?
The answer can among others be found in Pat Hellands “Life Beyond Distributed Transactions – An Apostate’s Opinion” (PDF format).
In his article Pat discusses that “adults” do not use Distributed transactions to coordinate updates across transaction boundaries (e.g. across databases, services, applications, etc.). There are many good reasons not to use distributed transactions, among them:
- Transactions lock resources while they are active
Services are autonomous , so if another service, through a distributed transaction, is allowed to lock resources in your service, it will be a clear violation of the autonomy
- A service can NOT be expected to complete its processing within a specified time interval – it is autonomous and therefore in control of how and when it wants to perform its processing. This mean that the weakest link (Service) in a chain of updates determines the strength of the chain.
- Locking keeps other transactions from completing their job
- Locking does not scale
If a transaction takes 200 ms and e.g. holds a table lock, then the service can maximum be scaled to 5 simultaneous transactions per second. It does not help to add more machines as this doesn’t change the time lock is kept by a single transaction
- 2 phase / 3 phase / X phase commit distributed transactions are fragile per design.
So even though X phase commit distributed transactions, at the expense of performance (yes X phase commit protocols are expensive), solves the problem with coordinating updates across transactional boundaries, there are still many error scenarios where an X phase transaction is left in an unknown state (e.g . if 2 phase commit is interrupted during the commit phase, it will mean that some participant has committed their changes while others have not. You’re left on deep water without a boat if just a single participant fails or is unavailable during the commit phase – see the drawing below for the 2 phase commit flow)
So if the distributed transactions isn’t the solution, then what is the solution?
The solution is in three parts:
- One part is how we split our data / services
- How do we identify our data / services
- And how we communicate between our data / services
How do we split our data / services and identify them?
According to Pat Helland, data must be collected in pieces called entities. These entities should be limited in size, so that, after a transaction they are consistent.
This requires that an entity is not greater than it can fit on one machine (across machines we would have to use distributed transactions to ensure consistency which is what we want to avoid in the first place). It also requires that the entity is not too small in relation to the usecases that update the entity. If it’s too small it would bring us back to the case where we would need to coordinate updates across services by e.g. using distributed transactions to ensure a consistent system.
Rule of thumb is: one transaction involves only one entity.
Let us take an example from the real world:
In a previous project I was faced with a textbook example of how misguided reuse ideals and micro splitting of services undermines service stability, transactionality, low coupling and low latency.
The customer thought they could ensure maximum reuse for two domain concepts, respectively Legal Entities and Addresses; where addresses covered everything that could used to address a legal entity (and most likely everything else on earth), such as home address, work address, email, phone number, mobile number , GPS location , etc.
To ensure reusability and the coordination of creation, updates and reads, they had to introduce a task service called “Legal Entity Task Service” that would coordinate work between the data services “Legal Entity Micro Service” and “Address Micro Service”. They could have chosen to let the “Legal Entity Micro Service” take the role of the Task service, but that wouldn’t have solved the fundamental transaction problem that we’re going to discuss here.
To create a legal entity, such as a person or a company, you first have to create a Legal Entity in the “Legal Entity Micro Service” plus one or more addresses in the “Address Micro Service” (depending on how many were defined in the data given to CreateLegalEntity ( ) method of the “Legal Entity Task Service”). For each address that was created the AddressId, returned from the CreateAddress() method in “Address Micro Service”, had to be associated with the LegalEntityId that was returned from theCreateLegalEntity( ) method of the “Legal Entity Micro Service” by using the AssociateLegalEntityWithAddress() in “Legal Entity Micro Service”:
From the sequence diagram above, it should clear that there is a high degree of coupling (at all levels). If the “Address Micro service” does not respond, then you cannot create any legal entities. The latency of such a solution is also high because of the high number of remote calls. Some of the latency can be minimized by performing several of the calls in parallel, but it is again sub-optimization of a fundamental wrong solution and our transaction problem is still the same:
If just a single of the CreateAddress() or AssociateLegalEntityWithAddress() calls fail we’re left with a nasty problem. Say we have created a Legal Entity and now one of CreateAddress() calls fail. This leaves us with an inconsistent system, because not all of the data we intended to create succeeded.
It could also be that we have created our Legal Entity and all the Addresses, but somehow not got all the addresses successfully got associated with the legal entity. Again we’re faced with an inconsistent system.
This form of orchestration places a heavy burden on the CreateLegalEntity() method in the “Legal Entity Task Service”. It is now responsible for retrying any failed calls or performing a clean up after them (also known as compensation). Maybe one of the cleanups fail and what do you do then? What if the CreateLegalEntity() method in the “Legal Entity Task Service” is in the process of retrying a failed call or in the process of performing a clean up when the physical server it runs on is turned off? Did the developer remembered to implement the CreateLegalEntity() method (in the “Legal Entity Task Service”) so that it remembers how far it was and can resume its work when the server is started. Did the developer of CreateAddress() or AssociateLegalEntityWithAddress() methods ensure that the methods are idempotent so that any calls to them can be retried several times without risking double creation or double association?
The transactionality problem can be solved by looking at the usecase and re-evaluating the reuse thesis
The design of the LegalEntity and Address services came about after a team architects had designed a logical canonical model and from it had decided what was reusable and thus should be elevated to services. The problem with this approach is that a canonical data model does not take into account how the data is used, i.e. the usecases that use this data. The reason this is a problem, is that the way data is changed/created directly determines our transaction boundaries, also known as our consistency boundaries. Data that is being changed together in a transaction / usecase, as a rule of thumb, also belongs together data wise and ownership wise.
Our rule of thumb is therefore expanded to: 1 usecase = 1 transaction = 1 entity.
The second mistake they made was to view the Legal entities’ addresses as general addresses and then elevate Address to a service. You could say that their reuse focus meant that everything that smelled of an address had to be shoehorned into the Address service. The hypothesis was that if everyone used this Address service and suddenly a city name or postal code changed then you only needed to fix it in one place . The latter was perhaps a valid reason to centralize this specific piece of information, but it had high costs for everything else.
The data model looked something like this (a lot of details omitted):
As shown in the model, the association between LegalEntity and Address is shared directed association indicating that two LegalEntities can share an Address instance. However this was never the situation, so the association was more of a composite directed association, indicating a parent-child relationship. There’s e.g. no need to store an Address for LegalEntity after the LegalEntity has been deleted (again an indication of a composite association). The parent-child relationship shows that LegalEntity and Address belongs closely together, they’re created together, changed together and they’re used together.
This means that instead of having two entities we really only have one entity, LegalEntity (our pivotal point), with one or more Address objects closely linked to it. This is where Pat Hellands Entity vocabulary can benefit from Domain Driven Design’ (DDD) more rich language, which includes:
- Entity – which describes an object that is defined by its identity (and not its data), an example is a legal entity (a Person has a Social Security number, a company has a VAT number, etc.)
- Value Object – which describes an object that is defined by its data and not its identity, an example is an Address, a Name, or an Email address. Two value objects, of the same type, with the same values are said to be equal. A value object never exists alone, it always exists as part of a relationship with an Entity. The value object, so to speak enriches the Entity with its data.
- Aggregate – is a cluster of coherent objects with complex associations. An Aggregate is used to ensure invariants and guarantee the consistency of the relationship between these objects. An Aggregate used to control locking and guarantee transactional consistency in the context of distributed systems.
- An Aggregate chooses an Entity to be the root and control access to objects within Aggregate through this root. The root is called the Aggregate Root.
- An Aggregate is a unique identifiable by its ID (usually a UUID/GUID)
- Aggregates refer to each other by their ID – they NEVER use memory pointers or Join tables (which we will return to in the next blog post)
From this description we can determine that what Pat Helland calls an Entity in DDD jargon is called an Aggregate. DDD’s language is more rich, so I will continue to use DDD’s naming. If you are interested in Aggregates I can recommend Gojko Adzic’s article.
From our usecase analysis (LegalEntity and Address are created and changed together) and the use of DDD’s jargon (LegalEntity is an Aggregate Root and an Entity and Address is a value object), we can now redesigning the data model (also known as domain model):
With the design above the AddressId has disappeared from the Address, since a Value Object doesn’t need one.
Our LegalEntity still has its LegalEntityId and it is the one we refer to when we communicate with LegalEntity micro service.
With the redesign we have made the Address service obsolete and all that is left is the “Legal Entity Micro service”:
With this design our transaction problem has completely disappeared because there is only one service to talk to in this example. There is much that can still be improved here and we have not yet covered how to communicate between services to ensure coordination and consistency across services when our processes / usecases cut across aggregates / services.
This blog post is already getting too long, so I’ll wait and cover this next blog post.
Until then, as always, I’m interested in feedback and thoughts 🙂