The title of this article may be a little provocative, but that’s a general idea. This time, it’s not just about sharing knowledge. We want to start a discussion about different approaches to programming. And also about how ignoring or underappreciating certain aspects may harm the final product.
Let’s start with defining a microservice. It’s not that easy since there are many valid definitions. Some say a microservice is any code below 1,000 lines of code. Others say it’s created with 3 pizzas or less. Some people claim that a microservice is something you can completely rewrite during a sprint.
The microservices we worked on required more than one sprint, more than 3 pizzas, and more than 1,000 lines of code. Our definition of microservice is something as small as possible and as big as necessary to run. In other words, if there’s something you can remove and your code still works, it’s not a microservice.
Here are the conditions a microservice has to meet:
- It has to have a single responsibility (business domain)
- Implementation should be executed through a contract
- It should be possible to implement independently from other microservices
All of the above combined give us a crashproof whole. Even if a single microservice crashes, it doesn’t hurt other running services.
How to approach a microservice project?
As software engineers, we decided that it was easier, at least for some of us, to design an app as a monolith, but adopt the microservice approach for communication. That’s how fire TMS came to be (read more about the project here).
The stages of development were as follow:
- Start – monolith until business domain boundaries are established,
- Development – monolith with microservice communication (modular monolith),
- Scaling – switch to actual microservices.
The limitations of microservices
Do you know that meme about wanting something cheap, fast, and good, but having to choose just two of these qualities? Another version of this applies to microservices. You can have two of those:
- Data consistency,
- High availability,
- Division resistance.
If the business requires consistency, you can’t guarantee 100% availability. If it calls for failure resistance and availability, consistency is going to be a problem. We had situations where a system was ready for scaling or crashes and available to the user, but there were data delays so it could be eventually consistent. However, if the business needs 100% consistent data, availability will become a problem, because scaling simply won’t be possible.
When does a microservice become a distributed monolith:
- change in one service requires changes in others (there is no possibility of independent implementation of services)
- services share a database
- services communicate intensively with each other
Microservice vs. Monolith
Let’s take a minute to think about why we even create microservices. In theory, it’s because…
- we want to make the programmers’ life easier,
- the app needs to be scalable,
- the app needs to be failure-proof,
- we need to use a technology which better fits a given task
- we want to distribute tasks between many developers organized into teams
And from our experience, you usually meet all these goals! That is, until the implementation.
Here’s what a reality check can prove:
Additional problems can and will transpire during launching such microservices. These dangers are well-known, but when you first encounter them, they’re quite fresh to you. Some of these can be coded away. If they are not, however, they’re going to become a problem for the administrators who implement and maintain the microservices.
Transferring even a part of error processing onto the infrastructure and administration can result in these problems piling up. The next step is network congestion or alert spam in the monitoring.
And we haven’t even started about the flaws of distributed systems.
The top 8 flaws of distributed systems
The funny thing is that these flaws have been known for over 20 years. In 1994 Peter Deutsch coined the Seven Fallacies of Distributed Computing, which was later expanded with one additional flaw by James Gosling. These two men charted the areas that deserve special interest during the microservice design phase.
1. The network isn’t failure-free
At some point, the network will fail. It’s inevitable. And then, your microservice can act in a variety of manners. You might experience a complete connection loss upon the start or re-start of the app, or the existing connection can be lost.
Common scenarios:
- the app freezes and requires a restart,
- endless 'awaiting response’ state with no timeout,
- excessive resource use,
- no reconnect option, no rollback, no resuming,
- wrong message order, data corruption, duplicates.
2. There will be delays
It’s something apparent for the UI people who test their app locally and think it works lightning-fast. As soon as they install it on the server, or move to a mobile app, the client that worked great in the dev environment, is painfully slow in RC and production.
The connection is also limited when reaching the client via VPN. Delays may cause the app to act up. The answer would be to use libraries that delay server response and let you test the app in the slow connection mode. Or you could simply install it somewhere far away and see how it really works.
Common scenarios:
- WAN connections
- mobile apps
- too many connections
- AJAX queries
Results:
- delays in data presentation
- low responsiveness of the app
- too many connections
3. The bandwidth is limited
We encountered this problem when trying to load a high number of short messages to elastic search. Establishing a connection takes time. Sending a message takes time. What can you do about it?
For one, you can aggregate your messages, but if the packages get too big and get lost, they get retransmitted. Then the network clogs up with these retransmissions. That’s why it’s important to find the right balance for the packages. Mind that you can also affect other services using the same network, such as VoIP.
Common scenarios:
- generating too many short messages (no aggregation)
- generating too large messages (retransmissions)
Results:
- data flow bottlenecks
- network overloads, lost packages
- limited bandwidth for other services (i.e. VoIP)
- dev / RC vs. production differences
4. The network isn’t safe
You can’t assume that it is and send unencrypted data. If you do, you’re making yourself open to attacks.
Common scenarios:
- transmission of unencrypted data
Results:
- external traffic monitoring and leaks of confidential data
- replay attack (i.e. logging) – retransmitting a recorded transmission of your logging, resulting in logging to your account from a different device and location
- injection of unauthorized data (i.e. Javascript into HTML) – this results in gaining full control of everything underneath the user’s browser by inserting an HTML script to the page that’s returned to the user
- MiTM attacks – the person controlling the access point can hijack the transmission. edit the data, and generally use the data that you input
- automatized attacks
5. Topology changes
This is one of the common wrong assumptions.
Common scenarios:
- adding or removing servers, or server instances – if you’re building a scalable service, you should know that the topology will keep changing because new instances of every service will keep appearing. You will also remove certain elements when a health check detects that a service isn’t working as expected, or when something causes the host to go down
- technology changes – it may happen so that you go through someone’s network and you don’t have any control over it
- changes in package routes – dynamic routing
Results:
- no connection with hardcoded servers
The solution for scaling would be service discovery: some services that let the service check-in so you can find it. If your services require a central control point, service discovery includes a master election for HA. If you have a group of microservices, you can determine the leading microservice that sets the pace for all the others. This may prove crucial when you’re building a distributed Cron or similar services that need to be started from a single point.
You can also use VPN to build a network abstract, regardless of what’s connected to this network. You know the address range, you know the NTU. You can build your own network on this foundation.
6. Only one administrator?
This problem haunts many companies. Even large ones. It never ends well.
Common scenarios:
multi-level production environment with infra/app admins and L1/L2 line division, where every segment has its own rules and politics. This means that whenever you want to push something to production, you have to consult with every team separately. You can try to change this process, try to convince the client that your implementation doesn’t require everyone’s participation. You can try to automatize it and just inform everyone it happened. However, your client can straight out refuse.
Results:
- difficulties with establishing the scope of competencies
- scheduling multiple meetings to execute a single update
- conflicting politics between the teams
7. Transport isn’t free
This is a programmer’s common problem: not minding the amount of transmitted data.
Common scenarios:
- SELECT * FROM – this is a way to get any data you want, but also a lot of data you don’t want at the same time. Don’t do this. Overusing this method in elastic is an admin’s nightmare, the database slows down to a crawl. And you can’t really optimize it. It’s also discouraged in AWS – you could actually get removed for such practices. It’s not only an abuse of the bandwidth but also an additional cost for you. Always think twice if you’re only getting the data you need
- insufficient aggregation from the database server
- lack of optimization of graphic files
- serialization that requires additional assets
Results:
- increasing costs due to incoming/outgoing traffic
- higher technical demands (better routers and connections)
- slow UI
8. Homogeneous network (that isn’t)
There’s a rather slight chance you’ll experience this problem, however – there’s still a chance. When you combine systems between different users, problems may arise.
Common scenarios:
- connections between the systems of different types, from different providers
- different connection speeds
- using closed protocols
- MTU size
Results:
- a complete lack or limited possibilities of connecting with other systems or applications
That’s pretty much it!
The problems above are the ones you should be aware of. They have existed for 26 years, and unfortunately, they’re still relevant. We’ve encountered them ourselves. So, if you have an app in the dev environment and want to move it to RC, take a look at this list. Think if any of these problems may rear its ugly head and if you are ready to handle it. If you need, try to simulate these situations, especially the network issues, because these are bound to happen at some point.
It could even so happen that the admins stage a surprise test by moving the app to another server, or disabling a crucial component. Do it before them. It’s way easier to test things without pressure from the client.