Foundational Patterns of Software Architecture

Fun facts and fallacies of distributed systems
Published on 2024/03/07

Hopefully you're familiar with design patterns. Not one specifically, the reason I mention this is because the existence of names for specific patterns make communication simpler. When discussing architecture styles, knowing common patterns becomes particularly important. If you discuss the Singleton design pattern, it's pretty simple to just show it if there's any doubt. If you discuss a layered monolith, it's not that simple since it carries many more assumptions about knowing which architectural characteristics work well, which ones don't work well, deployment models, and so on. That said, it does feel imperative for an architect to have a basic understanding of such patterns.

I thought that "Big Ball of Mud" was just an expression (which is still is), but I learned it's a common name for one of seven terrible patterns. I was researching the authors of the paper discussing the Big Ball of Mud pattern (Brian Foote and Joseph Yoder) and stumbled upon this gem which is an introduction to seven hilarious patterns:

  • BIG BALL OF MUD
  • THROWAWAY CODE
  • PIECEMEAL GROWTH
  • KEEP IT WORKING
  • SHEARING LAYERS
  • SWEEPING IT UNDER THE RUG
  • RECONSTRUCTION

I won't go into details about this but I really had to share it!

My attention went to The Fallacies of distributed computing when covering the distributed systems pattern. If you've ever worked in such a system you have most likely been bit at least once, so while I didn't know an official list existed I most certainly became aware of it very quickly in my career. Let's explore them briefly:

  1. The Network Is not Reliable: I think more than the network, early on, I would think in terms of availability of a service. If service A is trying to communicate with service B but service B is temporarily down or overwhelmed, what can I do? Retries, timeouts, and circuit breakers are the way to go depending on the problem and are very common pattern to make your service resilient to an unreliable network.
  2. Latency is Zero: The closest I heard someone fall for this was a belief that a service to service communication was still "pretty fast" (whatever that meant). My assumption has always been pretty intuitive and maybe naive when thinking about this. Any time you have to go through a network you have to pay the price. While things got better over time there's no doubt to this day that in-memory method calls are faster than remote access calls.
  3. Bandwidth is Infinite: You learn this fast if you've interacted with any cloud provider. You get charged for I/O. Generally though this is also about efficiency. If you only need 100 bytes of a payload of 250kb why would you transfer all that unnecessary data? If you have 1000 requests per second with that payload, it makes a big difference. This is one of those things that if you don't think at scale it's hard to imagine the price you pay for it. One of the selling point of GraphQL is the reduction of the overhead in specifying a contract between the client and the server for any request type, instead you leave the responsibility to the client in order to determine exactly what they need.
  4. The Network is Secure: As you break apart your system, the surface are of attack increases. With a monolith you only worry about bad requests sent to it, in a distributes system the worry multiplies. Having to secure each endpoint can also impact performance.
  5. The Topology Never Changes: From my perspective the network hardware is, for the most part, abstracted away. If you have an SRE team at your company, that is most likely what they deal with. This doesn't mean you are always shielded from it, sometimes there are component that are part of the topology that need to be upgraded. Having a place where updates can be communicated explicitly can definitely save you a headache. I almost paid the price for this lack of communication where the person managing our infrastructure wiped my R&D project data. Thankfully I have trust issues so I had a way to restore everything, although it took some time. Also, having a backup/recovery system in place is always a good idea (if not a requirement)!
  6. There Is Only One Administrator: This ties in with the previous fallacy, in most companies of a medium size there's definitely more than an individual managing the system. With the exception of small shops or early startups (see my example above).
  7. Transport Cost is Zero: Any call to an endpoint from service to service comes at a cost (a $ cost). A distributed system inherently needs additional hosts, hardware, gateways, ... You should take that into account to evaluate if a move to a distributed architecture is worth it.
  8. The Network is Homogeneous: I change this to think in terms of cloud providers rather than network hardware vendors, which is what this fallacy was originally about. If you're at a scale where you work across cloud providers (like we do at MongoDB), that introduces another layer of complexity in order to make sure that the overall system runs smoothly. At MongoDB this is something SRE has to deal with the most, which is an incredible amount of effort and work!

Thoughts

I think I'll remember the "Big Ball of Mud" as a pattern forever after reading the article. When going over the fundamental patterns of architecture styles I didn't learn anything new. A lof of the low level networking is abstracted away from us. With every cloud provider offering, these fallacies have somewhat translated to something else. I would like to add another to the list: "A Cloud Provider Region is never down". This leads you to believe that you can rely on, for example, us-east-1 in AWS to always be up and running. While the service is highly reliable, giving it for granted that it'll work 100% of the time year-round can give you the false confidence that your system will be as well. This is why availability zones are a thing and why a backup, based on a different region, is a precaution you should take to guarantee that your system is a bit more robust to outages.

0
← Go Back