Understanding Software Dynamics - Chapter 1
I joined Phil's book club and we covered the first 2 chapters of Understanding Software Dynamics. It felt a bit like going back to school, in a good way! Although I got the book just a few days before the first two chapters were discussed, I was able to cut out enough time to make it all the way through. I won't lie, I had to take several pauses. I hadn't thought at such a lower level in a while, so I had to stretch my muscles a bit. How refreshing!
Let me start with the title of the initial section: measurement. I'm not surprised by the emphasis on proper measurements. I could never understand folks discussing optimization starting from the potential approach but not from the measurement. How would you know what effectively needs improvement? How will you know you improved anything at all? What the first chapter of this book made me realize is that measurements are quite difficult to get right.
When a chance to improve performance arises, there's often a missed opportunity for learning. As we discover more details about our system, we can adjust our mental model. The problem is that we tend to measure first against no expectation. The better way to go about this would be to provide a realistic expectation based on the best you can get to with some "napkin math". The message is that if you want to improve the mental model you have of your system, you need to estimate first, measure, and then adjust your mental model based on that. This is a good exercise and a great learning opportunity. It is recommended to be familiar with these latency numbers to make your estimates less wild.
I was surprised by the focus on long-tail latency. As I was reading about it, I imagined that the message was to focus on that first. It's way more nuanced than that. While a terrible median is a sign of a poorly performing system, it otherwise provides little information about the shape and size of the long tail. A good reason to focus on that is to ensure the latency of our system is predictable. If you know a transaction in your system always takes 50ms with short tail latency, you can pack more compute. On the other hand, if the average is even below 50ms but your system has long tail latency, it introduces a level of unpredictability with occasional slow transactions. This means you have to account for them and the way you pack compute power has to adapt to it. In short, a system with a higher average and short tail latency is better and more predictable than a system with shorter average and long tail latency (this is pretty much a quote from the book).
I'll wrap this up with the 5 fundamental resources:
- CPU
- Memory
- Disk/SSD
- Network
- Software critical section (with many cooperating threads, this would be the code that accesses shared data)
Thoughts
Don't be afraid of the latency numbers I mentioned earlier, you can even get started by familiarizing yourself with the orders of magnitude. This is helpful to get you started with some basic ideas, like reading from memory is an order of magnitude faster than reading from disk. I was hoping to discuss a bit of chapter 2 here but I'll save that for another day. This far we are starting to develop a mental model of how to think about software dynamics. Measuring comes with its complexities but at least we know the role it plays, along with our system's performance expectations. I'm taking my time to digest all the information from these chapters, the pace is quite reasonable so I can take my time getting a deeper understanding rather than a superficial one. I have a feeling that, once we settle on some fundamental knowledge, things will pick up in an interesting way!