Why chaos engineering isn't as chaotic as it sounds
Don't let the name fool you. SSQ editors chatted with author Mikolaj Pawlikowski about why chaos engineering can be easily implemented into your testing environment.
Chaos engineering complements rather than replaces other forms of software testing. In a way, the process is often the inverse of happy path testing, which focuses on scenarios where a software system functions as it should. Chaos experiments instead try to test what happens when things go wrong.
That's the view of software engineer Mikolaj Pawlikowski, who is the author of the new book Chaos Engineering: Site reliability through controlled disruption. On a recent call with the SearchSoftwareQuality team, Pawlikowski discussed what development teams face when they attempt to adopt and implement chaos engineering. Specifically, we talked about why chaos engineering creates reservations among the developer crowd, what reservations they have and how to balance the costs and rewards of chaos experiments.
Some dev teams don't consider why chaos engineering can be helpful because they assume it requires a high level of maturity and/or a great amount of disruption. Upon hearing the word chaos, many developers fear the process will increase confusion and uncertainty.
Pawlikowski believes that chaos experiments can do the opposite -- if they're performed scientifically. This means actions should be well-documented and deliberate, not improvised. He makes the case that all will be safe if a team exercises common sense and chooses the scope of its experiments deliberately. Chaos engineering adopters that act accordingly should experience a strong return on investment.
Click here to read an excerpt from Chaos Engineering: Site reliability through controlled disruption by Mikolaj Pawlikowski.
Ryan Black: If you're a software tester and have ever been in need of identifying the weaknesses in a software system, you've might have thought about giving chaos engineering a try. Chaos engineering is a process to see if a computer system breaks when faced with unexpected disruptions. The process is typically done via controlled experiments meant to simulate random and unpredictable behavior. A common question software developers have is whether their team is ready to perform chaos engineering.
Mikolaj Pawlikowski is an experienced software engineer and developer, as well as the author of a new Manning book titled Chaos Engineering: Site reliability through controlled disruption. In the interview you are about to hear, we pick Miko's brain on how involved chaos experiments are, whether the process is a good fit for a dev team, the common concerns teams have, and more. Let's cut straight to that interview.
Ryan Black: Thanks for joining us, Mikolaj. We might as well dive right into it. Our first question for you is: In your experience, what are some of the complexities, practices, etc., that might hinder a team's ability to perform chaos engineering or at least perform it at the scale the team might want?
Mikolaj Pawlikowski: So, I find that most of the people who ask me that kind of question, they typically expect some kind of technical response like, 'Oh, you're not mature enough, because you're not satisfying this and that criteria' -- that, 'You haven't tested this display and, therefore, you're not ready for chaos engineering.'
But what I actually believe to be true is the opposite, that you can kind of jump right into that, regardless of where you're at [on] the maturity scale. It's just that, obviously, you need to apply common sense. And what's actually blocking people typically from starting is because they have [these] interesting ideas about what's supposed to be… like, 'Oh, yeah… We should jump right into production and start smashing things and introducing latency and all of that. Otherwise, it's not chaos engineering.' But that's not really what I believe to be the case.
[People say] we're not mature enough, or we don't have this big, you know, Netflix or Google-scale deployment, therefore, it's not applicable for us. [That's] probably what's more holding people back from ... dipping their toes and testing the waters rather than anything actually technical. Because ... I spend a lot of time convincing people now you can get quite a lot of value for a small investment.
Tim Culverhouse: That leads into a follow up question for you, Mikolaj. Where would you say that concern kind of started from? Because you mentioned that it's not really the technical holdups that's preventing a lot of enterprises from making the jump into chaos engineering. Is it how the idea is marketed and talked about? I'm just curious what your thoughts are in regards to why that mindset has become such a hindrance to adopting chaos engineering.
Pawlikowski: Starting from the very beginning, just the name has this kind of element that makes a lot of decision-makers nervous, right? Nobody wants to have said, 'We're introducing chaos.' And that's also something that I often hear is like, this response where people go, 'Oh, yeah, we already have enough chaos on our own.' Which just kind of shows that they didn't really get really what we're trying to do with chaos [engineering].
That's already a little bit of a misnomer, because we're actually trying to reduce the amount of chaos by introducing this kind of scientific approach [where] we test and observe things. And then we reduce the amount of uncertainty rather than adding to it, which the name kind of suggests. So that's definitely already ... from the beginning, a little bit of a problem. But once you pass that, I think the second thing that's kind of prominent is that a lot of [the] initial communications were through blog posts, and people covering things like Chaos Monkey, that was... like v1 of chaos engineering, and worse... more or less randomly taking VMs up and down. So, a lot of this was centered around this idea of going all in and going and randomly turning things off in production, which is pretty radical, right?
So, a lot of [these] initial communications were radical because it makes for good headlines and for ... entertaining reads, but that was a few years ago. We've kind of matured in this sphere in this ecosystem with the new tools and more of the critical mass of people and companies jumping into that. So, I think this is coming down now. And we're trying to treat it a little bit more like ... all of the boring aspects of running software in general, because ... boring is good if you want this to be adopted by big companies and to be a standard practice.
Black: It sounds like chaos engineering can be easier for teams if they both use common sense and are not overly cautious, like cautious to the extent that they're almost ... superstitious. What are some other measures teams can implement to make chaos experiments less of a hassle?
Pawlikowski: When you really think about it, what you are really after when you're running software, and that's regardless of the scale, and regardless of the stakes, what you are really after is just peace of mind, right? You want to be as confident with what you're running as possible.
And I think the way to look at tools like chaos engineering or ... all the other things that are fashionable, is to basically look at it as something that helps you with that goal in mind, right? So, we have best practices, when we deploy software and chaos engineering is not any different. It's the same kind of code like any other right, so if you apply the same principles of deploy slowly and progressively go through the stages, and whatever your usual precautions are, and you go and kind of do it step by step, you're not necessarily risking to be too hasty in rolling that out.
So that ... kind of goes again back to this, 'Oh, it has to be improvised in order to count as chaos engineering,' which is a myth.... It's kind of just important to look at it as just one of the many tools that help you, and it's potentially very powerful and also reasonably cheap to do because the return on investment that you get from potentially a cheap [chaos] experiment that can uncover something potentially catastrophic is great. One more thing to add to your toolkit. And, you know, to give you that enhanced peace of mind.
Culverhouse: And it's interesting, you mentioned the cost element as well, Mikolaj, where you would expect something like this, whereby the name, it inherently brings up chaos and concern and breaking and possible cost ramifications. So, it's interesting to hear that dichotomy from you as one of the major hindrances to adoption of this mindset.
Pawlikowski: Yeah, exactly. So, this is what I find pretty interesting... when you look at the return on investment, you have to put some kind of number for the value and for the risks. So, you can really mitigate the risks by just applying the same best principles to deploying that and thinking about things like blast radius, this is what we typically talk about in the topic of chaos engineering, the amount of stuff that can get affected by your change. And it's kind of the same as with any new release of your software. And so this balance between the cost and potential reward is sometimes not obvious. And it's, I find, it's very useful if you actually try to put some dollar value to that, because dollars speak to managers in general. It's good to be able to compare.
So, sometimes it's straightforward. You know, I remember some articles from a couple years ago, when they were estimating how much Facebook is losing per hour of downtime, right? It was in millions of dollars. And once you have something like that, then you can kind of make your calculation and depending on what you're doing, you're going to probably look at it slightly differently. And you have the entire spectrum of what your business is doing. If potential fallout for you is lost money. It's kind of straightforward. You can probably calculate that.
On the other hand, if you're running a hospital, and introducing a problem can risk a life so that's probably something that you shouldn't be doing. … But in general, you're going to find yourself somewhere in that spectrum, where you can estimate whether it's useful for you to fail early and potentially … introduce that failure, and potentially even cause some trouble. But find it beforehand and under kind of like auxiliary factors to take into account, too. Because if you don't discover a problem, and potentially it goes unnoticed for a long period of time, what's probably going to happen is that some of your clients will eventually run into that, and they're going to discover that for you. And this is problematic for a number of reasons, one of which is that a fix, and that is probably going to be a bigger problem than if you discovered that during the working hours.
If the practice of chaos engineering when everybody's into the office, or do it during the working hours, and even the worst-case scenario happens that we uncover something, it has some effect on the end users, there's no penalty [for] context switching of waking up at night… for everybody who's ever been [on] pager duty, they know very well. But nothing's particularly pleasant about being called at night, right? You have to wake up, do your coffee, people are panicking. There is noise, retellings, all kinds of things. You have to log in, it's a different computer, so it typically takes longer than it would if you were at office. So, if you've tried to like sum up all of [these] factors, you're probably going to find that failing early even in the worst-case scenario and discovering something even in this kind of all out-scenario, when you're so confident with your testing stages and staging and dev, whatnot, whatever your release cycle is that you're actually doing chaos engineering in production. The Holy Grail, even in a worst-case scenario, if you mess something up, it's probably going to be better in certain scenarios than waiting for actual clients to find it for you.
Black: What factors should a team weigh when determining the scope of their chaos engineering experiments? Or, put another way, what parts of the application in system should a team focus on?
Pawlikowski: So, I find that this is one of the more exciting parts of chaos engineering. You can do it that so many different levels, and if you're actually into that, you're firstly going to end up looking at the different styles and languages and different levels.
And that's one of the bigger messages of my book. I'm trying to illustrate that you can do it at various levels. And depending on what makes sense for you, you can kind of go and pick and choose and it goes from old level, top level; you can build chaos engineering directly into your application. Sometimes it makes sense, sometimes it's cheaper and easier, or sometimes you're just an application team and you don't actually have access to the underlying platform, or maybe [you're] using someone else's managed platform, right? Then you can use it at the level of language interpreters. If you're using JVM, for example, you can use the Java agent, and you can even without touching the source code, you can, on the fly, modify the behavior and trigger some events that you would expect to see in problematic situations. And you can test things like that. And then you can go all the way down to like the level of the system, regardless of what you're running, things have to use, our system calls Cisco's. You can implement chaos engineering at that very level. … You can treat things like black boxes, even if it's kind of like a vendor when you get precompiled and everything.
So, there's this entire spectrum of things that you can do. And what you're going to look into is what ... is low cost for you, what you have access to and what makes the most sense for you, depending on the situation. And I find, personally, most of the time that I spend is on the platform side of things, because it's kind of nice to be able to build some of that directly into the platforms if you own a platform, because then you know everybody gets that for free. That's just one of the use cases that are possible. … You have an entire spectrum of possibilities and depending on what your area of expertise is, what you have access to, and... any level really. I have a chapter, for example, about doing this on the front-end code. Even if you're like a front-end JavaScript developer, you can build things into to do chaos [experiments] and you can verify things, how to behave, when failure is actually happening. So, the answer is really all over the place, you can get value out of that, if you just look at it, take a step back and look at it from the kind of return-on-investment point of view.
Culverhouse: And going off the return-on-investment element of things, you mentioned earlier on about how the cost of chaos engineering really can be calculated by the return on investment. So, would you say that there's either a benefit or a disadvantage to how [widely] you implement chaos engineering across these different elements of your application here? Because without that heavy, upfront cost of having to do it front end, back end, whatever area you're doing it in, you're not exactly racking up the billing hours with this, are you?
Pawlikowski: Over the last few years, the tools have become much better. So, it's becoming increasingly easy to go and start doing something If you have a small developer team, you can pay someone to give you a commercial tool, they'll be able to start chaos testing things within minutes.
And then there are open source tools that are just getting better and better. There are certain kinds of small pockets, ecosystems, for example, and Kubernetes, that became a de facto API for deploying things. In the recent years, a lot of the work around chaos engineering is also kind of built around Kubernetes, probably just because of the popularity and the scale, and the fact that you can programmatically do things, you can automate a lot of that. The primary driver here is that the barrier to entry and the cost entry has become so low now that getting started and getting your feet wet is so cheap and easy, that it's even if you discover ... like with all of this chaos experiments, you basically have two outcomes, right? Either you found the problem, and then it's good news, because you found that before your clients did, and you can fix it. And hopefully don't find it too big of a cost in terms of actually breaking users. Right? Hopefully, you found you caught it to one of the previous stages.
With all the diminishing cost, the cost of entry, and the potential rewards that are pretty high, where you discover things, it's just becoming ridiculously cheap to do this kind of thing. And beforehand, you had to know if you wanted to do some lower-level stuff, you would have to actually have people who are experts or like Linux kernel to know how these things work. And now with the kind of higher-level tools you can do all of that, well, a lot of that has been already automated. And, therefore, it's just super easy to get started.
Black: You brought up tools. In what ways does a team's tools on hand or the data restraints they're dealing with shape its ability to perform or its limitations regarding chaos experiments?
Pawlikowski: In my view, the kind of higher-level tools that are being built up are mainly about convenience. Because, when you start with chaos engineering, you don't really need anything too fancy. The simplest case experiments that we're talking about in my book just are just a few lines of bash code that you can run and verify the outputs. But then, obviously at that scale, it becomes tedious if you do it, for example, in the kind of long-run automated manner. One of the nice ways of doing chaos experiments is that you can have them ongoing, you can just automate something and verify continuously that your platform or your product satisfies some kind of SLO or some kind of characteristic. And sometimes even as a way of kind of regression testing, you know that something broke, and you fixed it… in a way that the entire system behaved. And you just want to ensure that the system is now immune to that kind of situation.
So, you can simulate that situation, and they can experiment that's going to continuously create a situation, verify that your system continues working. To get started, you don't really need anything too fancy, right? But then as you go, and you scale up ... and what you're doing, behind the scenes, all of these tools that are becoming more and more easy to use… I'm partial to Powerful Seal, that's a tool that we open source a few years ago, that works with Kubernetes and cloud providers, and you can just specify the CMO scenario with a bunch of steps, take down this, create this and verify that it keeps working the way that you want to get VM up and down and ... kind of quite advanced scenarios that you can just write a very simple list of steps to implement. It's making it much more convenient.
There are other tools like Litmus, that's probably worth mentioning, that's also becoming more and more advanced. And that's, again, around Kubernetes. And things like Pumba, that makes it easier to work with Docker containers by automating some of that. And then the higher-level tools like Gremlin that, you pay some money, and you have a nice UI. … So, it's all about the convenience, right? And my thinking is that the easier it becomes [to get started] and more straightforward it is, more people will jump into that. And easier it will be to for the discipline as such to reach the critical mass. But, like I said, if you just want to get started, all you need is a bunch of lines of bash.
Culverhouse: And you hit on earlier, Mikolaj, about testing the code under your own umbrella in your enterprise. And that led me to ask the question about how would a tester, a developer go about testing with chaos engineering [for] source code outside of their control versus their code that's under their control? At what point, are there concerns about maybe security, compliance, regulatory, with these tests that a chaos engineering team needs to be aware of? And how would you possibly mitigate those types of concerns?
Pawlikowski: That's a good question. So, this also goes back to one [question] I get a lot and people ask, 'Okay, so does it actually replace other ways of testing?' And the answer to that is no, because when you think about it, it's about different levels. When you test, when you go with your unit tests, you focus on a particular piece of code, right? And be that like a functional piece of code that you want to ensure that works under specified conditions. And it's very narrow focus, right? Then when you go up at the kind of scale, you're going to probably look at some kind of integration tests, when you take components, you put them together, and you verify that they work, right. And then probably the step up from that is when you do some kind of end-to-end testing, right? When you take the system as a whole, and you start observing behaviors of all of these things, all of this components put together... in a way that you expect it to work in a production environment, hopefully are mimicking that right. And then the way that I see chaos engineering is one layer above that -- one layer further. ... it's kind of like end-to-end testing, except that instead of focusing on the normal operation of it… on the happy path, it's kind of like the unhappy, the sad path. When you introduce the problems and introduce a lot of the runtime problems, the resource starvation and race conditions, what we call emergent properties, ... the kind of interactions between the system, between the components and the system have the effect of having properties that you didn't initially expect or account for or design for. So, in that kind of respect, it's a complement to the other forms of testing rather than a replacement, right? Keep doing unit tests. You're not getting out of that just because you started doing chaos engineering.
But kind of going back to the question of code being under your control and not under your control. I was kind of trying to allude to that with the example of using the layer of Cisco's for testing and designing the chaos experiments is that you can do a lot of stuff that's behavioral, rather than looking at the code. Because you can look at it as this black box break, and be more concerned about what it's supposed to be doing. And make experiments based on how it's supposed to recover from things, how it's supposed to handle under load, which is kind of like an overlap with the performance testing domain, in general, and how it's gonna behave under stress in terms of free services what happens when there isn't enough resources, what happens when the retries are triggered, and that kind of stuff…
Really the answer to that is that you can do it in both scenarios, whether you have control or not about that and, obviously, if you're doing chaos experiments, in terms of like benchmarking or load testing of someone else's system, that might potentially be a problem. But from the technical point of view, it's kind of nice that you have this entire spectrum of things you can do.
Black: We did want to ask you about a term that came up in the chapters we read, and that is RPS, which I believe means responses per second. I was wondering if you could just elaborate on the role and utility of that metric.
Pawlikowski: Well, it could be response or request per second. It's just one of the metrics that you can look at and ends up being pretty useful if you work with some kind of request/response server. … And it goes something like this. First, you need to ensure that you can observe things because -- I keep driving this point home, but if you're not applying scientific principles to these experiments, they're not really experiments, you're just messing around, right?
And there are some quotes on LinkedIn, I like… 'If you [are] not writing it down, it's not science.' You [are] just messing around, kind of suggesting that the difference between doing science and messing around is just taking notes. But it's kind of kind of true that when you design chaos experiments, one of the things that you really want to make sure you can do is observe things reliably. So, if you want to be able to create a hypothesis to come up with a hypothesis that you can either ... confirm or deny, you're going to be able, you're going to have to be able to reliably measure something. So, the variables that you look at, are typically some important characteristic of this system. And requests per second is a very popular one, because it's easy to measure. And it's easy to look at and compare, right?
But it could be anything. It could be some kind of throughput. It could be number of bytes you write or send. It could be number of hashes per second. … Once you have that variable, you can establish the steady state, which is just a fancy way of saying the system normal range, and you can hypothesize, you can say, 'Okay, so if I have X number of servers. If I take down a certain percentage, I would expect just to satisfy this many requests per second.' And you can go measure that and verify whether you're wrong or right. So, yeah, RPS is just kind of one of the various metrics that are pretty useful when you're working with any kind of request/response servers.
Black: Thank you, Mikolaj. Before we let you go, can you just remind folks the title of your book, where people can find it and where also maybe people could follow you online?
Pawlikowski: Sure thing. So, the book is called Chaos Engineering: Site reliability through controlled disruption" and it's a mouthful. It's from Manning. Available at manning.com and Amazon. If you would like to reach out, mikolajpawlikowski.com is my landing page. You can find me on LinkedIn. And if you'd like to follow news about chaos engineering, you can go to chaosengineering.news. That's my mailing list. You can join and get an occasional update on all things chaos engineering.
Black: Thank you so much, Mikolaj.