How Your Systems Keep Running Day After Day – John Allspaw

How Your Systems Keep Running Day After Day – John Allspaw


My goal today is twofold. One, I’m intending to challenge you. I’m hoping to provoke new thoughts, new questions
in your mind. If, by chance, any of these new questions
give rise to some anxiety, I want you, to assure you that that’s quite normal. Don’t worry. We’ll get to some sort of resolution at the
end. The anxiety may remain, but before we get
started you’ll notice I changed the title of the talk to How Your Systems Keep Running
Day After Day, because that’s really the general gist of this. Before we get started, I want to start with
something. Can everybody read this? I don’t know why everybody’s laughing. I want to ask … Don’t worry, it’s rhetorical,
because I have the microphone. Is this dangerous? Well, at the very least, my expectation’s
that you’d say that it depends, right? Much like Nicole was saying earlier. All right, so let’s take another one, a little
bit more complicated. Right? What we see here is a diff. You see a change. This change is to an HTML comment. Change the case on the K. Right? Is this dangerous? Would your answer change if I tell you that
this is for a load balance or health check? Okay, so let’s get started. The point of both of those is that all work
is contextual. “It depends,” is an answer we give quite a
lot, and that’s important. We’ll come back to this. Here’s a slide about me. I won’t spend too much time on it. Here are some of the places I’ve worked and
things I’ve written, some places that I’ve studied. As Gene mentioned, I gave this talk, though
I want to just point out that the last time I felt so strongly about the topics that I’m
about to talk about was 2009 when I gave that talk with Hammond. What I want to talk about is new. It is different, and I feel very, very strongly
about this. Another piece that might be relevant is my,
the degree in Human Factors and System Safety. My thesis was Trade-Offs Under Pressure: Heuristics
and Observations Of Teams Resolving Internet Service Outages. This helps set the stage, I guess, a little
bit. I don’t want you to worry too much about this. I want to give you a, some of you may have
heard of this, what’s called the Stella Report at a high level. I’ll put the link up later. At a high level, this report is the result
of a year-long project of a consortium of industry partners. IBM, Etsy, and IEX, trading company, a trading
exchange in Manhattan. Over this year, folks from the Ohio State
University Cognitive Systems Engineering Lab, David Woods, Richard Cook, and a number of
other folks looked deeply at an incident in each of those organizations. Despite the fact that those organizations,
from a funding, from a resourcing, from a market standpoint, from population standpoint,
they found these six themes and that were common across all of them. What’s most important is … Certainly the
results are quite important. It’s how that research was done that I want
you all to take a look at a little bit later, and yeah, just as a quick little bit of a
cliffhanger, postmortems as recalibration. I’m going to talk a little bit about that. Blameless versus sanctionless. Controlling the cost of coordination. Visualizations, strange loops, and something
that I want to pique your interest on. Dark debt. Okay, so that’s the Stella Report. Here are my main points that I’m going to
give you. One, we have to start taking human performance
seriously in this industry. If we don’t, we will continue to see brittle
systems with ever-increasing impacts on our businesses and on society. Number two is that we can do this by looking
at incidents going beyond what we currently do in postmortems or post-incident reviews
or after-action reviews or whatever the hell you’d call them. Number three is that there do exist methods
and approaches from the study of resilience in other domains, but they require real commitment
to pursue. I’m going to talk about this. Doing this is both necessary and difficult,
but it will prove to be a competitive advantage for businesses who do it well. That’s the high level. First, I want to start with a little bit of
a baseline, a bit of a vocabulary that’s going to be important as I sort of walk you through
this. I’m going to describe a sort of picture, a
representation, like a mental model of your organizations, and it’s going to have an above-the-line
region and a below-the-line region. We’ll start with this. If you imagine what we have depicted here,
don’t worry about it being a cloud. Just think of it as like a bubble. What this is here is your product, your service,
your API, or whatever that your business derives value from and gives to customers. Okay? Inside there, what you see are your code. You see your technology stack. You see the data and some various ways of
delivering this, right? Like presumably over the internet or some
other sort of way. No, but if we stay here, nobody’s going to
believe me that that’s what we call the system, because it’s fine, but it’s not really complete. What’s really connected, and I think what’s
really, what a lot of people have been talking about here in this community in the last couple
of days is that all of the stuff that we do, and this is really familiar, all the stuff
we do to manipulate what goes on in there, and so we have testing tools. We’ve got monitoring tools. We’ve got deployment tools and all of the
stuff that’s sort of wired up. These are the things that we use. You could say that this is the system, because
many of us spend our time focused on those things that are not inside the little bubble
there, but all of the things that are around it, but if we were to stay just with this,
we won’t be able to see where real work happens. What we’re going to do here is, we’re going
to draw this line, is a line that we call the line of representation, and then play
with this a little bit further. What we see here is you. All the people who are getting stuff ready
to add to the system, to change the system. You’re doing the architectural framing. You’re doing monitoring. Right? You’re keeping track what it’s doing, how
it’s doing it, and what’s going on with them. Now, you’ll notice that each one of these
people have some sort of mental representation about what that system is. If you look at it a little bit more closely,
you’ll see that none of them are the same. By the way, that’s very characteristic of
these types of roles. Nobody has the same representation of what
is below the line. To summarize, and again, a little bit of a
view here, your product or service is here. This is the stuff you build and maintain with,
and here’s where work actually happens. This is our model of the world, and it includes
not just the things that are running there, but all of you, the kinds of activities you’re
performing, the cognitive work that you’re doing to keep that world functioning. If we play with this a little bit more, we
end up with this kind of model. This model has a line of representation going
through the middle, and you interact with the world below the line via a set of representations. Your interactions are never with the things
themselves. You don’t actually change the systems. What you do is that you interact with the
representation and that representation is something about what’s going on below. You can think of those green things as the
screens that you’re looking at during the day, but the only information that you have
about the system comes from these representations. They’re just a little keyhole. Right? What’s significant about that is that all
the activities that you do, all of the observing, inferring, anticipating, planning, correcting,
all of that sort of stuff has to be done via those representations, so there’s a world
above the line and a world below the line, and although you and we mostly talk about
the world below the line as if it’s very real, as if it’s very concrete, as though it’s something
that that’s the thing, here is the surprise. Here is the big deal. You never get to see it. It doesn’t exist. In a real sense, there is no below the line
that you can actually touch. You never, ever see code run. You never, ever see the system actually work. You never touch those things. What you do is that you manipulate it in a
kind of, well, not imaginary, it’s not imaginary, it’s very real, but you manipulate a world
that you cannot see via a set of representations, and that’s why you need to build those mental
models, those conceptions, those understandings about what’s going on. Those are the things that are driving that
manipulation. It’s not the world below the line that’s doing
it. It’s your conceptual ability to understand
the things that have happened in the past, the things that you’re doing now and why you’re
doing those things, what matters, and why what matters matters. Once you adopt this perspective, once you
step away that the idea that below the line is the thing you’re dealing with, and understand
that you’re really working above the line, all sorts of things change. What you see in the Stella Report and that
project and other projects that we’ve been engaged with is taking that view, and understand
what it really means to take the above-the-line world seriously. This is a big departure from a lot of what
you’ve all seen in the past, but I think it is a fruitful direction that we need to take. This is the bit here that I want to bring
your attention to. In other words, these cognitive activities
in both individuals and collectively in teams up and down the organization are what makes
the business actually work. Now, I’ve been studying this in detail for
quite a while here, and I can tell you this. It doesn’t work the way we think it does. Finally, to sort of set this frame up, the
most important part of this idea is that all of this changes over time. Right? It is a dynamic process that’s ongoing. This is the unit of analysis, the one you
have in your mind as we were talking through here. Once we take that frame, we can ask some questions. We can ask some questions about above the
line like this. “How does our software work really, versus
how it’s described in the wiki and in documentation and in the diagrams? We know that those aren’t comprehensively,
they’re not comprehensively accurate.” “How does our software break really, versus
how we thought it would break when we designed safeguards and circuit breakers and guardrails?” “What do we do to keep it all working?” Question. Imagine in all of your, imagine all of your
organizations today, starting today, imagine at six o’clock all of your companies, hands
off keyboard. They don’t answer any pages. They don’t look at any alerts. They do not touch any part of it, application
code or networks or any of it. Raise your hand if you think, raise your hand
if you are confident that your service will be up and running after a day. I thought I’d actually see more hands. For those of you, raise your hands really
high. All right, there you go. Keep it, wait, keep your hands up if you think
it’d still be working after a week. You’re not touching anything. You cannot respond to any bit. Keep your hand up if you think that your system
is still going to be running after a month. All right, I want to talk with you two afterwards. The question then is how to discover what
happens above the line. Well, there’s a couple things. We can learn from the study of other high-tempo,
high-consequence domains, and if we do, we can see that we can study incidents. By the way, when I say “incidence,” I mean
outages, degradations, breaches, accidents, near-misses, “glitches,” basically untoward
or unexpected events. What makes incidents interesting? Well, the obvious one is lost revenue and
reputation impacts on a particular business. I want to assert a couple of other reasons
why incidents are interesting. The one is that incidents shape the design
of new component subsystems and architectures. In other words, incidents of yesterday inform
the architectures of tomorrow. Right? That is, incidents help fuel our imaginations
on how to make our systems better, and therefore what I mean is, incidents below the line drive
changes above the line. That’s the thing. This can cost real money. Incidents can have sometimes almost tacit
or invisible effects, sometimes significant. Raise your hand if you’re splitting up a monolith
into micro-services. No kidding. Right. A lot of people do that because it provides
some amount of robustness that you don’t have. Where do you get that? You’re informed by incidents. Another reason to look at incidents is that
incidents tend to give birth to new forms of regulations, policies, norms, compliance,
auditing, constraints, that sort of thing. Another way of saying this is that incidents
of yesterday inform the rules of tomorrow, which influence staffing, budgets, planning,
roadmaps. Let me give you an example. In financial trading, the SEC has put into
place Regulation SCI. SCI, I’m just going to go out on a limb and
say, is probably the most comprehensive and detailed piece of compliance in modern software
era. The SEC has gone and been very explicit. We have this as a reaction to the flash crash
of 2010 to Knight Capital, BATS IPO, Facebook IPO. It is a reaction to incidents. Even if you go back a little bit further,
it’s often cited that PCI DSS came about when MasterCard and Visa compared notes, realized
they lost about 750 million over 10 years, so incidents have significant, and by the
way, I can, as a former public, a former CTO of a public company, I can tell you, I can
assure you that this is a very expensive, distracting, and inevitably burdensome albatross
for all of your organizations. Incidents are significant in this way too,
but if we think about incidents as opportunities, if we think about incidents as messages, encoded
messages that below the line is sending above the line, and your job is to decode them,
if you think about incidents as things that actively try to get your attention to parts
of the system that you thought you had a sufficient understanding of but you didn’t, these are
reminders that you have to continually reconsider how confident you are about how it all works. Now, if you do, and if you do take this view,
a whole bunch of things open up. There’s an opportunity for new training, of
course new tooling, new organizational structures, new funding dynamics and possibly insights
that your competitors don’t have. I’m going to unpack this a little bit. Incidents help us gauge the delta between
how your system works and how we think your system works, and this delta is almost always
greater than we imagine. I want to assert perhaps a different take
that you might be used to, and it’s this. Incidents are unplanned investments in enterprise,
in your company’s survival. They are hugely valuable opportunities to
understand how your system works, what vulnerabilities in attention exist, and what competitive advantages
you are not pursuing. If you think about incidents, they burn money,
time, reputation, staff. These are unavoidable sunk costs. Something’s interesting about this type of
investment, though. You don’t control the size of the investment,
so therefore the question remains, how will you maximize the ROI on that investment? Switch gears a little bit. When we look at incidents, these are the type
of questions that we hear, and it’s quite consistent with what researchers find in other
complex systems, domains. What’s it doing? Why is it doing that? What will it do next? How did it get into this state? What the fuck is happening? If we do Y, will it help us figure out what
to do? Is it getting worse? It looks like it’s fixed, but is it? If we do X, will it prevent it from getting
worse, or will it make it worse? Who else should we call that can help us? Is this our issue, or are we being attacked? Right? This is consistent with many other fields. Aviation, air traffic control, especially
in automation-rich domains. Another thing that’s notable is that the beginning
of any incident, it’s often uncertain or ambiguous about whether this is the one, this is the
one that sinks us, this is our Equifax moment, this is our Three Mile Island moment. At the beginning of a incident, we simply
don’t know, especially if it contains huge amounts of uncertainty and huge amounts of
ambiguity. If it’s uncertain and ambiguous, it means
that we’ve exhausted our mental models. They don’t fit with what we’re seeing, and
those questions arise. Only hindsight will tell us if that was the
event that brought the company down or if it was a tough Tuesday afternoon. Incidents provide calibration about how decisions
are focused, about how attention is focused, about how coordination is focused, about how
escalation is focused. The impact of time pressure, the impact of
uncertainty, the impact of ambiguity, and the consequences of consequences. Research validates these opportunities. We should look deeply at incidents, “nonroutine
challenging events, because these tough cases have the greatest potential for uncovering
elements of expertise and related cognitive phenomena.” From Gary Klein, the originator of naturalistic
decision-making research. There’s a family of well-worn methods, approaches
and techniques. Cognitive task analysis. Process tracing. Conversational analysis. The critical decision method. How we think postmortems have value looks
a little bit like this. An incident happens. Maybe somebody will put together a timeline. We have a little bit of a meeting. Maybe you’ve got a template, and you fill
that out, and then somebody might make a report or not, and then you’ve got, yeah, action
items, finally. Right? We think that the greatest value, perhaps
maybe the onliest value, is this. Right? Where you’re in a debriefing and people are
walking through the timeline and you’re like, “Oh, my God. We know all this. Why are we, can’t we just get to the … ”
This is not what the research bears out. The research bears out that if we gather subjective
and objective data from multiple places, behavioral data, what people said, what people did, where
they looked, what avenues in diagnosis did they follow and weren’t fruitful? Well-facilitated debriefings get people to
contrast and compare their mental models that are necessarily flawed. You can produce different results, including
things like bootcamp, onboarding materials, new hire training. You can have facilitation feedback if you
build a program to train facilitators. You might make roadmap changes, really significant
changes based on what you learn. I can tell you this from some experience. There is nothing more insightful to a new
engineer or a engineer just starting out in their career being in a room with a veteran
engineer who knows all of the nooks and crannies explaining things that they may not have ever
said out loud. They have knowledge. They may draw pictures and diagrams that they’ve
never drawn before because they think everybody else knows it. Guess what? They don’t. The greatest value is actually here, because
the quality of these outcomes depend on the quality of that, that recalibration. This is an opening to recalibrate mental models. From the Stella Report, it “informs and recalibrates
people’s models of the how the system works, their understandings of how it’s vulnerable
and what opportunities are available for exploration.” In a lot of the research, in all of the research
contained in the Stella Report, and it fits with my experience at Etsy as well, one of
the, the reflection’s strongest from people who do this in a facilitated way to do this
comparing and contrasting. “I didn’t know it worked that way.” Then there’s always other, “How did it ever
work?” Which is funny until you realize it’s serious. What that means is, the way not only I thought
it worked a different way. Now, I cannot even imagine, I can’t even draw
a picture in my mind of how it could have possibly worked. That should be more unsettling. By the way, I want to say this is not alignment. Like I said, via representations, we necessarily
have incomplete mental models. The idea here is not to have the same mental
models, because they’re always incomplete, because things are always changing, and because
they’re going to be flawed. We don’t want everybody to have the same mental
model. Now, everybody’s got the same blind spots. Blameless. This is the blog post that I wrote in 2012. “Blameless” is table stakes. It’s necessary, but it’s not sufficient. You could build an environment, a culture,
an embracing, a sort of welcoming organization that supports and allows people to tell stories
in all of the messy details, sometimes embarrassing details, without fear of retribution, so that
you could really make progress, and in understanding what’s happening, you can set that condition
up and still not learn very much. It’s not sufficient. It’s necessary, but not sufficient. What I’m talking about is much more effort
than typical post-incident reviews. Right? This is where an analyst, a facilitator can
prep, collating, organizing, analyzing behavioral data. What people say, what people do. There’s a raft of data that they can sift
through to prep for debriefings, a group debriefing, or a one-on-one debriefing, going beyond … Postmortems
hint at the richness of incidents. Following up on this takes a lot of work. By the way, everyone’s generally so exhausted
after a really, a stressful outage or incident or event that sometimes everything becomes
crystal clear. That’s the power of hindsight, and because
it seems so crystal clear, doesn’t seem productive to have a debriefing, because you think you
already know it all. The other issue is that postmortem briefings
are constrained by time as well. You only have the conference room for an hour
or two. Right? Everybody is really busy, and the clock is
ticking, so this is a challenge for doing this really well, even given those research
methods. There’s a handful of … I want to get this,
there’s a handful of things on this slide that people in the back aren’t going to be
able to read. I’m going to read it to you. Don’t worry about it. The other issue, especially if you have, if
you build a facilitation, debriefing facilitation training program like I did at Etsy, there’s
still challenges that show up. What I like to call it is, “Everyone has their
own mystery to solve,” or, “Don’t waste my time on details I already know.” In a cartoonish way, you can think about it
as this way. Network engineer, this is before you go into
a debriefing. Network engineer might say, “Oh, this outage
seems pretty straightforward. I just don’t know why they didn’t call me
to help sooner.” DBA might be thinking, “I understand how the
database got wedged. I hope we don’t waste time going over that
part. I just have no idea about how the load balancer
got involved in all this, and I hope we cover that, because that’s the real mystery.” The CEO might think, “Well, I need answers
quick. I got three board member voicemails lighting
up my phone, and we don’t have time to waste on details that don’t matter.” Customer service agent says, “I hope I can
get a word in edgewise in this. I don’t understand how it’s so hard to give
customers updates more frequently, and that’s the real priority here.” The application engineer says, “I’m glad we
used all those feature flags to turn them off, because without them, it could have been
a lot worse. I just don’t understand how the database got
stuck. It’s such a black box!” By the way, you have an hour. Extract as much learning as you can. All work is contextual. The two examples I gave you earlier. The answer is, it depends. Your job to maximize ROI is to discover, explore,
and rebuild the context in which work is done in an incident, how work and how people thought
above the line. Assessments are trade-offs, and those are
contextual. All right. We are almost at time here. I’m going to leave you with a couple of ideas
here. All incidents can be worse. A superficial view is to ask, “What went wrong? How did it break? What do we fix?” These are very reasonable questions. If we were to take a deeper level, and we
could ask, “What are the things that went into making it not nearly as bad as it could
have been?” Because we don’t pay attention to those things
and don’t identify those things, we might stop supporting those things. Maybe the reason why it didn’t get worse is
because somebody called Lisa, and Lisa knows her shit. Lisa, something from research is that experts
can see what is not there. If you don’t support Lisa, and you don’t even
identify that the reason why shit didn’t get worse is because Lisa was there, imagine a
world … Forget about action items for fixing something. Imagine a world where Lisa goes to a new job. Useful at a strategic level is a better question. “How can we support, encourage, advocate,
and fund the continual process of understanding in our systems, really take above the line
in a sustained way?” Here are some challenges for you. One, circulate the Stella Report in your company
and start a dialogue. Even if you’re too busy or you’re not in a
position to read it yourself, give it to people who do. Ask them what resonates. Ask them what doesn’t make sense. Ask them, “Yes, this is a thing that I didn’t
have words for this before, but now I do.” Start a dialogue. Second, look deeply at how you’re handling
post-event reviews, and most importantly, go find the people who are the most familiar
with the messy details of how shit gets done and ask them this, “What value do you think
our current post-incident reviews really have,” and listen. Will you learn more and faster from incidents
than your competitors? You’re either building a learning organization
or you’re losing to one who is. Lastly, we need to take human performance
seriously. This discussion is happening. It’s happening in nuclear power. It’s happening in medicine. It’s happening in aviation, air traffic control,
in firefighting. All right? The increasing significance of our systems,
the increasing potential for economic, political, and human damage when they don’t work properly,
and the proliferation of dependencies and associated uncertainty all make me very worried. If you look at your own system and its problems,
I think you’ll agree that we have to do a lot more than acknowledge this. We have to embrace it. What you can help me with, please spread this
presentation and these ideas. Come to me. What resonated with you about this? What didn’t? What challenges do you face in your org along
these lines? Come tell me. I’m on Twitter. You can find me. It’s quite easy to find me. It’s actually quite hard to get rid of me. The last is that if you’re interested in talking
more the group that I’ve been working with, we have two different vehicles. The Stella Report was produced by this consortium. This is otherwise known as the SNAFUcatchers,
and for direct partnership in a more traditional sort of consulting and training situation,
we’ve launched Adaptive Capacity Labs. Thank you
for listening.

You May Also Like

About the Author: Oren Garnes

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *