Getting Unstuck: Routine Postmortems
This is a post is in the Useful Ideas series.
From Reactive to Proactive
Cybersecurity often gets stuck:
- triaging the same types of alerts and events that came yesterday
- gathering the same kind of data to answer questions that were asked for last audit
- finding more of the same classes of security bugs in software
- playing whack-a-mole patching the worst vulnerabilities
- writing up risks similar to ones found during the last business initiative
If you aren’t in cybersecurity, you’ve probably seen a similar predicament in your domain. When a team is constantly in reactive mode, it feels inescapable. I can’t drop any of these required activities. Where would I find the time to change things?" But finding a time to invest in doing less of these activities is the only way to break the cycle.
In my work with organizations, I use a several methods to break this cycle, but I want to share one option with you here: routine postmortems.
You’re likely familiar with a variety of techniques for diagnosing how something went wrong. Five Whys, Root Cause Analysis, Blameless Postmortems, and more. If run well, these can be a very powerful tool for understanding the process, system, and even cultural problems that led to a breakdown. They also level up the organization through knowledge sharing and by implementing changes that stop the same type of problem from resurfacing.
However, if your organization is in reactive mode, you probably don’t feel like you have the bandwidth to perform these analyses for all but the very worst of events, and even when you do, you struggle to find time to implement any changes that would prevent the problem from reoccurring.
Shifting from ineffective formal emergency retrospectives to ongoing proactive analysis requires a cultural shift.
Routine Postmortems allow you to start to build the change into the fabric of your organization.
Implementing
Start small. Select a recurring time for the team to meet (somewhere between weekly and quarterly), or reserve some time in an existing meeting cadence. In preparation for the meeting, the team(s) should bring one reactive event from the last period for each functional area, along with an initial analysis. They should not spend much time on the analysis (less than the meeting itself is booked for). It’s better if these are not full emergency incidents. For example, within cybersecurity, items could include:
- a SOC alert
- an audit or pentest finding
- a vulnerability
- a configuration error
- a design weakness
- an unmitigated risk
Walk through a light version of your chosen diagnostic process, giving extra opportunity for team members to ask questions with nonjudgmental curiosity and to chime in with additional ideas on what upstream changes could have avoided the problem.
Go through as many categories as you have time to go through, but it’s ok if it’s just one! (Simply rotate the focus category next time.)
It’s better if this isn’t heavy with formality, but it’s perfectly fine to jot down takeaways and decisions.
When It’s Working
- preparing for the meeting, and even knowing they will be preparing for the meeting, gets people re-orienting their mindset towards problem prevention over firefighting
- silos of problem-responding are breaking down, people are learning from each other, and solutions (which often require cross-team collaboration) are emerging
- people are seeing that “an ounce of prevention is worth a pound of cure”, and that the effort invested is freeing up time for more strategic work
Failure Modes
Blamestorming: it’s tempting to criticize an initial analysis or to focus on people instead of problems. Use your facilitation skills to align the conversation to the future, not to the past. Practicing this routine will help the team improve at both analysis and problem-solving.
Giving Up on Ops: routine postmortems aren’t an excuse to stop responding entirely. Remember this is a method to carve out a slice of time to turn the ship.
Lack of Follow-Through: if people are having great ideas, but nothing is changing, then you likely have execution problems. That’s another topic, but one thing you can try is picking the quickest win from each session and making sure the right person has ownership to drive that result.
Something Else?: If you’ve tried this and run into other problems that you want to debug, I’m happy to help!
Reminds me of the "Upstream" book here: www.goodreads.com/book/show...