Being on call for incidents can be incredibly stressful. You can be paged at any time of day or night, especially because there is a critical system performing poorly and causing customer disruption. The objective is always to restore systems to their nominal state, but there can be a lot of discussion and details that come up, which may get lost unless you take them down as notes.
My company has a concept of incident coordinator and the job is to coordinate (shocking). If some system is down and we need to page in a site reliability engineer or a database engineer, you need to bring them up to speed with the details quickly. Here is the system I have developed in Notion to help me keep track of the notes, the people and the links needed later for retrospectives.
Keep It Simple
As I have done with other things, such as my dossier, I want to keep things as simple as possible when dealing with an incident. I don't want to waste mental energy on a complex system; I just want a button and a series of prompts that I can fill in at a later time.
Not only is the system I built simple, but I also try to keep my notes simple by using bullet lists. I don't mind if there are typos, as I use them to feed into my official review later. These are my personal notes, so they may not look polished.
Templates & Template Buttons
Within my notes database in Notion, I have created a template to collect all my on-call notes for the current rotation. This template includes a pre-flight checklist that I must complete before starting my rotation. The checklist consists of reviewing documentation, attending the hand-off meeting, and completing other administrative tasks, such as checking the PagerDuty app on my work phone to ensure it is up to date. Additionally, the template adds various fields, tags, and other data to the note, which further distinguishes it as an on-call note.
I added a template button to the template. When clicked, it will add a new toggle entry with a pre-populated list of items to fill out, such as who is involved in the incident, when it started, when I was paged, and more. At the bottom, there is a section for me to start adding my notes in a bullet list.
Data & Notes
When adding data to toggle entries, I strive for accuracy as I may need to reference them later in reviews or leadership meetings. I ensure that all timezones are consistent when logging notes, so reporting is clear. These sections often require multiple updates throughout an incident.
There are areas for relevant links to Slack threads, Datadog dashboards, and, of course, the PagerDuty incident, where I will post official updates on the incident until it is resolved.
The notes are a running list of things that are being said, leads that are being tracked, and things that have been ruled out. If necessary, I can read a list of things we have tried and our current leading theories. Sometimes, it feels like I am a court stenographer.
The final result is a note with a template button with lots of pre-populated data that I can use to quickly get people up to speed on the ongoing incident as well as report on details at a later time. This template will be included in my upcoming Engineering Management Notion template collection.