The On-Call Rotation Taught Me More About Leadership Than Any Book

Q: How do I avoid burnout from on-call without losing the leadership reps?

Rotate the type of work, not the on-call frequency. Six weeks primary, then six weeks doing runbook authoring or alert tuning. The leadership growth comes from variety of incidents handled, not raw page count. Volunteer for the rare/complex ones; trade off the routine ones.

TL;DR: Incidents are leadership labs. Triage under pressure, clear narration, knowing when to escalate, blameless retros — those aren't SRE skills, they're the exact skills that make engineering managers, tech leads, and CTOs effective. If you've done on-call well, you've been practicing leadership for years without knowing it.

It's 2:17am. Your phone is screaming. The pager says the payments service is returning 500s. Customers can't check out. Revenue is bleeding.

You have two choices:

A) Panic, wake up everyone on Slack, and start making changes in production while half-asleep B) Take a breath, check the dashboard, form a hypothesis, and communicate clearly to the people who need to know

The choice you make in that moment — groggy, stressed, under-caffeinated — reveals more about your leadership capacity than any behavioral interview question ever will.

Why is being on-call really a leadership exercise?

I spent years doing on-call for infrastructure at scale. AWS, Splunk, the whole monitoring stack. And here's what I learned: incidents are leadership labs.

Think about what an incident demands:

Triage under pressure: Quickly determine severity without complete information
Clear communication: Update stakeholders who are anxious and non-technical
Delegation: Know when to pull in help and who to pull
Ego management: Admit when your first hypothesis was wrong and pivot
Post-incident ownership: Run a blameless retro that actually improves things

That's not an SRE skill list. That's a leadership skill list. Every single one of those translates directly to managing people, running teams, and navigating organizational complexity.

How do you break down a well-handled incident?

In music production, a stem is an isolated track — just the drums, just the vocals, just the bass. You break a mix into stems so you can hear each element clearly and adjust it independently.

Let's break down a well-handled incident into its stems.

Stem 1: Detection

Good leaders notice problems early. In on-call, that means your alerts are tuned properly — not so sensitive they cry wolf, not so quiet they miss real fires.

In leadership, it's the same. Can you detect when a teammate is struggling before they rage-quit? Can you sense when a project is off-track before the deadline passes? Detection is a skill, not a talent.

Stem 2: Communication (narration)

The best incident commanders I've worked with all do the same thing: they narrate. They think out loud in the incident channel.

"Looking at the payment-service logs. Seeing timeout errors from the database connection pool. Hypothesis: we're exhausting connections. Checking pool config now."

This isn't showing off. It's creating shared context so everyone in the channel can follow the investigation, contribute if they see something, and stay calm because someone clearly has a handle on this.

Leadership is the same. The best managers narrate their thinking. "Here's what I'm seeing, here's what I think it means, here's what I'm going to do about it." Transparency under pressure builds trust.

Stem 3: Escalation

Knowing when to escalate is an art. Too early and you're the person who cried wolf. Too late and you're the person who let the building burn down because they didn't want to bother anyone.

In incidents, good on-callers have a mental timer: "If I haven't identified the root cause in 15 minutes, I'm paging the secondary." No ego. No heroics. Just pragmatism.

In leadership, it's: "If I can't resolve this conflict between two team members in one conversation, I'm involving my manager before it becomes a team-wide problem." Same muscle.

Stem 4: Retros

This is where most teams fail. The incident is over. Production is stable. Everyone wants to forget it and move on.

But the retro is where all the learning happens. And the way you run it determines whether your team gets better or just gets more anxious.

Blameless retros are a leadership philosophy, not just an SRE practice. "The system failed" vs. "Dave failed" changes everything about how people interact with risk. If people are afraid of being blamed, they hide problems. Hidden problems become catastrophic problems.

The best leaders I've worked for ran every failure — technical or organizational — the same way: "What happened? What did we learn? What do we change?" No finger-pointing. No politics. Just improvement.

Why don't engineers see their on-call experience as leadership experience?

If you've done on-call, you've already practiced:

Making decisions with incomplete information (every page, ever)
Communicating under stress (incident channels, status pages)
Building systems that anticipate failure (circuit breakers, runbooks)
Continuous improvement (post-incident reviews, SLO tuning)

These are the exact same skills that make someone an effective engineering manager, tech lead, or CTO. The context changes; the muscles don't.

The problem is that most engineers don't recognize these as transferable skills. They think leadership is something different — something that happens in board rooms and all-hands meetings. But leadership starts at 2am in an incident channel, and if you've been on-call, you've been practicing it for years.

Why do experienced SREs still struggle when they move into people leadership?

Because there's one critical difference: incidents have clear resolution criteria. The service is up or it's down. Latency is within SLO or it's not. The problem is bounded.

People problems are unbounded. There's no dashboard that tells you a teammate is quietly resentful about being passed over for a project. There's no alert that fires when trust breaks down between two pods. There's no runbook for "your best engineer just got a competing offer."

The detection, communication, and escalation muscles are the same. But the signal-to-noise ratio is way harder with humans than with metrics.

How can I practice the calm-voice playbook before my next 2am page?

Day 6 of the free Developer EQ Sprint workbook walks you through building your own one-page Incident Response Card — dashboards, rollback, escalation, and 4 ready-to-paste phrases in your voice. Pin it for the next 2am page.

The gap from incident leadership to people leadership is exactly what Developer EQ — $20 is about. The book takes the frameworks you already know from engineering (signal processing, feedback loops, gain staging, compression) and applies them to human interaction.

FAQ

Does on-call experience actually count on a promotion case?

Yes — if you frame it that way. Don't write "completed on-call rotations" on your promo brief. Write "ran 12 customer-impacting incidents as IC, drove the root cause to runbook for 8 of them, and trained 2 juniors to take primary." The skill is leadership; the artifact is the incident.

How do I lead an incident when I'm not the most senior person on the call?

Take the incident commander role anyway. Senior people often prefer to be assistants on calls, not commanders — it lets them think technically without managing the room. Your value isn't more knowledge; it's clear narration, status updates, and keeping the call focused. That's a coordination role, and it's leadership.

What's a "calm-voice" template I can paste during an incident?

"Status update at HH:MM — we're seeing [symptom]. Current hypothesis: [theory]. Next 10 min: [action]. Will update again at HH:MM regardless of progress." Repeat every 10–15 minutes. Beats "still investigating" by an order of magnitude.

Should I push back if the post-incident retro turns into a blame session?

Yes, immediately and explicitly. "I want to keep this blameless — what was the system or process gap that made it possible for any engineer to do that?" Name the dynamic out loud. Most teams drift into blame; one person pulling them back is usually enough to reset.

How do I avoid burnout from on-call without losing the leadership reps?

Rotate the type of work, not the on-call frequency. Six weeks primary, then six weeks doing runbook authoring or alert tuning. The leadership growth comes from variety of incidents handled, not raw page count. Volunteer for the rare/complex ones; trade off the routine ones.

This is part of the Developer EQ series on social skills for engineers. Get the free 2-week workbook or the book — $20.