← All Posts
leadershipSREincident management

April 22, 2026

The On-Call Rotation Taught Me More About Leadership Than Any Book

The On-Call Rotation Taught Me More About Leadership Than Any Book

It's 2:17am. Your phone is screaming. The pager says the payments service is returning 500s. Customers can't check out. Revenue is bleeding.

You have two choices:

A) Panic, wake up everyone on Slack, and start making changes in production while half-asleep B) Take a breath, check the dashboard, form a hypothesis, and communicate clearly to the people who need to know

The choice you make in that moment — groggy, stressed, under-caffeinated — reveals more about your leadership capacity than any behavioral interview question ever will.

The Pressure Cooker

I spent years doing on-call for infrastructure at scale. AWS, Splunk, the whole monitoring stack. And here's what I learned: incidents are leadership labs.

Think about what an incident demands:

  • Triage under pressure: Quickly determine severity without complete information
  • Clear communication: Update stakeholders who are anxious and non-technical
  • Delegation: Know when to pull in help and who to pull
  • Ego management: Admit when your first hypothesis was wrong and pivot
  • Post-incident ownership: Run a blameless retro that actually improves things

That's not an SRE skill list. That's a leadership skill list. Every single one of those translates directly to managing people, running teams, and navigating organizational complexity.

The Stem Session: Breaking Down an Incident

In music production, a stem is an isolated track — just the drums, just the vocals, just the bass. You break a mix into stems so you can hear each element clearly and adjust it independently.

Let's break down a well-handled incident into its stems:

Stem 1: The Detection

Good leaders notice problems early. In on-call, that means your alerts are tuned properly — not so sensitive they cry wolf, not so quiet they miss real fires.

In leadership, it's the same. Can you detect when a teammate is struggling before they rage-quit? Can you sense when a project is off-track before the deadline passes? Detection is a skill, not a talent.

Stem 2: The Communication

The best incident commanders I've worked with all do the same thing: they narrate. They think out loud in the incident channel.

"Looking at the payment-service logs. Seeing timeout errors from the database connection pool. Hypothesis: we're exhausting connections. Checking pool config now."

This isn't showing off. It's creating shared context so everyone in the channel can follow the investigation, contribute if they see something, and stay calm because someone clearly has a handle on this.

Leadership is the same. The best managers narrate their thinking. "Here's what I'm seeing, here's what I think it means, here's what I'm going to do about it." Transparency under pressure builds trust.

Stem 3: The Escalation

Knowing when to escalate is an art. Too early and you're the person who cried wolf. Too late and you're the person who let the building burn down because they didn't want to bother anyone.

In incidents, good on-callers have a mental timer: "If I haven't identified the root cause in 15 minutes, I'm paging the secondary." No ego. No heroics. Just pragmatism.

In leadership, it's: "If I can't resolve this conflict between two team members in one conversation, I'm involving my manager before it becomes a team-wide problem." Same muscle.

Stem 4: The Retro

This is where most teams fail. The incident is over. Production is stable. Everyone wants to forget it and move on.

But the retro is where all the learning happens. And the way you run it determines whether your team gets better or just gets more anxious.

Blameless retros are a leadership philosophy, not just an SRE practice. "The system failed" vs. "Dave failed" changes everything about how people interact with risk. If people are afraid of being blamed, they hide problems. Hidden problems become catastrophic problems.

The best leaders I've worked for ran every failure — technical or organizational — the same way: "What happened? What did we learn? What do we change?" No finger-pointing. No politics. Just improvement.

Why Engineers Make Great Leaders (But Don't Know It)

If you've done on-call, you've already practiced:

  • Making decisions with incomplete information (every page, ever)
  • Communicating under stress (incident channels, status pages)
  • Building systems that anticipate failure (circuit breakers, runbooks)
  • Continuous improvement (post-incident reviews, SLO tuning)

These are the exact same skills that make someone an effective engineering manager, tech lead, or CTO. The context changes; the muscles don't.

The problem is that most engineers don't recognize these as transferable skills. They think leadership is something different — something that happens in board rooms and all-hands meetings. But leadership starts at 2am in an incident channel, and if you've been on-call, you've been practicing it for years.

The Gap

So why do battle-tested SREs still struggle when they move into leadership roles?

Because there's one critical difference: incidents have clear resolution criteria. The service is up or it's down. Latency is within SLO or it's not. The problem is bounded.

People problems are unbounded. There's no dashboard that tells you a teammate is quietly resentful about being passed over for a project. There's no alert that fires when trust breaks down between two pods. There's no runbook for "your best engineer just got a competing offer."

The detection, communication, and escalation muscles are the same. But the signal-to-noise ratio is way harder with humans than with metrics.

Building the Bridge

That gap — from incident leadership to people leadership — is exactly what Developer EQ is about. The book takes the frameworks you already know from engineering (signal processing, feedback loops, gain staging, compression) and applies them to human interaction.

If you've ever wished people came with an observability stack, this is the closest you'll get.

The live cohort puts it into practice. We run scenarios that are basically social incidents — difficult conversations, conflicting priorities, career negotiations — and debrief them the same way you'd run a post-incident review. Because the process works. You just need to apply it to a different domain.


This is part of the Developer EQ series on social skills for engineers.

Like what you read?

Developer EQ is a 16-chapter guide to mastering the human side of engineering — using music production as the metaphor.