VelocinatorVelocinator
Engineering Metrics7 min read

Measuring MTTR: Building a Culture of Observability and Incident Response

January 26, 2026
Measuring MTTR: Building a Culture of Observability and Incident Response — Engineering Metrics article on engineering productivity

Of the four DORA metrics, Mean Time to Recovery (MTTR) is often the hardest to improve—and the most revealing about organizational health.

MTTR measures how long it takes to restore service after a failure. Elite performers recover in less than an hour. Low performers take more than a week.

The gap isn't primarily about technology. It's about culture, process, and preparation.

What MTTR Actually Measures

MTTR spans the entire recovery lifecycle:

  1. Detection: How long until you know something's wrong?
  2. Diagnosis: How long until you understand the cause?
  3. Remediation: How long until you fix it?
  4. Verification: How long until you confirm it's fixed?

Each phase can be a bottleneck. Teams often focus on remediation speed when detection is their real problem.

Calculating MTTR Automatically

Manual incident tracking is unreliable. People forget to update timestamps. They estimate poorly. The data is inconsistent.

Velocinator calculates MTTR from system data:

From Jira

  • Incident ticket created → Detection time
  • Status changes → Diagnosis and remediation phases
  • Ticket resolved → Recovery time

Configure Jira issue types (Bug, Incident) and priority levels to identify which tickets count as incidents.

From GitHub

  • Hotfix PR opened → Remediation started
  • Hotfix merged → Remediation complete
  • Associated rollback PRs → Additional signal

Combined View

By correlating Jira and GitHub, we can trace:

  • Incident created in Jira at 2:15 PM
  • PR opened at 2:45 PM (30 minutes to start fix)
  • PR merged at 3:30 PM (45 minutes to complete fix)
  • Incident marked resolved at 3:45 PM (verification)
  • Total MTTR: 90 minutes

Benchmarks and Targets

DORA research provides benchmarks:

PerformanceMTTR
Elite< 1 hour
High< 1 day
Medium1 day to 1 week
Low> 1 week

Where does your team fall? More importantly, what's driving your number?

Improving Detection

You can't fix what you don't know is broken.

Invest in Monitoring

  • Application Performance Monitoring (APM)
  • Error tracking (Sentry, Rollbar)
  • Infrastructure monitoring
  • Real User Monitoring (RUM)

Reduce Alert Noise

Too many alerts leads to alert fatigue. Teams start ignoring them. Critical alerts get lost in the noise.

  • Tune alert thresholds
  • Eliminate flapping alerts
  • Prioritize alerts by business impact

User Feedback Loop

Sometimes users detect problems before systems do. Make it easy to report issues:

  • In-app feedback mechanism
  • Customer support → Engineering escalation path
  • Social media monitoring

Track Detection Time Separately

Measure time from "incident began" to "we knew about it." If this is long, your monitoring needs work.

Improving Diagnosis

Once you know something's wrong, how fast can you identify the cause?

Observability Investment

The three pillars:

  • Logs: Structured, searchable, correlated
  • Metrics: System and business metrics
  • Traces: Request flow through distributed systems

Teams with good observability diagnose in minutes. Teams without it diagnose in hours.

Runbooks

Document diagnosis procedures for known failure modes:

  • "If API latency spikes, check: database connections, cache hit rate, external service status"
  • "If checkout fails, check: payment gateway, inventory service, session service"

Runbooks convert diagnosis from "figure it out" to "follow the checklist."

War Room Practices

When incidents occur, how does communication flow?

  • Clear roles (incident commander, communications, technical lead)
  • Dedicated channel (Slack, Zoom)
  • Regular status updates

Well-rehearsed coordination reduces chaos and speeds diagnosis.

Improving Remediation

Fixing the problem quickly.

Safe Deployment Patterns

  • Feature flags to disable broken features without deploy
  • Quick rollback capability (one-click, < 5 minutes)
  • Canary deployments to catch problems before full rollout

If you can roll back in 5 minutes, many incidents have 5-minute MTTR.

Reduce Deploy Friction

Every minute of deploy pipeline is a minute of incident duration. Optimize:

  • Faster builds
  • Faster tests (or smart test selection)
  • Faster deployment mechanisms

On-Call Effectiveness

  • Clear escalation paths
  • On-call has authority and tools to act
  • No single points of failure (primary and backup)

Game Days: Practice Makes Prepared

You don't want your first experience with a major incident to be a major incident.

Chaos Engineering

Deliberately inject failures in controlled conditions:

  • Kill a service
  • Spike latency
  • Corrupt data

See how the system and team respond. Learn while stakes are low.

Tabletop Exercises

Walk through incident scenarios without actually breaking anything:

  • "The database is down. What do we do?"
  • "A security breach is detected. Who gets notified?"
  • "The primary data center is unreachable. How do we failover?"

Identify gaps in runbooks and coordination.

Measure Game Day MTTR

Treat game days like real incidents. Measure detection, diagnosis, remediation times. This is your baseline.

Building the MTTR Culture

Blameless Postmortems

After every significant incident:

  • Timeline: What happened and when?
  • Contributing factors: What allowed this to happen?
  • Remediation: What did we do to fix it?
  • Prevention: What will we do to prevent recurrence?

The goal is learning, not blame. If people fear punishment, they'll hide incidents instead of learning from them.

Track Incident Trends

  • Incident volume over time
  • MTTR trend over time
  • Common root causes
  • Repeat incidents (same failure happening again)

Improving systems means fewer and shorter incidents.

Celebrate Recovery Excellence

When the team handles an incident exceptionally well, recognize it. Fast detection, calm coordination, quick recovery—these are skills worth celebrating.

The MTTR Dashboard

Velocinator provides MTTR analytics:

Summary Metrics

  • Overall MTTR (median and percentiles)
  • Trend over time
  • By severity level
  • By team/service

Incident Breakdown

  • Each incident with timeline
  • Detection, diagnosis, remediation phases
  • Associated PRs and commits

Root Cause Categories

  • Infrastructure vs. code vs. external
  • Patterns over time
  • Most common failure modes

Getting Started

If you don't currently track MTTR:

  1. Define what counts as an incident: Not every bug. Define by severity, user impact, or SLA breach.

  2. Configure incident tracking: Set up Jira issue types and workflows.

  3. Connect to Velocinator: Link Jira and GitHub for automated calculation.

  4. Baseline: What's your current MTTR? Don't guess—measure.

  5. Identify the bottleneck: Is it detection, diagnosis, or remediation? Focus improvement there.

  6. Iterate: Each quarter, review MTTR trends and invest in the next improvement.

MTTR isn't just a number. It's a measure of how prepared your team is for the unexpected. The lower it is, the more resilient you are.

More in Engineering Metrics

Continue reading related articles from this category.

Flow Efficiency: Finding the 'Dark Matter' in Your SDLC — Engineering Metrics article on engineering productivity

Flow Efficiency: Finding the 'Dark Matter' in Your SDLC

Your tickets spend 80% of their lifecycle waiting. Here's how to find and eliminate those invisible delays.

February 4, 2026
Understanding Code Churn: When Rework Indicates a Process Problem — Engineering Metrics article on engineering productivity

Understanding Code Churn: When Rework Indicates a Process Problem

High churn often points to unclear requirements or architectural misalignment—not bad developers.

January 30, 2026
How to Measure PR Cycle Time and Why It Matters — Engineering Metrics article on engineering productivity

How to Measure PR Cycle Time and Why It Matters

The single most important metric for understanding your team's delivery speed—and how to improve it.

January 28, 2026

Enjoyed this article?

Start measuring your own engineering velocity today.

Start Free Trial