Of the four DORA metrics, Mean Time to Recovery (MTTR) is often the hardest to improve—and the most revealing about organizational health.
MTTR measures how long it takes to restore service after a failure. Elite performers recover in less than an hour. Low performers take more than a week.
The gap isn't primarily about technology. It's about culture, process, and preparation.
What MTTR Actually Measures
MTTR spans the entire recovery lifecycle:
- Detection: How long until you know something's wrong?
- Diagnosis: How long until you understand the cause?
- Remediation: How long until you fix it?
- Verification: How long until you confirm it's fixed?
Each phase can be a bottleneck. Teams often focus on remediation speed when detection is their real problem.
Calculating MTTR Automatically
Manual incident tracking is unreliable. People forget to update timestamps. They estimate poorly. The data is inconsistent.
Velocinator calculates MTTR from system data:
From Jira
- Incident ticket created → Detection time
- Status changes → Diagnosis and remediation phases
- Ticket resolved → Recovery time
Configure Jira issue types (Bug, Incident) and priority levels to identify which tickets count as incidents.
From GitHub
- Hotfix PR opened → Remediation started
- Hotfix merged → Remediation complete
- Associated rollback PRs → Additional signal
Combined View
By correlating Jira and GitHub, we can trace:
- Incident created in Jira at 2:15 PM
- PR opened at 2:45 PM (30 minutes to start fix)
- PR merged at 3:30 PM (45 minutes to complete fix)
- Incident marked resolved at 3:45 PM (verification)
- Total MTTR: 90 minutes
Benchmarks and Targets
DORA research provides benchmarks:
| Performance | MTTR |
|---|---|
| Elite | < 1 hour |
| High | < 1 day |
| Medium | 1 day to 1 week |
| Low | > 1 week |
Where does your team fall? More importantly, what's driving your number?
Improving Detection
You can't fix what you don't know is broken.
Invest in Monitoring
- Application Performance Monitoring (APM)
- Error tracking (Sentry, Rollbar)
- Infrastructure monitoring
- Real User Monitoring (RUM)
Reduce Alert Noise
Too many alerts leads to alert fatigue. Teams start ignoring them. Critical alerts get lost in the noise.
- Tune alert thresholds
- Eliminate flapping alerts
- Prioritize alerts by business impact
User Feedback Loop
Sometimes users detect problems before systems do. Make it easy to report issues:
- In-app feedback mechanism
- Customer support → Engineering escalation path
- Social media monitoring
Track Detection Time Separately
Measure time from "incident began" to "we knew about it." If this is long, your monitoring needs work.
Improving Diagnosis
Once you know something's wrong, how fast can you identify the cause?
Observability Investment
The three pillars:
- Logs: Structured, searchable, correlated
- Metrics: System and business metrics
- Traces: Request flow through distributed systems
Teams with good observability diagnose in minutes. Teams without it diagnose in hours.
Runbooks
Document diagnosis procedures for known failure modes:
- "If API latency spikes, check: database connections, cache hit rate, external service status"
- "If checkout fails, check: payment gateway, inventory service, session service"
Runbooks convert diagnosis from "figure it out" to "follow the checklist."
War Room Practices
When incidents occur, how does communication flow?
- Clear roles (incident commander, communications, technical lead)
- Dedicated channel (Slack, Zoom)
- Regular status updates
Well-rehearsed coordination reduces chaos and speeds diagnosis.
Improving Remediation
Fixing the problem quickly.
Safe Deployment Patterns
- Feature flags to disable broken features without deploy
- Quick rollback capability (one-click, < 5 minutes)
- Canary deployments to catch problems before full rollout
If you can roll back in 5 minutes, many incidents have 5-minute MTTR.
Reduce Deploy Friction
Every minute of deploy pipeline is a minute of incident duration. Optimize:
- Faster builds
- Faster tests (or smart test selection)
- Faster deployment mechanisms
On-Call Effectiveness
- Clear escalation paths
- On-call has authority and tools to act
- No single points of failure (primary and backup)
Game Days: Practice Makes Prepared
You don't want your first experience with a major incident to be a major incident.
Chaos Engineering
Deliberately inject failures in controlled conditions:
- Kill a service
- Spike latency
- Corrupt data
See how the system and team respond. Learn while stakes are low.
Tabletop Exercises
Walk through incident scenarios without actually breaking anything:
- "The database is down. What do we do?"
- "A security breach is detected. Who gets notified?"
- "The primary data center is unreachable. How do we failover?"
Identify gaps in runbooks and coordination.
Measure Game Day MTTR
Treat game days like real incidents. Measure detection, diagnosis, remediation times. This is your baseline.
Building the MTTR Culture
Blameless Postmortems
After every significant incident:
- Timeline: What happened and when?
- Contributing factors: What allowed this to happen?
- Remediation: What did we do to fix it?
- Prevention: What will we do to prevent recurrence?
The goal is learning, not blame. If people fear punishment, they'll hide incidents instead of learning from them.
Track Incident Trends
- Incident volume over time
- MTTR trend over time
- Common root causes
- Repeat incidents (same failure happening again)
Improving systems means fewer and shorter incidents.
Celebrate Recovery Excellence
When the team handles an incident exceptionally well, recognize it. Fast detection, calm coordination, quick recovery—these are skills worth celebrating.
The MTTR Dashboard
Velocinator provides MTTR analytics:
Summary Metrics
- Overall MTTR (median and percentiles)
- Trend over time
- By severity level
- By team/service
Incident Breakdown
- Each incident with timeline
- Detection, diagnosis, remediation phases
- Associated PRs and commits
Root Cause Categories
- Infrastructure vs. code vs. external
- Patterns over time
- Most common failure modes
Getting Started
If you don't currently track MTTR:
-
Define what counts as an incident: Not every bug. Define by severity, user impact, or SLA breach.
-
Configure incident tracking: Set up Jira issue types and workflows.
-
Connect to Velocinator: Link Jira and GitHub for automated calculation.
-
Baseline: What's your current MTTR? Don't guess—measure.
-
Identify the bottleneck: Is it detection, diagnosis, or remediation? Focus improvement there.
-
Iterate: Each quarter, review MTTR trends and invest in the next improvement.
MTTR isn't just a number. It's a measure of how prepared your team is for the unexpected. The lower it is, the more resilient you are.



