Beyond Firefighting: Engineering with Intent through Triage, Tech Debt, and Root Cause Thinking
When teams are caught in chaos, it's clarity, not speed, that shapes the outcome.
TL;DR
Every engineering team runs into those moments: the eleventh-hour bug, the broken build right before release, the stakeholder panic call.
In this blog, I reflect on how we can respond better by leaning on triage, managing tech debt with a long view, and applying root cause analysis that helps, not haunts.
Along the way, I borrow ideas from software engineering research and blend them with lessons learned on the job.
The War Room Moment
It started like most critical incidents do, with a ping during a quiet evening. A security loophole had been flagged, and within minutes, we had a virtual war room running.
Devs, architects, and program folks are all jumping in with urgency, with lots of whiteboard discussions & suggestions. Everyone had the same goal: fix it fast. But amidst that rush, one question lingered in my head: Are we solving the right problem in the right way?
Why Triage is the Backbone of Better Decisions
Triage isn’t just tagging issues as P1 or P2. It's about understanding impact, not just on the system, but on the team too. Engineering decisions don’t happen in a vacuum. Teams are already running sprints, pushing deliverables, and balancing long-term goals with short-term firefights. If you're not careful, you're not just fixing a bug, you're breaking your people.
Here’s how I approach triage:
Assess the Impacts: Is this breaking functionality, trust, or just the CI pipeline?
Check team load: Are we in the middle of a release or end-of-sprint crunch?
Involve wisely: Not everyone needs to be in every call. Bring in only those who matter, but ensure diversity of thought.
This aligns with Mahvish Khurum’s [1] value-based engineering frameworks, where decisions are measured not just by urgency but by what stakeholders truly value.
Tech Debt: Short-Term Win or Long-Term Loss?
We’ve all been in those calls: “Let’s push a hotfix now and clean it up later.” Except ‘later’ never really comes, does it? To me, tech debt is not about bad code. It’s about trade-offs. Sometimes you choose speed, but you note down what needs to be revisited. The key is not to hide it or excuse it but to track it, quantify it, and plan around it.
Ask these questions:
If we fix it now, what’s the technical compromise?
Can we log this as part of the next sprint or backlog grooming?
Will this turn into a bigger mess later?
Again, this aligns with Khurum’s Software Value Map, where engineering choices are viewed not in isolation but in relation to business and team goals [1].
Root Cause Thinking: Not a Blame Game
Many teams think RCA is just post-mortem paperwork. But done right, it’s one of the best learning tools you can build into your culture. When something breaks, don’t just ask “what went wrong”. Ask these questions:
Who had context? Who didn’t?
Were the docs missing? Was the handoff unclear?
Was it a dev slip or a system design flaw?
Tony Gorschek’s FLEX-RCA and other frameworks emphasise that [3]:
RCA must scale
Involve multiple voices
Most importantly, seek systemic causes, not individual errors.
When Everyone Wants to Fix Fast — Pause and Reflect
One recent incident stayed with me. A critical vulnerability emerged mid-sprint.
PMs wanted a quick patch to meet dates. Architects were worried about design integrity, and developers, of course, were stuck balancing deliverables with urgency. But here’s the thing: everyone had valid concerns. What was missing was coordination and a shared mental model.
I paused the room and proposed this:
Let’s triage the issue first, with all inputs.
Let’s identify both the fastest fix and the cleanest long-term plan.
Let’s protect the team from overburn while meeting security needs.
It worked. Not because we were perfect, but because we aligned.
My Framework: Value-Driven Incident Flow (VDIF)
Here’s the structure I now follow for critical issues:
Triage Quickly, Include the Right People
Get inputs fast, but don’t crowd the room.Log Tech Debt as a First-Class Concern
Even if the fix is tactical, track what got skipped.Run RCA with Care and Breadth
Involve multiple roles, and avoid the blame trap.Talk Clearly Up and Down
Keep execs in loop, but shield teams from panic.Embed Learnings Back
Update onboarding, SOPs, or tooling as needed
The Bigger Picture: Process with Empathy
Engineering is not just code. It’s about building systems, including how we respond under pressure. If our only answer to a crisis is speed, we miss the chance to improve. We need
Capacity for calm thinking
Space for structured RCA
Voices from across the Stack, Devs, Leads, QA, Ops, PMs, to be heard and aligned
Because that’s how resilient systems and teams are made.
References
Khurum, M., et al. (2013). Software Value Map
Petersen, K., & Wohlin, C. (2010). Empirical Research in Software Engineering
Gorschek, T., et al. (2021). FLEX-RCA Framework
Great perspectives. Question- how do you prioritize tech debt versus immediate P1s ?
Good stuff...