Gambling with Tech Debt | Jeremy Misavage

At times, the software world can feel disconnected from the tangible realities of other professions. How many other fields measure progress in two-week sprints or require candidates to complete hypothetical challenges to prove their skills?

But every so often, something happens that bridges that gap and highlights how much we share with traditional engineering disciplines. For me, that moment came this week when Richmond’s 100 year old water treatment plant failed catastrophically. Ironically, just three days before the failure, the water department assured federal regulators the system was reliable, citing redundant backups. Whoops.

I don’t live in the city. I’m in a surrounding county with its own water system or so I thought. It turns out my half of the county gets its water from Richmond, a fact neither I nor my neighbors knew until now. To make matters worse, several other counties also rely on the same system, amplifying the impact of the failure.

This failure feels eerily familiar to anyone who’s dealt with software outages caused by tech debt. Deferred maintenance, flawed processes, and overconfidence in backups create the perfect storm, whether it’s a water plant or a legacy API.

Deferred Maintenance and the Cost of Neglect

The water treatment plant was over 100 years old, with critical maintenance deferred for decades. In just the last decade, four separate bills were proposed to fund upgrades, but none were approved. In software, we often see the same pattern: systems or APIs are left untouched until they fail, despite repeated discussions about their vulnerabilities.

Many systems rely on layered architectures, where newer features depend on older components. Legacy layers, often tied to outdated APIs, are rarely modernized, creating hidden vulnerabilities. No matter how polished the newer layers might seem, they are destined to fail if the foundation is rotting.

The cost of that neglect accumulates invisibly. Each passing year without investment increases the risk, but until something fails spectacularly, it’s easy to justify prioritizing flashier projects over essential maintenance.

Processes That Work—Until They Don’t

Both software engineers and utility managers often rely on processes to prevent or mitigate failures. The problem? Processes can only address what they’re designed to anticipate. In Richmond’s case, when the water treatment plant went offline, the backup systems failed to pick up the slack. The process for switching to alternate power sources, like batteries and generators, had critical flaws, leaving the system vulnerable at the worst possible moment.

In software, this happens when incident response playbooks fall short during multi-system outages or when dependencies fail in unexpected ways. The issue isn’t that the processes are useless—it’s that they’re built on assumptions: that redundancy or failover will always work as intended, or that the playbook only needs to handle a single system failure at a time.

Redundancy Isn’t a Silver Bullet

Perhaps the most chilling similarity was the misplaced faith in redundancy. The letter to the federal regulators explicitly cited the system’s redundant components as a reason why catastrophic failure was unlikely. But redundancy only works when each layer is independently reliable. Two-thirds of the plant’s battery backup system was offline when it had to pick up the slack. If every backup is neglected like the primary system—or shares the same vulnerabilities—it’s useless when failure strikes.

In Richmond’s case, with both the primary and battery backup systems offline, there wasn’t enough time to switch to generators before flooding occurred in the pump house. The flooding damaged the remaining equipment, and even after power was restored, crews spent several days repairing electrical systems, pumps, and filters before they could begin repressurizing the system.

This is a lesson software engineers know all too well. A backup database isn’t much help if it hasn’t been tested or if it relies on the same compromised network as the primary one. Redundancy is not a magic solution; it’s part of a broader strategy that must include regular testing, maintenance, and validation.

What We Can Learn

Both the Richmond water treatment failure and software outages stem from the same root cause: a lack of long-term investment. Maintenance, whether for a century-old plant or a legacy API, is neither glamorous nor immediately rewarding, but it’s essential.

As engineers, we can learn from these parallels. Neglect in software can lead to downtime and lost revenue, but in real-world systems, the stakes are even higher: public health and safety. It’s a sobering reminder that while the consequences may differ, the lessons are universal.

So the next time someone tells you tech debt isn’t a priority, just remind them of Richmond’s water treatment plant. Sometimes, the best way to avoid a crisis is to act like you’re already in one and address your vulnerabilities before they turn into headlines.