Infrastructure Debt Shows Up on Your Credit Card
A costly mistake hiding in your AWS bill can almost always be avoided — and I learned that the expensive way.
Two months ago we migrated our database from Aurora to RDS. The migration went smoothly, with zero downtime, and the team rightly celebrated. Then this week I sat down to review the AWS bill and realised something uncomfortable: we'd been running two databases the entire time.
What actually happened
After the cutover, we shut down the old Aurora cluster for a cool-down period — standard practice, so we could be confident the new database was stable before decommissioning the old one. I even knew the relevant detail: Aurora auto-restarts a stopped cluster after seven days. I told myself we'd deal with it soon.
Then we got busy shipping features. The cluster quietly restarted itself, exactly as documented, and nobody noticed. Two months later, there it was on the bill — we were paying for two databases instead of one.
Here's the part worth sitting with: the problem was never a lack of knowledge. I knew about the seven-day restart. I knew the cluster needed to be destroyed. The failure was the absence of a forcing function — anything that would make the cleanup happen whether or not I remembered it.
"We'll clean it up soon" is not a plan
"Soon" is a good intention, and good intentions don't close tickets. Stopped resources are especially dangerous because they feel handled — out of sight, seemingly off the clock — right up until a default behaviour turns them back on and the meter starts running again.
This is the trap with anything you defer in infrastructure: the cost of forgetting isn't a bug report or a failing test. It's a line item that compounds silently, month after month, until someone happens to look.
What I'd do differently
The fix isn't "be more careful." It's to design the cleanup so carefulness isn't required:
- Treat decommissioning as part of the migration, not a follow-up task. The migration isn't done when traffic moves — it's done when the old thing is gone. Put "destroy the old cluster" in the same plan, with the same owner, as the cutover itself.
- Set a hard calendar reminder — a literal "destroy old DB by [date]" with a real deadline, not a vague someday.
- Tag the old cluster and wire up a billing alert so that as long as it's still costing money, something is actively nagging you about it.
Each of these turns "remember to do it" into "you will be told until it's done." That's the whole point of a forcing function.
The real takeaway
Infrastructure debt is a lot like technical debt — except this kind shows up on your credit card. It accrues quietly, it rarely breaks anything in an obvious way, and the bill always arrives later than the decision that caused it.
So now, post-migration cleanup isn't a chore I hope to get to. It's a tracked, owned, deadlined step in the migration itself — because the only reliable way to beat "we'll clean it up soon" is to never rely on remembering in the first place.
If you run cloud infrastructure, it's worth asking: what's quietly restarted, scaled back up, or kept running in your account right now — and what forcing function would have caught it?
About the Author
Ankit Bhardwaj
Site Reliability Engineer with 12+ years in software engineering and 4+ years operating production cloud infrastructure on AWS and Kubernetes. Currently running six Kubernetes clusters at 99.99% uptime.
Get in touch