How To Safely Use AI In Legacy Codebases

Have you ever opened a legacy service, followed one method call, then another, then another, and suddenly you're reading a 900-line class that handles billing, emails, analytics, and one mysterious flag nobody wants to remove?

That's exactly where AI feels tempting. You want to paste the file into an agent and say, "Please make this sane." I get it. I've had that feeling too.

But legacy code is not just bad code. Legacy code is code with history. Some of that history is ugly. Some of it is business-critical. AI can help, but only if you make it respect the ghosts in the machine.

A safe AI workflow for legacy codebases showing steps from reading existing behavior to identifying business rules, adding tests, making small changes, and reviewing. — A safe AI workflow for legacy code: read existing behavior, identify business rules, add characterization tests, make small changes, review every step.

Legacy Code Is A Museum Of Business Rules

Legacy code often looks messy because it survived real customers, real production incidents, and real deadlines.

A weird if statement might represent a payment gateway edge case. A duplicated query might support an old reporting screen. A strange default value might exist because a mobile app version from three years ago still sends incomplete payloads.

Treat legacy code like an old city. The roads may look irrational until you learn where the rivers, hills, and old walls used to be.

First, Ask AI To Explain

Before asking for changes, ask for a map:

Text

Read this service and explain:
1. The main responsibility of the class.
2. The external systems it depends on.
3. The business rules you can infer.
4. The riskiest parts to change.
Do not suggest code changes yet.

That one instruction changes the whole interaction. You're not asking the AI to be a hero. You're asking it to be a careful analyst.

A good AI response should mention uncertainty. If it says everything is obvious, be suspicious. Legacy code is rarely obvious.

Tests First, Always

In legacy work, tests are not paperwork. Tests are a seatbelt.

Before refactoring, you need to capture current behavior. Not ideal behavior. Not "what the code should have done." Current behavior. That's the contract you can safely improve around.

Characterization Tests

Characterization tests describe what the existing system does today. They're useful when nobody fully trusts the code but everyone depends on it.

Here's a small example:

PHP tests/Feature/PaymentRetryTest.php

public function test_it_does_not_retry_hard_declines(): void
{
    $payment = Payment::factory()->declined('stolen_card')->create();

    $result = app(PaymentRetryService::class)->shouldRetry($payment);

    $this->assertFalse($result);
}

This test doesn't refactor anything. It freezes one important behavior so the AI can't "clean it up" accidentally.

Once you have tests around the risky behavior, AI becomes much safer. Not safe. Safer. Big difference.

Keep Diffs Small Enough To Review

AI is very good at creating big diffs. Unfortunately, big diffs are where legacy systems go to hide bugs.

A giant refactor can look elegant while changing behavior in five places. That's dangerous because reviewers get tired. The larger the diff, the easier it is for one tiny behavior change to sneak through.

Think of legacy refactoring like defusing wires. You don't cut all of them because the bundle looks messy. You isolate one wire, understand it, test it, then move to the next.

A comparison of unsafe AI refactoring across many legacy files versus a safe small AI-assisted change backed by tests and human review. — Unsafe sweeping AI refactor vs. a small AI-assisted change with tests and review — same files, very different risk.

A Safer Refactor Sequence

Add characterization tests. Lock down what the code currently does.
Extract pure logic. Move calculation or decision logic into small methods.
Reduce duplication. Only after tests prove behavior.
Improve naming. Names are cheap and often high-value.
Change behavior last. Bug fixes should be explicit, reviewed, and tested.

Here's a small extraction that AI can usually handle well:

PHP app/Services/PaymentRetryService.php

private function isHardDecline(string $reason): bool
{
    return in_array($reason, [
        'stolen_card',
        'do_not_honor',
        'fraud_suspected',
    ], true);
}

The value here is not the code itself. The value is that the rule now has a name, and a named rule is easier to test, review, and discuss.

Make The AI Show Its Work

When AI changes legacy code, don't accept "I fixed it" as an answer.

Ask for the reasoning, the affected behavior, the tests run, and the remaining risks. This is not about making the model sound smart. It's about forcing reviewable output.

A Useful Review Prompt

Text

Before I review the diff, summarize:
- Which behavior is intended to stay the same.
- Which behavior intentionally changed.
- Which tests prove that.
- Which areas still feel risky.
- Any assumptions you made.

This kind of summary is like a PR description from a careful engineer. It doesn't replace review, but it helps you focus your review.

Also, compare the summary against the diff. AI summaries can be incomplete. The diff is the source of truth.

Don't Let AI Rewrite Architecture In One Pass

One of the funniest and most dangerous AI habits is architectural enthusiasm.

You ask it to fix a bug. It discovers a service locator, old static calls, missing interfaces, and a controller doing too much. Five seconds later, it wants to introduce a new module boundary, repository layer, DTO structure, and event system.

I respect the ambition. I do not merge it.

Common Legacy AI Mistakes

Inventing abstractions too early. The agent adds interfaces before the team understands the domain.
Normalizing weird behavior away. It removes edge cases because they look accidental.
Changing dependency lifetimes. It turns lazy work into eager work or vice versa.
Breaking old integrations. It assumes current tests cover all external clients.
Mixing refactor and feature work. That makes review much harder.

A better instruction is:

Text

Do not introduce new architecture.
Make the smallest change that fixes the tested behavior.
If you see larger design issues, list them separately.

That last sentence is powerful. It lets the AI be helpful without turning your bug fix into a surprise rewrite.

Final Tips

The safest AI workflow I've found for legacy code is boring: understand first, test second, change third, review always. It doesn't feel as flashy as "agent refactors entire module," but it keeps you out of trouble.

My opinion: legacy codebases are where senior engineers will get the most value from AI, because seniors know what not to touch. That judgment is the real accelerator.

Good luck with your next legacy cleanup. Move slowly enough to stay fast 👊

How To Safely Use AI In Legacy Codebases

Legacy Code Is A Museum Of Business Rules

First, Ask AI To Explain

Tests First, Always

Characterization Tests

Keep Diffs Small Enough To Review

A Safer Refactor Sequence

Make The AI Show Its Work

A Useful Review Prompt

Don't Let AI Rewrite Architecture In One Pass

Common Legacy AI Mistakes

Final Tips

Let’s make something great together

Links

Contacts

Legacy Code Is A Museum Of Business Rules

First, Ask AI To Explain

Tests First, Always

Characterization Tests

Keep Diffs Small Enough To Review

A Safer Refactor Sequence

Make The AI Show Its Work

A Useful Review Prompt

Don't Let AI Rewrite Architecture In One Pass

Common Legacy AI Mistakes

Final Tips

You might also like

AI Code Review: Helpful Assistant Or False Confidence Machine?

AI Agent Architecture For Jira-To-Pull-Request Automation

Building AI Guardrails Into Development Workflows

Let’s make something great together