Building A Safe AI Assistant For Pull Request Summaries

Pull request summaries look simple. A developer changes code, AI reads the diff, AI writes a summary, reviewers save time. Nice, right?

But a safe pull request summary is not just a nicer version of git diff. A good PR summary should help reviewers understand what changed, what behavior is different, where the risk is, which tests were run, which tests are missing, and whether migrations or operational steps are involved.

A bad PR summary can do the opposite. It can hide risk behind confident language, claim tests were run when they were not, describe intent instead of actual code, miss a dangerous file, and make reviewers trust the pull request too quickly.

So the goal is not "generate a beautiful summary." The goal is "generate a reviewer-friendly, evidence-based summary." That means the assistant must be grounded in the diff, CI results, file metadata, and repository rules.

Safe PR Summary Pipeline diagram: Diff flows through a File Classifier, Risk Detector, Test Detector, and Migration Detector, then into the AI Summary step and finally to a Human Reviewer.

What a safe PR summary must include

A useful PR summary should answer these questions. What changed and why does it matter? What user-visible or system behavior changed? Which files are risky? Were tests added or updated, which were run, and are there missing tests? Are there database migrations or configuration changes? Does a reviewer need to focus on anything specific?

A weak summary says:

Markdown

This PR updates the billing logic and improves tests.

That is almost useless. A stronger summary says:

Markdown

This PR changes payment retry behavior so recoverable gateway errors are retried up to 3 times, while hard declines are not retried.

Reviewer focus:
- Confirm the recoverable error list matches gateway documentation.
- Check retry count behavior in PaymentRetryService.
- Review the new tests for hard declines and max attempts.

Tests:
- Added PaymentRetryServiceTest.
- CI reports php artisan test passed.

Risk:
- Billing behavior changed.
- No database migration.

This is better because it gives reviewers a map.

Step 1: Read the diff safely

The assistant should not read only the PR title and description. Titles can be wrong, descriptions can be outdated. The diff is the source of truth.

You can collect diff metadata with Git:

Bash

git diff --name-status origin/main...HEAD
git diff --stat origin/main...HEAD
git diff --unified=80 origin/main...HEAD

For large pull requests, do not blindly send the entire diff to the model. First classify files:

TypeScript

type ChangedFile = {
  path: string;
  status: 'added' | 'modified' | 'deleted' | 'renamed';
  additions: number;
  deletions: number;
};

function classifyFile(path: string): string[] {
  const tags: string[] = [];

  if (path.includes('migrations/')) tags.push('database_migration');
  if (path.includes('routes/')) tags.push('route_change');
  if (path.includes('auth') || path.includes('permissions')) tags.push('auth');
  if (path.includes('billing') || path.includes('payment')) tags.push('billing');
  if (path.includes('config/')) tags.push('configuration');
  if (path.includes('tests/')) tags.push('test');
  if (path.endsWith('.yml') || path.endsWith('.yaml')) tags.push('ci_or_config');

  return tags;
}

This simple classifier helps the assistant know what deserves attention.

Step 2: Explain changed behavior, not only changed files

Reviewers do not only need a file list. They need behavior. For example:

Diff

- if ($attempts > 3) {
+ if ($attempts >= 3) {
    return false;
  }

A file-level summary may say:

Markdown

Updated retry condition in PaymentRetryService.

A behavior-level summary says:

Markdown

The service now stops retrying at attempt 3 instead of allowing the third retry attempt.

That difference matters. Prompt the model to focus on behavior:

Markdown

You are summarizing a pull request for human reviewers.

Do not only list changed files.
Explain the behavior that changed.
Base your summary only on the diff and provided CI/test data.
If behavior is unclear, say it is unclear.
Do not claim tests were run unless test results are provided.

That last line is critical. AI should not invent confidence.

Step 3: Detect risky files before asking AI

Some risk detection should be deterministic. If a PR touches migrations, auth, billing, infrastructure, or dependency lock files, mark it.

TypeScript

type Risk = {
  level: 'low' | 'medium' | 'high';
  reason: string;
  file: string;
};

function detectRisk(file: ChangedFile): Risk[] {
  const tags = classifyFile(file.path);
  const risks: Risk[] = [];

  if (tags.includes('database_migration')) {
    risks.push({
      level: 'high',
      file: file.path,
      reason: 'Database migration can affect production data and deployment order.',
    });
  }

  if (tags.includes('auth')) {
    risks.push({
      level: 'high',
      file: file.path,
      reason: 'Authentication or authorization behavior changed.',
    });
  }

  if (tags.includes('billing')) {
    risks.push({
      level: 'high',
      file: file.path,
      reason: 'Billing or payment behavior changed.',
    });
  }

  if (tags.includes('configuration')) {
    risks.push({
      level: 'medium',
      file: file.path,
      reason: 'Configuration change may affect runtime behavior.',
    });
  }

  return risks;
}

Give these risks to the model. Do not make the model rediscover obvious things from scratch.

Technical risk heatmap for pull request files: rows for Auth, Billing, Migrations, Config, Tests, and UI, each color-banded by low / medium / high risk on a deep navy dashboard.

Step 4: List tests run and missing tests separately

This is one of the most important safety rules. Do not mix "tests that exist" with "tests that were run." A PR may include test files but CI may fail, and a developer may add tests but not run them locally. The assistant should be precise.

Example summary section:

Markdown

Tests found in diff:
- Added tests/Feature/Billing/PaymentRetryServiceTest.php

Tests reported by CI:
- php artisan test: passed
- vendor/bin/phpstan analyse: passed

Potential missing tests:
- No test found for retry behavior when attempts = 3.
- No test found for unknown gateway error codes.

To support this, collect CI data separately:

JSON

{
  "ci": {
    "status": "passed",
    "checks": [
      {
        "name": "php artisan test",
        "status": "passed",
        "duration_seconds": 84
      },
      {
        "name": "phpstan",
        "status": "passed",
        "duration_seconds": 31
      }
    ]
  }
}

Then instruct the model:

Markdown

Only list a test under "Tests run" if it appears in the CI data or user-provided command output.
If test data is missing, write: "No test run information was provided."

This prevents fake certainty.

Step 5: Explain migrations like a production engineer

Database migrations deserve special treatment. A summary should explain what table changes, whether columns are added, removed, renamed, or indexed, whether the migration may lock a large table, whether backfill is involved, whether rollback is safe, and whether application code depends on the migration order.

For example, this migration is not just "adds an index":

PHP

Schema::table('orders', function (Blueprint $table) {
    $table->index(['user_id', 'created_at']);
});

A safer summary says:

Markdown

Migration note:
- Adds a compound index on orders(user_id, created_at).
- This can improve queries that filter by user and sort/filter by creation date.
- On large orders tables, index creation may be slow or require an online migration strategy depending on the database engine and deployment setup.

The assistant does not need to pretend it knows table size. It should say what must be checked.

Step 6: Create structured output

Structured output makes PR summaries easier to review and validate. Use a schema:

TypeScript

import { z } from 'zod';

const PullRequestSummarySchema = z.object({
  overview: z.string(),
  changedBehavior: z.array(z.string()),
  reviewerFocus: z.array(z.string()),
  riskyFiles: z.array(z.object({
    file: z.string(),
    reason: z.string(),
    level: z.enum(['low', 'medium', 'high']),
  })),
  testsFound: z.array(z.string()),
  testsRun: z.array(z.string()),
  missingTests: z.array(z.string()),
  migrationNotes: z.array(z.string()),
  unknowns: z.array(z.string()),
});

Then render it as Markdown:

TypeScript

function renderSummary(summary: z.infer<typeof PullRequestSummarySchema>): string {
  return `
## Summary

${summary.overview}

## Changed behavior
${summary.changedBehavior.map(item => `- ${item}`).join('\n')}

## Reviewer focus
${summary.reviewerFocus.map(item => `- ${item}`).join('\n')}

## Risky files
${summary.riskyFiles.map(risk => `- **${risk.level}**: \`${risk.file}\` — ${risk.reason}`).join('\n')}

## Tests

**Tests found:**
${summary.testsFound.map(item => `- ${item}`).join('\n') || '- None found'}

**Tests run:**
${summary.testsRun.map(item => `- ${item}`).join('\n') || '- No test run information was provided'}

**Potential missing tests:**
${summary.missingTests.map(item => `- ${item}`).join('\n') || '- None detected'}

## Migration notes
${summary.migrationNotes.map(item => `- ${item}`).join('\n') || '- No migration notes'}

## Unknowns
${summary.unknowns.map(item => `- ${item}`).join('\n') || '- None'}
`;
}

This output is predictable. Reviewers know where to look.

Clean GitHub-style mockup of an AI-generated pull request summary card with sections for Summary, Changed Behavior, Reviewer Focus, Risky Files, Tests, Migration Notes, and Unknowns, plus subtle green and amber risk badges.

Step 7: Make the assistant admit uncertainty

A safe assistant must be allowed to say "I do not know." That is not weakness. That is a safety feature.

Examples:

Markdown

Unknowns:
- The diff changes retry behavior, but no gateway documentation was provided to confirm the recoverable error list.
- No CI result was provided, so tests run cannot be verified.
- The migration adds an index to orders, but table size is unknown.

This helps reviewers focus. A summary that claims everything is fine is less useful than a summary that highlights uncertainty.

Step 8: Avoid dangerous summary language

Do not let the assistant write:

Markdown

This PR is safe to merge.

That is not the assistant's decision. Better:

Markdown

No blocker was detected from the provided diff and CI data, but human review is still required.

Do not write:

Markdown

All edge cases are covered.

Better:

Markdown

The diff includes tests for the main success path and max retry count. No test was found for unknown gateway error codes.

Safe wording matters because summaries shape reviewer trust.

Step 9: Add repository-specific instructions

Generic summaries are okay. Repository-specific summaries are better. Create a file like this:

Markdown

# AI PR Summary Instructions

For this repository:

- Treat changes under app/Billing as high risk.
- Treat changes under app/Auth as high risk.
- Mention database migrations clearly.
- Mention queue/job changes because they may affect async behavior.
- Do not say tests passed unless CI data confirms it.
- Always include reviewer focus.
- Keep the summary concise and technical.

The assistant should read this file before summarizing. This turns team knowledge into a repeatable rule.

Step 10: Keep humans in control

AI-generated PR summaries should help reviewers. They should not replace review. The final summary can include a clear note:

Markdown

Generated summary based on the PR diff and available CI data. Please verify behavior and risk before merging.

This is honest and useful.

A complete workflow

Here is a practical flow:

Text

1. Pull request opens.
2. System collects changed files, diff stats, and CI results.
3. Deterministic classifier marks risky areas.
4. AI receives curated diff context, risk metadata, and repository instructions.
5. AI returns structured JSON.
6. System validates JSON schema.
7. System renders Markdown summary.
8. Summary is posted as a PR comment or used to update the PR description.
9. Human reviewer reviews the PR.

This flow is safe because the AI is not operating in a vacuum. It is grounded in facts.

Final thoughts

AI-generated pull request summaries can be genuinely helpful, but they must be designed for review, not for decoration.

A safe PR summary is evidence-based. It separates changed behavior from changed files. It separates tests found from tests run. It highlights risky files. It explains migrations. It admits unknowns. It avoids claiming approval.

That is the difference between a pretty summary and a useful engineering tool. The best AI PR assistant does not say, "Trust me." It says, "Here is what changed, here is why it matters, here is what needs your attention." That is exactly what reviewers need.

Building A Safe AI Assistant For Pull Request Summaries

What a safe PR summary must include

Step 1: Read the diff safely

Step 2: Explain changed behavior, not only changed files

Step 3: Detect risky files before asking AI

Step 4: List tests run and missing tests separately

Step 5: Explain migrations like a production engineer

Step 6: Create structured output

Step 7: Make the assistant admit uncertainty

Step 8: Avoid dangerous summary language

Step 9: Add repository-specific instructions

Step 10: Keep humans in control

A complete workflow

Final thoughts

Further reading

Let’s make something great together

Links

Contacts

What a safe PR summary must include

Step 1: Read the diff safely

Step 2: Explain changed behavior, not only changed files

Step 3: Detect risky files before asking AI

Step 4: List tests run and missing tests separately

Step 5: Explain migrations like a production engineer

Step 6: Create structured output

Step 7: Make the assistant admit uncertainty

Step 8: Avoid dangerous summary language

Step 9: Add repository-specific instructions

Step 10: Keep humans in control

A complete workflow

Final thoughts

Further reading

You might also like

AI Code Review: Helpful Assistant Or False Confidence Machine?

AI Agent Architecture For Jira-To-Pull-Request Automation

AI-Generated Code Needs Better Tests, Not Less Review

Let’s make something great together