Refactor with Adversarial Review
MIT↓ 0 downloadsA Claude Code agent performs a refactor, a hard `npm test` gate proves nothing broke, then a Codex reviewer adversarially audits the change on the same warm branch — hunting regressions, weak tests, and shortcuts — fixing and looping until it approves.
Topology
Disclosures
Everything below runs on your machine or inside the sandbox when you use this workflow. Mismatches between these declarations and the actual code block publishing.
Host hooks
Commands executed on YOUR host machine by Sandcastle lifecycle hooks.
None declared.
Sandbox hooks
Commands executed inside the sandbox container.
npm install
Network access
None. Both agents operate only on the local repository inside the sandbox; the sandbox hook runs `npm install` against your declared package registry.
Shell expansion
No shell-expansion blocks in prompt files.
Files
Diff vs the stock Sandcastle 0.12.0 template Dockerfile — green lines were added by the author, red lines were removed from stock.
+# Sandbox image for the Refactor with Adversarial Review workflow.+# Installs both agent CLIs — Claude Code (implementer) and Codex (reviewer) —+# on the stock Sandcastle base, running as a non-root `agent` user.FROM node:22-bookworm# System dependencies.RUN apt-get update && apt-get install -y --no-install-recommends \git \curl \jq \ca-certificates \&& rm -rf /var/lib/apt/lists/*-# Claude Code CLI (the agent runtime).-RUN npm install -g @anthropic-ai/claude-code+# Agent CLIs: Claude Code (implementer) and Codex (reviewer).+RUN npm install -g @anthropic-ai/claude-code @openai/codex# Non-root agent user. `sandcastle docker build-image` aligns AGENT_UID/GID to# the host user via --build-arg to avoid permission errors on bind mounts.+# node:22-bookworm already ships a "node" user at UID/GID 1000, so we RENAME it+# (the stock Sandcastle template pattern) — groupadd/useradd would collide with+# the existing IDs on a default build.ARG AGENT_UID=1000ARG AGENT_GID=1000-RUN groupadd --gid ${AGENT_GID} agent \- && useradd --uid ${AGENT_UID} --gid ${AGENT_GID} --create-home --shell /bin/bash agent+RUN groupmod -o -g ${AGENT_GID} node \+ && usermod -o -u ${AGENT_UID} -g ${AGENT_GID} -d /home/agent -m -l agent node-USER agent-WORKDIR /workspace+USER ${AGENT_UID}:${AGENT_GID}+WORKDIR /home/agent++# Sandcastle bind-mounts the worktree and sets the working directory at+# container start; the container just needs to stay alive until then.+ENTRYPOINT ["sleep", "infinity"]
Show full Dockerfile (highlighted)
# Sandbox image for the Refactor with Adversarial Review workflow.
# Installs both agent CLIs — Claude Code (implementer) and Codex (reviewer) —
# on the stock Sandcastle base, running as a non-root `agent` user.
FROM node:22-bookworm
# System dependencies.
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
curl \
jq \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Agent CLIs: Claude Code (implementer) and Codex (reviewer).
RUN npm install -g @anthropic-ai/claude-code @openai/codex
# Non-root agent user. `sandcastle docker build-image` aligns AGENT_UID/GID to
# the host user via --build-arg to avoid permission errors on bind mounts.
# node:22-bookworm already ships a "node" user at UID/GID 1000, so we RENAME it
# (the stock Sandcastle template pattern) — groupadd/useradd would collide with
# the existing IDs on a default build.
ARG AGENT_UID=1000
ARG AGENT_GID=1000
RUN groupmod -o -g ${AGENT_GID} node \
&& usermod -o -u ${AGENT_UID} -g ${AGENT_GID} -d /home/agent -m -l agent node
USER ${AGENT_UID}:${AGENT_GID}
WORKDIR /home/agent
# Sandcastle bind-mounts the worktree and sets the working directory at
# container start; the container just needs to stay alive until then.
ENTRYPOINT ["sleep", "infinity"]
# Auth for the Claude Code implementer.
# Run `claude setup-token` on your host to generate a token, then paste it here
# in your local .sandcastle/.env (never commit the real .env).
CLAUDE_CODE_OAUTH_TOKEN=
# Auth for the Codex reviewer.
OPENAI_API_KEY=
import { createSandbox, claudeCode, codex } from "@ai-hero/sandcastle";
import { docker } from "@ai-hero/sandcastle/sandboxes/docker";
// One warm Docker sandbox shared by both agents. Dependencies are installed
// exactly once via the onSandboxReady hook; the Claude Code implementer and the
// Codex reviewer then work the same branch inside the same warm container.
// `await using` tears the sandbox down on exit — if the branch still has
// uncommitted changes the worktree is preserved on disk for inspection.
await using sandbox = await createSandbox({
branch: "agent/refactor",
sandbox: docker(),
hooks: {
sandbox: {
onSandboxReady: [{ command: "npm install", timeoutMs: 300_000 }],
},
},
});
// 1. Implement the refactor with Claude Code.
const implementation = await sandbox.run({
name: "refactor",
agent: claudeCode("claude-sonnet-4-6"),
promptFile: ".sandcastle/refactor-prompt.md",
maxIterations: 5,
});
console.log(`Implementer made ${implementation.commits.length} commit(s).`);
// 2. Hard gate: the suite must be green before the reviewer takes over. A
// non-zero exit code is returned (not thrown), so we can fail loudly instead
// of reviewing a broken tree.
const tests = await sandbox.exec("npm test");
if (tests.exitCode !== 0) {
throw new Error(
`Refactor left the suite red — aborting before review:\n${tests.stdout}\n${tests.stderr}`,
);
}
console.log("Suite green after refactor — handing off to the reviewer.");
// 3. Adversarial review with Codex on the same warm branch and container. The
// reviewer hunts for regressions and weak tests, fixes what it finds, and
// loops until it is satisfied and emits the approval signal.
const review = await sandbox.run({
name: "adversarial-review",
agent: codex("gpt-5.4", { effort: "high" }),
promptFile: ".sandcastle/review-prompt.md",
maxIterations: 5,
completionSignal: "<promise>APPROVED</promise>",
});
console.log(`Reviewer finished with signal: ${review.completionSignal ?? "none"}`);
Refactor this codebase — carefully
You are a senior engineer performing a behaviour-preserving refactor. Your job is to improve the internal structure of the code without changing what it does from the outside.
What to do
- Read
REFACTOR.mdin the repository root if it exists — it describes the specific refactor requested (which module, what smell, the target shape). If there is no such file, pick the single worst-structured module that carries real logic and improve it. - Refactor with intent. Good targets: extract duplicated logic, break up a god-function, name things clearly, tighten types, remove dead code, replace a leaky abstraction. Keep each change focused.
- Preserve behaviour. Do not change public APIs, output, or semantics. If a test encodes current behaviour, it must still pass unchanged.
- Run the test suite (
npm test) and make sure it is green before you finish. If coverage is thin around what you touched, add tests that pin the existing behaviour. - Commit in small, reviewable steps with clear messages (e.g.
refactor: extract <thing> from <module>).
Work only on this branch. Leave the working tree green and the suite passing — an adversarial reviewer will audit everything you did next, so make it defensible.
Adversarially review the refactor
You are a skeptical staff engineer reviewing the refactor that just landed on
this branch. Assume nothing is correct until you have proven it. Read
REFACTOR.md (if present) for the intended change, then audit every commit
against it.
What to hunt for
- Behaviour drift — did any public API, return value, error path, or edge case quietly change? A refactor that alters output is a bug.
- Weak or missing tests — is the changed surface actually covered? Are the tests meaningful, or do they assert trivia? Add tests that pin real behaviour.
- Regressions and dead ends — broken imports, unhandled cases, resource leaks, off-by-one, swallowed errors introduced by the restructure.
- Shortcuts — commented-out code,
// TODOleft in place of real work, overly broad types papering over a problem, or "simplifications" that lose functionality.
When you find a problem, fix it directly on this branch with a focused
commit and re-run the suite (npm test) to confirm it is green. Do not just
describe problems — resolve them.
Only when you are genuinely satisfied that the refactor is correct, complete,
behaviour-preserving, and well-tested, output the exact line
<promise>APPROVED</promise> and stop.
README
Refactor with Adversarial Review
Let one model do the refactor and a different model try to tear it apart. This workflow pairs a Claude Code implementer with a Codex reviewer so your structural changes survive a genuinely independent second opinion — not the same model rubber-stamping its own work.
What it does
Refactors are risky precisely because they are supposed to change nothing
observable. This pipeline enforces that discipline. A Claude Code agent performs
a behaviour-preserving refactor (from your REFACTOR.md brief, or its own pick
of the worst-structured module), a hard npm test gate proves the suite is still
green, and then a Codex reviewer adversarially audits every commit — hunting
behaviour drift, weak tests, regressions, and shortcuts. The reviewer fixes what
it finds directly on the branch and only emits <promise>APPROVED</promise> once
it is genuinely satisfied.
How it works
main.ts opens one warm Docker sandbox with createSandbox() and installs
dependencies once via an onSandboxReady hook. Three steps run on the same warm
branch: (1) the implementer (claude-sonnet-4-6) refactors over up to five
iterations; (2) a hard sandbox.exec("npm test") gate aborts loudly if the
tree is red, so the reviewer never audits broken code; (3) the reviewer
(gpt-5.4 via Codex) loops for up to five passes, fixing and re-testing, until
it approves. Using two providers means the critique is truly independent.
The topology is install → refactor → verify → review → (fix loop back to implement) → approved.
Requirements
Set both CLAUDE_CODE_OAUTH_TOKEN (run claude setup-token) and OPENAI_API_KEY
in .sandcastle/.env. Your repo should expose a working npm test. Optionally
add a REFACTOR.md describing the exact change you want. Build the image once
with npx @ai-hero/sandcastle docker build-image, then run the workflow with
npx tsx .sandcastle/main.ts.