Run one task across Codex, DeepSeek, Claude Code (or any CLI agent) in parallel — each in an isolated git worktree — then pick the best diff.
forge run "add retry logic to send_message()" --engine codex,deepseek
your task
/ \
Codex DeepSeek
worktree worktree
(branch) (branch)
\ /
forge compare
→ LLM picks winner
Stop being the manual copy-paste buffer between AI coding tools.
Zero external dependencies. Pure Python stdlib (3.11+). Works today.
If you maintain real projects (especially multiple ones), you already use several coding agents:
- OpenAI Codex / GPT-4o / o1 for one class of tasks
- Claude Code / Sonnet for architecture and large refactors
- DeepSeek, Qwen, local models for speed and cost
Every time you switch, you lose context. You manually copy prompts, diffs, and decisions. You become the bottleneck.
agent-forge removes that friction.
Task (or task.md)
→ forge run task.md --engine codex,claude,deepseek
→ each engine gets its own git worktree (full isolation)
→ parallel execution
→ forge compare <id> (LLM-assisted diff review)
→ forge review <run> (deeper analysis of one result)
Web tools (Cursor, Claude.ai, Grok) stay as your upstream for strategy. agent-forge only orchestrates anything with a CLI or API.
- True parallel execution with git worktree isolation (no branch pollution)
- Pluggable engines via simple TOML config — add any CLI or OpenRouter-compatible API in 3 lines
- Built-in review & compare using your preferred model (OpenRouter)
- Per-repo SQLite tracking (
.forge/forge.db) digestcommand — perfect for daily/launchd summaries- Clean, auditable diffs after every run
git clone https://github.com/molty/agent-forge.git
cd agent-forge
pip install -e .
cp forge.toml.example forge.tomlThen configure your engines in forge.toml.
forge run "add rate limiting to the auth middleware" --engine codex,deepseek
forge run --file task.md --engine codex,sonnet
forge list
forge show 42
forge compare 42
forge review 87
forge digest --hours 24
forge clean 42[engines.codex]
type = "cli"
cmd = "codex exec --sandbox workspace-write {prompt}"
timeout = 1800
[engines.deepseek]
type = "cli"
cmd = "commandcode run --prompt-file {prompt_file} --model deepseek/deepseek-v4-pro"
[engines.sonnet]
type = "cli"
cmd = "claude -p {prompt} --model sonnet-3.7"
[review]
model = "anthropic/claude-3.7-sonnet"
api_key_env = "OPENROUTER_API_KEY"Adding a new engine is one [engines.*] section.
This tool was built because the author maintains multiple active production-grade systems (prediction market research platform, on-chain Solana automation, high-volume Telegram infrastructure). It is used daily to accelerate implementation across different agent strengths without losing context.
If you are an open-source maintainer juggling several coding agents while trying to keep velocity on issues and PRs — this is for you.
- Fast triage of complex issues across multiple models
- Consistent review quality via
compare - Easy to extend with your own preferred agents and review models
We actively welcome contributions from other maintainers.
- Resume previous task runs
- Auto-apply clean reviews
- Better multi-repo support
- Native Telegram / Slack digest
Good first issues are labeled good-first-issue.
Before opening a PR:
- Run
python -m pytest(when tests appear) - Keep the zero-dependency promise for the core
- Update this README if behavior changes
MIT
MVP, but daily driver for its author since early 2026.
Production-ready for power users and maintainers who live in the terminal and multiple agents.