swe-bench

Here are 71 public repositories matching this topic...

smallcloudai / refact

AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.

open-source enterprise vscode self-hosted developer-tools on-prem fine-tuning rag ai-agent swe-bench

Updated May 30, 2026
Rust

Human-Agent-Society / CORAL

Star

CORAL is a robust, lightweight infrastructure for multi-agent autonomous self-evolution, built for autoresearch. Works with Claude Code, Codex, Cursor, OpenCode, Kiro, and more.

opencode multi-agent code-generation evolutionary-algorithm codex autonomous-agents agent-framework large-language-models llm-agents agentic-ai self-evolving claude-code coding-agent alpha-evolve swe-bench self-evolving-agents autoresearch

Updated Jun 2, 2026
Python

Audit-grade multi-agent orchestration for CLI coding agents (Claude Code, Codex, Gemini CLI, +40 more). HMAC-chained audit log, signed agent cards, per-artefact lineage, air-gap deploy. The orchestrator your compliance team will sign off on. https://bernstein.run

Updated Jun 2, 2026
Python

JARVIS-Xs / SE-Agent

Star

SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance

mcts code-fix swe-agent test-time-scaling claude-code code-agent swe-bench self-evolve

Updated Sep 23, 2025
Python

hwfengcs / DM-Code-Agent

Star

Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.

agent mcp rag llm llm-agent react-agent agent-skills agent-evaluation reflexion-agent code-agent swe-bench

Updated May 28, 2026
Python

usetig / sage

Star

An LLM council that reviews your coding agent's every move

Updated Apr 28, 2026
TypeScript

logic-star-ai / insights

Star

We track and analyze the activity and performance of autonomous code agents in the wild

agents swe-agent swe-bench

Updated Dec 5, 2025
TypeScript

HumphreySun98 / repoagentbench

Star

SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

benchmark developer-tools ai-agents aider llm-eval coding-agents agent-evals swe-bench gemini-3-1-pro claude-opus-4-7 gpt-5-5

Updated Apr 30, 2026
Python

shreyash-sharma / provenant

Star

Wiki-based retrieval for AI coding agents. 65× token reduction, +24pp Coverage@5 on SWE-bench Verified.

python ai mcp developer-tools llm retrieval-augmented-generation codebase-indexing swe-bench

Updated May 28, 2026
Python

KRLabsOrg / squeez

Star

Squeeze verbose LLM agent tool output down to only the relevant lines

python pytorch lora tool-use llm context-compression coding-agent swe-bench

Updated Apr 27, 2026
Python

verseles / showdown

Star

Comprehensive LLM leaderboard aggregating multiple benchmarks into transparent rankings. Open data, community-driven, built with Svelte.

benchmark ai score lmarena swe-bench

Updated Apr 23, 2026
HTML

greynewell / mcpbr

Sponsor

Star

Benchmark your MCP server.

python benchmarking machine-learning mcp ml-evaluation llm-evaluation model-context-protocol swe-bench

Updated Apr 28, 2026
Python

xmpuspus / ai-workflow-benchmark

Star

Benchmark harness measuring AI coding tool+workflow performance, not just model capability. 100 tasks, sigmoid scoring, 12 capability dimensions, gap analysis.

benchmark developer-tools code-generation ai-agents llm-evaluation llm-benchmarking coding-agents ai-coding claude-code swe-bench

Updated May 30, 2026
Python

Vexp-ai / vexp-swe-bench

Star

Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.

benchmark mcp developer-tools ai-agents ai-coding claude-code swe-bench context-engineering

Updated May 2, 2026
Shell

agentic-trust-labs / glassbox-ai

Star

Lean orchestration platform for enterprise AI — where each decision costs hundreds. State machine core, HITL as a first-class state, corrections that accumulate. First use-case being Coding agent. Open research, early stage.

platform enterprise state-machine orchestration transparency human-in-the-loop ai-agent hitl enterprise-ai auditability coding-agent swe-bench

Updated Mar 17, 2026
HTML

jmerelnyc / repair-agent

Star

Repository-level automated code repair agent using SWE-Bench dataset

python program-repair automated-repair ai-agent code-repair swe-bench repository-level

Updated May 1, 2026
Python

greynewell / mcp-serialization-repro

Sponsor

Star

Do MCP tools serialize in Claude Code? Empirical study: readOnlyHint controls parallelism, IPC overhead is ~5ms/call. Reproduces #14353.

mcp llm-agents model-context-protocol claude-code swe-bench tool-parallelism readonlyhint

Updated Feb 15, 2026
Python

s1liconcow / repogauge

Star

Build a private evaluation dataset to optimize your organization's token costs.

token-cost agent-evals swe-bench

Updated Apr 26, 2026
Python

abhaymundhara / llm-benchmark-suite

Star

Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.

python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench

Updated Apr 23, 2026
Python

scitix / Agent-Sandbox

Star

Fast, Multi-Cloud Sandboxes for AI Agents.

kubernetes reinforcement-learning sandbox agents e2b agent-sandbox rlvr swe-bench terminal-bench agentic-rl e2b-compatible swe-rex

Updated Jun 1, 2026
Go

Improve this page

Add a description, image, and links to the swe-bench topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the swe-bench topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swe-bench

Here are 71 public repositories matching this topic...

smallcloudai / refact

Human-Agent-Society / CORAL

sipyourdrink-ltd / bernstein

JARVIS-Xs / SE-Agent

hwfengcs / DM-Code-Agent

usetig / sage

logic-star-ai / insights

HumphreySun98 / repoagentbench

shreyash-sharma / provenant

KRLabsOrg / squeez

verseles / showdown

greynewell / mcpbr

xmpuspus / ai-workflow-benchmark

Vexp-ai / vexp-swe-bench

agentic-trust-labs / glassbox-ai

jmerelnyc / repair-agent

greynewell / mcp-serialization-repro

s1liconcow / repogauge

abhaymundhara / llm-benchmark-suite

scitix / Agent-Sandbox

Improve this page

Add this topic to your repo