llm-benchmarks

Here are 12 public repositories matching this topic...

IsThatYou / auto-bench-audit

Automated auditing pipeline for LLM and agent benchmarks — surfaces task ambiguity, environment conflicts, and evaluation bugs.

auditing benchmarking benchmark evaluation agents ai-agents large-language-models llm llm-evaluation agentic-ai agent-evaluation llm-benchmarks

Updated May 27, 2026
HTML

MadsDoodle / Detecting-the-Machine-A-Comprehensive-Benchmark-of-AI-Generated-Text-Detectors-Across-Architectures

Star

This project aims to address this gap by conducting a systematic, controlled study of human versus LLM-generated text detectability using paired question–answer datasets. Rather than proposing a novel detection architecture, the focus is on analyzing detection robustness, failure modes, and the impact of adversarial humanization strategies.

nlp benchmarking text-classification transformers xgboost stylometry bert model-evaluation electra perplexity roberta domain-generalization adversarial-ml ai-evaluation llm-detection ai-generated-text-detection llm-benchmarks

Updated Mar 19, 2026
Jupyter Notebook

ejentum / benchmarks

Star

Benchmark methodology, task sets, and evaluation results for RA²R

benchmarks elephant bbh musr scicode agentic-ai livecodebench reasoning-evaluation arc-agi-3 llm-benchmarks ejentum ra2r causalbench

Updated Jun 1, 2026
Python

ejentum / eval

Star

A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module.

n8n llm-eval agentic-ai llm-benchmarks reasoning-harness ejentum ab-evaluation blind-eval

Updated May 30, 2026
Python

ky2renzz / irequiv-benchmark

Star

Interaction Net Equivalence Testing for LLMs

lambda-calculus interaction-nets automated-reasoning team-collaboration symbolic-reasoning equivalence-testing llm bend-lang hvm2 llm-benchmarks pytorch-to-bend

Updated Apr 25, 2026
Python

keez97 / claude-architecture-skills

Star

7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.

python tdd architecture cloud-infrastructure code-review software-architecture ai-agents fastapi architecture-review prompt-engineering anthropic llm-evals claude-code claude-code-skills llm-benchmarks

Updated May 8, 2026
HTML

zeroneverload / Local-LLM-Benchmark

Star

A local LLM benchmarking framework designed to evaluate model performance across multiple backends (LM Studio, Ollama, etc.), including metrics for speed, quality, and instruction adherence. Supports structured test runs, result analysis, and reproducible evaluations.

docker benchmarking model-evaluation model-evaluation-metrics llm prompt-testing local-llm llm-tools ollama llm-benchmarking meta-evaluation prompt-test openai-compatible llm-benchmark llm-benchmarks

Updated May 3, 2026
HTML

JeroenVanGorsel / stock-bench

Star

Stock Bench is an LLM benchmarking system where LLMs compete in a prediction market, making bets on how well they’ll perform on tasks. Thus making it possible to measure each model's performance, as well as how accurate and self-aware each model is about their own performance.

llm llms llm-evaluation llm-benchmarking llm-benchmark llm-benchmarks

Updated Mar 13, 2026
Python

slobodanmargetic988 / weeyuga-benchmarks-public

Sponsor

Star

Open benchmarks for local LLMs — same prompts, same suites, run on whatever hardware you've got, results compose into one ladder. Crowdsourced data + agent runbook for friends to clone-and-run.

open-data benchmarks gemma crowdsourced llamacpp local-llm ollama qwen agent-friendly llm-benchmarks