Automated auditing pipeline for LLM and agent benchmarks — surfaces task ambiguity, environment conflicts, and evaluation bugs.
-
Updated
May 27, 2026 - HTML
Automated auditing pipeline for LLM and agent benchmarks — surfaces task ambiguity, environment conflicts, and evaluation bugs.
This project aims to address this gap by conducting a systematic, controlled study of human versus LLM-generated text detectability using paired question–answer datasets. Rather than proposing a novel detection architecture, the focus is on analyzing detection robustness, failure modes, and the impact of adversarial humanization strategies.
Benchmark methodology, task sets, and evaluation results for RA²R
A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module.
Interaction Net Equivalence Testing for LLMs
7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.
A local LLM benchmarking framework designed to evaluate model performance across multiple backends (LM Studio, Ollama, etc.), including metrics for speed, quality, and instruction adherence. Supports structured test runs, result analysis, and reproducible evaluations.
Stock Bench is an LLM benchmarking system where LLMs compete in a prediction market, making bets on how well they’ll perform on tasks. Thus making it possible to measure each model's performance, as well as how accurate and self-aware each model is about their own performance.
Open benchmarks for local LLMs — same prompts, same suites, run on whatever hardware you've got, results compose into one ladder. Crowdsourced data + agent runbook for friends to clone-and-run.
Experimental data and technical audit of LLM content detectors (GPT-5, Claude 4) vs. humanization algorithms.
Live, open-source benchmark for comparing AI coding agents on real GitHub issues
Compare AI coding models with coding-focused benchmarks weighted your way — cross-verified, contradiction-flagged, in Turkish and English
Add a description, image, and links to the llm-benchmarks topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmarks topic, visit your repo's landing page and select "manage topics."