Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc by ErlisLushtaku · Pull Request #32 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-04-06T21:57:42Z

Updated dependencies to support Qwen3.5
Added thinking token budget to prevent Qwen from using all budget without outputting verdict.
...

- fix dependencies - add structured output to prevent judge from not respecting the prompt

- Switch from choice-based structured outputs to JSON schema constraint - Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0

… output

…port-qwen-3.5

…ench baseline from huggingface and update huggingface repo

…gex stripping since the structured output wasn't working for isolating thinking tokens anyway

…so that we have more customizability - Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.

- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.

…ted token count for max_model_len

Includes-AI-Code: true

Add generation-only runs, turn-1 thinking cleanup for MT-Bench carryover, native baseline resolution across pairwise tasks, and API-reported token usage accounting for OpenRouter-style models. Add vLLM init retries to clearly transient CUDA startup failures.

Make MT-Bench judge budget floors warning-only, remove implicit judge argument fallbacks, collapse MT-Bench thinking stripping onto the judge stripping flag, and restore default prompt completion-label rendering.

geoalgo · 2026-05-26T13:12:29Z

Closed as it was splitted in independent PRs.

ErlisLushtaku and others added 5 commits April 6, 2026 23:02

update dependencies to support Qwen 3.5

c6b2b0a

slurmpilot scripts

1f4bae8

update dep versions

25b0355

fix support for VLLM

ab065fd

- fix dependencies - add structured output to prevent judge from not respecting the prompt

remove qwen35 smoke launcher

ef1c92c

ErlisLushtaku changed the title ~~Support qwen 3.5~~ Support Qwen3.5 Apr 6, 2026

kargibora reviewed Apr 7, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread judgearena/evaluate.py Outdated

Comment thread judgearena/evaluate.py Outdated

ErlisLushtaku force-pushed the erlislushtaku/fix/support-qwen-3.5 branch from ab3db1b to ef1c92c Compare April 7, 2026 14:19

ErlisLushtaku added 6 commits April 7, 2026 16:23

use json schema structured outputs, tighten vllm range

32f2e7e

- Switch from choice-based structured outputs to JSON schema constraint - Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0

fix formatting

5f2edf0

Fix Qwen3.5 with mt-bench

cffb6dd

use latest vllm with thinking tokens limits and thinking field in the…

ac243aa

… output

Merge remote-tracking branch 'origin/main' into erlislushtaku/fix/sup…

41298a4

…port-qwen-3.5

thinking token handling improvements, mt-bench improvements, use mt-b…

319050d

…ench baseline from huggingface and update huggingface repo

ErlisLushtaku changed the title ~~Support Qwen3.5~~ Support Qwen3.5, fix mt-bench runs and other fixes Apr 14, 2026

Revert to free form generation, and use thinking token budget with re…

cb7ada5

…gex stripping since the structured output wasn't working for isolating thinking tokens anyway

ErlisLushtaku changed the title ~~Support Qwen3.5, fix mt-bench runs and other fixes~~ Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Apr 17, 2026

ErlisLushtaku added 2 commits April 17, 2026 14:06

revert unnecessary changes and relics from earlier trials

84faa05

delete slurmpilot script

c063f3d

ErlisLushtaku commented Apr 17, 2026

View reviewed changes

Comment thread judgearena/utils.py Outdated

ErlisLushtaku and others added 7 commits April 17, 2026 14:24

Revert comment removal

ec7fc95

simplify and revert unnecessary changes

20ca9a5

Support Skywork

217dc8d

Add judge input character truncation and model length configurations …

8087c15

…so that we have more customizability - Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.

add llmcompressor dev dependency for quantization

91d67ef

Update baseline handling for Arena-Hard datasets

5e8efc9

- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.

Add m-arenahard-v2.0

2af4714

ErlisLushtaku added 2 commits April 22, 2026 01:08

add default baseline for mt-bench

da6818e

handle prohibited content errors for gemini in openrouter

891c417

ErlisLushtaku added 7 commits April 22, 2026 02:44

update system prompt with alpaca eval version, fix mismatch for expec…

fb36154

…ted token count for max_model_len

roll back to the default system prompt

f33f191

update dependencies for qwen3.5 and gemma4 runs

e21639e

Merge origin/main into support-qwen-3.5

157d939

Includes-AI-Code: true

Clean up judge argument handling

16dc5e1

Make MT-Bench judge budget floors warning-only, remove implicit judge argument fallbacks, collapse MT-Bench thinking stripping onto the judge stripping flag, and restore default prompt completion-label rendering.

Add default score-based verdict mode for fastchat

5411ff8

This was referenced Apr 29, 2026

Pin dataset revisions for reproducibility #39

Open

Add judge-prompt registry with per-task defaults #40

Closed

ErlisLushtaku added 8 commits May 18, 2026 15:06

m-arena-hard localized prompts

bf2d59a

m-arena-hard localized prompts wiring

4a9df8a

random 1k sampling for elo estimation

2333463

olmo3 reasoning parser

b8a8700

paper experiment scripts

fae90de

tests

26fd6f2

Cleanup openrouter pricing logic and wiring as well as some tests

a8f9f52

cut excessive tests

a9c081f

geoalgo closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32
ErlisLushtaku wants to merge 38 commits into
mainfrom
erlislushtaku/fix/support-qwen-3.5

ErlisLushtaku commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoalgo commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ErlisLushtaku commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoalgo commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ErlisLushtaku commented Apr 6, 2026 •

edited

Loading