Skip to content

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32

Closed
ErlisLushtaku wants to merge 38 commits into
mainfrom
erlislushtaku/fix/support-qwen-3.5
Closed

Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc#32
ErlisLushtaku wants to merge 38 commits into
mainfrom
erlislushtaku/fix/support-qwen-3.5

Conversation

@ErlisLushtaku
Copy link
Copy Markdown
Collaborator

@ErlisLushtaku ErlisLushtaku commented Apr 6, 2026

  • Updated dependencies to support Qwen3.5
  • Added thinking token budget to prevent Qwen from using all budget without outputting verdict.
  • ...

@ErlisLushtaku ErlisLushtaku changed the title Support qwen 3.5 Support Qwen3.5 Apr 6, 2026
Comment thread pyproject.toml Outdated
Comment thread judgearena/evaluate.py Outdated
Comment thread judgearena/evaluate.py Outdated
@ErlisLushtaku ErlisLushtaku force-pushed the erlislushtaku/fix/support-qwen-3.5 branch from ab3db1b to ef1c92c Compare April 7, 2026 14:19
- Switch from choice-based structured outputs to JSON schema constraint
- Tighten vllm version range from >=0.17.0,<1.0.0 to >=0.17.0,<0.19.0
…ench baseline from huggingface and update huggingface repo
@ErlisLushtaku ErlisLushtaku changed the title Support Qwen3.5 Support Qwen3.5, fix mt-bench runs and other fixes Apr 14, 2026
…gex stripping since the structured output wasn't working for isolating thinking tokens anyway
@ErlisLushtaku ErlisLushtaku changed the title Support Qwen3.5, fix mt-bench runs and other fixes Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Apr 17, 2026
Comment thread judgearena/utils.py Outdated
ErlisLushtaku and others added 7 commits April 17, 2026 14:24
…so that we have more customizability

- Introduced `truncate_judge_input_chars` and `max_judge_model_len` to `BaseCliArgs` for better control over judge-side input limits.
- Refactor baseline assignment for Arena-Hard datasets to support different baselines based on category same as original benchmark.
@ErlisLushtaku ErlisLushtaku changed the title Changes related to running benchmark experiments for the paper: Support Qwen3.5, mt-bench, Skywork, and other changes Changes related to running benchmark experiments for the paper: Support Qwen3.5 and thinking models, Skywork, truncation tracking, benchmark changes etc Apr 21, 2026
Add generation-only runs, turn-1 thinking cleanup for MT-Bench carryover, native baseline resolution across pairwise tasks, and API-reported token usage accounting for OpenRouter-style models. Add vLLM init retries to clearly transient CUDA startup failures.
Make MT-Bench judge budget floors warning-only, remove implicit judge argument fallbacks, collapse MT-Bench thinking stripping onto the judge stripping flag, and restore default prompt completion-label rendering.
@geoalgo
Copy link
Copy Markdown
Collaborator

geoalgo commented May 26, 2026

Closed as it was splitted in independent PRs.

@geoalgo geoalgo closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants