com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does

## Inconsistency

`com.microsoft::MatMulNBits` calls `preprocess_weights_for_mixed_gemm_cuda` in its `PrePack()` hook ([`matmul_nbits.cc:164`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc#L164)), so users hand it the raw `[N, K/2]` packed int4 weights produced by `quantize_matmul_4bits` and ORT does the CUTLASS fpA_intB layout conversion (row permutation + sub-byte transpose + column interleave + bias) automatically at session-load time.

`com.microsoft::QMoE` does **not** do this. Its [`PrePack()`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc#L942) explicitly sets `is_packed = false` for input slots 2 and 5 (`fc1_experts_weights` / `fc2_experts_weights`) when `quant_type == 'int'`, and the compute path passes `tensor->DataRaw()` straight into the CUTLASS runner. Concretely: users must hand QMoE **already-prepacked** weights or the kernel silently produces garbage output.

## Impact

Producing those prepacked weights requires `pack_weights_for_cuda_mixed_gemm`, which only ships in CUDA-built ORT (`USE_CUDA`). That forces any offline quantization tool to:

- depend on a CUDA-built ORT installation just to write out a QMoE model, even though the actual quantization math (`quantize_matmul_4bits`) is CPU-side, and
- duplicate the CUTLASS layout transform on the offline side. We just hit this in microsoft/Olive#2491 (offline MoE→QMoE rewrite for mobius-exported Gemma 4 MoE models) — we end up calling `pack_weights_for_cuda_mixed_gemm` per expert as part of the Olive pass.

By contrast, models that consume `MatMulNBits` can be quantized on a CPU-only host and then loaded anywhere — ORT handles the prepack at `InferenceSession` creation time.

## Proposal

Add a `PrePack()` branch for `(input_idx == 2 \|\| input_idx == 5)` when `quant_type_ == "int"`:

1. Slice the `[E, N, K/(8/bits)]` weight tensor into `E` per-expert `[N, K/(8/bits)]` slices.
2. Call the existing `preprocess_weights_for_mixed_gemm_cuda` on each slice.
3. Stack the results back into a `[E, K, N/(8/bits)]` tensor that the compute path can use directly.
4. Set `is_packed = true` and store the result so the original initializer can be freed.

This is symmetric with what `MatMulNBits` already does, just looped over the `E` axis. The compute path then reads `packed_fc1_weights_` / `packed_fc2_weights_` instead of `fc1_experts_weights->DataRaw()` directly.

## Why this matters now

The recent #28467 work (QMoE CUDA EP + MoE GEMM refactor) opens up QMoE as a serious target for standard MoE models like Gemma 4. The offline tooling story is going to get used a lot more, and the asymmetry with MatMulNBits is going to confuse everyone who runs into it.

I'm happy to send a PR if there's agreement on the approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does #28748

Inconsistency

Impact

Proposal

Why this matters now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does #28748

Description

Inconsistency

Impact

Proposal

Why this matters now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions