Skip to content

com.microsoft::QMoE should prepack int4/int8 weights in PrePack(), like MatMulNBits does #28748

@justinchuby

Description

@justinchuby

Inconsistency

com.microsoft::MatMulNBits calls preprocess_weights_for_mixed_gemm_cuda in its PrePack() hook (matmul_nbits.cc:164), so users hand it the raw [N, K/2] packed int4 weights produced by quantize_matmul_4bits and ORT does the CUTLASS fpA_intB layout conversion (row permutation + sub-byte transpose + column interleave + bias) automatically at session-load time.

com.microsoft::QMoE does not do this. Its PrePack() explicitly sets is_packed = false for input slots 2 and 5 (fc1_experts_weights / fc2_experts_weights) when quant_type == 'int', and the compute path passes tensor->DataRaw() straight into the CUTLASS runner. Concretely: users must hand QMoE already-prepacked weights or the kernel silently produces garbage output.

Impact

Producing those prepacked weights requires pack_weights_for_cuda_mixed_gemm, which only ships in CUDA-built ORT (USE_CUDA). That forces any offline quantization tool to:

  • depend on a CUDA-built ORT installation just to write out a QMoE model, even though the actual quantization math (quantize_matmul_4bits) is CPU-side, and
  • duplicate the CUTLASS layout transform on the offline side. We just hit this in Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE) Olive#2491 (offline MoE→QMoE rewrite for mobius-exported Gemma 4 MoE models) — we end up calling pack_weights_for_cuda_mixed_gemm per expert as part of the Olive pass.

By contrast, models that consume MatMulNBits can be quantized on a CPU-only host and then loaded anywhere — ORT handles the prepack at InferenceSession creation time.

Proposal

Add a PrePack() branch for (input_idx == 2 \|\| input_idx == 5) when quant_type_ == "int":

  1. Slice the [E, N, K/(8/bits)] weight tensor into E per-expert [N, K/(8/bits)] slices.
  2. Call the existing preprocess_weights_for_mixed_gemm_cuda on each slice.
  3. Stack the results back into a [E, K, N/(8/bits)] tensor that the compute path can use directly.
  4. Set is_packed = true and store the result so the original initializer can be freed.

This is symmetric with what MatMulNBits already does, just looped over the E axis. The compute path then reads packed_fc1_weights_ / packed_fc2_weights_ instead of fc1_experts_weights->DataRaw() directly.

Why this matters now

The recent #28467 work (QMoE CUDA EP + MoE GEMM refactor) opens up QMoE as a serious target for standard MoE models like Gemma 4. The offline tooling story is going to get used a lot more, and the asymmetry with MatMulNBits is going to confuse everyone who runs into it.

I'm happy to send a PR if there's agreement on the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions