You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
com.microsoft::MatMulNBits calls preprocess_weights_for_mixed_gemm_cuda in its PrePack() hook (matmul_nbits.cc:164), so users hand it the raw [N, K/2] packed int4 weights produced by quantize_matmul_4bits and ORT does the CUTLASS fpA_intB layout conversion (row permutation + sub-byte transpose + column interleave + bias) automatically at session-load time.
com.microsoft::QMoE does not do this. Its PrePack() explicitly sets is_packed = false for input slots 2 and 5 (fc1_experts_weights / fc2_experts_weights) when quant_type == 'int', and the compute path passes tensor->DataRaw() straight into the CUTLASS runner. Concretely: users must hand QMoE already-prepacked weights or the kernel silently produces garbage output.
Impact
Producing those prepacked weights requires pack_weights_for_cuda_mixed_gemm, which only ships in CUDA-built ORT (USE_CUDA). That forces any offline quantization tool to:
depend on a CUDA-built ORT installation just to write out a QMoE model, even though the actual quantization math (quantize_matmul_4bits) is CPU-side, and
duplicate the CUTLASS layout transform on the offline side. We just hit this in Add OnnxMoEQuantization pass (com.microsoft::MoE → QMoE) Olive#2491 (offline MoE→QMoE rewrite for mobius-exported Gemma 4 MoE models) — we end up calling pack_weights_for_cuda_mixed_gemm per expert as part of the Olive pass.
By contrast, models that consume MatMulNBits can be quantized on a CPU-only host and then loaded anywhere — ORT handles the prepack at InferenceSession creation time.
Proposal
Add a PrePack() branch for (input_idx == 2 \|\| input_idx == 5) when quant_type_ == "int":
Slice the [E, N, K/(8/bits)] weight tensor into E per-expert [N, K/(8/bits)] slices.
Call the existing preprocess_weights_for_mixed_gemm_cuda on each slice.
Stack the results back into a [E, K, N/(8/bits)] tensor that the compute path can use directly.
Set is_packed = true and store the result so the original initializer can be freed.
This is symmetric with what MatMulNBits already does, just looped over the E axis. The compute path then reads packed_fc1_weights_ / packed_fc2_weights_ instead of fc1_experts_weights->DataRaw() directly.
Why this matters now
The recent #28467 work (QMoE CUDA EP + MoE GEMM refactor) opens up QMoE as a serious target for standard MoE models like Gemma 4. The offline tooling story is going to get used a lot more, and the asymmetry with MatMulNBits is going to confuse everyone who runs into it.
I'm happy to send a PR if there's agreement on the approach.
Inconsistency
com.microsoft::MatMulNBitscallspreprocess_weights_for_mixed_gemm_cudain itsPrePack()hook (matmul_nbits.cc:164), so users hand it the raw[N, K/2]packed int4 weights produced byquantize_matmul_4bitsand ORT does the CUTLASS fpA_intB layout conversion (row permutation + sub-byte transpose + column interleave + bias) automatically at session-load time.com.microsoft::QMoEdoes not do this. ItsPrePack()explicitly setsis_packed = falsefor input slots 2 and 5 (fc1_experts_weights/fc2_experts_weights) whenquant_type == 'int', and the compute path passestensor->DataRaw()straight into the CUTLASS runner. Concretely: users must hand QMoE already-prepacked weights or the kernel silently produces garbage output.Impact
Producing those prepacked weights requires
pack_weights_for_cuda_mixed_gemm, which only ships in CUDA-built ORT (USE_CUDA). That forces any offline quantization tool to:quantize_matmul_4bits) is CPU-side, andpack_weights_for_cuda_mixed_gemmper expert as part of the Olive pass.By contrast, models that consume
MatMulNBitscan be quantized on a CPU-only host and then loaded anywhere — ORT handles the prepack atInferenceSessioncreation time.Proposal
Add a
PrePack()branch for(input_idx == 2 \|\| input_idx == 5)whenquant_type_ == "int":[E, N, K/(8/bits)]weight tensor intoEper-expert[N, K/(8/bits)]slices.preprocess_weights_for_mixed_gemm_cudaon each slice.[E, K, N/(8/bits)]tensor that the compute path can use directly.is_packed = trueand store the result so the original initializer can be freed.This is symmetric with what
MatMulNBitsalready does, just looped over theEaxis. The compute path then readspacked_fc1_weights_/packed_fc2_weights_instead offc1_experts_weights->DataRaw()directly.Why this matters now
The recent #28467 work (QMoE CUDA EP + MoE GEMM refactor) opens up QMoE as a serious target for standard MoE models like Gemma 4. The offline tooling story is going to get used a lot more, and the asymmetry with MatMulNBits is going to confuse everyone who runs into it.
I'm happy to send a PR if there's agreement on the approach.