Community setup guides, benchmark recipes, troubleshooting notes, and patches for Intel XPU local AI work.
This repository is meant to become a stable community hub for Intel XPU local AI:
| I want to… | Go here |
|---|---|
| Ask for setup help | Discussions |
| Read community-maintained notes | Wiki |
| Deploy MiniMax M2.7 INT4 on 4x B70 | MiniMax Ubuntu 24 guide |
| Find model-specific recipes | Model recipes |
| Share a benchmark | Community results guide |
| Compare GPUs | GPU comparison |
| Send Intel feedback | Feedback for Intel |
The best documented fresh install today is:
0.0.0.0:800032768 tokens by default110.90 total tok/s, 83.17 output tok/s for the strict p512/n1536
2K comparison lane; warm OpenAI-compatible serving reached 84.12 output
tok/s at the 32K served setting.This is a deployable baseline, not the final speed ceiling. The strict
benchmark/quality lane remains p512/n1536 at context 2048 for comparability;
the served OpenAI-compatible endpoint now defaults to 32768 and validated a
32,408-token prompt plus 64 generated tokens without OOM.
Open a discussion with:
Good categories for discussion:
The rest of this README is dense historical lab context. New users should start with the links above.
xpu-smi v1.3.6: all four B70s have slot-facing host/root and card-upstream links at 32GT/s x16 (PCIe 5.0 x16). The 2.5GT/s x1 values shown by /sys/class/drm/*/device/current_link_* are the internal downstream bridge/GPU endpoint view described by Intel’s Arc PCIe hierarchy note, not the external slot link. xpu-smi reports PCIe downgrade disabled on all four cards; no power or frequency limits were changed.60.497227 output tok/s and 80.662970 total tok/s at p512/n1536. Finite tracing showed the prior fast full-logits path can produce NaNs at the TP full-vocab logits gather; the promoted path preserves temperature=0 argmax semantics by gathering only per-rank top (value,index) pairs. Quality gates passed 32x arithmetic repeat, six-prompt semantic repeat, and raw145 exact token hashes at n64/n256. LocalMaxxing accepted this result as cmp940h1703tpo401scj5tftf. Older 67-73 tok/s graph submissions remain historical speed clues until the full-logits gather corruption is fixed.Q4_0 GGUF on llama.cpp. Current quality-cleared no-root GGUF result is 50.130 tok/s on three B70s at 512 prompt / 512 output using the experimental flat Qwen35 fused beta/alpha GGUF, SYCL tensor split, -ub 128, --poll 25, Q8 activation cache, fused MMVQ2, fused MMVQ2+SwiGLU, fused RMS_NORM+scale-MUL, fused allreduce+ADD, fused final allreduce+GET_ROWS, single-kernel allreduce, and GGML_SYCL_COMM_SYNC_AFTER=2, with GGML_SYCL_COMM_FUSEADD_ROOT_RESIDUAL=0. LocalMaxxing accepted this experimental result as cmov6p4r7007tqr01yi8ug4un.GGML_SYCL_COMM_FUSEADD_ROOT_RESIDUAL=1 plus GGML_META_FUSE_ALLREDUCE_ADD=1. The accepted LocalMaxxing IDs are historical pending a root-residual ordering fix.-ts 1/1/1/0.05) after the guard-fix refresh. This improves the older assist result by 12.46% and equal 4x by 26.22%, but still trails 3x. Equal four-card split remains a negative scaling diagnostic at 34.929 tok/s.vrfai/Qwen3.6-27B-FP8, local XPU patches, 4x B70 TP4, CPU n-gram speculative decode, 49.582 output tok/s at 512 prompt / 512 output. This preserves target-model quality through verified speculative decoding, but now trails the current Q4_0 TP3 decode result by about 2.6%.max_model_len=32768 reaches 42.996 tok/s for 2048 prompt / 256 output and reports 1,133,163 GPU KV-cache tokens. The 2026-05-07 512/512 refresh kept TP4 ahead (45.865 tok/s no-spec, 48.082 tok/s n-gram). TP2/PP2 fits but is much slower for batch-1 decode (27.722 tok/s at 512/512) and should be treated as a capacity layout, not the speed path.CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0 was a tiny FP8 TP4 no-spec win (46.386 tok/s vs 45.865 tok/s) but regressed n-gram speculation (44.439 tok/s) because draft acceptance collapsed. Keep default topology recognition for speculative TP4.SYCL2/SYCL3 smoke hit UR_RESULT_ERROR_DEVICE_LOST in the existing SYCL split matmul path.mtp.safetensors now loads cleanly with an opt-in local vLLM patch, but the corrected MTP path is too slow (2.36 tok/s eager smoke, 1.84 tok/s compiled smoke) and is not a LocalMaxxing result.MUL_MAT_ID masking only moved the failure to coarse buffer allocation.p0/n64; the current GGUF best is 17.697772 tok/s with corrected RPC device mapping, -nkvo 0, fast IQ4_XS MUL_MAT_ID, runtime MMV row packing (GGML_SYCL_MMV_Y_RUNTIME=2), -ub 64, fused RMSNorm enabled, DNN disabled, and merged gate/up expert tensors (-muge 1). LocalMaxxing accepted this as cmox103ol0040ml019yzs6gvs; the same stack at p512/n128 reached 54.506 prompt tok/s and 17.693 decode tok/s as cmox1gcxl0049ml01kiijqbpo.UD-IQ4_XS p0/n16 reached only 11.746 tok/s with layer split and 12.559 tok/s with row split; tensor split failed at model load. Keep GGUF work on the current SYCL/RPC recipe unless doing a deeper graph/tensor-parallel rewrite.llm_load_tensors on a 19.028 GB regular SYCL model-buffer allocation on GPU0. The current RPC-worker layout remains useful because it avoids that large single-process buffer path. A layer-placement sweep topped out at 16.358 tok/s, so placement is not the route to the >30 tok/s target.GGML_MINIMAX_NO_DEFER_REDUCE=1 and GGML_RPC_REDUCE_MIRROR=1, but it is diagnostic only: the one-token smoke reached 2.034 tok/s after forcing real reductions at nonlinear boundaries. The faster branch-fused graph path is not promoted because deferred partial reductions can cross RMSNorm/router/MoE boundaries and change the math.-t is not limiting, -fa 1 currently aborts in the SYCL RPC worker due unsupported FLASH_ATTN_EXT, disabling fused MMAD/MoE is slower, oneDNN enabled is slower, same-type contiguous copy memcpy is neutral, and an 8-expert MUL_MULTI_ADD unroll regressed and was removed.FUSED_RMS_NORM, converting a previous unsupported-op abort into a valid path. It reached 16.308 tok/s at p0/n64/r1, so it is functional but not a speed record.FusedMoE to MoeWNA16Config patch and targeted vLLM package-skew repairs. Switching CCL_ZE_IPC_EXCHANGE from sockets to pidfd raised the p512/n128 result to 19.85 output tok/s and 99.231 total tok/s, accepted on LocalMaxxing as cmox6tys30085ml0125gihg18. The log still shows the next bottleneck: no B70-specific tuned MoE config for E=256,N=384,dtype=int4_w4a16. An AMD-derived config seed was accepted only after stripping an unsupported key, but it regressed to 1.73 output tok/s on p64/n16.VLLM_XPU_ENABLE_XPU_GRAPH=1 is disabled by vLLM because TP4 communication ops cannot be captured, and MiniMax QK-norm fusion is blocked because this XPU build lacks torch.ops._C.minimax_allreduce_rms_qk.x.shape[0] <= 4) through the custom raw-u4 kernel. The p512/n128 result improved from the FP16 baseline 20.17 output tok/s to 29.74843 output tok/s (148.742151 total), and p512/n256 reached 33.033788 output tok/s once prefill was better amortized. No speculative decode, no expert dropping, and no power-limit change. LocalMaxxing accepted these as cmoxptkfd00hsml01hf2ajhhp and cmoxq7cww00i8ml019ihbeqc9. MiniMax ngram_gpu with the same decode path failed/stalled during generation, so speculation remains negative for this harness.16.860287 to 33.681326 output tok/s at p512/n256, and reaches 36.607699 output tok/s at p512/n512, only about 1.4% behind the FP16 p512/n512 speed reference. LocalMaxxing accepted this BF16 result as cmoyr84ol000rtl01o4z9fwdm.--gpu-memory-utilization 0.95 is the quality-conservative MiniMax capacity recipe: default-memory BF16 has no usable KV cache headroom, while BF16 0.95 completed p512/n1024 at 37.303654 then 35.953772 output tok/s with 18,880 KV tokens. Treat this as capacity-mode validation, not a clear speed breakthrough over FP16 default-memory mode. LocalMaxxing accepted the first run as cmoz632kr0068tl017a1z6r0u.41.130667 output tok/s (cmoz8cow60001pd010klrb8g8) is reclassified as likely quality-valid after the generated-cache analyzer was fixed: the archived c15860... AOT graph contains 8 visible f32[s72,2] Q/K RMS variance allreduces plus hidden-state collectives. It is still not reproducibly recovered in the current runtime, so keep it as the accepted speed target. The lower quality-conservative reference remains 37.552538 output tok/s / 50.070051 total tok/s at p512/n1536, accepted on LocalMaxxing as cmozow03v005wlo01q81bnspx. No speculative decode, no expert dropping, and no power-limit change.max_model_len=4096 with the same FP16 p512/n1536 request reached 33.258227 output tok/s and was accepted as LocalMaxxing capacity datapoint cmoz8k9z40008pd01rhu50c0n. GPU KV cache dropped from 17,216 to 9,408 tokens, so the next capacity test should use gpu_memory_utilization=0.95.gpu_memory_utilization=0.95 is the better FP16 recipe: GPU KV cache rises to 33,408 tokens and the same p512/n1536 request improves to 36.616486 output tok/s, accepted as cmoz8ryb9000bpd014xhl3pxu. It is close to the 2048-window quality-conservative reference, but still capacity-oriented rather than a raw-speed path.33.308012 output tok/s with 25,600 GPU KV-cache tokens and was accepted as cmoz90lg0000wpd018x3zuukw; a warmed rerun reported 33,408 KV tokens and improved to 36.805228 output tok/s, accepted as cmoz9ayax001cpd01xkr0w54l. A real larger-prompt p4096/n512 run reached 31.287419 output tok/s and 281.586772 total tok/s, accepted as cmoz97d350015pd01smqui7lk. These validate usable 8192-context capacity, but not a raw-speed path.45 ms/token for runner.forward, while moe.router_select + moe.quant_apply was about 0.24 ms/layer, or about 15 ms/token across 62 layers. This points the next speed work toward attention/KV, Q/K RMS plus TP collectives, projections, and compiled graph boundaries rather than raw u4 matvec alone. The run is not a LocalMaxxing result because synchronized timing distorts throughput.0.465 ms/layer, decode MoE expert work around 0.580 ms/layer, and each direct TP allreduce around 0.084-0.088 ms in steady samples. oneCCL CCL_SYCL_ALLREDUCE_TMP_BUF=1, CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0, and CCL_ALLREDUCE_SMALL_THRESHOLD=0 all regressed the fast-NVMe p512/n512 baseline slightly, so keep default oneCCL settings for now and move to source-level fusion.CCL_WORKER_COUNT=2 hung during XCCL initialization before model loading, and vLLM’s built-in fuse_allreduce_rms pass is explicitly disabled on XPU and currently backed by FlashInfer/ROCm fusion machinery rather than Level Zero/XCCL. Keep default oneCCL worker count and build any allreduce+RMS fusion as an XPU-specific path.fuse_allreduce_rms is not a B70 shortcut. With VLLM_XPU_EXPERIMENTAL_FUSE_ALLREDUCE_RMS=1, vLLM enables allreduce_rms, then worker startup imports flashinfer.comm; FlashInfer imports CUDA GDN kernels and calls torch.cuda.get_device_properties(0), failing in the XPU-only PyTorch build. Keep that env unset outside the diagnostic screen.computation_graph.py is absent. On the current MiniMax generated cache it reports representative categories, not the old full graph count: 20 hidden f16[s72,3072] collectives and 8 Q/K variance f32[s72,2] collectives. A MiniMax env-constant helper cleanup was tested and reverted because p512/n1536 regressed to 36.450 output tok/s.c15860... reports 40 generated allreduce sites, including 8 Q/K variance sites, so the accepted 41.130667 p512/n1536 result is likely valid. Current recovery attempts did not reproduce it: graph-shaped warm reload reached 37.69, FP8 KV E4M3 reached only 37.15 and carries a quality warning, FP8 E5M2 fails in XPU FlashAttention, and FP16 router reached 38.13 but changes expert-routing precision.sigmoid(router_logits) + e_score_correction_bias still had 128 unordered top-8 set mismatches across the TP4 p64/n16 smoke. The useful lead is candidate coverage: FP16 top-16 and top-32 both contained the exact biased FP32 top-8 for every audited token/layer, so a future quality-preserving candidate-repair router should compute exact FP32 scores for the FP16 top-16 and feed exact top-k ids/weights into MoE apply. No LocalMaxxing submission was made for the direct FP16-router screen.35.54 to 36.27 output tok/s in the same active runtime. Top-12 and top-16 were effectively tied, and both remain below prior accepted references. Keep this as a source lead for a fused XPU candidate-selection kernel, not a promoted speed path.CCL_WORKER_AFFINITY=auto completed p512/n512 at 36.496 output tok/s, and CCL_WORKER_AFFINITY=0,1,2,3 completed at 35.568, both below the accepted 39.611 p512/n512 reference with the same AOT hash and 17,216-token KV cache. Keep default worker affinity.39.610585 to 35.681825 output tok/s. Eager timing looked better, so the likely issue is compiled graph scheduling rather than raw math correctness. Keep VLLM_MINIMAX_QK_APPLY_ROPE_XPU_HELPER unset.35.722 output tok/s and warmed p512/n1536 reached 36.572, below the current quality-conservative p512/n1536 reference. It reproduces the cold isolated AOT artifact (9,408 KV tokens, 28.664 output tok/s) but does not recover a faster valid schedule. Keep VLLM_MINIMAX_QK_RMS_XPU_HELPER unset.qkv and then applies Q/K RMS after the normal TP allreduce is also negative: warm p512/n512 reached 34.718 output tok/s versus the same-shape baseline 35.820, and AOT reload failed until recompilation because the custom op namespace was not registered early enough. Keep VLLM_MINIMAX_QK_RMS_XPU_DIRECT unset.o_proj allreduce from RowParallelLinear into the decoder layer before residual-add RMSNorm gave a small p512/n512 lift (36.540 vs 35.820 output tok/s) but p512/n1536 reached only 36.449, below the current 37.552538 quality-conservative reference. Archive the scheduling clue, but keep VLLM_MINIMAX_O_PROJ_DELAY_ALLREDUCE unset.36.621 output tok/s and one p512/n1536 run reached 37.716, but the repeat fell to 37.021, below the 37.552538 quality-conservative reference. This suggests the MoE allreduce boundary matters, but just moving the allreduce is not reliable; keep VLLM_MINIMAX_MOE_DELAY_ALLREDUCE unset.c15860... AOT graph cache and recovered the runtime after the htile/rebuild work: p512/n512 returned to 38.998 output tok/s, and p512/n1536 repeated at 39.450 then 39.961 output tok/s. The updated generated-cache analyzer later found the expected Q/K RMS variance allreduce signature in the archived c15860... graph, so those runs are now scheduling clues for a likely valid but currently unreproduced faster graph.c15860... MiniMax AOT artifact. The current reproducible valid floor is 36-38 output tok/s after restoring the graph-shaped MiniMax source, while the accepted 41.131 LocalMaxxing result is now the speed target again rather than a suspect-only clue. Early clean-source/no-autotune/combo-kernel-off screens were negative (26.69-28.64 tok/s). Keep future compile experiments in isolated VLLM_CACHE_ROOT directories and require generated-cache Q/K variance allreduce census before promoting a result.28.03 output tok/s, the archived larger c15860...slow-after-timing-noop payload only reached 35.22, the current live p512/n1536 repeat is 37.05, oneCCL topology-bypass retest is 36.13, and aggressive Inductor autotune either leaves no KV cache at default memory or falls to 23.15 with gpu_memory_utilization=0.95. The next quality-preserving MiniMax speed path is still a real graph-safe XPU Q/K allreduce+RMS fusion, not more generic graph-boundary or autotune knobs.minimax_m2.py. In an isolated NVMe cache, the cleaned source cold p512/n512 run showed the usual 9,408-KV artifact and 27.47 output tok/s, while warm p512/n512 and p512/n1536 direct-load runs recovered 17,216 KV tokens and reached 36.14 and 36.63 output tok/s. This confirms those dormant branches were not the main lost-speed cause; keep the active runtime clean and keep future helpers as archived patches unless they beat the floor.timed_region(...) wrappers from active allreduce, attention, MoE, and runner paths is neutral-to-slightly-positive hygiene, not a recovery path. The isolated no-timing AOT cache hit the same cold artifact (28.05 output tok/s, 9,408 KV tokens), then warm p512/n512 and p512/n1536 reached 35.85 and 37.37 output tok/s with 17,216 KV tokens. Keep timing wrappers out of the active runtime unless running a dedicated diagnostic.minimax_qk_rms_xpu helper extension and avoiding the CUDA Lamport workspace. It is mechanically useful but not a speed path: warm p512/n512 reached 36.44 output tok/s and warm p512/n1536 reached 37.24, both within the current floor. Keep VLLM_MINIMAX_QK_NORM_XPU_HELPER_FUSION unset for real runs.CCL_ALLREDUCE=direct is very negative on MiniMax TP4 p512/n512 (16.145345 output tok/s). Intel documents that non-topo GPU-buffer allreduce algorithms copy GPU data through host; keep default topo.xpu-smi stats polling is too intrusive for MiniMax TP4 benchmarking: per-device JSON polling caused a 60-second vLLM shared-memory broadcast warning and the run had to be killed. The partial trace still showed all four B70s at 2800 MHz with roughly 95-116 W per card and about 95-96% memory utilization, so there was no obvious frequency-throttling signal; future profiling should use lower-overhead Level Zero or kernel-side timers.25.076 to 29.892 output tok/s at p512/n512 and reached 30.911 at p512/n1536, accepted on LocalMaxxing as cmozofyv5005hlo01puv9rjs6. This is valid and quality-preserving, but still below the current Q/K-allreduce quality-conservative TP4 reference of 37.552538; round-robin placement falls back to linear, and --enable-dbo is blocked because XPU only has the AgRs/allgather-reducescatter all2all manager.35.72 output tok/s and warmed p512/n1536 reached 36.69, both below the quality-conservative MiniMax reference. Fresh isolated AOT compiles also reproducibly show a cold-run artifact: KV cache drops from 17,216 to 9,408 tokens and p512/n512 falls to about 27 output tok/s until the same cache is reloaded.ACCESS, all self-pairs report ACCESS|ATOMICS, cross-card atomics are not advertised, same-process P2P remote fills pass for all 16 pairs, and forked-process Level Zero IPC open/fill/verify passes for all 16 pairs. This makes a Lamport/mailbox-style XPU Q/K RMS fusion feasible without remote atomics.[tokens,2] Q/K variance values, publishes a sequence counter, polls peer counters, and reads peer payloads from XPU kernels. The device-counter path passed 4 ranks, 50 iterations, 32 token rows, and only 3 reused mailbox slots with exact [51.5,515.0] final averages; the 512-row prefill-sized test also passed. A no-host-barrier one-token correctness smoke passed when slots were not reused too early, but the same protocol can hang when a rank misses a slot sequence after wraparound. The naive float-sentinel single-kernel variant remains a negative result because some ranks read stale sentinels.VLLM_MINIMAX_QK_RMS_XPU_IPC_COMPILED=1 is explicitly set. When the IPC op actually runs in vLLM it is far too slow: the scalar path measured about 0.03 output tok/s eager and 0.02 output tok/s compiled on p1/n4, while the compiled counter path timed out after a long generation. A standalone device-counter microbench measured 416-418 ms per one-token [1,2] reduction versus about 0.016 ms for XCCL. Keep all IPC env flags unset for real benchmarks; future work should avoid standalone mailbox allreduce and only revisit peer-memory reads inside larger fused kernels.48.092807 output tok/s and 64.123742 total tok/s at p512/n1536 with vLLM/XPU TP4, --async-engine, the static decode graph compile_sizes=[1], AOT 3e2cefa134c3aecc743c56d36960e4cb0a8ac7d2adc73c3f2a078cc8b6164846, and preserved Q/K RMS variance allreduce. LocalMaxxing accepted it as cmp3cgooj0019s401d7p1ks3e.gpu_memory_utilization=0.95 with the current async/static graph produced one p512/n1536 run at 48.42 output tok/s but repeated at 46.21, so it is not submitted as a new record. vLLM compilation mode=3 produced the same AOT hash as the current best but hit the 9,408 KV-token cold-cache artifact and reached only 33.24 output tok/s at p512/n512.46.19 output tok/s), but the p512/n1536 repeat reached only 47.05, below the accepted async/static graph best. Keep prefix caching enabled for promoted MiniMax runs.60 tok/s. UR_L0_USE_IMMEDIATE_COMMANDLISTS=1 completed but regressed slightly to 45.49 output tok/s at p512/n512; UR_L0_USE_IMMEDIATE_COMMANDLISTS=2, UR_L0_DEVICE_SCOPE_EVENTS=2, and CCL_ATL_TRANSPORT=mpi stalled or hung. The new MirecX/MiniMax-M2.7-L3H5-DFlash drafter was downloaded and tested, but both the original 196608-position config and a local max-512 smoke copy stalled before generation. Keep current UR defaults and CCL_ATL_TRANSPORT=ofi; DFlash remains an infrastructure/debug track, not a current speed route.esimd_resadd_norm_gemv_int4_pert helper is not safe as a MiniMax projection fusion. A synthetic TP4 probe found a cross-workgroup residual mutation race on the actual o_proj shape (N=3072,K=1536), with about 10.3% fused relative error. A temporary no-store diagnostic confirmed the race, but the corrected standalone shape was slower than oneDNN INT4-only. Repro script: benchmarks/b70_resadd_norm_gemv_int4_race_probe.py.50 tok/s at p512/n1536 as the next repeatable conservative milestone, 60 tok/s as the main four-B70 goal, and 75+ tok/s only for verified speculative decoding or deeper source-level fusion that preserves target logits. The detailed roadmap is in plans/2026-05-10-minimax-60tok-roadmap.md.1,496 visible _c10d_functional.all_reduce_ call lines and 1,496 immediate wait_tensor call lines across generated Inductor Python, split evenly as 187 collectives in each generated rank/shape file. The stock vLLM allreduce+RMS pass remains XPU-disabled and FlashInfer/AITER oriented, so the next patch should be XPU-specific allreduce/residual/RMSNorm fusion.--no-async-scheduling fell to 27.309053 output tok/s, and --compilation-config={"compile_sizes":[1]} fell to 30.721984 despite successfully creating a token-count-1 compile range.0.016 ms for the 8-byte Q/K variance allreduce and 0.014 ms for the 6144-byte hidden allreduce on 4x B70. A follow-up out-of-place clone/copy mode that mimics vLLM’s output = input_.clone() path was only about 0.005 ms slower for tiny allreduces (0.021 ms at 8 B, 0.020 ms at 6144 B). This points next work toward vLLM graph/fence/collective placement, not more oneCCL algorithm toggles.33.925 output tok/s, while --gpu-memory-utilization 0.95 improved to 36.772 output tok/s and raised KV cache from 9,408 to 33,408 tokens. Treat 0.95 as a capacity setting, not a new speed record.140.425 us to 44.430 us, with exact synthetic output match, but full vLLM p512/n512 regressed to 35.067 output tok/s. The patch is retained as a negative artifact and should not be promoted.libsycl.so.8 runtime. Rebuilding with oneAPI 2026.0 produced a SYCL image-registration segfault on import. Rebuilding the FP16 u4 extension with oneAPI 2025.3 restored the p512/n512 baseline to 36.025 output tok/s. Follow-up toggles were negative: XPU graph requested was disabled for TP communication and fell to 29.562 output tok/s, CCL_ZE_IPC_EXCHANGE=pidfd was slightly slower at 35.534, and max_model_len=1024 fell to 28.909.VLLM_XPU_USE_LLM_SCALER_MOE_LOGITS unset for real benchmarks.MirecX/MiniMax-M2.7-L3H5-DFlash loads, compiles, shares target embeddings/lm head, and selects the expected target taps (2, 16, 30, 43, 57), but retries with num_speculative_tokens=3 were blocked by KV memory pressure, one Level Zero UR_RESULT_ERROR_DEVICE_LOST, and a generation hang after KV allocation. The drafter card reports m_accept ~= 1.38, already below expected break-even, so keep MiniMax optimization focused on non-speculative Q/K collective fusion and MoE decode work for now.num_speculative_tokens=4, p64/n32, and draft_tensor_parallel_size=1 also stalled after successful target/drafter load and AOT compile. It stayed at Processed prompts: 0/1, emitted repeated shared-memory broadcast warnings, and produced no benchmark JSON, so it was not submitted to LocalMaxxing.-1 and skipping them inside the llm-scaler u4 kernels only moved a BF16 p1/n8 EP smoke from 16.795602 to 16.883004 total tok/s, far below the stable non-EP BF16 u4 p512/n512 result of 36.607699 output tok/s. Treat EP loss as communication/scheduler/all-to-all dominated until proven otherwise.VLLM_XPU_ALLREDUCE_ASYNC_WAIT=1 hook completed a full BF16 0.95 MiniMax p512/n512 run at 35.949 output tok/s, but the hook is disabled inside compiled collectives. It stays as an eager-only diagnostic, not a speed setting or LocalMaxxing result.f16[s72,2] variance collectives, but warm p512/n512 reached only 35.316 output tok/s and carries a normalization-precision tradeoff. The active runtime was reverted to FP32 variance allreduce; keep VLLM_MINIMAX_QK_VAR_ALLREDUCE_DTYPE unset.MiniMaxText01LinearAttention is not applicable to the active MiniMax M2 AutoRound model: it uses minimax_m2.py normal attention, produced the same 4799a3c8... AOT hash, and the temporary gate was removed.fused_add_rms_norm is a useful diagnostic but not a speed path yet. After adding the B70 MiniMax MoE config to /home/steve/src/vllm, source default warmed to 34.602 output tok/s, source with --enable-flashinfer-autotune warmed to 35.781, and source with fused_add_rms_norm=["xpu_kernels","native"] warmed to 35.649, all below the installed-runtime p512/n512 reference. The installed custom_ops=["none","+rms_norm"] path also warmed to only 36.159. Do not submit these to LocalMaxxing.VLLM_MINIMAX_POST_ATTN_FUSED_ADD_RMS_XPU=1 warmed to 35.077 output tok/s at p512/n512, and pairing it with delayed output-projection allreduce warmed to 35.804, still below the accepted 39.611 reference. Keep both env flags unset and move to true collective-plus-epilogue fusion.32.611 output tok/s. This confirms that the next MiniMax fusion must be C++/SYCL or compiler-level, not a Python custom-op wrapper around existing allreduce and RMS kernels.37.17 output tok/s / 49.558 total tok/s with 17,216 GPU KV-cache tokens. The loaded AOT graph contains 187 TP allreduces per generated-token graph on rank 0: 62 Q/K variance reductions, 62 output-projection hidden reductions, 62 MoE hidden reductions, and one vocab-embedding reduction. The next target is reducing the effective cost of the hidden-state allreduce plus residual/RMSNorm boundaries, not more standalone MoE microkernels.origin/fix_27b_kernel (db05b45) fixes a large-N dense INT4 ResAddNormGEMV race reported on Qwen3.6-27B gate_up (N=8704,K=5120,TP=4). It is relevant if we return to dense Qwen3.6 INT4 AutoRound/sym-int4, but not to the current MiniMax u4 MoE bridge, Qwen Q4_0 GGUF, or Qwen static FP8 paths.28.036 tok/s), llm-scaler MoE logits path (35.899), TP2/PP2 (24.976), and generic FP8 KV (28.104) all underperformed the quality-cleared TP4 p512/n512 reference (39.611). Explicit fp8_e5m2 KV fails in the XPU FlashAttention metadata path. These were not submitted to LocalMaxxing; they are recorded as pruning data.plans/q4_0-gguf-b70-optimization-plan.md: active quality-preserving GGUF optimization plan.plans/2026-05-05-negative-followups-addendum.md: latest plan addendum after the PP2, CCL topology, small-F32, and MiniMax guard screens.notes/b70-llm-lab-notes.md: running investigation log, benchmarks, TODOs, and lessons learned.notes/2026-05-04-qwen36-fp8-b70-fa2.md: focused writeup for the Qwen3.6 27B static FP8 / vLLM XPU FA2 result on 4x B70.notes/2026-05-04-vllm-xpu-ngram4-fp8-validation.md: current best static FP8 n-gram speculative validation.notes/2026-05-04-qwen36-q4-eventbarrier.md: current best Q4_0 three-B70 event-barrier allreduce validation.notes/2026-05-04-minimax-row-split-ncmoe-staircase.md: MiniMax row-split expert allocation staircase.notes/2026-05-05-negative-followups.md: negative follow-up screens and backend bugs found after the current best results.notes/2026-05-06-fp8-mtp-block-fp8-clean.md: clean-load but slow Qwen3.6 FP8 MTP hybrid follow-up.notes/2026-05-06-llm-scaler-source-mining.md: llm-scaler ESIMD source-mining notes for the next Q4 kernel/fusion work.notes/2026-05-06-q4-esimd-blockscales.md: ESIMD harness block-loaded scale metadata win; positive standalone kernel direction.notes/2026-05-06-q4-graph-pattern-probe.md: Q4_0 decode graph probe showing same-activation multi-GEMV fusion opportunities.notes/2026-05-06-q4-active-device-row-split.md: focused active-device row-split patch and row-split safety failure.notes/2026-05-06-q4-fused-mmvq2-swiglu.md: opt-in Q4_0 gate/up matvec plus SwiGLU fusion and validation.notes/2026-05-06-q4-rmsnormmul.md: opt-in RMS_NORM+scale-MUL fusion and current best Q4_0 GGUF validation.notes/2026-05-06-q4-getrows-fusion-neutral.md: opt-in allreduce+GET_ROWS fusion; initially neutral, later a small current-stack win.notes/2026-05-06-q4-projection-epilogue-diagnostic.md: diagnostic MUL_MAT+allreduce+ADD scheduler hook; path works with Q8 disabled but regresses short decode, so it stays off.notes/2026-05-06-q4-single-subgroup-current-negative.md: current-stack single-B70 subgroup runtime sweep; default remains best.notes/2026-05-06-q4-vdr4-negative.md: runtime-gated one-lane-per-Q4_0-block reordered MMVQ screen; regressed short decode, so keep it off.notes/2026-05-06-q4-allreduce-max-bytes.md: opt-in larger fused allreduce ceiling probe; useful diagnostic but not a speed win.notes/2026-05-06-fp8-pp2-postreboot-validation.md: post-reboot FP8 PP2xTP2 XCCL/load/speculative plumbing validation.notes/2026-05-07-q4-q8-allreduce-add-guardfix.md: regression fix for the misplaced Q8-cache guard that disabled the validated allreduce+ADD path.notes/2026-05-07-fp8-tp4-pp2-refresh.md: FP8 TP4 vs PP2xTP2 post-reboot refresh, including the oneCCL topology-toggle screen.notes/2026-05-07-q4-quad-assist-refresh.md: current best four-card Q4_0 assist split refresh after the guard fix.notes/2026-05-07-q4-root-residual-tp3.md: three-card Q4_0 TP3 root-residual performance ceiling, now annotated as not currently quality-cleared after the later token/logit failure.notes/2026-05-07-q4-fused-beta-alpha-experimental.md: flat-layout Qwen35 fused ssm_beta/ssm_alpha GGUF experiment, quality-cleared with root-residual disabled.notes/2026-05-07-model-retention-cleanup.md: model-tree cleanup record and current keep set.notes/2026-05-07-minimax-ikrpc-sycl-13tok-baseline.md: ik_llama.cpp RPC+SYCL process-per-GPU baseline that reached 13.754 tok/s on MiniMax M2.7 UD-IQ4_XS.notes/2026-05-08-minimax-direct-sycl-and-placement.md: direct-SYCL allocation blocker and MiniMax RPC layer-placement sweep.notes/2026-05-08-minimax-correct-graph-reduce.md: quality-correct MiniMax graph reduce diagnostic and why host-mediated reduce/broadcast is too slow.notes/2026-05-08-minimax-layer-knob-and-kernel-screens.md: MiniMax layer-mode runtime knob, unsupported-op, and small-kernel screens.notes/2026-05-08-minimax-cpy-shape-trace.md: MiniMax SYCL CPY shape trace and negative shape-specific copy fast path.notes/2026-05-08-minimax-fused-rmsnorm-sycl.md: MiniMax SYCL RPC worker FUSED_RMS_NORM implementation and speed screen.notes/2026-05-08-minimax-autoround-vllm-xpu.md: MiniMax AutoRound INT4 vLLM/XPU bring-up, including the quantized-MoE fit patch and remaining blockers.notes/2026-05-09-minimax-u4-decode-path.md: unsigned llm-scaler u4 MiniMax decode path, p512/n128 29.74843 output tok/s result, and negative ngram_gpu follow-up.notes/2026-05-09-minimax-comm-and-ws-moe-followups.md: oneCCL small-payload, MoERunner timing, direct-dispatch, and ESIMD work-sharing u4 follow-ups; all kept as diagnostics/negatives.notes/2026-05-09-minimax-dflash-speculative-blocker.md: DFlash speculative drafter smoke; model loads and compiles, then stalls before producing a 16-token result.notes/2026-05-09-minimax-bf16-u4-decode.md: BF16-capable MiniMax AutoRound u4 decode path; restores BF16 speed to near the FP16 reference without forcing FP16 hidden states.notes/2026-05-09-minimax-oneapi-compiler-compat.md: oneAPI 2025.3 compiler compatibility finding, FP16 u4 restore, and negative XPU graph / CCL IPC / context-size screens.notes/2026-05-09-minimax-router-logits-fusion-negative.md: default-off fused top-2/router logits experiment; standalone math passes, full TP4 vLLM run hangs.notes/2026-05-09-minimax-ep-skip-and-dflash-update.md: EP non-local expert skip smoke and updated DFlash speculative retry matrix; both negative for speed.notes/2026-05-10-llm-scaler-fix-27b-kernel.md: upstream llm-scaler Qwen3.6 27B dense INT4 correctness fix discovered after fetching origin/fix_27b_kernel.notes/2026-05-10-minimax-qk-allreduce-diagnostic-and-39tok.md: correctness-breaking Q/K allreduce skip diagnostic plus the valid repeated 39 tok/s MiniMax AutoRound result.notes/2026-05-10-minimax-timing-and-ccl-sweep.md: MiniMax synchronized timing and oneCCL environment sweep showing default CCL behavior remains best so far.notes/2026-05-10-minimax-qk-apply-rope-helper-negative.md: default-off Q/K apply+RoPE helper, oneAPI 2025.3 build note, and negative compiled-throughput result.notes/2026-05-10-minimax-qk-rms-helper-retest.md: standalone Q/K RMS var/apply helper retest after the AOT-cache regression; functional but still slower than the stock path.notes/2026-05-10-minimax-scheduler-and-xccl-microbench.md: vLLM scheduler/compile screens and XCCL allreduce microbench at MiniMax decode tensor sizes.notes/2026-05-10-minimax-restore-and-htile-negative.md: active-runtime restore after removing the default-off Q/K apply+RoPE branch, p512/n2048 screens, out-of-place XCCL microbench, and negative llm-scaler htile experiment.notes/2026-05-10-minimax-aot-cache-regression.md: MiniMax AOT-cache regression record after the timing-helper no-op experiment, isolated Inductor knob negatives, and current reproducible 35-36 tok/s floor.notes/2026-05-10-minimax-aot-followups.md: follow-up AOT recovery screens: extra graph boundaries, archived AOT check, topology retest, current live repeat, and aggressive Inductor autotune negatives.notes/2026-05-10-minimax-inplace-allreduce-and-xpu-fusion.md: guarded compiled-XPU allreduce clone-skip negative and the next Level Zero IPC/P2P-backed Q/K RMS fusion direction.notes/2026-05-10-xpu-level-zero-peer-probe.md: Level Zero B70 P2P/IPC peer-memory feasibility probe for the next MiniMax Q/K RMS fusion prototype.notes/2026-05-10-minimax-xpu-ipc-qk-var-prototype.md: PyTorch/SYCL Level Zero IPC Q/K variance prototype; correctness passes under controlled slots, but standalone decode-sized IPC is hundreds of ms per call.notes/2026-05-10-minimax-vllm-ipc-qk-var-integration.md: default-off vLLM MiniMax Q/K variance IPC allreduce integration; eager liveness passes, preinit avoids the first compile crash, but actual IPC execution is currently much too slow.data/qwen36-fp8-32k-tp4-vs-pp2-20260506.json: post-reboot Q4 sanity plus FP8 32k-context TP4 vs TP2/PP2 validation.data/q4-esimd-blockscales-20260506.json: structured ESIMD block-loaded scale metadata screen.data/q4-active-device-row-split-20260506.json: structured active-device row-split patch validation and negative row-split smoke.data/qwen36-q4-fused-mmvq2-swiglu-20260506.json: structured fused MMVQ2+SwiGLU correctness, performance, and LocalMaxxing record.data/qwen36-q4-rmsnormmul-20260506.json: structured RMS_NORM+scale-MUL correctness, performance, failed 4x diagnostic, and LocalMaxxing record.data/qwen36-q4-getrows-fusion-20260506.json: structured allreduce+GET_ROWS A/B data, correctness check, and LocalMaxxing record.data/qwen36-q4-projection-epilogue-diagnostic-20260506.json: structured Q8 guard, path smoke, and negative A/B for the projection epilogue scheduler hook.data/qwen36-q4-single-subgroup-current-20260506.json: structured current-stack single-B70 subgroup runtime sweep.data/qwen36-q4-vdr4-negative-20260506.json: structured Q4_0 reordered MMVQ VDR4 negative screen.data/qwen36-q4-allreduce-max-bytes-20260506.json: structured Q4_0 larger allreduce-fusion ceiling probe.data/qwen36-fp8-pp2-postreboot-validation-20260506.json: structured FP8 PP2xTP2 post-reboot validation data.data/qwen36-q4-q8-allreduce-add-guardfix-20260507.json: structured Q4_0 guard-fix trace, restored throughput, and LocalMaxxing record.data/qwen36-fp8-tp4-pp2-refresh-20260507.json: structured FP8 TP4, PP2xTP2, n-gram, and oneCCL topology-toggle refresh data.data/qwen36-q4-quad-assist-refresh-20260507.json: structured four-card Q4_0 assist split refresh data and LocalMaxxing record.data/qwen36-q4-root-residual-tp3-20260507.json: structured root-residual TP3 performance ceiling, negative follow-up screens, LocalMaxxing IDs, and later correctness correction.data/qwen36-q4-fused-beta-alpha-20260507.json: structured flat-layout fused beta-alpha GGUF experiment data and final no-root correctness/performance validation.data/minimax-m27-ikrpc-sycl-13tok-baseline-20260507.json: structured MiniMax RPC+SYCL result table, command, source patches, LocalMaxxing ID, and next blockers.data/minimax-m27-direct-sycl-placement-20260508.json: structured direct-SYCL allocation failures and layer-placement sweep.data/minimax-m27-correct-graph-reduce-20260508.json: structured corrected MiniMax graph reduce diagnostic.data/minimax-m27-layer-knob-and-kernel-screens-20260508.json: structured MiniMax layer-mode knob and kernel screens.data/minimax-m27-cpy-shape-trace-20260508.json: structured MiniMax CPY shape trace and negative fast-path test.data/minimax-m27-fused-rmsnorm-sycl-20260508.json: structured MiniMax fused RMSNorm implementation result.data/minimax-m27-autoround-vllm-xpu-20260508.json: structured MiniMax AutoRound INT4 vLLM/XPU bring-up result and remaining MoE tuning blocker.data/minimax-m27-autoround-u4-decode-20260509.json: structured unsigned llm-scaler u4 decode path result, patch references, LocalMaxxing payload, and negative speculative follow-up.data/minimax-m27-compiled-timing-summary-20260509.json: synchronized rank-0 compiled timing summary showing the post-u4 remaining ceiling is outside the raw MoE bridge.data/minimax-m27-comm-direct-ws-followups-20260509.json: structured oneCCL env, MoE timing, direct-dispatch, and work-sharing u4 follow-up outcomes.data/minimax-m27-dflash-speculative-blocker-20260509.json: structured DFlash load/compile/smoke-stall result.data/minimax-m27-bf16-u4-decode-20260509.json: structured BF16 u4 decode results and patch references.data/minimax-m27-autoround-oneapi2025-recovery-20260509.json: structured compiler compatibility restore data and follow-up toggle results.data/minimax-m27-ep-skip-and-dflash-20260509.json: structured EP skip and DFlash retry matrix.data/minimax-m27-fast-nvme-scheduler-xpu-graph-20260510.json: fast-NVMe MiniMax scheduler, XPU graph, GPU-memory-utilization, B70 MoE config, and BF16 capacity-mode follow-up data.data/llm-scaler-fix-27b-kernel-20260510.json: structured record of upstream llm-scaler fix_27b_kernel and its relevance boundaries.data/minimax-m27-qk-allreduce-diagnostic-39tok-20260510.json: structured MiniMax Q/K allreduce diagnostic, p512/n512 repeat, and LocalMaxxing ID.data/minimax-m27-timing-and-ccl-sweep-20260510.json: structured MiniMax timing diagnostics and oneCCL sweep results.data/minimax-m27-qk-apply-rope-helper-negative-20260510.json: structured Q/K apply+RoPE helper validation, negative benchmark, and CCL_ALLREDUCE=direct screen.data/minimax-m27-qk-rms-helper-retest-20260510.json: structured standalone Q/K RMS var/apply helper retest and negative decision.data/minimax-m27-scheduler-and-xccl-microbench-20260510.json: structured scheduler/compile negatives and XCCL microbench data.data/minimax-m27-restore-longoutput-htile-20260510.json: structured restore, long-output, XCCL out-of-place, XPU graph follow-up, and htile negative results.data/minimax-m27-aot-cache-regression-20260510.json: structured AOT-cache regression, graph-shape, no-autotune, combo-kernel-off, and current floor measurements.data/minimax-m27-aot-followups-20260510.json: structured follow-up AOT recovery attempts and current repeat results after the c15860... regression.data/minimax-m27-inplace-allreduce-and-xpu-fusion-20260510.json: structured compiled-XPU allreduce clone-skip negative and XPU fusion next-direction record.data/xpu-level-zero-peer-probe-b70-20260510.json: structured Level Zero B70 P2P/IPC peer-memory feasibility result.data/minimax-xpu-ipc-qk-var-prototype-20260510.json: structured first XPU IPC Q/K variance prototype results.data/minimax-m27-vllm-xpu-ipc-qk-var-integration-20260510.json: structured vLLM MiniMax IPC Q/K variance integration smoke results and compiled-mode blocker.notes/2026-05-10-minimax-ccl-worker-affinity-negative.md: oneCCL worker-affinity screen showing affinity pinning underperformed the accepted MiniMax references.data/minimax-m27-ccl-worker-affinity-negative-20260510.json: structured oneCCL worker-affinity screen data.notes/2026-05-10-minimax-aot-collective-inspection.md: MiniMax AOT graph inspection identifying allreduce/wait boundaries as the source-level fusion target.data/minimax-m27-aot-collective-inspection-20260510.json: structured AOT collective/RMS inspection data.notes/2026-05-13-minimax-aot-collective-classifier.md: updated current-AOT classifier result showing 1,496 actual allreduce/wait pairs.data/minimax-m27-aot-collective-classification-20260513.json: structured current-AOT allreduce/wait classification data.notes/2026-05-10-minimax-negative-screens.md: direct Q/K helper, MoE logits, TP2/PP2, and FP8 KV negative screens after raising the MiniMax target to 60 tok/s.data/minimax-m27-negative-screens-20260510.json: structured data for those negative screens and external reference notes.benchmarks/b70_minimax_qk_boundary_bench.py: XCCL/QK variance/helper microbench for MiniMax TP4 layer shapes.data/minimax-m27-qk-boundary-microbench-20260510.json: structured microbench data explaining why the direct helper looked promising in isolation but regressed in the full compiled model.data/localmaxxing-submission-minimax-m27-autoround-bf16-u4-decode-20260509.json: LocalMaxxing response for the BF16 u4 decode p512/n512 result.data/localmaxxing-minimax-m27-autoround-u4-decode-bf16-gpumem095-p512n1024-20260510.payload.json: LocalMaxxing payload for the fast-NVMe BF16 0.95 MiniMax capacity-mode result.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-bf16-gpumem095-p512n1024-20260510.response.json: LocalMaxxing response for the fast-NVMe BF16 0.95 MiniMax capacity-mode result.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n512-20260510.payload.json: LocalMaxxing payload for the valid fast-NVMe FP16 p512/n512 MiniMax AutoRound high.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n512-20260510.response.json: LocalMaxxing response for the valid fast-NVMe FP16 p512/n512 MiniMax AutoRound high.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n1024-20260510-refresh.payload.json: LocalMaxxing payload for the valid refreshed fast-NVMe FP16 p512/n1024 MiniMax AutoRound high.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n1024-20260510-refresh.response.json: LocalMaxxing response for the valid refreshed fast-NVMe FP16 p512/n1024 MiniMax AutoRound high.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n1536-20260510.payload.json: LocalMaxxing payload for the valid fast-NVMe FP16 p512/n1536 MiniMax AutoRound high.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-p512n1536-20260510.response.json: LocalMaxxing response for the valid fast-NVMe FP16 p512/n1536 MiniMax AutoRound high.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen4096-p512n1536-20260510.payload.json: LocalMaxxing payload for the valid max_model_len=4096 MiniMax AutoRound capacity datapoint.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen4096-p512n1536-20260510.response.json: LocalMaxxing response for the valid max_model_len=4096 MiniMax AutoRound capacity datapoint.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen4096-gpumem095-p512n1536-20260510.payload.json: LocalMaxxing payload for the valid max_model_len=4096, gpu_memory_utilization=0.95 MiniMax AutoRound capacity datapoint.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen4096-gpumem095-p512n1536-20260510.response.json: LocalMaxxing response for the valid max_model_len=4096, gpu_memory_utilization=0.95 MiniMax AutoRound capacity datapoint.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p512n1536-20260510.payload.json: LocalMaxxing payload for the valid max_model_len=8192, gpu_memory_utilization=0.95 MiniMax AutoRound capacity datapoint.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p512n1536-20260510.response.json: LocalMaxxing response for the valid max_model_len=8192, gpu_memory_utilization=0.95 MiniMax AutoRound capacity datapoint.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p4096n512-20260510.payload.json: LocalMaxxing payload for the valid max_model_len=8192, gpu_memory_utilization=0.95, p4096/n512 MiniMax AutoRound larger-prompt capacity datapoint.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p4096n512-20260510.response.json: LocalMaxxing response for the valid max_model_len=8192, gpu_memory_utilization=0.95, p4096/n512 MiniMax AutoRound larger-prompt capacity datapoint.data/localmaxxing-minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p512n1536-refresh-20260510.payload.json: LocalMaxxing payload for the refreshed max_model_len=8192, gpu_memory_utilization=0.95, p512/n1536 MiniMax AutoRound capacity datapoint.data/localmaxxing-responses/minimax-m27-autoround-u4-decode-fp16-fast-nvme-maxlen8192-gpumem095-p512n1536-refresh-20260510.response.json: LocalMaxxing response for the refreshed max_model_len=8192, gpu_memory_utilization=0.95, p512/n1536 MiniMax AutoRound capacity datapoint.data/minimax-m27-autoround-dflash-fast-nvme-negative-20260510.json: structured negative DFlash speculative-decode retest from fast NVMe.data/minimax-m27-source-ir-fusedadd-screen-20260510.json: structured source-tree and installed-runtime RMS/fused-add implementation screen.data/minimax-m27-postattn-fusedadd-delay-negative-20260510.json: structured installed-runtime post-attention fused-add RMS and delayed-output-allreduce negative screen.data/minimax-m27-python-ar-fused-customop-negative-20260510.json: structured Python custom-op allreduce plus fused-add RMS negative screen.notes/2026-05-10-b70-pcie-and-xpu-smi.md: B70 PCIe hierarchy and xpu-smi setup note; all four slot-facing links are PCIe 5.0 x16, PCIe downgrade is disabled, and a local no-file libze1 shim keeps the existing Intel Level Zero loader intact.data/b70-pcie-and-xpu-smi-20260510.json: structured PCIe bridge table and xpu-smi package state.configs/vllm/minimax-m27-b70-int4-w4a16-moe-hybrid-20260508.json: hybrid B70 MoE config for MiniMax AutoRound vLLM/XPU, tuned key 1 plus default prompt-size keys.configs/vllm/minimax-m27-b70-int4-w4a16-moe-ep-negative-20260508.json: expert-parallel MiniMax MoE config retained as a negative/blocked result after EP underperformed and the tuned-config run OOMed.scripts/bench-qwen36-q4_0-gguf-vulkan-matrix.sh: Q4_0 GGUF Vulkan benchmark sweep harness.scripts/bench-qwen36-q4_0-gguf-sycl-matrix.sh: Q4_0 GGUF SYCL benchmark sweep harness.scripts/bench-qwen36-b70-single-mtp.sh: single-B70 vLLM INT4 MTP benchmark wrapper.scripts/bench-qwen36-b70-tp2.sh: dual-B70 vLLM TP2 benchmark wrapper.scripts/bench-vllm-qwen36-fp8.sh: reusable Qwen3.6 FP8 vLLM latency wrapper with TP/PP/speculative knobs.scripts/bench-vllm-minimax-autoround-xpu.sh: reusable MiniMax M2.7 AutoRound INT4 vLLM/XPU throughput wrapper for TP4 B70 bring-up.scripts/build-llm-scaler-moe-int4-xpu.sh: reproducible llm-scaler INT4 extension rebuild wrapper that sources oneAPI 2025.3 for PyTorch XPU libsycl.so.8 compatibility.scripts/summarize-vllm-aot-collectives.sh: helper for inspecting vLLM AOT cache allreduce/wait/RMS patterns.scripts/add-qwen35-fused-ba-gguf.py: experimental augmented-GGUF generator that adds fused Qwen35 ssm_ba tensors from separate alpha/beta tensors.scripts/submit_localmaxxing_results.py: LocalMaxxing submission helper. Requires LMX_API_KEY in the environment; no API key is stored in this repo.benchmarks/b70_xccl_allreduce_bench.py: XPU all-reduce/P2P microbenchmark.data/localmaxxing_payloads.json: sanitized benchmark payloads submitted or queued for LocalMaxxing.notes/2026-05-10-fast-nvme-model-placement.md: model placement update after moving the MiniMax GGUF shards to /mnt/fast-ai and preserving the original path as a symlink.data/qwen36-q4-eventbarrier-20260504.json: structured Q4_0 event-barrier validation data.data/minimax-m27-row-split-ncmoe-staircase-20260504.json: structured MiniMax staircase failure data.data/2026-05-05-negative-followups.json: structured negative follow-up screens.patches/llama-b70-openvino-vulkan.patch: local llama.cpp OpenVINO/Vulkan exploratory patch set.patches/llama-cpp-sycl-allreduce-event-barrier.patch: incremental event-barrier allreduce marker patch.patches/llama-cpp-sycl-minimax-mulmatid-guard.patch: diagnostic MiniMax MUL_MAT_ID split-buffer guard patch.patches/ik-llama-minimax-rpc-sycl-20260507.patch: ik_llama.cpp patch set for the MiniMax RPC+SYCL baseline, including llama-bench -no-mmad, SYCL SIGMOID, SYCL MULTI_ADD, experimental MUL_MULTI_ADD, and fused-op fallback env flags.patches/llama-cpp-active-device-row-split-current-20260506.patch: focused row-split selected-device to physical-backend split mapping patch.patches/llama-cpp-sycl-fused-mmvq2-swiglu-current-20260506.patch.gz.b64: current SYCL source diff containing the fused MMVQ2+SwiGLU path.patches/llama-cpp-sycl-rmsnormmul-current-20260506.patch.gz.b64: current SYCL source diff containing the RMS_NORM+scale-MUL path and allocator diagnostics.patches/llama-cpp-sycl-meta-mulmat-add-diagnostic-current-20260506.patch.gz.b64: current llama.cpp diff containing the diagnostic MUL_MAT+allreduce+ADD scheduler hook.patches/llama-cpp-sycl-q4-current-guardfix-20260507.patch.gz.b64: current llama.cpp diff after restoring Q8-cache compatibility for the validated allreduce+ADD path.patches/llama-cpp-qwen35-fused-beta-alpha-experimental-20260507.patch.gz.b64: experimental Qwen35 fused ssm_beta/ssm_alpha GGUF source branch; quality-cleared only with root-residual disabled.patches/llama-cpp-sycl-q4_1-mmvq-experiment-20260507.patch: focused default-off Q4_1 MMVQ dispatch experiment, retained as a negative result.patches/llama-cpp-sycl-q4-vdr4-experiment-current-20260506.patch.gz.b64: current llama.cpp diff containing the runtime-gated Q4_0 reordered MMVQ VDR4 experiment.patches/llama-cpp-meta-allreduce-max-bytes-20260506.patch: focused opt-in max-byte knob for fused meta allreduce diagnostics.patches/vllm-xpu-mtp-fallback.patch: vLLM 0.20.1 XPU speculative/MTP fallback patch.patches/vllm-xpu-force-graph-with-comm-experiment.patch: failed TP2 graph-capture experiment knob retained as a negative result.patches/vllm-xpu-fa2-compressed-tensors-scalar-scales.patch: vLLM compressed-tensors singleton attention scale fix for Intel XPU FlashAttention2.patches/vllm-xpu-qwen35-gdn-spec-fallback-contiguous-state.patch: XPU Gated DeltaNet speculative metadata/fallback patch used by the n-gram runs.patches/vllm-inc-xpu-autoround-fusedmoe-wna16-20260508.patch: experimental vLLM patch that lets INC/AutoRound XPU quantization apply WNA16 MoE quantization to MiniMax FusedMoE layers instead of falling back to unquantized MoE.patches/vllm-minimax-qknorm-passmanager-xpu-guard-20260508.patch: guard patch so enabling MiniMax QK-norm fusion on XPU does not crash when the fused Lamport op is absent.patches/vllm-benchmark-moe-xpu-tune-harness-20260508.patch: local vLLM MoE benchmark harness patch for XPU/Ray device exposure, XPU eager timing, and pruned small-M decode tuning.patches/llm-scaler-moe-int4-u4-decode-20260509.patch: llm-scaler MoE-only unsigned uint4 tiny decode kernel and Python binding.patches/vllm-minimax-llm-scaler-u4-decode-20260509.patch: vLLM WNA16 MiniMax gate that enables the llm-scaler u4 path only for tiny FP16 decode batches.patches/llm-scaler-moe-int4-u4-bf16-decode-20260509.patch: BF16-capable llm-scaler u4 decode kernel diff.patches/vllm-minimax-llm-scaler-u4-bf16-decode-20260509.patch: vLLM WNA16 MiniMax gate update for BF16 decode activations and BF16 checkpoint scales.patches/vllm-xpu-decode-timing-summary-helper-20260509.patch: standalone timing helper file with atexit summaries and corrected PRINT_EVERY=0 summary-only behavior.patches/vllm-minimax-ep-u4-expert-map-skip-20260509.patch: vLLM MiniMax WNA16/runner diff including the decode u4 bridge, router/logits gate, and EP non-local expert-map skip.patches/llm-scaler-minimax-ep-u4-skip-20260509.patch: llm-scaler MiniMax u4 kernel diff including BF16/logits helpers and expert < 0 skip handling.patches/vllm-xpu-allreduce-async-wait-guard-20260510.patch: default-off XPU allreduce async-wait diagnostic hook; compiled MiniMax is blocked because TorchDynamo rejects async_op=True collectives.patches/vllm-minimax-qk-skip-tp-allreduce-diagnostic-20260510.patch: correctness-breaking MiniMax Q/K TP allreduce skip diagnostic; retained only to prove naive Q/K collective removal is slower and invalid.patches/vllm-minimax-xpu-ipc-qk-var-20260510.patch: default-off MiniMax Q/K variance Level Zero IPC allreduce hook; eager liveness passes, compiled integration remains blocked by Python-side initialization.patches/vllm-minimax-remove-qk-apply-rope-branch-restore-c158-20260510.patch: active-runtime cleanup that removes the default-off Q/K apply+RoPE helper branch and restores the fast c15860... AOT graph cache.patches/vllm-minimax-graph-shaped-c158-floor-20260510.patch: current MiniMax graph-shaped source state after the AOT regression: keeps timing boundaries/default-off Q/K RMS helper and uses the simple K-norm constructor that recovers the reproducible c15860... floor for this MiniMax M2.7 TP4 config.patches/llm-scaler-minimax-u4-down-htile-negative-20260510.patch: negative llm-scaler htile experiment artifact; includes the prior u4 MiniMax work plus the failed htile addition, so do not reverse-apply it over the active runtime.patches/vllm-minimax-qk-var-allreduce-dtype-negative-20260510.patch: negative MiniMax Q/K variance dtype experiment; FP16 variance collectives compiled but underperformed and were reverted from the active runtime.patches/vllm-source-b70-minimax-moe-config-20260510.patch: source-tree B70 MiniMax MoE config needed for fair /home/steve/src/vllm import tests.patches/vllm-minimax-postattn-fusedadd-delay-negative-20260510.patch: default-off installed-runtime post-attention fused-add RMS and delayed-output-allreduce experiment; archived as negative.patches/vllm-minimax-postattn-ar-fused-customop-negative-20260510.patch: Python-level allreduce plus fused-add RMS custom-op wrapper; compiled but warmed to only 32.611 output tok/s.plans/2026-05-10-minimax-60tok-roadmap.md: raised MiniMax AutoRound target ladder and next workstreams for quality-preserving 4x B70 optimization.data/minimax-m27-60tok-roadmap-update-20260510.json: structured raised-target roadmap, quality guardrails, and external reference points.notes/2026-05-10-minimax-current-baseline-and-collective-census.md: clean p512/n1536 MiniMax baseline refresh plus AOT allreduce-shape census.data/minimax-m27-current-baseline-collectives-20260510.json: structured current baseline and collective census.notes/2026-05-10-minimax-dflash-fast-nvme-retest.md: fast-NVMe DFlash retest; load/compile works, generation still stalls before any throughput result.notes/2026-05-10-minimax-source-ir-fusedadd-screen.md: source-tree IR fused-add RMS screen; mechanically works but remains below the installed-runtime reference.notes/2026-05-10-minimax-postattn-fusedadd-delay-negative.md: installed-runtime post-attention fused-add RMS and delayed-output-allreduce screen; both variants remained below the accepted MiniMax AutoRound reference.notes/2026-05-10-minimax-python-ar-fused-customop-negative.md: Python custom-op allreduce plus fused-add RMS wrapper; liveness passes after registration fix, but p512/n512 throughput is clearly negative.notes/2026-05-10-minimax-60tok-collective-plan-update.md: raised MiniMax AutoRound 4x B70 targets, latest upstream/speculation references, short negative screens, and the next hidden-state collective-boundary implementation plan.data/minimax-m27-collective-plan-update-20260510.json: structured current anchors, short screens, timing diagnostics, external references, and next implementation order for the 60+ tok/s MiniMax target.patches/vllm-xpu-allreduce-moe-timing-20260510.patch: opt-in allreduce/MoE timing hook artifact for reproducing the p64/n32 collective census.notes/2026-05-10-minimax-callsite-timing-diagnostic.md: call-site timing label experiment; confirms per-token collective counts but is archived as a negative active-runtime patch because it slows compiled MiniMax.data/minimax-m27-callsite-timing-diagnostic-20260510.json: structured call-site timing counts, negative compiled runs, and recovery validation after reverting the active runtime.notes/2026-05-13-minimax-async-runtime-screens.md: vLLM async-engine, stream-interval, max-seqs, and no-prefix/no-chunk screens after the static decode compile win; records the current 48.092807 output tok/s best.notes/2026-05-13-minimax-fused-gemv-and-runtime-screens.md: gmem 0.95 and mode3 runtime screens plus the llm-scaler fused ResAdd/RMS/INT4 GEMV race diagnosis.benchmarks/b70_resadd_norm_gemv_int4_race_probe.py: synthetic B70/XPU probe comparing vLLM oneDNN W4A16 against llm-scaler fused ResAdd/RMS/INT4 GEMV on MiniMax TP4 projection shapes.data/minimax-m27-20260513-fused-gemv-and-runtime-screens.json: structured data for the 2026-05-13 gmem, mode3, and fused-GEMV race screens.data/localmaxxing-minimax-m27-autoround-static-compile1-asyncengine-p512n1536-20260513.payload.json: LocalMaxxing payload for the current quality-preserving MiniMax AutoRound async-engine best.data/localmaxxing-responses/minimax-m27-autoround-static-compile1-asyncengine-p512n1536-20260513.response.json: LocalMaxxing response for the current quality-preserving MiniMax AutoRound async-engine best.patches/vllm-xpu-allreduce-callsite-timing-20260510.patch: archived diagnostic-only call-site label patch; do not apply to production MiniMax runs.notes/2026-05-11-minimax-stock-allreduce-rms-xpu-screen.md: stock vLLM fuse_allreduce_rms XPU screen; enabling the pass reaches FlashInfer CUDA imports and fails before any benchmark.data/minimax-m27-stock-allreduce-rms-xpu-screen-20260511.json: structured control run and failure-mode record for the stock allreduce/RMS pass screen.patches/vllm-xpu-enable-stock-allreduce-rms-screen-20260511.patch: minimal default-off XPU gate used only to reproduce the stock pass failure.notes/2026-05-11-minimax-generated-aot-analyzer-and-envconst.md: generated-cache analyzer update plus reverted MiniMax env-constant helper screen.data/minimax-m27-generated-aot-analyzer-envconst-20260511.json: structured data for the generated-cache analyzer and env-constant screen.notes/2026-05-11-minimax-c158-recheck-router-kv-screens.md: corrected c15860... fast-result recheck plus graph-shaped recovery, FP8 KV, and FP16-router screens.data/minimax-m27-c158-recheck-router-kv-screens-20260511.json: structured data for the C158 recheck and follow-up screens.notes/2026-05-11-minimax-fp16-router-audit.md: MiniMax FP16-router route-agreement audit; direct FP16 changes expert sets, while FP16 top-16 fully covered exact biased FP32 top-8 in the smoke.data/minimax-m27-fp16-router-audit-20260511.json: structured data for the FP16-router audit and candidate-repair lead.notes/2026-05-11-minimax-candidate-router-screen.md: default-off MiniMax candidate-repair router prototype and negative p512/n512 screen.data/minimax-m27-candidate-router-screen-20260511.json: structured data for the candidate-repair router screen.patches/vllm-minimax-candidate-router-screen-20260511.patch: source diff for the candidate-router prototype.patches/vllm-minimax-graph-shaped-router-kv-screens-20260511.patch: current vLLM MiniMax graph-shaped runtime diff used for the recovery, FP8 KV, and FP16-router screens.patches/llm-scaler-minimax-u4-logits-topk-negative-20260511.patch: llm-scaler logits-topk experiment diff retained as a negative artifact.The strongest quality-preserving paths are now Q4_0 GGUF TP3 with root-residual disabled and static FP8 TP4 with verified n-gram speculative decoding. The INT4 AutoRound path remains interesting for maximum speed, but it should be treated separately because it changes quantization quality more aggressively.