This folder documents a fresh Ubuntu 24.04 bring-up performed on 2026-05-23 for:
Lasimeri/MiniMax-M2.7-int4-AutoRound32768 tokens by default2048110.896 tok/s83.172 tok/s83.8 output tok/s1.7k-1.8k prompt tok/sThe run satisfies the user’s requested >90 tok/s effective total throughput,
but it is not a new output-token speed record. Earlier repo memory files mention
an 89-class strict lane and a later 94-class structured lane. This 2026-05-23
setup is valuable because it is deployable, quality-gated, OpenAI-compatible,
serves roughly 32k context, and is documented from a mostly fresh machine state.
The Intel Arc Pro B70 is a workstation GPU. Four of them can run a large quantized MiniMax model locally if the software stack is aligned carefully:
This is not a one-command consumer install yet. It is a reproducible lab setup for people and agents who are comfortable running shell scripts, reading logs, and rebooting when drivers change.
The scripts are written for:
The native vllm-xpu-kernels source build can use more than 120 GB resident memory for one compiler process. On machines with 16-64 GB RAM, use a temporary swap file on SSD. The provided build script creates one by default and removes it when the build finishes.
Plan for roughly:
113 GB100-140 GB during download unless cleaned20-80 GB10-60 GB160 GB512 GB is tight but workable if /mnt/fast-ai has most of the space and old build trees/caches are cleaned.
Observed on the working system:
24.04.4 LTSintel-opencl-icd 26.18.38308.1-0libze-intel-gpu1 26.18.38308.1-0intel-ocloc 26.18.38308.1-0intel-igc-core-2 2.34.4intel-igc-opencl-2 2.34.4libigdgmm12 22.10.0libze1 1.28.2-1~24.04~ppa1xpu-smi 1.3.6-1~24.04~ppa1/opt/intel/oneapi/compiler/2025.33.122.11.0+xpuc51df43005726a09c6eb7348e8c1b00501c70a8e4bfc0070090cc54afdb2d46b8e5788235914156828e1f5e74c15744b69cf3b760f6160ceabd15de0https://github.com/vllm-project/vllm-xpu-kernelsImportant: source /opt/intel/oneapi/compiler/2025.3/env/vars.sh for builds. On this machine, using the umbrella /opt/intel/oneapi/setvars.sh selected oneAPI 2026 and caused SYCL header/build trouble.
Install provenance: earlier 2026-05-20 repro scripts installed
vllm-xpu-kernels==0.1.7 as a wheel. This 2026-05-23 Ubuntu 24.04
bring-up first tested wheel installs, then settled on an editable source build
from vllm-project/vllm-xpu-kernels at the commit above to resolve the active
PyTorch ABI mismatch.
Clone this repository:
git clone https://github.com/steveseguin/llm-optimizations.git
cd llm-optimizations/repro/minimax-m27-b70-110tps-ubuntu24-20260523
Install system packages, Intel GPU userspace, and oneAPI:
sudo bash scripts/00-install-system-deps.sh
sudo reboot
After reboot, verify GPUs:
xpu-smi discovery
clinfo -l
Set up storage, download the model, and build the Python/native stack:
sudo bash scripts/01-prepare-storage.sh
# Optional if needed:
# export HF_TOKEN=...
bash scripts/02-download-model.sh
bash scripts/03-build-stack.sh
Verify imports and runtime:
bash scripts/04-verify-runtime.sh
Run the quality and benchmark gate before serving:
bash scripts/05-run-quality-and-benchmark.sh
Start the OpenAI-compatible endpoint:
bash scripts/06-serve-openai-compatible.sh
The serve script listens on 0.0.0.0:8000 and defaults to:
max_model_len=32768
gpu_memory_utilization=0.95
Override only if you are intentionally retesting capacity:
VLLM_MAX_MODEL_LEN=16384 bash scripts/06-serve-openai-compatible.sh
From the same machine:
curl http://127.0.0.1:8000/v1/models
From another LAN machine, replace the IP:
curl http://<server-lan-ip>:8000/v1/models
For this exact setup, the strict gate produced:
quality_passed110.8962457692952 tok/s109.82634516832029111.29770064350417111.0660964054885111.3948408598678783.1721843269714 tok/s82.3697588762402183.4732754826281383.2995723041163783.5461306449009Primary local artifact from the originating machine:
/mnt/fast-ai/bench-results/minimax-m27-b70-89tps/strict/minimax-repro-minimax-moe-full-forward-customop-plus-output-ar-strict-tp4-ctx2048-mbt512-bs256-20260523T145201Z-summary.json
Label the benchmark before comparing numbers:
84.1 output tok/s warm through the
OpenAI-compatible API, with a 32768 token served context.89.314 output tok/s for p512/n1536,
context 2048.94 class result: constrained structured-output lane, not the same
random p512/n1536 decode test.The current host likely gives up some generated-token throughput because it is running the B70s over PCIe4 x16 instead of the earlier PCIe5-class path:
16.0 GT/s, width 16, or PCIe4 x16.32.0 GT/s, width 16, or PCIe5 x16.27.88 GB/s.13.79 GB/s.13.79 / 27.88 = 0.494, so the measured fabric bandwidth is about half.83.8 / 89.314 = 0.938, so the end-to-end decode rate is about 94% of the
older strict result.That lines up in a practical way: a 2x drop in GPU-to-GPU communication bandwidth causes a smaller end-to-end decode drop because only part of each generated-token step is communication. The rest is local GPU compute, memory traffic, vLLM scheduling, and sampling. This is not proof that PCIe explains everything; the current vLLM source also differs from the older promoted stack. It is, however, a credible non-quality explanation for why this host lands near 83 instead of the older 89-93 class.
Warm versus cold also matters. Cold runs can include model load, graph capture,
kernel compilation, and cache creation. The older repro saw 69.33 output tok/s
on a first post-reboot pass and 88.72 output tok/s on the immediate warm
rerun. Compare warm runs to warm runs.
The endpoint was expanded after the strict speed gate. The practical server
default is now 32768 tokens, not 2048.
Observed on 2026-05-23:
32768 failed vLLM’s KV-cache memory
check and 24576 was the practical default.xe.disable_display=1, 32768 started successfully with
gpu_memory_utilization=0.95.33,792 GPU KV-cache tokens and 1.03x maximum concurrency
for a 32,768-token request.xpu-smi showed about 32651 MiB used per B70 after startup. This is
expected because vLLM preallocates KV cache memory.84.12 output tok/s for
prompt 510 / output 1536.85.45 output tok/s after first streamed
chunk, 111.64 total tok/s, 351 ms vLLM TTFT, and a conservative
1446 tok/s prefill lower-bound for prompt 510 / output 1536 at the same
32K served context. See
../../data/minimax-m27-openai-endpoint-metrics-32k-20260524.json.32408 prompt tokens plus 64 generated
tokens completed without OOM.max_tokens=1 measured about 1.7k-1.8k prompt
tok/s for 2k-16k prompt sizes. A 16k prompt took about 9 s before the
generated token, so long-prompt TTFT is visible even though prefill throughput
is healthy.33792 was tested because vLLM reported 33,792 GPU KV-cache tokens at the
32k setting, but it did not expose /v1/models within the wait window and is
not promoted.Structured result: results/context-window-20260523.json
Detailed PCIe/prefill follow-up:
../../notes/2026-05-23-current-host-pcie4-prefill-check.md
../../notes/2026-05-23-b70-display-disable-32768-context.md
The strict run passed:
raw145-n64-exactraw145-n256-exactsemantic-suite-n64-r2arithmetic-repeat-n64-r16extended-sixpack-n64-r2Do not publish or trust new speed claims unless these gates pass first. Fast nonsense output is not a valid optimization.
The vLLM server exposes:
GET /healthGET /v1/modelsPOST /v1/chat/completionsPOST /v1/completionsPOST /v1/responsesPOST /v1/messagesWorking listener from the originating machine:
0.0.0.0:8000
Model id:
/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround
Expected /v1/models field:
{
"max_model_len": 32768
}
Example request:
curl -sS http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround",
"prompt": "Return only the number 42:\n",
"max_tokens": 16,
"temperature": 0
}'
HF_HUB_DISABLE_XET=1 and use a retrying hf download loop.2.13.0.dev20260520+xpu loaded the model but failed the quality gate during torch.compile/AOT with AssertionError: Handler already registered for current_stream. Workaround: use torch==2.11.0+xpu.vllm-xpu-kernels wheels were ABI-incompatible with the selected stack. Workaround: build vllm-xpu-kernels from source after installing PyTorch.vllm-xpu-kernels can require more than 120 GB RSS for paged_decode_xe2.cpp. Workaround: create temporary SSD swap and build conservatively.setvars.sh caused build trouble. Workaround: source the 2025.3 compiler env directly.--compilation-config-json, but the Python quality harness did not accept it. Fix: add a compatibility parser argument.jq expression in strict summary generation needed parentheses around the boolean comparison. Fix included in this commit.vllm serve in this checkout does not accept --async-engine. Workaround: remove the stale flag; async scheduling is still enabled by vLLM in this config.ocloc sometimes logged internal compiler errors for a Triton reduction kernel, then the process continued and the server started. Treat this as an Intel compiler/runtime bug to report, but it was not fatal in this run.MiniMax M2.7 here is a quantized MoE model. The heavy path is not a plain dense GEMM workload. The optimized lane depends on:
All 62 MoE layers reported the llm-scaler XPU INT4 decode path during quality and serving startup.
../../docs/b70-minimax-ubuntu24-deployment.md../../docs/intel-b70-minimax-feedback-20260523.md../../AGENT_HANDOFF.md