b70-optimization-lab

MiniMax M2.7 INT4 on Ubuntu 24.04 with 4x Intel Arc Pro B70

This folder documents a fresh Ubuntu 24.04 bring-up performed on 2026-05-23 for:

The run satisfies the user’s requested >90 tok/s effective total throughput, but it is not a new output-token speed record. Earlier repo memory files mention an 89-class strict lane and a later 94-class structured lane. This 2026-05-23 setup is valuable because it is deployable, quality-gated, OpenAI-compatible, serves roughly 32k context, and is documented from a mostly fresh machine state.

Plain-English Summary

The Intel Arc Pro B70 is a workstation GPU. Four of them can run a large quantized MiniMax model locally if the software stack is aligned carefully:

  1. Ubuntu needs Intel’s current GPU userspace packages and oneAPI compiler.
  2. PyTorch must be the XPU build, not regular CUDA or CPU PyTorch.
  3. vLLM needs local Intel/XPU patches from this repo.
  4. The MiniMax INT4 MoE path needs llm-scaler custom kernels.
  5. Native XPU extensions must be compiled against the exact PyTorch ABI in the venv.
  6. Quality must be tested before speed numbers are trusted.

This is not a one-command consumer install yet. It is a reproducible lab setup for people and agents who are comfortable running shell scripts, reading logs, and rebooting when drivers change.

Minimum Assumptions

The scripts are written for:

The native vllm-xpu-kernels source build can use more than 120 GB resident memory for one compiler process. On machines with 16-64 GB RAM, use a temporary swap file on SSD. The provided build script creates one by default and removes it when the build finishes.

Disk Budget

Plan for roughly:

512 GB is tight but workable if /mnt/fast-ai has most of the space and old build trees/caches are cleaned.

Current Known-Good Versions

Observed on the working system:

Important: source /opt/intel/oneapi/compiler/2025.3/env/vars.sh for builds. On this machine, using the umbrella /opt/intel/oneapi/setvars.sh selected oneAPI 2026 and caused SYCL header/build trouble.

Install provenance: earlier 2026-05-20 repro scripts installed vllm-xpu-kernels==0.1.7 as a wheel. This 2026-05-23 Ubuntu 24.04 bring-up first tested wheel installs, then settled on an editable source build from vllm-project/vllm-xpu-kernels at the commit above to resolve the active PyTorch ABI mismatch.

Fresh-System Quick Start

Clone this repository:

git clone https://github.com/steveseguin/llm-optimizations.git
cd llm-optimizations/repro/minimax-m27-b70-110tps-ubuntu24-20260523

Install system packages, Intel GPU userspace, and oneAPI:

sudo bash scripts/00-install-system-deps.sh
sudo reboot

After reboot, verify GPUs:

xpu-smi discovery
clinfo -l

Set up storage, download the model, and build the Python/native stack:

sudo bash scripts/01-prepare-storage.sh

# Optional if needed:
# export HF_TOKEN=...
bash scripts/02-download-model.sh

bash scripts/03-build-stack.sh

Verify imports and runtime:

bash scripts/04-verify-runtime.sh

Run the quality and benchmark gate before serving:

bash scripts/05-run-quality-and-benchmark.sh

Start the OpenAI-compatible endpoint:

bash scripts/06-serve-openai-compatible.sh

The serve script listens on 0.0.0.0:8000 and defaults to:

max_model_len=32768
gpu_memory_utilization=0.95

Override only if you are intentionally retesting capacity:

VLLM_MAX_MODEL_LEN=16384 bash scripts/06-serve-openai-compatible.sh

From the same machine:

curl http://127.0.0.1:8000/v1/models

From another LAN machine, replace the IP:

curl http://<server-lan-ip>:8000/v1/models

Expected Results

For this exact setup, the strict gate produced:

Primary local artifact from the originating machine:

/mnt/fast-ai/bench-results/minimax-m27-b70-89tps/strict/minimax-repro-minimax-moe-full-forward-customop-plus-output-ar-strict-tp4-ctx2048-mbt512-bs256-20260523T145201Z-summary.json

How To Compare 83, 89, 93, and 94 tok/s Notes

Label the benchmark before comparing numbers:

The current host likely gives up some generated-token throughput because it is running the B70s over PCIe4 x16 instead of the earlier PCIe5-class path:

That lines up in a practical way: a 2x drop in GPU-to-GPU communication bandwidth causes a smaller end-to-end decode drop because only part of each generated-token step is communication. The rest is local GPU compute, memory traffic, vLLM scheduling, and sampling. This is not proof that PCIe explains everything; the current vLLM source also differs from the older promoted stack. It is, however, a credible non-quality explanation for why this host lands near 83 instead of the older 89-93 class.

Warm versus cold also matters. Cold runs can include model load, graph capture, kernel compilation, and cache creation. The older repro saw 69.33 output tok/s on a first post-reboot pass and 88.72 output tok/s on the immediate warm rerun. Compare warm runs to warm runs.

Served Context Window

The endpoint was expanded after the strict speed gate. The practical server default is now 32768 tokens, not 2048.

Observed on 2026-05-23:

Structured result: results/context-window-20260523.json

Detailed PCIe/prefill follow-up:

../../notes/2026-05-23-current-host-pcie4-prefill-check.md

../../notes/2026-05-23-b70-display-disable-32768-context.md

Quality Gates That Passed

The strict run passed:

Do not publish or trust new speed claims unless these gates pass first. Fast nonsense output is not a valid optimization.

OpenAI-Compatible API

The vLLM server exposes:

Working listener from the originating machine:

0.0.0.0:8000

Model id:

/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround

Expected /v1/models field:

{
  "max_model_len": 32768
}

Example request:

curl -sS http://127.0.0.1:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround",
    "prompt": "Return only the number 42:\n",
    "max_tokens": 16,
    "temperature": 0
  }'

Bugs and Fixes Found During Bring-Up

What MiniMax M2.7 Looks Like In This Stack

MiniMax M2.7 here is a quantized MoE model. The heavy path is not a plain dense GEMM workload. The optimized lane depends on:

All 62 MoE layers reported the llm-scaler XPU INT4 decode path during quality and serving startup.

More Details