b70-optimization-lab

Deploying MiniMax M2.7 INT4 on 4x Intel Arc Pro B70

This note is for a future user or agent starting with a fresh Ubuntu 24.04 machine. It explains what is being built, why the steps matter, and how to validate the result before using the endpoint.

What You Are Building

You are building a local OpenAI-compatible text-generation server:

The endpoint is compatible with common OpenAI-style clients:

The endpoint in this repro uses --host 0.0.0.0, so it is accessible on the LAN if the host firewall and network allow it.

The current serving default is a 32768-token context window, roughly 32k tokens. The comparable quality/speed gate still uses a 2048-token context so new runs can be compared against the original p512/n1536 benchmark.

Hardware and OS Checklist

Before installing:

After the Intel packages and reboot:

xpu-smi discovery
clinfo -l
lspci | grep -i intel

Expected:

Why 16 GB RAM Is Still Possible

The model is too large to casually keep everything in RAM, but the runtime can work because weights live on the GPUs and the disk is used for model files and compile caches.

The build is the hard part. The vllm-xpu-kernels compile of paged_decode_xe2.cpp reached roughly 120+ GB RSS on the originating host. The repro script creates a temporary 160G swap file to survive that compile on machines with only 16 GB RAM. This is slow but practical on SSD.

If the compiler is killed:

  1. Keep or recreate the swap file.
  2. Rerun scripts/03-build-stack.sh.
  3. Use conservative parallelism: VLLM_XPU_KERNELS_MAX_JOBS=4 or 8.

The scripts default to:

/mnt/fast-ai/
  llm-models/
  llm-cache/hf/
  bench-results/
  vllm-cache-exp/
  src/

The model path used by vLLM:

/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround

Install Flow

Use the new repro folder:

cd llm-optimizations/repro/minimax-m27-b70-110tps-ubuntu24-20260523
sudo bash scripts/00-install-system-deps.sh
sudo reboot

After reboot:

sudo bash scripts/01-prepare-storage.sh
bash scripts/02-download-model.sh
bash scripts/03-build-stack.sh
bash scripts/04-verify-runtime.sh
bash scripts/05-run-quality-and-benchmark.sh
bash scripts/06-serve-openai-compatible.sh

In another terminal:

bash scripts/07-smoke-test-endpoint.sh

Build Order Matters

The successful order was:

  1. Install Intel system packages and oneAPI.
  2. Install PyTorch 2.11.0+xpu.
  3. Build vllm-xpu-kernels from source against that PyTorch.
  4. Build llm-scaler custom MiniMax INT4 kernels against that PyTorch.
  5. Install patched vLLM editable.
  6. Run runtime import checks.
  7. Run quality gates.
  8. Run benchmarks.
  9. Serve.

Do not install a prebuilt vllm-xpu-kernels wheel and assume it is ABI-compatible. It was not compatible during this bring-up.

Critical Environment Choices

Use:

source /opt/intel/oneapi/compiler/2025.3/env/vars.sh

Avoid:

source /opt/intel/oneapi/setvars.sh

On the originating host, the umbrella setvars.sh selected oneAPI 2026 and caused SYCL build problems. The direct 2025.3 compiler environment built successfully.

Runtime flags are collected in:

repro/minimax-m27-b70-110tps-ubuntu24-20260523/configs/runtime-env.sh

Notable flags:

Some flags show as “unknown vLLM environment variable” in vLLM’s generic env scanner. They are still consumed by patched code paths or local integration logic. Do not remove them without rerunning the full quality gate.

Quality Before Speed

The quality gate prevents false wins. It checks exact token hashes and semantic canaries. This matters because many low-level optimizations can produce fast but degraded, nondeterministic, or degenerate output.

Passed checks on 2026-05-23:

If any of these fail, stop and fix quality before benchmarking.

Throughput Interpretation

vLLM benchmark output reports total tokens per second. For a 512 input / 1536 output request, total token throughput includes prompt processing and generation.

2026-05-23 result:

The user’s target was >90 tok/s; this setup clears that target using total/effective throughput. For generated tokens only, the current fresh deployment is an 83 tok/s class setup.

Do not compare the old 94 tok/s note directly to this number. That was a constrained structured-output lane with short output and validation/retry accounting, not the same random p512/n1536 decode benchmark. The older apples-to-apples strict random decode lane was 89.314 output tok/s.

The most likely non-quality explanation for the 89 -> 83 gap on this host is PCIe fabric bandwidth:

MiniMax tensor-parallel decode uses many small cross-GPU reductions, so slower inter-GPU communication can show up as lower generated-token throughput. This does not prove PCIe explains every token of the gap: the current vLLM source is also newer than the original promoted stack. But the bandwidth math lines up well enough that PCIe4 is a credible primary cause.

Warm versus cold also matters. Cold starts can include model load, graph capture, kernel compilation, and cache effects. The old repro saw a first post-reboot pass at only 69.33 output tok/s, then an immediate warm rerun at 88.72 output tok/s. Compare warm runs to warm runs.

The served 32,768-token endpoint was also checked through the OpenAI-compatible API:

This means the larger served context did not reduce the normal short-request decode lane after warmup.

Serving

Start:

bash scripts/06-serve-openai-compatible.sh

Default bind:

0.0.0.0:8000

Default context settings:

VLLM_MAX_MODEL_LEN=32768
VLLM_GPU_MEMORY_UTILIZATION=0.95

The script allows overrides for controlled retests:

VLLM_MAX_MODEL_LEN=16384 bash scripts/06-serve-openai-compatible.sh

Find LAN IP:

hostname -I

Test:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Expected /v1/models includes:

{
  "max_model_len": 32768
}

Example OpenAI-compatible client base URL:

http://<server-ip>:8000/v1

Known Failure Modes

Hugging Face download hangs

Symptom: no progress, stale sockets, process appears wedged.

Workaround:

export HF_HUB_DISABLE_XET=1
bash scripts/02-download-model.sh

PyTorch 2.13 XPU AOT failure

Symptom during quality gate:

AssertionError: Handler already registered for <function current_stream ...>

Workaround: use torch==2.11.0+xpu.

vllm-xpu-kernels ABI mismatch

Symptom:

undefined symbol: _ZNR5torch7Library4_defEON3c1014FunctionSchema...

Workaround: build vllm-xpu-kernels from source after PyTorch is installed.

Native build OOM

Symptom: compiler killed or system becomes unresponsive during paged_decode_xe2.cpp.

Workaround: use SSD swap and lower VLLM_XPU_KERNELS_MAX_JOBS.

--async-engine rejected by vLLM serve

Symptom:

vllm: error: unrecognized arguments: --async-engine

Workaround: remove --async-engine. This checkout enables async scheduling internally.

Intel ocloc internal compiler error

Symptom during server compile:

IGC: Internal Compiler Error: Floating point exception

Observed behavior: nonfatal in the final server start; vLLM continued compiling/capturing graphs and served successfully. If it becomes fatal, wipe the relevant compile cache, reduce concurrency, and rerun. Preserve logs for Intel.

Context window does not fit

Symptom:

ValueError: To serve at least one request with the models's max seq len ...
Try increasing `gpu_memory_utilization` or decreasing `max_model_len`

Observed limits on the originating 4x B70 host:

This is VRAM/KV-cache headroom, not system RAM overflow. vLLM preallocates KV cache memory, so xpu-smi can show roughly 32651 MiB used and 100% memory utilization even when the server is idle.

What To Improve Next

Useful next experiments: