b70-optimization-lab

Notes To Intel: Making 4x B70 vLLM/MiniMax Deployment Easier

This note summarizes pain points and requests found while deploying MiniMax M2.7 INT4 on 4x Intel Arc Pro B70 GPUs with Ubuntu 24.04, PyTorch XPU, vLLM, llm-scaler, and Level Zero.

The final setup worked and quality-gated successfully, but the path is still too fragile for ordinary users.

What Worked

Main Friction Points

1. Native XPU kernel builds are extremely memory hungry

Building vllm-xpu-kernels from source required one compiler process around 120+ GB RSS for paged_decode_xe2.cpp.

Impact:

Requests:

2. Version compatibility is too implicit

The successful stack required:

An attempted newer PyTorch XPU nightly loaded the model but failed during vLLM torch.compile/AOT with a duplicate handler assertion.

Requests:

3. ocloc/IGC internal compiler errors need better handling

During server startup, Triton/Inductor compilation emitted:

IGC: Internal Compiler Error: Floating point exception
Build failed with error code: -11

The process eventually continued and the server started, but this is scary and ambiguous.

Requests:

4. oneAPI environment selection is easy to get wrong

Using /opt/intel/oneapi/setvars.sh selected a newer compiler stack that caused build trouble. Directly sourcing /opt/intel/oneapi/compiler/2025.3/env/vars.sh worked.

Requests:

5. The “unknown vLLM environment variable” warnings hide useful integration flags

Patched/local vLLM paths used several VLLM_* flags, while vLLM’s generic environment scanner warned they were unknown.

Impact:

Requests:

6. Quality validation needs to be first-class

Performance work on MoE, graph capture, custom allreduce, and logits paths can silently change output quality.

Requests:

7. CPU KV offload is CUDA-shaped and needs an XPU path

The native vLLM CPU KV offload path rejected XPU before a local prototype was added:

CPU Offloading is currently only supported on CUDA-alike GPUs

A local XPU worker prototype showed that pinned host RAM, XPU streams/events, and multi-GB KV movement can work on this stack. It moved CPU-to-GPU KV around 14-16 GB/s in session-cache tests. The remaining limitation is that XPU FlashAttention still needs the active request’s working KV blocks resident in live GPU KV memory.

Requests:

8. Session-cache reload can hit Level Zero device loss

In c4 session-cache service testing, vLLM started with 34304 GPU KV tokens and 8.0 GiB CPU KV budget per tensor-parallel worker. A later operational smoke hit a Level Zero device-lost error while copying vLLM block-table state back to GPU:

RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)

The stack was:

gpu_model_runner.py:_prepare_inputs
block_table.py:commit_block_table
vllm/v1/utils.py:copy_to_gpu

Requests:

9. TurboQuant on XPU needs workspace and quality guidance

TurboQuant is promising for KV capacity, but the first B70/XPU run failed with a locked-workspace allocation error:

Workspace is locked but allocation from turboquant_attn.py:_decode_attention
requires 0.19 MB, current size is 0.00 MB.

A local fallback patch allowed forward progress by allocating temporary buffers when the shared workspace was locked. After that, turboquant_k8v4 could start at 32K and report about 80128 GPU KV tokens, but sustained decode was much slower than the normal FP16-family KV lane.

Requests:

10. Long-context logprobs exposed NaN JSON serialization

Small OpenAI-compatible logprob requests worked, but long-context logprob checks failed because NaN values escaped into JSON:

Out of range float values are not JSON compliant: nan

Requests:

Suggestions For A Streamlined Intel-Supported Path

An ideal user path would be:

sudo apt install intel-gpu-ai-runtime-b70-vllm
python -m venv .venv
. .venv/bin/activate
pip install torch-xpu-b70 vllm-xpu-b70 llm-scaler-minimax-kernels
intel-ai doctor --model minimax-m27-int4 --gpus 4
intel-ai serve-minimax --model /models/minimax --host 0.0.0.0 --port 8000

The intel-ai doctor command should verify:

It should also explain how much disk, RAM, and swap are needed.

Concrete Bugs To Investigate

What To Preserve For Repro Reports

When reporting issues, capture: