b70-optimization-lab

Notes To Intel: Making 4x B70 vLLM/MiniMax Deployment Easier

This note summarizes pain points and requests found while deploying MiniMax M2.7 INT4 on 4x Intel Arc Pro B70 GPUs with Ubuntu 24.04, PyTorch XPU, vLLM, llm-scaler, and Level Zero.

The final setup worked and quality-gated successfully, but the path is still too fragile for ordinary users.

What Worked

Main Friction Points

1. Native XPU kernel builds are extremely memory hungry

Building vllm-xpu-kernels from source required one compiler process around 120+ GB RSS for paged_decode_xe2.cpp.

Impact:

Requests:

2. Version compatibility is too implicit

The successful stack required:

An attempted newer PyTorch XPU nightly loaded the model but failed during vLLM torch.compile/AOT with a duplicate handler assertion.

Requests:

3. ocloc/IGC internal compiler errors need better handling

During server startup, Triton/Inductor compilation emitted:

IGC: Internal Compiler Error: Floating point exception
Build failed with error code: -11

The process eventually continued and the server started, but this is scary and ambiguous.

Requests:

4. oneAPI environment selection is easy to get wrong

Using /opt/intel/oneapi/setvars.sh selected a newer compiler stack that caused build trouble. Directly sourcing /opt/intel/oneapi/compiler/2025.3/env/vars.sh worked.

Requests:

5. The “unknown vLLM environment variable” warnings hide useful integration flags

Patched/local vLLM paths used several VLLM_* flags, while vLLM’s generic environment scanner warned they were unknown.

Impact:

Requests:

6. Quality validation needs to be first-class

Performance work on MoE, graph capture, custom allreduce, and logits paths can silently change output quality.

Requests:

Suggestions For A Streamlined Intel-Supported Path

An ideal user path would be:

sudo apt install intel-gpu-ai-runtime-b70-vllm
python -m venv .venv
. .venv/bin/activate
pip install torch-xpu-b70 vllm-xpu-b70 llm-scaler-minimax-kernels
intel-ai doctor --model minimax-m27-int4 --gpus 4
intel-ai serve-minimax --model /models/minimax --host 0.0.0.0 --port 8000

The intel-ai doctor command should verify:

It should also explain how much disk, RAM, and swap are needed.

Concrete Bugs To Investigate

What To Preserve For Repro Reports

When reporting issues, capture: