b70-optimization-lab

MiniMax XPU CPU KV Offload Research Lane

Date started: 2026-05-24

This folder tracks the experimental path toward serving MiniMax M2.7 on Intel Arc Pro B70 with context beyond the current high-performance 32768 token endpoint by spilling KV cache to host RAM.

Execution plan:

../../plans/2026-05-24-minimax-xpu-kv-offload-plan.md

Quick reproduction guide:

REPRODUCE.md

Tracked artifact index:

ARTIFACTS.md

The stable production lane remains:

Do not replace that lane with this work until correctness, stability, and quality are proven.

Why This Matters

MiniMax advertises 196608 max position embeddings. The current B70 endpoint serves a reliable 32768 tokens because the FP16-family KV cache must fit in GPU memory. CPU KV offload would let the server keep less-active KV blocks in system RAM and page them back as needed.

Useful targets:

Target Purpose
32768, c1 Current fast baseline; must remain easy to restore.
65536, c1 First large-context milestone.
131072, c1 Prove CPU KV offload is genuinely useful.
196608, c1 Full MiniMax advertised context.
196608, c2-c4 Long-context concurrency research, likely slow but valuable.

Expected performance with CPU KV offload is much lower than full-VRAM decode. That is acceptable for this lane if it enables otherwise impossible sessions and does not degrade model quality.

Quality Rules

This lane may experiment with memory movement, cache layout, TurboQuant, and runtime scheduling. It must not silently lower answer quality.

Promotion requires:

FP8 KV, TurboQuant, or other compressed KV modes must be labeled as compressed KV experiments and compared against the FP16-family baseline.

2026-05-24 Experiment Summary

All experiments were temporary. The normal 32768 server was restored.

Attempt 1: CPU Weight Offload

Command shape:

VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
  --max-num-seqs 4 \
  --cpu-offload-gb 16

Result:

AssertionError: CPU tensor must be pinned

This is model-weight offload, not KV offload. It failed during model load in vLLM’s UVA offloader before any useful long-context test could run.

Log:

/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-cpuoffload16-20260524T215219Z.log

Attempt 2: No CPU Offload

Command shape:

VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
  --max-num-seqs 4

Result:

To serve at least one request with max seq len 196608,
11.62 GiB KV cache is needed, larger than available KV cache memory 1.56 GiB.
Based on the available memory, the estimated maximum model length is 26368.

Log:

/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-nooffload-20260524T215257Z.log

Attempt 3: Native CPU KV Offload Flag

Command shape:

VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
  --max-num-seqs 4 \
  --kv-offloading-size 64

Result:

vLLM accepted the flag but the KV preflight check still counted only GPU KV capacity and rejected the run before the offload connector could initialize.

Log:

/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-kvoffload64-20260524T221520Z.log

Attempt 4: Temporary Admission-Check Patch

A local patch added the per-worker CPU KV budget to the preflight capacity calculation. This got past the prior GPU-only KV check.

The next blocker:

Exception: CPU Offloading is currently only supported on CUDA-alike GPUs

The native CPU KV offload path then tried to initialize OffloadingConnector / CPUOffloadingSpec, but rejected XPU explicitly.

Log:

/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-kvoffload64-admissionpatch-20260524T223029Z.log

Patch sketch:

patches/kv-offload-admission-check-xpu-experiment-20260524.patch

Current Root Cause

vLLM’s native CPU KV offload implementation is CUDA-oriented. The XPU run gets past the scheduler-side configuration only after an admission-check patch, then fails in the worker-side offload handler because the CPU KV path uses CUDA concepts:

The guard is in:

vllm/v1/kv_offload/cpu/spec.py

The CUDA-specific worker code is in:

vllm/v1/kv_offload/cpu/gpu_worker.py

Candidate Work Plan

  1. Keep 32768 FP16-family KV endpoint as the stable fallback.
  2. Create an XPU implementation parallel to the CUDA CPU KV worker rather than weakening CUDA assumptions in place.
  3. Replace CUDA streams/events with XPU stream/event equivalents if available in the installed PyTorch XPU stack.
  4. Replace cudaHostRegister with a Level Zero / SYCL / PyTorch XPU pinned host-memory path. torch.empty(..., pin_memory=True) and .pin_memory() already work locally in torch 2.11.0+xpu.
  5. Start with small context over GPU capacity, not full 196608: 49152 or 65536, c1.
  6. Measure decode with long prompt plus small output first, then short prompt plus long output, then concurrency.
  7. Only after c1 works, test c2/c4.

Phase 1 Probe Result

Initial PyTorch XPU primitive probe passed on 2026-05-24 while the normal 32K server was running:

Artifacts:

Decision: proceed to an XPU CPU KV worker prototype rather than another launch-flag experiment.

Phase 2 Block Copy Probe Result

Initial KV-shaped block-copy probe passed on 2026-05-24:

Artifacts:

Decision: prototype an XPU CPU KV worker around range coalescing plus fast slice copies, with the loop path as a correctness fallback for fragmented transfers.

Phase 3 XPU Worker Live Server Result

Status on 2026-05-25: partially working, not production-ready.

Prototype patch:

Detailed notes:

What now works:

Measured live transfer example at about 34500 prompt tokens:

Direction Bytes Time Effective rate
GPU -> CPU 8.45152256 GB 0.779795848 s about 10.8 GB/s
CPU -> GPU 8.321499136 GB 0.508610908 s about 16.4 GB/s

Current blocker:

Do not use this lane as the default server yet. The stable 32768 endpoint is still the recommended path for real use.

Phase 4 Active Context Limit Finding

Follow-up validation on 2026-05-25 refined the blocker:

Detailed note:

notes-20260525-phase4-active-context-limit.md

Interpretation: this prototype is useful groundwork for XPU CPU KV movement and may support exact-quality session swapping for contexts that individually fit in GPU KV. It is not yet true active-context overflow. Full 196608 active exact context needs CPU-paged attention, host-readable KV kernels, or quality-gated KV compression.

Phase 5 C2 Session-Swap Smoke

Follow-up validation on 2026-05-25 found the first practical RAM-backed use case:

Detailed note:

notes-20260525-phase5-session-swap-smoke.md

Interpretation: CPU KV offload can already act as an exact session cache for contexts that fit individually in GPU KV. This is a useful community-facing capability, but it is separate from full active-context overflow.

Phase 6 Session-Cache Canaries And C4 Ladder

Follow-up validation on 2026-05-25 added a reusable OpenAI endpoint canary script and tested longer reload decode:

Important caveat: longer free-form concurrent completions do not produce exact text-hash matches across passes. One-token c2 matched for A/B. One-token c4 matched for A/B/D but not C. Treat CPU KV session caching as mechanically working and promising, but do not promote it as production quality-equivalent until a stricter deterministic canary and semantic quality gate pass.

Strict-word follow-up:

Detailed note:

notes-20260525-phase6-session-cache-canaries.md

C2 capacity ladder follow-up:

Detailed note:

notes-20260525-c2-session-cache-ladder.md

C2 quality and sustained-decode follow-up:

Detailed note:

notes-20260525-c2-quality-and-turboquant.md

C4/C8 session-cache ladder follow-up:

Detailed note:

notes-20260525-c4-c8-session-cache-ladder.md

Operational note:

notes-20260525-session-cache-operations.md

Practical mental model: clients keep and resend their full chat history. vLLM recognizes repeated token prefixes and can reload parked prefix KV from CPU RAM. There is no separate server-side session ID; exact prefix stability is what lets cache reuse work.

Live c4 operations caveat: after adding the profile switcher, c4 started and reported the expected 34304 GPU KV tokens, but an operational smoke hit a second-pass waiting/deferred stall and a rerun hit UR_RESULT_ERROR_DEVICE_LOST while copying vLLM block-table state to GPU. Keep c1 as production and use c2 as the safer correctness lane until c4 is debugged. The same switcher successfully ran a smaller c2 operations smoke with two concurrent 22540-token fact-word sessions; both matched exact output hashes across passes, with second-pass reload TTFT of 0.320-0.570 s. Do not present that smoke as the c2 context limit; the c2 target remains two 32768-token request windows with practical output headroom.

Sustained concurrent decode follow-up:

Detailed note:

notes-20260525-sustained-concurrency-decode.md

Additional deployment observation: after the B70s were no longer used for the Ubuntu display, experimental c2/c4 launches could report 34304 GPU KV tokens instead of the earlier 26112 c2 result. Display ownership and compile/cache shape can materially affect available KV budget.

TurboQuant Interaction

TurboQuant remains interesting because it reduces KV footprint and therefore reduces both RAM capacity pressure and PCIe transfer volume.

Current TurboQuant status after the 2026-05-25 workspace fallback experiment:

Interpretation: TurboQuant is now mechanically past the first XPU workspace blockers and can raise the live GPU KV ceiling, but it is not a production replacement. Capacity improved substantially; decode speed, stability, and quality still need work. Most importantly, TurboQuant plus CPU KV offload still does not provide true active-context overflow: the active request must fit in live GPU KV blocks.

Relevant repro:

scripts/repro-minimax-turboquant-xpu-workspace-bug.sh

Detailed note:

notes-20260525-c2-quality-and-turboquant.md

Active-boundary note:

notes-20260525-turboquant-active-context-boundary.md

CPU-Paged Attention Path

The next full-context path is CPU-paged attention, not a larger --kv-offloading-size setting. Current CPU KV offload can park/reload sessions, but XPU FlashAttention still needs the active request’s KV blocks in live GPU memory.

The proposed exact path is:

  1. Keep recent/current KV in normal GPU KV blocks.
  2. Keep older logical KV blocks in CPU offload storage.
  3. Stage old CPU-resident KV chunks into a small GPU scratch workspace.
  4. Run FlashAttention over each staged chunk with softmax LSE returned.
  5. Merge partial attention outputs using vLLM’s existing merge_attn_states() log-sum-exp merge.
  6. Merge that old-context result with normal attention over the live GPU suffix.

This mirrors two existing vLLM patterns:

New artifacts:

Standalone split-attention math probe results:

Shape Chunks Max output abs error Max LSE abs error Result
4096 KV tokens, 4 queries, 8 heads 8 6.71e-08 9.54e-07 pass
5000 KV tokens, 7 queries, 8 heads 7 8.94e-08 1.91e-06 pass

Interpretation: the core split-and-merge softmax math is sound. The remaining work is vLLM integration: logical-vs-physical KV accounting, CPU block range queries, GPU staging workspace, temporary block tables, and XPU FlashAttention calls that return LSE for merging.

Stage A vLLM integration attempt:

Dense-Scratch CPU-Staged Attention

Follow-up probes on 2026-05-25 narrowed the active-overflow path:

Decision: stop pursuing paged scratch merging for true active overflow. The next vLLM prototype should use dense scratch chunks for both CPU-resident old KV and GPU-resident suffix KV, then merge dense attention states.

Detailed note:

notes-20260525-dense-staged-cpu-attention.md

Current 196K / Multi-Session Conclusion

The requested end goal is useful: four or more concurrent sessions, ideally with the full 196608 token MiniMax context, backed by system RAM when needed.

The current stack cannot do that yet.

The best active capacity observed in the TurboQuant 4-bit NC lane was 98304 tokens. One full MiniMax context is 196608 tokens, so a single full active context is about 2x beyond the best live GPU KV capacity observed. Four full active contexts are 786432 tokens, about 8x beyond that live capacity.

CPU KV offload is still useful, but today it is session caching: it stores and reloads KV for sessions whose active working set fits in GPU KV. It does not make XPU attention read arbitrary old KV blocks directly from host RAM.

Production should stay on the 32768 FP16-family KV endpoint. The next R&D step is dense-scratch CPU-staged attention, not just larger --kv-offloading-size values or higher max_model_len. Paged scratch merging is currently blocked by unreliable XPU paged-attention LSE values.

Restore note: a fatal near-limit 196K TurboQuant run left orphan VLLM::Worker_TP* processes holding XPU memory. If the normal server fails on restart with near-zero free XPU memory, kill the orphan workers and stale multiprocessing.resource_tracker, then remove stale /dev/shm/psm_* and /dev/shm/sem.mp-* files. Details are in the active-boundary note.

Open Questions

Stable Restore Command

If an experiment leaves the server down, restore the stable endpoint with:

pkill -f 'vllm serve' || true
VLLM_MAX_MODEL_LEN=32768 /home/steve/bin/minimax-vllm-serve

Expected /v1/models:

{
  "id": "/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround",
  "max_model_len": 32768
}