Date started: 2026-05-24
This folder tracks the experimental path toward serving MiniMax M2.7 on Intel
Arc Pro B70 with context beyond the current high-performance 32768 token
endpoint by spilling KV cache to host RAM.
Execution plan:
../../plans/2026-05-24-minimax-xpu-kv-offload-plan.md
Quick reproduction guide:
REPRODUCE.md
Tracked artifact index:
ARTIFACTS.md
The stable production lane remains:
Lasimeri/MiniMax-M2.7-int4-AutoRound32768auto / FP16-family0.0.0.0:800084-95 tok/s, depending on warm state and the
exact benchmark shapeDo not replace that lane with this work until correctness, stability, and quality are proven.
MiniMax advertises 196608 max position embeddings. The current B70 endpoint
serves a reliable 32768 tokens because the FP16-family KV cache must fit in
GPU memory. CPU KV offload would let the server keep less-active KV blocks in
system RAM and page them back as needed.
Useful targets:
| Target | Purpose |
|---|---|
32768, c1 |
Current fast baseline; must remain easy to restore. |
65536, c1 |
First large-context milestone. |
131072, c1 |
Prove CPU KV offload is genuinely useful. |
196608, c1 |
Full MiniMax advertised context. |
196608, c2-c4 |
Long-context concurrency research, likely slow but valuable. |
Expected performance with CPU KV offload is much lower than full-VRAM decode. That is acceptable for this lane if it enables otherwise impossible sessions and does not degrade model quality.
This lane may experiment with memory movement, cache layout, TurboQuant, and runtime scheduling. It must not silently lower answer quality.
Promotion requires:
FP8 KV, TurboQuant, or other compressed KV modes must be labeled as compressed KV experiments and compared against the FP16-family baseline.
All experiments were temporary. The normal 32768 server was restored.
Command shape:
VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
--max-num-seqs 4 \
--cpu-offload-gb 16
Result:
AssertionError: CPU tensor must be pinned
This is model-weight offload, not KV offload. It failed during model load in vLLM’s UVA offloader before any useful long-context test could run.
Log:
/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-cpuoffload16-20260524T215219Z.log
Command shape:
VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
--max-num-seqs 4
Result:
To serve at least one request with max seq len 196608,
11.62 GiB KV cache is needed, larger than available KV cache memory 1.56 GiB.
Based on the available memory, the estimated maximum model length is 26368.
Log:
/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-nooffload-20260524T215257Z.log
Command shape:
VLLM_MAX_MODEL_LEN=196608 /home/steve/bin/minimax-vllm-serve \
--max-num-seqs 4 \
--kv-offloading-size 64
Result:
vLLM accepted the flag but the KV preflight check still counted only GPU KV capacity and rejected the run before the offload connector could initialize.
Log:
/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-kvoffload64-20260524T221520Z.log
A local patch added the per-worker CPU KV budget to the preflight capacity calculation. This got past the prior GPU-only KV check.
The next blocker:
Exception: CPU Offloading is currently only supported on CUDA-alike GPUs
The native CPU KV offload path then tried to initialize
OffloadingConnector / CPUOffloadingSpec, but rejected XPU explicitly.
Log:
/mnt/fast-ai/bench-results/minimax-m27-b70-serve/serve-196k-c4-kvoffload64-admissionpatch-20260524T223029Z.log
Patch sketch:
patches/kv-offload-admission-check-xpu-experiment-20260524.patch
vLLM’s native CPU KV offload implementation is CUDA-oriented. The XPU run gets past the scheduler-side configuration only after an admission-check patch, then fails in the worker-side offload handler because the CPU KV path uses CUDA concepts:
torch.cuda.Streamtorch.cuda.current_streamcudaHostRegisterThe guard is in:
vllm/v1/kv_offload/cpu/spec.py
The CUDA-specific worker code is in:
vllm/v1/kv_offload/cpu/gpu_worker.py
32768 FP16-family KV endpoint as the stable fallback.cudaHostRegister with a Level Zero / SYCL / PyTorch XPU pinned
host-memory path. torch.empty(..., pin_memory=True) and .pin_memory()
already work locally in torch 2.11.0+xpu.196608:
49152 or 65536, c1.Initial PyTorch XPU primitive probe passed on 2026-05-24 while the normal 32K server was running:
torch.xpu.Stream: availabletorch.xpu.Event: available64 MiB25-28 GB/sArtifacts:
probes/xpu_stream_copy_probe.pyxpu_stream_copy_probe_20260524.jsonnotes-20260524-phase1-probe.mdDecision: proceed to an XPU CPU KV worker prototype rather than another launch-flag experiment.
Initial KV-shaped block-copy probe passed on 2026-05-24:
2.1-2.4 GB/sindex_copy_ requires
same-device source/destination28 GB/s for 64-256
MiB one-way timed regionsArtifacts:
probes/xpu_kv_block_copy_probe.pyxpu_kv_block_copy_probe_20260524.jsonxpu_kv_block_copy_probe_20260524-indexed-fail.jsonxpu_kv_block_copy_probe_20260524-slice.jsonnotes-20260524-phase2-block-copy-probe.mdDecision: prototype an XPU CPU KV worker around range coalescing plus fast slice copies, with the loop path as a correctness fallback for fragmented transfers.
Status on 2026-05-25: partially working, not production-ready.
Prototype patch:
patches/xpu-cpu-kv-worker-prototype-20260525.patchDetailed notes:
notes-20260525-phase3-xpu-worker-live-server.mdWhat now works:
CPUOffloadingSpec on XPU with a new XPU worker.49152 context starts and /v1/models reports max_model_len=49152.33792 tokens, rather than being
accidentally inflated by CPU offload admission bytes.64 output tokens in 0.903 s
(70.88 tok/s wall output, 84.17 tok/s total).Measured live transfer example at about 34500 prompt tokens:
| Direction | Bytes | Time | Effective rate |
|---|---|---|---|
| GPU -> CPU | 8.45152256 GB |
0.779795848 s |
about 10.8 GB/s |
| CPU -> GPU | 8.321499136 GB |
0.508610908 s |
about 16.4 GB/s |
Current blocker:
49152 server starts, but prompts that require the last GPU KV block or
more can park in deferred instead of generating.Do not use this lane as the default server yet. The stable 32768 endpoint is
still the recommended path for real use.
Follow-up validation on 2026-05-25 refined the blocker:
49152/c1 with --kv-offloading-size 16 reported 33792 GPU KV tokens,
or 132 blocks at block size 256.33350 token prompt (131 blocks) plus one output token completed with
the CPU KV connector present.33580 token prompt (132 blocks) plus one output token timed out after
300 s and parked with 131/132 GPU KV blocks occupied.34500 token prompt had already shown real multi-GB CPU KV store/load,
but still did not return a completion.Detailed note:
notes-20260525-phase4-active-context-limit.md
Interpretation: this prototype is useful groundwork for XPU CPU KV movement
and may support exact-quality session swapping for contexts that individually
fit in GPU KV. It is not yet true active-context overflow. Full 196608
active exact context needs CPU-paged attention, host-readable KV kernels, or
quality-gated KV compression.
Follow-up validation on 2026-05-25 found the first practical RAM-backed use case:
32768/c2 with --kv-offloading-size 16 started successfully.26112 tokens because of extra runtime and
graph memory.14000-token prompts completed even though their
combined prompt length was 28000 tokens.4.29 GB from GPU to CPU at roughly
10-11 GB/s.7.02 GB in 0.467 s, roughly 15.0 GB/s.49.4%.Detailed note:
notes-20260525-phase5-session-swap-smoke.md
Interpretation: CPU KV offload can already act as an exact session cache for contexts that fit individually in GPU KV. This is a useful community-facing capability, but it is separate from full active-context overflow.
Follow-up validation on 2026-05-25 added a reusable OpenAI endpoint canary script and tested longer reload decode:
scripts/session_cache_canary.py.16134-token prompts and 128 generated tokens
measured about 74 tok/s after TTFT for each sequential request.2.785 s per request with about 60 tok/s after TTFT.13.9 GB/s.9234-token prompts worked after an expensive first compile.
The second pass returned in 1.6-2.6 s per request, with about
52-79 tok/s after TTFT.15.8 GB/s.ocloc / IGC internal compiler error
(error code 245, floating point exception) before fallback compilation
completed.Important caveat: longer free-form concurrent completions do not produce exact text-hash matches across passes. One-token c2 matched for A/B. One-token c4 matched for A/B/D but not C. Treat CPU KV session caching as mechanically working and promising, but do not promote it as production quality-equivalent until a stricter deterministic canary and semantic quality gate pass.
Strict-word follow-up:
--prompt-mode strict-word, which buries a long context and asks the
model to copy one target word.21073-token prompts produced stable exact
hashes for A/B/C/D across two passes.21073-token prompts, about 42146 combined prompt
tokens against 34304 GPU KV tokens, matched the GPU-only baseline by
expected word and exact output hash for A/B.0.506-0.885 s with CPU-to-GPU KV transfer
around 15.3 GB/s.12073-token prompts, about 48292 combined prompt
tokens against 34304 GPU KV tokens, produced the expected first word for
A/B/C/D on both passes.Detailed note:
notes-20260525-phase6-session-cache-canaries.md
C2 capacity ladder follow-up:
strict-word-answer-space-v2.8K, 16K, 21K, 30K, and
32.5K prompt tokens per session.64948 tokens against a 34304 GPU KV
budget.32474-token prompts had first-pass
TTFT of 24.758-48.363 s, then second-pass reload TTFT of 0.668-1.232 s.14-15 GB/s.4.0 GiB CPU KV budget per worker for --kv-offloading-size 16,
about 16 GiB total across four tensor-parallel workers.22540-token fact-word
run is only an operations smoke, not a target context ceiling.Detailed note:
notes-20260525-c2-session-cache-ladder.md
C2 quality and sustained-decode follow-up:
fact-word and --logprobs modes to
scripts/session_cache_canary.py.8K, 21K, and 32.5K prompt tokens
per session.cobalt and D=amber answers at
about 6.6K, 17.5K, and 27K prompt tokens per session, with one brittle
GPU-only baseline caveat at the middle size.24874 prompt tokens per session measured about
66 tok/s per request after TTFT on the second pass. Total wall output for
that two-request pass was about 53.35 tok/s.Detailed note:
notes-20260525-c2-quality-and-turboquant.md
C4/C8 session-cache ladder follow-up:
scripts/serve_session_cache.sh as a tracked helper for c1/c2/c4/c8
launch shapes.scripts/switch_session_cache_profile.sh and
scripts/session_cache_status.sh so the live endpoint can be switched
between c1/c2/c4/c8 with readiness checks and timestamped logs.VLLM_MAX_NUM_SEQS, VLLM_MAX_NUM_BATCHED_TOKENS,
VLLM_KV_OFFLOADING_SIZE, and VLLM_NO_SCHEDULER_RESERVE_FULL_ISL.--kv-offloading-size 32 reported 34304 live GPU KV tokens and
8.0 GiB CPU KV budget per TP worker.22540-token sessions passed expected-word checks
across two passes. Second-pass reload TTFT was 0.390-1.211 s.--kv-offloading-size 64 reported only 22784 live GPU KV tokens
and 16.0 GiB CPU KV budget per TP worker. More RAM budget reduced live GPU
headroom.315.97 s engine init, including 234.78 s of
compile time.12540-token sessions passed expected-word checks.
Second-pass reload TTFT was 0.552-3.709 s.17540-token sessions also passed. This is about
140321 combined prompt tokens. Second-pass reload TTFT was
0.415-3.247 s.750, 800, 850, and 900 line
fact-word runs left some or all requests waiting/deferred near 100% GPU KV.
Killing the canary client cleared the queue.Detailed note:
notes-20260525-c4-c8-session-cache-ladder.md
Operational note:
notes-20260525-session-cache-operations.md
Practical mental model: clients keep and resend their full chat history. vLLM recognizes repeated token prefixes and can reload parked prefix KV from CPU RAM. There is no separate server-side session ID; exact prefix stability is what lets cache reuse work.
Live c4 operations caveat: after adding the profile switcher, c4 started and
reported the expected 34304 GPU KV tokens, but an operational smoke hit a
second-pass waiting/deferred stall and a rerun hit
UR_RESULT_ERROR_DEVICE_LOST while copying vLLM block-table state to GPU. Keep
c1 as production and use c2 as the safer correctness lane until c4 is debugged.
The same switcher successfully ran a smaller c2 operations smoke with two
concurrent 22540-token fact-word sessions; both matched exact output hashes
across passes, with second-pass reload TTFT of 0.320-0.570 s. Do not present
that smoke as the c2 context limit; the c2 target remains two 32768-token
request windows with practical output headroom.
Sustained concurrent decode follow-up:
9234-token prompts with 128 requested output tokens each
measured 109.76 tok/s total warmed wall output on the second pass.9234-token prompts with 128 requested output tokens each
measured 110.34 tok/s total warmed wall output on the second pass.n128 attempts at 16134 and 22459 prompt tokens per
session stalled after partial completion even though shorter correctness
canaries at larger contexts can pass.Detailed note:
notes-20260525-sustained-concurrency-decode.md
Additional deployment observation: after the B70s were no longer used for the
Ubuntu display, experimental c2/c4 launches could report 34304 GPU KV tokens
instead of the earlier 26112 c2 result. Display ownership and compile/cache
shape can materially affect available KV budget.
TurboQuant remains interesting because it reduces KV footprint and therefore reduces both RAM capacity pressure and PCIe transfer volume.
Current TurboQuant status after the 2026-05-25 workspace fallback experiment:
turboquant_attn.py: _decode_attention and _continuation_prefill.../../patches/vllm-turboquant-xpu-workspace-fallback-20260525.patch.turboquant_k8v4 and max_model_len=32768, vLLM reported 80128 GPU
KV tokens and 2.45x max concurrency for a 32K request.8K and 32.5K prompt tokens.24874 prompt tokens was only about 16.5 tok/s after
TTFT, much slower than the normal FP16-family KV lane.max_model_len=65536 failed; vLLM estimated the maximum at 60672.max_model_len=60000 started and answered a 58874 token strict-word
canary, but TTFT was about 53-54 s and decode was not interactive.turboquant_4bit_nc, max_model_len=100000, c4, and
--kv-offloading-size 32, vLLM reported 84654 GPU KV tokens and 0.85x
max concurrency for a 100K request.84074 and 84374 prompt
tokens, but timed out or parked near 84644+ tokens because the active
request exhausted live GPU KV blocks.turboquant_4bit_nc, max_model_len=196608, c1,
--max-num-batched-tokens 512, and --gpu-memory-utilization 0.959, vLLM
reported 98304 GPU KV tokens and 0.50x max concurrency for a full 196K
request.84074 token strict-word prompt, but TTFT was
114.342 s. A near-limit prompt around 97800 tokens filled GPU KV
(kv_cache_usage=1.0) and killed the engine with
TimeoutError: RPC call to sample_tokens timed out.ocloc / IGC error 245 still appears during compile fallback.Interpretation: TurboQuant is now mechanically past the first XPU workspace blockers and can raise the live GPU KV ceiling, but it is not a production replacement. Capacity improved substantially; decode speed, stability, and quality still need work. Most importantly, TurboQuant plus CPU KV offload still does not provide true active-context overflow: the active request must fit in live GPU KV blocks.
Relevant repro:
scripts/repro-minimax-turboquant-xpu-workspace-bug.sh
Detailed note:
notes-20260525-c2-quality-and-turboquant.md
Active-boundary note:
notes-20260525-turboquant-active-context-boundary.md
The next full-context path is CPU-paged attention, not a larger
--kv-offloading-size setting. Current CPU KV offload can park/reload sessions,
but XPU FlashAttention still needs the active request’s KV blocks in live GPU
memory.
The proposed exact path is:
merge_attn_states() log-sum-exp merge.This mirrors two existing vLLM patterns:
cascade_attention() already splits prefix/suffix attention and merges
LSE-backed partial outputs.extend_forward() already gathers KV chunks into a workspace,
runs attention by chunk, and merges the results.New artifacts:
notes-20260525-cpu-paged-attention-design.mdnotes-20260525-stagea-gpu-split-attention.mdprobes/split_attention_merge_probe.pyprobes/xpu_flash_attn_split_probe.pysplit_attention_merge_probe_20260525.jsonsplit_attention_merge_probe_20260525-uneven.jsonxpu_flash_attn_split_probe_20260525.jsonStandalone split-attention math probe results:
| Shape | Chunks | Max output abs error | Max LSE abs error | Result |
|---|---|---|---|---|
4096 KV tokens, 4 queries, 8 heads |
8 |
6.71e-08 |
9.54e-07 |
pass |
5000 KV tokens, 7 queries, 8 heads |
7 |
8.94e-08 |
1.91e-06 |
pass |
Interpretation: the core split-and-merge softmax math is sound. The remaining work is vLLM integration: logical-vs-physical KV accounting, CPU block range queries, GPU staging workspace, temporary block tables, and XPU FlashAttention calls that return LSE for merging.
Stage A vLLM integration attempt:
cascade_attention() path.q_descale in the
prefix cascade call.causal=True produced
a large output mismatch (0.128 max abs error), while suffix
causal=False was close (0.0015 max abs error) but had an LSE offset.60.75 tok/s after TTFT but still did not match the baseline hash.3714 prompt tokens, 64 output tokens,
91.44 tok/s after TTFT, hash 5afda1f4fa37f3d3.3714 prompt tokens, 64 output tokens,
13.29 tok/s after TTFT, hash 2fb45f78a286e529, text diverged.Follow-up probes on 2026-05-25 narrowed the active-overflow path:
probes/xpu_cpu_staged_attention_probe.py to test paged XPU scratch.
It is a useful negative diagnostic: paged scratch output can be close, but
returned LSE is unstable enough to break exact chunk merging.probes/xpu_cpu_dense_staged_attention_probe.py to test dense XPU
scratch. This is the promising route.3e-05 to 6e-05, despite paged LSE being unreliable.32768 tokens with
8 KV heads, 128 head size, 16384 CPU-staged prefix tokens, and
8192 token dense scratch chunks.3.0517578125e-05 and matched
dense full LSE within 9.5367431640625e-07.Decision: stop pursuing paged scratch merging for true active overflow. The next vLLM prototype should use dense scratch chunks for both CPU-resident old KV and GPU-resident suffix KV, then merge dense attention states.
Detailed note:
notes-20260525-dense-staged-cpu-attention.md
The requested end goal is useful: four or more concurrent sessions, ideally
with the full 196608 token MiniMax context, backed by system RAM when needed.
The current stack cannot do that yet.
The best active capacity observed in the TurboQuant 4-bit NC lane was 98304
tokens. One full MiniMax context is 196608 tokens, so a single full active
context is about 2x beyond the best live GPU KV capacity observed. Four full
active contexts are 786432 tokens, about 8x beyond that live capacity.
CPU KV offload is still useful, but today it is session caching: it stores and reloads KV for sessions whose active working set fits in GPU KV. It does not make XPU attention read arbitrary old KV blocks directly from host RAM.
Production should stay on the 32768 FP16-family KV endpoint. The next R&D
step is dense-scratch CPU-staged attention, not just larger
--kv-offloading-size values or higher max_model_len. Paged scratch merging
is currently blocked by unreliable XPU paged-attention LSE values.
Restore note: a fatal near-limit 196K TurboQuant run left orphan
VLLM::Worker_TP* processes holding XPU memory. If the normal server fails on
restart with near-zero free XPU memory, kill the orphan workers and stale
multiprocessing.resource_tracker, then remove stale /dev/shm/psm_* and
/dev/shm/sem.mp-* files. Details are in the active-boundary note.
cudaHostRegister for pinned CPU KV pages?If an experiment leaves the server down, restore the stable endpoint with:
pkill -f 'vllm serve' || true
VLLM_MAX_MODEL_LEN=32768 /home/steve/bin/minimax-vllm-serve
Expected /v1/models:
{
"id": "/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround",
"max_model_len": 32768
}