This page connects the active Gemma 4 service, the deployable MiniMax baseline, the session-cache experiments, the TurboQuant patch, and the long-context research path. It is meant for a fresh human or agent who needs to reproduce or review the current work without reading every historical note first.
The active LAN endpoint on this host is the Gemma 4 c8 model-slot profile:
Intel/gemma-4-12B-it-int4-AutoRound/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel0.0.0.0:8000327688Restore production Gemma 4 c8:
cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8
Current Gemma 4 recipe and results:
../experiments/gemma4-12b-int4-autoround-vllm/README.md../experiments/gemma4-12b-int4-autoround-vllm/results-20260607-c10-c12-32k-boundary.jsonmodel-slot-switching.mdLatest full-32K concurrency conclusion:
22.20 s to 12.45 s.The MiniMax 32K FP16-family KV c1 endpoint remains the deployable baseline recipe and optimization reference:
Lasimeri/MiniMax-M2.7-int4-AutoRound/mnt/fast-ai/llm-models/minimax-m2.7-int4-autoround0.0.0.0:8000327681auto / FP16-familyFresh install guide:
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/README.md
Human deployment guide:
b70-minimax-ubuntu24-deployment.md
Main server script:
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/scripts/06-serve-openai-compatible.sh
Operational profile switcher:
../experiments/minimax_xpu_kv_offload/scripts/switch_session_cache_profile.sh
Restore c1:
cd /home/steve/llm-optimizations
experiments/minimax_xpu_kv_offload/scripts/switch_session_cache_profile.sh c1
Check status:
experiments/minimax_xpu_kv_offload/scripts/session_cache_status.sh
The fresh Ubuntu 24 repro builds from source and applies two compressed patch artifacts from the older strict-speed repro:
../repro/minimax-m27-b70-89tps-20260520/patches/vllm-active-promoted-minimax-89tps-20260520.patch.gz.b64../repro/minimax-m27-b70-89tps-20260520/patches/llm-scaler-active-promoted-minimax-89tps-20260520.patch.gz.b64The build script decodes and applies those patches automatically:
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/scripts/03-build-stack.sh
Pinned source commits are listed in:
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/README.md
Live-source audit snapshots from the originating machine are also tracked:
../patches/vllm-live-src-snapshot-20260525.patch../patches/llm-scaler-live-src-snapshot-20260525.patchThese snapshots capture the dirty local /home/steve/src/vllm and
/home/steve/src/llm-scaler trees after the session-cache and TurboQuant
research. Treat them as review/audit artifacts, not as clean upstream-ready
patches. The clean fresh-install repro still uses the two compressed promoted
patch bundles listed above.
The fresh deployable baseline records:
83.172 output tok/s, 110.896 total tok/s83.8-84.1 output tok/s1.7k-1.8k prompt tok/s3276832408, output 64, no OOMTracked summaries:
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/results/summary-20260523.json../repro/minimax-m27-b70-110tps-ubuntu24-20260523/results/context-window-32768-20260523.json../data/localmaxxing-minimax-m27-autoround-openai-32k-context-20260523.payload.json../data/localmaxxing-minimax-m27-autoround-openai-32k-endpoint-metrics-20260524.payload.jsonDetailed notes:
../notes/2026-05-23-b70-display-disable-32768-context.md../notes/2026-05-23-current-host-pcie4-prefill-check.mdThis is the main experimental path for keeping multiple long conversations warm. It is not one huge active context.
Mental model:
Entry points:
../experiments/minimax_xpu_kv_offload/REPRODUCE.md../experiments/minimax_xpu_kv_offload/ARTIFACTS.md../experiments/minimax_xpu_kv_offload/README.md../experiments/minimax_xpu_kv_offload/notes-20260525-session-cache-operations.mdScripts:
../scripts/install-minimax-vllm-service.sh../scripts/openai-lan-frontdoor.py../scripts/minimax-prod-health.py../scripts/minimax-prod-benchmark.py../deploy/systemd/minimax-vllm.service../deploy/systemd/minimax-openai-frontdoor.service../experiments/minimax_xpu_kv_offload/scripts/serve_session_cache.sh../experiments/minimax_xpu_kv_offload/scripts/switch_session_cache_profile.sh../experiments/minimax_xpu_kv_offload/scripts/session_cache_status.sh../experiments/minimax_xpu_kv_offload/scripts/session_cache_canary.pyCurrent operational recommendation:
minimax-vllm.service as a localhost backend
on 127.0.0.1:18080 and minimax-openai-frontdoor.service as the no-auth
LAN OpenAI-compatible endpoint on 0.0.0.0:8000.cmpm35jsa0003rt01zghtmwip for prompt 32264, output 64,
63.91 output tok/s after TTFT, 1382.57 approximate prefill tok/s,
23.336 s TTFT.32768-token window sessions.Near-full c2 validation:
32474 prompt tokens per session, 64948 combined prompt tokens0.668-1.232 s14-15 GB/sKnown-good c2 operations smoke:
22540 prompt tokens per session0.320-0.570 s16.2 GB/sThe operations smoke is intentionally smaller and cleaner. It does not define the desired c2 context ceiling; c2 should be presented as a 32K-window profile.
Result file from the originating host:
/mnt/fast-ai/bench-results/minimax-m27-b70-serve/session-cache-c2-ops-fact-900lines-20260525T223527Z.json
The raw /mnt/fast-ai file is not in GitHub; the result is summarized in:
../experiments/minimax_xpu_kv_offload/notes-20260525-session-cache-operations.md
Concurrency/sustained decode notes:
../experiments/minimax_xpu_kv_offload/notes-20260525-c2-session-cache-ladder.md../experiments/minimax_xpu_kv_offload/notes-20260525-c4-c8-session-cache-ladder.md../experiments/minimax_xpu_kv_offload/notes-20260525-sustained-concurrency-decode.mdHeadline sustained warm results:
9234-token prompts, 128 requested output tokens: about
109.76 tok/s total warmed wall output9234-token prompts, 128 requested output tokens: about
110.34 tok/s total warmed wall outputLive c4 caveat:
34304 GPU KV tokensUR_RESULT_ERROR_DEVICE_LOST while copying vLLM
block-table state to GPUTurboQuant is a compressed-KV research lane. It can raise the live KV ceiling, but it is not the production mode.
Patch artifact:
../patches/vllm-turboquant-xpu-workspace-fallback-20260525.patch
Repro script:
../scripts/repro-minimax-turboquant-xpu-workspace-bug.sh
Detailed notes:
../experiments/minimax_xpu_kv_offload/notes-20260525-c2-quality-and-turboquant.md../experiments/minimax_xpu_kv_offload/notes-20260525-turboquant-active-context-boundary.mdCurrent status:
turboquant_attn.py:_decode_attention and _continuation_prefillturboquant_k8v4 at 32K reported 80128 GPU KV tokens and 2.45x max
concurrency for a 32K request8K and 32.5K prompt tokens24874 token prompt was only about 16.5 tok/s
after TTFTturboquant_4bit_nc with max_model_len=196608 reported 98304 GPU KV
tokens but still could not serve a true 196K active requestImportant boundary:
TurboQuant plus CPU KV offload still requires the active request’s working KV blocks to fit in live GPU memory. It helps capacity, but it is not active-context overflow.
The credible exact-quality path is CPU-paged attention, not simply increasing
--kv-offloading-size.
Design notes and probes:
../experiments/minimax_xpu_kv_offload/notes-20260525-cpu-paged-attention-design.md../experiments/minimax_xpu_kv_offload/notes-20260525-dense-staged-cpu-attention.md../experiments/minimax_xpu_kv_offload/notes-20260525-stagea-gpu-split-attention.md../experiments/minimax_xpu_kv_offload/probes/split_attention_merge_probe.py../experiments/minimax_xpu_kv_offload/probes/xpu_flash_attn_split_probe.py../experiments/minimax_xpu_kv_offload/probes/xpu_cpu_dense_staged_attention_probe.pyExperimental patches:
../experiments/minimax_xpu_kv_offload/patches/kv-offload-admission-check-xpu-experiment-20260524.patch../experiments/minimax_xpu_kv_offload/patches/xpu-cpu-kv-worker-prototype-20260525.patch../experiments/minimax_xpu_kv_offload/patches/vllm-xpu-gpu-split-attn-stagea-failed-20260525.patchCurrent design direction:
This is still a research path, not a serving recipe.
GitHub has:
GitHub does not include:
/mnt/fast-ai/bench-results treeWhen a note references a raw /mnt/fast-ai log or JSON, use the summarized
values in GitHub unless you are on the originating machine.