b70-optimization-lab

Current Reproducibility Map

This page connects the active Gemma 4 service, the deployable MiniMax baseline, the session-cache experiments, the TurboQuant patch, and the long-context research path. It is meant for a fresh human or agent who needs to reproduce or review the current work without reading every historical note first.

What Is Production Today

The active LAN endpoint on this host is the Gemma 4 c8 model-slot profile:

Restore production Gemma 4 c8:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8

Current Gemma 4 recipe and results:

Latest full-32K concurrency conclusion:

MiniMax Deployable Baseline

The MiniMax 32K FP16-family KV c1 endpoint remains the deployable baseline recipe and optimization reference:

Fresh install guide:

../repro/minimax-m27-b70-110tps-ubuntu24-20260523/README.md

Human deployment guide:

b70-minimax-ubuntu24-deployment.md

Main server script:

../repro/minimax-m27-b70-110tps-ubuntu24-20260523/scripts/06-serve-openai-compatible.sh

Operational profile switcher:

../experiments/minimax_xpu_kv_offload/scripts/switch_session_cache_profile.sh

Restore c1:

cd /home/steve/llm-optimizations
experiments/minimax_xpu_kv_offload/scripts/switch_session_cache_profile.sh c1

Check status:

experiments/minimax_xpu_kv_offload/scripts/session_cache_status.sh

Baseline Build Inputs

The fresh Ubuntu 24 repro builds from source and applies two compressed patch artifacts from the older strict-speed repro:

The build script decodes and applies those patches automatically:

../repro/minimax-m27-b70-110tps-ubuntu24-20260523/scripts/03-build-stack.sh

Pinned source commits are listed in:

../repro/minimax-m27-b70-110tps-ubuntu24-20260523/README.md

Live-source audit snapshots from the originating machine are also tracked:

These snapshots capture the dirty local /home/steve/src/vllm and /home/steve/src/llm-scaler trees after the session-cache and TurboQuant research. Treat them as review/audit artifacts, not as clean upstream-ready patches. The clean fresh-install repro still uses the two compressed promoted patch bundles listed above.

Baseline Results

The fresh deployable baseline records:

Tracked summaries:

Detailed notes:

Session-Cache / RAM-Backed Juggling

This is the main experimental path for keeping multiple long conversations warm. It is not one huge active context.

Mental model:

Entry points:

Scripts:

Current operational recommendation:

Near-full c2 validation:

Known-good c2 operations smoke:

The operations smoke is intentionally smaller and cleaner. It does not define the desired c2 context ceiling; c2 should be presented as a 32K-window profile.

Result file from the originating host:

/mnt/fast-ai/bench-results/minimax-m27-b70-serve/session-cache-c2-ops-fact-900lines-20260525T223527Z.json

The raw /mnt/fast-ai file is not in GitHub; the result is summarized in:

../experiments/minimax_xpu_kv_offload/notes-20260525-session-cache-operations.md

Concurrency/sustained decode notes:

Headline sustained warm results:

Live c4 caveat:

TurboQuant

TurboQuant is a compressed-KV research lane. It can raise the live KV ceiling, but it is not the production mode.

Patch artifact:

../patches/vllm-turboquant-xpu-workspace-fallback-20260525.patch

Repro script:

../scripts/repro-minimax-turboquant-xpu-workspace-bug.sh

Detailed notes:

Current status:

Important boundary:

TurboQuant plus CPU KV offload still requires the active request’s working KV blocks to fit in live GPU memory. It helps capacity, but it is not active-context overflow.

Full 196K Active Context Path

The credible exact-quality path is CPU-paged attention, not simply increasing --kv-offloading-size.

Design notes and probes:

Experimental patches:

Current design direction:

  1. Keep recent/current KV in normal GPU KV blocks.
  2. Keep older logical KV blocks in CPU offload storage.
  3. Stage old CPU-resident KV chunks into GPU scratch.
  4. Run attention over each chunk.
  5. Merge partial attention outputs using log-sum-exp/LSE state.
  6. Merge old-context attention with normal attention over the live GPU suffix.

This is still a research path, not a serving recipe.

What GitHub Does Not Include

GitHub has:

GitHub does not include:

When a note references a raw /mnt/fast-ai log or JSON, use the summarized values in GitHub unless you are on the originating machine.