This page is the community-facing index for reproducible model deployments. Each recipe should make clear what it proves: installation, quality, benchmark speed, serving, or all of the above.
Every complete recipe should include:
Do not compare two results unless their model, quantization, prompt length, output length, context length, batch/concurrency, and quality gate are clear.
| Recipe | Status | What It Is For |
|---|---|---|
../repro/minimax-m27-b70-110tps-ubuntu24-20260523/ |
Deployable baseline | Fresh Ubuntu 24.04 setup for 4x B70, MiniMax M2.7 INT4 AutoRound, vLLM OpenAI-compatible endpoint on 0.0.0.0:8000. |
../repro/minimax-m27-b70-89tps-20260520/ |
Strict speed baseline | Older strict quality-passed MiniMax M2.7 INT4 lane with higher output-token throughput. Useful for optimization comparisons. |
../experiments/minimax_xpu_kv_offload/ |
Experimental | Session-cache c2/c4/c8, TurboQuant, and CPU-paged attention research. Use for review and experiments, not as the production recipe. |
../experiments/gemma4-12b-int4-autoround-vllm/ |
Production slot plus research profiles | Gemma 4 12B IT INT4 AutoRound image+text endpoint on vLLM/XPU. Current production is c8 with 32K context and 8 active generations; c10/c12/c16/c64 are documented research or rejected profiles. |
Start with:
cd repro/minimax-m27-b70-110tps-ubuntu24-20260523
sudo bash scripts/00-install-system-deps.sh
sudo reboot
After reboot:
sudo bash scripts/01-prepare-storage.sh
bash scripts/02-download-model.sh
bash scripts/03-build-stack.sh
bash scripts/04-verify-runtime.sh
bash scripts/05-run-quality-and-benchmark.sh
bash scripts/06-serve-openai-compatible.sh
Then in another terminal:
bash scripts/07-smoke-test-endpoint.sh
See the full deployment guide for explanation and troubleshooting.
For RAM-backed session-cache experiments, start with:
cd experiments/minimax_xpu_kv_offload
less REPRODUCE.md
Current status:
See Current Reproducibility Map for the full artifact map.
The current image+text research profile is:
cd /home/steve/llm-optimizations
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8
It serves Intel/gemma-4-12B-it-int4-AutoRound through the same no-auth
OpenAI-compatible LAN endpoint on 0.0.0.0:8000, with 32K context, 8 active
generations, prefix caching, and XPU graph capture. It needs the local vLLM
gemma4_unified backport captured in
patches/vllm-gemma4-unified-backport-b70-20260607.patch.
See the Gemma 4 experiment guide for the exact slot profiles, smoke tests, known bad multimedia-limit setting, 2K/512 concurrency results, c10/c12 full-32K boundary tests, and prefix-cache results for shared-prefix plus unique-tail prompts.
These are useful community targets to add as separate repro folders:
gemma4_unified backport is either upstream or packaged as a smaller patch.Use a name that includes model, hardware, headline result, OS/date, and avoid spaces:
repro/<model>-<hardware>-<headline>-<os>-<yyyymmdd>/
README.md
configs/
scripts/
patches/
results/
notes/
For example:
repro/minimax-m27-b70-110tps-ubuntu24-20260523/
For a result to be useful to other people, include: