b70-optimization-lab

Single Model Slot Switching

This host should normally run one large model at a time. The public LAN API can stay stable while the backend model changes.

Public endpoint:

http://<server-lan-ip>:8000/v1

Backend slot:

127.0.0.1:18080

The frontdoor is OpenAI-compatible and has no bearer-token requirement. It limits concurrent generation requests according to the active model profile.

Why One Slot

Four B70s can run a large model well, but two large vLLM backends at once would fight for VRAM, compile cache, device handles, and port 8000. The model-slot setup makes the intended behavior explicit:

Install Once

cd /home/steve/llm-optimizations
scripts/install-vllm-model-slot-service.sh --profile minimax-m27-c1

Install, enable at boot, and immediately move to the slot-managed MiniMax profile:

cd /home/steve/llm-optimizations
scripts/install-vllm-model-slot-service.sh --profile minimax-m27-c1 --start

This installs:

/etc/systemd/system/b70-vllm-slot.service
/etc/systemd/system/b70-openai-frontdoor.service
/etc/b70-vllm-slot/current.env

Tracked source files:

deploy/systemd/b70-vllm-slot.service
deploy/systemd/b70-openai-frontdoor.service
configs/model-slots/*.env
scripts/serve-vllm-profile.sh
scripts/run-openai-frontdoor-profile.sh
scripts/switch-vllm-model-slot.sh

Switch Models

List available profiles:

scripts/switch-vllm-model-slot.sh list

Switch back to the known-good MiniMax profile:

scripts/switch-vllm-model-slot.sh switch minimax-m27-c1

Try the preferred Qwen 35B INT4 AutoRound candidate after its weights are present:

scripts/switch-vllm-model-slot.sh switch qwen36-35b-a3b-int4-autoround

Try the tested Gemma 3 12B INT4 AutoRound image+text candidate after its weights are present:

scripts/switch-vllm-model-slot.sh switch gemma3-12b-it-int4-autoround

Try the current Gemma 4 12B INT4 AutoRound image+text candidate after the local gemma4_unified vLLM backport is applied:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround

Try the full-32K Gemma 4 profile with 8 active generations:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8

Try the full-32K Gemma 4 research profile with 10 active generations:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c10

The switch command stops the generic slot services and the older MiniMax-specific services before starting the selected slot. This avoids two large models being loaded at the same time. It also disables the older MiniMax-specific units when the generic slot is activated, so reboot behavior stays single-model.

Current Profiles

Profile Status Modalities Purpose
minimax-m27-c1 baseline text Deployable MiniMax M2.7 INT4 AutoRound baseline endpoint, 32K context, one active generation.
qwen36-35b-a3b-int4-autoround research text,image Preferred Qwen 35B candidate. Public W4A16 AutoRound checkpoint, working after the local XPU Mamba pointer patch.
gemma3-12b-it-int4-autoround research text,image Tested Gemma fallback. Public 12B INT4 AutoRound checkpoint, much faster than Qwen 35B on the 2K/128 concurrency ladder.
gemma4-12b-it-int4-autoround research text,image Current Gemma 4 candidate. Intel W4A16 AutoRound checkpoint, working after Transformers 5.10.2 plus the local vLLM gemma4_unified backport.
gemma4-12b-it-int4-autoround-c8 production text,image Active full-context Gemma 4 profile: 32768-token context, 8 active generations, prefix caching, XPU graph capture, no-auth LAN endpoint.
gemma4-12b-it-int4-autoround-c10 experiment-c10 text,image Full-32K Gemma 4 research profile: 10 active generations. Short prompts are faster in aggregate than c8, but near-32K throughput is not better and TTFT is worse.
gemma4-12b-it-int4-autoround-c12 rejected-device-lost text,image Full-32K Gemma 4 failure-boundary profile. Startup succeeded, then burst load hit Level Zero out-of-resources and device-lost. Retained for reproduction only.
gemma4-12b-it-int4-autoround-c16 experiment-c16 text,image Full-32K Gemma 4 research profile with 16 active generations. Do not use as production without a fresh stability/quality pass.
gemma4-12b-it-int4-autoround-c64 research-c64 text,image High-concurrency Gemma 4 profile: 64 active generations, prefix caching, 4480-token context selected to fit 64 full contexts in VRAM.
gemma4-12b-it-int4-autoround-c8-gmem096 rejected-startup-memory text,image Same as c8 but gpu_memory_utilization=0.96; rejected after engine startup failure near the VRAM boundary.
gemma4-12b-it-int4-autoround-c8-gmem097 rejected-startup-memory text,image Same as c8 but gpu_memory_utilization=0.97; rejected because startup free VRAM was below requested utilization.
gemma4-12b-it-int4-autoround-c8-cs1-8 rejected-device-lost text,image Same as c8 but compile_sizes=[1,8]; rejected after IGC fallback compile and UR_RESULT_ERROR_DEVICE_LOST during repeat validation.
gemma4-12b-it-int4-autoround-c8-xpugraph-mbt2048 rejected-device-lost text,image Same as c8 graph but max_num_batched_tokens=2048; gained KV headroom but hit UR_RESULT_ERROR_DEVICE_LOST during canary/sampling.
gemma4-12b-it-int4-autoround-c8-nolog experiment-nolog text,image Same as c8 but frontdoor event logging disabled; quality matched but no clear throughput win.
qwen36-27b-fp8-vrfai rejected-diagnostic text Do not use as a recommended lane. It only worked here with an opt-in BF16 dequant fallback for a failing XPU FP8 primitive.
qwen36-35b-a3b-fp8 blocked-native-xpu-fp8 text,image Official FP8 checkpoint is interesting, but the current local XPU path lacks native block-FP8 W8A8 support.
qwen3-vl-30b-a3b-fp8 blocked-native-xpu-fp8 text,image Multimodal FP8 candidate, blocked by the same native XPU block-FP8 concern until proven otherwise.

Check Status

scripts/switch-vllm-model-slot.sh status
curl http://127.0.0.1:8000/status
curl http://127.0.0.1:8000/v1/models

The frontdoor status includes the active profile metadata:

{
  "model_slot": {
    "name": "minimax-m27-c1",
    "modalities": "text",
    "status": "production"
  },
  "frontdoor": {
    "auth": "none",
    "max_active_generations": 1
  }
}

Validation Gate For A New Profile

Do not call a new profile production-ready until it passes:

  1. /v1/models reports the expected model and context length.
  2. A text completion returns valid tokens.
  3. For VL profiles, /v1/chat/completions accepts a real image request.
  4. Short decode throughput is measured after warmup.
  5. 16K and 32K prompt TTFT are measured.
  6. c2/c4 concurrency tests report both throughput and latency.
  7. Quality smoke tests pass with the same sampling settings used for service.

Notes On The Candidate Models

qwen36-35b-a3b-int4-autoround is the current first-choice Qwen profile. The checkpoint is abhinand/Qwen3.6-35B-A3B-int4-AutoRound, a public W4A16 AutoRound model with quant_method=auto-round and packing_format=auto_round:auto_gptq. In local vLLM, that AutoRound format maps to the INC XPU W4A16 path and Intel int4_gemm_w4a16, which is the quality-preserving hardware path we want to test. Qwen 35B also needed the local vLLM patch in patches/vllm-xpu-mamba-copy-pointer-uint64-20260606.patch; without it, a 2K prompt crashed in vllm/v1/worker/mamba_utils.py with OverflowError: Python int too large to convert to C long while copying Mamba cache pointers.

gemma3-12b-it-int4-autoround is the tested fast Gemma fallback. The checkpoint is OPEA/gemma-3-12b-it-int4-AutoRound, a Gemma 3 image+text model with 4-bit AutoRound/GPTQ-style weights. vLLM rejects Gemma 3 with float16 for numerical-stability reasons, so the profile uses bfloat16 as the runtime activation dtype while keeping INT4 weights. That is not the same as the rejected Qwen FP8 BF16-dequant fallback.

gemma4-12b-it-int4-autoround is now the current Gemma 4 image+text candidate. The checkpoint is Intel/gemma-4-12B-it-int4-AutoRound, a W4A16 AutoRound model with model_type=gemma4_unified. It needed Transformers 5.10.2 and the local vLLM backport in patches/vllm-gemma4-unified-backport-b70-20260607.patch. The active profile uses bfloat16 as the 16-bit activation dtype while the weights remain the INT4 AutoRound checkpoint. The production c8 profile keeps the same no-auth LAN endpoint on 0.0.0.0:8000, supports text and image requests, reports max_model_len=32768, and caps active generations at 8 for predictable full-context behavior.

gemma4-12b-it-int4-autoround-c64 is the high-concurrency variant for many shorter live sessions. It uses the same model and endpoint, but sets max_num_seqs=64, FRONTDOOR_MAX_ACTIVE_GENERATIONS=64, keeps prefix caching enabled, and lowers max_model_len to 4480. This was selected empirically: vLLM reported 292317 GPU KV tokens and 65.25x full-context concurrency at 4480, while 4864 only reached 60.15x. Do not derive the c64 context by dividing the c16 KV budget; the c64 scheduler/compile profile changes the available KV budget.

gemma4-12b-it-int4-autoround-c8 is the active production full-context variant selected after the c64 tradeoff became too short. It uses the same Gemma 4 INT4 AutoRound checkpoint and same no-auth LAN endpoint, but keeps max_model_len=32768 and caps vLLM/frontdoor live generations at 8. After the 2026-06-07 XPU graph promotion, startup reported 1,004,909 GPU KV-cache tokens and 30.67x theoretical full-32K concurrency, so the c8 cap is conservative. The profile passed an 8-way near-limit probe with 32703 prompt tokens and one output token per request. For normal generation, leave output headroom below 32768 total prompt+generated tokens.

gemma4-12b-it-int4-autoround-c10 is a research-only profile for testing whether the large reported KV pool can support more full-32K active clients. It passed the quality canary and improved short-prompt aggregate decode from about 755 to 850 wall output tok/s versus c8 load. It did not improve near-32K workloads: unique near-32K 128-token output stayed around 23 tok/s wall aggregate and half-shared-prefix near-32K output stayed around 40 tok/s wall aggregate, with c10 showing higher TTFT than c8.

gemma4-12b-it-int4-autoround-c12 is retained only to reproduce the failure boundary. It reported about 991K GPU KV tokens and 30.26x theoretical 32K concurrency, then failed under burst load with Level Zero UR_RESULT_ERROR_OUT_OF_RESOURCES followed by UR_RESULT_ERROR_DEVICE_LOST.

Avoid treating the Qwen FP8 profiles as production candidates right now. On 2026-06-06, the native compressed-tensors FP8 path failed during profiling on this host with RuntimeError: could not set scales primitive attribute in torch.ops._xpu_C.fp8_gemm_w8a16. The local BF16 fallback patch avoids that crash by dequantizing FP8 weights into BF16 and using F.linear, but the user explicitly rejected BF16 fallback as the active model direction. The block-FP8 Qwen 35B family also needs native XPU 128x128 block-FP8 W8A8 GEMM support; the local alternatives are BF16 dequant fallback or requantized FP8, neither of which should be promoted as quality-equivalent without a separate eval.

The rejected Qwen 27B BF16-fallback diagnostic used max_model_len=4096, about 2071 prompt tokens per request, and 512 generated tokens:

Concurrency Aggregate output tok/s, wall Mean request decode tok/s Mean TTFT
1 20.48 20.91 0.51 s
16 243.01 17.73 4.48 s
32 402.51 16.21 8.21 s
64 556.55 12.57 15.65 s

Those numbers are useful as a scheduler/concurrency diagnostic only. Full details are in notes/2026-06-06-qwen36-fp8-bf16-fallback-concurrency.md.

2026-06-06 INT4 AutoRound Results

These are text decode throughput measurements through the no-auth LAN frontdoor at about 2K prompt tokens and 128 generated tokens per request. Use aggregate output tok/s, wall as the primary number; XPU/vLLM sometimes coalesces stream chunks, so the post-TTFT derived field can be misleading.

Qwen 35B INT4 AutoRound, after applying the Mamba pointer uint64 patch:

Concurrency Prompt tokens each Output tokens each Aggregate output tok/s, wall Mean TTFT
1 2071 128 17.28 7.41 s
2 2071 128 33.91 7.49 s
4 2071 128 61.54 8.21 s

Qwen notes:

Gemma 3 12B INT4 AutoRound:

Concurrency Prompt tokens each Output tokens each Aggregate output tok/s, wall Mean TTFT
1 2072 128 31.31 4.09 s
2 2072 128 62.52 4.08 s
4 2072 128 124.48 4.09 s
8 2072 128 245.88 4.14 s
16 2072 128 166.14 9.31 s

Gemma notes:

Raw local result files:

/mnt/fast-ai/bench-results/qwen36-35b-int4-vllm-serve/qwen36-35b-int4-c1-2k-128-after-mamba-uint64-20260607T020923Z.json
/mnt/fast-ai/bench-results/qwen36-35b-int4-vllm-serve/qwen36-35b-int4-c2-c4-2k-128-after-mamba-uint64-20260607T021010Z.json
/mnt/fast-ai/bench-results/gemma3-12b-int4-vllm-serve/gemma3-12b-int4-concurrency-2k-128-warm-20260607T024447Z.json

2026-06-07 Gemma 4 INT4 AutoRound Results

Gemma 4 12B INT4 AutoRound:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround

Model:

Intel/gemma-4-12B-it-int4-AutoRound

Local path:

/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel

Serving facts:

Text and image smoke tests passed after the final restart. Text returned OK; a base64 PNG image request returned Blue.

Benchmark shape:

Concurrency Aggregate decode tok/s after first text Aggregate output tok/s, wall Mean request decode tok/s Mean TTFT
1 58.22 30.39 58.22 8.05 s
2 117.27 59.70 58.89 8.44 s
4 236.10 116.33 59.71 9.00 s
8 467.76 217.39 59.63 10.20 s
16 922.18 396.11 59.60 11.97 s

Important notes:

Raw local result files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c1-c2-c4-c8-20260607T034622Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c16-20260607T035019Z.json

Full reproduction notes:

../experiments/gemma4-12b-int4-autoround-vllm/README.md

2026-06-07 Gemma 4 C64 Profile

Switch command:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c64

Serving facts:

Context search:

Max model len GPU KV tokens vLLM full-context concurrency Outcome
15616 667253 42.73x too high
10368 507081 48.91x too high
7680 405664 52.82x too high
4864 292589 60.15x too high
4096 291995 71.29x fits
4480 292317 65.25x selected

Short-prompt c64 TTFT with about 123 prompt tokens and one generated token:

Mean TTFT p50 TTFT p95 TTFT Max TTFT
0.882 s 0.712 s 1.139 s 1.140 s

Short-prompt c64 output run with about 123 prompt tokens and 128 generated tokens per request:

Concurrency Aggregate output tok/s, wall
64 1614.41

Near-limit prompt smoke:

Actual prompt tokens Output tokens TTFT
4323 1 0.631 s

Raw local result files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-ttft-100p-1o-20260607T062129Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-decode-100p-128o-20260607T062143Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-longprompt-4300p-1o-20260607T062228Z.json

2026-06-07 Gemma 4 C8 Production Full-Context Profile

Switch command:

scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8

Serving facts:

Short-prompt c8 TTFT with about 119 prompt tokens and one generated token:

Mean TTFT Max TTFT
0.151 s 0.183 s

Short-prompt c8 output run with about 119 prompt tokens and 128 generated tokens per request:

Profile Concurrency Aggregate output tok/s, wall
pre-graph c8 8 247.49
XPU graph c8, promoted mean of 3 8 703.59

The XPU graph promotion matched the saved quality canary hashes before it was copied into the canonical production profile. It keeps the same model, quantization, context length, concurrency cap, and prefix caching; it changes only the compile/graph execution path:

XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'

Post-promotion startup facts:

torch.compile took 4.12 s
Graph capturing finished in 4 s
init engine took 19.05 s
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x

Sustained decode checks on the promoted graph profile:

Shape Concurrency Wall aggregate output tok/s
119 prompt tokens, 256 output tokens 8 796.18
119 prompt tokens, 512 output tokens 8 780.97
119 prompt tokens, 1024 output tokens 8 731.12

Production soak on 2026-06-07 used the active c8 profile on the no-auth LAN endpoint with 32768 max model length, 8 active generations, prefix caching, and XPU graph capture. The scheduled main soak cycles 1-32 ran from 20260607T090844Z through 20260607T165344Z: 31/32 cycles passed the quality gate and c8/512 decode check. Clean cycles averaged 781.04 tok/s wall aggregate output at 2.551 s mean TTFT, with a 765.44-784.37 tok/s range.

One scheduled quality anomaly occurred at cycle 17: copy_phrase returned slh cobalt orbit instead of satin cobalt orbit. Three immediate reruns and a 25-repeat quality stress loop passed exactly, and the next scheduled cycle returned to normal. Do not hide this in future reports; treat it as an isolated deterministic-canary anomaly unless it repeats.

The first soak run also exposed a harness bug after the last scheduled cycle: the script spun about every 10 s instead of sleeping to the deadline. Those rapid cycles are not normal soak data. scripts/run-gemma4-production-soak.sh now sleeps to the deadline and exits cleanly.

Fixed-harness continuation:

Cycle UTC Quality Wall aggregate output tok/s Mean TTFT
1 20260607T170030Z pass 779.20 2.559 s
2 20260607T171530Z pass 784.75 2.546 s

Frontdoor streaming fix: an apparent c1 TTFT regression after the soak was traced to the LAN proxy, not the vLLM backend. The backend streamed first text in about 34 ms, but scripts/openai-lan-frontdoor.py was forwarding with response.read(65536), which buffered SSE events. The frontdoor now forwards text/event-stream responses line-by-line and flushes each line.

After restarting only b70-openai-frontdoor.service, public endpoint TTFT matched the backend again:

Shape Concurrency Mean TTFT Wall aggregate output tok/s
119 prompt tokens, 512 output tokens 1 0.036 s 112.87
119 prompt tokens, 512 output tokens 8 0.099 s 783.62

The production frontdoor is not a batch queue. It admits up to 8 active generation requests immediately. When all 8 slots are busy, the next generation request fails fast with HTTP 503 instead of waiting behind a long queue. A 9-way overload check after setting FRONTDOOR_QUEUE_TIMEOUT_S=0 admitted 8 requests and returned one 503 in 0.221 s.

512-output scaling on the same production endpoint:

Concurrency Wall aggregate output tok/s
1 112.77
2 205.47
4 398.98
8 784.69

Long-prompt c8 decode is prefill-bound when cold, but prefix caching helps repeated sessions:

Shape Mean TTFT Wall aggregate output tok/s
15357 prompt tokens, 128 output tokens, c8 21.46 s 47.12
28774 prompt tokens, 128 output tokens, c8 22.44 s 45.13
repeated 28774 prompt-token shape after prefix cache warm 3.75 s 270.74

Near-full-context probes:

Shape Prompt tokens each Output tokens each Mean TTFT Max TTFT Notes
c8 cold-ish long prefill 30690 1 22.17 s 39.03 s Before the repeated prefix was warm.
c8 prefix-cache-warm near limit 32703 1 1.94 s 3.22 s Reused most of the prior repeated prefix.

Over-limit canary:

32894 input tokens + 1 output token was rejected, correctly, because it exceeds 32768 total tokens.

Raw local result files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-1o-fastprompt-20260607T065104Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-128o-fastprompt-20260607T065116Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32000p-1o-20260607T064822Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32703p-1o-20260607T065041Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-promoted-20260607T080322Z

Repo summary:

../experiments/gemma4-12b-int4-autoround-vllm/results-20260607-b70-c8-32768.json