This host should normally run one large model at a time. The public LAN API can stay stable while the backend model changes.
Public endpoint:
http://<server-lan-ip>:8000/v1
Backend slot:
127.0.0.1:18080
The frontdoor is OpenAI-compatible and has no bearer-token requirement. It limits concurrent generation requests according to the active model profile.
Four B70s can run a large model well, but two large vLLM backends at once would
fight for VRAM, compile cache, device handles, and port 8000. The model-slot
setup makes the intended behavior explicit:
cd /home/steve/llm-optimizations
scripts/install-vllm-model-slot-service.sh --profile minimax-m27-c1
Install, enable at boot, and immediately move to the slot-managed MiniMax profile:
cd /home/steve/llm-optimizations
scripts/install-vllm-model-slot-service.sh --profile minimax-m27-c1 --start
This installs:
/etc/systemd/system/b70-vllm-slot.service
/etc/systemd/system/b70-openai-frontdoor.service
/etc/b70-vllm-slot/current.env
Tracked source files:
deploy/systemd/b70-vllm-slot.service
deploy/systemd/b70-openai-frontdoor.service
configs/model-slots/*.env
scripts/serve-vllm-profile.sh
scripts/run-openai-frontdoor-profile.sh
scripts/switch-vllm-model-slot.sh
List available profiles:
scripts/switch-vllm-model-slot.sh list
Switch back to the known-good MiniMax profile:
scripts/switch-vllm-model-slot.sh switch minimax-m27-c1
Try the preferred Qwen 35B INT4 AutoRound candidate after its weights are present:
scripts/switch-vllm-model-slot.sh switch qwen36-35b-a3b-int4-autoround
Try the tested Gemma 3 12B INT4 AutoRound image+text candidate after its weights are present:
scripts/switch-vllm-model-slot.sh switch gemma3-12b-it-int4-autoround
Try the current Gemma 4 12B INT4 AutoRound image+text candidate after the local
gemma4_unified vLLM backport is applied:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround
Try the full-32K Gemma 4 profile with 8 active generations:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8
Try the full-32K Gemma 4 research profile with 10 active generations:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c10
The switch command stops the generic slot services and the older MiniMax-specific services before starting the selected slot. This avoids two large models being loaded at the same time. It also disables the older MiniMax-specific units when the generic slot is activated, so reboot behavior stays single-model.
| Profile | Status | Modalities | Purpose |
|---|---|---|---|
minimax-m27-c1 |
baseline | text | Deployable MiniMax M2.7 INT4 AutoRound baseline endpoint, 32K context, one active generation. |
qwen36-35b-a3b-int4-autoround |
research | text,image | Preferred Qwen 35B candidate. Public W4A16 AutoRound checkpoint, working after the local XPU Mamba pointer patch. |
gemma3-12b-it-int4-autoround |
research | text,image | Tested Gemma fallback. Public 12B INT4 AutoRound checkpoint, much faster than Qwen 35B on the 2K/128 concurrency ladder. |
gemma4-12b-it-int4-autoround |
research | text,image | Current Gemma 4 candidate. Intel W4A16 AutoRound checkpoint, working after Transformers 5.10.2 plus the local vLLM gemma4_unified backport. |
gemma4-12b-it-int4-autoround-c8 |
production | text,image | Active full-context Gemma 4 profile: 32768-token context, 8 active generations, prefix caching, XPU graph capture, no-auth LAN endpoint. |
gemma4-12b-it-int4-autoround-c10 |
experiment-c10 | text,image | Full-32K Gemma 4 research profile: 10 active generations. Short prompts are faster in aggregate than c8, but near-32K throughput is not better and TTFT is worse. |
gemma4-12b-it-int4-autoround-c12 |
rejected-device-lost | text,image | Full-32K Gemma 4 failure-boundary profile. Startup succeeded, then burst load hit Level Zero out-of-resources and device-lost. Retained for reproduction only. |
gemma4-12b-it-int4-autoround-c16 |
experiment-c16 | text,image | Full-32K Gemma 4 research profile with 16 active generations. Do not use as production without a fresh stability/quality pass. |
gemma4-12b-it-int4-autoround-c64 |
research-c64 | text,image | High-concurrency Gemma 4 profile: 64 active generations, prefix caching, 4480-token context selected to fit 64 full contexts in VRAM. |
gemma4-12b-it-int4-autoround-c8-gmem096 |
rejected-startup-memory | text,image | Same as c8 but gpu_memory_utilization=0.96; rejected after engine startup failure near the VRAM boundary. |
gemma4-12b-it-int4-autoround-c8-gmem097 |
rejected-startup-memory | text,image | Same as c8 but gpu_memory_utilization=0.97; rejected because startup free VRAM was below requested utilization. |
gemma4-12b-it-int4-autoround-c8-cs1-8 |
rejected-device-lost | text,image | Same as c8 but compile_sizes=[1,8]; rejected after IGC fallback compile and UR_RESULT_ERROR_DEVICE_LOST during repeat validation. |
gemma4-12b-it-int4-autoround-c8-xpugraph-mbt2048 |
rejected-device-lost | text,image | Same as c8 graph but max_num_batched_tokens=2048; gained KV headroom but hit UR_RESULT_ERROR_DEVICE_LOST during canary/sampling. |
gemma4-12b-it-int4-autoround-c8-nolog |
experiment-nolog | text,image | Same as c8 but frontdoor event logging disabled; quality matched but no clear throughput win. |
qwen36-27b-fp8-vrfai |
rejected-diagnostic | text | Do not use as a recommended lane. It only worked here with an opt-in BF16 dequant fallback for a failing XPU FP8 primitive. |
qwen36-35b-a3b-fp8 |
blocked-native-xpu-fp8 | text,image | Official FP8 checkpoint is interesting, but the current local XPU path lacks native block-FP8 W8A8 support. |
qwen3-vl-30b-a3b-fp8 |
blocked-native-xpu-fp8 | text,image | Multimodal FP8 candidate, blocked by the same native XPU block-FP8 concern until proven otherwise. |
scripts/switch-vllm-model-slot.sh status
curl http://127.0.0.1:8000/status
curl http://127.0.0.1:8000/v1/models
The frontdoor status includes the active profile metadata:
{
"model_slot": {
"name": "minimax-m27-c1",
"modalities": "text",
"status": "production"
},
"frontdoor": {
"auth": "none",
"max_active_generations": 1
}
}
Do not call a new profile production-ready until it passes:
/v1/models reports the expected model and context length./v1/chat/completions accepts a real image request.qwen36-35b-a3b-int4-autoround is the current first-choice Qwen profile. The
checkpoint is abhinand/Qwen3.6-35B-A3B-int4-AutoRound, a public W4A16
AutoRound model with quant_method=auto-round and
packing_format=auto_round:auto_gptq. In local vLLM, that AutoRound format
maps to the INC XPU W4A16 path and Intel int4_gemm_w4a16, which is the
quality-preserving hardware path we want to test. Qwen 35B also needed the
local vLLM patch in
patches/vllm-xpu-mamba-copy-pointer-uint64-20260606.patch; without it, a
2K prompt crashed in vllm/v1/worker/mamba_utils.py with OverflowError:
Python int too large to convert to C long while copying Mamba cache pointers.
gemma3-12b-it-int4-autoround is the tested fast Gemma fallback. The
checkpoint is OPEA/gemma-3-12b-it-int4-AutoRound, a Gemma 3 image+text model
with 4-bit AutoRound/GPTQ-style weights. vLLM rejects Gemma 3 with float16
for numerical-stability reasons, so the profile uses bfloat16 as the
runtime activation dtype while keeping INT4 weights. That is not the same as
the rejected Qwen FP8 BF16-dequant fallback.
gemma4-12b-it-int4-autoround is now the current Gemma 4 image+text candidate.
The checkpoint is Intel/gemma-4-12B-it-int4-AutoRound, a W4A16 AutoRound
model with model_type=gemma4_unified. It needed Transformers 5.10.2 and
the local vLLM backport in
patches/vllm-gemma4-unified-backport-b70-20260607.patch. The active profile
uses bfloat16 as the 16-bit activation dtype while the weights remain the
INT4 AutoRound checkpoint. The production c8 profile keeps the same no-auth
LAN endpoint on 0.0.0.0:8000, supports text and image requests, reports
max_model_len=32768, and caps active generations at 8 for predictable
full-context behavior.
gemma4-12b-it-int4-autoround-c64 is the high-concurrency variant for many
shorter live sessions. It uses the same model and endpoint, but sets
max_num_seqs=64, FRONTDOOR_MAX_ACTIVE_GENERATIONS=64, keeps prefix caching
enabled, and lowers max_model_len to 4480. This was selected empirically:
vLLM reported 292317 GPU KV tokens and 65.25x full-context concurrency at
4480, while 4864 only reached 60.15x. Do not derive the c64 context by
dividing the c16 KV budget; the c64 scheduler/compile profile changes the
available KV budget.
gemma4-12b-it-int4-autoround-c8 is the active production full-context variant
selected after the c64 tradeoff became too short. It uses the same Gemma 4 INT4 AutoRound
checkpoint and same no-auth LAN endpoint, but keeps max_model_len=32768 and
caps vLLM/frontdoor live generations at 8. After the 2026-06-07 XPU graph
promotion, startup reported 1,004,909 GPU KV-cache tokens and 30.67x
theoretical full-32K concurrency, so the c8 cap is conservative. The profile
passed an 8-way near-limit probe with 32703 prompt tokens and one output token
per request. For normal generation, leave output headroom below 32768 total
prompt+generated tokens.
gemma4-12b-it-int4-autoround-c10 is a research-only profile for testing
whether the large reported KV pool can support more full-32K active clients.
It passed the quality canary and improved short-prompt aggregate decode from
about 755 to 850 wall output tok/s versus c8 load. It did not improve
near-32K workloads: unique near-32K 128-token output stayed around 23 tok/s
wall aggregate and half-shared-prefix near-32K output stayed around 40 tok/s
wall aggregate, with c10 showing higher TTFT than c8.
gemma4-12b-it-int4-autoround-c12 is retained only to reproduce the failure
boundary. It reported about 991K GPU KV tokens and 30.26x theoretical
32K concurrency, then failed under burst load with Level Zero
UR_RESULT_ERROR_OUT_OF_RESOURCES followed by UR_RESULT_ERROR_DEVICE_LOST.
Avoid treating the Qwen FP8 profiles as production candidates right now. On
2026-06-06, the native compressed-tensors FP8 path failed during profiling on
this host with RuntimeError: could not set scales primitive attribute in
torch.ops._xpu_C.fp8_gemm_w8a16. The local BF16 fallback patch avoids that
crash by dequantizing FP8 weights into BF16 and using F.linear, but the user
explicitly rejected BF16 fallback as the active model direction. The block-FP8
Qwen 35B family also needs native XPU 128x128 block-FP8 W8A8 GEMM support;
the local alternatives are BF16 dequant fallback or requantized FP8, neither
of which should be promoted as quality-equivalent without a separate eval.
The rejected Qwen 27B BF16-fallback diagnostic used max_model_len=4096, about
2071 prompt tokens per request, and 512 generated tokens:
| Concurrency | Aggregate output tok/s, wall | Mean request decode tok/s | Mean TTFT |
|---|---|---|---|
| 1 | 20.48 |
20.91 |
0.51 s |
| 16 | 243.01 |
17.73 |
4.48 s |
| 32 | 402.51 |
16.21 |
8.21 s |
| 64 | 556.55 |
12.57 |
15.65 s |
Those numbers are useful as a scheduler/concurrency diagnostic only. Full
details are in notes/2026-06-06-qwen36-fp8-bf16-fallback-concurrency.md.
These are text decode throughput measurements through the no-auth LAN
frontdoor at about 2K prompt tokens and 128 generated tokens per request.
Use aggregate output tok/s, wall as the primary number; XPU/vLLM sometimes
coalesces stream chunks, so the post-TTFT derived field can be misleading.
Qwen 35B INT4 AutoRound, after applying the Mamba pointer uint64 patch:
| Concurrency | Prompt tokens each | Output tokens each | Aggregate output tok/s, wall | Mean TTFT |
|---|---|---|---|---|
| 1 | 2071 |
128 |
17.28 |
7.41 s |
| 2 | 2071 |
128 |
33.91 |
7.49 s |
| 4 | 2071 |
128 |
61.54 |
8.21 s |
Qwen notes:
/v1/chat/completions text worked after the frontdoor injected
chat_template_kwargs={"enable_thinking":false} by default.Red.quantization=inc, consistent with the XPU INT4 W4A16 path.E=256,N=128,device_name=Intel(R)_Arc(TM)_Pro_B70_Graphics,dtype=int4_w4a16.
That likely leaves performance on the table.Gemma 3 12B INT4 AutoRound:
| Concurrency | Prompt tokens each | Output tokens each | Aggregate output tok/s, wall | Mean TTFT |
|---|---|---|---|---|
| 1 | 2072 |
128 |
31.31 |
4.09 s |
| 2 | 2072 |
128 |
62.52 |
4.08 s |
| 4 | 2072 |
128 |
124.48 |
4.09 s |
| 8 | 2072 |
128 |
245.88 |
4.14 s |
| 16 | 2072 |
128 |
166.14 |
9.31 s |
Gemma notes:
OK. I'm Gemma, a large language model from Google
DeepMind.Red.max_model_len=32768, prefix caching enabled, and about
970,354 GPU KV-cache tokens, or about 29.61x theoretical concurrency for
full 32K requests.Raw local result files:
/mnt/fast-ai/bench-results/qwen36-35b-int4-vllm-serve/qwen36-35b-int4-c1-2k-128-after-mamba-uint64-20260607T020923Z.json
/mnt/fast-ai/bench-results/qwen36-35b-int4-vllm-serve/qwen36-35b-int4-c2-c4-2k-128-after-mamba-uint64-20260607T021010Z.json
/mnt/fast-ai/bench-results/gemma3-12b-int4-vllm-serve/gemma3-12b-int4-concurrency-2k-128-warm-20260607T024447Z.json
Gemma 4 12B INT4 AutoRound:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround
Model:
Intel/gemma-4-12B-it-int4-AutoRound
Local path:
/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel
Serving facts:
gemma4-12b-it-int4-autoroundhttp://0.0.0.0:8000/v14bfloat16inc3276816--limit-mm-per-prompt '{"image":4}'Text and image smoke tests passed after the final restart. Text returned
OK; a base64 PNG image request returned Blue.
Benchmark shape:
2048 tokens2071 per request512 per requestignore_eos=truescripts/bench-openai-concurrency.py| Concurrency | Aggregate decode tok/s after first text | Aggregate output tok/s, wall | Mean request decode tok/s | Mean TTFT |
|---|---|---|---|---|
| 1 | 58.22 |
30.39 |
58.22 |
8.05 s |
| 2 | 117.27 |
59.70 |
58.89 |
8.44 s |
| 4 | 236.10 |
116.33 |
59.71 |
9.00 s |
| 8 | 467.76 |
217.39 |
59.63 |
10.20 s |
| 16 | 922.18 |
396.11 |
59.60 |
11.97 s |
Important notes:
58-60 tok/s per active request.396 output tok/s.GPU KV cache size: 1,004,337 tokens, or 30.65x theoretical
concurrency for 32K requests. Only c16 has been validated.--limit-mm-per-prompt '{"image":4,"video":0,"audio":0}'; that
failed during Gemma4 unified dummy multimodal profiling. Keep
--limit-mm-per-prompt '{"image":4}'.Raw local result files:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c1-c2-c4-c8-20260607T034622Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c16-20260607T035019Z.json
Full reproduction notes:
../experiments/gemma4-12b-int4-autoround-vllm/README.md
Switch command:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c64
Serving facts:
http://0.0.0.0:8000/v1Intel/gemma-4-12B-it-int4-AutoRoundgemma4-12b-it-int4-autoround44806465.25xContext search:
| Max model len | GPU KV tokens | vLLM full-context concurrency | Outcome |
|---|---|---|---|
15616 |
667253 |
42.73x |
too high |
10368 |
507081 |
48.91x |
too high |
7680 |
405664 |
52.82x |
too high |
4864 |
292589 |
60.15x |
too high |
4096 |
291995 |
71.29x |
fits |
4480 |
292317 |
65.25x |
selected |
Short-prompt c64 TTFT with about 123 prompt tokens and one generated token:
| Mean TTFT | p50 TTFT | p95 TTFT | Max TTFT |
|---|---|---|---|
0.882 s |
0.712 s |
1.139 s |
1.140 s |
Short-prompt c64 output run with about 123 prompt tokens and 128 generated
tokens per request:
| Concurrency | Aggregate output tok/s, wall |
|---|---|
64 |
1614.41 |
Near-limit prompt smoke:
| Actual prompt tokens | Output tokens | TTFT |
|---|---|---|
4323 |
1 |
0.631 s |
Raw local result files:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-ttft-100p-1o-20260607T062129Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-decode-100p-128o-20260607T062143Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-longprompt-4300p-1o-20260607T062228Z.json
Switch command:
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8
Serving facts:
http://0.0.0.0:8000/v1Intel/gemma-4-12B-it-int4-AutoRoundgemma4-12b-it-int4-autoround3276880 s fail-fast when all generation slots are busy30.67x1004909 tokensShort-prompt c8 TTFT with about 119 prompt tokens and one generated token:
| Mean TTFT | Max TTFT |
|---|---|
0.151 s |
0.183 s |
Short-prompt c8 output run with about 119 prompt tokens and 128 generated
tokens per request:
| Profile | Concurrency | Aggregate output tok/s, wall |
|---|---|---|
| pre-graph c8 | 8 |
247.49 |
| XPU graph c8, promoted mean of 3 | 8 |
703.59 |
The XPU graph promotion matched the saved quality canary hashes before it was copied into the canonical production profile. It keeps the same model, quantization, context length, concurrency cap, and prefix caching; it changes only the compile/graph execution path:
XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'
Post-promotion startup facts:
torch.compile took 4.12 s
Graph capturing finished in 4 s
init engine took 19.05 s
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
Sustained decode checks on the promoted graph profile:
| Shape | Concurrency | Wall aggregate output tok/s |
|---|---|---|
119 prompt tokens, 256 output tokens |
8 |
796.18 |
119 prompt tokens, 512 output tokens |
8 |
780.97 |
119 prompt tokens, 1024 output tokens |
8 |
731.12 |
Production soak on 2026-06-07 used the active c8 profile on the no-auth LAN
endpoint with 32768 max model length, 8 active generations, prefix caching,
and XPU graph capture. The scheduled main soak cycles 1-32 ran from
20260607T090844Z through 20260607T165344Z: 31/32 cycles passed the quality
gate and c8/512 decode check. Clean cycles averaged 781.04 tok/s wall
aggregate output at 2.551 s mean TTFT, with a 765.44-784.37 tok/s range.
One scheduled quality anomaly occurred at cycle 17: copy_phrase returned
slh cobalt orbit instead of satin cobalt orbit. Three immediate reruns and
a 25-repeat quality stress loop passed exactly, and the next scheduled cycle
returned to normal. Do not hide this in future reports; treat it as an isolated
deterministic-canary anomaly unless it repeats.
The first soak run also exposed a harness bug after the last scheduled cycle:
the script spun about every 10 s instead of sleeping to the deadline. Those
rapid cycles are not normal soak data. scripts/run-gemma4-production-soak.sh
now sleeps to the deadline and exits cleanly.
Fixed-harness continuation:
| Cycle | UTC | Quality | Wall aggregate output tok/s | Mean TTFT |
|---|---|---|---|---|
1 |
20260607T170030Z |
pass | 779.20 |
2.559 s |
2 |
20260607T171530Z |
pass | 784.75 |
2.546 s |
Frontdoor streaming fix: an apparent c1 TTFT regression after the soak was
traced to the LAN proxy, not the vLLM backend. The backend streamed first text
in about 34 ms, but scripts/openai-lan-frontdoor.py was forwarding with
response.read(65536), which buffered SSE events. The frontdoor now forwards
text/event-stream responses line-by-line and flushes each line.
After restarting only b70-openai-frontdoor.service, public endpoint TTFT
matched the backend again:
| Shape | Concurrency | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|
119 prompt tokens, 512 output tokens |
1 |
0.036 s |
112.87 |
119 prompt tokens, 512 output tokens |
8 |
0.099 s |
783.62 |
The production frontdoor is not a batch queue. It admits up to 8 active
generation requests immediately. When all 8 slots are busy, the next
generation request fails fast with HTTP 503 instead of waiting behind a long
queue. A 9-way overload check after setting FRONTDOOR_QUEUE_TIMEOUT_S=0
admitted 8 requests and returned one 503 in 0.221 s.
512-output scaling on the same production endpoint:
| Concurrency | Wall aggregate output tok/s |
|---|---|
1 |
112.77 |
2 |
205.47 |
4 |
398.98 |
8 |
784.69 |
Long-prompt c8 decode is prefill-bound when cold, but prefix caching helps repeated sessions:
| Shape | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|
15357 prompt tokens, 128 output tokens, c8 |
21.46 s |
47.12 |
28774 prompt tokens, 128 output tokens, c8 |
22.44 s |
45.13 |
repeated 28774 prompt-token shape after prefix cache warm |
3.75 s |
270.74 |
Near-full-context probes:
| Shape | Prompt tokens each | Output tokens each | Mean TTFT | Max TTFT | Notes |
|---|---|---|---|---|---|
c8 cold-ish long prefill |
30690 |
1 |
22.17 s |
39.03 s |
Before the repeated prefix was warm. |
c8 prefix-cache-warm near limit |
32703 |
1 |
1.94 s |
3.22 s |
Reused most of the prior repeated prefix. |
Over-limit canary:
32894 input tokens + 1 output token was rejected, correctly, because it exceeds 32768 total tokens.
Raw local result files:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-1o-fastprompt-20260607T065104Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-128o-fastprompt-20260607T065116Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32000p-1o-20260607T064822Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32703p-1o-20260607T065041Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-promoted-20260607T080322Z
Repo summary:
../experiments/gemma4-12b-int4-autoround-vllm/results-20260607-b70-c8-32768.json