This folder tracks the Gemma 4 bring-up on the 4x Intel Arc Pro B70 host.
Goal:
Intel/gemma-4-12B-it-int4-AutoRound;0.0.0.0:8000;Status on 2026-06-07: c8 is the active production profile. It uses XPU graph capture, prefix caching, 32K context, and a fail-fast 8-active-generation LAN frontdoor. c10 is kept as a research profile for extra short-prompt concurrency, but c8 remains the better full-context production choice. c12 was rejected after a Level Zero out-of-resources/device-lost failure under burst load. c16 and c64 remain documented alternate/research profiles.
The endpoint is running through the generic model-slot services:
Public endpoint: http://<server-lan-ip>:8000/v1
Auth: none
Served model name: gemma4-12b-it-int4-autoround
Backend: vLLM/XPU on 127.0.0.1:18080
Production c8 profile: 32768 context, 8 live generations
Research c10 profile: 32768 context, 10 live generations
Rejected c12 profile: 32768 context, 12 live generations
High-concurrency c64 profile: 4480 context, 64 live generations
Modalities tested: text, image
Text smoke after the final restart returned exactly OK.
A real base64 PNG image request returned Blue.
Hugging Face:
https://huggingface.co/Intel/gemma-4-12B-it-int4-AutoRound
The Intel model card describes this as a W4A16 AutoRound quantization of
google/gemma-4-12B-it, with group size 128 and symmetric quantization. The
local vLLM startup logs report quantization=inc, which means the checkpoint is
using the Intel AutoRound/INC INT4 path rather than the rejected FP8/BF16
fallback direction.
Local model path:
/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel
Downloaded footprint on this host is about 7.3G.
The original local stack had Transformers 5.7.0, which did not recognize
model_type=gemma4_unified. The fix was:
5.10.2.gemma4_unified.py model implementation.Gemma4UnifiedForConditionalGeneration in vLLM’s model registry.gemma4_mm.py helper used by the unified implementation.Patch snapshot:
../../patches/vllm-gemma4-unified-backport-b70-20260607.patch
Validation commands used after patching:
/home/steve/.venvs/vllm-xpu/bin/python -m py_compile \
/home/steve/src/vllm/vllm/model_executor/models/gemma4_mm.py \
/home/steve/src/vllm/vllm/model_executor/models/gemma4_unified.py \
/home/steve/src/vllm/vllm/model_executor/models/registry.py
/home/steve/.venvs/vllm-xpu/bin/python - <<'PY'
from transformers import AutoConfig, AutoProcessor
from vllm.model_executor.models.registry import ModelRegistry
model = "/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel"
cfg = AutoConfig.from_pretrained(model, trust_remote_code=True)
proc = AutoProcessor.from_pretrained(model, trust_remote_code=True)
print(type(cfg).__name__, cfg.model_type, cfg.architectures)
print(type(proc).__name__)
print(ModelRegistry.resolve_model_cls(cfg.architectures)[0].__name__)
PY
Expected output includes:
Gemma4UnifiedConfig gemma4_unified ['Gemma4UnifiedForConditionalGeneration']
Gemma4UnifiedProcessor
Gemma4UnifiedForConditionalGeneration
Base c16 profile:
../../configs/model-slots/gemma4-12b-it-int4-autoround.env
Key settings:
MODEL_SLOT_HF_ID="Intel/gemma-4-12B-it-int4-AutoRound"
MODEL_DIR=/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel
VLLM_DTYPE=bfloat16
VLLM_QUANTIZATION=auto
VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=16
VLLM_ENABLE_PREFIX_CACHING=1
VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')
FRONTDOOR_HOST=0.0.0.0
FRONTDOOR_PORT=8000
FRONTDOOR_MAX_ACTIVE_GENERATIONS=16
High-concurrency profile:
../../configs/model-slots/gemma4-12b-it-int4-autoround-c64.env
Key c64 settings:
VLLM_MAX_MODEL_LEN=4480
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=64
VLLM_ENABLE_PREFIX_CACHING=1
FRONTDOOR_MAX_ACTIVE_GENERATIONS=64
Active production c8 profile:
../../configs/model-slots/gemma4-12b-it-int4-autoround-c8.env
Key c8 settings:
VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=8
VLLM_ENABLE_PREFIX_CACHING=1
XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'
FRONTDOOR_MAX_ACTIVE_GENERATIONS=8
FRONTDOOR_QUEUE_TIMEOUT_S=0
The c8 production frontdoor is intentionally fail-fast, not a bulk queue. It
admits up to 8 active generation requests. A 9th generation request receives
HTTP 503 immediately when all slots are busy.
VLLM_DTYPE=bfloat16 is the 16-bit activation/runtime dtype. The weights remain
the INT4 AutoRound checkpoint; this is not the rejected Qwen FP8 BF16-dequant
fallback.
cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround
Switch to the 64-active-client profile:
cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c64
Switch to the full-32K, 8-active-generation profile:
cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8
Try the full-32K, 10-active-generation research profile:
cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c10
Do not use gemma4-12b-it-int4-autoround-c12 for production. It is retained
only to reproduce the 2026-06-07 failure boundary.
Check status:
curl -fsS http://127.0.0.1:8000/status
curl -fsS http://127.0.0.1:8000/v1/models
Expected /v1/models includes:
{
"id": "gemma4-12b-it-int4-autoround",
"max_model_len": 32768
}
The c64 profile reports max_model_len=4480.
Text:
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"gemma4-12b-it-int4-autoround",
"messages":[{"role":"user","content":"Reply with exactly: OK"}],
"max_tokens":8,
"temperature":0
}'
Image requests use the normal OpenAI image_url content shape. The validated
smoke used a base64 PNG data URL and asked for the dominant color.
The useful benchmark run used forced generation with ignore_eos=true:
cd /home/steve/llm-optimizations
/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
--base-url http://127.0.0.1:8000 \
--tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
--prompt-tokens 2048 \
--output-tokens 512 \
--concurrency 1 \
--concurrency 2 \
--concurrency 4 \
--concurrency 8 \
--warmups 1 \
--timeout 1800 \
--output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c1-c2-c4-c8-20260607T034622Z.json
/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
--base-url http://127.0.0.1:8000 \
--tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
--prompt-tokens 2048 \
--output-tokens 512 \
--concurrency 16 \
--warmups 1 \
--timeout 1800 \
--output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c16-20260607T035019Z.json
The earlier single-2k-512 file is not a valid decode-rate measurement because
it allowed early EOS and the model stopped after only a few generated tokens.
Prompt tokens were about 2071 per request. Each request generated 512
tokens. Decode rates below are output-token rates after first streamed text,
which isolates generation from the long 2K prefill/TTFT portion. Wall aggregate
is also included because it is what a user feels for full prompt+decode runs.
| Concurrency | Aggregate decode tok/s after first text | Aggregate output tok/s wall | Mean per-request decode tok/s | Mean TTFT |
|---|---|---|---|---|
| 1 | 58.22 |
30.39 |
58.22 |
8.05 s |
| 2 | 117.27 |
59.70 |
58.89 |
8.44 s |
| 4 | 236.10 |
116.33 |
59.71 |
9.00 s |
| 8 | 467.76 |
217.39 |
59.63 |
10.20 s |
| 16 | 922.18 |
396.11 |
59.60 |
11.97 s |
Interpretation:
58-60 tok/s.60 tok/s through c16.c16, reaching about
922 tok/s after TTFT.c16 was
about 396 output tok/s.The 64-active-client profile cannot keep the 32K context window. The important
finding is that changing max_num_seqs from 16 to 64 changes vLLM’s profiled
KV budget and compile shape. Do not estimate c64 context by dividing the c16
1,004,337 KV-token budget by 64.
Search results:
| Max model len | GPU KV tokens | vLLM full-context concurrency | Outcome |
|---|---|---|---|
15616 |
667253 |
42.73x |
too high for 64 full contexts |
10368 |
507081 |
48.91x |
too high for 64 full contexts |
7680 |
405664 |
52.82x |
too high for 64 full contexts |
4864 |
292589 |
60.15x |
too high for 64 full contexts |
4096 |
291995 |
71.29x |
fits, but leaves more headroom |
4480 |
292317 |
65.25x |
selected c64 profile |
Selected c64 profile:
max_model_len=4480
max_num_seqs=64
max_active_generations=64
prefix caching enabled
The selected profile uses 4480 * 64 = 286720 logical KV tokens against a
profiled 292317 token KV budget. That is close to full without crossing the
full-64 admission boundary.
Short-prompt one-token TTFT probe:
| Concurrency | Prompt tokens each | Output tokens each | Mean TTFT | p50 TTFT | p95 TTFT | Max TTFT |
|---|---|---|---|---|---|---|
64 |
123 |
1 |
0.882 s |
0.712 s |
1.139 s |
1.140 s |
Short-prompt 128-token output run:
| Concurrency | Prompt tokens each | Output tokens each | Aggregate output tok/s wall | Mean TTFT field |
|---|---|---|---|---|
64 |
123 |
128 |
1614.41 |
5.04 s |
For the 128-token run, use the wall aggregate as the useful throughput number. The post-first-text decode field is not reliable for this short-prompt shape because vLLM/XPU coalesces streamed output chunks.
Near-limit prompt smoke:
| Requested prompt | Actual prompt tokens | Output tokens | TTFT |
|---|---|---|---|
4300 |
4323 |
1 |
0.631 s |
Raw files:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-ttft-100p-1o-20260607T062129Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-decode-100p-128o-20260607T062143Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-longprompt-4300p-1o-20260607T062228Z.json
The active production Gemma 4 profile keeps max_model_len=32768 and
caps live requests at 8. This keeps the LAN endpoint useful for full-context
clients without dropping to the c64 profile’s shorter 4480-token window.
Selected c8 profile:
max_model_len=32768
max_num_seqs=8
max_active_generations=8
prefix caching enabled
XPU graph capture enabled
Current production startup with XPU graph capture reported:
torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
Available KV cache memory: 27.48 GiB
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
The 30.67x line is vLLM’s KV-capacity estimate. The service is intentionally
capped at 8 live generations because the goal for this profile is predictable
LAN behavior with full 32K context, not maximum theoretical admission.
Short-prompt one-token TTFT probe after rebuilding the benchmark prompt generator:
| Concurrency | Prompt tokens each | Output tokens each | Mean TTFT | Max TTFT |
|---|---|---|---|---|
8 |
119 |
1 |
0.151 s |
0.183 s |
Short-prompt 128-token output run:
| Profile | Concurrency | Prompt tokens each | Output tokens each | Aggregate output tok/s wall | Mean TTFT field |
|---|---|---|---|---|---|
| pre-graph c8 | 8 |
119 |
128 |
247.49 |
4.12 s |
| XPU graph c8 promoted mean | 8 |
119 |
128 |
703.59 |
1.46 s |
For the 128-token run, use the wall aggregate as the useful throughput number. The post-first-text decode field is not reliable for this short-prompt shape because vLLM/XPU coalesces streamed output chunks.
Near-full-context probes:
| Shape | Prompt tokens each | Output tokens each | Mean TTFT | Max TTFT | Notes |
|---|---|---|---|---|---|
c8 cold-ish long prefill |
30690 |
1 |
22.17 s |
39.03 s |
Useful long-prefill signal before the repeated prefix was warm. |
c8 prefix-cache-warm near limit |
32703 |
1 |
1.94 s |
3.22 s |
Reused most of the earlier repeated 30.7K-token prefix. |
Over-limit canary:
requested target 34300 -> 32894 input tokens per c8 lane
requested output tokens: 1
result: rejected as expected
reason: 32894 + 1 exceeds the 32768 max context window
This is the right behavior. For one output token, the prompt must stay at or
below 32767 input tokens. For normal generation, leave more room for output.
Raw files:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-1o-fastprompt-20260607T065104Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-128o-fastprompt-20260607T065116Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32000p-1o-20260607T064822Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32703p-1o-20260607T065041Z.json
Repo summary:
results-20260607-b70-c8-32768.json
After promoting c8 to production metadata, the same profile was restarted and
revalidated. Cached restart loaded AOT compile in 4.91 s, initialized the
engine in 17.88 s, and reported 1,004,337 GPU KV tokens, or 30.65x
theoretical full-32K concurrency.
Quality canary:
exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red
Sequential production benchmark baseline:
| Shape | Prompt tokens each | Output tokens each | Concurrency | Primary result |
|---|---|---|---|---|
| short TTFT | 119 |
1 |
8 |
mean TTFT 0.106 s |
| short decode | 119 |
128 |
8 |
wall aggregate 250.21 tok/s |
| long prefill | 30690 |
1 |
8 |
mean TTFT 22.21 s |
Repo summary:
results-20260607-production-c8-baseline.json
On 2026-06-07, the c8 production profile was updated to enable XPU graph
capture with communication ops. This keeps the same model, INT4 AutoRound
weights, bfloat16 activation dtype, 32768 context, c8 concurrency cap, and
prefix caching. It changes only the compile/graph execution path.
Validated settings:
XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'
Post-promotion startup on the canonical production slot reported:
torch.compile took 4.12 s in total
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
Graph capturing finished in 4 s
init engine took 19.05 s
Quality gate:
expected outputs: pass
baseline text/hash comparison: pass
text checks: OK, satin cobalt orbit, 7
image check: Red
Repeated production c8 short-decode checks after promotion:
| Run | Prompt tokens each | Output tokens each | Concurrency | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| 1 | 119 |
128 |
8 |
1.658 s |
613.72 |
| 2 | 119 |
128 |
8 |
1.354 s |
751.58 |
| 3 | 119 |
128 |
8 |
1.365 s |
745.48 |
| mean | 119 |
128 |
8 |
1.459 s |
703.59 |
This is the current best production profile. The earlier non-graph c8
production baseline was about 240-250 tok/s for the same short-decode shape.
The graph branch also ran a longer validation loop before promotion:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/xpugraph-validation-20260607T075901Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-promoted-20260607T080322Z
The 32K prefill leg did not materially improve with XPU graph. It stayed around
22.28 s mean TTFT for eight concurrent 30690-token prompts with one output
token. Treat the win as a decode-path improvement, not a long-prefill fix.
Sustained decode characterization on the promoted production c8 graph profile:
| Shape | Prompt tokens each | Output tokens each | Concurrency | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| repeat mean | 119 |
256 |
8 |
2.530 s |
796.18 |
| repeat mean | 119 |
512 |
8 |
2.542 s |
780.97 |
| repeat mean | 119 |
1024 |
8 |
2.519 s |
731.12 |
512-token scaling on the same production endpoint:
| Concurrency | Prompt tokens each | Output tokens each | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|
1 |
119 |
512 |
2.190 s |
112.77 |
2 |
119 |
512 |
2.415 s |
205.47 |
4 |
119 |
512 |
2.500 s |
398.98 |
8 |
119 |
512 |
2.526 s |
784.69 |
Long-prompt decode with the same c8 production endpoint:
| Shape | Prompt tokens each | Output tokens each | Concurrency | Mean TTFT | Wall aggregate output tok/s | Notes |
|---|---|---|---|---|---|---|
| cold-ish 16K | 15357 |
128 |
8 |
21.46 s |
47.12 |
Includes prefill. |
| cold-ish 30K | 28774 |
128 |
8 |
22.44 s |
45.13 |
Includes prefill. |
| prefix-cache repeat 30K | 28774 |
128 |
8 |
3.75 s |
270.74 |
Same repeated prompt shape after cache warm. |
The post-first-text decode fields in these long-prompt runs are not useful: vLLM/XPU coalesced the streamed output into large chunks. Use TTFT and wall aggregate throughput for long-prompt comparisons.
Raw result directories:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-256o-repeat-20260607T084540Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-512o-repeat-20260607T084633Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-1024o-repeat-20260607T084718Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-scaling-512o-20260607T084806Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-decode-20260607T090436Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-30000p-128o-cache-repeat-20260607T090558Z.json
On 2026-06-07, the apparent spare KV budget was tested by raising the full-32K
active generation cap above c8. The important lesson is that vLLM’s reported KV
token budget is only a capacity estimate. It does not guarantee that higher
max_num_seqs shapes are stable or faster under real scheduler/runtime
pressure.
c12 startup did succeed and reported:
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x
But the first bursty client run accidentally sent too much concurrent work while quality and benchmark clients overlapped. c12 then failed with:
RuntimeError: level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
That makes c12 a rejected profile for now. It is kept as
configs/model-slots/gemma4-12b-it-int4-autoround-c12.env so another run can
reproduce or debug the boundary.
c10 was then tested as the first safer step above production c8. c10 startup
also reported 991,437 GPU KV tokens and 30.26x theoretical full-32K
concurrency. First launch compiled:
compile range (1, 1): 83.14 s
compile range (1, 4096): 100.85 s
c10 quality matched the production canary:
exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red
Short-prompt decode:
| Profile | Prompt tokens each | Output tokens each | Concurrency | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| c10 backend, c8 load | 119 |
512 |
8 |
0.140 s |
754.88 |
| c10 backend, c10 load | 120 |
512 |
10 |
0.221 s |
849.59 |
Short-prompt one-token TTFT:
| Concurrency | Prompt tokens each | Output tokens each | Mean TTFT |
|---|---|---|---|
1 |
119 |
1 |
0.036 s |
8 |
119 |
1 |
0.166 s |
10 |
120 |
1 |
0.198 s |
For short prompts, c10 is useful: it raised aggregate 512-token decode from
about 755 tok/s at c8 load to about 850 tok/s at c10 load. The tradeoff was
slightly higher TTFT and lower per-request decode speed.
For full-context-style requests, c10 was not better than c8:
| Shape | Profile load | Prompt tokens each | Output tokens each | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| unique near-32K TTFT | c8 | 30690 |
1 |
22.20 s |
0.205 |
| unique near-32K TTFT | c10 | 30744 |
1 |
27.21 s |
0.204 |
| unique near-32K decode | c8 | 30690 |
128 |
23.61 s |
22.73 |
| unique near-32K decode | c10 | 30744 |
128 |
29.25 s |
22.76 |
The service can accept c10 full-context work, but vLLM internally stages long prefills. c10 adds latency and does not improve aggregate full-context output throughput.
The benchmark harness now supports fixed-prefix tests:
/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
--base-url http://127.0.0.1:8000 \
--tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
--prompt-tokens 32000 \
--shared-prefix-tokens 16000 \
--prompt-salt unique-a \
--output-tokens 1 \
--concurrency 8 \
--concurrency 10 \
--warmups 0 \
--timeout 2400 \
--output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z/c10-shared-prefix-salted-ttft-c8-c10-32000p-16000shared-1o.json
Clean fixed-prefix plus unique-tail results, with about 16K shared leading tokens and about 16K unique tail tokens:
| Shape | Profile load | Prompt tokens each | Output tokens each | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| half-shared 32K TTFT | c8 | 31021 |
1 |
12.45 s |
0.371 |
| half-shared 32K TTFT | c10 | 31034 |
1 |
15.11 s |
0.372 |
| half-shared 32K decode | c8 | 31021 |
128 |
13.24 s |
39.20 |
| half-shared 32K decode | c10 | 31034 |
128 |
16.04 s |
39.67 |
Prefix caching is therefore valuable for website-generation traffic with a
fixed system/project prefix and unique user content: it roughly cut TTFT for
near-32K c8 requests from 22.20 s to 12.45 s in this synthetic half-shared
test. It did not make c10 materially better than c8 for full-context requests.
Recommendation: keep production on c8 for 32K request support. Use c10 only as a research profile for short-prompt high-throughput traffic. Do not use c12 unless debugging the device-lost boundary.
Raw result directory:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z
On 2026-06-07, the active c8 XPU-graph production profile was left running on the public no-auth LAN endpoint and checked with the reusable soak harness:
DURATION_S=28800 INTERVAL_S=900 PROMPT_TOKENS=100 OUTPUT_TOKENS=512 \
CONCURRENCY=8 scripts/run-gemma4-production-soak.sh
Scheduled cycles 1-32 covered the main soak window from 20260607T090844Z
through 20260607T165344Z. The production endpoint stayed on:
model slot: gemma4-12b-it-int4-autoround-c8
max_model_len: 32768
max active generations: 8
prefix caching: enabled
auth: none
Scheduled result summary:
| Metric | Value |
|---|---|
| Scheduled cycles | 32 |
| Clean cycles | 31 |
| Quality anomaly cycles | 1 |
| Mean wall aggregate output tok/s, clean cycles | 781.04 |
| Min wall aggregate output tok/s, clean cycles | 765.44 |
| Max wall aggregate output tok/s, clean cycles | 784.37 |
| Mean TTFT, clean cycles | 2.551 s |
The one scheduled quality anomaly was cycle 17: the copy_phrase canary
returned slh cobalt orbit instead of satin cobalt orbit. Exact OK,
arithmetic, and image-color canaries still passed. Three immediate manual
reruns and a 25-repeat quality-only stress loop then matched the baseline
hashes exactly, and cycle 18 returned to normal.
The original soak harness had a final-interval bug: after the last scheduled
cycle, it stopped sleeping and produced rapid cycles every about 10 s until
the deadline. Treat cycles 33+ in the first run as harness-bug data, not
normal soak samples. The harness now sleeps to the deadline and exits instead
of spinning.
A clean continuation with the fixed harness ran after the bug was patched:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z
Continuation result:
| Cycle | UTC | Quality | Wall aggregate output tok/s | Mean TTFT |
|---|---|---|---|---|
1 |
20260607T170030Z |
pass | 779.20 |
2.559 s |
2 |
20260607T171530Z |
pass | 784.75 |
2.546 s |
The first c1 post-soak benchmark appeared to show about 2.16 s TTFT for a
119-token prompt with 512 generated tokens. Direct backend testing showed
this was not model/prefill latency: vLLM on 127.0.0.1:18080 streamed first
text in about 34 ms.
Cause: the LAN frontdoor was proxying backend responses with
response.read(65536). For text/event-stream responses this can buffer many
SSE events before forwarding them, making client-observed TTFT look much worse
than backend TTFT.
Fix: scripts/openai-lan-frontdoor.py now forwards text/event-stream
responses line-by-line and flushes after every SSE line.
Clean public-endpoint checks after restarting only the frontdoor:
| Shape | Concurrency | Prompt tokens each | Output tokens each | Mean TTFT | Wall aggregate output tok/s |
|---|---|---|---|---|---|
| short decode | 1 |
119 |
512 |
0.036 s |
112.87 |
| short decode | 8 |
119 |
512 |
0.099 s |
783.62 |
The queue policy was then changed from the old one-hour timeout to fail-fast:
FRONTDOOR_QUEUE_TIMEOUT_S=0. A 9-way overload test admitted 8 requests and
returned one HTTP 503 in 0.221 s, while the admitted requests streamed
normally.
Direct frontdoor/backend streaming comparison after the fix:
| Endpoint | Max tokens | First text |
|---|---|---|
frontdoor :8000 |
128 |
0.034 s |
backend :18080 |
128 |
0.034 s |
frontdoor :8000 |
512 |
0.037 s |
backend :18080 |
512 |
0.034 s |
Raw soak paths:
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T090844Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z
LocalMaxxing submissions:
| Shape | tok/s | ID |
|---|---|---|
| c8, 119 prompt, 256 output, repeat mean | 796.18 |
cmq3jm75g000tlj01bx4frdf0 |
| c8, 119 prompt, 512 output, repeat mean | 780.97 |
cmq3jm7cx000wlj01wm75wqmk |
Rejected same-day branches:
| Branch | Result |
|---|---|
gemma4-12b-it-int4-autoround-c10 |
Research only. Short prompts improved from about 755 to 850 wall aggregate tok/s versus c8 load, but near-32K request throughput did not improve and TTFT worsened. |
gemma4-12b-it-int4-autoround-c12 |
Rejected. Startup succeeded with 991,437 KV tokens and 30.26x theoretical 32K concurrency, but burst load hit Level Zero UR_RESULT_ERROR_OUT_OF_RESOURCES followed by UR_RESULT_ERROR_DEVICE_LOST. |
gemma4-12b-it-int4-autoround-c8-mbt8192 |
Quality matched, but c8 short decode fell to about 245.75 tok/s, short TTFT worsened, and GPU KV dropped to 730,379 tokens. |
gemma4-12b-it-int4-autoround-c8-mbt2048 |
Quality matched and GPU KV rose to 1,201,507 tokens, but c8 short decode fell to about 235.37 tok/s. |
gemma4-12b-it-int4-autoround-c8-gmem097 |
Rejected at startup. Free memory on xpu:0 was about 30.61/31.89 GiB, below the 0.97 utilization request of about 30.93 GiB. |
gemma4-12b-it-int4-autoround-c8-gmem096 |
Rejected at startup/engine init near the same memory boundary. Keep production at VLLM_GPU_MEMORY_UTILIZATION=0.95. |
gemma4-12b-it-int4-autoround-c8-cs1-8 |
Rejected. First compile had six ocloc/IGC error code 245 fallbacks and torch.compile took 315.80 s; cached repeat validation later hit UR_RESULT_ERROR_DEVICE_LOST during sampling. |
gemma4-12b-it-int4-autoround-c8-xpugraph-mbt2048 |
Rejected. It raised GPU KV to 1,201,940 tokens and 36.68x theoretical 32K concurrency, but first compile took 214.55 s and it hit UR_RESULT_ERROR_DEVICE_LOST during the canary/sampling path. |
gemma4-12b-it-int4-autoround-c8-nolog |
Quality matched, but no clear win. c8 short decode was 714.81 tok/s, within the promoted graph profile’s normal variance, while one-token TTFT worsened. |
Those branches are kept as reproducible profiles for future tuning, but they are not the active production path.
Known-good c16 startup reported:
Resolved architecture: Gemma4UnifiedForConditionalGeneration
quantization=inc
max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=16
Loading weights took 0.73 s
torch.compile took 4.66 s in total on cached restart
GPU KV cache size: 1,004,337 tokens
Maximum concurrency for 32,768 tokens per request: 30.65x
The 30.65x line is a theoretical KV-capacity statement, not a claim that
c30 has been benchmarked. After later full-context testing, c8 is the
production profile because higher full-32K concurrency did not improve
long-context throughput.
Rejected c12 startup and failure:
max_seq_len=32768
max_num_seqs=12
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x
ocloc failed with error code 245
IGC: Internal Compiler Error: Floating point exception
UR_RESULT_ERROR_OUT_OF_RESOURCES
UR_RESULT_ERROR_DEVICE_LOST
The c12 30.26x KV estimate was real, but runtime stability under burst load
was not. Treat this as an Intel/vLLM/XPU scheduler/runtime boundary, not a VRAM
capacity limit.
Research c10 startup:
max_seq_len=32768
max_num_seqs=10
compile range (1, 1) took 83.14 s on first launch
compile range (1, 4096) took 100.85 s on first launch
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x
c10 passed the quality canary and short-prompt tests, but it is not the recommended 32K production profile because full-context throughput did not improve versus c8.
Known-good c64 startup reported:
max_seq_len=4480
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=64
torch.compile took 67.31 s on first launch for this shape
GPU KV cache size: 292,317 tokens
Maximum concurrency for 4,480 tokens per request: 65.25x
Each tried max-context value created a new torch.compile cache key and took about 66-67 seconds on first launch. Cached restarts should be faster, but c64 has a real first-start operational cost.
Current production c8 startup with XPU graph reported:
max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=8
torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
Do not set this:
VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4,"video":0,"audio":0}')
That launch failed during Gemma4 unified dummy multimodal profiling:
ValueError: Found 1 <|image|> tokens in the text but no images were passed.
RuntimeError: Engine core initialization failed.
The working setting is:
VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')
vLLM still logs a multimodal warmup warning and profiles a video-sized encoder cache budget, but real image requests work.