b70-optimization-lab

Gemma 4 12B IT INT4 AutoRound On B70 vLLM/XPU

This folder tracks the Gemma 4 bring-up on the 4x Intel Arc Pro B70 host.

Goal:

run Intel/gemma-4-12B-it-int4-AutoRound;
keep the public endpoint OpenAI-compatible on 0.0.0.0:8000;
support text and image requests;
keep a 32K context window, with the current c8 profile limiting live generations to 8 while keeping full 32K context;
characterize the shorter c64 profile needed for 64 active clients;
measure single-user and concurrent decode behavior;
keep the setup reproducible for another Ubuntu 24.04 B70 system.

Current Status

Status on 2026-06-07: c8 is the active production profile. It uses XPU graph capture, prefix caching, 32K context, and a fail-fast 8-active-generation LAN frontdoor. c10 is kept as a research profile for extra short-prompt concurrency, but c8 remains the better full-context production choice. c12 was rejected after a Level Zero out-of-resources/device-lost failure under burst load. c16 and c64 remain documented alternate/research profiles.

The endpoint is running through the generic model-slot services:

Public endpoint: http://<server-lan-ip>:8000/v1
Auth: none
Served model name: gemma4-12b-it-int4-autoround
Backend: vLLM/XPU on 127.0.0.1:18080
Production c8 profile: 32768 context, 8 live generations
Research c10 profile: 32768 context, 10 live generations
Rejected c12 profile: 32768 context, 12 live generations
High-concurrency c64 profile: 4480 context, 64 live generations
Modalities tested: text, image

Text smoke after the final restart returned exactly OK. A real base64 PNG image request returned Blue.

Model

Hugging Face:

https://huggingface.co/Intel/gemma-4-12B-it-int4-AutoRound

The Intel model card describes this as a W4A16 AutoRound quantization of google/gemma-4-12B-it, with group size 128 and symmetric quantization. The local vLLM startup logs report quantization=inc, which means the checkpoint is using the Intel AutoRound/INC INT4 path rather than the rejected FP8/BF16 fallback direction.

Local model path:

/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel

Downloaded footprint on this host is about 7.3G.

Why It Needed Patching

The original local stack had Transformers 5.7.0, which did not recognize model_type=gemma4_unified. The fix was:

Upgrade the serving venv to Transformers 5.10.2.
Backport vLLM’s gemma4_unified.py model implementation.
Register Gemma4UnifiedForConditionalGeneration in vLLM’s model registry.
Add the missing gemma4_mm.py helper used by the unified implementation.

Patch snapshot:

../../patches/vllm-gemma4-unified-backport-b70-20260607.patch

Validation commands used after patching:

/home/steve/.venvs/vllm-xpu/bin/python -m py_compile \
  /home/steve/src/vllm/vllm/model_executor/models/gemma4_mm.py \
  /home/steve/src/vllm/vllm/model_executor/models/gemma4_unified.py \
  /home/steve/src/vllm/vllm/model_executor/models/registry.py

/home/steve/.venvs/vllm-xpu/bin/python - <<'PY'
from transformers import AutoConfig, AutoProcessor
from vllm.model_executor.models.registry import ModelRegistry
model = "/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel"
cfg = AutoConfig.from_pretrained(model, trust_remote_code=True)
proc = AutoProcessor.from_pretrained(model, trust_remote_code=True)
print(type(cfg).__name__, cfg.model_type, cfg.architectures)
print(type(proc).__name__)
print(ModelRegistry.resolve_model_cls(cfg.architectures)[0].__name__)
PY

Expected output includes:

Gemma4UnifiedConfig gemma4_unified ['Gemma4UnifiedForConditionalGeneration']
Gemma4UnifiedProcessor
Gemma4UnifiedForConditionalGeneration

Slot Profile

Base c16 profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround.env

Key settings:

MODEL_SLOT_HF_ID="Intel/gemma-4-12B-it-int4-AutoRound"
MODEL_DIR=/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel
VLLM_DTYPE=bfloat16
VLLM_QUANTIZATION=auto
VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=16
VLLM_ENABLE_PREFIX_CACHING=1
VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')
FRONTDOOR_HOST=0.0.0.0
FRONTDOOR_PORT=8000
FRONTDOOR_MAX_ACTIVE_GENERATIONS=16

High-concurrency profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround-c64.env

Key c64 settings:

VLLM_MAX_MODEL_LEN=4480
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=64
VLLM_ENABLE_PREFIX_CACHING=1
FRONTDOOR_MAX_ACTIVE_GENERATIONS=64

Active production c8 profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround-c8.env

Key c8 settings:

VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=8
VLLM_ENABLE_PREFIX_CACHING=1
XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'
FRONTDOOR_MAX_ACTIVE_GENERATIONS=8
FRONTDOOR_QUEUE_TIMEOUT_S=0

The c8 production frontdoor is intentionally fail-fast, not a bulk queue. It admits up to 8 active generation requests. A 9th generation request receives HTTP 503 immediately when all slots are busy.

VLLM_DTYPE=bfloat16 is the 16-bit activation/runtime dtype. The weights remain the INT4 AutoRound checkpoint; this is not the rejected Qwen FP8 BF16-dequant fallback.

Start Or Switch To This Model

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround

Switch to the 64-active-client profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c64

Switch to the full-32K, 8-active-generation profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8

Try the full-32K, 10-active-generation research profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c10

Do not use gemma4-12b-it-int4-autoround-c12 for production. It is retained only to reproduce the 2026-06-07 failure boundary.

Check status:

curl -fsS http://127.0.0.1:8000/status
curl -fsS http://127.0.0.1:8000/v1/models

Expected /v1/models includes:

{
  "id": "gemma4-12b-it-int4-autoround",
  "max_model_len": 32768
}

The c64 profile reports max_model_len=4480.

Smoke Tests

Text:

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"gemma4-12b-it-int4-autoround",
    "messages":[{"role":"user","content":"Reply with exactly: OK"}],
    "max_tokens":8,
    "temperature":0
  }'

Image requests use the normal OpenAI image_url content shape. The validated smoke used a base64 PNG data URL and asked for the dominant color.

Benchmark Shape

The useful benchmark run used forced generation with ignore_eos=true:

cd /home/steve/llm-optimizations
/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 2048 \
  --output-tokens 512 \
  --concurrency 1 \
  --concurrency 2 \
  --concurrency 4 \
  --concurrency 8 \
  --warmups 1 \
  --timeout 1800 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c1-c2-c4-c8-20260607T034622Z.json

/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 2048 \
  --output-tokens 512 \
  --concurrency 16 \
  --warmups 1 \
  --timeout 1800 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c16-20260607T035019Z.json

The earlier single-2k-512 file is not a valid decode-rate measurement because it allowed early EOS and the model stopped after only a few generated tokens.

Results

Prompt tokens were about 2071 per request. Each request generated 512 tokens. Decode rates below are output-token rates after first streamed text, which isolates generation from the long 2K prefill/TTFT portion. Wall aggregate is also included because it is what a user feels for full prompt+decode runs.

Concurrency	Aggregate decode tok/s after first text	Aggregate output tok/s wall	Mean per-request decode tok/s	Mean TTFT
1	`58.22`	`30.39`	`58.22`	`8.05 s`
2	`117.27`	`59.70`	`58.89`	`8.44 s`
4	`236.10`	`116.33`	`59.71`	`9.00 s`
8	`467.76`	`217.39`	`59.63`	`10.20 s`
16	`922.18`	`396.11`	`59.60`	`11.97 s`

Interpretation:

Single stream decode is about 58-60 tok/s.
Per-request decode stays close to 60 tok/s through c16.
Aggregate warmed decode scales almost linearly to c16, reaching about 922 tok/s after TTFT.
The 2K prefill path is the current latency cost; wall throughput at c16 was about 396 output tok/s.

C64 Profile

The 64-active-client profile cannot keep the 32K context window. The important finding is that changing max_num_seqs from 16 to 64 changes vLLM’s profiled KV budget and compile shape. Do not estimate c64 context by dividing the c16 1,004,337 KV-token budget by 64.

Search results:

Max model len	GPU KV tokens	vLLM full-context concurrency	Outcome
`15616`	`667253`	`42.73x`	too high for 64 full contexts
`10368`	`507081`	`48.91x`	too high for 64 full contexts
`7680`	`405664`	`52.82x`	too high for 64 full contexts
`4864`	`292589`	`60.15x`	too high for 64 full contexts
`4096`	`291995`	`71.29x`	fits, but leaves more headroom
`4480`	`292317`	`65.25x`	selected c64 profile

Selected c64 profile:

max_model_len=4480
max_num_seqs=64
max_active_generations=64
prefix caching enabled

The selected profile uses 4480 * 64 = 286720 logical KV tokens against a profiled 292317 token KV budget. That is close to full without crossing the full-64 admission boundary.

Short-prompt one-token TTFT probe:

Concurrency	Prompt tokens each	Output tokens each	Mean TTFT	p50 TTFT	p95 TTFT	Max TTFT
`64`	`123`	`1`	`0.882 s`	`0.712 s`	`1.139 s`	`1.140 s`

Short-prompt 128-token output run:

Concurrency	Prompt tokens each	Output tokens each	Aggregate output tok/s wall	Mean TTFT field
`64`	`123`	`128`	`1614.41`	`5.04 s`

For the 128-token run, use the wall aggregate as the useful throughput number. The post-first-text decode field is not reliable for this short-prompt shape because vLLM/XPU coalesces streamed output chunks.

Near-limit prompt smoke:

Requested prompt	Actual prompt tokens	Output tokens	TTFT
`4300`	`4323`	`1`	`0.631 s`

Raw files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-ttft-100p-1o-20260607T062129Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-decode-100p-128o-20260607T062143Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-longprompt-4300p-1o-20260607T062228Z.json

C8 Full-Context Profile

The active production Gemma 4 profile keeps max_model_len=32768 and caps live requests at 8. This keeps the LAN endpoint useful for full-context clients without dropping to the c64 profile’s shorter 4480-token window.

Selected c8 profile:

max_model_len=32768
max_num_seqs=8
max_active_generations=8
prefix caching enabled
XPU graph capture enabled

Current production startup with XPU graph capture reported:

torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
Available KV cache memory: 27.48 GiB
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x

The 30.67x line is vLLM’s KV-capacity estimate. The service is intentionally capped at 8 live generations because the goal for this profile is predictable LAN behavior with full 32K context, not maximum theoretical admission.

Short-prompt one-token TTFT probe after rebuilding the benchmark prompt generator:

Concurrency	Prompt tokens each	Output tokens each	Mean TTFT	Max TTFT
`8`	`119`	`1`	`0.151 s`	`0.183 s`

Short-prompt 128-token output run:

Profile	Concurrency	Prompt tokens each	Output tokens each	Aggregate output tok/s wall	Mean TTFT field
pre-graph c8	`8`	`119`	`128`	`247.49`	`4.12 s`
XPU graph c8 promoted mean	`8`	`119`	`128`	`703.59`	`1.46 s`

Near-full-context probes:

Shape	Prompt tokens each	Output tokens each	Mean TTFT	Max TTFT	Notes
`c8` cold-ish long prefill	`30690`	`1`	`22.17 s`	`39.03 s`	Useful long-prefill signal before the repeated prefix was warm.
`c8` prefix-cache-warm near limit	`32703`	`1`	`1.94 s`	`3.22 s`	Reused most of the earlier repeated 30.7K-token prefix.

Over-limit canary:

requested target 34300 -> 32894 input tokens per c8 lane
requested output tokens: 1
result: rejected as expected
reason: 32894 + 1 exceeds the 32768 max context window

This is the right behavior. For one output token, the prompt must stay at or below 32767 input tokens. For normal generation, leave more room for output.

Raw files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-1o-fastprompt-20260607T065104Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-128o-fastprompt-20260607T065116Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32000p-1o-20260607T064822Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32703p-1o-20260607T065041Z.json

Repo summary:

results-20260607-b70-c8-32768.json

Production C8 Baseline

After promoting c8 to production metadata, the same profile was restarted and revalidated. Cached restart loaded AOT compile in 4.91 s, initialized the engine in 17.88 s, and reported 1,004,337 GPU KV tokens, or 30.65x theoretical full-32K concurrency.

Quality canary:

exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red

Sequential production benchmark baseline:

Shape	Prompt tokens each	Output tokens each	Concurrency	Primary result
short TTFT	`119`	`1`	`8`	mean TTFT `0.106 s`
short decode	`119`	`128`	`8`	wall aggregate `250.21 tok/s`
long prefill	`30690`	`1`	`8`	mean TTFT `22.21 s`

Repo summary:

results-20260607-production-c8-baseline.json

XPU Graph Promotion

On 2026-06-07, the c8 production profile was updated to enable XPU graph capture with communication ops. This keeps the same model, INT4 AutoRound weights, bfloat16 activation dtype, 32768 context, c8 concurrency cap, and prefix caching. It changes only the compile/graph execution path.

Validated settings:

XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'

Post-promotion startup on the canonical production slot reported:

torch.compile took 4.12 s in total
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
Graph capturing finished in 4 s
init engine took 19.05 s

Quality gate:

expected outputs: pass
baseline text/hash comparison: pass
text checks: OK, satin cobalt orbit, 7
image check: Red

Repeated production c8 short-decode checks after promotion:

Run	Prompt tokens each	Output tokens each	Concurrency	Mean TTFT	Wall aggregate output tok/s
1	`119`	`128`	`8`	`1.658 s`	`613.72`
2	`119`	`128`	`8`	`1.354 s`	`751.58`
3	`119`	`128`	`8`	`1.365 s`	`745.48`
mean	`119`	`128`	`8`	`1.459 s`	`703.59`

This is the current best production profile. The earlier non-graph c8 production baseline was about 240-250 tok/s for the same short-decode shape. The graph branch also ran a longer validation loop before promotion:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/xpugraph-validation-20260607T075901Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-promoted-20260607T080322Z

The 32K prefill leg did not materially improve with XPU graph. It stayed around 22.28 s mean TTFT for eight concurrent 30690-token prompts with one output token. Treat the win as a decode-path improvement, not a long-prefill fix.

Sustained decode characterization on the promoted production c8 graph profile:

Shape	Prompt tokens each	Output tokens each	Concurrency	Mean TTFT	Wall aggregate output tok/s
repeat mean	`119`	`256`	`8`	`2.530 s`	`796.18`
repeat mean	`119`	`512`	`8`	`2.542 s`	`780.97`
repeat mean	`119`	`1024`	`8`	`2.519 s`	`731.12`

512-token scaling on the same production endpoint:

Concurrency	Prompt tokens each	Output tokens each	Mean TTFT	Wall aggregate output tok/s
`1`	`119`	`512`	`2.190 s`	`112.77`
`2`	`119`	`512`	`2.415 s`	`205.47`
`4`	`119`	`512`	`2.500 s`	`398.98`
`8`	`119`	`512`	`2.526 s`	`784.69`

Long-prompt decode with the same c8 production endpoint:

Shape	Prompt tokens each	Output tokens each	Concurrency	Mean TTFT	Wall aggregate output tok/s	Notes
cold-ish 16K	`15357`	`128`	`8`	`21.46 s`	`47.12`	Includes prefill.
cold-ish 30K	`28774`	`128`	`8`	`22.44 s`	`45.13`	Includes prefill.
prefix-cache repeat 30K	`28774`	`128`	`8`	`3.75 s`	`270.74`	Same repeated prompt shape after cache warm.

The post-first-text decode fields in these long-prompt runs are not useful: vLLM/XPU coalesced the streamed output into large chunks. Use TTFT and wall aggregate throughput for long-prompt comparisons.

Raw result directories:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-256o-repeat-20260607T084540Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-512o-repeat-20260607T084633Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-1024o-repeat-20260607T084718Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-scaling-512o-20260607T084806Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-decode-20260607T090436Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-30000p-128o-cache-repeat-20260607T090558Z.json

C10/C12 32K Concurrency Boundary

On 2026-06-07, the apparent spare KV budget was tested by raising the full-32K active generation cap above c8. The important lesson is that vLLM’s reported KV token budget is only a capacity estimate. It does not guarantee that higher max_num_seqs shapes are stable or faster under real scheduler/runtime pressure.

c12 startup did succeed and reported:

GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x

But the first bursty client run accidentally sent too much concurrent work while quality and benchmark clients overlapped. c12 then failed with:

RuntimeError: level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)

That makes c12 a rejected profile for now. It is kept as configs/model-slots/gemma4-12b-it-int4-autoround-c12.env so another run can reproduce or debug the boundary.

c10 was then tested as the first safer step above production c8. c10 startup also reported 991,437 GPU KV tokens and 30.26x theoretical full-32K concurrency. First launch compiled:

compile range (1, 1): 83.14 s
compile range (1, 4096): 100.85 s

c10 quality matched the production canary:

exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red

Short-prompt decode:

Profile	Prompt tokens each	Output tokens each	Concurrency	Mean TTFT	Wall aggregate output tok/s
c10 backend, c8 load	`119`	`512`	`8`	`0.140 s`	`754.88`
c10 backend, c10 load	`120`	`512`	`10`	`0.221 s`	`849.59`

Short-prompt one-token TTFT:

Concurrency	Prompt tokens each	Output tokens each	Mean TTFT
`1`	`119`	`1`	`0.036 s`
`8`	`119`	`1`	`0.166 s`
`10`	`120`	`1`	`0.198 s`

For short prompts, c10 is useful: it raised aggregate 512-token decode from about 755 tok/s at c8 load to about 850 tok/s at c10 load. The tradeoff was slightly higher TTFT and lower per-request decode speed.

For full-context-style requests, c10 was not better than c8:

Shape	Profile load	Prompt tokens each	Output tokens each	Mean TTFT	Wall aggregate output tok/s
unique near-32K TTFT	c8	`30690`	`1`	`22.20 s`	`0.205`
unique near-32K TTFT	c10	`30744`	`1`	`27.21 s`	`0.204`
unique near-32K decode	c8	`30690`	`128`	`23.61 s`	`22.73`
unique near-32K decode	c10	`30744`	`128`	`29.25 s`	`22.76`

The service can accept c10 full-context work, but vLLM internally stages long prefills. c10 adds latency and does not improve aggregate full-context output throughput.

The benchmark harness now supports fixed-prefix tests:

/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 32000 \
  --shared-prefix-tokens 16000 \
  --prompt-salt unique-a \
  --output-tokens 1 \
  --concurrency 8 \
  --concurrency 10 \
  --warmups 0 \
  --timeout 2400 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z/c10-shared-prefix-salted-ttft-c8-c10-32000p-16000shared-1o.json

Clean fixed-prefix plus unique-tail results, with about 16K shared leading tokens and about 16K unique tail tokens:

Shape	Profile load	Prompt tokens each	Output tokens each	Mean TTFT	Wall aggregate output tok/s
half-shared 32K TTFT	c8	`31021`	`1`	`12.45 s`	`0.371`
half-shared 32K TTFT	c10	`31034`	`1`	`15.11 s`	`0.372`
half-shared 32K decode	c8	`31021`	`128`	`13.24 s`	`39.20`
half-shared 32K decode	c10	`31034`	`128`	`16.04 s`	`39.67`

Prefix caching is therefore valuable for website-generation traffic with a fixed system/project prefix and unique user content: it roughly cut TTFT for near-32K c8 requests from 22.20 s to 12.45 s in this synthetic half-shared test. It did not make c10 materially better than c8 for full-context requests.

Recommendation: keep production on c8 for 32K request support. Use c10 only as a research profile for short-prompt high-throughput traffic. Do not use c12 unless debugging the device-lost boundary.

Raw result directory:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z

Production Soak

On 2026-06-07, the active c8 XPU-graph production profile was left running on the public no-auth LAN endpoint and checked with the reusable soak harness:

DURATION_S=28800 INTERVAL_S=900 PROMPT_TOKENS=100 OUTPUT_TOKENS=512 \
  CONCURRENCY=8 scripts/run-gemma4-production-soak.sh

Scheduled cycles 1-32 covered the main soak window from 20260607T090844Z through 20260607T165344Z. The production endpoint stayed on:

model slot: gemma4-12b-it-int4-autoround-c8
max_model_len: 32768
max active generations: 8
prefix caching: enabled
auth: none

Scheduled result summary:

Metric	Value
Scheduled cycles	`32`
Clean cycles	`31`
Quality anomaly cycles	`1`
Mean wall aggregate output tok/s, clean cycles	`781.04`
Min wall aggregate output tok/s, clean cycles	`765.44`
Max wall aggregate output tok/s, clean cycles	`784.37`
Mean TTFT, clean cycles	`2.551 s`

The one scheduled quality anomaly was cycle 17: the copy_phrase canary returned slh cobalt orbit instead of satin cobalt orbit. Exact OK, arithmetic, and image-color canaries still passed. Three immediate manual reruns and a 25-repeat quality-only stress loop then matched the baseline hashes exactly, and cycle 18 returned to normal.

The original soak harness had a final-interval bug: after the last scheduled cycle, it stopped sleeping and produced rapid cycles every about 10 s until the deadline. Treat cycles 33+ in the first run as harness-bug data, not normal soak samples. The harness now sleeps to the deadline and exits instead of spinning.

A clean continuation with the fixed harness ran after the bug was patched:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z

Continuation result:

Cycle	UTC	Quality	Wall aggregate output tok/s	Mean TTFT
`1`	`20260607T170030Z`	pass	`779.20`	`2.559 s`
`2`	`20260607T171530Z`	pass	`784.75`	`2.546 s`

Frontdoor Streaming Fix

The first c1 post-soak benchmark appeared to show about 2.16 s TTFT for a 119-token prompt with 512 generated tokens. Direct backend testing showed this was not model/prefill latency: vLLM on 127.0.0.1:18080 streamed first text in about 34 ms.

Cause: the LAN frontdoor was proxying backend responses with response.read(65536). For text/event-stream responses this can buffer many SSE events before forwarding them, making client-observed TTFT look much worse than backend TTFT.

Fix: scripts/openai-lan-frontdoor.py now forwards text/event-stream responses line-by-line and flushes after every SSE line.

Clean public-endpoint checks after restarting only the frontdoor:

Shape	Concurrency	Prompt tokens each	Output tokens each	Mean TTFT	Wall aggregate output tok/s
short decode	`1`	`119`	`512`	`0.036 s`	`112.87`
short decode	`8`	`119`	`512`	`0.099 s`	`783.62`

The queue policy was then changed from the old one-hour timeout to fail-fast: FRONTDOOR_QUEUE_TIMEOUT_S=0. A 9-way overload test admitted 8 requests and returned one HTTP 503 in 0.221 s, while the admitted requests streamed normally.

Direct frontdoor/backend streaming comparison after the fix:

Endpoint	Max tokens	First text
frontdoor `:8000`	`128`	`0.034 s`
backend `:18080`	`128`	`0.034 s`
frontdoor `:8000`	`512`	`0.037 s`
backend `:18080`	`512`	`0.034 s`

Raw soak paths:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T090844Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z

LocalMaxxing submissions:

Shape	tok/s	ID
c8, 119 prompt, 256 output, repeat mean	`796.18`	`cmq3jm75g000tlj01bx4frdf0`
c8, 119 prompt, 512 output, repeat mean	`780.97`	`cmq3jm7cx000wlj01wm75wqmk`

Rejected same-day branches:

Branch	Result
`gemma4-12b-it-int4-autoround-c10`	Research only. Short prompts improved from about `755` to `850` wall aggregate tok/s versus c8 load, but near-32K request throughput did not improve and TTFT worsened.
`gemma4-12b-it-int4-autoround-c12`	Rejected. Startup succeeded with `991,437` KV tokens and `30.26x` theoretical 32K concurrency, but burst load hit Level Zero `UR_RESULT_ERROR_OUT_OF_RESOURCES` followed by `UR_RESULT_ERROR_DEVICE_LOST`.
`gemma4-12b-it-int4-autoround-c8-mbt8192`	Quality matched, but c8 short decode fell to about `245.75 tok/s`, short TTFT worsened, and GPU KV dropped to `730,379` tokens.
`gemma4-12b-it-int4-autoround-c8-mbt2048`	Quality matched and GPU KV rose to `1,201,507` tokens, but c8 short decode fell to about `235.37 tok/s`.
`gemma4-12b-it-int4-autoround-c8-gmem097`	Rejected at startup. Free memory on `xpu:0` was about `30.61/31.89 GiB`, below the `0.97` utilization request of about `30.93 GiB`.
`gemma4-12b-it-int4-autoround-c8-gmem096`	Rejected at startup/engine init near the same memory boundary. Keep production at `VLLM_GPU_MEMORY_UTILIZATION=0.95`.
`gemma4-12b-it-int4-autoround-c8-cs1-8`	Rejected. First compile had six `ocloc`/IGC `error code 245` fallbacks and `torch.compile` took `315.80 s`; cached repeat validation later hit `UR_RESULT_ERROR_DEVICE_LOST` during sampling.
`gemma4-12b-it-int4-autoround-c8-xpugraph-mbt2048`	Rejected. It raised GPU KV to `1,201,940` tokens and `36.68x` theoretical 32K concurrency, but first compile took `214.55 s` and it hit `UR_RESULT_ERROR_DEVICE_LOST` during the canary/sampling path.
`gemma4-12b-it-int4-autoround-c8-nolog`	Quality matched, but no clear win. c8 short decode was `714.81 tok/s`, within the promoted graph profile’s normal variance, while one-token TTFT worsened.

Those branches are kept as reproducible profiles for future tuning, but they are not the active production path.

Startup Observations

Known-good c16 startup reported:

Resolved architecture: Gemma4UnifiedForConditionalGeneration
quantization=inc
max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=16
Loading weights took 0.73 s
torch.compile took 4.66 s in total on cached restart
GPU KV cache size: 1,004,337 tokens
Maximum concurrency for 32,768 tokens per request: 30.65x

The 30.65x line is a theoretical KV-capacity statement, not a claim that c30 has been benchmarked. After later full-context testing, c8 is the production profile because higher full-32K concurrency did not improve long-context throughput.

Rejected c12 startup and failure:

max_seq_len=32768
max_num_seqs=12
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x
ocloc failed with error code 245
IGC: Internal Compiler Error: Floating point exception
UR_RESULT_ERROR_OUT_OF_RESOURCES
UR_RESULT_ERROR_DEVICE_LOST

The c12 30.26x KV estimate was real, but runtime stability under burst load was not. Treat this as an Intel/vLLM/XPU scheduler/runtime boundary, not a VRAM capacity limit.

Research c10 startup:

max_seq_len=32768
max_num_seqs=10
compile range (1, 1) took 83.14 s on first launch
compile range (1, 4096) took 100.85 s on first launch
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x

c10 passed the quality canary and short-prompt tests, but it is not the recommended 32K production profile because full-context throughput did not improve versus c8.

Known-good c64 startup reported:

max_seq_len=4480
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=64
torch.compile took 67.31 s on first launch for this shape
GPU KV cache size: 292,317 tokens
Maximum concurrency for 4,480 tokens per request: 65.25x

Each tried max-context value created a new torch.compile cache key and took about 66-67 seconds on first launch. Cached restarts should be faster, but c64 has a real first-start operational cost.

Current production c8 startup with XPU graph reported:

max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=8
torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x

Known Bad Setting

Do not set this:

VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4,"video":0,"audio":0}')

That launch failed during Gemma4 unified dummy multimodal profiling:

ValueError: Found 1 <|image|> tokens in the text but no images were passed.
RuntimeError: Engine core initialization failed.

The working setting is:

VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')

vLLM still logs a multimodal warmup warning and profiles a video-sized encoder cache budget, but real image requests work.

Next Work

Measure 16K and 32K prompt TTFT for c1/c4/c8/c16.
Compare c64 at 4480 against c16 at 32768 for real chat traffic, not just synthetic short prompts.
Revisit vLLM’s Gemma4 unified dummy multimodal profiling so image-only limits do not accidentally trip video/audio warmup behavior.
Submit a minimal upstream vLLM issue or PR once the backport is narrowed to the registry/helper delta needed by this release branch.

This site is open source. Improve this page.