b70-optimization-lab

Gemma 4 12B IT INT4 AutoRound On B70 vLLM/XPU

This folder tracks the Gemma 4 bring-up on the 4x Intel Arc Pro B70 host.

Goal:

Current Status

Status on 2026-06-07: c8 is the active production profile. It uses XPU graph capture, prefix caching, 32K context, and a fail-fast 8-active-generation LAN frontdoor. c10 is kept as a research profile for extra short-prompt concurrency, but c8 remains the better full-context production choice. c12 was rejected after a Level Zero out-of-resources/device-lost failure under burst load. c16 and c64 remain documented alternate/research profiles.

The endpoint is running through the generic model-slot services:

Public endpoint: http://<server-lan-ip>:8000/v1
Auth: none
Served model name: gemma4-12b-it-int4-autoround
Backend: vLLM/XPU on 127.0.0.1:18080
Production c8 profile: 32768 context, 8 live generations
Research c10 profile: 32768 context, 10 live generations
Rejected c12 profile: 32768 context, 12 live generations
High-concurrency c64 profile: 4480 context, 64 live generations
Modalities tested: text, image

Text smoke after the final restart returned exactly OK. A real base64 PNG image request returned Blue.

Model

Hugging Face:

https://huggingface.co/Intel/gemma-4-12B-it-int4-AutoRound

The Intel model card describes this as a W4A16 AutoRound quantization of google/gemma-4-12B-it, with group size 128 and symmetric quantization. The local vLLM startup logs report quantization=inc, which means the checkpoint is using the Intel AutoRound/INC INT4 path rather than the rejected FP8/BF16 fallback direction.

Local model path:

/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel

Downloaded footprint on this host is about 7.3G.

Why It Needed Patching

The original local stack had Transformers 5.7.0, which did not recognize model_type=gemma4_unified. The fix was:

  1. Upgrade the serving venv to Transformers 5.10.2.
  2. Backport vLLM’s gemma4_unified.py model implementation.
  3. Register Gemma4UnifiedForConditionalGeneration in vLLM’s model registry.
  4. Add the missing gemma4_mm.py helper used by the unified implementation.

Patch snapshot:

../../patches/vllm-gemma4-unified-backport-b70-20260607.patch

Validation commands used after patching:

/home/steve/.venvs/vllm-xpu/bin/python -m py_compile \
  /home/steve/src/vllm/vllm/model_executor/models/gemma4_mm.py \
  /home/steve/src/vllm/vllm/model_executor/models/gemma4_unified.py \
  /home/steve/src/vllm/vllm/model_executor/models/registry.py

/home/steve/.venvs/vllm-xpu/bin/python - <<'PY'
from transformers import AutoConfig, AutoProcessor
from vllm.model_executor.models.registry import ModelRegistry
model = "/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel"
cfg = AutoConfig.from_pretrained(model, trust_remote_code=True)
proc = AutoProcessor.from_pretrained(model, trust_remote_code=True)
print(type(cfg).__name__, cfg.model_type, cfg.architectures)
print(type(proc).__name__)
print(ModelRegistry.resolve_model_cls(cfg.architectures)[0].__name__)
PY

Expected output includes:

Gemma4UnifiedConfig gemma4_unified ['Gemma4UnifiedForConditionalGeneration']
Gemma4UnifiedProcessor
Gemma4UnifiedForConditionalGeneration

Slot Profile

Base c16 profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround.env

Key settings:

MODEL_SLOT_HF_ID="Intel/gemma-4-12B-it-int4-AutoRound"
MODEL_DIR=/mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel
VLLM_DTYPE=bfloat16
VLLM_QUANTIZATION=auto
VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=16
VLLM_ENABLE_PREFIX_CACHING=1
VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')
FRONTDOOR_HOST=0.0.0.0
FRONTDOOR_PORT=8000
FRONTDOOR_MAX_ACTIVE_GENERATIONS=16

High-concurrency profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround-c64.env

Key c64 settings:

VLLM_MAX_MODEL_LEN=4480
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=64
VLLM_ENABLE_PREFIX_CACHING=1
FRONTDOOR_MAX_ACTIVE_GENERATIONS=64

Active production c8 profile:

../../configs/model-slots/gemma4-12b-it-int4-autoround-c8.env

Key c8 settings:

VLLM_MAX_MODEL_LEN=32768
VLLM_MAX_NUM_BATCHED_TOKENS=4096
VLLM_MAX_NUM_SEQS=8
VLLM_ENABLE_PREFIX_CACHING=1
XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'
FRONTDOOR_MAX_ACTIVE_GENERATIONS=8
FRONTDOOR_QUEUE_TIMEOUT_S=0

The c8 production frontdoor is intentionally fail-fast, not a bulk queue. It admits up to 8 active generation requests. A 9th generation request receives HTTP 503 immediately when all slots are busy.

VLLM_DTYPE=bfloat16 is the 16-bit activation/runtime dtype. The weights remain the INT4 AutoRound checkpoint; this is not the rejected Qwen FP8 BF16-dequant fallback.

Start Or Switch To This Model

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround

Switch to the 64-active-client profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c64

Switch to the full-32K, 8-active-generation profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c8

Try the full-32K, 10-active-generation research profile:

cd /home/steve/llm-optimizations
printf '%s\n' "/'" | sudo -S -p '' \
  scripts/switch-vllm-model-slot.sh switch gemma4-12b-it-int4-autoround-c10

Do not use gemma4-12b-it-int4-autoround-c12 for production. It is retained only to reproduce the 2026-06-07 failure boundary.

Check status:

curl -fsS http://127.0.0.1:8000/status
curl -fsS http://127.0.0.1:8000/v1/models

Expected /v1/models includes:

{
  "id": "gemma4-12b-it-int4-autoround",
  "max_model_len": 32768
}

The c64 profile reports max_model_len=4480.

Smoke Tests

Text:

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"gemma4-12b-it-int4-autoround",
    "messages":[{"role":"user","content":"Reply with exactly: OK"}],
    "max_tokens":8,
    "temperature":0
  }'

Image requests use the normal OpenAI image_url content shape. The validated smoke used a base64 PNG data URL and asked for the dominant color.

Benchmark Shape

The useful benchmark run used forced generation with ignore_eos=true:

cd /home/steve/llm-optimizations
/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 2048 \
  --output-tokens 512 \
  --concurrency 1 \
  --concurrency 2 \
  --concurrency 4 \
  --concurrency 8 \
  --warmups 1 \
  --timeout 1800 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c1-c2-c4-c8-20260607T034622Z.json

/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 2048 \
  --output-tokens 512 \
  --concurrency 16 \
  --warmups 1 \
  --timeout 1800 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency-2k-512-c16-20260607T035019Z.json

The earlier single-2k-512 file is not a valid decode-rate measurement because it allowed early EOS and the model stopped after only a few generated tokens.

Results

Prompt tokens were about 2071 per request. Each request generated 512 tokens. Decode rates below are output-token rates after first streamed text, which isolates generation from the long 2K prefill/TTFT portion. Wall aggregate is also included because it is what a user feels for full prompt+decode runs.

Concurrency Aggregate decode tok/s after first text Aggregate output tok/s wall Mean per-request decode tok/s Mean TTFT
1 58.22 30.39 58.22 8.05 s
2 117.27 59.70 58.89 8.44 s
4 236.10 116.33 59.71 9.00 s
8 467.76 217.39 59.63 10.20 s
16 922.18 396.11 59.60 11.97 s

Interpretation:

C64 Profile

The 64-active-client profile cannot keep the 32K context window. The important finding is that changing max_num_seqs from 16 to 64 changes vLLM’s profiled KV budget and compile shape. Do not estimate c64 context by dividing the c16 1,004,337 KV-token budget by 64.

Search results:

Max model len GPU KV tokens vLLM full-context concurrency Outcome
15616 667253 42.73x too high for 64 full contexts
10368 507081 48.91x too high for 64 full contexts
7680 405664 52.82x too high for 64 full contexts
4864 292589 60.15x too high for 64 full contexts
4096 291995 71.29x fits, but leaves more headroom
4480 292317 65.25x selected c64 profile

Selected c64 profile:

max_model_len=4480
max_num_seqs=64
max_active_generations=64
prefix caching enabled

The selected profile uses 4480 * 64 = 286720 logical KV tokens against a profiled 292317 token KV budget. That is close to full without crossing the full-64 admission boundary.

Short-prompt one-token TTFT probe:

Concurrency Prompt tokens each Output tokens each Mean TTFT p50 TTFT p95 TTFT Max TTFT
64 123 1 0.882 s 0.712 s 1.139 s 1.140 s

Short-prompt 128-token output run:

Concurrency Prompt tokens each Output tokens each Aggregate output tok/s wall Mean TTFT field
64 123 128 1614.41 5.04 s

For the 128-token run, use the wall aggregate as the useful throughput number. The post-first-text decode field is not reliable for this short-prompt shape because vLLM/XPU coalesces streamed output chunks.

Near-limit prompt smoke:

Requested prompt Actual prompt tokens Output tokens TTFT
4300 4323 1 0.631 s

Raw files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-ttft-100p-1o-20260607T062129Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-decode-100p-128o-20260607T062143Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c64-4480-longprompt-4300p-1o-20260607T062228Z.json

C8 Full-Context Profile

The active production Gemma 4 profile keeps max_model_len=32768 and caps live requests at 8. This keeps the LAN endpoint useful for full-context clients without dropping to the c64 profile’s shorter 4480-token window.

Selected c8 profile:

max_model_len=32768
max_num_seqs=8
max_active_generations=8
prefix caching enabled
XPU graph capture enabled

Current production startup with XPU graph capture reported:

torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
Available KV cache memory: 27.48 GiB
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x

The 30.67x line is vLLM’s KV-capacity estimate. The service is intentionally capped at 8 live generations because the goal for this profile is predictable LAN behavior with full 32K context, not maximum theoretical admission.

Short-prompt one-token TTFT probe after rebuilding the benchmark prompt generator:

Concurrency Prompt tokens each Output tokens each Mean TTFT Max TTFT
8 119 1 0.151 s 0.183 s

Short-prompt 128-token output run:

Profile Concurrency Prompt tokens each Output tokens each Aggregate output tok/s wall Mean TTFT field
pre-graph c8 8 119 128 247.49 4.12 s
XPU graph c8 promoted mean 8 119 128 703.59 1.46 s

For the 128-token run, use the wall aggregate as the useful throughput number. The post-first-text decode field is not reliable for this short-prompt shape because vLLM/XPU coalesces streamed output chunks.

Near-full-context probes:

Shape Prompt tokens each Output tokens each Mean TTFT Max TTFT Notes
c8 cold-ish long prefill 30690 1 22.17 s 39.03 s Useful long-prefill signal before the repeated prefix was warm.
c8 prefix-cache-warm near limit 32703 1 1.94 s 3.22 s Reused most of the earlier repeated 30.7K-token prefix.

Over-limit canary:

requested target 34300 -> 32894 input tokens per c8 lane
requested output tokens: 1
result: rejected as expected
reason: 32894 + 1 exceeds the 32768 max context window

This is the right behavior. For one output token, the prompt must stay at or below 32767 input tokens. For normal generation, leave more room for output.

Raw files:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-1o-fastprompt-20260607T065104Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-100p-128o-fastprompt-20260607T065116Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32000p-1o-20260607T064822Z.json
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/c8-32768-8x32703p-1o-20260607T065041Z.json

Repo summary:

results-20260607-b70-c8-32768.json

Production C8 Baseline

After promoting c8 to production metadata, the same profile was restarted and revalidated. Cached restart loaded AOT compile in 4.91 s, initialized the engine in 17.88 s, and reported 1,004,337 GPU KV tokens, or 30.65x theoretical full-32K concurrency.

Quality canary:

exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red

Sequential production benchmark baseline:

Shape Prompt tokens each Output tokens each Concurrency Primary result
short TTFT 119 1 8 mean TTFT 0.106 s
short decode 119 128 8 wall aggregate 250.21 tok/s
long prefill 30690 1 8 mean TTFT 22.21 s

Repo summary:

results-20260607-production-c8-baseline.json

XPU Graph Promotion

On 2026-06-07, the c8 production profile was updated to enable XPU graph capture with communication ops. This keeps the same model, INT4 AutoRound weights, bfloat16 activation dtype, 32768 context, c8 concurrency cap, and prefix caching. It changes only the compile/graph execution path.

Validated settings:

XPU_GRAPH=1
VLLM_XPU_ENABLE_XPU_GRAPH=1
VLLM_XPU_FORCE_GRAPH_WITH_COMM=1
VLLM_XPU_GRAPH_NOOP_COMM_CAPTURE=1
VLLM_COMPILATION_CONFIG='{"use_inductor_graph_partition":true,"compile_sizes":[1],"cudagraph_mode":"PIECEWISE"}'

Post-promotion startup on the canonical production slot reported:

torch.compile took 4.12 s in total
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x
Graph capturing finished in 4 s
init engine took 19.05 s

Quality gate:

expected outputs: pass
baseline text/hash comparison: pass
text checks: OK, satin cobalt orbit, 7
image check: Red

Repeated production c8 short-decode checks after promotion:

Run Prompt tokens each Output tokens each Concurrency Mean TTFT Wall aggregate output tok/s
1 119 128 8 1.658 s 613.72
2 119 128 8 1.354 s 751.58
3 119 128 8 1.365 s 745.48
mean 119 128 8 1.459 s 703.59

This is the current best production profile. The earlier non-graph c8 production baseline was about 240-250 tok/s for the same short-decode shape. The graph branch also ran a longer validation loop before promotion:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/xpugraph-validation-20260607T075901Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-promoted-20260607T080322Z

The 32K prefill leg did not materially improve with XPU graph. It stayed around 22.28 s mean TTFT for eight concurrent 30690-token prompts with one output token. Treat the win as a decode-path improvement, not a long-prefill fix.

Sustained decode characterization on the promoted production c8 graph profile:

Shape Prompt tokens each Output tokens each Concurrency Mean TTFT Wall aggregate output tok/s
repeat mean 119 256 8 2.530 s 796.18
repeat mean 119 512 8 2.542 s 780.97
repeat mean 119 1024 8 2.519 s 731.12

512-token scaling on the same production endpoint:

Concurrency Prompt tokens each Output tokens each Mean TTFT Wall aggregate output tok/s
1 119 512 2.190 s 112.77
2 119 512 2.415 s 205.47
4 119 512 2.500 s 398.98
8 119 512 2.526 s 784.69

Long-prompt decode with the same c8 production endpoint:

Shape Prompt tokens each Output tokens each Concurrency Mean TTFT Wall aggregate output tok/s Notes
cold-ish 16K 15357 128 8 21.46 s 47.12 Includes prefill.
cold-ish 30K 28774 128 8 22.44 s 45.13 Includes prefill.
prefix-cache repeat 30K 28774 128 8 3.75 s 270.74 Same repeated prompt shape after cache warm.

The post-first-text decode fields in these long-prompt runs are not useful: vLLM/XPU coalesced the streamed output into large chunks. Use TTFT and wall aggregate throughput for long-prompt comparisons.

Raw result directories:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-256o-repeat-20260607T084540Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-512o-repeat-20260607T084633Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-1024o-repeat-20260607T084718Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-scaling-512o-20260607T084806Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-decode-20260607T090436Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-xpugraph-longprompt-30000p-128o-cache-repeat-20260607T090558Z.json

C10/C12 32K Concurrency Boundary

On 2026-06-07, the apparent spare KV budget was tested by raising the full-32K active generation cap above c8. The important lesson is that vLLM’s reported KV token budget is only a capacity estimate. It does not guarantee that higher max_num_seqs shapes are stable or faster under real scheduler/runtime pressure.

c12 startup did succeed and reported:

GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x

But the first bursty client run accidentally sent too much concurrent work while quality and benchmark clients overlapped. c12 then failed with:

RuntimeError: level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)

That makes c12 a rejected profile for now. It is kept as configs/model-slots/gemma4-12b-it-int4-autoround-c12.env so another run can reproduce or debug the boundary.

c10 was then tested as the first safer step above production c8. c10 startup also reported 991,437 GPU KV tokens and 30.26x theoretical full-32K concurrency. First launch compiled:

compile range (1, 1): 83.14 s
compile range (1, 4096): 100.85 s

c10 quality matched the production canary:

exact_ok: OK
copy_phrase: satin cobalt orbit
small_arithmetic: 7
red_image: Red

Short-prompt decode:

Profile Prompt tokens each Output tokens each Concurrency Mean TTFT Wall aggregate output tok/s
c10 backend, c8 load 119 512 8 0.140 s 754.88
c10 backend, c10 load 120 512 10 0.221 s 849.59

Short-prompt one-token TTFT:

Concurrency Prompt tokens each Output tokens each Mean TTFT
1 119 1 0.036 s
8 119 1 0.166 s
10 120 1 0.198 s

For short prompts, c10 is useful: it raised aggregate 512-token decode from about 755 tok/s at c8 load to about 850 tok/s at c10 load. The tradeoff was slightly higher TTFT and lower per-request decode speed.

For full-context-style requests, c10 was not better than c8:

Shape Profile load Prompt tokens each Output tokens each Mean TTFT Wall aggregate output tok/s
unique near-32K TTFT c8 30690 1 22.20 s 0.205
unique near-32K TTFT c10 30744 1 27.21 s 0.204
unique near-32K decode c8 30690 128 23.61 s 22.73
unique near-32K decode c10 30744 128 29.25 s 22.76

The service can accept c10 full-context work, but vLLM internally stages long prefills. c10 adds latency and does not improve aggregate full-context output throughput.

The benchmark harness now supports fixed-prefix tests:

/home/steve/.venvs/vllm-xpu/bin/python scripts/bench-openai-concurrency.py \
  --base-url http://127.0.0.1:8000 \
  --tokenizer /mnt/fast-ai/llm-models/gemma4-12b-it-int4-autoround-intel \
  --prompt-tokens 32000 \
  --shared-prefix-tokens 16000 \
  --prompt-salt unique-a \
  --output-tokens 1 \
  --concurrency 8 \
  --concurrency 10 \
  --warmups 0 \
  --timeout 2400 \
  --output-json /mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z/c10-shared-prefix-salted-ttft-c8-c10-32000p-16000shared-1o.json

Clean fixed-prefix plus unique-tail results, with about 16K shared leading tokens and about 16K unique tail tokens:

Shape Profile load Prompt tokens each Output tokens each Mean TTFT Wall aggregate output tok/s
half-shared 32K TTFT c8 31021 1 12.45 s 0.371
half-shared 32K TTFT c10 31034 1 15.11 s 0.372
half-shared 32K decode c8 31021 128 13.24 s 39.20
half-shared 32K decode c10 31034 128 16.04 s 39.67

Prefix caching is therefore valuable for website-generation traffic with a fixed system/project prefix and unique user content: it roughly cut TTFT for near-32K c8 requests from 22.20 s to 12.45 s in this synthetic half-shared test. It did not make c10 materially better than c8 for full-context requests.

Recommendation: keep production on c8 for 32K request support. Use c10 only as a research profile for short-prompt high-throughput traffic. Do not use c12 unless debugging the device-lost boundary.

Raw result directory:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/concurrency32k-20260607T190702Z

Production Soak

On 2026-06-07, the active c8 XPU-graph production profile was left running on the public no-auth LAN endpoint and checked with the reusable soak harness:

DURATION_S=28800 INTERVAL_S=900 PROMPT_TOKENS=100 OUTPUT_TOKENS=512 \
  CONCURRENCY=8 scripts/run-gemma4-production-soak.sh

Scheduled cycles 1-32 covered the main soak window from 20260607T090844Z through 20260607T165344Z. The production endpoint stayed on:

model slot: gemma4-12b-it-int4-autoround-c8
max_model_len: 32768
max active generations: 8
prefix caching: enabled
auth: none

Scheduled result summary:

Metric Value
Scheduled cycles 32
Clean cycles 31
Quality anomaly cycles 1
Mean wall aggregate output tok/s, clean cycles 781.04
Min wall aggregate output tok/s, clean cycles 765.44
Max wall aggregate output tok/s, clean cycles 784.37
Mean TTFT, clean cycles 2.551 s

The one scheduled quality anomaly was cycle 17: the copy_phrase canary returned slh cobalt orbit instead of satin cobalt orbit. Exact OK, arithmetic, and image-color canaries still passed. Three immediate manual reruns and a 25-repeat quality-only stress loop then matched the baseline hashes exactly, and cycle 18 returned to normal.

The original soak harness had a final-interval bug: after the last scheduled cycle, it stopped sleeping and produced rapid cycles every about 10 s until the deadline. Treat cycles 33+ in the first run as harness-bug data, not normal soak samples. The harness now sleeps to the deadline and exits instead of spinning.

A clean continuation with the fixed harness ran after the bug was patched:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z

Continuation result:

Cycle UTC Quality Wall aggregate output tok/s Mean TTFT
1 20260607T170030Z pass 779.20 2.559 s
2 20260607T171530Z pass 784.75 2.546 s

Frontdoor Streaming Fix

The first c1 post-soak benchmark appeared to show about 2.16 s TTFT for a 119-token prompt with 512 generated tokens. Direct backend testing showed this was not model/prefill latency: vLLM on 127.0.0.1:18080 streamed first text in about 34 ms.

Cause: the LAN frontdoor was proxying backend responses with response.read(65536). For text/event-stream responses this can buffer many SSE events before forwarding them, making client-observed TTFT look much worse than backend TTFT.

Fix: scripts/openai-lan-frontdoor.py now forwards text/event-stream responses line-by-line and flushes after every SSE line.

Clean public-endpoint checks after restarting only the frontdoor:

Shape Concurrency Prompt tokens each Output tokens each Mean TTFT Wall aggregate output tok/s
short decode 1 119 512 0.036 s 112.87
short decode 8 119 512 0.099 s 783.62

The queue policy was then changed from the old one-hour timeout to fail-fast: FRONTDOOR_QUEUE_TIMEOUT_S=0. A 9-way overload test admitted 8 requests and returned one HTTP 503 in 0.221 s, while the admitted requests streamed normally.

Direct frontdoor/backend streaming comparison after the fix:

Endpoint Max tokens First text
frontdoor :8000 128 0.034 s
backend :18080 128 0.034 s
frontdoor :8000 512 0.037 s
backend :18080 512 0.034 s

Raw soak paths:

/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T090844Z
/mnt/fast-ai/bench-results/gemma4-12b-it-int4-autoround/prod-c8-soak-20260607T170030Z

LocalMaxxing submissions:

Shape tok/s ID
c8, 119 prompt, 256 output, repeat mean 796.18 cmq3jm75g000tlj01bx4frdf0
c8, 119 prompt, 512 output, repeat mean 780.97 cmq3jm7cx000wlj01wm75wqmk

Rejected same-day branches:

Branch Result
gemma4-12b-it-int4-autoround-c10 Research only. Short prompts improved from about 755 to 850 wall aggregate tok/s versus c8 load, but near-32K request throughput did not improve and TTFT worsened.
gemma4-12b-it-int4-autoround-c12 Rejected. Startup succeeded with 991,437 KV tokens and 30.26x theoretical 32K concurrency, but burst load hit Level Zero UR_RESULT_ERROR_OUT_OF_RESOURCES followed by UR_RESULT_ERROR_DEVICE_LOST.
gemma4-12b-it-int4-autoround-c8-mbt8192 Quality matched, but c8 short decode fell to about 245.75 tok/s, short TTFT worsened, and GPU KV dropped to 730,379 tokens.
gemma4-12b-it-int4-autoround-c8-mbt2048 Quality matched and GPU KV rose to 1,201,507 tokens, but c8 short decode fell to about 235.37 tok/s.
gemma4-12b-it-int4-autoround-c8-gmem097 Rejected at startup. Free memory on xpu:0 was about 30.61/31.89 GiB, below the 0.97 utilization request of about 30.93 GiB.
gemma4-12b-it-int4-autoround-c8-gmem096 Rejected at startup/engine init near the same memory boundary. Keep production at VLLM_GPU_MEMORY_UTILIZATION=0.95.
gemma4-12b-it-int4-autoround-c8-cs1-8 Rejected. First compile had six ocloc/IGC error code 245 fallbacks and torch.compile took 315.80 s; cached repeat validation later hit UR_RESULT_ERROR_DEVICE_LOST during sampling.
gemma4-12b-it-int4-autoround-c8-xpugraph-mbt2048 Rejected. It raised GPU KV to 1,201,940 tokens and 36.68x theoretical 32K concurrency, but first compile took 214.55 s and it hit UR_RESULT_ERROR_DEVICE_LOST during the canary/sampling path.
gemma4-12b-it-int4-autoround-c8-nolog Quality matched, but no clear win. c8 short decode was 714.81 tok/s, within the promoted graph profile’s normal variance, while one-token TTFT worsened.

Those branches are kept as reproducible profiles for future tuning, but they are not the active production path.

Startup Observations

Known-good c16 startup reported:

Resolved architecture: Gemma4UnifiedForConditionalGeneration
quantization=inc
max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=16
Loading weights took 0.73 s
torch.compile took 4.66 s in total on cached restart
GPU KV cache size: 1,004,337 tokens
Maximum concurrency for 32,768 tokens per request: 30.65x

The 30.65x line is a theoretical KV-capacity statement, not a claim that c30 has been benchmarked. After later full-context testing, c8 is the production profile because higher full-32K concurrency did not improve long-context throughput.

Rejected c12 startup and failure:

max_seq_len=32768
max_num_seqs=12
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x
ocloc failed with error code 245
IGC: Internal Compiler Error: Floating point exception
UR_RESULT_ERROR_OUT_OF_RESOURCES
UR_RESULT_ERROR_DEVICE_LOST

The c12 30.26x KV estimate was real, but runtime stability under burst load was not. Treat this as an Intel/vLLM/XPU scheduler/runtime boundary, not a VRAM capacity limit.

Research c10 startup:

max_seq_len=32768
max_num_seqs=10
compile range (1, 1) took 83.14 s on first launch
compile range (1, 4096) took 100.85 s on first launch
GPU KV cache size: 991,437 tokens
Maximum concurrency for 32,768 tokens per request: 30.26x

c10 passed the quality canary and short-prompt tests, but it is not the recommended 32K production profile because full-context throughput did not improve versus c8.

Known-good c64 startup reported:

max_seq_len=4480
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=64
torch.compile took 67.31 s on first launch for this shape
GPU KV cache size: 292,317 tokens
Maximum concurrency for 4,480 tokens per request: 65.25x

Each tried max-context value created a new torch.compile cache key and took about 66-67 seconds on first launch. Cached restarts should be faster, but c64 has a real first-start operational cost.

Current production c8 startup with XPU graph reported:

max_seq_len=32768
enable_prefix_caching=True
max_num_batched_tokens=4096
max_num_seqs=8
torch.compile took 4.12 s on cached graph restart
Graph capturing finished in 4 s
init engine took 19.05 s
GPU KV cache size: 1,004,909 tokens
Maximum concurrency for 32,768 tokens per request: 30.67x

Known Bad Setting

Do not set this:

VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4,"video":0,"audio":0}')

That launch failed during Gemma4 unified dummy multimodal profiling:

ValueError: Found 1 <|image|> tokens in the text but no images were passed.
RuntimeError: Engine core initialization failed.

The working setting is:

VLLM_EXTRA_ARGS=(--limit-mm-per-prompt '{"image":4}')

vLLM still logs a multimodal warmup warning and profiles a video-sized encoder cache budget, but real image requests work.

Next Work