b70-optimization-lab

Unofficial Intel XPU Optimization Lab

Community setup guides, benchmark recipes, troubleshooting notes, and patches for local AI work on Intel XPUs. This is not a single-model repo. It is a working lab notebook and reproducibility collection for multiple model lanes that we revisit as new runtime, compiler, and kernel ideas appear.

This is experimental research, not a supported product. Commands, patches, and benchmark observations are provided under the repository LICENSE and the risks described in DISCLAIMER.md; review and use them at your own risk.

Who This Is For

Local AI users who want reproducible Intel Arc/B-series commands, patches and benchmark recipes.
Anyone deciding whether Intel Arc/B-series hardware is worth it for local inference, and wanting to see the state of support.
Optimization agents and contributors who need a map of current work, archived lessons, and validity rules.
Upstream vLLM, llama.cpp, oneAPI, SYCL, and Intel/XPU developers looking for concrete repros and failure signatures.

Start Here

Need	Entry Point
Understand the repo structure	Docs index
See every active, paused, and archived model lane	Model effort index
Reproduce promoted results	Results index and model recipes
Start optimizing a new model	Model optimization guide
Compare expected model performance	Performance scoreboard
Contribute a result, patch, or correction	Contribution guide and verification policy
Review or validate incoming work	Manager playbook
Reuse the best research prompts/workflows	Research workflow playbook
Find the current host/service map	Current reproducibility map
Submit or audit LocalMaxxing records	LocalMaxxing submissions
Handle local ops, secrets, sudo, and cross-agent delegation	Local ops
Review Intel-facing issues and asks	Feedback for Intel

How The Repo Is Organized

The repo is organized around model lanes, not branches or one-off leaderboard rows. A serious lane should leave behind:

results/<model>-<quant>-<hardware>/: promoted or closed-out result packet, validity gates, best commands, invalid fast lanes, and lessons.
repro/<model>-.../: copy-ready runnable recipe for a promoted result.
experiments/<model>-.../: active research lanes that are not production recipes yet.
notes/: chronological lab notebook entries, including negative results and postmortems.
patches/: patch snapshots and source/config deltas, including failed experiments worth preserving.
data/: compact structured benchmark records, payloads, responses, and logs.
scripts/: reusable harnesses, analyzers, launchers, and submission helpers.
community/<contributor>-...: work contributed from outside the reference lab, at any evidence level. Contributions land here first and move into repro/ or results/ only after they are reproduced on B70. See community/README.md.

The point is to make model switching cheap. Gemma, Qwen, MiniMax, and future lanes should all share validation discipline, result-packet shape, and reusable kernel/runtime lessons without dragging stale worktrees or huge artifacts forward.

Representative Promoted Results

CURRENT.md alone owns the live service and active research state. These are evidence-backed examples; the broader expected-performance view is the performance index.

These are entry points, not the whole repo:

Lane	Status	Best Current Pointer
Poolside Laguna S 2.1 INT4 on 4x B70	Exact target-verified DFlash depth 11 on the audited width-12 Breakable PIECEWISE graph, with persistent attention metadata and 31 W8A16 draft projections/rank: approved published-convention row `102.971436 tok/s`; conventional 99-interval rate `101.941721 tok/s`. 13/13 token-and-text exact vs canonical q1, all `cached_tokens=0`; LocalMaxxing `cms2ccv2d00lps201rej94pjy`	qualified result, standalone repro, accounting correction, transfer ledger
DeepSeek V4 Flash experimental uniform-K160 on 4x B70	Paused/closed frontier; target-verified DSpark7 record `80.820 tok/s` high and `78.287 tok/s` three-suite median-of-medians; exact source bundles and fail-closed launcher preserved; LocalMaxxing `cmrquta9905w3lg013m5vxoqx`	result packet, standalone repro, closeout
Qwen3.6 27B INT4 AutoRound on 1-2x B70	Closed reference lane; strict fresh-response TP2 record `95.385 tok/s`; pinned public oneCCL, captured MTP draft, graph-safe FlashAttention full target graph, and exact ReplaySSM pending/direct-output transaction fusions; exact cases, repeat128, baseline parity, 1K needle, unique cold prompts, and `cached_tokens=0` all passed; both swapped four-GPU crossover assignments favored the candidate; LocalMaxxing `cmrh35ct50092mj01h7jgydqj`; current service ladder passes exact cold retrieval through `17706` actual prompt tokens at `MAX_MODEL_LEN=32768`, but the forced chunk-decode record path is short-context-only	current record packet, handoff, record note, service ladder, general repro
Gemma 4 26B A4B Q8 / INT8 on 1x B70	Production-servable backend plus current strict fresh-response speed frontier; noisy near-record support band	handoff, production service, 125 tok/s repro
Gemma 4 26B long-context/prompt-processing service lane	Separate from short-decode record; approved LocalMaxxing service entry `cmr47ivql0045nv011pfdjlaa`; service gates must not regress short decode	Gemma result packet, service gate script
Rapid one-B70 model snapshots	Quick strict/fresh decode baselines across practical GGUF/vLLM candidates; current promoted rows include Qwen3 30B-A3B `107.484 tok/s`, Qwen3-Coder 30B-A3B `108.117 tok/s`, Phi-4 mini Q4 `96.548 tok/s`, GLM-4.7-Flash `40.769 tok/s`, and Mistral Small 3.2 24B `27.297 tok/s`	rapid result ledger, rapid experiment notes
Qwen3.6 35B A3B Quark W8A8 INT8 on B70	Closed reference packet for now; preserve lessons for future return	Qwen result packet, research map
MiniMax M2.7 INT4 AutoRound on 4x B70	Deployable baseline plus older strict-speed and source-fusion research leads	result packet, Ubuntu 24 deploy repro, production service notes
Gemma 4 12B IT INT4 AutoRound	Current model-slot production profile and multimodal service lane	experiment packet, slot switching

For the full queue and archive, use docs/model-effort-index.md.

Validity Rules For Speed Claims

Diagnostic runs are allowed and useful, but headline records require the model-lane gate. For Gemma/Qwen-style fresh-response records, that means:

fixed realistic prompt suite;
each prompt run once as a cold response;
cached_tokens=0 for every request;
no prompt/KV cache reuse, context checkpoints, response reuse, warmed repeated prompts, or n-gram/history acceleration;
target model and quantization unchanged;
speculative decoding/MTP allowed only when accepted tokens are verified by the declared target model;
primary metric is the median conventional rate across the 99 inter-token intervals between timestamps 1 and 100 after TTFT, with p10, mean, TTFT, wall-clock full-output throughput, full-output after-TTFT throughput, hashes, runtime identity, env vars, flags, and logs. Historical 100-event/99-interval compatibility fields must be labeled as such.

Synthetic, repeated, warmed, cached, or history-assisted rows stay diagnostic unless revalidated by the lane’s promotion gate. This matters because several very fast historical rows were useful optimization clues but not real-world fresh-response claims.

Hardware Scope

The reference lab has four Intel Arc Pro B70 32 GB cards (128 GB aggregate VRAM). B70 is the platform on which maintainers can independently reproduce and verify submitted patches, but this is an Intel Arc/XPU project, not a B70-only repository.

Results, fixes, and portability reports from Intel Arc Pro B50, B60, B65, and B70 owners are welcome, as are useful observations from other Intel Arc and XPU systems. A B70 rerun verifies what a patch does on B70; it does not certify a contributor’s score on hardware the maintainers do not possess. Hardware and verification status therefore stay explicit in every promoted result.

The four-card B70 host is enough for useful vLLM/XPU, llama.cpp/SYCL, driver, and model-port work, and it can run four independent one-GPU screens when a model fits. Higher-memory Intel devices would broaden model coverage, but that is not a prerequisite for contributing useful patches, results, failures, or optimization lessons.

Steve Seguin maintains this repo and posts ongoing build notes at https://x.com/xyster.

quad

Gemma 4 26B B70 result context

How To Contribute

This remains primarily a working optimization lab, but careful outside contributions are welcome. Start with CONTRIBUTING.md and the manual verification policy. No result needs to be a record to be useful.

Useful contributions include:

reproducing a result on another Intel/XPU stack and sharing exact versions;
testing a failed lane after a driver, PyTorch XPU, vLLM, llama.cpp, or oneAPI update;
turning local patches into clean upstream issues or PRs;
adding quality canaries for new model families;
sharing high-signal failure logs with model, quantization, graph mode, and hardware identity intact;
providing temporary access to larger Intel hardware for models that do not fit cleanly on 32 GB cards.

When opening an issue or discussion, include GPU, OS, model, quantization, runtime, exact command, benchmark shape, quality gate, result JSON/log paths, and what changed from the closest known-good run.

Reusable Optimization Lessons

The durable value of this lab is not only its fastest rows. Start with the cross-model pattern catalog for evidence-linked strategies that transfer across model lanes, and the model optimization guide for the identity, quality, A/B, variance, and negative-result process used to test them. Model-specific packets retain the commands and caveats behind each lesson.

Deep Historical Notes Below

The former top-level chronology was preserved verbatim in notes/2026-07-11-readme-historical-b70-archive.md. It remains useful for chronology and link recovery, but maintained navigation now lives in the result packets, model-effort index, reproduction recipes, and optimization guides.

Historical B70 Findings And Open Leads

Use the historical archive for the original long-form findings. Current model-specific starting points are the results index, including the MiniMax M2.7 packet, and the cross-model effort index. Underlying notes, data, patches, experiments, and reproduction paths were intentionally left in place.

Layout

Use the docs index for maintained navigation and the repository organization above for placement rules. The former file-by-file inventory is retained in the historical archive; it is not duplicated here because detailed inventories become stale quickly.

This site is open source. Improve this page.