b70-optimization-lab

Community Results And Build Notes

The point of this repository is not only to document one machine. It should help people reproduce, compare, and improve local AI deployments on B70s and other accessible GPUs.

What To Share

A useful community result includes:

Suggested Result Format

Model:
Quantization:
Hardware:
OS/kernel:
Engine/backend:
Recipe or commit:
Prompt/output/context:
Batch/concurrency:
Quality check:
Output tok/s:
Total tok/s:
Notes:
Artifacts:

Build Photos

Build photos are useful because multi-GPU local AI can fail for physical reasons:

Photos should ideally show:

Example public build-photo links from Steve’s X feed:

Prefer linking to public posts or images instead of committing large photos to the repo. If a photo is critical to a reproducible build, add a small compressed copy plus a short caption explaining what it proves.

Example Build Photos

B70 build photo showing a dense multi-GPU physical setup

This wide photo is useful as a quick visual reference for the density and physical layout of a multi-B70 build. When publishing similar photos, add notes about motherboard, slot order, power cabling, and airflow direction.

B70 build photo showing card placement and workstation layout

This taller build photo is useful for discussing card spacing, blower intake clearance, case/workbench layout, and whether the system is being used as a lab rig or a finished workstation.

Common Build Discussion Themes

These came up repeatedly in community discussion and should be captured in future build notes:

Two Cards Versus Four Cards

Use two B70s when:

Use four B70s when:

Four cards can be worse than two or three for some workloads if communication overhead dominates. Treat card count as an experimental variable.

The X feed is useful for chronology and informal discussion. The repo should remain the source for reproducible commands, patches, artifacts, and final notes.

Records Need Labels

For community records, avoid a single naked “tok/s” number. At minimum, label:

This matters because a result can be excellent for prefill, decode, chat latency, or multi-user serving while looking mediocre under another metric.

Example from the current 4x B70 MiniMax host:

Those numbers explain different parts of the user experience. Decode controls how fast text streams after it starts. Prefill and TTFT control how long a long prompt waits before the first generated token appears. Interconnect can cap multi-GPU decode even when each card has enough VRAM.

Metrics Beyond Tok/s

Useful community metrics include:

For agentic use, concurrency and correctness can matter more than a single clean decode number.

Recipe Versus Lab Note

Use repro/ for recipes someone can run. Use notes/ for lab history, failed attempts, and investigation details.

A result should become a repro/ folder when:

Keep exploratory work in notes/ until it is ready.

Discussion Topics

Useful discussion threads include: