b70-optimization-lab

MiniMax Production C1 Service

This is the recommended production-friendly shape for the current 4x B70 MiniMax endpoint.

Policy

Run c1 for real use:

The frontdoor is intentionally open on the LAN. It does not require a bearer token. Put it behind a firewall or reverse proxy if the LAN itself is not the trust boundary.

c2/c4/c8 and TurboQuant stay research profiles until their larger-context sustained decode and scheduler behavior are production-safe.

Install The Service

From the repo:

cd /home/steve/llm-optimizations
scripts/install-minimax-vllm-service.sh

Install and immediately move the current manual server under systemd:

cd /home/steve/llm-optimizations
scripts/install-minimax-vllm-service.sh --restart

The installed unit is:

/etc/systemd/system/minimax-vllm.service
/etc/systemd/system/minimax-openai-frontdoor.service

The tracked source unit is:

deploy/systemd/minimax-vllm.service
deploy/systemd/minimax-openai-frontdoor.service

Operate

systemctl status minimax-vllm.service --no-pager
systemctl status minimax-openai-frontdoor.service --no-pager
journalctl -u minimax-vllm.service -f
journalctl -u minimax-openai-frontdoor.service -f
sudo systemctl restart minimax-vllm.service
sudo systemctl restart minimax-openai-frontdoor.service
sudo systemctl stop minimax-vllm.service
sudo systemctl stop minimax-openai-frontdoor.service

Health check:

scripts/minimax-prod-health.py

Expected model report:

{
  "max_model_len": 32768
}

Frontdoor status:

curl http://127.0.0.1:8000/status

Expected shape:

{
  "frontdoor": {
    "backend": "http://127.0.0.1:18080",
    "max_active_generations": 1,
    "auth": "none"
  }
}

Run a small endpoint benchmark:

ts=$(date -u +%Y%m%dT%H%M%SZ)
scripts/minimax-prod-benchmark.py \
  --output-json "/mnt/fast-ai/bench-results/minimax-m27-b70-serve/prod-c1-benchmark-${ts}.json"

Reference systemd-run result from 2026-05-26:

Scenario Prompt tokens Output tokens Mean output tok/s after TTFT Mean TTFT Notes
short_decode 372 512 96.82 0.227 s decode-heavy smoke
prefill_16k 18924 16 77.89 13.085 s prompt-heavy mid-context
near32k 32264 64 63.91 23.336 s near-full 32K request

The near-32K result was accepted by LocalMaxxing as cmpm35jsa0003rt01zghtmwip. Payload and response:

Frontdoor concurrency smoke after moving vLLM behind localhost:

Production Notes

Use a queue or proxy in front of this service if more than one user or agent may call it at the same time. The tracked frontdoor does this for the common LAN case by serializing generation endpoints while allowing health/model requests through. c1 intentionally runs one active generation at a time because that is the fastest and most reliable path on this host.

For LAN use, restrict access with the firewall or a reverse proxy if needed. The default frontdoor does not provide authentication.

For remote OpenAI-compatible clients, use:

http://<server-lan-ip>:8000/v1