LLM Inference on RTX PRO 6000 Blackwell
In February 2026 I ran a bunch of experiments trying to figure out which open-source models work best for self-hosted inference on workstation hardware. I rented 2x RTX PRO 6000 Blackwell GPUs on RunPod, served models with vLLM, and used benchmarks like HumanEval+, GSM8K, IFEval, and MATH Hard as sanity checks — not to score the models, but to make sure quantization and inference were working correctly and to find the right parameters. This page has all the raw data from those experiments.
Related post: Running Open Models (2026-Feb)
NCCL_P2P_DISABLE=1 ·
--disable-custom-all-reduce ·
--enforce-eager (some configs)
Summary
These benchmark scores are for verifying that inference is working, not for ranking models against each other. Most of these benchmarks are saturated and don't reflect real-world capability. Take the numbers as “is vLLM producing correct output with this quantization?” rather than “which model is smarter?”
Quality Sanity Checks
Speed: Batched Throughput (tok/s)
Speed: Single-Stream Latency (tok/s)
True single-user experience at concurrency=1 (one request finishes before the next starts). This is the interactive speed you feel.
Concurrent Full-Context Sessions (128K tokens)
How many simultaneous 128K-token conversations fit in VRAM after loading model
weights. Determined by remaining memory available for KV cache at
gpu_memory_utilization=0.90.
Full Results Table
| Model | Quant | IFEval | MATH Hard | Perplexity | 1-stream | rate=16 | GiB/GPU | 128K Sess. |
|---|---|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | NVFP4 | 85.6% | 57.3% | 7.48 | 102 | 1,759 | 35.8 | 28 |
| Qwen3.5-122B-A10B | FP8 | 86.9% | 56.8% | 7.07 | 91 | 1,569 | 64 | 14 |
| Qwen3.5-35B-A3B | FP8 | 86.1% | — | 9.59 | — | 2,623 | — | 49 |
| Qwen3.5-35B-A3B | AWQ | 85.4% | — | — | — | 2,817 | — | 54 |
| gpt-oss-120b | MXFP4 | 76.9% | 55.5% | 30.25 | 62 | 2,656 | 32 | 20 |
| MiniMax M2.5 | NVFP4 | 37.2% | 44.9% | 12.45 | — | 1,688 | — | 2 |
Experiment Details
Project 1: Initial Pod Validation Feb 19
First time setting up the RunPod pod with these GPUs. Immediately ran into
a hang — the RTX PRO 6000 reports SM120 compute capability, and
vLLM's detection logic (120 // 10 = 12) doesn't recognize it
as SM100 family. So a lot of optimized kernels just don't load:
FlashInfer, VLLM_CUTLASS, some DeepGEMM variants — all gated behind
is_device_capability_family(100). The kind of thing you only
find out by running into it.
NCCL_P2P_DISABLE=1 and
--disable-custom-all-reduce to get tensor parallelism
working over PCIe at all.
Project 2: MiniMax M2.5 vs Step 3.5 Flash Feb 19
Tried to get MiniMax M2.5 NVFP4 (~126 GB) and Step 3.5 Flash running. MiniMax loaded fine.
Step 3.5 Flash couldn't run at all on this hardware/vLLM combo:
- NVFP4 — not supported for MoE models (vLLM issue #31782)
- GGUF Q4 — vLLM dequantizes to BF16, exceeding VRAM
- FP8 — weights alone exceed 192 GB capacity
Not all quantization formats work with all architectures. NVFP4 + MoE support is incomplete in vLLM.
Project 3: MiniMax M2.5 vs Qwen3-235B bugs found Feb 20
This one was a mess. The MiniMax quality scores came out terribly wrong and it took a while to figure out why. Turned out to be three bugs stacking on top of each other.
Speed Results (Valid)
| Metric | MiniMax M2.5 | Qwen3-235B |
|---|---|---|
| Output tok/s | 66 | 42 |
| Throughput (rate=20) | 2.16 req/s | 1.64 req/s |
Original Quality (Broken)
| Benchmark | Qwen3-235B | MiniMax M2.5 |
|---|---|---|
| HumanEval+ | 94.5% | 49.7% |
| GSM8K | 88.2% | 54.8% |
| IFEval | 86.0% | 25.1% |
MiniMax's chat template injects <think> as a forced
prefix — it's not part of generated output. The model only generates
</think>. The proxy regex required both tags, so it
never matched. Reasoning content was never stripped. Zero opening tags
found in 1,450+ GSM8K samples.
Published benchmarks are for MiniMax-Text-01 (Jan 2025, 456B params). The model tested was MiniMax M2.5 (Feb 2026, reasoning/agentic model). M2.5 does not publish HumanEval, GSM8K, or IFEval scores.
Used mmlu_generative instead of standard loglikelihood due
to API constraints. Post-processing extraction succeeded at only
65–75%. Scores unreliable for both models.
After Fix
With the proxy fixed and max_gen_toks increased from 256 to
4096:
Project 4: OpenAI gpt-oss-120b on SM120 Feb 21
Tried running gpt-oss-120b
(117B MoE, 128 experts, 4 active) on SM120. This model uses OpenAI's
Harmony response format (reasoning_content field) instead of
<think> tags, which avoids the whole think-tag mess.
Backend Compatibility on SM120
| Variant | Backend | Status |
|---|---|---|
| MXFP4 | Marlin | works |
| FP8-dynamic | Triton | garbage output |
FP8 backend cascade on SM120:
- FLASHINFER — SM100 only, unavailable
- DEEPGEMM — disabled or produces garbage
- TRITON — selected but produces garbage tokens
- MARLIN — crashes with thread config errors
Quality Results (MXFP4)
Speed
Working Configuration
NCCL_P2P_DISABLE=1 vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 2 \
--enforce-eager \
--tool-call-parser openai \
--reasoning-effort low
Project 5: MiniMax M2.5 FP8-INT4-AWQ artifact Feb 21
Ran mratsim/MiniMax-M2.5-FP8-INT4-AWQ (~92.5 GB) through
the full benchmark suite. Found a weird perplexity artifact baked into the
AWQ weights.
Quality with Reasoning Parser
| Benchmark | With Parser | Without | Effect |
|---|---|---|---|
| HumanEval+ pass@1 | 29.9% | 23.2% | +38% relative |
| GSM8K | 96.59% | — | — |
| IFEval | 48.8% | — | — |
| MATH Hard | 45.8% | — | — |
| GPQA Diamond | 77.78% | — | — |
Measured perplexity: 362.56 (vs 12.46 for NVFP4 of the
same model). Token 104256 (每) dominates the logprob
distribution. This is a weight quantization artifact — not a
backend issue. Loglikelihood evaluations (MMLU, GPQA loglik) are unusable
for this quantization.
Project 6: B200 SM100 Comparison Feb 22
Ran the same models on a B200 (SM100) to see if the kernel issues were Blackwell-wide or specific to SM120. Turns out SM100 supports different and often better kernel paths. So the problems I hit are SM120-specific — workstation Blackwell, not data center Blackwell.
SM120 (workstation) has fewer optimized paths than SM100 (data center). If you're deploying on RTX PRO cards, expect to hit kernel issues that B200 users won't see.
Project 7: Marlin vs Triton MXFP4 Backends Feb 23
I found a vLLM PR (#31089) that enables Triton on SM120, so I tested it head-to-head against the stock Marlin backend using gpt-oss-120b. The Triton patch technically works, but it completely crashes the performance.
| Metric | Marlin | Triton | Ratio |
|---|---|---|---|
| Output tok/s | 575.67 | 101.12 | 5.7x |
| Mean TPOT | 16.64ms | 146.62ms | 8.8x slower |
| Median TTFT | 165.87ms | 11,079ms | 66.8x slower |
Just use Marlin. The Triton SM120 path technically runs but it's not usable in practice.
Project 8: AWQ Logprob & Reasoning Parser A/B Feb 22
Question 1: Is the AWQ artifact backend-dependent?
| Configuration | Perplexity |
|---|---|
| Baseline (DeepGEMM) | 350.19 |
| No DeepGEMM + eager | 346.50 |
| FlashInfer | 345.51 |
All variants show ~350 perplexity. The artifact is in the model weights, not the backend.
Question 2: Reasoning parser per-benchmark effect
| Benchmark | With Parser | Without | Effect |
|---|---|---|---|
| HumanEval+ | 32.9% | 23.8% | +38% |
| GSM8K | 96.51% | 96.66% | Neutral |
| IFEval | 47.72% | 48.92% | Slight − |
| MATH Hard | 44.64% | 44.86% | Neutral |
The reasoning parser helps a lot for code generation (+38% on HumanEval) but doesn't do much for anything else. Use it selectively.
Project 9: MoE vs Dense Architecture headline result Feb 24
This was the experiment I was most curious about. Same model family, two architectures: MoE (35B total, only 3B active per token) vs Dense (27B, all active). The MoE model worked way better than I expected.
Qwen3.5-35B-A3B
Qwen3.5-27B
Speed Comparison
Both models served on 2x RTX PRO 6000 (TP=2) via vLLM. “rate=N” means N new requests arriving per second. Throughput is total output tokens per second across all in-flight requests. TPOT (time per output token) is per-request latency—how long a user waits between each streamed word. TTFT (time to first token) is how long until the first word appears after sending a prompt.
| Metric | MoE (35B-A3B) | Dense (27B) | MoE vs Dense |
|---|---|---|---|
| Throughput (rate=4: 4 req/s arriving) | 879.1 tok/s | 445.3 tok/s | 1.97x |
| Throughput (rate=16: 16 req/s arriving) | 2,332 tok/s | 1,369.8 tok/s | 1.70x |
| TPOT (rate=4: per-user streaming latency) | 26.66ms | 65.12ms | 2.4x lower |
| TTFT (rate=4: wait for first word) | 2,008ms | 26,792ms | 13.3x lower |
Simultaneous Full-Context Sessions
After loading weights, remaining VRAM determines how many 131K-token conversations can run at the same time. MoE activates fewer parameters per token, producing smaller KV cache entries and leaving more room for concurrent sessions.
MoE with only 3B active parameters gives you about 2x the batched throughput and nearly 3x as many simultaneous conversations in the same VRAM. Despite having more total parameters, it wins on every metric that matters for serving.
Project 10: Qwen3.5 Thinking Mode Quality Fix fixes IFEval Feb 25
The quality scores from Project 9 were suspiciously bad (~29% IFEval, ~9% MATH Hard). Turns out thinking mode was silently corrupting all the benchmark results. I'm still not fully sure if this is a fundamental limitation of the benchmark tooling with reasoning models, or just a configuration problem — but disabling thinking fixed everything.
The Problem
- Chat template injects
<think>as a prefix (not generated by the model) - Model only generates
</think>as a closing tag - vLLM's
--reasoning-parser qwen3expects both tags → fails silently - Thinking content remains in
contentfield, corrupting extraction - 96% of MATH samples had untagged reasoning that received no stripping
- Greedy decoding with 1024-token cap causes thinking to exhaust budget before answer
The Fix
--default-chat-template-kwargs '{"enable_thinking": false}'
Results
| Variant | Config | MATH Hard | IFEval |
|---|---|---|---|
| V0 (baseline) | qwen3 parser, proxy, greedy | 9.0% | 29.0% |
| V1 (no-think) | enable_thinking=False, greedy | 58.2% | 84.8% |
| V3 (no-think-warm) | enable_thinking=False, temp=0.7 | 57.5% | 86.1% |
| V2 (deepseek_r1) | deepseek_r1 parser, proxy | 0.3% | 7.9% |
| V4 (full budget) | deepseek_r1 parser, 32768 tokens | 70.5% | 8.9% |
Project 11: Qwen3.5 Vision — 1 GPU vs 2 GPU Feb 27
Good to know: Qwen3.5-35B-A3B handles images natively (early-fusion, same weights, no separate adapter). I tested vision on 1 GPU vs 2 to see if splitting helps.
Text Quality (No Degradation)
| Benchmark | 1-GPU | 2-GPU |
|---|---|---|
| IFEval (strict) | 85.03% | 85.58% |
| MATH Hard | 57.40% | 56.80% |
Vision Quality
| Benchmark | 1-GPU | 2-GPU |
|---|---|---|
| MMMU (val) | 52.67% | 51.78% |
| MMBench EN | 87.88% | 87.88% |
Speed
| Metric | What It Measures | 1-GPU | 2-GPU | Speedup |
|---|---|---|---|---|
| Text throughput | Output tokens generated per second across concurrent text requests (rate=inf) | 1,119 tok/s | 1,630 tok/s | 1.46x |
| Vision sequential | Image+text requests completed per second, one at a time (20 requests) | 1.91 req/s | 2.24 req/s | 1.17x |
| Vision concurrent | Image+text requests completed per second with 8 in-flight at once (50 requests) | 3.44 req/s | 4.37 req/s | 1.27x |
| Vision first-request | Time from sending the first image+text request to receiving a complete response | 4.14s | 0.46s | 9x lower latency |
Two GPUs help — 1.46x more text throughput and 9x lower vision first-request latency (4.14s → 0.46s) — without hurting quality on any benchmark.
Project 12: Quantization Comparison (Single GPU) Feb 28
Compared BF16, FP8, and AWQ-4bit of Qwen3.5-35B-A3B on a single card to see how much speed you gain and how much quality you lose.
Quality
| Quant | Perplexity | IFEval | MMMU |
|---|---|---|---|
| BF16 | 9.56 | 7.6%* | 25.0%* |
| FP8 | 9.58 | 7.6%* | 24.8%* |
| AWQ-4bit | — | 8.9%* | 26.1%* |
*Scores collapsed due to thinking mode being enabled. FP8 perplexity is essentially identical to BF16 (9.58 vs 9.56).
Speed
| Quant | TPOT | TTFT | Throughput | vs BF16 |
|---|---|---|---|---|
| BF16 | 26.3ms | 287.4ms | 1,103 tok/s | baseline |
| FP8 | 17.8ms | 269.9ms | 1,581 tok/s | +43% |
| AWQ-4bit | 13.9ms | 278.6ms | 1,955 tok/s | +77% |
FP8 is the sweet spot. Same quality as BF16, 43% more throughput on a
single GPU (1,581 vs 1,103 tok/s), and at
--gpu-memory-utilization 0.62 it only uses ~62 GB on a
96 GB card — leaving 34 GB for KV cache. AWQ-4bit is even quicker
but the perplexity artifact makes it harder to trust.
Project 13: Final Verification — 6 Model Variants definitive Mar 1
Re-ran all models with consistent methodology: vllm bench serve
with 512 input / 256 output tokens, same gpu_memory_utilization=0.90
and max_model_len=131072 across the board. Added Qwen3.5-122B-A10B
(the bigger MoE) to the lineup.
Speed (512in/256out, 2x RTX PRO 6000)
| Model | Params | Quant | TTFT | rate=1 | rate=4 | rate=16 |
|---|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35B (3B active) | FP8 | 61ms | 245 | 918 | 2,623 |
| Qwen3.5-35B-A3B | 35B (3B active) | AWQ | 66ms | 247 | 941 | 2,817 |
| Qwen3.5-122B-A10B | 122B (10B active) | FP8 | 105ms | 233 | 744 | 1,569 |
| Qwen3.5-122B-A10B | 122B (10B active) | NVFP4 | 106ms | 236 | 789 | 1,694 |
| MiniMax M2.5 | 456B | NVFP4 | 147ms | 227 | 726 | 1,688 |
| gpt-oss-120b | 120B | MXFP4 | 70ms | 237 | 876 | 2,656 |
Quality
| Model | IFEval | MATH Hard | Perplexity |
|---|---|---|---|
| Qwen3.5-35B FP8 | 86.1% | —* | 9.59 |
| Qwen3.5-35B AWQ | 85.4% | —* | — |
| Qwen3.5-122B FP8 | 86.9% | 56.8% | 7.07 |
| Qwen3.5-122B NVFP4 | 85.6% | 57.3% | 7.48 |
| MiniMax M2.5 | 37.2% | 44.9% | 12.45 |
| gpt-oss-120b | 76.9% | 55.5% | 30.25 |
*MATH Hard failed for Qwen3.5-35B variants due to missing
math-verify dependency (installed mid-run).
KV Cache & Concurrency
At gpu_memory_utilization=0.90 and
max_model_len=131072, how many concurrent 128K-token
conversations fit in the remaining VRAM after loading weights:
| Model | Quant | KV Memory | KV Tokens | 128K Sessions |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | AWQ | 69.96 GiB | 1,833,216 | 54 |
| Qwen3.5-35B-A3B | FP8 | 63.96 GiB | 1,675,872 | 49 |
| Qwen3.5-122B-A10B | NVFP4 | 45.35 GiB | 989,312 | 28 |
| gpt-oss-120b | MXFP4 | 47.84 GiB | 1,393,504 | 20 |
| Qwen3.5-122B-A10B | FP8 | 22.66 GiB | 494,656 | 14 |
| MiniMax M2.5 | NVFP4 | 19.26 GiB | 325,728 | 2 |
Key Takeaways
Qwen3.5-122B-A10B FP8 is the quality winner — best IFEval (86.9%), best perplexity (7.07), and solid MATH Hard (56.8%). Still 233 tok/s single-user.
gpt-oss-120b is surprisingly fast — 2,656 tok/s at rate=16 and only 70ms TTFT, matching the 35B models despite being 4x larger.
NVFP4 doubles KV cache for 122B — 28 vs 14 concurrent 128K conversations, because model weights are ~half the size, freeing 23 GiB for KV cache.
MiniMax M2.5 disappoints on quality — 37.2% IFEval is very low, likely a prompt template issue with the think-tag proxy. Only 2 concurrent 128K sessions.
Project 14: SM120 Backend Deep Dive & NVFP4 Optimization breakthrough Mar 2
Two-part investigation: (1) why SM120 is stuck on the Triton fallback for FP8 MoE, and (2) whether NVFP4 quantization unlocks better performance via native SM120 kernels.
SM120 vs Datacenter Blackwell
The RTX PRO 6000 (SM120) shares the Blackwell brand with B200/B100 (SM100) but has a fundamentally different instruction set:
| Feature | SM90 (Hopper) | SM100 (B200) | SM120 (RTX PRO 6000) |
|---|---|---|---|
wgmma (Warp Group MMA) | Yes | Yes | No |
tcgen05/UMMA | No | Yes | No |
| FP8 tensor cores | Yes | Yes | Yes |
| FP4 tensor cores | No | Yes | Yes |
| Max shared memory | 228 KB | 232 KB | 99 KB |
| Memory bandwidth | 3.35 TB/s | 8 TB/s | 1.79 TB/s |
SM120 has FP8 and FP4 tensor cores but can only access them via
mma.sync (SM89-era instruction), not the high-throughput
wgmma or tcgen05 used by optimized backends.
Backend Feasibility for FP8 MoE on SM120
| Backend | Status | Blocker |
|---|---|---|
| FlashInfer TRT-LLM | Impossible | Pre-compiled SM100 cubins only, no source |
| DeepGEMM | Impossible | wgmma + 232KB shared mem required |
| FlashInfer CUTLASS FP8 | Very hard | SM90-gated block-scale kernels |
| vLLM CUTLASS | Impossible | Not compiled for SM120 |
| Triton | Works | Fallback, SM ≥ 8.9 |
| Marlin | Works | Weight-only dequant to BF16 |
The NVFP4 Path Forward
SM120 has one capability SM90 does not: FP4 tensor cores.
FlashInfer CUTLASS already ships SM120-specific FP4 MoE kernels that use
mma.sync with FP4 data types and fit within 99 KB shared memory.
NVFP4 Backend Comparison (Qwen3.5-122B-A10B)
| Config | MoE Backend | 1-stream tok/s | TPOT | rate=16 | GiB/GPU |
|---|---|---|---|---|---|
| FP8 Triton (baseline) | TRITON | 91.4 | 10.36ms | 1,569 | ~64 |
| NVFP4 CUTLASS | FLASHINFER_CUTLASS | 89.3 | 10.59ms | 1,737 | 35.8 |
| NVFP4 Marlin | MARLIN | 102.1 | 9.19ms | 1,759 | 35.8 |
| NVFP4 Marlin+EP | MARLIN | 98.5 | 9.57ms | 1,745 | 35.8 |
NVFP4 + Marlin is the optimal 122B configuration on SM120. 102 tok/s single-stream (12% faster than FP8), 9.19ms per-token latency, 1,759 tok/s batched (12% more throughput), and 44% less memory per GPU (35.8 vs 64 GiB).
Marlin beats native CUTLASS at single-stream (102 vs 89 tok/s). At batch size 1 the workload is memory-bound (streaming expert weights from HBM), and Marlin’s BF16 dequant-and-compute path is faster than CUTLASS’s native FP4. Under concurrency both converge (within 1.3% at rate=16).
Quality verified — 12-test evaluation (math, logic, code, translation, creative writing) showed no degradation from NVFP4 quantization.
Project 15: Agent Harness Testing practical Mar 2
Tested two open-source AI coding agents — opencode and hermes-agent — against local models on a practical task: clone a real project, understand it, install dependencies, and investigate vLLM integration.
Round 1: Small Models
| Agent | Model | Duration | Tool Calls | Quality |
|---|---|---|---|---|
| opencode | MiniMax M2.5 (hosted) | ~2 min | 8 | Excellent |
| opencode | Trinity Large (hosted) | ~2 min | 12+ | Excellent |
| hermes | Qwen3-30B-A3B (local) | ~16s | 2 | Poor |
Round 2: Qwen3.5-122B-A10B (2x GPU)
| Agent | Quant | Duration | Tool Calls | Quality |
|---|---|---|---|---|
| opencode | FP8 | 35s | 4+ | Good* |
| opencode | NVFP4 | 21s | 6+ | Good* |
| hermes | FP8 | 88s | 48 | Excellent |
| hermes | NVFP4 | 79s | 39 | Excellent |
*opencode hit context window limits (32K output default consumes most of 65K context).
FP8 vs NVFP4 for Agentic Use
| Metric | FP8 | NVFP4+Marlin | Delta |
|---|---|---|---|
| GPU memory (model) | 58.2 GiB | 35.75 GiB | -39% |
| vLLM ready time | ~175s | ~135s | -23% |
| opencode duration | 35s | 21s | -40% |
| hermes duration | 88s | 79s | -10% |
| Output quality | Excellent | Excellent | = |
Model size is the dominant factor for agentic quality. Qwen3-30B-A3B (3B active) made 2 tool calls. Qwen3.5-122B (10B active) made 39–48 tool calls with correct conclusions. The 3.3x increase in active parameters produced a qualitative leap.
hermes-agent is excellent with a capable model. Round 1’s poor results were entirely model-limited. With 122B, hermes produced the most thorough reports of any combination — reading docs, searching code, installing dependencies, and producing structured reports.
NVFP4+Marlin shows no quality difference from FP8 in agentic use, while being 39% lighter on memory and 10–40% faster end-to-end.
Cross-Cutting Lessons
Every model does it differently. MiniMax injects <think>
as a template prefix, Qwen3.5 does the same, gpt-oss-120b uses a
completely separate field. There's no universal way to strip reasoning
content. For benchmarks, just disable thinking mode entirely.
wgmma
The RTX PRO 6000 shares the Blackwell brand with B200 but has a fundamentally
different ISA. It lacks wgmma and tcgen05 instructions,
blocking DeepGEMM, TRT-LLM MoE, and CUTLASS FP8 block-scale kernels. It also
has only 99 KB shared memory vs 228+ KB on datacenter GPUs. The fallback is
Triton for FP8, or Marlin for NVFP4/MXFP4 (which is actually faster for
memory-bound single-stream workloads).
The AWQ logprob anomaly (token 104256 dominating everything) showed up identically across DeepGEMM, eager mode, and FlashInfer. It's baked into the checkpoint. You have to validate quantized models independently.
Qwen3.5-35B-A3B with only 3B active parameters gives you 2x the batched throughput and nearly 3x the concurrent sessions vs a dense 27B model. For self-hosted inference where you're paying for VRAM, MoE is a no-brainer.
For Qwen3.5-122B on workstation Blackwell: NVFP4 with
--moe-backend marlin gives 102 tok/s single-stream (12% faster
than FP8 Triton), uses 44% less GPU memory, and shows no quality degradation
in either benchmarks or agentic tool use. Marlin's BF16 dequant-and-compute
beats CUTLASS native FP4 when memory-bound at batch size 1.
Default configs produce silently wrong results. Token budgets need to be 4096+, greedy decoding might be incompatible, and reasoning content leaks into answers unless you explicitly disable it. I'm not sure if this is a benchmark tooling problem or something more fundamental.
lm-eval, evalplus, and lmms-eval all pin different vLLM versions and will
silently downgrade your install. Always install vLLM last with
--upgrade.
Production Recommendation
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
--tensor-parallel-size 2 \
--dtype auto \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--seed 42 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--disable-custom-all-reduce \
--port 8000
Why this model
I'm going with Qwen3.5-35B-A3B in FP8. It's the best combination of speed and capability I found across all these experiments:
- 2,623 output tok/s batched on 2x GPUs — the 1000+ tok/s range makes it a completely different game
- 49 simultaneous 128K-token conversations in 192 GB VRAM
- 86% IFEval with thinking disabled (sanity check: inference is correct)
- Vision works natively through the same weights, no adapter needed
- FP8 is quality-identical to BF16 (perplexity 9.59 vs 9.56) — 43% more throughput for free
- Fits on a single 96 GB card with room to spare
- Works really well in practice with opencode and other coding tools
For harder tasks: Qwen3.5-122B-A10B NVFP4
If you need more reasoning depth, Qwen3.5-122B-A10B in NVFP4 with
--moe-backend marlin is a strong alternative — 102 tok/s
single-stream (9.2ms/tok), best perplexity (7.48), 57.3% MATH Hard, and
only 35.8 GiB/GPU leaving room for 28 concurrent 128K sessions. It also
performs excellently as an agentic coding backbone with opencode and hermes-agent.
vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
--tensor-parallel-size 2 \
--dtype auto \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--moe-backend marlin \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--disable-custom-all-reduce \
--port 8000