Running Open Models (2026-Feb)
For the past two weeks I’ve been benchmarking some of the open-source language models you can download for self-hosted AI inference on workstation-class hardware. I was eyeing two RTX PRO 6000 Blackwell GPUs, which total 192 GB of VRAM, and to test how they feel and how powerful the models are I’ve been renting virtual machines on RunPod. The setup was initially two RTX PRO 6000 cards, but I also ran some experiments on a single card. I used vLLM as the serving engine. The detailed results with tables and methodology are here (slop warning: it was 100% AI-generated).
I ran benchmarks like HumanEval+, GSM8K, IFEval, and MATH Hard, but I want to be clear about why. I don’t think evals are the right way to measure the power of a model. I used them to verify that vLLM was doing inference correctly, that the quantization and the way I was running the models was not introducing errors, and to find the right parameters for the best inference setup. Using them as sanity check, not a scorecard.
I tried loading some optimized kernels – FlashInfer, VLLM_CUTLASS, certain
DeepGEMM variants – but most of them didn’t load. These Blackwell GPUs report
as SM120, which vLLM doesn’t recognize as part of the SM100 family, so a lot of
fast paths are just unavailable. I also had to set NCCL_P2P_DISABLE=1 and
pass --disable-custom-all-reduce to get tensor parallelism working over PCIe.
These kinds of things you only find out by running into them.
The most important result: the Qwen 3.5 122B-A10B model worked way better than I expected. It’s a Mixture-of-Experts architecture with 35 billion total parameters but only 10 billion active at inference time, and it had really good output throughput, very low streaming latency.
FP8 quantization turned out to be pretty good. Perplexity was 9.58 vs 9.56 for full BF16 (essentially identical) while giving 43% higher throughput on a single GPU. I also tried AWQ quantization at 4 bits, which made the models run even quicker, but it introduced a strange logprob artifact where a single token dominated the probability distribution. Generative benchmarks still worked fine, but perplexity-based evaluation became unreliable.
I had a lot of problems with thinking XML-like tags, particularly with MiniMax
M2.5. Its chat template injects a <think> opening tag as a forced prefix that
isn’t part of the generated output – the model only produces the closing
</think> tag. The same class of problem appeared with Qwen 3.5’s thinking
mode. Eventually I was able to fix all of this with chat template arguments
that disabled thinking: --default-chat-template-kwargs '{"enable_thinking":
false}'. MATH Hard jumped from 9% to 58%. I’m still not fully sure if these
low scores were just from not correctly combining thinking mode with the
benchmarks, or if it’s a fundamental limitation of the benchmark tooling when
it comes to reasoning models.
The MXFP4 quantization for gpt-oss-120b works pretty well. But the FP8-dynamic variant produced garbage. I found a patch from a GitHub issue that enabled Triton on SM120, but while it technically worked, it crashed the performance. Marlin delivered 5.7x the output throughput.
Vision capabilities work well on the Qwen 3.5 122B-A10B. It’s an early-fusion multimodal model, so vision goes through the same weights with no separate adapter.
In conclusion, I’m pretty sure I’m going to use the Qwen-3.5-122B-A10B in FP8 mode as a capable open model right now. It’s noticeably quicker than MiniMax in batched throughput, and even in the vibe check with opencode harness, it works really well. The full breakdown with per-experiment details is here (warning: this link contains slop).