Running Open Models (2026-Feb)

For the past two weeks I’ve been benchmarking some of the open-source language models you can download for self-hosted AI inference on workstation-class hardware. I was eyeing two RTX PRO 6000 Blackwell GPUs, which total 192 GB of VRAM, and to test how they feel and how powerful the models are I’ve been renting virtual machines on RunPod. The setup was initially two RTX PRO 6000 cards, but I also ran some experiments on a single card. I used vLLM as the serving engine.

I ran benchmarks like HumanEval+, GSM8K, IFEval, and MATH Hard. I don’t think evals are the right way to measure the power of a model. I used them to verify that vLLM was doing inference correctly, that the quantization and the way I was running the models was not introducing errors, and to find the right parameters for the best inference setup. Using them as sanity check, not a scorecard.

I tried loading some optimized kernels – FlashInfer, VLLM_CUTLASS, certain DeepGEMM variants – but most of them didn’t load. These Blackwell GPUs report as SM120, which vLLM doesn’t recognize as part of the SM100 family, so a lot of fast paths are just unavailable. I also had to set NCCL_P2P_DISABLE=1 and pass --disable-custom-all-reduce to get tensor parallelism working over PCIe. These kinds of things you only find out by running into them.

The most important result: the Qwen 3.5 122B-A10B model worked way better than I expected. It’s a Mixture-of-Experts architecture with 122 billion total parameters but only 10 billion active at inference time, and it had really good output throughput, very low per-user streaming latency

FP8 quantization turned out to be pretty good.

I had a lot of problems with thinking XML-like tags, particularly with MiniMax M2.5. Its chat template injects a <think> opening tag as a forced prefix that isn’t part of the generated output – the model only produces the closing </think> tag. The same class of problem appeared with Qwen 3.5’s thinking mode. Eventually I was able to fix all of this with chat template arguments that disabled thinking: --default-chat-template-kwargs '{"enable_thinking": false}'. MATH Hard jumped from 9% to 58%. I’m still not fully sure if these low scores were just from not correctly combining thinking mode with the benchmarks, or if it’s a fundamental limitation of the benchmark tooling when it comes to reasoning models.

The MXFP4 quantization for gpt-oss-120b works pretty well, hitting the expected results on GSM8K and IFEval. But the FP8-dynamic variant produced garbage. I found a patch from a GitHub issue that enabled Triton on SM120, but while it technically worked, it crashed the performance. Marlin works best for this quantized model.

Vision capabilities work well on the Qwen 3.5 122B-A10B. It’s an early-fusion multimodal model, so vision goes through the same weights with no separate adapter.

In conclusion, I’m pretty sure I’m going to use the Qwen-3.5-122B-A10B in FP8 mode as a capable open model right now. It’s noticeably quicker than MiniMax in batched throughput, and even in the vibe check with opencode harness, it works really well. The full breakdown with per-experiment details is here (warning: this link contains slop).