LLMs, Interpolation, and the Ossification of CLIs

A recent podcast with Jeremy Howard on Machine Learning Street Talk crystallized a very important point about what LLMs can (“interpolate”) and can’t do (“creative research”):

“You have to be so nuanced about this stuff because if you say ‘they’re not creative’, it can give the wrong idea, because they can do very creative seeming things. But if it’s like, well, can they really extrapolate outside the training distribution? The answer is no, they can’t. But the training distribution is so big, and the number of ways to interpolate between them is so vast, we don’t really know yet what the limitations of that is.”

He describes how at Answer.AI, doing novel R&D work constantly pushes him past the boundary of the training distribution. The LLM goes from “incredibly clever to, like, worse than stupid, like not understanding the most basic fundamental premise.” Anyone who has used an LLM for something novel has felt this cliff.

I don’t agree that “interpolation” is all that LLMs can do. There is some “extrapolation” too.

In-Context Learning and CLIs

There’s a 2022 paper by Garg et al. (“What Can Transformers Learn In-Context?”) that shows transformers can effectively learn to run algorithms like least squares and gradient descent in a single forward pass. They interpolate both from the training data and the context.

This is why the quality of tool output matters so much, because tool output also gets into the context. When a CLI tool returns a clear error message that explains what went wrong and how to fix it, the LLM can interpolate from that context even if the specific failure mode was never in the training set. The combination of non-deterministic reasoning (the LLM) with deterministic feedback (the tool output) is remarkably powerful. So, it’s not just interpolating training data, it’s also extrapolating into the actual use case.

For the most popular tools – git, npm, cargo – maybe that’s already well understood. The training data is saturated with examples, and the LLM knows what to do. But for less common tools, the quality of the error output becomes the difference between the LLM solving the problem and the LLM spiraling into nonsense.

The Headscale Problem

I’ve seen this repeatedly happen with Headscale, the open-source control server for Tailscale. If you want to run Tailscale completely self-hosted without relying on Tailscale’s coordination servers, Headscale is great.

In v0.26.0 (May 2025), the route management CLI was completely rewritten. The old syntax:

headscale routes list
headscale routes enable -r <route_id>

became:

headscale nodes list-routes
headscale nodes approve-routes --identifier <node_id> --routes <CIDR,...>

The routes subcommand was removed, and route acceptance became route approval. The mental model changed from enabling route IDs to approving CIDR blocks per node, and the behavior also changed: approve-routes replaces all approved routes with whatever you pass, so if you only specify one route, you lose the rest.

Every time I ask Claude Code to work with Headscale, it tries the old syntax. And it tries really hard. It generates headscale routes enable, gets an error, and then tries variations of the old syntax rather than recognizing that the command structure has fundamentally changed, eventually it goes to the GitHub issue and understands the problem. The old syntax is baked deep into the training distribution, and the new syntax barely exists there yet, if at all.

The Ossification Risk

This creates a subtle pressure toward software ossification. If every CLI change breaks the LLM-assisted workflow for thousands of users, there’s a real incentive to never change anything. The training data becomes a form of technical debt that the entire ecosystem inherits.

What can be done?

For once, good error messages matter more than ever. When the old syntax is used, the tool should explain what changed and show the new equivalent. This gives the LLM the context it needs to adapt. Headscale does do some of this, but not quite enough. The message they output says Error: unknown command "routes" for "headscale", where it should say something like Error: command "routes" has been superseded by "nodes list-routes" and "nodes approve-routes" to give these eager and stubborn LLMs a chance.

Secondly, versioned command documentation in the output helps a lot. If the error message, or --help output includes migration notes or links to changelogs, the LLM can pick those up from the context window.

Finally, accepting old syntax with deprecation warnings is the gentlest path, and maybe mandatory now. Keep the old commands working but print a warning with the new equivalent. Do not introduce breaking changes!

Ben Thompson argues that the most important moat Anthropic currently has is the combination of model, harness, and model trained for that harness. A good CLI acts as a “skill” for the harness in real time, creating very valuable signal to course correct the agent. The better the tools communicate, the further the LLM can extrapolate beyond its training data, and the less tokens they use. As LLMs become a primary interface to CLIs, the quality of that communication becomes a first-class design concern.

Other people saying similar things

I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed: It’s better to think of “the AI” as the whole cybernetic system of feedback loops joining the LLM and its harness, because the harness can make as much of a difference as improvements to the model itself.
The Unreasonable Effectiveness of an LLM Agent Loop with Tool Use: On how astonishingly well a loop with an LLM that can call tools works for all kinds of tasks.
A strong commitment to backwards compatibility means keeping your mistakes: The eternal tension between stability and the ability to fix design errors.