Every time a new AI model is released, the conversation follows a familiar pattern. People compare benchmarks, debate reasoning capabilities, and celebrate coding scores and context window sizes. We all get excited about what the latest model can do.
I get excited too. But recently I started asking myself a different question: Do I actually need the most advanced model for my day-to-day work?
My Default Choice
For a long time, Sonnet has been my primary coding assistant. Even after newer and more capable models were released, I continued using Sonnet because it consistently delivered strong results for software development tasks. Recently, while working on a code refactoring task, I became curious. Instead of automatically using my usual model, I decided to compare it with a smaller and cheaper alternative.
A Simple Experiment
The setup was straightforward. I gave both models the same file and the same task: refactor the code. The results were interesting:
- Sonnet: 76.1 credits
- Haiku: 13.3 credits
That’s approximately 5.7× cheaper. Naturally, I expected the more expensive model to produce the better result. But that’s not what happened.
The Output That Surprised Me
To be completely honest, I preferred the solution from Haiku. Instead of making incremental changes to the existing file, it split the code into three smaller files. The structure felt cleaner and easier to maintain. What surprised me even more was that it followed the coding standards defined in our Copilot instructions more consistently.
The output wasn’t perfect. Neither was Sonnet’s. But when I compared the final results, I found myself preferring the work produced by the model that cost nearly six times less. That forced me to rethink an assumption many of us make: Bigger and more expensive doesn’t automatically mean better.
The Harness Matters More Than We Think
This experiment reinforced something I’ve been learning over the past year: Model capability is only one part of the equation.
In our repositories, we’ve spent considerable effort building what I call an AI development harness. Instead of treating AI as a chatbot that magically understands our codebase, we provide it with a structured environment:
- Repository-specific instructions
- Coding standards and conventions
- Architectural guidance
- Development workflows
- Context about how the project is organized
- Review and validation expectations
I wrote about this approach in a previous article: Beyond Coding: How I Built an AI Harness to Automate My Development Lifecycle.
What I’ve discovered is that once these guardrails are in place, smaller models become far more capable than many people expect. The model is no longer trying to guess what “good” looks like; the repository, architecture, and instructions already define the expectations. In many ways, the AI harness becomes a force multiplier.
Maybe We Are Optimizing the Wrong Thing
When teams experience inconsistent AI results, the first reaction is often: “Let’s use a bigger model.” Sometimes that’s the right decision, but I increasingly wonder whether we are overlooking a more important question: Have we created the right environment for the model to succeed?
Prompt Quality Still Matters
Another lesson from this experiment is that task clarity remains incredibly important. A well-described task can significantly reduce the gap between a flagship model and a smaller model. Most day-to-day software engineering tasks—refactoring, writing tests, creating documentation, or implementing known patterns—are not cutting-edge research problems. For these, a smaller model is often more than capable.
Optimizing for Outcomes Instead of Benchmarks
The AI industry naturally focuses on intelligence, but in production environments, the question isn’t “Which model scored highest on a benchmark?” The questions are:
- Did the task get completed?
- Was the result maintainable?
- Did it follow project standards?
- Was the cost justified?
- Can the team scale its usage economically?
Sometimes the answer will be to use the most advanced model available. But increasingly, I’m finding that the better answer is: Use the least expensive model that can reliably solve the problem.
Final Thoughts
I’m still excited every time a new model is released, but this experiment reminded me that the goal isn’t to use the smartest model—it’s to get the best outcome.
The AI industry spends a lot of time discussing model intelligence. I think we should spend more time discussing harness quality. Because once the guardrails, standards, and context are in place, a model that’s 5.7× cheaper can sometimes deliver results that are just as good—or even better.
Top comments (0)