Model Buzz Roundup: Week of June 3, 2026

#llm #openrouter #modelroundup #aiassisteddevelopmen

Three of the four scoreboards I trust say MiniMax M3 is the best deal in open-weights AI right now. The fourth says nobody has actually checked.

That gap is the whole story this week: a model topping the usage charts that nobody has independently verified. M3 launched June 1, rocketed up the OpenRouter rankings on a wave of launch hype and a half-price coupon, and landed at the top of the open-weights pile on a serious benchmark. It also shipped without its weights, without a technical report, and without a single Arena vote to its name. So I spent the week doing what I always do: cross-checking the four places that measure models against each other, because any one of them on its own will lie to you.

And while I was untangling M3, Claude Opus 4.8, the model sitting at #1 on raw intelligence, was quietly setting people’s money on fire. More on that below. Let’s go.

MiniMax M3: The Best Model Nobody’s Verified
The Smartest Model Is Also the One Eating Your Tokens
The Cheapskate Picks: Where You’re Actually Wasting Money
The Map Just Redrew Itself
Coming Soon (Allegedly)
What I’d Actually Run This Week

MiniMax M3: The Best Model Nobody’s Verified

Here’s the case for M3, and it’s a real one.

On OpenRouter it jumped to #3 by weekly token volume at 2.89 trillion tokens, with a week-over-week delta my scraper rendered as “>999%,” which is what you get when a model goes from not existing to being everywhere in seven days. On Artificial Analysis it scored 54.7 on the v4.0 Intelligence Index, good for #7 overall and, more importantly, the highest-scoring open-weights model on the board , edging out Kimi K2.6 (53.9) and Xiaomi’s MiMo-V2.5-Pro (53.8). It also sits on AA’s Intelligence-vs-Cost Pareto frontier, which is the chart I care about most, because it answers “is this smart for the money” instead of just “is this smart.” And the pricing is genuinely cheap: $0.30 per million input tokens and $1.20 per million output, roughly a tenth of what the frontier closed models charge.

The demos are wild, too. One agentic evaluation had M3 autonomously optimizing a CUDA kernel from 7.6% to 71.3% hardware utilization for a 9.4x speedup, across 1,959 tool calls over 24 hours with zero human babysitting. VentureBeat ran the headline that it “eclipses GPT-5.5 and Gemini 3.1 Pro on key benchmarks for 5-10% of the cost.” If you only read those two sentences, you’d switch today.

So here’s the part where I ruin it.

Every one of those benchmark numbers is vendor-published. As of this writing, MiniMax has not released the weights or a technical report. AA’s own changelog literally describes it as the “leading open weights model, once the weights are released.” That “open-weights” label is currently a company promise, not a thing you can download and verify. TechTimes called it exactly what it is: “Frontier Claims, Unverified Benchmarks.”

And then there’s the coupon. M3’s launch ran a 50%-off promo on the MiniMax provider through June 7. So that “>999%” OpenRouter spike? Part real curiosity, part “free money expires Sunday.” We’ve seen this movie. Tencent’s Hy3 pulled the same free-period stunt back in May and I got burned calling it a flash in the pan, so I’m not going to pretend the usage is meaningless. But I’m also not going to pretend a discount-driven launch spike is the same thing as adoption.

The tell that keeps me honest is Arena. The Arena leaderboard runs on human head-to-head votes, and M3 is completely absent from the Overall board. Its only appearance anywhere is a single early entry in the Math category at 1487. That’s not a knock on the model. Arena always lags new releases by a week or two because votes have to accumulate, which means the one source that measures lived human preference has no read on M3 yet. Three scoreboards say buy. The fourth says “who?”

One more asterisk for the agent crowd: M3 is the slowest model in the top tier on AA’s speed chart at 41 output tokens per second. If you’re running it in a long agent loop, that slowness compounds into real wall-clock cost. Cheap per token isn’t cheap per task when every task takes twice as long.

My take: M3 is probably real and probably very good. But “probably” is doing a lot of work, and I don’t move my daily driver on a coupon and a press release. Watch for the weights to actually drop and for Arena to fill in. Until then it’s the most interesting model of the week, not the one I’d bet a production workload on.

The Smartest Model Is Also the One Eating Your Tokens

If M3 is the hype story, Claude Opus 4.8 is the inconvenient-truth story.

On the AA Intelligence Index, Opus 4.8 is #1, full stop : 61.4, ahead of GPT-5.5’s 60.2 and everything else. It is, by that measure, the smartest model you can rent right now. On OpenRouter it climbed +199% week-over-week to 1.26T tokens as people pile in roughly two weeks post-launch. So far, so good.

Then you read GitHub issue #64961 on the Claude Code repo, and the picture sours fast. Users are reporting that Opus 4.8 (and 4.7) regressed token usage 2-3x after the update for equivalent work. One logged case: Opus 4.8 on medium effort spent 46,000 output tokens on hidden “thinking” for a simple coding turn. People are also seeing it re-fetch identical tool results 2-3x more often than 4.7, plus frequent disconnects that force a resume-and-retry, which burns even more tokens. If you’re on a five-hour session budget, the smartest model on the planet is also the one quietly chewing through your quota with nothing visible to show for it.

This is the paradox I keep running into lately: peak intelligence and peak cost-efficiency have fully decoupled. The model that wins the benchmark is not the model that wins your invoice. AA’s blended price chart puts Opus 4.8 at $4.10 per million, the priciest of the leaders, before you account for the token inflation on top. You’re paying a premium rate to burn premium volume.

I still reach for Opus when a problem genuinely needs the extra IQ. But “genuinely needs” is carrying weight now, because the default-to-the-smartest-model habit got a lot more expensive this month.

The Cheapskate Picks: Where You’re Actually Wasting Money

This is the part of the roundup I actually use myself, so here’s the method in one breath: take the Arena leader in a category, draw a band 50 rating points below it, and find the cheapest model still inside that band. Arena’s top end is compressed; the whole competitive set usually fits inside 50 points on a 1400+ scale, so “cheapest in band” is a real choice between near-equivalents, not “settle for worse.”

Here’s where that landed this week:

Category	Leader	Cheapskate pick	Pick price (out)	Rating gap	Roughly	AA Pareto
Overall	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−29	~8x cheaper	nearby
Coding	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−24	~8x cheaper	nearby
Creative Writing	Claude Opus 4.6 (thinking)	Gemini 3 Flash	~$3/1M	−39	~8x cheaper	n/a
Instruction Following	Claude Opus 4.6 (thinking)	MiMo-V2.5-Pro	$0.87/1M	−42	~29x cheaper	✓
Hard Prompts	Claude Opus 4.6 (thinking)	GLM-5.1	$3.08/1M	−34	~8x cheaper	nearby
Math	Gemini 3.5 Flash	MiMo-V2.5-Pro	$0.87/1M	−35	~10x cheaper	✓

Two models do all the heavy lifting here, and neither one is the model everybody spent the week talking about.

GLM-5.1 (Z.ai) takes Overall, Coding, and Hard Prompts. At $0.98 in / $3.08 out it’s roughly an eighth of what Opus costs, and it gives up something like 1.5-2% on the Arena scale to do it. It’s the boring correct answer that refuses to make headlines: top-ten across three categories and not a single viral thread about it.

MiMo-V2.5-Pro (Xiaomi) is the one I want to put a flag on. It wins Math and Instruction Following outright, and in Instruction Following it’s ~29x cheaper than the Opus leader for a 42-point gap. More importantly, it’s confirmed on AA’s Intelligence-vs-Cost Pareto frontier in both categories, meaning two completely independent methodologies (my Arena-band math and AA’s benchmark-vs-price chart) point at the same model. That convergence is the highest-confidence signal this whole skill produces, and it’s pointing at a Xiaomi model trading at 87 cents per million output tokens.

The one inversion worth flagging: in Math, the leader is the expensive one. Gemini 3.5 Flash tops the category at 1519 but costs $9 per million output, a “Flash” model priced like a flagship, which is a rant I already went on last month. MiMo gives you within ~2% of it for a tenth of the price. When your cheapest competitive option is also 10x cheaper than the leader, that’s not a tradeoff, that’s just the answer.

Notice who’s not in this table: MiniMax M3. The breakout of the week isn’t a cheapskate winner anywhere, purely because Arena hasn’t rated it yet. The proven value plays are the quiet models. Funny how that keeps working out.

The Map Just Redrew Itself

Step back from individual models and look at the OpenRouter market-share board, because it’s telling a bigger story than any single launch.

DeepSeek is the #1 author on the entire platform at 19.4% of all tokens, with DeepSeek V4 Flash sitting at #1 overall (4.07T tokens, $0.10 in / $0.20 out, still the cheap workhorse everybody actually runs). Add up DeepSeek, Tencent, Xiaomi, MiniMax, and Qwen and you’ve got more than half of OpenRouter’s token volume flowing through Chinese labs. Anthropic holds second at 15.6%. OpenAI? Down at 6.8% , which for the company that started this whole gold rush is a genuinely striking number.

Into that gap, NVIDIA planted a flag. Nemotron 3 Ultra, a 550B/55B-active MoE, got announced at Computex on June 1 and dropped its weights on Hugging Face June 4. It’s fast (171 output tps, second only to gpt-oss-120b on AA’s chart) and genuinely open, weights and recipes and all. NVIDIA’s framing is “the most capable US-developed open model ever,” and that’s true: it beats Gemma 4 and gpt-oss-120b. But scope that claim carefully: its Intelligence Index of 47.7 lands it a solid six-plus points behind Kimi K2.6, MiMo, and MiniMax M3. The best US open model is still chasing the Chinese open models. That’s the actual state of play in June 2026, and no amount of Computex keynote energy changes it. (It’s also not on OpenRouter yet; DeepInfra and HF only for now.)

If you’ve been ignoring the non-Western labs because the Reddit chatter is thinner over there, this is your reminder that the usage numbers don’t care about your feed.

Coming Soon (Allegedly)

The rumor pile, labeled honestly:

Gemini 3.5 Pro : announced. Google said at I/O (May 19) it’d ship “next month,” which is now. No date, no model ID yet. This is the one I’m actually watching.
Grok 5 : rumored, low confidence. 6 trillion parameters on the Colossus 2 supercluster, Q2 window per xAI, but prediction markets give it only about a 33% chance of shipping by June 30. Translation: don’t hold your breath.
MiniMax M3 weights + technical report : committed, not shipped. The single release that would flip M3 from “interesting” to “verified.” Watch for it.
Claude Mythos (Mythos 1): restricted. Still locked to ~50 Project Glasswing partners for defensive cybersecurity work. No general availability, no timeline.
GPT-6 : speculation. No announcement, no signal. OpenAI’s public ceiling is still GPT-5.5, which, as I covered above, might be part of why their token share looks the way it does.

What I’d Actually Run This Week

No inspiration porn, just the shortlist:

If you need raw intelligence for a hard problem and you can stomach the bill, Opus 4.8 is the smartest thing going, though watch issue #64961 and keep an eye on your token counter, because it’ll spend 46k tokens thinking about a one-liner if you let it. For the 90% of work that doesn’t need that, GLM-5.1 is the unglamorous ~8x-cheaper answer across general, coding, and hard prompts, and MiMo-V2.5-Pro is the genuine steal for math and instruction-following at 87 cents a million with two independent methodologies vouching for it.

And MiniMax M3 is the model I’m most excited about and least willing to recommend, which is a weird sentence to write but an honest one. Three scoreboards love it. The fourth hasn’t met it. The weights aren’t out. The benchmarks are self-reported. The launch spike rode a coupon. Every individual flag is yellow, not red. I’ve been doing this long enough to know that “probably great, just trust us” is exactly the pitch that’s burned me before. I’ll move my workload when the weights drop and the votes come in. Until then I’ll keep doing the boring thing: cross-checking four scoreboards and reaching for the cheap model that already proved itself.

See you next week, when half of this is wrong and there’s a new stealth model nobody can identify.