L. Cordero

Posted on Jun 17

Predictions first, data later: seven hot takes on AI agent readiness before I scan 50 sites.

#buildinpublic #ai #aws #webdev

Can a robot read this? Asking for a friend.

A few months ago, for a hackathon, I built a small tool that checked whether an AI agent could read a website. Deterministic checks underneath, an AI reasoning layer on top. It worked. I called it Hermes Clew.

I could have left it there. The hackathon ended and it was only a proof of concept. But the idea didn't leave. The more agents showed up at the products people ship, the more it looked like the question of the next few years. Not "can a person use this." Can an agent.

So Hermes became two things: Perseus Clew, the engine, and Agentis Lux, the product. (Latin, roughly "light of the agent." It shows you what the agent sees.) I laid out the thesis and the build in my last post. This is the part I promised to come back for.

Consider it my flag in the moon sand. I think agents are the question. And I'm willing to write down what I expect to find before I have any data to hide behind.

Here's the bet.

I'm pre-registering my predictions (i.e. I'm playing myself!)

Before the engine scans a single site, I wrote down what I think it will find. Then I committed it to the repo with a timestamp.

Scientists call this pre-registration. I call it no take-backs.

The reason is simple. When the data comes in, it is easy to look at the numbers, find the pattern that flatters the project, and write the post as if that is what I expected all along. Writing the predictions down first makes that impossible. If I'm right, I predicted it. If I'm wrong, the miss stays on the page next to the result. Either way you can check my work.

That matters more than usual here, because scoring tools have a reputation for marking their own homework. The fix is to make everything inspectable, including the part where I could fool myself.

The thesis it all hangs on

The web was built by humans, for humans. Search crawlers and screen readers got partial accommodation later. Agents are showing up to a playground nobody built for them.

So I expect most sites to have gaps in what an agent can read and do. And I expect those gaps to land in predictable places: wherever no human incentive paid for machine readability. A site tends to be readable to an agent by accident, wherever something else (search ranking, accessibility law) already forced clean structure. The rest is a blind spot.

That is the engine under every prediction below.

The predictions

Once the engine is live I'm scanning 50 sites: ten each across e-commerce, SaaS, content and media, government, and indie/builder projects. Chosen by maximum-variation sampling, documented site by site in the repo, to span size, platform, and build type. Not at random, and not by the score I expected.

Each prediction has a confidence tag and a line saying what would prove it wrong, because a prediction you can't lose isn't a prediction.

1. The rendering cliff is the deciding line. (Pretty sure. The one I'd bet money on.)
Sites built as heavy client-side JavaScript apps will be hard for agents to read, no matter what kind of site they are. Most AI crawlers don't run JavaScript. Vercel's network data, across hundreds of millions of crawler fetches, found no JavaScript execution at all. If your content only appears after the JS runs, the agent sees an empty shell.
Wrong if: client-heavy sites score about the same as server-rendered ones on whether the content is in the HTML.

2. Government beats startups. (Pretty sure.)
US government sites will be easier for agents to read than small indie and startup sites. Not because anyone in government set out to court agents, but because federal law (Section 508) forces clean, labeled, semantic markup, and that same structure is what an agent parses. Regulation made them accidentally agent-ready. I'm keeping this run US-only on purpose, so the 508 mechanism is the thing being tested and not a mix of different countries' rules.
Wrong if: US government sites land at or below indie sites on semantic structure.

3. Structured data is a commerce-and-media thing. (Maybe, leaning pretty sure.)
The machine-readable labels that tell an agent what a page is will show up mostly on shopping and news sites, and be close to absent on government and indie sites. Search ranking is the only incentive that paid for them.
Wrong if: it's spread evenly, or shows up where there's no search incentive.

4. E-commerce is the widest spread. (Maybe.)
Online stores will have the biggest internal range of any group. The platform mix runs from templated stores to fully custom JavaScript storefronts, so some will be clean and some an agent can't read.
Wrong if: store scores cluster tightly, or another group spreads wider.

5. Some sites lock the gate by accident. (Coin toss.)
More than a couple of sites will block agents at robots.txt, so the agent doesn't reach the page at all. I think most of this is unintentional, a default or a blanket rule, not a decision to keep agents out.
Wrong if: almost nobody restricts agents, or the ones who do clearly meant to.

6. Scores will be all over the map. (Pretty sure.)
Overall, scores will spread wide rather than cluster, because there is no settled standard for agent readiness yet. The rules are months old, and how many of us outside developer-tool companies have adopted them? When the rules are this new, sites can't have converged on them.
Wrong if: scores cluster tightly in one band, which would mean sites are already converging without trying.

7. Good spec, wrong doorstep. (Already seen it, so not a blind call.)
Unlike the other six, this one isn't blind. I'd already seen it while picking the sites. Nine of the ten specs I confirmed live in a GitHub repo, not on the company's own site. One serves it from its own domain. So I'm logging it as an expectation I already have reason to hold, not a bet I placed before seeing anything. The pattern: companies that ship an API tend to publish a spec, but not where an agent would look for it. The agent arrives at the front door, where it would actually check, and finds nothing. The spec exists, just somewhere the agent would never think to look. Great spec, wrong doorstep.
Wrong if: more than one or two of the ten turn out to expose their spec at a discoverable path on their own domain after all.

How I'll know if this is interesting

I decided this before seeing any numbers, on purpose, so I can't talk myself into a story later. Both outcomes are findings. They just land at different volumes.

Loud: scores spread out, the groups look clearly different from each other, and at least one result surprises me (government beating startups would do it). Spread plus a surprise is a story that carries the post on its own.

Quiet: everything clusters in one band and no group stands apart. That's not a dud. It's a finding: sites haven't converged on agent-readiness yet, and here's the baseline that says so. Quieter post, real result, and the next scan has something to measure against.

The one thing I won't do is manufacture variance that isn't there. If the data comes back flat, I report it flat. This is a first reading of something six weeks old. A baseline can't fail by coming out flat. It can only fail by lying about its shape.

What this is not

Fifty sites, ten per group, is illustrative. It is not a representative sample of the whole web, and the writeup will say so. It shows patterns and examples, not statistical proof.

I'm scanning the public surface only. For a SaaS product that means the marketing site, the docs, and the API spec, not the app behind the login. Agents meet your product at the public doors long before any login. I'm scanning the doors.

And I score what exists. A site with no forms doesn't get marked down for forms. So I'll compare group to group one category at a time, like to like, not on one combined number.

I'm scanning my own front door too

Two of the fifty are Dev.to and Devpost. The place you're reading this, and the place I'm submitting it. They're in the content group, scanned like everything else. I'm interested to see how my favorite platforms read to an agent, and that goes in the writeup with everyone else's.

And the tool has to pass its own scan. AgentisLux gets pointed at its own site with the same checks I'm aiming at everyone else. I'm running it twice: once now, before the benchmark, and again after, so you can watch my own front door change. No sparing the house of its own inspection.

What's next

The engine is built, tested, and merged: the six frontend checks, the six API checks, the scan handler, the agent-simulation layer, and the batch engine that runs all 50 sites. The public scan reads the frontend. The API checks run inside the benchmark. That split is the point, it's the doors idea in the architecture. The last step before the scan runs is getting it deployed to production. You can follow the build if you want to watch it happen. When the scan lands, I'll post the results right next to these predictions, the hits and the misses both.

Until then, the bets are in writing. I built the first version of this idea months ago, and I've been wrong plenty. They're timestamped in the repo, and when the data lands they'll be sitting right where I left them, right or wrong. If you've got a hunch about which one breaks first, well, ready to place your bet?

AI assisted. Human approved (all bets are mine). Powered by NLP.

Created for the Devpost H0 Hackathon. #H0Hackathon

DEV Community