Tony Wang

Posted on Jun 18 • Edited on Jun 20 • Originally published at crawlora.net

14% of the Web Is Actually Dead — But Not How You Think (We Scanned 10M Domains)

#webdev #webscraping #dns #datascience

Originally published at crawlora.net.

When you hit a dead URL in production, do you know whether the domain is gone — or whether an anti-bot system just blocked your crawler? They look identical from a failed request, but they're completely different failures, and most tools don't tell them apart.

We scanned the DomCop top 10 million domains to find out how much of the popular web is actually dead. The short version: about 14% — not the ~27% you've probably seen quoted.

Dead and blocked are not the same failure

A domain that won't load failed for one of two reasons:

It's gone. No DNS record, or nothing accepts a TCP connection. Genuinely dead.
It's alive and blocking you. A real server returning a 403 or 429 to anything that looks like a bot.

Most "dead web" studies count both as dead. They shouldn't, because the right response to each is opposite:

A dead domain never comes back. Retrying it — rotating proxies, escalating clients — is wasted compute.
A blocked domain is live. It needs a different client, not more retries.

The numbers

Probing every domain over HTTP and classifying each as alive / redirect / blocked / dead:

14.1% genuinely dead — overwhelmingly vanished DNS (76% of the dead bucket). The server is gone.
8.9% blocked — live servers returning 403/429 to automated clients.
76.6% alive, 0.3% redirect.

The widely-cited "~27% of the web has rotted" figure conflates blocked-but-live servers (and 404/5xx responses — still a live server answering) with the genuinely gone. Separate them honestly and the truly-dead web is about half what people assume.

Proof: same domains, different client

To show the 8.9% "blocked" really are alive, we re-probed them with a real Chrome TLS/JA3 fingerprint — an HTTP client that speaks Chrome's exact TLS handshake and header order (not a headless browser, no canvas/WebGL).

~72,000 of the blocked domains served content normally. Same URLs, same network — the only thing "dead" was the wall. That dropped the blocked rate from 8.9% to 8.2%.

The takeaway for anyone building crawlers or link-checkers: when a tool reports a dead domain, ~9% of the time it's a live server with anti-bot deployed. NXDOMAIN/REFUSED → dead, skip it. 403/429 → alive, recheck with a real browser TLS context before you mark it dead.

The web rots unevenly

Death rate isn't uniform. By country-code TLD:

China's .cn: 33% dead
Germany's .de: 7.6% dead

A 4× gap. Institutional TLDs fare badly too — .gov 26%, .edu 22% — matching Pew Research's finding that government and reference pages suffer the worst link rot.

The famous dead

The casualties are all in the data: Grooveshark, Gfycat, del.icio.us, Yahoo Pipes, AddThis, DMOZ, OpenSolaris, GeoCities. Two decades of the social and developer web's graveyard.

The open dataset

Every domain, both probe arms, is open under CC BY 4.0 (one JSON row per domain per arm: domain, tld, rank, mode, outcome, reason, HTTP statuses, redirect hops, parked flag):

Dataset: github.com/Crawlora-org/dead-web-index-data
Interactive explorer (look up any domain, browse the country map): crawlora.net/dead-web-index
Full write-up + methodology: crawlora.net/blog/how-much-of-the-web-is-dead-2026

(Disclosure: we build a web-scraping API, which is why the dead-vs-blocked distinction bites us daily.)

Top comments (1)

Luis • Jun 18

Strong empirical breakdown on “dead vs blocked” classification. The separation of NXDOMAIN/REFUSED vs 403/429 is particularly valuable for anyone building crawler reliability layers—this is usually where production systems silently accumulate bias in failure labeling.
There’s also a clear systems engineering opportunity here beyond measurement.
We’ve been working on a similar problem space in large-scale crawl orchestration, specifically around failure-state classification pipelines and adaptive retry policies. Your dataset and methodology could integrate well into a shared benchmarking layer for crawler intelligence systems.
A potential collaboration angle:

plug your dead/blocked taxonomy into a crawler decision engine to dynamically adjust retry strategies

extend the dataset into a real-time “domain health scoring API” for scraping infrastructure

compare Chrome-fingerprint revalidation against headless + TLS signature hybrid models

co-develop a shared evaluation framework for link rot vs anti-bot interference across large-scale indexing systems

If you’re open to it, there’s real value in turning this from a static study into a reusable infrastructure signal layer for scraping and indexing pipelines.