DEV Community

Cover image for Proof of Human: I Built a Reverse Turing Test After Getting Flagged as AI
Daniel Nwaneri
Daniel Nwaneri Subscriber

Posted on • Edited on

Proof of Human: I Built a Reverse Turing Test After Getting Flagged as AI

June Solstice Game Jam Submission

This is a submission for the June Solstice Game Jam


I got flagged by Sloan.

If you've been on DEV long enough, you know Sloan. I thought Sloan was a bot. Sloan is Francis — someone I've exchanged comments with for months, since before Richard left the platform. He posted about the flagging openly, tagged the founders, explained his reasoning. Then added: "This was hard to tell you for many reasons." He reads every flagged article himself, runs it through GPTZero, makes a call. He knew me. He flagged me anyway.

One of the flagged articles had sparked a five-exchange comment thread that became an open-source repo. The thinking was mine. The flag still landed.

That's the uncomfortable thing about the Turing Test in 2026: it doesn't measure origin. It measures surface texture. And if you write well enough, you sound like a machine.


What I Built

Play Proof of Human →

A reverse Turing Test. Five questions. You write your answers. Claude scores them 0–100 on how human they sound, tells you what gave you away, and at the end gives you an average.

The questions are the ones that actually separate humans from pattern-matchers:

  1. Describe the last time something genuinely surprised you. Not shocked — surprised.
  2. What's something you changed your mind about in the last year? What moved you?
  3. What's a skill you have that you never bothered to put on your CV?
  4. Name something you've read or watched that you think about more than you expected to.
  5. What do you actually think about AI? Not what you're supposed to think — what you actually think.

The scoring prompt at the heart of it:

Human signals: named specifics, opinions that could get you in trouble, genuine uncertainty, things slightly off-topic but revealing.
AI signals: balanced framing, hedge words, smooth transitions, excessive completeness.

Score 60+: Passes. Below 60: Flagged.

The June solstice is the longest day — the day the sun is most itself. Unambiguous. No hedging. That's what this game is asking for. Not your best answer. Your most you answer. Turing's original question was: can a machine think? The question we're living with now is its inversion — can a human still sound like one?


Video Demo


Code

Proof of Human

A reverse Turing Test. Five questions. Claude scores your answers 0–100 on how human they sound and tells you what gave you away.

Play it →


How it works

You write. Claude reads. It scores on one axis: specificity that costs you something. Named people, opinions you might regret, genuine uncertainty, things slightly off-topic but revealing. Those pass. Balanced framing, hedge words, smooth transitions — those get flagged.

Score 60+: Passes. Below 60: Flagged.


Stack

  • Frontend: single index.html, vanilla JS, no framework, no build step
  • Backend: Cloudflare Worker (keeps API key server-side)
  • Hosting: Cloudflare Pages
  • Model: claude-sonnet-4-6

Deploy your own

1. Deploy the Worker

cd worker
npm install
wrangler secret put ANTHROPIC_API_KEY
wrangler deploy
Enter fullscreen mode Exit fullscreen mode

2. Update the API URL in index.html

const API_URL = "https://your-worker.your-subdomain.workers.dev/score";
Enter fullscreen mode Exit fullscreen mode

3. Deploy the frontend

# From the repo root
wrangler pages project create proof-of-human --production-branch main
wrangler pages
Enter fullscreen mode Exit fullscreen mode

How I Built It

Vanilla JS, no dependencies, one HTML file. Cloudflare Worker as proxy, Pages for hosting.

The Worker sits between the browser and Anthropic. It receives your prompt and response, calls the API, returns { score, verdict, reason }. The frontend never sees the API key. One call per question, nothing stored, model is claude-sonnet-4-6.

The frontend is a single index.html — progress bar, animated score fill, final breakdown screen. No build step. A judge can open DevTools and follow exactly what happens on each submit.

The scoring prompt took the most iteration. The first version was too generous — everything passed. The second was too harsh — everything got flagged. The final version keys on one thing: specificity that costs you something. An answer that names a real person, admits a real mistake, or takes a position you might regret. That's what the model now reliably catches.

The irony: I had to write like an AI to build a detector for AI writing. I kept second-guessing my own prompt phrasing, smoothing transitions, hedging. The game caught me too.


Prize Category

Best Ode to Alan Turing

Turing's question was whether a machine could fool a human. This game inverts it — can a human fool the machine? The mechanic is the Turing Test itself, running live, aimed back at the player.

The submission post has a real backstory: I got flagged as AI-generated on this platform the same week I built this. The incident is documented in two public articles with 60+ comments between them. Writing that passed human editorial review at freeCodeCamp got flagged by a detector on DEV. That's not a contradiction to resolve. That's just where we are and it's the question this game puts directly to you.


Built June 2026. Vanilla JS. One API call. No frameworks. The Sloan incident was real.

Top comments (23)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Hey @francistrdev . you're the origin story for this one. Built it for the June Solstice Game Jam after the flagging incident. Five questions, Claude scores how human you sound.
Curious what you'd score. → proof-of-human-3ts.pages.dev/

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Hey Daniel!

Pretty much all are one sentence answers lol

Collapse
 
dannwaneri profile image
Daniel Nwaneri

72 . you passed. The 82 on Q5 makes sense, that's the question where one-sentence answers still carry weight because there's no safe answer. The 62s on Q1 and Q3 are the specificity gap . one sentence doesn't leave room for the detail that gives you away as human. Play again and go longer on those two.

Collapse
 
sylwia-lask profile image
Sylwia Laskowska

Hahaha it's perfect! Unfortunately, I'm not a human 😅

Reverse turing tests scores

The fun part is that I didn't even use the tools to polish the grammar. These are my answers flagged as AI generated 🤣

Reverse turing test

Reverse turing test

As you see, if you answer shortly and gramatically correct it's easy to be marked as AI 😅

Collapse
 
dannwaneri profile image
Daniel Nwaneri

57 is the honest score for "I genuinely love AI" . it's the answer that sounds most human but reads as the safest possible thing to say. Q5 specifically punishes the reflex to be positive about AI. The question tells you not to say what you're supposed to say and that's exactly what got flagged. Play again and say something you'd be slightly embarrassed to admit. That's the answer that passes. 😁😁

Collapse
 
sylwia-lask profile image
Sylwia Laskowska

But here I'm still losing 🤣

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Genuine fear doesn't use six exclamation marks 😅 It uses one sentence and stops. The performative outrage is the tell ."I hate AI" as a declaration reads like someone performing the opinion rather than holding it. The model caught the theatre, not the feeling.

Thread Thread
 
sylwia-lask profile image
Sylwia Laskowska

You've definitely never seen the Facebook messages from polish people 🤣

Collapse
 
sylwia-lask profile image
Sylwia Laskowska

Ok, sometimes it's really funny

Reverse turing test

Thread Thread
 
xulingfeng profile image
xulingfeng

🤣You find a bug!

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

85 is the most accurate score the game has produced 😂 Gibberish is maximally unpredictable . no AI would submit keyboard mash as a considered answer, so it reads as authentically chaotic. The model is genuinely right on this one.

Thread Thread
 
sylwia-lask profile image
Sylwia Laskowska

Haha but it's not true! In 99% cases it flagged it 🤣

Thread Thread
 
xulingfeng profile image
xulingfeng

🤣🤣🤣I love these kinds of face-plant moments the most, hahaha.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

You found the threshold 😂 Short gibberish reads as a human losing patience with an introspective question. relatable, unpredictable. Long gibberish reads as a bot or a broken keyboard. The model is actually making a reasonable call. 10 b's = frustrated human. 29 b's = something's wrong with the input. Turns out even nonsense has a human range. 🤣🤣

Thread Thread
 
xulingfeng profile image
xulingfeng

This is awesome. I've got new ideas for the articles I'm writing next, hahahaha.🤣

Thread Thread
 
sylwia-lask profile image
Sylwia Laskowska

I think I chose the wrong career path - I should became manual QA tester 🤣

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Thank you for actually playing it and breaking it in the best possible way 😂 The bbbbbb result is going in the documentation.

Collapse
 
xulingfeng profile image
xulingfeng

 🤣🤣🤣
No way — I actually wrote Q5 from the heart and it flagged me. That's hilarious.🤣🤣🤣

Collapse
 
dannwaneri profile image
Daniel Nwaneri

45 on Q5 is the most common result 😅. The question specifically asks you not to say what you're supposed to say but the moment you write your real opinion clearly and directly, it reads like a prepared statement. The only answers that pass Q5 are the ones with friction in them. Contradiction, uncertainty, something you haven't fully worked out yet 🤔 "I wrote this from the heart" is exactly what the model can't detect because the heart, written cleanly, looks like a press release 😂

Collapse
 
bumbulik0 profile image
Marco Sbragi

Funny...
I joke with Gemini and said "we need to pass a test, i ask you some questions. Answer like a real person will do". And voilà... Try it yourself.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

That's the whole thesis in one experiment, Marco 😅 Gemini coached to "answer like a real person" passes. A real person writing sincerely gets flagged. The detector can't see the difference between performed humanity and actual humanity and now neither can the game. That's not a bug. That's where we are in 2026. The test Turing designed to catch machines is now something machines pass more reliably than people. What score did Gemini get??

Collapse
 
gramli profile image
Daniel Balcarek

Yessssss, I knew I was human! 🧠
Mostly. 🤣🤣

Collapse
 
dannwaneri profile image
Daniel Nwaneri

62 counts as mostly human 🧠 Q2 at 82 and Q3 at 78 means you were specific enough where it mattered. Q4 and Q5 both at 45 is the pattern . Those are the questions where "something you think about more than expected" and "what you actually think about AI" require you to say something that costs you something. Safe answers on those two always land in the 40s. Go again and say the uncomfortable thing on Q4 and Q5 . you'll clear 75 overall. 😄