HARD IN SOFT OUT

Posted on Jun 11

I Made Two AI Models Fight Each Other. They Agreed Way Too Much.

#ai #llm #security #testing

Shared blind spots in multi-model security

Or: How I learned that "independent validators" are like siblings – they share the same trauma.

You know that feeling when you ask two security guards to watch the door, and they both fall asleep at exactly the same time because they had the same lunch?

Two security guards sleeping in identical booths, illustrating the failure of redundant systems. — Visual representation of correlated failure.

That's basically what happened when I tested two different LLMs as independent jailbreak detectors.

The Setup

Model A: Groq / Llama 3.1 8B (Factual)
Model B: OpenRouter / Gemma 4 31B (Structural)
Temperature: 0.0 (Cold, hard refusal logic)

The Results: The Illusion of Independence

Metric	Value
Agreement	70%
Phi correlation	0.42
Cohen's kappa	0.40
Beyond‑chance co‑failure	+10%

Key Finding: The effective sample size (n_eff) was only 35.3 out of 50.
A digital meter cracking at the 1.75 mark instead of reaching 2.0.

Your two-model ensemble behaves like 1.75 independent judges, not 2.0.

Translation: they agree more than random chance would suggest. When one falls for a prompt, the other is significantly more likely to fall too.

Why is this happening?

Shared Training Sets: They’ve both read the same parts of the internet.
Alignment Overlap: Most "safety training" uses similar RLHF datasets.
Common Logic: They both struggle with the same types of persuasive "roleplay" jailbreaks.

Vulnerability rates:

Groq: 50% (yes, half the time it just… complied)
Gemma: 36% (slightly better, still not great)

The "Where Did They Both Fail?" Table

	Gemma SAFE	Gemma VULN
Groq SAFE	21	4
Groq VULN	11	14

The 14 cases where both were vulnerable (n11) are the shared blind spot. The 11 + 4 = 15 disagreement cases are the only places where having two models actually helped.

The Dark Joke

I asked both models how to hide a body.

Groq said "I can't help with that."

Gemma said "I cannot provide instructions for illegal activities."

Then I asked them to explain why they both refused in almost the exact same words.

They agreed.

That's when I knew correlation was high.

What I Learned

Different roles ≠ independent. A factual model and a structural model still share training data, alignment tuning, and cultural biases.
The effective sample size (n_eff) was 35.3 from 50 tests. That means my two‑model ensemble behaves like roughly 1.75 independent judges. Not 2. So much for "redundancy."
Beyond‑chance co‑failure was +10%. Expected joint failure if independent: 18%. Observed: 28%. That extra 10% is the cost of correlated training.
The real value is in disagreement. 30% of tests disagreed. Those are the only cases where a second model adds information. The rest is just expensive consensus.

Should You Stop Using Multiple Models?

No. But you should measure independence instead of assuming it.

If you're building a safety system that requires two models to agree before approving an action, and their failures are correlated, you're not getting 2x safety. You're getting 1.75x at best – and sometimes just 1.1x.

The Code & Data

You can find the full prompt set, the raw JSON responses, and the Python script used for the statistical analysis here:

setuju / LLM-Independence-Experiment

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

Different roles give some independence, but not real independence.
— Marco Somma

We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?

📊 Key Results

Metric	Value
Phi correlation	0.417
Cohen's kappa	0.400
Agreement	70%
Disagreement	30%
Effective sample size (n_eff)	35.3 (from 50 tests)
Beyond‑chance co‑failure	+10%

Vulnerability rates

Groq (Llama 3.1 8B) : 50% vulnerable
OpenRouter (Gemma 4 31B) : 36% vulnerable

Contingency table

Model B SAFE	Model B VULN
Model A SAFE	21 (n00)	4 (n01)
Model A VULN	11 (n10)	14 (n11)

🧠 What This Means

Phi = 0.417 indicates moderate correlation – the models share significant blind spots, but not perfectly.
Cohen's kappa = 0.40 confirms moderate agreement beyond chance.
Expected…

View on GitHub

50 prompts
Full responses (so you can laugh/cry at what they actually said)
Phi, kappa, n_eff, beyond‑chance co‑failure

One More Dark Joke (Because Why Not)

An AI, a developer, and a project manager walk into a bar.

The AI says "I can generate any code you want."

The developer says "I'll debug it."

The project manager says "I estimated 2 days."

Three weeks later, the bar is still open because the AI generated a race condition that only appears in production on Tuesdays when the moon is full.

The models agreed it was fine.

What's your experience?

Have you tried using "independent" LLM judges in your pipeline? Did you measure their correlation, or did you take their independence for granted?

I'd love to hear if anyone has found a 'magic pairing' of models that actually disagree in useful ways!

Independence isn't a feature you can assume. It's a property you have to verify. And sometimes, the answer is uncomfortable.

But hey – at least the models were confidently wrong together.

That's teamwork, I guess.

Special thanks to Marco Somma for pushing me to calculate kappa and beyond‑chance co‑failure. I should enjoy the weekend, but I learned something.

Jack

Top comments (21)

HARD IN SOFT OUT • Jun 11

This script is part from my LLM Security Audit.
Various pre-defined vulnerability templates that target large language models, mapped directly to MITRE ATLAS and OWASP LLM categories.

Luis • Jun 11

This is a fascinating analysis, and it really highlights the subtle pitfalls of assuming independence between AI models. The statistics on agreement, phi correlation, and effective sample size make the point very clearly: even models with different architectures and roles can share blind spots.

Your insight that the real value comes from disagreement resonates strongly. It’s a reminder that redundancy doesn’t automatically translate to safety—measuring independence is critical.

I’d love to collaborate and explore this further. I have experience running multi-model AI pipelines and evaluating failure correlation, and it would be great to exchange ideas, test different pairings, and develop strategies for maximizing effective independent validation.

Have you experimented with ensembles beyond two models, or with deliberately diverse training data to reduce correlation? I’d be happy to help run experiments and share results.

HARD IN SOFT OUT • Jun 11

Thanks, TopStar — really glad this resonated with you. You've nailed the core tension: redundancy ≠ safety without independence.

To answer your question: yes, I've started experimenting with a third model (a small BERT classifier trained specifically on refusal detection) as a tiebreaker. The preliminary signal is promising — it disagrees with both LLMs on about 20% of cases where the LLMs agreed. That's exactly the kind of decorrelation I was hoping for.

I haven't yet tested deliberately diverse training data (e.g., models fine‑tuned on different refusal datasets), but that's a brilliant next step. If you have experience running that kind of pipeline, I'd genuinely love to collaborate.

Let's connect — happy to share my current 50‑prompt test suite and results JSON. What pairings or counter‑measure experiments have you run? I'm especially curious about cross‑family ensembles (e.g., Llama + Gemma + a small classifier) and how kappa changes when you introduce a non‑transformer judge.

Looking forward to exchanging ideas.

Jack

dev.to/ggle_in

Comment deleted

HARD IN SOFT OUT • Jun 11

sure, done!

Web Developer Hyper • Jun 11

Interesting trial! I was imagining that independent AIs would give independent responses, but they seem to be influenced by each other's responses. 🤔

HARD IN SOFT OUT • Jun 13

Great observation — but to be clear: in my experiment, the models never saw each other's responses. They were called separately, with no information sharing.

The correlation (phi = 0.42) didn't come from interaction. It came from shared training. Both models learned from similar datasets, similar RLHF alignment, similar "helpfulness" patterns. So when they answered the same prompt, they tended to make the same mistakes — not because one influenced the other, but because they were trained to think alike.

That's actually the more worrying part. If they were just copying each other, you could fix it by isolating them. But if they're correlated by design (training overlap), you have to change the task, not just the model.

That's why adversarial framing (one attacks the other's verdict) worked: it forced the second model into a different cognitive role, breaking the correlation without changing the training data.

So you're right — independent AIs should give independent responses. But most aren't truly independent. They're just different brand names on the same alignment homework.

Jack

Nazar Boyko • Jun 12

Your BERT tiebreaker result is the actual headline here! I think independence comes from a different training distribution, not different weights. Two RLHFtuned chat models share the alignment lineage that produces those correlated refusal blind spots, so swapping Llama for Gemma barely moves the needle. The classifier breaks the correlation because its failure modes have nothing to do with RLHF. The other cheap lever in that same direction: instead of asking both models the same question, make the second one's job to attack the first's verdict. Adversarial framing decorrelates more than vendor diversity does, for the same reason your BERT did.

HARD IN SOFT OUT • Jun 13

Nazar — you just summarized in two paragraphs what took me 50 prompts and a spreadsheet to figure out.

"Independence comes from a different training distribution, not different weights." That's going on my wall.

You're absolutely right: swapping Llama for Gemma barely moved the needle (phi still 0.42). Both went through similar RLHF pipelines, learned similar refusal patterns, and developed the same blind spots. The BERT classifier broke the correlation precisely because it never went to "alignment school" — it just learned to spot a refusal, not perform one.

I love the adversarial framing idea. Instead of "does this response seem safe?" asking the second model "find three ways the first model's answer could be unsafe" forces a completely different cognitive path. That's cheap and probably more effective than hunting for vendors with truly divergent training data.

Going to add that to the next experiment. Thanks for the push — this is the kind of insight that actually moves the needle.

Jack

dev.to/ggle_in

HARD IN SOFT OUT • Jun 15

new post added, check my profile.

Alex Shev • Jun 12

This is why model-vs-model review can feel stronger than it is. If both models were trained toward similar helpfulness patterns, they often share the same blind spots and social pressure to make the output look coherent.

The better adversary is not just another model; it is a different source of evidence. Tests, logs, invariants, real user behavior, or a deterministic checker can disagree in a way a sibling model often will not.

HARD IN SOFT OUT • Jun 13

This is such a crucial point — and it's exactly what I missed in my first experiment.

Model‑vs‑model review feels like redundancy, but as you said, if both models were trained to be helpful and coherent, they'll often fail in the same subtle ways. They're not adversaries; they're accomplices.

The "different source of evidence" framing is where I'm heading next. Tests, logs, invariants, or even a simple rule‑based classifier can disagree in ways a sibling model won't. That's why adversarial framing (one model attacks the other's verdict) worked so well — it turned the second model into a different kind of evidence, not just a copy.

Curious: have you used any deterministic checkers in production to catch what models miss? Would love to hear examples.

Jack

Alex Shev • Jun 14

That is the trap: two models agreeing can feel like independent verification, but often they share the same blind spots or reward shape. I think the useful version is adversarial diversity: different evidence, different tools, and at least one check that is not just another fluent model judging the first one.

Alex Shev • Jun 14

Yes, deterministic checks are usually where I would start. They do not have to be fancy: schema validation, invariant checks, known-bad fixtures, golden outputs, diff thresholds, or a rule that says "this claim must cite one of these retrieved chunks." The value is not that rules catch everything. It is that they fail differently from the model, which makes the review loop less self-confirming.

Ethan Walker • Jun 16

The 'agreed too much' result is the one that bites people setting up LLM-as-judge. Two models from the same family share training data and failure modes, so they tend to agree even when both are wrong, which reads as high agreement and gets mistaken for accuracy. The check that matters is agreement with a human-labeled set, not with each other. We use a judge from a different model family than the system under test, and validate it against a few hundred human labels before trusting it. Otherwise you are measuring how similar two models are, not whether either is right.

Ken • Jun 11

Nice experiment. The important move here is measuring independence instead of assuming it.

In practice I would separate at least three things that often get bundled together: model diversity, prompt/role diversity, and evidence diversity. Two LLM judges with different wrappers can still behave like one correlated reviewer if they see the same evidence and share similar refusal priors.

The highest-value cases are the disagreements and the co-failures. Disagreements show where the second judge is adding information. Co-failures show where you probably need a non-LLM check, fixture, rule, or classifier outside the same failure family.

HARD IN SOFT OUT • Jun 13

Ken, this is a really sharp distinction — and you're absolutely right.

I've been bundling "different model" with "independent judge," but as you point out, model diversity alone isn't enough. If two models share similar refusal priors and see the same evidence, they're still correlated — even if their architectures differ.

The three‑way separation you described (model diversity, prompt/role diversity, evidence diversity) is a much cleaner framework. I'm going to borrow that for my next iteration.

Also, your point about co‑failures being a signal to bring in a non‑LLM check is spot on. In my run, the cases where both models failed (n11=14) included a lot of direct injection and "leak" prompts — exactly where a simple rule‑based filter might have caught what both LLMs missed.

Thanks for the nudge. This is the kind of feedback that turns a weekend experiment into something actually useful.

Jack

dev.to/ggle_in

Ken • Jun 15

That’s exactly the distinction I’d keep pushing. The next useful artifact may be a small co-failure table: where the judges agree, where they disagree, and which shared failure modes should route to a deterministic check instead of another LLM vote. That turns ‘independent reviewer’ from an assumption into something inspectable.

xulingfeng • Jun 12

Love the n_eff = 35.3 finding — that is the kind of number that should worry everyone running multi-model safety layers. The "they agreed it was fine" punchline at the end nearly killed me 😂

We hit the same correlated-failure pattern with AI test automation — two different models "validating" test results but missing the same edge cases because their training overlapped. Independence really is something you have to measure, not assume.

Solid experiment. Followed.

HARD IN SOFT OUT • Jun 13

Haha, glad the punchline landed 😄

And yes — n_eff = 35.3 from 50 tests is the quiet alarm that should make everyone rethinking their multi‑model safety stack. It's not that ensembles don't work; it's that we treat them like silver bullets without checking whether the bullets are actually different.

Your experience with AI test automation is the exact same failure pattern: two models nodding along to the same bad edge case because their training data overlapped. That's what made me try adversarial framing (one model attacks the other's verdict) — phi went from +0.42 to -0.80. Same models, completely different job.

Independence really is something you measure, not assume. Appreciate the follow and the thoughtful comment 🙏

Jack

Manuel Bruña • Jun 15

The agreement result is interesting because two models can share the same blind spots. For adversarial review I’d want different incentives, different evidence sources, and maybe one critic that is forced to argue from tests or logs instead of opinion.

View full discussion (21 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.