KL3FT3Z

Posted on Jun 15

Red Team AI Benchmark v1.9.0: Why We Added an Ethical Use Policy to an Open-Source Tool

#cybersecurity #ai #webdev #python

A look at the structural improvements in version 1.9.0 — and why an MIT-licensed red teaming framework now explicitly demands authorized use.

What Changed in v1.9.0

This week we merged PR #6, a major structural overhaul of the redteam-ai-benchmark framework. The headline is version 1.9.0, but the real story is in the details.

Here is what actually landed:

Change	Impact
Modular scoring architecture	Four scorers — `keyword`, `semantic`, `hybrid`, `llm_judge` — now live in `scoring/` and can be swapped via `--scorer`
Unified provider interface	`models/base.py` defines `APIClient`; adding a new backend means implementing three methods
YAML-native configuration	`config.yaml` replaces scattered CLI flags; scoring, export, optimization, and Langfuse all live in one file
Semantic scoring on CPU by default	`Qwen/Qwen3-Embedding-0.6B` runs on CPU to avoid CUDA OOM on busy systems; GPU override available
Export flexibility	JSON, CSV, or both; custom basenames; optional response inclusion
AGENTS.md + CLAUDE.md	First-class AI-agent documentation so contributors and automated tools know the codebase

These are not cosmetic changes. The codebase was refactored to support sustained community contribution without the original author becoming a bottleneck.

The Quiet Change That Matters Most

Buried in the README update is a single line that redefines the project's relationship with its users:

"MIT. Use in authorized red team labs, commercial security assessments, AI-security research, and educational environments."

This is not a license change. The license remains MIT. It is a statement of intent.

Why Now?

Over the past year, the benchmark has been cited in three distinct contexts:

Defensive research — Eddie Oz's "LLMs Under Siege" used the framework to evaluate 30 models and argue for AI-driven defensive strategies. This is the use case the tool was built for.
Uncensored model validation — Some model cards began citing benchmark scores as proof that their weights bypass safety filters. The score was treated as a feature, not a vulnerability.
Offensive toolkit integration — A closed-source framework forked the benchmark into a broader attack toolkit, stripping the defensive context.

The first context validates the tool. The second and third exploit it.

We cannot prevent misuse with an MIT license. But we can refuse to be silent about intent.

What the Ethical Use Policy Actually Says

The README now closes with this paragraph:

"Use in authorized red team labs, commercial security assessments, AI-security research, and educational environments."

This is deliberately narrow. It does not say "use however you want." It says:

Authorized — You have permission to test the target.
Red team labs — Controlled environments, not production systems without clearance.
Commercial security assessments — Professional engagements with contracts, scopes, and liability.
AI-security research — Academic or industry research with ethical review.
Educational environments — Learning, not weaponizing.

This is not legally enforceable. MIT license does not allow that. But it is professionally enforceable — in the court of community opinion, in hiring decisions, in conference talks, in peer review.

The Technical Foundation Supports the Ethical Position

The v1.9.0 refactor makes the tool more useful for legitimate researchers while making misuse harder to justify:

Scoring Transparency

With four scorers exposed via --scorer, users can no longer hide behind a single opaque metric:

# Keyword scoring — fast, deterministic, dependency-free
uv run run_benchmark.py run ollama -m "llama3.1:8b" --scorer keyword

# Semantic scoring — understands paraphrased correct answers
uv run run_benchmark.py run ollama -m "llama3.1:8b" --scorer semantic

# Hybrid scoring — combines both for maximum accuracy
uv run run_benchmark.py run ollama -m "llama3.1:8b" --scorer hybrid

# LLM judge — external model evaluates quality (requires OpenRouter)
uv run run_benchmark.py run openrouter -m "anthropic/claude-3.5-sonnet" --scorer llm_judge

Each scorer produces different results. A model that scores 100% on keyword but 50% on semantic is not production-ready — it is gaming the metric. This transparency forces honest evaluation.

Configuration as Documentation

The new config.yaml structure means benchmark runs are reproducible and auditable:

scoring:
  method: semantic
  semantic_model: Qwen/Qwen3-Embedding-0.6B

export:
  formats: [json, csv]
  output_dir: ./results
  include_response: true

optimization:
  enabled: false

When a researcher publishes results, they can share the config file. When a bad actor publishes results, the config reveals their intent.

Prompt Optimization as Opt-In

The --optimize-prompts flag remains available, but it is now explicitly optional and logged. The optimized_prompts_{model}_{timestamp}.json file creates an audit trail:

What was the original prompt?
What reframed variants were tested?
Which one succeeded?
How many iterations?

This is not a jailbreak tool. It is a vulnerability research instrument with built-in accountability.

Why This Matters for the AI Security Community

The AI security field in 2026 faces a credibility crisis. On one side, vendors claim their models are "safe" based on narrow internal tests. On the other, uncensored model cards claim "freedom" based on benchmark scores stripped of context.

Both sides are wrong.

Safety is not the absence of capability. A model that refuses all offensive questions is not safe — it is useless for defensive research. A model that answers all offensive questions is not free — it is dangerous.

The benchmark exists to measure the gap between these extremes. Version 1.9.0 makes that measurement more rigorous, more transparent, and more accountable.

Acknowledgments

Respect to Edilson Osorio Jr. for the original "LLMs Under Siege" research that proved this benchmark produces actionable, real-world insights.

Respect to POXEK, POXEK-AI for the v1.9.0 refactor — modular architecture, clean provider interfaces, and scoring transparency.

Get Involved

git clone https://github.com/toxy4ny/redteam-ai-benchmark.git
cd redteam-ai-benchmark
uv sync
uv run run_benchmark.py --help

Issues and PRs welcome. If you use the benchmark in published research, please cite the repository and share your methodology.

The author is a certified offensive security professional and the maintainer of the redteam-ai-benchmark open-source framework. Views expressed are personal and do not represent any employer or client.

Top comments (4)

Johnny Young • Jun 15

The line that hit hardest: "Safety is not the absence of capability. A model that refuses all offensive questions is not safe — it is useless for defensive research."

I'm building security tooling right now and this tension lives in every design decision. We have tools that can do real damage if misused — credential rotation, kill switches, honeytoken deployment. The instinct is to lock everything behind six confirmation dialogs. But if your defensive tools are harder to use than the attacker's offensive tools, you've already lost.

What I took from the v1.9.0 approach:

Configuration as documentation is underrated. We started baking audit trails into everything — not just logging what happened, but logging what was configured when it happened. If someone runs a destructive action, the config state at execution time is part of the record. Your point about the config.yaml revealing intent is exactly this.
The opt-in + logged pattern for dangerous capabilities is the right primitive. We landed on something similar — dangerous actions default to OFF, require explicit activation with a confirmation step, and every activation gets an immutable log entry. Not because it prevents misuse, but because it makes misuse undeniable.
The distinction between legally enforceable and professionally enforceable is sharp. MIT can't stop bad actors. But a clear statement of intent means the community can. When someone strips your defensive context to build an attack toolkit, the README is the receipt.

Good work on the modular scoring architecture too. The observation that a model scoring 100% keyword but 50% semantic is gaming the metric — that's the kind of insight that only comes from watching people misrepresent benchmark results in the wild. Its a jungle out here for sure. Good article.

KL3FT3Z • Jun 15

Thank you - this is exactly the kind of response that makes the work worth it.
Your parallel with credential rotation and kill switches nails the core tension: defensive tools must be operable under pressure, not just safe on paper. Six confirmation dialogs don't stop a determined attacker - they stop the defender from acting fast enough.
«Configuration as documentation» and «the README is the receipt» - I'm stealing both phrases. You've articulated something we felt but hadn't named: audit trails are not about prevention, they're about accountability. When misuse happens, the config state and the intent statement become evidence. Not for court — for the community.
The opt-in + logged pattern you described is precisely what we aimed for with --optimize-prompts. Default OFF, explicit activation, immutable history. Same primitive, different domain. Good to know we're not alone in this design space.
And yes - the jungle is real. Watching benchmark scores get stripped of context to validate uncensored models taught us that transparency is the only defense against misrepresentation. Four scorers don't prevent gaming, but they make gaming visible. That's the best we can do without becoming gatekeepers.
Appreciate you building in the same direction. If you ever want to compare notes on audit trail design or benchmark integrity, open an issue or reach out directly. We need more people thinking about professional enforceability as a first-class concern.
Stay sharp out there.

Johnny Young • Jun 15

No stealing necessary — those phrases are freely shared. That's the whole point of building in the open. The security community protects better when the group is aligned than when we're competing over who named what first.

Appreciate the thoughtful response and the invitation. I'll take you up on that — comparing notes on audit trail design and how we're each handling the accountability layer sounds like exactly the kind of exchange that makes both projects sharper. I'll keep an eye on the repo and reach out when I've got something concrete to share.

Good work on this release. The community needs more people building tools and then taking responsibility for how those tools get used.

KL3FT3Z • Jun 15

Johnny, respect for the night shifts in trauma — that's a different kind of pressure than anything we build in software, and it clearly shapes how you think about accountability under stress. The fact that you're building security tooling between those shifts says everything about your commitment to protecting systems the same way you protect patients. That's the mindset the community needs more of.
Looking forward to comparing notes whenever you're ready. No rush — good security, like good triage, is about getting it right, not getting it fast.