A working Telegram bot that stress-tests trainees with a pressure character, coaches them in real time with Socratic hints, and scores their performance. All from a single message.
The Problem
Most AI training tools do one of two things: they quiz you, or they roleplay with you. Neither is quite right for professional skill training. A quiz tells you what you know. A roleplay with a helpful AI tells you what you should have said.
The missing piece is pressure. Something that watches how you respond under realistic conditions, nudges you when you're heading the wrong way, and then gives you an honest account of how you actually performed.
That's what I wanted to build.
What TeachSim Is
TeachSim is a Telegram-based training simulation. A trainee starts a session, gets dropped into a realistic workplace scenario, and has to navigate it in real time. What they don't see: two additional AI agents are watching the whole conversation. One is ready to offer a Socratic hint if the trainee stalls or heads in the wrong direction. The other is building a scored performance report that fires the moment the session ends.
Three agents. One conversation. The trainee only ever talks to one of them.
The Architecture
The three agents run as nodes in a LangGraph StateGraph, sharing a single state object that tracks everything (conversation history, escalation level, hints used, resolution signals, scoring dimensions) across every exchange.
User message → Chaos node → Mentor node → [conditional]
│
session_active=False → Score → END
session_active=True → wait → END
Each agent gets its own LLM client tuned for its role:
get_chaos_client() # temperature=0.75 — natural, varied character dialogue
get_mentor_client() # temperature=0.3 — measured Socratic hints
get_scoring_client() # temperature=0.1 — deterministic JSON report
All three call DeepSeek V4 Flash via OpenRouter. The model string is deepseek/deepseek-v4-flash. Switching providers means changing one file in the environment.
The Two Live Scenarios
Girls in STEM: Responding to an Excluded Student
The trainee receives a message from a student named Jamie who feels pushed out of a STEM group. They have to draft a reply. The Chaos Persona (Jamie) escalates if the trainee stalls, over-apologises, or writes something technically correct but emotionally tone-deaf.
This scenario is grounded in the Brooks (2025) TALK framework, with three patterns the scoring rubric explicitly tracks: responsiveness (did you address what Jamie actually said, not just the emotional category?), superfluous apology (hedging that signals your discomfort rather than addressing hers), and topic pyramid (connection first, explanation second, practical close third). The mentor reads the trainee's actual drafted reply before deciding whether to intervene, not just a keyword trigger.
Claude Code Assessment
The trainee is assessed by an AI trainer persona, Alex (warm, patient) or Jordan (direct, challenging), against a tiered competency framework before being granted access to Claude Code. Up to 16 exchanges, three difficulty levels, coverage thresholds that change by tier.
Novice difficulty always routes to Alex regardless of selection. Expert difficulty requires 80% coverage across all three tiers before the readiness verdict fires.
Under the Hood
The Chaos Persona escalates. Escalation level runs from 0 to 4 and never decrements. If the trainee stalls, makes repeated mistakes, or produces something clearly off-target, the persona gets more direct and less patient. The trainee can't reset the mood by being polite. They have to solve the actual problem.
The Mentor is silent until it identifies an issue. Mentor triggers are defined per scenario as structured conditions with IDs (MT-01 through MT-99), severity weights, and a max_fires ceiling. The same trigger won't fire twice. When it does fire, the hint appears as a coaching note, separate from the main conversation. The trainee knows the system is watching; they don't know exactly when it will speak.
New scenarios need no Python. The entire scenario definition lives in a JSON file: persona, escalation arc, resolution conditions, mentor triggers, scoring rubric, difficulty variants. Pydantic validates every JSON at startup. Adding a new simulation topic is a schema problem, not a code problem.
Try It
The bot is live at @teachsim_bot. Start a session, pick a scenario and difficulty, and see how the Chaos Persona responds to a weak first reply.
1. Open Telegram → search @teachsim_bot
2. /start
3. Pick a scenario
4. Pick a difficulty (novice / standard / expert)
5. Send your first message
Sessions time out after 20 minutes. The score report fires automatically when the session resolves or the trainee runs out of exchanges.
Honest State of It
Two scenarios in production. Four were built; two are deprecated because the training value is not as high as the two in production. The repo is currently private while I decide whether to open source it. The bot is single-instance, single-region, not designed for concurrent scale yet.
On Open Sourcing
I'm still deciding. There's enough going on here (three-agent LangGraph with per-agent temperature tuning, data-driven scenario schema, Socratic mentor triggering) that it might be more useful as a reference implementation than as a closed tool. If there's interest from people who want to build domain-specific training simulations on top of it, open source makes sense. If you'd use this for something, tell me what.
What's Coming Next
TeachSim was built around workplace and tool scenarios: high stakes, clear right answers, measurable outcomes. The architecture works well for that. But the same three-agent design (pressure character, silent mentor, scored report) maps onto a much older and harder problem: everyday conversation.
I'm planning a second, more substantial simulation platform on the same foundation. This one is grounded in Alison Wood Brooks' conversation research, specifically her work on topic flow, follow-up questions, the patterns that make people feel genuinely heard versus politely processed. The mentor in TeachSim watches for technical mistakes. The mentor in this one watches for the conversational habits most people don't know they have: the question that shuts a topic down instead of opening it, the pivot that signals discomfort, the apology that's really about the speaker.
38 situations are designed across six tiers, from a first meeting and a first date through to emotionally complex, high-stakes conversations. Nine JSON files are built. Two are bot-tested. The architecture is the same; the theory layer underneath is different and deeper.
More on that when it's ready. If conversation science and AI simulation overlap with something you're working on, follow along.
Closing Thought
The bet TeachSim makes is that pressure-testing is the missing layer in AI training tools. Most tools will tell you the right answer after you get it wrong. This one makes you find it under conditions that feel like the real thing. Whether that produces better retention, faster skill transfer, or just higher stress is something I'd like to measure. If you run a session and have a reaction, good or bad, I want to hear it.
TeachSim is live at @teachsim_bot. Built with LangGraph, DeepSeek V4 Flash, and python-telegram-bot. Repo currently private; open source decision pending. A second simulation platform based on Alison Wood Brooks' conversation research is in development on the same architecture.
Find me on GitHub: github.com/mediblacksand
Reference: Brooks, Alison Wood. Talk: The Science of Conversation and the Art of Being Ourselves. Crown, 2025.
In loving memory of Zhang Fu, 1950–2026.
Top comments (2)
The three-agent split is the strongest design choice here. Trainer, evaluator, and learner are different jobs, and collapsing them into one agent usually makes the feedback too soft.
I would be especially interested in how you score improvement across sessions. For training sims, the hard part is not generating scenarios; it is proving the learner is getting better instead of just seeing more varied prompts.
Thanks Alex. You've identified the gap I haven't even considered yet. Right now scoring is per-session only; there's no persistence across sessions to track improvement over time. The per-session report gives dimensional scores (responsiveness, apology quality, topic pyramid execution) but I'm not comparing run 3 against run 1 for the same learner.
The honest answer is I haven't solved the cross session problem yet. My instinct is that it needs a baseline session first. Something to serve as a meaningful reference point. But I haven't built it. If you've seen a sim that handles this well I'd be curious how they approached it.