A story about knowledge extraction, Kafka consumer rebalance, and what happens when a company discovers their AI Skill knows the past — but not th...
For further actions, you may consider blocking this person and/or reporting abuse
The 5x callback says everything about where we are right now. Packaging 12 years into a Skill captures the steps but not the judgment about when a step is wrong. That gap is what bit them on the Kafka rebalance, and what they paid 5x to hire back.
You hit it exactly. The Skill could learn the poll timeout config, but it couldn't learn why it was 5 seconds and not 10 — that decision came from a 3 AM outage postmortem that never made it into the training data.
The 3 AM postmortem detail is the whole thing. A Skill copies the config but not the scar that set it, and that scar is what seniority actually is. So we let the agents do the work and keep a human on the calls that need the context behind them.
This is exactly the split — let the AI do the work, keep the human on the call that explains why.
That's the line I keep landing on too. Automating the work is the cheap part, the judgment about when it's actually done is what I still want a person standing behind.
Andrii, exactly this. Automating the work is the cheap part — it's the judgment call on "when is it actually done" that I still want a real person on the other end of the phone for. The AI can do the work, sure, but done doesn't mean done right. What clients are paying for is knowing someone's there to catch it.
This is the uncomfortable side of AI that doesn’t get discussed enough.
Capturing someone’s knowledge is useful, but replacing the person who understands the judgment behind that knowledge is where companies usually create bigger problems later.
That phone call was decided two months before it happened — from the moment they chose to replace the person with the model, the missing part was already missing. The call was just a confirmation.
This is kinda wild. It's very important to understand the limitations of AI and what it is and isn't capable of. At the end of the day you always need the developer in the loop to be able to build something that is maintainable and scalable. If most folks haven't gotten this wake up call yet, they will sooner rather than later.
Everyone says "developer in the loop" until it's 3 AM and you're the loop.🤣
Love the come back Mark. What is remarkable about this is a telling that AI is used for again a market that only cares about the bottomline not the emplification it has on software development as a whole. Obviously short-sighted but the path of least resistance. Much harder to sell a new product then to just cut the cost. The software industry as a whole would benefit if engineers aren't viewed as the cost of good sold, but the force that gets multiplied with AI.
Man, "engineers as COGS" is going to stick with me. That's the whole thing in four words.
Every company says they want innovation. Then they look at the spreadsheet and realize innovation costs money. So they cut. And call it "AI transformation."
It works until it doesn't. Then they call someone like Mark. At 5x.
Oh, if it were possible to accommodate one knowledge transfer for the entire 12 years of experience ... then it would be possible not to raise anyone's salary at all. (please don't show this idea to the business) 😅
Too late. Pretty sure the CTO in the story already wrote this down. 🤣
The real issue isn't that the AI captured domain knowledge, it's that the company assumed a frozen snapshot of expertise equals a continuously adapting engineer. Skills decay without maintenance. The moment their business context shifted even slightly, the AI skill broke because nobody was updating the underlying assumptions. Companies treating AI as a headcount reduction tool instead of a force multiplier will keep learning this lesson the expensive way.
"Skills decay without maintenance" — that's the line that hit hardest for me.
What I didn't put in the story: maintaining that AI Skill would've cost almost as much as keeping me. Same engineering hours, different line item. They just couldn't see the second bill coming.
How's the maintenance story look on your side — you seen teams that actually budget for it upfront?
That was a good way to get a promotion 😂😂😂
Right? Walked in on day one and they were already logging my Slack messages for Skill 2.0. Next layoff's gonna cost 'em 10x.🤣🤣🤣
🤣🤣🤣🤣🤣
The tell is in Act 1, when Caleb stopped asking "how would you figure this out" and started asking "what was the answer." Those capture different things. "Why 450ms" has both an answer, the number, and a generator, the reasoning that picks 450ms from this broker, this load, this backpressure curve. He wrote down the number. When the consumer rebalance changed the conditions, the number was stale and the Skill had no generator to recompute it, because nobody recorded the procedure, only its output. A transfer that captures decisions without the decision procedure is a snapshot that stays correct until the inputs move. Scenario 313 was the first time they moved after the recording stopped. Calling it an edge case hides that every scenario eventually becomes a 313. The 5x call was the company paying retail for the generator they assumed the transcript already held.
You nailed the thing I was trying to say but couldn't quite land.
The answer vs generator split — that's the whole story right there. Caleb wasn't malicious. He just did what every knowledge extraction does: recorded the output, not the thing that produces it. But the thing that produces it is designed for a world that moves. The output only works as long as the world stays still.
"Calling it an edge case hides that every scenario eventually becomes a 313." — I genuinely wish I'd written that sentence.
Thanks for reading this carefully. Means a lot 🙏
"They wanted my judgment, not just my output." That line hits like a truck. This is the ultimate proof that AI can clone past patterns, but it cannot navigate unprecedented chaos or real-world edge cases. Your 12 years of experience wasn't just a dataset; it was the actual structural pillars keeping that system standing.
You nailed it. Data records experience, but it isn't experience. The day that Skill went live they thought they'd cloned me — until things broke and they realized they'd copied my actions, not my hesitation. And it's the hesitation moments — when to say no, when to wait — that were worth the price tag.
This is the risk of turning experience into a static artifact. The painful parts of senior knowledge are often current context, weak signals, and knowing when an old rule no longer applies. A skill can preserve patterns, but it cannot replace live ownership of the system.
"Live ownership" — that's the bit. Everything else can be documented. That part can't.
The Knowledge Transfer Initiative excels at distilling 12 years of engineering judgment into concrete decision frameworks rather than generic documentation. This approach creates transferable problem-solving patterns; how will you measure the reduction in critical incident resolution time when applying these frameworks to novel system failures?
Great question. My metric was: number of 3 AM phone calls after the Skill went live. Turns out that number went up, not down. So they went back to the old solution — me.
This resonates. I worked on an AI robot project where we spent months tuning the model — only to realize the real bottleneck was how the robot stored and retrieved context from sensor streams. The model was fine. The memory layer was a flat text log with no structure. We ended up building a proper multimodal database (moteDB) that could index camera frames, audio snippets, and telemetry together with temporal relationships. Retrieval time dropped from seconds to milliseconds, and the robot actually started completing tasks it previously failed. The lesson: no amount of prompting fixes a broken memory architecture.
Memory layer being the real bottleneck — that hit home. Model was fine, but the pipeline feeding it collapsed. That's the kind of thing no accuracy benchmark catches. Appreciate you sharing this, man 🙌
Love the story, but I feel it's more AI generated than real.
I will give my reason also.
You mention:
Later you say:
At first I thought these were referring to the same incident, but then the timeline became a bit confusing.
Then you mention:
But earlier you also say:
This made me confused about when the Kafka migration actually happened. If RabbitMQ was retired three years ago, but the migration happened after Mark left, the timeline doesn't seem to line up.
The timeline just feels a bit messy to me. Maybe I am missing something, but I had to stop and think about the sequence of events instead of following the story.
You read that more carefully than I wrote it — and yeah, the timeline is a mess, that's on me. The 450ms thing showing up as both "a year ago" and "five years ago" isn't some clever narrative trick — it's just me not catching that I had two references to the same utility that needed cleaning up before publish.
Funny thing is I went through this story 30+ times myself and never spotted it. Guess that proves the point — you always need fresh eyes.
Thanks for actually walking through it. I'm planning to put this series together into a proper book at some point, and this is exactly the kind of thing that needs fixing in a revision.
Really appreciate you reading this closely.
The "right about yesterday" failure mode exists outside AI too. I maintain a salary calculator and the tax law it encodes changes several times a year. A formula that was 100% correct in January quietly becomes wrong in July, and nothing crashes: it just keeps returning plausible numbers. The only defense I found is the boring one: treat every rule as data with a validity window and a link to its source, and re-run validation when the world changes, not when the code changes. A 96.8% snapshot is worthless if nobody owns the re-validation calendar.
Nail on the head. The "doesn't crash, just returns plausible numbers" bit is exactly why this scares me — same with AI, it passes every pipeline check but the output is just wrong.
Your "treat every rule as data with a validity window" is basically what the story ends on too. Turns out the boring solution is the only one that actually works, whether you're dealing with tax tables or AI test results.
That last line about the re-validation calendar though — that's the real gut punch. A 96.8% snapshot means nothing if nobody remembers to check next quarter.
Nice articles as for a non-person (AI), I think it lacks a real human touch as usually seen in real human written text.
Probably you should've pushed it to the MICROSECONDS, for example: "the CTO called me at 3:47:12:675ms and I answered, just at 3:47:14:500ms, inhuman reaction, but I was expecting it as it was quite obvious to happen - at least I had feel or a hope for that.".
Then it would probably win all the wins.
Bro, you got me on the human touch point — fair one, I'll take it. As for the microseconds suggestion... my phone's clock only goes down to the minute anyway 🤣
The uncomfortable lesson here is that a skill can package steps, but it cannot fully package context. The edge cases, tradeoffs, and recovery paths are usually where the senior experience lives.
That's it. The happy path is easy to document. It's the "what if this fails" paths that take 12 years to collect — and most of them never get written down.
Exactly. The missing artifact is usually the failure map: what changed, what to check first, what not to reuse, and when to stop the automation and call a human.
That part is rarely in the docs because it was learned through incidents, not written during the happy build.
"Failure map" — that's the exact term for it. Documentation tells you how the system is supposed to work. What lives in a senior engineer's head is when it shouldn't work that way. And that second list... you can't package it into a skill. You have to earn it.
I agree with the spirit, but I would split it in two.
You cannot package all senior judgment into a skill. But you can package some of the warning signs that trigger judgment: strange inputs, missing source data, repeated retries, outputs that look plausible but changed the business meaning, and anything that requires an owner decision.
A good skill should not pretend to replace that experience. It should preserve enough of it to know when to stop.
Runtime drift + architectural drift + unmonitored assumptions = catastrophic failure
That's the equation I couldn't get them to put on the slide. They had the 96.8% — they skipped the drift column.
Everytime they fire a 10+ year senior at a company i count months only until they call them back. you cant tell me you are going to replace 12 years of experience with a tool that predict based on what they have learnt . u just limitted ur company from growing to always relying on the same knowledge and no new components introduced and thats what a 12 year experience is there for.
The callback timing thing is spot on. I've seen companies bring people back in under 6 months — ends up costing more than keeping them in the first place.And it's not just the knowledge they take. It's the judgment. A 12-year knows what doesn't work and why — those failures never make it into any training set. That's what you lose when you let them walk.
The Kafka consumer rebalance piece is what makes this story technically precise. The AI applied the correct fix for a protocol that no longer existed — 100% accurate about RabbitMQ, 100% wrong about the current architecture.
That's the failure mode I think about running a payments platform. The AI doesn't know which rule was written for the old system. Only the person who lived through the migration does. The 96.8% validation rate is a snapshot. Nobody asked the harder question: who reruns it every time the infrastructure changes?
You can't package that into a Skill. The Skill knows the fix. It doesn't know why the fix was safe.
Yeah the RabbitMQ thing is the scariest part — you can write something textbook perfect and be completely wrong at the same time. Validation passes, logic checks out, just the wrong planet. The skill doesn't know which planet it's on.
🤣🤣🤣
Did the company really think they could replace an expert with just a SKILL.md? I don't get it. AI lacks in-person context, hints from past conversations, etc. It's inevitable.
Right? But that's why it happened — someone who'd never written code saw "12 years in a file" and thought they'd found a shortcut. The people who knew better weren't in the room.
I love these stories...
brutal 😂
🤣🤣🤣
This is good writing I read after a long time. Cheers to you!
Appreciate that, Aditya. Means a lot 🙏
the 450ms match to the RabbitMQ GC window is the exact failure mode. the skill didn't fail because it was wrong — it failed because it had no signal its assumptions had expired. correct about RabbitMQ, fatal on Kafka, same confidence score on both.
hit the same pattern building a RAG pipeline for infra runbooks. retrieval surfaced correct answers about the old API surface: high similarity scores, completely wrong context. fix that worked: timestamp metadata on every chunk plus a deployment tag. if the tag doesn't match the current deploy, the agent surfaces ‘knowledge may be stale’ instead of proceeding.
who owns the revalidation calendar when infrastructure migrates?
The timestamp + deployment tag approach is solid. We only had a global "last reviewed" date on the whole skill — completely useless when only part of the context drifts.
The revalidation question is the real pain point. Nobody on our side owned it either. The team that built the skill assumed the agent would self-detect staleness. The team that migrated the infra assumed the skill would pick it up automatically. Two assumptions, one gap.
A SKILL.md may cover the past.
Your Skills can cover the future.
Wow, a poet in the Dev.to comments. 🙌🤩
12 years of know-how packed into an AI feature, then laid off? Talent should not be a disposable asset - start valuing people over metrics.
Well said. The irony is — the day they packaged 12 years of experience into a Skill, they genuinely believed that Skill was "you." Until it crashed. You can compress experience, but you can't compress judgment. That 5x salary call tells you everything — they always knew what was valuable. They just didn't want to pay for it before something broke.
Your people matter more than any metric.
.... and i should read this before read the new one! thats awesome!
"It was right about yesterday — and yesterday wasn't running anymore."
That single sentence captures the most dangerous blind spot in AI knowledge extraction.
The Skill didn't fail because it was poorly built. It failed because the world moved and nobody re‑validated the assumptions. 312 correct answers created a false sense of permanence. Number 313 exposed the truth: the Skill was frozen in time. Mark wasn't.
What strikes me most is the validation gap. Companies love the 96.8% number. They put it on slides, in board decks. But validation is a snapshot, not a contract. Once the person who holds the context leaves, who re‑runs the tests? Who knows which edge cases were borderline? Who remembers that 450ms was RabbitMQ‑specific, not a universal constant?
The CTO's call at 4 AM says everything. They didn't need the Skill. They needed Mark. But Mark's price was 5x. That's the "context tax" — what you pay when you realize that documentation and AI can't replicate years of lived experience.
I've been building SHALA (a supportive agent for Human) and working on LLM security audits. This story reinforces something I keep telling teams: AI can compress experience, but it cannot own the responsibility for when that compression fails. Someone still has to hold the bag. And that someone should be paid accordingly.
Thanks for sharing this (and to the original storyteller). It's a necessary counterweight to the "just extract and replace" narrative.
Cheers,
Jack
DEV.to/ggle.in
Jack, thanks for reading — and for picking out that sentence. It was the one I kept coming back to while writing.
"It was right about yesterday — and yesterday wasn't running anymore."
You're spot on about validation being a snapshot, not a contract. What I think makes it even more dangerous is the gap between snapshots. Companies love to freeze-frame at 96.8%, frame it, put it on slides. But nobody accounts for what drifts between screenshot A and screenshot B. Configs change. Dependencies roll. Data distributions shift. None of that shows up on a dashboard, but it's exactly what kills you at 4 AM.
The CTO calling at 4 AM wasn't paying for Mark's knowledge. He was paying for someone willing to be the person who picks up at 4 AM. You can compress knowledge into a Skill. You can't compress "I'll take the call" into one. That's the real context tax — and there's no automation that avoids it.
SHALA and LLM security audits sound like exactly the kind of work the industry needs more of. Would love to hear how it goes. The line between "compression works" and "compression failed" — I think that's the question none of us have a good answer for yet, and we're all figuring it out in real time.
Cheers,
Xu
What a awesome achievement!
Thanks for reading and for the comment. Hope you'll enjoy the next one too.
wow
😁
Karma has a ticket price, and it's 5x your old salary. Hope you made them prepay.
Nah, I invoiced them net-30. Still waiting. 😅