DEV Community

Stephan Miller
Stephan Miller

Posted on • Originally published at stephanmiller.com on

I Built an Obsidian OCR Plugin for My Notebooks, Then Started Talking to OpenClaw Instead

Rotring and Notebook

I use a physical notebook to collect ideas. Eventually I add these notes to Obsidian. The problem is that eventually may be a long time.

The pages sit in the notebook. The notebook goes in the bag. The bag goes under the desk. Three weeks later I’m digging through it trying to remember that idea I had at the coffee shop that I was absolutely certain I would remember. The transcription never happens because transcription is boring and I am lazy.

So I built an Obsidian plugin to do it for me. It works. The OCR is good, the rule engine is clever, and the folder monitor runs automatically. I’m genuinely proud of how it came together.

Then I mostly stopped using it. Not because it’s broken, but because it solved the wrong problem.

The Hardware Setup Nobody Asked For

Before we get into the plugin, let me explain why this problem exists, because “just type your notes” is not a real answer.

The leather case and the Rotring 600 are not affectations. The case means the notebook survives being thrown in a bag with keys and cables. The Rotring 600 is a metal drafting pencil that weighs enough to feel like an actual tool and writes consistently whether you’re at a desk or scribbling at a coffee shop. Together they make writing fast and comfortable enough that I actually do it.

The craft paper notebooks are the recent upgrade. Field Notebooks are beautiful but not cheap, and I fill them fast. Craft paper composition books in the same size cost a fraction of that and I’ve stopped caring about them being pretty. Turns out “not precious” helps with actually using them. More ideas. More pages filled. More stuff sitting unread in the analog void.

The switch to cheaper notebooks was supposed to make transcription feel less painful. It didn’t. What it actually did was produce more notes that I wasn’t transcribing. Good problem to have, mostly. So I decided to automate it.

Spec Mode With Kiro: Let AI Design the Damn Thing

I’ve written before about using Kiro for Obsidian plugin development. It’s become my go-to for this kind of plugin or extension project: reliable, follows best practices better than I do when I’m in a hurry, and the spec mode is legitimately useful.

Spec mode is where you tell Kiro what you want to build before you build it. Instead of jumping straight to code, it produces a detailed spec, you review it, and then it builds from the spec. The result is usually more coherent than “AI, build me a thing” with no upfront planning. I’ve been burned enough times by starting from nothing to appreciate this.

Here’s approximately what I handed it:

I want to create an Obsidian plugin designed for processing pictures of 3.5" x 5.5"
field notebook pages with OCR and importing the resulting data.

Basic MVP:
- An Obsidian command
- It launches a file picker
- You select one or multiple image files
- It uses OCR to put this data in the daily note for the day
- Config: a heading to put the imported notes under in the daily note

Enhancements:
- Detect patterns in text to separate and route data to correct notes
  - *[Project Name]: [TODO item] → add to that project's task list
  - this/hierarchal/tag: [Description] → create new idea note in specific folder
  - Unmatched notes → dump into daily note as bullet list
- Macro system: patterns configurable in settings (regex + template),
  not hard-coded rules based on my examples
- Regularly check a specific folder for new images to process (hourly or daily)

Nice to have: mobile support with camera

Enter fullscreen mode Exit fullscreen mode

Kiro came back with a spec, I approved it, and it built the thing.

Obsidian OCR Plugin

The OCR Backend

Here’s where the project got more interesting than I planned.

Tesseract first. Kiro built the initial version using Tesseract.js, which is the WebAssembly port of the open-source Tesseract OCR engine. It runs entirely in the browser/Electron environment: no API keys, no internet required, no cost. For printed text, it’s genuinely good. I tested it on a few photos of typed documents and it came back clean.

My handwriting is not printed text. Tesseract read my handwriting the way someone might read a foreign language they’ve seen but never studied: confident and completely wrong. The words it produced were adjacent to reality at best.

OpenAI Vision next. The plugin has a clean interface for OCR backends, so swapping was straightforward:

interface OCRService {
  initialize(): Promise<void>;
  processImage(imageData: ArrayBuffer): Promise<OCRResult>;
  isAvailable(): boolean;
}

// Tesseract implementation
class TesseractOCRService implements OCRService { ... }

// Swap in OpenAI
class OpenAIVisionService implements OCRService { ... }

Enter fullscreen mode Exit fullscreen mode

Set up the API key, pointed it at my notebook photos. Better than Tesseract. Not dramatically better. OpenAI Vision handles handwriting, but my particular combination of fast writing and cramped field notebook pages wasn’t making it easy. Legible enough to be useful about 70% of the time. Not good enough to trust.

Google Cloud Vision won. This one took longer to set up. The credential flow for Google Cloud is always a little more involved than it should be. But the results were noticeably better. Google Vision handles handwriting well, and the confidence scores it returns are actually useful for filtering out the garbage OCR results versus the merely mediocre ones.

The kicker: the free tier is 1,000 OCR units per month. That’s more than enough for my actual usage. The plugin currently supports all three backends, selectable in settings. Tesseract is the offline/free default. Google Vision is what I actually use when I want results I trust.

The Rule Engine: When Your Handwriting Has Structure

This is the part of the plugin I’m most pleased with, and also the part that took the most explaining to Kiro.

The dumb version of OCR import is: scan image → dump text into daily note. That’s fine for random notes. But my notebook has more structure than that. I developed shorthand over years of using these notebooks, and I wanted the plugin to understand it.

My notation system, roughly:

  • A dash (-) means a plain note. Just information.
  • An asterisk (*) means it’s tied to an active project. Format: *[Project Name]: thing I need to do
  • A hierarchical path followed by a colon means it’s a new idea. Format: ideas/software: description of idea

The plugin’s rule engine lets you configure these as regex patterns with templates. Here’s what one looks like:

Pattern: \*\[(.+?)\]:\s*(.+)
Template: "## 2\n\nAdded from notebook import."
Target Note: Projects/1/Tasks.md
Action: insert-content (at end of file)

Enter fullscreen mode Exit fullscreen mode

When OCR text matches *[SomeProject]: some task, it extracts SomeProject and some task as capture groups, renders the template with them, and inserts the result into Projects/SomeProject/Tasks.md. Everything that doesn’t match any rule falls back to the daily note as a bullet point.

The macro system means these rules are configurable in the plugin settings. You’re not stuck with my notation. You can define any regex pattern, any template, any target file or folder, and any of five insertion strategies (beginning, end, before/after a pattern, or under a heading).

The Physical Reality

There’s a gap between the physical notebook and the digital image that’s its own problem, separate from OCR accuracy.

Scanning flat is a trap. If you lay a notebook flat on a flatbed scanner, the scanner sees two pages. OCR then produces text that intermingles left and right pages in the order the lines appear on the scan, which is not the order you wrote them. The output is a word salad of two separate notes. I discovered this after my first real import batch and had to re-photograph everything by hand.

The workaround is either photograph individual pages with a camera or scan one page at a time by folding the notebook back and covering the other page. Camera photos work fine and are faster.

Line breaks are OCR’s favorite lie. When OCR reads a handwritten page, it sees line endings as, well, line endings. Every physical line in the notebook becomes a line break in the output. Notes that span multiple physical lines get chopped up, and the regex patterns that depend on a consistent format start failing on the second line. The plugin does some normalization but it’s imperfect: notes that wrap in the notebook still come out fragmented.

The Part Where the Plugin Wins and I Stopped Using It Anyway

The plugin works. I’ve used it to clear out a backlog of older notebooks: pages that were just sitting there, never going to get manually transcribed, information that would have stayed locked in paper forever. For that use case, it’s great. Take a batch of photos, drop them in the Inbox/ folder, let the monitor run, and an hour later the notes are in Obsidian where they can at least be searched.

The Part Where the Plugin Wins and I Stopped Using It Anyway

But my daily workflow? I stopped feeding it new notebooks almost immediately.

Here’s the thing I didn’t expect: the process of sitting down with a two-week-old notebook and actually reading through it is not the friction I thought it was. It’s the point. When I flip through pages I wrote two weeks ago, I see ideas I’d forgotten about with fresh eyes. The idea that seemed obvious when I wrote it down looks different now that I’ve been thinking about other things. Connections form. An idea from page 4 relates to something I was doing last week that didn’t exist yet when I wrote page 4. A project note that felt stalled when I wrote it suddenly has a new angle.

None of that happens when the notes get automatically inserted into Obsidian without a human in the loop. They go in, they get tagged, they sit in the daily note at the right date, and I never look at them again because the capture already happened and my brain considers it done. The ideas don’t get that second read. They don’t get the benefit of time and distance. They just disappear into the vault.

So I went back to reviewing notebooks by hand, and I wrote most of this post ready to land on a tidy little lesson: manual transcription was the review, and I’d been trying to automate away the one part that mattered.

That lesson is half right. I just had the wrong half.

The Fix Was Talking to a Robot

The thing I’d actually stumbled onto wasn’t “manual good, automated bad.” It was “the review needs a mind in the loop.” The OCR plugin failed not because it was automated but because it pulled my brain out of the process entirely. Scan, route, done: no thinking required, so no thinking happened.

Which raised a question: could I automate the filing, the boring part, while keeping my brain in the capturing?

Yes, and the answer was already running on a server in my house.

I have an AI agent I talk to over Telegram. Not the plugin, but a general-purpose agent with a set of skills, one of which knows how to write to this vault. So now my notebook-to-Obsidian flow looks like this: I open Telegram, hit the voice button, and just talk my notes. Speech-to-text transcribes the ramble, hands it to the agent, and the agent figures out where it goes. Is this an idea, a daily note, a to-do for a specific project, a blog draft? It pulls the vault, drafts the note from the right template, and then, this is the part that matters, reads it back to me and asks before it writes anything.

That last step is the whole game. It’s a conversation, not a capture. And two things fall out of it that hand-review gave me and silent OCR never did:

Speaking the notes is re-remembering them. When I transcribe by hand I copy the line and move on. When I have to say a note out loud, I can’t just copy it. I have to re-explain it to myself. I ramble. And rambling is thinking. Half the time a better version of the idea falls out of my mouth than the four cramped words I scrawled. Saying it out loud grows the thing.

The agent talks back. It asks which project a task belongs to. It notices that what I’m describing relates to a note from last week and offers to cross-link them. It pushes back when I’m vague. The note that lands in the vault is richer than anything that was on the paper, because two minds touched it instead of zero. Then it commits and pushes, and the note syncs back to every device before I’ve set my phone down.

It’s automated and it’s a review, more of one than reading by hand ever was. Hand-review is one tired brain reprocessing old notes. This is two brains building the note up in real time, at the moment I’ve got the most context.

Where It Actually Fits

Where It Actually Fits

So I’m not abandoning anything. I ended up with three tools for three jobs, which is more than I set out to build and exactly right.

  • Old notebooks and backlog → the OCR plugin. I have notebooks going back years full of ideas that are never getting a careful read. The plugin burns through those and gets them into a searchable state. Fire and forget. This is the job it’s genuinely good at.
  • Notes I want to actually sit with → read by hand. Still valid, still happens. Sometimes the right move is to close the laptop, put down the phone, and just turn pages.
  • Daily capture → dictate to the agent. This is the new default for anything fresh. Automated filing, human-in-the-loop thinking, a second mind that makes the note better on the way in.

The things about the OCR plugin that are still rough, in case you want to build something similar:

  • Line break normalization works for single-line entries, breaks down for anything that wraps
  • The confidence threshold for Google Vision needs tuning per handwriting style: start conservative
  • Mobile support didn’t make the first version; camera integration on mobile Obsidian is a whole other project

The things worth stealing:

  • Kiro spec mode for plugin development is underrated. Write the whole spec first, let it plan, build from the plan. The 3,500-line single-file result was architecturally weird but delivered a working plugin without babysitting.
  • Google Vision’s free tier (1,000 units/month) covers any reasonable personal usage.
  • The strategy pattern for OCR backends is genuinely useful: Tesseract gets you most of the way for zero cost, and swapping to Google Vision when it matters is two lines of config.
  • The real one: if you’re automating a personal workflow, figure out which step is secretly doing work you don’t want to lose. Automate around it, not through it.

I built a tool to eliminate the boring part of my workflow and discovered the boring part was a feature. Then I found a way to keep the feature and still kill the boredom. That second realization only happened because the first tool failed in a specific, instructive way.

This is actually one of the underrated advantages of AI-assisted development. Sometimes your dreams are stupid and you don’t know it until you test them. Before, you’d pine away for months waiting for a spare weekend to finally build the thing, and by the time you got there you’d either talked yourself into it being a great idea or life had moved on and you never built it at all. Now you can build it in a day and realize the error of your ways in a few hours or a few days of actual use. The feedback loop went from months to a long weekend. Not that an Obsidian OCR plugin is a life-changing idea (I know what I’m working with here), but the same principle applies to things that actually matter. Build the dumb dream fast, find out if it’s actually dumb, and let what you learn point you at the thing you should have built instead.

Top comments (0)