kent-tokyo

Posted on Jun 19

Building chematic: Why I Wrote a Pure-Rust Cheminformatics Library from Scratch

#rust #cheminformatics #webassembly #opensource

I'm building chematic, a cheminformatics library in pure Rust, from scratch. Here's why I started, and what kept tripping me up along the way.

How it started

"I just want to run RDKit in the browser" — that was it.

RDKit.js exists. But the WASM binary is over 30 MB, and building it requires cmake, clang, and the Emscripten SDK. CI kept breaking, Docker images bloated, and every deploy meant setting up the build environment again. Too much overhead for what I wanted to do.

I tried OCL.js (OpenChemLib) too, but it's Java code transpiled via GWT, so the API feels Java-shaped and TypeScript types don't come out cleanly.

That's when I thought: what if I just wrote it in pure Rust? At the time, I figured a few weeks would be enough.

Why I imposed the pure-Rust constraint

I didn't set "no FFI" as a rule from the start. But as I kept writing, I could see that allowing one exception would cascade.

Take InChI. The official IUPAC implementation is in C, and reimplementing it correctly in pure Rust isn't realistic. The moment I said "I'll use FFI just for this," cmake and Emscripten dependencies would come back. Better to draw the line early. No FFI, no unsafe, no random number generation. Those three constraints went in first.

That "InChI is out" decision would come back to bite me from the outside.

Getting to 15 crates

I started with a single crate. As files multiplied and compile times grew, I split out chematic-core and chematic-smiles first. Every time a new feature landed, I carved it out, and now there are 15 crates.

chematic-core       → atom/bond/molecule primitives, kekulization
chematic-smiles     → OpenSMILES parser, canonical SMILES writer
chematic-perception → ring perception (SSSR), aromaticity
chematic-mol        → MOL/SDF file I/O
chematic-depict     → 2D SVG rendering (CPK colors, SMARTS highlighting)
chematic-chem       → 70+ descriptors, pKa prediction, ADMET profiling
chematic-fp         → 6 fingerprint types (ECFP, MACCS, MAP4, etc.)
chematic-smarts     → SMARTS parser, substructure search (VF2)
chematic-ff         → force field implementations (UFF, DREIDING, MMFF94)
chematic-3d         → 3D coordinate generation (ETKDG), conformer handling
chematic-rxn        → reaction SMILES/SMIRKS parser
chematic-wasm       → JavaScript/TypeScript WASM bindings
chematic-iupac      → IUPAC name generation (25+ compound classes)
chematic-mcp        → MCP server for AI agents (14 tools)
chematic            → umbrella crate integrating everything

The trickiest part was the dependency graph. chematic-perception (ring perception) depends on chematic-core, but the kekulization code inside core needs ring perception results. To break the cycle, I put the kekulization interface in core and the implementation in perception. Not the cleanest design, but it works for now.

Getting stuck on the SMILES parser

"SMILES is a simple notation, so parsing it should be simple" — wrong. The OpenSMILES spec has enough ambiguity that for edge cases I ended up looking at RDKit's behavior and using that as the reference.

The first wall was branch handling. Writing a recursive-descent parser for deeply nested structures like C(CC(N)CC)(=O)O made stack management awkward. I rewrote it as an iterative implementation with an explicit stack.

Implicit hydrogen was also quietly painful. SMILES usually omits hydrogens and calculates them from valence. That calculation is more complex than it looks — atom type, charge, and valence model all interact — and when I ran ChEMBL molecules through it, mismatches kept showing up and I had to fix them multiple times.

Getting stuck on kekulization

Kekulization is the conversion of aromatic SMILES (lowercase atoms like c1ccccc1) into explicit single/double bond form (C1=CC=CC=C1).

It's a bipartite graph maximum matching problem, and the algorithm itself is well-known. What I didn't anticipate was scale: for large fused ring systems like porphyrins and natural products, the matching search blows up exponentially. I only noticed this when I ran the full ChEMBL dataset — molecules over MW 5000 were timing out.

Atoms with only two adjacent aromatic bonds have a unique solution for which bond gets the double bond. Processing those first and reducing the graph before running the matching fixed the timeouts for large molecules. That fix went into v0.4.6.

Fingerprint determinism

The thing that caught me in the ECFP (Morgan algorithm) implementation was atom traversal order.

Rust's HashMap randomizes its seed by default (HashDoS mitigation). So when traversing molecular graph atoms via a hash map, the order changes every run — same molecule, different bit vector. The symptom was "Tanimoto similarity varies between runs," and it took a while to track down.

I switched to sorting all atoms by atomic number, degree, charge, and isotope mass before computing, and replaced AHashMap with IndexMap for deterministic ordering. Setting the no-random-numbers constraint upfront probably saved me from shipping a "works for now" implementation and hitting the reproducibility bug later.

Fighting the borrow checker in CIP

CIP (Cahn–Ingold–Prelog) rules determine R/S for stereocenters and E/Z for double bonds. Assigning priority to an atom requires the priorities of its neighbors, which requires their neighbors, and so on — a recursive dependency.

Understanding the algorithm wasn't the hard part. Getting it past the borrow checker was. Holding the graph as Rc<RefCell<Node>> caused other issues; I tried an Arena pattern too. Eventually I settled on a flat Vec<Atom> with AtomIdx (usize newtype) for indirection. No need to hold &mut and & simultaneously during traversal, so the borrow checker was happy. Took a few days, but once this pattern clicked, I used it for other algorithms too.

ChEMBL full validation

Mid-development, unit tests would pass completely while the parser crashed on specific ChEMBL molecules.

The causes were roughly three kinds: SSSR count mismatches on rare fused ring systems, hydrogen valence miscalculations from non-standard SMILES, and the kekulization timeouts mentioned above. None of these were cases I would have thought to write unit tests for — they only surface when you run real data.

Eventually I parsed all 2,897,819 molecules from ChEMBL 37 without a failure. In the RDKit compatibility benchmark on 5,000 molecules, HBA (H-bond acceptor) count agreement reached 99.98%, and aromatic ring count agreement hit 95.6% — the remaining gap comes from differing treatment of fused N-heterocycles.

Issue #11: dropping InChI

On June 16, 2026, an external issue came in:

"100% errors on InChI generation vs IUPAC reference InChI"

Even benzene produced the wrong connectivity block. The root causes identified were: InChI's own canonical numbering algorithm (different from Morgan ordering) wasn't implemented, spurious stereochemistry layers were being added to molecules without stereocenters, tautomer and mobile-hydrogen normalization was missing, and the InChIKey hash conversion diverged from spec — not isolated bugs, but a design-level failure.

InChI's spec is effectively defined by the IUPAC C library implementation. Reimplementing that correctly from scratch in pure Rust, without following the C library, isn't realistic. I'd expected this moment when I banned FFI, so I deleted chematic-inchi and documented InChI/InChIKey as an unsupported limitation.

WASM binding design

I use wasm-bindgen to expose Rust functions to JavaScript. For complex return types, wasm-bindgen can't pass Rust types directly, so I serialize to JSON via serde_json and call JSON.parse() on the JS side.

const desc = JSON.parse(get_descriptors_json(mol));
console.log(`MW: ${desc.mw}, TPSA: ${desc.tpsa}, LogP: ${desc.logP}`);

It's not suited for high-frequency computation, but chemistry calculations don't typically need that, so it hasn't been a bottleneck. TypeScript type definitions are auto-generated by wasm-bindgen, so IDE autocomplete works.

Where Rust actually helped

Not all of it was painful.

BondOrder is an enum, so calling "count double bonds" on an un-kekulized aromatic molecule is a compile error. In C++ that silently returns a wrong value.

The 3D force field implementations (UFF, DREIDING, MMFF94) were also written without a single line of unsafe. Translating the math into safe Rust was more straightforward than I expected.

Current state

v0.4.6 (June 19, 2026) is the latest release. 1,991 tests pass across the workspace. The WASM binary is ~550 KB after wasm-opt, available as an npm package (@kent-tokyo/chematic).

Recent additions include pKa prediction, ADMET profiling, BOILED-Egg passive permeability classification, and an MCP server (14 tools) for AI agent integration. The live demo runs all of it client-side in WASM.

https://kent-tokyo.github.io/chematic/

https://github.com/kent-tokyo/chematic