Jaydeep Shah (JD)

Posted on Jun 16

The Invisible Layer Between My Prompt and the Model

#edgeai #android #litertlm

I typed a prompt into my app, hit send, and the model responded with incoherent fragments. No error. No crash. Just wrong output.

I checked the weights - fine. Checked the tokenizer - fine. Rebuilt the export pipeline twice. Still garbage. It took far longer than it should have (a hackathon does not give you two days) before I found the problem: a single configuration layer I had never thought about. The chat template.

What I did not realize about how models see conversations

When you call a chat API, you pass structured messages with roles:

[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is PII?"}
]

I assumed the model understood this structure natively. It does not. A large language model sees a flat stream of tokens - numbers - with no inherent notion of "system" or "user" or "assistant." The chat template is a Jinja2 template that takes your structured messages and renders them into the exact text format the model was trained on, special tokens and all.

If the template is wrong, the model does not complain. It treats the malformed input as an unfamiliar continuation and produces incoherent output. Silently.

The format problem I did not expect

There is no universal standard. Each model family was trained with its own special tokens. Here are three I kept running into:

Gemma (Google):

<start_of_turn>user
What is PII?<end_of_turn>
<start_of_turn>model
PII stands for Personally Identifiable Information.<end_of_turn>

Llama 3+ (Meta):

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is PII?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Mistral 7B Instruct:

[INST] What is PII? [/INST]

These are not cosmetic differences. The model's attention patterns were shaped during training by these exact token boundaries. Swap Gemma's <start_of_turn> for Llama's <|start_header_id|> and the output degrades - sometimes subtly, sometimes catastrophically.

Where the template actually lives

The chat template is stored as a Jinja2 string inside tokenizer_config.json in the model's HuggingFace repository, under the chat_template key. When you call tokenizer.apply_chat_template(), the Transformers library reads this string and renders your messages through it.

For on-device deployment with LiteRT-LM, the template gets baked into the .litertlm file at export time. Once exported, you cannot change it. If the template is wrong, you go back to the source weights, fix the template in tokenizer_config.json, and re-export. There is no runtime patching. There is no config flag.

The trap we hit on-device

This is where we got burned building Redacto.

LiteRT-LM includes a lightweight template parser for rendering chat templates on-device. It is not a full Jinja2 engine - it supports a subset of Jinja2 syntax. When we fine-tuned Gemma 4 E2B and exported it to .litertlm, the export pipeline bundled the chat template from google/gemma-4-E2B-it. That template uses map.get() - a Jinja method for safely accessing dictionary keys with a default value. Perfectly valid Jinja2. Works in Python.

On-device, the LiteRT-LM parser produced:

Failed to apply template: unknown method: map has no method named get
(in template:238)

The model loaded. The tokenizer loaded. The weights were fine. But inference failed because the template could not be rendered. And this error does not appear until runtime on the device - there is no validation step during export.

How we fixed it

The solution was non-obvious. We downloaded tokenizer_config.json from google/gemma-3-1b-it - a template version that LiteRT-LM's on-device parser supports. We patched our fine-tuned model's config with the compatible chat_template value, removed the standalone chat_template.jinja file (which the export tool reads if present, overriding the JSON config), and re-exported.

To be clear: this is not a LiteRT oversight. A lightweight on-device Jinja parser is the right call - you do not want a full Python runtime embedded in a mobile inference engine. And Gemma 4's template grew more complex specifically to support tool-calling and multimodal turns, which is also the right call for the model's capabilities. These two things just moved independently. Under time pressure during the hackathon, we were at the intersection - and piecing it together from the error stack and the template source was ours to figure out.

What I check now before every export

Open tokenizer_config.json and read the chat_template field
Check whether it uses Jinja features beyond basic control flow (if, for, variable substitution)
If it uses methods like .get(), .items(), or complex filters, test against your target runtime's parser - or swap it for a known-compatible template from the same model family

The chat template is invisible infrastructure. When it works, you never think about it. When it breaks, the failure mode is silence - the model runs, produces tokens, and those tokens are wrong. Or in the on-device case, inference fails at runtime with a cryptic parser error pointing to a line inside a template you never wrote.

Related in this series

I Opened a .litertlm File. Here Is What Is Actually in There. - the bundle where the chat template gets embedded at export time
The Chat Template Trap - the full debugging story when our on-device parser hit the unsupported Jinja feature
Prompt Engineering on a 2B Model - how conversation formatting interacts with prompt design on small models

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources:

Last updated: June 2026
7th of 22 posts in the "Edge AI from the Trenches" series

DEV Community