I opened a .litertlm file. Here is what is actually in there

#edgeai #android #litertlm

You download a 2.59 GB file called gemma-4-E2B-it.litertlm. You push it to your phone. An LLM starts running - no cloud, no API key. But what is that file?

I had assumed "model file equals weights." That is true for .safetensors, which stores raw weight matrices and nothing else. When I started digging into what a .litertlm file actually contains, I found something fundamentally different.

What I found inside

Based on export tool behavior, runtime error messages, and what breaks when components are missing, a .litertlm file contains:

Quantized weights - the model's parameters, compressed to INT4 or similar precision to fit in mobile RAM
Execution graph - the compiled computational graph that defines how tensors flow through the model's layers
Tokenizer - the vocabulary and encoding rules that convert text to token IDs and back
LLM metadata - serialized configuration for token roles, KV-cache sizing, and context management
Chat template - the formatting rules that wrap user messages into the turn structure the model expects (e.g., <start_of_turn>user\n...<end_of_turn>)
Vision encoder (optional) - for multimodal models like Gemma 4 E2B, the image-processing sub-network
Audio encoder (optional) - for models that handle audio input

Not a "model file" in the traditional sense - closer to an APK for inference.

How it compares to other model formats

Format	What It Contains	Primary Use Case	LLM Support	Hardware Targeting
`.litertlm`	Weights + tokenizer + chat template + execution graph + optional encoders	On-device LLM inference via LiteRT-LM	Yes - built for LLMs	Device-specific (separate GPU/NPU builds)
`.tflite`	Weights + execution graph	Classical ML models (image classification, object detection)	No	Delegate-based (CPU/GPU/NPU at runtime)
`.gguf`	Weights + tokenizer + metadata	LLM inference via llama.cpp	Yes	CPU-first, with GPU offloading (Metal, CUDA, Vulkan)
`.onnx`	Weights + execution graph	Cross-runtime model interchange	Yes, via ONNX Runtime	Runtime-dependent
`.safetensors`	Raw weights only	Storage and transfer of model parameters	No runtime	None - just data

.litertlm is purpose-built for one job: running an LLM on a mobile device through the LiteRT-LM runtime. The other formats each serve different roles - .tflite for classical ML, .gguf for CPU-first cross-platform LLM inference, .onnx for runtime interoperability, .safetensors for raw weight storage.

Why GPU and NPU need separate files

This is the detail that tripped me up the hardest when building Redacto.

We ship two .litertlm files for the same model:

gemma-4-E2B-it.litertlm (2.59 GB) - works on CPU and GPU
gemma-4-E2B-it_qualcomm_sm8750.litertlm (3.02 GB) - works ONLY on the Hexagon V79 NPU

They are not the same file with different labels. The NPU variant was compiled through Qualcomm's QNN toolchain, and its execution graph contains DISPATCH_OP custom ops - pre-compiled subgraphs that run directly on the Hexagon DSP. The DISPATCH_OP is registered at runtime by libLiteRtDispatch_Qualcomm.so; without that library and the matching hardware, the ops have no implementation.

If you try to load the NPU model on a CPU or GPU backend:

E tflite : Encountered unresolved custom op: DISPATCH_OP.
E tflite : Node number 0 (DISPATCH_OP) failed to prepare.

It is not a compatibility issue you can work around - the operations literally do not have CPU or GPU implementations. An earlier version of our app tried to fall back from NPU to GPU using the same NPU model file. It failed immediately.

PS: I will do a deep dive on the details of how GPU and NPU graphs / subgraphs are different and how they are compiled in a separate post.

You cannot go back

A .litertlm file is the END of the pipeline, not a waypoint.

You cannot:

Fine-tune it - the weights are quantized and compiled. The training-time representation is gone.
Change its quantization - INT4 is baked in at export time. You cannot re-quantize to INT8 from the .litertlm.
Swap its chat template - the template is embedded during export. If it is wrong, you re-export from scratch.
Add or remove the vision encoder - the model's modality support is fixed at compile time.

The tool that produces these files is litert-torch export_hf, and every decision is made at export time:

litert-torch export_hf \
  --model=google/gemma-4-E2B-it \
  --output_dir=./exported_model \
  --externalize_embedder \
  --quantization_recipe=dynamic_wi4_afp32

Prefill chunk size, KV cache length, quantization scheme - all locked in. The resulting .litertlm is immutable.

Three files, three lessons

During the hackathon, we ended up with three .litertlm files on the device:

File	Size	Works On	Vision	Notes
`gemma-4-E2B-it.litertlm`	2.59 GB	CPU, GPU	Yes	Standard litert-community export
`gemma-4-E2B-it_qualcomm_sm8750.litertlm`	3.02 GB	NPU only	Yes	QNN-compiled with DISPATCH_OP for Hexagon V79
`gemma4_ft.litertlm`	4.7 GB	CPU, GPU	No	Custom fine-tuned, different quantization granularity

The fine-tuned model was nearly double the size - not more parameters, different quantization settings at export. And it was missing the vision encoder entirely (TF_LITE_VISION_ENCODER not found), so no image redaction. Not a bug in our fine-tuning - a consequence of how the export was configured.

The lesson: what goes in at export time is what you get. No runtime inspection, no patching, no "just swap out the tokenizer." If something is wrong, you go back to the source weights and re-export.

The mental model

Think of it this way:

.safetensors is source code
.litertlm is a compiled binary

You do not debug a compiled binary by editing its bytes. You go back to the source, fix it, and recompile. Everything upstream - fine-tuning, quantization, chat template, modality support - must be correct before the .litertlm is produced. After that, it is sealed.

Related in this series of "Edge AI from the Trenches"

What I Learned Turning a HuggingFace Model Into Something My Phone Can Run - the full pipeline that produces the .litertlm bundle
What Is a Chat Template and Why Does It Matter? - the template component baked into every .litertlm file

Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Sources: