You download a 2.59 GB file called gemma-4-E2B-it.litertlm. You push it to your phone. An LLM starts running - no cloud, no API key. But what is that file?
I had assumed "model file equals weights." That is true for .safetensors, which stores raw weight matrices and nothing else. When I started digging into what a .litertlm file actually contains, I found something fundamentally different.
What I found inside
Based on export tool behavior, runtime error messages, and what breaks when components are missing, a .litertlm file contains:
- Quantized weights - the model's parameters, compressed to INT4 or similar precision to fit in mobile RAM
- Execution graph - the compiled computational graph that defines how tensors flow through the model's layers
- Tokenizer - the vocabulary and encoding rules that convert text to token IDs and back
- LLM metadata - serialized configuration for token roles, KV-cache sizing, and context management
-
Chat template - the formatting rules that wrap user messages into the turn structure the model expects (e.g.,
<start_of_turn>user\n...<end_of_turn>) - Vision encoder (optional) - for multimodal models like Gemma 4 E2B, the image-processing sub-network
- Audio encoder (optional) - for models that handle audio input
Not a "model file" in the traditional sense - closer to an APK for inference.
How it compares to other model formats
| Format | What It Contains | Primary Use Case | LLM Support | Hardware Targeting |
|---|---|---|---|---|
.litertlm |
Weights + tokenizer + chat template + execution graph + optional encoders | On-device LLM inference via LiteRT-LM | Yes - built for LLMs | Device-specific (separate GPU/NPU builds) |
.tflite |
Weights + execution graph | Classical ML models (image classification, object detection) | No | Delegate-based (CPU/GPU/NPU at runtime) |
.gguf |
Weights + tokenizer + metadata | LLM inference via llama.cpp | Yes | CPU-first, with GPU offloading (Metal, CUDA, Vulkan) |
.onnx |
Weights + execution graph | Cross-runtime model interchange | Yes, via ONNX Runtime | Runtime-dependent |
.safetensors |
Raw weights only | Storage and transfer of model parameters | No runtime | None - just data |
.litertlm is purpose-built for one job: running an LLM on a mobile device through the LiteRT-LM runtime. The other formats each serve different roles - .tflite for classical ML, .gguf for CPU-first cross-platform LLM inference, .onnx for runtime interoperability, .safetensors for raw weight storage.
Why GPU and NPU need separate files
This is the detail that tripped me up the hardest when building Redacto.
We ship two .litertlm files for the same model:
-
gemma-4-E2B-it.litertlm(2.59 GB) - works on CPU and GPU -
gemma-4-E2B-it_qualcomm_sm8750.litertlm(3.02 GB) - works ONLY on the Hexagon V79 NPU
They are not the same file with different labels. The NPU variant was compiled through Qualcomm's QNN toolchain, and its execution graph contains DISPATCH_OP custom ops - pre-compiled subgraphs that run directly on the Hexagon DSP. The DISPATCH_OP is registered at runtime by libLiteRtDispatch_Qualcomm.so; without that library and the matching hardware, the ops have no implementation.
If you try to load the NPU model on a CPU or GPU backend:
E tflite : Encountered unresolved custom op: DISPATCH_OP.
E tflite : Node number 0 (DISPATCH_OP) failed to prepare.
It is not a compatibility issue you can work around - the operations literally do not have CPU or GPU implementations. An earlier version of our app tried to fall back from NPU to GPU using the same NPU model file. It failed immediately.
PS: I will do a deep dive on the details of how GPU and NPU graphs / subgraphs are different and how they are compiled in a separate post.
You cannot go back
A .litertlm file is the END of the pipeline, not a waypoint.
You cannot:
- Fine-tune it - the weights are quantized and compiled. The training-time representation is gone.
-
Change its quantization - INT4 is baked in at export time. You cannot re-quantize to INT8 from the
.litertlm. - Swap its chat template - the template is embedded during export. If it is wrong, you re-export from scratch.
- Add or remove the vision encoder - the model's modality support is fixed at compile time.
The tool that produces these files is litert-torch export_hf, and every decision is made at export time:
litert-torch export_hf \
--model=google/gemma-4-E2B-it \
--output_dir=./exported_model \
--externalize_embedder \
--quantization_recipe=dynamic_wi4_afp32
Prefill chunk size, KV cache length, quantization scheme - all locked in. The resulting .litertlm is immutable.
Three files, three lessons
During the hackathon, we ended up with three .litertlm files on the device:
| File | Size | Works On | Vision | Notes |
|---|---|---|---|---|
gemma-4-E2B-it.litertlm |
2.59 GB | CPU, GPU | Yes | Standard litert-community export |
gemma-4-E2B-it_qualcomm_sm8750.litertlm |
3.02 GB | NPU only | Yes | QNN-compiled with DISPATCH_OP for Hexagon V79 |
gemma4_ft.litertlm |
4.7 GB | CPU, GPU | No | Custom fine-tuned, different quantization granularity |
The fine-tuned model was nearly double the size - not more parameters, different quantization settings at export. And it was missing the vision encoder entirely (TF_LITE_VISION_ENCODER not found), so no image redaction. Not a bug in our fine-tuning - a consequence of how the export was configured.
The lesson: what goes in at export time is what you get. No runtime inspection, no patching, no "just swap out the tokenizer." If something is wrong, you go back to the source weights and re-export.
The mental model
Think of it this way:
-
.safetensorsis source code -
.litertlmis a compiled binary
You do not debug a compiled binary by editing its bytes. You go back to the source, fix it, and recompile. Everything upstream - fine-tuning, quantization, chat template, modality support - must be correct before the .litertlm is produced. After that, it is sealed.
Related in this series of "Edge AI from the Trenches"
-
What I Learned Turning a HuggingFace Model Into Something My Phone Can Run - the full pipeline that produces the
.litertlmbundle - What Is a Chat Template and Why Does It Matter? - the template component baked into every
.litertlmfile
Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.
Sources:
- LiteRT-LM Overview
- litert-community/gemma-4-E2B-it-litert-lm - pre-compiled .litertlm files
- litert-torch on PyPI - the export tool
- GGUF format specification
Last updated: June 2026
6th of 22 posts in the "Edge AI from the Trenches" series

Top comments (0)