DEV Community

Jaydeep Shah (JD)
Jaydeep Shah (JD)

Posted on

I opened a .litertlm file. Here is what is actually in there

You download a 2.59 GB file called gemma-4-E2B-it.litertlm. You push it to your phone. An LLM starts running - no cloud, no API key. But what is that file?

I had assumed "model file equals weights." That is true for .safetensors, which stores raw weight matrices and nothing else. When I started digging into what a .litertlm file actually contains, I found something fundamentally different.

What I found inside

What Is Inside a .litertlm File - sealed bundle containing weights, execution graph, tokenizer, metadata, chat template, and optional encoders

Based on export tool behavior, runtime error messages, and what breaks when components are missing, a .litertlm file contains:

  • Quantized weights - the model's parameters, compressed to INT4 or similar precision to fit in mobile RAM
  • Execution graph - the compiled computational graph that defines how tensors flow through the model's layers
  • Tokenizer - the vocabulary and encoding rules that convert text to token IDs and back
  • LLM metadata - serialized configuration for token roles, KV-cache sizing, and context management
  • Chat template - the formatting rules that wrap user messages into the turn structure the model expects (e.g., <start_of_turn>user\n...<end_of_turn>)
  • Vision encoder (optional) - for multimodal models like Gemma 4 E2B, the image-processing sub-network
  • Audio encoder (optional) - for models that handle audio input

Not a "model file" in the traditional sense - closer to an APK for inference.

How it compares to other model formats

Format What It Contains Primary Use Case LLM Support Hardware Targeting
.litertlm Weights + tokenizer + chat template + execution graph + optional encoders On-device LLM inference via LiteRT-LM Yes - built for LLMs Device-specific (separate GPU/NPU builds)
.tflite Weights + execution graph Classical ML models (image classification, object detection) No Delegate-based (CPU/GPU/NPU at runtime)
.gguf Weights + tokenizer + metadata LLM inference via llama.cpp Yes CPU-first, with GPU offloading (Metal, CUDA, Vulkan)
.onnx Weights + execution graph Cross-runtime model interchange Yes, via ONNX Runtime Runtime-dependent
.safetensors Raw weights only Storage and transfer of model parameters No runtime None - just data

.litertlm is purpose-built for one job: running an LLM on a mobile device through the LiteRT-LM runtime. The other formats each serve different roles - .tflite for classical ML, .gguf for CPU-first cross-platform LLM inference, .onnx for runtime interoperability, .safetensors for raw weight storage.

Why GPU and NPU need separate files

This is the detail that tripped me up the hardest when building Redacto.

We ship two .litertlm files for the same model:

  • gemma-4-E2B-it.litertlm (2.59 GB) - works on CPU and GPU
  • gemma-4-E2B-it_qualcomm_sm8750.litertlm (3.02 GB) - works ONLY on the Hexagon V79 NPU

They are not the same file with different labels. The NPU variant was compiled through Qualcomm's QNN toolchain, and its execution graph contains DISPATCH_OP custom ops - pre-compiled subgraphs that run directly on the Hexagon DSP. The DISPATCH_OP is registered at runtime by libLiteRtDispatch_Qualcomm.so; without that library and the matching hardware, the ops have no implementation.

If you try to load the NPU model on a CPU or GPU backend:

E tflite : Encountered unresolved custom op: DISPATCH_OP.
E tflite : Node number 0 (DISPATCH_OP) failed to prepare.
Enter fullscreen mode Exit fullscreen mode

It is not a compatibility issue you can work around - the operations literally do not have CPU or GPU implementations. An earlier version of our app tried to fall back from NPU to GPU using the same NPU model file. It failed immediately.

PS: I will do a deep dive on the details of how GPU and NPU graphs / subgraphs are different and how they are compiled in a separate post.

You cannot go back

A .litertlm file is the END of the pipeline, not a waypoint.

You cannot:

  • Fine-tune it - the weights are quantized and compiled. The training-time representation is gone.
  • Change its quantization - INT4 is baked in at export time. You cannot re-quantize to INT8 from the .litertlm.
  • Swap its chat template - the template is embedded during export. If it is wrong, you re-export from scratch.
  • Add or remove the vision encoder - the model's modality support is fixed at compile time.

The tool that produces these files is litert-torch export_hf, and every decision is made at export time:

litert-torch export_hf \
  --model=google/gemma-4-E2B-it \
  --output_dir=./exported_model \
  --externalize_embedder \
  --quantization_recipe=dynamic_wi4_afp32
Enter fullscreen mode Exit fullscreen mode

Prefill chunk size, KV cache length, quantization scheme - all locked in. The resulting .litertlm is immutable.

Three files, three lessons

During the hackathon, we ended up with three .litertlm files on the device:

File Size Works On Vision Notes
gemma-4-E2B-it.litertlm 2.59 GB CPU, GPU Yes Standard litert-community export
gemma-4-E2B-it_qualcomm_sm8750.litertlm 3.02 GB NPU only Yes QNN-compiled with DISPATCH_OP for Hexagon V79
gemma4_ft.litertlm 4.7 GB CPU, GPU No Custom fine-tuned, different quantization granularity

The fine-tuned model was nearly double the size - not more parameters, different quantization settings at export. And it was missing the vision encoder entirely (TF_LITE_VISION_ENCODER not found), so no image redaction. Not a bug in our fine-tuning - a consequence of how the export was configured.

The lesson: what goes in at export time is what you get. No runtime inspection, no patching, no "just swap out the tokenizer." If something is wrong, you go back to the source weights and re-export.

The mental model

Think of it this way:

  • .safetensors is source code
  • .litertlm is a compiled binary

You do not debug a compiled binary by editing its bytes. You go back to the source, fix it, and recompile. Everything upstream - fine-tuning, quantization, chat template, modality support - must be correct before the .litertlm is produced. After that, it is sealed.


Related in this series of "Edge AI from the Trenches"


Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.


Sources:

Last updated: June 2026
6th of 22 posts in the "Edge AI from the Trenches" series

Top comments (0)