What makes the 'web' model different?

by nlisac - opened 13 days ago

13 days ago

I am trying to learn about finetuning models. I would love to finetune a Gemma model, however I am looking to use it in a Web application, which needs special consideration in the type of model I use.

In your version of this model, for web, what are the special settings or operations used - I assume in conversion - that allows for a web-friendly model vs. a non-web Friendly model. Thank you!

BalakrishnaCh

Google org 5 days ago

•

edited 5 days ago

Hi @nlisac ,

Welcome to Gemma models, thanks for reaching out to us. The most direct indicator of web-friendly is the model's file extension and the runtime environment it's designed for.
Standard Model: Typically distributed in formats like PyTorch (.pth, .safetensors) or Hugging Face Transformers, which are designed for powerful server-side GPUs.

Web-Friendly Model: The suffix litert-lm in the model name stands for LiteRT-LM, which is Google's dedicated framework for on-device or edge inference.

The model is converted into a proprietary, highly optimized format like .litertlm (or sometimes a .task file bundle containing a mix of smaller optimized TFLite files).
This format is designed to be used with the MediaPipe LLM Inference API or the underlying LiteRT-LM runtime, which can run efficiently within a web worker thread using WebAssembly (Wasm).

The base Gemma 3n architecture already includes features that make it inherently more web-friendly before conversion:
1.Selective Parameter Activation (MatFormer): Gemma 3n uses the Matryoshka Transformer architecture, which allows for the selective activation of only a subset of the total parameters based on the task or device resources. This reduces the effective parameter count (E2B) and the computational cost per request.
2. Per-Layer Embedding (PLE) Caching: This technique allows key embedding parameters to be cached to fast, local storage, reducing runtime memory requirements.
3. On-Device Focus: The Gemma family, particularly the 3n variants, are specifically engineered by Google for efficient execution on low-resource devices like mobile phones and, by extension, web browsers.
Thanks.

fosple

3 days ago

@BalakrishnaCh

Thanks for the detailed answer.

A follow up question from my side:
Whats the difference then between gemma-3n-E2B-it-int4-Web.litertlm and gemma-3n-E2B-it-int4.litertlm? As both models are already in the litertlm format.

BalakrishnaCh

Google org 1 day ago

Could you please clarify the sources from which you are referencing these two models?

Thanks.

fosple

1 day ago

•

edited 1 day ago

Could you please clarify the sources from which you are referencing these two models?

Thanks.

@BalakrishnaCh
From this repository (the Files tab):
https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/tree/main

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment