Transformers documentation

SGLang

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.1.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SGLang

SGLang is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows.

Set model_impl="transformers" to load a Transformers modeling backend.

import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model_impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])

Pass --model-impl transformers to the sglang.launch_server command for online serving.

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --model-impl transformers \
  --host 0.0.0.0 \
  --port 30000

Transformers integration

Setting model_impl="transformers" tells SGLang to skip its native model matching and use the Transformers model directly.

  1. PreTrainedConfig.from_pretrained() loads the model’s config.json from the Hub or your Hugging Face cache.
  2. AutoModel.from_config() resolves the model class based on the config.
  3. During loading, _attn_implementation is set to "sglang". This routes attention calls through SGLang’s RadixAttention kernels.
  4. SGLang’s parallel linear class replaces linear layers to support tensor parallelism.
  5. The load_weights function populates the model with weights from safetensors files.

The model benefits from all SGLang optimizations while using the Transformers model structure.

Compatible models require _supports_attention_backend=True so SGLang can control attention execution. See the Building a compatible model backend for inference guide for details.

Resources

Update on GitHub