starbix
/

Apertus-8B-Instruct-2509-FP8_dynamic

@@ -27,101 +27,3 @@ This is an FP8 dynamically quantized version of [swiss-ai/Apertus-8B-Instruct-25
 - **Ignored Layers**: `lm_head` (kept in higher precision for better output quality)
 - **Tool**: llm-compressor (Neural Magic)
-## Benefits
-FP8 quantization provides:
-- **Reduced model size**: ~50% smaller than FP16
-- **Faster inference**: Especially on hardware with FP8 support (e.g., NVIDIA H100, H200)
-- **Lower memory usage**: Enables larger batch sizes
-- **Maintained quality**: Minimal accuracy loss compared to full precision
-## Usage
-### With Transformers
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained(
-    "starbix/Apertus-8B-Instruct-2509-FP8_dynamic",
-    device_map="auto",
-    trust_remote_code=True,
-)
-tokenizer = AutoTokenizer.from_pretrained("starbix/Apertus-8B-Instruct-2509-FP8_dynamic")
-# Generate text
-messages = [
-    {"role": "user", "content": "What is the capital of Switzerland?"}
-]
-inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
-outputs = model.generate(inputs, max_new_tokens=256)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-### With vLLM (Recommended for FP8)
-```python
-from vllm import LLM, SamplingParams
-llm = LLM(
-    model="starbix/Apertus-8B-Instruct-2509-FP8_dynamic",
-    trust_remote_code=True,
-)
-sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
-prompts = ["What is the capital of Switzerland?"]
-outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
-    print(output.outputs[0].text)
-```
-## Performance Comparison
-Compared to the base model:
-- **Model size**: ~50% reduction
-- **Inference speed**: Up to 2x faster on FP8-capable hardware
-- **Memory usage**: ~50% reduction
-## Hardware Requirements
-- **GPU**: Recommended for best performance
-  - NVIDIA H100/H200: Native FP8 support for optimal performance
-  - NVIDIA A100/A10: Compatible but may not see full speedup
-- **CPU**: Supported but slower
-- **Memory**: ~8-10 GB GPU memory for inference
-## Limitations
-- May have slight accuracy differences compared to the full precision model
-- FP8 speedups are most pronounced on hardware with native FP8 support
-- Not all operations may be quantized
-## Base Model
-For more information about the base model, capabilities, and training details, please see:
-[swiss-ai/Apertus-8B-Instruct-2509](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509)
-## Citation
-If you use this quantized model, please cite both the base model and llm-compressor:
-```bibtex
-@misc{apertus-8b-instruct-2509,
-  title={Apertus-8B-Instruct-2509},
-  author={Swiss AI},
-  url={https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509},
-  year={2025}
-}
-@software{llm-compressor,
-  title={LLM Compressor},
-  author={Neural Magic},
-  url={https://github.com/vllm-project/llm-compressor},
-  year={2024}
-}
-```
-## License
-This model inherits the Apache 2.0 license from the base model.


27	- Ignored Layers: `lm_head` (kept in higher precision for better output quality)
28	- Tool: llm-compressor (Neural Magic)
29