starbix commited on
Commit
ba62146
·
verified ·
1 Parent(s): 5ceac10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -98
README.md CHANGED
@@ -27,101 +27,3 @@ This is an FP8 dynamically quantized version of [swiss-ai/Apertus-8B-Instruct-25
27
  - **Ignored Layers**: `lm_head` (kept in higher precision for better output quality)
28
  - **Tool**: llm-compressor (Neural Magic)
29
 
30
- ## Benefits
31
-
32
- FP8 quantization provides:
33
- - **Reduced model size**: ~50% smaller than FP16
34
- - **Faster inference**: Especially on hardware with FP8 support (e.g., NVIDIA H100, H200)
35
- - **Lower memory usage**: Enables larger batch sizes
36
- - **Maintained quality**: Minimal accuracy loss compared to full precision
37
-
38
- ## Usage
39
-
40
- ### With Transformers
41
-
42
- ```python
43
- from transformers import AutoModelForCausalLM, AutoTokenizer
44
-
45
- model = AutoModelForCausalLM.from_pretrained(
46
- "starbix/Apertus-8B-Instruct-2509-FP8_dynamic",
47
- device_map="auto",
48
- trust_remote_code=True,
49
- )
50
- tokenizer = AutoTokenizer.from_pretrained("starbix/Apertus-8B-Instruct-2509-FP8_dynamic")
51
-
52
- # Generate text
53
- messages = [
54
- {"role": "user", "content": "What is the capital of Switzerland?"}
55
- ]
56
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
57
- outputs = model.generate(inputs, max_new_tokens=256)
58
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
59
- ```
60
-
61
- ### With vLLM (Recommended for FP8)
62
-
63
- ```python
64
- from vllm import LLM, SamplingParams
65
-
66
- llm = LLM(
67
- model="starbix/Apertus-8B-Instruct-2509-FP8_dynamic",
68
- trust_remote_code=True,
69
- )
70
-
71
- sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
72
- prompts = ["What is the capital of Switzerland?"]
73
- outputs = llm.generate(prompts, sampling_params)
74
-
75
- for output in outputs:
76
- print(output.outputs[0].text)
77
- ```
78
-
79
- ## Performance Comparison
80
-
81
- Compared to the base model:
82
- - **Model size**: ~50% reduction
83
- - **Inference speed**: Up to 2x faster on FP8-capable hardware
84
- - **Memory usage**: ~50% reduction
85
-
86
- ## Hardware Requirements
87
-
88
- - **GPU**: Recommended for best performance
89
- - NVIDIA H100/H200: Native FP8 support for optimal performance
90
- - NVIDIA A100/A10: Compatible but may not see full speedup
91
- - **CPU**: Supported but slower
92
- - **Memory**: ~8-10 GB GPU memory for inference
93
-
94
- ## Limitations
95
-
96
- - May have slight accuracy differences compared to the full precision model
97
- - FP8 speedups are most pronounced on hardware with native FP8 support
98
- - Not all operations may be quantized
99
-
100
- ## Base Model
101
-
102
- For more information about the base model, capabilities, and training details, please see:
103
- [swiss-ai/Apertus-8B-Instruct-2509](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509)
104
-
105
- ## Citation
106
-
107
- If you use this quantized model, please cite both the base model and llm-compressor:
108
-
109
- ```bibtex
110
- @misc{apertus-8b-instruct-2509,
111
- title={Apertus-8B-Instruct-2509},
112
- author={Swiss AI},
113
- url={https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509},
114
- year={2025}
115
- }
116
-
117
- @software{llm-compressor,
118
- title={LLM Compressor},
119
- author={Neural Magic},
120
- url={https://github.com/vllm-project/llm-compressor},
121
- year={2024}
122
- }
123
- ```
124
-
125
- ## License
126
-
127
- This model inherits the Apache 2.0 license from the base model.
 
27
  - **Ignored Layers**: `lm_head` (kept in higher precision for better output quality)
28
  - **Tool**: llm-compressor (Neural Magic)
29