YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

腾云智算（Tenyunw）专注于大模型推理优化和GPU云基础设施。

我们在做什么：

推理加速：实现了 Eagle3 for training on Qwen3-8B（20,000+ downloads），显著提升LLM推理速度
GPU云平台：为AIGC应用提供性价比最优的推理部署方案
量化优化：基于QAT+ DPO的训练和推理框架，帮助客户在Blackwell架构硬件上翻倍并发能力

核心团队来自腾讯云、华为云、面壁智能等，在AI基础设施领域有15年以上实战经验。

🎁 限时福利：如果你在做大模型应用部署，我们提供免费的推理优化咨询（30分钟技术诊断），帮你分析：

当前部署方案的性能瓶颈
成本优化的具体路径
适合你场景的推理加速方案

适合谁：

月推理费用 > $5K的团队
需要降低推理成本30%+
考虑自建推理集群

联系方式：

创始人/CTO Rocky（前腾讯云行业架构师团队负责人）
- 微信：[rocket-assassin]

LinkedIn: [https://www.linkedin.com/in/wangchao0808/]
官网：https://www.tenyunw.com/
Email: rockywang@tenyunw.com

We help AI builders deploy faster and cheaper. Let's talk.

install

pip install git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python

serve

python -m sglang.launch_server     --model-path Qwen/Qwen3-8B     --speculative-algorithm DFLASH     --speculative-draft-model-path Tengyunw/Qwen3-8B-DFlash     --tp-size 1     --dtype bfloat16     --attention-backend fa3     --mem-fraction-static 0.75     --trust-remote-code

performance

DFLASH Bench Report

Settings

dataset: math500
max_new_tokens: 2048
attention_backends: fa3, flashinfer
tp_size: 1
concurrencies: 1, 4, 8, 16, 32
questions_per_concurrency: base=128
device_sm: 90
is_blackwell: False
skip_baseline: False
drop_first_batch: true

Backend: `fa3`

Baseline output tok/s

conc	1	4	8	16	32
value	146.15	557.57	1,073.81	1,995.19	3,522.30

DFLASH output tok/s

conc	1	4	8	16	32
value	507.64	1,820.02	3,335.61	5,365.14	6,985.11

Speedup (DFLASH / baseline)

conc	1	4	8	16	32
value	3.474	3.264	3.106	2.689	1.983

DFLASH acceptance length

conc	1	4	8	16	32
value	5.758	5.682	5.665	5.680	5.696

Backend: `flashinfer`

Baseline output tok/s

conc	1	4	8	16	32
value	147.22	563.03	1,079.81	1,991.50	3,414.32

DFLASH output tok/s

conc	1	4	8	16	32
value	485.97	1,686.93	2,958.82	4,580.78	5,861.75

Speedup (DFLASH / baseline)

conc	1	4	8	16	32
value	3.301	2.996	2.740	2.300	1.717

DFLASH acceptance length

conc	1	4	8	16	32
value	5.714	5.669	5.675	5.688	5.702

Downloads last month: 74

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

install

serve

performance

DFLASH Bench Report

Settings

Backend: fa3

Baseline output tok/s

DFLASH output tok/s

Speedup (DFLASH / baseline)

DFLASH acceptance length

Backend: flashinfer

Baseline output tok/s

DFLASH output tok/s

Speedup (DFLASH / baseline)

DFLASH acceptance length

Backend: `fa3`

Backend: `flashinfer`