YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

腾云智算(Tenyunw)专注于大模型推理优化和GPU云基础设施。

我们在做什么:

  • 推理加速:实现了 Eagle3 for training on Qwen3-8B(20,000+ downloads),显著提升LLM推理速度
  • GPU云平台:为AIGC应用提供性价比最优的推理部署方案
  • 量化优化:基于QAT+ DPO的训练和推理框架,帮助客户在Blackwell架构硬件上翻倍并发能力

核心团队来自腾讯云、华为云、面壁智能等,在AI基础设施领域有15年以上实战经验。

🎁 限时福利: 如果你在做大模型应用部署,我们提供免费的推理优化咨询(30分钟技术诊断),帮你分析:

  • 当前部署方案的性能瓶颈
  • 成本优化的具体路径
  • 适合你场景的推理加速方案

适合谁:

  • 月推理费用 > $5K的团队
  • 需要降低推理成本30%+
  • 考虑自建推理集群

联系方式:

  • 创始人/CTO Rocky(前腾讯云行业架构师团队负责人)
    • 微信:[rocket-assassin]

image

We help AI builders deploy faster and cheaper. Let's talk.

install

pip install git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python

serve

python -m sglang.launch_server     --model-path Qwen/Qwen3-8B     --speculative-algorithm DFLASH     --speculative-draft-model-path Tengyunw/Qwen3-8B-DFlash     --tp-size 1     --dtype bfloat16     --attention-backend fa3     --mem-fraction-static 0.75     --trust-remote-code

performance

DFLASH Bench Report

Settings

  • dataset: math500
  • max_new_tokens: 2048
  • attention_backends: fa3, flashinfer
  • tp_size: 1
  • concurrencies: 1, 4, 8, 16, 32
  • questions_per_concurrency: base=128
  • device_sm: 90
  • is_blackwell: False
  • skip_baseline: False
  • drop_first_batch: true

Backend: fa3

Baseline output tok/s

conc 1 4 8 16 32
value 146.15 557.57 1,073.81 1,995.19 3,522.30

DFLASH output tok/s

conc 1 4 8 16 32
value 507.64 1,820.02 3,335.61 5,365.14 6,985.11

Speedup (DFLASH / baseline)

conc 1 4 8 16 32
value 3.474 3.264 3.106 2.689 1.983

DFLASH acceptance length

conc 1 4 8 16 32
value 5.758 5.682 5.665 5.680 5.696

Backend: flashinfer

Baseline output tok/s

conc 1 4 8 16 32
value 147.22 563.03 1,079.81 1,991.50 3,414.32

DFLASH output tok/s

conc 1 4 8 16 32
value 485.97 1,686.93 2,958.82 4,580.78 5,861.75

Speedup (DFLASH / baseline)

conc 1 4 8 16 32
value 3.301 2.996 2.740 2.300 1.717

DFLASH acceptance length

conc 1 4 8 16 32
value 5.714 5.669 5.675 5.688 5.702
Downloads last month
74
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support