Kokoro Huggingface Inference Demo

本项目提供了基于 Axera 平台的 Kokoro 语音合成推理 demo，支持多语言文本到语音（TTS）推理，适用于板端和算力卡环境。

Python 示例
C++ 示例

模型转换

kokoro convert

支持平台

AX650N
AX630C

主要功能

语音合成：支持将多种语言文本合成为语音。

上板部署, 以AX650N为例

AX650N 的设备已预装 Ubuntu22.04
以 root 权限登陆 AX650N 的板卡设备
链接互联网，确保 AX650N 的设备能正常执行 apt install, pip install 等指令
已验证设备：AX650N DEMO Board

依赖环境

Python 3.10上验证
依赖包见 requirements.txt
需要安装 axengine 推理库（如未安装，参考下方步骤）

目录结构

.
|-- demo_kokoro_ax.py           # 主推理 demo 脚本
|-- inference_utils.py          # 推理辅助脚本
|-- requirements.txt            # 依赖包列表
|-- checkpoints/                # 配置与声纹文件
|   |-- config.json
|   |-- voices/*.pt
|-- models/                     # 模型文件
|   |-- kokoro_part1_96.axmodel
|   |-- kokoro_part2_96.axmodel
|   |-- kokoro_part3_96.axmodel
|   |-- model4_har_sim.onnx

快速开始

1. 安装依赖

# 创建虚拟环境并激活  
conda create -n kokoro_test python=3.10  
conda activate kokoro_test  

# 安装 axengine（如未安装）
hf download AXERA-TECH/PyAXEngine --local-dir PyAXEngine
cd PyAXEngine
pip install axengine-0.1.3-py3-none-any.whl

# 安装项目依赖
cd到工程主目录
pip install -r requirements.txt
然后执行如下命令：
python -m spacy download en_core_web_sm

2. 模型

# Kokoro模型和声纹文件，放置在 models/ 和 checkpoints/ 目录

3. 运行推理 Demo

中文：
python demo_kokoro_ax.py --text "致力于打造世界领先的人工智能感知与边缘计算芯片。" --lang z --voice checkpoints/voices/zf_xiaoyi.pt --output output_zh.wav -d models -f 0.3
英文：
python demo_kokoro_ax.py --text "The sky above the port was the color of television, tuned to a dead channel." --lang a --voice checkpoints/voices/af_heart.pt --output output_en.wav -d models -f 0.3
日文：
python demo_kokoro_ax.py --text "「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。" --lang j --voice checkpoints/voices/jm_kumo.pt --output output_jp.wav -d models -f 0.3

注：AX630C平台推理，更改模型文件目录即可，将 -d models 改成 -d models_620E 如下：

python demo_kokoro_ax.py --text "The sky above the port was the color of television, tuned to a dead channel." --lang a --voice checkpoints/voices/af_heart.pt --output output_en.wav -d models_620E -f 0.3

参数说明：

参数（简写）	说明
--axmodel_dir, -d	模型文件目录（默认 models）
--voice, -v	声纹文件路径（必填）
--text, -t	合成文本（支持多语言）
--lang, -l	语言代码（如 a, z, j, ...）,目前只测了中英日
--config, -c	配置文件路径
--output, -o	输出 wav 文件名
--fade_out, -f	音频结尾淡出时长（秒），减少音频末尾有杂音
--max_len, -m	最大分句长度（默认 96）

4. 输出结果

生成的音频保存在 --output 指定的 wav 文件中。

耗时如下：rtf最小约0.07

输出: output_zh.wav | 时长: 4.80s

初始化: 5.107s
音频推理: 0.322s (共1次)
  ├─ Model1: 0.022s (平均22.1ms)
  ├─ Model2: 0.017s (平均17.4ms)
  ├─ Model3: 0.185s (平均185.3ms)
  └─ Model4 onnx: 0.074s (平均73.7ms)

 rtf:0.067

TTS 服务

启动服务

python kokoro_svr.py --port 28000
[INFO] Available providers:  ['AXCLRTExecutionProvider']
TTS Server started at http://0.0.0.0:28000

调用服务

curl -X POST "http://127.0.0.1:8000/tts" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "sentence=爱芯元智半导体股份有限公司，致力于打造世界领先的人工智能感知与边缘计算芯片。" \
  --output tts.wav

import requests

url = "http://127.0.0.1:8000/tts"
data = {
    "sentence": "爱芯元智半导体股份有限公司，致力于打造世界领先的人工智能感知与边缘计算芯片。",
    "language": "z",
    "speed": "1.0",
    "sample_rate": "24000",
}

resp = requests.post(url, data=data)
if resp.status_code == 200:
    with open("tts_output.wav", "wb") as f:
        f.write(resp.content)
    print("Saved to tts_output.wav")
else:
    print("Error:", resp.status_code, resp.text)

CPP

预编译程序

# 命令行运行的程序
/cpp/bin/kokoro     

# HTTP服务端
/cpp/bin/kokoro_srv

交叉编译(如需修改源码)

在 PC 上完成（已在Ubuntu22.04上测试）

安装开发环境:

sudo apt update
sudo apt install build-essential cmake

获取交叉编译工具链: 地址
将交叉编译工具链路径添加到PATH

编译

cd cpp
./download_bsp.sh
./build.sh

编译完成后的程序在install目录下

运行

命令行工具

./bin/kokoro

输出结果

(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro
[I][                            main][  41]: Args:
[I][                            main][  42]: axmodel_dir: ../models
[I][                            main][  43]: text: 我想留在大家身边，从过去，一同迈向明天
[I][                            main][  44]: lang: z
[I][                            main][  45]: voice_path: ./voices
[I][                            main][  46]: voice_name: zf_xiaoxiao
[I][                            main][  47]: output: output.wav
[I][                            main][  48]: fade_out: 0.00
[I][                            main][  49]: max_len: 96
[I][                            init][ 306]: Loaded 114 tokens from dict/vocab.txt
[INFO] total pinyin character count: 26749
[INFO] total pinyin phrase count: 411946
Loading English G2P dict from dict/cmudict-0.7b/cmudict.dict...
[EnG2P] Loaded 126052 words from CMU dict.
[I][                            main][  90]: Init kokoro take 7.3640 seconds
[I][                            main][ 109]: Audio save to output.wav
[I][                            main][ 113]: RTF: 0.1322, process_time: 0.5620 seconds, audio duration: 4.25 seconds

程序使用帮助:

(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro --help
usage: ./bin/kokoro [options] ...
options:
      --axmodel_dir    model path (string [=../models])
  -t, --text           Text to be generated (string [=我想留在大家身边，从过去，一同迈向明天])
  -l, --lang           Language code, Only support a(American English) or z(Chinese) currently (string [=z])
      --voice_path     Binary voices store path (string [=./voices])
  -v, --voice_name     Speaker voice name, check possible choices from voices/ (string [=zf_xiaoxiao])
  -o, --output         Output file path (string [=output.wav])
  -f, --fade_out       Fade out ratio between sentences (float [=0])
  -m, --max_len        Max input token num, fixed by model, no need to change usually (int [=96])
  -?, --help           print this message

服务端

./bin/kokoro_srv

运行效果:

(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./install/kokoro_srv
[I][                            main][  32]: Args:
[I][                            main][  33]: port: 8080
[I][                            main][  34]: axmodel_dir: ../models
[I][                            main][  35]: lang: z
[I][                            main][  36]: voice_path: ./voices
[I][                            main][  37]: voice_name: zf_xiaoxiao
[I][                            main][  38]: max_len: 96
[I][                            init][  55]: Initializing Kokoro TTS model...
[I][                            init][  56]: Model path: ../models
[I][                            init][  57]: Voices path: ./voices
[I][                            init][  58]: Default voice: zf_xiaoxiao
[I][                            init][ 306]: Loaded 114 tokens from dict/vocab.txt
[INFO] total pinyin character count: 26749
[INFO] total pinyin phrase count: 411946
Loading English G2P dict from dict/cmudict-0.7b/cmudict.dict...
[EnG2P] Loaded 126052 words from CMU dict.
[I][                            init][  65]: Kokoro model initialized successfully!
[I][                            main][  73]: Start TTS Server at port 8080
[I][                           start][  72]: Starting TTS server at port 8080
[I][                           start][  73]: Endpoints:
[I][                           start][  74]:   POST /tts          - Text-to-Speech (returns WAV audio)
[I][                           start][  75]:   POST /tts_raw      - Text-to-Speech (returns raw PCM)
[I][                           start][  76]:   GET  /health       - Health check

服务端使用帮助:


(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro_srv --help
usage: ./bin/kokoro_srv [options] ...
options:
  -p, --port           Server port (int [=8080])
      --axmodel_dir    model path (string [=../models])
  -l, --lang           Language code, Only support a(American English) or z(Chinese) currently (string [=z])
      --voice_path     Binary voices store path (string [=./voices])
  -v, --voice_name     Speaker voice name, check possible choices from voices/ (string [=zf_xiaoxiao])
  -m, --max_len        Max input token num, fixed by model, no need to change usually (int [=96])
  -?, --help           print this message

客户端

cd cpp
python kokoro_cli.py

生成output.wav

模型占用内存

CMM Stands for Physical memory used by Axera modules like VDEC(Video decoder), VENC(Video encoder), NPU, etc.

lang	OS(MB)	CMM(MB)
z(Chinese)	233	237
a(English)	23	237

参考

技术支持

Github issues
QQ 群: 139953715

Downloads last month: 148

Collection including AXERA-TECH/kokoro.axera

Audio Models

Collection

12 items • Updated 10 days ago