Audio Models
Collection
12 items
•
Updated
本项目提供了基于 Axera 平台的 Kokoro 语音合成推理 demo,支持多语言文本到语音(TTS)推理,适用于板端和算力卡环境。
requirements.txt.
|-- demo_kokoro_ax.py # 主推理 demo 脚本
|-- inference_utils.py # 推理辅助脚本
|-- requirements.txt # 依赖包列表
|-- checkpoints/ # 配置与声纹文件
| |-- config.json
| |-- voices/*.pt
|-- models/ # 模型文件
| |-- kokoro_part1_96.axmodel
| |-- kokoro_part2_96.axmodel
| |-- kokoro_part3_96.axmodel
| |-- model4_har_sim.onnx
# 创建虚拟环境并激活
conda create -n kokoro_test python=3.10
conda activate kokoro_test
# 安装 axengine(如未安装)
hf download AXERA-TECH/PyAXEngine --local-dir PyAXEngine
cd PyAXEngine
pip install axengine-0.1.3-py3-none-any.whl
# 安装项目依赖
cd到工程主目录
pip install -r requirements.txt
然后执行如下命令:
python -m spacy download en_core_web_sm
# Kokoro模型和声纹文件,放置在 models/ 和 checkpoints/ 目录
中文:
python demo_kokoro_ax.py --text "致力于打造世界领先的人工智能感知与边缘计算芯片。" --lang z --voice checkpoints/voices/zf_xiaoyi.pt --output output_zh.wav -d models -f 0.3
英文:
python demo_kokoro_ax.py --text "The sky above the port was the color of television, tuned to a dead channel." --lang a --voice checkpoints/voices/af_heart.pt --output output_en.wav -d models -f 0.3
日文:
python demo_kokoro_ax.py --text "「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。" --lang j --voice checkpoints/voices/jm_kumo.pt --output output_jp.wav -d models -f 0.3
注:AX630C平台推理,更改模型文件目录即可,将 -d models 改成 -d models_620E 如下:
python demo_kokoro_ax.py --text "The sky above the port was the color of television, tuned to a dead channel." --lang a --voice checkpoints/voices/af_heart.pt --output output_en.wav -d models_620E -f 0.3
参数说明:
| 参数(简写) | 说明 |
|---|---|
| --axmodel_dir, -d | 模型文件目录(默认 models) |
| --voice, -v | 声纹文件路径(必填) |
| --text, -t | 合成文本(支持多语言) |
| --lang, -l | 语言代码(如 a, z, j, ...),目前只测了中英日 |
| --config, -c | 配置文件路径 |
| --output, -o | 输出 wav 文件名 |
| --fade_out, -f | 音频结尾淡出时长(秒),减少音频末尾有杂音 |
| --max_len, -m | 最大分句长度(默认 96) |
--output 指定的 wav 文件中。耗时如下:rtf最小约0.07
输出: output_zh.wav | 时长: 4.80s
初始化: 5.107s
音频推理: 0.322s (共1次)
├─ Model1: 0.022s (平均22.1ms)
├─ Model2: 0.017s (平均17.4ms)
├─ Model3: 0.185s (平均185.3ms)
└─ Model4 onnx: 0.074s (平均73.7ms)
rtf:0.067
python kokoro_svr.py --port 28000
[INFO] Available providers: ['AXCLRTExecutionProvider']
TTS Server started at http://0.0.0.0:28000
curl -X POST "http://127.0.0.1:8000/tts" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "sentence=爱芯元智半导体股份有限公司,致力于打造世界领先的人工智能感知与边缘计算芯片。" \
--output tts.wav
import requests
url = "http://127.0.0.1:8000/tts"
data = {
"sentence": "爱芯元智半导体股份有限公司,致力于打造世界领先的人工智能感知与边缘计算芯片。",
"language": "z",
"speed": "1.0",
"sample_rate": "24000",
}
resp = requests.post(url, data=data)
if resp.status_code == 200:
with open("tts_output.wav", "wb") as f:
f.write(resp.content)
print("Saved to tts_output.wav")
else:
print("Error:", resp.status_code, resp.text)
# 命令行运行的程序
/cpp/bin/kokoro
# HTTP服务端
/cpp/bin/kokoro_srv
在 PC 上完成(已在Ubuntu22.04上测试)
安装开发环境:
sudo apt update
sudo apt install build-essential cmake
获取交叉编译工具链: 地址
将交叉编译工具链路径添加到PATH
编译
cd cpp
./download_bsp.sh
./build.sh
编译完成后的程序在install目录下
./bin/kokoro
输出结果
(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro
[I][ main][ 41]: Args:
[I][ main][ 42]: axmodel_dir: ../models
[I][ main][ 43]: text: 我想留在大家身边,从过去,一同迈向明天
[I][ main][ 44]: lang: z
[I][ main][ 45]: voice_path: ./voices
[I][ main][ 46]: voice_name: zf_xiaoxiao
[I][ main][ 47]: output: output.wav
[I][ main][ 48]: fade_out: 0.00
[I][ main][ 49]: max_len: 96
[I][ init][ 306]: Loaded 114 tokens from dict/vocab.txt
[INFO] total pinyin character count: 26749
[INFO] total pinyin phrase count: 411946
Loading English G2P dict from dict/cmudict-0.7b/cmudict.dict...
[EnG2P] Loaded 126052 words from CMU dict.
[I][ main][ 90]: Init kokoro take 7.3640 seconds
[I][ main][ 109]: Audio save to output.wav
[I][ main][ 113]: RTF: 0.1322, process_time: 0.5620 seconds, audio duration: 4.25 seconds
程序使用帮助:
(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro --help
usage: ./bin/kokoro [options] ...
options:
--axmodel_dir model path (string [=../models])
-t, --text Text to be generated (string [=我想留在大家身边,从过去,一同迈向明天])
-l, --lang Language code, Only support a(American English) or z(Chinese) currently (string [=z])
--voice_path Binary voices store path (string [=./voices])
-v, --voice_name Speaker voice name, check possible choices from voices/ (string [=zf_xiaoxiao])
-o, --output Output file path (string [=output.wav])
-f, --fade_out Fade out ratio between sentences (float [=0])
-m, --max_len Max input token num, fixed by model, no need to change usually (int [=96])
-?, --help print this message
./bin/kokoro_srv
运行效果:
(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./install/kokoro_srv
[I][ main][ 32]: Args:
[I][ main][ 33]: port: 8080
[I][ main][ 34]: axmodel_dir: ../models
[I][ main][ 35]: lang: z
[I][ main][ 36]: voice_path: ./voices
[I][ main][ 37]: voice_name: zf_xiaoxiao
[I][ main][ 38]: max_len: 96
[I][ init][ 55]: Initializing Kokoro TTS model...
[I][ init][ 56]: Model path: ../models
[I][ init][ 57]: Voices path: ./voices
[I][ init][ 58]: Default voice: zf_xiaoxiao
[I][ init][ 306]: Loaded 114 tokens from dict/vocab.txt
[INFO] total pinyin character count: 26749
[INFO] total pinyin phrase count: 411946
Loading English G2P dict from dict/cmudict-0.7b/cmudict.dict...
[EnG2P] Loaded 126052 words from CMU dict.
[I][ init][ 65]: Kokoro model initialized successfully!
[I][ main][ 73]: Start TTS Server at port 8080
[I][ start][ 72]: Starting TTS server at port 8080
[I][ start][ 73]: Endpoints:
[I][ start][ 74]: POST /tts - Text-to-Speech (returns WAV audio)
[I][ start][ 75]: POST /tts_raw - Text-to-Speech (returns raw PCM)
[I][ start][ 76]: GET /health - Health check
服务端使用帮助:
(base) root@ax650:/mnt/data/HF/kokoro.axera/cpp# ./bin/kokoro_srv --help
usage: ./bin/kokoro_srv [options] ...
options:
-p, --port Server port (int [=8080])
--axmodel_dir model path (string [=../models])
-l, --lang Language code, Only support a(American English) or z(Chinese) currently (string [=z])
--voice_path Binary voices store path (string [=./voices])
-v, --voice_name Speaker voice name, check possible choices from voices/ (string [=zf_xiaoxiao])
-m, --max_len Max input token num, fixed by model, no need to change usually (int [=96])
-?, --help print this message
cd cpp
python kokoro_cli.py
生成output.wav
| lang | OS(MB) | CMM(MB) |
|---|---|---|
| z(Chinese) | 233 | 237 |
| a(English) | 23 | 237 |