Qwen3 Phoneme (ARPABET) Converter — T2P / P2T(Unsloth + Dolly text-only)
このリポジトリ(ノートブック)では、Qwen3 系モデルを Unsloth で SFT し、
(1) Text → ARPABET phonemes(T2P) と (2) ARPABET phonemes → Text(P2T) の 2 タスクを同一モデルで扱えるようにしたものです。
Brain-to-text '25用に作成しました。
1. 何ができるか(概要)
T2P (Text to Phoneme)
英文を ARPABET 音素列(Brain-to-text '25 互換トークン)に変換P2T (Phoneme to Text)
ARPABET 音素列(同トークン)から英文を復元
2. 追加トークン仕様(重要)
2.1 Control tokens
ノートブック内で tokenizer.add_special_tokens により追加しています。
START_TOKEN:<|PHONEME_START|>END_TOKEN:<|PHONEME_END|>T2P_TOKEN:<|CVT2P_START|>(Text→Phoneme の開始マーカー)P2T_TOKEN:<|CVP2T_START|>(Phoneme→Text の開始マーカー)
2.2 Silence token
SILENCE_TOKEN:<|PNM_SIL|>
2.3 ARPABET phoneme tokens(39音素)
以下の 39 音素を <|PNM_XX|> 形式で追加します。
AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH
追加される総トークン数(ノートブックの定義)
- Control 4個 + Silence 1個 + Phoneme 39個 = 合計 44 トークン
2.4 テキスト→音素列(encode_text_to_pnm)のルール(ノートブック準拠)
g2p_enを用いて単語を ARPABET に変換- ストレス数字(例:
IY1)は除去してIYに正規化 PHONEME_SETに含まれない音素は 捨てる- 単語間に
SILENCE_TOKEN(<|PNM_SIL|>)を挿入可能 - 必要なら
START_TOKEN/END_TOKENを付与 - 返り値は トークンを基本的に結合した1文字列(ノートブックは
return "".join(tokens))
3. プロンプト仕様(T2P / P2T)
3.1 T2P(Text → ARPABET phonemes)
User
Convert to ARPABET phonemes:
<|CVT2P_START|>{TEXT}
Assistant(教師信号)
{encode_text_to_pnm(TEXT)}
ノートブックでは、会話データを以下で生成しています(概念):
- user:
Convert to ARPABET phonemes:\n{T2P_TOKEN}{input_text} - assistant:
encode_text_to_pnm(input_text)
3.2 P2T(ARPABET phonemes → Text)
User
Convert ARPABET phonemes to text:
<|CVP2T_START|>{PHONEME_SEQUENCE}
Assistant(教師信号)
{TEXT}
ノートブックでは、会話データを以下で生成しています(概念):
- user:
Convert ARPABET phonemes to text:\n{P2T_TOKEN}{encode_text_to_pnm(input_text)} - assistant:
input_text
4. トレーニングの流れ(ノートブックの手順)
- G2P と音素エンコード関数を定義
PHONEME_SET/SILENCE_TOKEN/START_TOKEN/END_TOKEN/T2P_TOKEN/P2T_TOKEN_normalize_text,word_to_arpabet,encode_text_to_pnm
- Unsloth でモデル・トークナイザ読み込み
FastLanguageModel.from_pretrained(...)
- 追加トークンを tokenizer に登録し、埋め込みを拡張
tokenizer.add_special_tokens({"additional_special_tokens": add_tokens})model.resize_token_embeddings(len(tokenizer))
- チャットテンプレを指定
get_chat_template(..., chat_template="qwen3-instruct")
- Dolly text-only データセットをロード
load_from_disk(dir_dolly_textonly)- 期待カラム:
text_line(ノートブックではROW_ID="text_line")
- T2P / P2T の学習用会話テキストを生成して結合
formatting_prompts_func_t2p/formatting_prompts_func_p2tmapでそれぞれの split に適用concatenate_datasets([t2p, p2t])- split:
train,test,evalを同様に作成
- LoRA 設定で PEFT 化
FastLanguageModel.get_peft_model(...)
- SFTTrainer で学習(SFTConfig)
SFTTrainer(model, tokenizer, train_dataset, eval_dataset, args=SFTConfig(...))
- assistant 応答部分のみを loss 対象にする
train_on_responses_only(trainer, instruction_part="<|im_start|>user\n", response_part="<|im_start|>assistant\n")
- 学習実行
trainer.train()
- 保存
- LoRA adapters:
model.save_pretrained(dir_save_lora)+tokenizer.save_pretrained(dir_save_lora) - 16bit merge:
model.save_pretrained_merged(dir_save_model, tokenizer, save_method="merged_16bit")
5. ハイパーパラメータ(ノートブック準拠)
5.1 モデル読み込み
max_seq_length: 2048load_in_4bit: Falseload_in_8bit: Falsefull_finetuning: Truecache_dir:dir_cache(環境依存)chat_template: "qwen3-instruct"
5.2 LoRA(PEFT)
FastLanguageModel.get_peft_model の設定:
r: 32lora_alpha: 32lora_dropout: 0bias: "none"use_gradient_checkpointing: "unsloth"random_state: 3407use_rslora: Falseloftq_config: Nonetarget_modules:q_proj, k_proj, v_proj, o_projgate_proj, up_proj, down_projlm_head, embed_tokens
5.3 SFT(SFTTrainer / SFTConfig)
dataset_text_field:"text"per_device_train_batch_size: 24gradient_accumulation_steps: 2(実効バッチ = 48)warmup_steps: 5num_train_epochs: 2learning_rate: 5e-5logging_steps: 50optim: "adamw_8bit"weight_decay: 0.001lr_scheduler_type: "cosine"seed: 3407save_strategy: "epoch"report_to: "none"eval_dataset:ds_eval(※評価戦略は設定に依存)
6. 推論(T2P / P2T)
6.1 例
from g2p_en import G2p
import re
import unicodedata
from typing import List
from IPython.display import display
# Brain-to-text '25 のラベル集合に合わせた ARPABET 音素(無音 ' | ' は別扱い)
PHONEME_SET = {
'AA','AE','AH','AO','AW','AY','B','CH','D','DH','EH','ER','EY','F','G',
'HH','IH','IY','JH','K','L','M','N','NG','OW','OY','P','R','S','SH','T',
'TH','UH','UW','V','W','Y','Z','ZH'
}
SILENCE_TOKEN = '<|PNM_SIL|>'
START_TOKEN = '<|PHONEME_START|>'
END_TOKEN = '<|PHONEME_END|>'
T2P_TOKEN = "<|CVT2P_START|>"
P2T_TOKEN = "<|CVP2T_START|>"
PH_PREFIX = 'PNM_' # 変更したければここを変える
_g2p = G2p()
def _normalize_text(s: str) -> str:
"""記号の正規化&不要記号の削除(アポストロフィは残す)"""
s = unicodedata.normalize("NFKC", s).replace("’", "'")
# ハイフンは空白へ、他の句読点は削除
s = re.sub(r"[-_]+", " ", s)
s = re.sub(r"[^A-Za-z0-9'\s]", " ", s)
s = re.sub(r"\s+", " ", s).strip()
return s
def word_to_arpabet(word: str) -> List[str]:
"""単語 -> ARPABET(ストレス数字除去+未知排除)"""
phones = _g2p(word) # 例: ["AY1","M"]
cleaned = []
for ph in phones:
ph = re.sub(r"\d", "", ph) # IY1 -> IY
if ph in PHONEME_SET:
cleaned.append(ph)
return cleaned
def encode_text_to_pnm(text: str,
add_start_end: bool = True,
insert_silence_between_words: bool = True) -> str:
"""
テキスト -> <|PNM_XX|> 形式の列(必要なら START/END と無音 ' | ' を付与)
"""
text = _normalize_text(text)
if not text:
return f"{START_TOKEN} {END_TOKEN}" if add_start_end else ""
words = text.split()
tokens: List[str] = []
for i, w in enumerate(words):
phones = word_to_arpabet(w)
tokens.extend([f"<|{PH_PREFIX}{ph}|>" for ph in phones])
# 単語間に無音 ' | ' を入れる(最後の単語の後には入れない)
if insert_silence_between_words and i < len(words) - 1:
tokens.append(SILENCE_TOKEN)
if add_start_end:
tokens = [START_TOKEN] + tokens + [END_TOKEN]
return "".join(tokens)
from unsloth import FastLanguageModel
import torch
import os
fourbit_models = [
"unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit", # Qwen 14B 2x faster
"unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
"unsloth/Qwen3-8B-unsloth-bnb-4bit",
"unsloth/Qwen3-14B-unsloth-bnb-4bit",
"unsloth/Qwen3-32B-unsloth-bnb-4bit",
# 4bit dynamic quants for superior accuracy and low memory use
"unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
"unsloth/Phi-4",
"unsloth/Llama-3.1-8B",
"unsloth/Llama-3.2-3B",
"unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth
dir_cache = r"/media/kurogane/kioxia1/cache"
model_id = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1_/lora_model" #"unsloth/Qwen3-0.6B" #"unsloth/Qwen3-4B-Instruct-2507"
i_ctx = 2048
b_load_in_4bit = False
b_load_in_8bit = False
dir_save_base = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1"
dir_output = os.path.join(dir_save_base, "outputs")
dir_save_lora = os.path.join(dir_save_base, "lora_model")
dir_save_model = os.path.join(dir_save_base, "model_phenome")
os.makedirs(dir_output, exist_ok=True)
os.makedirs(dir_save_lora, exist_ok=True)
os.makedirs(dir_save_model, exist_ok=True)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_id,
max_seq_length = i_ctx, # Choose any for long context!
load_in_4bit = b_load_in_4bit, # 4 bit quantization to reduce memory
load_in_8bit = b_load_in_8bit, # [NEW!] A bit more accurate, uses 2x memory
full_finetuning = True, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
cache_dir = dir_cache,
)
sample = "hello. i'm still in the meeting. how are you?"
s_tkns_phenomes = encode_text_to_pnm(sample)
print(f"Input: {sample}\nPhenome: {s_tkns_phenomes}")
inputs_t2p = tokenizer.apply_chat_template(
[
{"role" : "user", "content" : f"Convert to ARPABET phonemes:\n{T2P_TOKEN}{sample}"},
# {"role" : "assistant", "content" : s_tkns_phenomes}
],
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
inputs_p2t = tokenizer.apply_chat_template(
[
{"role" : "user", "content" : f"Convert ARPABET phonemes to text:\n{P2T_TOKEN}{s_tkns_phenomes}"},
# {"role" : "assistant", "content" : sample}
],
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
# print(inputs_t2p)
print("===T2P===")
_ = model.generate(
**inputs_t2p,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Liquid settings!
temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
# streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))
print("===P2T===")
_ = model.generate(
**inputs_p2t,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Liquid settings!
temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
# streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))
Input: hello. i'm still in the meeting. how are you?
Phenome: <|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|>
===T2P===
<|im_start|>user
Convert to ARPABET phonemes:
<|CVT2P_START|>hello. i'm still in the meeting. how are you?<|im_end|>
<|im_start|>assistant
<|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
===P2T===
<|im_start|>user
Convert ARPABET phonemes to text:
<|CVP2T_START|><|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
<|im_start|>assistant
hello, i'm still in the meeting. how are you?<|im_end|>
- Downloads last month
- 3