Qwen3 Phoneme (ARPABET) Converter — T2P / P2T(Unsloth + Dolly text-only)

このリポジトリ(ノートブック)では、Qwen3 系モデルを Unsloth で SFT し、
(1) Text → ARPABET phonemes(T2P) と (2) ARPABET phonemes → Text(P2T) の 2 タスクを同一モデルで扱えるようにしたものです。

Brain-to-text '25用に作成しました。


1. 何ができるか(概要)

  • T2P (Text to Phoneme)
    英文を ARPABET 音素列(Brain-to-text '25 互換トークン)に変換

  • P2T (Phoneme to Text)
    ARPABET 音素列(同トークン)から英文を復元


2. 追加トークン仕様(重要)

2.1 Control tokens

ノートブック内で tokenizer.add_special_tokens により追加しています。

  • START_TOKEN : <|PHONEME_START|>
  • END_TOKEN : <|PHONEME_END|>
  • T2P_TOKEN : <|CVT2P_START|> (Text→Phoneme の開始マーカー)
  • P2T_TOKEN : <|CVP2T_START|> (Phoneme→Text の開始マーカー)

2.2 Silence token

  • SILENCE_TOKEN : <|PNM_SIL|>

2.3 ARPABET phoneme tokens(39音素)

以下の 39 音素を <|PNM_XX|> 形式で追加します。

AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH

追加される総トークン数(ノートブックの定義)

  • Control 4個 + Silence 1個 + Phoneme 39個 = 合計 44 トークン

2.4 テキスト→音素列(encode_text_to_pnm)のルール(ノートブック準拠)

  • g2p_en を用いて単語を ARPABET に変換
  • ストレス数字(例: IY1)は除去して IY に正規化
  • PHONEME_SET に含まれない音素は 捨てる
  • 単語間に SILENCE_TOKEN<|PNM_SIL|>)を挿入可能
  • 必要なら START_TOKEN / END_TOKEN を付与
  • 返り値は トークンを基本的に結合した1文字列(ノートブックは return "".join(tokens)

3. プロンプト仕様(T2P / P2T)

3.1 T2P(Text → ARPABET phonemes)

User

Convert to ARPABET phonemes:
<|CVT2P_START|>{TEXT}

Assistant(教師信号)

{encode_text_to_pnm(TEXT)}

ノートブックでは、会話データを以下で生成しています(概念):

  • user: Convert to ARPABET phonemes:\n{T2P_TOKEN}{input_text}
  • assistant: encode_text_to_pnm(input_text)

3.2 P2T(ARPABET phonemes → Text)

User

Convert ARPABET phonemes to text:
<|CVP2T_START|>{PHONEME_SEQUENCE}

Assistant(教師信号)

{TEXT}

ノートブックでは、会話データを以下で生成しています(概念):

  • user: Convert ARPABET phonemes to text:\n{P2T_TOKEN}{encode_text_to_pnm(input_text)}
  • assistant: input_text

4. トレーニングの流れ(ノートブックの手順)

  1. G2P と音素エンコード関数を定義
  • PHONEME_SET / SILENCE_TOKEN / START_TOKEN / END_TOKEN / T2P_TOKEN / P2T_TOKEN
  • _normalize_text, word_to_arpabet, encode_text_to_pnm
  1. Unsloth でモデル・トークナイザ読み込み
  • FastLanguageModel.from_pretrained(...)
  1. 追加トークンを tokenizer に登録し、埋め込みを拡張
  • tokenizer.add_special_tokens({"additional_special_tokens": add_tokens})
  • model.resize_token_embeddings(len(tokenizer))
  1. チャットテンプレを指定
  • get_chat_template(..., chat_template="qwen3-instruct")
  1. Dolly text-only データセットをロード
  • load_from_disk(dir_dolly_textonly)
  • 期待カラム: text_line(ノートブックでは ROW_ID="text_line"
  1. T2P / P2T の学習用会話テキストを生成して結合
  • formatting_prompts_func_t2p / formatting_prompts_func_p2t
  • map でそれぞれの split に適用
  • concatenate_datasets([t2p, p2t])
  • split: train, test, eval を同様に作成
  1. LoRA 設定で PEFT 化
  • FastLanguageModel.get_peft_model(...)
  1. SFTTrainer で学習(SFTConfig)
  • SFTTrainer(model, tokenizer, train_dataset, eval_dataset, args=SFTConfig(...))
  1. assistant 応答部分のみを loss 対象にする
  • train_on_responses_only(trainer, instruction_part="<|im_start|>user\n", response_part="<|im_start|>assistant\n")
  1. 学習実行
  • trainer.train()
  1. 保存
  • LoRA adapters: model.save_pretrained(dir_save_lora) + tokenizer.save_pretrained(dir_save_lora)
  • 16bit merge: model.save_pretrained_merged(dir_save_model, tokenizer, save_method="merged_16bit")

5. ハイパーパラメータ(ノートブック準拠)

5.1 モデル読み込み

  • max_seq_length: 2048
  • load_in_4bit: False
  • load_in_8bit: False
  • full_finetuning: True
  • cache_dir: dir_cache(環境依存)
  • chat_template: "qwen3-instruct"

5.2 LoRA(PEFT)

FastLanguageModel.get_peft_model の設定:

  • r: 32
  • lora_alpha: 32
  • lora_dropout: 0
  • bias: "none"
  • use_gradient_checkpointing: "unsloth"
  • random_state: 3407
  • use_rslora: False
  • loftq_config: None
  • target_modules:
    • q_proj, k_proj, v_proj, o_proj
    • gate_proj, up_proj, down_proj
    • lm_head, embed_tokens

5.3 SFT(SFTTrainer / SFTConfig)

  • dataset_text_field: "text"
  • per_device_train_batch_size: 24
  • gradient_accumulation_steps: 2(実効バッチ = 48)
  • warmup_steps: 5
  • num_train_epochs: 2
  • learning_rate: 5e-5
  • logging_steps: 50
  • optim: "adamw_8bit"
  • weight_decay: 0.001
  • lr_scheduler_type: "cosine"
  • seed: 3407
  • save_strategy: "epoch"
  • report_to: "none"
  • eval_dataset: ds_eval(※評価戦略は設定に依存)

6. 推論(T2P / P2T)

6.1 例

from g2p_en import G2p
import re
import unicodedata
from typing import List
from IPython.display import display

# Brain-to-text '25 のラベル集合に合わせた ARPABET 音素(無音 ' | ' は別扱い)
PHONEME_SET = {
    'AA','AE','AH','AO','AW','AY','B','CH','D','DH','EH','ER','EY','F','G',
    'HH','IH','IY','JH','K','L','M','N','NG','OW','OY','P','R','S','SH','T',
    'TH','UH','UW','V','W','Y','Z','ZH'
}
SILENCE_TOKEN = '<|PNM_SIL|>'
START_TOKEN = '<|PHONEME_START|>'
END_TOKEN   = '<|PHONEME_END|>'
T2P_TOKEN = "<|CVT2P_START|>"
P2T_TOKEN = "<|CVP2T_START|>"
PH_PREFIX   = 'PNM_'  # 変更したければここを変える

_g2p = G2p()

def _normalize_text(s: str) -> str:
    """記号の正規化&不要記号の削除(アポストロフィは残す)"""
    s = unicodedata.normalize("NFKC", s).replace("’", "'")
    # ハイフンは空白へ、他の句読点は削除
    s = re.sub(r"[-_]+", " ", s)
    s = re.sub(r"[^A-Za-z0-9'\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def word_to_arpabet(word: str) -> List[str]:
    """単語 -> ARPABET(ストレス数字除去+未知排除)"""
    phones = _g2p(word)  # 例: ["AY1","M"]
    cleaned = []
    for ph in phones:
        ph = re.sub(r"\d", "", ph)  # IY1 -> IY
        if ph in PHONEME_SET:
            cleaned.append(ph)
    return cleaned

def encode_text_to_pnm(text: str,
                       add_start_end: bool = True,
                       insert_silence_between_words: bool = True) -> str:
    """
    テキスト -> <|PNM_XX|> 形式の列(必要なら START/END と無音 ' | ' を付与)
    """
    text = _normalize_text(text)
    if not text:
        return f"{START_TOKEN} {END_TOKEN}" if add_start_end else ""

    words = text.split()
    tokens: List[str] = []
    for i, w in enumerate(words):
        phones = word_to_arpabet(w)
        tokens.extend([f"<|{PH_PREFIX}{ph}|>" for ph in phones])
        # 単語間に無音 ' | ' を入れる(最後の単語の後には入れない)
        if insert_silence_between_words and i < len(words) - 1:
            tokens.append(SILENCE_TOKEN)

    if add_start_end:
        tokens = [START_TOKEN] + tokens + [END_TOKEN]
    return "".join(tokens)


from unsloth import FastLanguageModel
import torch
import os

fourbit_models = [
    "unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth


dir_cache = r"/media/kurogane/kioxia1/cache"
model_id = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1_/lora_model" #"unsloth/Qwen3-0.6B" #"unsloth/Qwen3-4B-Instruct-2507"
i_ctx = 2048
b_load_in_4bit = False
b_load_in_8bit = False



dir_save_base = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1"

dir_output = os.path.join(dir_save_base, "outputs")
dir_save_lora = os.path.join(dir_save_base, "lora_model")
dir_save_model = os.path.join(dir_save_base, "model_phenome")
os.makedirs(dir_output, exist_ok=True)
os.makedirs(dir_save_lora, exist_ok=True)
os.makedirs(dir_save_model, exist_ok=True)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = i_ctx, # Choose any for long context!
    load_in_4bit = b_load_in_4bit,  # 4 bit quantization to reduce memory
    load_in_8bit = b_load_in_8bit, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = True, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
    cache_dir = dir_cache,
)

sample = "hello. i'm still in the meeting. how are you?"
s_tkns_phenomes = encode_text_to_pnm(sample)
print(f"Input: {sample}\nPhenome: {s_tkns_phenomes}")

inputs_t2p = tokenizer.apply_chat_template(
    [
        {"role" : "user", "content" : f"Convert to ARPABET phonemes:\n{T2P_TOKEN}{sample}"},
        # {"role" : "assistant", "content" : s_tkns_phenomes}
    ],
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")


inputs_p2t = tokenizer.apply_chat_template(
    [
        {"role" : "user", "content" : f"Convert ARPABET phonemes to text:\n{P2T_TOKEN}{s_tkns_phenomes}"},
        # {"role" : "assistant", "content" : sample}
    ], 
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")


from transformers import TextStreamer
# print(inputs_t2p)
print("===T2P===")
_ = model.generate(
    **inputs_t2p,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
    # streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))

print("===P2T===")
_ = model.generate(
    **inputs_p2t,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
    # streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))

Input: hello. i'm still in the meeting. how are you?
Phenome: <|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|>
===T2P===
<|im_start|>user
Convert to ARPABET phonemes:
<|CVT2P_START|>hello. i'm still in the meeting. how are you?<|im_end|>
<|im_start|>assistant
<|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
===P2T===
<|im_start|>user
Convert ARPABET phonemes to text:
<|CVP2T_START|><|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
<|im_start|>assistant
hello, i'm still in the meeting. how are you?<|im_end|>

Downloads last month
3
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kurogane/Qwen3-0.6B-phenome-estimation

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(688)
this model