Qwen3 Phoneme (ARPABET) Converter — T2P / P2T（Unsloth + Dolly text-only）

このリポジトリ（ノートブック）では、Qwen3 系モデルを Unsloth で SFT し、
(1) Text → ARPABET phonemes（T2P） と (2) ARPABET phonemes → Text（P2T） の 2 タスクを同一モデルで扱えるようにしたものです。

Brain-to-text '25用に作成しました。

1. 何ができるか（概要）

T2P (Text to Phoneme)
英文を ARPABET 音素列（Brain-to-text '25 互換トークン）に変換
P2T (Phoneme to Text)
ARPABET 音素列（同トークン）から英文を復元

2. 追加トークン仕様（重要）

2.1 Control tokens

ノートブック内で tokenizer.add_special_tokens により追加しています。

START_TOKEN : <|PHONEME_START|>
END_TOKEN : <|PHONEME_END|>
T2P_TOKEN : <|CVT2P_START|> （Text→Phoneme の開始マーカー）
P2T_TOKEN : <|CVP2T_START|> （Phoneme→Text の開始マーカー）

2.2 Silence token

SILENCE_TOKEN : <|PNM_SIL|>

2.3 ARPABET phoneme tokens（39音素）

以下の 39 音素を <|PNM_XX|> 形式で追加します。

AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH

追加される総トークン数（ノートブックの定義）

Control 4個 + Silence 1個 + Phoneme 39個 = 合計 44 トークン

2.4 テキスト→音素列（`encode_text_to_pnm`）のルール（ノートブック準拠）

g2p_en を用いて単語を ARPABET に変換
ストレス数字（例: IY1）は除去して IY に正規化
PHONEME_SET に含まれない音素は 捨てる
単語間に SILENCE_TOKEN（<|PNM_SIL|>）を挿入可能
必要なら START_TOKEN / END_TOKEN を付与
返り値は トークンを基本的に結合した1文字列（ノートブックは return "".join(tokens)）

3. プロンプト仕様（T2P / P2T）

3.1 T2P（Text → ARPABET phonemes）

User

Convert to ARPABET phonemes:
<|CVT2P_START|>{TEXT}

Assistant（教師信号）

{encode_text_to_pnm(TEXT)}

ノートブックでは、会話データを以下で生成しています（概念）:

user: Convert to ARPABET phonemes:\n{T2P_TOKEN}{input_text}
assistant: encode_text_to_pnm(input_text)

3.2 P2T（ARPABET phonemes → Text）

User

Convert ARPABET phonemes to text:
<|CVP2T_START|>{PHONEME_SEQUENCE}

Assistant（教師信号）

{TEXT}

ノートブックでは、会話データを以下で生成しています（概念）:

user: Convert ARPABET phonemes to text:\n{P2T_TOKEN}{encode_text_to_pnm(input_text)}
assistant: input_text

4. トレーニングの流れ（ノートブックの手順）

G2P と音素エンコード関数を定義

PHONEME_SET / SILENCE_TOKEN / START_TOKEN / END_TOKEN / T2P_TOKEN / P2T_TOKEN
_normalize_text, word_to_arpabet, encode_text_to_pnm

Unsloth でモデル・トークナイザ読み込み

FastLanguageModel.from_pretrained(...)

追加トークンを tokenizer に登録し、埋め込みを拡張

tokenizer.add_special_tokens({"additional_special_tokens": add_tokens})
model.resize_token_embeddings(len(tokenizer))

チャットテンプレを指定

get_chat_template(..., chat_template="qwen3-instruct")

Dolly text-only データセットをロード

load_from_disk(dir_dolly_textonly)
期待カラム: text_line（ノートブックでは ROW_ID="text_line"）

T2P / P2T の学習用会話テキストを生成して結合

formatting_prompts_func_t2p / formatting_prompts_func_p2t
map でそれぞれの split に適用
concatenate_datasets([t2p, p2t])
split: train, test, eval を同様に作成

LoRA 設定で PEFT 化

FastLanguageModel.get_peft_model(...)

SFTTrainer で学習（SFTConfig）

SFTTrainer(model, tokenizer, train_dataset, eval_dataset, args=SFTConfig(...))

assistant 応答部分のみを loss 対象にする

train_on_responses_only(trainer, instruction_part="<|im_start|>user\n", response_part="<|im_start|>assistant\n")

学習実行

trainer.train()

保存

LoRA adapters: model.save_pretrained(dir_save_lora) + tokenizer.save_pretrained(dir_save_lora)
16bit merge: model.save_pretrained_merged(dir_save_model, tokenizer, save_method="merged_16bit")

5. ハイパーパラメータ（ノートブック準拠）

5.1 モデル読み込み

max_seq_length: 2048
load_in_4bit: False
load_in_8bit: False
full_finetuning: True
cache_dir: dir_cache（環境依存）
chat_template: "qwen3-instruct"

5.2 LoRA（PEFT）

FastLanguageModel.get_peft_model の設定:

r: 32
lora_alpha: 32
lora_dropout: 0
bias: "none"
use_gradient_checkpointing: "unsloth"
random_state: 3407
use_rslora: False
loftq_config: None
target_modules:
- q_proj, k_proj, v_proj, o_proj
- gate_proj, up_proj, down_proj
- lm_head, embed_tokens

5.3 SFT（SFTTrainer / SFTConfig）

dataset_text_field: "text"
per_device_train_batch_size: 24
gradient_accumulation_steps: 2（実効バッチ = 48）
warmup_steps: 5
num_train_epochs: 2
learning_rate: 5e-5
logging_steps: 50
optim: "adamw_8bit"
weight_decay: 0.001
lr_scheduler_type: "cosine"
seed: 3407
save_strategy: "epoch"
report_to: "none"
eval_dataset: ds_eval（※評価戦略は設定に依存）

6. 推論（T2P / P2T）

6.1 例

from g2p_en import G2p
import re
import unicodedata
from typing import List
from IPython.display import display

# Brain-to-text '25 のラベル集合に合わせた ARPABET 音素（無音 ' | ' は別扱い）
PHONEME_SET = {
    'AA','AE','AH','AO','AW','AY','B','CH','D','DH','EH','ER','EY','F','G',
    'HH','IH','IY','JH','K','L','M','N','NG','OW','OY','P','R','S','SH','T',
    'TH','UH','UW','V','W','Y','Z','ZH'
}
SILENCE_TOKEN = '<|PNM_SIL|>'
START_TOKEN = '<|PHONEME_START|>'
END_TOKEN   = '<|PHONEME_END|>'
T2P_TOKEN = "<|CVT2P_START|>"
P2T_TOKEN = "<|CVP2T_START|>"
PH_PREFIX   = 'PNM_'  # 変更したければここを変える

_g2p = G2p()

def _normalize_text(s: str) -> str:
    """記号の正規化＆不要記号の削除（アポストロフィは残す）"""
    s = unicodedata.normalize("NFKC", s).replace("’", "'")
    # ハイフンは空白へ、他の句読点は削除
    s = re.sub(r"[-_]+", " ", s)
    s = re.sub(r"[^A-Za-z0-9'\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def word_to_arpabet(word: str) -> List[str]:
    """単語 -> ARPABET（ストレス数字除去＋未知排除）"""
    phones = _g2p(word)  # 例: ["AY1","M"]
    cleaned = []
    for ph in phones:
        ph = re.sub(r"\d", "", ph)  # IY1 -> IY
        if ph in PHONEME_SET:
            cleaned.append(ph)
    return cleaned

def encode_text_to_pnm(text: str,
                       add_start_end: bool = True,
                       insert_silence_between_words: bool = True) -> str:
    """
    テキスト -> <|PNM_XX|> 形式の列（必要なら START/END と無音 ' | ' を付与）
    """
    text = _normalize_text(text)
    if not text:
        return f"{START_TOKEN} {END_TOKEN}" if add_start_end else ""

    words = text.split()
    tokens: List[str] = []
    for i, w in enumerate(words):
        phones = word_to_arpabet(w)
        tokens.extend([f"<|{PH_PREFIX}{ph}|>" for ph in phones])
        # 単語間に無音 ' | ' を入れる（最後の単語の後には入れない）
        if insert_silence_between_words and i < len(words) - 1:
            tokens.append(SILENCE_TOKEN)

    if add_start_end:
        tokens = [START_TOKEN] + tokens + [END_TOKEN]
    return "".join(tokens)


from unsloth import FastLanguageModel
import torch
import os

fourbit_models = [
    "unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth


dir_cache = r"/media/kurogane/kioxia1/cache"
model_id = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1_/lora_model" #"unsloth/Qwen3-0.6B" #"unsloth/Qwen3-4B-Instruct-2507"
i_ctx = 2048
b_load_in_4bit = False
b_load_in_8bit = False



dir_save_base = r"/media/kurogane/kioxia1/unsloth/phenome/phenome_qwen3_06_dolly_test1"

dir_output = os.path.join(dir_save_base, "outputs")
dir_save_lora = os.path.join(dir_save_base, "lora_model")
dir_save_model = os.path.join(dir_save_base, "model_phenome")
os.makedirs(dir_output, exist_ok=True)
os.makedirs(dir_save_lora, exist_ok=True)
os.makedirs(dir_save_model, exist_ok=True)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = i_ctx, # Choose any for long context!
    load_in_4bit = b_load_in_4bit,  # 4 bit quantization to reduce memory
    load_in_8bit = b_load_in_8bit, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = True, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
    cache_dir = dir_cache,
)

sample = "hello. i'm still in the meeting. how are you?"
s_tkns_phenomes = encode_text_to_pnm(sample)
print(f"Input: {sample}\nPhenome: {s_tkns_phenomes}")

inputs_t2p = tokenizer.apply_chat_template(
    [
        {"role" : "user", "content" : f"Convert to ARPABET phonemes:\n{T2P_TOKEN}{sample}"},
        # {"role" : "assistant", "content" : s_tkns_phenomes}
    ],
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")


inputs_p2t = tokenizer.apply_chat_template(
    [
        {"role" : "user", "content" : f"Convert ARPABET phonemes to text:\n{P2T_TOKEN}{s_tkns_phenomes}"},
        # {"role" : "assistant", "content" : sample}
    ], 
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")


from transformers import TextStreamer
# print(inputs_t2p)
print("===T2P===")
_ = model.generate(
    **inputs_t2p,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
    # streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))

print("===P2T===")
_ = model.generate(
    **inputs_p2t,
    max_new_tokens = 128, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
    # streamer = TextStreamer(tokenizer, skip_prompt = True),
)
print(tokenizer.decode(_[0]))

Input: hello. i'm still in the meeting. how are you?
Phenome: <|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|>
===T2P===
<|im_start|>user
Convert to ARPABET phonemes:
<|CVT2P_START|>hello. i'm still in the meeting. how are you?<|im_end|>
<|im_start|>assistant
<|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
===P2T===
<|im_start|>user
Convert ARPABET phonemes to text:
<|CVP2T_START|><|PHONEME_START|><|PNM_HH|><|PNM_AH|><|PNM_L|><|PNM_OW|><|PNM_SIL|><|PNM_AY|><|PNM_M|><|PNM_SIL|><|PNM_S|><|PNM_T|><|PNM_IH|><|PNM_L|><|PNM_SIL|><|PNM_IH|><|PNM_N|><|PNM_SIL|><|PNM_DH|><|PNM_AH|><|PNM_SIL|><|PNM_M|><|PNM_IY|><|PNM_T|><|PNM_IH|><|PNM_NG|><|PNM_SIL|><|PNM_HH|><|PNM_AW|><|PNM_SIL|><|PNM_AA|><|PNM_R|><|PNM_SIL|><|PNM_Y|><|PNM_UW|><|PHONEME_END|><|im_end|>
<|im_start|>assistant
hello, i'm still in the meeting. how are you?<|im_end|>

Downloads last month: 3

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for kurogane/Qwen3-0.6B-phenome-estimation

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B