| | --- |
| | license: apache-2.0 |
| | base_model: Helsinki-NLP/opus-mt-en-zh |
| | tags: |
| | - generated_from_trainer |
| | - translation |
| | - machine-translation |
| | - english |
| | - traditional-chinese |
| | - transformer |
| | - fine-tuned |
| | datasets: |
| | - agentlans/en-zhtw-google-translate |
| | language: |
| | - en |
| | - zh |
| | pipeline_tag: translation |
| | --- |
| | <details> |
| | <summary>English-to-Traditional Chinese Translator</summary> |
| |
|
| | This model is a fine-tuned version of [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh), trained on the [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) dataset. |
| |
|
| | It is optimized to produce **Traditional Chinese translations by default**, enhancing the naturalness and fluency of the output. |
| |
|
| | ## Model Description |
| |
|
| | - **Input:** English text only |
| | - **Output:** Traditional Chinese translation |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>英文至繁體中文翻譯模型</summary> |
| |
|
| | 本模型為 [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh) 的微調版本,使用 [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) 資料集進行訓練。 |
| |
|
| | 模型已針對輸出繁體中文進行最佳化,提升了翻譯結果的自然度與流暢性。 |
| |
|
| | ## 模型說明 |
| |
|
| | - **輸入:** 僅支援英文文本 |
| | - **輸出:** 繁體中文翻譯 |
| | </details> |
| |
|
| | ## How to use / 如何使用 |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | # Load the translation model |
| | # 載入翻譯模型 |
| | model_checkpoint = "agentlans/en-zhtw" |
| | translator = pipeline("translation", model=model_checkpoint) |
| | |
| | # This is for correcting English punctuation marks to Traditional Chinese. |
| | # 這是為了將英語標點符號校正為繁體中文。 |
| | def en_to_zh_punct(text): |
| | punct = { |
| | '!': '!', '?': '?', ',': ',', '.': '。', |
| | ':': ':', ';': ';', '(': '(', ')': ')', |
| | '[': '【', ']': '】', '{': '{', '}': '}' |
| | } |
| | result, in_dq, in_sq = [], False, False |
| | for ch in text: |
| | if ch == '"': |
| | result.append("」" if in_dq else "「") |
| | in_dq = not in_dq |
| | elif ch == "'": |
| | result.append("』" if in_sq else "『") |
| | in_sq = not in_sq |
| | else: |
| | result.append(punct.get(ch, ch)) |
| | return "".join(result) |
| | |
| | # The main function for translating English to Traditional Chinese |
| | # 將英語翻譯成繁體中文的主要功能 |
| | def translate(en_text): |
| | return [en_to_zh_punct(x["translation_text"]) for x in translator(en_text)] |
| | |
| | # Example |
| | # 範例 |
| | translate( |
| | [ |
| | "Trump announces new tariffs on penguin islands. The penguins plan to tax U.S. imports in retaliation.", |
| | "We now return to the White House for the latest developments on the trade war.", |
| | ] |
| | ) |
| | # ['川普宣佈對企鵝島徵收新關稅,企鵝打算對美國進口產品徵稅報復。', '我們現在回到白宮尋找貿易戰的最新發展。'] |
| | ``` |
| |
|
| | ## Limitations / 限制 |
| |
|
| | <details> |
| | <summary>Limitations</summary> |
| |
|
| | - Handles only one- or two-sentence inputs in English effectively. |
| | - Struggles with English spelling, names, abbreviations, and especially technical terminology. |
| | - Uses unusual punctuation like the English comma instead of the Chinese comma. |
| | - Has difficulty understanding context. |
| | - As a result, may generate inaccurate information or omit important details. |
| | - Sometimes uses incorrect words due to the base model being primarily trained on Simplified Chinese, which does not always correspond directly to Traditional Chinese. |
| | </details> |
| |
|
| | <details> |
| | <summary>限制</summary> |
| |
|
| | - 僅適用於處理一至兩句英文句子的輸入,處理較長段落時效果有限。 |
| | - 難以準確掌握英語拼字、專有名詞及縮寫,尤其在處理技術術語時表現不佳。 |
| | - 常出現標點符號使用不當的情況,例如以英文逗號取代中文逗號。 |
| | - 對語境的理解能力有限。 |
| | - 可能導致資訊不準確或遺漏重要細節。 |
| | - 由於基礎模型主要以簡體中文語料訓練,有時會使用不自然或錯誤的詞語,簡體與繁體用語之間也未必能精確對應。 |
| | </details> |
| |
|
| | ## Training procedure / 訓練過程 |
| |
|
| | <details> |
| | <summary>Click here / 點這裡</summary> |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 5e-05 |
| | - train_batch_size: 8 |
| | - eval_batch_size: 8 |
| | - seed: 42 |
| | - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| | - lr_scheduler_type: linear |
| | - num_epochs: 5.0 |
| |
|
| | ### Training results |
| |
|
| | | Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen | |
| | |:-------------:|:-----:|:------:|:---------------:|:-----------------:| |
| | | 1.3993 | 1.0 | 99952 | 1.2487 | 54454616 | |
| | | 1.2801 | 2.0 | 199904 | 1.1701 | 108935048 | |
| | | 1.1728 | 3.0 | 299856 | 1.1232 | 163424808 | |
| | | 1.1001 | 4.0 | 399808 | 1.0871 | 217911400 | |
| | | 1.0243 | 5.0 | 499760 | 1.0584 | 272407288 | |
| |
|
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 4.51.3 |
| | - Pytorch 2.6.0+cu124 |
| | - Datasets 3.2.0 |
| | - Tokenizers 0.21.0 |
| |
|
| | </details> |