Gemma-3-TAIDE-12b-Chat-GGUF / README_en.md

nctu6

Upload folder using huggingface_hub

82278c0 verified 3 months ago

preview code

raw

history blame contribute delete

24.4 kB

中文版

Introduction

The TAIDE project aims to develop a generative AI dialogue engine model that is tailored to the linguistic and cultural characteristics of Taiwan, while also establishing a trustworthy AI environment. By combining academic, industrial, and research resources, the project seeks to advance the development of trustworthy generative AI, enhancing Taiwan's international competitiveness, promoting industrial development, and reducing dependence on foreign technologies.
The Gemma 3 TAIDE series models are based on Google's released Gemma 3 12b, incorporating text and training materials from various fields in Taiwan to enhance the model's ability to respond in Traditional Chinese and perform specific tasks. The publicly released models are as follows:
- Gemma-3-TAIDE-12b-Chat: Based on Gemma-3-12b-pt, continual pretrained using Traditional Chinese data, and enhanced for office tasks and multi-turn dialogue capabilities through instruction tuning. Suitable for scenarios involving chat dialogue or task assistance.

Supported Modalities

Input: Text (text-only input is recommended)
Output: Text

Model Parameters

Parameters: 12.4B
Max context length: 128K
Training token in Traditional Chinese: 46B
Training time: 642 H200 GPU Hours

Features

Strictly vetting training data for the model to improve its trustworthiness and applicability.
Enhancing the model's performance for tasks such as summarization, writing articles, writing letters, and translating between Chinese and English, which are commonly used in office settings.
Strengthening the model's understanding of Taiwan-specific culture, terminology, and context.
Equip the model with multi-turn conversation capabilities.

Examples

Task	User input	Model output
Applied Writing	我正在準備一場有關網路詐騙的宣導講座，請給我一份大約一小時的講座大綱，必須安排至少三個台灣常見網路詐騙手法的解析。	好的，以下為一份約一小時的網路詐騙宣導講座大綱，包含開場、詐騙解析、預防與應對，以及Q&A。講座名稱：小心！網路詐騙就在你身邊 – 提升自我保護力一、開場 (5分鐘) * 引言：網路世界便利，但危機四伏。詐騙手法日新月異，一不小心就可能成為受害者。 * 詐騙的普遍性：根據統計，網路詐騙案件逐年攀升，不分年齡、學歷，任何人都可能受害。 * 講座目的：提升大家對網路詐騙的警覺性，學習辨識詐騙手法的能力，並採取正確的應對措施。 * 強調：預防勝於治療，保護個人資訊，才能遠離詐騙風險。二、台灣常見網路詐騙手法解析 (30分鐘) * 第一彈：投資詐騙 (10分鐘) * 手法描述：詐騙集團假冒投資專家、理財顧問，利用LINE、Facebook等平台，以「高獲利、低風險」的誘餌，吸引投資人加入投資群組或網站。 * 常見話術：穩賺不賠、內線消息、快速致富、虛擬貨幣投資等。 * 案例分析： 2023年常見的「投資高報酬」詐騙，例如： * 「穩賺型股票」：鼓吹投資冷門股票，聲稱短期內可翻倍獲利。 * 「虛擬貨幣」：以高額利潤吸引投資，但平台最終關閉或資金被提領一空。 * 提醒：投資務必謹慎，不輕信高獲利承諾，投資前務必做好功課，並選擇合法合規的投資管道。 * 第二彈：愛情詐騙 (10分鐘) * 手法描述：詐騙集團在交友App或網站，以俊男美女的照片，假冒身份，與受害者建立感情，再編造各種理由（如：生病、親人過世、工作緊急）要求金錢援助。 * 常見話術：甜言蜜語、關懷問候、緊急求助、請求金錢援助等。 * 案例分析： 2022年常見的「假交友真詐財」案例，例如： * 「海外軍官」：假冒海外軍官，以戰爭或政治情勢為由，請求金錢援助。 * 「生病的戀人」：假冒生病的戀人，以醫療費用或生活困難為由，請求金錢援助。 * 提醒：網路交友要謹慎，不輕易相信陌生人，更不要隨意匯款。 * 第三彈：釣魚詐騙 (10分鐘) * 手法描述：詐騙集團假冒銀行、郵局、電商等機構的名義，發送詐騙郵件或簡訊，誘使受害者點擊惡意連結或下載附件，竊取個人資訊（如：帳號密碼、信用卡號）。 * 常見話術：帳戶安全警告、密碼重設、訂單確認、退稅通知等。 * 案例分析： 2023年常見的「釣魚郵件」案例，例如： * 「銀行安全提醒」：假冒銀行名義，要求點擊連結確認帳戶安全。 * 「包裹待領」：假冒物流公司名義，要求點擊連結領取包裹。 * 提醒：不要輕易點擊不明連結或下載附件，保護個人資訊，定期更新密碼。三、預防與應對 (15分鐘) * 保護個人資訊： * 不隨意透露個人資訊（如：生日、地址、電話、銀行資料）。 * 使用複雜且獨特的密碼。 * 定期更新密碼。 * 啟用雙重驗證（2FA）。 * 辨識詐騙徵兆： * 不輕信高獲利、低風險的投資。 * 不隨意相信陌生人的請求。 * 不點擊不明連結或下載附件。 * 保持懷疑態度。 * 應對詐騙： * 不要回應可疑郵件或訊息。 * 向相關單位檢舉（165反詐騙專線、110報案）。 * 向親友求助。 * 保留證據（詐騙郵件、訊息、對話紀錄）。四、Q&A (10分鐘) * 開放現場提問，解答聽眾的疑問。 * 提供相關資源連結（165反詐騙專線、警政署110報案、消費者保護官網站）。備註： * 講座時間可依實際情況調整。 * 可準備一些詐騙案例的圖片或影片，增加講座的生動性。 * 鼓勵聽眾分享自身經驗，增加互動性。 * 講座結束後，可提供反詐騙宣導單或相關資料。希望這份講座大綱對您有幫助！
Letter Writing	近來主管對我下班已讀不回訊息頗有微詞，甚至在整個部門的email串中提及此事。請幫我寫一封信，以委婉但堅定的語氣說明我希望在上班時間內專心完成工作，下班後另有私人要事，實在不便處理工作事務，然若是可以立即回答的問題，會盡可能即時回覆。整封信的語氣必須專業有禮貌，且必須考慮部門中其他同事也會看到。	主旨：關於下班時間回覆工作訊息一事敬愛的 [主管的名字] 您好：感謝您對我工作的指導與關心。近日收到您對於本人下班時間後回覆工作訊息一事的提醒，我理解您希望我們能保持高度警覺性與即時性的工作態度。我向來以工作為重，並盡力在上班時間內高效完成所有任務。然而，下班後我另有私人要事需要處理，時間安排上實在有所限制。我深知即時回覆工作訊息的重要性，因此，對於上班時間內收到的工作訊息，我會盡力迅速且妥善地回覆。若訊息內容涉及需要立即處理的事項，我定會盡快處理。未來，我會更加注意工作時間的分配，並盡力在上班時間內完成所有工作。同時，也請您理解我在下班後的時間安排，我仍會以負責任的態度面對我的工作。再次感謝您對我的指導，我將持續努力，為團隊貢獻我的力量。敬祝工作順心 [你的名字] [你的職稱] [你的部門] [你的聯絡方式] --- 注意事項： * 請將 [主管的名字] 替換為實際主管的名字。 * 請確認你的職稱、部門和聯絡方式是否正確。 * 這封信的語氣是委婉但堅定的，表達了你的立場，同時也展現了你對工作的尊重和承諾。 * 如果你覺得這封信的語氣太過強硬，可以再稍微修改，使其更柔和一些。 * 如果主管的微詞是出於誤會，你可以在信中加入一些解釋，例如你過去有盡力回覆訊息，或你過去有處理過緊急的工作經驗。
Summarization	請將這篇文章精簡條理化:「產業創新條例第10條之2及第72條條文修正案」俗稱「台版晶片法」,針對半導體、電動車、5G等技術創新且居國際供應鏈關鍵地位公司,提供最高25%營所稅投抵優惠,企業適用要件包含當年度研發費用、研發密度達一定規模,且有效稅率達一定比率。為因應經濟合作暨發展組織(OECD)國家最低稅負制調整,其中有效稅率門檻,民國112年訂為12%,113年料將提高至15%,但仍得審酌國際間最低稅負制實施情形。經濟部官員表示,已和財政部協商進入最後階段,除企業研發密度訂在6%,目前已確認,企業購置先進製程的設備投資金額達100億元以上可抵減。財政部官員表示,研商過程中,針對台灣產業與其在國際間類似的公司進行深入研究,在設備部分,畢竟適用產創10之2的業者是代表台灣隊打「國際盃」,投入金額不達100億元,可能也打不了。至於備受關注的研發費用門檻,經濟部官員表示,歷經與財政部來回密切討論,研發費用門檻有望落在60億至70億元之間。財政部官員指出,研發攸關台灣未來經濟成長動能,門檻不能「高不可攀」,起初雖設定在100億元,之所以會調降,正是盼讓企業覺得有辦法達得到門檻、進而適用租稅優惠,才有動力繼續投入研發,維持國際供應鏈關鍵地位。經濟部官員表示,因廠商研發費用平均為30、40億元,其中,IC設計業者介於30億至60億元範圍,若將門檻訂在100億元,符合條件的業者較少、刺激誘因不足;此外,若符合申請門檻的業者增加,將可提高企業在台投資金額,財政部稅收也能因此獲得挹注。 IC設計業者近日頻頻針對產創10之2發聲,希望降低適用門檻,加上各國力拚供應鏈自主化、加碼補助半導體產業,經濟部官員表示,經濟部和財政部就產創10之2達成共識,爭取讓更多業者受惠,盼增強企業投資力道及鞏固台灣技術地位。財政部官員表示,租稅獎勵的制定必須「有為有守」,並以達到獎勵設置目的為最高原則,現階段在打「國內盃」的企業仍可適用產創第10條、10之1的租稅優惠,共同壯大台灣經濟發展。經濟部和財政部正就研發費用門檻做最後確認,待今明兩天預告子法之後,約有30天時間,可與業界進一步討論及調整,盼產創10之2能在6月上路。	--- 台版晶片法 - 產業創新條例第10條之2及第72條修正案精簡版一、優惠內容： * 針對半導體、電動車、5G 等產業的國際關鍵企業，提供最高 25% 營所稅投抵。二、企業適用資格： * 研發費用：研發費用門檻擬落在 60-70 億元之間。 * 研發密度：研發費用占營收比率至少 6%。 * 有效稅率： * 112 年：12% * 113 年 (可能) 15% * 調整後,將配合 OECD 最低稅負制調整。 * 設備投資：購置先進製程設備投資達 100 億元以上者，可申請抵減。三、政策背景與目的： * 配合 OECD 全球最低稅負制，調整有效稅率門檻。 * 鼓勵企業持續投入研發，維持台灣技術優勢及國際競爭力。 * 盼吸引更多企業投資台灣，增加國家稅收。四、時程： * 預計 6 月上路，草案預告後進行 30 天的業界意見徵詢期。五、政策立場： * 經濟部與財政部：共同協商，力求企業在可控制合理範圍內能獲得租稅優惠，以鼓勵企業發展。 * 租稅獎勵考量「有為有守」，訂定適切門檻，促進台灣經濟發展。
Translation	以下提供英文內容，請幫我翻譯成中文。Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados.	東山咖啡以其獨特的位置和持續改良的製程而聞名，其風味被許多咖啡愛好者所推崇。

Usage

Sample code
- The sample code and documentation will be released on GitHub later.

Prompt template

Normal QA

chat = [
    {"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)

Replace {question} with user input

QA with system prompt

chat = [
    {"role": "system", "content": "{sys}"},
    {"role": "user", "content": "{question}"},
]
prompt = tokenizer.apply_chat_template(chat)

Replace {sys} with system prompt.
Replace {question} as user input

Multi turns conversation
```
chat = [
    {"role": "system", "content": "{sys}"},
    {"role": "user", "content": "{question1}"},
    {"role": "assistant", "content": "{model_anwer_1}"},
    {"role": "user", "content": "{question2}"},
]
prompt = tokenizer.apply_chat_template(chat)
```
- Replace {sys} with system prompt. e.g.：你是一個來自台灣的AI助理，你的名字是 TAIDE，樂於以台灣人的立場幫助使用者，會用繁體中文回答問題。
- Replace {question1} with user input 1
- Replace {model_anwer_1} with model response 1
- Replace {question2} with user input 2
For more details, please refer to the Gemma 3 Model Card

Training methods

Software / hardware spec
- GPU: H200
- Training Framework: PyTorch
Data preprocessing
- Character normalization
- Deduplication
- Denoise
  - Html tag and javascript in web content
  - Non-standard characters or garbage characters
  - Posts with an insufficient number of characters
  - Removing specific formats such as extra line breaks added for formatting purposes
- Removing personal information such as emails and phone numbers.
- Remove inappropriate content such as gambling, pornography, etc..
Continual pretraining (CP)
- Supplementing the model with a large amount of reliable Traditional Chinese knowledge.
- Hyper parameters
  - optimizer: FusedAdam
  - learning rate: 1e-4
  - batch size: 4M tokens
  - epoch: 1
Fine-tuning (FT)
- Enabling the model to answer questions in Traditional Chinese.
- Hyper parameters
  - optimizer: FusedAdam
  - learning rate: 3e-6
  - batch size: 4M tokens
  - epoch: 5

Training Data

Continual pre-training data (about 240GB)

Dataset	Description
Litigation Data	Civil litigation data from various levels of courts in the judicial rulings, including data from 2013/01 to 2023/12.
CNA news	The CNA news includes daily news articles from June 1993 to December 2024, spanning a period of 31.5 years. The content covers various domains such as domestic and international politics, society, economy, culture, education, and lifestyle.
ETtoday news	ETtoday news data, including data from 2011/10 to 2024/11.
Legislative Yuan Gazette	The Legislative Yuan Gazette contains data from the 1st session of the 8th term to the 7th session of the 10th term.
Publisher Website Book Introduction	Includes book introduction data from the websites of SunColor, Gotop publishers.
Abstracts of GRB research projects	GRB is an information system that compiles research projects funded by government grants and their outcome reports. This dataset primarily includes research project abstracts from 1993 to 2023, including both Chinese and their English counterparts.
Academic conference proceedings abstracts	The database contains academic conference proceedings held in Taiwan from 1988 to 2009.
Taiwan Panorama magazine	Taiwan Panorama magazine contains articles from July 1993 to June 2023, spanning 30 years. The content focuses on Taiwanese culture, tourism, and local customs.
樂詞網	《樂詞網》covers approximately 187,000 academic terms in the humanities and social sciences, along with their translations.
Data from various ministries and commissions	Including partial data from government department websites such as the Executive Yuan's "National Overview", the Ministry of Culture's "National Cultural Memory Bank", the National Development Council's "Archives Support Teaching Network", the Ministry of Transportation's "Traffic Safety Portal", etc.
Business Today	Business Today Magazine is a weekly magazine focused on finance. The dataset includes articles from 2008/01 to 2023/07.
Mandarin and idiom dictionary from the Ministry of Education	Dataset including: Idiom Dictionary: Contains 5,338 idioms, including definitions, original stories, usage explanations, and example sentences. Revised Mandarin Dictionary: contains Chinese words and various vocabulary, including pronunciation, radicals, definitions, and other information, totaling approximately 165,539 entries. Concise Mandarin Dictionary: is a condensed version of the "Revised Mandarin Dictionary", containing a total of 45,247 entries.
SCITechVista	The dataset includes science news and popular science articles from the SCITechVista website.
iKnow	The iKnow platform provides information on market trends, strategic analysis, patent knowledge, and technology transaction information for Taiwan and the global technology industry. The dataset includes data from 2005/01 to 2023/07.
Science Development Monthly Magazine	Science Development Monthly Magazine is a popular science publication published by the National Science Council (NSC) to promote science education. It includes articles from 2004/10 to 2020/12. In 2021, the magazine was relaunched as "CharmingSCITech" quarterly, providing new knowledge on international technology issues.
Legislation Database	The Legislation Database includes the latest central regulations, rules, draft bills, and local regulations issued by government agencies as of 2023/10.
Local Government Tourism Websites	Covering partial data from tourism websites of local government counties and cities in Taiwan.
Curriculum Guidelines from the National Institute of Education	The dataset includes curriculum guidelines for different subjects at various levels of education.
CNA's English and Chinese Name Translation Database	The English and Chinese Name Translation Database of the Central News Agency (CNA) collects translations of foreign and Chinese surnames, personal names, organizations, and place names used in news.
RedPajama-Data-V2	Extracting English data from the RedPajama-Data-v2 multilingual dataset
MathPile-commercial	A mathematics-focused dataset obtained from MathPile-commercial
Traditional Chinese Wikipedia Articles	The content of all articles in Traditional Chinese Wikipedia, up to January 2023.
github-code-clean	An open-source code dataset on GitHub. After removing unlicensed code and documents.

Fine tune data
- The TAIDE team uses the Llama 3 series and Gemma 3 series models to generate fine-tuning data, covering tasks such as world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values in both single-turn and multi-turn dialogues. In total, there are 220K prompt-response pairs.

Evaluation

TMMLU+

Evaluation Data
- A total of 20,118 questions across 66 subcategories: STEM, Humanities, Social Sciences, and Other (Business, Health, Misc.).
Evaluation Method
- Scoring Program: lm-evaluation-harness

Evaluation Scores

Model	STEM	Humanities	Social Sciences	Other	Average
Gemma-3-TAIDE-12b-Chat	48.00	47.59	58.19	45.68	49.95
Gemma-3-12b-it	56.00	44.41	60.52	50.36	53.82
Llama-3.1-TAIDE-LX-8B-Chat	42.11	46.23	50.96	43.22	45.58

Since Gemma-3-TAIDE-12b-Chat has been specifically enhanced with knowledge related to Taiwan, the scores for Taiwan-specific metrics within TMMLU+ are also provided for reference.

Model	Geography of Taiwan	Taiwanese Trivia Question Answering v2	Taiwanese Hokkien
Gemma-3-TAIDE-12b-Chat	69.14	80.53	38.76
Gemma-3-12b-it	60.93	71.68	28.68
Llama-3.1-TAIDE-LX-8B-Chat	62.36	75.10	31.78

taide-bench

Evaluation Data
- Tasks include writing articles, writing letters, summarizing articles, translating from English to Traditional Chinese, translating from Traditional Chinese to English. There are 500 questions in total.
- Data link: taide-bench
Evaluation method
- LLM as a Judge by gpt-4o
- code link: taide-bench-eval

Scores

Model	Translating from Traditional Chinese to English	Translating from English to Traditional Chinese	Summerization	Writing articles	Writing letters	Average
Gemma-3-TAIDE-12b-Chat	7.99	8.46	7.64	9.73	9.57	8.67
Gemma-3-12b-it	8.24	8.44	7.77	9.60	9.46	8.70
Llama-3.1-TAIDE-LX-8B-Chat*	6.99	7.26	9.17	8.80	7.18	7.88

*Note: The Llama-3.1-TAIDE-LX-8B-Chat's score on the TAIDE-Bench has been updated and differs from the previous release due to a change in the evaluation judge prompt.

授權條款

(Gemma Version)-TAIDE Models License Agreement

免責聲明

LLM 模型由於設計架構的限制，以及資料難免有偏誤，語言模型的任何回應不代表 TAIDE 立場，使用前需要額外加入安全防護機制，且回應內容也可能包含不正確的資訊，使用者請勿盡信。

開發團隊

https://taide.tw/index/teamList

Citation

TAIDE官網

nctu6
/

Gemma-3-TAIDE-12b-Chat-GGUF