Greek Dialect LoRA — Llama-3 8B Instruct Adapter

LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of meta-llama/Meta-Llama-3-8B-Instruct. The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874).

Project website: https://stergioscha.github.io/CLLT/

Model Details

  • Developer: CLLT Lab, University of Crete
  • Adapter type: LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections
  • Dataset: 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD)
  • Split: 95% train / 5% validation using Hugging Face datasets random split
  • Precision: bfloat16, gradient accumulation 8 → effective batch size 16
  • License: Research purposes only, subject to the Meta Llama 3 license terms
  • Compute: AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding

Sources

Intended Use

Direct

  • Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation
  • Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly

Downstream

  • Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base
  • Evaluate dialectal control against GRDD+ or bespoke benchmarks

Out-of-scope

  • Critical or safety-sensitive deployments without native-speaker review
  • Automatic translation or identification of dialects (model produces text; it is not a classifier)
  • Standard Modern Greek generation (training data removed it)

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora")

prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

  • Preparation: convert_to_natural_prompts_dialects_only.py converts <po>/<cr>/<no>/<cy> tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”).
  • Filtering: Removed Standard Modern Greek entries to keep the adapter dialect-focused.
  • Tokenization: 512 tokens, padding to max length, labels = input IDs.
  • Hyperparameters: epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps, load_best_model_at_end=True.
  • Checkpoint size: adapter ≈ 170 MB (adapter_model.safetensors).

Evaluation

  • Automatic: Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874).
  • Recommended manual checks: Have native speakers verify correctness, register, and cultural sensitivity.

Limitations & Risks

  • Dialect mixing can occur if prompts are vague. Specify the dialect explicitly.
  • Model inherits any biases present in GRDD (topics, speaker demographics, orthography).
  • Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here.

Acknowledgments

  • National Infrastructures for Research and Technology (GRNET) for AWS credits
  • EU Recovery & Resilience Facility for funding
  • Meta for the base Llama 3 models

Contact

Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stergios/llama3-8b-instruct-lora

Adapter
(2098)
this model