| | --- |
| | license: mit |
| | datasets: |
| | - Darsala/english_georgian_corpora |
| | language: |
| | - ka |
| | - en |
| | metrics: |
| | - comet |
| | - bleu |
| | - chrf |
| | pipeline_tag: translation |
| | tags: |
| | - translation |
| | - Georgian |
| | - NMT |
| | - MT |
| | - encoder-decoder |
| | model-index: |
| | - name: Georgian-Translation |
| | results: |
| | - task: |
| | type: translation |
| | name: Machine Translation |
| | dataset: |
| | name: FLORES Test Set |
| | type: flores |
| | metrics: |
| | - type: comet |
| | value: 0.79 |
| | name: COMET Score |
| | base_model: bert-base-uncased |
| | --- |
| | |
| | # Georgian Translation Model |
| |
|
| | ## Model Description |
| |
|
| | This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder. |
| |
|
| | ## Architecture |
| |
|
| | - **Model Type**: Encoder-Decoder |
| | - **Encoder**: Pretrained BERT model |
| | - **Decoder**: Randomly initialized with custom configuration |
| | - **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test` |
| | - **Parameters**: 266M total parameters |
| |
|
| | ## Training Details |
| |
|
| | - **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)) |
| | - **Training Duration**: 16 epochs |
| | - **Hardware**: Nvidia A100 80GB |
| | - **Batch Size**: 128 with 2 gradient accumulation steps |
| | - **Scheduler**: Cosine learning rate scheduler |
| | - **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline |
| |
|
| | ## Performance |
| |
|
| | - **COMET Score**: 0.79 (on FLORES test set) |
| | - **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset |
| | - **Translation Style**: More literary and natural Georgian compared to Google Translate |
| |
|
| | ## Usage |
| |
|
| | **Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it. |
| |
|
| | ```python |
| | import sys |
| | from transformers import EncoderDecoderModel |
| | import torch |
| | import re |
| | from huggingface_hub import snapshot_download |
| | |
| | # Download the repo to a local folder |
| | path_to_downloaded = snapshot_download( |
| | repo_id="Darsala/Georgian-Translation", |
| | local_dir="./Georgian-Translation", |
| | local_dir_use_symlinks=False |
| | ) |
| | |
| | # Add the downloaded folder to Python path so we can import the custom tokenizer |
| | sys.path.append(path_to_downloaded) |
| | from encoder_decoder_tokenizer import EncoderDecoderTokenizer |
| | |
| | # Load the model and tokenizer from the downloaded folder |
| | model = EncoderDecoderModel.from_pretrained(path_to_downloaded) |
| | tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded) |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model.to(device) |
| | |
| | def translate( |
| | text: str, |
| | num_beams: int = 5, |
| | max_length: int = 256, |
| | ) -> str: |
| | """ |
| | Translate a single string with the given EncoderDecoderModel. |
| | """ |
| | text = text.lower() |
| | text = re.sub(r'\s+', ' ', text) |
| | |
| | # tokenize & move to device |
| | inputs = tokenizer( |
| | text, |
| | return_tensors="pt", |
| | truncation=True, |
| | padding="longest" |
| | ).to(device) |
| | |
| | # generation |
| | generated_ids = model.generate( |
| | input_ids=inputs.input_ids, |
| | attention_mask=inputs.attention_mask, |
| | num_beams=num_beams, |
| | max_length=max_length, |
| | early_stopping=True, |
| | ) |
| | |
| | output = tokenizer.decode(generated_ids[0], skip_special_tokens=True) |
| | print(f"English: {text}") |
| | print(f"Translated: {output}") |
| | |
| | return output |
| | |
| | # Example usage |
| | translation = translate("Hello, how are you?") |
| | ``` |
| |
|
| | **Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository. |
| |
|
| | ## Strengths and Limitations |
| |
|
| | ### Strengths |
| | - Produces more literary and natural Georgian translations |
| | - Good performance on general text translation |
| | - Specialized for Georgian language characteristics |
| |
|
| | ### Limitations |
| | - Struggles with proper names and company names |
| | - Issues with terms requiring direct English text copying |
| | - Limited by tokenizer coverage for certain English terms |
| |
|
| | ## Demo |
| |
|
| | Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @mastersthesis{darsalia2025georgian, |
| | title={English Translation Quality Assessment and Computer Translation}, |
| | author={Luka Darsalia}, |
| | year={2025}, |
| | school={Tbilisi University}, |
| | note={Bachelor's Thesis - Computer Science} |
| | } |
| | ``` |
| |
|
| | ## Related Resources |
| |
|
| | - **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora) |
| | - **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet) |
| | - **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation) |