masri-embed-student-10k
نموذج تعلّم تمثيلات (embeddings) للعامية المصرية مبني على:
- Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- Fine-tuned on a subset (~20000 عينة) من داتاسيت EgyTriplets-250K
- Training objective: triplet loss على (anchor, positive, negative) للجمل المصرية
Usage (Python)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
model_id = "Ahmedhisham/queen_of_embedded_egy_20k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
def encode(texts, max_length=128, device="cuda" if torch.cuda.is_available() else "cpu"):
model.to(device)
model.eval()
with torch.no_grad():
enc = tokenizer(
texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
).to(device)
out = model(**enc)
last_hidden = out.last_hidden_state
mask = enc["attention_mask"].unsqueeze(-1).expand(last_hidden.size()).float()
masked = last_hidden * mask
summed = masked.sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1e-9)
emb = summed / counts
emb = F.normalize(emb, p=2, dim=-1)
return emb
# Example
texts = [
"عايز أروح الساحل أغير جو وأرتاح شوية",
"محتاج أجازة على البحر كام يوم",
"بحب أقرأ كتب عن الذكاء الاصطناعي",
]
embs = encode(texts)
sim = torch.matmul(embs, embs.T)
print(sim)
- Downloads last month
- 98