--- base_model: unsloth/Phi-4-unsloth-bnb-4bit library_name: peft pipeline_tag: text-generation language: - zho - eng - fra - spa - por - deu - ita - rus - jpn - kor - vie - tha - ara --- # Model Card for Model ID A LoRA made for the ORKG Ask information extraction usecase. It expects a list of properties and the source to extract them from The model also supports 13 languages and 4 diffierent language tones. ## Model Details Finetuned using `unsloth` with a dataset created using larger LLMs like GPT-4o. Languages supported are: - English - Spanish - German - Dutch - French - Italian - Portuguese - Russian - Chinese - Japanese - Korean - Arabic - Farsi Language tones supported: - Researcher - Adult - Teenager - Child ### Model Description The language tones are described as follows: 1. Child (10–11 years old): - Simple, short sentences and basic accurate explanations. - No advanced jargons. - Everyday examples that tie into the research findings. 2. Teenager: - Casual, engaging manner; relevant slang moderately. - Interesting and emotional research findings. - Relatable explanations, referencing everyday scenarios or pop culture where applicable. 3. Adult: - Concise details yet with a polished, clear tone. - Moderate, non-technical vocabulary where possible. - Essential context and logical flow, focusing on practical applications of research. 4. Researcher: - Formal, precise language with clear references to methodologies or data. - Discipline-specific terminology as needed. - Balanced, objective presentation of research complexities. The system prompt of the model is: ``` You will get as an input: a research paper's content and a set of properties/criteria to extract. You will extract the values corresponding to the list of provided predicates. You should stick to the schema and description of the properties (if available). Use the ORKG Ask structured information extraction XML output format. The extractions must be in the "{language}" language and the complexity of the language should be for a "{tone}". ``` The user prompt should look like this: ``` # Specifications of what to extract: {properties} # Extraction source: {source} ``` Properties look like this: ```json [{'label': 'Methods', 'desc': 'The methods used in the study', 'schema': 'multiple values of type string'}, {'label': 'TL;DR', 'desc': null, 'schema': null}] ``` The output of the model is created in XML format ```xml Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. ``` The model should be used in chat mode or use the chat template (check tokenizer) and feed it to a normal generation endpoint. ## Trainging Details ### LoRA details ```python r=16 target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj"] lora_alpha=32 lora_dropout=0 seed=42 ``` ### SFT details ```python per_device_train_batch_size=8 gradient_accumulation_steps=8 warmup_steps=5 num_train_epochs=3 learning_rate=2e-4 bf16=True optim="adamw_torch_fused" weight_decay=0.01 lr_scheduler_type="linear" seed=42 ``` Trained on responses only!! ## Model Card Contact ORKG Ask Team - info@orkg.org ### Framework versions - PEFT 0.14.0