AgriLLM

Model Card for AI71ai/Llama-agrillm-3.3-70B

AI71ai/Llama-agrillm-3.3-70B is an agriculture-focused Foundation Large Language Model (LLM) developed by ai71 with support from leading institutions in the agricultural sector. It is fine-tuned using LoRA on top of meta-llama/Llama-3.3-70B-Instruct.

This model is developed as part of the AgriLLM initiative, a multi-stakeholder collaboration involving the International Affairs Office at the UAE Presidential Court, the Gates Foundation, CGIAR, Embrapa, ECHO, FAO, IFAD, the World Bank, Digital Green, and other leading organizations in the agriculture domain.

The model is fine-tuned on agriculture-specific Q&A pairs to strengthen its ability to understand agricultural contexts, provide accurate agronomic guidance, and generate reliable, expert-aligned responses, while still preserving the broad capabilities of the underlying Llama 3.3-70B base model.

It is primarily designed for use within Retrieval-Augmented Generation (RAG) systems - where it can leverage external agricultural knowledge bases for accuracy and contextual grounding - rather than as a standalone model.

🚜 What is AgriLLM?

AgriLLM is an initiative to provide the global agriculture community with open, foundation AI building blocks that support wider AI adoption and help close the information gap faced by smallholder farmers and agricultural professionals worldwide.

As part of the initiative, four open-source public goods will be released:

  1. A set of fine-tuned LLMs specialized for agriculture
  2. The supervised training dataset of agriculture-focused Q&A pairs used for fine-tuning, enabling anyone to train their own models
  3. An agriculture evaluation benchmark (datasets and metrics) providing a common standard to assess and compare model performance
  4. A corpus of agricultural documents for building Retrieval-Augmented Generation (RAG) pipelines

The initiative’s philosophy is to empower the community with practical, high-quality AI resources - allowing researchers, developers, and institutions to create their own downstream agricultural AI applications built on top of these open building blocks, including the fine-tuned AgriLLMs.

How to Get Started with the Model

Use the code below to get started with the model.

Transformers Code Example
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AI71ai/Llama-agrillm-3.3-70B")
model = AutoModelForCausalLM.from_pretrained("AI71ai/Llama-agrillm-3.3-70B")
messages = [
    {"role": "user", "content": "How to grow maize in Kenya?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Table of Contents

Model Details

Model Description

πŸ”§ Base Model

  • Base: meta-llama/Llama-3.3-70B-Instruct
  • Finetuning Method: LoRA (parameter-efficient finetuning)
  • Architecture: 70B transformer (Llama3.3 family)

🎯 Objective of Finetuning

  • Strengthen agricultural reasoning: Improve understanding of agronomic concepts, and the ability to reason about agricultural scenarios
  • Generate expert-aligned and domain-relevant outputs: Produce advisory responses that reflect best practices, validated knowledge, and use terminology appropriate to the agricultural context
  • Ensure reliability and safety: Reduce hallucinations, maintain factual accuracy, and preserve groundedness in agriculture-related queries
  • Leverage retrieved context effectively: Enhance the model’s ability to interpret retrieved information and identify the most relevant agricultural content in RAG applications Preserve general capabilities: Maintain the broad reasoning and generative abilities of the Qwen3-A3B-30B base model

🌾 Model Capabilities

AI71ai/Llama-agrillm-3.3-70B provides foundational AI capabilities that can be applied across the agricultural domain, including:

  • Question Answering: Responds accurately to agricultural queries based on provided or retrieved information
  • Summarization: Condenses technical agricultural documents, research papers, and policy briefs into concise summaries
  • Advisory Generation: Produces structured guidance or recommendations based on domain knowledge
  • Reasoning: Supports scenario analysis, domain-specific reasoning, and decision support
  • Context Evaluation: Assesses retrieved content for relevance when generating outputs (optimized for RAG pipelines)

Uses

AI71ai/Llama-agrillm-3.3-70B is primarily intended as a foundation model building block. It is designed to better understand agricultural contexts and perform effectively when connected to internal knowledge bases or used in Retrieval-Augmented Generation (RAG) pipelines. This model is not a standalone solution with universal answers; instead, it provides specialized capabilities that downstream applications can leverage to deliver accurate, grounded, and context-aware outputs in agriculture. The model can assist multiple personas across the agricultural ecosystem. Example capabilities include:

Farmers

  • Answer questions on crop production (sowing, irrigation, harvesting)
  • Provide pest and disease management guidance based on symptoms
  • Recommend fertilizers and nutrient applications
  • Advise on livestock care

Field Extension Agents

  • Generate advisory responses for farmers
  • Support diagnostic workflows and on-field problem-solving
  • Prepare step-by-step field instructions and protocols
  • Interpret technical manuals, guidelines, and extension materials

Academics & Researchers

  • Summarize agricultural literature and research papers
  • Explain research methodologies and concepts
  • Analyze and interpret policy briefs and technical reports
  • Support domain-specific reasoning and scenario modeling

Policymakers & Project Managers

  • Assist in agricultural program assessments and evaluations
  • Support impact analysis and data-driven recommendations
  • Generate evidence-based policy or project briefs
  • Provide reasoning grounded in agricultural principles and best practices

⚠️ Out-of-Scope Use

This model is not a universal source of answers and is not intended as a standalone solution. It is designed to support agricultural workflows but cannot replace expert knowledge. All outputs should be verified, especially in high-stakes contexts.

Moreover, the model is not intended for:

  • Medical or veterinary diagnosis
  • Producing legally binding recommendations or official documents
  • High-risk decision-making without expert supervision
  • Replacing certified agronomists, extension agents, or researchers
  • Providing real-time field measurements or monitoring (e.g., soil moisture, weather, or crop sensor data)
  • Making financial, legal, or regulatory decisions in agricultural projects

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Training Details

Training Data

The model was fine-tuned on 200k high-quality examples (146k was domain specific), combining:

  1. Human expert-generated Q&A pairs - Written or approved by agricultural domain specialists (e.g., agronomists, researchers, extension agents, etc.)
  2. Q&A pairs extracted from real-world interactions - Extracted from forums, email threads, SMS-based extension services, and other practical agricultural communications.
  3. Synthetic Q&A pairs: Generated and curated through controlled extraction from agricultural documents using LLMs with carefully designed prompts to prevent any hallucination
  4. Domain-specific tasks:
    • Summarization of agronomy texts
    • Reading comprehension of agricultural guidelines
    • Soil, crop, and livestock reasoning tasks
    • Policy, research, and project-management reasoning
  5. General ability tasks:
    • Included to prevent catastrophic forgetting
    • Maintains strong general reasoning, math, and language skills

No private or personal data is included. All partner datasets were anonymized and ethically prepared.

Training Procedure

This model was created by performing LoRA (parameter-efficient) finetuning on top of the instruct-foundation model: meta-llama/Llama-3.3-70B-Instruct

Hardware: 8xH100 GPU
CPU: 150 Cores
Training Time: ~7 hours
Batch Size: 8
gradient_accumulation_steps: 4
Model Checkpoint: 350 steps 
lora_rank: 128
lora_alpha: 256
target_modules: 
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - lm_head
    - embed_tokens

Evaluation

The first evaluation was performed covering:

  • Basic generation quality
  • Domain consistency
  • Manual SME review of sample outputs
  • A full benchmark suite (field Q&A accuracy, agronomy reasoning, livestock QA, safety evaluation) will be added in a future update.
Model Answer Correctness Factual Correctness (Recall) Factual Correctness (Precision)
GPT-4o 0.4046921554 0.4960951189 0.2404005006
GPT-4o with RAG 0.4888914377 0.630975 0.3546307885
Llama-3.3-70B 0.3832698282 0.4624625 0.2320625
Llama-3.3-70B Finetuned 0.4739898234 0.3576146789 0.3594901961
Llama-3.3-70B Finetuned with RAG 0.5569369224 0.4872040302 0.5322138365
Qwen3-30B-A3B 0.3773581101 0.4801293661 0.1803589744
Qwen3-30B-A3B Finetuned 0.4877470022 0.3549163449 0.3906931964
Qwen3-30B-A3B Finetuned with RAG 0.5645609396 0.4957125 0.5091625
Falcon3-10B Finetuned 0.4516784406 0.3506540881 0.3499623588
Falcon3-10B Finetuned with RAG 0.5453206181 0.459025 0.5092

Testing Data, Factors & Metrics

Testing Data

We evaluated the model on the dataset available at agrillm-qa-eval-800, which contains 800 Q&A pairs, covering multiple agricultural topics, crops, geographies and tasks.

Factors

The creation of the evaluation set considered multiple dimensions:

  • User personas: 4 primary personas spanning farmers, extension agents, researchers, and policymakers
  • Topics and domains: Multiple crops, produce types, and subdomains of agricultural knowledge
  • Answer types: Single-turn and multi-turn Q&A, as well as domain β€˜tasks’ (summarization, classification, etc.)

Metrics

The evaluation pipeline leverages RAGAS with LLM-as-a-judge using GPT-4o, where the model’s responses are automatically assessed against reference answers for correctness, completeness, and language quality.

Example (Agriculture – Rice Irrigation):

Ground-truth facts:

  • Rice needs standing water during early growth.
  • Drip irrigation is rarely used for rice.
  • Water requirement is highest during tillering stage.

LLM Response:

  • Rice generally grows best with standing water.
  • Restrictive licenses are good

when we grade the response we find

βœ… Rice generally grows best with standing water. ❌ Restrictive licenses are good

factual_correctness_precision: Out of 2 stated facts, 1 is correct (standing water) β†’ Precision = 1/2.

factual_correctness_recall: Out of 3 ground-truth facts, the model mentioned only 1 β†’ Recall = 1/3.

Results

Check our leaderboard for more information: https://huggingface.co/spaces/AI71ai/agrillm-leaderboard

Environmental Impact

Training Hardware

  • 8Γ— NVIDIA H100 GPUs
  • 150-core CPU
  • 500 GB storage
  • Single-node configuration in a UAE data center
  • ~7 hours

Estimated Energy Consumption

  • Estimated IT power (GPUs + CPU + system): β‰ˆ 7.1 kW
  • Data-center Power Usage Effectiveness (PUE): 1.4
  • Estimated total facility power: β‰ˆ 9.94 kW
  • Total energy consumed:
    9.94 kWΓ—7 hβ‰ˆ69.6 kWh

Estimated Carbon Emissions

  • UAE grid emission factor: 0.40 kg COβ‚‚e/kWh
  • Total carbon emissions:
    69.6 kWhΓ—0.40 β‰ˆ28 kg COβ‚‚e

Summary

  • Total training energy: ~69.6 kWh
  • Total training emissions: ~28 kg COβ‚‚e

Assumptions and Methodology

  • GPU power based on NVIDIA H100 SXM maximum TDP of ~700 W per GPU.
  • CPU + platform power estimated at ~1.5 kW under load.
  • IT load assumed to be fully utilized during training.
  • Data-center overhead modeled using PUE = 1.4.
  • UAE grid intensity assumed at 0.40 kg COβ‚‚e/kWh.
  • Estimates include only operational electricity use; hardware manufacturing and external networking emissions are excluded.

Acknowledgements & Data Sources

We gratefully acknowledge the contributions of our partners and collaborators who made this work possible:

  • The International Affairs Office of the UAE Presidential Court
  • Gates Foundation
  • CGIAR– Consultative Group on International Agricultural Research
  • Embrapa – Empresa Brasileira de Pesquisa AgropecuΓ‘ria
  • ECHO
  • FAO – Food and Agriculture Organization of the United Nations
  • IFAD – International Fund for Agricultural Development
  • The World Bank
  • Digital Green
  • KIADPAI – Khalifa International Award for Date Palm and Agricultural Innovation
  • KALRO – Kenya Agricultural and Livestock Research Organization
  • Extension Foundation

Special thanks to all partners for their invaluable support, including:

Data preparation: Curating agricultural documents and Q&A pairs, with manual verification by domain experts Expert guidance: Supporting the verification of synthetic Q&A pairs generated for model fine-tuning AI assistant design: Providing expertise on designing downstream AI applications to test the models Model testing: Manually evaluating model outputs to ensure quality and relevance Field engagement: Collaborating with end-users in agricultural settings to support adoption, and collect current needs and feedback

All datasets used were anonymized and ethically prepared, and no private or personal data was included

Citation

If you find this model useful, please cite us:

@misc{Llama-agrillm-3.3-70B,
      title={Llama-agrillm-3.3-70B}, 
      author={Mamoun Alaoui, Ojas Agarwal, Zafar Shadman, Derek Thomas},
      year={2025},
}

Model Card Contact

Downloads last month
108
Safetensors
Model size
71B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for AI71ai/Llama-agrillm-3.3-70B

Finetuned
(229)
this model
Quantizations
2 models

Dataset used to train AI71ai/Llama-agrillm-3.3-70B

Space using AI71ai/Llama-agrillm-3.3-70B 1

Collection including AI71ai/Llama-agrillm-3.3-70B