YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SmolLM3-Energy-RAG (LoRA Adapter)

This repository contains a LoRA fine-tuned version of SmolLM3-3B for AI energy sustainability question answering and retrieval-augmented generation.

Introduction/Motivation

Artificial intelligence is expanding at an unprecedented rate, but the energy demands behind modern AI systems are a growing environmental and sustainability challenge due to large-scale data centers, intensive computation, and the need for constant model retraining. General-purpose LLMs are poorly suited for this domain because they are updated infrequently, struggle to capture technical nuance, and cannot reliably ground claims in domain-specific sources. To address these limitations, this project develops a domain-specific Retrieval-Augmented Generation (RAG) system built from 824 curated GDELT news articles on AI energy usage, paired with a FAISS vector store using MPNet embeddings and cosine similarity. The system also incorporates a small LoRA-tuned instruction dataset to improve structured summarization and comparative reasoning within the domain. Designed for students, researchers, policymakers, and technically curious users, this RAG model provides timely, source-grounded insights into AI energy trends and sustainability challenges, offering far greater relevance and specificity than general-purpose LLMs.

Data

The RAG corpus for this project was built entirely from the GDELT news API, using targeted AI-energy keywords (e.g., AI energy usage, GPU power consumption, data center emissions, AI sustainability) to collect relevant URLs, remove duplicates, and scrape full-text articles with standardized metadata. This process produced a final set of 824 articles focused on the energy and environmental impacts of modern AI systems. This small article dataset is for class purposes and only covers news in the last year, but could easily be expanded by changing the dates on the basecode. Separately, a small instruction-tuning dataset was created using the free-tier GNews API to gather a distinct set of articles that served as the basis for generating four types of prompts: summarization, comparison/synthesis, analytical reasoning, and limited forward-looking questions. Instruction–response pairs were generated using GPT and validated through a combination of manual review and secondary verification by Claude to ensure factual accuracy and consistency. This instruction dataset was intentionally kept small to provide a lightweight refinement of the model’s response style without overwriting SmolLM3’s strong general semantic capabilities.

Methodology

This project uses a finetuned variant of HuggingFaceTB/SmolLM3-3B, trained with lightweight LoRA instruction tuning, and a separate Retrieval-Augmented Generation (RAG) pipeline that connects to a FAISS vector store built from MPNet embeddings using cosine similarity. The LoRA-tuned model is meant to improve the style or responses and help with domain-specific generation, while the FAISS-based retriever supplies up-to-date, semantically relevant context drawn from the curated GDELT corpus. To improve the model’s ability to generate structured domain-specific answers while preserving its base reasoning abilities, I applied LoRA instruction tuning using the curated GNews-based dataset. Initial experiments used LoRA rank r = 16 with dropout = 0.05, but ablation studies showed that while these settings improved ROUGE scores on outputs, BERTScores declined significantly, indicating a loss of the model’s semantic interpretability. Because of this, the final configuration reduced the rank to r = 8 and the dropout to 0.02, which struck a better balance between adapting the style of the instruction-pair responses and maintaining core semantic capabilities. Prompt engineering also played a critical role in improving model output. I tested several templates, and the final prompt that produced the most natural, reliable, and well-structured outputs are detailed in the ‘Prompt Format’ section of this documentation. I then combined this with the RAG component, which retrieves the top-matching articles for each query, injects their content into the model’s prompt, and enables the system to incorporate new information without requiring additional model retraining. To choose MPNet with cosine similarity, I tested against MiniLM and euclidean as alternative embedding models and similarity metrics. In practice, MPNet with cosine similarity consistently retrieved articles that were more contextually meaningful and aligned with the nuances of AI energy and sustainability topics. Together, the selected MPNet–cosine retrieval setup and the LoRA-tuned model ensure that the system stays grounded in domain-specific context while preserving the flexibility and reasoning depth of the SmolLM3 base model.

Evaluation

To evaluate the system, I conducted two complementary assessments: (1) model-level evaluation measuring how LoRA instruction tuning changed the generative behavior of SmolLM3, and (2) retrieval-level evaluation measuring how effectively the RAG pipeline identifies relevant documents from the GDELT corpus. For model-level testing, I used a 90–10 train/test split of my instruction dataset and measured BoolQ accuracy, ROUGE, and BERTScore before and after LoRA training. BoolQ (via lm_eval) was selected to ensure that the model’s general reasoning ability remained intact. ROUGE measures how well the model adapts to the structure and style of the instruction-response pairs, while BERTScore quantifies the semantic similarity between the model’s generated answer and the reference answer.

For retrieval evaluation, I constructed my own benchmark: for each query, I created a reference list of the 10 most relevant articles in my vector store. For this, I generated two versions of this reference set, one using TF-IDF ranking and one using GPT-generated relevance judgments. The goal was the these two methods combined would balance each other out and improve evaluation. Although the generated lists had little overlap, both produced similar retrieval statistics. Using MPNet embeddings with cosine similarity, the retriever achieved an average of about three top matches per query. RAG pipelines typically only use the top few retrieved documents for context injection, meaning that retrieving “the right three” is far more important than attempting to fill all ten. Below are the benchmarks for the quantitative benchmarks and the train/test split:

Model and Benchmark Results - Quantitative

Model / Benchmark	BoolQ Accuracy	ROUGE (Test Split)	BERTScore F1	Retrieval Weighted Score
SmolLM3-3B (Base Model)	0.8248	0.0762	0.8382	N/A
Finetuned SmolLM3-3B (LoRA r=8, d=0.02)	0.8254	0.0966	0.8332	N/A
RAG Retriever (MPNet + cosine)	N/A	N/A	N/A	421 (Precision@10 = 0.276)

Model and Benchmark Results - Qualitative

Below are example outputs from the model before and after applying LoRA tuning and RAG retrieval.

Before RAG + LoRA

After RAG + LoRA

Performance Summary

Overall, LoRA instruction tuning improved stylistic alignment and response structure, shown by increased ROUGE, while maintaining the model’s semantic reasoning ability, as evidenced by stable BoolQ accuracy and only a minor decrease in BERTScore. Retrieval performance with MPNet-cosine produced a small but meaningful set of relevant documents for each query. The largest improvements, however, are shown by the qualitative outputs. Initially, for one prompt the model decides to make it a multiple choice question and explains why it would choose one answer, despite the question not being multiple choice. For the second prompt, the model gives an answer that is technically correct, but is not very detailed. As you see above, after training the model is not longer producing the wrong style output and the answers have much more depth.

Usage and Intended Uses

The model is designed for students, researchers, and policymakers who want to explore topics in AI energy usage, sustainability challenges, and hardware efficiency trends. The finetuned SmolLM3 model is intended for tasks such as summarizing energy-related news, extracting key claims from news articles, comparing technologies, and answering analytic questions using retrieved context from the RAG pipeline. This model is not designed for general-purpose knowledge beyond the AI sustainability domain and should not be used for high-stakes decision making. The RAG retrieval system allows the model to incorporate up-to-date information from the curated GDELT corpus, while the LoRA tuning helps align responses to the desired explanatory style without altering core reasoning abilities. Below is code to load the model, FAISS Vector Store and metadata, and embedder in your local environment.

from huggingface_hub import hf_hub_download
import faiss
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "rrallan/smollm3-energy-rag-lora"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
faiss_path = hf_hub_download(model_name, "rag_index/faiss.index")
metadata_path = hf_hub_download(model_name, "rag_index/metadata.json")
index = faiss.read_index(faiss_path)
with open(metadata_path, "r") as f:
    metadata = json.load(f)
embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

Prompt Format

This model uses a structured RAG prompt that injects the retrieved documents directly into the system message before generation. The prompt format was selected after iterative experimentation to encourage concise, grounded, evidence-based responses, but users are free to adjust or simplify the template as needed. The core prompt consists of a short system role description, the user’s question, and the retrieved documents.

Example Code: Retrieval, Prompt Construction, and Answer Generation

def retrieve(query, top_k=3):
    q_emb = embedder.encode([query]).astype("float32")
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    docs = [metadata[i] for i in indices[0]]
    return docs

def build_context(docs, max_chars=600):
    blocks = []
    for doc in docs:
        snippet = doc["content"][:max_chars]
        blocks.append(f"Document: {doc['title']}\n{snippet}")
    return "\n\n".join(blocks)

def query_rag(question, top_k=3):
    retrieved_docs = retrieve(question, top_k=top_k)
    context = build_context(retrieved_docs)

    messages = [
        {
            "role": "system",
            "content": """
You are a question answering system.

Rules:
- Answer the question directly.
- Use ONLY the provided documents, making sure to incorporate information from each document.
- Do NOT ask questions.
- If the answer is not in the documents, say: "Not enough information in the provided documents."
- Be concise, write a max of 3 paragraphs at 5-7 sentences per paragraph, but do not write this much if it is not needed.
- If possible and reasonable, cite specific statistics and metrics from the provided documents.
"""
        },
        {
            "role": "user",
            "content": f"""
Question: {question}

Documents:
{context}

Answer:
"""
        }
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=2000,
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.pad_token_id
        )

    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "assistant" in full_response.lower():
        answer = full_response.split("assistant")[-1].strip()
    else:
        answer = full_response.strip()

    return answer

# Example Call
response = query_rag("How much electricity do modern AI models consume?", top_k=3)
print(response)

Optional: Add Confidence Scores and Source Reporting

If users want additional transparency, they can extend the pipeline with similarity-score-based confidence labels and a list of retrieved article titles.

import re

def confidence_label(score):
    if score >= 0.85:
        return "High"
    elif score >= 0.7:
        return "Medium"
    else:
        return "Low"

def retrieve_with_scores(query, top_k=3):
    q_emb = embedder.encode([query]).astype("float32")
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    return indices[0], scores[0]

# Example extension inside query_rag
avg_score = float(np.mean(similarity_scores))
label = confidence_label(avg_score)

return {
    "query": question,
    "answer": cleaned_answer,
    "confidence_score": round(avg_score, 3),
    "confidence_label": label,
    "sources": [doc["title"] for doc in retrieved_docs]
}

Expected Output Format

The model returns a concise, grounded answer based solely on the retrieved documents. A typical response consists of 1-3 short paragraphs that synthesize information across sources without adding unsupported claims. Users may optionally enable confidence scores and a list of retrieved document titles for greater transparency into the retrieval process.

Example Output:

Answer:
The energy demands of AI training across cloud, edge, and large-scale data centers are significant and growing rapidly. Here's a comparison based on the provided documents:

1. Large-Scale Data Centers: These are the primary locations for AI training, as they provide the necessary high-performance computing (HPC) infrastructure, specialized chips, and large memory. According to the GlobalData report, AI workloads are pushing the boundaries of data center capacity, with most AI training occurring in large-scale facilities. The International Energy Agency (IEA) estimates that data centers' electricity consumption will more than double to approximately 945 terawatt-hours (TWh) by 2030, up from 415 TWh in 2024. This growth is driven by the computational intensity of AI models, which require significant energy for training and inference.

2. Edge Computing: Edge computing involves processing data closer to the source, reducing the need for data to be sent to large data centers. This can lead to lower energy consumption for AI training, as less data needs to be transmitted. However, edge devices still require energy for processing and inference. The document from The IOWN Global Forum suggests that all-photonic networks (APNs) could help reduce energy consumption in AI training by enabling remote GPU services, which can offload energy-intensive tasks to specialized, eco-friendly data centers. This approach can significantly reduce the energy footprint of AI training.

3. Cloud Computing: Cloud providers like Oracle, Nvidia, and Google are investing heavily in AI infrastructure, which drives demand for energy. The document from The IOWN Global Forum highlights that cloud providers are exploring innovative solutions like liquid cooling and advanced cooling systems to manage extreme rack densities and improve thermal performance. These solutions can help reduce energy consumption in cloud-based AI training.

In summary, large-scale data centers are the primary drivers of AI energy demand, but edge computing and cloud computing can also play a role in reducing energy consumption through more efficient processing and offloading of tasks. However, the overall energy demand for AI training is expected to continue growing unless significant innovations in energy efficiency and sustainable practices are implemented.

Optional Output Extension (Confidence + Sources):

Confidence:
Medium 0.727

Sources:
- Data centers must tackle AI , sustainability challenges : report
- The AI energy paradox : Turning a power surge into a climate opportunity
- The AI Energy Crisis : A Looming Threat to Sustainability and Tech Green Ambitions
- The Digital Gold Rush : Why Investors Back Sustainable Data Centres
- How can we create a sustainable AI future ?
- How much energy does AI really use ? The answer is surprising - and a little complicated
- How photonics could save AI from its own energy appetite
- What the Tech : AI environmental impact
- Google Reveals Gemini AI Energy , CO2 , Water Use Per Prompt
- Why AI energy demand needs transparency , not just efficiency

Limitations

While this domain-specific RAG system meaningfully improves the relevance and timeliness of answers related to AI energy usage, several important limitations remain. First, the model can still generate inaccurate or incomplete statements, especially when retrieved documents lack specific details, and all outputs should be independently fact-checked. Second, the system does not maintain conversational memory: users cannot build iterative dialogue or follow-up questions without explicitly re-including all prior context in each new query, which may limit usability for deeper analysis. Third, the vector store powering retrieval was last updated on November 29, 2025, meaning that news or research published after that date will not be surfaced by the RAG pipeline unless the index is refreshed. Finally, the system relies entirely on news sources rather than peer-reviewed academic literature or structured energy datasets, which may introduce bias toward journalistic framing or omit critical quantitative metrics. Together, these limitations indicate that the model is best used as an assistive research aid rather than an authoritative analytical tool.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support