π Sifera AI V2 - Qwen LoRA Adapter
π Quick Start β’ ποΈ Architecture β’ π Performance β’ π» API
π Overview
Sifera AI V2 Qwen LoRA is a parameter-efficient fine-tuned adapter for document processing and text generation. This LoRA adapter works with Qwen2.5-1.5B-Instruct base model, optimized for CPU inference.
Key Highlights:
- π― LoRA Adapter - Small adapter size, uses base model
- β‘ CPU Optimized - Fast inference with 1.5B base model
- π§ Parameter Efficient - Only fine-tuned adapter weights
- π Multi-task - Summarization, notes, Q&A, key points extraction
- π Production Ready - Deployed in HF Spaces & AWS
ποΈ Architecture
β¨ Features
| Feature | Description | Credits |
|---|---|---|
| β¨ Summarize | Generate concise summaries of long documents | 2 |
| π Notes | Extract structured study notes in bullet points | 2 |
| π Key Points | Identify and extract main ideas and concepts | 2 |
| β Q&A | Generate question-answer pairs for learning | 2 |
| ποΈ Podcast | Create conversational podcast scripts | 5 |
π Quick Start
Installation
pip install transformers torch accelerate peft
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model and LoRA adapter
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
LORA_ADAPTER = "YOUR_HF_USERNAME/sifera-v2-qwen-lora"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float32,
device_map="cpu",
low_cpu_mem_usage=True,
trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
# Summarize text
text = "Your long document text here..."
prompt = f"Summarize the following text:\n\n{text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Gradio App (Hugging Face Space)
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
LORA_ADAPTER = "YOUR_HF_USERNAME/sifera-v2-qwen-lora"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
def process(text, action):
prompt = f"{action} the following:\n\n{text}\n\n{action.capitalize()}:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_new_tokens=256)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
demo = gr.Interface(
fn=process,
inputs=[
gr.Textbox(label="Input Text", lines=10),
gr.Radio(["Summarize", "Notes", "Key Points", "Q&A"], label="Action")
],
outputs=gr.Textbox(label="Output", lines=10)
)
demo.launch()
π Performance
| Metric | Value | Hardware |
|---|---|---|
| Adapter Size | ~50 MB | LoRA Weights |
| Base Model Size | ~3 GB | Qwen2.5-1.5B |
| Inference Speed | 120+ tokens/sec | CPU (12-core) |
| Memory Usage | ~4 GB RAM | Typical |
| Latency (p50) | 2-3 sec | Single request |
| ROUGE-1 Score | 42.3 | Evaluation set |
Tested On:
- β Intel Core i7-12700K
- β AMD Ryzen 9 5950X
- β AWS t3.xlarge (4 vCPU)
- β Hugging Face Spaces (CPU)
π» API Usage
Using with FastAPI
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
app = FastAPI()
# Load model with LoRA
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
LORA_ADAPTER = "YOUR_HF_USERNAME/sifera-v2-qwen-lora"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map="cpu", trust_remote_code=True)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
@app.post("/summarize")
async def summarize(text: str):
prompt = f"Summarize: {text}"
result = generator(prompt, max_new_tokens=256)
return {"summary": result[0]["generated_text"]}
cURL Example
curl -X POST "http://localhost:8000/summarize" \
-H "Content-Type: application/json" \
-d '{"text": "Your document text here..."}'
π¦ Model Details
Architecture:
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Adapter Type: LoRA (Low-Rank Adaptation)
- Context Length: 4096 tokens
- Vocabulary Size: 151,936 tokens
Training:
- Fine-tuned on 2.3M document samples
- Tasks: Summarization, Q&A, extraction, note-taking
- Training Method: LoRA (r=16, alpha=32)
- Framework: PyTorch + Transformers + PEFT
Optimization:
- Parameter-efficient fine-tuning (LoRA)
- CPU-optimized inference
- Small adapter size (~50MB)
- No GPU required
π§ Configuration
Generation Parameters
generation_config = {
"do_sample": True,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"max_new_tokens": 512,
"repetition_penalty": 1.1
}
Recommended Settings
| Use Case | Temperature | Max Tokens | Top P |
|---|---|---|---|
| Summarization | 0.5 | 256 | 0.85 |
| Creative Writing | 0.9 | 512 | 0.95 |
| Q&A | 0.3 | 128 | 0.75 |
| Notes | 0.6 | 384 | 0.9 |
π Known Issues
- β οΈ Long contexts (>4K tokens) may cause slower inference
- β οΈ First inference takes ~5-10 seconds (model loading)
- β οΈ Output quality may vary with very technical/domain-specific text
π License
Apache License 2.0 - See LICENSE for details.
π Support
- π§ Email: vaghani.shivam83@gmail.com
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π Website: sifera.ai
π― Citation
If you use this model in your research, please cite:
@software{sifera_v2_qwen_2025,
author = {Vaghani, Shivam},
title = {Sifera AI V2 - Qwen LoRA Adapter},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/YOUR_USERNAME/sifera-v2-qwen-lora}
}
Status: β Production Ready | Version: 1.0 | Updated: January 2, 2026
Made with β€οΈ by Shivam Vaghani
Model tree for shivam909067/Sifera-v2-Qwen
Evaluation results
- ROUGE-1self-reported42.300