nolitai-vision β Meeting Vision Model (LoRA Adapter)
A LoRA adapter for FastVLM-1.5B-Stage3 fine-tuned for visual meeting intelligence tasks. Designed for on-device inference on Apple Silicon via MLX.
Status: Early checkpoint β This is an initial training run with limited data (~81 examples, 3 epochs). Performance is not yet production-ready. We're sharing it for research and community collaboration.
Model Details
| Property | Value |
|---|---|
| Base Model | zhaode/FastVLM-1.5B-Stage3 |
| Architecture | LlavaQwen2 (MobileClip vision + Qwen2 language model) |
| Adapter Size | 8.3 MB (LoRA weights only) |
| Training | LoRA (rank=8, alpha=16) on q/k/v/o attention projections |
| Framework | PyTorch (PEFT), convertible to MLX |
Capabilities
Given a video call screenshot, the model can:
- Speaker Identification β Detect the active/highlighted speaker in a video call grid
- Participant Listing β List all visible participants by name
- Platform Detection β Identify the meeting platform (Zoom, Teams, Meet, etc.)
- Slide OCR β Extract title and content from shared presentation slides
Example Tasks
Speaker ID Input: A screenshot of a Zoom call with a highlighted speaker tile Expected Output:
{"speaker": "Sarah Chen"}
Platform Detection Input: A screenshot of a video call Expected Output:
{"platform": "Microsoft Teams"}
Current Performance
| Task | Score | Notes |
|---|---|---|
| Speaker ID | 0% | Needs more diverse training examples |
| Participants | 0% | Needs more training data |
| Platform Detection | 60% | Partially learned |
| Slide OCR | 0% | Needs more training data |
| Overall | 10% | Early checkpoint, needs more data |
Training Details
- Method: LoRA (full precision base model, adapter-only training)
- LoRA Config: rank=8, alpha=16, dropout=0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj (language model only)
- Frozen Components: vision_tower (MobileClip), mm_projector (MLP)
- Dataset: ~81 synthetic video call screenshots with annotations
- Epochs: 3
- Learning Rate: 2e-5 (cosine scheduler, 5% warmup)
- Hardware: NVIDIA A40 48GB (RunPod)
- Training Time: ~3 minutes
- Final Train Loss: 2.50
Usage with PyTorch
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoProcessor, CLIPImageProcessor
from PIL import Image
# Load base model
base = AutoModelForCausalLM.from_pretrained(
"zhaode/FastVLM-1.5B-Stage3",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load and merge adapter
model = PeftModel.from_pretrained(base, "SearchingBinary/nolitai-vision")
model = model.merge_and_unload()
model.eval()
# Load processors
processor = AutoProcessor.from_pretrained("zhaode/FastVLM-1.5B-Stage3", trust_remote_code=True)
image_processor = CLIPImageProcessor.from_pretrained("zhaode/FastVLM-1.5B-Stage3")
tokenizer = processor.tokenizer
# Inference
image = Image.open("meeting_screenshot.png").convert("RGB")
image_tensor = image_processor.preprocess(image, return_tensors="pt")["pixel_values"]
image_tensor = image_tensor.to(device=model.device, dtype=torch.bfloat16)
prompt = 'Identify the active speaker. Respond with JSON: {"speaker": "Name"}'
chat = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# NOTE: FastVLM expects input_ids as positional arg, not keyword
outputs = model.generate(inputs["input_ids"], images=image_tensor, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Roadmap
- Expand training dataset to 1000+ examples
- Add more diverse meeting platforms and layouts
- Train for more epochs (target: >90% overall)
- Convert to MLX format for Apple Silicon deployment
- Integrate with nolitai-2b for full meeting intelligence pipeline
Part of nolit.ai
This model is part of nolit.ai β a native macOS meeting copilot that processes everything locally on your Mac. The vision model handles real-time speaker identification during video calls.
License
Apache 2.0
- Downloads last month
- 5
Model tree for SearchingBinary/nolitai-vision
Base model
zhaode/FastVLM-1.5B-Stage3