YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

mxbai-edge-colbert-v0-17m — ONNX export (ColBERT, ModernBERT backbone)

This repository contains an ONNX export of mixedbread-ai/mxbai-edge-colbert-v0-17m produced with PyLate + a ColBERT-aware wrapper. It preserves the projection stack and ColBERT markers ([Q] / [D] ) and includes a skiplist for MaxSim.

onnx/model.onnx — FP32 export, opset 17 (✅ cosine 1.0 vs PyTorch)
onnx/model_quantized.onnx — Dynamic INT8 (⚠️ cosine ~0.972 vs PyTorch; quality hit)
tokenizer.json, tokenizer_config.json, special_tokens_map.json — saved from the PyLate-modified tokenizer with markers
config.json — model config
conversion_metadata.json — minimal export metadata
skiplist.json — token IDs to skip during MaxSim (32 punctuation IDs for this model)

Architecture (verified)

Transformer 256 → Projection 512 → Projection 48

Special tokens

[Q] : 50368
[D] : 50369

Quality checks

PyTorch vs ONNX FP32: cosine 1.00000000, MSE 0.0
PyTorch vs ONNX INT8: cosine ~0.9719 (degradation observed)

Usage (Python, onnxruntime)

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

model_dir = "path/to/this/repo"
sess = ort.InferenceSession(f"{model_dir}/onnx/model.onnx", providers=["CPUExecutionProvider"])
tok = AutoTokenizer.from_pretrained(model_dir)

q = "[Q] what is colbert?"
enc = tok(q, return_tensors="np", padding="max_length", max_length=128, truncation=True)
out = sess.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
print(out.shape)  # (batch, seq_len, 48)

Usage (Node, onnxruntime-node)

import { AutoTokenizer } from "@huggingface/transformers";
import * as ort from "onnxruntime-node";
import fs from "fs";

const modelDir = "path/to/this/repo";
const tokenizer = await AutoTokenizer.from_pretrained(modelDir);
const session = await ort.InferenceSession.create(`${modelDir}/onnx/model.onnx`);

const q = "[Q] what is colbert?";
const encoded = await tokenizer(q, { return_tensors: "np", padding: "max_length", max_length: 128, truncation: true });
const outputs = await session.run({ input_ids: encoded.input_ids, attention_mask: encoded.attention_mask });
console.log(outputs[session.outputNames[0]].dims); // [1, 128, 48]

const skiplist = new Set(JSON.parse(fs.readFileSync(`${modelDir}/skiplist.json`, "utf8")));

Notes

INT8 file is provided but shows measurable drift; prefer FP32 (or export FP16 if size matters).
Skiplist here is punctuation-only because the model did not expose additional skip words; adjust downstream if you maintain a richer skiplist.

Conversion

Tooling: PyLate + custom ColBERT wrapper
Opset: 17
Date: 2025-02 (see conversion_metadata.json for details)

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ryandono
/

osgrep-colbert-fp32