auto-g-embed-st
auto-g-embed-st is effectively all-MiniLM-L6-v2 with almost no meaningful change.
The auto-g-embed SentenceTransformer model is essentially a lightly fine-tuned clone of all-MiniLM-L6-v2: it starts directly from MiniLM weights, is trained for only one epoch on ~24k pairs, uses the exact same architecture, tokenizer, pooling, and normalization, and achieves nearly identical MTEB scores (a difference of ~0.0003, which is statistical noise). While the broader project also includes a completely separate, much faster Rust-based embedder, the published and evaluated model is the MiniLM-derived one. As a result, the fine-tuning does not meaningfully alter retrieval performance.
Local semantic embedding pipeline with a Rust-native runtime.
What this repo provides
- Contrastive dataset preparation (
prepare_contrastive) - Rust-native embedder training (
train_rust_embedder) - Runtime embedding APIs and examples
- Optional ONNX/SentenceTransformer path in
training/
Quick start
cargo test
./training/run_pipeline.sh \
--profile kaggle_questions_million \
--source-csv data/kaggle/one-million-reddit-questions.csv
Run the Rust embedding example:
cargo run --example rust_embed -- \
artifacts/model/rust-embedder \
"A quick test sentence for semantic embeddings."
Model artifacts
Published model artifacts are available on Hugging Face:
Project layout
src/: library modules and binariesexamples/: runnable embedding demostests/: integration/performance teststraining/: pipeline scripts and dataset adapters
Development checks
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test
Community Benchmark
Run the reproducible benchmark CLI:
cargo run --release --bin community_benchmark -- \
--output artifacts/benchmarks/latest.json
The output includes throughput, latency percentiles (p50/p95/p99), retrieval quality metrics, and environment metadata for publishing.
Methodology and reporting guidance: BENCHMARKS.md.
Latest Benchmark (M4 Max) (February 8, 2026):
cargo run --release --bin community_benchmark -- \
--eval-count 500 --warmup-count 100 --query-count 32 \
--output artifacts/benchmarks/smoke.json
embeds_per_second:219595.18p50_us:3.88p95_us:6.54p99_us:6.71top1_accuracy:0.9375separation:0.2886
Additional docs
- Training and pipeline details:
training/README.md - Test data notes:
test-data/README.md
- Downloads last month
- 31