File size: 2,349 Bytes
a2edf18 a586bda a2edf18 e3e1ab4 a2edf18 6267ccb a2edf18 a586bda a2edf18 71c9e75 a2edf18 41b8165 a2edf18 2ac5744 a2edf18 3290222 7c2f2e1 41b8165 a2edf18 7c2f2e1 a2edf18 a586bda |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
inference: false
language: en
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
datasets:
- burgerbee/wikipedia-en-20241020
---
# Wikipedia txtai embeddings index
This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/).
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server.
This index is built from the [Wikipedia october 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020).
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
to only match commonly visited pages.
txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this.
## Example code
```python
from txtai.embeddings import Embeddings
import json
# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia")
# Run a search
for x in embeddings.search("Bob Dylans second album", 1):
print(x["text"])
# Run a search and filter on popular results (page views).
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1):
print(json.dumps(x, indent=2))
```
## Example output
```json
The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article)
{
"id": "Where in the World Is Carmen Sandiego? (game show)",
"text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article)
"score": 0.8537465929985046,
"percentile": 0.996002961084341
}
```
## Data source
https://dumps.wikimedia.org/enwiki/
https://dumps.wikimedia.org/other/pageview_complete/
https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020 |