File size: 2,349 Bytes
a2edf18
 
 
 
 
 
 
 
 
 
a586bda
a2edf18
e3e1ab4
 
a2edf18
 
6267ccb
a2edf18
a586bda
a2edf18
 
 
71c9e75
a2edf18
 
 
 
 
 
 
 
 
 
41b8165
 
a2edf18
 
2ac5744
a2edf18
 
 
3290222
7c2f2e1
41b8165
a2edf18
7c2f2e1
 
 
 
a2edf18
 
 
 
 
 
 
a586bda
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
inference: false
language: en
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
datasets:
- burgerbee/wikipedia-en-20241020
---
# Wikipedia txtai embeddings index
This is a [txtai](https://github.com/neuml/txtai) embeddings index (5GB embeddings + 25GB documents) for the [english edition of Wikipedia](https://en.wikipedia.org/). 

Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server.

This index is built from the [Wikipedia october 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020).
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
to only match commonly visited pages.

txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this.
## Example code
```python
from txtai.embeddings import Embeddings
import json

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia")

# Run a search
for x in embeddings.search("Bob Dylans second album", 1):
  print(x["text"])

# Run a search and filter on popular results (page views).
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1):
  print(json.dumps(x, indent=2))
```
## Example output
```json
The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article)

{
  "id": "Where in the World Is Carmen Sandiego? (game show)",
  "text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article)
  "score": 0.8537465929985046,
  "percentile": 0.996002961084341
}
```
## Data source
https://dumps.wikimedia.org/enwiki/

https://dumps.wikimedia.org/other/pageview_complete/

https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020