Buckets:
| license: cc-by-4.0 | |
| task_categories: | |
| - text-to-video | |
| - video-classification | |
| language: | |
| - en | |
| tags: | |
| - text-to-video | |
| - video-search | |
| pretty_name: openvid-lance | |
| size_categories: | |
| - 100K<n<1M | |
|  | |
| # OpenVid Dataset (Lance Format) | |
| Lance format version of the [OpenVid dataset](https://huggingface.co/datasets/nkp37/OpenVid-1M) with **937,957 high-quality videos** stored with inline video blobs, embeddings, and rich metadata. | |
| ## Why Lance? | |
| Lance is an open-source format designed for multimodal AI data, offering significant advantages over traditional formats for modern AI workloads. | |
| - **Blazing Fast Random Access**: Optimized for fetching scattered rows, making it ideal for random sampling, real-time ML serving, and interactive applications without performance degradation. | |
| - **Native Multimodal Support**: Store text, embeddings, and other data types together in a single file. Large binary objects are loaded lazily, and vectors are optimized for fast similarity search. | |
| - **Efficient Data Evolution**: Add new columns and backfill data without rewriting the entire dataset. This is perfect for evolving ML features, adding new embeddings, or introducing moderation tags over time. | |
| - **Versatile Querying**: Supports combining vector similarity search, full-text search, and SQL-style filtering in a single query, accelerated by on-disk indexes. | |
| ## Key Features | |
| The OpenVid dataset is stored in Lance format with inline video blobs, video embeddings, and rich metadata. | |
| - **Videos stored inline as blobs**: No external files to manage | |
| - **Efficient column access**: Load metadata without touching video data | |
| - **Prebuilt indices available**: IVF_PQ index for similarity search, FTS index on captions | |
| - **Fast random access**: Read any video instantly by index | |
| - **HuggingFace integration**: Load directly from the Hub | |
| ## Quick Start | |
| ### Load with `datasets.load_dataset` | |
| ```python | |
| import datasets | |
| hf_ds = datasets.load_dataset( | |
| "lance-format/openvid-lance", | |
| split="train", | |
| streaming=True, | |
| ) | |
| # Take first three rows and print captions | |
| for row in hf_ds.take(3): | |
| print(row["caption"]) | |
| ``` | |
| ### Load with Lance | |
| Use Lance for ANN search, retrieving specific blob bytes or advanced indexing, while still pointing at the dataset on the Hub: | |
| ```python | |
| import lance | |
| lance_ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") | |
| blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0] | |
| video_bytes = blob_file.read() | |
| ``` | |
| ### Load with LanceDB | |
| These tables can also be consumed by [LanceDB](https://docs.lancedb.com/), the multimodal lakehouse for AI (built on top of Lance). | |
| LanceDB provides several convenience APIs for search, index creation and data updates on top of the Lance format. | |
| ```python | |
| import lancedb | |
| db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") | |
| tbl = db.open_table("train") | |
| print(f"LanceDB table opened with {len(tbl)} videos") | |
| ``` | |
| ## Blob API | |
| Lance stores videos as **inline blobs** - binary data embedded directly in the dataset. This provides: | |
| - **Single source of truth** - Videos and metadata together in one dataset | |
| - **Lazy loading** - Videos only loaded when you explicitly request them | |
| - **Efficient storage** - Optimized encoding for large binary data | |
| ```python | |
| import lance | |
| ds = lance.dataset("hf://datasets/lance-format/openvid-lance") | |
| # 1. Browse metadata without loading video data | |
| metadata = ds.scanner( | |
| columns=["caption", "aesthetic_score"], # No video_blob column! | |
| filter="aesthetic_score >= 4.5", | |
| limit=10 | |
| ).to_table().to_pylist() | |
| # 2. User selects video to watch | |
| selected_index = 3 | |
| # 3. Load only that video blob | |
| blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] | |
| video_bytes = blob_file.read() | |
| # 4. Save to disk | |
| with open("video.mp4", "wb") as f: | |
| f.write(video_bytes) | |
| ``` | |
| > **⚠️ HuggingFace Streaming Note** | |
| > | |
| > When streaming from HuggingFace (as shown above), some operations use minimal parameters to avoid rate limits: | |
| > - `nprobes=1` for vector search (lowest value) | |
| > - Column selection to reduce I/O | |
| > | |
| > **You may hit rate limits on HuggingFace's free tier.** For best performance and to avoid rate limits, pass a token for an account with a | |
| > Pro, Teams or Enterprise subscription (which come with much higher rate limits), or download the dataset locally: | |
| > | |
| > ```bash | |
| > # Download once | |
| > huggingface-cli download lance-format/openvid-lance --repo-type dataset --local-dir ./openvid | |
| > | |
| > # Then load locally | |
| > ds = lance.dataset("./openvid") | |
| > ``` | |
| > | |
| > Streaming is recommended only for quick exploration and testing. | |
| ## Usage Examples | |
| ### 1. Browse Metadata quickly (fast, no video loading) | |
| ```python | |
| # Load only metadata without heavy video blobs | |
| scanner = ds.scanner( | |
| columns=["caption", "aesthetic_score", "motion_score"], | |
| limit=10 | |
| ) | |
| videos = scanner.to_table().to_pylist() | |
| for video in videos: | |
| print(f"{video['caption']} - Quality: {video['aesthetic_score']:.2f}") | |
| ``` | |
| ### 2. Export videos from blobs | |
| Retrieve specific video files if you want to work with subsets of the data. This is done by exporting them to files on your local machine. | |
| ```python | |
| # Load specific videos by index | |
| indices = [0, 100, 500] | |
| blob_files = ds.take_blobs("video_blob", ids=indices) | |
| # Save to disk | |
| for i, blob_file in enumerate(blob_files): | |
| with open(f"video_{i}.mp4", "wb") as f: | |
| f.write(blob_file.read()) | |
| ``` | |
| ### 3. Open inline videos with PyAV and run seeks directly on the blob file | |
| Using seeks, you can open a specific set of frames within a blob. The example below shows this. | |
| ```python | |
| import av | |
| selected_index = 123 | |
| blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] | |
| with av.open(blob_file) as container: | |
| stream = container.streams.video[0] | |
| for seconds in (0.0, 1.0, 2.5): | |
| target_pts = int(seconds / stream.time_base) | |
| container.seek(target_pts, stream=stream) | |
| frame = None | |
| for candidate in container.decode(stream): | |
| if candidate.time is None: | |
| continue | |
| frame = candidate | |
| if frame.time >= seconds: | |
| break | |
| print( | |
| f"Seek {seconds:.1f}s -> {frame.width}x{frame.height} " | |
| f"(pts={frame.pts}, time={frame.time:.2f}s)" | |
| ) | |
| ``` | |
| ### 4. Inspecting Existing Indices | |
| You can inspect the prebuilt indices on the dataset: | |
| ```python | |
| import lance | |
| # Open the dataset | |
| dataset = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance") | |
| # List all indices | |
| indices = dataset.list_indices() | |
| print(indices) | |
| ``` | |
| ### 5. Create New Index | |
| While this dataset comes with pre-built indices, you can also create your own custom indices if needed. | |
| The example below creates a vector index on the `embedding` column. | |
| ```python | |
| # ds is a local Lance dataset | |
| ds.create_index( | |
| "embedding", | |
| index_type="IVF_PQ", | |
| num_partitions=256, | |
| num_sub_vectors=96, | |
| replace=True, | |
| ) | |
| ``` | |
| ### 6. Vector Similarity Search | |
| ```python | |
| import pyarrow as pa | |
| # Find similar videos | |
| ref_video = ds.take([0], columns=["embedding"]).to_pylist()[0] | |
| query_vector = pa.array([ref_video['embedding']], type=pa.list_(pa.float32(), 1024)) | |
| results = ds.scanner( | |
| nearest={ | |
| "column": "embedding", | |
| "q": query_vector[0], | |
| "k": 5, | |
| "nprobes": 1, | |
| "refine_factor": 1 | |
| } | |
| ).to_table().to_pylist() | |
| for video in results[1:]: # Skip first (query itself) | |
| print(video['caption']) | |
| ``` | |
| ### 7. Full-Text Search | |
| ```python | |
| # Search captions using FTS index | |
| results = ds.scanner( | |
| full_text_query="sunset beach", | |
| columns=["caption", "aesthetic_score"], | |
| limit=10, | |
| fast_search=True | |
| ).to_table().to_pylist() | |
| for video in results: | |
| print(f"{video['caption']} - {video['aesthetic_score']:.2f}") | |
| ``` | |
| ## Dataset Evolution | |
| Lance supports flexible schema and data evolution ([docs](https://lance.org/guide/data_evolution/?h=evol)). You can add/drop columns, backfill with SQL or Python, rename fields, or change data types without rewriting the whole dataset. In practice this lets you: | |
| - Introduce fresh metadata (moderation labels, embeddings, quality scores) as new signals become available. | |
| - Add new columns to existing datasets without re-exporting terabytes of video. | |
| - Adjust column names or shrink storage (e.g., cast embeddings to float16) while keeping previous snapshots queryable for reproducibility. | |
| ```python | |
| import lance | |
| import pyarrow as pa | |
| import numpy as np | |
| base = pa.table({"id": pa.array([1, 2, 3])}) | |
| dataset = lance.write_dataset(base, "openvid_evolution", mode="overwrite") | |
| # 1. Grow the schema instantly (metadata-only) | |
| dataset.add_columns(pa.field("quality_bucket", pa.string())) | |
| # 2. Backfill with SQL expressions or constants | |
| dataset.add_columns({"status": "'active'"}) | |
| # 3. Generate rich columns via Python batch UDFs | |
| @lance.batch_udf() | |
| def random_embedding(batch): | |
| arr = np.random.rand(batch.num_rows, 128).astype("float32") | |
| return pa.RecordBatch.from_arrays( | |
| [pa.FixedSizeListArray.from_arrays(arr.ravel(), 128)], | |
| names=["embedding"], | |
| ) | |
| dataset.add_columns(random_embedding) | |
| # 4. Bring in offline annotations with merge | |
| labels = pa.table({ | |
| "id": pa.array([1, 2, 3]), | |
| "label": pa.array(["horse", "rabbit", "cat"]), | |
| }) | |
| dataset.merge(labels, "id") | |
| # 5. Rename or cast columns as needs change | |
| dataset.alter_columns({"path": "quality_bucket", "name": "quality_tier"}) | |
| dataset.alter_columns({"path": "embedding", "data_type": pa.list_(pa.float16(), 128)}) | |
| ``` | |
| These operations are automatically versioned, so prior experiments can still point to earlier versions while OpenVid keeps evolving. | |
| ## LanceDB | |
| LanceDB users can follow the following examples to run search queries on the dataset. | |
| ### LanceDB Vector Similarity Search | |
| ```python | |
| import lancedb | |
| db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") | |
| tbl = db.open_table("train") | |
| # Get a video to use as a query | |
| ref_video = tbl.limit(1).select(["embedding", "caption"]).to_pandas().to_dict('records')[0] | |
| query_embedding = ref_video["embedding"] | |
| results = tbl.search(query_embedding, vector_column_name="embedding") \ | |
| .metric("L2") \ | |
| .nprobes(1) \ | |
| .limit(5) \ | |
| .to_list() | |
| for video in results[1:]: # Skip first (query itself) | |
| print(f"{video['caption'][:60]}...") | |
| ``` | |
| ### LanceDB Full-Text Search | |
| ```python | |
| import lancedb | |
| db = lancedb.connect("hf://datasets/lance-format/openvid-lance/data") | |
| tbl = db.open_table("train") | |
| results = tbl.search("sunset beach") \ | |
| .select(["caption", "aesthetic_score"]) \ | |
| .limit(10) \ | |
| .to_list() | |
| for video in results: | |
| print(f"{video['caption']} - {video['aesthetic_score']:.2f}") | |
| ``` | |
| ## Citation | |
| @article{nan2024openvid, | |
| title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation}, | |
| author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying}, | |
| journal={arXiv preprint arXiv:2407.02371}, | |
| year={2024} | |
| } | |
| ## License | |
| Please check the original OpenVid dataset license for usage terms. | |
Xet Storage Details
- Size:
- 11.2 kB
- Xet hash:
- 6f0f3bb1b90525f80f50be72b7733cfbd3c9b78483055a9c11d7157ae828d24e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.