Add `notebook.ipynb` to the model repo

Hey hey, this would allow users to directly open a customised model inference notebook that users can use to play with the model.

Try it by going on Use this model -> Google Colab/ Kaggle directly

You can find more details about this feature here: https://huggingface.co/docs/hub/en/notebooks

Files changed (1) hide show

notebook.ipynb +157 -0

notebook.ipynb ADDED Viewed

	@@ -0,0 +1,157 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Use VJEPA 2"
+      ],
+      "metadata": {
+        "id": "02ruu54h4yLc"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️\n",
+        "\n",
+        "The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).\n",
+        "\n",
+        "You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6)."
+      ],
+      "metadata": {
+        "id": "ol0IGYCd4hg4"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We need to install transformers' release specific branch."
+      ],
+      "metadata": {
+        "id": "kIIBxYOA41Ga"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -q git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview"
+      ],
+      "metadata": {
+        "id": "4D4D1hC940yX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from huggingface_hub import login # to later push the model\n",
+        "\n",
+        "login()"
+      ],
+      "metadata": {
+        "id": "Ne2rU68Ep1On"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "As of now, Colab supports torchcodec==0.2.1 which supports torch==2.6.0."
+      ],
+      "metadata": {
+        "id": "dJWXmFu53Ap6"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -q torch==2.6.0 torchvision==0.21.0\n",
+        "!pip install -q torchcodec==0.2.1\n",
+        "\n",
+        "import torch\n",
+        "print(\"Torch:\", torch.__version__)\n",
+        "from torchcodec.decoders import VideoDecoder # verify"
+      ],
+      "metadata": {
+        "id": "JIoq84ze2_Ls"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Initialize the model and the processor"
+      ],
+      "metadata": {
+        "id": "-7OATf5S20U_"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import AutoVideoProcessor, AutoModel\n",
+        "\n",
+        "hf_repo = \"facebook/vjepa2-vith-fpc64-256\"\n",
+        "\n",
+        "model = AutoModel.from_pretrained(hf_repo).to(\"cuda\")\n",
+        "processor = AutoVideoProcessor.from_pretrained(hf_repo)"
+      ],
+      "metadata": {
+        "id": "K8oSsy7Y2zQK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Extract video embeddings from the model"
+      ],
+      "metadata": {
+        "id": "ZJ_DUR9f22Uc"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import torch\n",
+        "from torchcodec.decoders import VideoDecoder\n",
+        "import numpy as np\n",
+        "\n",
+        "video_url = \"https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4\"\n",
+        "vr = VideoDecoder(video_url)\n",
+        "frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy\n",
+        "video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W\n",
+        "video = processor(video, return_tensors=\"pt\").to(model.device)\n",
+        "with torch.no_grad():\n",
+        "    video_embeddings = model.get_vision_features(**video)\n",
+        "\n",
+        "print(video_embeddings.shape)"
+      ],
+      "metadata": {
+        "id": "kAgWZJHt24px"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}