A0lgk
/

MNIST

+---
+license: mit
+language:
+- en
+library_name: sklearn
+tags:
+- mnist
+- image-classification
+- digits
+- handwritten
+- computer-vision
+- logistic-regression
+- machine-learning
+datasets:
+- ylecun/mnist
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+pipeline_tag: image-classification
+---
+# MNIST Handwritten Digit Classifier
+A classical machine learning approach to handwritten digit recognition using Logistic Regression on the MNIST dataset.
+## Model Description
+This model classifies 28x28 grayscale images of handwritten digits (0-9) using a simple yet effective Logistic Regression classifier. The project serves as an introduction to image classification and the MNIST dataset.
+### Intended Uses
+- **Educational**: Learning image classification fundamentals
+- **Benchmarking**: Baseline for comparing more complex models
+- **Research**: Exploring classical ML on image data
+- **Prototyping**: Quick digit recognition experiments
+## Training Data
+**Dataset**: [ylecun/mnist](https://huggingface.co/datasets/ylecun/mnist)
+| Split | Images |
+|-------|--------|
+| Train | 60,000 |
+| Test | 10,000 |
+| **Total** | **70,000** |
+### Data Characteristics
+| Property | Value |
+|----------|-------|
+| Image Size | 28 x 28 pixels |
+| Channels | 1 (Grayscale) |
+| Classes | 10 (digits 0-9) |
+| Pixel Range | 0-255 (raw), 0-1 (normalized) |
+| Format | PNG/NumPy arrays |
+### Class Distribution
+The dataset is relatively balanced across all 10 digit classes.
+## Model Architecture
+### Preprocessing Pipeline
+```
+Raw Image (28x28, uint8)
+    ↓
+Normalize to [0, 1] (divide by 255)
+    ↓
+Flatten to vector (784 dimensions)
+    ↓
+Logistic Regression Classifier
+    ↓
+Softmax Probabilities (10 classes)
+```
+### Classifier Configuration
+```python
+LogisticRegression(
+    max_iter=100,
+    solver='lbfgs',
+    multi_class='multinomial',
+    n_jobs=-1
+)
+```
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| max_iter | 100 | Maximum iterations for convergence |
+| solver | lbfgs | L-BFGS optimization algorithm |
+| multi_class | multinomial | True multiclass (not OvR) |
+| n_jobs | -1 | Use all CPU cores |
+## Performance
+### Test Set Results
+| Metric | Score |
+|--------|-------|
+| Accuracy | ~92% |
+| Macro F1 | ~92% |
+| Macro Precision | ~92% |
+| Macro Recall | ~92% |
+### Per-Class Performance
+| Digit | Precision | Recall | F1-Score |
+|-------|-----------|--------|----------|
+| 0 | ~0.95 | ~0.97 | ~0.96 |
+| 1 | ~0.95 | ~0.97 | ~0.96 |
+| 2 | ~0.91 | ~0.89 | ~0.90 |
+| 3 | ~0.89 | ~0.90 | ~0.90 |
+| 4 | ~0.92 | ~0.92 | ~0.92 |
+| 5 | ~0.88 | ~0.87 | ~0.87 |
+| 6 | ~0.94 | ~0.95 | ~0.94 |
+| 7 | ~0.93 | ~0.91 | ~0.92 |
+| 8 | ~0.88 | ~0.87 | ~0.88 |
+| 9 | ~0.89 | ~0.90 | ~0.90 |
+*Note: Performance varies slightly between runs*
+### Common Confusion Pairs
+- 4 ↔ 9 (similar upper loops)
+- 3 ↔ 8 (curved shapes)
+- 5 ↔ 3 (similar strokes)
+- 7 ↔ 1 (vertical strokes)
+## Usage
+### Installation
+```bash
+pip install scikit-learn pandas numpy matplotlib seaborn pillow
+```
+### Load and Preprocess Data
+```python
+import pandas as pd
+import numpy as np
+from PIL import Image
+# Load from Hugging Face
+df_train = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/train-00000-of-00001.parquet")
+df_test = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/test-00000-of-00001.parquet")
+def extract_image(row):
+    """Extract image as numpy array"""
+    img_data = row['image']
+    if isinstance(img_data, dict) and 'bytes' in img_data:
+        from io import BytesIO
+        img = Image.open(BytesIO(img_data['bytes']))
+        return np.array(img)
+    elif isinstance(img_data, Image.Image):
+        return np.array(img_data)
+    return np.array(img_data)
+# Prepare data
+X_train = np.array([extract_image(row) for _, row in df_train.iterrows()])
+y_train = df_train['label'].values
+# Normalize and flatten
+X_train_flat = X_train.astype('float32').reshape(-1, 784) / 255.0
+```
+### Train Model
+```python
+from sklearn.linear_model import LogisticRegression
+model = LogisticRegression(
+    max_iter=100,
+    solver='lbfgs',
+    multi_class='multinomial',
+    n_jobs=-1
+)
+model.fit(X_train_flat, y_train)
+```
+### Inference
+```python
+import joblib
+# Load model
+model = joblib.load('mnist_model.pkl')
+# Predict single image
+def predict_digit(image):
+    """
+    image: 28x28 numpy array or PIL Image
+    returns: predicted digit (0-9)
+    """
+    if isinstance(image, Image.Image):
+        image = np.array(image)
+    # Preprocess
+    image_flat = image.astype('float32').reshape(1, 784) / 255.0
+    # Predict
+    prediction = model.predict(image_flat)[0]
+    probabilities = model.predict_proba(image_flat)[0]
+    return prediction, probabilities
+# Example
+digit, probs = predict_digit(test_image)
+print(f"Predicted: {digit} (confidence: {probs[digit]:.2%})")
+```
+### Visualization
+```python
+import matplotlib.pyplot as plt
+from sklearn.metrics import confusion_matrix
+import seaborn as sns
+# Confusion Matrix
+y_pred = model.predict(X_test_flat)
+cm = confusion_matrix(y_test, y_pred)
+plt.figure(figsize=(10, 8))
+sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
+            xticklabels=range(10), yticklabels=range(10))
+plt.xlabel('Predicted')
+plt.ylabel('True')
+plt.title('Confusion Matrix - MNIST')
+plt.show()
+```
+### Average Digit Visualization
+```python
+# Compute mean image per digit
+fig, axes = plt.subplots(2, 5, figsize=(12, 5))
+for digit in range(10):
+    ax = axes[digit // 5, digit % 5]
+    mask = y_train == digit
+    mean_img = X_train[mask].mean(axis=0)
+    ax.imshow(mean_img, cmap='hot')
+    ax.set_title(f'Digit: {digit}')
+    ax.axis('off')
+plt.tight_layout()
+plt.show()
+```
+## Limitations
+- **Simple Model**: Logistic Regression doesn't capture spatial relationships
+- **No Data Augmentation**: Sensitive to rotation, scaling, translation
+- **Grayscale Only**: Won't work with color images
+- **Fixed Size**: Requires exactly 28x28 input
+- **Clean Data**: Struggles with noisy or poorly centered digits
+## Comparison with Other Approaches
+| Model | MNIST Accuracy |
+|-------|----------------|
+| **Logistic Regression** | **~92%** |
+| Random Forest | ~97% |
+| SVM (RBF kernel) | ~98% |
+| MLP (2 hidden layers) | ~98% |
+| CNN (LeNet-5) | ~99% |
+| Modern CNNs | ~99.7% |
+## Technical Specifications
+### Dependencies
+```
+scikit-learn>=1.0.0
+pandas>=1.3.0
+numpy>=1.20.0
+matplotlib>=3.4.0
+seaborn>=0.11.0
+pillow>=8.0.0
+```
+### Hardware Requirements
+| Task | Hardware | Time |
+|------|----------|------|
+| Training | CPU | ~2-5 min |
+| Inference | CPU | < 1ms per image |
+| Memory | RAM | ~500MB |
+## Files
+```
+MNIST/
+├── README_HF.md          # This model card
+├── mnist_exploration.ipynb  # Full exploration notebook
+├── mnist_model.pkl       # Trained model (generated)
+└── figures/              # Visualizations (generated)
+```
+## Citation
+```bibtex
+@article{lecun1998mnist,
+  title={Gradient-based learning applied to document recognition},
+  author={LeCun, Yann and Bottou, L{\'e}on and Bengio, Yoshua and Haffner, Patrick},
+  journal={Proceedings of the IEEE},
+  volume={86},
+  number={11},
+  pages={2278--2324},
+  year={1998}
+}
+@misc{mnist_hf,
+  title={MNIST Dataset},
+  author={LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C.},
+  howpublished={Hugging Face Datasets},
+  url={https://huggingface.co/datasets/ylecun/mnist}
+}
+```
+## License
+MIT License
+## Acknowledgments
+- Yann LeCun for creating MNIST
+- Scikit-learn team for the ML library
+- Hugging Face for dataset hosting
+---
+## Next Steps
+For better performance, consider:
+1. **More Complex Models**: SVM, Random Forest, Neural Networks
+2. **Deep Learning**: CNNs with PyTorch/TensorFlow
+3. **Data Augmentation**: Rotation, scaling, elastic deformations
+4. **Feature Engineering**: HOG, SIFT features
+5. **Ensemble Methods**: Combine multiple classifiers

mnist_exploration.ipynb ADDED Viewed

	@@ -0,0 +1,383 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🔢 Exploration du Dataset MNIST\n",
+    "\n",
+    "Ce notebook explore le célèbre dataset MNIST de chiffres manuscrits."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Chargement des données"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from PIL import Image\n",
+    "\n",
+    "# Chargement du dataset depuis Hugging Face\n",
+    "splits = {\n",
+    "    'train': 'mnist/train-00000-of-00001.parquet',\n",
+    "    'test': 'mnist/test-00000-of-00001.parquet'\n",
+    "}\n",
+    "\n",
+    "df_train = pd.read_parquet(\"hf://datasets/ylecun/mnist/\" + splits[\"train\"])\n",
+    "df_test = pd.read_parquet(\"hf://datasets/ylecun/mnist/\" + splits[\"test\"])\n",
+    "\n",
+    "print(f\"✅ Données chargées avec succès!\")\n",
+    "print(f\"📊 Taille du set d'entraînement: {len(df_train)} images\")\n",
+    "print(f\"📊 Taille du set de test: {len(df_test)} images\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Exploration des données"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Structure du DataFrame\n",
+    "print(\"Colonnes du dataset:\")\n",
+    "print(df_train.columns.tolist())\n",
+    "print(\"\\nAperçu des premières lignes:\")\n",
+    "df_train.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Distribution des labels\n",
+    "print(\"Distribution des chiffres dans le set d'entraînement:\")\n",
+    "label_counts = df_train['label'].value_counts().sort_index()\n",
+    "\n",
+    "plt.figure(figsize=(10, 5))\n",
+    "plt.bar(label_counts.index, label_counts.values, color='steelblue', edgecolor='black')\n",
+    "plt.xlabel('Chiffre', fontsize=12)\n",
+    "plt.ylabel('Nombre d\\'images', fontsize=12)\n",
+    "plt.title('Distribution des chiffres dans MNIST (train)', fontsize=14)\n",
+    "plt.xticks(range(10))\n",
+    "for i, v in enumerate(label_counts.values):\n",
+    "    plt.text(i, v + 100, str(v), ha='center', fontsize=9)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Visualisation des images"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def extract_image(row):\n",
+    "    \"\"\"Extrait l'image depuis la colonne 'image' du DataFrame.\"\"\"\n",
+    "    img_data = row['image']\n",
+    "    if isinstance(img_data, dict) and 'bytes' in img_data:\n",
+    "        # Format avec bytes\n",
+    "        from io import BytesIO\n",
+    "        img = Image.open(BytesIO(img_data['bytes']))\n",
+    "        return np.array(img)\n",
+    "    elif isinstance(img_data, Image.Image):\n",
+    "        return np.array(img_data)\n",
+    "    elif isinstance(img_data, np.ndarray):\n",
+    "        return img_data\n",
+    "    else:\n",
+    "        # Essayer de convertir directement\n",
+    "        return np.array(img_data)\n",
+    "\n",
+    "# Afficher quelques exemples\n",
+    "fig, axes = plt.subplots(2, 5, figsize=(12, 5))\n",
+    "fig.suptitle('Exemples d\\'images MNIST', fontsize=14)\n",
+    "\n",
+    "for idx, ax in enumerate(axes.flat):\n",
+    "    img = extract_image(df_train.iloc[idx])\n",
+    "    label = df_train.iloc[idx]['label']\n",
+    "    ax.imshow(img, cmap='gray')\n",
+    "    ax.set_title(f'Label: {label}', fontsize=11)\n",
+    "    ax.axis('off')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Afficher un exemple de chaque chiffre\n",
+    "fig, axes = plt.subplots(2, 5, figsize=(12, 5))\n",
+    "fig.suptitle('Un exemple de chaque chiffre (0-9)', fontsize=14)\n",
+    "\n",
+    "for digit in range(10):\n",
+    "    ax = axes[digit // 5, digit % 5]\n",
+    "    sample = df_train[df_train['label'] == digit].iloc[0]\n",
+    "    img = extract_image(sample)\n",
+    "    ax.imshow(img, cmap='gray')\n",
+    "    ax.set_title(f'Chiffre: {digit}', fontsize=11)\n",
+    "    ax.axis('off')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Préparation des données pour le Machine Learning"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convertir toutes les images en arrays numpy\n",
+    "print(\"Conversion des images en arrays numpy...\")\n",
+    "\n",
+    "X_train = np.array([extract_image(row) for _, row in df_train.iterrows()])\n",
+    "y_train = df_train['label'].values\n",
+    "\n",
+    "X_test = np.array([extract_image(row) for _, row in df_test.iterrows()])\n",
+    "y_test = df_test['label'].values\n",
+    "\n",
+    "print(f\"\\n✅ Conversion terminée!\")\n",
+    "print(f\"X_train shape: {X_train.shape}\")\n",
+    "print(f\"y_train shape: {y_train.shape}\")\n",
+    "print(f\"X_test shape: {X_test.shape}\")\n",
+    "print(f\"y_test shape: {y_test.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Normalisation des données (0-1)\n",
+    "X_train_norm = X_train.astype('float32') / 255.0\n",
+    "X_test_norm = X_test.astype('float32') / 255.0\n",
+    "\n",
+    "# Aplatir les images pour les modèles classiques (28x28 -> 784)\n",
+    "X_train_flat = X_train_norm.reshape(X_train_norm.shape[0], -1)\n",
+    "X_test_flat = X_test_norm.reshape(X_test_norm.shape[0], -1)\n",
+    "\n",
+    "print(f\"Données normalisées et aplaties:\")\n",
+    "print(f\"X_train_flat shape: {X_train_flat.shape}\")\n",
+    "print(f\"X_test_flat shape: {X_test_flat.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Modèle simple de classification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
+    "import seaborn as sns\n",
+    "\n",
+    "# Entraînement d'un modèle de régression logistique\n",
+    "print(\"🔄 Entraînement du modèle de régression logistique...\")\n",
+    "print(\"(Cela peut prendre quelques minutes)\\n\")\n",
+    "\n",
+    "model = LogisticRegression(max_iter=100, solver='lbfgs', multi_class='multinomial', n_jobs=-1)\n",
+    "model.fit(X_train_flat, y_train)\n",
+    "\n",
+    "print(\"✅ Entraînement terminé!\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Évaluation du modèle\n",
+    "y_pred = model.predict(X_test_flat)\n",
+    "accuracy = accuracy_score(y_test, y_pred)\n",
+    "\n",
+    "print(f\"🎯 Précision sur le set de test: {accuracy:.4f} ({accuracy*100:.2f}%)\\n\")\n",
+    "print(\"Rapport de classification:\")\n",
+    "print(classification_report(y_test, y_pred))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Matrice de confusion\n",
+    "cm = confusion_matrix(y_test, y_pred)\n",
+    "\n",
+    "plt.figure(figsize=(10, 8))\n",
+    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', \n",
+    "            xticklabels=range(10), yticklabels=range(10))\n",
+    "plt.xlabel('Prédiction', fontsize=12)\n",
+    "plt.ylabel('Vraie valeur', fontsize=12)\n",
+    "plt.title('Matrice de confusion - MNIST', fontsize=14)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Visualisation des prédictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Afficher quelques prédictions\n",
+    "fig, axes = plt.subplots(3, 5, figsize=(14, 8))\n",
+    "fig.suptitle('Exemples de prédictions', fontsize=14)\n",
+    "\n",
+    "indices = np.random.choice(len(X_test), 15, replace=False)\n",
+    "\n",
+    "for i, (ax, idx) in enumerate(zip(axes.flat, indices)):\n",
+    "    ax.imshow(X_test[idx], cmap='gray')\n",
+    "    pred = y_pred[idx]\n",
+    "    true = y_test[idx]\n",
+    "    color = 'green' if pred == true else 'red'\n",
+    "    ax.set_title(f'Préd: {pred} | Vrai: {true}', color=color, fontsize=10)\n",
+    "    ax.axis('off')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Afficher les erreurs\n",
+    "errors = np.where(y_pred != y_test)[0]\n",
+    "print(f\"Nombre d'erreurs: {len(errors)} sur {len(y_test)} ({len(errors)/len(y_test)*100:.2f}%)\\n\")\n",
+    "\n",
+    "# Afficher quelques erreurs\n",
+    "fig, axes = plt.subplots(2, 5, figsize=(14, 6))\n",
+    "fig.suptitle('Exemples d\\'erreurs de classification', fontsize=14)\n",
+    "\n",
+    "for i, ax in enumerate(axes.flat):\n",
+    "    if i < len(errors):\n",
+    "        idx = errors[i]\n",
+    "        ax.imshow(X_test[idx], cmap='gray')\n",
+    "        ax.set_title(f'Préd: {y_pred[idx]} | Vrai: {y_test[idx]}', color='red', fontsize=10)\n",
+    "    ax.axis('off')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Analyse des pixels moyens par chiffre"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Calculer l'image moyenne pour chaque chiffre\n",
+    "fig, axes = plt.subplots(2, 5, figsize=(12, 5))\n",
+    "fig.suptitle('Image moyenne pour chaque chiffre', fontsize=14)\n",
+    "\n",
+    "for digit in range(10):\n",
+    "    ax = axes[digit // 5, digit % 5]\n",
+    "    mask = y_train == digit\n",
+    "    mean_img = X_train[mask].mean(axis=0)\n",
+    "    ax.imshow(mean_img, cmap='hot')\n",
+    "    ax.set_title(f'Chiffre: {digit}', fontsize=11)\n",
+    "    ax.axis('off')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📝 Résumé\n",
+    "\n",
+    "Dans ce notebook, nous avons:\n",
+    "1. Chargé le dataset MNIST depuis Hugging Face\n",
+    "2. Exploré la structure et la distribution des données\n",
+    "3. Visualisé des exemples d'images\n",
+    "4. Préparé les données pour le machine learning\n",
+    "5. Entraîné un modèle de régression logistique simple\n",
+    "6. Évalué les performances du modèle\n",
+    "7. Analysé les images moyennes par chiffre\n",
+    "\n",
+    "**Prochaines étapes possibles:**\n",
+    "- Essayer d'autres modèles (SVM, Random Forest, KNN)\n",
+    "- Implémenter un réseau de neurones avec TensorFlow/PyTorch\n",
+    "- Appliquer des techniques d'augmentation de données\n",
+    "- Explorer la réduction de dimensionnalité (PCA, t-SNE)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}