| | --- |
| | license: apache-2.0 |
| | tags: |
| | - conversational |
| | - efficient |
| | - i3-architecture |
| | - custom_code |
| | datasets: |
| | - starhopp3r/TinyChat |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | --- |
| | |
| | # i3 Model - Ultra-Efficient Pretraining Language Model |
| |
|
| | ## Model Description |
| |
|
| | The **i3 Model** is designed to optimize **pretraining efficiency** while retaining core language modeling capabilities. |
| | Its architecture allows training on **memory-constrained hardware**, including CPU-only setups, without sacrificing sequence modeling performance. |
| |
|
| | > [!Note] |
| | > The i3 architecture is present within the model for highly efficient pretraining. It is designed to **reduce memory usage**, **speed up training**, and allow pretraining from scratch on tiny hardware. Internal details are abstracted for simplicity. |
| |
|
| | --- |
| |
|
| | ## Use |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | pipe = pipeline("text-generation", model="FlameF0X/i3-12m") |
| | messages = [ |
| | {"role": "user", "content": "Who are you?"}, |
| | ] |
| | pipe(messages) |
| | ```` |
| |
|
| | --- |
| |
|
| | ## Model Statistics |
| |
|
| | * **Vocabulary Size:** 4,466 (variable-length chunks) |
| | * **Hidden Dimension:** 512 |
| | * **Number of Layers:** 12 |
| | * **Max Sequence Length:** 256 |
| | * **Total Parameters:** 12,691,186 |
| | * **Tokenization:** Memory-efficient variable-length chunking (2β3 characters) |
| |
|
| | * **Total tokens:** 334,524,736 |
| |
|
| | --- |
| |
|
| | ## Key Features |
| |
|
| | 1. **Memory-Optimized:** Designed to train on tiny hardware with minimal RAM usage |
| | 2. **Pretraining-Focused Architecture:** i3 layers provide efficient sequence modeling, low-rank linear updates, and factorized attention |
| | 3. **Variable-Length Tokenization:** 2β3 character chunks for compact embeddings |
| | 4. **Conversational Readiness:** Optimized for dialogue and text generation |
| |
|
| | --- |
| |
|
| | ## i3 Architecture (Abstract Overview) |
| |
|
| | ### Design Philosophy |
| |
|
| | The i3 model targets **CPU-friendly, memory-constrained pretraining**, emphasizing: |
| |
|
| | * Long-range sequence modeling |
| | * Low-rank weight updates for memory savings |
| | * Efficient factorized attention |
| | * 4-bit weights and microbatching for minimal memory footprint |
| |
|
| | ## Technologies used in the i3 Architecture that are open-sourced by me: |
| |
|
| | * [Low-Rank Pre-training](https://github.com/FlameF0X/Low-Rank-Pretraining) - LoRa for pre-training. |
| |
|
| | ### Conceptual Layout |
| |
|
| | ``` |
| | Input Tokens |
| | β |
| | +-----------------+ |
| | | Embedding Layer | |
| | +-----------------+ |
| | β |
| | +-----------------+ |
| | | i3 Architecture | |
| | +-----------------+ |
| | β |
| | +------------------------+ |
| | | KQV Low-Rank Attention | |
| | +------------------------+ |
| | β |
| | +-----------------------+ |
| | | LayerNorm + Residuals | |
| | +-----------------------+ |
| | β |
| | +-------------------+ |
| | | Output Projection | |
| | +-------------------+ |
| | β |
| | Predicted Tokens |
| | ``` |
| |
|
| | > Key idea: Every component is optimized for **memory efficiency** and **pretraining speed** on small hardware, while preserving essential transformer dynamics. |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | * **Sequence length:** 128β512 tokens |
| | * **Model size:** ~12M parameters (CPU-friendly) |
| | * **Optimizer:** AdamW or Lion (4-bit / mixed precision) |
| | * **Dataset:** TinyChat (~50β200 MB) |
| | * **Training loop:** gradient checkpointing + recomputation |
| | * **Objective:** token prediction / text generation |
| |
|
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @software{lorpt2025, |
| | title={LoRPt: Low-Rank Pretraining for Resource-Efficient Language Models}, |
| | author={[FlameF0X]}, |
| | year={2025}, |
| | url={https://github.com/FlameF0X/Low-Rank-Pretraining} |
| | } |
| | ``` |