--- base_model: - HuggingFaceTB/SmolVLM-256M-Instruct language: - en library_name: transformers license: cdla-permissive-2.0 pipeline_tag: image-text-to-text ---
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messyβit often loses details, doesnβt clearly show the documentβs layout, and increases the number of tokens, making processing less efficient.
DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency.
## Supported Instructions
| Description | Instruction | Comment |
| Full conversion | Convert this page to docling. | DocTags represetation |
| Chart | Convert chart to table. | (e.g., <chart>) |
| Formula | Convert formula to LaTeX. | (e.g., <formula>) |
| Code | Convert code to text. | (e.g., <code>) |
| Table | Convert table to OTSL. | (e.g., <otsl>) OTSL: Lysak et al., 2023 |
| Actions and Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | |
| Identify element at: <loc_247><loc_482><10c_252><loc_486> | ||
| Find all 'text' elements on the page, retrieve all section headers. | ||
| Detect footer elements on the page. |