Typst Image Dataset
This dataset was generated with a fork of tex2typ and the hoang-quoc-trung/fusion-image-to-latex-datasets dataset, which itself is a compilation of LaTeX labels and images of equations.
The hoang-quoc-trung dataset is difficult to work with in that it has the image data stored in a large compressed RAR archive, which does not permit efficient random read access. Additionally, it appears to have a larger number of corrupted filenames inside the archive, which has been mended in this dataset.
This dataset instead opts to use a WebDataset for convenient and efficient storage of the image files and associated metadata.
The code used to generate this dataset can be found at here: https://github.com/JeppeKlitgaard/DTU-02456-Deep-Learning-Project (this is currently private but should be released after examination. If this is not the case prod me at huggingface@jeppe.science)