Model Card for Model ID
Per-page text quality prediction (BLEU) for various document parsers given the PyMuPDF-extracted text.
Model Details
Allen AI's Specter fine-tuned for page-wise document quality prediction.
Model Description
- Developed by: Carlo Siebenschuh
Model Sources
- Repository: AdaParse@GitHub
- Paper: AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
- Presentation: MLSys25
Uses
Predict quality of parser output given the extracted text/
Direct Use
Document quality prediction for resource-optimal delegation within AdaParse (version 2 for this particular instance).
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
Quality prediction for documents that are (a.) out-of-distribution (e.g., non-scientific) or (b.) for parsers that were not part of the fine-tunign set.
Bias, Risks, and Limitations
Bias: Model was trained on tens of thousands of scientific documents from several journals across eight scientific disciplines (mathematics, engineering, biology, physics, etc.). Naturally, biased towards STEM documents. Limitations: Quality prediction based on a single page's text of one particular extraction tool (PyMuPDF) is challenging.
Recommendations
Fine-tune further on your document corpus.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
~30K documents. Not public.
Training Procedure
Internal software:
run_training.py ... --parser pymupdf --max_page_idx 0 --task reg --alpha 0.5 --multi --batch_size 64 --n_epochs 12 --learn_rate 3e-5
Preprocessing [optional]
None
Training Hyperparameters
- Training regime: fp32
Speeds, Sizes, Times [optional]
None
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
High-performance compute (Aurora, Polaris, Sophia, Lambda) at Argonne National Laboratory (ANL)/Argonne Leadership Computing Facility (ALCF).
Citation
BibTeX:
@article{siebenschuh2025adaparse,
title={AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine},
author={Siebenschuh, Carlo and Hippe, Kyle and Gokdemir, Ozan and Brace, Alexander and Khan, Arham and Hossain, Khalid and Babuji, Yadu and Chia, Nicholas and Vishwanath, Venkatram and Stevens, Rick and others},
journal={arXiv preprint arXiv:2505.01435},
year={2025}
}
APA:
Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Khan, A., Hossain, K., ... & Underwood, R. (2025). AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine. arXiv preprint arXiv:2505.01435.
Model Card Authors [optional]
Carlo Siebenschuh
Model Card Contact
7shoe
- Downloads last month
- 8