Model Card for Model ID

Per-page text quality prediction (BLEU) for various document parsers given the PyMuPDF-extracted text.

Model Details

Allen AI's Specter fine-tuned for page-wise document quality prediction.

Model Description

Developed by: Carlo Siebenschuh

Model Sources

Repository: AdaParse@GitHub
Paper: AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Presentation: MLSys25

Uses

Predict quality of parser output given the extracted text/

Direct Use

Document quality prediction for resource-optimal delegation within AdaParse (version 2 for this particular instance).

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

Quality prediction for documents that are (a.) out-of-distribution (e.g., non-scientific) or (b.) for parsers that were not part of the fine-tunign set.

Bias, Risks, and Limitations

Bias: Model was trained on tens of thousands of scientific documents from several journals across eight scientific disciplines (mathematics, engineering, biology, physics, etc.). Naturally, biased towards STEM documents. Limitations: Quality prediction based on a single page's text of one particular extraction tool (PyMuPDF) is challenging.

Recommendations

Fine-tune further on your document corpus.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

~30K documents. Not public.

Training Procedure

Internal software:

run_training.py ... --parser pymupdf --max_page_idx 0 --task reg --alpha 0.5 --multi --batch_size 64 --n_epochs 12 --learn_rate 3e-5

Preprocessing [optional]

None

Training Hyperparameters

Training regime: fp32

Speeds, Sizes, Times [optional]

None

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

High-performance compute (Aurora, Polaris, Sophia, Lambda) at Argonne National Laboratory (ANL)/Argonne Leadership Computing Facility (ALCF).

Citation

BibTeX:

@article{siebenschuh2025adaparse,
  title={AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine},
  author={Siebenschuh, Carlo and Hippe, Kyle and Gokdemir, Ozan and Brace, Alexander and Khan, Arham and Hossain, Khalid and Babuji, Yadu and Chia, Nicholas and Vishwanath, Venkatram and Stevens, Rick and others},
  journal={arXiv preprint arXiv:2505.01435},
  year={2025}
}

APA:

Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Khan, A., Hossain, K., ... & Underwood, R. (2025). AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine. arXiv preprint arXiv:2505.01435.

Model Card Authors [optional]

Carlo Siebenschuh

Model Card Contact

7shoe

Downloads last month: 8

Safetensors

Model size

0.1B params

Tensor type

F32