| | --- |
| | base_model: answerdotai/ModernBERT-base |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - reasoning |
| | - complexity |
| | - education |
| | - regression |
| | - fineweb-edu |
| | --- |
| | |
| | # Reasoning Complexity Classifier |
| |
|
| | A ModernBERT-base model fine-tuned to predict the **reasoning complexity** of educational text on a continuous 1β4 scale. Trained on FineWeb-Edu documents labeled by GPT-5-nano via the OpenAI Batch API (~$20 in credits). |
| |
|
| | ## Model Description |
| |
|
| | This is a regression model (`num_labels=1`, `problem_type="regression"`) that outputs a continuous score. The score can be rounded to the nearest integer to obtain a discrete complexity level. Level 5 (Formal/Abstract reasoning) was excluded from training due to data scarcity; the model's effective range is **1.0β4.0**. |
| |
|
| | ### Complexity Levels |
| |
|
| | | Level | Name | Description | Example | |
| | |-------|------|-------------|---------| |
| | | 1 | Factual/Declarative | States facts with no reasoning | "The Pacific Ocean covers ~165 million kmΒ²." | |
| | | 2 | Single-step reasoning | One inference or comparison | "Because boiling point decreases at altitude, water boils faster in Denver than Miami." | |
| | | 3 | Multi-step reasoning | 2β4 chained logical steps | "Demand rose while supply held fixed β prices rose β consumer spending fell β GDP slowed." | |
| | | 4 | Complex reasoning | 5+ steps, conditionals, competing factors | Medical differential diagnosis with branching conditions and exclusion criteria. | |
| |
|
| | ## Training Details |
| |
|
| | ### Data |
| |
|
| | - **Source**: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) β a curated subset of Common Crawl filtered for educational content. |
| | - **Labeling**: ~100,000 documents reservoir-sampled from ~6,000 records per subject category, then labeled with GPT-5-nano via the OpenAI Batch API using structured output (integer 1β5). |
| | - **Splits**: 80% train / 10% validation / 10% test (stratified by integer complexity level). |
| | - **Preprocessing**: Texts truncated to 8,000 characters before labeling; tokenized to 512 tokens during training with dynamic padding. |
| | - **Level 5 exclusion**: Rows labeled as level 5 were excluded from the training set. |
| |
|
| | ### Hyperparameters |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Base model | `answerdotai/ModernBERT-base` | |
| | | Epochs | 3 | |
| | | Batch size | 32 | |
| | | Learning rate | 2e-5 | |
| | | Weight decay | 0.01 | |
| | | Warmup ratio | 0.1 | |
| | | Max token length | 512 | |
| | | Optimizer | AdamW | |
| | | Scheduler | Linear with warmup | |
| | | AMP | bf16 (CUDA) | |
| | | Loss | MSE | |
| |
|
| | ### Training History |
| |
|
| | | Epoch | Train Loss | Val MAE | Val Acc (rounded) | Val Spearman r | |
| | |-------|-----------|---------|-------------------|----------------| |
| | | 1 | 0.6002 | 0.5190 | 56.98% | 0.7533 | |
| | | **2** | **0.3631** | **0.5040** | **58.43%** | **0.7597** | |
| | | 3 | 0.2040 | 0.5114 | 58.19% | 0.7485 | |
| |
|
| | The best checkpoint (by validation MAE) was saved at **epoch 2**. |
| |
|
| | ## Evaluation Results |
| |
|
| | Evaluated on a held-out test set: |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | MSE | 0.4388 | |
| | | MAE | 0.5063 | |
| | | Rounded accuracy | 58.6% | |
| | | Spearman r | 0.7527 | |
| |
|
| | **Interpretation**: The model achieves a Spearman correlation of ~0.75 with gold labels, indicating strong ordinal ranking ability. The MAE of ~0.51 means predictions are on average within half a level of the true score when treated as a continuous signal. |
| |
|
| | ### Output Interpretation |
| |
|
| | | Raw score | Meaning | |
| | |-----------|---------| |
| | | ~1.0 | Factual/Declarative | |
| | | ~2.0 | Single-step reasoning | |
| | | ~3.0 | Multi-step reasoning | |
| | | ~4.0 | Complex reasoning | |
| |
|
| | Clip and round the raw float output to `[1, 4]` for a discrete level. |
| |
|
| | ## Architecture |
| |
|
| | Based on `answerdotai/ModernBERT-base`: |
| |
|
| | - **Layers**: 22 transformer layers (alternating full and sliding attention) |
| | - **Hidden size**: 768 |
| | - **Attention heads**: 12 |
| | - **Intermediate size**: 1,152 |
| | - **Max position embeddings**: 8,192 |
| | - **Classifier pooling**: mean |
| | - **Classifier activation**: GELU |
| |
|
| | ## Limitations |
| |
|
| | - Labels are silver-standard (GPT-5-nano), not human-annotated; label noise may affect the ~1.5% of ambiguous texts. |
| | - Texts are truncated to 512 tokens; very long documents are judged on their first ~512 tokens only. |
| | - Trained primarily on English educational web text; performance may degrade on other domains or languages. |
| |
|
| | ## Intended Use |
| |
|
| | Designed for data curation pipelines that need to filter or balance training corpora by reasoning complexity β for example, constructing curriculum-ordered datasets for language model training. |
| |
|