Introduction
Teaching is a lot like herding cats or playing whack-a-mole. Everyone learns in different styles, at different rates, and has different questions. While there is much to be said for the valiant effort our teachers gives in addressing all of these and more within a classroom setting, teachers are finite. Students need more help than teachers can provide, at times that teachers cannot provide. Thus enters the role of tutors, but these are often scheduled and for a specific subject. What happens when a student needs help outside of their normal scheduled programming? Enter the rise of AI tutors. AI tutors can be accessed at any time, for any subject. We do want to ensure that our students are getting accurate information and not a hallucination that will steer our students down a wrong path of knowledge. This finetuned LLM, seeks to do just that, specifically for AP Statistics. Due to the broad scope of AP Statistics and the text-only nature of this LLM, it was necessary to narrow down the scope of this model. The chosen task could be done text only (mostly) and is a super important topic on the AP exam: inference. Inference procdures count for about 15% of the mulitple choice and are guaranteed to be one full free response question (in the vain of the example below) plus the possibility of more. That makes this topic within AP Statistics perfect for a finetune model to help students understand the topic and score higher on the AP exam. THis model aims to help students fill the gaps in their knowledge to understand the topic of inference better.
Data
The dataset used in training consists of 1014 question-answer pairs on the topic of inference for the AP Statistics exam created by the owner of this model. Specifically the problems involve 1 and 2 sample means and proportions confidence intervals and significance tests (no chi-sqaure or inference for slope). These problems were taken from three textbooks (two paid for copies, 1 open source). There are 928 free response questions and 86 multiple choice questions. This dataset was split 80/20 between a training set a validation set using a random seed of 42. The dataset can be found here.
Methodology
The finetuning method used was LoRA.
LoRA is a moderate intervention model editor that seems perfect for my task.
It is computationally efficient, preserves the knowledge of the base model well, and has smaller file sizes which means the latency of the model is minimally impacted.
This is perfect for an AI tutor since students these days need answers immediately or they go onto to other things.
They also tend to go down rabbit holes, so while this model is specifically trained for a statistics tutor, keeping the base model knowledge when the explore the rabbit holes can be important.
The hyperparameters are as follows:
LoRA R: 64
LoRA Alpha: 64
LoRA Dropout: 0.05
Learning Rate: 0.00001
Epochs: 3
Evaluation
The metrics used to evaluate this model are the mmlu_high_school_statistics, minerva_math, and race benchmarks. The BERT benchmarks are also reported. The mmlu_high_school_statistics benchmarks is the main statistics knowledge benchmark, minerva_math serves as a general math benchmark to ensure by gaining stat knowledge, general math knowledge was not lost, and the race benchmark served to establish that the model does not suffer from catastrophic forgetting. As can be seen this new model does not improve on the scores of the base model on any of the evaluation metrics. Not improving on the minvera_math and race benchmarks should be expected, and the fact they did not decrease is good sign. Not improving on the mmlu_high_school_statistics is a bit disheartening. However, the benchmark score is one thing, how the model acutally performs is another. In testing the base model, the model frequently made math errors and though it "answered" the question, left off key pieces required for the AP Exam. The finetuned model fixes these errors and gives answer in a format much more suitable for an AP Exam with fewer mistakes (see the Expected Output Format section). This may be due to the specific nature of the object of the model itself (inference problems for the AP test) verses the broad nature of the benchmark (all of high school statistics in a non AP setting). So, while the benchmark scores do not indicate success, the model does perform better in real world scenarios indicating the finetuning was a success. The model was compared to Llama-3.2-3B-Instruct and Mistral7B-Instruct-v0.2 and show superior metrics on the mmlu_high_school_statistics and minerva_math while having a comparable race metric. Similar BERT scores were also calculated across all models.
| Model | mmlu_high_school_statistics | minerva_math | race | BERT: precision | BERT: recall | BERT: f1 |
|---|---|---|---|---|---|---|
| AP_Stat_Inference_Helper | 0.72 | 0.45 | 0.32 | 0.75 | 0.85 | 0.80 |
| Qwen3-4B-Instruct-2507 | 0.72 | 0.45 | 0.32 | 0.75 | 0.85 | 0.80 |
| Llama-3.2-3B-Instruct | 0.30 | 0.29 | 0.38 | 0.77 | 0.86 | 0.81 |
| Mistral7B-Instruct-v0.2 | 0.46 | 0.09 | 0.38 | 0.79 | 0.86 | 0.82 |
Usage and Intended Use
To use this model, simply use the following code:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("bnolton/AP_Stat_Inference_Helper", dtype="auto")
The intended use of this model is geared at AP Statistics students wanting to ensure they understanding the topic of inference. The student will enter their question into the model and the model will output a response in the AP style format.
Prompt Format
pipe = pipeline(
"text-generation",
model = model,
dtype = torch.bfloat16,
device_map = "auto",
tokenizer = tokenizer,
max_new_tokens = 500,
do_sample = False)
formatted_prompt = f"Q: YOUR QUESTION HERE \n\nA: "
text = pipe(formatted_prompt)
print(text[0]['generated_text'])
Expected Output Format
Q: Past experience is that when individuals are approached with a request to fill out and return a particular questionnaire in a provided stamped and addressed envelope, the response rate is 40%. An investigator believes that if the person distributing the questionnaire were stigmatized in some obvious way, potential respondents would feel sorry for the distributor and thus tend to respond at a rate higher than 40%. To test this theory, a distributor wore an eye patch. Of the 200 questionnaires distributed by this individual, 109 were returned. Does this provide evidence that the response rate in this situation is greater than the previous rate of 40%? State and test the appropriate hypotheses using a significance level of 0.05.
A:
State the hypotheses:
H0: p = 0.40 (the response rate is the same as the previous rate)
Ha: p > 0.40 (the response rate is greater than the previous rate)Check the conditions:
- Randomness: The 200 questionnaires were randomly distributed to individuals. (Assuming the investigator randomly selected the individuals to distribute the questionnaire to.)
- Independence: The 200 individuals are independent of each other. (Assuming the individuals are randomly selected and not related to each other.)
- Normality: np0 = 200(0.40) = 80 and n(1 - p0) = 200(1 - 0.40) = 120. Both are greater than 10, so the sampling distribution of the sample proportion is approximately normal.
- Calculate the test statistic:
- Sample proportion: p-hat = 109 / 200 = 0.545
- Standard error: SE = sqrt(p0(1 - p0) / n) = sqrt(0.40(1 - 0.40) / 200) = sqrt(0.24 / 200) = 0.0346
- Test statistic: z = (p-hat - p0) / SE = (0.545 - 0.40) / 0.0346 = 4.19
Determine the p-value:
Using a standard normal distribution table or calculator, find the probability of obtaining a z-score of 4.19 or greater. This is a very small probability (p-value < 0.0001).Make a decision:
Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.Conclusion:
There is sufficient evidence at the 0.05 significance level to conclude that the response rate in this situation is greater than the previous rate of 40%. The investigator's theory that people would respond at a higher rate when the distributor was stigmatized appears to be supported by the data.
Limitations
The dataset for this model is solely focus on the inference procedures for the AP Statistics class. The specific inference procedures are 1 and 2 sample means and proportions confidence intervals and significance tests (no chi-sqaure or inference for slope). While the AP test for some of these problems would require the drawing of a curve, this model is text only. The model may use some terms that are being phased out due to the source of the problems in the dataset being published before the AP Statistics rework (for example: indepencence instead of 10% check and normality instead of large counts condition). The example above was copied from acutal output from this finetuned LLM. And while it is a great improvement on the base model, it still leaves a few things to be desired. It will help the students more than the base model. This model does not always know when to stop. A end of sequence token could not be found to stop the model. This means sometimes the models carries on after the question is answered. Limiting the number of tokens to 500 usually is enough to answer the question but not ramble on.
Model tree for bnolton/AP_Stat_Inference_Helper
Base model
Qwen/Qwen3-4B-Instruct-2507