Introduction

This model was fine-tuned to extract scientific equipment details from the methods sections of academic articles. The goal is to generate structured JSON outputs capturing the name, model, brand, and country of origin (if available) of research equipment. This task is important because existing metadata sources like Web of Science or PubMed often lack consistent or complete information about research infrastructure. Understanding what tools are used in emerging scientific domains can support both academic analysis and strategic insights for industry. While general-purpose LLMs can parse text well, they struggle with producing structured, domain-specific outputs. Fine-tuning was necessary to teach the model how to identify and organize equipment details reliably. The resulting model shows strong performance on extracting relevant information and provides a valuable foundation for downstream research infrastructure analysis.

Training Data

The training dataset was constructed from a custom corpus of academic articles drawn from bioRxiv.org, the preprint server for biology, which hosts open-access scientific articles. For each article, the methods section was extracted and paired with structured annotations specifying the scientific equipment mentioned, including fields for name, brand, model, and country of origin when available. These annotations were generated using a few-shot prompting strategy with OpenAI’s GPT-4o mini model and then manually reviewed for quality. The data was formatted into JSONL with instruction-following input/output fields suitable for supervised fine-tuning. A fixed random seed of 42 was used for reproducibility during the test-train split. The dataset did not come with a built-in test set, so this split was created manually from the full annotated corpus.

Training Method

We fine-tuned the Mistral-7B-Instruct-v0.1 model using the HuggingFace transformers library with parameter-efficient fine-tuning via LoRA. We chose this method to reduce computational cost while still allowing the model to specialize in structured extraction tasks. The model was trained to generate JSON-style outputs containing equipment details (name, brand, model, country of origin) from the methods sections of academic articles. We used a learning rate of 2e-5, batch size of 4, and trained for 3 epochs with early stopping based on validation loss. Our fine-tuning setup used the AdamW optimizer with a warmup ratio of 0.1 and a linear learning rate scheduler. These hyperparameters were chosen after initial experiments balancing performance and training efficiency.

Evaluation

To evaluate the effectiveness of the fine-tuned model on scientific equipment extraction, I tested performance across three benchmark tasks: WikiANN and CoNLL-2003 (named entity recognition datasets), as well as my own custom dataset of scientific method sections manually annotated for equipment mentions. I also used a test set to generate metrics for comparison.

The CoNLL-2003 and WikiANN datasets served as standard NER benchmarks to measure general extraction ability, while my manually tagged dataset and custom test set focused on domain-specific extraction of scientific equipment from academic articles. I selected Zephyr-7B and OpenChat-3.5 as comparison models because they are similar in size to my Mistral-7B base model and represent open-source instruction-tuned alternatives that are commonly used in applied NLP. Compared to these models, my post-PEFT fine-tuned Mistral significantly outperformed both on all four benchmarks, especially on the domain-specific tasks, where it achieved more than 5× the F1 score of Zephyr and nearly doubled the performance of OpenChat.

F1 Score Comparison Across Benchmarks

Benchmark Mistral (pre) Mistral (post) Zephyr-7B OpenChat-3.5
CoNLL-2003 0.167 0.257 0.119 0.120
Manual Tagged 0.073 0.525 0.036 0.260
WikiANN 0.111 0.154 0.010 0.029
Equipment Test Set 0.222 0.539 0.074 0.434

Precision Comparison Across Benchmarks

Benchmark Mistral (pre) Mistral (post) Zephyr-7B OpenChat-3.5
CoNLL-2003 0.136 0.221 0.134 0.199
Manual Tagged 0.074 0.452 0.061 0.304
WikiANN 0.132 0.182 0.009 0.036
Equipment Test Set 0.219 0.501 0.082 0.478

Recall Comparison Across Benchmarks

Benchmark Mistral (pre) Mistral (post) Zephyr-7B OpenChat-3.5
CoNLL-2003 0.214 0.307 0.107 0.086
Manual Tagged 0.072 0.627 0.025 0.227
WikiANN 0.096 0.133 0.011 0.024
Equipment Test Set 0.226 0.584 0.068 0.397

I compared my model to the base Mistral-7B model (used without finetuning) and observed substantial improvements across all evaluation settings. Notably, my model improved F1 score on my custom test set from 0.222 to 0.539, and improved manual tagging F1 from 0.073 to 0.725. Even on general-purpose NER tasks like CoNLL-2003 and WikiANN, performance improved meaningfully, suggesting that my instruction tuning led to more generalizable entity extraction behavior.

Usage and Intended Uses

This model was fine-tuned to extract structured information about scientific supplies and equipment from the methods sections of academic articles. It is designed for use cases that require converting unstructured text into structured JSON, such as research metadata pipelines, equipment indexing, or automated review support. While optimized for biomedical research articles, the model may generalize to similar domains that use technical descriptions of laboratory processes and materials.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "cvillejustin/mistral_custom_lora"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model using efficient memory settings
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,     
    device_map="auto",             
    low_cpu_mem_usage=True         
)

# Example inference
prompt = (
    "[INST] Extract structured information about supplies and equipment used in the scientific article. "
    "Return ONLY valid JSON—no explanations, extra text, or reformatting. "
    "Extract only equipment and materials mentioned in the text. "
    "If any field (brand, model, or country of origin) is missing, mark it as 'Unknown'. "
    "If no equipment is found, return: { \"equipment_list\": [] }. "
    "Do not repeat the article text in the response. "
    "Do not add explanations, opinions, or anything outside the JSON format. "
    "\n\nMice were housed in polycarbonate cages (Tecniplast) and fed with standard chow (Envigo 2018S). Cells were sorted using a FACSAria III (BD Biosciences). [/INST]"
)

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

# Decode response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Prompt Format

This model expects input in a simple instruction format that clearly specifies the task. Prompts begin with an [INST] tag, followed by the instruction and the target content.

[INST] Extract structured information about supplies and equipment used in the scientific article. 
Return ONLY valid JSON—no explanations, extra text, or reformatting. 
Extract only equipment and materials mentioned in the text. 
If any field (brand, model, or country of origin) is missing, mark it as 'Unknown'. 
If no equipment is found, return: { "equipment_list": [] }. 

C57BL/6J mice (8–18-week-old male) were purchased from Orient Bio. Jurkat T cells (Clone E6-1, ATCC) and RPMI media (Gibco) were used... [/INST]

Expected Output Format

The model returns a JSON object containing an equipment_list field. This field holds a list of dictionaries, each representing a detected piece of equipment with fields for name, brand, model, and country_of_origin. Missing information is labeled as "Unknown".

{
  "equipment_list": [
    {
      "name": "RPMI media",
      "brand": "Gibco",
      "model": "Unknown",
      "country_of_origin": "Unknown"
    },
    {
      "name": "Jurkat T cells",
      "brand": "ATCC",
      "model": "Clone E6-1",
      "country_of_origin": "Unknown"
    }
  ]
}

Limitations

While the model performs well on structured extraction of equipment from scientific texts, it has several limitations. First, its accuracy is dependent on the clarity and specificity of the language in the article. Ambiguous or highly technical phrasing can reduce extraction quality. Second, the model occasionally misses equipment that is mentioned implicitly or across multiple sentences. Additionally, when brand or model details are not explicitly stated, the model may inconsistently default to “Unknown”.

Framework versions

  • PEFT 0.14.0
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cvillejustin/mistral_custom_lora

Adapter
(437)
this model