INTRODUCTION

The field of geoscience is critical yet underrepresented, both in general and in machine learning research. Utilizing geoscience information is crucial for advancing our Earth's dynamic systems, including energy resources, hydrology, mineral exploration, natural hazards, atmospheric processes, and more. To enhance geoscientific information accessibility, I present a large language model (LLM) that has been tailored for long- form text summarization in the geoscience domain. Utilitizing the google-t5/t5-large model, I fine-tune it using Low-Rank Adaptation (LoRA) on the ArXiv Summarization dataset comprised of technical academic articles. The performance was evaluated on three long-document summarization benchmarks from the lm_eval harness. While the results were modest, they exemplify the potential of this model's ability to condense geoscientific literature into useful summaries.

TRAINING DATA

The training data used for this model is the ArXiv Summarization dataset, available on HuggingFace. The dataset includes full-text scientific articles and their corresponding abstracts. Though this dataset does not focus exclusively on geoscience papers, it includes papers from fields such as physics, mathematics, biology, and chemistry which are all relevant to geoscience. The dataset was split 80/20 into training and validation sets using a fixed random seed of 42.

TRAINING METHOD

The training method implemented for this project was LoRA. The google-t5/t5-large model was fine-tuned usingLoRA to achieve efficiency without heavy computational cost. The target modules used were SelfAttention.q and SelfAttention.v with the LoRA configuration as follows: r set to 64, lora_alpha set to 64, lora_dropout set to 0.05. The training, utilized on the ArXiv dataset, described in the previous section, was only conducted for 1 epoch using the Seq2SeqTrainer. With this trainer came a per-device batch size of 2, gradient accumulation steps equal to 4, and a mixed precision set to FP16.

EVALUATION

This model was evaluated on three benchmarks from the lm_eval harness: scrolls_govreport, scrolls_qasper, and scrolls_summscreenfd: -scrolls_govreport: focuses on the summarization of government reports -scrolls_qasper: focuses on scientific QA pairs -scrolls_summscreenfd: focuses on dialogue summarization

These benchmarks are useful for our task because they are relevant processing long, structured, and technical documents, applicable to geoscientific literature. We compare the fine-tuned T5 LoRA model against the base model:

Dataset Metric Base T5-Large LoRA Fine-Tuned
scrolls_govreport ROUGE-1 0.2848 0.2848
ROUGE-2 0.0000 0.0000
ROUGE-L 0.2848 0.2848
scrolls_qasper F1 11.0256 11.0256
scrolls_summscreenfd ROUGE-1 0.0000 0.0000
ROUGE-2 0.0000 0.0000
ROUGE-L 0.0000 0.0000

These results show that the LoRA-adapted did not alter the performance at all. These modest results are likely a results of the limitations section at the bottom of the repository. It is important to note that two additional models that would be beneficial to compare results against are the meta-llama/Llama-3.2-1B model and the facebook/bart-large-cnn model.

USAGE AND INTENDED USES

The usage for this model is as follows:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("isabellafpaolucci/geosum")
model = AutoModelForSeq2SeqLM.from_pretrained("isabellafpaolucci/geosum")

## Example usage
input_text = (
    "Summarize the following geoscience article:\n\n"
    "[Insert user geoscientific text]"
)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
summary_ids = model.generate(**inputs, max_length=150, num_beams=4)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

The intended use case for this model is long-form summarization of geoscientific literature. It is designed to support researchers, students, academic personnel, and professionals by condensing complex technical papers and exhaustive reports into comprehensive summaries. This would enable users to comprehend general ideas and key concept of their input more efficiently.

PROMPT FORMAT

The prompt format for this model is a simple written request as well as input literature.

Summarize the following geoscience article:

[Insert your geoscientific text here]

EXPECTED OUTPUT

The expected output for this model is a concise and comprehensive summary, typically around a paragraph long, summarizing the input literature.

[output paragraph]

LIMITATIONS

Fine-tuning the google-t5/t5-large model offers computational efficiency, however due to the shortened training time, smaller batch sizes, significantly reduced data size, and lack of geoscience-specific data,the training of the model has no meaninful contributions. The results yield outputs that have poor generalization, formatting, and underfit the data. These limitations exemplify some of the challenges involvedswith developing a long-form text summarization task tailored to a domain-specific cause with limited computational resources.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for isabellafpaolucci/geosum

Base model

google-t5/t5-large
Finetuned
(171)
this model