INTRODUCTION
The field of geoscience is critical yet underrepresented, both in general and in machine learning research.
Utilizing geoscience information is crucial for advancing our Earth's dynamic systems, including energy
resources, hydrology, mineral exploration, natural hazards, atmospheric processes, and more. To enhance
geoscientific information accessibility, I present a large language model (LLM) that has been tailored for long-
form text summarization in the geoscience domain. Utilitizing the google-t5/t5-large model, I fine-tune it
using Low-Rank Adaptation (LoRA) on the ArXiv Summarization dataset comprised of technical academic articles.
The performance was evaluated on three long-document summarization benchmarks from the lm_eval harness.
While the results were modest, they exemplify the potential of this model's ability to condense geoscientific
literature into useful summaries.
TRAINING DATA
The training data used for this model is the ArXiv Summarization dataset, available on HuggingFace. The dataset includes full-text scientific articles and their corresponding abstracts. Though this dataset does not focus exclusively on geoscience papers, it includes papers from fields such as physics, mathematics, biology, and chemistry which are all relevant to geoscience. The dataset was split 80/20 into training and validation sets using a fixed random seed of 42.
TRAINING METHOD
The training method implemented for this project was LoRA. The google-t5/t5-large
model was fine-tuned usingLoRA to achieve efficiency without heavy computational
cost. The target modules used were SelfAttention.q and SelfAttention.v with
the LoRA configuration as follows: r set to 64, lora_alpha set to 64,
lora_dropout set to 0.05. The training, utilized on the ArXiv dataset, described
in the previous section, was only conducted for 1 epoch using the Seq2SeqTrainer.
With this trainer came a per-device batch size of 2, gradient accumulation steps
equal to 4, and a mixed precision set to FP16.
EVALUATION
This model was evaluated on three benchmarks from the lm_eval harness: scrolls_govreport,
scrolls_qasper, and scrolls_summscreenfd:
-scrolls_govreport: focuses on the summarization of government reports
-scrolls_qasper: focuses on scientific QA pairs
-scrolls_summscreenfd: focuses on dialogue summarization
These benchmarks are useful for our task because they are relevant processing long, structured, and technical documents, applicable to geoscientific literature. We compare the fine-tuned T5 LoRA model against the base model:
| Dataset | Metric | Base T5-Large | LoRA Fine-Tuned |
|---|---|---|---|
scrolls_govreport |
ROUGE-1 | 0.2848 | 0.2848 |
| ROUGE-2 | 0.0000 | 0.0000 | |
| ROUGE-L | 0.2848 | 0.2848 | |
scrolls_qasper |
F1 | 11.0256 | 11.0256 |
scrolls_summscreenfd |
ROUGE-1 | 0.0000 | 0.0000 |
| ROUGE-2 | 0.0000 | 0.0000 | |
| ROUGE-L | 0.0000 | 0.0000 |
These results show that the LoRA-adapted did not alter the performance at all.
These modest results are likely a results of the limitations section at the bottom
of the repository. It is important to note that two additional models that would
be beneficial to compare results against are the meta-llama/Llama-3.2-1B model
and the facebook/bart-large-cnn model.
USAGE AND INTENDED USES
The usage for this model is as follows:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
## Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("isabellafpaolucci/geosum")
model = AutoModelForSeq2SeqLM.from_pretrained("isabellafpaolucci/geosum")
## Example usage
input_text = (
"Summarize the following geoscience article:\n\n"
"[Insert user geoscientific text]"
)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
summary_ids = model.generate(**inputs, max_length=150, num_beams=4)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
The intended use case for this model is long-form summarization of geoscientific literature. It is designed to support researchers, students, academic personnel, and professionals by condensing complex technical papers and exhaustive reports into comprehensive summaries. This would enable users to comprehend general ideas and key concept of their input more efficiently.
PROMPT FORMAT
The prompt format for this model is a simple written request as well as input literature.
Summarize the following geoscience article:
[Insert your geoscientific text here]
EXPECTED OUTPUT
The expected output for this model is a concise and comprehensive summary, typically around a paragraph long, summarizing the input literature.
[output paragraph]
LIMITATIONS
Fine-tuning the google-t5/t5-large model offers computational efficiency,
however due to the shortened training time, smaller batch sizes, significantly
reduced data size, and lack of geoscience-specific data,the training of the model
has no meaninful contributions. The results yield outputs that have poor
generalization, formatting, and underfit the data. These limitations exemplify
some of the challenges involvedswith developing a long-form text summarization
task tailored to a domain-specific cause with limited computational resources.
Model tree for isabellafpaolucci/geosum
Base model
google-t5/t5-large