Fine-tuning Flan-T5 to generate concise "approach" descriptions from scientific papers. Given a paper's Abstract + Introduction + Conclusion (AIC) text, the model produces a short, human-readable summary of the research approach.
CMPSC 497 Final Project — Manay Lodha
Automatic generation of concise approach descriptions can greatly accelerate literature reviews and proposal writing. This project explores how well a medium-sized open-source LLM can learn this extreme summarization task from a domain-specific dataset.
We use the SciTLDR dataset (5.4K TLDR summaries across 3.2K papers), specifically the AIC configuration. The expert-derived TLDR serves as the target summary.
Construction steps:
- Load and concatenate
sourcesentences into a single prompt; concatenatetargetsentences into a single summary. - Filter to pairs where the prompt has 20–300 words and the target has 10–100 words.
- Split into train / validation / test sets.
raw = load_dataset("allenai/scitldr", "AIC", split="train+validation+test")| Split | Examples |
|---|---|
| Train | ~4,320 |
| Validation | ~540 |
| Test | ~540 |
| Total | ~5,400 |
All examples are saved as JSONL under data/.
- Tokenizer:
AutoTokenizer.from_pretrained("google/flan-t5-base") - Max prompt length: 128 tokens
- Max target length: 512 tokens
- Base model:
google/flan-t5-base(250M parameters) - Framework: HuggingFace Transformers
Trainer, single GPU
| Hyperparameter | Value |
|---|---|
| Batch size | 4 |
| Learning rate | 5 × 10⁻⁵ |
| Epochs | 3 |
| FP16 | True |
| Eval & save strategy | per-epoch |
| Metric | Score |
|---|---|
| ROUGE-1 | 0.1841 |
| ROUGE-2 | 0.0753 |
| ROUGE-L | 0.1470 |
| ROUGE-Lsum | 0.1489 |
| Avg PPL | 1,777,852.93 |
Prompt (truncated):
We introduce a new procedural dynamic system that can generate a variety of shapes that often appear as curves…
Generated:
We introduce a new procedural dynamic system that can generate a variety of shapes that often appear as curves… We introduce a new procedural dynamic system…
Reference:
A new, very simple dynamic system is introduced that generates pretty patterns; properties are proved and possibilities are explored.
The model exhibits near-verbatim copying of the input prompt rather than producing concise, abstractive summaries.
- Copying behavior: The model frequently echoes the input, leading to poor abstractive summarization.
- High perplexity: The fine-tuned model assigns very low probability to reference summaries, suggesting over-reliance on the input distribution.
- Possible causes:
- Insufficient training epochs / dataset size for generalization.
- Learning rate too high, causing early convergence to copying.
- Lack of explicit instruction prompting (raw AIC text fed directly).
- Prompt engineering — prepend explicit instructions (e.g., "Summarize the methods in one sentence:") to guide abstraction.
- Longer training / larger model — experiment with
flan-t5-largeor more epochs at a lower learning rate. - Regularization — apply label smoothing or dropout to mitigate copying.
- Data augmentation — incorporate additional summarization resources (e.g., abstract-to-TLDR pairs).
# Install dependencies
pip install transformers datasets rouge-score
# Prepare the dataset
python prepare_data.py
# Fine-tune
python train.py
# Evaluate
python evaluate.pyThis project is for academic purposes (CMPSC 497).