I fine-tune and red-team LLMs for safety and reliability. Recent: DPO pipeline (+3 pp paraphrase accuracy, +7 pp human preference), adversarial red-teaming (refusal rate 6% → 89%), METEOR reproducibility audit (10-point gap found). Currently: shutdown-resistance evaluations at Algoverse. Published on arXiv, 1st place Apart Research hackathon (35+ teams).
Pinned Loading
-
ai-incident-forecasting
ai-incident-forecasting Public🥇 1st place, Apart Research AI Forecasting Hackathon (35+ teams). Poisson GAM + multinomial share model forecasting AI incidents with calibrated 90% prediction intervals and rolling backtests. Pred…
Jupyter Notebook 4
-
dpo-rlhf-paraphrase-types
dpo-rlhf-paraphrase-types PublicEnd-to-end DPO fine-tuning pipeline for paraphrase-type generation (M.Sc. thesis, arXiv:2506.02018). DPO on 1,040 human-ranked pairs raised type accuracy +3 pp and human preference +7 pp over SFT b…
Jupyter Notebook
-
adverserial-paraphrasing
adverserial-paraphrasing PublicRed-teaming harness for open-weight LLMs (LLaMA, Mistral, Pythia). LoRA-SFT on 580 examples raised refusal rate from ~6% to 89% and cut harmful replies to 8%. Includes adversarial prompt dataset, S…
Jupyter Notebook 1
-
Reproducibility-METEOR-NLP
Reproducibility-METEOR-NLP PublicReproducibility audit of METEOR across 73k ACL papers. Compared 8 Python/Java implementations; found up to 10-point score divergence from undocumented parameter choices. Includes Docker test harnes…
Jupyter Notebook 2
-
gmf-annotation-platform
gmf-annotation-platform PublicMVP pipeline for LLM-assisted annotation of AI incidents from the AIID. Imports AIID backup data, runs GPT structured-output classification against GMF taxonomy categories, and compares predictions…
Python
-
NLP_DeepLearning_Spring2023
NLP_DeepLearning_Spring2023 PublicMultitask BERT fine-tuning for sentiment analysis, paraphrase detection, and semantic similarity (SST, QQP, STS-B). Explores SMART regularization, Sophia optimizer, and gradient surgery. ~5% robust…
Python
If the problem persists, check the GitHub status page or contact support.

