A curated collection of tools, frameworks, and resources for AI-driven automated model training — letting AI agents autonomously run experiments, fine-tune models, optimize hyperparameters, and evolve themselves.
Inspired by Karpathy's AutoResearch, HuggingFace Skills, and the broader AutoML movement.
The paradigm is shifting: instead of manually tuning models, we now have tools that let AI agents design experiments, modify training code, evaluate results, and iterate autonomously — while you sleep.
This repository collects the best open-source tools and frameworks that make this possible across the full training lifecycle.
- Autonomous Experiment / Research Frameworks
- Agent-Driven Training Skills (HuggingFace Ecosystem)
- LLM Fine-Tuning Frameworks
- RL Alignment Training Frameworks (RLHF / GRPO)
- Automated Hyperparameter Optimization / AutoML
- Self-Evolving / Self-Play Training
- Lightweight Pretraining Frameworks
- Experiment Tracking & Orchestration
- Benchmarks & Evaluation
- Coding Agents (for Training Script Development)
- Recommended Stacks
Core idea: AI agents autonomously design experiments, modify training code, evaluate results, and iterate. You sleep, AI experiments.
| Project | Description | Key Highlight |
|---|---|---|
| AutoResearch | AI agent runs autonomous ML experiments in a loop | 630 lines of Python, ~100 experiments overnight, 11% efficiency gain on GPT-2 training |
| AI Scientist v2 | Fully automated scientific discovery with agentic tree search | Hypothesis → Experiment → Paper, no human templates needed |
| auto-ml-agent | LLM-orchestrated autonomous ML pipeline | End-to-end: data preprocessing → model deployment, multi-agent architecture |
| MLAgentBench | Benchmark for evaluating AI agents on ML experimentation | 13 end-to-end ML tasks from CIFAR-10 to BabyLM |
| AutoAgent | Zero-code LLM agent framework with self-play customization | Create agents via natural language, iterative self-improvement |
| ShinkaEvolve | LLM-as-mutation-operator program evolution framework | Evolves programs for scientific discovery |
"Vibe Training" — use natural language to drive the full model training lifecycle through coding agents.
| Project | Description | Key Highlight |
|---|---|---|
| HuggingFace Skills | Standardized ML skill packages for coding agents | 12 skills: model training (SFT/DPO/GRPO), vision training, experiment tracking, evaluation, dataset management |
| HuggingFace AutoTrain | No-code training platform | Upload data → auto model selection → training → evaluation → Hub publishing |
HF Skills covers:
hugging-face-model-trainer— Fine-tune LLMs with TRL (SFT, DPO, GRPO), 0.5B to 70B parametershugging-face-vision-trainer— Train object detection & image classification (RTDETRv2, YOLOS, ViT)hugging-face-jobs— Run compute jobs on HF infrastructure with cost estimationhugging-face-trackio— ML experiment tracking with real-time metricshugging-face-evaluation— Model evaluation with lightevalhugging-face-datasets— Dataset creation and management- Compatible with: Claude Code, OpenAI Codex, Google Gemini CLI, Cursor
The training engines. Upper-level agents (AutoResearch, HF Skills) ultimately call these frameworks to execute training.
| Project | Description | Key Highlight |
|---|---|---|
| Unsloth | Ultra-efficient LLM fine-tuning & RL | 2x faster, 70% less VRAM; custom CUDA kernels; MoE 12x faster; MCP Server available |
| Axolotl | Flexible, production-ready fine-tuning | YAML-driven; v0.8.x: QAT, sequence parallelism, GRPO, full RLHF pipeline |
| LlamaFactory | Unified fine-tuning with Web UI | LlamaBoard browser UI; 100+ models; SFT/RLHF/DPO/PPO |
| TRL | HuggingFace's RL training library | SFT, DPO, GRPO, PPO, KTO, ORPO; deep Transformers/PEFT integration |
| torchtune | PyTorch-native fine-tuning | No extra abstractions; multi-node support (Feb 2025) |
| NeMo AutoModel | NVIDIA's DTensor-native training library | Day-0 HuggingFace support; single-to-multi-node scaling |
2025-2026 trend: GRPO (Group Relative Policy Optimization) is replacing PPO as the default alignment method — no critic model needed, simpler and more stable.
| Project | Description | Key Highlight |
|---|---|---|
| OpenRLHF | High-performance RLHF framework on Ray + vLLM | 70B+ full tuning; PPO/DAPO/REINFORCE++; async agent RLHF; MARTI fork for multi-agent RL |
| rLLM | Post-training RL framework for language agents | Custom agents + environments → RL training → deployment; rLLM-FinQA-4B beats Qwen3-235B |
| LlamaGym | Online RL fine-tuning for LLM agents | Define agent → create LLM → write RL loop |
| Project | Description | Key Highlight |
|---|---|---|
| AgentHPO | LLM-driven hyperparameter optimization | Matches/surpasses human best trials on 12 ML tasks with explainable results |
| Optuna | Industry-standard HPO framework | Bayesian search, pruning, distributed execution, visualization dashboard |
| Microsoft NNI | Full AutoML toolkit | Neural Architecture Search + HPO + model compression + feature engineering |
| W&B Sweeps | Automated hyperparameter search + tracking | Bayesian/Grid/Random search; Hyperband early stopping; cross-machine parallelism |
Core idea: Models generate their own training data to train themselves, reducing dependence on human annotations.
| Project | Description | Key Highlight |
|---|---|---|
| SPIN | Self-Play Fine-Tuning | Model plays against its previous iterations; outperforms DPO + GPT-4 preference data without extra annotations |
| SPPO | Self-Play Preference Optimization | Iterative policy updates approximating Nash equilibrium with convergence guarantees |
| Multi-Agent Evolve | One LLM plays Proposer + Solver + Judge roles | Verified improvements on math, coding, reasoning with Qwen2.5-3B |
| Multiagent Finetuning | Multi-agent society from same base model | Multi-agent iteration keeps improving where single-model self-training plateaus |
| CORY | Cooperative multi-agent RL fine-tuning | Pioneer + Observer dual-agent paradigm (NeurIPS 2024) |
Pair these with autonomous experiment frameworks — fast, small-scale training is the foundation for autonomous experimentation.
| Project | Description | Key Highlight |
|---|---|---|
| nanochat | Minimal LLM training harness (AutoResearch's engine) | Single GPU; tokenization → pretrain → finetune → eval → chat; GPT-2 for ~$48 |
| Nanotron | Minimal 3D-parallel LLM pretraining | Data + Tensor + Pipeline parallelism; scales from experiments to production |
| Project | Description | Key Highlight |
|---|---|---|
| Weights & Biases | Experiment tracking + sweeps + model registry | Industry standard; integrates with all major frameworks |
| MLflow 3.0 | Open-source experiment tracking + model serving | Self-hosted; nested experiments; model registry |
| HF Trackio | Lightweight experiment tracking in HF ecosystem | Deep integration with HF Skills; agents can read metrics and make decisions |
| Benchmark | Description | Key Highlight |
|---|---|---|
| MLE-bench | 75 Kaggle ML engineering competition tasks | Evaluates AI agents on real ML engineering: training, data prep, experiments |
| MLAgentBench | 13 end-to-end ML experimentation tasks | Stanford SNAP; Claude v3 Opus best at 37.5% |
| MLRC-Bench | ML Research Competition challenges | Tests novel methodology development |
| LiveCodeBench | Contamination-free coding benchmark | Fresh problems from LeetCode/AtCoder/Codeforces |
These agents don't train models directly, but can write and debug training code, completing the automation loop when paired with HF Skills.
| Project | Description | Key Highlight |
|---|---|---|
| Aider | Terminal AI pair programming | Git integration; supports Claude/GPT/DeepSeek/local models |
| OpenHands | AI-driven software development (open-source Devin) | Autonomous code editing + execution + debugging; MIT license |
| SWE-agent | Autonomous GitHub issue fixer | SWE-bench open-source SOTA (NeurIPS 2024) |
HuggingFace Skills + Claude Code + Unsloth + W&B
Natural language → Claude Code orchestrates → HF Skills calls Unsloth for training → W&B tracks experiments.
AutoResearch + nanochat (single GPU)
Start before bed, wake up to ~100 autonomous experiment results.
Axolotl / LlamaFactory + OpenRLHF + Optuna + MLflow
YAML-configured training + automated HPO + full experiment tracking.
- AutoResearch Paradigm: Karpathy proved "AI autonomously doing ML research" works with just 630 lines of code
- "Vibe Training": HF Skills enables natural-language-driven model training lifecycle
- GRPO > PPO: DeepSeek's GRPO is becoming the default alignment method (no critic model, simpler, more stable)
- Self-Play Breakthrough: Multi-agent self-evolution (SPIN, MAE) overcomes single-model self-training plateaus
- MCP Standardization: Model Context Protocol adopted by OpenAI/Google/Microsoft as the "USB-C for AI agents"
- Single-GPU Research: Unsloth + nanochat + AutoResearch enables individual developers to do serious LLM research
Contributions are welcome! Please open an issue or submit a PR if you know of tools that fit this collection.
Criteria for inclusion:
- Must be directly usable for automated model training workflows
- Preference for open-source projects with active maintenance
- Focus on tools that leverage AI/LLMs to automate the training process itself
This curated list is released under CC0 1.0.
Compiled March 2026. Project statuses may change — check individual GitHub repos for the latest.