Skip to content

Share training code between backends#626

Open
Kovbo wants to merge 5 commits intomainfrom
feat/shared-training-code
Open

Share training code between backends#626
Kovbo wants to merge 5 commits intomainfrom
feat/shared-training-code

Conversation

@Kovbo
Copy link
Collaborator

@Kovbo Kovbo commented Mar 21, 2026

We have a lot of similar training code implemented differently across the local and serverless backends. This makes it hard to maintain.
We should keep the training code in ART and import it into the serverless backend.
This PR implements that change.

  • Mental Model

    There are 3 layers:

    • ART backend entrypoints: turn user-facing train() / train_sft() calls into a concrete training request.
    • Job handoff/orchestration: move that request into a worker process and stream metrics back.
    • Unsloth/Megatron worker execution: actually run forward/backward/optimizer/save loops.

    The refactor mostly changed layer 3, and a small part of layer 1.

    Before, the same Megatron execution logic existed in two places:

    • ART’s local Megatron worker had its own inline runtime setup and RL loop in what is now .venv/lib/python3.12/site-packages/art/megatron/
      train.py:1.
    • The serverless Megatron worker had its own inline runtime setup plus RL and SFT loops in what is now megatron/train.py:1.

    Those two scripts both did the same kinds of work:

    • set CUDA / Triton / inductor env
    • build the Megatron provider/model/optimizer
    • patch GPT preprocess
    • load LoRA + optimizer state
    • run the RL step loop
    • save LoRA + optimizer shards
    • write metrics to a log file
    • clean up the job directory

    The serverless worker also had its own separate SFT loop.

    There was also smaller duplication at the ART backend layer:

    • .venv/lib/python3.12/site-packages/art/local/backend.py:501
    • .venv/lib/python3.12/site-packages/art/serverless/backend.py:193

    Both of those built RL config objects and aggregated training metrics separately.

    So before, changing RL training logic meant touching multiple files.

    How It Works Now

    Now the shared Megatron execution lives in one ART module:

    • .venv/lib/python3.12/site-packages/art/megatron/shared.py:58
    • .venv/lib/python3.12/site-packages/art/megatron/shared.py:113
    • .venv/lib/python3.12/site-packages/art/megatron/shared.py:230

    That module owns the actual training process:

    • model/optimizer context creation
    • LoRA + optimizer load/save
    • RL loop
    • SFT loop
    • per-rank gradient reduction
    • metrics logging
    • job completion cleanup

    The wrappers are now thin:

    • ART local Megatron wrapper: .venv/lib/python3.12/site-packages/art/megatron/train.py:1
    • Repo serverless worker wrapper: megatron/train.py:1

    The important design choice is that offload/reload stayed outside the shared runner:

    • Local ART wrapper does reload_to_gpu(...), calls run_megatron_rl_job(...), then offload_to_cpu(...) in finally at .venv/lib/python3.12/site-
      packages/art/megatron/train.py:41.
    • Serverless wrapper just calls the shared ART runner directly at megatron/train.py:42.

    So the shared code is “training logic only”, and the local-only memory management stays local-only.

    At the backend layer, the shared RL config/metrics glue now lives in .venv/lib/python3.12/site-packages/art/_backend_training.py:15.
    Both .venv/lib/python3.12/site-packages/art/local/backend.py:613 and .venv/lib/python3.12/site-packages/art/serverless/backend.py:260 use it.

    Current End-to-End Flows

    Local RL with ART + Megatron:

    • LocalBackend.train() builds config and delegates to _train_model() in .venv/lib/python3.12/site-packages/art/local/backend.py:501.
    • _train_model() packs trajectory groups to disk and calls the service in .venv/lib/python3.12/site-packages/art/local/backend.py:670.
    • MegatronService.train() pauses vLLM, writes a job JSON into /tmp/megatron_training_jobs, and tails the shared log file at .venv/lib/
      python3.12/site-packages/art/megatron/service.py:209.
    • The local Megatron worker .venv/lib/python3.12/site-packages/art/megatron/train.py:28 polls that directory, reloads GPU state, and
      calls .venv/lib/python3.12/site-packages/art/megatron/shared.py:113.
    • After completion, MegatronService merges shards, creates the next checkpoint, wakes vLLM, and registers the new LoRA alias at .venv/lib/
      python3.12/site-packages/art/megatron/service.py:271.

    Serverless RL:

    • ServerlessBackend.train() builds config and submits a training job through the API at .venv/lib/python3.12/site-packages/art/serverless/
      backend.py:193.
    • _train_model() calls client.training_jobs.create(...) and polls event streams at .venv/lib/python3.12/site-packages/art/serverless/
      backend.py:319 using the client defined at .venv/lib/python3.12/site-packages/art/serverless/client.py:183.
    • On the server side, the repo’s MegatronTrainer.train() writes a worker job file and tails the per-job log at trainers/
      megatron_trainer.py:97.
    • The repo worker megatron/train.py:26 polls the same job directory and dispatches to .venv/lib/python3.12/site-packages/art/megatron/
      shared.py:113.

    Serverless SFT:

    • The server workflow downloads the artifact and tokenizes it into SFTBatch objects in app/temporal/workflows/training_workflows.py:825.

    • The repo’s MegatronTrainer.train_sft() writes the tokenized batches to disk and writes an SFT job file at trainers/megatron_trainer.py:149.

    • The repo worker megatron/train.py:42 dispatches to .venv/lib/python3.12/site-packages/art/megatron/shared.py:230.

    • RL backend config/metric aggregation is shared in one place.

@Kovbo Kovbo force-pushed the feat/shared-training-code branch from c1abe2e to a1b8efc Compare March 21, 2026 02:03
@Kovbo Kovbo requested a review from bradhilton March 24, 2026 20:32
@Kovbo Kovbo marked this pull request as ready for review March 24, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant