This repo provides the extended environments for CoMLRL.
Install CoMLRL:
pip install comlrl
# Install PyTorch compatible with your deviceOr via conda-forge:
conda install -c conda-forge comlrl
# Install PyTorch compatible with your device- Contents: each sample includes a class skeleton, method stubs (with docstrings
or
pass), and canonical hidden tests. - Splitting:
train/train_magrpo.pyloads explicit HF slices fromdataset.train_splitanddataset.eval_split(e.g.,test[:50]andtest[50:]). - Subsetting: if a split name is missing (e.g., ClassEval only has
test), the loader falls back to the first available split before slicing. - Prompting: prompts include the sanitized class skeleton plus per-agent method assignments. The default strategy assigns 1-parameter methods to agent 0 and all other methods to agent 1.
- Testing: reward code merges agent completions back into the skeleton and runs the provided unit tests inside a temporary directory to isolate state.
Key sections in configs/magrpo_classeval_config.yaml:
model: base checkpoint (Qwen/Qwen2.5-Coder-3B-Instructby default), tokenizer/model kwargs, and device mapping.dataset: dataset name and split strings (train_split,eval_split) for ClassEval sub-slices or local mirrors.external: feedback configuration (usecode_feedbackfor syntax/test diagnostics).magrpo: forwarded tocomlrl.trainers.magrpo.MAGRPOTrainer. Includes collaboration (num_agents, param-count assignment), sampling settings (num_generations,num_turns, temperature/top_p), rollout buffering (rollout_buffer_size), optimization hyperparameters, and IO controls.reward_processor: optional post-processing for rewards (scale, shift).output: persistence knobs (save final model, output paths, verbose debug prints).
rewards/CE_reward.pycomputes structured rewards:lv1: syntax score proportional to valid method outputs (range [0, 2]).lv2: unit-test bonus based on pass rate (passed/total), scaled to [0, 4].lv3: overlap penalty normalized by total methods (range [-1, 0]).- reward shift: optional post-processing shift via
reward_processor.shift.
- Tests execute inside per-sample temporary directories to avoid polluted state and are automatically truncated on timeout.
- Loggers are inherited from CoMLRL. Enable Weights & Biases by filling
wandb.entityor disable it for offline debugging.
