Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions DOCUMENTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Baby Dragon Hatchling (BDH) Guide

This guide explains what the project does, how to run it end-to-end, and how to interpret the outputs.

## What This Generates

- Training checkpoints at every 10% of the run.
- A consolidated generation log and a Markdown report comparing checkpoints.
- A loss curve plot (loss vs steps).
- A memory run log and a small loss plot for the model-memory fine-tune.

## Deliverables Summary

1. **End-to-End Training**: `train.py` supports configurable steps and automatic checkpointing at every 10% of progress.
2. **Checkpoint Comparison**: `compare_checkpoints.py` evaluates all checkpoints with a fixed prompt and creates a report + consolidated log.
3. **Memory Script**: `memory.py` demonstrates:
- **Fast Memory (Context)**: A lightweight memory store ingests facts from a prompt and answers from memory without keeping the fact in the user prompt.
- **Model Memory (Weights)**: Facts consolidated into model weights via a short fine-tune.
4. **One-Click Run**: `main.py` runs training, evaluation, and memory demos in order.

## How to Run

### Using Docker (Recommended for Reproducibility)

1. **Build the CPU image**:

```bash
docker build -t bdh-project -f Dockerfile .
```

2. **Build the GPU image** (CUDA):

```bash
docker build -t bdh-project-gpu -f Dockerfile.gpu .
```

3. **Run Training** (saves checkpoints to `outputs/`):

```bash
docker run -v $(pwd)/outputs:/app/outputs bdh-project python train.py --max_iters 1000
```

4. **Compare Checkpoints**:

```bash
docker run -v $(pwd)/outputs:/app/outputs bdh-project python compare_checkpoints.py --input_dir outputs/training/checkpoints --output_dir outputs/evaluation
```

5. **Run Memory Script** (requires training checkpoints):
```bash
docker run -v $(pwd)/outputs:/app/outputs bdh-project python memory.py
```
6. **Run All in Order**:
```bash
docker run -v $(pwd)/outputs:/app/outputs/checkpoints bdh-project python main.py --max_iters 1000
```

**GPU usage**:

```bash
docker run --gpus all -v $(pwd)/outputs:/app/outputs bdh-project-gpu python main.py --max_iters 1000
```

### Running Locally

Ensure you have Python and PyTorch installed.

1. **Install requirements**:

```bash
pip install numpy torch requests psutil pandas matplotlib
```

2. **Train**:

```bash
python train.py --max_iters 1000 --out_dir outputs
```

- You can control training length with `--max_iters` (e.g., 100 for a quick run, 3000 for full training).
- CPU training is slower; more iterations mean longer runtime. Use smaller `--max_iters` for quick tests.

3. **Visuals & Logs**:

- Training logs (loss vs step) are saved to `outputs/training/logs/training_log.csv`.
- Run `python compare_checkpoints.py --input_dir outputs/training/checkpoints` to see the text generation progress.
- Consolidated generations are written to `outputs/evaluation/checkpoint_generations.log`.

4. **Run Memory Script** (requires training checkpoints):
```bash
python memory.py
```
5. **Run All in Order**:
```bash
python main.py --max_iters 1000
```

## Outputs and Where to Find Them

- Training outputs: `outputs/training/`
- Checkpoints: `outputs/training/checkpoints/`
- Training log CSV: `outputs/training/logs/training_log.csv`
- Evaluation outputs: `outputs/evaluation/`
- Report: `outputs/evaluation/report.md`
- Consolidated generations log: `outputs/evaluation/checkpoint_generations.log`
- Loss plot: `outputs/evaluation/figs/loss_curve.png`
- Checkpoint samples: `outputs/evaluation/figs/checkpoint_samples.png`
- Output diversity plot: `outputs/evaluation/figs/output_quality.png`
- Multi-prompt runs: `outputs/evaluation/multi_prompt/`
- Memory outputs: `outputs/memory/`
- Memory log: `outputs/memory/memory_log.txt`
- Fine-tune loss plot: `outputs/memory/figs/memory_fine_tuning_loss.png`

## Results Interpretation

- **Training**: You will see the loss decrease over time. Around 100-200 steps, the model starts forming recognizable words. By 1000 steps, it should produce Shakespeare-like structure.
- **Checkpoint Comparison**: Early checkpoints produce noise; later checkpoints produce English-like text.
- **Fast Memory**: The memory layer recalls facts without putting them in the user prompt (and the log also shows the model attempt with memory-injected prompt for transparency).
- **Model Memory**: After fine-tuning, the model recalls the fact even without any context provided at runtime.

## What the Loss Curve Means

- The loss curve (`outputs/evaluation/figs/loss_curve.png`) shows how prediction error changes over training steps.
- Lower loss means the model is better at predicting the next character/byte.
- A steady downward trend is expected; it indicates the model is learning patterns from the dataset.
- Minor bumps are normal (training is noisy), but the overall trend should go down.
- If validation loss is shown, it should track training loss but stay slightly higher.

## Additional Visuals Included

- **Checkpoint samples**: A small panel showing early/mid/final outputs for quick comparison.
- **Output diversity**: Unique character ratio per checkpoint to show output variation over time.
- **Before vs After table**: Inline table in the report with early/mid/late outputs.
- **Multi-prompt checks**: Same checkpoints evaluated on 3 prompts for consistency.
19 changes: 19 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# CPU-only, minimal base image for smaller size
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PIP_INDEX_URL=https://download.pytorch.org/whl/cpu
ENV PIP_EXTRA_INDEX_URL=https://pypi.org/simple

WORKDIR /app

# Copy and install deps first for better layer caching
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt

# Copy project files
COPY . /app

# Default command: show help
CMD ["python", "main.py", "--help"]
14 changes: 14 additions & 0 deletions Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# GPU-enabled image (CUDA runtime)
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime

WORKDIR /app

# Install dependencies
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt

# Copy project files
COPY . /app

# Default command: show help
CMD ["python", "main.py", "--help"]
23 changes: 0 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,29 +50,6 @@ BDH follows **Transformer-like scaling laws**, maintaining parameter efficiency

***

## Latest research update: Sudoku Benchmark

Note: The Sudoku Extreme result refers to Pathway’s internal BDH implementation, not to the current open-source repository. This repository contains the implementation of the baseline variant as described in our [public paper](https://arxiv.org/abs/2509.26507) and does not reproduce the 97.4% benchmark result out of the box. See the dedicated Extreme Sudoku research blog post for additional benchmark context and the reported results.

On Sudoku Extreme, BDH reaches 97.4% accuracy across roughly 250,000 difficult puzzles, without chain-of-thought, solution backtracking, or external tool use, while leading LLMs struggle to perform on the benchmark at all.

Language is not enough for intelligence. Transformers process information token by token with limited internal state, which makes search-heavy, non-linguistic reasoning tasks like Sudoku awkward. BDH uses a larger latent reasoning space with intrinsic memory that supports learning and adaptation during use.

We believe that the future of AI will belong to systems that can reason natively across domains, that can hold multiple possibilities in a rich latent space, and that can converge on solutions without needing to verbalize every step. BDH is our answer to that challenge. It is designed to be a universal reasoning system that can speak our language without being trapped inside it. And yes, it solves Sudoku.

Read more: [Post-transformers: Sudoku Bench](https://pathway.com/research/beyond-transformers-sudoku-bench)

### Performance Comparison

| Model | Sudoku Extreme Accuracy | Relative Cost |
|------|------------------------|--------------|
| Pathway BDH | 97.4% | 10× lower, No chain-of-thought |
| Leading LLMs (O3-mini, DeepSeek R1, Claude 3.7 8K) | ~0% | High (chain-of-thought) |

*Table 1: Performance comparison on extreme Sudoku benchmarks (~250,000 difficult puzzles).*
*Source: Pathway internal data and https://arxiv.org/pdf/2506.21734 for the Leading LLMs’ accuracy score. Pathway’s approach reflects top-1 accuracy and does not rely on chain-of-thought nor solution backtracking.*


## Installation and Training

```bash
Expand Down
Loading