pathwaycom · ganeshflexiana · Mar 30, 2026
diff --git a/DOCUMENTATION.md b/DOCUMENTATION.md
@@ -0,0 +1,134 @@
+# Baby Dragon Hatchling (BDH) Guide
+
+This guide explains what the project does, how to run it end-to-end, and how to interpret the outputs.
+
+## What This Generates
+
+- Training checkpoints at every 10% of the run.
+- A consolidated generation log and a Markdown report comparing checkpoints.
+- A loss curve plot (loss vs steps).
+- A memory run log and a small loss plot for the model-memory fine-tune.
+
+## Deliverables Summary
+
+1. **End-to-End Training**: `train.py` supports configurable steps and automatic checkpointing at every 10% of progress.
+2. **Checkpoint Comparison**: `compare_checkpoints.py` evaluates all checkpoints with a fixed prompt and creates a report + consolidated log.
+3. **Memory Script**: `memory.py` demonstrates:
+   - **Fast Memory (Context)**: A lightweight memory store ingests facts from a prompt and answers from memory without keeping the fact in the user prompt.
+   - **Model Memory (Weights)**: Facts consolidated into model weights via a short fine-tune.
+4. **One-Click Run**: `main.py` runs training, evaluation, and memory demos in order.
+
+## How to Run
+
+### Using Docker (Recommended for Reproducibility)
+
+1.  **Build the CPU image**:
+
+    ```bash
+    docker build -t bdh-project -f Dockerfile .
+    ```
+
+2.  **Build the GPU image** (CUDA):
+
+    ```bash
+    docker build -t bdh-project-gpu -f Dockerfile.gpu .
+    ```
+
+3.  **Run Training** (saves checkpoints to `outputs/`):
+
+    ```bash
+    docker run -v $(pwd)/outputs:/app/outputs bdh-project python train.py --max_iters 1000
+    ```
+
+4.  **Compare Checkpoints**:
+
+    ```bash
+      docker run -v $(pwd)/outputs:/app/outputs bdh-project python compare_checkpoints.py --input_dir outputs/training/checkpoints --output_dir outputs/evaluation
+    ```
+
+5.  **Run Memory Script** (requires training checkpoints):
+    ```bash
+    docker run -v $(pwd)/outputs:/app/outputs bdh-project python memory.py
+    ```
+6.  **Run All in Order**:
+    ```bash
+    docker run -v $(pwd)/outputs:/app/outputs/checkpoints bdh-project python main.py --max_iters 1000
+    ```
+
+**GPU usage**:
+
+```bash
+docker run --gpus all -v $(pwd)/outputs:/app/outputs bdh-project-gpu python main.py --max_iters 1000
+```
+
+### Running Locally
+
+Ensure you have Python and PyTorch installed.
+
+1.  **Install requirements**:
+
+    ```bash
+    pip install numpy torch requests psutil pandas matplotlib
+    ```
+
+2.  **Train**:
+
+    ```bash
+    python train.py --max_iters 1000 --out_dir outputs
+    ```
+
+    - You can control training length with `--max_iters` (e.g., 100 for a quick run, 3000 for full training).
+    - CPU training is slower; more iterations mean longer runtime. Use smaller `--max_iters` for quick tests.
+
+3.  **Visuals & Logs**:
+
+    - Training logs (loss vs step) are saved to `outputs/training/logs/training_log.csv`.
+    - Run `python compare_checkpoints.py --input_dir outputs/training/checkpoints` to see the text generation progress.
+    - Consolidated generations are written to `outputs/evaluation/checkpoint_generations.log`.
+
+4.  **Run Memory Script** (requires training checkpoints):
+    ```bash
+    python memory.py
+    ```
+5.  **Run All in Order**:
+    ```bash
+    python main.py --max_iters 1000
+    ```
+
+## Outputs and Where to Find Them
+
+- Training outputs: `outputs/training/`
+  - Checkpoints: `outputs/training/checkpoints/`
+  - Training log CSV: `outputs/training/logs/training_log.csv`
+- Evaluation outputs: `outputs/evaluation/`
+  - Report: `outputs/evaluation/report.md`
+  - Consolidated generations log: `outputs/evaluation/checkpoint_generations.log`
+  - Loss plot: `outputs/evaluation/figs/loss_curve.png`
+  - Checkpoint samples: `outputs/evaluation/figs/checkpoint_samples.png`
+  - Output diversity plot: `outputs/evaluation/figs/output_quality.png`
+  - Multi-prompt runs: `outputs/evaluation/multi_prompt/`
+- Memory outputs: `outputs/memory/`
+  - Memory log: `outputs/memory/memory_log.txt`
+  - Fine-tune loss plot: `outputs/memory/figs/memory_fine_tuning_loss.png`
+
+## Results Interpretation
+
+- **Training**: You will see the loss decrease over time. Around 100-200 steps, the model starts forming recognizable words. By 1000 steps, it should produce Shakespeare-like structure.
+- **Checkpoint Comparison**: Early checkpoints produce noise; later checkpoints produce English-like text.
+- **Fast Memory**: The memory layer recalls facts without putting them in the user prompt (and the log also shows the model attempt with memory-injected prompt for transparency).
+- **Model Memory**: After fine-tuning, the model recalls the fact even without any context provided at runtime.
+
+## What the Loss Curve Means
+
+- The loss curve (`outputs/evaluation/figs/loss_curve.png`) shows how prediction error changes over training steps.
+- Lower loss means the model is better at predicting the next character/byte.
+- A steady downward trend is expected; it indicates the model is learning patterns from the dataset.
+- Minor bumps are normal (training is noisy), but the overall trend should go down.
+- If validation loss is shown, it should track training loss but stay slightly higher.
+
+## Additional Visuals Included
+
+- **Checkpoint samples**: A small panel showing early/mid/final outputs for quick comparison.
+- **Output diversity**: Unique character ratio per checkpoint to show output variation over time.
+- **Before vs After table**: Inline table in the report with early/mid/late outputs.
+- **Multi-prompt checks**: Same checkpoints evaluated on 3 prompts for consistency.
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,19 @@
+# CPU-only, minimal base image for smaller size
+FROM python:3.11-slim
+
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PIP_INDEX_URL=https://download.pytorch.org/whl/cpu
+ENV PIP_EXTRA_INDEX_URL=https://pypi.org/simple
+
+WORKDIR /app
+
+# Copy and install deps first for better layer caching
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+
+# Copy project files
+COPY . /app
+
+# Default command: show help
+CMD ["python", "main.py", "--help"]
diff --git a/Dockerfile.gpu b/Dockerfile.gpu
@@ -0,0 +1,14 @@
+# GPU-enabled image (CUDA runtime)
+FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
+
+WORKDIR /app
+
+# Install dependencies
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+
+# Copy project files
+COPY . /app
+
+# Default command: show help
+CMD ["python", "main.py", "--help"]
diff --git a/README.md b/README.md
@@ -50,29 +50,6 @@ BDH follows **Transformer-like scaling laws**, maintaining parameter efficiency
 
 ***
 
-## Latest research update: Sudoku Benchmark
-
-Note: The Sudoku Extreme result refers to Pathway’s internal BDH implementation, not to the current open-source repository. This repository contains the implementation of the baseline variant as described in our [public paper](https://arxiv.org/abs/2509.26507) and does not reproduce the 97.4% benchmark result out of the box. See the dedicated Extreme Sudoku research blog post for additional benchmark context and the reported results.
-
-On Sudoku Extreme, BDH reaches 97.4% accuracy across roughly 250,000 difficult puzzles, without chain-of-thought, solution backtracking, or external tool use, while leading LLMs struggle to perform on the benchmark at all.
-
-Language is not enough for intelligence. Transformers process information token by token with limited internal state, which makes search-heavy, non-linguistic reasoning tasks like Sudoku awkward. BDH uses a larger latent reasoning space with intrinsic memory that supports learning and adaptation during use.
-
-We believe that the future of AI will belong to systems that can reason natively across domains, that can hold multiple possibilities in a rich latent space, and that can converge on solutions without needing to verbalize every step. BDH is our answer to that challenge. It is designed to be a universal reasoning system that can speak our language without being trapped inside it. And yes, it solves Sudoku.
-
-Read more: [Post-transformers: Sudoku Bench](https://pathway.com/research/beyond-transformers-sudoku-bench)
-
-### Performance Comparison
-
-| Model | Sudoku Extreme Accuracy | Relative Cost |
-|------|------------------------|--------------|
-| Pathway BDH | 97.4% | 10× lower, No chain-of-thought |
-| Leading LLMs (O3-mini, DeepSeek R1, Claude 3.7 8K) | ~0% | High (chain-of-thought) |
-
-*Table 1: Performance comparison on extreme Sudoku benchmarks (~250,000 difficult puzzles).*  
-*Source: Pathway internal data and https://arxiv.org/pdf/2506.21734 for the Leading LLMs’ accuracy score. Pathway’s approach reflects top-1 accuracy and does not rely on chain-of-thought nor solution backtracking.*
-
-
 ## Installation and Training
 
 ```bash