Practical setup guides, reusable templates, and sync workflows for running local AI coding agents with Ollama and any other compatible tools.
OpenClaw personal AI assistant running via Telegram both local and cloud model
This repository documents a real developer stack built around:
- Aider — primary Windows-native repo editing agent
- OpenClaw — personal AI assistant via Telegram with multi-agent and tool calling (replaced Hermes)
- Pi — lean customizable coding harness
- Ollama — local model runtime (9 models, auto-swap fallback chain)
- Continue.dev — editor integration and autocomplete
- OpenCode — hosted coding workflow option
It is built for people who want a real local workflow for create, read, update, and delete tasks on actual project files, not just chat demos.
Most AI-agent setup notes are either:
- too generic to reproduce
- too personal to reuse
- too fragile to maintain
This repository tries to be more useful:
- real-world setup patterns
- model-role mapping that matches actual hardware limits
- public-safe templates
- sync workflows for keeping docs aligned with a tested local setup
- a contribution-friendly structure for people who want to share their own setups
- Fullstack developers
- IT generalists
- Local-first AI users
- Windows users with low-VRAM GPUs
- People who want to back up and version their agent configuration cleanly
| Agent | Runtime | Default Model | Best Use |
|---|---|---|---|
| Aider | Windows native | Jan-code | Primary repo editing, file CRUD |
| OpenClaw | Windows native (Node.js) | Phi4-mini | Personal AI assistant via Telegram, tool calling, multi-agent |
| Pi | Windows native | Jan-code | Lean customizable coding harness |
OpenClaw replaced Hermes Agent (WSL-based). No WSL dependency required anymore.
These are not all configured as primary agents in the same way, but they are part of the practical tool ecosystem around this setup:
| Tool | Role |
|---|---|
| Ollama | local model runtime |
| OpenCode | hosted coding workflow option |
| Continue.dev | editor integration, autocomplete, and inline assistance |
| CodeGemma | autocomplete-oriented local model |
| Zed IDE | modern editor worth testing for AI-native workflows |
https://context7.com-> useful for fetching fresher framework and library documentation contexthttps://opencode.aior your OpenCode access path -> strong hosted coding workflow option, often good for free-tier experimentation depending on account accesshttps://build.nvidia.com-> worth testing for free hosted model access and experimentationhttps://zed.dev-> Zed IDE is worth trying if you want another AI-friendly editor workflow- Another free LLM API's list, Free AI Tools Compare
- Awesome MCP Servers — curated list of MCP servers for extending tool capabilities
local-gpu-4gb- fully documented from a real working setuplocal-cpu-only- recommended setup path using NovaforgeAI models
| Model | Size | Role | Tool Calling |
|---|---|---|---|
phi4-mini:3.8b-q4_K_M |
2.5 GB | OpenClaw primary — tool calling + step-by-step reasoning | Yes |
fredrezones55/Jan-code:Q4_K_M |
2.7 GB | Aider/Pi default — balanced daily coding | No |
aikid123/Qwen3-coder:latest |
1.4 GB | Fast code and chat with thinking | No |
fredrezones55/qwen3.5-opus:4b |
3.4 GB | Complex reasoning and heavier work | No |
softw8/Nanbeige4.1-3B-q4_K_M |
2.5 GB | Backup tool calling | Yes |
relational/orlex:latest |
3.3 GB | Backup reasoning and planning | No |
rardiolata/CodeTito:latest |
2.0 GB | Backup coding | No |
codegemma:2b |
1.6 GB | Autocomplete (editor integration) | No |
nomic-embed-text:latest |
274 MB | Embeddings only, not for chat/coding | No |
| Provider | Model | Purpose |
|---|---|---|
| Gemini 2.5 Flash (free tier) | Long context (1M tokens), tool calling | |
| NVIDIA | Llama 3.3 70B (free tier) | High-quality fallback |
Cloud models are only used when all local models fail. See OpenClaw Model Guide for the full fallback chain.
Phi4-mini → Qwen3-coder → Jan-code → Qwen3.5-opus → Orlex → CodeTito → Nanbeige → Gemini Flash → NVIDIA Llama
(local) (local) (local) (local) (local) (local) (local) (cloud free) (cloud free)
| Environment | Status | Best Default | Notes |
|---|---|---|---|
| Local GPU 4GB | Tested | phi4-mini:3.8b-q4_K_M (OpenClaw) / Jan-code (Aider) |
Multi-model role split with auto-swap fallback chain |
| Local CPU Only | Recommended | NovaforgeAI small models or API-backed providers | Good contribution target for future validation |
While there are technically ways to exploit models and quotas using OpenCode, sorry i will not discuss those here but i will share some of the best strategies for maximizing output while minimizing API token costs, which is a critical part of running AI agents in a sustainable way.
InshaAllah, with this blessed approach, the savings achieved will be more meaningful and sustainable. 🙏
"Halal sustenance brings long-term blessings."
One of the most powerful patterns is combining local models for execution with premium API models (Claude Opus, Gemini Pro, GPT-4+) for architecture and complex reasoning. This dramatically reduces API token costs while maintaining high-quality output.
Also you can use Claude Code as an agent with free-tier API keys from OpenRouter (e.g., qwen3 models) instead of paid Claude API. See Claude Code + OpenRouter Integration Guide for setup instructions.
Recommendation: Use Claude Pro ($20/month) with the official Claude Code extension rather than OpenRouter pay-as-you-go API.
In agentic mode (Zed, Claude Code), a single prompt can trigger 20–50+ API calls automatically — each with full context. Real (in my) case: $11.65 spent in one day from just a 2 prompts using Claude Opus via OpenRouter.
If you must use an API, always set a spending limit on your platform if available and use cheaper models (Sonnet / Qwen free) for execution tasks. OpenRouter's also offers Enable 1% discount on all LLMs Consent to OpenRouter using your inputs/outputs to improve the product.
For actual code implementation, these local models provide excellent cost savings:
Fast Execution (Recommended):
qwen2.5-coder:3b- Fast and capable, works best when tasks are broken into multiple focused plan files rather than one long context
Better Quality:
deepseek-coder:6.7b-instruct-q4_0- Higher quality output, slightly slower but more reliable for complex tasks
Don't waste expensive API tokens on boilerplate code. Use premium models purely for high-level design.
Step 1: The Architect (Claude Opus / Gemini Pro)
Prompt the premium model with this instruction:
Act as an Expert Software Architect and Principal Engineer. Your task is to design a comprehensive Technical Specification Document for the feature/system requested below.
CRITICAL INSTRUCTION TO SAVE TOKENS:
DO NOT write the actual implementation code. DO NOT write full functions. Your output must ONLY contain high-level architecture, logic flows, file structures, and strict pseudocode.
Please provide the output in clean Markdown format with the following sections:
1. System Overview: A brief summary of how the solution works.
2. File Structure: A tree representation of the files to be created or modified.
3. Tech Stack & Dependencies: Any specific libraries, modules, or APIs needed.
4. Data Flow / State Management: How data moves between components.
5. Step-by-Step Logic (Pseudocode): The exact logical steps for core algorithms, edge cases, and security considerations.
6. Execution Order: A numbered list of which file/component should be built first.
Here is the feature I want to build:
[DESCRIBE YOUR FEATURE HERE]
Step 2: The Builder (Gemini Flash / Minimax 2.5 / Local Model)
Take the Technical Spec from Claude and feed it to a cheaper/free executor:
Act as a Senior Full-Stack Developer. I will provide you with a Technical Specification Document written by an Expert Architect.
Your task is to write the COMPLETE, production-ready, and fully functional code based EXACTLY on this specification.
Rules:
1. Follow the file structure and execution order provided.
2. Write clean, well-commented code.
3. Do not skip any logic mentioned in the pseudocode.
4. Output the code block by block, clearly stating the filename above each code block.
Here is the Technical Specification:
[PASTE CLAUDE'S SPEC HERE]
Use:
- Gemini 3 Flash - Blazing fast, huge context window, very cheap for mass execution
- Local qwen2.5-coder:3b - Free, fast, works great with focused specs
- Local deepseek-coder:6.7b - Free, better quality for complex implementations
One of the biggest token drains is when AI rewrites 500 lines when you only asked to change 3 lines.
Always end your prompts with:
Only provide the code that changed, with comments indicating where to insert it.
NEVER rewrite entire files that haven't changed.
When debugging large projects, don't paste entire files into chat.
Solution:
- Generate a directory tree structure (use
treecommand or OpenCode) - Provide the "Directory Map" to the AI
- Let the AI determine which files it actually needs to see
- Only then provide the specific file contents
If we use Ollama and VSCodium (Continue.dev/OpenCode), use them as your first line of defense:
Autocomplete (0 tokens):
- Let
qwen2.5-coder:1.5b-basehandle line-by-line autocomplete as you type
Small Refactoring (0 tokens):
- Use local
deepseek-r1:1.5bin OpenCode terminal for:- Regex generation
- Code formatting
- Simple unit tests
- Variable renaming
Only escalate to premium APIs (Claude Opus/Gemini Pro) when:
- Hitting a dead end (complex Nginx config bugs, intricate state management)
- Architectural decisions needed
- Complex debugging requiring deep reasoning
Traditional approach (all premium API):
- Architect + Implementation: ~50,000 tokens ($$$)
Hybrid approach:
- Architect (Claude Opus): ~5,000 tokens ($)
- Implementation (Gemini Flash or local): ~0-1,000 tokens (¢ or free)
- Savings: ~90%
GPT,Llama models also good in tool calling mode, so you can use them for tool orchestration,bash,deploy,etc while letting other models handle the actual code generation.
For complete master prompts and detailed strategies, see:
- guides/COST_SAVING_PROMPTS.md - Battle-tested prompts for Architect & Builder workflow, also use the magic of DCP plugins to manage your context and token usage in opencode, this is a quick summary of my patterns for maximizing output while minimizing tokens.
1. AGENTS.md → model fallback chain
2. DCP → 20-40% saving ctx
3. Plan mode → focused & minimal iterations
4. Session/task → keep context clean
5. /dcp sweep {n} → prune irrelevant context
6. /compact → aggressive compression
1. /dcp context ← check current state
2. /dcp sweep 10 ← prune least relevant ctx
3. /dcp context ← confirm pruning results
4. /compact ← last resort
- Read
docs/quick-start.md - For the tested setup, start with
docs/installation/local-gpu-4gb.md - For the lighter CPU-oriented path, read
docs/installation/local-cpu-only.md
docs/ -> setup guides, operations, troubleshooting, contribution notes
templates/ -> reusable public-safe config templates by environment
guides/ -> quick command references
scripts/ -> sync helpers and utilities (including Ollama auto-start fix)
assets/ -> screenshots and images
Automated solution for Ollama startup issues after PC restart. Solves the common problem where Ollama processes hang and require manual restart after reboot.
Location: scripts/ollama/
Quick Setup:
cscript scripts\ollama\setup-ollama-startup.vbsFeatures:
- Automatic startup after PC restart
- Kills hung processes automatically
- 10-second delay for system stability
- Runs minimized in background
- No admin privileges required
Documentation:
docs/agents/openclaw-agent.md- OpenClaw setup, multi-agent, Telegram, skills, cloud fallbackdocs/agents/aider.md- Aider configuration and usagedocs/agents/pi-agent.md- Pi agent setupdocs/agents/continue-dev.md- Continue.dev editor integrationdocs/agents/opencode.md- OpenCode hosted workflow
guides/OPENCLAW_MODEL_GUIDE.md- Model strategy, Phi4-mini, fallback chain, VRAM tuningguides/OPENCLAW_PERFORMANCE_GUIDE.md- 4GB VRAM optimization, timeouts, context window, disk managementguides/AGENT_COMMANDS.md- All launcher commands for Aider, OpenClaw, Piguides/COST_SAVING_PROMPTS.md- Architect & Builder workflow prompts
docs/overview.mddocs/quick-start.mddocs/installation/local-gpu-4gb.mddocs/installation/local-cpu-only.mddocs/models/model-strategy.md
docs/operations/security.md- Security model, pairing, data privacydocs/operations/backup-and-restore.mddocs/operations/update-workflow.mddocs/operations/contributor-sync-guide.mddocs/operations/troubleshooting.md
scripts/ollama/README.md- Ollama auto-start utilities
If you use Aider, OpenClaw, Pi, Ollama, custom launchers, and local model routing, your setup becomes part of your engineering environment.
Backing it up gives you:
- a faster restore path on a new machine
- versioned changes to prompts, configs, and launchers
- cleaner experimentation with different models
- easier sharing of sanitized setups with other people
This project works best with two repositories:
- public repo -> reusable docs, templates, contributor-friendly structure
- private repo -> your live working backup and machine-specific copies
Recommended rule:
- private repo first
- public repo second
This repository is designed to be updated from a real local working setup.
Recommended order:
- update and validate the local machine setup
- sync the private backup repository
- refresh the public templates
- update the docs if commands or model strategy changed
Useful scripts:
# from the public repo
powershell -ExecutionPolicy Bypass -File .\scripts\sync-all.ps1
# or run the public step only
powershell -ExecutionPolicy Bypass -File .\scripts\refresh-public-templates.ps1For more detail, read:
docs/operations/update-workflow.mddocs/operations/contributor-sync-guide.md
This repository is intentionally structured so other people can contribute their own working AI-agent setups.
Useful contributions include:
- cloud VPS without GPU
- Linux desktop
- macOS
- larger local GPU setups
- better CPU-only model recommendations
- alternative local model families
- Continue.dev presets and editor configs
- OpenCode or Build NVIDIA hosted usage notes
- Zed IDE workflow notes
- different launcher strategies
If you maintain your own agent stack, consider contributing:
- sanitized config templates
- documented model choices
- launcher patterns
- hardware notes
- troubleshooting notes
The goal is simple: make it easier for other people to back up, understand, and share agent setups that actually work.
If your setup differs from this one, that is a feature, not a problem. Different hardware tiers, operating systems, model families, and launcher strategies are all useful contributions when they are documented clearly.
Contributions are welcome.
See docs/contribution-guide.md for the preferred direction and docs/operations/contributor-sync-guide.md for the sync/update workflow.
Good contribution examples:
- a tested CPU-only setup with measured tradeoffs
- a Linux-native workflow
- a cloud VPS no-GPU workflow
- a cleaner launcher design
- better restore or troubleshooting notes
- Do not commit real API keys, auth files, or session state
- Treat these templates as public-safe examples unless explicitly marked as private backup material
- Review all launcher scripts before running them on production machines
MIT