OpenClaw Multi-Model Router — Local Gemma + Qwen + Claude
A three-model AI setup that routes requests intelligently between local models and Claude. Simple questions stay local (free, private, fast). Complex reasoning escalates to the Claude API.
Prerequisite. You need OpenClaw already running, with your agent (the “Operator”) configured. If you haven’t done that yet, follow the OpenClaw Setup Guide first — this post picks up where that one ends.
The Architecture
Every response is labeled [Gemma], [Qwen], or [Claude] so you always know which model answered.
Customize the labels. I personally use [EDI] for Claude (named after the AI in Mass Effect) and a custom name for my Operator. Pick whatever fits your setup — just keep the labeling consistent so you always know who’s talking.
Hardware tested on: Mac Mini M4, 24GB unified memory, ~60GB free disk for models.
Step 1 — Install Ollama and Python 3.11
brew install ollama python@3.11
Verify:
ollama --version /opt/homebrew/bin/python3.11 --version
Step 2 — Start Ollama
brew services start ollama
Verify with ollama list.
Step 3 — Pull the Models
ollama pull gemma4:31b ollama pull qwen2.5-coder:32b
- Gemma 4 31B — Google DeepMind. General-purpose, 256K context, multimodal.
- Qwen 2.5 Coder 32B — Alibaba. Coding specialist, 92 languages.
Each is ~19–20 GB. Disk needed: ~40 GB. RAM needed: 24 GB (one model loads at a time).
Step 4 — Install Python Packages
/opt/homebrew/bin/python3.11 -m pip install litellm anthropic uvicorn fastapi
Step 5 — Create the Router Project
mkdir -p ~/.openclaw/workspace/openclaw-router
Create ~/.openclaw/workspace/openclaw-router/start.sh:
#!/bin/bash
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
[ -f "$SCRIPT_DIR/.env" ] && export $(grep -v '^#' "$SCRIPT_DIR/.env" | xargs)
[ -z "$ANTHROPIC_API_KEY" ] && echo "ERROR: ANTHROPIC_API_KEY not set" && exit 1
exec /opt/homebrew/bin/python3.11 -m uvicorn router:app \
--host 0.0.0.0 --port 4242 --log-level info --app-dir "$SCRIPT_DIR"
chmod +x ~/.openclaw/workspace/openclaw-router/start.sh
Create ~/.openclaw/workspace/openclaw-router/.env:
ANTHROPIC_API_KEY=sk-ant-...
Then drop in the router.py source from the repo.
Never commit your .env. The repo’s .gitignore excludes it for a reason.
Step 6 — Auto-Start on Login (Optional)
Save this as ~/Library/LaunchAgents/com.openclaw.router.plist — replace YOUR_USERNAME with your Mac username:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.openclaw.router</string>
<key>ProgramArguments</key>
<array>
<string>/Users/YOUR_USERNAME/.openclaw/workspace/openclaw-router/start.sh</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/YOUR_USERNAME/.openclaw/workspace/openclaw-router/router.log</string>
<key>StandardErrorPath</key>
<string>/Users/YOUR_USERNAME/.openclaw/workspace/openclaw-router/router.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
Load it:
launchctl load ~/Library/LaunchAgents/com.openclaw.router.plist
The router will now start automatically every time you log in, and restart if it crashes.
Step 7 — Configure Your Operator to Route Automatically
Add this section to ~/.openclaw/workspace/AGENTS.md before the “Make It Yours” section:
## Smart Routing — Three-Way: Gemma / Qwen / Claude
| Model | Label | Role |
|--------------------|----------|-------------------------|
| Gemma 4 31B | [Gemma] | On-premise generalist |
| Qwen 2.5 Coder 32B | [Qwen] | Coding agent |
| Claude (you) | [Claude] | Deep reasoning |
You are the Operator — direct the models, don't label yourself.
Route to Gemma: trivia, definitions, casual chat, greetings, sign-offs,
anything under ~20 words with no technical content.
Route to Qwen: writing/debugging/explaining code, scripting, architecture,
anything involving a code block or programming language.
Keep with Claude: multi-step reasoning, analysis, tradeoffs, personal context
(user's family/work/schedule), tool use, long input (>500 words),
anything requiring Bash/file/web access.
How to call Gemma (Bash tool):
curl -s --max-time 90 http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model":"gemma4:31b","messages":[{"role":"user","content":"PROMPT"}],"stream":false}' \
| /opt/homebrew/bin/python3.11 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
How to call Qwen (Bash tool):
curl -s --max-time 90 http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder:32b","messages":[{"role":"user","content":"PROMPT"}],"stream":false}' \
| /opt/homebrew/bin/python3.11 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
Always start every reply with [Gemma], [Qwen], or [Claude].
If a local model times out, fall back to Claude silently and label [Claude].
Step 8 — Test Everything
# Test Gemma
curl -s --max-time 90 http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model":"gemma4:31b","messages":[{"role":"user","content":"What is the capital of France?"}],"stream":false}' \
| /opt/homebrew/bin/python3.11 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
# Test Qwen
curl -s --max-time 90 http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder:32b","messages":[{"role":"user","content":"Write a Python function to reverse a string."}],"stream":false}' \
| /opt/homebrew/bin/python3.11 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
Both should respond within 30 seconds (longer on first call as the model loads into memory).
Routing at a Glance
| Signal | Goes to |
|---|---|
| Greeting, joke, trivia, simple question | Gemma |
| Code, debugging, scripting, architecture | Qwen |
| Planning, analysis, personal info, tools | Claude |
| Local model timeout | Claude (fallback) |
Troubleshooting
| Problem | Fix |
|---|---|
| Ollama not responding | brew services restart ollama |
| Model not found | ollama pull gemma4:31b or ollama pull qwen2.5-coder:32b |
| Slow first response | Normal — model loads on first query (~30 sec) |
| Router won’t start | cat ~/.openclaw/workspace/openclaw-router/router.log |
| Out of memory | Restart Ollama: brew services restart ollama |
| API key not set | Check .env exists in openclaw-router/ and contains the key |
Why this exists
Running everything through the Claude API gets expensive fast, especially for casual chat. Running everything locally means slow, inconsistent answers on hard problems. This router gives you the best of both: cheap local speed for the 80% of requests that don’t need a frontier model, and Claude’s deep reasoning for the 20% that do.
The labels ([Gemma], [Qwen], [Claude]) make the trade-off transparent. You always know which model answered, so you can tell the Operator to “ask Claude this one” if the local answer was weak.
Source code & further reading
- mcornelia/openclaw-multi-model-router — the GitHub repo (start.sh, plist, router.py, full README)
- OpenClaw Setup Guide — the prerequisite (sets up your local Operator)
- Gemma 4 on Ollama · Qwen 2.5 Coder on Ollama
- LiteLLM docs · FastAPI docs
Questions, ideas, or want to share your own routing rules? Reach out.