# Local-Model Wrap Demo (Pi + Ollama / llama.cpp / MLX / ds4) Experimental lab runbook for demoing the staged `django-resume` Electron wrap driven by a local model through Pi, on each of four serving runtimes. It is the companion to the runtime comparison in [Agent Use](agent-use.md) (see its "Local runtime comparison" section). The commands below assume the lab layout used to produce the comparison (`~/projects/desktop-django-starter`, `~/projects/django-resume`, and the local model servers). The reproduction tooling lives in the starter checkout under `.bench-qwen36/`. ## What the demo does 1. Makes a clean clone of `django-resume` (the target). 2. Runs the deterministic **Stage 1 scaffold** (not the model) — lays down `electron/` and the Django desktop baseline. 3. Drives the local model through Pi for **Stage 2** (Electron) and **Stage 3** (Django) — verification-first; on this target both are zero-edit passes. 4. Runs an independent **Pi judge** (`openai-codex/gpt-5.5`) that re-runs the packaged smoke and returns PASS/FAIL. The mechanical wrapping is done by Stage 1; the model's job is to drive and verify Stages 2–3. That is what these runs measure. ## Prerequisites (host: Apple Silicon, 128 GB used here) | Need | Install / location | |---|---| | `pi` | already installed (`pi --version`) | | Source repo | `~/projects/django-resume` (clean checkout) | | Starter + scaffold | `~/projects/desktop-django-starter` (this repo) | | Ollama | `ollama serve` running; model `qwen36-27b-tools` (built below) | | llama.cpp | `brew install llama.cpp`; GGUF `~/models/gguf/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf` | | MLX | `uv tool install mlx-lm`; model `mlx-community/Qwen3.6-27B-4bit` (HF cache) | | ds4 | built `ds4-server` + `ds4flash.gguf` in `~/workspaces/ds4-pi-django-resume/ds4-pi` | | `uv`, `node`, `npm` | for the Django/Electron verification commands | ### One-time: build the tool-capable Ollama model The raw Ollama GGUF import shipped a bare `{{ .Prompt }}` template and cannot call tools. Rebuild it with the Qwen3 ChatML template: ```bash cd ~/projects/desktop-django-starter/.bench-qwen36 ollama create qwen36-27b-tools -f Modelfile.qwen36-27b-tools ``` ## Start the model server for the runtime you want to demo Pick one. Each exposes an OpenAI-compatible `/v1` endpoint that a Pi provider extension in `.bench-qwen36/` points at. ```bash # Ollama — already serving on :11434, nothing to start (uses qwen36-27b-tools) # llama.cpp — native tool calling via --jinja (reads the GGUF chat template) llama-server -m ~/models/gguf/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \ --jinja --host 127.0.0.1 --port 8080 -c 32768 -ngl 999 # MLX mlx_lm.server --model mlx-community/Qwen3.6-27B-4bit --host 127.0.0.1 --port 8081 # ds4 (DeepSeek V4 Flash) — port 8002 to avoid the default :8000 cd ~/workspaces/ds4-pi-django-resume/ds4-pi ./ds4-server -m ds4flash.gguf --host 127.0.0.1 --port 8002 -c 32768 \ --kv-disk-dir /tmp/ds4-bench-kv --kv-disk-space-mb 8192 ``` Wait until ready: `curl -s http://127.0.0.1:/v1/models` returns the model. (llama.cpp loads in seconds; ds4 maps 86 GB and may download weights first.) ## Run the full wrap (one command) The runner does a clean `rm -rf` + fresh `git clone` of the target, Stage 1 scaffold, `npm install`, then drives Pi for Stage 2 + Stage 3, then an independent verification smoke. Results land in `results-