Local-Model Wrap Demo (Pi + Ollama / llama.cpp / MLX / ds4)

Experimental lab runbook for demoing the staged django-resume Electron wrap driven by a local model through Pi, on each of four serving runtimes. It is the companion to the runtime comparison in Agent Use (see its “Local runtime comparison” section).

The commands below assume the lab layout used to produce the comparison (~/projects/desktop-django-starter, ~/projects/django-resume, and the local model servers). The reproduction tooling lives in the starter checkout under .bench-qwen36/.

What the demo does

  1. Makes a clean clone of django-resume (the target).

  2. Runs the deterministic Stage 1 scaffold (not the model) — lays down electron/ and the Django desktop baseline.

  3. Drives the local model through Pi for Stage 2 (Electron) and Stage 3 (Django) — verification-first; on this target both are zero-edit passes.

  4. Runs an independent Pi judge (openai-codex/gpt-5.5) that re-runs the packaged smoke and returns PASS/FAIL.

The mechanical wrapping is done by Stage 1; the model’s job is to drive and verify Stages 2–3. That is what these runs measure.

Prerequisites (host: Apple Silicon, 128 GB used here)

Need

Install / location

pi

already installed (pi --version)

Source repo

~/projects/django-resume (clean checkout)

Starter + scaffold

~/projects/desktop-django-starter (this repo)

Ollama

ollama serve running; model qwen36-27b-tools (built below)

llama.cpp

brew install llama.cpp; GGUF ~/models/gguf/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf

MLX

uv tool install mlx-lm; model mlx-community/Qwen3.6-27B-4bit (HF cache)

ds4

built ds4-server + ds4flash.gguf in ~/workspaces/ds4-pi-django-resume/ds4-pi

uv, node, npm

for the Django/Electron verification commands

One-time: build the tool-capable Ollama model

The raw Ollama GGUF import shipped a bare {{ .Prompt }} template and cannot call tools. Rebuild it with the Qwen3 ChatML template:

cd ~/projects/desktop-django-starter/.bench-qwen36
ollama create qwen36-27b-tools -f Modelfile.qwen36-27b-tools

Start the model server for the runtime you want to demo

Pick one. Each exposes an OpenAI-compatible /v1 endpoint that a Pi provider extension in .bench-qwen36/ points at.

# Ollama — already serving on :11434, nothing to start (uses qwen36-27b-tools)

# llama.cpp — native tool calling via --jinja (reads the GGUF chat template)
llama-server -m ~/models/gguf/qwen3.6-27b/Qwen3.6-27B-Q4_K_M.gguf \
  --jinja --host 127.0.0.1 --port 8080 -c 32768 -ngl 999

# MLX
mlx_lm.server --model mlx-community/Qwen3.6-27B-4bit --host 127.0.0.1 --port 8081

# ds4 (DeepSeek V4 Flash) — port 8002 to avoid the default :8000
cd ~/workspaces/ds4-pi-django-resume/ds4-pi
./ds4-server -m ds4flash.gguf --host 127.0.0.1 --port 8002 -c 32768 \
  --kv-disk-dir /tmp/ds4-bench-kv --kv-disk-space-mb 8192

Wait until ready: curl -s http://127.0.0.1:<port>/v1/models returns the model. (llama.cpp loads in seconds; ds4 maps 86 GB and may download weights first.)

Run the full wrap (one command)

The runner does a clean rm -rf + fresh git clone of the target, Stage 1 scaffold, npm install, then drives Pi for Stage 2 + Stage 3, then an independent verification smoke. Results land in results-<label>/.

cd ~/projects/desktop-django-starter/.bench-qwen36
EXT_LOCAL=$PWD/pi-localserver-provider.ts   # llamacpp + mlx
EXT_OLLAMA=$PWD/pi-ollama-provider.ts
EXT_DS4=$PWD/pi-ds4local-provider.ts

# usage: ./run-staged-wrap.sh <label> <pi-extension> <provider> <model>
./run-staged-wrap.sh ollama   "$EXT_OLLAMA" ollama   qwen36-27b-tools
./run-staged-wrap.sh llamacpp "$EXT_LOCAL"  llamacpp qwen3.6-27b
./run-staged-wrap.sh mlx      "$EXT_LOCAL"  mlx      mlx-community/Qwen3.6-27B-4bit
./run-staged-wrap.sh ds4      "$EXT_DS4"    ds4local deepseek-v4-flash

A pass prints health200=1 root302=1 resume200=1 and git_changed_files=11 (the 9 scaffold files + uv.lock + .stage/, i.e. zero model edits).

Run the model stages manually (better for a live demo)

If you want to show the agent working step by step instead of the runner:

TARGET=~/projects/django-resume-clean
rm -rf "$TARGET" && git clone ~/projects/django-resume "$TARGET"
~/projects/desktop-django-starter/skills/wrap-existing-django-in-electron-staged/scripts/scaffold-target.sh "$TARGET"
npm --prefix "$TARGET/electron" install
cd "$TARGET"

EXT=~/projects/desktop-django-starter/.bench-qwen36/pi-localserver-provider.ts
B=~/projects/desktop-django-starter/.bench-qwen36

# Stage 2 (Electron). Swap --provider/--model for the runtime you started.
pi -e "$EXT" --provider llamacpp --model qwen3.6-27b \
   --no-session --thinking off -nc -ns -np -p "$(cat $B/stage-2-filled.md)"

# Stage 3 (Django).
pi -e "$EXT" --provider llamacpp --model qwen3.6-27b \
   --no-session --thinking off -nc -ns -np -p "$(cat $B/stage-3-filled.md)"

Provider/model per runtime:

Runtime

--provider

--model

extension

Ollama

ollama

qwen36-27b-tools

pi-ollama-provider.ts

llama.cpp

llamacpp

qwen3.6-27b

pi-localserver-provider.ts

MLX

mlx

mlx-community/Qwen3.6-27B-4bit

pi-localserver-provider.ts

ds4

ds4local

deepseek-v4-flash

pi-ds4local-provider.ts

Note

Keep --thinking off. In thinking mode ds4 was ~3x slower and derailed Stage 3 on malformed inline tool output. Non-thinking was both fastest and cleanest.

Independent Pi judge

cd ~/projects/django-resume-clean
pi --no-session -nc --print "$(cat ~/projects/desktop-django-starter/.bench-qwen36/judge-prompt.md)"

It ends with VERDICT: PASS (or FAIL) after re-running the packaged smoke itself.

Absolute time per model — target 1: django-resume (Stage 2 + Stage 3)

Measured 2026-06-02 on studio (Apple M4 Max, 128 GB). Stage times are wall-clock including the fixed tool execution each stage runs (uv, npm, node tests, smoke).

Model / runtime

Stage 2

Stage 3

Total wrap time

Ollama — qwen 3.6 27b Q4_K_M

73.7s

97.4s

171.1s (~2m51s)

llama.cpp — qwen 3.6 27b Q4_K_M

41.7s

70.7s

112.4s (~1m52s)

MLX — qwen 3.6 27b 4-bit

40.0s

70.2s

110.2s (~1m50s)

ds4 — DeepSeek V4 Flash (non-thinking)

28.6s

70.1s

98.7s (~1m39s)

ds4 — DeepSeek V4 Flash (thinking=high, 06-01)

86.4s

318.0s

404.4s (~6m44s)

Add roughly npm install (~3–20s, cached after first run) + Stage 1 scaffold (~0.2s) + a ~7s verification smoke for the full end-to-end. Raw decode speed (isolated): ds4 29.9 > MLX 26.9 > llama.cpp 22.5 > Ollama 16.0 tok/s. For a smooth demo, start the server and do one warm-up request before the audience is watching.

Second target: django-wiki (harder app)

The same harness was replicated against a harder real-world target — django-wiki’s testproject (auth + article permissions + media + MPTT + plugins; a settings package; no committed seed DB; / serves the wiki root article). A 6-cell matrix (4 engines × 2 models × thinking off/high) all reached a working packaged wrap with zero model edits, judged by pi / openai-codex/gpt-5.5 (live judge PASS + goal-coverage judge GOAL_REACHED: YES).

Engine

Model

Thinking

Stage 2

Stage 3

Total

Ollama

qwen 3.6 27b

off

56.2s

91.2s

147.4s

llama.cpp

qwen 3.6 27b

off

46.7s

71.7s

118.4s

MLX

qwen 3.6 27b

off

60.3s

72.0s

132.3s

llama.cpp

qwen 3.6 27b

high

32.2s

73.7s

105.9s

ds4

DeepSeek V4 Flash

off

55.5s

71.0s

126.5s

ds4

DeepSeek V4 Flash

high

29.7s

64.1s

93.8s

Same outcome as django-resume (zero edits, verification-only), same ballpark times despite the heavier app. Notably, thinking=high did not derail here (unlike ds4 thinking=high on django-resume) — because the deterministic scaffold fully covered django-wiki, both thinking modes were clean passes. The full cross-target comparison is in .bench-qwen36/RUNTIME-COMPARISON.md.

Reproduce django-wiki via the Pi agent from a clean workspace

# 1. Get a clean django-wiki source clone (the runner re-clones from here per run)
git clone https://github.com/django-wiki/django-wiki ~/projects/django-wiki

# 2. Start the model server you want (same as the django-resume section above):
#    Ollama (:11434, qwen36-27b-tools) / llama.cpp (:8080) / MLX (:8081) / ds4 (:8002)

# 3. Run a wrap cell. The runner is env-parametrized for the target + prompts.
cd ~/projects/desktop-django-starter/.bench-qwen36
export BENCH_SRC=~/projects/django-wiki BENCH_TGT=~/projects/django-wiki-clean
export BENCH_S2=$PWD/wiki-stage-2-filled.md BENCH_S3=$PWD/wiki-stage-3-filled.md

# qwen on each engine (thinking off):
BENCH_THINKING=off ./run-staged-wrap.sh wiki-ollama   "$PWD/pi-ollama-provider.ts"      ollama   qwen36-27b-tools
BENCH_THINKING=off ./run-staged-wrap.sh wiki-llamacpp "$PWD/pi-localserver-provider.ts" llamacpp qwen3.6-27b
BENCH_THINKING=off ./run-staged-wrap.sh wiki-mlx      "$PWD/pi-localserver-provider.ts" mlx      mlx-community/Qwen3.6-27B-4bit

# DeepSeek V4 Flash on ds4, and thinking-mode variation (high):
BENCH_THINKING=off  ./run-staged-wrap.sh wiki-ds4 "$PWD/pi-ds4local-provider.ts" ds4local deepseek-v4-flash
BENCH_THINKING=high BENCH_OUT_SUFFIX=-thinkhigh ./run-staged-wrap.sh wiki-ds4 "$PWD/pi-ds4local-provider.ts" ds4local deepseek-v4-flash
BENCH_THINKING=high BENCH_OUT_SUFFIX=-thinkhigh ./run-staged-wrap.sh wiki-llamacpp "$PWD/pi-localserver-provider.ts" llamacpp qwen3.6-27b

Each run prints health200=1 root200=1 app_served=1 and git_changed_files=8 on a clean pass (the deterministic scaffold output + uv.lock, i.e. zero model edits), and writes logs under results-wiki-<label>/.

To run the stages manually for a live demo, use the same pi -e ... --provider ... invocation as the django-resume section but with the wiki prompts (wiki-stage-2-filled.md, wiki-stage-3-filled.md). Judge a single result with wiki-judge-prompt.md, and confirm full coverage across all cells with wiki-goal-judge.md:

cd ~/projects/django-wiki-clean
pi --no-session -nc --print "$(sed 's/{{MODEL_DESC}}/the model under test/' \
  ~/projects/desktop-django-starter/.bench-qwen36/wiki-judge-prompt.md)"

pi --no-session -nc --print "$(cat ~/projects/desktop-django-starter/.bench-qwen36/wiki-goal-judge.md)"

Third target: django-cast (Wagtail CMS)

The same harness was replicated again against django-cast — a Wagtail-based podcast/blog CMS using its example/ project (settings package example_site.settings; ManifestStaticFilesStorage; django-vite; no committed seed DB; / serves the Wagtail root page). The deterministic Stage 1 scaffold already covered this target with no new generalizations (the settings-package handling added for django-wiki was sufficient), so the same 6-cell matrix (4 engines × 2 models × thinking off/high) ran as clean verification-only passes, judged by pi / openai-codex/gpt-5.5 (live judge PASS + goal-coverage judge GOAL_REACHED: YES).

The packaged smoke contract matches the other targets: GET /health/ 200 and GET / 200 (the Wagtail welcome root page). A benign django_vite.W001 warning about a missing Vite manifest for app default is expected and does not affect the root page or the smoke; serving deeper cast blog/podcast pages (e.g. /test-blog/) additionally needs target-side packaged DJANGO_VITE["cast"] + plain static storage and is out of scope for this benchmark’s contract.

Engine

Model

Thinking

Stage 2

Stage 3

Total

Ollama

qwen 3.6 27b

off

63.0s

107.8s

170.8s

llama.cpp

qwen 3.6 27b

off

43.8s

94.8s

138.6s

MLX

qwen 3.6 27b

off

43.1s

82.4s

125.5s

llama.cpp

qwen 3.6 27b

high

41.7s

84.6s

126.3s

ds4

DeepSeek V4 Flash

off

37.0s

87.8s

124.8s

ds4

DeepSeek V4 Flash

high

33.3s

78.5s

111.8s

All six cells app_served=1 (smoke /health/ 200, / 200) with zero model edits (git_changed_files=9). The cross-target comparison is in .bench-qwen36/RUNTIME-COMPARISON.md.

Reproduce django-cast via the Pi agent from a clean workspace

# 1. django-cast source clone (the runner re-clones from here per run)
git clone https://github.com/ephes/django-cast ~/projects/django-cast

# 2. Start the model server you want (same as the django-resume section above):
#    Ollama (:11434, qwen36-27b-tools) / llama.cpp (:8080) / MLX (:8081) / ds4 (:8002)

# 3. Run a wrap cell. The runner is env-parametrized for the target + prompts.
cd ~/projects/desktop-django-starter/.bench-qwen36
export BENCH_SRC=~/projects/django-cast BENCH_TGT=~/projects/django-cast-clean
export BENCH_S2=$PWD/cast-stage-2-filled.md BENCH_S3=$PWD/cast-stage-3-filled.md

# qwen on each engine (thinking off):
BENCH_THINKING=off ./run-staged-wrap.sh cast-ollama   "$PWD/pi-ollama-provider.ts"      ollama   qwen36-27b-tools
BENCH_THINKING=off ./run-staged-wrap.sh cast-llamacpp "$PWD/pi-localserver-provider.ts" llamacpp qwen3.6-27b
BENCH_THINKING=off ./run-staged-wrap.sh cast-mlx      "$PWD/pi-localserver-provider.ts" mlx      mlx-community/Qwen3.6-27B-4bit

# DeepSeek V4 Flash on ds4, and thinking-mode variation (high):
BENCH_THINKING=off  ./run-staged-wrap.sh cast-ds4 "$PWD/pi-ds4local-provider.ts" ds4local deepseek-v4-flash
BENCH_THINKING=high BENCH_OUT_SUFFIX=-thinkhigh ./run-staged-wrap.sh cast-ds4 "$PWD/pi-ds4local-provider.ts" ds4local deepseek-v4-flash
BENCH_THINKING=high BENCH_OUT_SUFFIX=-thinkhigh ./run-staged-wrap.sh cast-llamacpp "$PWD/pi-localserver-provider.ts" llamacpp qwen3.6-27b

Or drive all six cells (with automatic server start/stop) via ./run-cast-matrix.sh. Judge a single result with cast-judge-prompt.md and confirm full coverage with cast-goal-judge.md:

cd ~/projects/django-cast-clean
pi --no-session -nc --print "$(sed 's/{{MODEL_DESC}}/the model under test/' \
  ~/projects/desktop-django-starter/.bench-qwen36/cast-judge-prompt.md)"

pi --no-session -nc --print "$(cat ~/projects/desktop-django-starter/.bench-qwen36/cast-goal-judge.md)"

Cleanup

pkill -f "llama-server -m .*Qwen3.6-27B"       # free ~16 GB
pkill -f "mlx_lm.server --model mlx-community"  # free ~15 GB
pkill -f "ds4-server -m ds4flash.gguf"          # free ~86 GB
rm -rf ~/projects/django-resume-clean ~/projects/django-wiki-clean ~/projects/django-cast-clean