Run Ollama in Docker: a local fallback for your AI orchestration stack
Frontier APIs go down, don't work offline, and meter every token. A local Ollama in Docker fixes all three in under a minute.
Why your stack needs a local fallback
The case for a local model isn't that it beats the frontier. It doesn't. The case is that it covers four scenarios the frontier can't:
- Cloud outages. Status pages exist for a reason. Your work shouldn't pause when someone else's does.
- Air-gapped environments. Banking, defence, medical, legal. Pasting a codebase into a hosted API is a non-starter, and the list of contexts where it is keeps growing.
- Volume experimentation. Local inference is free at any rate you can run it. Cloud inference is not.
- Offline work. Trains, planes, conference Wi-Fi. The frontier isn't there. The local model is.
Why Docker, not a native install
The native Ollama installer is excellent for solo use. LM Studio and vLLM are reasonable alternatives if your priority is a polished GUI or production-scale serving. For the specific job of standing up a local model the same way (on every developer's machine and every server you deploy it to) use Docker.
- Portable. A
compose.ymlin the repo means every teammate gets the same Ollama version, the same cache layout, the same port. No brew-vs-apt-vs-winget drift. - Disposable. Tear it down, rebuild, snapshot the volume, attach it to a different host. The native install is a fixture; the container is furniture.
- Isolated. The model cache lives in a named volume, not buried in
~/.ollamaalongside whatever else you've installed. GPU drivers, CUDA toolchains, and Python environments stay out of each other's way. - Composable. If your dev environment already runs Postgres, Redis, and a couple of services in Compose, the local model becomes part of the same
docker compose up.
The fastest path to a running container
Two commands are enough:
docker pull ollama/ollama
docker run -d -p 11434:11434 -v ollama-models:/root/.ollama --name ollama ollama/ollama
NVIDIA users add --gpus=all to the docker run line; nvidia-container-toolkit must already be installed on the host. Apple Silicon users do nothing. Metal acceleration in the official image is automatic.
Confirm it's alive:
curl http://localhost:11434/api/tags
Empty list. Correct: no models yet. The ollama-models named volume is where they'll live, so blowing away the container doesn't blow away gigabytes of weights.
To stop the container when you're done:
docker stop ollama
Restart it later with docker start ollama — same name, same volume, same models.
Optional: remove the stopped container entirely. Do this when you want to recreate it from scratch (e.g. after editing the docker run flags):
docker rm ollama
The ollama-models volume survives the rm; the models don't get re-downloaded next time you start.
Pulling models that fit your hardware
Ollama ships with a registry. One command, one model, one cache. Pull through the running container:
docker exec ollama ollama pull llama3.1:8b docker exec ollama ollama pull qwen2.5-coder:7b
Pick the model that fits the RAM or VRAM you actually have, with margin. The numbers below are working estimates for typical 4-bit quantised builds (Ollama's default), not benchmarks. Match a tier to your hardware before you pull anything large:
- 1–3B parameters (
llama3.2:1b,llama3.2:3b,phi3:mini,gemma2:2b): around 1–3 GB resident. Runs on an 8 GB MacBook Air, a Raspberry Pi 5, or a four-year-old laptop. Fast first token, weak on multi-step reasoning. Useful for autocomplete, classification, and quick summarisation; not for code generation worth shipping. Best pick:docker exec ollama ollama pull llama3.2:3b
- 7–9B (
llama3.1:8b,qwen2.5:7b,qwen2.5-coder:7b,mistral:7b,gemma2:9b): around 5–6 GB. The sweet spot for a 16 GB MacBook Air or any modern laptop with integrated graphics. Capable for chat, code edits, small refactors.qwen2.5-coder:7bis the strong default if your work is coding. Best pick:docker exec ollama ollama pull qwen2.5-coder:7b
- 13–14B (
qwen2.5:14b,qwen2.5-coder:14b,phi3:14b,mistral-nemo:12b): around 8–10 GB. A 32 GB MacBook Pro or a desktop with a 16 GB GPU (RTX 4070 Ti, RTX 4080) keeps the whole model resident. Notably better than 7B on long-context coherence and tool-call structure. Best pick:docker exec ollama ollama pull qwen2.5-coder:14b
- 27–34B (
gemma2:27b,qwen2.5:32b,qwen2.5-coder:32b,codellama:34b): around 18–22 GB. A 48 GB+ unified-memory Mac, a 24 GB consumer GPU like the RTX 4090, or a 32 GB workstation card.qwen2.5-coder:32bis the best open coding model in this range and the right reach when you actually need a competent local agent, not just a fallback. Best pick:docker exec ollama ollama pull qwen2.5-coder:32b
- 70B+ (
llama3.1:70b,llama3.3:70b,qwen2.5:72b): around 40–45 GB. Workstation territory: a 64 GB+ Mac Studio, a dual-GPU box (2× RTX 6000 Ada at 48 GB each), or a single A100 / H100. Frontier-adjacent quality without the cloud, but expect roughly 5–15 tokens per second on consumer-tier hardware in our testing. Best pick:docker exec ollama ollama pull llama3.3:70b
Pick a coding-tuned model for coding work. qwen2.5-coder:7b is the strong default for laptops; qwen2.5-coder:32b is the strong default for workstations. Both are sharp on small edits and weaker than the frontier on long-context reasoning. The frontier they aren't. Useful they are.
One latency footgun. The first request after a model loads pays a cold-start cost while the weights are read into memory; subsequent requests don't. Ollama keeps the model resident for five minutes by default, then evicts it. If your fallback path is hit irregularly, every fallback eats that cold-start hit. Set OLLAMA_KEEP_ALIVE=24h in the container's environment to pin a model in memory across the working day.
Raising Docker Desktop's memory cap
If a pull succeeds but loading the model fails with model requires more system memory (X GiB) than is available (Y GiB), the limit is Docker's VM, not your hardware. Docker Desktop on macOS and Windows runs containers inside a Linux VM with a fixed memory allocation — around 8 GB by default, regardless of how much RAM the host has. Even a 32 GB MacBook Pro will block a 19 GB model out of the box.
The fix lives in Docker Desktop's own settings: Settings → Resources → Memory. Drag the slider above your largest model with margin (24 GB for a 19 GB model is comfortable), then click Apply & Restart. The Ollama container picks up the new ceiling on next start.
Linux native users are unaffected; containers there share host memory directly with no VM in between.
Cleaning up
Models don't get garbage-collected automatically; a few 14B pulls will fill a developer laptop's drive faster than you'd expect. Two commands keep the cache honest.
List what's resident and how much disk each model takes:
docker exec ollama ollama list
Evict one when you need the space back:
docker exec ollama ollama rm <model-name>
The named volume survives the rm; only the targeted model's weights disappear.
Pulling models that fit your use
Hardware is the floor; the job is the ceiling. The right model for code review isn't the right model for embeddings, and an 8 GB chat model can't see a screenshot. Match the family to the job.
Coding (chat, edits, refactors)
Coder-tuned families read diffs better, hold imports in their head, and follow framework idioms more reliably than general chat models at the same size. qwen2.5-coder:7b on a laptop, qwen2.5-coder:32b on a workstation. Best pick:
docker exec ollama ollama pull qwen2.5-coder:7b
General chat & reasoning
Multi-turn questions, design discussion, planning, code review at a thousand-foot level. llama3.1:8b is the balanced laptop generalist; llama3.3:70b if you have the VRAM. Best pick:
docker exec ollama ollama pull llama3.1:8b
Writing & long-form prose
Drafts, summarisation, release notes, polishing READMEs. Gemma's prose register is noticeably more natural than Llama's at the same size; reach for it when the output is meant for humans, not compilers. Best pick:
docker exec ollama ollama pull gemma2:9b
Embeddings (RAG, semantic search, dedup)
A different model class — tiny, fast, no chat interface. nomic-embed-text is around 270 MB and the de facto default for local retrieval pipelines. Pair it with any chat model for RAG; query it via Ollama's /api/embeddings endpoint, not /v1/chat/completions. Best pick:
docker exec ollama ollama pull nomic-embed-text
Vision (screenshots, OCR, image Q&A)
Most chat models can't see images. llama3.2-vision:11b is the strongest small open multimodal in Ollama's registry; pull it when you need a model that can read a UI screenshot, describe a diagram, or extract text from a scanned page. Best pick:
docker exec ollama ollama pull llama3.2-vision:11b
Multiple jobs in one stack? Pull more than one. The same Ollama container serves all of them on the same port; switching is a model-field change in the request, not a container rebuild. Disk is the only real cost.
This is enough for one developer on one machine. The moment you want the same setup running the same way on every laptop, formalise it.
The minimum viable compose file
Stop the imperative container first if it's still running:
docker stop ollama && docker rm ollama
Then drop this into compose.yml at the root of your project, or merge it into the file you already have:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama
restart: unless-stopped
# Uncomment if you have an NVIDIA GPU and nvidia-container-toolkit installed.
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
volumes:
ollama-models:
docker compose up -d. The running state is the same as before: Ollama on port 11434, model cache in the ollama-models named volume. The difference is reproducibility. Every teammate runs the same command and lands the same setup.
If you're on NVIDIA and uncommented the GPU block, verify the container actually sees the card before pulling anything large:
docker compose exec ollama nvidia-smi
If nvidia-smi errors out, your container isn't seeing the GPU. The fix is almost always the host-level nvidia-container-toolkit install, not the compose file.
Useful Docker commands
Once the container is running, these are the commands you'll reach for routinely:
docker pull ollama/ollama: fetch the image without starting anything. Useful in CI, in air-gapped environments where you stage images ahead of time, or to refresh the:latesttag without interrupting a running container.docker compose logs -f ollama: tail the container logs. The first place to look when a request hangs or returns an empty response.docker compose restart ollama: restart the container without rebuilding. Faster than down-then-up when you've only changed an environment variable.docker compose down: stop and remove the container. The named volume survives, so your pulled models stay on disk.docker exec -it ollama bash: shell into the running container for debugging. From inside,ollama list,ollama show <model>, andnvidia-smiall work without thedocker compose execprefix.docker volume inspect ollama-models: shows the host path Docker uses for the model cache. Useful when you want to know where 40 GB of weights actually live.
OpenCode as the front end for your local model
Ollama exposes two HTTP APIs on port 11434: a native one at /api/... and an OpenAI-compatible one at /v1/.... Use the OpenAI-compatible one. Every modern dev tool already speaks it.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:7b",
"messages": [
{"role": "user", "content": "Write a Python function that returns the factorial of n."}
]
}'
Curl is a sanity check, not a workflow. The tool that should sit in front of your local model is OpenCode: an open-source terminal-native coding agent in the same shape as Claude Code, but with bring-your-own-model as a first-class concern. It handles the agentic loop (plan, edit files, run commands, observe, iterate) so the model doesn't have to. Point it at Ollama and you get a fully local agent.
Install:
curl -fsSL https://opencode.ai/install | bash
Add Ollama as a provider in ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen2.5-coder:7b": { "name": "Qwen 2.5 Coder 7B" },
"qwen2.5-coder:32b": { "name": "Qwen 2.5 Coder 32B" }
}
}
}
}
Then run opencode from your project root. The TUI launches, the model loads on first request, and you're working against an agent that lives entirely on your machine.
One context-length gotcha
Agentic loops want more context than Ollama gives by default. If OpenCode's tool calls drift, stall, or fail to parse, the cause is almost always Ollama's default 2k context window starving the agent. Set OLLAMA_CONTEXT_LENGTH=32768 in the container's environment (or bump num_ctx via a Modelfile), and the same model that was misbehaving will start finishing tasks.
Other clients work too, but OpenCode is the one to reach for first. Quick mentions:
- Aider:
aider --model openai/qwen2.5-coder:7b --openai-api-base http://localhost:11434/v1for diff-aware editing without an agent loop. - Continue and Cline: IDE-native chat panels; set the base URL and model name in their custom-provider config.
For your own scripts, the official openai Python and Node SDKs accept a base_url override. Same library, different endpoint:
from openai import OpenAI
# Cloud
cloud = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Local Ollama (api_key value is ignored)
local = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
client = local if offline else cloud
resp = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[{"role": "user", "content": prompt}],
)
The pattern that matters most in production: wrap both clients behind a thin function of your own that handles retries, timeouts, and a cloud-to-local fallback path:
def complete(prompt: str, *, prefer_cloud: bool = True) -> str:
primary = cloud if prefer_cloud else local
fallback = local if prefer_cloud else cloud
try:
return primary.chat.completions.create(
model="gpt-5" if prefer_cloud else "qwen2.5-coder:7b",
messages=[{"role": "user", "content": prompt}],
timeout=20,
).choices[0].message.content
except (APIConnectionError, APITimeoutError):
return fallback.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
That's the whole pattern. Cloud first, local on failure, one timeout to bound the wait. Everything else (retries, model routing, quality gating) layers on top of these eight lines.
Where local models fall down
The point of running a local model is to know its limits, not to pretend it doesn't have them.
- The quality gap is real. Frontier models are months ahead on reasoning-heavy benchmarks, and the gap widens with long context or complex tool use. Don't expect Claude-level work from a 7B model.
- Tool-calling is patchy. Open models support function calling, but JSON drifts more often and agent loops need more retries. Run agentic flows on the cloud; keep local for chat and completion.
- Context windows are smaller in practice. Advertised length and useful length aren't the same number. Coherence over long documents is where local still struggles most.
- Speed is your hardware. An M4 Mac feels snappy. A four-year-old laptop with no GPU does not. Run it. Time it. Decide.
Run a local model knowing what it's for: the fallback, not the primary. Pretend otherwise and you'll be disappointed by both.
The cloud gives you the smartest model on demand. The local container gives you the model you can't lose. A serious AI orchestration stack runs both. Resilience is part of the craft now.