Overview
This project asks a deceptively simple question: how much genuinely useful AI can you run entirely under your own control — and where is the line at which you must reach for rented, higher-caliber hardware? For regulated or privacy-sensitive work — health records, legal discovery, financial PII — that line matters enormously, because the moment your data leaves your perimeter a whole set of guarantees changes. The through-line for everything here is sovereignty: keeping both your data and the AI service itself under your own control, not as a nice-to-have but as a tier-one requirement.
The centerpiece is a live demo of a complete sovereign stack — Gemma 4 26B-A4B QAT running on a single RTX 3090 workstation, served by Ollama behind an Open WebUI front end. We run a genuinely useful query locally, then unplug the network cable mid-session to show the model still answering with nothing connected — the data has nowhere to go. From there the demo climbs a deliberate comparison against a rented H100 in a data center: the same model runs faster but returns the same answer (the speed rung), while a larger 70–120B model that needs the H100's 80 GB gives a better answer the local card cannot (the capability rung) — at the cost of sending data across the line, a trade to make deliberately, not by accident. The line that ties it together: local privacy is guaranteed by physics; cloud privacy is guaranteed by a contract.
Key Concepts
- The "word calculator": an LLM is autocomplete that read a library — given the text so far, it calculates the most likely next token, appends it, and repeats, feeding its own output back in
- A frozen neural network: the model is a fixed grid of learned numbers (parameters); training tunes them, inference only calculates with them — so by default the model neither learns from nor retains your prompt
- Attention, transformers, and the KV cache: attention weighs everything in your context to pick the next token, transformers run it at massive scale in parallel, and the conversation's KV cache lives in the GPU's VRAM
- The knowledge boundary: what a model knows is its frozen training (with a cutoff date) plus whatever you place in its context window — which turns "where does the model run?" into the real question, "where does my data go?"
- VRAM as the binding constraint: the RTX 3090's 24 GB of VRAM (backed by 128 GB of DDR5 system RAM) sets a hard ceiling on model size, making right-sizing the model to the card the core hardware skill
- Mixture of Experts (MoE): Gemma 4 26B-A4B activates only 4B parameters per token (the "A4B" in its name), which is what keeps a capable model responsive on a single consumer card
Learning Outcomes
By the end of this session, attendees will be able to:
- Run a genuinely useful model on a single consumer GPU, served end-to-end through a self-hosted chat front end
- Prove air-gapped operation — serve the model with the network physically disconnected and confirm nothing in the path silently depends on the internet
- Enforce access control at the chat front end, where per-user accounts, permissions, and audit logs (via Open WebUI) are the concrete enforcement point the model itself has no notion of
- Distinguish the speed rung from the capability rung — when data-center hardware only buys throughput for the same answer, versus when a larger model is the only option that fits
- Place a workload on the sovereignty ladder — workstation, colocated box, and private endpoint all stay above the sovereignty line, before a public API crosses it — and apply the utilization break-even: a card kept busy pays for itself, while bursty or low-volume work is usually cheaper to rent
Deliverables
A documented, reproducible local stack — Ollama, Open WebUI, and Gemma 4 26B-A4B QAT on an RTX 3090 — demonstrating private, self-hosted inference proven air-gapped with the network physically disconnected. Paired with it is a structured comparison against a rented H100 (80 GB): the same model as a speed rung (a faster path to the same answer) and a larger 70–120B model as a capability rung (one that only fits on data-center hardware). Together they frame the capability-and-cost trade as a deliberate decision rather than an accident — directly applicable to organizations with data-residency requirements or air-gapped security postures.
Applied Skills
- Local LLM deployment with Ollama and Open WebUI, including per-user authentication as the access-control boundary
- Model selection across QAT, MoE, and quantization tradeoffs (Gemma 4 26B-A4B QAT as the primary model; Gemma 4 31B dense QAT as a higher-quality stretch option)
- VRAM/RAM sizing and right-sizing a model to a 24 GB card
- Air-gap validation of a self-hosted inference stack
- Local-vs-rented (H100) capability and cost analysis — reading the speed rung against the capability rung, plus the utilization break-even
- Articulating data and service sovereignty and placing workloads on the sovereignty ladder