BOF.team
← All Projects
activeWave 1 · Foundations

Local LLM Primer & Private Client Demo

Stand up a sovereign local AI stack — Gemma 4 26B-A4B QAT on a single RTX 3090 via Ollama and Open WebUI, proven air-gapped — then weigh it against a rented H100 to map the capability/cost ladder for privacy-sensitive work.

llmlocal-inferencehardwaresovereigntyollama

Overview

This project asks a deceptively simple question: how much genuinely useful AI can you run entirely under your own control — and where is the line at which you must reach for rented, higher-caliber hardware? For regulated or privacy-sensitive work — health records, legal discovery, financial PII — that line matters enormously, because the moment your data leaves your perimeter a whole set of guarantees changes. The through-line for everything here is sovereignty: keeping both your data and the AI service itself under your own control, not as a nice-to-have but as a tier-one requirement.

The centerpiece is a live demo of a complete sovereign stack — Gemma 4 26B-A4B QAT running on a single RTX 3090 workstation, served by Ollama behind an Open WebUI front end. We run a genuinely useful query locally, then unplug the network cable mid-session to show the model still answering with nothing connected — the data has nowhere to go. From there the demo climbs a deliberate comparison against a rented H100 in a data center: the same model runs faster but returns the same answer (the speed rung), while a larger 70–120B model that needs the H100's 80 GB gives a better answer the local card cannot (the capability rung) — at the cost of sending data across the line, a trade to make deliberately, not by accident. The line that ties it together: local privacy is guaranteed by physics; cloud privacy is guaranteed by a contract.

Key Concepts

  • The "word calculator": an LLM is autocomplete that read a library — given the text so far, it calculates the most likely next token, appends it, and repeats, feeding its own output back in
  • A frozen neural network: the model is a fixed grid of learned numbers (parameters); training tunes them, inference only calculates with them — so by default the model neither learns from nor retains your prompt
  • Attention, transformers, and the KV cache: attention weighs everything in your context to pick the next token, transformers run it at massive scale in parallel, and the conversation's KV cache lives in the GPU's VRAM
  • The knowledge boundary: what a model knows is its frozen training (with a cutoff date) plus whatever you place in its context window — which turns "where does the model run?" into the real question, "where does my data go?"
  • VRAM as the binding constraint: the RTX 3090's 24 GB of VRAM (backed by 128 GB of DDR5 system RAM) sets a hard ceiling on model size, making right-sizing the model to the card the core hardware skill
  • Mixture of Experts (MoE): Gemma 4 26B-A4B activates only 4B parameters per token (the "A4B" in its name), which is what keeps a capable model responsive on a single consumer card

Learning Outcomes

By the end of this session, attendees will be able to:

  • Run a genuinely useful model on a single consumer GPU, served end-to-end through a self-hosted chat front end
  • Prove air-gapped operation — serve the model with the network physically disconnected and confirm nothing in the path silently depends on the internet
  • Enforce access control at the chat front end, where per-user accounts, permissions, and audit logs (via Open WebUI) are the concrete enforcement point the model itself has no notion of
  • Distinguish the speed rung from the capability rung — when data-center hardware only buys throughput for the same answer, versus when a larger model is the only option that fits
  • Place a workload on the sovereignty ladder — workstation, colocated box, and private endpoint all stay above the sovereignty line, before a public API crosses it — and apply the utilization break-even: a card kept busy pays for itself, while bursty or low-volume work is usually cheaper to rent

Deliverables

A documented, reproducible local stack — Ollama, Open WebUI, and Gemma 4 26B-A4B QAT on an RTX 3090 — demonstrating private, self-hosted inference proven air-gapped with the network physically disconnected. Paired with it is a structured comparison against a rented H100 (80 GB): the same model as a speed rung (a faster path to the same answer) and a larger 70–120B model as a capability rung (one that only fits on data-center hardware). Together they frame the capability-and-cost trade as a deliberate decision rather than an accident — directly applicable to organizations with data-residency requirements or air-gapped security postures.

Applied Skills

  • Local LLM deployment with Ollama and Open WebUI, including per-user authentication as the access-control boundary
  • Model selection across QAT, MoE, and quantization tradeoffs (Gemma 4 26B-A4B QAT as the primary model; Gemma 4 31B dense QAT as a higher-quality stretch option)
  • VRAM/RAM sizing and right-sizing a model to a 24 GB card
  • Air-gap validation of a self-hosted inference stack
  • Local-vs-rented (H100) capability and cost analysis — reading the speed rung against the capability rung, plus the utilization break-even
  • Articulating data and service sovereignty and placing workloads on the sovereignty ladder