BOF.team — Where Funny Meets Forward

Overview

This project asks a deceptively simple question: how much genuinely useful AI can you run entirely under your own control — and where is the line at which you must reach for rented, higher-caliber hardware? For regulated or privacy-sensitive work — health records, legal discovery, financial PII — that line matters enormously, because the moment your data leaves your perimeter a whole set of guarantees changes. The through-line for everything here is sovereignty: keeping both your data and the AI service itself under your own control, not as a nice-to-have but as a tier-one requirement.

The centerpiece is a live demo of a complete sovereign stack — Gemma 4 26B-A4B QAT running on a single RTX 3090 workstation, served by Ollama behind an Open WebUI front end. We run a genuinely useful query locally, then unplug the network cable mid-session to show the model still answering with nothing connected — the data has nowhere to go. From there the demo climbs a deliberate comparison against a rented H100 in a data center: the same model runs faster but returns the same answer (the speed rung), while a larger 70–120B model that needs the H100's 80 GB gives a better answer the local card cannot (the capability rung) — at the cost of sending data across the line, a trade to make deliberately, not by accident. The line that ties it together: local privacy is guaranteed by physics; cloud privacy is guaranteed by a contract.

Key Concepts

The "word calculator": an LLM is autocomplete that read a library — given the text so far, it calculates the most likely next token, appends it, and repeats, feeding its own output back in
A frozen neural network: the model is a fixed grid of learned numbers (parameters); training tunes them, inference only calculates with them — so by default the model neither learns from nor retains your prompt
Attention, transformers, and the KV cache: attention weighs everything in your context to pick the next token, transformers run it at massive scale in parallel, and the conversation's KV cache lives in the GPU's VRAM
The knowledge boundary: what a model knows is its frozen training (with a cutoff date) plus whatever you place in its context window — which turns "where does the model run?" into the real question, "where does my data go?"
VRAM as the binding constraint: the RTX 3090's 24 GB of VRAM (backed by 128 GB of DDR5 system RAM) sets a hard ceiling on model size, making right-sizing the model to the card the core hardware skill
Mixture of Experts (MoE): Gemma 4 26B-A4B activates only 4B parameters per token (the "A4B" in its name), which is what keeps a capable model responsive on a single consumer card

Learning Outcomes

By the end of this session, attendees will be able to:

Run a genuinely useful model on a single consumer GPU, served end-to-end through a self-hosted chat front end
Prove air-gapped operation — serve the model with the network physically disconnected and confirm nothing in the path silently depends on the internet
Enforce access control at the chat front end, where per-user accounts, permissions, and audit logs (via Open WebUI) are the concrete enforcement point the model itself has no notion of
Distinguish the speed rung from the capability rung — when data-center hardware only buys throughput for the same answer, versus when a larger model is the only option that fits
Place a workload on the sovereignty ladder — workstation, colocated box, and private endpoint all stay above the sovereignty line, before a public API crosses it — and apply the utilization break-even: a card kept busy pays for itself, while bursty or low-volume work is usually cheaper to rent

Deliverables

A documented, reproducible local stack — Ollama, Open WebUI, and Gemma 4 26B-A4B QAT on an RTX 3090 — demonstrating private, self-hosted inference proven air-gapped with the network physically disconnected. Paired with it is a structured comparison against a rented H100 (80 GB): the same model as a speed rung (a faster path to the same answer) and a larger 70–120B model as a capability rung (one that only fits on data-center hardware). Together they frame the capability-and-cost trade as a deliberate decision rather than an accident — directly applicable to organizations with data-residency requirements or air-gapped security postures.

Applied Skills

Local LLM deployment with Ollama and Open WebUI, including per-user authentication as the access-control boundary
Model selection across QAT, MoE, and quantization tradeoffs (Gemma 4 26B-A4B QAT as the primary model; Gemma 4 31B dense QAT as a higher-quality stretch option)
VRAM/RAM sizing and right-sizing a model to a 24 GB card
Air-gap validation of a self-hosted inference stack
Local-vs-rented (H100) capability and cost analysis — reading the speed rung against the capability rung, plus the utilization break-even
Articulating data and service sovereignty and placing workloads on the sovereignty ladder