| |

From “It Works” to “It Thinks”: Building a Real Agentic AI System at Home

Or: why context windows, KV cache, and humility matter more than bigger models.

My last blog post ended on a high note: I had a functional, fully local LLM stack running in my homelab. Models were answering questions, Open-WebUI was humming along, and Ollama had made the early experimentation delightfully easy.

Then I tried to build something… agentic.

That’s where things got interesting.

This post is about what I learned when I pushed past “chatbot that responds” and started building a manager-driven, multi-agent system—one that can plan, delegate, validate, retry, and stream responses cleanly back to a UI. Along the way, I learned some hard truths about context windows, KV cache, model size, quantization, and why Ollama—while excellent for beginners—starts to creak under real architectural pressure.


Ollama Is Great… Until It Isn’t

Let me be clear: Ollama is fantastic.

If you want:

  • quick local inference
  • minimal configuration
  • a friendly on-ramp to local models

…it’s hard to beat.

But once you move into agentic systems, Ollama starts to show its limits.

Agentic systems require:

  • explicit control over context length
  • predictable memory behavior
  • multi-model routing
  • streaming that you can intercept, validate, and rewrite
  • tight integration with external orchestration logic

Ollama abstracts too much of this away. That abstraction is its strength—until you need to reason about why your system is failing.

When I started building a system where one model plans, another classifies, a third executes, and the first one validates, I needed:

  • explicit model servers
  • explicit APIs
  • explicit memory control

That’s when I moved to vLLM.


Context Windows vs KV Cache: Same Sentence, Very Different Things

One of the most persistent misconceptions I had early on was assuming that context window size was the primary constraint on complex tasks.

It isn’t.

Context Window

The context window defines how many tokens a model can see at once. This includes:

  • system prompts
  • user messages
  • tool outputs
  • prior assistant responses

If you exceed it, your request fails. Hard.

But just because a model supports 32K, 64K, or even 128K tokens does not mean you can use all of that in practice.

KV Cache

The KV (Key–Value) cache is the memory structure the model uses to remember attention information during inference. This lives in VRAM, and it scales with:

  • context length
  • number of concurrent sequences
  • precision (FP16 vs INT4, etc.)

Here’s the uncomfortable truth:

You can have a huge context window and still run out of memory instantly.

That’s exactly what happened to me.

Unquantized models happily advertised massive context sizes—right up until they exhausted VRAM loading the model weights, leaving nothing for KV cache. The result was endless CUDA OOM crashes, even at modest context lengths.

This is where the penny dropped:

  • Context window is a logical limit
  • KV cache is a physical limit

You need both.


Bigger Models Are Not Always Better (Ask My GPUs)

My first instinct—because of course it was—was to go big:

  • Qwen 32B
  • full precision
  • dual RTX 3090s
  • tensor parallelism
  • “What could possibly go wrong?”

Everything.

The Failure Mode

On paper, the math almost worked. In reality:

  • FP16 model weights consumed ~23GB per GPU
  • CUDA graphs and overhead ate more
  • KV cache had nowhere to live
  • Containers entered infinite restart loops

I tried all the usual tuning:

  • smaller context windows
  • gpu-memory-utilization
  • eager mode
  • allocator tweaks

None of it mattered. The model was simply too large unquantized.

Quantization to the Rescue

The breakthrough was AWQ INT4 quantization.

By switching to quantized models:

  • model memory dropped by ~4×
  • inference quality loss was negligible
  • KV cache suddenly had room to breathe

This wasn’t theoretical—it was immediately observable in stable containers and usable token capacity.

The final lineup looked like this:

  • Reasoner: Qwen2.5-32B-Instruct-AWQ on dual 3090s
  • Coder: Qwen2.5-Coder-7B-AWQ on a 3080 Ti
  • Summarizer / Classifier: Qwen2.5-3B on a 3070

Each model was right-sized for its job, not its ego.


The Manager Gateway: Where the System Became Agentic

Even with multiple models running, nothing “agentic” happens automatically.

Open-WebUI can only talk to one endpoint. Without intervention, your biggest model will happily do everything itself—including writing code it shouldn’t be writing.

The solution was the Manager Gateway.

This is a FastAPI service that:

  • exposes a single OpenAI-compatible API
  • receives every request from Open-WebUI
  • orchestrates the multi-agent workflow behind the scenes

The Agentic Workflow

┌─────────────────────────────────────────────────────────────┐
│                         USER PROMPT                         │
└─────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────┐
│                        MANAGER GATEWAY                      │
└─────────────────────────────────────────────────────────────┘
                               │
                               ▼
                 ┌────────────────────────────┐
                 │ Phase 1: PLANNING          │
                 │ Reasoner:                  │
                 │  - suggest route           │
                 │  - define validation rules │
                 └──────────────┬─────────────┘
                                │
                                ▼
                 ┌─────────────────────────────┐
                 │ Phase 2: CLASSIFICATION     │
                 │ Summarizer returns exactly: │
                 │  {reasoner|coder|summarizer}│
                 └──────────────┬──────────────┘
                                │
                                ▼
                 ┌────────────────────────────┐
                 │ Phase 3: EXECUTION         │
                 │ Route request to specialist│
                 │  - coder OR summarizer OR  │
                 │    reasoner                │
                 └──────────────┬─────────────┘
                                │
                                ▼
                 ┌────────────────────────────┐
                 │ Phase 4: VALIDATION        │
                 │ Reasoner checks output vs  │
                 │ criteria. If fail:         │
                 │  - inject feedback         │
                 │  - retry (max N)           │
                 └──────────────┬─────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                       FINAL RESPONSE                        │
└─────────────────────────────────────────────────────────────┘

This entire system—routing, retries, validation, streaming translation—lives in the gateway, not in prompt gymnastics.

That’s the key distinction.


What’s Done (So Far)

At this point, the system has:

  • Fully local inference
  • Multiple specialized models across two servers
  • Explicit context + memory control
  • Quantized, hardware-aligned models
  • Manager Gateway with planning, routing, validation, and retries
  • Open-WebUI integration that “just works”

High-Level Architecture

                         ┌───────────────────────────────┐
                         │          Open WebUI           │
                         │  (UI: chats + streaming UX)   │
                         └───────────────┬───────────────┘
                                         │  OpenAI-compatible /v1
                                         ▼
                         ┌───────────────────────────────┐
                         │       Manager Gateway         │
                         │  FastAPI: routing + workflow) │
                         └───────┬───────────┬───────────┘
                                 │           │
                                 │           │
                      ┌──────────▼───┐   ┌───▼───────────┐
                      │ Server A     │   │   Server B    │
                      │ (2x RTX3090) │   │(3080Ti + 3070)│
                      └───────┬──────┘   └───────┬───────┘
                              │                  │
                              │                  │
                    ┌─────────▼─────────┐  ┌─────▼──────────────┐
                    │ vLLM Reasoner     │  │ vLLM Coder          │
                    │ Qwen2.5-32B AWQ   │  │ Qwen2.5-Coder-7B AWQ │
                    │ :8001             │  │ :8011               │
                    └───────────────────┘  └─────┬──────────────┘
                                                 │
                                                 │
                                         ┌───────▼─────────────┐
                                         │ vLLM Summarizer     │
                                         │ Qwen2.5-3B          │
                                         │ :8012               │
                                         └─────────────────────┘

And Now the Real Fun Begins

Up to now, I’ve been building infrastructure.

What comes next is purpose:

  • the idea of what I am building all this for.
  • In 2.5 words: “Automated Pentesting”

We also need more tools:

  • richer tool execution
  • persistent memory beyond prompts
  • smarter planning heuristics
  • better validation
  • observability and metrics

Ollama got me started. vLLM gave me control (And taught me a lot along the way about LLMs). The Manager Gateway gave the system a basic brain… Sort of anyways.

Now it’s finally time to see what this thing can do.

Written by David and his AI assistant.1

  1. There is a lot of work here and I can tend to go down a technical rabbit hole and this would have been at least three times longer but I have used AI to help keep me a little more on track and keep things a little higher level. ↩︎