AI Adventures – Part Two – Agents and Architecture

When you’re trying to turn your homelab into SkyNet, it helps to start with a blueprint. So before we dive into the debugging hijinks, existential model crises, and browser-induced gaslighting that dominated this chapter, let’s start with the real architecture I built, the one that turned my homelab into a fully offline, containerized AI city‑state, complete with agents, tooling, translation layers, and a Python sandbox that still hasn’t forgiven me for the constraints I put on it.

🏗️ The Architecture (a.k.a. “How to Build a Tiny AI Metropolis”)

Here’s the high‑level layout I engineered—a stack designed to be local‑only, fully containerized, modular, and capable of running multiple cooperating LangChain agents.

🧱 The Foundation (also known as the hardware)

So here what we have to work with:

CPU: AMD Ryzen 7 3800X 8-Core Processor @ 4.5GHz
RAM: 32 GB @ 3,600 MT/s
GPUs: Dual Overclocked Nvidia 3090’s, w/ 24GB VRAM @ 10,000MHz per GPU

I overclocked the VRAM a little from 9,750 MHz to 10,000 MHz and left the GPU core clocks alone. With LLM inference, the main limitation is the VRAM size and speed, not the actual GPU clock speeds. I may in the future drop the core clocks to gain some memory clock speed. But for now our biggest limit is our mere 48GB of VRAM. Though I have pondered quadrupling down and getting a pair of RTX PRO 6000’s with 96GB of VRAM each for a total of 192 GB of VRAM. But they are not very checking account friendly, even at half price on eBay…

🧠 Ollama — The Model Workhorse

Runs all local models, including some medium-weight ones like gpt-oss:20b.
This is where the actual LLM inference happens, but Ollama speaks its own dialect of responses—especially for tool calls—so we introduced…

🛡️ Assistant Proxy — The Great Translator

A custom FastAPI service that:

Normalizes Ollama output
Converts Ollama tool-call dicts ↔ OpenAI JSON strings
Implements /v1/chat/completions and /v1/models
Enforces a 131k-token context window on every request
Provides fully OpenAI-compatible streaming behavior

This layer is the diplomatic corps between all services.

🖥️ Open WebUI — Mission Control

Our primary user interface.
Instead of talking to Ollama directly, it talks to the Assistant Proxy using the OpenAI API.
This allows:

Multi‑agent orchestration
Context handling
Streaming
Tool calling
File downloads

Everything goes through one unified UI.

🔒 Python Sandbox — A Safe, Disposable Code Execution Environment

A Docker-in-Docker sandbox used to execute arbitrary Python safely.
I originally tried using mounted temp files, but reality had opinions.
I pivoted to directly executing code via python -c inside the container—simpler, cleaner, more obedient.

🧩 LangChain Multi-Agent Service — The Brains of the Operation

A standalone container running multiple agents:

A resume agent (reads, writes, evaluates documents, outputs DOCX)
A code agent (executes Python in the sandbox)
Other agents that I will talk about in later posts
Tools for file management, evaluation, and document generation

This service also exposes:

An /outputs directory mounted to the host
A /download/{filename} endpoint for retrieving generated files
A clean REST interface via LangServe-style routes

Everything plugs neatly into the Open WebUI → Assistant Proxy → LangChain pipeline.

🕸️ How All Components Talk to Each Other

Open WebUI
     ↓ (OpenAI API)
Assistant Proxy
     ↓ (Ollama API + normalized tool calls)
Ollama ←→ LangChain Agents ←→ Python Sandbox
     ↓
Host-mounted /outputs directory
     ↓
Download links presented to user

The result?
A fully offline multi-agent AI system that behaves like a miniature cloud platform, almost… but running entirely in my basement.

🚀 Building the Multi-Agent AI Stack

The plan was simple: build a fully offline, containerized, multi-agent AI system with one that could read a files, process job descriptions, generate tailored documents, and—because I’m me—also evaluate them. Another agent can execute arbitrary code, but that is for later. Today’s post is about getting the framework working with a simple workflow. (Resume building)

To make this happen, I deployed a stack that would make DevOps engineers blink twice:

Ollama for running the models
Assistant Proxy to make everything speak OpenAI API
Open WebUI as the command center
A Python sandbox running Docker-in-Docker because obviously it has to be safe code execution… for now…
A LangChain multi-agent service, armed with tools and zero hesitation

That’s right. I built a small AI city-state. And it will grow as I have plans for more tasks.

🤖 When APIs Start Arguing

Very early on, we discovered that Ollama and OpenAI have creative differences in how they handle tool calls. Ollama happily returns dicts. OpenAI insists on JSON strings. LangChain pretends both should “just get along.”

They did not.

Enter custom conversion functions, revised endpoints, and a proxy that now serves as marriage counselor between competing LLM formats. The assistant-proxy grew a full OpenAI-compatible layer (/v1/models, /v1/chat/completions) and learned how to translate tool calls both ways without dropping arguments on the floor or hallucinating data types.

🔒 The Python Sandbox Learns Who’s Boss

I had a vision: a pristine, isolated environment where code could run safely without risk to the host.

The sandbox had a different vision: existential freedom.

The original approach used temp files mounted into Docker-in-Docker containers, which sounded good on paper and failed spectacularly in reality. Files created inside the sandbox container weren’t visible to the host Docker daemon that was supposed to run the “inner” containers. The result was a sort of quantum file system: the files existed and didn’t exist at the same time, depending on which process you asked.

The fix? I stopped negotiating and simply executed code directly with python -c inside the sandbox container. No shared temp files, no awkward mount gymnastics. The sandbox still runs nested Docker where needed, but with a much simpler and more predictable execution path. It’s now obedient, probably resentful, but obedient nonetheless.

🧠 Building a Workforce with LangChain

Once everything communicated properly, I introduced multiple agents. Suddenly, I had:

A code agent with a Python execution tool
A resume agent with tools to read, write, and evaluate documents
A system prompt so strict it may qualify as a military order

The resume agent can read a master resume, apply ATS-friendly formatting rules, generate a .docx file, and then run that resume through a three-stage evaluation pipeline (ATS → HR → Hiring Manager) that’s perfectly willing to tell me, “No, you probably shouldn’t apply for this one.”

I also wired in a dedicated output directory and volume mounts so resumes could survive container boundaries. More on that in a moment.

🧵 Streaming, Versions, and Other Shenanigans

This is the part of the story where all the “minor details” conspired to become major incidents.

📂 Files, Paths, and the Great Escape from the Container

First, the file problem.

The resume agent generates Word documents inside the langchain-service container. From the model’s point of view, they lived in something like:

Inside container: /app/outputs/Your_Tailored_Resume.docx

Which is great—unless you’re a human using a browser outside the container who would rather not docker exec into anything just to grab a resume.

To solve this, we did three things:

Defined a clear output directory inside the container
The service writes all generated resumes to an OUTPUT_DIR, e.g. /app/outputs.
Mounted that directory to the host
In docker-compose.yml we wired it up so that: /app/outputs (in the container) maps to something like ./outputs on the host (e.g., /opt/stacks/localLLMs/outputs) That means any .docx created by the agent is instantly visible to the host filesystem.
Exposed a proper download endpoint
The first iteration of the tool helpfully returned paths like:

   file:///app/outputs/Your_Tailored_Resume.docx

Which, from a browser’s perspective, might as well be:
“Somewhere you can’t go, peasant!”

The fix was to add a dedicated HTTP endpoint to langchain-service:

   GET /download/{filename}

That endpoint:

Validates the filename (no ../ shenanigans)
Looks for the file in OUTPUT_DIR
Serves it back with the correct MIME type for Word documents
Returns an actual HTTP URL like: http://<langchain-host>:9100/download/Your_Tailored_Resume.docx

🧠 When Context Windows Attack (or: Why VRAM Usage Matters)

Next, the context window saga.

At one point, the model started asking me for the job description… that I had already pasted into the conversation. Repeatedly. It was like arguing with a forgetful oracle.

Two clues stood out:

The model kept “forgetting” earlier parts of the conversation
It would behave as if the job description simply never existed.
GPU memory usage looked suspiciously low
For a model that should be chewing through a decent chunk of VRAM with long contexts, usage stayed too small, like it had a tiny context window configured.

This is where Part 1 of the AI adventures paid off: I had already learned that context size is everything. If the context window is too small, your giant job description gets silently truncated, and the model politely gaslights you about what it remembers.

The fix was implemented in the assistant-proxy:

We started explicitly injecting a large num_ctx value into every Ollama call.
The context window was bumped up to 131,072 tokens.
Both the Ollama-native /api/chat and OpenAI-compatible /v1/chat/completions endpoints were updated so nothing slipped through without the larger context configured.

We also cleaned up the LangChain LLM initialization to remove invalid or conflicting parameters, letting Ollama handle the context logic cleanly.

Once that was in place, VRAM usage finally looked like the model was awake, and, more importantly, it stopped pretending it had never seen the job description.

🌊 Streaming: When “Done” Isn’t Done

Then there was the streaming issue.

Open WebUI expects streamed responses in a particular pattern. The agent, meanwhile, was doing this:

Send a first chunk with content and "done": false
Send a final chunk with "done": true… and no content

From a human perspective, that seems reasonable. From Open WebUI’s perspective, it looked like:

“Cool, thanks for the empty message, I’ll just wait here for the actual last bit.”

So the UI would just sit there, happily waiting forever, while the backend logs insisted everything had finished successfully.

Since LangChain agents don’t really do token-by-token incremental streaming, the solution was to stop pretending they did:

I changed the implementation so that the agent runs to completion.
Then we send one single streaming chunk back to Open WebUI:

  {
    "model": "resume-agent:latest",
    "message": { "role": "assistant", "content": "..." },
    "done": true
  }

No partial fragments, no empty “done” frame at the end. Just one clean, final message that Open WebUI is happy to render.

🧼 The Browser Was Lying to Me

Finally, the most humbling bug: browser caching.

At one point, the agent appeared to stop responding:

The backend logs showed requests coming in and completing with HTTP 200.
The resume files were being created in the outputs folder.
The endpoints were doing their job.

And yet… nothing visible in the UI.

The culprit turned out to be stale frontend assets cached by the browser. Somewhere along the way, I’d updated enough moving pieces that Open WebUI and my browser no longer agreed on what the current reality was.

A good old-fashioned: Ctrl + F5 (hard reload) cleared the cache, forced fresh assets to load, and the “frozen” behavior vanished instantly. It was the web equivalent of “Did you try turning it off and on again?” and yes, it worked.

📄 The Final Form

At the end of this adventure, I now have a system that:

Creates tailored resumes
Generates DOCX files
Evaluates fit from an ATS, HR, and hiring-manager perspective.
Scores honesty over flattery
Provides download links
Handles massive job descriptions
And stays entirely inside the homelab

It’s fast, reliable, and—best of all—doesn’t hallucinate a fictional PhD from “Harvard Technical Institute of Midwest Florida.”

🕰️ The Real Timeline

Now that was all a pretty short read. in reality that represents about two days of work and troubleshooting of around 20 hours. (Yes my days are 10 hours long, this is what I do for fun.) I also haven’t finished tuning the resume specifications that I am giving, that will probably take an additional day or two.

⏭️ Coming Up in Part 3

Do I stop here?
Absolutely not.

Next up:
A supervisor agent to rule them all! Or really just an agent to delegate tasks to other more specialized agents. A skill I had to learn the hard way in the last two years. Because even I have a context window limit, and delegating to others can help you keep track of the bigger picture. This also helps the LLMs manage the context size that they are processing and make sure they don’t get overloaded and start talking to you like they have lost their digital minds. (Because when context overflows that is exactly what happens.)

Stay tuned. The AI city-state must grow!

Written by David, Made funnier with AI.