Part 2 of 2. Part 1 covers the motivations, hardware, and requirements.
Round One: Six Prompts, Six Blueprints
I fed the requirements document to five AI systems: Claude, ChatGPT, Grok, Lumo, and Meta AI. Each received the same document. The prompt was simple:
Prepare Architecture recommendations in a markdown document.
The output was six architecture proposals, ranging from terse (MetaAI with 536 words) to torrential (Claude with 4,255 words). Reading them back to back was instructive. There was significant consensus on the fundamentals – Proxmox VE was recommended by almost every system, ZFS for storage, Docker Compose as the service layer – but the disagreements were where it got interesting.
One system suggested running k3s across three machines spanning Sandy Bridge, Kaby Lake, and Raptor Lake architecture. Three CPU generations, three GPU generations, heterogeneous storage, aging SATA. (No.) One suggested I flip which machine is primary in a way that would have dedicated my i9-14900K/RTX 4080 Super to light transcription duties while the i7-7700K ran primary AI inference. (Also no.) One documented a recovery point objective of zero, which felt more like aspiration than engineering.
The disagreements were the interesting part. Agreement is cheap. Disagreement forces you to actually think.
Round Two: Three Comparative Analyses
I took the six responses and fed them – all of them, together – to three different systems (Meta, Lumo and Claude) and asked each to compare, rank, and critique. What did they agree on? Where were the meaningful differences? Which recommendations were well-reasoned versus well-marketed?
This round was useful in a specific way: it surfaced the shape of the disagreements. Not just “System A said X and System B said Y,” but why those differences existed and what they revealed about each system’s underlying assumptions. Systems that had been confidently wrong in Round One looked different in aggregate than systems that had been thoughtfully uncertain.
Round Three: The Synthesis
For the final round, I sat down with Claude to write a synthesis – a document that takes everything useful from all six responses, resolves every contested decision with explicit reasoning, and throws out anything over-engineered for this scale. Whether that last point was effectively navigated I’ll leave as an exercise for the reader.
At this stage I also shifted emphasis on human vs AI and asked Claude to imagine implementing and operating the entire stack with little to no interaction from me. That framing shaped the output significantly.
The resulting document – Homelab Architecture Synthesis: Claude-Implementable Design – is sitting at version 1.0 in my research folder. It runs to about sixty kilobytes of markdown, which means it’s either comprehensive or I have a problem. Plausibly both.
Where the Contested Decisions Landed
Eight decisions diverged meaningfully across the six responses. Here’s where the synthesis came out:
Proxmox VE. Almost unanimous, and correct. FOSS, first-class ZFS, LXC containers with GPU passthrough, a purpose-built backup server. Unraid has been fine – but, to my sorrow and somewhat unforgivably, it expects to operate as root. That’s a hard thing to build an agentic management model on top of.
Machine roles. My desktop – the i9-14900K/RTX 4080 Super – becomes the primary production host. My existing server becomes secondary production and storage. The Sandy Bridge box gets a quiet semi-retirement running AdGuard Home, Uptime Kuma, and dev instances of services. Things where “14-year-old hardware” genuinely doesn’t matter. RIP ultimate resolution on new-release video games.
Terraform + Ansible + Git-controlled Compose files as the IaC stack. This is the decision I’m most excited about. Right now, if my Unraid server died, I’d be reconstructing container configurations from memory and half-remembered XML templates. With this stack, recovery is a `terraform apply`.
SOPS + Age for secrets management. Encrypted in git. No plaintext credentials in compose files. (Yes, I currently have a Forgejo database running with the password `changeme`. That’s on the list. It’s been on the list.)
Caddy as the reverse proxy. No more bare port numbers on every service URL. Finally becoming an adult.
RPO honesty. Most systems told me my recovery point objective would be zero. One said it would be fifteen minutes and here’s why. The honest answer was more useful, even though it was less impressive. ZFS snapshots every fifteen minutes get you to RPO ? 15 minutes. Critical databases get WAL archiving to approach true zero. Document what you can actually deliver.
Claude’s Implementation Model
The section I’m most excited and scared of.
None of the Round One responses had an opportunity to address agentic administration – it wasn’t an explicit requirement, and no system volunteered it. The synthesis adds a layer no other document addressed: a defined model for how Claude operates on the infrastructure. What network access it has (WireGuard peer). What credentials (SSH keys, Proxmox API, Forgejo API). What it can do autonomously versus what requires my approval. What it is explicitly never allowed to do without human confirmation. How it responds to Alertmanager incidents – fetching a runbook, executing the procedure, reporting back.
The whole thing is designed to be end-to-end manageable by either a human or an AI agent. That discipline turns out to improve the infrastructure design regardless of whether the AI ever actually runs it.
Phase -1
The synthesis has a four-phase implementation roadmap, starting with installing Proxmox on physical hardware, which requires hands and scheduling. But I’m not in Phase 0 yet.
A lot is riding on runbooks that aren’t written. We need a disaster recovery plan for when the AI is unavailable or gets fired. A consistent infrastructure idiom for nomenclature and design choices across disparate surfaces. Baked-in patterns for knowledge reinvestment – making sure the system gets smarter from operating, not dumber.
I have in mind four core skill documents – shared DNA, different mindsets appropriate to each phase: implementation, management, refinement, dev/testing. That’s where the principles, guardrails, and operating posture get encoded. That work comes before Phase 0.
So: a Phase -1, perhaps?
What Running the Tournament Taught Me
The majority is right more often than any individual. On every major decision where four or five systems agreed, they were right. The outliers were usually chasing novelty or solving problems I don’t have.
But sometimes… honesty about tradeoffs is rare and valuable. The response that told me my RPO was fifteen minutes was more useful than the five that told me it was zero.
No round 1 response had an opportunity to address exclusive agentic administration. Few of them mentioned shared duties though I’d contemplated that in my requirements. My outcomes might have been substantively different if exclusivity had been an initial core requirement, though the early specification of IaC probably helped. All six gave me plausible results, but the contrasting process gave something that feels more considered.
The plan is written. The decisions are made. The document exists. I can still see gaps, which is either a sign of maturity or a sign I need to stop looking.
Send a rescue team if you don’t hear from me in a week.
This is part of an ongoing series about running an obsessively documented homelab and learning something new every time I break it.