Blog

  • The AI Architecture Tournament — Three Rounds and Eight Contested Decisions

    Part 2 of 2. Part 1 covers the motivations, hardware, and requirements.


    Round One: Six Prompts, Six Blueprints

    I fed the requirements document to five AI systems: Claude, ChatGPT, Grok, Lumo, and Meta AI. Each received the same document. The prompt was simple:

    Prepare Architecture recommendations in a markdown document.

    The output was six architecture proposals, ranging from terse (MetaAI with 536 words) to torrential (Claude with 4,255 words). Reading them back to back was instructive. There was significant consensus on the fundamentals – Proxmox VE was recommended by almost every system, ZFS for storage, Docker Compose as the service layer – but the disagreements were where it got interesting.

    One system suggested running k3s across three machines spanning Sandy Bridge, Kaby Lake, and Raptor Lake architecture. Three CPU generations, three GPU generations, heterogeneous storage, aging SATA. (No.) One suggested I flip which machine is primary in a way that would have dedicated my i9-14900K/RTX 4080 Super to light transcription duties while the i7-7700K ran primary AI inference. (Also no.) One documented a recovery point objective of zero, which felt more like aspiration than engineering.

    The disagreements were the interesting part. Agreement is cheap. Disagreement forces you to actually think.


    Round Two: Three Comparative Analyses

    I took the six responses and fed them – all of them, together – to three different systems (Meta, Lumo and Claude) and asked each to compare, rank, and critique. What did they agree on? Where were the meaningful differences? Which recommendations were well-reasoned versus well-marketed?

    This round was useful in a specific way: it surfaced the shape of the disagreements. Not just “System A said X and System B said Y,” but why those differences existed and what they revealed about each system’s underlying assumptions. Systems that had been confidently wrong in Round One looked different in aggregate than systems that had been thoughtfully uncertain.


    Round Three: The Synthesis

    For the final round, I sat down with Claude to write a synthesis – a document that takes everything useful from all six responses, resolves every contested decision with explicit reasoning, and throws out anything over-engineered for this scale. Whether that last point was effectively navigated I’ll leave as an exercise for the reader.

    At this stage I also shifted emphasis on human vs AI and asked Claude to imagine implementing and operating the entire stack with little to no interaction from me. That framing shaped the output significantly.

    The resulting document – Homelab Architecture Synthesis: Claude-Implementable Design – is sitting at version 1.0 in my research folder. It runs to about sixty kilobytes of markdown, which means it’s either comprehensive or I have a problem. Plausibly both.


    Where the Contested Decisions Landed

    Eight decisions diverged meaningfully across the six responses. Here’s where the synthesis came out:

    Proxmox VE. Almost unanimous, and correct. FOSS, first-class ZFS, LXC containers with GPU passthrough, a purpose-built backup server. Unraid has been fine – but, to my sorrow and somewhat unforgivably, it expects to operate as root. That’s a hard thing to build an agentic management model on top of.

    Machine roles. My desktop – the i9-14900K/RTX 4080 Super – becomes the primary production host. My existing server becomes secondary production and storage. The Sandy Bridge box gets a quiet semi-retirement running AdGuard Home, Uptime Kuma, and dev instances of services. Things where “14-year-old hardware” genuinely doesn’t matter. RIP ultimate resolution on new-release video games.

    Terraform + Ansible + Git-controlled Compose files as the IaC stack. This is the decision I’m most excited about. Right now, if my Unraid server died, I’d be reconstructing container configurations from memory and half-remembered XML templates. With this stack, recovery is a `terraform apply`.

    SOPS + Age for secrets management. Encrypted in git. No plaintext credentials in compose files. (Yes, I currently have a Forgejo database running with the password `changeme`. That’s on the list. It’s been on the list.)

    Caddy as the reverse proxy. No more bare port numbers on every service URL. Finally becoming an adult.

    RPO honesty. Most systems told me my recovery point objective would be zero. One said it would be fifteen minutes and here’s why. The honest answer was more useful, even though it was less impressive. ZFS snapshots every fifteen minutes get you to RPO ? 15 minutes. Critical databases get WAL archiving to approach true zero. Document what you can actually deliver.


    Claude’s Implementation Model

    The section I’m most excited and scared of.

    None of the Round One responses had an opportunity to address agentic administration – it wasn’t an explicit requirement, and no system volunteered it. The synthesis adds a layer no other document addressed: a defined model for how Claude operates on the infrastructure. What network access it has (WireGuard peer). What credentials (SSH keys, Proxmox API, Forgejo API). What it can do autonomously versus what requires my approval. What it is explicitly never allowed to do without human confirmation. How it responds to Alertmanager incidents – fetching a runbook, executing the procedure, reporting back.

    The whole thing is designed to be end-to-end manageable by either a human or an AI agent. That discipline turns out to improve the infrastructure design regardless of whether the AI ever actually runs it.


    Phase -1

    The synthesis has a four-phase implementation roadmap, starting with installing Proxmox on physical hardware, which requires hands and scheduling. But I’m not in Phase 0 yet.

    A lot is riding on runbooks that aren’t written. We need a disaster recovery plan for when the AI is unavailable or gets fired. A consistent infrastructure idiom for nomenclature and design choices across disparate surfaces. Baked-in patterns for knowledge reinvestment – making sure the system gets smarter from operating, not dumber.

    I have in mind four core skill documents – shared DNA, different mindsets appropriate to each phase: implementation, management, refinement, dev/testing. That’s where the principles, guardrails, and operating posture get encoded. That work comes before Phase 0.

    So: a Phase -1, perhaps?


    What Running the Tournament Taught Me

    The majority is right more often than any individual. On every major decision where four or five systems agreed, they were right. The outliers were usually chasing novelty or solving problems I don’t have.

    But sometimes… honesty about tradeoffs is rare and valuable. The response that told me my RPO was fifteen minutes was more useful than the five that told me it was zero.

    No round 1 response had an opportunity to address exclusive agentic administration. Few of them mentioned shared duties though I’d contemplated that in my requirements. My outcomes might have been substantively different if exclusivity had been an initial core requirement, though the early specification of IaC probably helped. All six gave me plausible results, but the contrasting process gave something that feels more considered.


    The plan is written. The decisions are made. The document exists. I can still see gaps, which is either a sign of maturity or a sign I need to stop looking.

    Send a rescue team if you don’t hear from me in a week.


    This is part of an ongoing series about running an obsessively documented homelab and learning something new every time I break it.

  • The AI Architecture Tournament — Motivation, Resources, and Requirements

    Part 1 of 2. Part 2 covers the three rounds of AI input and the contested decisions.


    There’s a certain kind of homelab project that starts as a reasonable question and ends with you staring at a 60-page architecture document wondering how you got here.

    That’s where I am.


    A Working Mess

    My homelab has been running on Unraid. It works. I have fourteen Docker containers running, a Synology Diskstation for backups, a Home Assistant Green for automation, my gaming PC doing AI inference, and enough Shelly devices to control the lighting in every room of my house. It’s not elegant, but it’s mine, and it mostly does what I want.

    The problem isn’t that it’s broken. The problem is that the more I’ve learned, the more I can see the places where it’s more accreted than designed. More generative than purposeful. No reverse proxy. No infrastructure-as-code. Secrets half-managed. Backups partially verified. A git-watcher script that was never actually version-controlled. The kind of debt that doesn’t break you today but makes every future change a little harder than it needs to be.

    I’d been thinking about a proper rearchitecture for a while. But thinking and doing are different things, and I needed a question sharp enough to actually move on.


    The Question

    What if I put everything on the table? No sacred cows. What would the setup look like if every decision were optimized based on using the hardware at its most powerful and efficient, and every software decision was driven by use cases and outcomes rather than favorites or familiarity?

    I’d be deeply interested in the answer to this question, but the legwork to get there is daunting to my ADHD. Thankfully we’ve got AI to offload that kind of cognitive gruntwork to.

    I also wanted to avoid problem-solving with my credit card. A hard constraint of ‘no new purchases’ was applied. Work with what you have.

    So, here’s the full list:

    • Selected configurations are always best practices aligned
    • Infrastructure as code
    • No new hardware. All architecture, method, process, software and OS are fair game.
    • 3-2-1 backups
    • Free and Open Source software
    • Development vs Production environments, where possible
    • End to end administrable by human or by agent
    • Rigorous changelog
    • Securely accessible inside and outside the home
    • Tier 2 Availability (redundancy primarily in storage)
    • RTO: 24hrs max
    • RPO: No data loss is acceptable
    • MTTR: 4 hrs
    • Failover: Manual
    • Migration downtime should be minimized, but is the least concern. Home Assistant is the largest concern for migration downtime.

    What I Have to Work With

    Three machines, a Synology, and a Home Assistant Green.

    The primary server is a 2017 era repurposed HP workstation — i7-7700K, 32GB DDR4, a GTX 1070 I use for audio transcription, and a mix of spinning rust and SSDs. It’s been getting the job done.

    The beast machine is my desktop — where I’m typing this. An i9-14900K with an RTX 4080 Super, 48GB of DDR5, and two 4TB NVMe drives. It currently runs Ollama for local LLM inference and games on the weekends. The “games” part of that may be in jeopardy.

    The third machine is a Sandy Bridge i7-2600K from 2011. It runs Ubuntu and various things I’ve tried over the years. She’s old and no longer as mighty as when I specced her back in the day, but she has sentimental value and still shows up to work on time if the task is sized right.

    The Synology is a DS418 with 4×8TB drives. It handles backups. The Home Assistant Green runs home automation and will be staying exactly where it is for as long as I can manage it.


    Writing the Requirements Document

    Before I could ask anyone — AI or otherwise — for a design, I had to know what I was asking for. So I wrote a requirements document.

    It covered the hardware above in detail, including exact CPU/RAM/storage/GPU specifications per machine. It listed every service I run and why I run it. It articulated the constraints.

    It also specified something I hadn’t entirely thought through until I wrote it down: I wanted the final architecture to be end-to-end administrable by an AI agent. Not because I’m looking to fully hand over the keys, but because if Claude can autonomously execute operations, that means the operations are well-documented, idempotent, and testable. The discipline of designing for AI operation will hopefully produce better infrastructure for human operation too.

    Some of you may be screaming inside about the sustainability (along many axes) of letting the AI run the show. And rightly so. But at this point we were at the thought exercise stage of the game and there was plenty of time to navigate risk as we kicked off the opening ceremony.


    What Came Next

    With a requirements document in hand, I did something I hadn’t done much of since diving into LLM tools: a structured response comparison. Not just asking one system and running with the answer, but treating it as an input process.

    Part 2 is where the tournament happens.


    This is part of an ongoing series about running an obsessively documented homelab and learning something new every time I break it.