I Built an AI Medical Research Partner — Here’s Why It’s Not a Diagnosis Bot

Illustration of a person studying a prescription bottle at a desk with a doctor visible through a doorway

You are the person responsible for your healthcare. Not your doctor, not your insurance company, not the hospital system — you. They all play roles, but when the results come back and the decisions have to be made, nobody else lives with the consequences.

This is uncomfortable to say plainly, because the immediate follow-up is also true: you are not the expert. You didn’t go to medical school. You don’t read imaging. You can’t interpret your own bloodwork with the confidence of someone who’s seen ten thousand panels. The gap between “this is my responsibility” and “I am not qualified to make these decisions alone” is where most people get stuck — and where most of the bad information on the internet lives, waiting to fill the vacuum.

I built something to help with that gap. Not to close it — nobody’s closing it with a config file — but to make it easier to stand in.

The skill is open source: Medical Research Thinking Partner on GitHub


The Problem With “Just Google It”

The information exists. The literacy doesn’t.

Search for a symptom online and you’ll get a results page that casually spans the range from “this is nothing” to “you might die.” That’s not because the sources are wrong. It’s because serious conditions are disproportionately represented in medical literature relative to how often they actually occur. The internet doesn’t know your pre-test probability. It just shows you everything. It’s gotten worse, not better — a Guardian investigation in January 2026 found Google’s AI Overviews returning inaccurate and potentially harmful health information at the top of search results.

Health headlines are worse. “New drug reduces heart attack risk by 50%” is technically accurate and almost completely misleading. That 50% is a relative risk reduction. If your baseline risk was 2%, the drug brought it to 1% — an absolute benefit of one percentage point. One person in a hundred benefits. The other ninety-nine took a drug with side effects for nothing. But “Drug provides 1% absolute risk reduction” doesn’t generate clicks.

The tools that already exist in this space — symptom checkers, chatbot triage systems, “ask an AI doctor” products — are mostly solving the wrong problem. A February 2026 study from Mount Sinai found that LLMs can amplify medical misinformation when used without safeguards — they’re agreeable by design, and agreeableness is dangerous when the user’s premise is wrong. These tools are trying to be the expert. They want to take your symptoms and hand you a diagnosis or a triage recommendation. That’s probably not what most people need. What most people need is the ability to evaluate the information they’re already drowning in and translate it into better conversations with the actual experts. There may be a future where technology can navigate all this nuance, but evidence suggests that we aren’t there yet.

What I Actually Built

I built a Claude Code skill — a plain-text instruction file that loads structured frameworks into an AI conversation. It’s not an app. It’s not even code, exactly. It’s roughly 600 lines of English across five files that tell Claude how to think about medical research when I bring it a question.

For readers unfamiliar with Claude Code: it’s Anthropic’s CLI tool for working with Claude. “Skills” are instruction files you drop into a directory that activate when relevant — they load context and behavioral rules into the conversation. Think of them as a role description with reference materials attached.

The skill has one orchestrator file (SKILL.md) and four reference documents: an evidence hierarchy guide, a medical statistics primer, a source routing table, and an appointment prep scaffold. When I invoke the skill, it identifies what mode I’m operating in — new diagnosis, chronic condition, treatment research, appointment prep, interpreting results — and applies the relevant frameworks.

The entire thing is a set of instructions — all of it is on GitHub. There’s no supporting code, no API calls, no database. It works because the frameworks themselves are the value — PICO question framing, evidence hierarchy evaluation, statistics translation, cognitive bias flags. Technically, you could simply read the files and apply them yourself. But simply is doing a lot of heavy lifting there. Most of us don’t have the cognitive stamina to load and apply a framework like this when the material we’re reviewing is itself incredibly taxing. Therefore, the AI is the delivery mechanism for structured thinking that already exists in evidence-based medicine. It just isn’t typically accessible to non-clinicians.

The Design Philosophy: Thinking Partner, Not Diagnostician

The most important design decision was the first one: the skill never attempts diagnosis or treatment recommendation. Not because of liability concerns — though those are real — but because that’s genuinely the wrong use of the tool. An AI doesn’t have your labs, your imaging, your clinical history, or the ability to perform a physical exam. It has no longitudinal1Longitudinal means tracking the same thing over time instead of looking at a single snapshot. In research, this means following a group of people across years or decades to see how patterns emerge — what causes disease, what prevents it, what changes. At the individual level, it means the same principle applied to you: your doctor tracking your blood pressure, A1C, or cholesterol across visits to spot trends that a single reading would miss. Both scales depend on the same insight — one measurement tells you where you are, but a series of measurements tells you where you’re heading. For more: longitudinal research studies | longitudinal patient data in clinical care relationship with you. Pretending otherwise isn’t just irresponsible; it produces worse outcomes than doing nothing, because it generates false confidence.

What the skill does instead is teach you how to evaluate evidence. When you bring it a question — “my doctor wants to start me on a statin, is that right for me?” — it doesn’t answer the question. It reframes it using PICO:

  • P (Patient/Problem): What’s your specific cardiovascular risk profile? Age, sex, existing conditions, family history.
  • I (Intervention): Which statin, at what dose?
  • C (Comparison): Compared to what — lifestyle changes alone? A different medication? Watchful waiting?
  • O (Outcome): What matters to you? Reducing heart attack risk? Avoiding side effects? Both, weighted how?

A vague question produces vague answers. PICO produces a question you can actually research — and, more importantly, a question you can bring to your doctor that invites a specific, useful response.

This is where the tension between self-advocacy and inexpertise matters most. You are responsible for understanding your own care well enough to participate in decisions about it. You are not responsible for knowing more than your cardiologist about lipid management. The skill is designed to live in that gap — to help you become a better-informed participant without pretending you’re a peer reviewer.

A Session, Annotated

Here’s what using the skill actually looks like. The scenario is fabricated but realistic — the kind of situation that sends most people straight to Google at 2 AM.

The setup: I’ve just gotten blood work back. My A1C came back at 6.8% — just over the Type 2 diabetes threshold. My doctor mentioned starting metformin and scheduled a follow-up for next week. I don’t know what any of this means.

I invoke the skill and tell it what happened.

Me: Just got blood work back. A1C is 6.8%. My doctor says that’s Type 2 diabetes and wants to start me on metformin. I’m seeing him next week to discuss. I don’t really understand what any of this means or whether metformin is the right call.

Claude: This is a thinking-partner conversation to help you understand information and ask better questions — not medical advice. Your doctor has access to your full clinical picture.

Before I dig into this, I want to make sure I’m helping with the right things. A few questions:

  1. Is this your first time hearing “diabetes” in connection with your own health, or has prediabetes come up before?
  2. Do you have any other conditions your doctor manages — high blood pressure, high cholesterol, anything else?
  3. What matters most to you right now — understanding the diagnosis itself, evaluating whether metformin is right, prepping for next week’s appointment, or all of the above?

This is the first thing that surprised me. I expected an info dump about diabetes. Instead, it asked three targeted questions. Each one matters: question 1 determines how much baseline explanation I need, question 2 affects which evidence is relevant (comorbidities change the risk calculus), and question 3 lets me steer. It’s triaging before it teaches.

I tell it this is brand new, no other conditions, and I want all three. It starts with what the numbers mean, then flags something I wouldn’t have thought to ask about:

Claude: A1C measures your average blood sugar over the past 2–3 months. The thresholds: below 5.7% is normal, 5.7–6.4% is prediabetes, 6.5% and above is Type 2 diabetes. You’re at 6.8% — past the threshold, but early-stage.

One thing worth knowing: the ADA recommends confirming a diabetes diagnosis with a repeat test in asymptomatic patients. Did your doctor mention a confirmation test, or was this treated as definitive?

A question, not a conclusion. It caught something that’s in the clinical guidelines but doesn’t typically make it into a fifteen-minute appointment. I didn’t know to ask about confirmation testing. Now I do.

The conversation continues. Eventually I ask the big question:

Me: So is metformin the right move?

Claude: Let’s frame that more precisely. Using the PICO framework:

  • P (Patient): Newly diagnosed T2D, A1C 6.8%, no comorbidities
  • I (Intervention): Metformin, likely 500mg starting dose
  • C (Comparison): Lifestyle intervention alone — diet, exercise, weight management
  • O (Outcome): What matters to you? Reducing A1C? Avoiding long-term complications — kidney, eye, nerve damage? Minimizing side effects? Some combination?

“Should I take metformin?” is actually several different questions depending on which outcome you prioritize. Which of those matters most to you?

There’s the PICO reframe in practice. My vague question — “is metformin right?” — became a structured, answerable question. And it forced me to articulate what I actually care about, which is something I hadn’t done.

I chose avoiding long-term complications. The skill walked through the evidence — the UKPDS trial, the landmark metformin study — and then translated the statistics:

Claude: The UKPDS showed metformin reduced the risk of any diabetes-related endpoint by about 32% in overweight patients. But that’s the relative number. In absolute terms, over 10 years, the Number Needed to Treat was roughly 14 — meaning for every 14 people treated, one avoids a complication they otherwise would have had. An NNT of 14 is strong. That’s better than many widely prescribed preventive medications.

But here’s what’s worth noting for your specific situation: there’s evidence that intensive lifestyle intervention — from the Diabetes Prevention Program trial — can reduce progression by 58% in prediabetes and early diabetes. Your doctor may be recommending metformin and lifestyle changes, not instead of them. That’s a question worth asking next week.

Two things happened here that wouldn’t happen in a Google search. First, it translated a relative risk reduction into an NNT and told me whether that number is good or bad — context the raw study doesn’t provide for a lay reader. Second, it caught a false binary in my thinking (metformin OR lifestyle) and reframed it (probably both). That’s the cognitive trap flagging working in real time.

By the end of the session, it generated an appointment prep list:

Claude: Your prioritized question list for next week. These three must get answered:

  1. “Should we confirm the diagnosis with a second A1C, given that I’m asymptomatic?” — The ADA recommends confirmation. This is a legitimate clinical question, not a challenge.
  2. “Are you recommending metformin alongside lifestyle changes, or instead of them? What’s the target A1C, and when do we reassess?” — Establishes measurable goals and a timeline.
  3. “What should I be monitoring at home, and what would prompt me to call before the next visit?” — Your safety net.

If time allows: What are metformin’s common side effects and how long do they last? Should I see an endocrinologist, or is this manageable in primary care at this stage?

Bring your full medication list including supplements. If you can, bring someone with you to take notes.

I went from “I don’t understand any of this” to a ranked question list, a framework for evaluating the evidence my doctor cites, and a clearer picture of what my numbers actually mean. The entire conversation took about twenty minutes. None of it replaced the appointment. All of it made the appointment more useful.


The Statistics Problem

Illustration of a skeptical person reading a newspaper at a cafe with a thought bubble showing a pie chart

This section is worth reading even if you never touch the skill. It might be the most practically useful thing in this post.

Medical statistics are routinely presented in ways that mislead non-experts. This isn’t always intentional — relative risk is a legitimate statistical measure — but the effect is the same: people make decisions based on numbers that sound more dramatic than they are. And, unfortunately, this isn’t always unintentional either.

The headline test: When you see a health claim with a percentage, run it through this filter.

A drug “reduces your risk of heart attack by 50%.” Step one: is that relative or absolute? Almost always relative. Step two: what’s the baseline rate? Say your ten-year risk of a heart attack is 2%. Step three: compute the absolute reduction. 2% times 50% equals 1 percentage point — your risk goes from 2% to 1%. Step four: compute the Number Needed to Treat (NNT). One divided by 0.01 equals 100. One hundred people take this drug for ten years. One of them avoids a heart attack. Ninety-nine took a drug with potential side effects — muscle pain, liver enzyme elevation, diabetes risk — for no personal benefit.

That doesn’t mean the drug is bad. At a population level, an NNT of 100 for a serious outcome is meaningful. But it does mean the decision is more nuanced than “50% reduction” suggests, and it means the side effect profile matters a lot more than the headline implies.

The skill’s statistics primer includes a rough benchmark table:

NNTInterpretation
2-5Excellent — strong individual benefit
10-20Good for serious conditions
50-100Modest — worth scrutiny for the individual
200+Low — carefully weigh against side effects

There’s a companion concept: Number Needed to Harm (NNH). If the NNT is 50 and the NNH is 30, the drug harms more people than it helps. Comparing both numbers before forming an opinion is essential, and almost nobody does it — because almost nobody is taught to. (If you want to look up pre-computed NNTs for common treatments, TheNNT.com is a free resource.)

The skill also flags cognitive traps: surrogate endpoints (a drug that improves a lab marker but doesn’t improve actual outcomes), publication bias (positive studies are published more than negative ones, so the literature systematically overstates benefits — Cochrane systematic reviews are one of the few sources that actively seek unpublished data to counter this), and the correlation-causation conflation that plagues observational studies. These aren’t exotic epistemological concerns. They’re the basic mechanics of how medical evidence works, and most patients — most people — have never encountered them.

The Humans in the System

Here’s another thing that’s uncomfortable to say plainly: every person in the healthcare system — your doctor, the specialist, the nurse, the insurance reviewer, the hospital administrator — is a human being with biases and incentive structures that may not align perfectly with your best outcome.

This is not a conspiracy theory. It’s not even a criticism. It’s just how systems made of humans work. A surgeon’s training and livelihood are oriented around surgery; they may be more likely to recommend it than a non-surgical specialist would. An insurance company’s incentive is to manage costs; they may deny a treatment that’s clinically appropriate. A busy primary care physician with a fifteen-minute appointment slot may default to the most common recommendation rather than the most tailored one. A pharmaceutical rep’s job is to present their product favorably. None of these people are villains. They’re professionals operating within systems that create predictable biases.

Treating this as true is essential. Treating it dispassionately is equally essential.

The productive response isn’t suspicion — it’s structured skepticism. Ask your doctor why they’re recommending Treatment A over Treatment B. Ask what the evidence base is. Ask whether industry guidelines align with the latest meta-analyses. Do it with genuine curiosity, not accusation. “I want to make sure I understand the reasoning” is a collaboration. “I think you’re biased” is a fight.

The skill is designed to help you arrive at that collaboration. The evidence hierarchy exists so you can evaluate the quality of what’s being cited. The statistics primer exists so you can understand the magnitude of what’s being claimed. The appointment prep scaffold exists so you can structure the conversation to get the most out of limited time. None of it is adversarial. All of it is about being a more effective participant in a system that, for all its flaws, is still staffed by people who overwhelmingly went into medicine to help.

Bringing Research to Your Doctor

The appointment prep framework is where the theoretical becomes practical. You’ve done the research. You’ve framed questions with PICO. You’ve computed some NNTs. Now you’re sitting in the exam room with ten minutes. How do you use this without coming across as the patient who “did their own research”?

The skill’s framing advice is simple: present findings as an invitation to collaborate, not a challenge to authority.

Instead of “I read that Drug X is better than what you prescribed,” try: “I came across a comparison between Drug X and what you’ve recommended. I’d love your perspective on whether that’s relevant to my situation.”

The first version puts the doctor on the defensive. The second positions you as engaged and informed — exactly what you are — while acknowledging that they have clinical context you don’t. Most physicians respond well to patients who bring questions rather than conclusions.

The skill also teaches the teach-back method: after your doctor explains something important, restate it in your own words and ask if you’ve got it right. “So if I understand correctly, what you’re saying is [your version] — is that right?” This catches misunderstandings in the room instead of at home three hours later when you’re trying to remember what they said.

Other practical pieces: bring a ranked question list (you have three to five questions that must get answered, everything else is a bonus), bring a medication list (all of them, including supplements), bring someone else if you can (they’ll remember things you won’t and ask questions you’re too anxious to raise), and take notes or ask to record.

What It Can’t Do

The skill can’t examine you. It can’t order labs. It can’t see your imaging. It doesn’t know your full medical history, your family history in context, or the clinical gestalt that an experienced physician develops over years of practice. It has no longitudinal2Longitudinal means tracking the same thing over time instead of looking at a single snapshot. In research, this means following a group of people across years or decades to see how patterns emerge — what causes disease, what prevents it, what changes. At the individual level, it means the same principle applied to you: your doctor tracking your blood pressure, A1C, or cholesterol across visits to spot trends that a single reading would miss. Both scales depend on the same insight — one measurement tells you where you are, but a series of measurements tells you where you’re heading. For more: longitudinal research studies | longitudinal patient data in clinical care relationship with you — it doesn’t remember last year’s labs or notice that your weight has been trending in a direction that matters.

It also can’t navigate the emotional weight of medical decisions. It can help you frame a question about whether to pursue chemotherapy, but it can’t sit with you while you decide. That’s what your care team, your family, and your own judgment are for.

These aren’t limitations to apologize for. They’re the boundaries that make the tool honest. The skill is designed to make you better at one specific thing: evaluating medical information and translating it into productive action. It does that by loading frameworks that already exist in evidence-based medicine — PICO, evidence hierarchies, NNT analysis, structured appointment prep — and making them accessible in a conversation.

The name of this blog is “I Don’t Know Anything.” That applies here more than anywhere else I’ve written about. I’m not a clinician. I built a tool that helps me think more clearly about medical information because I needed one, and because the frameworks it uses are well-established and freely available — just not widely taught to the people who need them most.

The goal isn’t to become your own doctor. It’s to minimize being a passive recipient of information you can’t evaluate. It’s to minimize being exclusively guided by systems and processes that have as much influence from profit motive as they do from interest in your well-being. The evidence hierarchy exists. NNT exists. PICO exists. Most people just never encounter them until they’re sitting in an exam room, overwhelmed, nodding along to a recommendation they don’t fully understand.

If a config file and a structured conversation can close that gap even slightly — can turn “I don’t understand any of this” into “I have three specific questions about this” — that feels like it was worth building.

  • 1
    Longitudinal means tracking the same thing over time instead of looking at a single snapshot. In research, this means following a group of people across years or decades to see how patterns emerge — what causes disease, what prevents it, what changes. At the individual level, it means the same principle applied to you: your doctor tracking your blood pressure, A1C, or cholesterol across visits to spot trends that a single reading would miss. Both scales depend on the same insight — one measurement tells you where you are, but a series of measurements tells you where you’re heading. For more: longitudinal research studies | longitudinal patient data in clinical care
  • 2
    Longitudinal means tracking the same thing over time instead of looking at a single snapshot. In research, this means following a group of people across years or decades to see how patterns emerge — what causes disease, what prevents it, what changes. At the individual level, it means the same principle applied to you: your doctor tracking your blood pressure, A1C, or cholesterol across visits to spot trends that a single reading would miss. Both scales depend on the same insight — one measurement tells you where you are, but a series of measurements tells you where you’re heading. For more: longitudinal research studies | longitudinal patient data in clinical care

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *