Why does AI-generated R / Bioconductor code fail to run?

Because the model predicts plausible tokens rather than consulting each package's real contract. Failures cluster into three types: stale function signatures from old releases, namespace conflation (the right verb attributed to the wrong package), and invented APIs (e.g. results.shrink() instead of lfcShrink()). In a 20-task benchmark, an ungrounded LLM cited real functions only 71.4% of the time.

What is structure grounding?

Checking every function the model proposes against the package's actual NAMESPACE export manifest for the targeted Bioconductor release, and feeding that structure back into generation. It raises function-citation accuracy from 71.4% to 88.2% and cuts wrong-package citations from 14.3% to 2.1%.

What is execution grounding?

Actually running the workflow end to end in a dependency-complete environment on realistic input and checking it produces the outputs it claims. A workflow can pass every NAMESPACE check and still fail at runtime through version drift or data-contract mismatch, so execution is the only ground truth.

Why AI Writes Bioconductor Code That Doesn't Run (and the Fix)

If you've asked ChatGPT or Claude to write you a DESeq2 or Seurat workflow, you've probably seen it: code that looks completely right, references functions that sound real, and then fails the moment you run it. This isn't a prompt problem. It's a grounding problem — and the fix tells us something useful about how AI should be built for science.

The thing every bioinformatician has quietly noticed

Ask a large language model for a Bioconductor analysis workflow and you get something remarkable: fluent R, plausible package names, a confident narrative of steps. Ask it to actually run, and a meaningful fraction of the time it doesn't. A function that doesn't exist. A package that was renamed two releases ago. An argument that belongs to a different function entirely. The code reads as expert; it just isn't grounded in what the packages really expose.

We wanted to measure this rather than gesture at it. Across a 20-task benchmark spanning seven biomedical domains, an ungrounded LLM — a strong general model, no retrieval, no structural scaffolding — produced function citations that were correct only 71.4% of the time. Put plainly: roughly three in ten function calls in generated Bioconductor workflows pointed at something that wasn't actually an exported function of the package it was attributed to. Wrong-package citations alone ran at 14.3%.

None of this is a knock on the models, or on the people using them. It's a structural consequence of how these models learn. And it has a structural fix.

Why it happens: three failure modes

When we looked at what the failures were, they clustered into three recognizable types:

Stale signatures. Bioconductor moves fast — a release every six months, functions deprecated and renamed, arguments added and removed. A model trained on a snapshot of the internet has seen years of overlapping versions blended together. It confidently emits an argument that was valid in 2021 and removed in 2023.

Namespace conflation. Many packages export a filter(), a select(), a plot(). The model knows the verb but attributes it to the wrong package — calling for dplyr::filter semantics inside a flow that needs the Biostrings or S4Vectors meaning. The name is real; the attribution is wrong.

Invented APIs. Sometimes the model simply hallucinates a function that should exist by analogy — results.shrink() instead of the real lfcShrink(). It's a reasonable guess about how the world might be organized. It just isn't how DESeq2 is organized.

All three share a root cause: the model is predicting plausible tokens, not consulting the package's actual contract.

The fix: ground generation in the structure that already exists

Here's the encouraging part. Bioconductor is unusually well-structured for exactly this problem. Every package ships a formal NAMESPACE manifest of what it actually exports. Classes are declared as typed S4 hierarchies. Packages are organized under controlled BiocViews vocabularies. And nearly every package ships standardized vignettes — worked, runnable examples maintained by the authors.

That structure is a grounding scaffold. When you check every function the model proposes against the real NAMESPACE export manifest for the targeted Bioconductor release — does this symbol actually exist, exported, in this package? — and feed that structure back into generation, the numbers move sharply:

20 workflow-generation tasks across seven domains, validated against Bioconductor 3.20 NAMESPACE export manifests. Both arms scored by the identical protocol.
Metric	Ungrounded LLM	Structure-grounded	Change
Function citation accuracy	71.4%	88.2%	+16.8 pp
Wrong-package citations	14.3%	2.1%	−12.2 pp

That's the difference between "looks right" and "is right" — not by making the model bigger, but by anchoring it to the ecosystem's own ground truth. The residual ~12% is not all hallucination, either: much of it is legitimate cross-package co-imports that simply weren't yet in our validation allowlist; expanding to the full registry lifts confirmed coverage further.

But grounding in structure still isn't enough

Here's the part that surprised us, and it's the more important lesson.

A workflow can cite only real, verified functions — pass every NAMESPACE check — and still fail to run. Versions drift between when a vignette was written and the environment you're in. The data contract between step three's output and step four's input doesn't quite match. An input can't be synthesized in a form the next tool accepts. Structural correctness is necessary. It is not sufficient.

So the only thing that actually tells you a workflow works is the obvious thing: run it. End to end, in a dependency-complete environment, on realistic input, and check that it produces the outputs it claims to. That binary — did it run and produce what it promised? — is a correctness signal that needs no human reviewer and no second AI to judge it. It either ran or it didn't.

A grounding hierarchy: lexical grounding (the words look right) sits inside structural grounding (the functions are real) sits inside execution grounding (the workflow actually runs). Execution is the floor you can't fall through — the ground truth.

This reframes a lot. "The code runs" and "the code is scientifically correct" are not the same claim, and neither is guaranteed by fluency. The discipline that closes the gap is execution, not eloquence.

What we built on top of this

This is the foundation of BioMate-KB, an open knowledge base of 15,641 workflow steps extracted from across the Bioconductor 3.20 ecosystem. Steps are NAMESPACE-validated, annotated with EDAM ontology terms, linked to container images and software DOIs, and — for the validated head of the collection — confirmed by real end-to-end execution. The public skill bundle covering the top 200 packages is free, under CC-BY-4.0.

The methodology and the full validation results are documented in two preprints:

BioMate-KB: A Real-Execution-Validated Workflow Knowledge Base for Bioconductor — doi.org/10.5281/zenodo.20616355
Structure Grounding Is Not Enough: Real Execution as the Ground Truth for LLM-Generated Bioinformatics Workflows — doi.org/10.5281/zenodo.20616543

The knowledge layer is a gift to the community — clone it, use it, build on it: github.com/bioMate-AI/biomate-bioconductor-kb.

Already 200+ stars from the community.

And the execution layer is what BioMate does for you. You describe the analysis in plain language; BioMate routes only to workflows that have been validated to actually run, in environments that are already dependency-complete — so you skip the version-mismatch hunt, the container wrangling, and the "why doesn't this example work anymore" afternoon entirely. The grounding we measured isn't a paper result you have to take on faith; it's the thing standing between you and a workflow that runs the first time.

BioMate is a life-sciences AI platform that turns validated computational workflows into something you can run by asking. If you spend afternoons getting other people's bioinformatics code to run, that's the afternoon we're trying to give you back.

The takeaway

Fluent AI code is not the same as runnable AI code. Structure grounding — checking every function against the real NAMESPACE — lifts citation accuracy from 71.4% to 88.2%. But only real execution proves a workflow runs. BioMate routes you to workflows validated to actually run.