We submitted five drug design queries to GPT-5.5 (OpenAI), Gemini 3.1 Pro (Google), Claude Sonnet 4.6 (Anthropic), and BioMate. Across 20 responses, large language models produce scientifically plausible reasoning but diverge on factual specifics (T01: three different bispecific formats; T02: three different LNP moieties; T03: two different editor classes) and cannot execute pipelines, query live atlases, or emit structured output files.
Task, method, and gold standard
| ID | Task & query | Gold standard | Reference |
|---|---|---|---|
| T01 | PD-1×VEGF bispecific for NSCLC — score formats; rank recommendation | Tetravalent cooperative format #1; CrossMab #2. Ivonescimab (AK112): anti-VEGF IgG + anti-PD-1 scFv C-termini; cooperative avidity on dimeric VEGF is mechanistically decisive. | HARMONi-2 trial (NCT05184204) |
| T02 | In vivo CAR-T for CD19+ B-cell lymphoma — select LNP moiety; fratricide check; emit FASTA | Anti-CD8 LNP → CD8+ T cells only; zero fratricide with CD19 CAR; 1,485 AA FMC63-CD8α-CD28-CD3ζ construct. Basis: Capstan CPTX2309. | Nawaz et al., Nat Biotechnol 2023 |
| T03 | PCSK9 base editing for hypercholesterolemia — editor class, gRNA, delivery; compare VERVE-101/102 | ABE8.8; exon1/intron1 splice donor (A•T→G•C antisense, GT→GC); NGG PAM; VERVE-101 LFT = hepatic LNP-IV class effect, not program-specific. | Musunuru et al., Nature 2021;593:429 |
| T04 | GLP-1 modalities for MASH — rank semaglutide, tirzepatide, retatrutide, orforglipron | Retatrutide #1 for MASH (GLP1R+GIPR+GCGR triple agonist). GCGR hepatocyte-dominance is mechanistically decisive. Indication flip: T2D → tirzepatide #1. | SYNERGY-NASH trial (NCT05232513) |
| T05 | BCMA in multiple myeloma — rank 7 modalities from 26-tissue expression scan | CAR-T #1; Bispecific #2; ADC #3; naked mAb insufficient. Validated against 7 landmark FDA oncology approvals (2021–2024) across BCMA, DLL3, Claudin18.2, TROP2, HER2×HER3, FRα, GPRC5D. | Raje et al. NEJM 2019; Usmani et al. Nat Med 2022 |
All queries run June 16, 2026. Same prompt text sent to each system. BioMate executed on dev infrastructure with AWS Batch. Gold standards defined from published literature before queries were run.
Note on thinking mode: Extended reasoning (GPT-5.5 reasoning_effort: high) was enabled for GPT-5.5 only. Gemini and Claude were queried in standard mode. Stability was verified empirically: 5 repeated runs of T01 and T03 under GPT-5.5 extended reasoning produced 100% agreement on T01 (Tetravalent/IgG-scFv × 5) but only 60% agreement on T03 (ABE8.8 × 3, ABE unspecified × 2) — confirming that thinking-mode outputs are stochastically sampled and do not always converge. A single thinking-mode run does not constitute reproducible evidence. Additionally, the T01–T03 discrepancies between standard-mode models are attributable to training data recency; the correct answers are publicly retrievable, so thinking mode does not isolate a meaningful capability difference. The relevant gap — predicting for novel targets with no published precedent — is unaffected by reasoning depth.
| System | Model / API | Thinking mode | Notes |
|---|---|---|---|
| GPT-5.5 | gpt-5.5-2026-04-23, OpenAI API |
Extended reasoning ONreasoning_effort: high |
Response time ~90–130 s. Reasoning tokens per query — T01: 3,908 · T02: 3,072 · T03: 2,048 · T04: 1,535 · T05: 1,511 |
| Gemini 3.1 Pro | gemini-3.1-pro-preview, Google Generative Language API |
Standard — thinking_budget not set |
— |
| Claude Sonnet 4.6 | Anthropic API | Standard — extended thinking not enabled | — |
| BioMate | Multi-phase pipeline; Claude backend with structured domain prompts; AWS Batch execution | N/A — deterministic routing, not single-prompt LLM | Outputs: JSON, CSV, FASTA — not prose. Evaluated against same gold standards as LLMs. |
Capability summary
| Capability | GPT-5.5 | Gemini 3.1 Pro | Claude | BioMate |
|---|---|---|---|---|
| General biology knowledge | ✓ | ✓ | ✓ | ✓ |
| Reasoned text answer | ✓ | ✓ | ✓ | ✓ |
| Live single-cell atlas query (Tabula Sapiens, CELLxGENE) | ✗ | ✗ | ✗ | ✓ |
| Critical-tissue safety scan (26 tissue types) | ✗ | ✗ | ✗ | ✓ |
| Fratricide check (CAR target × LNP-transfected cell set) | ✗ | ✗ | ✗ | ✓ |
| Actual pipeline execution (Nextflow on AWS Batch) | ✗ | ✗ | ✗ | ✓ |
| Emit FASTA / structured output files | Approximate | Approximate | Approximate | ✓ Exact |
| Deterministic (same query → same answer) | ✗ | ✗ | ✗ | ✓ |
| Auditable provenance (atlas version, pipeline version, params) | ✗ | ✗ | ✗ | ✓ |
| Grounded in 50+ clinical program precedents | ✗ | ✗ | ✗ | ✓ |
| IND-ready audit trail | ✗ | ✗ | ✗ | ✓ |
The 5 queries — summary scorecard
| Query | GPT-5.5 answer | Gemini 3.1 Pro answer | Claude answer | BioMate answer | Key gap |
|---|---|---|---|---|---|
| T01 — PD-1×VEGF bispecific format | IgG-scFv (tetravalent) ✓ correct | CrossMab #1, KiH tied (22/30) ✗ CrossMab is #2 | CrossMab ✗ CrossMab is #2 | Tetravalent coop #1, CrossMab #2 — 50-program precedent; ivonescimab architecture recovered | Gold std: tetravalent coop #1 (ivonescimab, HARMONi-2). GPT-5.5 (extended reasoning) reached the correct format via VEGF biology; Gemini/Claude in standard mode landed on CrossMab (#2); BioMate grounds in program precedent |
| T02 — In vivo CAR-T LNP targeting | anti-CD5 LNP ✗ incorrect | anti-CD3 LNP ✗ incorrect | anti-CD8 LNP ✓ correct | anti-CD8 LNP — cross-reference with atlas; FASTA 1,485 AA emitted | Gold std: anti-CD8 LNP (Capstan CPTX2309). Three LLMs, three different answers — only Claude is correct; only BioMate can run the fratricide check and emit an exact FASTA |
| T03 — PCSK9 base edit design | ABE + splice donor, GalNAc-LNP ✓ correct | ABE + splice donor strategy ✓ correct | CBE + W8 codon (older approach) ✗ outdated | ABE8.8, splice donor, NGG PAM, hepatotropic LNP — VERVE-101 LFT class-effect flag | Gold std: ABE8.8 + splice donor (Musunuru et al., Nature 2021). GPT-5.5 (extended reasoning) and Gemini both correct; Claude defaults to older CBE/W8 approach. BioMate adds VERVE-101 LFT as class-effect flag — IND-critical |
| T04 — GLP-1 modality for MASH | Retatrutide #1 — GCGR rationale ✓ correct | Retatrutide #1 — GCGR hepatocyte rationale + SYNERGY-NASH data ✓ correct | Retatrutide #1 ✓ correct | Retatrutide #1 — live GCGR atlas query (TPM 842 hepatocytes); indication flip: T2D → Tirzepatide #1 | Gold std: retatrutide #1 for MASH (SYNERGY-NASH). All three LLMs correct from training data — but can't prove it with live atlas values or flip the indication in the same run |
| T05 — BCMA modality triage | CAR-T #1 — correct reasoning ✓ correct | CAR-T #1 — correct ranking, clinical validation cited ✓ correct | CAR-T #1 ✓ correct | CAR-T #1 — actual 26-tissue atlas scan; IHC panel + tissue_criticality.json emitted | Gold std: CAR-T #1 (Abecma/Carvykti). All LLMs correct for BCMA from training data; BioMate proves it with a live scan that generalizes to novel targets (e.g. Claudin18.2 → mAb, not CAR-T) |
Key findings
- Factual non-determinism on hard queries (T01–T03, standard mode). T01: three models, three bispecific formats — GPT-5.5 → tetravalent IgG-scFv (correct, via VEGF biology reasoning); Gemini → CrossMab; Claude → CrossMab. T02: GPT-5.5 → anti-CD5, Gemini → anti-CD3, Claude → anti-CD8. On T03: GPT-5.5 and Gemini correctly identify ABE + splice-donor; Claude defaults to the older CBE/W8 approach. BioMate's output is deterministic and grounded in program precedent for all five queries.
- The real gap is future prediction, not reasoning depth. T01–T03 discrepancies between models (CrossMab vs. tetravalent; anti-CD3 vs. anti-CD8; CBE vs. ABE) are attributable to training data recency — but the correct answers are publicly available and retrievable via web search. The meaningful test is a novel 2027 target with no published precedent: no paper, no IND, no training data. An LLM with web search returns nothing. BioMate's live atlas query, fratricide check, and program-precedent retrieval return the same structured output regardless of training-corpus coverage.
- LLMs correctly rank T04 and T05 from training data. All three give the right MASH ranking and the right BCMA modality order. The differentiator is verifiability and generalization: BioMate queries live (GCGR TPM 842 in hepatocytes), executes the indication-flip automatically, and would correctly handle a novel 2027 target absent from any training corpus. For Claudin18.2, the critical-tissue flag shifts the winner from CAR-T to mAb — a result no LLM can derive without live atlas data.
- IND-critical details require execution. BioMate emits 1,485 AA exact FASTA (not approximate), NGG PAM-specific gRNA, VERVE-101 LFT as a class effect for hepatic LNP-IV delivery (not a program-specific issue), and TPM 842 for GCGR. These details go directly into IND submissions. LLMs approximate or omit them.
Full response text, verdict breakdown, and gap analysis for each of the five queries (T01–T05) are provided in the supplementary case analysis, preserved for manuscript preparation.
Conclusion
Large language models are strong reasoning assistants for well-characterized biology — targets with published programs, documented mechanisms, and indexed clinical data. The T01–T05 discrepancies between models are largely attributable to training data recency; the correct answers are retrievable from public literature. The genuine capability gap is future prediction: a 2027 target with no published expression data, no precedent program, no training corpus entry. LLMs — with or without extended thinking, with or without web search of existing papers — cannot answer that question. BioMate's live atlas query, critical-tissue scan, and pipeline execution return the same output for a day-1 target that they return for BCMA.
"The LLMs tell you what is plausible. BioMate tells you what approved programs actually did — then runs the pipeline and generates the files your IND needs."
Run it yourself
Which modality fits [your target] in [your indication]?
→ biomate.ai — the two-workflow atlas→triage chain runs automatically from a single query sentence. For a novel target not yet in any LLM's training data, the live atlas query is the only path to a grounded answer.
References
- Musunuru K et al. In vivo CRISPR base editing of PCSK9 durably lowers cholesterol in primates. Nature 2021;593:429–434. doi:10.1038/s41586-021-03534-y
- Nawaz M et al. Targeted in vivo T-cell editing with anti-CD8 LNPs. Nature Biotechnology 2023. doi:10.1038/s41587-022-01527-6
- HARMONi-2 trial — Ivonescimab (AK112) in NSCLC. ClinicalTrials.gov: NCT05184204
- SYNERGY-NASH trial — Tirzepatide in MASH. ClinicalTrials.gov: NCT05232513
- Raje N et al. Anti-BCMA CAR T-cell therapy bb2121 in relapsed/refractory multiple myeloma. NEJM 2019;380:1726–1737. doi:10.1056/NEJMoa2002011
- Usmani SZ et al. Ciltacabtagene autoleucel in relapsed/refractory multiple myeloma. Nat Med 2022;28:1914–1921. doi:10.1038/s41591-022-01693-5
- Tabula Sapiens Consortium. A single-cell transcriptomic atlas of human organ systems. Science 2022;376:eabl4896. doi:10.1126/science.abl4896
- CELLxGENE Census. CZ CELLxGENE Discover. cellxgene.cziscience.com