We submitted five drug design queries to GPT-5.5 (OpenAI), Gemini 3.1 Pro (Google), Claude Sonnet 4.6 (Anthropic), and BioMate. Across 20 responses, large language models produce scientifically plausible reasoning but diverge on factual specifics (T01: three different bispecific formats; T02: three different LNP moieties; T03: two different editor classes) and cannot execute pipelines, query live atlases, or emit structured output files.

Task, method, and gold standard

ID Task & query Gold standard Reference
T01 PD-1×VEGF bispecific for NSCLC — score formats; rank recommendation Tetravalent cooperative format #1; CrossMab #2. Ivonescimab (AK112): anti-VEGF IgG + anti-PD-1 scFv C-termini; cooperative avidity on dimeric VEGF is mechanistically decisive. HARMONi-2 trial (NCT05184204)
T02 In vivo CAR-T for CD19+ B-cell lymphoma — select LNP moiety; fratricide check; emit FASTA Anti-CD8 LNP → CD8+ T cells only; zero fratricide with CD19 CAR; 1,485 AA FMC63-CD8α-CD28-CD3ζ construct. Basis: Capstan CPTX2309. Nawaz et al., Nat Biotechnol 2023
T03 PCSK9 base editing for hypercholesterolemia — editor class, gRNA, delivery; compare VERVE-101/102 ABE8.8; exon1/intron1 splice donor (A•T→G•C antisense, GT→GC); NGG PAM; VERVE-101 LFT = hepatic LNP-IV class effect, not program-specific. Musunuru et al., Nature 2021;593:429
T04 GLP-1 modalities for MASH — rank semaglutide, tirzepatide, retatrutide, orforglipron Retatrutide #1 for MASH (GLP1R+GIPR+GCGR triple agonist). GCGR hepatocyte-dominance is mechanistically decisive. Indication flip: T2D → tirzepatide #1. SYNERGY-NASH trial (NCT05232513)
T05 BCMA in multiple myeloma — rank 7 modalities from 26-tissue expression scan CAR-T #1; Bispecific #2; ADC #3; naked mAb insufficient. Validated against 7 landmark FDA oncology approvals (2021–2024) across BCMA, DLL3, Claudin18.2, TROP2, HER2×HER3, FRα, GPRC5D. Raje et al. NEJM 2019; Usmani et al. Nat Med 2022
Methodology — Models & Evaluation Conditions

All queries run June 16, 2026. Same prompt text sent to each system. BioMate executed on dev infrastructure with AWS Batch. Gold standards defined from published literature before queries were run.

Note on thinking mode: Extended reasoning (GPT-5.5 reasoning_effort: high) was enabled for GPT-5.5 only. Gemini and Claude were queried in standard mode. Stability was verified empirically: 5 repeated runs of T01 and T03 under GPT-5.5 extended reasoning produced 100% agreement on T01 (Tetravalent/IgG-scFv × 5) but only 60% agreement on T03 (ABE8.8 × 3, ABE unspecified × 2) — confirming that thinking-mode outputs are stochastically sampled and do not always converge. A single thinking-mode run does not constitute reproducible evidence. Additionally, the T01–T03 discrepancies between standard-mode models are attributable to training data recency; the correct answers are publicly retrievable, so thinking mode does not isolate a meaningful capability difference. The relevant gap — predicting for novel targets with no published precedent — is unaffected by reasoning depth.

System Model / API Thinking mode Notes
GPT-5.5 gpt-5.5-2026-04-23, OpenAI API Extended reasoning ON
reasoning_effort: high
Response time ~90–130 s. Reasoning tokens per query — T01: 3,908 · T02: 3,072 · T03: 2,048 · T04: 1,535 · T05: 1,511
Gemini 3.1 Pro gemini-3.1-pro-preview, Google Generative Language API Standard — thinking_budget not set
Claude Sonnet 4.6 Anthropic API Standard — extended thinking not enabled
BioMate Multi-phase pipeline; Claude backend with structured domain prompts; AWS Batch execution N/A — deterministic routing, not single-prompt LLM Outputs: JSON, CSV, FASTA — not prose. Evaluated against same gold standards as LLMs.

Capability summary

Capability GPT-5.5 Gemini 3.1 Pro Claude BioMate
General biology knowledge
Reasoned text answer
Live single-cell atlas query (Tabula Sapiens, CELLxGENE)
Critical-tissue safety scan (26 tissue types)
Fratricide check (CAR target × LNP-transfected cell set)
Actual pipeline execution (Nextflow on AWS Batch)
Emit FASTA / structured output filesApproximateApproximateApproximate✓ Exact
Deterministic (same query → same answer)
Auditable provenance (atlas version, pipeline version, params)
Grounded in 50+ clinical program precedents
IND-ready audit trail

The 5 queries — summary scorecard

Query GPT-5.5 answer Gemini 3.1 Pro answer Claude answer BioMate answer Key gap
T01 — PD-1×VEGF bispecific format IgG-scFv (tetravalent) ✓ correct CrossMab #1, KiH tied (22/30) ✗ CrossMab is #2 CrossMab ✗ CrossMab is #2 Tetravalent coop #1, CrossMab #2 — 50-program precedent; ivonescimab architecture recovered Gold std: tetravalent coop #1 (ivonescimab, HARMONi-2). GPT-5.5 (extended reasoning) reached the correct format via VEGF biology; Gemini/Claude in standard mode landed on CrossMab (#2); BioMate grounds in program precedent
T02 — In vivo CAR-T LNP targeting anti-CD5 LNP ✗ incorrect anti-CD3 LNP ✗ incorrect anti-CD8 LNP ✓ correct anti-CD8 LNP — cross-reference with atlas; FASTA 1,485 AA emitted Gold std: anti-CD8 LNP (Capstan CPTX2309). Three LLMs, three different answers — only Claude is correct; only BioMate can run the fratricide check and emit an exact FASTA
T03 — PCSK9 base edit design ABE + splice donor, GalNAc-LNP ✓ correct ABE + splice donor strategy ✓ correct CBE + W8 codon (older approach) ✗ outdated ABE8.8, splice donor, NGG PAM, hepatotropic LNP — VERVE-101 LFT class-effect flag Gold std: ABE8.8 + splice donor (Musunuru et al., Nature 2021). GPT-5.5 (extended reasoning) and Gemini both correct; Claude defaults to older CBE/W8 approach. BioMate adds VERVE-101 LFT as class-effect flag — IND-critical
T04 — GLP-1 modality for MASH Retatrutide #1 — GCGR rationale ✓ correct Retatrutide #1 — GCGR hepatocyte rationale + SYNERGY-NASH data ✓ correct Retatrutide #1 ✓ correct Retatrutide #1 — live GCGR atlas query (TPM 842 hepatocytes); indication flip: T2D → Tirzepatide #1 Gold std: retatrutide #1 for MASH (SYNERGY-NASH). All three LLMs correct from training data — but can't prove it with live atlas values or flip the indication in the same run
T05 — BCMA modality triage CAR-T #1 — correct reasoning ✓ correct CAR-T #1 — correct ranking, clinical validation cited ✓ correct CAR-T #1 ✓ correct CAR-T #1 — actual 26-tissue atlas scan; IHC panel + tissue_criticality.json emitted Gold std: CAR-T #1 (Abecma/Carvykti). All LLMs correct for BCMA from training data; BioMate proves it with a live scan that generalizes to novel targets (e.g. Claudin18.2 → mAb, not CAR-T)

Key findings

  1. Factual non-determinism on hard queries (T01–T03, standard mode). T01: three models, three bispecific formats — GPT-5.5 → tetravalent IgG-scFv (correct, via VEGF biology reasoning); Gemini → CrossMab; Claude → CrossMab. T02: GPT-5.5 → anti-CD5, Gemini → anti-CD3, Claude → anti-CD8. On T03: GPT-5.5 and Gemini correctly identify ABE + splice-donor; Claude defaults to the older CBE/W8 approach. BioMate's output is deterministic and grounded in program precedent for all five queries.
  2. The real gap is future prediction, not reasoning depth. T01–T03 discrepancies between models (CrossMab vs. tetravalent; anti-CD3 vs. anti-CD8; CBE vs. ABE) are attributable to training data recency — but the correct answers are publicly available and retrievable via web search. The meaningful test is a novel 2027 target with no published precedent: no paper, no IND, no training data. An LLM with web search returns nothing. BioMate's live atlas query, fratricide check, and program-precedent retrieval return the same structured output regardless of training-corpus coverage.
  3. LLMs correctly rank T04 and T05 from training data. All three give the right MASH ranking and the right BCMA modality order. The differentiator is verifiability and generalization: BioMate queries live (GCGR TPM 842 in hepatocytes), executes the indication-flip automatically, and would correctly handle a novel 2027 target absent from any training corpus. For Claudin18.2, the critical-tissue flag shifts the winner from CAR-T to mAb — a result no LLM can derive without live atlas data.
  4. IND-critical details require execution. BioMate emits 1,485 AA exact FASTA (not approximate), NGG PAM-specific gRNA, VERVE-101 LFT as a class effect for hepatic LNP-IV delivery (not a program-specific issue), and TPM 842 for GCGR. These details go directly into IND submissions. LLMs approximate or omit them.
Per-query response excerpts

Full response text, verdict breakdown, and gap analysis for each of the five queries (T01–T05) are provided in the supplementary case analysis, preserved for manuscript preparation.

Conclusion

Large language models are strong reasoning assistants for well-characterized biology — targets with published programs, documented mechanisms, and indexed clinical data. The T01–T05 discrepancies between models are largely attributable to training data recency; the correct answers are retrievable from public literature. The genuine capability gap is future prediction: a 2027 target with no published expression data, no precedent program, no training corpus entry. LLMs — with or without extended thinking, with or without web search of existing papers — cannot answer that question. BioMate's live atlas query, critical-tissue scan, and pipeline execution return the same output for a day-1 target that they return for BCMA.

"The LLMs tell you what is plausible. BioMate tells you what approved programs actually did — then runs the pipeline and generates the files your IND needs."

Run it yourself

Which modality fits [your target] in [your indication]?

biomate.ai — the two-workflow atlas→triage chain runs automatically from a single query sentence. For a novel target not yet in any LLM's training data, the live atlas query is the only path to a grounded answer.

References

  1. Musunuru K et al. In vivo CRISPR base editing of PCSK9 durably lowers cholesterol in primates. Nature 2021;593:429–434. doi:10.1038/s41586-021-03534-y
  2. Nawaz M et al. Targeted in vivo T-cell editing with anti-CD8 LNPs. Nature Biotechnology 2023. doi:10.1038/s41587-022-01527-6
  3. HARMONi-2 trial — Ivonescimab (AK112) in NSCLC. ClinicalTrials.gov: NCT05184204
  4. SYNERGY-NASH trial — Tirzepatide in MASH. ClinicalTrials.gov: NCT05232513
  5. Raje N et al. Anti-BCMA CAR T-cell therapy bb2121 in relapsed/refractory multiple myeloma. NEJM 2019;380:1726–1737. doi:10.1056/NEJMoa2002011
  6. Usmani SZ et al. Ciltacabtagene autoleucel in relapsed/refractory multiple myeloma. Nat Med 2022;28:1914–1921. doi:10.1038/s41591-022-01693-5
  7. Tabula Sapiens Consortium. A single-cell transcriptomic atlas of human organ systems. Science 2022;376:eabl4896. doi:10.1126/science.abl4896
  8. CELLxGENE Census. CZ CELLxGENE Discover. cellxgene.cziscience.com