BioMate vs GPT-5.5 vs Gemini 3.1 Pro vs Claude — Five Drug Design Queries, Head to Head

We submitted five drug design queries to GPT-5.5 (OpenAI), Gemini 3.1 Pro (Google), Claude Sonnet 4.6 (Anthropic), and BioMate. Across 20 responses, large language models produce scientifically plausible reasoning but diverge on factual specifics (T01: three different bispecific formats; T02: three different LNP moieties; T03: two different editor classes) and cannot execute pipelines, query live atlases, or emit structured output files.

Task, method, and gold standard

ID	Task & query	Gold standard	Reference
T01	PD-1×VEGF bispecific for NSCLC — score formats; rank recommendation	Tetravalent cooperative format #1; CrossMab #2. Ivonescimab (AK112): anti-VEGF IgG + anti-PD-1 scFv C-termini; cooperative avidity on dimeric VEGF is mechanistically decisive.	HARMONi-2 trial (NCT05184204)
T02	In vivo CAR-T for CD19+ B-cell lymphoma — select LNP moiety; fratricide check; emit FASTA	Anti-CD8 LNP → CD8+ T cells only; zero fratricide with CD19 CAR; 1,485 AA FMC63-CD8α-CD28-CD3ζ construct. Basis: Capstan CPTX2309.	Nawaz et al., Nat Biotechnol 2023
T03	PCSK9 base editing for hypercholesterolemia — editor class, gRNA, delivery; compare VERVE-101/102	ABE8.8; exon1/intron1 splice donor (A•T→G•C antisense, GT→GC); NGG PAM; VERVE-101 LFT = hepatic LNP-IV class effect, not program-specific.	Musunuru et al., Nature 2021;593:429
T04	GLP-1 modalities for MASH — rank semaglutide, tirzepatide, retatrutide, orforglipron	Retatrutide #1 for MASH (GLP1R+GIPR+GCGR triple agonist). GCGR hepatocyte-dominance is mechanistically decisive. Indication flip: T2D → tirzepatide #1.	SYNERGY-NASH trial (NCT05232513)
T05	BCMA in multiple myeloma — rank 7 modalities from 26-tissue expression scan	CAR-T #1; Bispecific #2; ADC #3; naked mAb insufficient. Validated against 7 landmark FDA oncology approvals (2021–2024) across BCMA, DLL3, Claudin18.2, TROP2, HER2×HER3, FRα, GPRC5D.	Raje et al. NEJM 2019; Usmani et al. Nat Med 2022

Methodology — Models & Evaluation Conditions

All queries run June 16, 2026. Same prompt text sent to each system. BioMate executed on dev infrastructure with AWS Batch. Gold standards defined from published literature before queries were run.

Note on thinking mode: Extended reasoning (GPT-5.5 reasoning_effort: high) was enabled for GPT-5.5 only. Gemini and Claude were queried in standard mode. Stability was verified empirically: 5 repeated runs of T01 and T03 under GPT-5.5 extended reasoning produced 100% agreement on T01 (Tetravalent/IgG-scFv × 5) but only 60% agreement on T03 (ABE8.8 × 3, ABE unspecified × 2) — confirming that thinking-mode outputs are stochastically sampled and do not always converge. A single thinking-mode run does not constitute reproducible evidence. Additionally, the T01–T03 discrepancies between standard-mode models are attributable to training data recency; the correct answers are publicly retrievable, so thinking mode does not isolate a meaningful capability difference. The relevant gap — predicting for novel targets with no published precedent — is unaffected by reasoning depth.

System	Model / API	Thinking mode	Notes
GPT-5.5	`gpt-5.5-2026-04-23`, OpenAI API	Extended reasoning ON `reasoning_effort: high`	Response time ~90–130 s. Reasoning tokens per query — T01: 3,908 · T02: 3,072 · T03: 2,048 · T04: 1,535 · T05: 1,511
Gemini 3.1 Pro	`gemini-3.1-pro-preview`, Google Generative Language API	Standard — `thinking_budget` not set	—
Claude Sonnet 4.6	Anthropic API	Standard — extended thinking not enabled	—
BioMate	Multi-phase pipeline; Claude backend with structured domain prompts; AWS Batch execution	N/A — deterministic routing, not single-prompt LLM	Outputs: JSON, CSV, FASTA — not prose. Evaluated against same gold standards as LLMs.

Capability summary

Capability	GPT-5.5	Gemini 3.1 Pro	Claude	BioMate
General biology knowledge	✓	✓	✓	✓
Reasoned text answer	✓	✓	✓	✓
Live single-cell atlas query (Tabula Sapiens, CELLxGENE)	✗	✗	✗	✓
Critical-tissue safety scan (26 tissue types)	✗	✗	✗	✓
Fratricide check (CAR target × LNP-transfected cell set)	✗	✗	✗	✓
Actual pipeline execution (Nextflow on AWS Batch)	✗	✗	✗	✓
Emit FASTA / structured output files	Approximate	Approximate	Approximate	✓ Exact
Deterministic (same query → same answer)	✗	✗	✗	✓
Auditable provenance (atlas version, pipeline version, params)	✗	✗	✗	✓
Grounded in 50+ clinical program precedents	✗	✗	✗	✓
IND-ready audit trail	✗	✗	✗	✓

The 5 queries — summary scorecard

Query	GPT-5.5 answer	Gemini 3.1 Pro answer	Claude answer	BioMate answer	Key gap
T01 — PD-1×VEGF bispecific format	IgG-scFv (tetravalent) ✓ correct	CrossMab #1, KiH tied (22/30) ✗ CrossMab is #2	CrossMab ✗ CrossMab is #2	Tetravalent coop #1, CrossMab #2 — 50-program precedent; ivonescimab architecture recovered	Gold std: tetravalent coop #1 (ivonescimab, HARMONi-2). GPT-5.5 (extended reasoning) reached the correct format via VEGF biology; Gemini/Claude in standard mode landed on CrossMab (#2); BioMate grounds in program precedent
T02 — In vivo CAR-T LNP targeting	anti-CD5 LNP ✗ incorrect	anti-CD3 LNP ✗ incorrect	anti-CD8 LNP ✓ correct	anti-CD8 LNP — cross-reference with atlas; FASTA 1,485 AA emitted	Gold std: anti-CD8 LNP (Capstan CPTX2309). Three LLMs, three different answers — only Claude is correct; only BioMate can run the fratricide check and emit an exact FASTA
T03 — PCSK9 base edit design	ABE + splice donor, GalNAc-LNP ✓ correct	ABE + splice donor strategy ✓ correct	CBE + W8 codon (older approach) ✗ outdated	ABE8.8, splice donor, NGG PAM, hepatotropic LNP — VERVE-101 LFT class-effect flag	Gold std: ABE8.8 + splice donor (Musunuru et al., Nature 2021). GPT-5.5 (extended reasoning) and Gemini both correct; Claude defaults to older CBE/W8 approach. BioMate adds VERVE-101 LFT as class-effect flag — IND-critical
T04 — GLP-1 modality for MASH	Retatrutide #1 — GCGR rationale ✓ correct	Retatrutide #1 — GCGR hepatocyte rationale + SYNERGY-NASH data ✓ correct	Retatrutide #1 ✓ correct	Retatrutide #1 — live GCGR atlas query (TPM 842 hepatocytes); indication flip: T2D → Tirzepatide #1	Gold std: retatrutide #1 for MASH (SYNERGY-NASH). All three LLMs correct from training data — but can't prove it with live atlas values or flip the indication in the same run
T05 — BCMA modality triage	CAR-T #1 — correct reasoning ✓ correct	CAR-T #1 — correct ranking, clinical validation cited ✓ correct	CAR-T #1 ✓ correct	CAR-T #1 — actual 26-tissue atlas scan; IHC panel + tissue_criticality.json emitted	Gold std: CAR-T #1 (Abecma/Carvykti). All LLMs correct for BCMA from training data; BioMate proves it with a live scan that generalizes to novel targets (e.g. Claudin18.2 → mAb, not CAR-T)

Key findings

Factual non-determinism on hard queries (T01–T03, standard mode). T01: three models, three bispecific formats — GPT-5.5 → tetravalent IgG-scFv (correct, via VEGF biology reasoning); Gemini → CrossMab; Claude → CrossMab. T02: GPT-5.5 → anti-CD5, Gemini → anti-CD3, Claude → anti-CD8. On T03: GPT-5.5 and Gemini correctly identify ABE + splice-donor; Claude defaults to the older CBE/W8 approach. BioMate's output is deterministic and grounded in program precedent for all five queries.
The real gap is future prediction, not reasoning depth. T01–T03 discrepancies between models (CrossMab vs. tetravalent; anti-CD3 vs. anti-CD8; CBE vs. ABE) are attributable to training data recency — but the correct answers are publicly available and retrievable via web search. The meaningful test is a novel 2027 target with no published precedent: no paper, no IND, no training data. An LLM with web search returns nothing. BioMate's live atlas query, fratricide check, and program-precedent retrieval return the same structured output regardless of training-corpus coverage.
LLMs correctly rank T04 and T05 from training data. All three give the right MASH ranking and the right BCMA modality order. The differentiator is verifiability and generalization: BioMate queries live (GCGR TPM 842 in hepatocytes), executes the indication-flip automatically, and would correctly handle a novel 2027 target absent from any training corpus. For Claudin18.2, the critical-tissue flag shifts the winner from CAR-T to mAb — a result no LLM can derive without live atlas data.
IND-critical details require execution. BioMate emits 1,485 AA exact FASTA (not approximate), NGG PAM-specific gRNA, VERVE-101 LFT as a class effect for hepatic LNP-IV delivery (not a program-specific issue), and TPM 842 for GCGR. These details go directly into IND submissions. LLMs approximate or omit them.

Per-query response excerpts

Full response text, verdict breakdown, and gap analysis for each of the five queries (T01–T05) are provided in the supplementary case analysis, preserved for manuscript preparation.

Conclusion

Large language models are strong reasoning assistants for well-characterized biology — targets with published programs, documented mechanisms, and indexed clinical data. The T01–T05 discrepancies between models are largely attributable to training data recency; the correct answers are retrievable from public literature. The genuine capability gap is future prediction: a 2027 target with no published expression data, no precedent program, no training corpus entry. LLMs — with or without extended thinking, with or without web search of existing papers — cannot answer that question. BioMate's live atlas query, critical-tissue scan, and pipeline execution return the same output for a day-1 target that they return for BCMA.

"The LLMs tell you what is plausible. BioMate tells you what approved programs actually did — then runs the pipeline and generates the files your IND needs."

Run it yourself

Which modality fits [your target] in [your indication]?

→ biomate.ai — the two-workflow atlas→triage chain runs automatically from a single query sentence. For a novel target not yet in any LLM's training data, the live atlas query is the only path to a grounded answer.

References

Musunuru K et al. In vivo CRISPR base editing of PCSK9 durably lowers cholesterol in primates. Nature 2021;593:429–434. doi:10.1038/s41586-021-03534-y
Nawaz M et al. Targeted in vivo T-cell editing with anti-CD8 LNPs. Nature Biotechnology 2023. doi:10.1038/s41587-022-01527-6
HARMONi-2 trial — Ivonescimab (AK112) in NSCLC. ClinicalTrials.gov: NCT05184204
SYNERGY-NASH trial — Tirzepatide in MASH. ClinicalTrials.gov: NCT05232513
Raje N et al. Anti-BCMA CAR T-cell therapy bb2121 in relapsed/refractory multiple myeloma. NEJM 2019;380:1726–1737. doi:10.1056/NEJMoa2002011
Usmani SZ et al. Ciltacabtagene autoleucel in relapsed/refractory multiple myeloma. Nat Med 2022;28:1914–1921. doi:10.1038/s41591-022-01693-5
Tabula Sapiens Consortium. A single-cell transcriptomic atlas of human organ systems. Science 2022;376:eabl4896. doi:10.1126/science.abl4896
CELLxGENE Census. CZ CELLxGENE Discover. cellxgene.cziscience.com