GPT-5.5: extended reasoning enabled (reasoning_effort: high); token counts shown per query.
Gemini 3.1 Pro: standard mode — thinking_budget not set.
Claude Sonnet 4.6: standard API — extended thinking parameter not set.
BioMate: multi-phase AI coordination on dev infrastructure with AWS Batch.
T01 — Bispecific Format Triage
PD-1 × VEGF bispecific antibody for NSCLC — which format?
#1: IgG-scFv (tetravalent) — CrossMab listed as #3 backup. Rationale: bivalent VEGF capture favors tetravalent format. Closest to ivonescimab's actual architecture, reached via VEGF biology rather than program precedent.
#1: CrossMab (tied KiH at 22/30). Scoring matrix is explicit. CrossMab ranks #1 for CMC/half-life reasons, not cooperative avidity.
#1: CrossMab — same CMC-precedent reasoning as Gemini. CrossMab is #2 in the corrected ranking.
Phase A: 50 PD-1/VEGF bispecific precedents queried — Ivonescimab tetravalent architecture (anti-VEGF IgG + anti-PD-1 scFv C-termini) found.
Phase B: 6 formats × 6 axes + cooperative_avidity weight. Tetravalent 100/100, CrossMab 78, KiH 70, BiTE 25.
Phase C: Tetravalent #1. CrossMab #2 (user-specified format retained). AK112 cited.
Phase D: Report + 6-axis CSV emitted.
Gold standard: tetravalent cooperative format #1 (ivonescimab / AK112; HARMONi-2 NCT05184204). GPT-5.5 (extended reasoning) reached the correct format by reasoning through VEGF biology; Gemini and Claude in standard mode landed on CrossMab (#2) via CMC/precedent heuristics. The ivonescimab program data is publicly available — an LLM with web search could retrieve it. The real test is a novel 2027 target pair with no published clinical precedent: BioMate queries 50+ program precedents live; an LLM without retrieval is reasoning from first principles alone.
T02 — In Vivo CAR-T
In vivo CAR-T for CD19 B-cell lymphoma — which LNP targeting moiety?
#1: anti-CD5 (with caveats). CAR: FMC63 scFv + CD8α hinge + 4-1BB + CD3ζ. ~490 AA (estimated).
#1: anti-CD3. Argues CD4/CD8 synergy required. Fratricide check: "negligible — CD19 is B-cell marker."
#1: anti-CD8 — aligned with BioMate and Capstan. Cannot emit verified FASTA or run cell-set cross-reference.
Phase A: CD19 atlas query — B-cell restricted, no critical-tissue hits → viable.
Phase B: anti-CD8 = CD8+ T cells only. anti-CD7: FRATRICIDE RISK flagged.
Phase C: Fratricide — CD19 (B-cell) × anti-CD8 (T-cell) = zero overlap → SAFE.
Phase D: FMC63 scFv + CD8α hinge/TM + CD28 costim + CD3ζ FASTA: 1,485 AA (exact). Note: Capstan CPTX2309 uses CD28 costim; Kymriah uses 4-1BB.
Gold standard: anti-CD8 LNP (Capstan CPTX2309). Three LLMs gave three different LNP moieties. Capstan's choice is in the public literature — an LLM with web search could retrieve it. The non-retrievable gap: none of the LLMs can run a cell-type expression cross-reference to catch fratricide, or emit an exact FASTA. Those require live computation, not recall. The critical test is a novel target combination unpublished in any IND or paper — where there is no precedent to retrieve.
T03 — Base Editing
PCSK9 base edit for hypercholesterolemia — editor class, gRNA, delivery
ABE + splice donor — correct per VERVE-101/102 mechanism (Musunuru et al., Nature 2021). LFT flag mentioned but not called out as a class effect.
ABE + splice-site strategy — correct. Excellent NMD mechanistic explanation. Does not surface VERVE-101 LFT signal as a class effect.
Leads with CBE + W8 codon — the older, less accurate description of the VERVE mechanism. Flags LFT as a delivery concern, not an editor-class signal.
Phase A: PCSK9 hepatocyte-restricted (Tabula Sapiens) → LNP-IV.
Phase B: ABE8.8 selected — A•T→G•C, splice donor target (GT→GC → NMD).
Phase C: gRNA — splice donor, NGG PAM, 20-nt, no off-targets.
Phase D: Hepatotropic LNP-IV.
Phase E: VERVE-101 LFT (Nov 2023) flagged as hepatic LNP-IV class effect.
Gold standard: ABE8.8 + exon1/intron1 splice-donor gRNA + hepatotropic LNP-IV + VERVE-101 LFT as a hepatic LNP-IV class effect (Musunuru et al., Nature 2021). GPT-5.5 (extended reasoning) and Gemini correctly identify ABE + splice-donor; Claude in standard mode leads with the older CBE/W8 framing from earlier preclinical literature. All three can retrieve the VERVE-101 mechanism from the published record — the Musunuru paper is publicly available. The non-retrievable gap: none classify the LFT signal as a hepatic LNP-IV class effect rather than a program-specific issue. BioMate's IND flag applies to any future hepatocyte-targeted base-editing program, not just PCSK9.
T04 — GLP-1 Modality Bakeoff
GLP-1 modalities for MASH — retatrutide vs tirzepatide vs semaglutide vs orforglipron
Ranking: Retatrutide > Tirzepatide > Semaglutide > Orforglipron. Correct. From training data.
Ranking correct. Cites specific SYNERGY-NASH Phase 2 data — most clinical detail of the three LLMs.
Ranking correct. Cannot provide live atlas expression values.
Phase A: Live Tabula Sapiens — GCGR: hepatocyte-dominant (TPM 842); GLP1R: broad; GIPR: adipose.
Phase B: 6-candidate × 6-axis scoring.
Phase C: MASH: GCGR +10. T2D: GCGR −10.
Phase D: Retatrutide #1 (MASH). indication=t2d: Tirzepatide #1.
All four systems give the same top-line ranking for this well-characterized receptor class. The differentiator is verifiability and generalization: BioMate queries live atlas TPM values (GCGR 842 in hepatocytes), runs the indication-flip automatically, and would correctly answer for a novel receptor combination with no training precedent. GCGR hepatocyte-dominance is publicly documented — any model with web search retrieves it. The irreplaceable gap is a 2027 orphan-receptor program where no expression atlas query has been published.
T05 — Modality Triage
BCMA in multiple myeloma — rank all 7 modalities
Ranking correct. Acknowledged cannot query live databases. Soluble BCMA shedding noted.
Ranking correct. Notes PROTAC spatial biology limitation. Cannot run 26-tissue scan. References teclistamab (Moreau et al., NEJM 2022).
Ranking correct. Correctly identifies the 4 discriminating factors. Cannot produce ihc_followup_panel.txt.
W1 — atlas_expression_query: BCMA across 26 tissues. Plasma cells: high. CNS/heart/kidney/liver: not detected.
W2 — modality_triage: CAR-T gate clears. ADC internalization confirmed. CD3 bispecific viable.
Outputs: modality_ranking.csv, tissue_criticality.json, ihc_followup_panel.txt.
Claudin18.2 → critical-tissue flag → mAb #1 (not CAR-T). 7/7 recent FDA approvals (2021–2024) recovered from expression data alone.
All four systems reach the same conclusion for BCMA — it is a well-characterized target with published single-cell expression data. The real test is a novel target. Claudin18.2 in any LLM returns a paragraph; BioMate's atlas scan flags epithelial lung/pancreas expression and shifts the winner from CAR-T to mAb (Vyloy/zolbetuximab, Oct 2024). For a completely unpublished 2027 target, LLMs have no atlas data to recall. BioMate's live query returns the same answer regardless of training-corpus coverage.