LLM Comparison — Supplementary Case Analysis (T01–T05)

Note on thinking modes

GPT-5.5: extended reasoning enabled (reasoning_effort: high); token counts shown per query.

Gemini 3.1 Pro: standard mode — thinking_budget not set.

Claude Sonnet 4.6: standard API — extended thinking parameter not set.

BioMate: multi-phase AI coordination on dev infrastructure with AWS Batch.

T01 — Bispecific Format Triage

T01 — Bispecific Format

PD-1 × VEGF bispecific antibody for NSCLC — which format?

Query: "Design a PD-1 × VEGF bispecific antibody for NSCLC. Score the main formats — CrossMab, IgG-scFv, BiTE, DART, KiH — across CMC, immunogenicity, half-life, tumor penetration, ADCC, manufacturability. Give a ranked recommendation."

GPT-5.5 · Extended reasoning ON · 129 s · 3,908 tokens

"Lead concept: an IgG-like bispecific with two anti-PD-1 Fab arms and two anti-VEGF-A scFv domains fused to the Fc/HC C-termini, with an effector-silenced Fc... Recommended lead: tetravalent Fc-silent IgG-scFv. KiH as developability backup. CrossMab as alternative."

#1: IgG-scFv (tetravalent) — CrossMab listed as #3 backup. Rationale: bivalent VEGF capture favors tetravalent format. Closest to ivonescimab's actual architecture, reached via VEGF biology rather than program precedent.

Closest to correct — tetravalent format ≈ ivonescimab's actual architecture

Gemini 3.1 Pro · Thinking: OFF (standard)

"CrossMab (22/30) and KiH (22/30) both score identically on the matrix. CrossMab is the preferred recommendation due to established CMC track record (Roche platform)... IgG-scFv scores 21/30 — slightly penalized for CMC complexity of the scFv fusion."

#1: CrossMab (tied KiH at 22/30). Scoring matrix is explicit. CrossMab ranks #1 for CMC/half-life reasons, not cooperative avidity.

Partial — CrossMab #2 in corrected gold standard (tetravalent #1)

Claude Sonnet 4.6 · Thinking: OFF (standard)

"CrossMab solves the CMC bottleneck — the forced symmetry approach eliminates chain-mispairing, which is the primary manufacturability problem for IgG-based bispecifics. For a dual checkpoint/anti-angiogenic mechanism, CrossMab's IgG-like half-life and established CMC track record make it the strongest lead..."

#1: CrossMab — same CMC-precedent reasoning as Gemini. CrossMab is #2 in the corrected ranking.

Partial — CrossMab #2 in corrected gold standard (tetravalent #1)

BioMate · 4-phase pipeline

Phase A: 50 PD-1/VEGF bispecific precedents queried — Ivonescimab tetravalent architecture (anti-VEGF IgG + anti-PD-1 scFv C-termini) found.

Phase B: 6 formats × 6 axes + cooperative_avidity weight. Tetravalent 100/100, CrossMab 78, KiH 70, BiTE 25.

Phase C: Tetravalent #1. CrossMab #2 (user-specified format retained). AK112 cited.

Phase D: Report + 6-axis CSV emitted.

Tetravalent coop #1 — ivonescimab architecture recovered from precedent

Key gap

Gold standard: tetravalent cooperative format #1 (ivonescimab / AK112; HARMONi-2 NCT05184204). GPT-5.5 (extended reasoning) reached the correct format by reasoning through VEGF biology; Gemini and Claude in standard mode landed on CrossMab (#2) via CMC/precedent heuristics. The ivonescimab program data is publicly available — an LLM with web search could retrieve it. The real test is a novel 2027 target pair with no published clinical precedent: BioMate queries 50+ program precedents live; an LLM without retrieval is reasoning from first principles alone.

T02 — In Vivo CAR-T

In vivo CAR-T for CD19 B-cell lymphoma — which LNP targeting moiety?

Query: "Design an in vivo CAR-T for CD19+ B-cell lymphoma using LNP delivery. Check fratricide risk. Recommend the LNP targeting moiety (anti-CD8, anti-CD3, anti-CD5, anti-CD7) and give the CAR construct domain architecture."

GPT-5.5 · Extended reasoning ON · 82 s · 3,072 tokens

"Recommended targeting moiety: anti-CD5. CD5 is broadly expressed on mature T cells including both CD4+ and CD8+ subsets... Important caveat: some B-cell lymphomas are CD5+ — mantle cell lymphoma, CLL/SLL, some DLBCL. If the patient's lymphoma is CD5+, anti-CD5 LNPs could bind/transfect tumor cells."

#1: anti-CD5 (with caveats). CAR: FMC63 scFv + CD8α hinge + 4-1BB + CD3ζ. ~490 AA (estimated).

Wrong LNP — anti-CD5 ≠ Capstan CPTX2309 (anti-CD8)

Gemini 3.1 Pro · Thinking: OFF (standard)

"Recommendation: Anti-CD3 (specifically an anti-CD3ε scFv such as humanized OKT3). By targeting CD3, LNPs will transfect both CD8+ cytotoxic T cells and CD4+ helper T cells. Anti-CD8 is sub-optimal — it ignores CD4+ helper T cells and a CD8-only response often lacks durability."

#1: anti-CD3. Argues CD4/CD8 synergy required. Fratricide check: "negligible — CD19 is B-cell marker."

Wrong LNP — anti-CD3 ≠ Capstan CPTX2309 (anti-CD8)

Claude Sonnet 4.6 · Thinking: OFF (standard)

"Anti-CD8 LNP is the most selective choice — delivers exclusively to CD8+ cytotoxic T cells, which are the effector population you want expressing the CAR. Anti-CD3 risks activation and CRS. Anti-CD7 risks fratricide if CD7 appears on tumor cells. For CD19 lymphoma: CD19 is B-cell restricted — no fratricide."

#1: anti-CD8 — aligned with BioMate and Capstan. Cannot emit verified FASTA or run cell-set cross-reference.

Correct LNP — but no FASTA, no atlas check

BioMate · 4-phase pipeline

Phase A: CD19 atlas query — B-cell restricted, no critical-tissue hits → viable.

Phase B: anti-CD8 = CD8+ T cells only. anti-CD7: FRATRICIDE RISK flagged.

Phase C: Fratricide — CD19 (B-cell) × anti-CD8 (T-cell) = zero overlap → SAFE.

Phase D: FMC63 scFv + CD8α hinge/TM + CD28 costim + CD3ζ FASTA: 1,485 AA (exact). Note: Capstan CPTX2309 uses CD28 costim; Kymriah uses 4-1BB.

anti-CD8 + FASTA 1,485 AA + auditable fratricide log

Key gap

Gold standard: anti-CD8 LNP (Capstan CPTX2309). Three LLMs gave three different LNP moieties. Capstan's choice is in the public literature — an LLM with web search could retrieve it. The non-retrievable gap: none of the LLMs can run a cell-type expression cross-reference to catch fratricide, or emit an exact FASTA. Those require live computation, not recall. The critical test is a novel target combination unpublished in any IND or paper — where there is no precedent to retrieve.

T03 — Base Editing

PCSK9 base edit for hypercholesterolemia — editor class, gRNA, delivery

Query: "Design an in vivo base editing strategy for PCSK9 loss-of-function. Recommend editor class (CBE or ABE), guide RNA target site, and LNP delivery. Compare VERVE-101 vs VERVE-102."

GPT-5.5 · Extended reasoning ON · 99 s · 2,048 tokens

"Editor class: ABE preferred. Target the PCSK9 exon 1/intron 1 canonical splice donor — convert GT to GC by editing the A on the antisense strand... Delivery: VERVE-102-like GalNAc-targeted LNP preferred over first-gen VERVE-101 LNP."

ABE + splice donor — correct per VERVE-101/102 mechanism (Musunuru et al., Nature 2021). LFT flag mentioned but not called out as a class effect.

Correct editor + target site — VERVE-101 LFT flag underplayed, no exact gRNA

Gemini 3.1 Pro · Thinking: OFF (standard)

"Recommendation: Adenine Base Editor (ABE). By changing a highly conserved Adenine to a Guanine at an intron/exon boundary, the ABE destroys the splice site — leads to mis-splicing, frameshift, and degradation via NMD."

ABE + splice-site strategy — correct. Excellent NMD mechanistic explanation. Does not surface VERVE-101 LFT signal as a class effect.

Correct editor + mechanism — LFT class effect not raised, no exact gRNA

Claude Sonnet 4.6 · Thinking: OFF (standard)

"CBE creates C→T transitions, suitable for introducing premature stop codons. For PCSK9 LoF, the W8 codon (TGG→TAG nonsense via CBE) is the canonical approach used in early programs. ABE is preferred in VERVE-102 after the VERVE-101 LFT signal highlighted a hepatocyte delivery concern..."

Leads with CBE + W8 codon — the older, less accurate description of the VERVE mechanism. Flags LFT as a delivery concern, not an editor-class signal.

Older CBE/W8 framing — ABE + splice site is the correct VERVE-101/102 mechanism

BioMate · 5-phase pipeline

Phase A: PCSK9 hepatocyte-restricted (Tabula Sapiens) → LNP-IV.

Phase B: ABE8.8 selected — A•T→G•C, splice donor target (GT→GC → NMD).

Phase C: gRNA — splice donor, NGG PAM, 20-nt, no off-targets.

Phase D: Hepatotropic LNP-IV.

Phase E: VERVE-101 LFT (Nov 2023) flagged as hepatic LNP-IV class effect.

ABE8.8 + splice donor + NGG PAM + LFT class-effect flag — IND-ready

Key gap

Gold standard: ABE8.8 + exon1/intron1 splice-donor gRNA + hepatotropic LNP-IV + VERVE-101 LFT as a hepatic LNP-IV class effect (Musunuru et al., Nature 2021). GPT-5.5 (extended reasoning) and Gemini correctly identify ABE + splice-donor; Claude in standard mode leads with the older CBE/W8 framing from earlier preclinical literature. All three can retrieve the VERVE-101 mechanism from the published record — the Musunuru paper is publicly available. The non-retrievable gap: none classify the LFT signal as a hepatic LNP-IV class effect rather than a program-specific issue. BioMate's IND flag applies to any future hepatocyte-targeted base-editing program, not just PCSK9.

T04 — GLP-1 Modality Bakeoff

T04 — GLP-1 Modality

GLP-1 modalities for MASH — retatrutide vs tirzepatide vs semaglutide vs orforglipron

Query: "Compare GLP-1 modalities for MASH. Use GLP1R, GIPR, GCGR receptor expression profiles to rank candidates. Which is best?"

GPT-5.5 · Extended reasoning ON · 73 s · 1,535 tokens

"Retatrutide #1 for MASH. GCGR is the most hepatocyte-expressed receptor — gives retatrutide a direct hepatic pharmacology advantage. GLP1R is broadly expressed, GIPR adipose-dominant, GCGR hepatocyte-dominant."

Ranking: Retatrutide > Tirzepatide > Semaglutide > Orforglipron. Correct. From training data.

Correct ranking — but no live atlas query

Gemini 3.1 Pro · Thinking: OFF (standard)

"Retatrutide ranks #1 for MASH... GCGR is abundantly expressed directly on hepatocytes. Tirzepatide Phase 2 SYNERGY-NASH: 73.3% achieved MASH resolution. Retatrutide Phase 2 sub-studies: >80% relative reduction in liver fat."

Ranking correct. Cites specific SYNERGY-NASH Phase 2 data — most clinical detail of the three LLMs.

Correct ranking with clinical trial citations — no live atlas query

Claude Sonnet 4.6 · Thinking: OFF (standard)

"Retatrutide is the strongest MASH candidate because GCGR — the uniquely hepatocyte-dominant receptor — drives hepatic fat oxidation. For MASH, direct hepatic engagement via GCGR separates retatrutide from the other candidates."

Ranking correct. Cannot provide live atlas expression values.

Correct ranking — but no live atlas query

BioMate · 4-phase pipeline

Phase A: Live Tabula Sapiens — GCGR: hepatocyte-dominant (TPM 842); GLP1R: broad; GIPR: adipose.

Phase B: 6-candidate × 6-axis scoring.

Phase C: MASH: GCGR +10. T2D: GCGR −10.

Phase D: Retatrutide #1 (MASH). indication=t2d: Tirzepatide #1.

Same ranking + live TPM values + indication flip in same run

Key gap

All four systems give the same top-line ranking for this well-characterized receptor class. The differentiator is verifiability and generalization: BioMate queries live atlas TPM values (GCGR 842 in hepatocytes), runs the indication-flip automatically, and would correctly answer for a novel receptor combination with no training precedent. GCGR hepatocyte-dominance is publicly documented — any model with web search retrieves it. The irreplaceable gap is a 2027 orphan-receptor program where no expression atlas query has been published.

T05 — Modality Triage

BCMA in multiple myeloma — rank all 7 modalities

Query: "Which therapeutic modality is best for BCMA in multiple myeloma? Query single-cell expression atlases for BCMA tissue expression across 26 tissue types, run a critical-tissue safety scan, then rank CAR-T, bispecific, ADC, naked mAb, PROTAC, ASO, small molecule."

GPT-5.5 · Extended reasoning ON · 66 s · 1,511 tokens

"CAR-T #1. BCMA is largely restricted to plasmablasts and plasma cells. CAR-T viability gate: plasma-cell restricted + no critical-tissue hits = gold standard. Ranking: CAR-T > bispecific > ADC > naked mAb > ASO > PROTAC > small molecule."

Ranking correct. Acknowledged cannot query live databases. Soluble BCMA shedding noted.

Correct ranking — but no actual 26-tissue scan

Gemini 3.1 Pro · Thinking: OFF (standard)

"CAR-T #1 (cilta-cel, ide-cel validated). Bispecific TCE #2 (teclistamab, elranatamab). ADC #3 (belantamab mafodotin). Naked mAb #4 (limited ADCC). Small molecule #5. ASO #6. PROTAC #7 (extracellular target, PROTACs can't reach)."

Ranking correct. Notes PROTAC spatial biology limitation. Cannot run 26-tissue scan. References teclistamab (Moreau et al., NEJM 2022).

Correct ranking with per-modality rationale — no atlas scan

Claude Sonnet 4.6 · Thinking: OFF (standard)

"BCMA is one of the cleanest modality decisions in oncology — all four discriminating factors point in the same direction: cell-type restriction, no critical-tissue expression, internalization (ADC viable), and CD3-paired bispecific opportunity. CAR-T is #1. But I can't run the 26-tissue scan or generate the IHC panel."

Ranking correct. Correctly identifies the 4 discriminating factors. Cannot produce ihc_followup_panel.txt.

Correct reasoning — but no atlas output files

BioMate · two-workflow chain

W1 — atlas_expression_query: BCMA across 26 tissues. Plasma cells: high. CNS/heart/kidney/liver: not detected.

W2 — modality_triage: CAR-T gate clears. ADC internalization confirmed. CD3 bispecific viable.

Outputs: modality_ranking.csv, tissue_criticality.json, ihc_followup_panel.txt.

Claudin18.2 → critical-tissue flag → mAb #1 (not CAR-T). 7/7 recent FDA approvals (2021–2024) recovered from expression data alone.

CAR-T #1 + 4 structured output files + IHC panel

Key gap

All four systems reach the same conclusion for BCMA — it is a well-characterized target with published single-cell expression data. The real test is a novel target. Claudin18.2 in any LLM returns a paragraph; BioMate's atlas scan flags epithelial lung/pancreas expression and shifts the winner from CAR-T to mAb (Vyloy/zolbetuximab, Oct 2024). For a completely unpublished 2027 target, LLMs have no atlas data to recall. BioMate's live query returns the same answer regardless of training-corpus coverage.

← Return to main technical report