BioMate's Live Database Connections

Every answer BioMate gives draws from authoritative sources that researchers, clinicians, and drug developers trust. BioMate maintains live connections to more than 60 named biomedical databases — UniProt, PDB, AlphaFold, ChEMBL, Ensembl, gnomAD, ClinVar, GEO, KEGG, Reactome, Open Targets, and more. These are not static snapshots baked into a language model; they are live connections queried in real time, every time you ask.

When a researcher asks BioMate about a variant, it does not rely on what a model memorized during training. It queries ClinVar, gnomAD, and Ensembl for current classifications and population frequencies. When a medicinal chemist asks about a compound's target activity, BioMate fetches the actual ChEMBL bioactivity records. When a bioinformatician pastes a GEO accession, BioMate detects it and routes it directly to the right analysis pipeline without being asked.

This post walks through the major categories of built-in database connections with worked examples — then covers the connector framework that lets teams extend BioMate with any data source of their own.

Protein Biology: UniProt, PDB, AlphaFold, InterPro, and STRING

Protein information is the foundation of both structural biology and drug discovery. BioMate connects to five complementary sources that together cover sequence, function, three-dimensional structure, domain architecture, and interaction networks.

Example query "What is known about EGFR and why is the T790M mutation clinically significant?"

UniProt Fetches the canonical Swiss-Prot entry for EGFR (P00533): protein function, subcellular localization, known post-translational modifications, disease associations, and the complete variant table including T790M's documented clinical effect.

PDB Retrieves available crystal structures — including the T790M mutant apo form and complexes with osimertinib — providing the structural rationale for why the mutation abolishes first-generation EGFR inhibitor binding.

AlphaFold DB Pulls the predicted structure with per-residue confidence (pLDDT) for the full-length receptor, covering regions absent from crystallographic models — including the flexible ectodomain and intracellular C-terminal tail.

InterPro Maps domain architecture: the receptor L-domain pair, furin-like cysteine-rich regions, and kinase domain — providing context for where T790M sits relative to the ATP-binding pocket.

STRING-db Returns the high-confidence interaction network: co-occurring partners (ERBB2, ERBB3, MET, KRAS) that explain bypass resistance mechanisms observed clinically.

None of this required separate database searches, accession number lookups, or format conversions. The answer arrives as a coherent synthesis, with every claim traceable to its source.

Genomics and Variants: Ensembl, NCBI Gene, gnomAD, ClinVar, dbSNP, and iGenomes

Genomic questions span a wide range of needs — from understanding what a gene does, to interpreting a specific variant in a patient sample, to running an analysis pipeline that needs the right reference genome. BioMate connects to the authoritative sources for each layer.

Example query "My patient has a BRCA2 c.5946delT variant. What is its clinical significance and how common is it in the general population?"

ClinVar Retrieves the current pathogenicity classification for BRCA2 c.5946delT: Pathogenic, established by multiple submitters, associated with Hereditary Breast and Ovarian Cancer Syndrome.

gnomAD Returns population allele frequency across gnomAD v4 cohorts — confirms the variant is extremely rare, consistent with pathogenic classification.

Ensembl Provides transcript-level consequence annotation: the deletion causes a frameshift in the canonical transcript, predicted to introduce a premature stop codon and trigger nonsense-mediated decay.

NCBI Gene Returns the authoritative gene summary: BRCA2's role in homologous recombination repair, its interaction with RAD51, and a curated list of associated conditions with OMIM cross-references.

For analysis workflows, BioMate also auto-selects the correct reference genome from iGenomes — the cloud-hosted reference collection on AWS S3. When a researcher specifies human samples, BioMate uses GRCh38 with the matching GATK bundle by default, without requiring the user to specify a path or version.

Omics Data Accessions: GEO, SRA, ENA, and PRIDE

One of BioMate's most practically useful capabilities is accession detection. Researchers frequently work with published datasets deposited in GEO, SRA, ENA, or PRIDE — and the conventional workflow involves manually downloading files and constructing a pipeline. BioMate eliminates that friction entirely.

Example query "Run differential expression analysis on GSE89225 comparing tumor vs. normal samples."

GEO BioMate detects the GSE89225 accession pattern automatically. It fetches the series metadata — platform, organism, sample count and groupings — and identifies which sample accessions correspond to tumor vs. normal.

SRA Resolves the linked SRR accessions and passes them directly to nf-core/rnaseq as input. Raw FASTQ retrieval is handled by the pipeline itself — no manual download required.

iGenomes Selects GRCh38 as the reference genome from iGenomes S3, including the STAR index, transcript FASTA, and GTF annotation appropriate for the pipeline version.

The same pattern applies to proteomics experiments (PRIDE accessions starting with PXD), raw sequencing archives (SRR, ERR, DRR prefixes from SRA and its international mirror ENA), and European genomic data (EGA accessions for controlled-access datasets).

Chemical Biology and Drug Discovery: ChEMBL and ChEBI

For drug discovery workflows, BioMate connects to the two principal chemical databases maintained by the European Bioinformatics Institute.

Example query "What are the most potent known inhibitors of CDK4 and what is their selectivity profile across the kinome?"

ChEMBL Queries ChEMBL for all bioactivity records where the target is CDK4 and the assay type is binding or functional. Returns a ranked table of compounds by IC50/Ki, including palbociclib, ribociclib, and abemaciclib — with their confirmed selectivity data across CDK family members.

ChEBI Provides the authoritative chemical ontology for each compound: SMILES, InChI, molecular formula, and classification within the chemical entity hierarchy — used to ensure downstream ADMET workflows receive correctly structured inputs.

ChEMBL contains bioactivity data from more than 1.9 million assays across 15,000+ targets. BioMate queries it live — not from a local copy — so any recent additions from published literature are immediately available.

Pathways and Biological Context: KEGG, Reactome, Gene Ontology, and HPO

Understanding what a gene list, protein set, or variant panel means biologically requires mapping to pathways, processes, and disease phenotypes. BioMate connects to all four major resources in this space.

Example query "I ran differential expression and got 340 upregulated genes. What pathways are enriched, and are any linked to autoimmune disease phenotypes?"

Gene Ontology Runs GO enrichment testing biological process, molecular function, and cellular component annotations. Returns ranked GO terms with adjusted p-values, gene counts, and the subset of input genes driving each enrichment.

KEGG Maps the gene list to KEGG pathways — metabolic, signaling, and disease pathways — identifying overrepresented modules such as the JAK-STAT signaling pathway or cytokine-cytokine receptor interaction network.

Reactome Provides hierarchical pathway enrichment with curated mechanistic detail — identifying enrichment in T cell receptor signaling and interleukin signaling as sub-pathways within the broader immune system hierarchy.

HPO Cross-references the enriched gene set against the Human Phenotype Ontology, surfacing associations with autoimmune hepatitis, inflammatory bowel disease, and systemic lupus erythematosus — linking the transcriptomic signature to clinical phenotype vocabulary.

Disease Targets and Clinical Data: Open Targets, cBioPortal, GWAS Catalog, and ClinicalTrials.gov

For translational and drug discovery questions, BioMate connects to the databases that bridge molecular biology with clinical evidence.

Open Targets — integrates genetic, somatic, and functional evidence to score target-disease associations across 60,000+ targets and 23,000+ diseases.
cBioPortal — provides somatic mutation, copy number, and expression data from hundreds of cancer studies, including TCGA and MSK-IMPACT cohorts.
GWAS Catalog — the curated collection of genome-wide association studies, with trait-variant associations, effect sizes, and ancestry information.
ClinicalTrials.gov — allows BioMate to surface active and completed trials for a target or indication, contextualized alongside the molecular evidence.

Literature: PubMed, arXiv, Semantic Scholar, and Europe PMC

BioMate integrates live literature queries to ground answers in primary sources. When a question touches on a specific gene, disease association, or experimental method, BioMate surfaces the key supporting publications — not as a static list from training data, but as a live query returning current results.

This is particularly valuable for rapidly evolving fields: variant classification, emerging drug targets, and newly published clinical associations are available as soon as they are indexed, without waiting for a model retraining cycle.

"The difference between a good answer and a trustworthy answer is knowing exactly where the data came from."

The Unified Interface

Across all of these databases, BioMate's interface is the same: plain language. You do not need to know that gnomAD has a GraphQL API, or that UniProt uses a REST query syntax, or that GEO metadata lives in a SOFT file format. You describe what you want to know, and BioMate resolves which databases to query, retrieves the relevant records, and incorporates the results into a coherent, cited response.

The accession detection layer is particularly powerful in practice. Paste a line from a paper — "samples were deposited in GEO under accession GSE147495" — into BioMate's chat, and it will automatically recognize the accession, fetch the series metadata, and offer to run the appropriate analysis pipeline on that dataset.

Beyond the Built-Ins: The Connector Framework

The databases described above represent BioMate's built-in integrations — pre-configured and available to every user from day one. But research organizations almost always have data that lives outside public repositories: proprietary compound libraries, internal patient cohorts, institutional genomics databases, legacy LIMS systems, or custom annotation sources developed in-house over years of work.

BioMate's connector framework makes these private sources first-class citizens alongside the public databases. A connector is a lightweight configuration that tells BioMate how to reach a data source, what it contains, and how to interpret the response. Once registered, it behaves exactly like a built-in integration — BioMate routes queries to it automatically when the content is relevant, cites it alongside public sources in responses, and makes it available as an input to workflows.

What a connector can point to

Any REST API with a JSON response, any SQL or PostgreSQL database exposed over a read-only endpoint, any S3 bucket containing structured annotation files (BED, VCF, TSV, FASTA), any internal LIMS or ELN with an API, or any commercial database your institution has licensed. BioMate handles authentication, query construction, response parsing, and caching — teams provide the endpoint and schema description, nothing else.

Practical examples from teams using the connector framework:

A pharma team registered their internal compound library (120,000 proprietary structures with in-house ADMET measurements) so BioMate can cross-reference ChEMBL hits against internal data without ever exporting to a spreadsheet.
A genomics core facility connected their institutional variant database — curated from years of clinical sequencing — so BioMate considers internal population frequencies alongside gnomAD when interpreting variants in their specific patient ancestry mix.
A structural biology group linked their private AlphaFold prediction archive, which contains models for proteins they study that aren't yet in the public AlphaFold DB, making them queryable in exactly the same way as public structures.
A CRO connected their internal assay result database so project teams can ask BioMate questions that span public bioactivity data and their own experimental records in one query.

Connectors are scoped per workspace and do not cross team boundaries. They can be marked read-only, restricted to specific users or projects, or set to require explicit invocation rather than automatic routing.

The full list of live connections

UniProt, PDB, AlphaFold DB, InterPro, STRING-db, Ensembl, NCBI Gene, NCBI Entrez, gnomAD, ClinVar, dbSNP, iGenomes, GEO, SRA, ENA, PRIDE, EGA, Metabolights, ChEMBL, ChEBI, KEGG, Reactome, Gene Ontology, HPO, PubMed, arXiv, Semantic Scholar, Europe PMC, GTEx, UCSC Genome Browser, Open Targets, MyGene.info, OMIM, Monarch Initiative, cBioPortal, TCGA, GWAS Catalog, JASPAR, IEDB, REGULOMEDB, REMAP, MaveDB, OmicsDI, SYNAPSE, EMDB, GTOPDB, NetMHCPan, NetMHCiiPan, ClinicalTrials.gov, AREsite2, DoRiNA, iReceptor, WORMS, PALEOBIOLOGY, Worldclim, IUCN, OpenFDA, BARIC Archive, IDR/OMERO — with QC thresholds grounded in ENCODE metrics and GATK Best Practices. Contact us if a source you rely on is not yet covered.

Complete Database Reference

Database	Category	What BioMate retrieves
UniProt / Swiss-Prot	Protein	Sequence, function, variants, disease associations, PTMs
PDB / RCSB	Structure	Experimental 3D structures, ligand complexes, resolution
AlphaFold DB (EBI)	Structure	Predicted structures, per-residue pLDDT confidence scores
EMDB	Structure	Cryo-EM density maps, fitting models, resolution statistics
InterPro	Protein	Domain architecture, family classification, Pfam entries
STRING-db	Protein	Protein-protein interaction networks, confidence scores
Ensembl	Genomics	Gene annotation, transcript models, variant consequences
NCBI Gene	Genomics	Gene summaries, chromosomal location, OMIM cross-references
gnomAD	Variants	Population allele frequencies across diverse cohorts
ClinVar	Variants	Clinical variant classifications, submitter evidence
dbSNP	Variants	Variant identifiers (rs numbers), validation status
GWAS Catalog	Variants	Trait-variant associations, effect sizes, ancestry
iGenomes (AWS S3)	Reference data	Reference genomes, STAR/BWA indices, GTF annotations, GATK bundles
GEO	Omics data	Series metadata, sample tables, platform annotations
SRA / ENA	Omics data	Raw sequencing runs (FASTQ), experiment metadata
PRIDE	Omics data	Proteomics raw data, search results
EGA	Omics data	Controlled-access genomic and phenotypic data
Metabolights	Omics data	Metabolomics studies, raw spectra, sample metadata
OmicsDI	Omics data	Cross-repository omics dataset discovery
SYNAPSE	Omics data	Open biomedical datasets and analysis results (Sage Bionetworks)
ChEMBL	Chemical / Drug	Bioactivity records, target associations, drug-likeness metrics
ChEBI	Chemical / Drug	Chemical identity, SMILES, InChI, ontology classification
GTOPDB	Chemical / Drug	Pharmacology targets, ligand-receptor data (Guide to Pharmacology)
MaveDB	Chemical / Drug	Multiplexed assay of variant effect scores
KEGG	Pathways	Metabolic and signaling pathway maps, gene-pathway membership
Reactome	Pathways	Curated mechanistic pathway hierarchy, reaction-level detail
Gene Ontology	Ontologies	Biological process, molecular function, cellular component terms
HPO	Ontologies	Human phenotype terms, gene-disease-phenotype associations
Monarch Initiative	Ontologies	Cross-species phenotype-gene associations
Open Targets	Disease / Target	Target-disease association scores, genetic and functional evidence
cBioPortal / TCGA	Disease / Target	Somatic mutation, copy number, expression data across cancer studies
ClinicalTrials.gov	Clinical	Active and completed trials by target, indication, or drug
OpenFDA	Clinical	Drug adverse events, labels, and recall data
IEDB	Immunology	Immune epitope binding data, T/B cell assays
NetMHCPan / NetMHCiiPan	Immunology	MHC-I and MHC-II peptide binding predictions
iReceptor	Immunology	Immune receptor repertoire sequences
JASPAR	Epigenomics	Transcription factor binding profiles
REGULOMEDB	Epigenomics	Regulatory variant annotation, chromatin state evidence
REMAP	Epigenomics	Regulatory regions from ChIP-seq and ATAC-seq experiments
GTEx	Expression	Tissue-specific gene expression and eQTLs
UCSC Genome Browser	Expression	Genome tracks, conservation, regulatory annotations
PubMed	Literature	Publication records, abstracts, MeSH terms
arXiv / Europe PMC	Literature	Preprints and open-access primary literature
Semantic Scholar	Literature	Citation graph, semantic search across scientific papers
WORMS / IUCN	Ecology	Species taxonomy, conservation status
PALEOBIOLOGY	Ecology	Fossil occurrence and taxonomic data
Worldclim	Ecology	Global climate and environmental raster data
IDR / OMERO	Imaging	High-content imaging datasets, phenotypic screen results
ENCODE QC Metrics	Standards	QC thresholds for RNA-seq, ChIP-seq, ATAC-seq
GATK Best Practices	Standards	Variant calling pipeline standards, VQSR truth resources