Every answer BioMate gives draws from authoritative sources that researchers, clinicians, and drug developers trust. BioMate maintains live connections to more than 60 named biomedical databases — UniProt, PDB, AlphaFold, ChEMBL, Ensembl, gnomAD, ClinVar, GEO, KEGG, Reactome, Open Targets, and more. These are not static snapshots baked into a language model; they are live connections queried in real time, every time you ask.
When a researcher asks BioMate about a variant, it does not rely on what a model memorized during training. It queries ClinVar, gnomAD, and Ensembl for current classifications and population frequencies. When a medicinal chemist asks about a compound's target activity, BioMate fetches the actual ChEMBL bioactivity records. When a bioinformatician pastes a GEO accession, BioMate detects it and routes it directly to the right analysis pipeline without being asked.
This post walks through the major categories of built-in database connections with worked examples — then covers the connector framework that lets teams extend BioMate with any data source of their own.
Protein Biology: UniProt, PDB, AlphaFold, InterPro, and STRING
Protein information is the foundation of both structural biology and drug discovery. BioMate connects to five complementary sources that together cover sequence, function, three-dimensional structure, domain architecture, and interaction networks.
None of this required separate database searches, accession number lookups, or format conversions. The answer arrives as a coherent synthesis, with every claim traceable to its source.
Genomics and Variants: Ensembl, NCBI Gene, gnomAD, ClinVar, dbSNP, and iGenomes
Genomic questions span a wide range of needs — from understanding what a gene does, to interpreting a specific variant in a patient sample, to running an analysis pipeline that needs the right reference genome. BioMate connects to the authoritative sources for each layer.
For analysis workflows, BioMate also auto-selects the correct reference genome from iGenomes — the cloud-hosted reference collection on AWS S3. When a researcher specifies human samples, BioMate uses GRCh38 with the matching GATK bundle by default, without requiring the user to specify a path or version.
Omics Data Accessions: GEO, SRA, ENA, and PRIDE
One of BioMate's most practically useful capabilities is accession detection. Researchers frequently work with published datasets deposited in GEO, SRA, ENA, or PRIDE — and the conventional workflow involves manually downloading files and constructing a pipeline. BioMate eliminates that friction entirely.
The same pattern applies to proteomics experiments (PRIDE accessions starting with PXD), raw sequencing archives (SRR, ERR, DRR prefixes from SRA and its international mirror ENA), and European genomic data (EGA accessions for controlled-access datasets).
Chemical Biology and Drug Discovery: ChEMBL and ChEBI
For drug discovery workflows, BioMate connects to the two principal chemical databases maintained by the European Bioinformatics Institute.
ChEMBL contains bioactivity data from more than 1.9 million assays across 15,000+ targets. BioMate queries it live — not from a local copy — so any recent additions from published literature are immediately available.
Pathways and Biological Context: KEGG, Reactome, Gene Ontology, and HPO
Understanding what a gene list, protein set, or variant panel means biologically requires mapping to pathways, processes, and disease phenotypes. BioMate connects to all four major resources in this space.
Disease Targets and Clinical Data: Open Targets, cBioPortal, GWAS Catalog, and ClinicalTrials.gov
For translational and drug discovery questions, BioMate connects to the databases that bridge molecular biology with clinical evidence.
- Open Targets — integrates genetic, somatic, and functional evidence to score target-disease associations across 60,000+ targets and 23,000+ diseases.
- cBioPortal — provides somatic mutation, copy number, and expression data from hundreds of cancer studies, including TCGA and MSK-IMPACT cohorts.
- GWAS Catalog — the curated collection of genome-wide association studies, with trait-variant associations, effect sizes, and ancestry information.
- ClinicalTrials.gov — allows BioMate to surface active and completed trials for a target or indication, contextualized alongside the molecular evidence.
Literature: PubMed, arXiv, Semantic Scholar, and Europe PMC
BioMate integrates live literature queries to ground answers in primary sources. When a question touches on a specific gene, disease association, or experimental method, BioMate surfaces the key supporting publications — not as a static list from training data, but as a live query returning current results.
This is particularly valuable for rapidly evolving fields: variant classification, emerging drug targets, and newly published clinical associations are available as soon as they are indexed, without waiting for a model retraining cycle.
"The difference between a good answer and a trustworthy answer is knowing exactly where the data came from."
The Unified Interface
Across all of these databases, BioMate's interface is the same: plain language. You do not need to know that gnomAD has a GraphQL API, or that UniProt uses a REST query syntax, or that GEO metadata lives in a SOFT file format. You describe what you want to know, and BioMate resolves which databases to query, retrieves the relevant records, and incorporates the results into a coherent, cited response.
The accession detection layer is particularly powerful in practice. Paste a line from a paper — "samples were deposited in GEO under accession GSE147495" — into BioMate's chat, and it will automatically recognize the accession, fetch the series metadata, and offer to run the appropriate analysis pipeline on that dataset.
Beyond the Built-Ins: The Connector Framework
The databases described above represent BioMate's built-in integrations — pre-configured and available to every user from day one. But research organizations almost always have data that lives outside public repositories: proprietary compound libraries, internal patient cohorts, institutional genomics databases, legacy LIMS systems, or custom annotation sources developed in-house over years of work.
BioMate's connector framework makes these private sources first-class citizens alongside the public databases. A connector is a lightweight configuration that tells BioMate how to reach a data source, what it contains, and how to interpret the response. Once registered, it behaves exactly like a built-in integration — BioMate routes queries to it automatically when the content is relevant, cites it alongside public sources in responses, and makes it available as an input to workflows.
Any REST API with a JSON response, any SQL or PostgreSQL database exposed over a read-only endpoint, any S3 bucket containing structured annotation files (BED, VCF, TSV, FASTA), any internal LIMS or ELN with an API, or any commercial database your institution has licensed. BioMate handles authentication, query construction, response parsing, and caching — teams provide the endpoint and schema description, nothing else.
Practical examples from teams using the connector framework:
- A pharma team registered their internal compound library (120,000 proprietary structures with in-house ADMET measurements) so BioMate can cross-reference ChEMBL hits against internal data without ever exporting to a spreadsheet.
- A genomics core facility connected their institutional variant database — curated from years of clinical sequencing — so BioMate considers internal population frequencies alongside gnomAD when interpreting variants in their specific patient ancestry mix.
- A structural biology group linked their private AlphaFold prediction archive, which contains models for proteins they study that aren't yet in the public AlphaFold DB, making them queryable in exactly the same way as public structures.
- A CRO connected their internal assay result database so project teams can ask BioMate questions that span public bioactivity data and their own experimental records in one query.
Connectors are scoped per workspace and do not cross team boundaries. They can be marked read-only, restricted to specific users or projects, or set to require explicit invocation rather than automatic routing.
UniProt, PDB, AlphaFold DB, InterPro, STRING-db, Ensembl, NCBI Gene, NCBI Entrez, gnomAD, ClinVar, dbSNP, iGenomes, GEO, SRA, ENA, PRIDE, EGA, Metabolights, ChEMBL, ChEBI, KEGG, Reactome, Gene Ontology, HPO, PubMed, arXiv, Semantic Scholar, Europe PMC, GTEx, UCSC Genome Browser, Open Targets, MyGene.info, OMIM, Monarch Initiative, cBioPortal, TCGA, GWAS Catalog, JASPAR, IEDB, REGULOMEDB, REMAP, MaveDB, OmicsDI, SYNAPSE, EMDB, GTOPDB, NetMHCPan, NetMHCiiPan, ClinicalTrials.gov, AREsite2, DoRiNA, iReceptor, WORMS, PALEOBIOLOGY, Worldclim, IUCN, OpenFDA, BARIC Archive, IDR/OMERO — with QC thresholds grounded in ENCODE metrics and GATK Best Practices. Contact us if a source you rely on is not yet covered.
Complete Database Reference
| Database | Category | What BioMate retrieves |
|---|---|---|
| UniProt / Swiss-Prot | Protein | Sequence, function, variants, disease associations, PTMs |
| PDB / RCSB | Structure | Experimental 3D structures, ligand complexes, resolution |
| AlphaFold DB (EBI) | Structure | Predicted structures, per-residue pLDDT confidence scores |
| EMDB | Structure | Cryo-EM density maps, fitting models, resolution statistics |
| InterPro | Protein | Domain architecture, family classification, Pfam entries |
| STRING-db | Protein | Protein-protein interaction networks, confidence scores |
| Ensembl | Genomics | Gene annotation, transcript models, variant consequences |
| NCBI Gene | Genomics | Gene summaries, chromosomal location, OMIM cross-references |
| gnomAD | Variants | Population allele frequencies across diverse cohorts |
| ClinVar | Variants | Clinical variant classifications, submitter evidence |
| dbSNP | Variants | Variant identifiers (rs numbers), validation status |
| GWAS Catalog | Variants | Trait-variant associations, effect sizes, ancestry |
| iGenomes (AWS S3) | Reference data | Reference genomes, STAR/BWA indices, GTF annotations, GATK bundles |
| GEO | Omics data | Series metadata, sample tables, platform annotations |
| SRA / ENA | Omics data | Raw sequencing runs (FASTQ), experiment metadata |
| PRIDE | Omics data | Proteomics raw data, search results |
| EGA | Omics data | Controlled-access genomic and phenotypic data |
| Metabolights | Omics data | Metabolomics studies, raw spectra, sample metadata |
| OmicsDI | Omics data | Cross-repository omics dataset discovery |
| SYNAPSE | Omics data | Open biomedical datasets and analysis results (Sage Bionetworks) |
| ChEMBL | Chemical / Drug | Bioactivity records, target associations, drug-likeness metrics |
| ChEBI | Chemical / Drug | Chemical identity, SMILES, InChI, ontology classification |
| GTOPDB | Chemical / Drug | Pharmacology targets, ligand-receptor data (Guide to Pharmacology) |
| MaveDB | Chemical / Drug | Multiplexed assay of variant effect scores |
| KEGG | Pathways | Metabolic and signaling pathway maps, gene-pathway membership |
| Reactome | Pathways | Curated mechanistic pathway hierarchy, reaction-level detail |
| Gene Ontology | Ontologies | Biological process, molecular function, cellular component terms |
| HPO | Ontologies | Human phenotype terms, gene-disease-phenotype associations |
| Monarch Initiative | Ontologies | Cross-species phenotype-gene associations |
| Open Targets | Disease / Target | Target-disease association scores, genetic and functional evidence |
| cBioPortal / TCGA | Disease / Target | Somatic mutation, copy number, expression data across cancer studies |
| ClinicalTrials.gov | Clinical | Active and completed trials by target, indication, or drug |
| OpenFDA | Clinical | Drug adverse events, labels, and recall data |
| IEDB | Immunology | Immune epitope binding data, T/B cell assays |
| NetMHCPan / NetMHCiiPan | Immunology | MHC-I and MHC-II peptide binding predictions |
| iReceptor | Immunology | Immune receptor repertoire sequences |
| JASPAR | Epigenomics | Transcription factor binding profiles |
| REGULOMEDB | Epigenomics | Regulatory variant annotation, chromatin state evidence |
| REMAP | Epigenomics | Regulatory regions from ChIP-seq and ATAC-seq experiments |
| GTEx | Expression | Tissue-specific gene expression and eQTLs |
| UCSC Genome Browser | Expression | Genome tracks, conservation, regulatory annotations |
| PubMed | Literature | Publication records, abstracts, MeSH terms |
| arXiv / Europe PMC | Literature | Preprints and open-access primary literature |
| Semantic Scholar | Literature | Citation graph, semantic search across scientific papers |
| WORMS / IUCN | Ecology | Species taxonomy, conservation status |
| PALEOBIOLOGY | Ecology | Fossil occurrence and taxonomic data |
| Worldclim | Ecology | Global climate and environmental raster data |
| IDR / OMERO | Imaging | High-content imaging datasets, phenotypic screen results |
| ENCODE QC Metrics | Standards | QC thresholds for RNA-seq, ChIP-seq, ATAC-seq |
| GATK Best Practices | Standards | Variant calling pipeline standards, VQSR truth resources |