What is ADMET in drug discovery?

ADMET stands for Absorption, Distribution, Metabolism, Excretion, and Toxicity. In drug discovery, ADMET profiling predicts whether a drug candidate will be absorbed through the gut (or other routes), distributed to target tissues, metabolized at an appropriate rate, excreted without harmful accumulation, and whether it has toxic liabilities (e.g., hERG cardiotoxicity, DILI, Ames mutagenicity). In silico ADMET prediction allows teams to filter out problematic candidates before expensive wet-lab synthesis, dramatically reducing attrition in the development pipeline.

What is PBPK modeling?

PBPK (Physiologically Based Pharmacokinetic) modeling uses mathematical equations parameterized by real anatomical and physiological measurements (organ sizes, blood flows, enzyme activities) to simulate how a drug behaves in the body over time. Unlike empirical PK models, PBPK can extrapolate across species (mouse to human), predict drug interactions, and model special populations (pediatric, renal impairment). It is routinely used in IND submissions and FDA review to justify first-in-human dosing.

What is cryo-EM and how is it different from X-ray crystallography?

Cryo-electron microscopy (cryo-EM) and X-ray crystallography both determine 3D protein structures, but through different methods. X-ray crystallography requires growing protein crystals (often difficult or impossible) and measures diffraction patterns. Cryo-EM images individual protein molecules frozen in vitreous ice with an electron microscope, then uses computational averaging of many particle images to reconstruct the 3D density map. Cryo-EM excels for large protein complexes, membrane proteins, and flexible molecules that cannot be crystallized. Modern cryo-EM routinely achieves 2-3 Angstrom resolution.

What is the difference between WGS and WES?

WGS (Whole Genome Sequencing) sequences the entire genome (~3 billion base pairs in humans), including coding and non-coding regions, intronic sequences, and regulatory elements. WES (Whole Exome Sequencing) captures only the exome — the ~1-2% of the genome that encodes proteins. WGS is more comprehensive and can detect structural variants, non-coding mutations, and copy number changes, but costs more and requires greater computational resources. WES is cost-effective for Mendelian disease genetics and protein-coding variant discovery.

What is single-cell RNA sequencing (scRNA-seq)?

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells rather than bulk populations. By profiling thousands of cells simultaneously, it reveals cell-type heterogeneity, rare cell populations, developmental trajectories, and cell-state transitions that are invisible in bulk RNA-seq. Common platforms include 10x Genomics Chromium (droplet-based) and Smart-seq2 (plate-based). Analysis involves dimensionality reduction (PCA, UMAP), clustering, marker gene identification, and trajectory inference using tools like Seurat and Scanpy.

What is the nf-core project?

nf-core is a community-curated collection of analysis pipelines built with the Nextflow workflow management system. Each pipeline (e.g., nf-core/rnaseq, nf-core/sarek for WGS, nf-core/chipseq) follows best practices, uses containerized software (Docker/Singularity), and passes rigorous CI testing. BioMate AI runs nf-core pipelines under the hood on AWS Batch, making the same gold-standard tools accessible through plain-English requests without requiring Nextflow expertise.

Computational Biology Glossary

ADMET

Absorption · Distribution · Metabolism · Excretion · Toxicity

ADMET profiling predicts the five pharmacokinetic and safety properties that determine whether a drug candidate is suitable for development. Absorption assesses oral bioavailability (Caco-2 permeability, P-glycoprotein efflux). Distribution covers plasma protein binding and volume of distribution. Metabolism identifies CYP450 interactions, metabolic stability, and drug-drug interaction potential. Excretion models renal and hepatic clearance. Toxicity flags hERG cardiac risk, DILI (drug-induced liver injury), Ames mutagenicity, and other liabilities. In silico ADMET is performed before synthesis to filter out high-risk scaffolds — a compound that fails ADMET in a cell assay after months of chemistry work represents sunk cost that computational screening would have caught in minutes. BioMate AI runs full ADMET panels with Gold/Silver/Bronze evidence-graded QC, citing the threshold source (e.g., FDA guidance, published hERG IC₅₀ benchmarks) for every metric.

PBPK Modeling

Physiologically Based Pharmacokinetics

PBPK models simulate drug behavior in the body using compartments representing real anatomical organs (liver, kidney, lungs, gut, plasma) connected by blood flow rates that match known physiology. Unlike empirical two-compartment models, PBPK models can extrapolate across species (mouse → rat → human) because they encode actual physiological parameters rather than fitting to observed data. This makes them essential for allometric scaling — translating animal study doses to human equivalents for IND submissions — and for predicting behavior in special populations (children, elderly, hepatic impairment). PBPK is explicitly requested in FDA and EMA guidance for certain drug classes. BioMate runs PBPK simulation as part of its preclinical development pipeline, producing PK profiles, species scaling tables, and a draft §2.6.4 PK Written Summary for IND Module 2.

IND

Investigational New Drug Application

An IND is the regulatory submission filed with the U.S. FDA (or equivalent in other jurisdictions) before a pharmaceutical company can begin clinical trials of a new drug in human subjects. The nonclinical IND package (CTD Module 2.6) contains computational and experimental evidence on the drug's pharmacology, pharmacokinetics, and toxicology. Sections §2.6.1–2.6.7 cover: introduction, pharmacology summary, pharmacology tables, PK written summary, PK tables, toxicology summary, and toxicology tables. BioMate AI automates the computational sections of Module 2.6 — generating CTD-formatted DOCX documents with narrative prose, tabulated data, and QC-graded citations from run data — leaving wet-lab sections as clearly marked placeholders for the regulatory affairs team.

Target Identification & Validation

Target identification is the process of finding a biological molecule (usually a protein) whose modulation could treat a disease. Computational target identification uses genetic support scoring (GWAS, rare variant burden), CRISPR essentiality data (DepMap, Project Score) to rank targets by cancer cell line dependence, pocket druggability analysis (FPocket, SiteMap) to assess whether the target has a ligandable cavity, and functional genomics to identify pathway relationships. Validation confirms that the target is expressed in the disease tissue, is not safety-critical in normal tissues, and has prior art suggesting modulation is feasible. BioMate AI runs automated target scoring workflows using Open Targets, ChEMBL, and AlphaFold-predicted structures for pocket analysis.

Virtual Screening

Virtual screening computationally filters large compound libraries (millions to billions of molecules) to identify candidates likely to bind a target protein. Structure-based virtual screening uses molecular docking to score binding poses within the target's active site. Ligand-based screening identifies compounds similar in shape or pharmacophore to known active molecules when no structure is available. AI-guided screening (GNN scoring functions, diffusion-based pose prediction) has dramatically improved enrichment rates. Virtual screening precedes wet-lab high-throughput screening (HTS) to reduce the chemical space to a tractable hit list of 50–500 compounds for experimental validation. BioMate runs virtual screening using AutoDock Vina, DiffDock, and pharmacophore filtering with ADMET pre-screening of the hit list.

BOIN

Bayesian Optimal Interval Design

BOIN is a statistical framework for Phase I dose-escalation trials that determines whether to escalate, stay at, or de-escalate the current dose based on the observed dose-limiting toxicity (DLT) rate in each cohort. Unlike the classical 3+3 design, BOIN is statistically optimal: it uses a pre-computed decision table derived from Bayesian posterior probabilities to find the Maximum Tolerated Dose (MTD) more accurately and with fewer patients. BOIN has been widely adopted by oncology programs and is accepted by the FDA. BioMate simulates BOIN dose-escalation trials computationally, allowing clinical teams to explore different target DLT rates and sample size assumptions before trial initiation.

RNA-seq

RNA Sequencing / Transcriptomics

RNA-seq measures the transcriptome — the complete set of RNA transcripts in a cell or tissue at a given time — by converting RNA to cDNA, fragmenting it, and sequencing millions of fragments. The key readout is read counts per gene, which serve as proxies for expression levels. Differential expression analysis (DESeq2, edgeR, limma-voom) identifies genes whose expression changes significantly between conditions (treated vs. control, disease vs. healthy). Standard workflows involve: quality control with FastQC, adapter trimming (Trim Galore), alignment (STAR or HISAT2), quantification (salmon or featureCounts), and differential expression with visualization (volcano plots, heatmaps, GSEA). BioMate runs the complete nf-core/rnaseq pipeline on AWS Batch from FASTQ input to DEG tables and pathway enrichment, with ENCODE-graded QC at every step.

scRNA-seq

Single-Cell RNA Sequencing

scRNA-seq profiles gene expression in individual cells rather than a bulk tissue average, revealing cellular heterogeneity invisible to bulk RNA-seq. The most common platform is 10x Genomics Chromium (droplet-based), which captures thousands of cells with unique molecular identifiers (UMIs) for accurate quantification. Analysis starts with a sparse count matrix and proceeds through: quality filtering, normalization, dimensionality reduction (PCA, UMAP/t-SNE), unsupervised clustering, cell-type annotation with marker genes, trajectory inference, and differential expression between cell populations. Tools include Seurat (R) and Scanpy (Python), both available in BioMate's workflow library. Applications include tumor microenvironment characterization, developmental biology, and drug mechanism of action studies.

WGS / WES

Whole Genome Sequencing / Whole Exome Sequencing

WGS sequences the entire ~3-billion-base-pair human genome, including coding and non-coding regions, at typical coverage of 30x–60x depth. WES captures only the exome (~50 Mb, ~2% of the genome) — the protein-coding regions — at deeper coverage (~100x). WGS advantages: detects structural variants (SVs), copy number variants (CNVs), and non-coding variants; better at low-allele-frequency somatic mutations. WES advantages: lower cost, smaller data volumes, easier interpretation for rare Mendelian disease genetics. Both use the same alignment (BWA-MEM), base quality score recalibration (BQSR), and variant calling (GATK HaplotypeCaller or DeepVariant) workflows. BioMate runs WGS/WES through nf-core/sarek with GATK Best Practices or DeepVariant, ACMG/AMP variant classification, and VCF annotation via VEP or ANNOVAR.

GATK

Genome Analysis Toolkit

GATK is the industry-standard framework for variant calling in sequencing data, developed and maintained by the Broad Institute. The GATK Best Practices pipeline for germline short variant discovery covers: Picard MarkDuplicates, Base Quality Score Recalibration (BQSR), HaplotypeCaller (per-sample GVCF mode), GenomicsDBImport, GenotypeGVCFs, and Variant Quality Score Recalibration (VQSR). For somatic variant calling, Mutect2 identifies somatic SNVs and indels. GATK4 (the current version) is Java-based with Spark support for scalable cloud compute. DeepVariant (Google) is a deep learning alternative that can outperform GATK on certain datasets, particularly for non-human genomes and PacBio long reads.

Variant Calling

Variant calling is the computational process of identifying positions in a sequenced genome that differ from a reference sequence. Variants include: SNVs (single nucleotide variants — single base changes), indels (insertions and deletions of 1–50 bp), CNVs (copy number variants — large-scale duplications or deletions), and SVs (structural variants — translocations, inversions, large indels). Germline variant calling identifies inherited variants (typically 4–5 million per human genome). Somatic variant calling identifies mutations acquired in tumor cells by comparing tumor vs. matched normal tissue. Variant annotation adds biological context: predicted functional impact (VEP/ANNOVAR), ClinVar pathogenicity classification, population allele frequencies (gnomAD), and ACMG/AMP clinical significance criteria.

DESeq2

DESeq2 is a widely used R/Bioconductor package for differential expression analysis of RNA-seq count data. It models counts using a negative binomial distribution and applies empirical Bayes shrinkage of log2 fold change estimates to reduce noise for low-count genes. Key features: size factor normalization to correct for library size differences; dispersion estimation sharing information across genes; Wald test for significance; and independent filtering to maximize the number of genes passing an FDR threshold (typically padj < 0.05). Results include log2 fold change, Wald statistic, p-value, and adjusted p-value (Benjamini-Hochberg) for every gene. edgeR and limma-voom are the two most common alternatives with different statistical approaches. BioMate runs all three and returns the union of significant hits with QC-graded concordance.

Multi-omics

Multi-omics refers to the integration of data from multiple omic layers — genomics (DNA variants), transcriptomics (RNA expression), proteomics (protein abundance), metabolomics (metabolite levels), and epigenomics (chromatin accessibility, methylation) — to build a more complete biological picture than any single modality can provide. Integration approaches include: MOFA+ (multi-omics factor analysis) for unsupervised latent factor decomposition; iCluster for cancer subtyping; mixOmics for supervised integration; and network-based methods (OmicsNet, WGCNA). Multi-omics integration is particularly powerful for biomarker discovery, drug mechanism of action studies, and patient stratification in clinical trials.

Cryo-EM

Cryo-Electron Microscopy

Cryo-EM determines the 3D structure of biological molecules by imaging them in a near-native, frozen-hydrated state using electron beams. Samples are vitrified in a thin film of amorphous ice by plunge-freezing in liquid ethane. An electron microscope captures thousands of 2D projection images of randomly oriented particles. Computational reconstruction (single-particle analysis, SPA) aligns and averages these projections to produce a 3D electron density map, from which an atomic model is built. Modern cryo-EM achieves 2–3 Å resolution — comparable to X-ray crystallography — without requiring crystals. This has been transformative for membrane proteins, large complexes (>150 kDa), and dynamic assemblies. Nobel Prize in Chemistry 2017. Software: CryoSPARC and RELION are the dominant platforms.

SPA

Single-Particle Analysis

Single-particle analysis (SPA) is the dominant cryo-EM workflow for determining structures of purified, homogeneous protein complexes. The SPA pipeline includes: motion correction (aligning drift-corrected frames), CTF estimation (characterizing lens aberrations), particle picking (identifying individual protein molecules using templates or neural networks like TOPAZ), 2D classification (grouping particles by orientation and removing junk), 3D classification (separating conformational states), homogeneous refinement (maximizing resolution), and map validation (FSC curves, local resolution estimation). BioMate runs CryoSPARC SPA as a 5-phase, 23-step pipeline on AWS GPU instances, with evidence-graded QC at each phase.

AlphaFold

AlphaFold is an AI system developed by Google DeepMind that predicts protein 3D structures from amino acid sequences with near-experimental accuracy. AlphaFold2 (2021) used a transformer-based architecture with multiple sequence alignments (MSA) and pairwise representations to achieve a mean GDT_TS score of 92.4 in CASP14 — surpassing all other methods and roughly matching experimental accuracy. AlphaFold3 (2024) extends prediction to protein–DNA, protein–RNA, and protein–ligand complexes using a diffusion architecture. The AlphaFold Protein Structure Database (AFDB) provides predicted structures for virtually all known proteins (~200 million). AlphaFold predictions have accuracy metrics: pLDDT (per-residue confidence, 0–100) and PAE (predicted aligned error, for inter-domain confidence). BioMate integrates AlphaFold2 and AF3 predictions directly into docking and binding site analysis pipelines.

Molecular Docking

Molecular docking predicts the preferred orientation (binding pose) and binding affinity (ΔG) of a small molecule (ligand) within a target protein's binding site. Rigid docking holds both protein and ligand rigid; flexible docking samples ligand conformations and optionally protein side-chain flexibility; induced-fit docking allows full protein rearrangement. Key software: AutoDock Vina (widely used, Lamarckian genetic algorithm), GLIDE (Schrödinger, hierarchical scoring), DiffDock (diffusion-based, no predefined binding site needed). Docking scores correlate roughly with binding affinity but are not quantitative — top-scoring poses are prioritized for experimental validation. BioMate chains AlphaFold structure prediction directly into AutoDock Vina docking in a single automated pipeline.

Molecular Dynamics (MD)

Molecular dynamics simulation numerically integrates Newton's equations of motion for all atoms in a molecular system (protein, ligand, solvent, membrane) using a force field (CHARMM36, AMBER ff14SB, OPLS-AA) to describe inter-atomic forces. MD reveals conformational dynamics that static crystal or cryo-EM structures cannot capture: protein flexibility, induced-fit binding, allosteric communication, and membrane permeability. Key analyses: RMSD (structural stability), RMSF (per-residue flexibility), hydrogen bond occupancy, and binding free energy (MM/GBSA, FEP). GPU-accelerated MD with GROMACS or OpenMM can simulate microsecond timescales. BioMate runs MD as part of ADMET lead optimization (binding stability validation) and standalone conformational analysis.

Nextflow / nf-core

Nextflow is a workflow management system (WMS) that enables scalable, reproducible bioinformatics pipelines. Pipelines are written in Nextflow DSL2 (a Groovy-based domain-specific language) and run on HPC schedulers (SLURM, PBS), cloud (AWS Batch, Google Cloud Life Sciences), or local machines with identical results. nf-core is a community organization that maintains a library of peer-reviewed, continuously tested Nextflow pipelines following strict coding standards: containerization (Docker/Singularity), comprehensive test datasets, CI/CD, and extensive documentation. Key nf-core pipelines: nf-core/rnaseq, nf-core/sarek (WGS/WES), nf-core/atacseq, nf-core/chipseq, nf-core/proteomicslfq, nf-core/differentialabundance. BioMate AI runs nf-core pipelines on AWS Batch, exposing them through plain-English requests without requiring Nextflow expertise.

Bioconductor

Bioconductor is a collection of open-source R packages for the analysis of genomic data, maintained by an international community with a rigorous review and release process. It provides: core data structures (SummarizedExperiment, SingleCellExperiment, GRanges), statistical methods (DESeq2, edgeR, limma, clusterProfiler, MAST), annotation resources (Homo.sapiens, BSgenome, GenomicFeatures), and visualization tools (EnhancedVolcano, ComplexHeatmap). Bioconductor has ~2,300 packages covering bulk RNA-seq, single-cell, methylation, ChIP-seq, spatial transcriptomics, proteomics, and metabolomics. BioMate AI indexes 3,300+ Bioconductor workflows and makes them accessible through plain-English requests without requiring R programming.