Modeling an FGFR Kinase Inhibitor with Boltz-1 on DiPhyx: A Sequence-to-Structure-to-Function Workflow#
Why This Case Study Matters#
FGFR kinases (FGFR1-3) are implicated in diverse cancers. Infigratinib (BGJ-398), an ATP-competitive inhibitor, is approved for FGFR2 fusion-driven cholangiocarcinoma and in trials for broader FGFR-addicted cancers. Understanding the structural basis of its activity supports:
Precision medicine — anticipating resistance mutations and optimizing analogs.
Biomarker discovery — linking structure to downstream gene expression.
In-silico screening — evaluating analogs computationally before synthesis.
Boltz-1, a generative diffusion model, predicts protein-ligand complex structures. This notebook demonstrates how to run Boltz-1 within the DiPhyx platform, integrate molecular modeling and transcriptomics, and interpret biological outcomes.
Pipeline Overview#
Stage |
Key Tool |
Output |
---|---|---|
A. Target & ligand prep |
UniProt, RDKit |
FGFR1-3 kinase sequences; 3D mol of Infigratinib |
B. Structure prediction |
Boltz-1 |
PDBs of FGFR–drug complexes + per-model confidence |
C. Expression signature |
Scanpy + GSEApy (bulk/SC datasets) |
Differential gene lists & pathway NES |
D. Interpretation |
PyMOL, volcano/heat-maps |
Structure-function narrative & design hypotheses |
Compute Recommendations: Run this notebook on GPU-enabled units on DiPhyx. Recommended instances include:
g4dn.4xlarge
(16 cores, 64 GB RAM, Tesla T4 GPU)
g4dn.2xlarge
(8 cores, 16 GB RAM, Tesla T4 GPU)
g6.2xlarge
(8 cores, 32 GB RAM, NVIDIA L4 GPU)
Practical Walk-through#
Prepare Inputs#
Fetch FGFR1 kinase domain and generate Infigratinib conformer:
"""Fetch FGFR1 kinase domain (residues 564‑822) and build a 3‑D conformer of
Infigratinib – all in pure Python so you can run inside a notebook."""
import os, requests, textwrap
from pathlib import Path
from rdkit import Chem
from rdkit.Chem import AllChem
boltz_input_path = Path("boltz_inputs");
boltz_input_path.mkdir(exist_ok=True)
# ▸ Fetch canonical SMILES for Infigratinib (PubChem CID 50909836) -----
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/50909836/property/IsomericSMILES/JSON"
smiles = requests.get(url, timeout=30).json()['PropertyTable']['Properties'][0]['IsomericSMILES']
print("SMILES:", smiles[:60], "…")
# Build 3‑D ligand
mol = Chem.AddHs(Chem.MolFromSmiles(smiles))
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.UFFOptimizeMolecule(mol)
Chem.MolToMolFile(mol, boltz_input_path / "Infigratinib.mol")
print("Wrote 3D MOL →", boltz_input_path / "Infigratinib.mol")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 6
4 import os, requests, textwrap
5 from pathlib import Path
----> 6 from rdkit import Chem
7 from rdkit.Chem import AllChem
9 boltz_input_path = Path("boltz_inputs");
ModuleNotFoundError: No module named 'rdkit'
Generate YAML programmatically#
The snippet below downloads the full FGFR1 sequence from UniProt, slices the kinase domain (564–822), and writes a Boltz‑1 YAML in boltz_inputs/. It re‑uses the smiles variable created in the previous cell:
import os, requests, yaml, textwrap
from pathlib import Path
boltz_input_path = Path("boltz_inputs")
boltz_input_path.mkdir(exist_ok=True)
# ▸ Download full FGFR1 sequence (UniProt P11362) ----------------------
uniprots = {
"FGFR1": "P11362",
"FGFR2": "P21802",
"FGFR3": "P22607",
}
seq_dict = {}
for name, uid in uniprots.items():
url = f"https://www.uniprot.org/uniprot/{uid}.fasta"
fasta = requests.get(url, timeout=30).text
full_seq = "".join(l.strip() for l in fasta.splitlines() if not l.startswith(">"))
kd_seq = full_seq[563:822] # slice residues 564‑822 (python 0‑based)
seq_dict[name] = kd_seq
print(f"{name}: kinase domain length = {len(kd_seq)} aa")
inputs_dict ={}
for name, kd_seq in seq_dict.items():
yaml_dict = {
"version": 1,
"sequences": [
{"protein": {"id": "A", "sequence": textwrap.fill(kd_seq, 60)}},
{"ligand": {"id": "B", "smiles": smiles}},
]
}
outfile = os.path.join("boltz_inputs", f"{name.lower()}_infig.yaml")
with open(outfile, "w") as fh:
yaml.safe_dump(yaml_dict, fh, sort_keys=False)
inputs_dict[name] = outfile
print("Wrote", outfile)
3.2 Run Boltz‑1#
The fist step is install Boltz-1. The following command will install the latest version of Boltz-1 from the GitHub repository. There are some dependencies that need to be installed first, including Cmake, compilers (C++, gfortran, etc). You can find the installation instructions in the Boltz-1 GitHub repository
!conda install -y -c conda-forge gfortran_linux-64 compilers git cmake openblas openblas-devel > /dev/null 2>&1
!pip install rdkit
!pip install pyyaml
# !pip install boltz
!git clone https://github.com/jwohlwend/boltz.git
!cd boltz; pip install -e .
import subprocess
import os
boltz_output_path = "boltz_output"
# Ensure boltz_output_path exists
os.makedirs(boltz_output_path, exist_ok=True)
# Run the command and stream output in real time
def run_and_stream(cmd):
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in process.stdout:
print(line, end='')
process.wait()
if process.returncode != 0:
print(f"Process exited with code {process.returncode}")
# Repeat for FGFR2 and FGFR3.Outp
#boltz_input_path = os.path.join("volume", "boltz_input")
for name, yaml_input_file in inputs_dict.items():
cmd = [
"boltz", "predict",
yaml_input_file,
"--out_dir", boltz_output_path,
"--recycling_steps", "10",
"--diffusion_samples", "8",
"--cache", "/volume/boltz_cache",
"--use_msa_server"
]
print(f"Running command: {' '.join(cmd)}")
run_and_stream(cmd)
Binding Pose Visualization#
Load fgfr1_infig_model_0.pdb in PyMOL:In order to visualize the binding pose of the ligand in the protein structure, we can use PyMOL. PyMOL is a powerful molecular visualization tool that allows us to view and manipulate 3D structures of proteins and ligands. We can load the PDB file generated by Boltz-1 and visualize the binding pose of Infigratinib in the FGFR1 kinase domain. You can launch PyMol on your desired compute-unit. First go to the flow then find PyMol and click on the “Tryout” button. Then select the desired compute-unit to launch the PyMol. This will launch a new instance of PyMol in your browser.
When you open the PyMol check the following:
Acrylamide warhead aligned toward Cys488 (covalent site).
Hinge hydrogen bonds to Ala564 backbone.
Confidence JSON → ligand_iptm > 0.6 ⇒ stable pose.
Link Structure to Transcriptional Response#
Obtain any public RNA-seq dataset where FGFR-addicted cells are treated with BGJ-398 (e.g. GEO GSE65324). Analyse with Scanpy:
import scanpy as sc
import gseapy as gp
adata = sc.read_h5ad("BGJ398_treated_vs_control.h5ad")
sc.tl.rank_genes_groups(adata, 'condition', groups=['treated'], reference='control')
deg = sc.get.rank_genes_groups_df(adata, 'treated')
rank = deg[['names','logfoldchanges']].sort_values('logfoldchanges', ascending=False)
enrich = gp.prerank(rnk=rank, gene_sets='MSigDB_Hallmark_2020')
enrich.res2d.head(10)
Expected results
Observation |
Structural rationale |
---|---|
Down-reg of E2F targets, MYC targets |
Loss of FGFR/ERK proliferative signalling |
Up-reg of p53 pathway, apoptosis |
FGFR blockade induces cell-cycle arrest |
Feedback ↓ in FGFR1/2 mRNA |
Kinase pocket occupancy disrupts receptor recycling |
Combine volcano plot of DEGs with PyMOL snapshot → a coherent narrative from pocket blockade to pathway shutdown.