Skip to content

Add revisions for 0.0.3 #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
# .env is loaded by train.py automatically
# hydra allows you to reference variables in .yaml configs with special syntax: ${oc.env:MY_VAR}

MY_VAR="/home/user/my/system/path"
PLINDER_MOUNT="$(pwd)/data/PLINDER"
16 changes: 8 additions & 8 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,18 @@ repos:
- id: check-toml
- id: check-case-conflict
- id: check-added-large-files
args: ["--maxkb=10000"]
args: ["--maxkb=20000"]

# python code formatting
- repo: https://github.com/psf/black
rev: 24.10.0
rev: 25.1.0
hooks:
- id: black
args: [--line-length, "99"]

# python import sorting
- repo: https://github.com/PyCQA/isort
rev: 5.13.2
rev: 6.0.1
hooks:
- id: isort
args: ["--profile", "black", "--filter-files"]
Expand All @@ -43,7 +43,7 @@ repos:

# python docstring formatting
- repo: https://github.com/myint/docformatter
rev: v1.7.5
rev: eb1df347edd128b30cd3368dddc3aa65edcfac38 # Don't autoupdate until https://github.com/PyCQA/docformatter/issues/293 is fixed
hooks:
- id: docformatter
args:
Expand Down Expand Up @@ -73,7 +73,7 @@ repos:

# python check (PEP8), programming errors and code complexity
- repo: https://github.com/PyCQA/flake8
rev: 7.1.1
rev: 7.1.2
hooks:
- id: flake8
args:
Expand All @@ -87,7 +87,7 @@ repos:

# python security linter
- repo: https://github.com/PyCQA/bandit
rev: "1.8.2"
rev: "1.8.3"
hooks:
- id: bandit
args: ["-s", "B101"]
Expand All @@ -108,7 +108,7 @@ repos:

# md formatting
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.21
rev: 0.7.22
hooks:
- id: mdformat
args: ["--number"]
Expand All @@ -121,7 +121,7 @@ repos:

# word spelling linter
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
rev: v2.4.1
hooks:
- id: codespell
args:
Expand Down
7 changes: 5 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,21 @@ RUN mkdir -p /software/flowdock
WORKDIR /software/flowdock

## Clone project
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock

## Create conda environment
# RUN conda env create -f environments/flowdock_environment.yaml
COPY environments/flowdock_environment_docker.yaml /software/flowdock/environments/flowdock_environment_docker.yaml
RUN conda env create -f environments/flowdock_environment_docker.yaml

# Install ProDy without NumPy dependency
RUN python -m pip install --no-cache-dir --no-dependencies prody==2.4.1

## Automatically activate conda environment
RUN echo "source activate flowdock" >> /etc/profile.d/conda.sh && \
echo "source /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate flowdock" >> ~/.bashrc

## Default shell and command
SHELL ["/bin/bash", "-l", "-c"]
CMD ["/bin/bash"]
CMD ["/bin/bash"]
69 changes: 44 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

<!-- [![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020) -->

[![Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14660031.svg)](https://doi.org/10.5281/zenodo.14660031)
[![Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15066450.svg)](https://doi.org/10.5281/zenodo.15066450)

<img src="./img/FlowDock.png" width="600">

Expand All @@ -24,7 +24,7 @@ This is the official codebase of the paper

**FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction**

\[[arXiv](https://arxiv.org/abs/2412.10966)\]
\[[arXiv](https://arxiv.org/abs/2412.10966)\] \[[Neurosnap](https://neurosnap.ai/service/FlowDock)\] \[[Tamarind Bio](https://app.tamarind.bio/tools/flowdock)\]

<div align="center">

Expand Down Expand Up @@ -76,6 +76,7 @@ cd FlowDock
mamba env create -f environments/flowdock_environment.yaml
conda activate FlowDock # NOTE: one still needs to use `conda` to (de)activate environments
pip3 install -e . # install local project as package
pip3 install prody==2.4.1 --no-dependencies # install ProDy without NumPy dependency
```

Download checkpoints
Expand All @@ -91,7 +92,7 @@ cd ../

```bash
# pretrained FlowDock weights
wget https://zenodo.org/records/14660031/files/flowdock_checkpoints.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_checkpoints.tar.gz
tar -xzf flowdock_checkpoints.tar.gz
rm flowdock_checkpoints.tar.gz
```
Expand All @@ -105,19 +106,19 @@ tar -xzf flowdock_data_cache.tar.gz
rm flowdock_data_cache.tar.gz

# cached data for PDBBind, Binding MOAD, DockGen, and the PDB-based van der Mers (vdM) dataset
wget https://zenodo.org/records/14660031/files/flowdock_pdbbind_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_pdbbind_data.tar.gz
tar -xzf flowdock_pdbbind_data.tar.gz
rm flowdock_pdbbind_data.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_moad_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_moad_data.tar.gz
tar -xzf flowdock_moad_data.tar.gz
rm flowdock_moad_data.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_dockgen_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_dockgen_data.tar.gz
tar -xzf flowdock_dockgen_data.tar.gz
rm flowdock_dockgen_data.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_pdbsidechain_data.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_pdbsidechain_data.tar.gz
tar -xzf flowdock_pdbsidechain_data.tar.gz
rm flowdock_pdbsidechain_data.tar.gz
```
Expand All @@ -129,7 +130,7 @@ rm flowdock_pdbsidechain_data.tar.gz
<details>

**NOTE:** The following steps (besides downloading PDBBind and Binding MOAD's PDB files) are only necessary if one wants to fully process each of the following datasets manually.
Otherwise, preprocessed versions of each dataset can be found on [Zenodo](https://zenodo.org/records/14660031).
Otherwise, preprocessed versions of each dataset can be found on [Zenodo](https://zenodo.org/records/15066450).

Download data

Expand Down Expand Up @@ -159,6 +160,16 @@ mv pdb_2021aug02/ pdbsidechain/
cd ../
```

Lastly, to finetune `FlowDock` using the `PLINDER` dataset, one must first prepare this data for training

```bash
# fetch PLINDER data (NOTE: requires ~1 hour to download and ~750G of storage)
export PLINDER_MOUNT="$(pwd)/data/PLINDER"
mkdir -p "$PLINDER_MOUNT" # create the directory if it doesn't exist

plinder_download -y
```

### Generating ESM2 embeddings for each protein (optional, cached input data available on SharePoint)

To generate the ESM2 embeddings for the protein inputs,
Expand Down Expand Up @@ -260,10 +271,10 @@ python flowdock/train.py experiment=flowdock_fm
python flowdock/train.py experiment=flowdock_fm trainer.max_epochs=20 data.batch_size=8
```

For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset
For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset such as [PLINDER](https://www.plinder.sh/)

```bash
python flowdock/train.py experiment=flowdock_fm data=my_new_datamodule ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
python flowdock/train.py experiment=flowdock_fm data=plinder ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
```

</details>
Expand All @@ -277,7 +288,7 @@ To reproduce `FlowDock`'s evaluation results for structure prediction, please re
To reproduce `FlowDock`'s evaluation results for binding affinity prediction using the PDBBind dataset

```bash
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt trainer=gpu
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt trainer=gpu
... # re-run two more times to gather triplicate results
```

Expand All @@ -291,47 +302,55 @@ Download baseline method predictions and results

```bash
# cached predictions and evaluation metrics for reproducing structure prediction paper results
wget https://zenodo.org/records/14660031/files/alphafold3_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/alphafold3_baseline_method_predictions.tar.gz
tar -xzf alphafold3_baseline_method_predictions.tar.gz
rm alphafold3_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/chai_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/chai_baseline_method_predictions.tar.gz
tar -xzf chai_baseline_method_predictions.tar.gz
rm chai_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/diffdock_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/diffdock_baseline_method_predictions.tar.gz
tar -xzf diffdock_baseline_method_predictions.tar.gz
rm diffdock_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/dynamicbind_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/dynamicbind_baseline_method_predictions.tar.gz
tar -xzf dynamicbind_baseline_method_predictions.tar.gz
rm dynamicbind_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_baseline_method_predictions.tar.gz
tar -xzf flowdock_baseline_method_predictions.tar.gz
rm flowdock_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_aft_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_aft_baseline_method_predictions.tar.gz
tar -xzf flowdock_aft_baseline_method_predictions.tar.gz
rm flowdock_aft_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_esmfold_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_pft_baseline_method_predictions.tar.gz
tar -xzf flowdock_pft_baseline_method_predictions.tar.gz
rm flowdock_pft_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/15066450/files/flowdock_esmfold_baseline_method_predictions.tar.gz
tar -xzf flowdock_esmfold_baseline_method_predictions.tar.gz
rm flowdock_esmfold_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/flowdock_hp_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/flowdock_chai_baseline_method_predictions.tar.gz
tar -xzf flowdock_chai_baseline_method_predictions.tar.gz
rm flowdock_chai_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/15066450/files/flowdock_hp_baseline_method_predictions.tar.gz
tar -xzf flowdock_hp_baseline_method_predictions.tar.gz
rm flowdock_hp_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/neuralplexer_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/neuralplexer_baseline_method_predictions.tar.gz
tar -xzf neuralplexer_baseline_method_predictions.tar.gz
rm neuralplexer_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/vina_p2rank_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/vina_p2rank_baseline_method_predictions.tar.gz
tar -xzf vina_p2rank_baseline_method_predictions.tar.gz
rm vina_p2rank_baseline_method_predictions.tar.gz

wget https://zenodo.org/records/14660031/files/rfaa_baseline_method_predictions.tar.gz
wget https://zenodo.org/records/15066450/files/rfaa_baseline_method_predictions.tar.gz
tar -xzf rfaa_baseline_method_predictions.tar.gz
rm rfaa_baseline_method_predictions.tar.gz
```
Expand All @@ -353,13 +372,13 @@ jupyter notebook notebooks/casp16_binding_affinity_prediction_results_plotting.i
For example, generate new protein-ligand complexes for a pair of protein sequence and ligand SMILES strings such as those of the PDBBind 2020 test target `6i67`

```bash
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
```

Or, for example, generate new protein-ligand complexes for pairs of protein sequences and (multi-)ligand SMILES strings (delimited via `|`) such as those of the CASP15 target `T1152`

```bash
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
```

If you do not already have a template protein structure available for your target of interest, set `input_template=null` to instead have the sampling script predict the ESMFold structure of your provided `input_protein` sequence before running the sampling pipeline. For more information regarding the input arguments available for sampling, please refer to the config at `configs/sample.yaml`.
Expand All @@ -369,7 +388,7 @@ If you do not already have a template protein structure available for your targe
For instance, one can perform batched prediction as follows:

```bash
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
```

</details>
Expand Down
18 changes: 18 additions & 0 deletions configs/data/plinder.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
_target_: flowdock.data.plinder_datamodule.PlinderDataModule
data_dir: ${paths.data_dir}/PLINDER/
batch_size: 16 # Needs to be divisible by the number of devices (e.g., if in a distributed setup)
num_workers: 4
pin_memory: True
# overfitting arguments
overfitting_example_name: null # NOTE: currently not used
# model arguments
n_protein_patches: 96
n_lig_patches: 32
epoch_frac: 1.0
edge_crop_size: 400000
esm_version: ${model.cfg.protein_encoder.esm_version}
esm_repr_layer: ${model.cfg.protein_encoder.esm_repr_layer}
# general dataset arguments
plinder_offline: False
min_protein_length: 50
max_protein_length: 750
Loading
Loading