Skip to content

Commit 6bf96dc

Browse files
authored
Merge pull request #9 from BioinfoMachineLearning/0.0.3
Add revisions for `0.0.3`
2 parents e1d169d + 021e9b7 commit 6bf96dc

File tree

36 files changed

+33149
-470
lines changed

36 files changed

+33149
-470
lines changed

.env.example

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@
33
# .env is loaded by train.py automatically
44
# hydra allows you to reference variables in .yaml configs with special syntax: ${oc.env:MY_VAR}
55

6-
MY_VAR="/home/user/my/system/path"
6+
PLINDER_MOUNT="$(pwd)/data/PLINDER"

.pre-commit-config.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,18 +18,18 @@ repos:
1818
- id: check-toml
1919
- id: check-case-conflict
2020
- id: check-added-large-files
21-
args: ["--maxkb=10000"]
21+
args: ["--maxkb=20000"]
2222

2323
# python code formatting
2424
- repo: https://github.com/psf/black
25-
rev: 24.10.0
25+
rev: 25.1.0
2626
hooks:
2727
- id: black
2828
args: [--line-length, "99"]
2929

3030
# python import sorting
3131
- repo: https://github.com/PyCQA/isort
32-
rev: 5.13.2
32+
rev: 6.0.1
3333
hooks:
3434
- id: isort
3535
args: ["--profile", "black", "--filter-files"]
@@ -43,7 +43,7 @@ repos:
4343

4444
# python docstring formatting
4545
- repo: https://github.com/myint/docformatter
46-
rev: v1.7.5
46+
rev: eb1df347edd128b30cd3368dddc3aa65edcfac38 # Don't autoupdate until https://github.com/PyCQA/docformatter/issues/293 is fixed
4747
hooks:
4848
- id: docformatter
4949
args:
@@ -73,7 +73,7 @@ repos:
7373

7474
# python check (PEP8), programming errors and code complexity
7575
- repo: https://github.com/PyCQA/flake8
76-
rev: 7.1.1
76+
rev: 7.1.2
7777
hooks:
7878
- id: flake8
7979
args:
@@ -87,7 +87,7 @@ repos:
8787

8888
# python security linter
8989
- repo: https://github.com/PyCQA/bandit
90-
rev: "1.8.2"
90+
rev: "1.8.3"
9191
hooks:
9292
- id: bandit
9393
args: ["-s", "B101"]
@@ -108,7 +108,7 @@ repos:
108108

109109
# md formatting
110110
- repo: https://github.com/executablebooks/mdformat
111-
rev: 0.7.21
111+
rev: 0.7.22
112112
hooks:
113113
- id: mdformat
114114
args: ["--number"]
@@ -121,7 +121,7 @@ repos:
121121

122122
# word spelling linter
123123
- repo: https://github.com/codespell-project/codespell
124-
rev: v2.3.0
124+
rev: v2.4.1
125125
hooks:
126126
- id: codespell
127127
args:

Dockerfile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,21 @@ RUN mkdir -p /software/flowdock
2020
WORKDIR /software/flowdock
2121

2222
## Clone project
23-
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
23+
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
2424

2525
## Create conda environment
2626
# RUN conda env create -f environments/flowdock_environment.yaml
2727
COPY environments/flowdock_environment_docker.yaml /software/flowdock/environments/flowdock_environment_docker.yaml
2828
RUN conda env create -f environments/flowdock_environment_docker.yaml
2929

30+
# Install ProDy without NumPy dependency
31+
RUN python -m pip install --no-cache-dir --no-dependencies prody==2.4.1
32+
3033
## Automatically activate conda environment
3134
RUN echo "source activate flowdock" >> /etc/profile.d/conda.sh && \
3235
echo "source /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
3336
echo "conda activate flowdock" >> ~/.bashrc
3437

3538
## Default shell and command
3639
SHELL ["/bin/bash", "-l", "-c"]
37-
CMD ["/bin/bash"]
40+
CMD ["/bin/bash"]

README.md

Lines changed: 43 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
<!-- [![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020) -->
1414

15-
[![Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14660031.svg)](https://doi.org/10.5281/zenodo.14660031)
15+
[![Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15066450.svg)](https://doi.org/10.5281/zenodo.15066450)
1616

1717
<img src="./img/FlowDock.png" width="600">
1818

@@ -76,6 +76,7 @@ cd FlowDock
7676
mamba env create -f environments/flowdock_environment.yaml
7777
conda activate FlowDock # NOTE: one still needs to use `conda` to (de)activate environments
7878
pip3 install -e . # install local project as package
79+
pip3 install prody==2.4.1 --no-dependencies # install ProDy without NumPy dependency
7980
```
8081

8182
Download checkpoints
@@ -91,7 +92,7 @@ cd ../
9192

9293
```bash
9394
# pretrained FlowDock weights
94-
wget https://zenodo.org/records/14660031/files/flowdock_checkpoints.tar.gz
95+
wget https://zenodo.org/records/15066450/files/flowdock_checkpoints.tar.gz
9596
tar -xzf flowdock_checkpoints.tar.gz
9697
rm flowdock_checkpoints.tar.gz
9798
```
@@ -105,19 +106,19 @@ tar -xzf flowdock_data_cache.tar.gz
105106
rm flowdock_data_cache.tar.gz
106107

107108
# cached data for PDBBind, Binding MOAD, DockGen, and the PDB-based van der Mers (vdM) dataset
108-
wget https://zenodo.org/records/14660031/files/flowdock_pdbbind_data.tar.gz
109+
wget https://zenodo.org/records/15066450/files/flowdock_pdbbind_data.tar.gz
109110
tar -xzf flowdock_pdbbind_data.tar.gz
110111
rm flowdock_pdbbind_data.tar.gz
111112

112-
wget https://zenodo.org/records/14660031/files/flowdock_moad_data.tar.gz
113+
wget https://zenodo.org/records/15066450/files/flowdock_moad_data.tar.gz
113114
tar -xzf flowdock_moad_data.tar.gz
114115
rm flowdock_moad_data.tar.gz
115116

116-
wget https://zenodo.org/records/14660031/files/flowdock_dockgen_data.tar.gz
117+
wget https://zenodo.org/records/15066450/files/flowdock_dockgen_data.tar.gz
117118
tar -xzf flowdock_dockgen_data.tar.gz
118119
rm flowdock_dockgen_data.tar.gz
119120

120-
wget https://zenodo.org/records/14660031/files/flowdock_pdbsidechain_data.tar.gz
121+
wget https://zenodo.org/records/15066450/files/flowdock_pdbsidechain_data.tar.gz
121122
tar -xzf flowdock_pdbsidechain_data.tar.gz
122123
rm flowdock_pdbsidechain_data.tar.gz
123124
```
@@ -129,7 +130,7 @@ rm flowdock_pdbsidechain_data.tar.gz
129130
<details>
130131

131132
**NOTE:** The following steps (besides downloading PDBBind and Binding MOAD's PDB files) are only necessary if one wants to fully process each of the following datasets manually.
132-
Otherwise, preprocessed versions of each dataset can be found on [Zenodo](https://zenodo.org/records/14660031).
133+
Otherwise, preprocessed versions of each dataset can be found on [Zenodo](https://zenodo.org/records/15066450).
133134

134135
Download data
135136

@@ -159,6 +160,16 @@ mv pdb_2021aug02/ pdbsidechain/
159160
cd ../
160161
```
161162

163+
Lastly, to finetune `FlowDock` using the `PLINDER` dataset, one must first prepare this data for training
164+
165+
```bash
166+
# fetch PLINDER data (NOTE: requires ~1 hour to download and ~750G of storage)
167+
export PLINDER_MOUNT="$(pwd)/data/PLINDER"
168+
mkdir -p "$PLINDER_MOUNT" # create the directory if it doesn't exist
169+
170+
plinder_download -y
171+
```
172+
162173
### Generating ESM2 embeddings for each protein (optional, cached input data available on SharePoint)
163174

164175
To generate the ESM2 embeddings for the protein inputs,
@@ -260,10 +271,10 @@ python flowdock/train.py experiment=flowdock_fm
260271
python flowdock/train.py experiment=flowdock_fm trainer.max_epochs=20 data.batch_size=8
261272
```
262273

263-
For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset
274+
For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset such as [PLINDER](https://www.plinder.sh/)
264275

265276
```bash
266-
python flowdock/train.py experiment=flowdock_fm data=my_new_datamodule ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
277+
python flowdock/train.py experiment=flowdock_fm data=plinder ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
267278
```
268279

269280
</details>
@@ -277,7 +288,7 @@ To reproduce `FlowDock`'s evaluation results for structure prediction, please re
277288
To reproduce `FlowDock`'s evaluation results for binding affinity prediction using the PDBBind dataset
278289

279290
```bash
280-
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt trainer=gpu
291+
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt trainer=gpu
281292
... # re-run two more times to gather triplicate results
282293
```
283294

@@ -291,47 +302,55 @@ Download baseline method predictions and results
291302

292303
```bash
293304
# cached predictions and evaluation metrics for reproducing structure prediction paper results
294-
wget https://zenodo.org/records/14660031/files/alphafold3_baseline_method_predictions.tar.gz
305+
wget https://zenodo.org/records/15066450/files/alphafold3_baseline_method_predictions.tar.gz
295306
tar -xzf alphafold3_baseline_method_predictions.tar.gz
296307
rm alphafold3_baseline_method_predictions.tar.gz
297308

298-
wget https://zenodo.org/records/14660031/files/chai_baseline_method_predictions.tar.gz
309+
wget https://zenodo.org/records/15066450/files/chai_baseline_method_predictions.tar.gz
299310
tar -xzf chai_baseline_method_predictions.tar.gz
300311
rm chai_baseline_method_predictions.tar.gz
301312

302-
wget https://zenodo.org/records/14660031/files/diffdock_baseline_method_predictions.tar.gz
313+
wget https://zenodo.org/records/15066450/files/diffdock_baseline_method_predictions.tar.gz
303314
tar -xzf diffdock_baseline_method_predictions.tar.gz
304315
rm diffdock_baseline_method_predictions.tar.gz
305316

306-
wget https://zenodo.org/records/14660031/files/dynamicbind_baseline_method_predictions.tar.gz
317+
wget https://zenodo.org/records/15066450/files/dynamicbind_baseline_method_predictions.tar.gz
307318
tar -xzf dynamicbind_baseline_method_predictions.tar.gz
308319
rm dynamicbind_baseline_method_predictions.tar.gz
309320

310-
wget https://zenodo.org/records/14660031/files/flowdock_baseline_method_predictions.tar.gz
321+
wget https://zenodo.org/records/15066450/files/flowdock_baseline_method_predictions.tar.gz
311322
tar -xzf flowdock_baseline_method_predictions.tar.gz
312323
rm flowdock_baseline_method_predictions.tar.gz
313324

314-
wget https://zenodo.org/records/14660031/files/flowdock_aft_baseline_method_predictions.tar.gz
325+
wget https://zenodo.org/records/15066450/files/flowdock_aft_baseline_method_predictions.tar.gz
315326
tar -xzf flowdock_aft_baseline_method_predictions.tar.gz
316327
rm flowdock_aft_baseline_method_predictions.tar.gz
317328

318-
wget https://zenodo.org/records/14660031/files/flowdock_esmfold_baseline_method_predictions.tar.gz
329+
wget https://zenodo.org/records/15066450/files/flowdock_pft_baseline_method_predictions.tar.gz
330+
tar -xzf flowdock_pft_baseline_method_predictions.tar.gz
331+
rm flowdock_pft_baseline_method_predictions.tar.gz
332+
333+
wget https://zenodo.org/records/15066450/files/flowdock_esmfold_baseline_method_predictions.tar.gz
319334
tar -xzf flowdock_esmfold_baseline_method_predictions.tar.gz
320335
rm flowdock_esmfold_baseline_method_predictions.tar.gz
321336

322-
wget https://zenodo.org/records/14660031/files/flowdock_hp_baseline_method_predictions.tar.gz
337+
wget https://zenodo.org/records/15066450/files/flowdock_chai_baseline_method_predictions.tar.gz
338+
tar -xzf flowdock_chai_baseline_method_predictions.tar.gz
339+
rm flowdock_chai_baseline_method_predictions.tar.gz
340+
341+
wget https://zenodo.org/records/15066450/files/flowdock_hp_baseline_method_predictions.tar.gz
323342
tar -xzf flowdock_hp_baseline_method_predictions.tar.gz
324343
rm flowdock_hp_baseline_method_predictions.tar.gz
325344

326-
wget https://zenodo.org/records/14660031/files/neuralplexer_baseline_method_predictions.tar.gz
345+
wget https://zenodo.org/records/15066450/files/neuralplexer_baseline_method_predictions.tar.gz
327346
tar -xzf neuralplexer_baseline_method_predictions.tar.gz
328347
rm neuralplexer_baseline_method_predictions.tar.gz
329348

330-
wget https://zenodo.org/records/14660031/files/vina_p2rank_baseline_method_predictions.tar.gz
349+
wget https://zenodo.org/records/15066450/files/vina_p2rank_baseline_method_predictions.tar.gz
331350
tar -xzf vina_p2rank_baseline_method_predictions.tar.gz
332351
rm vina_p2rank_baseline_method_predictions.tar.gz
333352

334-
wget https://zenodo.org/records/14660031/files/rfaa_baseline_method_predictions.tar.gz
353+
wget https://zenodo.org/records/15066450/files/rfaa_baseline_method_predictions.tar.gz
335354
tar -xzf rfaa_baseline_method_predictions.tar.gz
336355
rm rfaa_baseline_method_predictions.tar.gz
337356
```
@@ -353,13 +372,13 @@ jupyter notebook notebooks/casp16_binding_affinity_prediction_results_plotting.i
353372
For example, generate new protein-ligand complexes for a pair of protein sequence and ligand SMILES strings such as those of the PDBBind 2020 test target `6i67`
354373

355374
```bash
356-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
375+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
357376
```
358377

359378
Or, for example, generate new protein-ligand complexes for pairs of protein sequences and (multi-)ligand SMILES strings (delimited via `|`) such as those of the CASP15 target `T1152`
360379

361380
```bash
362-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
381+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
363382
```
364383

365384
If you do not already have a template protein structure available for your target of interest, set `input_template=null` to instead have the sampling script predict the ESMFold structure of your provided `input_protein` sequence before running the sampling pipeline. For more information regarding the input arguments available for sampling, please refer to the config at `configs/sample.yaml`.
@@ -369,7 +388,7 @@ If you do not already have a template protein structure available for your targe
369388
For instance, one can perform batched prediction as follows:
370389

371390
```bash
372-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
391+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
373392
```
374393

375394
</details>

configs/data/plinder.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
_target_: flowdock.data.plinder_datamodule.PlinderDataModule
2+
data_dir: ${paths.data_dir}/PLINDER/
3+
batch_size: 16 # Needs to be divisible by the number of devices (e.g., if in a distributed setup)
4+
num_workers: 4
5+
pin_memory: True
6+
# overfitting arguments
7+
overfitting_example_name: null # NOTE: currently not used
8+
# model arguments
9+
n_protein_patches: 96
10+
n_lig_patches: 32
11+
epoch_frac: 1.0
12+
edge_crop_size: 400000
13+
esm_version: ${model.cfg.protein_encoder.esm_version}
14+
esm_repr_layer: ${model.cfg.protein_encoder.esm_repr_layer}
15+
# general dataset arguments
16+
plinder_offline: False
17+
min_protein_length: 50
18+
max_protein_length: 750

0 commit comments

Comments
 (0)