Skip to content

Commit cf3c29b

Browse files
committed
Simplify README.md and environment.yaml
1 parent 3042a26 commit cf3c29b

File tree

2 files changed

+11
-134
lines changed

2 files changed

+11
-134
lines changed

Diff for: README.md

+11-133
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111

1212
**Update**: Also consider checking out our new diffusion generative model, GCDM, that uses GCPNet to improve equivariant diffusion models for 3D molecule generation in multiple ways. The GCDM [GitHub](https://github.com/BioinfoMachineLearning/bio-diffusion) and [paper](https://arxiv.org/abs/2302.04313).
1313

14+
**Update**: Also consider checking out the new ProteinWorkshop benchmark which features GCPNet as a state-of-the-art geometric graph neural network for representation learning of 3D protein structures. [GitHub](https://github.com/a-r-j/ProteinWorkshop).
15+
1416
![GCP_Architecture.png](./img/GCPNet.png)
1517

1618
</div>
@@ -22,10 +24,9 @@ A PyTorch implementation of Geometry-Complete SE(3)-Equivariant Perceptron Netwo
2224
<details open><summary><b>Table of contents</b></summary>
2325

2426
- [Creating a Virtual Environment](#virtual-environment-creation)
25-
- [GCPNet Foundational Tasks and Models](#gcpnet-foundational)
26-
- [Model Training](#gcpnet-foundational-training)
27-
- [Model Evaluation](#gcpnet-foundational-evaluation)
28-
- [GCPNet for Protein Structure EMA (GCPNet-EMA)](#gcpnet-ema)
27+
- [GCPNet Tasks and Models](#gcpnet)
28+
- [Model Training](#gcpnet-training)
29+
- [Model Evaluation](#gcpnet-evaluation)
2930
- [Acknowledgements](#acknowledgements)
3031
- [Citations](#citations)
3132
</details>
@@ -56,9 +57,9 @@ conda activate gcpnet # note: one still needs to use `conda` to (de)activate en
5657
pip3 install -e .
5758
```
5859

59-
## GCPNet Foundational Tasks and Models <a name="gcpnet-foundational"></a>
60+
## GCPNet Tasks and Models <a name="gcpnet"></a>
6061

61-
Download data for foundational tasks
62+
Download data for tasks
6263
```bash
6364
# initialize data directory structure
6465
mkdir -p data
@@ -81,25 +82,7 @@ navigating to https://figshare.com/s/e23be65a884ce7fc8543 and downloading the th
8182

8283
**Note**: The ATOM3D datasets (i.e., the LBA and PSR datasets) as well as the CATH dataset we use will automatically be downloaded during execution of `src/train.py` or `src/eval.py` if they have not already been downloaded. However, data for the NMS and RS tasks must be downloaded manually.
8384

84-
**Another Note**: TM-score and MolProbity are required to score protein structures, where one can install them as follows:
85-
```bash
86-
# download and compile TM-score
87-
mkdir -p ~/Programs && cd ~/Programs
88-
wget https://zhanggroup.org/TM-score/TMscore.cpp
89-
g++ -static -O3 -ffast-math -lm -o TMscore TMscore.cpp
90-
rm TMscore.cpp
91-
92-
# download and compile MolProbity
93-
# note: beforehand, if not already installed within the `gcpnet` environment by `mamba`, make sure `svn` is installed locally using e.g., `apt install subversion` or `yum install subversion`
94-
mkdir -p ~/Programs/MolProbity && cd ~/Programs/MolProbity
95-
wget https://raw.githubusercontent.com/rlabduke/MolProbity/master/install_via_bootstrap.sh
96-
conda activate gcpnet # ensure the `gcpnet` Conda environment is activated for installation
97-
bash install_via_bootstrap.sh 4 # note: `4` here indicates the number of processes to run in parallel for faster installation
98-
bash molprobity/setup.sh # note: this command will likely fail due to not being run inside a GUI, but nonetheless installation should now be completed
99-
```
100-
Make sure to update the `tmscore_exec_path` and `molprobity_exec_path` values in e.g., `configs/paths/default.yaml` to reflect where you have placed the TM-score and MolProbity executables on your machine. Also, make sure that `lddt_exec_path` points to the `bin/lddt` path within your `gcpnet` Conda environment, where `lddt` is installed automatically as described in `environment.yaml`.
101-
102-
## How to train foundational models <a name="gcpnet-foundational-training"></a>
85+
## How to train models <a name="gcpnet-training"></a>
10386

10487
Train model with default configuration
10588

@@ -159,7 +142,7 @@ _**New**_: For tasks that may benefit from it, you can now enable E(3) equivaria
159142
python3 src/train.py model.module_cfg.enable_e3_equivariance=true
160143
```
161144

162-
## How to evaluate foundational models <a name="gcpnet-foundational-evaluation"></a>
145+
## How to evaluate models <a name="gcpnet-evaluation"></a>
163146
Reproduce our results for the LBA task
164147

165148
```bash
@@ -339,114 +322,19 @@ CPD Model
339322
└──────────────────────────────┴──────────────────────────────┴──────────────────────────────┴──────────────────────────────┘
340323
```
341324

342-
## GCPNet-EMA <a name="gcpnet-ema"></a>
343-
**Note**: Make sure the `gcpnet` Mamba environment has previously been created as outlined above in the section [Creating a Virtual Environment](#virtual-environment-creation).
344-
345-
Download training and evaluation data
346-
347-
```bash
348-
cd data/EQ/
349-
wget https://zenodo.org/record/8150859/files/ema_decoy_model.tar.gz
350-
wget https://zenodo.org/record/8150859/files/ema_true_model.tar.gz
351-
tar -xzf ema_decoy_model.tar.gz
352-
tar -xzf ema_true_model.tar.gz
353-
cd ../../ # head back to the root project directory
354-
```
355-
356-
Train a model for the estimation of protein structure model accuracy (**EMA**) task (A.K.A. equivariant quality (EQ) assessment of protein structures)
357-
358-
```bash
359-
python3 src/train.py experiment=gcpnet_eq.yaml
360-
```
361-
362-
Reproduce our results for the EMA task
363-
364-
```bash
365-
eq_model_1_ckpt_path="checkpoints/EQ/model_1_epoch_53_per_res_pearson_0_7443.ckpt"
366-
eq_model_2_ckpt_path="checkpoints/EQ/model_2_epoch_25_per_res_pearson_0_7426.ckpt"
367-
eq_model_3_ckpt_path="checkpoints/EQ/model_3_epoch_14_per_res_pearson_0_7133.ckpt"
368-
369-
python3 src/eval.py datamodule=eq model=gcpnet_eq logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$eq_model_1_ckpt_path"
370-
python3 src/eval.py datamodule=eq model=gcpnet_eq logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$eq_model_2_ckpt_path"
371-
python3 src/eval.py datamodule=eq model=gcpnet_eq logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$eq_model_3_ckpt_path"
372-
```
373-
374-
```bash
375-
EMA Model 1
376-
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
377-
┃ Test metric ┃ DataLoader 0 ┃
378-
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
379-
│ test/PerModelMAE │ 0.04894806072115898 │
380-
│ test/PerModelMSE │ 0.004262289963662624 │
381-
│ test/PerModelPearsonCorrCoef │ 0.8362738490104675 │
382-
│ test/PerResidueMAE │ 0.06654192507266998 │
383-
│ test/PerResidueMSE │ 0.009298641234636307 │
384-
│ test/PerResiduePearsonCorrCoef │ 0.7442569732666016 │
385-
│ test/loss │ 0.005294517148286104 │
386-
└────────────────────────────────┴────────────────────────────────┘
387-
388-
EMA Model 2
389-
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
390-
┃ Test metric ┃ DataLoader 0 ┃
391-
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
392-
│ test/PerModelMAE │ 0.04955434799194336 │
393-
│ test/PerModelMSE │ 0.004251933190971613 │
394-
│ test/PerModelPearsonCorrCoef │ 0.841285228729248 │
395-
│ test/PerResidueMAE │ 0.06787651032209396 │
396-
│ test/PerResidueMSE │ 0.009320290759205818 │
397-
│ test/PerResiduePearsonCorrCoef │ 0.7426220774650574 │
398-
│ test/loss │ 0.005294565111398697 │
399-
└────────────────────────────────┴────────────────────────────────┘
400-
401-
EMA Model 3
402-
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
403-
┃ Test metric ┃ DataLoader 0 ┃
404-
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
405-
│ test/PerModelMAE │ 0.05056113377213478 │
406-
│ test/PerModelMSE │ 0.004722739569842815 │
407-
│ test/PerModelPearsonCorrCoef │ 0.8154276013374329 │
408-
│ test/PerResidueMAE │ 0.07143402099609375 │
409-
│ test/PerResidueMSE │ 0.01017170213162899 │
410-
│ test/PerResiduePearsonCorrCoef │ 0.7132763266563416 │
411-
│ test/loss │ 0.005769775714725256 │
412-
└────────────────────────────────┴────────────────────────────────┘
413-
```
414-
415-
Predict per-residue and per-model lDDT scores for computationally-predicted (e.g., AlphaFold 2) protein structure decoys
416-
417-
```bash
418-
eq_model_ckpt_path="checkpoints/EQ/model_1_epoch_53_per_res_pearson_0_7443.ckpt"
419-
predict_batch_size=1 # adjust as desired according to available GPU memory
420-
num_workers=0 # note: required when initially processing new PDB file inputs, due to ESM's GPU usage
421-
422-
python3 src/predict.py model=gcpnet_eq datamodule=eq datamodule.predict_input_dir=$MY_INPUT_PDB_DIR datamodule.predict_true_dir=$MY_OPTIONAL_TRUE_PDB_DIR datamodule.predict_output_dir=$MY_OUTPUTS_DIR datamodule.predict_batch_size=$predict_batch_size datamodule.num_workers=$num_workers logger=csv trainer.accelerator=gpu trainer.devices=1 ckpt_path="$eq_model_ckpt_path"
423-
```
424-
425-
For example, one can predict per-residue and per-model lDDT scores for a batch of tertiary protein structure inputs, `6W6VE.pdb` and `6W77K.pdb` within `data/EQ/examples/decoy_model`, as follows
426-
427-
```bash
428-
python3 src/predict.py model=gcpnet_eq datamodule=eq datamodule.predict_input_dir=data/EQ/examples/decoy_model datamodule.predict_output_dir=data/EQ/examples/outputs datamodule.predict_batch_size=1 datamodule.num_workers=0 datamodule.python_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/python datamodule.lddt_exec_path="$HOME"/mambaforge/envs/gcpnet/bin/lddt datamodule.pdbtools_dir="$HOME"/mambaforge/envs/gcpnet/lib/python3.9/site-packages/pdbtools/ logger=csv trainer.accelerator=gpu trainer.devices=[0] ckpt_path=checkpoints/EQ/model_1_epoch_53_per_res_pearson_0_7443.ckpt
429-
```
430-
431-
**Note**: After running the above command, an output CSV containing metadata for the predictions will be located at `logs/predict/runs/YYYY-MM-DD_HH-MM-SS/predict_YYYYMMDD_HHMMSS_rank_0_predictions.csv`, with text substitutions for the time at which the above command was completed. This CSV will contain a column called `predicted_annotated_pdb_filepath` that identifies the temporary location of each input PDB file after annotating it with GCPNet-EMA's predicted lDDT scores for each residue. If a directory containing ground-truth PDB files corresponding one-to-one with the inputs in `datamodule.predict_input_dir` is provided as `datamodule.predict_true_dir`, then metrics and PDB annotation filepaths will also be reported in the output CSV to quantitatively and qualitatively describe how well GCPNet-EMA was able to improve upon AlphaFold's initial per-residue plDDT values.
432-
433325
## Acknowledgements <a name="acknowledgements"></a>
434326

435-
GCPNet foundational models build upon the source code and data from the following projects:
327+
GCPNet builds upon the source code and data from the following projects:
436328
* [ClofNet](https://github.com/mouthful/ClofNet)
437329
* [GBPNet](https://github.com/sarpaykent/GBPNet)
438330
* [gvp-pytorch](https://github.com/drorlab/gvp-pytorch)
439331
* [lightning-hydra-template](https://github.com/ashleve/lightning-hydra-template)
440332

441-
GCPNet-EMA builds upon the source code and data from the following project(s):
442-
* [EnQA](https://github.com/BioinfoMachineLearning/EnQA)
443-
444333
We thank all their contributors and maintainers!
445334

446-
447335
## Citing this work <a name="citations"></a>
448336

449-
If you use the code or data associated with the GCPNet foundational models within this package or otherwise find such work useful, please cite:
337+
If you use the code or data associated with the GCPNet models within this package or otherwise find such work useful, please cite:
450338

451339
```bibtex
452340
@article{morehead2023gcpnet,
@@ -455,14 +343,4 @@ If you use the code or data associated with the GCPNet foundational models withi
455343
journal={AAAI Workshop on Deep Learning on Graphs: Methods and Applications},
456344
year={2023}
457345
}
458-
```
459-
460-
If you use the code or data associated with the GCPNet-EMA model within this package or otherwise find such work useful, please cite:
461-
462-
```bibtex
463-
@article{morehead2023gcpnet_ema,
464-
title={Protein Structure Accuracy Estimation using Geometry-Complete Perceptron Networks},
465-
author={Morehead, Alex and Cheng, Jianlin},
466-
year={2023}
467-
}
468346
```

Diff for: environment.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,6 @@ dependencies:
8282
- lame=3.100=h7f98852_1001
8383
- lcms2=2.15=hfd0df8a_0
8484
- ld_impl_linux-64=2.40=h41732ed_0
85-
- lddt=2.2=h9ee0642_0
8685
- lerc=4.0.0=h27087fc_0
8786
- libabseil=20230125.3=cxx17_h59595ed_0
8887
- libapr=1.7.0=h7f98852_5

0 commit comments

Comments
 (0)