Skip to content

Commit ccd3704

Browse files
committed
Update README
1 parent dc9e7c7 commit ccd3704

File tree

5 files changed

+349
-48
lines changed

5 files changed

+349
-48
lines changed

.Rprofile

-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
reticulate::use_condaenv("VitalWikiClassifier")
2-
#reticulate::use_python("~/anaconda3/bin/python")

.idea/workspace.xml

+2-9
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

+41-17
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,38 @@
11
# PatternRecognition
2-
Project files for pattern recognition group assignment
32

4-
## Files
3+
Project files for pattern recognition group assignment. Our project is about the classification of Wikipedia articles in one of the 11 top-level categories of the [Vital Articles Wikipedia list, level 4](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4).
4+
5+
The vital articles list was downloaded using scrapy. The code can be found in the [WikiVitalArticles](https://github.com/JasperHG90/WikiVitalArticles) repository. The raw data is included in this repository.
6+
7+
Given the different skillsets in our group, we use a mix of R, python, Keras and Pytorch to build our models. However, we make sure that each model uses the same train/test splits.
58

6-
Currently contains the following files:
9+
Our presentation can be found here. Our final paper can be found here.
710

8-
1. `data/WikiEssentials_L4.7z`: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).
9-
1. `preprocess_utils.py`: preprocessing functions for Wiki data.
10-
2. `model_utils.py`: various utility functions used for modeling (e.g. loading embeddings).
11-
3. `1_preprocess_raw_data.py`: preprocessing of raw input data. Currently shortens each article to first 8 sentences.
12-
4. `2_baseline_model.py`: tokenization, vectorization of input data and baseline model (1-layer NN with softmax classifier).
11+
## Files
12+
13+
Currently contains the following files (not all files are listed here):
14+
15+
1. `data/raw/WikiEssentials_L4.7z`: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).
16+
2. `preprocess_utils.py`: preprocessing functions for Wiki data.
17+
3. `model_utils.py`: various utility functions used for modeling (e.g. loading embeddings).
18+
4. `1_preprocess_raw_data.py`: preprocessing of raw input data. Currently shortens each article to first 8 sentences.
19+
5. `2_baseline_model.py`: Pytorch implementation of the baseline model (1-layer NN with softmax classifier).
20+
6. `3_cnn_model.R`: Keras implementation of a 1D convolutional neural network.
21+
7. `4_lstm_model.py`: Pytorch implementation of a Long-Short Term Recurrent Neural Network.
22+
8. `5_han_model.py`: Pytorch implementation of a Hierarchical Attention Network (HAN).
23+
9. `6_statistical_test.R`: Contains R code to perform Stuart-Maxwell test on the classification outcomes.
24+
10. `HAN.py`: Contains the Pytorch module implementation of the HAN.
25+
11. `LSTM.py`: Contains the Pytorch module implementation of the LSTM.
26+
27+
It contains the following folders:
28+
29+
1. `data`: Contains raw and pre-processed data used by the models. To understand te pipeline from raw to preprocessed data, see the `preprocess_utils.py` file.
30+
2. `embeddings`: Folder in which FastText embeddings should be downloaded and unzipped.
31+
3. `img`: Contains images.
32+
4. `model_cnn`: Final model for the convolutional neural network after hyperparameter optimization.
33+
5. `models`: Final Pytorch model weights for the baseline, HAN and LSTM.
34+
6. `predictions`: CSV files containing the predictions and ground-truth labels for each model.
35+
7. `results`: CSV files containing the results of the hyperparameter search we conducted using [Hyperopt](https://github.com/hyperopt/hyperopt).
1336

1437
## Setup
1538

@@ -21,18 +44,19 @@ Currently contains the following files:
2144
conda env create -f environment.yml
2245
```
2346

24-
4. Install PyTorch with cuda 9.2 support
25-
26-
```shell
27-
conda activate VitalWikiClassifier
28-
conda install pytorch torchvision cudatoolkit=9.2 -c pytorch -c defaults -c numba/label/dev
29-
```
47+
Note that this will install both Python requirements as well as R requirements. We use a separate R library location that is set in the `.Renviron` file.
3048

31-
5. In R, install the `reticulate` library:
49+
4. In R, install the following libraries:
3250

3351
```r
34-
install.packages("reticulate")
52+
install.packages(c("yardstick", "rBayesianOptimization", "DescTools", "ggExtra"))
3553
```
3654

37-
6. Check the `.Rprofile` file to ensure that R knows where to find your anaconda distribution.
55+
5. Check the `.Rprofile` file to ensure that R knows where to find your anaconda distribution. Check the `.Renviron` file to ensure that the path to the Anaconda environment is set correctly.
56+
57+
## Shiny application
58+
59+
We created a small shiny application that allows you to input a document and visualize the HAN attention predictions and score. Find the repository to this shiny app [here](https://github.com/JasperHG90/shiny_han).
60+
61+
3862

0 commit comments

Comments
 (0)