You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Project files for pattern recognition group assignment
3
2
4
-
## Files
3
+
Project files for pattern recognition group assignment. Our project is about the classification of Wikipedia articles in one of the 11 top-level categories of the [Vital Articles Wikipedia list, level 4](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4).
4
+
5
+
The vital articles list was downloaded using scrapy. The code can be found in the [WikiVitalArticles](https://github.com/JasperHG90/WikiVitalArticles) repository. The raw data is included in this repository.
6
+
7
+
Given the different skillsets in our group, we use a mix of R, python, Keras and Pytorch to build our models. However, we make sure that each model uses the same train/test splits.
5
8
6
-
Currently contains the following files:
9
+
Our presentation can be found here. Our final paper can be found here.
7
10
8
-
1.`data/WikiEssentials_L4.7z`: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).
9
-
1.`preprocess_utils.py`: preprocessing functions for Wiki data.
10
-
2.`model_utils.py`: various utility functions used for modeling (e.g. loading embeddings).
11
-
3.`1_preprocess_raw_data.py`: preprocessing of raw input data. Currently shortens each article to first 8 sentences.
12
-
4.`2_baseline_model.py`: tokenization, vectorization of input data and baseline model (1-layer NN with softmax classifier).
11
+
## Files
12
+
13
+
Currently contains the following files (not all files are listed here):
14
+
15
+
1.`data/raw/WikiEssentials_L4.7z`: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).
16
+
2.`preprocess_utils.py`: preprocessing functions for Wiki data.
17
+
3.`model_utils.py`: various utility functions used for modeling (e.g. loading embeddings).
18
+
4.`1_preprocess_raw_data.py`: preprocessing of raw input data. Currently shortens each article to first 8 sentences.
19
+
5.`2_baseline_model.py`: Pytorch implementation of the baseline model (1-layer NN with softmax classifier).
20
+
6.`3_cnn_model.R`: Keras implementation of a 1D convolutional neural network.
21
+
7.`4_lstm_model.py`: Pytorch implementation of a Long-Short Term Recurrent Neural Network.
22
+
8.`5_han_model.py`: Pytorch implementation of a Hierarchical Attention Network (HAN).
23
+
9.`6_statistical_test.R`: Contains R code to perform Stuart-Maxwell test on the classification outcomes.
24
+
10.`HAN.py`: Contains the Pytorch module implementation of the HAN.
25
+
11.`LSTM.py`: Contains the Pytorch module implementation of the LSTM.
26
+
27
+
It contains the following folders:
28
+
29
+
1.`data`: Contains raw and pre-processed data used by the models. To understand te pipeline from raw to preprocessed data, see the `preprocess_utils.py` file.
30
+
2.`embeddings`: Folder in which FastText embeddings should be downloaded and unzipped.
31
+
3.`img`: Contains images.
32
+
4.`model_cnn`: Final model for the convolutional neural network after hyperparameter optimization.
33
+
5.`models`: Final Pytorch model weights for the baseline, HAN and LSTM.
34
+
6.`predictions`: CSV files containing the predictions and ground-truth labels for each model.
35
+
7.`results`: CSV files containing the results of the hyperparameter search we conducted using [Hyperopt](https://github.com/hyperopt/hyperopt).
13
36
14
37
## Setup
15
38
@@ -21,18 +44,19 @@ Currently contains the following files:
Note that this will install both Python requirements as well as R requirements. We use a separate R library location that is set in the `.Renviron` file.
6. Check the `.Rprofile` file to ensure that R knows where to find your anaconda distribution.
55
+
5. Check the `.Rprofile` file to ensure that R knows where to find your anaconda distribution. Check the `.Renviron` file to ensure that the path to the Anaconda environment is set correctly.
56
+
57
+
## Shiny application
58
+
59
+
We created a small shiny application that allows you to input a document and visualize the HAN attention predictions and score. Find the repository to this shiny app [here](https://github.com/JasperHG90/shiny_han).
0 commit comments