1
1
This is intended as a simple end-to-end example of how to get your data into
2
2
the format that PyTorch BigGraph expects using SQL. It's implemented in SQLite
3
3
for portability, but similar techniques scale to billions of edges using cloud
4
- databases such as BigQuery. This pipeline can be split into three different
5
- components:
4
+ databases such as BigQuery or SnowFlake . This pipeline can be split into three
5
+ different components:
6
6
7
7
1 . Data preparation
8
8
2 . Data verification/checking
@@ -17,5 +17,24 @@ pedagogical purpose.
17
17
18
18
In the data preparation stage, we first load the graph
19
19
into a SQLite database and then we transform and partition it. The transformation
20
- can be understood as first generating a mapping between the graph-ids and
21
- ordinal ids per-type that PBG will expect.
20
+ can be understood as first partitioning the entities, then generating a mapping
21
+ between the graph-ids and ordinal ids per-type that PBG will expect, and finally
22
+ writing out all the files required to train, including the config file. By
23
+ keeping track of the vertex types, we're able to specifically verify our mappings
24
+ in a fully self consistent fashion.
25
+
26
+ Once the data has been prepared and generated, we're ready to embed the graph. We
27
+ do this by passing the generated config to ` torchbiggraph_train ` in the following
28
+ way:
29
+
30
+ ```
31
+ torchbiggraph_train \
32
+ path/to/generated/config.py
33
+ ```
34
+
35
+ The ` data_prep.py ` script will also compute the approximate amount of shared memory
36
+ that will be needed for training. If the training demands are more than the
37
+ available shared memory, you'll need to regenerate your data with more partitions
38
+ than what you currently have. If you're seeing either a bus error or a OOM kill
39
+ message in the kernel ring buffer but your machine has enough ram, you'll want to
40
+ verify that ` /dev/shm ` is large enough to accomodate your embedding table.
0 commit comments