The Gen-Parallel-Workloads repository contains generated and training data for job traces from various high-performance computing clusters, including BW
, Theta
, Philly
, and Helios
, designed to facilitate the comparison of machine learning models for synthetic job trace generation.
This table includes all the traces included in this repo and their download links. These data can be used for training and benchmarking various scheduling decisions.
Notes that, all generated job traces have 15,000 jobs. The original job traces are also cut to the latest 15,000 jobs.
Original Job Traces | Metadata | GAN-Gen | CTGAN-Gen* | TVAE-Gen* | GC-Gen | CGAN-Gen |
---|---|---|---|---|---|---|
BlueWater | NCSA, 26,864 Nodes, 396K Cores, 4,228 GPUs | BW-GAN | BW-CTGAN | BW-TVAE | BW-GC | BW-CGAN |
Theta | ALCF, 4,392 Nodes, 281,088 Cores | Theta-GAN | Theta-CTGAN | Theta-TVAE | Theta-GC | Theta-CGAN |
Helios | Sensetime, 802 Nodes, 6,416 GPUs | Helios-GAN | Helios-CTGAN | Helios-TVAE | Helios-GC | Helios-CGAN |
Philly | Microsoft, 552 Nodes, 2,490 GPUs | Philly-GAN | Philly-CTGAN | Philly-TVAE | Philly-GC | Philly-CGAN |
- BW, Theta, Philly, Helios: Directories for each cluster, containing:
- generated_data: Synthetic traces generated by different ML models.
- training_data: Original traces used to train the models.
- SDSC-95: Additional data including traces generated by statistical methods.
- Readme.md: Documentation of the repository.
Five machine learning models, listed below, are utilized to generate synthetic traces for each original workload or job trace. Please refer to the Example
section below for more details on how these models are applied.
- GAN (Generative Adversarial Network)
- CTGAN (Conditional GAN)
- TVAE (Tabular Variational Autoencoder)
- Gaussian Copula
- Copula GAN
Original job traces from the Blue Waters dataset were used to train five models, producing five synthetic traces for each original trace (as shown in image below). This process was replicated for all listed datasets, providing a broad basis for analysis and comparison across different machine learning techniques.
Each trace includes several key columns such as:
Column Name | Description |
---|---|
u id | A unique identifier assigned to each job |
user | User ID, an identifier assigned to distinct users |
gpu num | Number of GPUs a job uses |
cpu num | Number of CPUs a job uses |
node num | Number of Nodes a job uses |
interval | Time taken for a job to arrive after the previous job was submitted |
run time | Total time a job was running |
wall time | Total time a job spent in the system from submit to completion |
new status | Status of the job, when it was completed (Pass, Failed, Killed) |
Please cite the following paper if you use this dataset or repository in your research:
@inproceedings{SoundarRaj2024Empirical,
title={An Empirical Study of Machine Learning-based Synthetic Job Trace Generation Methods},
author={Monish Soundar Raj and Thomas MacDougall and Di Zhang and Dong Dai},
booktitle={Workshop on Job Scheduling Strategies for Parallel Processing},
year={2024},
organization={Springer}
}