GitHub - DIR-LAB/Gen-Parallel-Workloads: Generated Parallel Workloads Archive.

Overview

The Gen-Parallel-Workloads repository contains generated and training data for job traces from various high-performance computing clusters, including BW, Theta, Philly, and Helios, designed to facilitate the comparison of machine learning models for synthetic job trace generation.

This table includes all the traces included in this repo and their download links. These data can be used for training and benchmarking various scheduling decisions.

Notes that, all generated job traces have 15,000 jobs. The original job traces are also cut to the latest 15,000 jobs.

Original Job Traces	Metadata	GAN-Gen	CTGAN-Gen*	TVAE-Gen*	GC-Gen	CGAN-Gen
BlueWater	NCSA, 26,864 Nodes, 396K Cores, 4,228 GPUs	BW-GAN	BW-CTGAN	BW-TVAE	BW-GC	BW-CGAN
Theta	ALCF, 4,392 Nodes, 281,088 Cores	Theta-GAN	Theta-CTGAN	Theta-TVAE	Theta-GC	Theta-CGAN
Helios	Sensetime, 802 Nodes, 6,416 GPUs	Helios-GAN	Helios-CTGAN	Helios-TVAE	Helios-GC	Helios-CGAN
Philly	Microsoft, 552 Nodes, 2,490 GPUs	Philly-GAN	Philly-CTGAN	Philly-TVAE	Philly-GC	Philly-CGAN

Structure

BW, Theta, Philly, Helios: Directories for each cluster, containing:
- generated_data: Synthetic traces generated by different ML models.
- training_data: Original traces used to train the models.
SDSC-95: Additional data including traces generated by statistical methods.
Readme.md: Documentation of the repository.

Models

Five machine learning models, listed below, are utilized to generate synthetic traces for each original workload or job trace. Please refer to the Example section below for more details on how these models are applied.

GAN (Generative Adversarial Network)
CTGAN (Conditional GAN)
TVAE (Tabular Variational Autoencoder)
Gaussian Copula
Copula GAN

Example

Original job traces from the Blue Waters dataset were used to train five models, producing five synthetic traces for each original trace (as shown in image below). This process was replicated for all listed datasets, providing a broad basis for analysis and comparison across different machine learning techniques.

Data Format

Each trace includes several key columns such as:

Column Name	Description
u id	A unique identifier assigned to each job
user	User ID, an identifier assigned to distinct users
gpu num	Number of GPUs a job uses
cpu num	Number of CPUs a job uses
node num	Number of Nodes a job uses
interval	Time taken for a job to arrive after the previous job was submitted
run time	Total time a job was running
wall time	Total time a job spent in the system from submit to completion
new status	Status of the job, when it was completed (Pass, Failed, Killed)

Citation

Please cite the following paper if you use this dataset or repository in your research:

@inproceedings{SoundarRaj2024Empirical,
  title={An Empirical Study of Machine Learning-based Synthetic Job Trace Generation Methods},
  author={Monish Soundar Raj and Thomas MacDougall and Di Zhang and Dong Dai},
  booktitle={Workshop on Job Scheduling Strategies for Parallel Processing},
  year={2024},
  organization={Springer}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Structure

Models

Example

Data Format

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BW		BW
Helios		Helios
Philly		Philly
SDSC-95		SDSC-95
Theta		Theta
Readme.md		Readme.md
example_image.png		example_image.png

DIR-LAB/Gen-Parallel-Workloads

Folders and files

Latest commit

History

Repository files navigation

Overview

Structure

Models

Example

Data Format

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages