Skip to content

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡

License

Notifications You must be signed in to change notification settings

lopezdar222/spark-standalone-cluster-on-docker

This branch is up to date with cluster-apps-on-docker/spark-standalone-cluster-on-docker:master.

Folders and files

NameName
Last commit message
Last commit date
Dec 31, 2020
Dec 31, 2020
Aug 9, 2020
Jul 29, 2020
Dec 31, 2020
Dec 29, 2020
Dec 29, 2020
Jul 3, 2020
Dec 31, 2020
Dec 30, 2020

Repository files navigation

Apache Spark Standalone Cluster on Docker

The project was featured on an article at MongoDB official tech blog! 😱

The project just got its own article at Towards Data Science Medium blog! ✨

Introduction

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

build-master sponsor jupyterlab-latest-version spark-latest-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Spark Driver localhost:4040 Spark Driver web ui
Spark Master localhost:8080 Spark Master node
Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Download from Docker Hub (easier)

  1. Download the docker compose file;
curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
  1. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  2. Start the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c on the terminal;
  3. Run step 3 to restart the cluster.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

  1. Download the source code or clone the repository;
  2. Move to the build directory;
cd build
  1. Edit the build.yml file with your favorite tech stack version;
  2. Match those version on the docker compose file;
  3. Build up the images;
chmod +x build.sh ; ./build.sh
  1. Start the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c on the terminal;
  3. Run step 6 to restart the cluster.

Tech Stack

  • Infra
Component Version
Docker Engine 1.13.0+
Docker Compose 1.10.0+
  • Languages and Kernels
Spark Hadoop Scala Scala Kernel Python Python Kernel R R Kernel
3.x 3.2 2.12.10 0.10.9 3.7.3 7.19.0 3.5.2 1.1.1
2.x 2.7 2.11.12 0.6.0 3.7.3 7.19.0 3.5.2 1.1.1
  • Apps
Component Version Docker Tag
Apache Spark 2.4.0 | 2.4.4 | 3.0.0 <spark-version>
JupyterLab 2.1.4 | 3.0.0 <jupyterlab-version>-spark-<spark-version>

Metrics

Image Size Downloads
JupyterLab docker-size-jupyterlab docker-pull
Spark Master docker-size-master docker-pull
Spark Worker docker-size-worker docker-pull

Contributing

We'd love some help. To contribute, please read this file.

Contributors

A list of amazing people that somehow contributed to the project can be found in this file. This project is maintained by:

André Perez - dekoperez - andre.marcos.perez@gmail.com

Support

Support us on GitHub by staring this project ⭐

Support us on Patreon. 💖

About

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 80.2%
  • Dockerfile 10.4%
  • Shell 9.4%