added sparkr support and its notebook

andre-marcos-perez · andre-marcos-perez · commit 55942ca8e856 · 2020-08-19T20:42:33.000-03:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -21,6 +21,6 @@ parallel computing in distributed environments through our projects. :sparkles:
 - [x] JupyterLab Scala kernel;
 - [x] Jupyter notebook with Apache Spark Scala API examples;
 - [x] JupyterLab R kernel;
-- [ ] Jupyter notebook with Apache Spark R API examples;
+- [x] Jupyter notebook with Apache Spark R API examples;
 - [ ] Test coverage;
 - [ ] Ever growing examples.
diff --git a/README.md b/README.md
@@ -1,8 +1,9 @@
 # Apache Spark Standalone Cluster on Docker
+
 > The project just got its [own article](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445) at Towards Data Science Medium blog! :sparkles:
 
 This project gives you an **Apache Spark** cluster in standalone mode with a **JupyterLab** interface built on top of **Docker**.
-Learn Apache Spark through its Scala and Python API (PySpark) by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
+Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
 
 <p align="center"><img src="docs/image/cluster-architecture.png"></p>
 
@@ -13,6 +14,7 @@ Learn Apache Spark through its Scala and Python API (PySpark) by running the Jup
 ![docker-compose-file-version](https://img.shields.io/badge/docker--compose-v1.10.0%2B-blue)
 ![spark-scala-api](https://img.shields.io/badge/spark%20api-scala-red)
 ![spark-pyspark-api](https://img.shields.io/badge/spark%20api-pyspark-red)
+![spark-sparkr-api](https://img.shields.io/badge/spark%20api-sparkr-red)
 
 ## TL;DR
 
@@ -25,20 +27,20 @@ docker-compose up
 
 - [Quick Start](#quick-start)
 - [Tech Stack](#tech-stack)
-- [Docker Hub Metrics](#docker-hub-metrics)
 - [Contributing](#contributing)
 - [Contributors](#contributors)
+- [Downloads](#downloads)
 
 ## <a name="quick-start"></a>Quick Start
 
 ### Cluster overview
 
-| Application            | URL                                      | Description                                                 |
-| ---------------------- | ---------------------------------------- | ----------------------------------------------------------- |
-| JupyterLab             | [localhost:8888](http://localhost:8888/) | Cluster interface with Scala and PySpark built-in notebooks |
-| Apache Spark Master    | [localhost:8080](http://localhost:8080/) | Spark Master node                                           |
-| Apache Spark Worker I  | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default)  |
-| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default)  |
+| Application            | URL                                      | Description                                                         |
+| ---------------------- | ---------------------------------------- | ------------------------------------------------------------------- |
+| JupyterLab             | [localhost:8888](http://localhost:8888/) | Cluster interface with Scala, PySpark and SparkR built-in notebooks |
+| Apache Spark Master    | [localhost:8080](http://localhost:8080/) | Spark Master node                                                   |
+| Apache Spark Worker I  | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default)          |
+| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default)          |
 
 ### Prerequisites
 
@@ -54,7 +56,7 @@ docker-compose up
 docker-compose up
 ```
 
-4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
+4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
 5. Stop the cluster by typing `ctrl+c`.
 
 ### Build from your local machine
@@ -82,7 +84,7 @@ chmod +x build.sh ; ./build.sh
 docker-compose up
 ```
 
-7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
+7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
 8. Stop the cluster by typing `ctrl+c`.
 
 ## <a name="tech-stack"></a>Tech Stack
@@ -114,18 +116,20 @@ docker-compose up
 
 > Apache Spark R API (SparkR) is only supported on version **2.4.4**. Full list can be found [here](https://cran.r-project.org/src/contrib/Archive/SparkR/).
 
-## <a name="docker-hub-metrics"></a>Docker Hub Metrics
-
-| Image                                                          | Latest Version Size (Compressed)                                                      | Downloads                                                                 |
-| -------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
-| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab)     | ![docker-size](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest)   | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab)   |
-| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
-| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
-
 ## <a name="contributing"></a>Contributing
 
 We'd love some help. To contribute, please read [this file](CONTRIBUTING.md).
 
+> Staring us on GitHub is also an awesome way to show your support :star:
+
 ## <a name="contributors"></a>Contributors
 
- - **André Perez** - [dekoperez](https://twitter.com/dekoperez) - andre.marcos.perez@gmail.com
+ - **André Perez** - [dekoperez](https://twitter.com/dekoperez) - andre.marcos.perez@gmail.com
+
+## <a name="downloads"></a>Downloads
+
+| Image                                                          | Latest Version Size (Compressed)                                                               | Downloads                                                                 |
+| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
+| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab)     | ![docker-size-jupyterlab](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab)   |
+| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size-master](https://img.shields.io/docker/image-size/andreper/spark-master/latest)   | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
+| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size-worker](https://img.shields.io/docker/image-size/andreper/spark-worker/latest)   | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
diff --git a/build/build.yml b/build/build.yml
@@ -1,6 +1,6 @@
 applications:
   scala: "2.12.11"
-  spark: "3.0.0"
+  spark: "2.4.4"
   hadoop: "2.7"
   jupyterlab: "2.1.4"
 build:
diff --git a/build/docker-compose.yml b/build/docker-compose.yml
@@ -9,22 +9,22 @@ volumes:
     driver: local
 services:
   jupyterlab:
-    image: jupyterlab:2.1.4-spark-3.0.0
+    image: jupyterlab:2.1.4-spark-2.4.4
     container_name: jupyterlab
     ports:
       - 8888:8888
     volumes:
       - shared-workspace:/opt/workspace
   spark-master:
-    image: spark-master:3.0.0-hadoop-2.7
+    image: spark-master:2.4.4-hadoop-2.7
     container_name: spark-master
     ports:
       - 8080:8080
       - 7077:7077
     volumes:
       - shared-workspace:/opt/workspace
   spark-worker-1:
-    image: spark-worker:3.0.0-hadoop-2.7
+    image: spark-worker:2.4.4-hadoop-2.7
     container_name: spark-worker-1
     environment:
       - SPARK_WORKER_CORES=1
@@ -36,7 +36,7 @@ services:
     depends_on:
       - spark-master
   spark-worker-2:
-    image: spark-worker:3.0.0-hadoop-2.7
+    image: spark-worker:2.4.4-hadoop-2.7
     container_name: spark-worker-2
     environment:
       - SPARK_WORKER_CORES=1
diff --git a/build/docker/jupyterlab/Dockerfile b/build/docker/jupyterlab/Dockerfile
@@ -33,9 +33,14 @@ RUN apt-get install -y ca-certificates-java --no-install-recommends && \
 
 # -- Layer: R kernel for SparkR
 
+COPY ./script/sparkr.sh ./sparkr.sh
+
 RUN apt-get install -y r-base-dev && \
     R -e "install.packages('IRkernel')" && \
-    R -e "IRkernel::installspec(displayname = 'R 3.5', user = FALSE)"
+    R -e "IRkernel::installspec(displayname = 'R 3.5', user = FALSE)" && \
+    chmod +x ./sparkr.sh && \
+    ./sparkr.sh ${spark_version} && \
+    rm -f sparkr.sh
 
 # -- Runtime
 
diff --git a/build/script/sparkr.sh b/build/script/sparkr.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+#
+# -- Download and install Apache Spark R API (SparkR)
+
+# ----------------------------------------------------------------------------------------------------------------------
+# -- Variables ---------------------------------------------------------------------------------------------------------
+# ----------------------------------------------------------------------------------------------------------------------
+
+SPARK_VERSION="${1}"
+
+# ----------------------------------------------------------------------------------------------------------------------
+# -- Main --------------------------------------------------------------------------------------------------------------
+# ----------------------------------------------------------------------------------------------------------------------
+
+if [[ "${SPARK_VERSION}" =~ ^(2.1.2|2.3.0|2.4.1|2.4.2|2.4.3|2.4.4|2.4.5|2.4.6)$ ]]
+then
+  curl https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_${SPARK_VERSION}.tar.gz -k -o sparkr.tar.gz
+  R CMD INSTALL sparkr.tar.gz
+  rm -f sparkr.tar.gz
+fi
diff --git a/build/workspace/pyspark.ipynb b/build/workspace/pyspark.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# **PySpark**: The Spark Python API"
+    "# **PySpark**: The Apache Spark Python API"
    ]
   },
   {
@@ -33,7 +33,7 @@
     "\n",
     "+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);\n",
     "+ **master:** Spark Master URL, same used by Spark Workers;\n",
-    "+ **config:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
+    "+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
    ]
   },
   {
@@ -52,6 +52,13 @@
     "        getOrCreate()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession)."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -162,7 +169,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "unemployment = data.select(['Description', 'Population (GB+NI)', 'Unemployment rate'])"
+    "unemployment = data.select([\"Description\", \"Population (GB+NI)\", \"Unemployment rate\"])"
    ]
   },
   {
@@ -308,9 +315,9 @@
    "outputs": [],
    "source": [
     "unemployment = unemployment.\\\n",
-    "               withColumnRenamed('Description', 'year').\\\n",
-    "               withColumnRenamed('Population (GB+NI)', 'population').\\\n",
-    "               withColumnRenamed('Unemployment rate', 'unemployment_rate')"
+    "               withColumnRenamed(\"Description\", 'year').\\\n",
+    "               withColumnRenamed(\"Population (GB+NI)\", \"population\").\\\n",
+    "               withColumnRenamed(\"Unemployment rate\", \"unemployment_rate\")"
    ]
   },
   {
@@ -365,7 +372,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode='overwrite')"
+    "unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode=\"overwrite\")"
    ]
   }
  ],
diff --git a/build/workspace/scala.ipynb b/build/workspace/scala.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# The Spark Scala API"
+    "# The Apache Spark Scala API"
    ]
   },
   {
@@ -29,7 +29,7 @@
    "source": [
     "### 2.1. Get Spark\n",
     "\n",
-    "Let's start by importing Apache Spark from Maven repository (mind the version)."
+    "Let's start by importing Apache Spark from Maven repository (mind the Apache Spark **version**)."
    ]
   },
   {
@@ -49,7 +49,7 @@
     }
    ],
    "source": [
-    "import $ivy.`org.apache.spark::spark-sql:3.0.0`;"
+    "import $ivy.`org.apache.spark::spark-sql:2.4.4`;"
    ]
   },
   {
@@ -91,7 +91,7 @@
     "\n",
     "+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);\n",
     "+ **master:** Spark Master URL, same used by Spark Workers;\n",
-    "+ **config:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
+    "+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
    ]
   },
   {
@@ -110,6 +110,13 @@
     "            getOrCreate()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html)."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/build/workspace/sparkr.ipynb b/build/workspace/sparkr.ipynb
diff --git a/docker-compose.yml b/docker-compose.yml

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"cell_type": "markdown",`
`5`	`5`	`"metadata": {},`
`6`	`6`	`"source": [`
`7`		`- "# PySpark: The Spark Python API"`
	`7`	`+ "# PySpark: The Apache Spark Python API"`
`8`	`8`	`]`
`9`	`9`	`},`
`10`	`10`	`{`
`@@ -33,7 +33,7 @@`
`33`	`33`	`"\n",`
`34`	`34`	`"+ appName: application name displayed at the [Spark Master Web UI](http://localhost:8080/);\n",`
`35`	`35`	`"+ master: Spark Master URL, same used by Spark Workers;\n",`
`36`		`- "+ config: must be less than or equals to docker compose SPARK_WORKER_MEMORY config."`
	`36`	`+ "+ spark.executor.memory: must be less than or equals to docker compose SPARK_WORKER_MEMORY config."`
`37`	`37`	`]`
`38`	`38`	`},`
`39`	`39`	`{`
`@@ -52,6 +52,13 @@`
`52`	`52`	`" getOrCreate()"`
`53`	`53`	`]`
`54`	`54`	`},`
	`55`	`+ {`
	`56`	`+ "cell_type": "markdown",`
	`57`	`+ "metadata": {},`
	`58`	`+ "source": [`
	`59`	`+ "More confs for SparkSession object in standalone mode can be added using the config method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession)."`
	`60`	`+ ]`
	`61`	`+ },`
`55`	`62`	`{`
`56`	`63`	`"cell_type": "markdown",`
`57`	`64`	`"metadata": {},`
`@@ -162,7 +169,7 @@`
`162`	`169`	`"metadata": {},`
`163`	`170`	`"outputs": [],`
`164`	`171`	`"source": [`
`165`		`- "unemployment = data.select(['Description', 'Population (GB+NI)', 'Unemployment rate'])"`
	`172`	`+ "unemployment = data.select([\"Description\", \"Population (GB+NI)\", \"Unemployment rate\"])"`
`166`	`173`	`]`
`167`	`174`	`},`
`168`	`175`	`{`
`@@ -308,9 +315,9 @@`
`308`	`315`	`"outputs": [],`
`309`	`316`	`"source": [`
`310`	`317`	`"unemployment = unemployment.\\\n",`
`311`		`- " withColumnRenamed('Description', 'year').\\\n",`
`312`		`- " withColumnRenamed('Population (GB+NI)', 'population').\\\n",`
`313`		`- " withColumnRenamed('Unemployment rate', 'unemployment_rate')"`
	`318`	`+ " withColumnRenamed(\"Description\", 'year').\\\n",`
	`319`	`+ " withColumnRenamed(\"Population (GB+NI)\", \"population\").\\\n",`
	`320`	`+ " withColumnRenamed(\"Unemployment rate\", \"unemployment_rate\")"`
`314`	`321`	`]`
`315`	`322`	`},`
`316`	`323`	`{`
`@@ -365,7 +372,7 @@`
`365`	`372`	`"metadata": {},`
`366`	`373`	`"outputs": [],`
`367`	`374`	`"source": [`
`368`		`- "unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode='overwrite')"`
	`375`	`+ "unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode=\"overwrite\")"`
`369`	`376`	`]`
`370`	`377`	`}`
`371`	`378`	`],`