Skip to content

Commit 55942ca

Browse files
added sparkr support and its notebook
1 parent 128c0f4 commit 55942ca

File tree

10 files changed

+719
-41
lines changed

10 files changed

+719
-41
lines changed

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,6 @@ parallel computing in distributed environments through our projects. :sparkles:
2121
- [x] JupyterLab Scala kernel;
2222
- [x] Jupyter notebook with Apache Spark Scala API examples;
2323
- [x] JupyterLab R kernel;
24-
- [ ] Jupyter notebook with Apache Spark R API examples;
24+
- [x] Jupyter notebook with Apache Spark R API examples;
2525
- [ ] Test coverage;
2626
- [ ] Ever growing examples.

README.md

+23-19
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# Apache Spark Standalone Cluster on Docker
2+
23
> The project just got its [own article](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445) at Towards Data Science Medium blog! :sparkles:
34
45
This project gives you an **Apache Spark** cluster in standalone mode with a **JupyterLab** interface built on top of **Docker**.
5-
Learn Apache Spark through its Scala and Python API (PySpark) by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
6+
Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
67

78
<p align="center"><img src="docs/image/cluster-architecture.png"></p>
89

@@ -13,6 +14,7 @@ Learn Apache Spark through its Scala and Python API (PySpark) by running the Jup
1314
![docker-compose-file-version](https://img.shields.io/badge/docker--compose-v1.10.0%2B-blue)
1415
![spark-scala-api](https://img.shields.io/badge/spark%20api-scala-red)
1516
![spark-pyspark-api](https://img.shields.io/badge/spark%20api-pyspark-red)
17+
![spark-sparkr-api](https://img.shields.io/badge/spark%20api-sparkr-red)
1618

1719
## TL;DR
1820

@@ -25,20 +27,20 @@ docker-compose up
2527

2628
- [Quick Start](#quick-start)
2729
- [Tech Stack](#tech-stack)
28-
- [Docker Hub Metrics](#docker-hub-metrics)
2930
- [Contributing](#contributing)
3031
- [Contributors](#contributors)
32+
- [Downloads](#downloads)
3133

3234
## <a name="quick-start"></a>Quick Start
3335

3436
### Cluster overview
3537

36-
| Application | URL | Description |
37-
| ---------------------- | ---------------------------------------- | ----------------------------------------------------------- |
38-
| JupyterLab | [localhost:8888](http://localhost:8888/) | Cluster interface with Scala and PySpark built-in notebooks |
39-
| Apache Spark Master | [localhost:8080](http://localhost:8080/) | Spark Master node |
40-
| Apache Spark Worker I | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default) |
41-
| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default) |
38+
| Application | URL | Description |
39+
| ---------------------- | ---------------------------------------- | ------------------------------------------------------------------- |
40+
| JupyterLab | [localhost:8888](http://localhost:8888/) | Cluster interface with Scala, PySpark and SparkR built-in notebooks |
41+
| Apache Spark Master | [localhost:8080](http://localhost:8080/) | Spark Master node |
42+
| Apache Spark Worker I | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default) |
43+
| Apache Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default) |
4244

4345
### Prerequisites
4446

@@ -54,7 +56,7 @@ docker-compose up
5456
docker-compose up
5557
```
5658

57-
4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
59+
4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
5860
5. Stop the cluster by typing `ctrl+c`.
5961

6062
### Build from your local machine
@@ -82,7 +84,7 @@ chmod +x build.sh ; ./build.sh
8284
docker-compose up
8385
```
8486

85-
7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala and PySpark examples;
87+
7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
8688
8. Stop the cluster by typing `ctrl+c`.
8789

8890
## <a name="tech-stack"></a>Tech Stack
@@ -114,18 +116,20 @@ docker-compose up
114116

115117
> Apache Spark R API (SparkR) is only supported on version **2.4.4**. Full list can be found [here](https://cran.r-project.org/src/contrib/Archive/SparkR/).
116118
117-
## <a name="docker-hub-metrics"></a>Docker Hub Metrics
118-
119-
| Image | Latest Version Size (Compressed) | Downloads |
120-
| -------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
121-
| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab) | ![docker-size](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab) |
122-
| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
123-
| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
124-
125119
## <a name="contributing"></a>Contributing
126120

127121
We'd love some help. To contribute, please read [this file](CONTRIBUTING.md).
128122

123+
> Staring us on GitHub is also an awesome way to show your support :star:
124+
129125
## <a name="contributors"></a>Contributors
130126

131-
- **André Perez** - [dekoperez](https://twitter.com/dekoperez) - [email protected]
127+
- **André Perez** - [dekoperez](https://twitter.com/dekoperez) - [email protected]
128+
129+
## <a name="downloads"></a>Downloads
130+
131+
| Image | Latest Version Size (Compressed) | Downloads |
132+
| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
133+
| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab) | ![docker-size-jupyterlab](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab) |
134+
| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size-master](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
135+
| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size-worker](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |

build/build.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
applications:
22
scala: "2.12.11"
3-
spark: "3.0.0"
3+
spark: "2.4.4"
44
hadoop: "2.7"
55
jupyterlab: "2.1.4"
66
build:

build/docker-compose.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,22 @@ volumes:
99
driver: local
1010
services:
1111
jupyterlab:
12-
image: jupyterlab:2.1.4-spark-3.0.0
12+
image: jupyterlab:2.1.4-spark-2.4.4
1313
container_name: jupyterlab
1414
ports:
1515
- 8888:8888
1616
volumes:
1717
- shared-workspace:/opt/workspace
1818
spark-master:
19-
image: spark-master:3.0.0-hadoop-2.7
19+
image: spark-master:2.4.4-hadoop-2.7
2020
container_name: spark-master
2121
ports:
2222
- 8080:8080
2323
- 7077:7077
2424
volumes:
2525
- shared-workspace:/opt/workspace
2626
spark-worker-1:
27-
image: spark-worker:3.0.0-hadoop-2.7
27+
image: spark-worker:2.4.4-hadoop-2.7
2828
container_name: spark-worker-1
2929
environment:
3030
- SPARK_WORKER_CORES=1
@@ -36,7 +36,7 @@ services:
3636
depends_on:
3737
- spark-master
3838
spark-worker-2:
39-
image: spark-worker:3.0.0-hadoop-2.7
39+
image: spark-worker:2.4.4-hadoop-2.7
4040
container_name: spark-worker-2
4141
environment:
4242
- SPARK_WORKER_CORES=1

build/docker/jupyterlab/Dockerfile

+6-1
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,14 @@ RUN apt-get install -y ca-certificates-java --no-install-recommends && \
3333

3434
# -- Layer: R kernel for SparkR
3535

36+
COPY ./script/sparkr.sh ./sparkr.sh
37+
3638
RUN apt-get install -y r-base-dev && \
3739
R -e "install.packages('IRkernel')" && \
38-
R -e "IRkernel::installspec(displayname = 'R 3.5', user = FALSE)"
40+
R -e "IRkernel::installspec(displayname = 'R 3.5', user = FALSE)" && \
41+
chmod +x ./sparkr.sh && \
42+
./sparkr.sh ${spark_version} && \
43+
rm -f sparkr.sh
3944

4045
# -- Runtime
4146

build/script/sparkr.sh

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
#
3+
# -- Download and install Apache Spark R API (SparkR)
4+
5+
# ----------------------------------------------------------------------------------------------------------------------
6+
# -- Variables ---------------------------------------------------------------------------------------------------------
7+
# ----------------------------------------------------------------------------------------------------------------------
8+
9+
SPARK_VERSION="${1}"
10+
11+
# ----------------------------------------------------------------------------------------------------------------------
12+
# -- Main --------------------------------------------------------------------------------------------------------------
13+
# ----------------------------------------------------------------------------------------------------------------------
14+
15+
if [[ "${SPARK_VERSION}" =~ ^(2.1.2|2.3.0|2.4.1|2.4.2|2.4.3|2.4.4|2.4.5|2.4.6)$ ]]
16+
then
17+
curl https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_${SPARK_VERSION}.tar.gz -k -o sparkr.tar.gz
18+
R CMD INSTALL sparkr.tar.gz
19+
rm -f sparkr.tar.gz
20+
fi

build/workspace/pyspark.ipynb

+14-7
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# **PySpark**: The Spark Python API"
7+
"# **PySpark**: The Apache Spark Python API"
88
]
99
},
1010
{
@@ -33,7 +33,7 @@
3333
"\n",
3434
"+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);\n",
3535
"+ **master:** Spark Master URL, same used by Spark Workers;\n",
36-
"+ **config:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
36+
"+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
3737
]
3838
},
3939
{
@@ -52,6 +52,13 @@
5252
" getOrCreate()"
5353
]
5454
},
55+
{
56+
"cell_type": "markdown",
57+
"metadata": {},
58+
"source": [
59+
"More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession)."
60+
]
61+
},
5562
{
5663
"cell_type": "markdown",
5764
"metadata": {},
@@ -162,7 +169,7 @@
162169
"metadata": {},
163170
"outputs": [],
164171
"source": [
165-
"unemployment = data.select(['Description', 'Population (GB+NI)', 'Unemployment rate'])"
172+
"unemployment = data.select([\"Description\", \"Population (GB+NI)\", \"Unemployment rate\"])"
166173
]
167174
},
168175
{
@@ -308,9 +315,9 @@
308315
"outputs": [],
309316
"source": [
310317
"unemployment = unemployment.\\\n",
311-
" withColumnRenamed('Description', 'year').\\\n",
312-
" withColumnRenamed('Population (GB+NI)', 'population').\\\n",
313-
" withColumnRenamed('Unemployment rate', 'unemployment_rate')"
318+
" withColumnRenamed(\"Description\", 'year').\\\n",
319+
" withColumnRenamed(\"Population (GB+NI)\", \"population\").\\\n",
320+
" withColumnRenamed(\"Unemployment rate\", \"unemployment_rate\")"
314321
]
315322
},
316323
{
@@ -365,7 +372,7 @@
365372
"metadata": {},
366373
"outputs": [],
367374
"source": [
368-
"unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode='overwrite')"
375+
"unemployment.repartition(1).write.csv(path=\"data/uk-macroeconomic-unemployment-data.csv\", sep=\",\", header=True, mode=\"overwrite\")"
369376
]
370377
}
371378
],

build/workspace/scala.ipynb

+11-4
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# The Spark Scala API"
7+
"# The Apache Spark Scala API"
88
]
99
},
1010
{
@@ -29,7 +29,7 @@
2929
"source": [
3030
"### 2.1. Get Spark\n",
3131
"\n",
32-
"Let's start by importing Apache Spark from Maven repository (mind the version)."
32+
"Let's start by importing Apache Spark from Maven repository (mind the Apache Spark **version**)."
3333
]
3434
},
3535
{
@@ -49,7 +49,7 @@
4949
}
5050
],
5151
"source": [
52-
"import $ivy.`org.apache.spark::spark-sql:3.0.0`;"
52+
"import $ivy.`org.apache.spark::spark-sql:2.4.4`;"
5353
]
5454
},
5555
{
@@ -91,7 +91,7 @@
9191
"\n",
9292
"+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);\n",
9393
"+ **master:** Spark Master URL, same used by Spark Workers;\n",
94-
"+ **config:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
94+
"+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config."
9595
]
9696
},
9797
{
@@ -110,6 +110,13 @@
110110
" getOrCreate()"
111111
]
112112
},
113+
{
114+
"cell_type": "markdown",
115+
"metadata": {},
116+
"source": [
117+
"More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html)."
118+
]
119+
},
113120
{
114121
"cell_type": "markdown",
115122
"metadata": {},

0 commit comments

Comments
 (0)