Skip to content

Commit 75189d8

Browse files
author
Bill Prin
committed
Dataproc GCS sample plus doc touchups
1 parent 1f8b255 commit 75189d8

File tree

3 files changed

+75
-16
lines changed

3 files changed

+75
-16
lines changed

dataproc/README.md

Lines changed: 44 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ demonstrated here could also be accomplished using the Cloud Console or the gclo
88
`list_clusters.py` is a simple command-line program to demonstrate connecting to the
99
Dataproc API and listing the clusters in a region
1010

11-
`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
11+
`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
1212
`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result.
1313

1414
## Prerequisites to run locally:
@@ -20,49 +20,78 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
2020
Under API Manager, search for the Google Cloud Dataproc API and enable it.
2121

2222

23-
# Set Up Your Local Dev Environment
23+
## Set Up Your Local Dev Environment
24+
2425
To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
2526
(recommended), run the commands within a virtualenv.
2627

2728
* pip install -r requirements.txt
2829

2930
Create local credentials by running the following command and following the oauth2 flow:
3031

31-
gcloud beta auth application-default login
32+
gcloud auth application-default login
33+
34+
Set the following environment variables:
35+
36+
GOOGLE_CLOUD_PROJECT=your-project-id
37+
REGION=us-central1 # or your region
38+
CLUSTER_NAME=waprin-spark7
39+
ZONE=us-central1-b
3240

3341
To run list_clusters.py:
3442

35-
python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
43+
python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
44+
45+
`submit_job_to_cluster.py` can create the Dataproc cluster, or use an existing one.
46+
If you'd like to create a cluster ahead of time, either use the
47+
[Cloud Console](console.cloud.google.com) or run:
3648

49+
gcloud dataproc clusters create your-cluster-name
3750

38-
To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
51+
To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
3952
gsutil:
4053

41-
gsutil mb gs://<your-input-bucket-name>
42-
54+
gsutil mb gs://<your-staging-bucket-name>
55+
56+
Set the environment variable's name:
57+
58+
BUCKET=your-staging-bucket
59+
CLUSTER=your-cluster-name
60+
4361
Then, if you want to rely on an existing cluster, run:
44-
45-
python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name>
46-
47-
Otherwise, if you want the script to create a new cluster for you:
4862

49-
python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name> --create_new_cluster
63+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
64+
65+
Otherwise, if you want the script to create a new cluster for you:
5066

67+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
5168

5269
This will setup a cluster, upload the PySpark file, submit the job, print the result, then
5370
delete the cluster.
5471

55-
You can optionally specify a `--pyspark_file` argument to change from the default
72+
You can optionally specify a `--pyspark_file` argument to change from the default
5673
`pyspark_sort.py` included in this script to a new script.
5774

75+
## Reading Data from Google Cloud Storage
76+
77+
Included in this directory is `pyspark_sort_gcs.py`, which demonstrates how
78+
you might read a file from Google Cloud Storage. To use it, replace
79+
`path-to-your-GCS-file'` which will be the text input the job sorts.
80+
81+
On Cloud Dataproc, the [GCS Connector](https://cloud.google.com/dataproc/docs/connectors/cloud-storage)
82+
is automatically installed. This means anywhere you read from a path starting with `gs://`,
83+
Spark will automatically know how to read from the GCS bucket. If you wish to use GCS with another Spark installation,
84+
including locally, you will have to [install the connector](https://cloud.google.com/dataproc/docs/connectors/install-storage-connector).
85+
86+
5887
## Running on GCE, GAE, or other environments
5988

6089
On Google App Engine, the credentials should be found automatically.
6190

6291
On Google Compute Engine, the credentials should be found automatically, but require that
63-
you create the instance with the correct scopes.
92+
you create the instance with the correct scopes.
6493

6594
gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
6695

67-
If you did not create the instance with the right scopes, you can still upload a JSON service
96+
If you did not create the instance with the right scopes, you can still upload a JSON service
6897
account and set `GOOGLE_APPLICATION_CREDENTIALS`. See [Google Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials) for more details.

dataproc/pyspark_sort.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,5 @@
2424
sc = pyspark.SparkContext()
2525
rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
2626
words = sorted(rdd.collect())
27-
print words
27+
print(words)
2828
# [END pyspark]

dataproc/pyspark_sort_gcs.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/usr/bin/env python
2+
# Licensed under the Apache License, Version 2.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
#
6+
# http://www.apache.org/licenses/LICENSE-2.0
7+
#
8+
# Unless required by applicable law or agreed to in writing, software
9+
# distributed under the License is distributed on an "AS IS" BASIS,
10+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
# See the License for the specific language governing permissions and
12+
# limitations under the License.
13+
14+
""" Sample pyspark script to be uploaded to Cloud Storage and run on
15+
Cloud Dataproc.
16+
17+
Note this file is not intended to be run directly, but run inside a PySpark
18+
environment.
19+
20+
This file demonstrates how to read from a GCS bucket. See README.md for more
21+
information.
22+
"""
23+
24+
# [START pyspark]
25+
import pyspark
26+
27+
sc = pyspark.SparkContext()
28+
rdd = sc.textFile('gs://path-to-your-GCS-file')
29+
print(sorted(rdd.collect()))
30+
# [END pyspark]

0 commit comments

Comments
 (0)