Skip to content
This repository was archived by the owner on Nov 29, 2023. It is now read-only.

Commit ccf7fb0

Browse files
waprinJon Wayne Parrott
authored and
Jon Wayne Parrott
committed
Dataproc GCS sample plus doc touchups [(#1151)](GoogleCloudPlatform/python-docs-samples#1151)
1 parent 3b59b75 commit ccf7fb0

File tree

3 files changed

+74
-27
lines changed

3 files changed

+74
-27
lines changed

samples/snippets/README.md

Lines changed: 43 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,23 @@
22

33
Sample command-line programs for interacting with the Cloud Dataproc API.
44

5+
6+
Please see [the tutorial on the using the Dataproc API with the Python client
7+
library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example)
8+
for more information.
9+
510
Note that while this sample demonstrates interacting with Dataproc via the API, the functionality
611
demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
712

813
`list_clusters.py` is a simple command-line program to demonstrate connecting to the
914
Dataproc API and listing the clusters in a region
1015

11-
`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
16+
`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
1217
`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result.
1318

19+
`pyspark_sort.py_gcs` is the asme as `pyspark_sort.py` but demonstrates
20+
reading from a GCS bucket.
21+
1422
## Prerequisites to run locally:
1523

1624
* [pip](https://pypi.python.org/pypi/pip)
@@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
1927

2028
Under API Manager, search for the Google Cloud Dataproc API and enable it.
2129

30+
## Set Up Your Local Dev Environment
2231

23-
# Set Up Your Local Dev Environment
2432
To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
2533
(recommended), run the commands within a virtualenv.
2634

2735
* pip install -r requirements.txt
2836

29-
Create local credentials by running the following command and following the oauth2 flow:
37+
## Authentication
38+
39+
Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/).
40+
The recommended approach to running these samples is a Service Account with a JSON key.
41+
42+
## Environment Variables
3043

31-
gcloud beta auth application-default login
44+
Set the following environment variables:
45+
46+
GOOGLE_CLOUD_PROJECT=your-project-id
47+
REGION=us-central1 # or your region
48+
CLUSTER_NAME=waprin-spark7
49+
ZONE=us-central1-b
50+
51+
## Running the samples
3252

3353
To run list_clusters.py:
3454

35-
python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
55+
python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
3656

57+
`submit_job_to_cluster.py` can create the Dataproc cluster, or use an existing one.
58+
If you'd like to create a cluster ahead of time, either use the
59+
[Cloud Console](console.cloud.google.com) or run:
3760

38-
To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
39-
gsutil:
61+
gcloud dataproc clusters create your-cluster-name
4062

41-
gsutil mb gs://<your-input-bucket-name>
42-
43-
Then, if you want to rely on an existing cluster, run:
44-
45-
python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name>
46-
47-
Otherwise, if you want the script to create a new cluster for you:
63+
To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
64+
gsutil:
4865

49-
python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name> --create_new_cluster
66+
gsutil mb gs://<your-staging-bucket-name>
5067

68+
Set the environment variable's name:
5169

52-
This will setup a cluster, upload the PySpark file, submit the job, print the result, then
53-
delete the cluster.
70+
BUCKET=your-staging-bucket
71+
CLUSTER=your-cluster-name
5472

55-
You can optionally specify a `--pyspark_file` argument to change from the default
56-
`pyspark_sort.py` included in this script to a new script.
73+
Then, if you want to rely on an existing cluster, run:
5774

58-
## Running on GCE, GAE, or other environments
75+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
5976

60-
On Google App Engine, the credentials should be found automatically.
77+
Otherwise, if you want the script to create a new cluster for you:
6178

62-
On Google Compute Engine, the credentials should be found automatically, but require that
63-
you create the instance with the correct scopes.
79+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
6480

65-
gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
81+
This will setup a cluster, upload the PySpark file, submit the job, print the result, then
82+
delete the cluster.
6683

67-
If you did not create the instance with the right scopes, you can still upload a JSON service
68-
account and set `GOOGLE_APPLICATION_CREDENTIALS`. See [Google Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials) for more details.
84+
You can optionally specify a `--pyspark_file` argument to change from the default
85+
`pyspark_sort.py` included in this script to a new script.

samples/snippets/pyspark_sort.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,5 @@
2424
sc = pyspark.SparkContext()
2525
rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
2626
words = sorted(rdd.collect())
27-
print words
27+
print(words)
2828
# [END pyspark]

samples/snippets/pyspark_sort_gcs.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/usr/bin/env python
2+
# Licensed under the Apache License, Version 2.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
#
6+
# http://www.apache.org/licenses/LICENSE-2.0
7+
#
8+
# Unless required by applicable law or agreed to in writing, software
9+
# distributed under the License is distributed on an "AS IS" BASIS,
10+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
# See the License for the specific language governing permissions and
12+
# limitations under the License.
13+
14+
""" Sample pyspark script to be uploaded to Cloud Storage and run on
15+
Cloud Dataproc.
16+
17+
Note this file is not intended to be run directly, but run inside a PySpark
18+
environment.
19+
20+
This file demonstrates how to read from a GCS bucket. See README.md for more
21+
information.
22+
"""
23+
24+
# [START pyspark]
25+
import pyspark
26+
27+
sc = pyspark.SparkContext()
28+
rdd = sc.textFile('gs://path-to-your-GCS-file')
29+
print(sorted(rdd.collect()))
30+
# [END pyspark]

0 commit comments

Comments
 (0)