2
2
3
3
Sample command-line programs for interacting with the Cloud Dataproc API.
4
4
5
+
6
+ Please see [ the tutorial on the using the Dataproc API with the Python client
7
+ library] ( https://cloud.google.com/dataproc/docs/tutorials/python-library-example )
8
+ for more information.
9
+
5
10
Note that while this sample demonstrates interacting with Dataproc via the API, the functionality
6
11
demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
7
12
8
13
` list_clusters.py ` is a simple command-line program to demonstrate connecting to the
9
14
Dataproc API and listing the clusters in a region
10
15
11
- ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
16
+ ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
12
17
` pyspark_sort.py ` job, download the output from Google Cloud Storage, and output the result.
13
18
19
+ ` pyspark_sort.py_gcs ` is the asme as ` pyspark_sort.py ` but demonstrates
20
+ reading from a GCS bucket.
21
+
14
22
## Prerequisites to run locally:
15
23
16
24
* [ pip] ( https://pypi.python.org/pypi/pip )
@@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
19
27
20
28
Under API Manager, search for the Google Cloud Dataproc API and enable it.
21
29
30
+ ## Set Up Your Local Dev Environment
22
31
23
- # Set Up Your Local Dev Environment
24
32
To install, run the following commands. If you want to use [ virtualenv] ( https://virtualenv.readthedocs.org/en/latest/ )
25
33
(recommended), run the commands within a virtualenv.
26
34
27
35
* pip install -r requirements.txt
28
36
29
- Create local credentials by running the following command and following the oauth2 flow:
37
+ ## Authentication
38
+
39
+ Please see the [ Google cloud authentication guide] ( https://cloud.google.com/docs/authentication/ ) .
40
+ The recommended approach to running these samples is a Service Account with a JSON key.
41
+
42
+ ## Environment Variables
30
43
31
- gcloud beta auth application-default login
44
+ Set the following environment variables:
45
+
46
+ GOOGLE_CLOUD_PROJECT=your-project-id
47
+ REGION=us-central1 # or your region
48
+ CLUSTER_NAME=waprin-spark7
49
+ ZONE=us-central1-b
50
+
51
+ ## Running the samples
32
52
33
53
To run list_clusters.py:
34
54
35
- python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
55
+ python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
36
56
57
+ ` submit_job_to_cluster.py ` can create the Dataproc cluster, or use an existing one.
58
+ If you'd like to create a cluster ahead of time, either use the
59
+ [ Cloud Console] ( console.cloud.google.com ) or run:
37
60
38
- To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
39
- gsutil:
61
+ gcloud dataproc clusters create your-cluster-name
40
62
41
- gsutil mb gs://<your-input-bucket-name>
42
-
43
- Then, if you want to rely on an existing cluster, run:
44
-
45
- python submit_job_to_cluster.py --project_id=<your-project-id > --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name >
46
-
47
- Otherwise, if you want the script to create a new cluster for you:
63
+ To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
64
+ gsutil:
48
65
49
- python submit_job_to_cluster.py --project_id= <your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input- bucket-name> --create_new_cluster
66
+ gsutil mb gs:// <your-staging- bucket-name>
50
67
68
+ Set the environment variable's name:
51
69
52
- This will setup a cluster, upload the PySpark file, submit the job, print the result, then
53
- delete the cluster.
70
+ BUCKET=your-staging-bucket
71
+ CLUSTER=your- cluster-name
54
72
55
- You can optionally specify a ` --pyspark_file ` argument to change from the default
56
- ` pyspark_sort.py ` included in this script to a new script.
73
+ Then, if you want to rely on an existing cluster, run:
57
74
58
- ## Running on GCE, GAE, or other environments
75
+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
59
76
60
- On Google App Engine, the credentials should be found automatically.
77
+ Otherwise, if you want the script to create a new cluster for you:
61
78
62
- On Google Compute Engine, the credentials should be found automatically, but require that
63
- you create the instance with the correct scopes.
79
+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
64
80
65
- gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
81
+ This will setup a cluster, upload the PySpark file, submit the job, print the result, then
82
+ delete the cluster.
66
83
67
- If you did not create the instance with the right scopes, you can still upload a JSON service
68
- account and set ` GOOGLE_APPLICATION_CREDENTIALS ` . See [ Google Application Default Credentials ] ( https://developers.google.com/identity/protocols/application-default-credentials ) for more details .
84
+ You can optionally specify a ` --pyspark_file ` argument to change from the default
85
+ ` pyspark_sort.py ` included in this script to a new script .
0 commit comments