@@ -8,7 +8,7 @@ demonstrated here could also be accomplished using the Cloud Console or the gclo
8
8
` list_clusters.py ` is a simple command-line program to demonstrate connecting to the
9
9
Dataproc API and listing the clusters in a region
10
10
11
- ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
11
+ ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
12
12
` pyspark_sort.py ` job, download the output from Google Cloud Storage, and output the result.
13
13
14
14
## Prerequisites to run locally:
@@ -20,49 +20,78 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
20
20
Under API Manager, search for the Google Cloud Dataproc API and enable it.
21
21
22
22
23
- # Set Up Your Local Dev Environment
23
+ ## Set Up Your Local Dev Environment
24
+
24
25
To install, run the following commands. If you want to use [ virtualenv] ( https://virtualenv.readthedocs.org/en/latest/ )
25
26
(recommended), run the commands within a virtualenv.
26
27
27
28
* pip install -r requirements.txt
28
29
29
30
Create local credentials by running the following command and following the oauth2 flow:
30
31
31
- gcloud beta auth application-default login
32
+ gcloud auth application-default login
33
+
34
+ Set the following environment variables:
35
+
36
+ GOOGLE_CLOUD_PROJECT=your-project-id
37
+ REGION=us-central1 # or your region
38
+ CLUSTER_NAME=waprin-spark7
39
+ ZONE=us-central1-b
32
40
33
41
To run list_clusters.py:
34
42
35
- python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
43
+ python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
44
+
45
+ ` submit_job_to_cluster.py ` can create the Dataproc cluster, or use an existing one.
46
+ If you'd like to create a cluster ahead of time, either use the
47
+ [ Cloud Console] ( console.cloud.google.com ) or run:
36
48
49
+ gcloud dataproc clusters create your-cluster-name
37
50
38
- To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
51
+ To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files , from the Cloud Console or with
39
52
gsutil:
40
53
41
- gsutil mb gs://<your-input-bucket-name>
42
-
54
+ gsutil mb gs://<your-staging-bucket-name>
55
+
56
+ Set the environment variable's name:
57
+
58
+ BUCKET=your-staging-bucket
59
+ CLUSTER=your-cluster-name
60
+
43
61
Then, if you want to rely on an existing cluster, run:
44
-
45
- python submit_job_to_cluster.py --project_id=<your-project-id > --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name >
46
-
47
- Otherwise, if you want the script to create a new cluster for you:
48
62
49
- python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name> --create_new_cluster
63
+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
64
+
65
+ Otherwise, if you want the script to create a new cluster for you:
50
66
67
+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
51
68
52
69
This will setup a cluster, upload the PySpark file, submit the job, print the result, then
53
70
delete the cluster.
54
71
55
- You can optionally specify a ` --pyspark_file ` argument to change from the default
72
+ You can optionally specify a ` --pyspark_file ` argument to change from the default
56
73
` pyspark_sort.py ` included in this script to a new script.
57
74
75
+ ## Reading Data from Google Cloud Storage
76
+
77
+ Included in this directory is ` pyspark_sort_gcs.py ` , which demonstrates how
78
+ you might read a file from Google Cloud Storage. To use it, replace
79
+ ` path-to-your-GCS-file' ` which will be the text input the job sorts.
80
+
81
+ On Cloud Dataproc, the [ GCS Connector] ( https://cloud.google.com/dataproc/docs/connectors/cloud-storage )
82
+ is automatically installed. This means anywhere you read from a path starting with ` gs:// ` ,
83
+ Spark will automatically know how to read from the GCS bucket. If you wish to use GCS with another Spark installation,
84
+ including locally, you will have to [ install the connector] ( https://cloud.google.com/dataproc/docs/connectors/install-storage-connector ) .
85
+
86
+
58
87
## Running on GCE, GAE, or other environments
59
88
60
89
On Google App Engine, the credentials should be found automatically.
61
90
62
91
On Google Compute Engine, the credentials should be found automatically, but require that
63
- you create the instance with the correct scopes.
92
+ you create the instance with the correct scopes.
64
93
65
94
gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
66
95
67
- If you did not create the instance with the right scopes, you can still upload a JSON service
96
+ If you did not create the instance with the right scopes, you can still upload a JSON service
68
97
account and set ` GOOGLE_APPLICATION_CREDENTIALS ` . See [ Google Application Default Credentials] ( https://developers.google.com/identity/protocols/application-default-credentials ) for more details.
0 commit comments