@@ -13,10 +13,9 @@ As you follow this walkthrough, you run Python code that calls
13
13
[ Dataproc gRPC APIs] ( https://cloud.google.com/dataproc/docs/reference/rpc/ )
14
14
to:
15
15
16
- * create a Dataproc cluster
17
- * submit a small PySpark word sort job to run on the cluster
18
- * get job status
19
- * tear down the cluster after job completion
16
+ * Create a Dataproc cluster
17
+ * Submit a PySpark word sort job to the cluster
18
+ * Delete the cluster after job completion
20
19
21
20
## Using the walkthrough
22
21
@@ -32,144 +31,127 @@ an explanation of how the code works.
32
31
33
32
cloudshell launch-tutorial python-api-walkthrough.md
34
33
35
- ** To copy and run commands** : Click the "Paste in Cloud Shell" button
34
+ ** To copy and run commands** : Click the "Copy to Cloud Shell" button
36
35
(<walkthrough-cloud-shell-icon ></walkthrough-cloud-shell-icon >)
37
36
on the side of a code box, then press ` Enter ` to run the command.
38
37
39
38
## Prerequisites (1)
40
39
41
- <walkthrough-watcher-constant key="project_id" value="<project_id>"
42
- > </walkthrough-watcher-constant >
40
+ <walkthrough-watcher-constant key =" project_id " value =" <project_id> " ></walkthrough-watcher-constant >
43
41
44
42
1 . Create or select a Google Cloud project to use for this
45
- tutorial.
46
- * <walkthrough-project-setup billing =" true " ></walkthrough-project-setup >
43
+ tutorial.
44
+ * <walkthrough-project-setup billing =" true " ></walkthrough-project-setup >
47
45
48
46
1 . Enable the Dataproc, Compute Engine, and Cloud Storage APIs in your
49
- project.
50
- ``` sh
51
- gcloud services enable dataproc.googleapis.com \
52
- compute.googleapis.com \
53
- storage-component.googleapis.com \
54
- --project={{project_id}}
55
- ```
47
+ project.
48
+
49
+ ``` bash
50
+ gcloud services enable dataproc.googleapis.com \
51
+ compute.googleapis.com \
52
+ storage-component.googleapis.com \
53
+ --project={{project_id}}
54
+ ```
56
55
57
56
# # Prerequisites (2)
58
57
59
58
1. This walkthrough uploads a PySpark file (` pyspark_sort.py` ) to a
60
59
[Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in
61
60
your project.
62
61
* You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser)
63
- in Google Cloud Platform Console to view existing buckets in your project.
62
+ in Google Cloud Console to view existing buckets in your project.
64
63
65
- & nbsp ;& nbsp ;& nbsp ;& nbsp ; ** OR**
64
+ ** OR**
66
65
67
66
* To create a new bucket, run the following command. Your bucket name must be unique.
68
- ``` bash
69
- gsutil mb -p {{project-id}} gs://your-bucket-name
70
- ```
71
67
72
- 1 . Set environment variables.
68
+ gsutil mb -p {{project-id}} gs://your-bucket-name
69
+
73
70
74
- * Set the name of your bucket .
75
- ``` bash
76
- BUCKET=your-bucket-name
77
- ```
71
+ 2. Set environment variables .
72
+ * Set the name of your bucket.
73
+
74
+ BUCKET=your-bucket-name
78
75
79
76
# # Prerequisites (3)
80
77
81
78
1. Set up a Python
82
- [virtual environment](https://virtualenv.readthedocs.org/en/latest/)
83
- in Cloud Shell.
79
+ [virtual environment](https://virtualenv.readthedocs.org/en/latest/).
84
80
85
81
* Create the virtual environment.
86
- ` ` ` bash
87
- virtualenv ENV
88
- ` ` `
82
+
83
+ virtualenv ENV
84
+
89
85
* Activate the virtual environment.
90
- ` ` ` bash
91
- source ENV/bin/activate
92
- ` ` `
86
+
87
+ source ENV/bin/activate
93
88
94
- 1. Install library dependencies in Cloud Shell.
95
- ` ` ` bash
96
- pip install -r requirements.txt
97
- ` ` `
89
+ 1. Install library dependencies.
90
+
91
+ pip install -r requirements.txt
98
92
99
93
# # Create a cluster and submit a job
100
94
101
95
1. Set a name for your new cluster.
102
- ` ` ` bash
103
- CLUSTER=new-cluster-name
104
- ` ` `
105
96
106
- 1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/# available)
107
- where your new cluster will be located. You can change the
108
- " us-central1-a" zone that is pre-set in the following command.
109
- ` ` ` bash
110
- ZONE=us-central1-a
111
- ` ` `
97
+ CLUSTER=new-cluster-name
112
98
113
- 1. Run ` submit_job.py ` with the ` --create_new_cluster ` flag
114
- to create a new cluster and submit the ` pyspark_sort.py ` job
115
- to the cluster .
99
+ 1. Set a [region](https://cloud.google.com/compute/docs/regions-zones/ # available)
100
+ where your new cluster will be located. You can change the pre-set
101
+ " us-central1 " region beforew you copy and run the following command .
116
102
117
- ` ` ` bash
118
- python submit_job_to_cluster.py \
119
- --project_id={{project-id}} \
120
- --cluster_name=$CLUSTER \
121
- --zone=$ZONE \
122
- --gcs_bucket=$BUCKET \
123
- --create_new_cluster
124
- ` ` `
103
+ REGION=us-central1
104
+
105
+ 1. Run ` submit_job_to_cluster.py` to create a new cluster and run the
106
+ ` pyspark_sort.py` job on the cluster.
107
+
108
+ python submit_job_to_cluster.py \
109
+ --project_id={{project-id}} \
110
+ --cluster_name=$CLUSTER \
111
+ --region=$REGION \
112
+ --gcs_bucket=$BUCKET
125
113
126
114
# # Job Output
127
115
128
- Job output in Cloud Shell shows cluster creation, job submission,
129
- job completion, and then tear-down of the cluster.
130
-
131
- ...
132
- Creating cluster...
133
- Cluster created.
134
- Uploading pyspark file to Cloud Storage.
135
- new-cluster-name - RUNNING
136
- Submitted job ID ...
137
- Waiting for job to finish...
138
- Job finished.
139
- Downloading output file
140
- .....
141
- [' Hello,' , ' dog' , ' elephant' , ' panther' , ' world!' ]
142
- ...
143
- Tearing down cluster
144
- ` ` `
145
- # # Congratulations on Completing the Walkthrough!
116
+ Job output displayed in the Cloud Shell terminaL shows cluster creation,
117
+ job completion, sorted job output, and then deletion of the cluster.
118
+
119
+ ` ` ` xml
120
+ Cluster created successfully: cliuster-name.
121
+ ...
122
+ Job finished successfully.
123
+ ...
124
+ [' Hello,' , ' dog' , ' elephant' , ' panther' , ' world!' ]
125
+ ...
126
+ Cluster cluster-name successfully deleted.
127
+ ` ` `
128
+
129
+ # # Congratulations on completing the Walkthrough!
146
130
< walkthrough-conclusion-trophy></walkthrough-conclusion-trophy>
147
131
148
132
---
149
133
150
134
# ## Next Steps:
151
135
152
- * ** View job details from the Console.** View job details by selecting the
153
- PySpark job from the Dataproc
154
- =
136
+ * ** View job details in the Cloud Console.** View job details by selecting the
137
+ PySpark job name on the Dataproc
155
138
[Jobs page](https://console.cloud.google.com/dataproc/jobs)
156
- in the Google Cloud Platform Console .
139
+ in the Cloud console .
157
140
158
141
* ** Delete resources used in the walkthrough.**
159
- The ` submit_job_to_cluster.py` job deletes the cluster that it created for this
142
+ The ` submit_job_to_cluster.py` code deletes the cluster that it created for this
160
143
walkthrough.
161
144
162
- If you created a bucket to use for this walkthrough,
163
- you can run the following command to delete the
164
- Cloud Storage bucket (the bucket must be empty).
165
- ` ` ` bash
166
- gsutil rb gs://$BUCKET
167
- ` ` `
168
- You can run the following command to delete the bucket ** and all
169
- objects within it. Note: the deleted objects cannot be recovered.**
170
- ` ` ` bash
171
- gsutil rm -r gs://$BUCKET
172
- ` ` `
145
+ If you created a Cloud Storage bucket to use for this walkthrough,
146
+ you can run the following command to delete the bucket (the bucket must be empty).
147
+
148
+ gsutil rb gs://$BUCKET
149
+
150
+ * You can run the following command to ** delete the bucket and all
151
+ objects within it. Note: the deleted objects cannot be recovered.**
152
+
153
+ gsutil rm -r gs://$BUCKET
154
+
173
155
174
156
* ** For more information.** See the [Dataproc documentation](https://cloud.google.com/dataproc/docs/)
175
157
for API reference and product feature information.
0 commit comments