Skip to content

Commit d864ffa

Browse files
aman-ebaymeredithslotaloferris
authored
Update python-api-walkthrough.md (#398)
Co-authored-by: meredithslota <[email protected]> Co-authored-by: Lo Ferris <[email protected]>
1 parent cca6c6a commit d864ffa

File tree

1 file changed

+74
-92
lines changed

1 file changed

+74
-92
lines changed

dataproc/snippets/python-api-walkthrough.md

Lines changed: 74 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,9 @@ As you follow this walkthrough, you run Python code that calls
1313
[Dataproc gRPC APIs](https://cloud.google.com/dataproc/docs/reference/rpc/)
1414
to:
1515

16-
* create a Dataproc cluster
17-
* submit a small PySpark word sort job to run on the cluster
18-
* get job status
19-
* tear down the cluster after job completion
16+
* Create a Dataproc cluster
17+
* Submit a PySpark word sort job to the cluster
18+
* Delete the cluster after job completion
2019

2120
## Using the walkthrough
2221

@@ -32,144 +31,127 @@ an explanation of how the code works.
3231

3332
cloudshell launch-tutorial python-api-walkthrough.md
3433

35-
**To copy and run commands**: Click the "Paste in Cloud Shell" button
34+
**To copy and run commands**: Click the "Copy to Cloud Shell" button
3635
(<walkthrough-cloud-shell-icon></walkthrough-cloud-shell-icon>)
3736
on the side of a code box, then press `Enter` to run the command.
3837

3938
## Prerequisites (1)
4039

41-
<walkthrough-watcher-constant key="project_id" value="<project_id>"
42-
></walkthrough-watcher-constant>
40+
<walkthrough-watcher-constant key="project_id" value="<project_id>"></walkthrough-watcher-constant>
4341

4442
1. Create or select a Google Cloud project to use for this
45-
tutorial.
46-
* <walkthrough-project-setup billing="true"></walkthrough-project-setup>
43+
tutorial.
44+
* <walkthrough-project-setup billing="true"></walkthrough-project-setup>
4745

4846
1. Enable the Dataproc, Compute Engine, and Cloud Storage APIs in your
49-
project.
50-
```sh
51-
gcloud services enable dataproc.googleapis.com \
52-
compute.googleapis.com \
53-
storage-component.googleapis.com \
54-
--project={{project_id}}
55-
```
47+
project.
48+
49+
```bash
50+
gcloud services enable dataproc.googleapis.com \
51+
compute.googleapis.com \
52+
storage-component.googleapis.com \
53+
--project={{project_id}}
54+
```
5655

5756
## Prerequisites (2)
5857

5958
1. This walkthrough uploads a PySpark file (`pyspark_sort.py`) to a
6059
[Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in
6160
your project.
6261
* You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser)
63-
in Google Cloud Platform Console to view existing buckets in your project.
62+
in Google Cloud Console to view existing buckets in your project.
6463

65-
&nbsp;&nbsp;&nbsp;&nbsp;**OR**
64+
**OR**
6665

6766
* To create a new bucket, run the following command. Your bucket name must be unique.
68-
```bash
69-
gsutil mb -p {{project-id}} gs://your-bucket-name
70-
```
7167

72-
1. Set environment variables.
68+
gsutil mb -p {{project-id}} gs://your-bucket-name
69+
7370

74-
* Set the name of your bucket.
75-
```bash
76-
BUCKET=your-bucket-name
77-
```
71+
2. Set environment variables.
72+
* Set the name of your bucket.
73+
74+
BUCKET=your-bucket-name
7875

7976
## Prerequisites (3)
8077

8178
1. Set up a Python
82-
[virtual environment](https://virtualenv.readthedocs.org/en/latest/)
83-
in Cloud Shell.
79+
[virtual environment](https://virtualenv.readthedocs.org/en/latest/).
8480

8581
* Create the virtual environment.
86-
```bash
87-
virtualenv ENV
88-
```
82+
83+
virtualenv ENV
84+
8985
* Activate the virtual environment.
90-
```bash
91-
source ENV/bin/activate
92-
```
86+
87+
source ENV/bin/activate
9388

94-
1. Install library dependencies in Cloud Shell.
95-
```bash
96-
pip install -r requirements.txt
97-
```
89+
1. Install library dependencies.
90+
91+
pip install -r requirements.txt
9892

9993
## Create a cluster and submit a job
10094

10195
1. Set a name for your new cluster.
102-
```bash
103-
CLUSTER=new-cluster-name
104-
```
10596

106-
1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/#available)
107-
where your new cluster will be located. You can change the
108-
"us-central1-a" zone that is pre-set in the following command.
109-
```bash
110-
ZONE=us-central1-a
111-
```
97+
CLUSTER=new-cluster-name
11298

113-
1. Run `submit_job.py` with the `--create_new_cluster` flag
114-
to create a new cluster and submit the `pyspark_sort.py` job
115-
to the cluster.
99+
1. Set a [region](https://cloud.google.com/compute/docs/regions-zones/#available)
100+
where your new cluster will be located. You can change the pre-set
101+
"us-central1" region beforew you copy and run the following command.
116102

117-
```bash
118-
python submit_job_to_cluster.py \
119-
--project_id={{project-id}} \
120-
--cluster_name=$CLUSTER \
121-
--zone=$ZONE \
122-
--gcs_bucket=$BUCKET \
123-
--create_new_cluster
124-
```
103+
REGION=us-central1
104+
105+
1. Run `submit_job_to_cluster.py` to create a new cluster and run the
106+
`pyspark_sort.py` job on the cluster.
107+
108+
python submit_job_to_cluster.py \
109+
--project_id={{project-id}} \
110+
--cluster_name=$CLUSTER \
111+
--region=$REGION \
112+
--gcs_bucket=$BUCKET
125113

126114
## Job Output
127115

128-
Job output in Cloud Shell shows cluster creation, job submission,
129-
job completion, and then tear-down of the cluster.
130-
131-
...
132-
Creating cluster...
133-
Cluster created.
134-
Uploading pyspark file to Cloud Storage.
135-
new-cluster-name - RUNNING
136-
Submitted job ID ...
137-
Waiting for job to finish...
138-
Job finished.
139-
Downloading output file
140-
.....
141-
['Hello,', 'dog', 'elephant', 'panther', 'world!']
142-
...
143-
Tearing down cluster
144-
```
145-
## Congratulations on Completing the Walkthrough!
116+
Job output displayed in the Cloud Shell terminaL shows cluster creation,
117+
job completion, sorted job output, and then deletion of the cluster.
118+
119+
```xml
120+
Cluster created successfully: cliuster-name.
121+
...
122+
Job finished successfully.
123+
...
124+
['Hello,', 'dog', 'elephant', 'panther', 'world!']
125+
...
126+
Cluster cluster-name successfully deleted.
127+
```
128+
129+
## Congratulations on completing the Walkthrough!
146130
<walkthrough-conclusion-trophy></walkthrough-conclusion-trophy>
147131

148132
---
149133

150134
### Next Steps:
151135

152-
* **View job details from the Console.** View job details by selecting the
153-
PySpark job from the Dataproc
154-
=
136+
* **View job details in the Cloud Console.** View job details by selecting the
137+
PySpark job name on the Dataproc
155138
[Jobs page](https://console.cloud.google.com/dataproc/jobs)
156-
in the Google Cloud Platform Console.
139+
in the Cloud console.
157140

158141
* **Delete resources used in the walkthrough.**
159-
The `submit_job_to_cluster.py` job deletes the cluster that it created for this
142+
The `submit_job_to_cluster.py` code deletes the cluster that it created for this
160143
walkthrough.
161144

162-
If you created a bucket to use for this walkthrough,
163-
you can run the following command to delete the
164-
Cloud Storage bucket (the bucket must be empty).
165-
```bash
166-
gsutil rb gs://$BUCKET
167-
```
168-
You can run the following command to delete the bucket **and all
169-
objects within it. Note: the deleted objects cannot be recovered.**
170-
```bash
171-
gsutil rm -r gs://$BUCKET
172-
```
145+
If you created a Cloud Storage bucket to use for this walkthrough,
146+
you can run the following command to delete the bucket (the bucket must be empty).
147+
148+
gsutil rb gs://$BUCKET
149+
150+
* You can run the following command to **delete the bucket and all
151+
objects within it. Note: the deleted objects cannot be recovered.**
152+
153+
gsutil rm -r gs://$BUCKET
154+
173155

174156
* **For more information.** See the [Dataproc documentation](https://cloud.google.com/dataproc/docs/)
175157
for API reference and product feature information.

0 commit comments

Comments
 (0)