Skip to content
This repository was archived by the owner on Jul 18, 2024. It is now read-only.

Commit 856af3d

Browse files
authored
Add in eval model yaml and instructions
1 parent c8e3f6d commit 856af3d

File tree

4 files changed

+187
-20
lines changed

4 files changed

+187
-20
lines changed

Diff for: Dockerfile

+4-4
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,11 @@ RUN apt-get update && \
77
mkdir /model /data && \
88
git clone https://github.com/tensorflow/models.git && \
99
cp -r /models/research/slim/* /model/ && \
10-
rm -rf /models
10+
rm -rf /models
1111

1212
COPY dataset_factory.py /model/datasets/.
13-
COPY arts.py /model/datasets/.
14-
COPY data/*.tfrecord /data/
15-
COPY data/labels.txt /data/.
13+
COPY arts.py /model/datasets/.
14+
COPY classify.py /model/.
15+
COPY data/ /data/
1616

1717
ENTRYPOINT ["python", "/model/train_image_classifier.py"]

Diff for: README.md

+73-16
Original file line numberDiff line numberDiff line change
@@ -127,17 +127,15 @@ SELECT department, culture, link_resource
127127
        LIMIT 200
128128
```
129129

130-
You can enter these strings on the Google BigQuery console to see the data.
131-
The journey also provides convenient script to query the attributes.
132-
First clone the journey git repository:
130+
You can enter these strings on the Google BigQuery console to see the data. The journey also provides convenient script
131+
to query the attributes. First clone the journey git repository:
133132

134133
```
135134
cd ~
136135
git clone https://github.com/IBM/tensorflow-kubernetes-art-classification.git
137136
```
138137

139-
The script to query Google BigQuery is bigquery.py.
140-
Edit the script to put the appropriate SQL string and run the script:
138+
The script to query Google BigQuery is bigquery.py. Edit the script to put the appropriate SQL string and run the script:
141139

142140
```
143141
cd tensorflow-kubernetes-art-classification
@@ -262,12 +260,12 @@ within a reasonable amount of time. In practice, you would use a larger dataset
262260
such as multiple CPU cores and GPU. Depending on the amount of computation resources, the training can run for days
263261
or over a week.
264262

265-
Next follow this [instructions](https://console.bluemix.net/docs/containers/cs_cluster.html#bx_registry_other) to
263+
Next follow these [instructions](https://console.bluemix.net/docs/containers/cs_cluster.html#bx_registry_other) to
266264
1. create a namespace in Bluemix Container Registry and upload the image to this namespace
267265
2. create a non-expiring registry token
268266
3. create a Kubernetes secret to store the Bluemix token information
269267

270-
Update met-art.yaml file with your images name and secret name
268+
Update train-model.yaml file with your images name and secret name
271269

272270
```
273271
apiVersion: v1
@@ -299,39 +297,98 @@ spec:
299297
persistentVolumeClaim:
300298
claimName: met-art-logs
301299
imagePullSecrets:
302-
- name: bluemix-token
300+
- name: bluemix-secret
303301
restartPolicy: Never
304302
```
305303

306304
```
307305
# For Mac OS
308-
sed -i '.original' 's/registry.ng.bluemix.net\/tf_ns\/met-art:v1/registry.<region>.bluemix.net\/<my_namespace>\/<my_image>:<tag>/' met-art.yaml
309-
sed -i '.original' 's/bluemix-token/<my_token>/' met-art.yaml
306+
sed -i '.original' 's/registry.ng.bluemix.net\/tf_ns\/met-art:v1/registry.<region>.bluemix.net\/<my_namespace>\/<my_image>:<tag>/' train-model.yaml
307+
sed -i '.original' 's/bluemix-secret/<my_token>/' train-model.yaml
310308
# For all other Linux platforms
311-
sed -i 's/registry.ng.bluemix.net\/tf_ns\/met-art:v1/registry.<region>.bluemix.net\/<my_namespace>\/<my_image>:<tag>/' met-art.yaml
312-
sed -i 's/bluemix-token/<my_token>/' met-art.yaml
309+
sed -i 's/registry.ng.bluemix.net\/tf_ns\/met-art:v1/registry.<region>.bluemix.net\/<my_namespace>\/<my_image>:<tag>/' train-model.yaml
310+
sed -i 's/bluemix-secret/<my_token>/' train-model.yaml
313311
```
314312

315313
Deploy the pod with the following command:
316314

317315
```
318-
kubectl create -f met-art.yaml
316+
kubectl create -f train-model.yaml
317+
```
318+
319+
Check the training status with the following command:
320+
321+
```
322+
kubectl logs train-met-art-model
319323
```
320324

321325
Along with the pod, a local volume will be created and mounted to the pod to hold the output of the training.
322326
This includes the checkpoints, which are used for resuming after a crash and saving a trained model,
323327
and the event file, which is used for visualization. Further, the restart policy for the pod is set to "Never",
324328
because once the training complete there is no need to restart the pod again.
325329

330+
### 7. Evaluate model performance
331+
332+
Evaluate the model from the last checkpoint in the training step above
333+
334+
```
335+
apiVersion: v1
336+
kind: Pod
337+
metadata:
338+
name: eval-met-art-model
339+
spec:
340+
containers:
341+
- name: tensorflow
342+
image: registry.ng.bluemix.net/tf_ns/met-art:v1
343+
volumeMounts:
344+
- name: model-logs
345+
mountPath: /logs
346+
ports:
347+
- containerPort: 5000
348+
command:
349+
- "/usr/bin/python"
350+
- "/model/eval_image_classifier.py"
351+
args:
352+
- "--alsologtostderr"
353+
- "--checkpoint_path=/logs/model.ckpt-100"
354+
- "--eval_dir=/logs"
355+
- "--dataset_dir=/data"
356+
- "--dataset_name=arts"
357+
- "--dataset_split_name=validation"
358+
- "--model_name=inception_v3"
359+
- "--clone_on_cpu=True"
360+
- "--batch_size=10"
361+
volumes:
362+
- name: model-logs
363+
persistentVolumeClaim:
364+
claimName: met-art-logs
365+
imagePullSecrets:
366+
- name: bluemix-secret
367+
restartPolicy: Never
368+
```
369+
Update eval-model.yaml file with your images name and secret name just like in step 6
370+
371+
Deploy the pod with the following command:
372+
373+
```
374+
kubectl create -f eval-model.yaml
375+
```
376+
377+
Check the evaluation status with the following command:
378+
379+
```
380+
kubectl logs eval-met-art-model
381+
```
382+
326383

327-
### 7. Save trained model
384+
### 8. Save trained model
328385

329386
Copy the files from the Kubernetes local volume.
330387

331388
The trained model is the last checkpoint file.
332389

333390

334-
### 8. Visualize
391+
### 9. Visualize
335392

336393
The event file copied from the Kubernetes local volume contains the log data for TensorBoard.
337394
Start the TensorBoard and point to the local directory with the event file:
@@ -342,7 +399,7 @@ tensorboard --logdir=<path_to_dir>
342399

343400
Then open your browser with the link displayed from the command.
344401

345-
### 9. Run inference
402+
### 10. Run inference
346403

347404
Now that you have trained a model to classify art image by culture, you can provide
348405
a new art image to see how it will be classified by the model.

Diff for: eval-model.yaml

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: eval-met-art-model
5+
spec:
6+
selector:
7+
name: eval-met-art-model
8+
ports:
9+
- port: 5000
10+
---
11+
apiVersion: v1
12+
kind: Pod
13+
metadata:
14+
name: eval-met-art-model
15+
spec:
16+
containers:
17+
- name: tensorflow
18+
image: registry.ng.bluemix.net/tf_ns/met-art:v1
19+
volumeMounts:
20+
- name: model-logs
21+
mountPath: /logs
22+
ports:
23+
- containerPort: 5000
24+
command:
25+
- "/usr/bin/python"
26+
- "/model/eval_image_classifier.py"
27+
args:
28+
- "--alsologtostderr"
29+
- "--checkpoint_path=/logs/model.ckpt-100"
30+
- "--eval_dir=/logs"
31+
- "--dataset_dir=/data"
32+
- "--dataset_name=arts"
33+
- "--dataset_split_name=validation"
34+
- "--model_name=inception_v3"
35+
- "--batch_size=10"
36+
volumes:
37+
- name: model-logs
38+
persistentVolumeClaim:
39+
claimName: met-art-logs
40+
imagePullSecrets:
41+
- name: bluemix-secret
42+
restartPolicy: Never

Diff for: train-model.yaml

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
apiVersion: v1
2+
kind: PersistentVolume
3+
metadata:
4+
name: persistent-volume-1
5+
labels:
6+
type: local
7+
spec:
8+
capacity:
9+
storage: 1Gi
10+
accessModes:
11+
- ReadWriteMany
12+
hostPath:
13+
path: /tmp/data/pv-1
14+
persistentVolumeReclaimPolicy: Retain
15+
---
16+
apiVersion: v1
17+
kind: PersistentVolumeClaim
18+
metadata:
19+
name: met-art-logs
20+
spec:
21+
accessModes:
22+
- ReadWriteMany
23+
resources:
24+
requests:
25+
storage: 1Gi
26+
---
27+
apiVersion: v1
28+
kind: Service
29+
metadata:
30+
name: train-met-art-model
31+
spec:
32+
selector:
33+
name: train-met-art-model
34+
ports:
35+
- port: 5000
36+
---
37+
apiVersion: v1
38+
kind: Pod
39+
metadata:
40+
name: train-met-art-model
41+
spec:
42+
containers:
43+
- name: tensorflow
44+
image: registry.ng.bluemix.net/tf_ns/met-art:v1
45+
volumeMounts:
46+
- name: model-logs
47+
mountPath: /logs
48+
ports:
49+
- containerPort: 5000
50+
command:
51+
- "/usr/bin/python"
52+
- "/model/train_image_classifier.py"
53+
args:
54+
- "--train_dir=/logs"
55+
- "--dataset_name=arts"
56+
- "--dataset_split_name=train"
57+
- "--dataset_dir=/data"
58+
- "--model_name=inception_v3"
59+
- "--clone_on_cpu=True"
60+
- "--batch_size=10"
61+
- "--max_number_of_steps=100"
62+
volumes:
63+
- name: model-logs
64+
persistentVolumeClaim:
65+
claimName: met-art-logs
66+
imagePullSecrets:
67+
- name: bluemix-secret
68+
restartPolicy: Never

0 commit comments

Comments
 (0)