-
Notifications
You must be signed in to change notification settings - Fork 28
docs: iteration 3 #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: iteration 3 #492
Changes from all commits
af6445f
01cc68b
b5cefe6
91e57c4
739ffea
1df2ec1
eb8e209
215d1c1
5db25b0
ad941c4
1e2434c
64c45ec
4aeffa8
2a30b67
ad2c377
0468e53
bba3c83
9dc7c5b
6a6f94a
3917887
81551d5
041e34b
1ff0ba1
1bb47a3
1ad01a9
1887610
d2edbc9
54ab9d1
a1f1c8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,25 +1,48 @@ | ||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
# Terraform Provider Iterative (TPI) | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
[](https://registry.terraform.io/providers/iterative/iterative/latest/docs) | ||||||||||||||||||||||||||||||||||||||||||
[](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml) | ||||||||||||||||||||||||||||||||||||||||||
[![Apache-2.0][licence-badge]][licence-file] | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert. | ||||||||||||||||||||||||||||||||||||||||||
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
- **Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation | ||||||||||||||||||||||||||||||||||||||||||
- **Sync & Execute**: easily sync & run local data & code in the cloud | ||||||||||||||||||||||||||||||||||||||||||
- **Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances | ||||||||||||||||||||||||||||||||||||||||||
- **No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results) | ||||||||||||||||||||||||||||||||||||||||||
- **No lock-in**: switch between several cloud vendors with ease due to concise unified configuration | ||||||||||||||||||||||||||||||||||||||||||
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances | ||||||||||||||||||||||||||||||||||||||||||
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction | ||||||||||||||||||||||||||||||||||||||||||
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+12
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💅🏼 missing |
||||||||||||||||||||||||||||||||||||||||||
- **Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
Supported cloud vendors include: | ||||||||||||||||||||||||||||||||||||||||||
Supported cloud vendors [include][auth]: | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💅🏼 I'd link the other 3 words instead. |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
- Amazon Web Services (AWS) | ||||||||||||||||||||||||||||||||||||||||||
- Microsoft Azure | ||||||||||||||||||||||||||||||||||||||||||
- Google Cloud Platform (GCP) | ||||||||||||||||||||||||||||||||||||||||||
- Kubernetes (K8s) | ||||||||||||||||||||||||||||||||||||||||||
| [![Amazon Web Services (AWS)][aws-badge]][aws] | [![Microsoft Azure][azure-badge]][azure] | [![Google Cloud Platform (GCP)][gcp-badge]][gcp] | [![Kubernetes (K8s)][k8s-badge]][k8s] | | ||||||||||||||||||||||||||||||||||||||||||
| ---------------------------------------------- | ---------------------------------------- | ------------------------------------------------ | ------------------------------------- | | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
[aws-badge]: https://img.shields.io/badge/AWS-Amazon_Web_Services-black?colorA=white&logoColor=232F3E&logo=amazonaws | ||||||||||||||||||||||||||||||||||||||||||
[aws]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services | ||||||||||||||||||||||||||||||||||||||||||
[azure-badge]: https://img.shields.io/badge/Azure-Microsoft_Azure-black?colorA=white&logoColor=0078D4&logo=microsoftazure | ||||||||||||||||||||||||||||||||||||||||||
[azure]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#microsoft-azure | ||||||||||||||||||||||||||||||||||||||||||
[gcp-badge]: https://img.shields.io/badge/GCP-Google_Cloud_Platform-black?colorA=white&logoColor=4285F4&logo=googlecloud | ||||||||||||||||||||||||||||||||||||||||||
[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform | ||||||||||||||||||||||||||||||||||||||||||
[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white&logoColor=326CE5&logo=kubernetes | ||||||||||||||||||||||||||||||||||||||||||
[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes | ||||||||||||||||||||||||||||||||||||||||||
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💅🏼 💅🏼 💅🏼 Shouldn't it be at the beginning of the link list? 😋 |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
casperdcl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+31
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Woah how does that work? Just curious |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## What's Special | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators): | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
1. **Reduced management overhead and infrastructure cost**: | ||||||||||||||||||||||||||||||||||||||||||
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline. | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+34
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we number the list it seems like we mean that it is an exhaustive list. Is it? Otherwise maybe use bullets. But also it somewhat overlaps with the previous bullet list. Maybe they can be combined? So that the reader can get to the Usage right after the vendors figure |
||||||||||||||||||||||||||||||||||||||||||
2. **Unified tool for data science and software development teams**: | ||||||||||||||||||||||||||||||||||||||||||
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production. | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this a direct effect of using TPI for development? Given that this tool is not intended for model serving, that assertion migh be slightly misleading. |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||
[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job). | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Usage | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
@@ -36,7 +59,7 @@ Supported cloud vendors include: | |||||||||||||||||||||||||||||||||||||||||
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | ||||||||||||||||||||||||||||||||||||||||||
sudo apt-get update && sudo apt-get install terraform | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication) | ||||||||||||||||||||||||||||||||||||||||||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth] | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
### Define a Task | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
@@ -47,6 +70,7 @@ terraform { | |||||||||||||||||||||||||||||||||||||||||
required_providers { iterative = { source = "iterative/iterative" } } | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
provider "iterative" {} | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
resource "iterative_task" "example" { | ||||||||||||||||||||||||||||||||||||||||||
cloud = "aws" # or any of: gcp, az, k8s | ||||||||||||||||||||||||||||||||||||||||||
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ... | ||||||||||||||||||||||||||||||||||||||||||
|
@@ -59,8 +83,18 @@ resource "iterative_task" "example" { | |||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
script = <<-END | ||||||||||||||||||||||||||||||||||||||||||
#!/bin/bash | ||||||||||||||||||||||||||||||||||||||||||
mkdir results | ||||||||||||||||||||||||||||||||||||||||||
echo "Hello World!" > results/greeting.txt | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
# create output directory if needed | ||||||||||||||||||||||||||||||||||||||||||
mkdir -p results | ||||||||||||||||||||||||||||||||||||||||||
# read last result (in case of spot/preemptible instance recovery) | ||||||||||||||||||||||||||||||||||||||||||
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi | ||||||||||||||||||||||||||||||||||||||||||
EPOCH=$${EPOCH:-1} # start from 1 if last result not found | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
echo "(re)starting training loop from $EPOCH up to 1337 epochs" | ||||||||||||||||||||||||||||||||||||||||||
for epoch in $(seq $EPOCH 1337); do | ||||||||||||||||||||||||||||||||||||||||||
sleep 1 | ||||||||||||||||||||||||||||||||||||||||||
echo "$epoch" | tee results/epoch.txt | ||||||||||||||||||||||||||||||||||||||||||
done | ||||||||||||||||||||||||||||||||||||||||||
END | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
|
@@ -81,7 +115,7 @@ TF_LOG_PROVIDER=INFO terraform apply | |||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
This launches a `machine` in the `cloud`, uploads `workdir`, and runs the `script`. Upon completion (or error), the `machine` is terminated. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks. | ||||||||||||||||||||||||||||||||||||||||||
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
### Query Status | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
@@ -92,14 +126,52 @@ TF_LOG_PROVIDER=INFO terraform refresh | |||||||||||||||||||||||||||||||||||||||||
TF_LOG_PROVIDER=INFO terraform show | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
### Stop Tasks | ||||||||||||||||||||||||||||||||||||||||||
### Stop Task | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. End/Delete a Task? |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
TF_LOG_PROVIDER=INFO terraform destroy | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
This terminates the `machine` (if still running), downloads `output`, and removes the persistent `disk_size` storage. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## How it Works | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
This diagram may help to see what TPI does under-the-hood: | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
```mermaid | ||||||||||||||||||||||||||||||||||||||||||
flowchart LR | ||||||||||||||||||||||||||||||||||||||||||
subgraph tpi [what TPI manages] | ||||||||||||||||||||||||||||||||||||||||||
direction LR | ||||||||||||||||||||||||||||||||||||||||||
subgraph you [what you manage] | ||||||||||||||||||||||||||||||||||||||||||
direction LR | ||||||||||||||||||||||||||||||||||||||||||
A([Personal Computer]) | ||||||||||||||||||||||||||||||||||||||||||
end | ||||||||||||||||||||||||||||||||||||||||||
B[("Cloud Storage (low cost)")] | ||||||||||||||||||||||||||||||||||||||||||
C{{"Cloud instance scaler (zero cost)"}} | ||||||||||||||||||||||||||||||||||||||||||
D[["Cloud (spot) Instance"]] | ||||||||||||||||||||||||||||||||||||||||||
A ---> |create cloud storage| B | ||||||||||||||||||||||||||||||||||||||||||
A --> |create cloud instance scaler| C | ||||||||||||||||||||||||||||||||||||||||||
A ==> |upload script & workdir| B | ||||||||||||||||||||||||||||||||||||||||||
A -.-> |"offline (lunch break)"| A | ||||||||||||||||||||||||||||||||||||||||||
C -.-> |"(re)provision instance"| D | ||||||||||||||||||||||||||||||||||||||||||
D ==> |run script| D | ||||||||||||||||||||||||||||||||||||||||||
B <-.-> |persistent workdir cache| D | ||||||||||||||||||||||||||||||||||||||||||
D ==> |script end,\nshutdown instance| B | ||||||||||||||||||||||||||||||||||||||||||
D -.-> |outage| C | ||||||||||||||||||||||||||||||||||||||||||
B ==> |download output| A | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+152
to
+161
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
old: flowchart LR
subgraph tpi [what TPI manages]
direction LR
subgraph you [what you manage]
direction LR
A([Personal Computer])
end
B[("Cloud Storage (low cost)")]
C{{"Cloud instance scaler (zero cost)"}}
D[["Cloud (spot) Instance"]]
A ---> |create cloud storage| B
A --> |create cloud instance scaler| C
A ==> |upload script & workdir| B
A -.-> |"offline (lunch break)"| A
C -.-> |"(re)provision instance"| D
D ==> |run script| D
B <-.-> |persistent workdir cache| D
D ==> |script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
new: flowchart LR
subgraph tpi [what TPI manages]
direction LR
subgraph you [what you manage]
direction LR
A([Personal Computer])
end
B[("Cloud Storage (low cost)")]
C{{"Cloud instance scaler (zero cost)"}}
D[["Cloud (spot) Instance"]]
A ---> |2. create cloud storage| B
A --> |1. create cloud instance scaler| C
A ==> |3. upload script & workdir| B
A -.-> |"4. offline (lunch break)"| A
C -.-> |"5. (re)provision instance"| D
D ==> |7. run script| D
B <-.-> |6. persistent workdir cache| D
D ==> |8. script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |9. download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool diagram but it's not readable without clicking the |
||||||||||||||||||||||||||||||||||||||||||
end | ||||||||||||||||||||||||||||||||||||||||||
style you fill:#FFFFFF00,stroke:#13ADC7 | ||||||||||||||||||||||||||||||||||||||||||
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px | ||||||||||||||||||||||||||||||||||||||||||
style A fill:#13ADC7,stroke:#333333,color:#000000 | ||||||||||||||||||||||||||||||||||||||||||
style B fill:#945DD5,stroke:#333333,color:#000000 | ||||||||||||||||||||||||||||||||||||||||||
style D fill:#F46737,stroke:#333333,color:#000000 | ||||||||||||||||||||||||||||||||||||||||||
style C fill:#7B61FF,stroke:#333333,color:#000000 | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Future Plans | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org). | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Help | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`. | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💅🏼
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,11 +20,11 @@ page_title: Getting Started | |
sudo apt-get update && sudo apt-get install terraform | ||
``` | ||
|
||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication] | ||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth] | ||
|
||
[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication | ||
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication | ||
Comment on lines
-23
to
+25
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW I think the nav would be more logical with Getting Started on top, since it covers installation and then it links to the auth section. |
||
|
||
## Defining a Task | ||
## Define a Task | ||
|
||
In a project root directory, create a file named `main.tf` with the following contents: | ||
|
||
|
@@ -33,6 +33,7 @@ terraform { | |
required_providers { iterative = { source = "iterative/iterative" } } | ||
} | ||
provider "iterative" {} | ||
|
||
resource "iterative_task" "example" { | ||
cloud = "aws" # or any of: gcp, az, k8s | ||
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ... | ||
|
@@ -45,8 +46,18 @@ resource "iterative_task" "example" { | |
} | ||
script = <<-END | ||
#!/bin/bash | ||
mkdir results | ||
echo "Hello World!" > results/greeting.txt | ||
|
||
# create output directory if needed | ||
mkdir -p results | ||
# read last result (in case of spot/preemptible instance recovery) | ||
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi | ||
EPOCH=$${EPOCH:-1} # start from 1 if last result not found | ||
|
||
echo "(re)starting training loop from $EPOCH up to 1337 epochs" | ||
for epoch in $(seq $EPOCH 1337); do | ||
sleep 1 | ||
echo "$epoch" | tee results/epoch.txt | ||
done | ||
END | ||
} | ||
``` | ||
|
@@ -61,7 +72,7 @@ The project layout should look similar to this: | |
project/ | ||
├── main.tf | ||
└── results/ | ||
└── greeting.txt (created in the cloud and downloaded locally) | ||
└── epoch.txt (created in the cloud and downloaded locally) | ||
``` | ||
|
||
## Initialise Terraform | ||
|
@@ -72,7 +83,7 @@ $ terraform init | |
|
||
This command will check `main.tf` and download the required TPI plugin. | ||
|
||
~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][authentication]. | ||
~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][auth]. | ||
|
||
## Run Task | ||
|
||
|
@@ -82,14 +93,16 @@ $ TF_LOG_PROVIDER=INFO terraform apply | |
|
||
This command will: | ||
|
||
1. Create all the required cloud resources. | ||
1. Create all the required cloud resources (provisioning a `machine` with `disk_size` storage). | ||
2. Upload the working directory (`workdir`) to the cloud. | ||
3. Launch the task `script`. | ||
|
||
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks. | ||
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks. | ||
|
||
-> **Note:** A large `workdir` may take a long time to upload. | ||
|
||
~> **Warning:** To take full advantage of spot instance recovery, a `script` should start by cheching the disk for results (recovered from a previous interrupted run). | ||
|
||
-> **Note:** The [`id`](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#id) returned by `terraform apply` (i.e. `[id=tpi-···]`) can be used to locate the created cloud resources through the cloud's web console or command–line tool. | ||
|
||
## Query Status | ||
|
@@ -113,9 +126,9 @@ $ TF_LOG_PROVIDER=INFO terraform destroy | |
This command will: | ||
|
||
1. Download the `output` directory from the cloud. | ||
2. Delete all the cloud resources created by `terraform apply`. | ||
2. Delete all the cloud resources created by `terraform apply` (terminating `machine` if it's still running and removing the persistent `disk_size` storage). | ||
|
||
In this example, after running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!` | ||
In this example, after running `terraform destroy`, the `results` directory should contain a file named `epoch.txt` with the text `1337`. | ||
|
||
-> **Note:** A large `output` directory may take a long time to download. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a lot of changes since this review, a couple outstanding things:
technically we are stopping tasks <- terminating instances <- destroying resources
implied by transparent spot auto-recovery, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no, you make the users to think or guess at least. There is a phase in design (title also of a praised book) "Dont make me think"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, technically Tasks are stopped by themselves, what I want to say with this is that the user might find the ambiguity of the stop concept in cloud. Where they can stop the machine and then at some point they can restart the machine again. Here after destroy there is no possible resume