Skip to content

Commit d24f99c

Browse files
casperdclDavidGOrtegarestyled-commitsdacbd
authored
docs: iteration 3 (#492)
* more info on cloud account creation * update USPs * update example scripts * low-level diagram * bi-directional cache * minor shading * udpate diagram styles * fix cross-refs * fix epochs * readme: how it works * Apply suggestions from code review Co-authored-by: DavidGOrtega <[email protected]> * split features & USPs * fix example * add high-level diagram * minify list * update diagrams * more work on features/USPs * explicit epochs * copyediting * table providers * sync docs * fix light/dark * auth: GCP env vars * POSIX compliance * better badge links * Restyled by prettier-markdown * colonic problems * @jurv11 copyedits * logging typo (#512) Let me slide this typo I found in with the docs pr Co-authored-by: DavidGOrtega <[email protected]> Co-authored-by: Restyled.io <[email protected]> Co-authored-by: Daniel Barnes <[email protected]>
1 parent 520bd85 commit d24f99c

File tree

7 files changed

+174
-50
lines changed

7 files changed

+174
-50
lines changed

README.md

+89-17
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,48 @@
1-
![TPI](https://static.iterative.ai/img/cml/banner-tpi.svg)
1+
![TPI](https://static.iterative.ai/img/tpi/banner.svg)
22

33
# Terraform Provider Iterative (TPI)
44

55
[![docs](https://img.shields.io/badge/-docs-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)
66
[![tests](https://img.shields.io/github/workflow/status/iterative/terraform-provider-iterative/Test?label=tests&logo=GitHub)](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml)
77
[![Apache-2.0][licence-badge]][licence-file]
88

9-
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
9+
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
1010

11-
- **Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
12-
- **Sync & Execute**: easily sync & run local data & code in the cloud
13-
- **Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
14-
- **No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results)
15-
- **No lock-in**: switch between several cloud vendors with ease due to concise unified configuration
11+
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
12+
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
13+
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
14+
- **Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop
1615

17-
Supported cloud vendors include:
16+
Supported cloud vendors [include][auth]:
1817

19-
- Amazon Web Services (AWS)
20-
- Microsoft Azure
21-
- Google Cloud Platform (GCP)
22-
- Kubernetes (K8s)
18+
| [![Amazon Web Services (AWS)][aws-badge]][aws] | [![Microsoft Azure][azure-badge]][azure] | [![Google Cloud Platform (GCP)][gcp-badge]][gcp] | [![Kubernetes (K8s)][k8s-badge]][k8s] |
19+
| ---------------------------------------------- | ---------------------------------------- | ------------------------------------------------ | ------------------------------------- |
20+
21+
[aws-badge]: https://img.shields.io/badge/AWS-Amazon_Web_Services-black?colorA=white&logoColor=232F3E&logo=amazonaws
22+
[aws]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services
23+
[azure-badge]: https://img.shields.io/badge/Azure-Microsoft_Azure-black?colorA=white&logoColor=0078D4&logo=microsoftazure
24+
[azure]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#microsoft-azure
25+
[gcp-badge]: https://img.shields.io/badge/GCP-Google_Cloud_Platform-black?colorA=white&logoColor=4285F4&logo=googlecloud
26+
[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform
27+
[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white&logoColor=326CE5&logo=kubernetes
28+
[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes
29+
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
30+
31+
![](https://github.com/iterative/static/raw/main/img/tpi/high-level-light.png#gh-light-mode-only)
32+
![](https://github.com/iterative/static/raw/main/img/tpi/high-level-dark.png#gh-dark-mode-only)
33+
34+
## What's Special
35+
36+
There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):
37+
38+
1. **Reduced management overhead and infrastructure cost**:
39+
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
40+
2. **Unified tool for data science and software development teams**:
41+
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
42+
43+
[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job).
44+
45+
<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows.
2346

2447
## Usage
2548

@@ -36,7 +59,7 @@ Supported cloud vendors include:
3659
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
3760
sudo apt-get update && sudo apt-get install terraform
3861
```
39-
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
62+
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]
4063
4164
### Define a Task
4265
@@ -47,6 +70,7 @@ terraform {
4770
required_providers { iterative = { source = "iterative/iterative" } }
4871
}
4972
provider "iterative" {}
73+
5074
resource "iterative_task" "example" {
5175
cloud = "aws" # or any of: gcp, az, k8s
5276
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
@@ -59,8 +83,18 @@ resource "iterative_task" "example" {
5983
}
6084
script = <<-END
6185
#!/bin/bash
62-
mkdir results
63-
echo "Hello World!" > results/greeting.txt
86+
87+
# create output directory if needed
88+
mkdir -p results
89+
# read last result (in case of spot/preemptible instance recovery)
90+
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
91+
EPOCH=$${EPOCH:-1} # start from 1 if last result not found
92+
93+
echo "(re)starting training loop from $EPOCH up to 1337 epochs"
94+
for epoch in $(seq $EPOCH 1337); do
95+
sleep 1
96+
echo "$epoch" | tee results/epoch.txt
97+
done
6498
END
6599
}
66100
```
@@ -81,7 +115,7 @@ TF_LOG_PROVIDER=INFO terraform apply
81115

82116
This launches a `machine` in the `cloud`, uploads `workdir`, and runs the `script`. Upon completion (or error), the `machine` is terminated.
83117

84-
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks.
118+
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.
85119

86120
### Query Status
87121

@@ -92,14 +126,52 @@ TF_LOG_PROVIDER=INFO terraform refresh
92126
TF_LOG_PROVIDER=INFO terraform show
93127
```
94128

95-
### Stop Tasks
129+
### Stop Task
96130

97131
```
98132
TF_LOG_PROVIDER=INFO terraform destroy
99133
```
100134

101135
This terminates the `machine` (if still running), downloads `output`, and removes the persistent `disk_size` storage.
102136

137+
## How it Works
138+
139+
This diagram may help to see what TPI does under-the-hood:
140+
141+
```mermaid
142+
flowchart LR
143+
subgraph tpi [what TPI manages]
144+
direction LR
145+
subgraph you [what you manage]
146+
direction LR
147+
A([Personal Computer])
148+
end
149+
B[("Cloud Storage (low cost)")]
150+
C{{"Cloud instance scaler (zero cost)"}}
151+
D[["Cloud (spot) Instance"]]
152+
A ---> |create cloud storage| B
153+
A --> |create cloud instance scaler| C
154+
A ==> |upload script & workdir| B
155+
A -.-> |"offline (lunch break)"| A
156+
C -.-> |"(re)provision instance"| D
157+
D ==> |run script| D
158+
B <-.-> |persistent workdir cache| D
159+
D ==> |script end,\nshutdown instance| B
160+
D -.-> |outage| C
161+
B ==> |download output| A
162+
end
163+
style you fill:#FFFFFF00,stroke:#13ADC7
164+
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
165+
style A fill:#13ADC7,stroke:#333333,color:#000000
166+
style B fill:#945DD5,stroke:#333333,color:#000000
167+
style D fill:#F46737,stroke:#333333,color:#000000
168+
style C fill:#7B61FF,stroke:#333333,color:#000000
169+
```
170+
171+
## Future Plans
172+
173+
TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org).
174+
103175
## Help
104176

105177
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.

docs/guides/authentication.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ TF_LOG_PROVIDER=INFO terraform apply
1313

1414
## Amazon Web Services
1515

16+
[Create an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) if needed, and then set these environment variables:
17+
1618
- `AWS_ACCESS_KEY_ID` - Access key identifier.
1719
- `AWS_SECRET_ACCESS_KEY` - Secret access key.
1820
- `AWS_SESSION_TOKEN` - (Optional) Session token.
@@ -29,6 +31,8 @@ export AWS_SECRET_ACCESS_KEY="$(terraform output --raw aws_secret_access_key)"
2931

3032
## Microsoft Azure
3133

34+
[Create an Azure account](https://docs.microsoft.com/en-us/learn/modules/create-an-azure-account/) if needed, and then set these environment variables:
35+
3236
- `AZURE_CLIENT_ID` - Client identifier.
3337
- `AZURE_CLIENT_SECRET` - Client secret.
3438
- `AZURE_SUBSCRIPTION_ID` - Subscription identifier.
@@ -48,7 +52,10 @@ export AZURE_CLIENT_SECRET="$(terraform output --raw azure_client_secret)"
4852

4953
## Google Cloud Platform
5054

51-
- `GOOGLE_APPLICATION_CREDENTIALS` - Path to (or contents of) a service account JSON key file.
55+
[Create a GCP account](https://cloud.google.com/free) if needed, and then either one of the environment variables:
56+
57+
- `GOOGLE_APPLICATION_CREDENTIALS` - **Path** to a service account JSON key file.
58+
- `GOOGLE_APPLICATION_CREDENTIALS_DATA` - Alternatively, **contents** of a service account JSON key file.
5259

5360
See the [GCP documentation](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account) to obtain these variables directly.
5461

docs/guides/getting-started.md

+24-11
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ page_title: Getting Started
2020
sudo apt-get update && sudo apt-get install terraform
2121
```
2222

23-
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication]
23+
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]
2424

25-
[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
25+
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
2626

27-
## Defining a Task
27+
## Define a Task
2828

2929
In a project root directory, create a file named `main.tf` with the following contents:
3030

@@ -33,6 +33,7 @@ terraform {
3333
required_providers { iterative = { source = "iterative/iterative" } }
3434
}
3535
provider "iterative" {}
36+
3637
resource "iterative_task" "example" {
3738
cloud = "aws" # or any of: gcp, az, k8s
3839
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
@@ -45,8 +46,18 @@ resource "iterative_task" "example" {
4546
}
4647
script = <<-END
4748
#!/bin/bash
48-
mkdir results
49-
echo "Hello World!" > results/greeting.txt
49+
50+
# create output directory if needed
51+
mkdir -p results
52+
# read last result (in case of spot/preemptible instance recovery)
53+
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
54+
EPOCH=$${EPOCH:-1} # start from 1 if last result not found
55+
56+
echo "(re)starting training loop from $EPOCH up to 1337 epochs"
57+
for epoch in $(seq $EPOCH 1337); do
58+
sleep 1
59+
echo "$epoch" | tee results/epoch.txt
60+
done
5061
END
5162
}
5263
```
@@ -61,7 +72,7 @@ The project layout should look similar to this:
6172
project/
6273
├── main.tf
6374
└── results/
64-
└── greeting.txt (created in the cloud and downloaded locally)
75+
└── epoch.txt (created in the cloud and downloaded locally)
6576
```
6677

6778
## Initialise Terraform
@@ -72,7 +83,7 @@ $ terraform init
7283

7384
This command will check `main.tf` and download the required TPI plugin.
7485

75-
~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][authentication].
86+
~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][auth].
7687

7788
## Run Task
7889

@@ -82,14 +93,16 @@ $ TF_LOG_PROVIDER=INFO terraform apply
8293

8394
This command will:
8495

85-
1. Create all the required cloud resources.
96+
1. Create all the required cloud resources (provisioning a `machine` with `disk_size` storage).
8697
2. Upload the working directory (`workdir`) to the cloud.
8798
3. Launch the task `script`.
8899

89-
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks.
100+
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.
90101

91102
-> **Note:** A large `workdir` may take a long time to upload.
92103

104+
~> **Warning:** To take full advantage of spot instance recovery, a `script` should start by cheching the disk for results (recovered from a previous interrupted run).
105+
93106
-> **Note:** The [`id`](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#id) returned by `terraform apply` (i.e. `[id=tpi-···]`) can be used to locate the created cloud resources through the cloud's web console or command–line tool.
94107

95108
## Query Status
@@ -113,9 +126,9 @@ $ TF_LOG_PROVIDER=INFO terraform destroy
113126
This command will:
114127

115128
1. Download the `output` directory from the cloud.
116-
2. Delete all the cloud resources created by `terraform apply`.
129+
2. Delete all the cloud resources created by `terraform apply` (terminating `machine` if it's still running and removing the persistent `disk_size` storage).
117130

118-
In this example, after running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!`
131+
In this example, after running `terraform destroy`, the `results` directory should contain a file named `epoch.txt` with the text `1337`.
119132

120133
-> **Note:** A large `output` directory may take a long time to download.
121134

0 commit comments

Comments
 (0)