Skip to content

docs: iteration 3 #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Apr 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 89 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,48 @@
![TPI](https://static.iterative.ai/img/cml/banner-tpi.svg)
![TPI](https://static.iterative.ai/img/tpi/banner.svg)

# Terraform Provider Iterative (TPI)

[![docs](https://img.shields.io/badge/-docs-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)
[![tests](https://img.shields.io/github/workflow/status/iterative/terraform-provider-iterative/Test?label=tests&logo=GitHub)](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml)
[![Apache-2.0][licence-badge]][licence-file]

TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

- **Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
- **Sync & Execute**: easily sync & run local data & code in the cloud
- **Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
- **No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results)
- **No lock-in**: switch between several cloud vendors with ease due to concise unified configuration
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
- **Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances and automatic intermediate data backup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a lot of changes since this review, a couple outstanding things:

We are not stoping tasks, we are destroying tasks

technically we are stopping tasks <- terminating instances <- destroying resources

automatic intermediate data backup

implied by transparent spot auto-recovery, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implied by transparent spot auto-recovery, right?

Yes and no, you make the users to think or guess at least. There is a phase in design (title also of a praised book) "Dont make me think"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we are stopping tasks

No, technically Tasks are stopped by themselves, what I want to say with this is that the user might find the ambiguity of the stop concept in cloud. Where they can stop the machine and then at some point they can restart the machine again. Here after destroy there is no possible resume

- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
Comment on lines +12 to +13
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 missing . periods in these ?

- **Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop

Supported cloud vendors include:
Supported cloud vendors [include][auth]:
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 I'd link the other 3 words instead.


- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform (GCP)
- Kubernetes (K8s)
| [![Amazon Web Services (AWS)][aws-badge]][aws] | [![Microsoft Azure][azure-badge]][azure] | [![Google Cloud Platform (GCP)][gcp-badge]][gcp] | [![Kubernetes (K8s)][k8s-badge]][k8s] |
| ---------------------------------------------- | ---------------------------------------- | ------------------------------------------------ | ------------------------------------- |

[aws-badge]: https://img.shields.io/badge/AWS-Amazon_Web_Services-black?colorA=white&logoColor=232F3E&logo=amazonaws
[aws]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services
[azure-badge]: https://img.shields.io/badge/Azure-Microsoft_Azure-black?colorA=white&logoColor=0078D4&logo=microsoftazure
[azure]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#microsoft-azure
[gcp-badge]: https://img.shields.io/badge/GCP-Google_Cloud_Platform-black?colorA=white&logoColor=4285F4&logo=googlecloud
[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform
[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white&logoColor=326CE5&logo=kubernetes
[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 💅🏼 💅🏼 Shouldn't it be at the beginning of the link list? 😋


![](https://github.com/iterative/static/raw/main/img/tpi/high-level-light.png#gh-light-mode-only)
![](https://github.com/iterative/static/raw/main/img/tpi/high-level-dark.png#gh-dark-mode-only)
Comment on lines +31 to +32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#gh-(light|dark)-mode-only

Woah how does that work? Just curious


## What's Special

There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):

1. **Reduced management overhead and infrastructure cost**:
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
Comment on lines +34 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we number the list it seems like we mean that it is an exhaustive list. Is it? Otherwise maybe use bullets.

But also it somewhat overlaps with the previous bullet list. Maybe they can be combined? So that the reader can get to the Usage right after the vendors figure

2. **Unified tool for data science and software development teams**:
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and reduces time to deliver ML models into production

Is this a direct effect of using TPI for development? Given that this tool is not intended for model serving, that assertion migh be slightly misleading.


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. **Reproducible, codified environments**: Store hardware requirements & pipelines in a single configuration file with the rest of your ML project code.

[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job).

<img width=24px src="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows.

## Usage

Expand All @@ -36,7 +59,7 @@ Supported cloud vendors include:
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform
```
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]

### Define a Task

Expand All @@ -47,6 +70,7 @@ terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}

resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
Expand All @@ -59,8 +83,18 @@ resource "iterative_task" "example" {
}
script = <<-END
#!/bin/bash
mkdir results
echo "Hello World!" > results/greeting.txt

# create output directory if needed
mkdir -p results
# read last result (in case of spot/preemptible instance recovery)
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
EPOCH=$${EPOCH:-1} # start from 1 if last result not found

echo "(re)starting training loop from $EPOCH up to 1337 epochs"
for epoch in $(seq $EPOCH 1337); do
sleep 1
echo "$epoch" | tee results/epoch.txt
done
END
}
```
Expand All @@ -81,7 +115,7 @@ TF_LOG_PROVIDER=INFO terraform apply

This launches a `machine` in the `cloud`, uploads `workdir`, and runs the `script`. Upon completion (or error), the `machine` is terminated.

With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks.
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.

### Query Status

Expand All @@ -92,14 +126,52 @@ TF_LOG_PROVIDER=INFO terraform refresh
TF_LOG_PROVIDER=INFO terraform show
```

### Stop Tasks
### Stop Task
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End/Delete a Task?
" the Task?


```
TF_LOG_PROVIDER=INFO terraform destroy
```

This terminates the `machine` (if still running), downloads `output`, and removes the persistent `disk_size` storage.

## How it Works

This diagram may help to see what TPI does under-the-hood:

```mermaid
flowchart LR
subgraph tpi [what TPI manages]
direction LR
subgraph you [what you manage]
direction LR
A([Personal Computer])
end
B[("Cloud Storage (low cost)")]
C{{"Cloud instance scaler (zero cost)"}}
D[["Cloud (spot) Instance"]]
A ---> |create cloud storage| B
A --> |create cloud instance scaler| C
A ==> |upload script & workdir| B
A -.-> |"offline (lunch break)"| A
C -.-> |"(re)provision instance"| D
D ==> |run script| D
B <-.-> |persistent workdir cache| D
D ==> |script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |download output| A
Comment on lines +152 to +161
Copy link
Contributor Author

@casperdcl casperdcl Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A ---> |create cloud storage| B
A --> |create cloud instance scaler| C
A ==> |upload script & workdir| B
A -.-> |"offline (lunch break)"| A
C -.-> |"(re)provision instance"| D
D ==> |run script| D
B <-.-> |persistent workdir cache| D
D ==> |script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |download output| A
A ---> |2. create cloud storage| B
A --> |1. create cloud instance scaler| C
A ==> |3. upload script & workdir| B
A -.-> |"4. offline (lunch break)"| A
C -.-> |"5. (re)provision instance"| D
D ==> |7. run script| D
B <-.-> |6. persistent workdir cache| D
D ==> |8. script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |9. download output| A

old:

flowchart LR
subgraph tpi [what TPI manages]
direction LR
    subgraph you [what you manage]
        direction LR
        A([Personal Computer])
    end
    B[("Cloud Storage (low cost)")]
    C{{"Cloud instance scaler (zero cost)"}}
    D[["Cloud (spot) Instance"]]
    A ---> |create cloud storage| B
    A --> |create cloud instance scaler| C
    A ==> |upload script & workdir| B
    A -.-> |"offline (lunch break)"| A
    C -.-> |"(re)provision instance"| D
    D ==> |run script| D
    B <-.-> |persistent workdir cache| D
    D ==> |script end,\nshutdown instance| B
    D -.-> |outage| C
    B ==> |download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
Loading

new:

flowchart LR
subgraph tpi [what TPI manages]
direction LR
    subgraph you [what you manage]
        direction LR
        A([Personal Computer])
    end
    B[("Cloud Storage (low cost)")]
    C{{"Cloud instance scaler (zero cost)"}}
    D[["Cloud (spot) Instance"]]
    A ---> |2. create cloud storage| B
    A --> |1. create cloud instance scaler| C
    A ==> |3. upload script & workdir| B
    A -.-> |"4. offline (lunch break)"| A
    C -.-> |"5. (re)provision instance"| D
    D ==> |7. run script| D
    B <-.-> |6. persistent workdir cache| D
    D ==> |8. script end,\nshutdown instance| B
    D -.-> |outage| C
    B ==> |9. download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
Loading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool diagram but it's not readable without clicking the <-> button (not necessarily obvious usability). Any chance to make it taller and less long i.e. fold the diagram somehow?

end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
```

## Future Plans

TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org).

## Help

The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼

Suggested change
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.
The [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) guide has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.

Expand Down
9 changes: 8 additions & 1 deletion docs/guides/authentication.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ TF_LOG_PROVIDER=INFO terraform apply

## Amazon Web Services

[Create an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) if needed, and then set these environment variables:

- `AWS_ACCESS_KEY_ID` - Access key identifier.
- `AWS_SECRET_ACCESS_KEY` - Secret access key.
- `AWS_SESSION_TOKEN` - (Optional) Session token.
Expand All @@ -29,6 +31,8 @@ export AWS_SECRET_ACCESS_KEY="$(terraform output --raw aws_secret_access_key)"

## Microsoft Azure

[Create an Azure account](https://docs.microsoft.com/en-us/learn/modules/create-an-azure-account/) if needed, and then set these environment variables:

- `AZURE_CLIENT_ID` - Client identifier.
- `AZURE_CLIENT_SECRET` - Client secret.
- `AZURE_SUBSCRIPTION_ID` - Subscription identifier.
Expand All @@ -48,7 +52,10 @@ export AZURE_CLIENT_SECRET="$(terraform output --raw azure_client_secret)"

## Google Cloud Platform

- `GOOGLE_APPLICATION_CREDENTIALS` - Path to (or contents of) a service account JSON key file.
[Create a GCP account](https://cloud.google.com/free) if needed, and then either one of the environment variables:

- `GOOGLE_APPLICATION_CREDENTIALS` - **Path** to a service account JSON key file.
- `GOOGLE_APPLICATION_CREDENTIALS_DATA` - Alternatively, **contents** of a service account JSON key file.

See the [GCP documentation](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account) to obtain these variables directly.

Expand Down
35 changes: 24 additions & 11 deletions docs/guides/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ page_title: Getting Started
sudo apt-get update && sudo apt-get install terraform
```

- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication]
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]

[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
Comment on lines -23 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think the nav would be more logical with Getting Started on top, since it covers installation and then it links to the auth section.


## Defining a Task
## Define a Task

In a project root directory, create a file named `main.tf` with the following contents:

Expand All @@ -33,6 +33,7 @@ terraform {
required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}

resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
Expand All @@ -45,8 +46,18 @@ resource "iterative_task" "example" {
}
script = <<-END
#!/bin/bash
mkdir results
echo "Hello World!" > results/greeting.txt

# create output directory if needed
mkdir -p results
# read last result (in case of spot/preemptible instance recovery)
if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
EPOCH=$${EPOCH:-1} # start from 1 if last result not found

echo "(re)starting training loop from $EPOCH up to 1337 epochs"
for epoch in $(seq $EPOCH 1337); do
sleep 1
echo "$epoch" | tee results/epoch.txt
done
END
}
```
Expand All @@ -61,7 +72,7 @@ The project layout should look similar to this:
project/
├── main.tf
└── results/
└── greeting.txt (created in the cloud and downloaded locally)
└── epoch.txt (created in the cloud and downloaded locally)
```

## Initialise Terraform
Expand All @@ -72,7 +83,7 @@ $ terraform init

This command will check `main.tf` and download the required TPI plugin.

~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][authentication].
~> **Warning:** None of the subsequent commands will work without first setting some [authentication environment variables][auth].

## Run Task

Expand All @@ -82,14 +93,16 @@ $ TF_LOG_PROVIDER=INFO terraform apply

This command will:

1. Create all the required cloud resources.
1. Create all the required cloud resources (provisioning a `machine` with `disk_size` storage).
2. Upload the working directory (`workdir`) to the cloud.
3. Launch the task `script`.

With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks.
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.

-> **Note:** A large `workdir` may take a long time to upload.

~> **Warning:** To take full advantage of spot instance recovery, a `script` should start by cheching the disk for results (recovered from a previous interrupted run).

-> **Note:** The [`id`](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#id) returned by `terraform apply` (i.e. `[id=tpi-···]`) can be used to locate the created cloud resources through the cloud's web console or command–line tool.

## Query Status
Expand All @@ -113,9 +126,9 @@ $ TF_LOG_PROVIDER=INFO terraform destroy
This command will:

1. Download the `output` directory from the cloud.
2. Delete all the cloud resources created by `terraform apply`.
2. Delete all the cloud resources created by `terraform apply` (terminating `machine` if it's still running and removing the persistent `disk_size` storage).

In this example, after running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!`
In this example, after running `terraform destroy`, the `results` directory should contain a file named `epoch.txt` with the text `1337`.

-> **Note:** A large `output` directory may take a long time to download.

Expand Down
Loading