Skip to content

docs: iteration 3 #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Apr 15, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 50 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,21 @@

TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

- **Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
- **Sync & Execute**: easily sync & run local data & code in the cloud
- **Easy to use**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
- **Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
- **No cloud vendor lock-in**: switch between several cloud vendors with ease due to concise unified configuration
- **Seamless developer experience**: easily sync & run data & code in the cloud as easily as on a local laptop
- **No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results)
- **No lock-in**: switch between several cloud vendors with ease due to concise unified configuration

Supported cloud vendors include:
Supported cloud vendors [include][auth]:
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 I'd link the other 3 words instead.


- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform (GCP)
- Kubernetes (K8s)

[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication

## Usage

### Requirements
Expand All @@ -36,7 +38,7 @@ Supported cloud vendors include:
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform
```
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]

### Define a Task

Expand All @@ -59,8 +61,17 @@ resource "iterative_task" "example" {
}
script = <<-END
#!/bin/bash
mkdir results
echo "Hello World!" > results/greeting.txt

# create output directory if needed
mkdir -p results
# read last result (in case of spot/preemptible instance recovery)
if [[ -f results/epoch.txt ]]; then EPOCH="$(cat results/epoch.txt)"; fi

# (re)start training loop up to 42 epochs
for epoch in $(seq ${EPOCH:-1} 10); do
sleep 1
echo "$epoch" > results/epoch.txt
done
END
}
```
Expand Down Expand Up @@ -119,6 +130,38 @@ Instead of using the latest stable release, a local copy of the repository must
```
3. Use `source = "github.com/iterative/iterative"` in your `main.tf` to use the local repository (`source = "iterative/iterative"` will download the latest release instead), and run `terraform init --upgrade`

This diagram may also help to see what TPI does under-the-hood:

```mermaid
flowchart LR
subgraph tpi [what TPI manages]
direction LR
subgraph you [what you manage]
direction LR
A([Personal Computer])
end
B[("Cloud Storage (low cost)")]
C{{"Cloud Orchestrator (zero cost)"}}
D[["Cloud (spot) Instance"]]
A ---> |create cloud storage| B
A --> |create cloud orchestrator| C
A ==> |upload script & workdir| B
A -.-> |"offline (lunch break)"| A
C -.-> |"(re)provision instance"| D
D ==> |run script| D
B <-.-> |persistent workdir cache| D
D ==> |script end,\nshutdown instance| B
D -.-> |outage| C
B ==> |download output| A
end
style you fill:#fff,stroke:#13ADC7
style tpi fill:#fff,stroke:#fff,stroke-width:0px
style A fill:#13ADC7,stroke:#333
style B fill:#945DD5,stroke:#333
style D fill:#F46737,stroke:#333
style C fill:#7B61FF,stroke:#333
```

## Copyright

This project and all contributions to it are distributed under [![Apache-2.0][licence-badge]][licence-file]
Expand Down
6 changes: 6 additions & 0 deletions docs/guides/authentication.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ TF_LOG_PROVIDER=INFO terraform apply

## Amazon Web Services

[Create an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) if needed, and then set these environment variables:

- `AWS_ACCESS_KEY_ID` - Access key identifier.
- `AWS_SECRET_ACCESS_KEY` - Secret access key.
- `AWS_SESSION_TOKEN` - (Optional) Session token.
Expand All @@ -29,6 +31,8 @@ export AWS_SECRET_ACCESS_KEY="$(terraform output --raw aws_secret_access_key)"

## Microsoft Azure

[Create an Azure account](https://docs.microsoft.com/en-us/learn/modules/create-an-azure-account/) if needed, and then set these environment variables:

- `AZURE_CLIENT_ID` - Client identifier.
- `AZURE_CLIENT_SECRET` - Client secret.
- `AZURE_SUBSCRIPTION_ID` - Subscription identifier.
Expand All @@ -48,6 +52,8 @@ export AZURE_CLIENT_SECRET="$(terraform output --raw azure_client_secret)"

## Google Cloud Platform

[Create a GCP account](https://cloud.google.com/free) if needed, and then set the environment variable:

- `GOOGLE_APPLICATION_CREDENTIALS` - Path to (or contents of) a service account JSON key file.

See the [GCP documentation](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account) to obtain these variables directly.
Expand Down
15 changes: 12 additions & 3 deletions docs/guides/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,17 @@ resource "iterative_task" "example" {
}
script = <<-END
#!/bin/bash
mkdir results
echo "Hello World!" > results/greeting.txt

# create output directory if needed
mkdir -p results
# read last result (in case of spot/preemptible instance recovery)
if [[ -f results/epoch.txt ]]; then EPOCH="$(cat results/epoch.txt)"; fi

# (re)start training loop up to 42 epochs
for epoch in $(seq ${EPOCH:-1} 10); do
sleep 1
echo "$epoch" > results/epoch.txt
done
END
}
```
Expand Down Expand Up @@ -115,7 +124,7 @@ This command will:
1. Download the `output` directory from the cloud.
2. Delete all the cloud resources created by `terraform apply`.

In this example, after running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!`
In this example, after running `terraform destroy`, the `results` directory should contain a file named `epoch.txt` with the text `10`.

-> **Note:** A large `output` directory may take a long time to download.

Expand Down
12 changes: 7 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,23 @@

TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

- **Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
- **Sync & Execute**: easily sync & run local data & code in the cloud
- **Easy to use**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
- **Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
- **No cloud vendor lock-in**: switch between several cloud vendors with ease due to concise unified configuration
- **Seamless developer experience**: easily sync & run data & code in the cloud as easily as on a local laptop
- **No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results)
- **No lock-in**: switch between several cloud vendors with ease due to concise unified configuration

Supported cloud vendors include:
Supported cloud vendors [include][auth]:

- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform (GCP)
- Kubernetes (K8s)

[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication

## Links

- [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started)
- [Authentication](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
- [Authentication][auth]
- [Full reference](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task)
1 change: 1 addition & 0 deletions docs/resources/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ resource "iterative_task" "example" {
}
script = <<-END
#!/bin/bash
mkdir -p results
echo "$GREETING" | tee results/$(uuidgen)
END
# or: script = file("example.sh")
Expand Down