You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
9
+
TPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.
10
10
11
-
-**Provision Resources**: create cloud compute (CPU, GPU, RAM) & storage resources without reading pages of documentation
12
-
-**Sync & Execute**: easily sync & run local data & code in the cloud
13
-
-**Low cost**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
14
-
-**No waste**: auto-cleanup unused resources (terminate compute instances upon job completion/failure & remove storage upon download of results)
15
-
-**No lock-in**: switch between several cloud vendors with ease due to concise unified configuration
11
+
-**Lower cost with spot recovery**: transparent auto-recovery from interrupted low-cost spot/preemptible instances
12
+
-**No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
13
+
-**No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
14
+
-**Developer-first experience**: one-command data sync & code execution with no external server, making the cloud feel like a laptop
16
15
17
-
Supported cloud vendors include:
16
+
Supported cloud vendors [include][auth]:
18
17
19
-
- Amazon Web Services (AWS)
20
-
- Microsoft Azure
21
-
- Google Cloud Platform (GCP)
22
-
- Kubernetes (K8s)
18
+
|[![Amazon Web Services (AWS)][aws-badge]][aws]|[![Microsoft Azure][azure-badge]][azure]|[![Google Cloud Platform (GCP)][gcp-badge]][gcp]|[![Kubernetes (K8s)][k8s-badge]][k8s]|
There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):
37
+
38
+
1.**Reduced management overhead and infrastructure cost**:
39
+
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
40
+
2.**Unified tool for data science and software development teams**:
41
+
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
42
+
43
+
[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job).
44
+
45
+
<imgwidth=24pxsrc="https://static.iterative.ai/logo/cml.svg"/> TPI is used to power [CML runners](https://cml.dev/doc/self-hosted-runners), bringing cloud providers to existing CI/CD workflows.
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication)
62
+
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]
TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org).
174
+
103
175
## Help
104
176
105
177
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.
[Create an AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) if needed, and then set these environment variables:
[Create an Azure account](https://docs.microsoft.com/en-us/learn/modules/create-an-azure-account/) if needed, and then set these environment variables:
-`GOOGLE_APPLICATION_CREDENTIALS` - Path to (or contents of) a service account JSON key file.
55
+
[Create a GCP account](https://cloud.google.com/free) if needed, and then either one of the environment variables:
56
+
57
+
-`GOOGLE_APPLICATION_CREDENTIALS` - **Path** to a service account JSON key file.
58
+
-`GOOGLE_APPLICATION_CREDENTIALS_DATA` - Alternatively, **contents** of a service account JSON key file.
52
59
53
60
See the [GCP documentation](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account) to obtain these variables directly.
1. Create all the required cloud resources (provisioning a `machine` with `disk_size` storage).
86
97
2. Upload the working directory (`workdir`) to the cloud.
87
98
3. Launch the task `script`.
88
99
89
-
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent storage will be used to relaunch interrupted tasks.
100
+
With spot/preemptible instances (`spot >= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.
90
101
91
102
-> **Note:** A large `workdir` may take a long time to upload.
92
103
104
+
~> **Warning:** To take full advantage of spot instance recovery, a `script` should start by cheching the disk for results (recovered from a previous interrupted run).
105
+
93
106
-> **Note:** The [`id`](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#id) returned by `terraform apply` (i.e. `[id=tpi-···]`) can be used to locate the created cloud resources through the cloud's web console or command–line tool.
1. Download the `output` directory from the cloud.
116
-
2. Delete all the cloud resources created by `terraform apply`.
129
+
2. Delete all the cloud resources created by `terraform apply` (terminating `machine` if it's still running and removing the persistent `disk_size` storage).
117
130
118
-
In this example, after running `terraform destroy`, the `results` directory should contain a file named `greeting.txt` with the text `Hello, World!`
131
+
In this example, after running `terraform destroy`, the `results` directory should contain a file named `epoch.txt` with the text `1337`.
119
132
120
133
-> **Note:** A large `output` directory may take a long time to download.
0 commit comments