-
Notifications
You must be signed in to change notification settings - Fork 28
docs: iteration 3 #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: iteration 3 #492
Conversation
Question: Where in the Readme will the new banner image go? I see where the low-level diagram goes, but not the banner. It's not replacing the Terraform/Iterative banner, right? |
I was thinking somewhere mid-way. Potentially just below the USP bullet list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we are lacking the unique competitive advantage
or Why TPI?
. I there a way to introduce it?
Co-authored-by: DavidGOrtega <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential 3rd point (IaC/HaC)
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline. | ||
2. **Unified tool for data science and software development teams**: | ||
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. **Reproducible, codified environments**: Store hardware requirements & pipelines in a single configuration file with the rest of your ML project code. | |
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups ([AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job)), taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline. | ||
2. **Unified tool for data science and software development teams**: | ||
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. **Reproducible, codified environments**: Store hardware requirements & pipelines in a single configuration file with the rest of your ML project code. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks greet!
A couple of minor changes:
- It might be better to have a link to CML repository and cml.dev, not a doc
- It was a good idea about adding 3rd item to the list like "Reproducible, codified environments" or " Extend your GitOps and CI/CD-oriented workflows" or "hardware as code” for AI/ML" or multiple of these.
Let me slide this typo I found in with the docs pr
🔔 @dmpetrov & @iterative/cml, we still have some unresolved conversations that are unlikely to have a noticeable influence in the result: Can we merge this in the current state and address them in a separate pull request? |
Merging as per this conversation |
1. **Reduced management overhead and infrastructure cost**: | ||
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline. | ||
2. **Unified tool for data science and software development teams**: | ||
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and reduces time to deliver ML models into production
Is this a direct effect of using TPI for development? Given that this tool is not intended for model serving, that assertion migh be slightly misleading.
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction | ||
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼 missing .
periods in these ?
|
||
Supported cloud vendors include: | ||
Supported cloud vendors [include][auth]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼 I'd link the other 3 words instead.
[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform | ||
[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white&logoColor=326CE5&logo=kubernetes | ||
[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes | ||
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼 💅🏼 💅🏼 Shouldn't it be at the beginning of the link list? 😋
 | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#gh-(light|dark)-mode-only
Woah how does that work? Just curious
## What's Special | ||
|
||
There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators): | ||
|
||
1. **Reduced management overhead and infrastructure cost**: | ||
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we number the list it seems like we mean that it is an exhaustive list. Is it? Otherwise maybe use bullets.
But also it somewhat overlaps with the previous bullet list. Maybe they can be combined? So that the reader can get to the Usage right after the vendors figure
@@ -92,14 +126,52 @@ TF_LOG_PROVIDER=INFO terraform refresh | |||
TF_LOG_PROVIDER=INFO terraform show | |||
``` | |||
|
|||
### Stop Tasks | |||
### Stop Task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End/Delete a Task?
" the Task?
## Future Plans | ||
|
||
TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org). | ||
|
||
## Help | ||
|
||
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅🏼
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`. | |
The [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) guide has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`. |
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication] | ||
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth] | ||
|
||
[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication | ||
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think the nav would be more logical with Getting Started on top, since it covers installation and then it links to the auth section.
update README USPssplit into features & USPsdocs/index.md
high-level
inspired by
low-level