Skip to content

docs: iteration 3 #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Apr 15, 2022
Merged

docs: iteration 3 #492

merged 29 commits into from
Apr 15, 2022

Conversation

casperdcl
Copy link
Contributor

@casperdcl casperdcl commented Apr 11, 2022

  • update README USPs split into features & USPs
    • sync with docs/index.md
  • update description
  • add high-level diagram
  • more info on cloud account creation
  • update example scripts (make spot recovery use case more clear)
  • fix cross-references/pluralisation
  • add low-level diagram

high-level


inspired by

low-level

flowchart LR
subgraph tpi [what TPI manages]
direction LR
    subgraph you [what you manage]
        direction LR
        A([Personal Computer])
    end
    B[("Cloud Storage (low cost)")]
    C{{"Cloud instance scaler (zero cost)"}}
    D[["Cloud (spot) Instance"]]
    A ---> |create cloud storage| B
    A --> |create cloud instance scaler| C
    A ==> |upload script & workdir| B
    A -.-> |"offline (lunch break)"| A
    C -.-> |"(re)provision instance"| D
    D ==> |run script| D
    B <-.-> |persistent workdir cache| D
    D ==> |script end,\nshutdown instance| B
    D -.-> |outage| C
    B ==> |download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
Loading

@casperdcl casperdcl self-assigned this Apr 11, 2022
@casperdcl casperdcl changed the title docs: diagrams docs: iteration 3 Apr 11, 2022
@casperdcl casperdcl requested review from dmpetrov and a team April 11, 2022 11:42
@casperdcl casperdcl added documentation Markdown files resource-task iterative_task TF resource labels Apr 11, 2022
@jendefig
Copy link

Question: Where in the Readme will the new banner image go? I see where the low-level diagram goes, but not the banner. It's not replacing the Terraform/Iterative banner, right?

@casperdcl
Copy link
Contributor Author

Where in the Readme will the new [high-level] banner image go?

I was thinking somewhere mid-way. Potentially just below the USP bullet list.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we are lacking the unique competitive advantage or Why TPI?. I there a way to introduce it?

@casperdcl casperdcl temporarily deployed to automatic April 14, 2022 21:45 Inactive
Copy link
Contributor Author

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential 3rd point (IaC/HaC)

TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
2. **Unified tool for data science and software development teams**:
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. **Reproducible, codified environments**: Store hardware requirements & pipelines in a single configuration file with the rest of your ML project code.

TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups ([AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job)), taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
2. **Unified tool for data science and software development teams**:
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. **Reproducible, codified environments**: Store hardware requirements & pipelines in a single configuration file with the rest of your ML project code.

@casperdcl casperdcl temporarily deployed to automatic April 14, 2022 22:08 Inactive
@casperdcl casperdcl temporarily deployed to automatic April 14, 2022 22:09 Inactive
@casperdcl casperdcl temporarily deployed to automatic April 14, 2022 22:09 Inactive
@casperdcl casperdcl temporarily deployed to automatic April 14, 2022 22:09 Inactive
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks greet!

A couple of minor changes:

  1. It might be better to have a link to CML repository and cml.dev, not a doc
  2. It was a good idea about adding 3rd item to the list like "Reproducible, codified environments" or " Extend your GitOps and CI/CD-oriented workflows" or "hardware as code” for AI/ML" or multiple of these.

Let me slide this typo I found in with the docs pr
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to automatic April 15, 2022 22:05 Inactive
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to automatic April 15, 2022 22:06 Inactive
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to automatic April 15, 2022 22:06 Inactive
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Apr 15, 2022

🔔 @dmpetrov & @iterative/cml, we still have some unresolved conversations that are unlikely to have a noticeable influence in the result:

Can we merge this in the current state and address them in a separate pull request?

@0x2b3bfa0
Copy link
Member

Merging as per this conversation

@0x2b3bfa0 0x2b3bfa0 merged commit d24f99c into master Apr 15, 2022
@0x2b3bfa0 0x2b3bfa0 deleted the docs-iter branch April 15, 2022 23:24
@0x2b3bfa0 0x2b3bfa0 mentioned this pull request Apr 16, 2022
6 tasks
1. **Reduced management overhead and infrastructure cost**:
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
2. **Unified tool for data science and software development teams**:
TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and reduces time to deliver ML models into production

Is this a direct effect of using TPI for development? Given that this tool is not intended for model serving, that assertion migh be slightly misleading.

@casperdcl casperdcl mentioned this pull request Apr 19, 2022
21 tasks
Comment on lines +12 to +13
- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction
- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 missing . periods in these ?


Supported cloud vendors include:
Supported cloud vendors [include][auth]:
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 I'd link the other 3 words instead.

[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform
[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white&logoColor=326CE5&logo=kubernetes
[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼 💅🏼 💅🏼 Shouldn't it be at the beginning of the link list? 😋

Comment on lines +31 to +32
![](https://github.com/iterative/static/raw/main/img/tpi/high-level-light.png#gh-light-mode-only)
![](https://github.com/iterative/static/raw/main/img/tpi/high-level-dark.png#gh-dark-mode-only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#gh-(light|dark)-mode-only

Woah how does that work? Just curious

Comment on lines +34 to +39
## What's Special

There are a several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):

1. **Reduced management overhead and infrastructure cost**:
TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running -- auto-recovery happens even if you are offline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we number the list it seems like we mean that it is an exhaustive list. Is it? Otherwise maybe use bullets.

But also it somewhat overlaps with the previous bullet list. Maybe they can be combined? So that the reader can get to the Usage right after the vendors figure

@@ -92,14 +126,52 @@ TF_LOG_PROVIDER=INFO terraform refresh
TF_LOG_PROVIDER=INFO terraform show
```

### Stop Tasks
### Stop Task
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End/Delete a Task?
" the Task?

## Future Plans

TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org).

## Help

The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅🏼

Suggested change
The [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.
The [Getting Started](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) guide has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.

Comment on lines -23 to +25
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][authentication]
- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]

[authentication]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think the nav would be more logical with Getting Started on top, since it covers installation and then it links to the auth section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Markdown files resource-task iterative_task TF resource
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants