Skip to content

Separate input and output directory settings #395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 21, 2022
Merged

Conversation

0x2b3bfa0
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 commented Feb 15, 2022

Follow-up of #340,

Old behavior

When output was unset, "" or null it implicitly mirrored the value of input.

resource "iterative_task" "example" {
  ···
  workdir {
    input = "/path/to/input"
    # output = "/path/to/input" is implicit
  }
}

New behavior

When output is unset, "" or null it will be disabled. Doesn't implicitly take the value of input anymore.

resource "iterative_task" "example" {
  ···
  workdir {
    input = "/path/to/input"
    output = "/path/to/output"
  }
}

@0x2b3bfa0 0x2b3bfa0 self-assigned this Feb 15, 2022
@0x2b3bfa0 0x2b3bfa0 requested a review from a team February 15, 2022 15:44
@0x2b3bfa0 0x2b3bfa0 added the p1-important High priority label Feb 15, 2022
@0x2b3bfa0 0x2b3bfa0 marked this pull request as ready for review February 15, 2022 15:46
@DavidGOrtega
Copy link
Contributor

Im not sure about this PR. What's the goal? There are scenarios where syncing back to current workdir is fine (CI in example)

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Feb 16, 2022

What's the goal?

Fix #340: setting workdir.output to null will not synchronize back any artifacts, as proposed.

There are scenarios where syncing back to current workdir is fine (CI in example)

In those scenarios, you can set workdir.input and workdir.output to the same value.

@0x2b3bfa0
Copy link
Member Author

If I recall correctly, 🔔 @casperdcl also had a preference for the old output ||= input behavior. 🤔 Still, I believe that making input and output fully independent will be better in the long run.

🔔 @iterative/cml, any opinions/alternatives?

@DavidGOrtega
Copy link
Contributor

I see, but it's a funny mechanism to be guessed. We should at least document it through-fully.
Also, its not to me as interesting as the one I proposed in the original issue.
Reminder:

  • use a prop skip-output-sync

I consider this superior because changing it would allow to change your mind if we make this property non destructive.
And also is more explicit

@DavidGOrtega

This comment was marked as off-topic.

@0x2b3bfa0

This comment was marked as off-topic.

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Feb 16, 2022

I believe that the behavior introduced with this pull request is cleaner and more explicit:1

  • workdir {input = null, output = null} — don't push nor pull any data
  • workdir {input = "a", output = null} — push a on apply and don't pull on destroy
  • workdir {input = null, output = "b"} — don't push on apply and pull to b on destroy
  • workdir {input = "a", output = "b"} — push a on apply and pull to b on destroy

What you propose is a bit more complex to understand, at least for me: 🤔

  • workdir {input = null, output = null} — don't push nor pull any data
  • workdir {input = "a", output = null} — push a on apply and implicitly pull to a on destroy
  • workdir {input = "a", output = "b"} — push a on apply and pull to b on destroy
  • workdir {input = "a", output = null, skip_output_sync = true} — push a on apply and don't pull on destroy

Footnotes

  1. explicit is better than implicit

@0x2b3bfa0
Copy link
Member Author

We should at least document it thoroughly

I wholeheartedly agree, no matter which option we choose.

I consider this superior because changing it would allow to change your mind if we make this property non destructive

Can't we just configure input and output with ForceNew: false in the same way? 🤔

And also is more explicit

To my mind, it's quite the opposite, but I guess that it's a subjective perception in both cases.

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Feb 16, 2022

I have never contemplated the idea of not pushing input data. I do not even envision too many scenarios like that.

workdir {input = "a", output = null} — push a on apply and implicitly pull to a on destroy

This is to my mind the most common and expected case. Something that already happens in the CI. If I generate artefacts during the task I want them to be in the same place as it would have been if run locally.

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Feb 16, 2022

I have never contemplated the idea of not pushing input data. I do not even envision too many scenarios like that.

There might be some users that don't want synchronizing a local working directory but want to retrieve the results after the execution. E.g. clone the repository as part of the script to reduce wait times, or DVC–driven tasks.

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Feb 16, 2022

This is to my mind the most common and expected case. Something that already happens in the CI.

You seem particularly focused on CI/CD scenarios. I wonder why. 😉

If I generate artefacts during the task I want them to be in the same place as it would have been if run locally.

Sure? Such a behavior would:

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Feb 16, 2022

I believe that the behavior introduced with this pull request is cleaner and more explicit:1

To my mind is not the case because as I stated I think that the most expected way is:

push a on apply and implicitly pull to a on destroy

clone the repository as part of the script to reduce wait times

I do not envision cloning the repo and work locally because that kills the experimentation. Should they change and immediately commit?
Also working in the CI they have already cloned

, or DVC–driven tasks.

DVC driven task use to need at least a dvc pipeline that needs to be pushed?

I might be wrong but we should probably discuss this urgently instead of taking personal assumptions

@0x2b3bfa0
Copy link
Member Author

To my mind is not the case because as I stated I think that the most expected way is: “push a on apply and implicitly pull to a on destroy”

In CI/CD systems this may hold true, but overwriting files as warned on #395 (comment) does't look like the most user–friendly experience for local users.

I do not envision cloning the repo and work locally because that kills the experimentation. Should they change and immediately commit?

🤔 Should we even assume that users have some version control system in place?

Also working in the CI they have already cloned

CI/CD scenarios are pretty clear and work as expected. 👍🏼

DVC driven task use to need at least a dvc pipeline that needs to be pushed?

But probably with DVC itself instead of rclone 🤷🏼‍♂️ Just thinking ahead of time.

I might be wrong but we should probably discuss this urgently instead of taking personal assumptions

Shall we return to this page and refine the existing proposals or suggest new ones? 🤔

@DavidGOrtega
Copy link
Contributor

🤔 Should we even assume that users have some version control system in place?

No. But Im not expecting them to clone

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Feb 16, 2022

overwriting files as warned on #395 (comment) does't look like the most user–friendly experience for local users.

Overwriting is the most common behaviour in the majority of the commands involving files. The issue here is wanting to do multiple runs in the same folder, but this is a scenario that is managed by a regular user on a daily basis. I would say

@DavidGOrtega
Copy link
Contributor

Shall we return to this page and refine the existing proposals or suggest new ones? 🤔

We already had a behaviour in place. The question is that introducing skip is altering completely the at-least-known-not-yet-validated behaviour.

You seem particularly focused on CI/CD scenarios. I wonder why. 😉

It's not because of the CI. Its because as stated I think is the most common way. Let's imagine that task is not running in the cloud. It's just a magical power that enhances our laptop making it able of training with GPU. As I stated the most common way to understand the output is that the artefacts are going to be in the place the script is generating them.

Also we should always reduce the complexity in the story. We can not say for local do a, for CI do b. Thats more confusing than say for local or CI do a

@0x2b3bfa0
Copy link
Member Author

I'm not expecting them to clone

Sounds like a valid use case? 🤔

Overwriting is the most common behaviour in the majority of the commands involving files.

Overwriting is a common behavior in operations that explicitly take a list of files. Doesn't seem the expected behavior for such a complex process, especially when the working directory contains both the user's project and the Terraform code & state files.

The issue here is wanting to do multiple runs in the same folder, but this is a scenario that is managed by a regular user on a daily basis.

What do you mean by “managed by a regular user on a daily basis”? Is this process slow enough to not require any automation? 🤔

We already had a behaviour in place. The question is that introducing skip is altering completely the at-least-known-not-yet-validated behaviour.

Do you mean the original behavior, prior to #340?

It's not because of the CI. Its because as stated I think is the most common way. Let's imagine that task is not running in the cloud. It's just a magical power that enhances our laptop making it able of training with GPU. As I stated the most common way to understand the output is that the artefacts are going to be in the place the script is generating them.

Agreed! That's the behavior we had in place before #340, but that behavior caused #306 and overwrote local files with outdated copies.

Also we should always reduce the complexity in the story.

Agreed!

We can not say for local do a, for CI do b. Thats more confusing than say for local or CI do a

Sometimes, both a and b make sense for different use cases, and overloading the API (#324, #339, #389, #390) to be “consistent” for all of these use cases might be worse that letting users be aware of the differences. 🤔

Co-authored-by: Casper da Costa-Luis <[email protected]>
@casperdcl casperdcl merged commit ed322f3 into master Feb 21, 2022
@casperdcl casperdcl deleted the separate-input-output branch February 21, 2022 18:24
@casperdcl casperdcl added the resource-task iterative_task TF resource label Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important High priority resource-task iterative_task TF resource
Projects
None yet
Development

Successfully merging this pull request may close these issues.

task: optionally skip downloading artefacts on destroy
3 participants