Skip to content

ENH: Caching of inputs #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hmgaudecker opened this issue Jan 23, 2023 · 8 comments
Closed

ENH: Caching of inputs #338

hmgaudecker opened this issue Jan 23, 2023 · 8 comments
Labels
enhancement New feature or request
Milestone

Comments

@hmgaudecker
Copy link
Contributor

Is your feature request related to a problem?

Yes and no, at least pytask and my expectations were not aligned when I ran the code snippet below.

In particular, I expected the task to be re-run whenever the result from load_model_dict() changes, which is specified in a central module.

Describe the solution you'd like

I would love to see the possibility to hash Python inputs similarly to what is being done for file contents. Usually, they will be much smaller and it allows for more granularity (e.g., above I could of course specify the central config.py as a dependency, but this will mean doing so everywhere and whenever it changes, the entire pipeline will be run. Splitting its contents across many files would also be possible, but ugly).

I would actually like to see the inputs of hash_python_inputs or whatever a decorator might be called to be true in the spirit of correctness trumps performance (typically, these objects will be small relative to files).

API breaking implications

If the default is set differently as suggested before, behavior in some cases might change. Else just an addition.

Additional context

Add any other context, code examples, or references to existing implementations about
the feature request here.

def _get_parametrization(models):
    id_to_kwargs = {}
    for model in models:
        id_to_kwargs[model] = {
            "model_dict": load_model_dict(model_name=model),
            "depends_on": ORIGINAL_DATA["data.dta"],
            "produces": BLD["data"][model]["result.pickle"]
        }
    return id_to_kwargs


for id_, kwargs in _get_parametrization(MODELS).items():

    @pytask.mark.task(id=id_, kwargs=kwargs)
    def task_final_data(depends_on, produces, model_dict):
        pass
@hmgaudecker hmgaudecker added the enhancement New feature or request label Jan 23, 2023
@tobiasraabe
Copy link
Member

I like the idea! What are possible interfaces?

Extending pytask.mark.depends_on

@pytask.mark.depends_on({"first": ..., "second": ...}, hash=<>)

where <> can be

  • True to hash all.
  • "first" to hash only the first dependency. (Positional indices if not a dict?)

Extra decorator

@pytask.mark.hashed_depends_on()

Wrapper for dependencies

Something similar to pytest.param https://docs.pytest.org/en/7.1.x/example/parametrize.html#set-marks-or-test-id-for-individual-parametrized-test

@hmgaudecker
Copy link
Contributor Author

  1. Love that idea, but I don't see how it would work in practice? That is, how would pytask be able to differentiate between a dependency that is a file and a dependency that is a Python object? In the above example, model_dict is an input to the task function similar to depends_on and do think that the leaves of depends_on should always be files.
  2. That is what I had in mind, but given the name I am not sure whether we are on the same page?
  3. I don't follow how that would work; could you be more explicit?

IIUC, the above is mostly about hashing files, not hashing Python objects that are inputs to a task function?

@tobiasraabe
Copy link
Member

I had two things in mind.

  1. depends_on already exists and should be the entry-point for all dependencies regardless of type. It would feel counterintuitive to have it only for file path dependencies and not for everything else.
  2. For every dependency, pytask goes through a list of hooks to collect this dependency (same with tasks, by the way), but there is only a FilePathNode that collects files. pytask could be extended to a PythonObjectNode.

And, then, hashing can be turned on if necessary. Paths are also currently not hashed; instead, we use the last modified date.

@hmgaudecker
Copy link
Contributor Author

Sounds good, but it might be hard to distinguish between FilePathNodes and PythonObjectNodes. E.g., a string could be both? Would be fine to explicitly require PythonObjectNode(model_dict) in the above example?

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does. Is there any reason to allow task functions to have arguments beyond depends_on and produces, then?

@tobiasraabe
Copy link
Member

Sounds good, but it might be hard to distinguish between FilePathNodes and PythonObjectNodes. E.g., a string could be both? Would be fine to explicitly require PythonObjectNode(model_dict) in the above example?

Maybe deprecate strings as filepaths or some logic to differentiate it. Second, can be indeed ugly.

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does.

Not all input is tracked. Only depends_on and produces are. Sometimes additional inputs change the signature that triggers a re-run.

Is there any reason to allow task functions to have arguments beyond depends_on and produces, then?

Everything should be a dependency except for the products. Thus, we could remove pytask.mark.depends_on but how do we define products? With the decorator or is there something less bulky?

@hmgaudecker
Copy link
Contributor Author

Wild thought: Can't we borrow from dags logic and just use reserved keywords

file_deps
outputs

for task functions instead of the depends_on and produces decorators? (picked those names only for obvious differentiation) These two would only ever refer to files; everything else would be a PythonObjectNode unless a FilePathNode is explicitly passed.

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does.

Not all input is tracked. Only depends_on and produces are. Sometimes additional inputs change the signature that triggers a re-run.

Sure, I understand that now, it just was not intuitive to me 😇

@tobiasraabe
Copy link
Member

A new feature dropped in FastAPI that uses Annotated to add more metadata to function arguments. This could be very interesting for the implementation.

@tobiasraabe
Copy link
Member

The feature will be available in v0.4. It is documented here: https://pytask-dev.readthedocs.io/en/latest/how_to_guides/hashing_inputs_of_tasks.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants