ENH: Caching of inputs #338

hmgaudecker · 2023-01-23T08:35:57Z

Is your feature request related to a problem?

Yes and no, at least pytask and my expectations were not aligned when I ran the code snippet below.

In particular, I expected the task to be re-run whenever the result from load_model_dict() changes, which is specified in a central module.

Describe the solution you'd like

I would love to see the possibility to hash Python inputs similarly to what is being done for file contents. Usually, they will be much smaller and it allows for more granularity (e.g., above I could of course specify the central config.py as a dependency, but this will mean doing so everywhere and whenever it changes, the entire pipeline will be run. Splitting its contents across many files would also be possible, but ugly).

I would actually like to see the inputs of hash_python_inputs or whatever a decorator might be called to be true in the spirit of correctness trumps performance (typically, these objects will be small relative to files).

API breaking implications

If the default is set differently as suggested before, behavior in some cases might change. Else just an addition.

Additional context

Add any other context, code examples, or references to existing implementations about
the feature request here.

def _get_parametrization(models):
    id_to_kwargs = {}
    for model in models:
        id_to_kwargs[model] = {
            "model_dict": load_model_dict(model_name=model),
            "depends_on": ORIGINAL_DATA["data.dta"],
            "produces": BLD["data"][model]["result.pickle"]
        }
    return id_to_kwargs


for id_, kwargs in _get_parametrization(MODELS).items():

    @pytask.mark.task(id=id_, kwargs=kwargs)
    def task_final_data(depends_on, produces, model_dict):
        pass

The text was updated successfully, but these errors were encountered:

tobiasraabe · 2023-01-23T09:35:59Z

I like the idea! What are possible interfaces?

Extending `pytask.mark.depends_on`

@pytask.mark.depends_on({"first": ..., "second": ...}, hash=<>)

where <> can be

True to hash all.
"first" to hash only the first dependency. (Positional indices if not a dict?)

Extra decorator

@pytask.mark.hashed_depends_on()

Wrapper for dependencies

Something similar to pytest.param https://docs.pytest.org/en/7.1.x/example/parametrize.html#set-marks-or-test-id-for-individual-parametrized-test

hmgaudecker · 2023-01-23T10:21:26Z

Love that idea, but I don't see how it would work in practice? That is, how would pytask be able to differentiate between a dependency that is a file and a dependency that is a Python object? In the above example, model_dict is an input to the task function similar to depends_on and do think that the leaves of depends_on should always be files.
That is what I had in mind, but given the name I am not sure whether we are on the same page?
I don't follow how that would work; could you be more explicit?

IIUC, the above is mostly about hashing files, not hashing Python objects that are inputs to a task function?

tobiasraabe · 2023-01-23T10:36:53Z

I had two things in mind.

depends_on already exists and should be the entry-point for all dependencies regardless of type. It would feel counterintuitive to have it only for file path dependencies and not for everything else.
For every dependency, pytask goes through a list of hooks to collect this dependency (same with tasks, by the way), but there is only a FilePathNode that collects files. pytask could be extended to a PythonObjectNode.

And, then, hashing can be turned on if necessary. Paths are also currently not hashed; instead, we use the last modified date.

hmgaudecker · 2023-01-23T10:57:51Z

Sounds good, but it might be hard to distinguish between FilePathNodes and PythonObjectNodes. E.g., a string could be both? Would be fine to explicitly require PythonObjectNode(model_dict) in the above example?

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does. Is there any reason to allow task functions to have arguments beyond depends_on and produces, then?

tobiasraabe · 2023-01-23T11:49:19Z

Sounds good, but it might be hard to distinguish between FilePathNodes and PythonObjectNodes. E.g., a string could be both? Would be fine to explicitly require PythonObjectNode(model_dict) in the above example?

Maybe deprecate strings as filepaths or some logic to differentiate it. Second, can be indeed ugly.

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does.

Not all input is tracked. Only depends_on and produces are. Sometimes additional inputs change the signature that triggers a re-run.

Is there any reason to allow task functions to have arguments beyond depends_on and produces, then?

Everything should be a dependency except for the products. Thus, we could remove pytask.mark.depends_on but how do we define products? With the decorator or is there something less bulky?

hmgaudecker · 2023-01-23T12:11:07Z

Wild thought: Can't we borrow from dags logic and just use reserved keywords

file_deps
outputs

for task functions instead of the depends_on and produces decorators? (picked those names only for obvious differentiation) These two would only ever refer to files; everything else would be a PythonObjectNode unless a FilePathNode is explicitly passed.

I was genuinely surprised that the above did not work -- for me as a user it seems the same whether the body of a function changes or some input does.

Not all input is tracked. Only depends_on and produces are. Sometimes additional inputs change the signature that triggers a re-run.

Sure, I understand that now, it just was not intuitive to me 😇

tobiasraabe · 2023-03-24T14:28:20Z

A new feature dropped in FastAPI that uses Annotated to add more metadata to function arguments. This could be very interesting for the implementation.

Implementation https://github.com/tiangolo/fastapi/pull/4871
https://fastapi.tiangolo.com/tutorial/query-params-str-validations/#advantages-of-annotated
https://fastapi.tiangolo.com/python-types/#type-hints-with-metadata-annotations
Potentially more use-case examples https://stackoverflow.com/questions/71898644/how-to-use-python-typing-annotated

tobiasraabe · 2023-09-30T09:42:38Z

The feature will be available in v0.4. It is documented here: https://pytask-dev.readthedocs.io/en/latest/how_to_guides/hashing_inputs_of_tasks.html.

hmgaudecker added the enhancement New feature or request label Jan 23, 2023

hmgaudecker mentioned this issue Feb 21, 2023

ENH: Improving file-change detection. #344

Closed

This was referenced Jul 29, 2023

Parse dependencies from all args if depends_on is not used. #384

Merged

Replace pybaum with optree and add paths to PythonNode names. #396

Merged

tobiasraabe added this to the v0.4.0 milestone Jul 29, 2023

tobiasraabe closed this as completed Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Caching of inputs #338

ENH: Caching of inputs #338

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Jan 23, 2023

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Jan 23, 2023

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Jan 23, 2023

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Mar 24, 2023

tobiasraabe commented Sep 30, 2023

ENH: Caching of inputs #338

ENH: Caching of inputs #338

Comments

hmgaudecker commented Jan 23, 2023

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Additional context

tobiasraabe commented Jan 23, 2023

Extending pytask.mark.depends_on

Extra decorator

Wrapper for dependencies

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Jan 23, 2023

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Jan 23, 2023

hmgaudecker commented Jan 23, 2023

tobiasraabe commented Mar 24, 2023

tobiasraabe commented Sep 30, 2023

Extending `pytask.mark.depends_on`