Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to call get_results lazily #1596

Open
sk1p opened this issue Feb 19, 2024 · 3 comments · May be fixed by #1632
Open

Allow to call get_results lazily #1596

sk1p opened this issue Feb 19, 2024 · 3 comments · May be fixed by #1632
Milestone

Comments

@sk1p
Copy link
Member

sk1p commented Feb 19, 2024

Right now, UDF.get_results is called for every intermediate result, for every UDF that is running, when using run_udf_iter. The user of run_udf_iter may not be interested in all of these intermediates, especially if they are very small. Additionally, if the get_results function is relatively expensive, and partitions comparatively small, this can result in a potential slow-down, as get_results becomes the bottleneck. A sample profile where this happened in practice looks like this:

image

(get_results is highlighted and takes ~50% in this case)

Instead of eagerly generating results for every intermediate result, if we only lazily call it, we can decouple the computation and visualization in a way that we only pay the get_results cost if we are actually "drawing" an updated visualization frame (in this case, we send it over the network, but the same principle applies by backpressure)

This would remove another reason to manually set the frames_per_partition parameter in LiberTEM-live use cases.

@uellue
Copy link
Member

uellue commented Feb 19, 2024

The Analysis interface is doing this already, perhaps this can be applied as a blueprint?

Maybe we can also use dask.delayed?

@uellue
Copy link
Member

uellue commented Feb 19, 2024

This callable is returned as an analysis result, for example.

@sk1p
Copy link
Member Author

sk1p commented Feb 19, 2024

The Analysis interface is doing this already, perhaps this can be applied as a blueprint?
[...]
This callable is returned as an analysis result, for example.

I think this can be done in a generic way that works for all UDFs the same way, instead of allowing UDFs to opt-in via changing get_results or similar, like it's done for analyses. We can just decide not to call get_results if a result is not used by the consumer of the iteration. There's no explicit guarantee in the UDF interface that each merge call means get_results will be called afterwards, for example (another case to think about: per-node merge trees in the distributed use-case, where results from worker processes on a node are merged together before being sent to a main node, which is an option we should keep open for ourselves). We can clarify this in the documentation, to make clear that get_results should be a pure function.

An example implementation could switch UDFResults with a UDFResultsLazy class, with the same outer interface, just instead of having the final results, is has the merge results plus a reference to the UDFs themselves. On accessing UDFResults.buffers, get_results is called (need to be careful about buffers that are re-used, concurrency etc., where it gets a bit interesting - we may actually need a lock to guarantee atomic access to partial results, without interleaving merge calls with result accesses)

Maybe we can also use dask.delayed?

I think that would be overkill for this usage (would create a hard dependency on dask in core UDF code, and always need dask resources/threads/clients/...) - we are not building up complicated graphs I think, and pure-Python should be enough (lazy == a function instead of a value). If you are already using the dask delayed executor for the computation, the result is already lazy (well, and you can't actually use run_udf_iter so 🤷)

@sk1p sk1p added this to the 0.15 milestone Apr 23, 2024
sk1p added a commit to sk1p/LiberTEM that referenced this issue Apr 25, 2024
@sk1p sk1p linked a pull request Apr 25, 2024 that will close this issue
10 tasks
sk1p added a commit to sk1p/LiberTEM that referenced this issue Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants