Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lead / lag / first_value / last_value - like reducers. #18

Open
ilyanoskov opened this issue Mar 17, 2024 · 3 comments
Open

Add lead / lag / first_value / last_value - like reducers. #18

ilyanoskov opened this issue Mar 17, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@ilyanoskov
Copy link

I tried to implement some functionality that would give me the previous seen value inside a window and I have realised that it is currently quite cumbersome to do it in Pathway.

I think this feature will be useful for anyone trying to do not-so-trivial aggregations.

@ilyanoskov ilyanoskov added the enhancement New feature or request label Mar 17, 2024
@ilyanoskov
Copy link
Author

To add more context, I was looking for a way to do an operation that is similar to diff - take a previous row value, manipulate it, and then multiply by the next row value. I defined a window with hop=1, duration=1 but was only able to get those values after an additional join - which I found cumbersome.

@dxtrous
Copy link
Member

dxtrous commented Mar 17, 2024

This is a great point. Actually, diff is a particularly unpleasant (slow and memory-inefficient) operation to perform in a streaming or distributed system in its full generality - and may cause unexpected performance bottlenecks. At the same time, it can be extremely useful. Whether or not it should be present as a keyword here is a recurring topic among repo maintainers.

If by any chance you are in the lucky place that your data has a column with sequential row numbers (seq = 1, 2, 3,...), compute seq_next = seq+1 and join the table with itself on table_copy1.seq_next = table_copy2.seq. It's not beautiful but it's (fairly) fast.

Next best case, if you can localize all the adjacent rows inside a small window, a reducer running a UDF over this window is a good idea.

In the most general case, you can use the sort syntax to achieve the diff like here: https://pathway.com/developers/showcases/event_stream_processing_time_between_occurrences. We are working to optimize this approach for speed, but even then, it will always be a bit unwieldy for streaming systems at scale (especially with partitioned data).

@dxtrous
Copy link
Member

dxtrous commented Mar 18, 2024

Comment: I'm bookmarking https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions to have a terminology reference for Topic for future participants of the discussion (taken from the non-streaming window world).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants