Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aw-transform: Add union_events_split #179

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johan-bjareholt
Copy link
Member

One step in fixing the "Merge web watcher events into window events in the query? (to allow for classifying by url/domain)" ActivityWatch/aw-webui#151. We already have it in the same query, but the data is not merged.
Hopefully this will be a good transform to merge window and browser data in that manner.

@codecov
Copy link

codecov bot commented Oct 18, 2020

Codecov Report

Merging #179 (77eef3d) into master (7d55fca) will increase coverage by 13.67%.
The diff coverage is 5.06%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #179       +/-   ##
===========================================
+ Coverage   46.30%   59.97%   +13.67%     
===========================================
  Files          51       44        -7     
  Lines        6148     4765     -1383     
  Branches     1454        0     -1454     
===========================================
+ Hits         2847     2858       +11     
+ Misses       2442     1907      -535     
+ Partials      859        0      -859     
Impacted Files Coverage Δ
aw-transform/src/union.rs 0.00% <0.00%> (ø)
aw-query/src/functions.rs 82.92% <28.57%> (+16.05%) ⬆️
aw-server/src/dirs.rs 0.00% <0.00%> (-73.69%) ⬇️
aw-server/src/logging.rs 0.00% <0.00%> (-62.50%) ⬇️
aw-transform/src/split_url.rs 12.50% <0.00%> (-54.17%) ⬇️
aw-query/src/ast.rs 0.00% <0.00%> (-37.26%) ⬇️
aw-models/src/key_value.rs 15.38% <0.00%> (-34.62%) ⬇️
aw-transform/src/flood.rs 66.66% <0.00%> (-16.56%) ⬇️
aw-transform/src/filter_keyvals.rs 77.27% <0.00%> (-6.37%) ⬇️
aw-transform/src/find_bucket.rs 80.00% <0.00%> (-4.22%) ⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d55fca...77eef3d. Read the comment docs.

Copy link
Member

@ErikBjare ErikBjare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only taken a quick look but looks alright.

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).


'event1: for mut event1 in events1 {
let event1_endtime = event1.calculate_endtime();
'event2: for event2 in events2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a nested loop one might want to step through each list under certain conditions (similarly to how we do it in aw-server-python for some transforms: e1_i++ and e2_i++).

The borrow checker would probably hate that though, and I guess the timestamp checks are pretty fast despite the worst-case O(N^2).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the borrow checker would have any issues with that, you could just use iterators which should work I think.

@johan-bjareholt
Copy link
Member Author

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).

So like the opposite of filter_period_intersect and then concatenating window and browser events?

@ErikBjare
Copy link
Member

ErikBjare commented Oct 30, 2020

@johan-bjareholt I think something like:

window_events = query(...)   # query and do AFK filtering

for browser in browsers:  # this for-loop would need to be expressed in the function constructing the query
    web_events = query("aw-watcher-web-$browser")
    window_events_browser = filter_by(window_events, {app: $browserAppname})  # can't remember what transform is used to do this
    web_events_active = filter_period_intersect(browser_events, window_events_browser)`  # filter web events by browser being the active window
    web_events_active = amend_data(web_events_active, {app: $browserAppname})  # optional, amends the missing {app: $browserAppname}
    window_events = union(web_events_active, window_events)  # picks events from web_events_active first, then fills rest with window_events

Not sure if we have something like union (in the way I mean it) already, but I mean something that does:

"bucket"          | events
---------------------------------------------------------------
web_events_active |           [github  ]           [youtube   ]
window            |[spotify  ][browser ][terminal ][browser   ]
---------------------------------------------------------------
result            |[spotify  ][github  ][terminal ][youtube   ]

A complete end-to-end example would be:

bucket  | events
---------------------------------------------
window  | [terminal  ][firefox  ][terminal  ]
browser | [youtube   ][google               ]
---------------------------------------------
result  | [terminal  ][google   ][terminal  ]

@johan-bjareholt
Copy link
Member Author

web_events_active = amend_data(web_events_active, {app: $browserAppname})  # optional, amends the missing {app: $browserAppname}

I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid. One reason would be because it encourages people to write dynamically generated queries like we do in the web-ui which is something we don't want.

window_events = union(web_events_active, window_events) # picks events from web_events_active first, then fills rest with window_events

To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.

@johan-bjareholt
Copy link
Member Author

johan-bjareholt commented Oct 30, 2020

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).

What this transform does is essentially this, but it also injects the browsers appname. See the example in the code

///     |---------|--------------------|
///     | events1 |[a     ][b     ]    |
///     | events2 |    [c     ]    [d ]|
///     | result  |[a ][ac][bc][b ]    |
///     |---------|--------------------|

So if we put that into perspective of window events and browser events already filtered with the browser window:

///     |---------|--------------------------|
///     | window  |[firefox   ][terminal    ]|
///     | browser |    [google    ]          |
///     | result  |[f ][f+g   ][terminal    ]|
///     |---------|--------------------------|

Which is pretty much exacly what you just wrote?

@ErikBjare
Copy link
Member

ErikBjare commented Nov 1, 2020

To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.

For sure, I was just unsure what to call it.

I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid.

I agree in general, but in this case the data does come from a bucket (and doesn't actually introduce new hardcoding, since we already have our list of $browserAppname). We're reinserting exactly what we know we lost before (with filter_by and filter_period_intersect).

But on second thought I realize it'll still be messy, due to there being multiple possible appnames for each browser.

So if we put that into perspective of window events and browser events already filtered with the browser window: <example>

I thought it would become:

///     |---------|---------------------------|
///     | window  |[firefox   ][terminal     ]|
///     | browser |    [google     ]          |
///     | result  |[f ][f+g   ][t+g][terminal]|
///     |---------|---------------------------|

Basically what I'm trying to get rid of splitting events into two.

Edit: Ah nvm, your example would indeed become as you wrote after filtering the browser events.

But it would still lead to things like this, no?:

/// Abbreviations:
///  - ff: Firefox
///  - t1: Tab 1
///  - ff(t1): Firefox window event polled from when `t1` was active
///  - ff(t1)+t2: Firefox event with title from `t1`, but URL from `t2` (after `merge_map`)
///
///     |---------|------------------------------------|
///     | window  |[ff(t1)                ][ff(t2)    ]|
///     | browser |[t1        ][t2                    ]|
///     | result  |[ff(t1)+t1 ][ff(t1)+t2 ][ff(t2)+t2 ]|
///     |---------|------------------------------------|

So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.

Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split and flood after?

Maybe that would lead to:

///     |---------|-------------------------------|
///     | window  |[ff(t1)    ]      [ff(t2)     ]|
///     | browser |[t1           ][t2            ]|
///     | result  |[ff(t1)+t1 ]      [ff(t2)+t2  ]|
///     |---------|-------------------------------|

Which after flooding would become the correct:

///     |---------|-------------------------------|
///     | window  |[ff(t1)    ]      [ff(t2)     ]|
///     | browser |[t1           ][t2            ]|
///     | result  |[ff(t1)+t1    ][ff(t2)+t2     ]|
///     |---------|-------------------------------|

(or something similar, depending on flooding strategy)

This might require a lot of extra memory though, since we'd need both window_events (not flooded, for filtering the browser events that go into union_events_split) and window_events_flooded (flooded, for filtering the browser events that go into filter_period_intersect).

Edit 3: Regardless, I'd be happy to merge this if there were more comprehensive tests (for example, checking that these 'middle-events' get created as expected).

Unless we can come up with a neat solution to the problem (which I'm no longer sure there is) I think it's better to just merge this in the meantime, and work on perfecting the transforms/queries later (as this'll probably work good enough).

Copy link
Member

@ErikBjare ErikBjare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review, now that I understand things better.


/* test non-object conflict, prefer map1 value */
// TODO: This does not work yet!
// It should be a pretty rare use-case anyway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't be rare? Both window-events and web-events have a title?

duration: Duration::seconds(3),
data: json_map! {"test": json!(1)},
};
let mut e2 = e1.clone();
Copy link
Member

@ErikBjare ErikBjare Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have more realistic data here, for example:

For e1: {app: "firefox", title: "google - firefox"}
For e2: {title: "google", url: "google.com"}

I'd also like event series that are at least two events long (like the examples I recently commented about). Would also help with ensuring behavior stays consistent when we eventually remove the nested loops.

@johan-bjareholt
Copy link
Member Author

So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.

That's a very good point which I thought of before but forgot.

Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split and flood after?

That sounds like a very clever way of solving it, will try that out.

This might require a lot of extra memory though, since we'd need both window_events (not flooded, for filtering the browser events that go into union_events_split) and window_events_flooded (flooded, for filtering the browser events that go into filter_period_intersect).

I took a deep dive into the query code and it seems like you are correct here. Here's an example of how we would have to change our transforms

- events = flood(query_bucket("bucketname"));
+ events_unflooded = query_bucket("bucketname");
+ events_flooded = flood(events_unflooded)

The query language is very inefficient with its assignments, every time we assign something it gets cloned every time it's used afterwards because we do not know if the variable will be used afterwards or not and we need to guarantee that the variable won't change. So previously when we just called flood(query_bucket("bucketname")) we could just re-use the previous events while when we assign it we need to first keep the value in the variable and then clone it on each reference as well as not release the value of the variable until the whole query is complete. This can be improved in the future by #119.

But I think that the impact of just one more clone would be minimal compared to the whole issue we have today with #119, there are lots of more unnecessary clones than just this.

@johan-bjareholt
Copy link
Member Author

johan-bjareholt commented Nov 16, 2020

Here's a query that works with this transform, the web-ui queries.ts code is really messy though so I'm having a hard time adapting it, especially now that "canonicalQuery" should probably include browser events.

{
  "timeperiods": [
    "2020-11-01T01:00:00Z/2020-11-30T01:00:00Z"
  ],
  "query": [
    "window_events = query_bucket(\"aw-watcher-window_johan-laptop2\");",
    "browser_events = query_bucket(\"aw-watcher-web-firefox\");",
    "browser_events = split_url_events(browser_events);",
    "firefox_events = filter_keyvals(window_events, \"app\", [\"firefox\"]);",
    "events = union_events_split(firefox_events, browser_events);",
    "events = merge_events_by_keys(events, [\"$domain\"]);",
    "events = categorize(events, [
        [[\"docs.rs\"], {\"type\": \"regex\", \"regex\": \"^docs.rs$\"}],
        [[\"reddit\"], {\"type\": \"regex\", \"regex\": \"^reddit.com$\"}],
        [[\"crates.io\"], {\"type\": \"regex\", \"regex\": \"^crates.io$\"}]
    ]);",
    "events = sort_by_duration(events);",
    "RETURN = events;"
  ]
}

@AllanChain
Copy link

Hi! Seems that all major problems have been solved in above discussions. What's currently blocking this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants