Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More options for history completion entries order #5481

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lufte
Copy link
Member

@lufte lufte commented Jun 2, 2020

This is based on the work of @vchimishuk in #4403.

I started by reviewing that pull request and testing it, but I figured that if we used a simpler formula for frecency we wouldn't need the periodic calculation nor the threading stuff. So this approach uses the last visit timestamp plus a bonus for each visit as the frecency. This way the frecency doesn't depend on the current time and can be calculated just once when inserting (or updating) the record in CompletionHistory.

I also added frequency as an option as well since it was trivial. I will test this for a bit and add unit tests afterwards.

Fixes #601.

@lufte lufte requested a review from rcorre as a code owner June 2, 2020 03:43
@project-bot project-bot bot added this to Inbox in Pull request backlog Jun 2, 2020
Copy link
Contributor

@rcorre rcorre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I like the relative simplicity of it. I've got a few comments, and I think we could use some more tests in test_histcategory.py.

qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/completion/models/histcategory.py Show resolved Hide resolved
@lufte
Copy link
Member Author

lufte commented Jun 3, 2020

Looks good! I like the relative simplicity of it. I've got a few comments, and I think we could use some more tests in test_histcategory.py.

Thanks. I'm definitely adding tests later on if we all agree on this approach.

@lufte lufte changed the title More options for completion entries order More options for history completion entries order Jun 5, 2020
Pull request backlog automation moved this from Inbox to Focus Jun 7, 2020
Copy link
Contributor

@rcorre rcorre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll let @The-Compiler give the final sign-off.

tests/unit/browser/test_history.py Show resolved Hide resolved
@lufte
Copy link
Member Author

lufte commented Jun 7, 2020

Hmm, some of the environments now fail because of the UPSERT syntax. It seems I checked all the sqlite versions while thinking "we need 3.8 for upsert", when actually we need 3.24... I even said that myself in the previous comment! 🤦

So it turns out we can't use it after all. Back to insert+update it is...

@The-Compiler
Copy link
Member

@lufte @rcorre Could you please enlighten me on what benefits UPSERT / fa2b280 would bring? Is it just slightly nicer code, or are there other benefits (performance?) as well?

@The-Compiler The-Compiler moved this from Focus to Backlog in Pull request backlog Jun 10, 2020
@rcorre
Copy link
Contributor

rcorre commented Jun 11, 2020

@The-Compiler I believe UPSERT would avoid looking up the record twice (once to find the duplicate on insert, then again to update the row on the update). Hopefully that isn't a big deal on a primary key lookup.

@lufte
Copy link
Member Author

lufte commented Jun 11, 2020

Usually there is a performance improvement when talking to a RDBMS over the network, but I wasn't sure about SQLite so I ran some tests:

>>> def insert_update():
...  q = c.execute("insert or ignore into CompletionHistory values ('url', 'title', 1, 1, 1)")
...  if q.rowcount == 0:
...   c.execute("update CompletionHistory set url='url', title='title', last_atime=1, visits=100, frecency=1 where url='url'")
... 
>>> def upsert():
...  c.execute("insert into CompletionHistory values ('url', 'title', 1, 1, 1) on conflict(url) do update set url='url', title='title', last_atime=1, visits=1, frecency=1 where url='url'")
... 
>>> from timeit import timeit
>>> import sqlite3
>>> conn=sqlite3.connect("/tmp/qutebrowser-basedir/data/history.sqlite")
>>> c=conn.cursor()
>>> timeit('insert_update()', globals=globals())
7.934143277000658
>>> timeit('upsert()', globals=globals())
5.431534209999882
>>> from qutebrowser.misc import sql
>>> def insert_update():                                                                                                                               
...  q = sql.Query("insert or ignore into CompletionHistory values ('url', 'title', 1, 1, 1)").run()                                                   
...  if q.rows_affected() == 0:                                                                                                                        
...    sql.Query("update CompletionHistory set url='url', title='title', last_atime=1, visits=100, frecency=1 where url='url'").run()
...    
>>> def upsert():
...  sql.Query("insert into CompletionHistory values ('url', 'title', 1, 1, 1) on conflict(url) do update set url='url', title='title', last_atime=1, visits=1, frecency=1 where url='url'").run()
...  
>>> from timeit import timeit
>>> timeit('insert_update()', globals=globals(), setup='from qutebrowser.misc import sql;sql.init("/tmp/qutebrowser-basedir/data/history.sqlite")')
QSqlDatabasePrivate::addDatabase: duplicate connection name 'qt_sql_default_connection', old connection removed.
289.1367021160004
>>> timeit('upsert()', globals=globals(), setup='from qutebrowser.misc import sql;sql.init("/tmp/qutebrowser-basedir/data/history.sqlite")')
QSqlDatabasePrivate::addDatabase: duplicate connection name 'qt_sql_default_connection', old connection removed.
209.05686401799994

So it seems there is some improvement. By the way, the gain is proportionally the same in both cases (~30% faster) but oh boy does it take longer when using QtSql. What could be going on there? I believe both tests run with autocommit so that shouldn't be it.

@The-Compiler I believe UPSERT would avoid looking up the record twice (once to find the duplicate on insert, then again to update the row on the update). Hopefully that isn't a big deal on a primary key lookup.

I'm not sure this is the case because the DO UPDATE part requires its own WHERE clause, meaning that the update it's not done automatically on the record that failed and could even update multiple records.

@lufte
Copy link
Member Author

lufte commented Dec 4, 2020

Since I was around I fixed the conflicts here. I've been using this patch since writing it by the way, and so far everything works perfectly. @treeshateorcs if you are interested in this feature (I saw your comment in the old PR) and you are able to apply custom patches, it would be great if you could give this a try.

@treeshateorcs
Copy link

thank you, i just applied the patch. i need to give it a few days

@treeshateorcs
Copy link

hmm, for some reason the patch can be applied against git, but not release?

@The-Compiler
Copy link
Member

@treeshateorcs It happens - master is heading towards v2.0.0 so there are a lot of bigger changes in there which aren't in any release so far.

@treeshateorcs
Copy link

been working great for me so far

@treeshateorcs
Copy link

@lufte please update this to current master, i tried myself but failed

@lufte
Copy link
Member Author

lufte commented Jan 6, 2021

You got it @treeshateorcs!

@treeshateorcs
Copy link

treeshateorcs commented Jan 18, 2021

is there a chance this will be merged? it broke again. i can't believe i'm the only one interested in this feature. to me it's essential

p.s. also it seems like this patch alters the database or something, because i'm not able to use qutebrowser with my current data and config files without it. it just crashes

@The-Compiler
Copy link
Member

I'll go back to merging older PRs once v2.0.0 is released - but this isn't exactly on the top of my priorities, see the PR backlog project board.

@vchimishuk
Copy link
Contributor

Glad to hear this feature is moving. Lack of it force me to use my own build of QB v1.5.2. I got tired to constantly rebase my patch, so I just live with very old (2 years) version or QB. :)
Thank you, @lufte.

@lufte
Copy link
Member Author

lufte commented Feb 15, 2021

The upgrade to v2.0 forced me to refactor a good part of this work so I decided to squash all the old commits into one and re-apply it on top of master. Sorry about the loss of history (note to self: I left the old branch in https://github.com/lufte/qutebrowser/tree/issue601old) but this makes it easier to keep the patch up to date every time a new conflict is introduced, without having merge commits everywhere.

@treeshateorcs You should be able to apply the patch on top of master or v2.0.2. Indeed, this patch alters the database, but it only adds columns and indexes. I'm not sure why the same database wouldn't work with upstream. If you have more details on this please let me know, because it affects one of the changes I had to make now.

@The-Compiler While working on this I noticed a potential problem with the new way of setting PRAGMA user_version: if something goes wrong while re-creating CompletionHistory and qutebrowser crashes, user_version is already set to the newer number so the user will be left in an inconsistent state, and user_version will need to be set back manually to reproduce the issue.

@The-Compiler
Copy link
Member

@The-Compiler While working on this I noticed a potential problem with the new way of setting PRAGMA user_version: if something goes wrong while re-creating CompletionHistory and qutebrowser crashes, user_version is already set to the newer number so the user will be left in an inconsistent state, and user_version will need to be set back manually to reproduce the issue

That should be fixed in 2b47bd0 already, no?

@lufte
Copy link
Member Author

lufte commented Feb 15, 2021

I was talking about re-creating the actual table (to add the new columns), not only re-creating its records. Unless we make it so force_rebuild always drops the table too... it shouldn't even be noticeable compared to the time it takes to rebuild all the rows.

@The-Compiler
Copy link
Member

Ah, I see... I'm not sure if this is really fixable in a sane way before #6039, given that everything is scattered around history.py and sql.py at the moment...

@The-Compiler The-Compiler moved this from Inbox to Focus in Pull request backlog Jun 15, 2021
Copy link
Member

@The-Compiler The-Compiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things I noticed in a first pass. Now that I got some context again, this indeed seems a very nice and simple solution for the problem - great work! 👍

If you want, it'd be great if you could add some type annotations, especially for arguments and return values in any new/changed functions. It's by no means required though, it's just something I started doing for new code.

I'll answer to some of the earlier discussion comments about upsert and #6039 too, but let me do that in a separate comment.

qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/browser/history.py Outdated Show resolved Hide resolved
qutebrowser/completion/models/histcategory.py Outdated Show resolved Hide resolved
Comment on lines 1116 to 1120
- recency: Entries are sorted from most recently visited to least
recently visited.
- frequency: Entries are sorted from most visited to least visited.
- frecency: Entries are sorted using a combination of recency and
frequency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While those terms are probably the correct technical terms, I wonder if they aren't a bit confusing (and also hard to read/distinguish), especially for non-English users.

Maybe they should be something like most-recent, most-frequent, combined or so instead?

@mschilli87 any opinion on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are too bad, unless for what I imagine is the typical qutebrowser user, but I'm happy to hear more opinions.

qutebrowser/misc/sql.py Outdated Show resolved Hide resolved
qutebrowser/misc/sql.py Outdated Show resolved Hide resolved
qutebrowser/misc/sql.py Outdated Show resolved Hide resolved
tests/unit/completion/test_histcategory.py Outdated Show resolved Hide resolved
@The-Compiler
Copy link
Member

Related issues:

For #5551, I still think it'd be nice to use UPSERT here and have the logic in the SQL layer somehow, instead of the ignore=True thing we have now. I'll let you decide though, @lufte - if you believe it's simpler to keep things as-is for this PR and move to upsert (where available) later, that's fine by me as well. As long as #6548 is still around, I guess taking a stab at that would yield a more significant result than insert/update vs. upsert anyways.

As for #6039, like you said:

While working on this I noticed a potential problem with the new way of setting PRAGMA user_version: if something goes wrong while re-creating CompletionHistory and qutebrowser crashes, user_version is already set to the newer number so the user will be left in an inconsistent state, and user_version will need to be set back manually to reproduce the issue.

I guess it'd be nice to have that fixed, especially since the rebuild logic is getting a bit more complex here. However, I don't think I want to block this PR based on that. Is this something you can see yourself working on, @lufte? If so, maybe let's focus on getting this in first, but make sure #6039 is fixed before the next release?


Also, I'm guessing this will potentially conflict with URL and Title changes trigger history entry updates by andrewcarlotti · Pull Request #5337 · qutebrowser/qutebrowser (cc @andrewcarlotti). I'm afraid I still haven't had a chance to take a closer look at that one. I'd prefer to merge this first, since I've already looked over it now. If you're still interested to get yours in, @andrewcarlotti, I guess it might make sense for me to look at that one next. If not, I totally understand and I'll pick up the loose ends, but that'll probably take a bit longer then.

@lufte
Copy link
Member Author

lufte commented Jun 15, 2021

Related issues:

For #5551, I still think it'd be nice to use UPSERT here and have the logic in the SQL layer somehow, instead of the ignore=True thing we have now. I'll let you decide though, @lufte - if you believe it's simpler to keep things as-is for this PR and move to upsert (where available) later, that's fine by me as well. As long as #6548 is still around, I guess taking a stab at that would yield a more significant result than insert/update vs. upsert anyways.

Agreed. I'm not paricularly worried about this specially because this occurs when updating history instead of querying it, which is when performance problems would be more noticeable (as in big history + typing quickly). I say let's leave this for after.

As for #6039, like you said:

While working on this I noticed a potential problem with the new way of setting PRAGMA user_version: if something goes wrong while re-creating CompletionHistory and qutebrowser crashes, user_version is already set to the newer number so the user will be left in an inconsistent state, and user_version will need to be set back manually to reproduce the issue.

I guess it'd be nice to have that fixed, especially since the rebuild logic is getting a bit more complex here. However, I don't think I want to block this PR based on that. Is this something you can see yourself working on, @lufte? If so, maybe let's focus on getting this in first, but make sure #6039 is fixed before the next release?

This one looks more important though. I'm happy to work on #6039 first. Not sure how much of an effort it is though, I would need to take a look at the code. Do you think it's a lot?
If I start with this one but we keep it blocked until #6039 is done, then it sounds like I may need to get back here afterwards to take care of any potential conflict.

Also, I'm guessing this will potentially conflict with URL and Title changes trigger history entry updates by andrewcarlotti · Pull Request #5337 · qutebrowser/qutebrowser (cc @andrewcarlotti). I'm afraid I still haven't had a chance to take a closer look at that one. I'd prefer to merge this first, since I've already looked over it now. If you're still interested to get yours in, @andrewcarlotti, I guess it might make sense for me to look at that one next. If not, I totally understand and I'll pick up the loose ends, but that'll probably take a bit longer then.

If I start with #6039 we also leave some time to maybe get this one in. That's another conflict that I can deal with later along with the others, in one go.

@The-Compiler
Copy link
Member

#6039 is in now - so I'm assuming you'll take care of updating this accordingly and fixing the conflicts. Let me know once this is updated and I'll take a look - however, I'll be away for most of next week.

@lufte
Copy link
Member Author

lufte commented Jul 10, 2021

#6039 is in now - so I'm assuming you'll take care of updating this accordingly and fixing the conflicts. Let me know once this is updated and I'll take a look - however, I'll be away for most of next week.

Correct! I'm working on it now. I'll ping you when it's ready.

@@ -177,7 +183,6 @@ def __init__(self, database: sql.Database, progress: HistoryProgress,
# Store the last saved url to avoid duplicate immediate saves.
self._last_url = None

self.completion = CompletionHistory(database, parent=self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to move this to the bottom because otherwise it will try to create indexes on columns it doesn't yet have. Maybe index creation should happen inside SqlTable.__init__ instead of it being a separate method?

Copy link
Member Author

@lufte lufte Jul 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we need something like that, otherwise now every time we re-create CompletionHistory without user version changing, its indexes wont be re-created. I'll come back to this during the week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a force parameter to create_index, with the same behavior as the one in _create_table. I think this is enough for now.

@@ -1156,6 +1156,30 @@ completion.cmd_history_max_items:

0: no history / -1: unlimited

completion.web_history.sort_criterion:
default: recency
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set frecency as the new default?

@lufte
Copy link
Member Author

lufte commented Jul 11, 2021

I went through all the old changes and re-applied them on top of the new master. I left a few open discussions that still need solving, and afterwards this would be ready for review again.

@lufte lufte requested a review from The-Compiler July 11, 2021 02:16
@project-bot project-bot bot moved this from WIP to Inbox in Pull request backlog Jul 11, 2021
@The-Compiler The-Compiler moved this from Inbox to Focus in Pull request backlog Jul 19, 2021
@The-Compiler
Copy link
Member

Thanks for the update! I'll take another look soon. There's a lot going on in the next month (including some desperately needed 1.5 weeks of holidays, at least) so it might still take a bit of time I'm afraid, but this definitely is the next non-trivial PR I want to look at.

@treeshateorcs
Copy link

thank you for still working on this! @lufte

@lufte
Copy link
Member Author

lufte commented Mar 2, 2023

thank you for still working on this! @lufte

Just keeping my forked version up to date 😅

@treeshateorcs
Copy link

treeshateorcs commented Mar 12, 2023

this patch no longer applies against master (nor a previous version of this patch that i had saved previously) :(

edit: nevermind, i had messed up the previous version of the patch, now it works alright

@lufte
Copy link
Member Author

lufte commented Mar 12, 2023

Yes, according to this PR, there are many conflicts against master at the moment. The patch is valid with the latest release tags though, which is what I apply it against. I will update it once again when Qt 6 is merged and we're ready to merge this into master.

@vchimishuk
Copy link
Contributor

Hi @lufte,
Do you still have any plans to update the branch?

P. S.
It is sad, that such a great and important feature is still in review after many years.

@lufte
Copy link
Member Author

lufte commented May 11, 2024

Hi @lufte, Do you still have any plans to update the branch?

I've been lazy but I should do it. There is a rain forecast for tomorrow here, so maybe that will be a good opportunity :)

I'll try to apply it against 3.1.

@vchimishuk
Copy link
Contributor

@lufte, thanks for updating PR.
I've started using your solution and it works great in general. However, I've found one issue with the proposed formula.
The issue is that links do not die for a very long time when they are not in use (for a long time) any more.

Here is my case. Year ago Web Telegram link has been changed from https://web.telegram.org/z/ (8K visits) to https://web.telegram.org/a/ (5K visits). So, /z/ is still higher than /a/ despite it has not been used for a year now.
Original Mozilla's formula handles such situations pretty good.

I've tweaked your formula to limit number of visits down to 100. In this case it seems to work better for me.

update CompletionHistory set frecency = min((visits - 1), 100) * 43200 + last_atime;

Is it worth to add one more configuration property next to frecency_bonus to limit the number of visits it takes into account?

@lufte
Copy link
Member Author

lufte commented May 20, 2024

@vchimishuk ideally that's what frecency_bonus is for. If you feel URLs with too many old visits are still too relevant, maybe the default value for frecency_bonus is too high. If you halve it, does the new URL rank higher?

@vchimishuk
Copy link
Contributor

@vchimishuk ideally that's what frecency_bonus is for. If you feel URLs with too many old visits are still too relevant, maybe the default value for frecency_bonus is too high. If you halve it, does the new URL rank higher?

@lufte, nope. Even 6 hours bonus gives old URL 10 years of credits, which means it will be hard to beat it by any new URL.
Using 5 minutes bonus works good for a 2-3 top rows, but then fresh one-visitors beats regular time-to-time used pages.

Limiting number of visits in the formula works the best for me. Probably it can add extra flexibility and improve user experience.
Anyway, your solution works great in general, especially comparing to what we have available in QB master. Just want to share with you my experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Customizable sorting for completion
5 participants