New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve unique key generation logic #193
Conversation
@fnesveda Just FYI; not requesting a review from you, taking into account the current situation, so I added Jirka as a reviewer instead. |
src/apify/_utils.py
Outdated
parsed_url = urlparse(url.strip()) | ||
search_params = dict(parse_qsl(parsed_url.query)) # Convert query to a dict | ||
|
||
# Remove any 'utm_' parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be more generic, utm_
isn't the only tracking parameter I'd say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, it's the same as in the #193 (comment).
@B4nan Should we add the removal of more tracking parameters (if there are any, TBH I don't know)? Or rather keep the parity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm convinced that this is way out of scope of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with the other problem, I'd say we want consistency now, if we should change something, it should be changed in the JS version too (which we surely can do as well).
@jirimoravcik do you have some suggestions on what else to ignore? I don't think we can find a "more generic solution", but we can blacklist more common parameters (at the same time we need to be careful to not blacklist something that might be used for other purposes).
I'm convinced that this is way out of scope of the PR.
Indeed, if we want to change this, let's just create an issue for it and keep the same behavior as the JS version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just wanted to point it out for discussion. E.g. for tracking, there's https://easylist.to/easylist/easyprivacy.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Request Queue
RequestQueue.add_request
method by improving its unique key generation logic.add_request
.apify/_utils.py
:get_short_base64_hash
based on the _hashPayload.normalize_url
based on the normalizeUrl.compute_unique_key
based on the _computeUniqueKey.Scrapy integration
scrapy.Request.dont_filter
field in theto_apify_request
.Issues
uniqueKey
generation logic聽#141Testing
Unit tests
Manual testing / execution
The
YieldPostSpider
tests the case of the POST requests to the same URL.The
DontFilterSpider
tests the case of the Scrapy request withdont_filter
option.And
src/main.py
to execute these spiders with Apify:Execute it using Scrapy:
And using Apify (need to change the Spider in the
main.py
manually):And it produces the same output 馃帀.