Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unique key generation logic #193

Merged
merged 7 commits into from Mar 12, 2024
Merged

Improve unique key generation logic #193

merged 7 commits into from Mar 12, 2024

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Mar 11, 2024

Description

Request Queue

  • Updates RequestQueue.add_request method by improving its unique key generation logic.
    • The logic is exposed via new optional parameters of add_request.
  • I also extended the docstring to make clear what it does (mainly regarding the deduplication).
  • The PR introduces 3 new "helper" functions in the apify/_utils.py:

Scrapy integration

  • Use this new Request Queue functionality in the Scrapy integration.
  • Also add support for scrapy.Request.dont_filter field in the to_apify_request.

Issues

Testing

Unit tests

  • The new code is covered by unit tests.

Manual testing / execution

The YieldPostSpider tests the case of the POST requests to the same URL.

# spiders/yield_post.py
import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class YieldPostSpider(BaseSpider):
    name = 'yield-post'

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                'https://httpbin.org/post',
                method='POST',
                body=json.dumps(dict(code=f'CODE{number:0>4}', rate=number)),
                headers={'Content-Type': 'application/json'},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())['data'])
        yield data

The DontFilterSpider tests the case of the Scrapy request with dont_filter option.

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class DontFilterSpider(BaseSpider):
    name = 'dont-filter'

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request('https://httpbin.org/get', method='GET', dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {'something': True}

And src/main.py to execute these spiders with Apify:

from __future__ import annotations

from scrapy.crawler import CrawlerProcess

from apify import Actor
from apify.scrapy.utils import apply_apify_settings

from .spiders.yield_post import YieldPostSpider as Spider
# from .spiders.dont_filter import DontFilterSpider as Spider


async def main() -> None:
    """Apify Actor main coroutine for executing the Scrapy spider."""
    async with Actor:
        Actor.log.info('Actor is being executed...')

        # Apply Apify settings, it will override the Scrapy project settings
        settings = apply_apify_settings()

        # Execute the spider using Scrapy CrawlerProcess
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

Execute it using Scrapy:

scrapy crawl 'dont-filter' -o dont_filter_output.json
scrapy crawl 'yield-post' -o yield_post_output.json

And using Apify (need to change the Spider in the main.py manually):

apify run --purge

And it produces the same output 馃帀.

@github-actions github-actions bot added this to the 85th sprint - Tooling team milestone Mar 11, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 11, 2024
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@vdusek
Copy link
Contributor Author

vdusek commented Mar 11, 2024

@fnesveda Just FYI; not requesting a review from you, taking into account the current situation, so I added Jirka as a reviewer instead.

src/apify/_utils.py Outdated Show resolved Hide resolved
src/apify/_utils.py Outdated Show resolved Hide resolved
parsed_url = urlparse(url.strip())
search_params = dict(parse_qsl(parsed_url.query)) # Convert query to a dict

# Remove any 'utm_' parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be more generic, utm_ isn't the only tracking parameter I'd say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, it's the same as in the #193 (comment).

@B4nan Should we add the removal of more tracking parameters (if there are any, TBH I don't know)? Or rather keep the parity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced that this is way out of scope of the PR.

Copy link
Member

@B4nan B4nan Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with the other problem, I'd say we want consistency now, if we should change something, it should be changed in the JS version too (which we surely can do as well).

@jirimoravcik do you have some suggestions on what else to ignore? I don't think we can find a "more generic solution", but we can blacklist more common parameters (at the same time we need to be careful to not blacklist something that might be used for other purposes).

I'm convinced that this is way out of scope of the PR.

Indeed, if we want to change this, let's just create an issue for it and keep the same behavior as the JS version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just wanted to point it out for discussion. E.g. for tracking, there's https://easylist.to/easylist/easyprivacy.txt

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/apify/_utils.py Outdated Show resolved Hide resolved
tests/unit/test_utils.py Show resolved Hide resolved
tests/unit/test_utils.py Show resolved Hide resolved
@vdusek vdusek merged commit 49265e8 into master Mar 12, 2024
21 of 26 checks passed
@vdusek vdusek deleted the unique-key-fix branch March 12, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants