Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy integration silently does just one POST request #190

Closed
honzajavorek opened this issue Mar 5, 2024 · 8 comments
Closed

Scrapy integration silently does just one POST request #190

honzajavorek opened this issue Mar 5, 2024 · 8 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@honzajavorek
Copy link
Contributor

We've successfully solved #185, that's awesome! 馃帀 It seems to process redirects correctly now. However, I still struggle to get past a single POST request done.

Running code of my spider through Scrapy's crawl command gives me:

  • downloader/request_count: 1919
  • downloader/request_method_count/GET: 1723
  • downloader/request_method_count/POST: 196
  • downloader/response_count: 1919
  • downloader/response_status_count/200: 1130
  • downloader/response_status_count/302: 789
  • dupefilter/filtered: 223
  • item_scraped_count: 741

I tried twice without cache and got the same numbers. Running the very same code through Apify integration gives me:

  • downloader/request_count: 1721
  • downloader/request_method_count/GET: 1720
  • downloader/request_method_count/POST: 1
  • downloader/response_count: 1721
  • downloader/response_status_count/200: 934
  • downloader/response_status_count/302: 787
  • item_scraped_count: 546

I don't understand why the number of GET requests differs by two, but let's say the difference in POSTs is a bigger concern for now. Looking to the log with debug level turned on, I noticed this one thing repeats:

[apify] DEBUG [nfTk1APL]: rq.add_request.result={
  'wasAlreadyPresent': True,
  'wasAlreadyHandled': False,
  'requestId': 'uPDbxqGboeKtU6J',
  'uniqueKey': 'https://api.example.com/api/graphql/widget'}...

The DEBUG [...] part changes, but uniqueKey doesn't change and wasAlreadyPresent is True, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

@honzajavorek
Copy link
Contributor Author

I was able to confirm my hypothesis with the following minimal example:

import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class Spider(BaseSpider):
    name = "minimal-example"

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                "https://httpbin.org/post",
                method="POST",
                body=json.dumps(dict(code=f"CODE{number:0>4}", rate=number)),
                headers={"Content-Type": "application/json"},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())["data"])
        yield data

Scrapy results:

  • downloader/request_count: 3
  • downloader/request_method_count/POST: 3
  • downloader/response_count: 3
  • downloader/response_status_count/200: 3

Apify results:

  • downloader/request_count: 1
  • downloader/request_method_count/POST: 1
  • downloader/response_count: 1
  • downloader/response_status_count/200: 1

And no warnings, just silently deduping and dropping the requests.

@honzajavorek
Copy link
Contributor Author

honzajavorek commented Mar 5, 2024

Playing around with another my scraper, it seems that using dont_filter=True with manually created requests also doesn't work. The docs say:

indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops.

I'm adding it under this issue, because I think it's related. The problem also is that all this happens silently. If something isn't supported by the Apify scheduler, it should warn or fail. Minimal example:

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class Spider(BaseSpider):
    name = "minimal-example"

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request("https://httpbin.org/get", method="GET", dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {"something": True}

The code above makes 3 requests with Scrapy, but only 1 request with Apify. This means that probably the whole logic of deduplicating requests works very differently. I believe Scrapy fingerprints the requests to assess whether they're the same or not, and spider author can override it using dont_filter=True if the situation requires it.

For example, I have a scraper which sometimes redirects me to a dummy page. When I detect it, I want to retry the original URL. By default that would be a duplicate request and Scrapy would ignore it, but using dont_filter=True I can override the behavior and try again.

@vdusek vdusek added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 6, 2024
@vdusek
Copy link
Contributor

vdusek commented Mar 6, 2024

Hi @honzajavorek, thank you for reporting these 馃檹.

I'm gonna investigate the deduplication process within our request queue and try and resolve the underlying issues you've pointed out.

Playing around with another my scraper, it seems that using dont_filter=True with manually created requests also doesn't work.

Yeah, sorry, this is bad and not transparent.

The users should be informed when certain requests are not being scheduled (& processed), especially if they are not supported by our ApifyScheduler. The initial step should be this PR #191 to address failures with serialization.

@vdusek
Copy link
Contributor

vdusek commented Mar 7, 2024

Not working GraphQL APIs

The DEBUG [...] part changes, but uniqueKey doesn't change and wasAlreadyPresent is True, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

Deduplication is done based on the uniqueKey field, which is by default equal to the URL (it's done here https://github.com/apify/apify-sdk-python/blob/master/src/apify/storages/request_queue.py#L161-L164). Let's change this default behavior. Maybe we can hash URL + method in general, and in the case of the POST request, we can hash URL + method + payload.

Not working dont_filter=True

We will need to check whether Scrapy Request has field dont_filter set to True and if so, we could just generate a random uniqueKey and set it explicitly, before passing the request to the request queue.

This also relates to #141 and I will resolve it all as a single PR.

cc @fnesveda

@vdusek vdusek added this to the 85th sprint - Tooling team milestone Mar 7, 2024
@vdusek vdusek self-assigned this Mar 7, 2024
@honzajavorek
Copy link
Contributor Author

Maybe we can hash URL + method in general, and in the case of the POST request, we can hash URL + method + payload.

I'm not sure whether methods which are not idempotent/safe should be deduped at all, but that's up to you (Apify) to figure this out and decide what's the best way forward. I can understand that one thing is theoretical purity and another is practical use. But these things should be probably considered.

Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with Accept-Language: cs, another with Accept-Language: en, and I can get two wildly different responses from the same server.

However, if dont_filter is supported, I guess calculation of the uniqueKey can be left as a simple heuristic such as the one you outlined, and if that doesn't work for a particular case, there's at least option to override the behavior and prevent deduping. And, of course, the log should warn me that my requests get deduped, so I can notice it and figure out if that's okay or not for my particular case (I can imagine wide crawls where deduping is a feature, not a bug).

@vdusek
Copy link
Contributor

vdusek commented Mar 12, 2024

Resolved in #193.

@vdusek vdusek closed this as completed Mar 12, 2024
@vdusek
Copy link
Contributor

vdusek commented Mar 12, 2024

However, if dont_filter is supported, I guess calculation of the uniqueKey can be left as a simple heuristic such as the one you outlined, and if that doesn't work for a particular case, there's at least option to override the behavior and prevent deduping. And, of course, the log should warn me that my requests get deduped, so I can notice it and figure out if that's okay or not for my particular case (I can imagine wide crawls where deduping is a feature, not a bug).

Support for dont_filter was added.

Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with Accept-Language: cs, another with Accept-Language: en, and I can get two wildly different responses from the same server.

I will not implement this feature now. However, you can provisionally enforce it by utilizing the dont_filter parameter. Our current approach aims to maintain consistency with the Crawlee JS implementation, which also does not include this feature. Nevertheless, I think what you mentioned is a good point and plan to propose its inclusion in future versions of both the Python and JavaScript implementations.

And once again, thank you very much for the input.

@honzajavorek
Copy link
Contributor Author

Thanks! Looking forward to try out the new version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants