Scrapy integration silently does just one POST request #190

honzajavorek · 2024-03-05T11:43:43Z

We've successfully solved #185, that's awesome! 🎉 It seems to process redirects correctly now. However, I still struggle to get past a single POST request done.

Running code of my spider through Scrapy's crawl command gives me:

downloader/request_count: 1919
downloader/request_method_count/GET: 1723
downloader/request_method_count/POST: 196
downloader/response_count: 1919
downloader/response_status_count/200: 1130
downloader/response_status_count/302: 789
dupefilter/filtered: 223
item_scraped_count: 741

I tried twice without cache and got the same numbers. Running the very same code through Apify integration gives me:

downloader/request_count: 1721
downloader/request_method_count/GET: 1720
downloader/request_method_count/POST: 1
downloader/response_count: 1721
downloader/response_status_count/200: 934
downloader/response_status_count/302: 787
item_scraped_count: 546

I don't understand why the number of GET requests differs by two, but let's say the difference in POSTs is a bigger concern for now. Looking to the log with debug level turned on, I noticed this one thing repeats:

[apify] DEBUG [nfTk1APL]: rq.add_request.result={
  'wasAlreadyPresent': True,
  'wasAlreadyHandled': False,
  'requestId': 'uPDbxqGboeKtU6J',
  'uniqueKey': 'https://api.example.com/api/graphql/widget'}...

The DEBUG [...] part changes, but uniqueKey doesn't change and wasAlreadyPresent is True, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

The text was updated successfully, but these errors were encountered:

honzajavorek · 2024-03-05T12:04:30Z

I was able to confirm my hypothesis with the following minimal example:

import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class Spider(BaseSpider):
    name = "minimal-example"

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                "https://httpbin.org/post",
                method="POST",
                body=json.dumps(dict(code=f"CODE{number:0>4}", rate=number)),
                headers={"Content-Type": "application/json"},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())["data"])
        yield data

Scrapy results:

downloader/request_count: 3
downloader/request_method_count/POST: 3
downloader/response_count: 3
downloader/response_status_count/200: 3

Apify results:

downloader/request_count: 1
downloader/request_method_count/POST: 1
downloader/response_count: 1
downloader/response_status_count/200: 1

And no warnings, just silently deduping and dropping the requests.

honzajavorek · 2024-03-05T14:53:26Z

Playing around with another my scraper, it seems that using dont_filter=True with manually created requests also doesn't work. The docs say:

indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops.

I'm adding it under this issue, because I think it's related. The problem also is that all this happens silently. If something isn't supported by the Apify scheduler, it should warn or fail. Minimal example:

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse


class Spider(BaseSpider):
    name = "minimal-example"

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request("https://httpbin.org/get", method="GET", dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {"something": True}

The code above makes 3 requests with Scrapy, but only 1 request with Apify. This means that probably the whole logic of deduplicating requests works very differently. I believe Scrapy fingerprints the requests to assess whether they're the same or not, and spider author can override it using dont_filter=True if the situation requires it.

For example, I have a scraper which sometimes redirects me to a dummy page. When I detect it, I want to retry the original URL. By default that would be a duplicate request and Scrapy would ignore it, but using dont_filter=True I can override the behavior and try again.

vdusek · 2024-03-06T13:51:12Z

Hi @honzajavorek, thank you for reporting these 🙏.

I'm gonna investigate the deduplication process within our request queue and try and resolve the underlying issues you've pointed out.

Playing around with another my scraper, it seems that using dont_filter=True with manually created requests also doesn't work.

Yeah, sorry, this is bad and not transparent.

The users should be informed when certain requests are not being scheduled (& processed), especially if they are not supported by our ApifyScheduler. The initial step should be this PR #191 to address failures with serialization.

vdusek · 2024-03-07T13:01:01Z

Not working GraphQL APIs

The DEBUG [...] part changes, but uniqueKey doesn't change and wasAlreadyPresent is True, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

Deduplication is done based on the uniqueKey field, which is by default equal to the URL (it's done here https://github.com/apify/apify-sdk-python/blob/master/src/apify/storages/request_queue.py#L161-L164). Let's change this default behavior. Maybe we can hash URL + method in general, and in the case of the POST request, we can hash URL + method + payload.

Not working `dont_filter=True`

We will need to check whether Scrapy Request has field dont_filter set to True and if so, we could just generate a random uniqueKey and set it explicitly, before passing the request to the request queue.

This also relates to #141 and I will resolve it all as a single PR.

cc @fnesveda

honzajavorek · 2024-03-07T23:36:34Z

Maybe we can hash URL + method in general, and in the case of the POST request, we can hash URL + method + payload.

I'm not sure whether methods which are not idempotent/safe should be deduped at all, but that's up to you (Apify) to figure this out and decide what's the best way forward. I can understand that one thing is theoretical purity and another is practical use. But these things should be probably considered.

Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with Accept-Language: cs, another with Accept-Language: en, and I can get two wildly different responses from the same server.

However, if dont_filter is supported, I guess calculation of the uniqueKey can be left as a simple heuristic such as the one you outlined, and if that doesn't work for a particular case, there's at least option to override the behavior and prevent deduping. And, of course, the log should warn me that my requests get deduped, so I can notice it and figure out if that's okay or not for my particular case (I can imagine wide crawls where deduping is a feature, not a bug).

vdusek · 2024-03-12T15:32:04Z

Resolved in #193.

vdusek · 2024-03-12T15:37:48Z

However, if dont_filter is supported, I guess calculation of the uniqueKey can be left as a simple heuristic such as the one you outlined, and if that doesn't work for a particular case, there's at least option to override the behavior and prevent deduping. And, of course, the log should warn me that my requests get deduped, so I can notice it and figure out if that's okay or not for my particular case (I can imagine wide crawls where deduping is a feature, not a bug).

Support for dont_filter was added.

Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with Accept-Language: cs, another with Accept-Language: en, and I can get two wildly different responses from the same server.

I will not implement this feature now. However, you can provisionally enforce it by utilizing the dont_filter parameter. Our current approach aims to maintain consistency with the Crawlee JS implementation, which also does not include this feature. Nevertheless, I think what you mentioned is a good point and plan to propose its inclusion in future versions of both the Python and JavaScript implementations.

And once again, thank you very much for the input.

honzajavorek · 2024-03-12T16:26:14Z

Thanks! Looking forward to try out the new version!

vdusek added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 6, 2024

vdusek added this to the 85th sprint - Tooling team milestone Mar 7, 2024

vdusek self-assigned this Mar 7, 2024

vdusek mentioned this issue Mar 11, 2024

Improve unique key generation logic #193

Merged

vdusek closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy integration silently does just one POST request #190

Scrapy integration silently does just one POST request #190

honzajavorek commented Mar 5, 2024

honzajavorek commented Mar 5, 2024

honzajavorek commented Mar 5, 2024 •

edited

vdusek commented Mar 6, 2024

vdusek commented Mar 7, 2024 •

edited

honzajavorek commented Mar 7, 2024

vdusek commented Mar 12, 2024

vdusek commented Mar 12, 2024

honzajavorek commented Mar 12, 2024

Scrapy integration silently does just one POST request #190

Scrapy integration silently does just one POST request #190

Comments

honzajavorek commented Mar 5, 2024

honzajavorek commented Mar 5, 2024

honzajavorek commented Mar 5, 2024 • edited

vdusek commented Mar 6, 2024

vdusek commented Mar 7, 2024 • edited

Not working GraphQL APIs

Not working dont_filter=True

honzajavorek commented Mar 7, 2024

vdusek commented Mar 12, 2024

vdusek commented Mar 12, 2024

honzajavorek commented Mar 12, 2024

honzajavorek commented Mar 5, 2024 •

edited

vdusek commented Mar 7, 2024 •

edited

Not working `dont_filter=True`