New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy integration silently does just one POST request #190
Comments
I was able to confirm my hypothesis with the following minimal example: import json
import json
from typing import Generator, cast
from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse
class Spider(BaseSpider):
name = "minimal-example"
def start_requests(self) -> Generator[Request, None, None]:
for number in range(3):
yield Request(
"https://httpbin.org/post",
method="POST",
body=json.dumps(dict(code=f"CODE{number:0>4}", rate=number)),
headers={"Content-Type": "application/json"},
)
def parse(self, response: TextResponse) -> Generator[dict, None, None]:
data = json.loads(cast(dict, response.json())["data"])
yield data Scrapy results:
Apify results:
And no warnings, just silently deduping and dropping the requests. |
Playing around with another my scraper, it seems that using
I'm adding it under this issue, because I think it's related. The problem also is that all this happens silently. If something isn't supported by the Apify scheduler, it should warn or fail. Minimal example: from typing import Generator
from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse
class Spider(BaseSpider):
name = "minimal-example"
def start_requests(self) -> Generator[Request, None, None]:
for _ in range(3):
yield Request("https://httpbin.org/get", method="GET", dont_filter=True)
def parse(self, response: TextResponse) -> Generator[dict, None, None]:
yield {"something": True} The code above makes 3 requests with Scrapy, but only 1 request with Apify. This means that probably the whole logic of deduplicating requests works very differently. I believe Scrapy fingerprints the requests to assess whether they're the same or not, and spider author can override it using For example, I have a scraper which sometimes redirects me to a dummy page. When I detect it, I want to retry the original URL. By default that would be a duplicate request and Scrapy would ignore it, but using |
Hi @honzajavorek, thank you for reporting these 馃檹. I'm gonna investigate the deduplication process within our request queue and try and resolve the underlying issues you've pointed out.
Yeah, sorry, this is bad and not transparent. The users should be informed when certain requests are not being scheduled (& processed), especially if they are not supported by our |
Not working GraphQL APIs
Deduplication is done based on the Not working
|
I'm not sure whether methods which are not idempotent/safe should be deduped at all, but that's up to you (Apify) to figure this out and decide what's the best way forward. I can understand that one thing is theoretical purity and another is practical use. But these things should be probably considered. Also wondering whether two identical requests with one different HTTP header should be considered same or different. Even with a simple GET request, I could make one with However, if |
Resolved in #193. |
Support for
I will not implement this feature now. However, you can provisionally enforce it by utilizing the And once again, thank you very much for the input. |
Thanks! Looking forward to try out the new version! |
We've successfully solved #185, that's awesome! 馃帀 It seems to process redirects correctly now. However, I still struggle to get past a single
POST
request done.Running code of my spider through Scrapy's
crawl
command gives me:I tried twice without cache and got the same numbers. Running the very same code through Apify integration gives me:
I don't understand why the number of GET requests differs by two, but let's say the difference in POSTs is a bigger concern for now. Looking to the log with
debug
level turned on, I noticed this one thing repeats:The
DEBUG [...]
part changes, butuniqueKey
doesn't change andwasAlreadyPresent
isTrue
, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.The text was updated successfully, but these errors were encountered: