Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects (Scrapy) #189

Closed
honzajavorek opened this issue Mar 5, 2024 · 2 comments

Comments

@honzajavorek
Copy link
Contributor

My spider https://github.com/juniorguru/plucker/blob/26d1758e310b8b2451541516cf4447e4a5e4a11a/juniorguru_plucker/jobs_jobscz/spider.py runs just fine with Scrapy, but fails with critical errors when teaming up with Apify.

See exception details 💌
[twisted] CRITICAL Unhandled error in Deferred:

Traceback (most recent call last):
  File "/Users/honza/.local/share/mise/installs/python/3.11/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/asyncioreactor.py", line 271, in _onTimer
    self.runUntilCurrent()
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 680, in _tick
    taskObj._oneWorkUnit()
--- <exception caught here> ---
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit
    result = next(self._iterator)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 102, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/scraper.py", line 298, in _process_spidermw_output
    self.crawler.engine.crawl(request=output)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 290, in crawl
    self._schedule_request(request, self.spider)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 297, in _schedule_request
    if not self.slot.scheduler.enqueue_request(request):  # type: ignore[union-attr]
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/scheduler.py", line 87, in enqueue_request
    apify_request = to_apify_request(request, spider=self.spider)
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/requests.py", line 76, in to_apify_request
    scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode()
  File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/parsel/selector.py", line 532, in __getstate__
    raise TypeError("can't pickle Selector objects")
builtins.TypeError: can't pickle Selector objects

When debugging the problem, I figured out the following line causes the problem:

scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode()

Inspecting problematic dicts, the culprit seems to be the fact that I pass a response object around:

yield response.follow(
    script_url,
    callback=self.parse_job_widget_script,
    cb_kwargs=dict(item=item, html_response=response, track_id=track_id),
)

Then the response comes in the dict like this:

{'body': b'',
 'callback': 'parse_job_widget_script',
 'cb_kwargs': {'html_response': <200 https://example.com/.../>,
               'item': {...}}}

The <200 https://example.com/.../> is a representation of the Response, which probably cannot be pickled, or at least some Selector objects in there.

I don't think you can do much about it, it's probably a limitation of delegating the request mechanics to an external system such as Apify. If you need to serialize and later deserialize the request, there's just no way I could pass around something which Python cannot pickle.

So I think the only solution here is to fail nicely. The line which pickles the request should catch the exception and provide a nicer error message which explains what is happening and why, ideally with some guidance on how to avoid the problem. I'll get back here if I come up with a workaround.

@honzajavorek honzajavorek changed the title Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects (Scrapy) Mar 5, 2024
@honzajavorek
Copy link
Contributor Author

I figured out I don't need whole response, so I was able to fix this by a change like this: juniorguru/plucker@a0cabe8

@vdusek
Copy link
Contributor

vdusek commented Mar 6, 2024

Thank you @honzajavorek for reporting this. I've opened a PR #191 which should improve the error handling in to_apify_request. Also, the ApifyScheduler should let the user know, that the request was not scheduled due to this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants