Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

Open
lioreshai opened this issue Jul 12, 2018 · 4 comments

Comments

@lioreshai
Copy link

What is the current behavior?

Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests.

If the current behavior is a bug, please provide the steps to reproduce

Create a cluster in which each workers process starts crawling the same url (on a crawler using Redis cache)

What is the expected behavior?

Even if the same url is added multiple times, I would expect there to be no duplicates. Should this be the case?

Please tell us about your environment:

  • Version: 1.8.0
  • Platform / OS version: Windows
  • Node.js version: 8.11.3
@Devhercule
Copy link

I have the same problem, I confirm

@BubuAnabelas
Copy link

Have you tried enabling the skipDuplicates and skipRequestedRedirect options in the queue options?

I believe that the current behavior is that it will crawl duplicate urls because it finds them as different request/response pairs. But if you enable those options it should make more duplicate requests.
Please confirm if your problem was fixed this way.

@maschad96
Copy link

maschad96 commented Jun 25, 2019

@BubuAnabelas I set it up with skipDuplicates and skipRequestedRedirect but it's still able to be reproduced issue for me.

Have a feeling it's because of differing 'extraHeaders' maybe?

Any guidance here would be appreciated; am redis noob, and moreover just want to make my crawler more efficient and not hit the same pages once per worker.

@iamprageeth
Copy link

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

  1. First created a sqlite database.
  2. Then in RequestStarted event, insert the current url.
  3. In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
  4. In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants