Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

Open
panthony opened this issue Apr 3, 2018 · 2 comments
Open

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

panthony opened this issue Apr 3, 2018 · 2 comments
Labels

Comments

@panthony
Copy link

panthony commented Apr 3, 2018

What is the current behavior?

The Crawl-Delay is ignored.

What is the expected behavior?

The Crawl-Delay should be honored, it can be retrieved using getCrawlDelay() on the robots parser.

What is the motivation / use case for changing the behavior?

A bot is bound to respect all the directives of the robots.txt

@yujiosaka
Copy link
Owner

@panthony
Crawler-Delay is not part of the standard, so there is no way we can tell the number is seconds, minutes, hours or days.
Probably providing robots.txt should be the direct solution to your use case: #192

@panthony
Copy link
Author

panthony commented Apr 4, 2018

@yujiosaka You are right, this is not part of the standard.

But it looks like everyone agree that it is expected to be as a number of seconds and if the crawler may not obey it out of the box we should have some way to enforce it.

It would be sad to be banned from accessing a site because we did not obey their rules :)

I do not quite see how providing a robots.txt could be a solution?

Or you meant like I could configure the delay of the crawler according to the robots.txt I provide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants