Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

Open
pixelastic opened this issue Oct 4, 2018 · 10 comments

Comments

@pixelastic
Copy link
Contributor

A user expected the crawler to respect the <meta name="robots" content="noindex,nofollow"> meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor the robots.txt).

I've always considered DocSearch as an opt-in crawler, so not bound to respect those rules as everything it will crawl or not is configured in the config file that each website owner can edit, so I don't think we should respect this.

That being said, maybe we should introduce a new DocSearch meta tag to exclude pages, to allow owners more fine-grain without requiring to PR their config.

Thoughts @Shipow @s-pace @clemfromspace?

@s-pace
Copy link
Contributor

s-pace commented Oct 8, 2018

I do think that creating a dedicated tags would be nice to avoid crawling a small subset of pages. However, let's try to avoid making it a regular practice since it might increase the load of the crawl by downloading more content than required. Providing a dedicated sitemap might be wiser. We will only follow the links from this dedicated documentation sitemap. WDYT?

@pixelastic
Copy link
Contributor Author

Providing a dedicated sitemap might be wiser.

That could be a great idea. Some kind of docsearch.xml on the root that would list all the pages to crawl. Or maybe there is a way to re-use the standard robots.txt file?

@clemfromspace
Copy link

clemfromspace commented Oct 9, 2018

Scrapy have a built-in support for the robots.txt: https://doc.scrapy.org/en/latest/topics/settings.html?highlight=robot#std:setting-ROBOTSTXT_OBEY
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots

Should be easy to add the Scrapy settings here ('ROBOTSTXT_OBEY': True): https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L52

But it will maybe impact existing configurations though.

@pixelastic
Copy link
Contributor Author

Yep, I think we should not follow robots.txt by default (because changing will not be backward compatible).

My suggestion was that maybe we could reuse the robots.txt syntax to add custom DocSearch information. Maybe something like:

User-agent: DocSearch
Disallow: /dont-index-that-directory/
Disallow: /tmp/

@s-pace
Copy link
Contributor

s-pace commented Oct 17, 2018

Good idea but let's put this into the configuration.

Let's wait for the codebase refactor? (migrate to python 3)

@clemfromspace
Copy link

Yeah, let's wait for the refactor, we can then add a new middleware inspired by the built-in one from scrapy: https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py#L88

@nkuehn
Copy link

nkuehn commented Jan 27, 2021

from 2018:

Yeah, let's wait for the refactor, we can then add a new middleware

Any updates on this issue? It would be great to at least know a decision from Algolia whether adding support for respecting the no-follow meta headers is intended at all or not.

@Shipow
Copy link

Shipow commented Feb 1, 2021

Hi @nkuehn
As far as I know, nothing is in the pipe regarding this for the moment.
Could you give more detail on how this would impact you experience or technical requirement?

@nkuehn
Copy link

nkuehn commented Feb 5, 2021

Sure: Our docs site generator supports a markdown frontmatter that triggers the standard meta=noindex HTML tagging to ensure a given page is not indexed in search engines.

There are varying use cases: pre-release documentation, deprecated features that are just documented as an archive, pages that are just lists or navigation helps and should not appear in search results etc..

These pages are often in that state temporarily and do not follow a specific "regex'able" pattern that we could put into the docsearch config and also, we need immediate control over adding / removing them without having to constantly bother you (the algolia docsearch team) with every individual change with a PR to your configs repo.

We have now understood that docsearch is only relying on whether the page is reachable through crawling. So we are teaching docs authors the different behavior of the on-site search vs. the public search engines and live with some pages appearing in search that we ideally would like to not see there. It's an acceptable situation - something that we absolutely want to hide would not be linked anyways.

TL;DR: The main downside is the additional mental workload for the authors to understand the subtle differences between the behaviors of excluding from "search" (onsite) vs "search"(public). IMHO absolutely acceptable for a free product that is great in all other respect.

PS: I personally think that de-facto standard HTML headers should be respected by a crawler by default and not only via customization. But that's likely rather feedback to scrapy than to docsearch.

@Shipow
Copy link

Shipow commented Feb 6, 2021

Legit. cc @shortcuts We should have a look a the current state of this.
Thanks @nkuehn for taking the time to give more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants