Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't return first rule match for canFetch #18

Open
sebastianwessel opened this issue Jan 11, 2015 · 4 comments
Open

Don't return first rule match for canFetch #18

sebastianwessel opened this issue Jan 11, 2015 · 4 comments

Comments

@sebastianwessel
Copy link

Currently the first matching rule will be returned , but I don't think it's an good idea.

For example this will always be true:
User-Agent: *
Allow: /
Disallow: /admin/
Disallow: /redirect/

change /lib/entry.js

Entry.prototype.allowance = function(url) {
  ut.d('* Entry.allowance, url: '+url);
  var ret=true;
  for (var i = 0, len = this.rules.length, rule; i < len; i++) {
    rule = this.rules[i];

    if ( rule.appliesTo(url) ) {
      ret= rule.allowance;
    }
  };

  return ret;
};

...this will return the last matching rule

@srmor
Copy link

srmor commented Feb 25, 2015

This is a big issue, but I'm not totally sure returning the last rule to match really is the fix. Is there any way to determine whether a rule is more specific because ultimately we want to get the most specific rule.

@spinatmensch
Copy link

...it depends on how you like to interpret rules. In most ACL-cases you write something like:
Disallow something
Explicit allow some specific thing
Disallow some more specific thing which normally will be allowed by rule before

Or reverse-case:
Allow all
Disallow a specific thing
Allow a more specific thing which normally will be dissallowed by rule before

For robots.txt there is normally no official "allow" command - only a "dissallow" command is standard command. So a robots.txt should normally only contain "dissallow" commands to ensure correct interpretation.
So the current "return on first matching rule" is correct and fastes way if you only have "disallow" commands in robots.txt or only respect "dissallow" command.
But most big search engines are interpreting an "allow" command also to be able to crawl more sites.
And in that case the last matching command rules, because it is always the most specific rule - see samples above.

And remember: robots.txt is NOT a "you should not crawl"-command, it's more "please, don't crawl" or "crawling of... Is not necessary"

So in my eyes it's up to creator of ACL to ensure correct order of rules and use of "allow" and there is no way to determinate a "more specific rule" - it's like army: "last order rules" if you respect "allow" command.

@srmor
Copy link

srmor commented Feb 26, 2015

Very true... but I guess it really depends whether you want it to be able to accurately interpret all robots.txt or just ones that strictly follow the spec (practically none of them).

@ghost
Copy link

ghost commented Mar 6, 2017

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#order-of-precedence-for-group-member-records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants