Don't return first rule match for canFetch #18

sebastianwessel · 2015-01-11T22:47:25Z

Currently the first matching rule will be returned , but I don't think it's an good idea.

For example this will always be true:
User-Agent: *
Allow: /
Disallow: /admin/
Disallow: /redirect/

change /lib/entry.js

Entry.prototype.allowance = function(url) {
  ut.d('* Entry.allowance, url: '+url);
  var ret=true;
  for (var i = 0, len = this.rules.length, rule; i < len; i++) {
    rule = this.rules[i];

    if ( rule.appliesTo(url) ) {
      ret= rule.allowance;
    }
  };

  return ret;
};

...this will return the last matching rule

srmor · 2015-02-25T23:07:54Z

This is a big issue, but I'm not totally sure returning the last rule to match really is the fix. Is there any way to determine whether a rule is more specific because ultimately we want to get the most specific rule.

spinatmensch · 2015-02-26T16:30:18Z

...it depends on how you like to interpret rules. In most ACL-cases you write something like:
Disallow something
Explicit allow some specific thing
Disallow some more specific thing which normally will be allowed by rule before

Or reverse-case:
Allow all
Disallow a specific thing
Allow a more specific thing which normally will be dissallowed by rule before

For robots.txt there is normally no official "allow" command - only a "dissallow" command is standard command. So a robots.txt should normally only contain "dissallow" commands to ensure correct interpretation.
So the current "return on first matching rule" is correct and fastes way if you only have "disallow" commands in robots.txt or only respect "dissallow" command.
But most big search engines are interpreting an "allow" command also to be able to crawl more sites.
And in that case the last matching command rules, because it is always the most specific rule - see samples above.

And remember: robots.txt is NOT a "you should not crawl"-command, it's more "please, don't crawl" or "crawling of... Is not necessary"

So in my eyes it's up to creator of ACL to ensure correct order of rules and use of "allow" and there is no way to determinate a "more specific rule" - it's like army: "last order rules" if you respect "allow" command.

srmor · 2015-02-26T20:23:50Z

Very true... but I guess it really depends whether you want it to be able to accurately interpret all robots.txt or just ones that strictly follow the spec (practically none of them).

ghost · 2017-03-06T23:12:50Z

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#order-of-precedence-for-group-member-records

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't return first rule match for canFetch #18

Don't return first rule match for canFetch #18

sebastianwessel commented Jan 11, 2015

srmor commented Feb 25, 2015

spinatmensch commented Feb 26, 2015

srmor commented Feb 26, 2015

ghost commented Mar 6, 2017

Don't return first rule match for canFetch #18

Don't return first rule match for canFetch #18

Comments

sebastianwessel commented Jan 11, 2015

srmor commented Feb 25, 2015

spinatmensch commented Feb 26, 2015

srmor commented Feb 26, 2015

ghost commented Mar 6, 2017