The library should validate the document before processing it #34

sneko · 2024-01-09T19:14:55Z

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

The text was updated successfully, but these errors were encountered:

samclarke · 2024-01-10T02:10:46Z

Thanks for reporting!

It's a bit counter-intuitive but I believe the behaviour of isAllowed() -> true for invalid robots.txt files is correct.

A robots.txt file is part of the Robots Exclusion Protocol. The default behaviour is to assume URLs are allowed unless specifically excluded.

As an invalid robots.txt file doesn't exclude anything, and the default behaviour is to assume allow, then everything should be allowed.

You're right that an invalid robots.txt file is a sign something is misconfigured but I don't think this library can assume misconfigured means disallow. If the file is empty or returns 404 then nothing is excluded so being invalid shouldn't be treated differently.

The draft specification says invalid characters should be ignored but nothing about if the whole file is invalid. However, Google's implementation does specify that if given HTML they will ignore the invalid lines the same as this library.

I have a script to watch multiple robots.txt from websites

Are you using the library to validate the robots.txt files? If so, an isValid() and/or a getInvalidLines() method could be added. Every robots.txt parser I'm aware of will ignore invalid lines, but it could be useful for website owners to check that nothing is misconfigured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The library should validate the document before processing it #34

The library should validate the document before processing it #34

sneko commented Jan 9, 2024 •

edited

samclarke commented Jan 10, 2024

The library should validate the document before processing it #34

The library should validate the document before processing it #34

Comments

sneko commented Jan 9, 2024 • edited

samclarke commented Jan 10, 2024

sneko commented Jan 9, 2024 •

edited