Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSI API giving ERRORED_DOCUMENT_REQUEST error for some urls that worked recently #15989

Open
3 tasks done
pushkarbh opened this issue May 10, 2024 · 8 comments
Open
3 tasks done
Assignees

Comments

@pushkarbh
Copy link

FAQ

URL

https://www.realtor.com/realestateandhomes-search/Chicago_IL

What happened?

The url https://www.realtor.com/realestateandhomes-search/Chicago_IL and some other valid urls from the same domain have started failing in the PSI API calls. We used PSI API for these urls for long time successfully but seeing these errors for past couple of weeks. Here is the error:

[Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)]

All these failing urls continue to work on https://pagespeed.web.dev. I checked bug reports for similar error but most of those are for lighthouse as opposed to PSI API. I see some possible causes listed in #2784, but curious why the same urls work successfully on the PSI site. We run the API from a Python script but same error can be reproduced by running the API on Postman as well.

Please suggest what can be done to resolve this.

What did you expect?

As mentioned earlier, these urls worked till couple weeks ago. We expect it to give us web vital data using field and lab metrics very similar to what we can see even now on https://pagespeed.web.dev.

What have you tried?

Tested different urls and validated on https://pagespeed.web.dev. Other urls from different sites we use in our test suite continue to work. Just the urls from this domain stopped working recently.

How were you running Lighthouse?

PageSpeed Insights, Other

Lighthouse Version

11.5.0

Chrome Version

119.0.0.0

Node Version

No response

OS

Linux & Mac

Relevant log output

{
    "error": {
        "code": 400,
        "message": "Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)",
        "errors": [
            {
                "message": "Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested. Make sure you are testing the correct URL and that the server is properly responding to all requests. (Status code: 403)",
                "domain": "lighthouse",
                "reason": "lighthouseUserError"
            }
        ]
    }
}
@connorjclark
Copy link
Collaborator

connorjclark commented May 13, 2024

Does this still occur with 12.0 (we just updated PSI API)?

I just tried a few times and it seems to work for me. It may be an intermittent error.

@pushkar-bh
Copy link

pushkar-bh commented May 13, 2024

I just tried using the endpoint we've been using "https://www.googleapis.com/pagespeedonline/v5/runPagespeed" and getting the same error still.

Here is the curl command - curl --location 'https://www.googleapis.com/pagespeedonline/v5/runPagespeed?key=<API-KEY>&url=https%3A%2F%2Fwww.realtor.com%2Frealestateandhomes-search%2FChicago_IL&strategy=mobile'

How do I test this api with 12.0? Using v12 as opposed to v5 gives a 404 error.

@connorjclark
Copy link
Collaborator

Thanks. I'll look further tomorrow.

How do I test this api with 12.0? Using v12 as opposed to v5 gives a 404 error.

You already are. There's only one PSI version (v5), but we update the LH version there (which is now 12).

@pushkar-bh
Copy link

Hopefully you're able to reproduce the issue. Let me know if not. Thanks!

@connorjclark
Copy link
Collaborator

connorjclark commented May 15, 2024

I overlooked the 403 in your error message. I get the same locally when using the API, and also via plain usage of curl:

curl https://www.realtor.com/realestateandhomes-search/Chicago_IL -I

Seems your webserver is blocking UAs that indicate curl was used (or rather, that a web browser is not being used), which would explain failures of the API from programmatic usage.

The 403 error is coming from a machine in google making requests to your webserver, which IIUC should be the same via curl kicking off the API request or the webserver doing it.... so actually I'm really unsure why this could be happening. @paulirish mentions perhaps X-Forwarded-For is what varies, is your server perhaps checking that or any request headers and blocking access to some bots?

@pushkar-bh
Copy link

I tried curl https://www.realtor.com and it returns an error page with This page requires JavaScript! mentioned in the html response.

I don't work for realtor.com, so I won't be able to find out what has changed. But it seems like they've recently added some defense to non-browser accesses. This used to work, so must be a recent change.

Is there anyway to make this work by sending any custom headers to the PSI api? Thanks for looking into this.

@pushkar-bh
Copy link

I think the options of using PSI for the mentioned domain are limited given the bot control mechanism put in place. Can the CrUX API or CrUX History API be used to fetch the aggregated data from BigQuery without reaching the origin url?

@connorjclark
Copy link
Collaborator

connorjclark commented May 21, 2024

We have some planned changes to the PSI api that preclude spending time on it now to still get the CruX parts of the API even if the Lighthouse part fails. For now, any error in the Lighthouse part will fail the entire request.

Is what you're looking for not part of these APIs? https://developer.chrome.com/docs/crux/methodology/tools#tool-crux-api or https://developer.chrome.com/docs/crux/methodology/tools#tool-crux-history-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants