Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(robots): use new fetcher to get robots.txt #12423

Merged
merged 11 commits into from
May 4, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,17 @@ module.exports = [
},
{
artifacts: {
_maxChromiumMilestone: 91,
RobotsTxt: {
status: null,
content: null,
},
InspectorIssues: {
contentSecurityPolicy: [
{
// TODO: Fix connect-src violation when fetching robots.txt.
// https://github.com/GoogleChrome/lighthouse/issues/10225
//
// Fixed with new fetcher using M92.
blockedURL: 'http://localhost:10200/robots.txt',
violatedDirective: 'connect-src',
isReportOnly: false,
Expand All @@ -75,7 +77,6 @@ module.exports = [
// https://github.com/GoogleChrome/lighthouse/pull/12044#issuecomment-788274938
//
// Fixed with new fetcher using M92.
_maxChromiumMilestone: 91,
sourceMapUrl: 'http://localhost:10200/source-map/script.js.map',
errorMessage: 'Error: Timed out fetching resource.',
map: undefined,
Expand All @@ -87,4 +88,35 @@ module.exports = [
audits: {},
},
},
{
// Same CSP as above, but verifies correct behavior for M92.
artifacts: {
_minChromiumMilestone: 92,
RobotsTxt: {
status: 404,
content: null,
},
InspectorIssues: {
contentSecurityPolicy: [
{
// TODO: Fix style-src-elem violation when checking tap targets.
// https://github.com/GoogleChrome/lighthouse/issues/11862
violatedDirective: 'style-src-elem',
isReportOnly: false,
contentSecurityPolicyViolationType: 'kInlineViolation',
},
],
},
SourceMaps: [{
sourceMapUrl: 'http://localhost:10200/source-map/script.js.map',
map: {},
errorMessage: undefined,
}],
},
lhr: {
requestedUrl: 'http://localhost:10200/csp.html?' + blockAllExceptInlineScriptCsp,
finalUrl: 'http://localhost:10200/csp.html?' + blockAllExceptInlineScriptCsp,
audits: {},
},
},
];
57 changes: 33 additions & 24 deletions lighthouse-core/gather/fetcher.js
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@

/* global document */

/** @typedef {{content: string|null, status: number|null}} FetchResponse */

const log = require('lighthouse-logger');
const {getBrowserVersion} = require('./driver/environment.js');

Expand Down Expand Up @@ -97,7 +99,7 @@ class Fetcher {
*
* @param {string} url
* @param {{timeout: number}=} options timeout is in ms
* @return {Promise<string>}
* @return {Promise<FetchResponse>}
*/
async fetchResource(url, options = {timeout: 500}) {
if (!this._enabled) {
Expand Down Expand Up @@ -141,7 +143,7 @@ class Fetcher {

/**
* @param {string} url
* @return {Promise<LH.Crdp.IO.StreamHandle>}
* @return {Promise<{stream: LH.Crdp.IO.StreamHandle|null, status: number|null}>}
*/
async _loadNetworkResource(url) {
await this.session.sendCommand('Network.enable');
Expand All @@ -156,18 +158,16 @@ class Fetcher {
});
await this.session.sendCommand('Network.disable');

if (!networkResponse.resource.success || !networkResponse.resource.stream) {
const statusCode = networkResponse.resource.httpStatusCode || '';
throw new Error(`Loading network resource failed (${statusCode})`);
}

return networkResponse.resource.stream;
return {
stream: networkResponse.resource.success ? (networkResponse.resource.stream || null) : null,
status: networkResponse.resource.httpStatusCode || null,
};
Comment on lines +161 to +164
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used null to match the expectations of the RobotsTxt artifact. I think null makes more sense than undefined but it leads to some ugly code like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had 3 different versions of a comment written here in various tabs that I've lost with varying levels of concern, but at this point it just SGTM 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the failure cases? ERR_BLOCKED_BY_CLIENT? Is success set by netError and error status codes, or just netError? If it's just netError, maybe we should mimic real fetch() and throw in that case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just netError, maybe we should mimic real fetch() and throw in that case?

In one iteration of my comment, I discuss with my lost tab that it seems odd that we would throw an exception rather than a warning on notApplicable. It's not a bug, but rather a known edge case, so the new null behavior makes more sense. That does suggest we actually make it a warning in RobotsTxt audit.

I'm fine with maintaining the status quo as well and throwing if others prefer that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking in terms of fetcher on its own. The gatherer could then do as it saw fit with the response

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success is set to false for error status codes, and there is no stream handle to use if content is served wit an error status code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no stream handle to use if content is served wit an error status code.

that seems like a CDP bug but probably not one worth pursuing until we have a reason to want to look at response bodies from requests that 404 or whatever :)

I think it still could make sense for fetcher to return

  • good statusCode: Promise.resolve({status: number, content: string})
  • bad statusCode: Promise.resolve({status: number, content: null})
  • netError: Promise.reject(new Error(netErrorName))

(and e.g. the RobotsTxt gatherer could catch the error and augment the artifact with a failure reason to add to the audit explanation string)

but I don't think this is critical today, so I'm fine with this version as well.

}

/**
* @param {string} url
* @param {{timeout: number}} options timeout is in ms
* @return {Promise<string>}
* @return {Promise<FetchResponse>}
*/
async _fetchResourceOverProtocol(url, options) {
const startTime = Date.now();
Expand All @@ -180,21 +180,29 @@ class Fetcher {
}, options.timeout);
});

const fetchStreamPromise = this._loadNetworkResource(url);
const stream = await Promise.race([fetchStreamPromise, timeoutPromise])
const responsePromise = this._loadNetworkResource(url);

/** @type {{stream: LH.Crdp.IO.StreamHandle|null, status: number|null}} */
const response = await Promise.race([responsePromise, timeoutPromise])
.finally(() => clearTimeout(timeoutHandle));

return await this._readIOStream(stream, {timeout: options.timeout - (Date.now() - startTime)});
if (response.stream && response.status && response.status >= 200 && response.status <= 299) {
const content = await this._readIOStream(response.stream, {
timeout: options.timeout - (Date.now() - startTime),
});
return {content, status: response.status};
}
return {content: null, status: response.status};
adamraine marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Fetches resource by injecting an iframe into the page.
* @param {string} url
* @param {{timeout: number}} options timeout is in ms
* @return {Promise<string>}
* @return {Promise<FetchResponse>}
*/
async _fetchResourceIframe(url, options) {
/** @type {Promise<string>} */
/** @type {Promise<FetchResponse>} */
const requestInterceptionPromise = new Promise((resolve, reject) => {
/** @param {LH.Crdp.Fetch.RequestPausedEvent} event */
const handlerAsync = async event => {
Expand All @@ -216,17 +224,18 @@ class Fetcher {
return;
}

// Now in the response stage, but the request failed.
if (!(responseStatusCode >= 200 && responseStatusCode < 300)) {
reject(new Error(`Invalid response status code: ${responseStatusCode}`));
return;
}

const responseBody = await this.session.sendCommand('Fetch.getResponseBody', {requestId});
if (responseBody.base64Encoded) {
resolve(Buffer.from(responseBody.body, 'base64').toString());
if (responseStatusCode >= 200 && responseStatusCode <= 299) {
adamraine marked this conversation as resolved.
Show resolved Hide resolved
adamraine marked this conversation as resolved.
Show resolved Hide resolved
const responseBody = await this.session.sendCommand('Fetch.getResponseBody', {requestId});
if (responseBody.base64Encoded) {
resolve({
content: Buffer.from(responseBody.body, 'base64').toString(),
status: responseStatusCode,
});
} else {
resolve({content: responseBody.body, status: responseStatusCode});
}
} else {
resolve(responseBody.body);
resolve({content: null, status: responseStatusCode});
}

// Fail the request (from the page's perspective) so that the iframe never loads.
Expand Down
24 changes: 19 additions & 5 deletions lighthouse-core/gather/gatherers/seo/robots-txt.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
'use strict';

const FRGatherer = require('../../../fraggle-rock/gather/base-gatherer.js');
const {getBrowserVersion} = require('../../driver/environment.js');

/* global fetch, location */

Expand Down Expand Up @@ -36,11 +37,24 @@ class RobotsTxt extends FRGatherer {
* @param {LH.Gatherer.FRTransitionalContext} passContext
* @return {Promise<LH.Artifacts['RobotsTxt']>}
*/
snapshot(passContext) {
return passContext.driver.executionContext.evaluate(getRobotsTxtContent, {
args: [],
useIsolation: true,
});
async snapshot(passContext) {
const milestone
= await getBrowserVersion(passContext.driver.defaultSession).then(v => v.milestone);
adamraine marked this conversation as resolved.
Show resolved Hide resolved

// TODO: Remove when 92 hits stable.
// Iframe fetcher still has issues with CSPs.
// Only use the fetcher if we are fetching over the CDP.
if (milestone < 92) {
return passContext.driver.executionContext.evaluate(getRobotsTxtContent, {
args: [],
useIsolation: true,
});
}

const robotsUrl = new URL(passContext.url);
adamraine marked this conversation as resolved.
Show resolved Hide resolved
robotsUrl.pathname = '/robots.txt';
passContext.driver.fetcher.enable();
return passContext.driver.fetcher.fetchResource(robotsUrl.toString());
}
}

Expand Down
8 changes: 5 additions & 3 deletions lighthouse-core/gather/gatherers/source-maps.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,11 @@ class SourceMaps extends Gatherer {
* @return {Promise<LH.Artifacts.RawSourceMap>}
*/
async fetchSourceMap(driver, sourceMapUrl) {
/** @type {string} */
const sourceMapJson = await driver.fetcher.fetchResource(sourceMapUrl, {timeout: 1500});
return JSON.parse(sourceMapJson);
const response = await driver.fetcher.fetchResource(sourceMapUrl, {timeout: 1500});
if (response.content === null) {
throw new Error(`Failed fetching source map (${response.status})`);
}
return JSON.parse(response.content);
}

/**
Expand Down
8 changes: 4 additions & 4 deletions lighthouse-core/test/gather/fetcher-test.js
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,10 @@ describe('._fetchResourceOverProtocol', () => {
.mockResponse('Network.disable');

const data = await fetcher._fetchResourceOverProtocol('https://example.com', {timeout: 500});
expect(data).toEqual(streamContents);
expect(data).toEqual({content: streamContents, status: 200});
});

it('throws when resource could not be fetched', async () => {
it('returns null when resource could not be fetched', async () => {
connectionStub.sendCommand = createMockSendCommandFn()
.mockResponse('Network.enable')
.mockResponse('Page.getFrameTree', {frameTree: {frame: {id: 'FRAME'}}})
Expand All @@ -145,8 +145,8 @@ describe('._fetchResourceOverProtocol', () => {
})
.mockResponse('Network.disable');

const dataPromise = fetcher._fetchResourceOverProtocol('https://example.com', {timeout: 500});
await expect(dataPromise).rejects.toThrowError(/Loading network resource failed/);
const data = await fetcher._fetchResourceOverProtocol('https://example.com', {timeout: 500});
expect(data).toEqual({content: null, status: 404});
});

it('throws on timeout', async () => {
Expand Down
35 changes: 32 additions & 3 deletions lighthouse-core/test/gather/gatherers/source-maps-test.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ describe('SourceMaps gatherer', () => {
* `resolvedSourceMapUrl` is used to assert that the SourceMaps gatherer is using the expected
* url to fetch the source map.
* `fetchError` mocks an error that happens in the page. Only fetch error message make sense.
* @param {Array<{scriptParsedEvent: LH.Crdp.Debugger.ScriptParsedEvent, map: string, resolvedSourceMapUrl?: string, fetchError: string}>} mapsAndEvents
* @param {Array<{scriptParsedEvent: LH.Crdp.Debugger.ScriptParsedEvent, map: string, status?: number, resolvedSourceMapUrl?: string, fetchError: string}>} mapsAndEvents
* @return {Promise<LH.Artifacts['SourceMaps']>}
*/
async function runSourceMaps(mapsAndEvents) {
Expand All @@ -50,7 +50,14 @@ describe('SourceMaps gatherer', () => {
.mockResponse('Fetch.disable', {});
const fetchMock = jest.fn();

for (const {scriptParsedEvent, map, resolvedSourceMapUrl, fetchError} of mapsAndEvents) {
for (const mapAndEvents of mapsAndEvents) {
const {
scriptParsedEvent,
map,
status = null,
resolvedSourceMapUrl,
fetchError,
} = mapAndEvents;
onMock.mockEvent('protocolevent', {
method: 'Debugger.scriptParsed',
params: scriptParsedEvent,
Expand All @@ -71,7 +78,7 @@ describe('SourceMaps gatherer', () => {
throw new Error(fetchError);
}

return map;
return {content: map, status};
});
}
const connectionStub = new Connection();
Expand Down Expand Up @@ -173,6 +180,28 @@ describe('SourceMaps gatherer', () => {
]);
});

it('throws an error message when fetching map returns bad status code', async () => {
const mapsAndEvents = [
{
scriptParsedEvent: {
url: 'http://www.example.com/bundle.js',
sourceMapURL: 'http://www.example.com/bundle.js.map',
},
status: 404,
map: null,
},
];
const artifact = await runSourceMaps(mapsAndEvents);
expect(artifact).toEqual([
{
scriptUrl: mapsAndEvents[0].scriptParsedEvent.url,
sourceMapUrl: mapsAndEvents[0].scriptParsedEvent.sourceMapURL,
errorMessage: 'Error: Failed fetching source map (404)',
map: undefined,
},
]);
});

it('generates an error message when fetching map fails', async () => {
const mapsAndEvents = [
{
Expand Down