Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cooperative request intercepts #6735

Merged
merged 67 commits into from Jul 2, 2021
Merged

Cooperative request intercepts #6735

merged 67 commits into from Jul 2, 2021

Conversation

benallfree
Copy link
Contributor

@benallfree benallfree commented Jan 6, 2021

Hello, I wanted share my fork with the Puppeteer team to see if there is anything here you'd like to bring into the core.

This PR introduces request intercept handlers that play well with others. I needed this for a recent project and thought it might help to share my fork because it seems that there is a lot of community confusion around the way request intercepts are handled.

The Problem: Puppeteer's core expects only one page.on('request') handler to call abort/respond/continue

Puppeteer anticipates only one request intercept handler (page.on('request', ...)) to perform continue(), abort(), or respond(). There are valid use cases where multiple request handlers may call abort/respond/continue, unaware of each other. For example, puppeteer-extra attempts a plugin architecture, yet enabling any plugin that utilizes request intercepts (such as ad blocker) means that I cannot further alter request interceptions because calling abort/respond/continue a second time will throw an error.

The current core design makes it impossible for multiple request handlers to cooperatively determine what to do with a request.

Here is a sample of issues for background:

berstend/puppeteer-extra#364
berstend/puppeteer-extra#156
argos-ci/jest-puppeteer#308
#5334
#3853 (comment)
#2687
#4524

The solution: Cooperative Request Interception mode

This PR proposes a cooperative interception strategy which allows multiple request intercept handlers to call abort/respond/continue in any order and multiplicity, while also making NetworkManager wait for async handlers to be fulfilled before finalizing the request interception.

// 2nd param enables Cooperative mode rather than Legacy mode
// In Cooperative mode, abort/respond/continue can be called multiple times by multiple handlers.
// The winning action is decided after all handlers have resolved.
page.setRequestInterception(true, true) 

// If any handler calls abort(), the request will be aborted.
// All handlers will still run, but ultimately the request will be aborted because at least one handler
// requested an abort.
page.on('request', req=> {
  req.abort()
});

// Had no handler called abort(), but at least one handler called respond(),
// then the request will be responded.
page.on('request', req=> {
  req.respond({...})
});

// This is the lowest priority. continue() is the default action and no handler even needs to call it explicitly
// unless request info needs to be changed. A request will be continued only if no abort() or respond()
// has been called by any handler.
// Cooperative mode is more forgiving because no request will hang since continue() is the default action.
page.on('request', req=> {
  req.continue() // Not even necessary, NetworkManger will fall through to continue() by default
});

Breaking changes

None. Legacy intercept mode is still the default. A new "Cooperative Intercept Mode" is introduced by a second Page.setRequestInterception parameter.

Todo

  • Document cooperative mode
  • Add unit tests
  • Fix failing tests/regressions

@whalderman
Copy link

There's no reason to pull these breaking changes when a valid workaround already exists, as you referenced ( #3853 ).
This is a solvable issue within the plugin community.

@benallfree
Copy link
Contributor Author

@whalderman Thanks for looking at this!

#3853 doesn't solve the problems I mentioned above, it only defers one handler's intercept resolution until all synchronous handlers have run. But you might be right that the plugin community could agree to install ONE handler that does the same thing my PR proposes. Let me take a shot at that and get back to you.

@benallfree
Copy link
Contributor Author

benallfree commented Jan 8, 2021

I have published a community package (https://www.npmjs.com/package/enchant-puppeteer) that monkey-patches NetworkManager and HTTPRequest to cooperatively process multiple request handlers with asynchronous executions.

IMHO this ought to be a core feature. It can be done with no breaking changes and would significantly improve interception capabilities.

I would like to hear an official thumbs-up/down from a Puppeteer maintainer before closing this issue. @whalderman forgive me if you are maintainer, I didn't see your name in the contributor list.

@XavierGeerinck
Copy link

I +1 on @benallfree his comment, this is a crucial feature that I have mentioned as well already in #6567

@jackfranklin
Copy link
Collaborator

@benallfree thanks for your efforts here and for kicking off the discussion.

I've read through your PR and I think I follow, but what I'm struggling with in my head is fully understanding a situation where the limitation of having one request handler causes problems. Is there a small code example that you could share? I just don't think that's landed in my head yet so I'm struggling to form an opinion and would like to make sure I understand fully - thanks.

@benallfree
Copy link
Contributor Author

benallfree commented Jan 12, 2021

@jackfranklin Thanks for writing back and for giving this feature honest consideration 👍

Please consider this use case of two request handlers needing to work cooperatively:

The package @cliqz/adblocker contains a Puppeteer binding that intercepts Puppeteer requests and conditionally blocks them by calling req.abort() if they appear to be related to ads. Otherwise, it calls req.continue(). https://github.com/cliqz-oss/adblocker/blob/master/packages/adblocker-puppeteer/adblocker.ts#L276

AdBlocker makes for a nice plug-in, except using it precludes any other request intercept handlers from further manipulating the request. Once abort() or continue() are called by any handler, calling them again has no effect and in fact throws an error.

Suppose I wanted to use AdBlocker, but I also wanted to further intercept requests not blocked by AdBlocker and return an alternate response stored in Redis. This is my real-world use case and how I found this problem.

const page = await browser.newPage()
page.setInterception(true)

// Initialize ad blocker
// Internally, this registers a handler with page.on('request') and calls a req.abort() or req.continue()
initializeAdBocker(page) 

// Now I would like to add my own caching intercept that that either calls respond()
// or continue(). The problem is, by the time my handler runs, AdBlocker has already
// sealed our fate. There is no way to call respond() from here.
//
// Moreover, because this is an async function, the continue() will not
// be called until the next event loop. Any synchronous handler will "win" this race.
//
// By adding a second `true` to Page.setInterception(true,true), we can enable
// Cooperative Intercept Mode and this use case will work with no other changes.
page.on('request', async req=>{
   const cachedResponse = await lookUpCachedResponseInRedis(req.url())
   if(cachedResponse) {
      req.respond(cachedResponse)
      return
   }
   req.continue()
})

await page.goto('...')

The modifications I propose in this PR create no breaking changes. Legacy behavior is preserved by default, and the new "Cooperative Intercept Mode" can be enabled for those who want to use it. In cooperative mode, it possible for any number of handlers to call abort(), respond(), and continue() without throwing errors. After all the handlers have run and all asynchronous handlers have resolved, it will decide which action wins: if any handler called abort(), the response will be aborted. If no handler called abort() but any handler called respond(), the request will be responded. And then the default is that the request will be continued.

@jackfranklin
Copy link
Collaborator

@benallfree thank you, that's made it super clear and I think I understand the problem now.

I like the fact we could land such a change without needing to introduce breaking changes to the API.

@mathiasbynens what do you think about this?

@benallfree benallfree changed the base branch from main to 5.3.1-post January 13, 2021 12:45
@benallfree benallfree changed the base branch from 5.3.1-post to main January 13, 2021 12:46
@benallfree benallfree changed the base branch from main to 5.3.1-post January 14, 2021 17:59
@benallfree benallfree changed the base branch from 5.3.1-post to main January 14, 2021 17:59
@benallfree
Copy link
Contributor Author

@jackfranklin Just checking back, thank you 👍

@mathiasbynens
Copy link
Member

@jschfflr and @sadym-chromium, could you PTAL?

@jschfflr
Copy link
Contributor

I also see a lot of value in this approach to handling request interception. Especially in the context of third party code that might be intercepting requests too.
But have you also considered the possibility to run the handlers in sequence instead of in parallel? I'm thinking of something similar to how express handles network requests. The advantage would be that it's more explicit why a certain request was handled in a certain way. Otherwise a developer might for example be confused why a request they chose to modify via response was aborted.
Let me know what you think!

@benallfree
Copy link
Contributor Author

benallfree commented Jan 22, 2021

@jschfflr Thanks for the feedback!

I have considered the same things, great discussion.

Synchronous handlers run in series because the Mitt event library fires them in the order they were subscribed. But if those handlers perform asynchronous operations, Mitt does not await them. I could take a shot at modifying Mitt to accept async handlers and await them in order. I agree that more determinism could be valuable.

On the other hand, there could be significant performance penalties associated with serializing async handlers. We also don't currently have a way of prioritizing or introspecting page.on('request') handlers so it would be hard to determine the order in which handlers are registered and executed anyway. (Especially if they are registered via plugins or even transient dependencies). But to your point, in theory two ill-behaved async handlers could find themselves in occasional race conditions that are difficult to debug if they are allowed to execute in parallel.

Well-written handlers shouldn't assume anything about what other handlers might be doing, or how state may have changed during an await. I added the shouldAbort(), shouldRespond(), and shouldContinue() helpers address this for interested handlers. If it's important that my handler behaves differently based on what may have happened already, I would check these states and react accordingly. Likewise, I can get the existing continuation or response data and modify it via continueRequestOverrides() or respondForRequest().

Overall, I'm leaning toward allowing parallel, non-deterministic execution, but the express community may have gotten this one right. Let me know what you think! 👍

@benallfree
Copy link
Contributor Author

@jschfflr Any further thoughts? I've been thinking more about your idea of deterministic execution, I might have an idea. Do you think it's worthwhile pursuing?

@jschfflr
Copy link
Contributor

jschfflr commented Jan 28, 2021

I've been thinking a lot about this problem because I'm not sure how to handle the situation best. But I came up with the following requirements:

Having the requests be handled in a deterministic order sounds like the best first step in that direction. From the top of my head we could maybe work with an api that looks like the following:

const page = await browser.newPage();

// Adding the first interception handler will automatically enable interception
// Request handlers will be run in the order they have been registered
// request.continue() will call the next request handler
// request.abort() and request.respond() will prevent further request handlers 
// from being called
page.addRequestInterceptionHandler((request) => {
   // Do something
  request.abort()
})

page.addRequestInterceptionHandler((request) => {
   // Do something
  request.respond()
})

page.on('request', (request) => {
  // Deprecate abort and respond in the on handler
  // But keep current behaviour for now
  // Event handlers will be triggered after the interception handlers
  // even if abort or respond have already been called to make sure
  // this is backwards compatible
})


await page.goto('...')

Let me know what you think!

@benallfree
Copy link
Contributor Author

@sadym-chromium Was this closed in error?

@sadym-chromium
Copy link
Collaborator

I'm terribly sorry! I closed it by mistake.

src/common/HTTPRequest.ts Show resolved Hide resolved
@jschfflr jschfflr enabled auto-merge (squash) July 2, 2021 17:41
@jschfflr jschfflr merged commit b5e6474 into puppeteer:main Jul 2, 2021
@jschfflr
Copy link
Contributor

jschfflr commented Jul 2, 2021

@benallfree Whohooo, this change is finally merged 🎉 🎉 🎉

@benallfree
Copy link
Contributor Author

Fantastic!!! 🎉🎉🥳

@jschfflr
Copy link
Contributor

jschfflr commented Jul 4, 2021

Thank you so much for your contribution!
I'm very much looking forward to seeing this being adopted :)

@benallfree
Copy link
Contributor Author

@Androbin I’m the main author of this PR, can you please provide a little more explanation? Do you mean that this feature introduces a regression even when it is not enabled?

@benallfree
Copy link
Contributor Author

Tagging

product-os/jellyfish#7068
#7475
#6696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants