Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Request): add requestHandlerTimeoutSecs and requestTimeoutSecs options #1560

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mstephen19
Copy link
Contributor

closes #1485

Adds requestHandlerTimeoutSecs and requestTimeoutSecs options to a request to allow for request-specific timeouts.

If these optional parameters aren't provided in RequestOptions, the crawler will fallback to the timeouts provided in its configuration.

Copy link
Contributor

@szmarczak szmarczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests.

// A request can have some request-specific timeouts defined. We check if they have been defined, and if so,
// use those timeouts instead. However; if they aren't present, the defaults will be used.
const requestTimeoutMillis = request?.requestTimeoutMillis ?? this.internalTimeoutMillis;
const requestHandlerTimeoutMillis = request?.requestHandlerTimeoutMillis ?? this.requestHandlerTimeoutMillis;
Copy link
Contributor

@szmarczak szmarczak Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

            request.requestHandlerTimeoutSecs ??= this.requestHandlerTimeoutMillis / 1000;
            [...]
            await this._timeoutAndRetry(
                () => source.markRequestHandled(request!),
                request.requestHandlerTimeoutSecs * 1000,
                `Marking request ${request.url} (${request.id}) as handled timed out after ${requestTimeoutMillis / 1e3} seconds.`,
            );

This way we don't need those helpers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes true, nullish coalescing reassignment. I rarely use it, but this is definitely a good use-case

@mstephen19
Copy link
Contributor Author

This is not done. Going to work on a better solution other than reassigning these values all over the place - it gets messy. These new request options need to be handled in BasicCrawler, HttpCrawler, and BrowserCrawler. Work in progress.

@B4nan B4nan marked this pull request as draft September 23, 2022 06:52
@B4nan
Copy link
Member

B4nan commented Sep 23, 2022

If we want to persist the information (e.g. to make this work on platform even after migration), we should use the same approach as with skipNavigation - it gets stored in the userData.__crawlee.

These new request options need to be handled in BasicCrawler, HttpCrawler, and BrowserCrawler

Isnt this all about basic crawler, the rest builds on top of it? I can imagine some protected method there (so we can override in extension classes) to calculate timeout for a given request instance, like crawler.getRequestTimeout(req).

@szmarczak
Copy link
Contributor

@B4nan What about making it look like this:

router.addHandler('label-a', async (ctx) => {
   ctx.log.info('...');
}, { timeoutSecs: 5 });

@mstephen19
Copy link
Contributor Author

@B4nan The problem (in my opinion) with the idea Szymon is proposing is that it would only allow the user to specify label-specific requestHandlerTimeoutSecs. If the options are right within RequestOptions, we can allow the user to control requestHandlerTimeoutSecs AND navigationTimeoutSecs for specific requests. It allows for more control.

@B4nan
Copy link
Member

B4nan commented Sep 23, 2022

I would rather support something like this, feels more natural (and having options parameter after a callback one is a weird DX):

router.addHandler('label-a', async ({ request, log }) => {
    request.timeoutSecs = 5;
    log.info('...');
});

(and this means those options would be part of RequestOptions too, I can see the value in the nav timeout)

But we can support both ways in the end if it makes sense.

Keep mind mind that a request is labeled, so this above is also "label specific timeout", I dont see a difference with the Szymon's version in this regard.

@mstephen19
Copy link
Contributor Author

It's different because you could do something like this:

await crawler.run([{ url: 'foo', label: 'TEST', timeoutSecs: 5 }, { url: 'foo', label: 'TEST', timeoutSecs: 10 }])

Same label, but different requestHandlerTimeoutSecs since the timeout is configured on the request object instead of the handler.

@B4nan
Copy link
Member

B4nan commented Sep 23, 2022

Yes, and one of that would have priority, two ways to do the same.

(not that the proposal is about router.addHandler, not crawler.run, but I guess we are aligned on that)

@mstephen19
Copy link
Contributor Author

Yeah not talking specifically about addHandler or run, but just about how these should be directly on the Request object. I'll take a look at what we did for skipNavigation and refactor the changes in this PR. Will also rename the requestTimeoutSecs option to navigationTimeoutSecs to match crawler options.

@B4nan
Copy link
Member

B4nan commented Sep 23, 2022

Right, I would definitely do the request options first, then we can think about the router API. The initial goal here to me was to allow modifying this e.g. via router middleware - there you can label stuff as well as modify the timeouts, before the corrent request handler is picked and executed.

@mstephen19
Copy link
Contributor Author

mstephen19 commented Sep 23, 2022

This feature requires not only changes in the Request object, but also in BasicCrawler, HttpCrawler, and BrowserCrawler. BasicCrawler has requestHandlerTimeoutSecs, which is used differently in both HttpCrawler (this crawler has its own userRequestHandlerTimeoutSecs) and BasicCrawler. It's a tightly coupled OOP mess with state values and overrides all over the place, and many things need to be changed in many places to make this work smoothly in all crawlers while also maintaining previous behaviors.
As someone not completely familiar with this codebase, I could easily forget something and unknowingly break stuff. @szmarczak would you feel comfortable taking a crack at it?

@B4nan
Copy link
Member

B4nan commented Sep 23, 2022

You lost me at OOP mess :D

@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow distinct timeouts for differently labeled handlers in the router
4 participants