Job timeout isn't throwing an exception #33473

glamorous · 2020-07-08T13:30:14Z

glamorous
Jul 8, 2020

Laravel Version: 7.18.0
PHP Version: 7.4.2
Database Driver & Version: /

Description:

When a long job has a specific timeout, then no exception is thrown. There is only an exception if the job doesn't have a timeout but for example Horizon has a timeout. Therefore if you say a job can only have maximum 1 exception but can try 10 times but the timeout for one job run is 100s, then it will try 10 times for 100s every time. I expect an exception to be thrown because all the retries will fail because of the timeout for this long job.

Real life example: When using cache locks in job to release a job, the lock can't be released because there's no exception and the "failed" method is not called so you can't cleanup the lock. The second try for the job, the lock is still set but you can retrieve it anymore because that retried job isn't the owner. Therefore job will still retry x amount of times, then fails, and then forceReleased the lock. Not the expected behaviour.

Steps To Reproduce:

Create job with a sleep for 20s
Set the job timeout to 10s
Set amount of tries to 6
Set amount of maximum exceptions to 1
Dispatch the job
60s will pass and the job will eventually fail after 6 tries with MaxAttemptsExceededException

I guess this is a bug, because the failed job will never be successful for the timeout that has been set to the job. Now the function markJobAsFailedIfWillExceedMaxAttempts has been called by registerTimeoutHandler (in src/Illuminate/Queue/Worker.php). The name of called function (markJobAsFailedIfWillExceedMaxAttempts) indicates it should only check if it exceeds the max attempts. Therefore instead of calling this method it should throw a new 'jobTimedOutException' (or something similar) so the user is in control to catch it or to let the job fail and look at the tries and maximum exceptions, the default behaviour of Laravel.

I would expect the same behaviour as when Horizon-process has been killed by the specified timeout of the worker. Then the default Laravel behaviour is working fine.

If wanted, I can submit a PR with a proposal.

Answered by themsaid

Jul 13, 2020

Handled in #33521

View full answer

taylorotwell · 2020-07-08T20:09:41Z

taylorotwell
Jul 8, 2020
Maintainer

So... what were you expecting to happen? I'm not clear on that.

Were you expecting it to only attempt once because of "maximum exceptions"?

0 replies

themsaid · 2020-07-09T03:27:30Z

themsaid
Jul 9, 2020
Maintainer

Not clear what you expect here. You want to consider a job timeout as an exception and fail right away since you have maxExceptions=1?

0 replies

glamorous · 2020-07-09T06:36:53Z

glamorous
Jul 9, 2020
Author

Sorry for the misunderstanding. The expectation was the same behaviour like hitting the timeout only in horizon/queue command. There it will fail after one long attempt that is hitting the timeout of the process. So yes, failing because of the maxExceptions=1.

The registerTimeoutHandler uses the function timeoutForJob on the Worker-class:

protected function timeoutForJob($job, WorkerOptions $options)
{
        return $job && ! is_null($job->timeout()) ? $job->timeout() : $options->timeout;
}

But the behaviour is different in the end. When horizon kills the process, you get the wanted exception right after hitting the timeout (so no additional retries because of the maxExceptions variable set to 1) and the failed method on the job is called (when it exists). The latter isn't happen when set a timeout on the job that is smaller then horizon timeout (like advised in the documentation).

0 replies

themsaid · 2020-07-09T08:14:21Z

themsaid
Jul 9, 2020
Maintainer

Sorry I still don't understand. What do you mean by the Horizon timeout?

What is the bug here? Is it that the job retries while you have maxExceptions=1?

0 replies

glamorous · 2020-07-09T09:01:10Z

glamorous
Jul 9, 2020
Author

In Horizon you can set a timeout for the supervisor. If no job timeout is specified, the timeout set by the worker (Horizon) is used in the function timeoutForJob (see above). The difference in behaviour is, that on that moment when there's a timeout at "horizon/queue" level, an exception is thrown (the expectation), but not when the job itself is timed out when the job has a timeout for itself. (not correct behaviour). Therefore you can't release a cache lock or something else (in failed method) when you're job has been timed out because of the timeout set in the job-class.

In my opinion instead of calling the markJobAsFailedIfWillExceedMaxAttempts in in src/Illuminate/Queue/Worker.php:162 it should it should throw a timeout exception so as an develop you have control of the timeout of your job.

In the documentation https://laravel.com/docs/7.x/queues#cleaning-up-after-failed-jobs :

You may define a failed method directly on your job class, allowing you to perform job specific clean-up when a failure occurs. This is the perfect location to send an alert to your users or revert any actions performed by the job. The Exception that caused the job to fail will be passed to the failed method.

Problem is the "cleanup" can only be done after the amount of retries in case of a timeout due to job timeout.

If it's "work as intended". How can you release a cache-lock if the job is running too long, so it goes in timeout. Without waiting to al its retries happened (because maxExceptions is not taken into account for a timeout)

0 replies

themsaid · 2020-07-09T09:46:24Z

themsaid
Jul 9, 2020
Maintainer

The failed() method is only called when the job runs out of tries, this is the expected behaviour. You can clear the lock when the job starts again to use a backoff that pushes the job back to queue to be retried after the lock expires.

0 replies

glamorous · 2020-07-09T09:52:44Z

glamorous
Jul 9, 2020
Author

Why is there then a change in behaviour if no timeout is set? When the supervisor kills the process by hitting the timeout, a retry isn't happen.

Can you give an example for the cache lock. Our cache lock hasn't an expire time. We use it to process the jobs in sync for specific client and release it to the queue if lock is already used by other job. But when the job with the lock hits its own timeout, then retries can't require the lock. This is also not wanted because the job hits the timout once, and will timeout all the other times too. How can this be managed? If timeout happend, it shouldn't retry again and use resources.

0 replies

themsaid · 2020-07-09T10:05:13Z

themsaid
Jul 9, 2020
Maintainer

I think you're confusing things. I suggest you try post a sample code on the forums and see if people can help explain what's going on in your job.

As for Horizon, the timeout works the same as in a regular worker. When a timeout happens the worker will exit but the job is still in queue and is going to be retried if you have tries available.

0 replies

taylorotwell · 2020-07-09T13:22:35Z

taylorotwell
Jul 9, 2020
Maintainer

It would be a pretty huge change of behavior for us to start calling failed() for every timeout? We have never before called failed multiple times for the same job (once for each timeout if it times out 3 times)...

But, I'm not sure if I'm even understanding what you want. Is that what you are wanting? For the failed method to be called each time a job times out?

0 replies

taylorotwell · 2020-07-12T15:22:57Z

taylorotwell
Jul 12, 2020
Maintainer

Converting to discussion pending inactivity.

0 replies

glamorous · 2020-07-13T07:49:51Z

glamorous
Jul 13, 2020
Author

As for Horizon, the timeout works the same as in a regular worker. When a timeout happens the worker will exit but the job is still in queue and is going to be retried if you have tries available.

Correct, that was a misunderstanding in the tests we tried.

But, I'm not sure if I'm even understanding what you want. Is that what you are wanting? For the failed method to be called each time a job times out?

No, that's not the expecting result. When the job fails, it only has to fail once.

What we're expecting on the other hand is that the job can fail when there's a timeout. Most of the time, you will catch timeout exceptions for requests to 3rd parties (through Guzzle for example). The timeout of the job is always greater than those timeouts so a retry isn't wanted here.

Couple of possible solutions:

event when a timeout happens, so you can do something like clearing a infinite cache lock
timeout exception (that can be "catched" with the maxExceptions)

Best solution: new function timedOut (or something similar like failed) where you can:

throw an exception yourself (so maxExceptions can be triggered/used)
delete job from queue and push it to another queue with for example longer timeouts
release cache-lock
call the fail-method on the job
cleanup up
...

2 replies

themsaid Jul 13, 2020
Maintainer

I'm sorry but what you're saying still makes no sense. What is your issue? What's the relation between a job timeout and a timeout from Guzzle? You're mixing a lot of things in there.

Please share a code snippet of a job and explain the case, the result, and the expected result.

glamorous Jul 13, 2020
Author

class LongJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $timeout = 900;
    public int $tries = 60;
    public int $maxExceptions = 1;

    public function handle(): void {
        // Release (delay) if job isn't the first one in the queue.
        if ($this->getNumberInQueue() !== 0) {
            $this->release(60);

            return;
        }

        // There should only be 1 job running at a time for a given client & resource.
        $lock = $this->getCacheLock();

        if (!$lock->get()) {
            $this->release(60);

            return;
        }

        try {
            // Execute the long job.
        } finally {
            // Manually releasing the lock as we didn't set an expiry for our lock.
            $lock->release();
        }
    }

    public function failed(Exception $exception): void
    {
        $lock = $this->getCacheLock();
        $lock->forceRelease();
    }

    private function getCacheLock(): Lock
    {
        return Cache::lock(
            'cache_lock_' . $this->uuidForCustomer,
            0
        );
    }

    private function getNumberInQueue(): int
    {
        // Check and set the number in the queue for this specific customer (by query the db)
        return $this->numberInQueue;
    }
}

In the code above, when the long job runs longer than 900s because it's heavier than expected (current situation):

The job will time out after 900s
Cache lock isn't cleared
Job is first in queue but can't acquire cache lock and will release for 60s (until retries are at 60)
Job fails

This will only release the lock after 900s + 60 x 60s (delay) because the long job is running too long (always).

If it's possible to clear the lock in a way then this will be the result:

The job will time out after 900s
Cache lock can be cleared (no idea how we can do this with current job class)
Job is first in queue and will execute again and after 900s time out will happen again (back to step 1 until retries are at 60)
Job fails

If we manage to clear the cache lock, then job will fail after 900s + 60 x 900s. Not wanted because job is running to long (always).

When we have the following method available to implement in our job and that is called every time a timeout occurs (job or queue timeout):

public function timedOut(): void
{
        $lock = $this->getCacheLock();
        $lock->forceRelease();
        $this->fail();
        // raise event or something
}

With the specified job from above and the extra method timedOut then this scenario will happen:

The job will time out after 900s
timedOut function is called where cache lock can be cleared and job can be marked as failed
Job fails

In this case the job can fail after 900s instead of holding the queue occupied for a job that can't be finished in the amount that has been set.

Hope this example will clarify some things? No confusing with other kind of timeouts from external calls (guzzle) where job timeout must always be higher than those.

themsaid · 2020-07-13T08:55:43Z

themsaid
Jul 13, 2020
Maintainer

You're using a lock that never expires! Not sure why you're doing so but this is very dangerous. There's no guarantee that any code will run that will release the lock. Your locks should auto expire after a decent amount of time based on how you design your jobs to run.

7 replies

themsaid Jul 13, 2020
Maintainer

I don't really think you should fail the job after the first time it timeouts. You should make sure $timeout is properly configured to cover your needs.

glamorous Jul 13, 2020
Author

The real problem here is that a long job can't fail fast, after its first attempt and using unnecessary resources and can only succeed if a developer set a higher timeout. If a queue is started by an user interaction, you can't predict the outcome of an import for example. We limit the file size but the time it needs to finish is not the same on every hardware. Locally it can be fast, but in production it's a lot slower. Therefore you can't always configure the timeout for every unknown situation and you don't want to spoil resources by setting an unrealistic timeout so it should alway pass.

What would be your solution if we want to have multiple processes so queues for multiple clients can be handled at the same time. But only one job can be handled for every customer at the same time and they should run sequentially.

The only way possible, I guess, is for now, storing all jobs in database and only dispatch the first one. If the first one succeeds, it should check if there's a second one, that should be dispatched and some more logic that a new job can only be dispatched if there are no running jobs for that client (by using atomic lock) and set the retries on only 1.

That being said, a developer should be in control to do perform some extra tasks if a timeout appears. He's now not in control of that.

themsaid Jul 13, 2020
Maintainer

I think you are trying to discuss multiple topics here and that's what confused me and Taylor. Timeouts, Locks, concurrency, and FIFO queues.

What you are trying to achieve is called a FIFO queue that's supported by the SQS driver through multiple 3rd party packages, this for example https://packagist.org/packages/shiftonelabs/laravel-sqs-fifo-queue. You don't need locks or anything, SQS will take care of making a job available only if the preceding job finishes.

Now my question that's related to this discussion is: Is your issue that you want the job to fail rightaway if it timeouts but retries multiple times otherwise?

glamorous Jul 13, 2020
Author

I'm familiar with FIFO, but then you should setup a separate queue for every customer. The project in this case is too little to set separated queues on AWS, therefore we tried it with locks (with success).

Now my question that's related to this discussion is: Is your issue that you want the job to fail rightaway if it timeouts but retries multiple times otherwise?

That would be indeed the case here. To fail rightaway. But I guess that's not the wanted result for everyone. Therefore an additional method to take control when a timeout happens, should be a better solution for everyone. That wouldn't be a braking change, because, if it's not implemented, it would just retry like the current behaviour.

themsaid Jul 13, 2020
Maintainer

Here's a PR I sent: #33520

themsaid · 2020-07-13T14:28:50Z

themsaid
Jul 13, 2020
Maintainer

Handled in #33521

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job timeout isn't throwing an exception #33473

{{title}}

Replies: 13 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Job timeout isn't throwing an exception #33473

glamorous Jul 8, 2020

Description:

Steps To Reproduce:

Replies: 13 comments · 9 replies

taylorotwell Jul 8, 2020 Maintainer

themsaid Jul 9, 2020 Maintainer

glamorous Jul 9, 2020 Author

themsaid Jul 9, 2020 Maintainer

glamorous Jul 9, 2020 Author

themsaid Jul 9, 2020 Maintainer

glamorous Jul 9, 2020 Author

themsaid Jul 9, 2020 Maintainer

taylorotwell Jul 9, 2020 Maintainer

taylorotwell Jul 12, 2020 Maintainer

glamorous Jul 13, 2020 Author

themsaid Jul 13, 2020 Maintainer

glamorous Jul 13, 2020 Author

themsaid Jul 13, 2020 Maintainer

themsaid Jul 13, 2020 Maintainer

glamorous Jul 13, 2020 Author

themsaid Jul 13, 2020 Maintainer

glamorous Jul 13, 2020 Author

themsaid Jul 13, 2020 Maintainer

themsaid Jul 13, 2020 Maintainer

glamorous
Jul 8, 2020

Replies: 13 comments 9 replies

taylorotwell
Jul 8, 2020
Maintainer

themsaid
Jul 9, 2020
Maintainer

glamorous
Jul 9, 2020
Author

themsaid
Jul 9, 2020
Maintainer

glamorous
Jul 9, 2020
Author

themsaid
Jul 9, 2020
Maintainer

glamorous
Jul 9, 2020
Author

themsaid
Jul 9, 2020
Maintainer

taylorotwell
Jul 9, 2020
Maintainer

taylorotwell
Jul 12, 2020
Maintainer

glamorous
Jul 13, 2020
Author

themsaid Jul 13, 2020
Maintainer

glamorous Jul 13, 2020
Author

themsaid
Jul 13, 2020
Maintainer

themsaid Jul 13, 2020
Maintainer

glamorous Jul 13, 2020
Author

themsaid Jul 13, 2020
Maintainer

glamorous Jul 13, 2020
Author

themsaid Jul 13, 2020
Maintainer

themsaid
Jul 13, 2020
Maintainer