Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background workers #86

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

RealOrangeOne
Copy link

@RealOrangeOne RealOrangeOne commented Feb 9, 2024

This is a proposal for adding a background workers interface to Django, to better facilitate developers (both working with Django and library maintainers) moving certain tasks from their web processes to separate background runners.

This work originally started in Wagtail as RFC 72, but was found to be generic enough to also apply to Django, so @carltongibson suggested I adapt it to a DEP. See also my more complete code examples.

Got feedback? Feel free to leave a comment? Got bigger feedback, I'm more than happy to chat on Twitter, Mastodon, or another channel if you already know me (so this PR doesn't get too messy).

View rendered content

Comment on lines 259 to 269
The global task connection ``tasks`` is used to access the configured backends, with global versions of those methods available for the default backend. This contradicts the pattern already used for storage and caches. A "task" is already used in a number of places to refer to an executed task, so using it to refer to the default backend is confusing and may lead to it being overridden in the current scope:

.. code:: python

from django.contrib.tasks import task

# Later...
task = task.enqueue(do_a_thing)

# Clearer
thing_task = task.enqueue(do_a_thing)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming things is hard - I'd love some alternative proposals for how to handle this naming. Even having the connection handler called tasks erks me out a little

Copy link

@PeterJCLaw PeterJCLaw Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: in the case of from django.cache import cache or from django.db import connection, "cache" & "connection" are the name of the thing which hosts the action to be done or provides access to said host. In the case of "tasks" though, the naming thus far has come from the name of the thing to be worked on. Perhaps there's a name for the container/host analogue which could be used instead?

For example: "Runner", "Worker", "Queue" come to mind, though I'm not sure they're great alternatives as they sound a bit too close to implementations. "Background" maybe? (perhaps too abstract?)

🥘 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version used background, but I changed it for exactly that reason.

"Runner", "Worker" and "Queue" are all fairly specific components in the task execution pipeline, and that's not what this really is. I guess a task backend could be considered adding a task to a "Queue", but when developing a backend, having 2 very different things called "queues" is quite annoying.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking inspiration from the Rails ActiveJob API, I quite like this API:

from django.tasks import perform_later

perform_later(send_email)
perform_later(send_email, wait_until=timezone.now() + timedelta(minutes=5))

Only exposing one function (and its async counterpart) avoids having to come up with different names.

Leaving run_later, call_later, execute_later as alternatives.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Laravel uses a similarly unified dispatch API, so unifying the call API could be nice. There's still a discovery API for whether scheduling is supported, so we can just reuse.

perform_later doesn't quite solve what to call the overarching connection manager, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate thought on naming, which I think someone mentioned in a different thread but I can't immediately find -- I think it would be useful to separate the ideas of:

  • a function which can be enqueued & called later
  • the representation of such an enqueueing which is pending being run

Currently these seem to both be called "tasks", which potentially leads to confusion. One spelling for splitting this out might be tasks (definitions) & jobs (instances) respectively1.

Footnotes

  1. this terminology stolen from DLQ.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some of the latest updates he's referred to the task definition as a "task function", as opposed to the task which is a "job" in your definition above. I'm not sure we'll find terminology everyone is happy with, so I've lived with the ambiguity, although I did make an oblique suggestion that we need a glossary at the end of this comment.

@carltongibson
Copy link
Member

Thanks so much for getting to this so quickly @RealOrangeOne! 🎁

I’ll have a proper read over the weekend but, one initial remark: I don’t think this should be in the contrib namespace. Either django.tasks or django.core.tasks. (I kind of prefer the former: from django import tasks would be 🥰)

@prophile
Copy link

prophile commented Feb 9, 2024

Hello! Thanks for putting this proposal up, IMO this would be great to have. Dropping some comments here for completeness (as we discussed in DMs), though I'm not sure this is the right place.

Task/transaction race condition

In a backend which queues tasks onto an external queue (let's say, Redis), there's a race condition between a task runner starting the task and the transaction being committed. If I as a caller create a PaymentRequest entity and then enqueue a task to process it, the task runner might start processing the task before the transaction has committed, and so the task runner won't be able to see the PaymentRequest entity. The way to avoid the race condition is to use transaction.on_commit to enqueue the task only after the transaction has committed.

The problem is, beyond being a bit of a pain to do manually and being easily forgotten, it's not clear what is best to do in the DatabaseBackend here. For the DatabaseBackend the opposite applies – because we're enqueueing the task inside the database, we want that to be transactional, and it's most unclear to me what the guaranteed semantics are of doing database writes in transaction.on_commit. For code which wants to portably support both backends, which I presume is the goal for Wagtail, this would presumably mean conditionally using or not using transaction.on_commit based on the backend, which is quite a pain, and doesn't necessarily generalise to new backends.

Transactions in ImmediateBackend

There are some fairly important subtleties with transactions in ImmediateBackend which it's important to nail down in the non-autocommit case. Specifically - does the task run in the same transaction as the caller, and how?

Case 1: Task runs in the same transaction as the caller

This is the 'obvious' implementation but comes with some major hazards.

First, if the task errors, this can leave the transaction in an inconsistent or aborted state, which is a fairly major footgun, not to mention a substantially different set of problems than would be experienced in production.

Second, in the case of asynchronous task runners, there is no guarantee that the caller will wait for the task to complete before closing the transaction, which would leave the task in a very peculiar state.

Case 2: Task runs in the same transaction as the caller, but with a savepoint

This is a bit more robust, but only limits rather than fixes the first problem, and does nothing for the second. Releasing a savepoint is also not the same thing as committing a transaction, and so this can't replicate on-commit behaviours either from the database or Django. The former can pop up errors on commit which won't show up here, and the latter matters for on_commit calls because these will then be run when the caller commits, not when the task completes.

Case 3: Task runs in a separate transaction

This provides the most robust semantics, and best mirrors the production use, but also opens up two pretty big cans of worms:

  1. How is this even implemented? A separate connection to the database? Are we talking threads? Does that work for both synchronous and asynchronous code? What about the transaction.on_commit problem?
  2. What does this mean for tests? Tests are likely to be one of if not the major use of the ImmediateBackend, but for fast tests we definitely want to be rolling back transactions after each test rather than doing the TRUNCATEs manually to reset each table. How would or could that interact with the separate transactions for the task? Can asynchronous, or even synchronous, tasks dangle after a test has finished?

Serialization in DatabaseBackend

I presume that the args and kwargs are serialized with Pickle in the DatabaseBackend? Some considerations:

  1. This is probably worth documenting explicitly, because it might be something of a surprise if switching to a different backend.
  2. Will the ImmediateBackend replicate the same serialization? If not, this could be a source of surprises when switching between backends. I'm also not sure of the implications of passing ORM objects as arguments to tasks and the difference between passing them with and without intermediate serialization.
  3. Moving a class to a different module, or a few other innocuous changes, will cause task dequeuing to suddenly break after a release for tasks enqueued beforehand. This is potentially quite surprising. Perhaps there's something to be learned from the serialization infrastructure for sessions, which have quite similar problems?
  4. If this used Pickle protocol version -1 (i.e. the latest) then there's a race condition around releases if a new version were released – producers would immediately start enqueuing tasks with the new version, but consumers would still be running the old version. Should the version be pinned? This is maybe not fundamentally different from the problem of rolling out new versions of classes which are being pickled and passed around in general, but it's also potentially much more surprising to developers if it happens "behind their backs". For instance, if you upgraded Python version in production, you might unexpectedly not be able to roll back because all the enqueued tasks are unexpectedly not backwards-compatible.

Standard priority ordering

Your priority ordering is an int but there are differing standards for what the ordering should be. There's quite differing prior art here:

Lower is more important: UNIX (nice levels), Celery (Redis backend)

Higher is more important: Windows thread scheduling priorities, AMQP (including RabbitMQ), Celery (RabbitMQ backend)

It seems like it would be a good idea to standardise on one of these because otherwise there's no way to portably do prioritisation, and portable prioritisation is presumably why that argument is there in the first place.

For backends which can't support priority, which includes potentially fairly major targets like SQS and Google Cloud Pub/Sub, what's the behaviour if you pass a priority argument? Is it silently ignored, and is that safe?

Dedicated capacity

This configuration semantically puts all tasks into one (priority) queue. In previous projects I (and presumably others) have relied on being able to provision dedicated capacity with things split into multiple named queues, even if there's a default. Having a way to (1) specify this portably and (2) implement it with the DatabaseBackend would be fairly important for me. YMMV.

Reliability semantics and retrying

For automatic task retrying, whenever I've worked with queue infrastructure which doesn't have reliability, I've eventually found it necessary to emulate it. If this has been others' experience (and I'm not a terribly interesting developer so if I've experienced this I suspect a lot of people have), it would be good to take the opportunity here?

It also affects the design of the interface, so IMO it's at least worth spelling out how this would work even if the implementations are left for later.

When a task fails, it's necessary to have some way of signalling whether or not it should be retried. If my email sending task fails because the SMTP server is down, I probably want to retry it; if it fails because the email address is invalid, or because I can't actually decode the task data due to a Pickle upgrade, I probably don't want to retry it. The specification here doesn't have a way to portably signal that though, and TaskStatus has only Failed in the portable subset, where it may be useful to differentiate retryable and non-retryable failures.

Task status signalling in DatabaseBackend

Given portable transactions, in the DatabaseBackend, what are your expectations around when and how the task runner locks the task instances? If the transaction for a task which is Running fails, how do you update its status afterwards from Running to Failed without a race condition?

Entry cleanup in DatabaseBackend

Task records are presumably not left around forever otherwise the amount of data used grows without bound. What's the strategy to clean up tasks records for tasks that have completed, and particularly what's the strategy to do this without a race condition?

Final thoughts

I'm asking these questions not because I object to the proposal but exactly because I'd like to see this proposal succeed. Give me a shout if there's anything I can clarify or help with here.

@jacobian
Copy link
Member

jacobian commented Feb 9, 2024

Thank you so much for putting this together, @RealOrangeOne! This has long been on my personal Django wishlist, so I'm absolutely stoked to see it moving forward.

I've just had a quick read, so don't have super-detailed comments, but I do have a couple of big-picture questions that I don't see adressed, and that I think need to have some answers:

  1. How "far" do you picture this framework going? That is: Celery (to pick the most full-featured Python queue) has just a huge ton of features. There's probably a point at which if you need some of those things, you should just be using Celery. I think having some idea of where the "line" is going to be would be helpful.

    Personally, I think whatever's built in to Django should aim to be super-simple, with minimal configuration, and a minimally-useful subset. Beyond that, I think we should be pushing people to more full-featured external libraries. That is, for me, a built-in task queue needs to handle two major use-cases:

    • Facilitate rapid early development - when you're starting a new app, having to set up a more complex task queue can be a barrier. Having something built-in means you can quickly do background email sending (or whatever) without too much faff. This is the "... with deadlines" part of the Django tagline, right?
    • Give third-party apps a common baseline, so they can expect a task queue without needing to commit to a specific dependency. This is the Wagtail use-case, I think, right?

    There may be other use-cases -- but the point is i think we need to enumerate them to avoid feeling like we need to build something feature-competitive with Celery or whatever.

  2. Related: what is the "transition path" for moving from the built-in task queue to something else? Some amount of refactoring is unavoidable and fine, but I think the DEP should address, at least a bit, how we expect users to transition if they outgrow what the built-in queue offers.

@RealOrangeOne
Copy link
Author

RealOrangeOne commented Feb 9, 2024

How "far" do you picture this framework going?

No where near Celery, at least not for a while. The vast majority of this proposal is about writing the glue which sits between Django and tools like Celery, so developers and library maintainers have a single interface to using queues, without needing to worry about the implementation details.

The biggest risk of scope-creep for this is definitely the DatabaseBackend, as it actually tries to implement a task queue (which as @prophile outline above, is quite difficult). I agree that I see DatabaseBackend as being perfect for anything which needs a sensible and performant job queue. If people already know they need Celery, it's absolutely not the backend for them, and I don't think it should try to. With that said, I'd like to try and avoid intentionally nerfing the DatabaseBackend - just because it's optimised for smaller deployments, doesn't mean it shouldn't be capable of handling a very large number of tasks concurrently, and operate in a sensible and expected way. I suspect the target audience for DatabaseBackend and a SQL-based queue are fairly well aligned with those who may choose something like PostgreSQL FTS over something like ElasticSearch. Sure, ElasticSearch is probably better for those 10% of users who really need it, but doesn't mean the other 90% won't be perfectly happy with postgres, and probably wouldn't benefit from ElasticSearch anyway.

I think the DatabaseBackend both helps people get moving quickly (and scale up fairly far without needing to worry too much). Once someone needs to move, there's little application code which needs changing to move over to something like Celery should its wealth of features be needed.

what is the "transition path" for moving from the built-in task queue to something else

I think there are 2 things which would need to transition:

Code wise, I'd like there to be as close to 0 changes needed to swap out the backend (up to a certain point). If implementation details are encoded into how tasks are queued (eg the need for transaction.on_commit for non-DB queues), then that may need changing. Otherwise, I see a world where the only changes are those in settings.py, much like we have now with django.core.cache.

As for the queue storage itself, I think that's a little more complex. Exactly how to say move a bunch of tasks from RQ to Celery and RabbitMQ is probably quite a difficult story, which doesn't have an obvious answer. However, I don't think that needs to be on Django to solve, since it's only providing the glue between the runner and Django, as opposed to a library for other task runners to integrate with (Other than implementing this interface, I'd like there to be no changes Celery would need to make to support this). The easiest transition path might be running both queues in parallel for a while until the first queue is empty, and then stopping the runners.

@orf
Copy link
Sponsor

orf commented Feb 9, 2024

I love this, thank you @RealOrangeOne. A couple of thoughts:

Passing a plain callable to enqueue and defer is a nice interface but it has some downsides. We can only really enqueue/delay globally importable functions: no lambdas, no local functions, etc. Handling this at the callsite or the receiver both have downsides, and it’s generally confusing to new users. What do we do when a user tries to enqueue the callable some_model_instance.send_email?

Should we add some concept of ownership/queue/routing to the task model itself? Given a task ID how do I know what queue it was sent to, or how would I send a task to a specific queue? Defining lots of duplicate backends (1 per queue) seems redundant and not very performant.

An execution of a task is a separate thing to the task itself - a task might have 0 or more executions (i.e retries, timeouts, etc). The current model seems to mix both of these together - is it worth making a distinction?

Can we/should we tie enqueuing into “transaction.on_commit” by default? Seems like a big footgun people will run into almost immediately.

I personally hate enqueue and defer - they are pretty widely used, but run_later and run_at seem more approachable. Just my personal opinion though, but I wonder if anyone else agrees.

@jacobian
Copy link
Member

I suspect the target audience for DatabaseBackend and a SQL-based queue are fairly well aligned with those who may choose something like PostgreSQL FTS over something like ElasticSearch. Sure, ElasticSearch is probably better for those 10% of users who really need it, but doesn't mean the other 90% won't be perfectly happy with postgres, and probably wouldn't benefit from ElasticSearch anyway.

This is fantastic framing, and something that should probably end up in the eventually docs for this. Super-well-said!

Copy link

@PeterJCLaw PeterJCLaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together 👏. It's really cool to see something like this coming to Django.

I've added some comments & questions inline, though I also have some thoughts towards an alternative API which alleviates some of the things I and others have raised. I suspect this is into the realm of bigger thoughts so I'll save that for another venue. (The short version is that it's somewhere between DLQ's API and Django's ORM). I'll DM you somewhere to chat.

For context: my experience with queueing systems is almost entirely with somewhat home-grown and less well known one (https://github.com/thread/django-lightweight-queue, DLQ), which I suspect has fewer features than the libraries already mentioned. As a consequence I've had some hands-on experience extending such a system. Many of the comments I've added are built on learnings from that.

Just a thought -- would it be worth inviting the maintainers of Celery/RQ/etc. to have a look at this PR? Especially if the goal is that transitioning from Django's internal system to one of those is relatively easy I suspect they could have a lot of wisdom to share.

Comment on lines 289 to 293
- Bulk queueing
- Automated task retrying
- A generic way of executing task runners. This will remain the repsonsibility of the underlying implementation, and the user to execute correctly.
- Observability into task queues, including monitoring and reporting
- Cron-based scheduling. For now, this can be achieved by triggering the same task once it finishes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the desire to keep things simple, however (speaking from experiences adding some of these features to a queueing implementation) many of the things pushed out here are likely have significant impacts on the API design. I would encourage considering these upfront if possible, even if they are not included in the implementation for now.

In particular the ones I'm expecting could impact the API are:

  • Bulk enqueueing
  • Automated task retrying (see also @prophile's comments around reliability and races in the database backend, which are kinda the other half of this)
  • (from @prophile's comments) dedicated capacity/priority semantics
  • (not mentioned anywhere) job timeouts and other shutdown interactions (whether to wait for or kill an in-flight task)

I'd also encourage thinking about the observability aspects. While a little less likely to impact the enqueuing API, I expect this will impact the design of the backends and may impact what plumbing APIs we want.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved

task = defer(do_a_task, when=timezone.now() + timedelta(minutes=5))

When scheduling a task, it may not be **exactly** that time a task is executed, however it should be accurate to within a few seconds. This will depend on the current state of the queue and task runners, and is out of the control of Django.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ How important do you feel it is to prescribe the precision of the deferred running? Could this be left up to implementers to define? The latter would allow for configurable implementations if e.g: a user needs more or is happy with far less precision.

👍 to calling out that it's out of Django's control and there may be some inaccuracy here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be a layover from a previous verison of this DEP, where it was more important to define this. It's absolutely backend dependent, but I still wanted to flag both that there's little Django can do, and that there's a risk anyway (a risk that is probably the case anyway).

Comment on lines 200 to 213
Deferring tasks
---------------

Tasks may also be "deferred" to run at a specific time in the future:

.. code:: python

from django.utils import timezone
from datetime import timedelta
from django.tasks import defer

task = defer(do_a_task, when=timezone.now() + timedelta(minutes=5))

When scheduling a task, it may not be **exactly** that time a task is executed, however it should be accurate to within a few seconds. This will depend on the current state of the queue and task runners, and is out of the control of Django.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Given that running tasks on a cron is excluded from this proposal, how important do you feel it is to include support for running at an arbitrary time in the future? Thinking about the implementation side, cron support and delayed running support feel fairly similar.

For some context here, the queueing system I'm most familiar with (https://github.com/thread/django-lightweight-queue) supports cron but not delayed running of tasks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation wise, they're very similar, sure. Definition wise however they're quite different. when is just a "do not execute before X" field in the queue. cron however not only requires the list of tasks to be defined statically (to avoid a bootstrapping issue), but also some kind of lock to confirm 2 workers don't try to run it at once.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
This suffered from a bootstrapping problem. Instead, people can use whatever techniques they're using at the moment, and task-ify them as needed.
@RealOrangeOne
Copy link
Author

We can only really enqueue/delay globally importable functions

I don't think I can encode this in the type signature, but yes the function will need to be global. I suspect we may need to validate this just before triggering. Whilst certain backends may support it, most won't.

Should we add some concept of ownership/queue/routing to the task model itself

Yes, absolutely we should. A Task should probably know all the attributes which were passed to the enqueue call which triggered it.

An execution of a task is a separate thing to the task itself

Currently, these are one and the same. The current implementation doesn't support retries directly, so would likely be separate tasks.

Can we/should we tie enqueuing into “transaction.on_commit” by default?

I don't think so. It's a foot-gun in some cases, but might end up being confusing in others. In some cases, you might not care about having objects committed to the DB, so it just delays the task unnecessarily. I think in the first-pass, we can solve this with documentation.

I personally hate enqueue and defer

Naming things is hard. I'm not attached to them by any means.

Copy link

@ryanhiebert ryanhiebert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for bringing this forward. I fully support it. I've spent the past two weeks battling myself on what things I think are critical features for this interface.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should declare the expected acknowledgement semantics.

While I very heavily use Celery's acks_late=True in combination with reliable message broker like RabbitMQ, I think that if we want to keep a simpler interface we should define that they are acknowledged before work on the task begins. Thus, we should document that backends are expected to run tasks at most once in order to be compatible with this interface. Senders of tasks (including library authors) that perform such dangerous actions as sending out emails need to be confident that they aren't going to be sent out multiple times.

At least once execution is also a very helpful semantic for a broad variety of tasks, and I hope that we can standardize an interface for that in the future. What I think is unwise would be to leave this semantic distinction undefined in the specification.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at most once" semantics are definitely the easiest to reason about, although in some cases people may want the latter. I'm not exactly how we encode that in the current API. But I think I agree that we can assume "at most once" for all backends, and if we want it to be configurable over time, it'll probably end up being another argument to enqueue (or perhaps something defined at config time if it's not sensibly / easily configured per-task).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we document it as the semantic we expect, then the API doesn't need to encode it. It's along the same lines as the scheduling API, where we want to be clear about the intended semantics of the API we're exposing.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Celery may not be able to be a straightforward backend as-is. And I'm not sure that's a problem we want to fix. Celery, afaik, requires tasks to be registered in the runner. But the more I think about the interface, the more I like that this doesn't require the callables to be registered somewhere, and I hope we can keep that.

Copy link
Sponsor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However this means that moving a task function breaks all in-flight tasks, plus all the lovely security issues that come from a remote source giving us arbitrary functions, by path, to call.

Registration is definitely the way, unfortunately.

Copy link
Sponsor

@orf orf Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to elaborate on this, what if a worker receives a malicious message like so:

{"function": "subprocess.check_call", "args": ["rm", "-rf", "/"]}

There has to be a form of task registration, so that there is an allowlist of specific functions to run when a task is received. This also helps decouple the functions import path from the actual message, which is a good thing ™️ .

You could perhaps restrict this to functions that live in [app_name].tasks, but it still suffers from the same issue if subprocess is imported:

{"function": "some_app_name.tasks.subprocess.check_call", "args": ["rm", "-rf", "/"]}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should attempt to make this interface resilient to malicious messages any more than Django generally make things resilient to malicious database queries. Sending messages is dangerous, and it is intended to do dangerous things, and you need to make sure you trust the code that can do that.

It's true that you can't change the location of a callable all at once without the possible need for downtime, but that's pretty easy to deal with by creating a wrapper function in one place that calls the other.

Copy link
Sponsor

@orf orf Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending messages is dangerous, and it is intended to do dangerous things, and you need to make sure you trust the code that can do that.

There is a vast difference between this and "anyone who can add messages into the queue now has full remote-code execution capabilities, no questions asked". Invoking completely arbitrary, user-supplied functions with user-supplied arguments from outside sources has always ended rather poorly.

There are also unresolved issues around how (and where!) to handle task.enqueue(some_model_instance.send_email). Some things can't, and shouldn't, be done with this kind of pass-an-arbitrary-callable interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Celery requires task functions to be marked, then it's unlikely they're the only library

I'm certain it is not the only one. Dramatiq also requires actors (analog to Celery tasks) to be registered.

personally I'd prefer Django be forcing good practice on a user, rather than require they opt-in to security, or allowing them to create a foot-gun and potential CVE for Django later on.

I agree.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this opt-in functionality, or even opt-out, but personally I'd prefer Django be forcing good practice on a user, rather than require they opt-in to security, or allowing them to create a foot-gun and potential CVE for Django later on.

Security needs to be on by default. We already receive emails to security@ that boil down to "When I ignore the security warning in the docs, there is a security gap". Any opt-out would need to be documented with very clear "Security warning" to scare away some who don't actually need to use an opt-out, and provide an easy link for us to reply to those security report emails.

While it follows common patterns, a decorator for task registration does affect usability compared to just sending the callable, because it requires that the modules be imported in order for the tasks to be found. This can be done explicitly or by some auto-discovery magical convention, and while you may prefer those patterns they aren't without trade-offs (ones that I would prefer to avoid).

Django is opinionated and already applies the auto-discovery pattern for several types of resources; most relevant examples being Models and management commands. Adding task auto-discovery to AppConfig with a def import_tasks(self) would be a natural fit.

This AppConfig auto-discovery opens the opportunity for a built-in management commands to inspect and run/enqueue tasks. I've seen this exact cron calling a management command that enqueues a celery task pattern way too many times in my career. It would be nice to remove that boilerplate code because people will repeat that pattern if cron is not included.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Django is opinionated and already applies the auto-discovery pattern for several types of resources

FWIW, DLQ takes this approach -- it auto-loads tasks.py files in apps to discover tasks which register themselves via a decorator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this discussion, I've made quite a few changes to the API, including adding an explicit decorator to create a Task object based on a function (which in turn both validates the function is globally importable, and ensures only registered functions can be used as tasks).

I'm not super familiar with how Celery "marks" a task behind the scenes, and then validates it. If it just needs to store a function reference somewhere, that's easy enough to implement with the current calling pattern. If it needs the globally importable function to be of a given type, that might also be doable with a little more work.

@ryanhiebert as you've clearly used Celery more than I, I'm interested in your thoughts!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs it to be of a specific type (a subclass of celery.Task), and registered with the (usually global) celery app, I believe.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
@adamchainz
Copy link
Sponsor Member

I would rather we don’t use Pickle for argument serialization, and instead use json. There are too many problems with pickle:

  1. Protocol version changes between Python versions complicate upgrading, since you need to lose tasks or temporarily try both versions.
  2. Enqueueing data rather than references - That is, enqueueing a task with a model instance, rather than its ID. If you save the queued model instance within the task, it erases any intermediate changes in the database. I’ve seen this problem so many times that I put it first in my post “Common Issues Using Celery (And Other Task Queues)”.
  3. Security problems - per the big red box in the pickle docs. Yes, the task storage should be trusted, just like cache storage, but if we can just avoid pickle altogether there will never be an issue.
  4. Slower performance - pickle makes it easy to put lots of data into a task argument when you only need a bit, such as a whole model instance when you only need an ID. It’s also not the fastest to deserialize because it runs a whole stack machine.

The serialization should be swappable, perhaps by subclassing and replacing the serialize/deserialize methods. That way users can use pickle or whatever if they really need to.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
@RealOrangeOne
Copy link
Author

The serialization should be swappable

I had intentionally not included support for this. If a given backend doesn't support pickle-based serialization, shoe-horning it in can be quite difficult and unpredictable. If a library maintainer also assumes pickle serialization, their library won't work in environments where the pickle constraints do matter. And that's before we think of a transition path.

Going with JSON (at least at first) has a few benefits:

  • It avoids the pickle issues
  • If it can be JSON serialized, it can be pickled (the inverse isn't true)
  • It bypasses a number of odd behaviours where someone accidentally passes something to a task which shouldn't be (ie it isn't thread safe or depends on external state like NamedTemporaryFile).
  • Keeping the format JSON serializable can be helpful when integrating with external monitoring tools

I'm not opposed to swappable serializers in future, but JSON only for the initial pass is probably the way to go. Being limiting and then expanding support is going to be better for adoption than being vague or potentially limiting in future.

This allows them to be statically typed, and avoids the need to import the Django methods all over the place.
@prophile
Copy link

prophile commented Apr 7, 2024

The serialization should be swappable, perhaps by subclassing and replacing the serialize/deserialize methods. That way users can use pickle or whatever if they really need to.

json as default serialization sidesteps some problems and avoids the issues around serializing instances, but not being able to serialize a datetime in an argument by default is quite unfortunate. I don't know whether it's worth a default workaround though.

Copy link
Member

@carltongibson carltongibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RealOrangeOne.

This is looking good. A very gentle read, and really quite compelling. 🏅

With the (WIP) reference implementation coming along https://github.com/RealOrangeOne/django-core-tasks it's very exciting. (There are no Issues there, so I'll comment here: would it be worth adding an examples/ folder with a simple app setup, so that first pass is not much more than and git clone, a pip install, and a ./manage.py ... or two?)

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
Copy link
Member

@ngnpope ngnpope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had these comments floating around as pending for ages and never got around to posting them. Doing so now - I did have a quick look and tweaked them, but sorry if they're outdated or have been discussed/decided on already.

draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
draft/0000-background-workers.rst Outdated Show resolved Hide resolved
This allows it to handle raising the exception itself, which makes errors more descriptive
It's the `TaskResult` which has the status
Copy link

@olivierdalang olivierdalang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an author of one of these too many packages dealing with background tasks (django-toosimple-q), I can't wait to see this land in Django :-) Thanks so much for working on this !

Comment on lines 407 to 414
- Completion / failed hooks, to run subsequent tasks automatically
- Bulk queueing
- Automated task retrying
- A generic way of executing task runners. This will remain the responsibility of the underlying implementation, and the user to execute correctly.
- Observability into task queues, including monitoring and reporting
- Cron-based scheduling
- Task timeouts
- Swappable argument serialization (eg `pickle`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can include these in this list (which I see as potentially important and which I don't see discussed in the DEP)

  • Logging (besides the result, the task results could capture logs, stdout and stderr)
  • Realtime output (for long running tasks, having the ability to get some kind of feedback before it finishes would for instance allow to implement progress bars)
  • Non-django tasks (not 100% clear if the proposed approach already supports enqueuing non-django tasks, but if not, having a worker agnostic background task API would open quite some doors for distributed architectures)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Capturing stdout / stderr is probably an antipattern anyway - collect metrics like that manually
  • As part of the initial interface, that's definitely not happening without manual implementation. Depending on how granular the progress is, it could end up a performance bottleneck
  • This one has bugged me for a while - there's not a huge amount in here that's Django specific, so it could be extracted out to Python itself. However, I'd rather let the backends deal with that themselves. If someone wants to maintain a package which can let arbitrary Python apps schedule tasks using say the built-in DB queue, I'm not opposed to that at all.

Comment on lines 97 to 98
DatabaseBackend
This backend uses the Django ORM as a task store. This backend will support all features, and should be considered production-grade.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you explained, this will suit most users. Could you describe just a little bit more how you'd see that implemented (as I didn't see it yet in your POC implementation) ?
I guess it would be an optional contrib module providing a management command to run the workers, and probably some admin integration to see task results ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've intentionally not put implementation details like that in here, as they're fairly external. But yes, the idea would be a management command to run the worker, a DB model to store the data itself, and likely an admin interface for debugging.

@carltongibson
Copy link
Member

carltongibson commented May 13, 2024

Thanks for the discussion so far all. I've asked for @django/steering-council review on the Forum here:

https://forum.djangoproject.com/t/steering-council-vote-on-background-tasks-dep-14/31131

@apollo13
Copy link
Member

Overall the proposal looks good, but there is one thing that I can't wrap my head around it yet. Let's assume you are writing a third-party application and want to use this task framework. Which backend/queue names are you putting into the @task decorator? If you leave it at default then all of a sudden most third party libraries will end up on the same backend. Using a unique backend name for your third party library requires the user to configure multiple backends which all in the worst case have an expensive to setup connection. In an ideal world as a consumer of the library I'd like to be able to route where the tasks go without having to fork the library.

Two approaches come to mind (there might be more and even saner approaches):

  • Use the queue_name of the task to search for a backend that serves this queue?
  • Add aliases to the backend definition so that one could configure something like this:
   TASKS = {
      "high_prio": {
         "BACKEND": "django.tasks.backends.ImmediateBackend",
         "QUEUES": [],
         "OPTIONS": {},
         "ALIASES": ["default", "third_party_app1"] # See here
      },
      "low_prio": {
         "BACKEND": "django.tasks.backends.ImmediateBackend",
         "QUEUES": [],
         "OPTIONS": {},
         "ALIASES": ["third_party_app2"] # and here…
      }
   }

@apollo13
Copy link
Member

I think the points raised by @prophile in Transactions in ImmediateBackend need some more thoughts. Even if we want fast tests we shouldn't stray to much in semantics from a real deployment. This most likely means that we have to execute tasks from the immediate backend in a separate database connection.

Additionally we should probably add a flag to the database backend so it can operate in two ways: one where it joins the existing transaction on enqueue and one where it utilizes on_commit if a transaction is open.

The database backend part doesn't have to be part of the initial implementation, but we really need to get the semantics of the immediate backend right.

@RealOrangeOne
Copy link
Author

Which backend/queue names are you putting into the @task decorator?

Unless you have a good reason, just omit it. If you want users to be able to configure the attribute themselves, you can define your own setting and let the user configure it there. I do like the aliases idea, as a way of easily allowing a user to route a task to a given queue, but a setting feels a bit less magic to me.

I imagine TASKS being very small, generally just a single entry - much like you would have in DATABASES for most deployments. High / low priority should be handled either with the priority attribute, or using queue names if the intention is more around dedicated capacity / specialised hardware.

@RealOrangeOne
Copy link
Author

RealOrangeOne commented May 13, 2024

Transactions in ImmediateBackend

These are the semantics which concern me about an initial implementation. I suspect configuration is the way to go here, as there isn't one obvious answer (on_commit might make sense for a database-based backend, but is radically different to an external, eg Redis, backend).

I've intentionally not started much work on the DatabaseBackend yet, as I want to have lots of conversations with smart people to enumerate these kinds of potential issues, so we can explicitly address (and unit test) them.

@apollo13
Copy link
Member

Which backend/queue names are you putting into the @task decorator?

Unless you have a good reason, just omit it. If you want users to be able to configure the attribute themselves, you can define your own setting and let the user configure it there. I do like the aliases idea, as a way of easily allowing a user to route a task to a given queue, but a setting feels a bit less magic to me.

I didn't consider a setting here; but I agree that it is a good workaround for now and we can add aliases later on if needed.

Transactions in ImmediateBackend

These are the semantics which concern me about an initial implementation. I suspect configuration is the way to go here, as there isn't one obvious answer

I guess the immediate answer would be "behave like other backends would" :D But once one tested that the backends behave properly it might make sense to switch to "I know this code works, now run fast so my tests are quicker and better to test".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet