Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes: #15194 - Check the whole queue for matching queued webhooks #15318

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

peteeckel
Copy link
Contributor

Fixes: #15194

This PR changes the logic how changes to M2M relationships are matched to changes in the base object to which the M2M relationship is related. The current code assumes that the post_save is immediately followed by the m2m_changed signal for the former, but that is not generally true as any save() operation for a different object occurring in between will separate the two entries in the queue. This results in two webhooks being triggered, the first one of which has incorrect data in its payload.

The new code finds the position of the webhook data for the object change in the queue and updates it with the M2M change data and ensures only one webhook is triggered, and with the correct data.

@peteeckel peteeckel marked this pull request as draft February 29, 2024 19:09
@peteeckel peteeckel marked this pull request as ready for review March 1, 2024 12:57
@peteeckel peteeckel marked this pull request as draft March 1, 2024 17:37
@peteeckel peteeckel force-pushed the 15194-fix-duplicate-webhook-invocations branch from a2e6304 to 1728997 Compare March 1, 2024 18:10
@peteeckel peteeckel marked this pull request as ready for review March 1, 2024 18:28
@jeremystretch
Copy link
Member

@peteeckel thanks for all your work on this, both in capturing the root issue and in devising a solution! Admittedly, I haven't attempted to reproduce this firsthand, but thanks to your very detailed notes I believe I understand the root issue.

Given that we cannot rely on merely checking the most recent queued item (queue[-1]), perhaps converting the queue to a hash map would be preferable. For each item, we could compute a simple hash from its content type, primary key, and request ID, which would be stored as the entry's key in an ordered dictionary. This would allow us to easily look up a previously queued record for an object by its unique hash. Does that make sense?

If you're open to it, I'd be happy to continue your work in this direction. (Of course, you're also welcome to continue tackling it yourself.) Let me know what you think, and thanks again!

@peteeckel
Copy link
Contributor Author

Hi Jeremy, you're very welcome - I know that some of the bugs I find are slightly exotic and that you're probably working much more than usual to get NetBox 4 out, so never mind about the delay.

As for converting the quere to a dictionary: That makes a lot of sense to me, provided that nobody actually relies on it being a queue, which probably isn't the case. The only thing that might be of concern is that some users might make assumptions about the order in which webhooks are triggered, so my first thought was to use an OrderedDict.

In CPython, dicts are in fact always ordered, but the question is whether we can rely on NetBox always being run on CPython? Trying to err on the safe side I'd assume the answer is 'no', so the queue should be replaced by an OrderedDict.

I'm currently on a business trip until Thursday, so I can start working on it again on Friday - if that's not an issue I'll give it a try. On the other hand, the sooner the problem is fixed the better, so don't hesitate to go for it.

@peteeckel
Copy link
Contributor Author

Hi @jeremystretch, I just had a look at the code and it seems that converting the queue to a hash map is not a trivial change.

To me, there seems to be a showstopper in netbox/extras/events.py function flush_events, which not only expects a queue but also hands it off to the the methods listed in EVENTS_PIPELINE, which might contain other code (plugin code, maybe?) that needs to be prepared to handle the hash map instead of the queue.

So while changing it is rather simple, it will definitely break documented or at least established interfaces, which would probably be possible with the switch to NetBox 4 but not earlier, correct?

@peteeckel
Copy link
Contributor Author

Hi @jeremystretch, as this seems to be stalled on the question of the interface to methods in the EVENTS_PIPELINE - do you have a preference how to proceed with this?

I think moving from a queue to a hash map would require changing the (documented?) interface, so the question is whether or not this is an option (or whether I am missing something). If it gets changed, it would probably be best with NetBox 4.0 as a major version.

Or were you suggesting converting the queue to a hash map just for the sake of the check whether an item is already in the queue, without changing the queue structure as such? I doubt this would be an efficient use of resources, as repeatedly copying the data into a hash map is more effort than the linear search I implemented.

If not, do you require any changes to my PR?

@jeffgdotorg
Copy link
Collaborator

@peteeckel Please add a test that illustrates the behavior you're describing, so that the developer who picks this up can be confident in their understanding of the problem. Additionally, given the sensitivity of the code path that this PR targets, we need to include tests which confirm the continued correct operation of the rest of the code path's behavior. For example, there's some concern about the possibility of race conditions when multiple actors try to modify the same object.

@peteeckel
Copy link
Contributor Author

Hi @jeffgdotorg, that's an interesting request :-)

  1. I can't add a test within NetBox because the behaviour can only be reproduced with a specifically crafted save() operation that calls a save() operation for a different model. AFAIK such a constellation does not exist within NetBox (but within NetBox DNS).
  2. I agree that the code path is critical, which is why I am assuming that the existing NetBox test suite already covers it - and the existing test suite passes without a flaw with the patch in place.
  3. The code (the existing as well as the patched version) runs within the context of a single request, so I don't see how concurrency could come into play. Is there an existing test in the NetBox test suite that I can use as an example for how that can be done?

@peteeckel
Copy link
Contributor Author

By the way, I can add a test to the NetBox DNS test suite. Not sure if that would help, though.

@jeremystretch
Copy link
Member

@peteeckel my biggest concern, given the nature of the bug, is that some future change to the logging logic could reintroduce the error. If you can walk me through exactly what's happening in the plugin, I might be able to rig up a test to replicate it in NetBox.

@peteeckel
Copy link
Contributor Author

HI @jeremystretch, thanks for the quick reaction.

I constructed some code to reproduce the behviour in the underlying issue. It can be triggered in multiple ways, all requiring a specifically crafted save() function that may or may not affect a model with an associated many-to-many foreign key.

The easiest way to trigger it is to call super().save() twice within that save() function - that will cause a double invocation of the event hook, one with a correct pre- and one with a correct post-snapshot. The code in events.py does not handle this correctly because it only associates an m2m_changed event with an immediately preceding post_save event, not two consecutive post_save events.

When there is an m2m relationship, the issue is also triggered when the save() method of the model calls the save() method of another model, because that also breaks the order of post_save immediately followed by m2m_changed because post_save for the second model is called before the m2m_changed event.

The only thing my patch does is to check the whole previous queue instead of only the last event in the queue for events for the same object, and not only m2m_changed after post_save, which covers both cases.

With your earlier suggestion of changing the data structure of the queue to a hash map, that would be much simpler and more elegant, however at the cost of breaking the interface for the EVENTS_PIPELINE. In general, however, nothing would change much.

@peteeckel
Copy link
Contributor Author

Just in case you are asking yourself "Why should anyone want to do that kind of stuff?" ...

Take a DNS zone, now update the list of name servers for the zone. The latter is an m2m relationship.

  • If the zone needs to be created first, that causes a post_save event for Zone.
  • Now the m2m_handler kicks in and creates the NS records in the zone (one post_save event for Record per NS record) ...
  • ... which requires the zone's serial to be updated (another post_save for Zone, which can under some circumstances be optimised away, but not always)
  • And finally, the m2m_changed event for the zone itself is queued (m2m_changed event for Zone)

And there you have everything described above in a single request.

I documented the result in peteeckel/netbox-plugin-dns#89 (comment), including a verification that the PR actually solves this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous invocations of Event Rules/Webhooks with M2M relationships
3 participants