Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Watchdog generates multiple file modify events #64684

Open
2 of 9 tasks
dfidler opened this issue Jul 20, 2023 · 11 comments
Open
2 of 9 tasks

[BUG] Watchdog generates multiple file modify events #64684

dfidler opened this issue Jul 20, 2023 · 11 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@dfidler
Copy link

dfidler commented Jul 20, 2023

Description

If you modify a file that is being watched by watchdog, two beacon events are being generated by the minion. This appears to be because, on windows, multiple events are sent from the windows kernel, through watchdog.

IHAC that configured a watchdog beacon (interval 10) that is always generating two modify events and, as a result, is trigging a reactor twice to do the same work (effectively doubling the load on the salt-minion and the reactor system.

Setup

  1. Install minion on a windows server
  2. Ensure watchdog py module has not been installed before
  3. open console on salt-master and watch event stream ( salt-run state.event pretty=true )
  4. Run the following...
cat <<EOF > /srv/salt/watchdogtest.sls
ensure watchdog installed:
  pip.installed:
    - name: watchdog
    - reload_modules: True
      
watch_dog_beacon:
   beacon.present:
     - name: watchdog
     - save: True
     - enabled: True
     - directories:
        C:\inetpub\wwwroot:
          mask:
            - modify
            - delete
     - interval: 10
     - disable_during_state_run: True
     - require:
       - ensure watchdog installed
  1. salt <windows minion> state.apply watchdogtest
  2. Restart the salt-minion (note: This appears to be another bug - the windows minion can't seem to load the watchdog py module without a restart, causing the beacon.present to fail).

You will see two events go across the bus. And if you have a reactor configured to listen to these events, it will trigger twice.

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior

  1. Setup the system (per above)
  2. echo 1 > C:\inetpub\wwwroot\test (no event generated)
  3. wait a few seconds to ensure no bus events are forthcoming....
  4. echo 1 > C:\inetpub\wwwroot\test

As you watch the event bus, you'll see two events are generated

Expected behavior

Should only receive one event, not two. We should be de-duping events from watchdog.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.7
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: 1.1.4
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.10.1
        pygit2: Not Installed
        Python: 3.8.15 (tags/v3.8.15:44adf8a, Nov  8 2022, 17:20:07) [MSC v.1929 64 bit (AMD64)]
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 22.0.3
         smmap: 4.0.0
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist:
        locale: cp1252
       machine: AMD64
       release: 2019Server
        system: Windows
       version: 2019Server 10.0.17763 SP0 Multiprocessor Free
@dfidler dfidler added Bug broken, incorrect, or confusing behavior needs-triage labels Jul 20, 2023
@dfidler
Copy link
Author

dfidler commented Jul 20, 2023

Note: The version of watchdog being used is 3.0.0

@s0undt3ch
Copy link
Member

@dfidler being a common issue with watchdog, what do you suggest Salt should do?

@dfidler
Copy link
Author

dfidler commented Jul 22, 2023

Suggestions on the threads that I've seen are to dedupe the events. There's a discussion on some of the pros/cons of doing so (specifically around when creating files) on one of the sites I read. So cache the events and, rather than throw them immediately, wait half a second, and dedupe.

There's the potential for race conditions there is someone is doing a bunch of file operations that last longer than a half second. So maybe even not throwing an event in a file if it had already been thrown in the last (configurable) seconds?

One thread recommended using watchfiles rather than watchdog (newer/better) but I think that would be a new module.

edit: going back in time to hide the fact that either I can't type or that English appears to be my 689th language

@garethgreenaway
Copy link
Contributor

@dfidler Sounds like perhaps we should deprecate the watchdog beacon in favor of one based on watchfiles.

@twangboy twangboy self-assigned this Aug 24, 2023
@LeaDevelop
Copy link

Could reproduce this issue by us as well

@LeaDevelop
Copy link

Related issue #62638

@twangboy
Copy link
Contributor

@garethgreenaway The watchdog beacon was added for Windows as iNotify does not support Windows. I think I wrote the original beacon... it's quite possible that I did it wrong... it was my first (and only) time writing a beacon

@dfidler
Copy link
Author

dfidler commented Oct 26, 2023

@twangboy I think that this is less about "us" and more about "watchdog on windows". I think it's a reasonable expectation for a "file event monitoring tool" to only send out one event per change and I think everyone that uses it starts with that assumption.

In other news, someone published a "DuplicateEventLimiter" class in the watchdog repo: gorakhargosh/watchdog#1003 (comment)

Perhaps pilfer that code and move the threshold into a watchdog config on the minion?

@dfidler
Copy link
Author

dfidler commented Oct 26, 2023

Ah... I think this is the issue: gorakhargosh/watchdog#260

Interesting discussion on the topic...

@dfidler
Copy link
Author

dfidler commented Oct 26, 2023

My brain seems to have purged almost all knowledge that I had on this (signature of events, etc) - I wish I'd documented the details better in this thread. :(

Sorry all... I'm typing while I think through this so this is going to be a long one...

Anyway, I am of four minds (purist, architect, pragmatist and the masochist) on this issue and they are currently at war with each other. (actually the masochist is watching the other "three" of me and is laughing). But the more I think about this, the more complex the "right" solution becomes.

The only thing that all three brains agree on is that the sole purpose of any system producing an event is to allow a consumer to respond to it, and that if the consumer takes action, that action takes time to start, and complete (minutes/days for a manual one, seconds for an automated one).

The purist in me rails at the idea of a single filesystem operation generating two events. It's wrong on every level and it violates everything I hold dear as a purist. Imagine if the "response" to a filesystem event was to create an invoice (and there was no way to distinguish between those events). In that context, generating multiple events is completely inexcusable.

The architect (who is also kind of a purist) wants the beacon-reactor relationship to be transactional. That is to say that once a beacon is fired, the minion squashes all further events (or it has some kind of escalation mechanism where if the same event is fired a hundred times, it communicates that thrash to the salt-master with "something bad is happening") until the salt-master(s) acknowledge the event. I see two possible algorithms:

  1. salt-master has no configured reactors:

    1. salt-master responds with ACK:complete
    2. salt-minion clears state
  2. salt-master has reactors configured for subject:

    1. salt-master responds with ACK:working
    2. salt-master runs orchestration
    3. salt-master responds with ACK:complete (challenge here, if ACK:complete doesn't happen inside a state.apply, then a race condition ensues because there is a delay between when the state.apply finishes and when the ACK:complete reaches the minion and changes could occur during that window that wouldn't be caught - but this is also solvable).

The real challenge here is that a minion can speak to multiple masters and masters in a "cluster" don't share this kind of stateful information... so if the two masters are misconfigured with different reactors.... and down the rabbit hole I go adding more and more code to make this reliable.... :(

Of course this would further address some other issues with the reactor system, like if there are too many events on the bus, it just drops them on the floor (they are not queued).

This last point alone means that the whole reactor system could be improved significantly as we have users that stop using beacons/reactors because they can't rely on them to fire - this watchdog issue only brings us to that "overloaded-and-dropping-events" twice as fast.

The pragmatist in me thinks that the MVP here is the dedupe with a configurable timeout. It works around the duplicate event issue and goes at least some way (in the crudest sense) toward addressing the "it takes time to respond to an event", with a disclaimer, in the docs, that this is happening but that so long as the event's response is a state.apply, it doesn't really matter anyway.

Uh oh... the purist is trying to club the pragmatist with a clue bat... I need to put the keyboard down. 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

5 participants