Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileChange added/deleted events for one file are returned out of order, or are missing #148

Open
Lx opened this issue May 27, 2022 · 11 comments
Labels

Comments

@Lx
Copy link

Lx commented May 27, 2022

Description

I'm trying to monitor file creations and deletions using awatch. I'm expecting that if files are rapidly created and then deleted, watchfiles would either:

  • report them, in order, as a creation followed by a deletion (ideally); or
  • not report them at all (acceptable).

I'm getting a Set[FileChange] back with two entries for the same file as expected, but when I enumerate over the set, the deletion is being reported before the addition.

I've tried setting debounce=0 and step=0 in my awatch call in the hope that this will prevent the events from being bundled within the same set, but this doesn't work.

Debug output suggests that the raw events are being emitted in the correct order. Possibly awatch should be returning a Sequence rather than a set?

Example Code

import asyncio
import watchfiles

async def go():
    async for changes in watchfiles.awatch(
        "/REDACTED", debounce=0, step=0, debug=True
    ):
        print("---")
        for change in changes:
            print(change)
        print("---")

asyncio.run(go())

Example Code Output

raw-event: Event { kind: Create(File), paths: ["/REDACTED/bee309868d89bfa2"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
raw-event: Event { kind: Modify(Name(Any)), paths: ["/REDACTED/bee309868d89bfa2"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
raw-event: Event { kind: Modify(Data(Content)), paths: ["/REDACTED/bee309868d89bfa2"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
---
(<Change.deleted: 3>, '/REDACTED/bee309868d89bfa2')
(<Change.added: 1>, '/REDACTED/bee309868d89bfa2')
---

Operating System

macOS 12.4 (21F79)

Environment

No response

Watchfiles Version

0.14.1

Python Version

Python 3.10.4

@Lx Lx added the bug label May 27, 2022
@Lx
Copy link
Author

Lx commented May 27, 2022

After digging around within watchfiles, and in case it helps:

raw_changes = await anyio.to_thread.run_sync(watcher.watch, debounce, step, timeout, stop_event_)

raw_changes receives the events out of order at this point, so it seems to be a RustWatcher-internal problem. I don't speak Rust, but I'm almost willing to bet that the use of a HashSet is causing the issue:

changes: Arc<Mutex<HashSet<(u8, String)>>>,

especially given the prevalence of the phrase "arbitrary order" in Rust's std::collections::HashSet docs.

If a set data structure is really desired then maybe a BTreeSet is more appropriate—but on further reflection, is a set really appropriate?

Suppose a scenario where, within one "cycle", a file is rapidly:

  1. created
  2. deleted
  3. created again.

Whether or not this situation is actually possible, I would want to be told that either:

  • the file was created, deleted, and then created; or
  • the file was created (and not be told about the additional deletion/creation).

What I really would not want in this situation is for one of the creation events to be treated as a duplicate, and filtered out with the deletion remaining in the result. I suspect that currently, this is exactly what would occur. Perhaps a Sequence-type structure is the only appropriate option.

@samuelcolvin
Copy link
Owner

Well python sets don't maintain order, so it doesn't really matter what type we use in rust.

I assume that offer doesn't matter, but de-duplication does, hence the use of sets.

@Lx
Copy link
Author

Lx commented May 27, 2022

Why is deduplication important?

@samuelcolvin
Copy link
Owner

Because depending on the OS there can easily be duplicates in the file changes. For example two "file modified" events because the data is modified and also the timestamp. E.g. look at your example above.

@Lx
Copy link
Author

Lx commented May 28, 2022

Well python sets don't maintain order, so it doesn't really matter what type we use in rust.

Yes, use of a set in either space will destroy the ordering, so even if the Rust was using a data type that maintained the raw event order, the Python would be losing it.

But the data is losing its order before being emitted from the Rust side of things, so I'm reasonably convinced that the data type being used in the Rust space does matter here also.

Because depending on the OS there can easily be duplicates in the file changes. For example two "file modified" events because the data is modified and also the timestamp.

Okay, I can understand the utility in deduplicating modification events.

I'm still persuaded to believe that naïvely deduplicating creation/deletion events is dangerous though.

To paraphrase my earlier example, would you agree that given three consecutive raw events (create → delete → create), removing one of the create events (and nothing else—a "naïve" deduplication) paints a false picture?

A more appropriate deduplication strategy for these events (if deduping is even really needed) might be to remove pairs of creation/deletion events, leaving the single creation event.

E.g. look at your example above.

In my example I see two raw events of kind Modify. Is that what you're referring to?

Are they truly duplicate events though? They have different "sub-kinds" and one of them seems to actually concern file deletion.

@samuelcolvin
Copy link
Owner

What you miss is that filesystem changes are often delayed by seconds or even tens of seconds, and can arrive or of order.

The delay issue is particularly prevalent on MacOS, but can also happen on Windows.

The other thing I think you miss is that most users of watchfiles (and watchgod before that), don't actually care much about exact order, many of them don't even care about which files changed.

I'm therefore 👎 on changing the interface and adding lots of extra logic for a minority of users.

However I'd be willing to accept a pr to add a new interface to return details of every change so they can be used in python. Would you like to have a go at that?

@Lx
Copy link
Author

Lx commented May 28, 2022

The other thing I think you miss is that most users of watchfiles (and watchgod before that), don't actually care much about exact order, many of them don't even care about which files changed.

This remark makes me feel that maybe I've again missed the mark in describing the core problem, so I hope you don't mind my attempting one final rephrase:

  • I (personally, like other users in your description) don't care about the order of operations across separate files.

  • However, if:

    • watchfiles is simultaneously telling me that the same file has been both created and deleted; and
    • watchfiles can't tell me which of those events is more recent; and
    • even if it could, naïve deduplication of these events has removed the latest creation/deletion event;

    then I have to resort to explicitly testing the file's existence outside watchfiles to get the true story.

  • If the Rust and Python code were collecting these events in an ordered data structure, part of the problem would be solved.

I'm therefore 👎 on changing the interface and adding lots of extra logic for a minority of users.

Understood and accepted.

In case anyone else lands on this bug report with the same problem I've faced, I've implemented this Python workaround:

def without_multifile_events(changes: set[watchfiles.main.FileChange]):
    """
    Prevent simultaneous reports of creation and deletion of one file.

    `watchfiles` might report creation and deletion of the same file at once,
    and since results are returned in an unordered set, the latest event can't
    be deduced from the results alone.  This function explicitly tests for the
    existence of paths in this ambiguous situation and emits only the events
    that represent the current path state.

    Specifically, for each unique path in `changes`:

    - Emit `Change.modified` events as is.
    - Emit one `Change.added` event if and only if:
        - there are no `Change.deleted` events for this path; or
        - there are also `Change.deleted` events, but the path exists.
    - Emit one `Change.deleted` event if and only if:
        - there are no `Change.added` events for this path; or
        - there are also `Change.added` events, but the path doesn't exist.

    See the related GitHub issue for further details:
    https://github.com/samuelcolvin/watchfiles/issues/148

    ```
    from collections import defaultdict
    from pathlib import Path
    import watchfiles

    for changes in watchfiles.watch("/path"):
        for change, path_str in without_multifile_events(changes):
            ...
    ```
    """
    changes_by_path = defaultdict[str, set[watchfiles.Change]](set)
    for change_tuple in changes:
        change, path = change_tuple
        if change is watchfiles.Change.modified:
            yield change_tuple
        else:
            changes_by_path[path].add(change)
    for path, changes_only in changes_by_path.items():
        if len(changes_only) == 1:
            (single_change,) = changes_only
        else:
            single_change = (
                watchfiles.Change.added
                if Path(path).exists()
                else watchfiles.Change.deleted
            )
        yield single_change, path

However I'd be willing to accept a pr to add a new interface to return details of every change so they can be used in python. Would you like to have a go at that?

This would involve making changes to the Rust code and I have no Rust experience, so I respectfully decline.

@Lx Lx closed this as completed May 28, 2022
@Lx Lx closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2022
@Lx Lx changed the title multiple FileChange events are returned out of order FileChange added/deleted events for one file are returned out of order, or are missing May 28, 2022
@samuelcolvin
Copy link
Owner

Ok, the rust wouldn't be that hard.

I'll leave this open as a python interface to the underlying changes would be useful.

@samuelcolvin samuelcolvin reopened this May 28, 2022
@0xpr03
Copy link

0xpr03 commented Aug 31, 2022

can't tell me which of those events is more recent

In my experience your OS also can't reliably tell you this. And it may have changed in between your notification and you using that file. So checking is definitely the right way.

That said I can understand wanting to correctly handle the case where it does actually work as expected.

@koiosjojo
Copy link

@samuelcolvin - you obviously have a much closer relationship to the typical use cases of end users utilizing this package, but in my case (monitoring many files simultaneously on a shared network drive for updates which then trigger downstream processes) @Lx's assessment and workaround has been critical in accurately identifying inflection points relative to a particular file.

It seems to me (and I only offer this since the topic is still open) that a package-native solve would be something a substantive percentage of users would find utility in having, rather than having to run this out to pathlib and defaultdict to manage externally.

As you mentioned above, I don't believe the request to be outside the scope of the Notify implementation this leans on, so it should be fairly simple to implement.

@samuelcolvin
Copy link
Owner

Maybe @Lx is right and this is an issue with the use of HashSet.

Maybe we could use https://docs.rs/indexmap/latest/indexmap/set/struct.IndexSet.html and thereby maintain order.

Pr welcome to implement this, we should be able to tell just from creating the pr if it fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants