Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix most types in beets.util.__init__ #5223

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

snejus
Copy link
Member

@snejus snejus commented May 4, 2024

Description

This PR is Part 1 of the work #5215 that fixes typing issues in beets.util.__init__ module.

It addresses simple-to-fix / most of the issues in this module.

@snejus snejus self-assigned this May 4, 2024
@snejus snejus requested a review from wisp3rwind May 4, 2024 12:30

This comment was marked as resolved.

@snejus snejus force-pushed the fix-most-types-in-util-init branch from adac926 to 6c0b63f Compare May 4, 2024 12:46
Copy link
Member

@wisp3rwind wisp3rwind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a bunch of comments: Many of them are concerned with the tension between what types of arguments functions would accept vs. what we actually want to call them with (i.e. often it should only by bytes that are accepted, but the code does not obviously fail for str).

In other cases, it's a bit unclear why a specific change was made; it would be great if you could elaborate.

beets/util/__init__.py Outdated Show resolved Hide resolved
beets/util/__init__.py Outdated Show resolved Hide resolved
@@ -188,28 +189,28 @@ def ancestry(path: bytes) -> List[str]:


def sorted_walk(
path: AnyStr,
ignore: Sequence = (),
path: BytesOrStr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path: BytesOrStr,
path: bytes,

While there's the bytestring_path call below, I think it is more of a safety measure to avoid crashes in case of programming errors. Effectively, all paths should internally be represented as bytes and as such I think sorted_walk should only accept bytes here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why should paths be represented as bytes internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html

This way, Python 3’s Unicode str can represent arbitrary bytes in filenames. (The first commit to beets happened a bit before Python 3.0 was released, so perhaps the project can be forgiven for not adopting this approach in the first place.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd imagine these functions should accept the same types that the original ones do. In this case it's os.walk, which accepts any path-like str or bytes and returns the same type that it received.

On the other hand, this PR was about fixing types rather than adjusting the implementation, so I focused on underlining what the current implement does and handles to make life easier to the person who will need to refactor this module later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why should paths be represented as bytes internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html

Yes, that could be changed in principle, but that's in fact not completely straightforward: We still need conversions, at a minimum when dealing with the database (which currently stores bytes, not text. Would SQLite even support storing text with surrogate escapes?).
At that point, we might want to swtich to pathlib, however (in which case it's still not completely obvious what we should store to the database).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLite stores unicode fine these days (we may want to include its icu extension to cover users that have some ancient versions of SQLite).

For example, I locally store bandcamp releases in this db:

CREATE TABLE release (
    source TEXT,
    label TEXT,
    name TEXT,
    album TEXT,
    artist TEXT,
    title TEXT,
    comment TEXT,
    url TEXT,
    email_id TEXT,
    added TEXT,
    done INTEGER DEFAULT 0,
    done_date TEXT,
    bought INTEGER DEFAULT 0,
    next INTEGER DEFAULT 0,
    invalid INTEGER DEFAULT 0,
    PRIMARY KEY(email_id, added, source, name) ON CONFLICT IGNORE
  );

See some of the data that I've got there
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand I'd strongly suggest to keep the database as it is, at least for now, since this means opening a can of worms of migrating existing user databases.

What I meant by 'internally', I guess, is the code handling the paths before it hits the database (I assume that regardless of whether we supply a str or bytes, db code appropriately converts it to bytes for storage).

On the other hand paths being stored as bytes in db doesn't mean that we need to handle them as such across the codebase. I agree that pathlib is indeed the right way forward.

Comment on lines 292 to +293
path: str,
root: Optional[Bytes_or_String] = None,
root: Optional[AnyStr] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both path and root should be bytes here, for the same reason as mentioned above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments above: in summary, I think we'd rather use types that reflect the current implementation. In addition to the two reasons I've given above, I see another risk: if we were to restrict the accepted type to bytes here, in a fully-typed codebase we'd expect mypy check to fail anywhere where a str is being fed into this function.

Right now, much the codebase is not typed, so the check passes silently. Whoever ends up adding types to the part of the codebase that sends str into this function will get slightly bamboozled: they will need to read this code and understand what it does and they will have to either

  1. Adjust the implementation of that code
  2. Relax the type here

Both of these actions will be out of scope of what they're doing. Remember, I had to do this in playlist.py when I ended up relaxing the type in syspath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with making these functions to handle bytes only if

  1. We explicitly handle bytes only - remove the code that casts str to bytes from the beginning of the implementation
  2. Rename the functions to make the only-bytes-accepted-here requirement clear, since the equivalents in os (normpath, samefile, ...) handle str and bytes types. Therefore, to not confuse people used to the builtins, we'd want to name these functions accordingly I suppose.



def samefile(p1: bytes, p2: bytes) -> bool:
def samefile(p1: AnyStr, p2: AnyStr) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def samefile(p1: AnyStr, p2: AnyStr) -> bool:
def samefile(p1: bytes, p2: bytes) -> bool:

See reasoning above.

try:
os.remove(path)
except OSError as exc:
raise FilesystemError(exc, "delete", (path,), traceback.format_exc())


def copy(path: bytes, dest: bytes, replace: bool = False):
def copy(path: AnyStr, dest: AnyStr, replace: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def copy(path: AnyStr, dest: AnyStr, replace: bool = False):
def copy(path: bytes, dest: bytes, replace: bool = False):

beets/util/__init__.py Outdated Show resolved Hide resolved
beets/util/__init__.py Show resolved Hide resolved
beets/util/__init__.py Outdated Show resolved Hide resolved
@@ -327,7 +327,7 @@ def item_file(item_id):

try:
# Imitate http.server behaviour
base_filename.encode("latin-1", "strict")
base_filename.decode().encode("latin-1", "strict")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this change? I think base_filename should be str here, but mypy might not be able to infer this. In any case, I don't understand what this code does at all, it might need a proper cleanup.

Copy link
Member Author

@snejus snejus May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to do with the change I made in syspath
image

You can see that it now always returns the same type that was fed into it.

    return str_path if isinstance(path, str) else str_path.encode()

It previously returned undeterministic BytesOrStr: for non-windows systems it would return the same type that was fed in, but on Windows it would always return str. This caused a fair bit of typing issues down the line.

Now the code in web/__init__.py is a bit of a mess

    # On Windows under Python 2, Flask wants a Unicode path. On Python 3, it
    # *always* wants a Unicode path.
    if os.name == "nt":
        item_path = util.syspath(item.path)
    else:
        item_path = util.py3_path(item.path)

    base_filename = os.path.basename(item_path)
    # FIXME: Arguably, this should just use `displayable_path`: The latter
    # tries `_fsencoding()` first, but then falls back to `utf-8`, too.
    if isinstance(base_filename, bytes):
        try:
            unicode_base_filename = base_filename.decode("utf-8")
        except UnicodeError:
            unicode_base_filename = util.displayable_path(base_filename)
    else:
        unicode_base_filename = base_filename

    try:
        # Imitate http.server behaviour
        base_filename.decode().encode("latin-1", "strict")

You can see it calls syspath on Windows, which now returns bytes instead of str. It seemingly expects base_filename to have type bytes and assigns unicode_base_filename, but then ends up using base_filename in the last line here, assuming that it's a str but it is still bytes.

Thus, I decode()d it here before encoding it to latin-1.

Copy link
Member Author

@snejus snejus May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this comment

    # On Windows under Python 2, Flask wants a Unicode path. On Python 3, it
    # *always* wants a Unicode path.
    if os.name == "nt":
        item_path = util.syspath(item.path)
    else:
        item_path = util.py3_path(item.path)

I think it should be safe to simply call py3_path in all cases now, actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that syspath is a tricky regarding typing — but this is in fact a change which I think has the potential to cause a big fallout if we get it wrong, and at this point I'm not fully convinced that it is correct.

Up to now, syspath is intended to return str on Windows, because all Windows filesystems and APIs use some Unicode encoding and the previous default mbcs encoding (which I think would have been used when passing bytes to OS APIs) would in fact have been incorrect, given that beets internally uses UTF-8 as stated in the code comment.

Due to https://peps.python.org/pep-0529/ your change might in fact be correct now (note that there's a legacy mode that uses MBCS, even though I'd say users who explicitly enable that are on their own).

I'd prefer not to change the behaviour for now (at least not in this PR, let's keep the code changes to a minimum when the primary goal is introducing typings). I'm not confident that our testsuite exercises non-ASCII filenames sufficiently to catch all encoding edge cases.

For now, another solution might be to define types conditionally (mypy understands the sys.platform switch IIRC):

if sys.platform == "win32":
    SysPath: TypeAlias = str
else:
    SysPath: TypeAlias = bytes
def syspath(path: bytes, prefix: bool = True) -> SysPath:

(we should probably also run mypy on Windows then).


Regarding the web plugin code: If I'm not completely mistaken, base_filename is guaranteed to be str when decode is invoked (irrespective of the syspath change, in fact), but only bytes can be decode()ed.

Actually, it looks like the whole section

    base_filename = os.path.basename(item_path)
    # FIXME: Arguably, this should just use `displayable_path`: The latter
    # tries `_fsencoding()` first, but then falls back to `utf-8`, too.
    if isinstance(base_filename, bytes):
        try:
            unicode_base_filename = base_filename.decode("utf-8")
        except UnicodeError:
            unicode_base_filename = util.displayable_path(base_filename)
    else:
        unicode_base_filename = base_filename

is a no-op nowadays (always hitting the else arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py but note that there, also the UnicodeError never happens, i.e. this is a place where we don't seem to actually test non-ASCII (actually, non-latin1) filenames.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the decode is where the tests fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, another solution might be to define types conditionally (mypy understands the sys.platform switch IIRC):

if sys.platform == "win32":
    SysPath: TypeAlias = str
else:
    SysPath: TypeAlias = bytes
def syspath(path: bytes, prefix: bool = True) -> SysPath:

This is a nice solution - though the catch here is that for non-win32 systems the function actually returns AnyStr, and I'm not sure how to type it to avoid this issue

image

Regarding the failing tests, oops, I was meant to use os.fsdecode there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like the whole section

    base_filename = os.path.basename(item_path)
    # FIXME: Arguably, this should just use `displayable_path`: The latter
    # tries `_fsencoding()` first, but then falls back to `utf-8`, too.
    if isinstance(base_filename, bytes):
        try:
            unicode_base_filename = base_filename.decode("utf-8")
        except UnicodeError:
            unicode_base_filename = util.displayable_path(base_filename)
    else:
        unicode_base_filename = base_filename

is a no-op nowadays (always hitting the else arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py

Note that the coverage is produced by running the tests in the ubuntu machine, so it doesn't show the logical branches that Windows hits 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding syspath, I reckon we could use the good ol' cast?

    return cast(AnyStr, str_path)

@snejus snejus force-pushed the fix-most-types-in-util-init branch from 13dea13 to f0a9ef5 Compare May 5, 2024 17:18
@wisp3rwind
Copy link
Member

I left a few more comments. The one thing that remains and where we seem to disagree is this:


Ideally, I'd imagine these functions should accept the same types that the original ones do. In this case it's os.walk, which accepts any path-like str or bytes and returns the same type that it received.

On the other hand, this PR was about fixing types rather than adjusting the implementation, so I focused on underlining what the current implement does and handles to make life easier to the person who will need to refactor this module later.


I'd be happy with making these functions to handle bytes only if

We explicitly handle bytes only - remove the code that casts str to bytes from the beginning of the implementation
Rename the functions to make the only-bytes-accepted-here requirement clear, since the equivalents in os (normpath, samefile, ...) handle str and bytes types. Therefore, to not confuse people used to the builtins, we'd want to name these functions accordingly I suppose.


I guess there are two philosophies here:

  1. Add typings according to anything the code accepts currently. This significantly reduces how much the type checker can help us to verify that our current path handling is correct, but might be the quicker path to a clean mypy run.

  2. Add typings according to what is vaguely defined to be the intended usage (e.g. always call syspath with bytes). There may be some code around that would fail to type check due to this. I'd argue that this is actually a good thing, and probably indicates that the calling code needs to be fixed. In this interpretation, more lenient implementation (i.e. no assert isinstance(path, bytes) is a form of defensive programming.

I don't think we should start adding such assert statements immediately: This is too likely to cause issues. If we want to go this way, I think the steps should be

  • Type the entire beets code base and fix any callers
  • Add a warning where the assert should ultimately be
  • Do a release, and mention in the release notes that plugin authors need to anticipate this change
  • swap warnings for asserts

I'm not sure what's the best way to proceed here. I would very much like to see 2. happen eventually.
A broader roadmap could be:

  • Type the entire beets code base, erring on the more lenient side.
  • Prepare an updated version of https://beets.io/blog/paths.html, this should probably live in this repository rather than a blog. This should have clear guidance on what to do when
    • passing paths to os.path. and similar APIs
    • reading/writing paths or path fragments from/to files (e.g. playlist files)
    • passing paths to subprocesses
    • storing paths in the database
    • reading paths or path fragments from the configuration
    • ...
  • Tighten up our typings accordingly and add warnings where the ultimately we would want assertions.
  • Do a release, and mention in the release notes that plugin authors need to anticipate this change
  • swap warnings for asserts
  • At this point, I think we'd be in a good position to revisit a conversion to pathlib. We should start by updating our design document from the second point.

@wisp3rwind
Copy link
Member

See also #1409

@snejus snejus force-pushed the fix-most-types-in-util-init branch 2 times, most recently from 49473e0 to 79cf3fd Compare May 6, 2024 08:19
@snejus snejus force-pushed the fix-most-types-in-util-init branch from 79cf3fd to 6a1f684 Compare May 6, 2024 08:30
@snejus snejus force-pushed the fix-most-types-in-util-init branch from 8f63779 to 22e2a3c Compare May 6, 2024 09:06
@snejus
Copy link
Member Author

snejus commented May 6, 2024

Weird stuff: as you've seen my change made that single test fail on Windows, see this build 22e2a3c for example

FAILED test/plugins/test_web.py::WebPluginTest::test_get_item_file - Attribut...
===== 1 failed, 1639 passed, 123 skipped, 9 xfailed in 199.92s (0:03:19) ======

Now in the next commit abd8447 I reverted my change and I got bombarded with this

FAILED test/plugins/test_convert.py::ImportConvertTest::test_delete_originals
FAILED test/plugins/test_convert.py::ImportConvertTest::test_import_converted
FAILED test/plugins/test_convert.py::ConvertCliTest::test_convert - TypeError...
FAILED test/plugins/test_convert.py::ConvertCliTest::test_convert_keep_new - ...
FAILED test/plugins/test_convert.py::ConvertCliTest::test_convert_with_auto_confirmation
FAILED test/plugins/test_convert.py::ConvertCliTest::test_embed_album_art - T...
FAILED test/plugins/test_convert.py::ConvertCliTest::test_format_option - Typ...
FAILED test/plugins/test_convert.py::ConvertCliTest::test_no_transcode_when_maxbr_set_high_and_different_formats
FAILED test/plugins/test_convert.py::ConvertCliTest::test_playlist - TypeErro...
FAILED test/plugins/test_convert.py::ConvertCliTest::test_transcode_when_maxbr_set_low_and_different_formats
FAILED test/plugins/test_convert.py::ConvertCliTest::test_transcode_when_maxbr_set_low_and_same_formats
FAILED test/plugins/test_convert.py::ConvertCliTest::test_transcode_when_maxbr_set_to_none_and_different_formats
FAILED test/plugins/test_convert.py::NeverConvertLossyFilesTest::test_transcode_from_lossless
FAILED test/plugins/test_convert.py::NeverConvertLossyFilesTest::test_transcode_from_lossy
===== 14 failed, 1628 passed, 121 skipped, 9 xfailed in 194.90s (0:03:14) =====
wtf-white-cat.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants