Fix most types in beets.util.init #5223

snejus · 2024-05-04T12:30:28Z

Description

This PR is Part 1 of the work #5215 that fixes typing issues in beets.util.__init__ module.

It addresses simple-to-fix / most of the issues in this module.

wisp3rwind

I just added a bunch of comments: Many of them are concerned with the tension between what types of arguments functions would accept vs. what we actually want to call them with (i.e. often it should only by bytes that are accepted, but the code does not obviously fail for str).

In other cases, it's a bit unclear why a specific change was made; it would be great if you could elaborate.

beets/util/__init__.py

wisp3rwind · 2024-05-04T16:49:35Z

beets/util/__init__.py

@@ -188,28 +189,28 @@ def ancestry(path: bytes) -> List[str]:


 def sorted_walk(
-    path: AnyStr,
-    ignore: Sequence = (),
+    path: BytesOrStr,


Suggested change

path: BytesOrStr,

path: bytes,

While there's the bytestring_path call below, I think it is more of a safety measure to avoid crashes in case of programming errors. Effectively, all paths should internally be represented as bytes and as such I think sorted_walk should only accept bytes here.

I was wondering why should paths be represented as bytes internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html

This way, Python 3’s Unicode str can represent arbitrary bytes in filenames. (The first commit to beets happened a bit before Python 3.0 was released, so perhaps the project can be forgiven for not adopting this approach in the first place.)

Ideally, I'd imagine these functions should accept the same types that the original ones do. In this case it's os.walk, which accepts any path-like str or bytes and returns the same type that it received.

On the other hand, this PR was about fixing types rather than adjusting the implementation, so I focused on underlining what the current implement does and handles to make life easier to the person who will need to refactor this module later.

I was wondering why should paths be represented as bytes internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html

Yes, that could be changed in principle, but that's in fact not completely straightforward: We still need conversions, at a minimum when dealing with the database (which currently stores bytes, not text. Would SQLite even support storing text with surrogate escapes?).
At that point, we might want to swtich to pathlib, however (in which case it's still not completely obvious what we should store to the database).

SQLite stores unicode fine these days (we may want to include its icu extension to cover users that have some ancient versions of SQLite).

For example, I locally store bandcamp releases in this db:

CREATE TABLE release ( source TEXT, label TEXT, name TEXT, album TEXT, artist TEXT, title TEXT, comment TEXT, url TEXT, email_id TEXT, added TEXT, done INTEGER DEFAULT 0, done_date TEXT, bought INTEGER DEFAULT 0, next INTEGER DEFAULT 0, invalid INTEGER DEFAULT 0, PRIMARY KEY(email_id, added, source, name) ON CONFLICT IGNORE );

See some of the data that I've got there

On the other hand I'd strongly suggest to keep the database as it is, at least for now, since this means opening a can of worms of migrating existing user databases.

What I meant by 'internally', I guess, is the code handling the paths before it hits the database (I assume that regardless of whether we supply a str or bytes, db code appropriately converts it to bytes for storage).

On the other hand paths being stored as bytes in db doesn't mean that we need to handle them as such across the codebase. I agree that pathlib is indeed the right way forward.

wisp3rwind · 2024-05-04T16:50:47Z

beets/util/__init__.py

    path: str,
-    root: Optional[Bytes_or_String] = None,
+    root: Optional[AnyStr] = None,


I think both path and root should be bytes here, for the same reason as mentioned above.

See my comments above: in summary, I think we'd rather use types that reflect the current implementation. In addition to the two reasons I've given above, I see another risk: if we were to restrict the accepted type to bytes here, in a fully-typed codebase we'd expect mypy check to fail anywhere where a str is being fed into this function.

Right now, much the codebase is not typed, so the check passes silently. Whoever ends up adding types to the part of the codebase that sends str into this function will get slightly bamboozled: they will need to read this code and understand what it does and they will have to either

Adjust the implementation of that code

Relax the type here

Both of these actions will be out of scope of what they're doing. Remember, I had to do this in playlist.py when I ended up relaxing the type in syspath?

I'd be happy with making these functions to handle bytes only if

We explicitly handle bytes only - remove the code that casts str to bytes from the beginning of the implementation

Rename the functions to make the only-bytes-accepted-here requirement clear, since the equivalents in os (normpath, samefile, ...) handle str and bytes types. Therefore, to not confuse people used to the builtins, we'd want to name these functions accordingly I suppose.

wisp3rwind · 2024-05-04T16:59:28Z

beets/util/__init__.py



-def samefile(p1: bytes, p2: bytes) -> bool:
+def samefile(p1: AnyStr, p2: AnyStr) -> bool:


Suggested change

def samefile(p1: AnyStr, p2: AnyStr) -> bool:

def samefile(p1: bytes, p2: bytes) -> bool:

See reasoning above.

wisp3rwind · 2024-05-04T17:03:29Z

beets/util/__init__.py

    try:
        os.remove(path)
    except OSError as exc:
        raise FilesystemError(exc, "delete", (path,), traceback.format_exc())


-def copy(path: bytes, dest: bytes, replace: bool = False):
+def copy(path: AnyStr, dest: AnyStr, replace: bool = False):


Suggested change

def copy(path: AnyStr, dest: AnyStr, replace: bool = False):

def copy(path: bytes, dest: bytes, replace: bool = False):

beets/util/__init__.py

wisp3rwind · 2024-05-04T17:19:31Z

beetsplug/web/__init__.py

@@ -327,7 +327,7 @@ def item_file(item_id):

    try:
        # Imitate http.server behaviour
-        base_filename.encode("latin-1", "strict")
+        base_filename.decode().encode("latin-1", "strict")


Could you explain this change? I think base_filename should be str here, but mypy might not be able to infer this. In any case, I don't understand what this code does at all, it might need a proper cleanup.

This has to do with the change I made in syspath

You can see that it now always returns the same type that was fed into it.

return str_path if isinstance(path, str) else str_path.encode()

It previously returned undeterministic BytesOrStr: for non-windows systems it would return the same type that was fed in, but on Windows it would always return str. This caused a fair bit of typing issues down the line.

Now the code in web/__init__.py is a bit of a mess

# On Windows under Python 2, Flask wants a Unicode path. On Python 3, it # *always* wants a Unicode path. if os.name == "nt": item_path = util.syspath(item.path) else: item_path = util.py3_path(item.path) base_filename = os.path.basename(item_path) # FIXME: Arguably, this should just use `displayable_path`: The latter # tries `_fsencoding()` first, but then falls back to `utf-8`, too. if isinstance(base_filename, bytes): try: unicode_base_filename = base_filename.decode("utf-8") except UnicodeError: unicode_base_filename = util.displayable_path(base_filename) else: unicode_base_filename = base_filename try: # Imitate http.server behaviour base_filename.decode().encode("latin-1", "strict")

You can see it calls syspath on Windows, which now returns bytes instead of str. It seemingly expects base_filename to have type bytes and assigns unicode_base_filename, but then ends up using base_filename in the last line here, assuming that it's a str but it is still bytes.

Thus, I decode()d it here before encoding it to latin-1.

Given this comment

# On Windows under Python 2, Flask wants a Unicode path. On Python 3, it # *always* wants a Unicode path. if os.name == "nt": item_path = util.syspath(item.path) else: item_path = util.py3_path(item.path)

I think it should be safe to simply call py3_path in all cases now, actually.

I agree that syspath is a tricky regarding typing — but this is in fact a change which I think has the potential to cause a big fallout if we get it wrong, and at this point I'm not fully convinced that it is correct.

Up to now, syspath is intended to return str on Windows, because all Windows filesystems and APIs use some Unicode encoding and the previous default mbcs encoding (which I think would have been used when passing bytes to OS APIs) would in fact have been incorrect, given that beets internally uses UTF-8 as stated in the code comment.

Due to https://peps.python.org/pep-0529/ your change might in fact be correct now (note that there's a legacy mode that uses MBCS, even though I'd say users who explicitly enable that are on their own).

I'd prefer not to change the behaviour for now (at least not in this PR, let's keep the code changes to a minimum when the primary goal is introducing typings). I'm not confident that our testsuite exercises non-ASCII filenames sufficiently to catch all encoding edge cases.

For now, another solution might be to define types conditionally (mypy understands the sys.platform switch IIRC):

if sys.platform == "win32": SysPath: TypeAlias = str else: SysPath: TypeAlias = bytes def syspath(path: bytes, prefix: bool = True) -> SysPath:

(we should probably also run mypy on Windows then).

Regarding the web plugin code: If I'm not completely mistaken, base_filename is guaranteed to be str when decode is invoked (irrespective of the syspath change, in fact), but only bytes can be decode()ed.

Actually, it looks like the whole section

base_filename = os.path.basename(item_path) # FIXME: Arguably, this should just use `displayable_path`: The latter # tries `_fsencoding()` first, but then falls back to `utf-8`, too. if isinstance(base_filename, bytes): try: unicode_base_filename = base_filename.decode("utf-8") except UnicodeError: unicode_base_filename = util.displayable_path(base_filename) else: unicode_base_filename = base_filename

is a no-op nowadays (always hitting the else arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py but note that there, also the UnicodeError never happens, i.e. this is a place where we don't seem to actually test non-ASCII (actually, non-latin1) filenames.)

Actually, the decode is where the tests fail.

For now, another solution might be to define types conditionally (mypy understands the sys.platform switch IIRC):

if sys.platform == "win32": SysPath: TypeAlias = str else: SysPath: TypeAlias = bytes def syspath(path: bytes, prefix: bool = True) -> SysPath:

This is a nice solution - though the catch here is that for non-win32 systems the function actually returns AnyStr, and I'm not sure how to type it to avoid this issue

Regarding the failing tests, oops, I was meant to use os.fsdecode there

Actually, it looks like the whole section

base_filename = os.path.basename(item_path) # FIXME: Arguably, this should just use `displayable_path`: The latter # tries `_fsencoding()` first, but then falls back to `utf-8`, too. if isinstance(base_filename, bytes): try: unicode_base_filename = base_filename.decode("utf-8") except UnicodeError: unicode_base_filename = util.displayable_path(base_filename) else: unicode_base_filename = base_filename

is a no-op nowadays (always hitting the else arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py

Note that the coverage is produced by running the tests in the ubuntu machine, so it doesn't show the logical branches that Windows hits 🙁

Regarding syspath, I reckon we could use the good ol' cast?

return cast(AnyStr, str_path)

wisp3rwind · 2024-05-05T19:53:13Z

I left a few more comments. The one thing that remains and where we seem to disagree is this:

Ideally, I'd imagine these functions should accept the same types that the original ones do. In this case it's os.walk, which accepts any path-like str or bytes and returns the same type that it received.

On the other hand, this PR was about fixing types rather than adjusting the implementation, so I focused on underlining what the current implement does and handles to make life easier to the person who will need to refactor this module later.

I'd be happy with making these functions to handle bytes only if

We explicitly handle bytes only - remove the code that casts str to bytes from the beginning of the implementation
Rename the functions to make the only-bytes-accepted-here requirement clear, since the equivalents in os (normpath, samefile, ...) handle str and bytes types. Therefore, to not confuse people used to the builtins, we'd want to name these functions accordingly I suppose.

I guess there are two philosophies here:

Add typings according to anything the code accepts currently. This significantly reduces how much the type checker can help us to verify that our current path handling is correct, but might be the quicker path to a clean mypy run.
Add typings according to what is vaguely defined to be the intended usage (e.g. always call syspath with bytes). There may be some code around that would fail to type check due to this. I'd argue that this is actually a good thing, and probably indicates that the calling code needs to be fixed. In this interpretation, more lenient implementation (i.e. no assert isinstance(path, bytes) is a form of defensive programming.

I don't think we should start adding such assert statements immediately: This is too likely to cause issues. If we want to go this way, I think the steps should be

Type the entire beets code base and fix any callers
Add a warning where the assert should ultimately be
Do a release, and mention in the release notes that plugin authors need to anticipate this change
swap warnings for asserts

I'm not sure what's the best way to proceed here. I would very much like to see 2. happen eventually.
A broader roadmap could be:

Type the entire beets code base, erring on the more lenient side.
Prepare an updated version of https://beets.io/blog/paths.html, this should probably live in this repository rather than a blog. This should have clear guidance on what to do when
- passing paths to os.path. and similar APIs
- reading/writing paths or path fragments from/to files (e.g. playlist files)
- passing paths to subprocesses
- storing paths in the database
- reading paths or path fragments from the configuration
- ...
Tighten up our typings accordingly and add warnings where the ultimately we would want assertions.
Do a release, and mention in the release notes that plugin authors need to anticipate this change
swap warnings for asserts
At this point, I think we'd be in a good position to revisit a conversion to pathlib. We should start by updating our design document from the second point.

wisp3rwind · 2024-05-05T19:54:35Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix most types in beets.util.init #5223

Fix most types in beets.util.init #5223

snejus commented May 4, 2024 •

edited

This comment was marked as resolved.

wisp3rwind left a comment

wisp3rwind May 4, 2024

snejus May 5, 2024

snejus May 5, 2024

wisp3rwind May 5, 2024

snejus May 5, 2024

snejus May 5, 2024

wisp3rwind May 4, 2024

snejus May 5, 2024

snejus May 5, 2024

wisp3rwind May 4, 2024

wisp3rwind May 4, 2024

wisp3rwind May 4, 2024

snejus May 5, 2024 •

edited

snejus May 5, 2024 •

edited

wisp3rwind May 5, 2024

wisp3rwind May 5, 2024

snejus May 6, 2024

snejus May 6, 2024

snejus May 6, 2024

wisp3rwind commented May 5, 2024

wisp3rwind commented May 5, 2024

snejus commented May 6, 2024



		def samefile(p1: bytes, p2: bytes) -> bool:
		def samefile(p1: AnyStr, p2: AnyStr) -> bool:

	def copy(path: AnyStr, dest: AnyStr, replace: bool = False):
	def copy(path: bytes, dest: bytes, replace: bool = False):

Fix most types in beets.util.__init__ #5223

Are you sure you want to change the base?

Fix most types in beets.util.__init__ #5223

Conversation

snejus commented May 4, 2024 • edited

Description

This comment was marked as resolved.

wisp3rwind left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snejus May 5, 2024 • edited

Choose a reason for hiding this comment

snejus May 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wisp3rwind commented May 5, 2024

wisp3rwind commented May 5, 2024

snejus commented May 6, 2024

Fix most types in beets.util.init #5223

Fix most types in beets.util.init #5223

snejus commented May 4, 2024 •

edited

snejus May 5, 2024 •

edited

snejus May 5, 2024 •

edited