-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix most types in beets.util.__init__ #5223
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
adac926
to
6c0b63f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a bunch of comments: Many of them are concerned with the tension between what types of arguments functions would accept vs. what we actually want to call them with (i.e. often it should only by bytes
that are accepted, but the code does not obviously fail for str
).
In other cases, it's a bit unclear why a specific change was made; it would be great if you could elaborate.
@@ -188,28 +189,28 @@ def ancestry(path: bytes) -> List[str]: | |||
|
|||
|
|||
def sorted_walk( | |||
path: AnyStr, | |||
ignore: Sequence = (), | |||
path: BytesOrStr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path: BytesOrStr, | |
path: bytes, |
While there's the bytestring_path
call below, I think it is more of a safety measure to avoid crashes in case of programming errors. Effectively, all paths should internally be represented as bytes
and as such I think sorted_walk
should only accept bytes
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why should paths be represented as bytes
internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html
This way, Python 3’s Unicode str can represent arbitrary bytes in filenames. (The first commit to beets happened a bit before Python 3.0 was released, so perhaps the project can be forgiven for not adopting this approach in the first place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd imagine these functions should accept the same types that the original ones do. In this case it's os.walk
, which accepts any path-like str
or bytes
and returns the same type that it received.
On the other hand, this PR was about fixing types rather than adjusting the implementation, so I focused on underlining what the current implement does and handles to make life easier to the person who will need to refactor this module later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why should paths be represented as bytes internally? This requirement may be less relevant now that python 2 is dropped and not supported anymore, see https://beets.io/blog/paths.html
Yes, that could be changed in principle, but that's in fact not completely straightforward: We still need conversions, at a minimum when dealing with the database (which currently stores bytes, not text. Would SQLite even support storing text with surrogate escapes?).
At that point, we might want to swtich to pathlib, however (in which case it's still not completely obvious what we should store to the database).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLite stores unicode fine these days (we may want to include its icu extension to cover users that have some ancient versions of SQLite).
For example, I locally store bandcamp releases in this db:
CREATE TABLE release (
source TEXT,
label TEXT,
name TEXT,
album TEXT,
artist TEXT,
title TEXT,
comment TEXT,
url TEXT,
email_id TEXT,
added TEXT,
done INTEGER DEFAULT 0,
done_date TEXT,
bought INTEGER DEFAULT 0,
next INTEGER DEFAULT 0,
invalid INTEGER DEFAULT 0,
PRIMARY KEY(email_id, added, source, name) ON CONFLICT IGNORE
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand I'd strongly suggest to keep the database as it is, at least for now, since this means opening a can of worms of migrating existing user databases.
What I meant by 'internally', I guess, is the code handling the paths before it hits the database (I assume that regardless of whether we supply a str
or bytes
, db code appropriately converts it to bytes
for storage).
On the other hand paths being stored as bytes
in db doesn't mean that we need to handle them as such across the codebase. I agree that pathlib
is indeed the right way forward.
path: str, | ||
root: Optional[Bytes_or_String] = None, | ||
root: Optional[AnyStr] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both path
and root
should be bytes
here, for the same reason as mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments above: in summary, I think we'd rather use types that reflect the current implementation. In addition to the two reasons I've given above, I see another risk: if we were to restrict the accepted type to bytes
here, in a fully-typed codebase we'd expect mypy
check to fail anywhere where a str
is being fed into this function.
Right now, much the codebase is not typed, so the check passes silently. Whoever ends up adding types to the part of the codebase that sends str
into this function will get slightly bamboozled: they will need to read this code and understand what it does and they will have to either
- Adjust the implementation of that code
- Relax the type here
Both of these actions will be out of scope of what they're doing. Remember, I had to do this in playlist.py
when I ended up relaxing the type in syspath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with making these functions to handle bytes
only if
- We explicitly handle
bytes
only - remove the code that castsstr
tobytes
from the beginning of the implementation - Rename the functions to make the only-bytes-accepted-here requirement clear, since the equivalents in
os
(normpath
,samefile
, ...) handlestr
andbytes
types. Therefore, to not confuse people used to the builtins, we'd want to name these functions accordingly I suppose.
|
||
|
||
def samefile(p1: bytes, p2: bytes) -> bool: | ||
def samefile(p1: AnyStr, p2: AnyStr) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def samefile(p1: AnyStr, p2: AnyStr) -> bool: | |
def samefile(p1: bytes, p2: bytes) -> bool: |
See reasoning above.
try: | ||
os.remove(path) | ||
except OSError as exc: | ||
raise FilesystemError(exc, "delete", (path,), traceback.format_exc()) | ||
|
||
|
||
def copy(path: bytes, dest: bytes, replace: bool = False): | ||
def copy(path: AnyStr, dest: AnyStr, replace: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def copy(path: AnyStr, dest: AnyStr, replace: bool = False): | |
def copy(path: bytes, dest: bytes, replace: bool = False): |
beetsplug/web/__init__.py
Outdated
@@ -327,7 +327,7 @@ def item_file(item_id): | |||
|
|||
try: | |||
# Imitate http.server behaviour | |||
base_filename.encode("latin-1", "strict") | |||
base_filename.decode().encode("latin-1", "strict") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this change? I think base_filename
should be str
here, but mypy might not be able to infer this. In any case, I don't understand what this code does at all, it might need a proper cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to do with the change I made in syspath
You can see that it now always returns the same type that was fed into it.
return str_path if isinstance(path, str) else str_path.encode()
It previously returned undeterministic BytesOrStr
: for non-windows systems it would return the same type that was fed in, but on Windows it would always return str
. This caused a fair bit of typing issues down the line.
Now the code in web/__init__.py
is a bit of a mess
# On Windows under Python 2, Flask wants a Unicode path. On Python 3, it
# *always* wants a Unicode path.
if os.name == "nt":
item_path = util.syspath(item.path)
else:
item_path = util.py3_path(item.path)
base_filename = os.path.basename(item_path)
# FIXME: Arguably, this should just use `displayable_path`: The latter
# tries `_fsencoding()` first, but then falls back to `utf-8`, too.
if isinstance(base_filename, bytes):
try:
unicode_base_filename = base_filename.decode("utf-8")
except UnicodeError:
unicode_base_filename = util.displayable_path(base_filename)
else:
unicode_base_filename = base_filename
try:
# Imitate http.server behaviour
base_filename.decode().encode("latin-1", "strict")
You can see it calls syspath
on Windows, which now returns bytes
instead of str
. It seemingly expects base_filename
to have type bytes
and assigns unicode_base_filename
, but then ends up using base_filename
in the last line here, assuming that it's a str
but it is still bytes
.
Thus, I decode()
d it here before encoding it to latin-1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this comment
# On Windows under Python 2, Flask wants a Unicode path. On Python 3, it
# *always* wants a Unicode path.
if os.name == "nt":
item_path = util.syspath(item.path)
else:
item_path = util.py3_path(item.path)
I think it should be safe to simply call py3_path
in all cases now, actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that syspath
is a tricky regarding typing — but this is in fact a change which I think has the potential to cause a big fallout if we get it wrong, and at this point I'm not fully convinced that it is correct.
Up to now, syspath
is intended to return str
on Windows, because all Windows filesystems and APIs use some Unicode encoding and the previous default mbcs
encoding (which I think would have been used when passing bytes
to OS APIs) would in fact have been incorrect, given that beets internally uses UTF-8 as stated in the code comment.
Due to https://peps.python.org/pep-0529/ your change might in fact be correct now (note that there's a legacy mode that uses MBCS, even though I'd say users who explicitly enable that are on their own).
I'd prefer not to change the behaviour for now (at least not in this PR, let's keep the code changes to a minimum when the primary goal is introducing typings). I'm not confident that our testsuite exercises non-ASCII filenames sufficiently to catch all encoding edge cases.
For now, another solution might be to define types conditionally (mypy understands the sys.platform
switch IIRC):
if sys.platform == "win32":
SysPath: TypeAlias = str
else:
SysPath: TypeAlias = bytes
def syspath(path: bytes, prefix: bool = True) -> SysPath:
(we should probably also run mypy on Windows then).
Regarding the web
plugin code: If I'm not completely mistaken, base_filename
is guaranteed to be str
when decode
is invoked (irrespective of the syspath
change, in fact), but only bytes
can be decode()
ed.
Actually, it looks like the whole section
base_filename = os.path.basename(item_path)
# FIXME: Arguably, this should just use `displayable_path`: The latter
# tries `_fsencoding()` first, but then falls back to `utf-8`, too.
if isinstance(base_filename, bytes):
try:
unicode_base_filename = base_filename.decode("utf-8")
except UnicodeError:
unicode_base_filename = util.displayable_path(base_filename)
else:
unicode_base_filename = base_filename
is a no-op nowadays (always hitting the else
arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py but note that there, also the UnicodeError
never happens, i.e. this is a place where we don't seem to actually test non-ASCII (actually, non-latin1) filenames.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the decode
is where the tests fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, another solution might be to define types conditionally (mypy understands the
sys.platform
switch IIRC):if sys.platform == "win32": SysPath: TypeAlias = str else: SysPath: TypeAlias = bytes def syspath(path: bytes, prefix: bool = True) -> SysPath:
This is a nice solution - though the catch here is that for non-win32 systems the function actually returns AnyStr
, and I'm not sure how to type it to avoid this issue
Regarding the failing tests, oops, I was meant to use os.fsdecode
there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like the whole section
base_filename = os.path.basename(item_path) # FIXME: Arguably, this should just use `displayable_path`: The latter # tries `_fsencoding()` first, but then falls back to `utf-8`, too. if isinstance(base_filename, bytes): try: unicode_base_filename = base_filename.decode("utf-8") except UnicodeError: unicode_base_filename = util.displayable_path(base_filename) else: unicode_base_filename = base_filenameis a no-op nowadays (always hitting the
else
arm. (It's not hit in tests at least: https://app.codecov.io/gh/beetbox/beets/blob/master/beetsplug%2Fweb%2F__init__.py
Note that the coverage is produced by running the tests in the ubuntu
machine, so it doesn't show the logical branches that Windows hits 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding syspath
, I reckon we could use the good ol' cast
?
return cast(AnyStr, str_path)
13dea13
to
f0a9ef5
Compare
I left a few more comments. The one thing that remains and where we seem to disagree is this:
I guess there are two philosophies here:
I don't think we should start adding such assert statements immediately: This is too likely to cause issues. If we want to go this way, I think the steps should be
I'm not sure what's the best way to proceed here. I would very much like to see 2. happen eventually.
|
See also #1409 |
49473e0
to
79cf3fd
Compare
79cf3fd
to
6a1f684
Compare
8f63779
to
22e2a3c
Compare
Weird stuff: as you've seen my change made that single test fail on Windows, see this build 22e2a3c for example
Now in the next commit abd8447 I reverted my change and I got bombarded with this
wtf-white-cat.mp4 |
Description
This PR is Part 1 of the work #5215 that fixes typing issues in
beets.util.__init__
module.It addresses simple-to-fix / most of the issues in this module.