Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid "str | bytes" where names or text are meant #29

Open
matangover opened this issue Jun 9, 2021 · 8 comments
Open

Avoid "str | bytes" where names or text are meant #29

matangover opened this issue Jun 9, 2021 · 8 comments

Comments

@matangover
Copy link

matangover commented Jun 9, 2021

Right now _Element.text is typed as Optional[Union[str, bytes]]. However according to the [lxml FAQ]:(https://lxml.de/3.6/FAQ.html):

In Python 3, lxml always returns Unicode strings for text and names, as does ElementTree.

Having to deal with bytes makes all code accessing _Element.text unnecessarily cumbersome.

@matangover
Copy link
Author

matangover commented Jun 9, 2021

This is also relevant to _Element.attrib, whose __getitem__ (among others) is typed as returning Union[str, bytes]. I'm not sure whether it can actually contain bytes values (couldn't find info in docs). However right now there is a discrepancy: _Element.get is typed as returning str. So one of those must be wrong.

@scoder
Copy link
Member

scoder commented Feb 4, 2022

The thing is, as long as users are still running their code in Py2, and writing code that works with Py2 and Py3, both need to be supported. Users who only target Py2 and not Py3 at all would probably also not use static type checkers, but legacy Py2/3 code should be supported.

That said, yes, the current declarations are inconsistent.

And the fact that legacy code needs to be supported doesn't mean that str wouldn't be sufficient. lxml does not use bytes as in "sequence of bytes" but only as byte encoded textual data, i.e. Py2 str, not Py3 bytes.

I think it's worth trying a switch to plain str and seeing if that really breaks anything.

PR welcome.

@lovetox
Copy link
Contributor

lovetox commented Feb 18, 2022

This seems to be changed in d95ca46

Issue can be closed in my opinion

@scoder scoder changed the title _Element.text should be changed to str on Python 3 Avoid "str | bytes" where names or text are meant Feb 19, 2022
@scoder
Copy link
Member

scoder commented Feb 19, 2022

I changed the title to reflect the general issue here.

@matangover
Copy link
Author

I think it's worth trying a switch to plain str and seeing if that really breaks anything.

It would break correctness on Python 2 because on Python 2 these attributes may return either str or unicode. Can we drop Python 2 compatibility of the typings? If yes, we can switch to str.

If you want to keep Python 2 compatibility, we can use Union[typing.Text, str] or use a version-dependent type alias.

@abelcheung
Copy link
Contributor

@matangover Agreed it's a good idea to drop py2 support for annotations. Personally I don't know anybody who's writing new code on py2 now, lest using annotation on py2 project.

However, the removal of bytes is mostly for method / function return type and object attributes. bytes as input argument is still in use, I'm not sure dropping that is good idea or not.

@matangover
Copy link
Author

Agreed, str should be used for places where "names or text" are returned / accepted.

@macro1
Copy link
Contributor

macro1 commented Jan 29, 2023

Per python/typing#208 an instance of Python 2 str is a subtype of unicode. Are we over thinking this, and would typing.Text just work for 2.7 type checking? It seems to me the only thing we need to do is indicate for Python 3 where bytes are acceptable as an input as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants