Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code enhancements suggestions #292

Open
lovetox opened this issue Apr 29, 2024 · 3 comments
Open

Code enhancements suggestions #292

lovetox opened this issue Apr 29, 2024 · 3 comments

Comments

@lovetox
Copy link
Contributor

lovetox commented Apr 29, 2024

While adding type hints to the library i found various places where in my opinion improvments could be made.

I add them here to discuss because most of them will be breaking changes, and should only be considered once a major revision is planned.

1)

class EmojiMatch:
    def data_copy(self) -> Dict[str, Any]:
        """
        Returns a copy of the data from :data:`EMOJI_DATA` for this match
        with the additional keys ``match_start`` and ``match_end``.
        """
        if self.data:
            emj_data = self.data.copy()
            emj_data['match_start'] = self.start
            emj_data['match_end'] = self.end
            return emj_data
        else:
            return {
                'match_start': self.start,
                'match_end': self.end
            }

This method returns 2 completely different dicts, i don’t think its evident for the user what they will get here, and if the user wants to prevent bugs, even with type hints, after each call the user needs to test which kind of dict he got.

I got a branch where all dicts are typed, but this is still a pain here.

Also deepcopy should be considered here, because the emoji data contain itself mutable objects again.

2)

class Token(NamedTuple):
    chars: str
    value: Union[str, EmojiMatch]

The user wants to call token.value.emoji but he first needs to test what he got with isinstance(token.value, str)

3)

In data_dict.py: STATUS should be an IntEnum, i found no reason why this is a dict.

4)

def get_emoji_unicode_dict(lang: str) -> Dict[str, Any]:
    """Generate dict containing all fully-qualified and component emoji name for a language
    The dict is only generated once per language and then cached in _EMOJI_UNICODE[lang]"""

    if _EMOJI_UNICODE[lang] is None:
        _EMOJI_UNICODE[lang] = {data[lang]: emj for emj, data in EMOJI_DATA.items()
                                if lang in data and data['status'] <= STATUS['fully_qualified']}

    return _EMOJI_UNICODE[lang]


def get_aliases_unicode_dict() -> Dict[str, str]:
    """Generate dict containing all fully-qualified and component aliases
    The dict is only generated once and then cached in _ALIASES_UNICODE"""

    if not _ALIASES_UNICODE:
        _ALIASES_UNICODE.update(get_emoji_unicode_dict('en'))
        for emj, data in EMOJI_DATA.items():
            if 'alias' in data and data['status'] <= STATUS['fully_qualified']:
                for alias in data['alias']:
                    _ALIASES_UNICODE[alias] = emj

    return _ALIASES_UNICODE

a)
Instead of returning mutable dicts, we could return a MappingProxy Object which wraps the dict, this makes it (as much as you can in python) immutable. And the user cannot accidentally modify it.

b)
Instead of the constant _EMOJI_UNICODE and _ALIASES_UNICODE one could use @functools.lru_cache which would make the code much smaller.

5)

A lot of test code imports the private cache _EMOJI_UNICODE. While this is probably not really a problem it makes it so we have to add many type: ignore comments.

A simple helper method in the test module which iterates the languages and builds the whole cache would let us get rid of all the comments.

@cvzi
Copy link
Contributor

cvzi commented May 6, 2024

Sounds great. I would generally be open to any of those.

Moving the EMOJI_DATA to json files with a separate json file for each language ( #280 ) would be a breaking change, so that should be a major version change anyway.

  1. Yes. I think Python 2 had some problem with enums, I don't exactly remember why I chose dict.

  2. Yes!

I am not familiar with 4 a/b

Sorry, I don't have much time at the moment.

@cvzi
Copy link
Contributor

cvzi commented May 6, 2024

  1. what would be a solution? Two separate analyze() methods, one that returns only "EmojiToken" and one that returns "MixedToken"?

@cvzi
Copy link
Contributor

cvzi commented May 15, 2024

I'll look into 4 b: @functools.lru_cache and 5: removing the many type: ignore in tests.

That should be possible without breaking changes, since _EMOJI_UNICODE and _ALIASES_UNICODE are private.

@cvzi cvzi mentioned this issue May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants