Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data files should probably be in a separate package. #251

Open
KOLANICH opened this issue Feb 1, 2023 · 11 comments
Open

Data files should probably be in a separate package. #251

KOLANICH opened this issue Feb 1, 2023 · 11 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Feb 1, 2023

And also not in Python code, but in JSON/CSV/TSV.

@cvzi
Copy link
Contributor

cvzi commented Feb 1, 2023

Why?

@KOLANICH
Copy link
Author

KOLANICH commented Feb 1, 2023

To update them separately and fully automatically. For example with a CI pipeline run by "cron".

@cvzi
Copy link
Contributor

cvzi commented Feb 2, 2023

Currently there is only one data file emoji/unicode_codes/data_dict.py.
It is generated with the script in utils/get_codes_from_unicode_emoji_data_files.py. So it could be automated already.
However I like to check the changed entries. The data is created manually by the Unicode contributors, so it's susceptible to errors or unexpected data.

Also the Unicode data is only updated twice per year, so at the moment it's not that much work to do it manually.

@KOLANICH
Copy link
Author

KOLANICH commented Feb 5, 2023

I have created a tool merging some data from different sources: https://github.com/KOLANICH-tools/emojiSlugMappingGen.py

@cvzi
Copy link
Contributor

cvzi commented Feb 14, 2023

In light of the pull request #252 adding Japanese & Korean and the recently added languages Chinese & Indonesian, I think the more important issue with the single data file is memory consumption.
The import emoji statement, currently loads the whole data file (as a big dictionary) into memory. I haven't measured it recently, but I expect it is about 1-2MB of memory per language.
With every new language the memory consumption will grow. Probably most users only ever use one language though.

@KOLANICH
Copy link
Author

@cvzi, https://github.com/KOLANICH-tools/emojifilt.cpp uses .mo files of libintl generated by https://github.com/KOLANICH-tools/emojiSlugMappingGen.py . Currently it reads the whole files into memory (it relies on a lib that uses code compiled from Kaitai Struct spec to parse .mo files (because libintl public API is limited, so I found it easier to have an own lib), and KS currently cannot generate code relying on memory mappings (well, we can use memory mappings via std::stream interface, but it is not enough for processing files larger than RAM, for such files laying out raw structures over memory is needed) ). mo format in theory suits for using it through a memory mapping.

@cvzi
Copy link
Contributor

cvzi commented Feb 14, 2023

I think something like that would be overkill for this library. I don't think it needs to be that efficient.

My suggestion would be to ideally keep the current API of the library and still try to reduce memory usage a little bit.

People use the big dictionary directly at the moment (it is in the public API). I think it is also nice that you can just open the file in a text editor and look at the emoji. In that way it is kind of like JSON, a human can easily read or even edit it. It is simple to add custom slugs or even custom emoji.

Maybe the language data could be in separate files and could be loaded on request.
Like so:

import emoji

print(emoji.EMOJI_DATA['🐕']['fr']) # would throw an error because fr data has not been loaded

emoji.load_languages(['fr', 'zh', 'ja'])
print(emoji.EMOJI_DATA['🐕']['fr']) # now it would work

If all the languages would be in separate files, it would probably reduce memory usage by about 50% for a user who only uses one language. It would still be a breaking change for the API though, since the first access in the example fails.

@cvzi
Copy link
Contributor

cvzi commented Feb 14, 2023

Maybe we could use a class to emulate a dictionary with __getitem__(self, key). The __getitem__ could load the necessary language data from different files. That way there would be no breaking changes to the API.

Currently it looks like this:

EMOJI_DATA = {
    u'\U0001F415': { # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True,
        'de': ':hund:',
        'es': ':perro:',
        'fr': ':chien:',
        'ja': u':イヌ:',
        'ko': u':개:',
        'pt': ':cachorro:',
        'it': ':cane:',
        'fa': u':سگ:',
        'id': ':anjing:',
        'zh': u':狗:'
    },
    ...
}

Maybe the inner dictionaries could be objects instead and the language data could be in separate files:

EMOJI_DATA = {
    u'\U0001F415': ClassLikeADictionary({ # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True
    }),
    ...
}


class ClassLikeADictionary:
    def __getitem__(self, key):
        # Load language data if it is not loaded yet
        if languageIsNotLoaded(key):
          loadLanguageFromDataFile(key)
        return valueFor(key)
    ...

So you could still access it with EMOJI_DATA['🐕']['fr']

@KOLANICH
Copy link
Author

KOLANICH commented Feb 15, 2023

If you don't want to use binary gettext .mo (was chosen mostly beacuse it contains a precomputed on-disk hashtable, though most of impls don't bother to actually use the hashtable), you have another option, a plain text TSV file lexicographically sorted by first column + some statistics to optimize lookup. Can be memory-mapped, then the mapping can be navigated with kinda binary search in the file.

emoji.EMOJI_DATA['🐕']['fr']

I guess since your impl is going to spend memory on each opened file for each language,

emoji.EMOJI_DATA['fr']['🐕']

should make more sense, because it'll make opening a new file more explicit.

@cvzi
Copy link
Contributor

cvzi commented Mar 15, 2023

Thinking about the idea with updating the data files in CI, I had another idea that could make the memory usage smaller for average user:

We could release several flavors of this package instead of just one. For example:

  • emoji (full package as it is)
  • emoji_en (only English)
  • emoji_fr (only French)
  • ...

I think this could be easy with CI/Github actions. The main thing to do would be to remove all languages from EMOJI_DATA except the one language and then publish on PyPi

@TahirJalilov
Copy link
Collaborator

I think keeping languages in a separate files (like it was before) but in one "emoji" project much better than create separate projects on PyPi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants