Data files should probably be in a separate package. #251

KOLANICH · 2023-02-01T17:03:49Z

And also not in Python code, but in JSON/CSV/TSV.

cvzi · 2023-02-01T17:37:23Z

Why?

KOLANICH · 2023-02-01T19:07:15Z

To update them separately and fully automatically. For example with a CI pipeline run by "cron".

cvzi · 2023-02-02T10:51:55Z

Currently there is only one data file emoji/unicode_codes/data_dict.py.
It is generated with the script in utils/get_codes_from_unicode_emoji_data_files.py. So it could be automated already.
However I like to check the changed entries. The data is created manually by the Unicode contributors, so it's susceptible to errors or unexpected data.

Also the Unicode data is only updated twice per year, so at the moment it's not that much work to do it manually.

KOLANICH · 2023-02-05T22:28:27Z

I have created a tool merging some data from different sources: https://github.com/KOLANICH-tools/emojiSlugMappingGen.py

cvzi · 2023-02-14T11:33:52Z

In light of the pull request #252 adding Japanese & Korean and the recently added languages Chinese & Indonesian, I think the more important issue with the single data file is memory consumption.
The import emoji statement, currently loads the whole data file (as a big dictionary) into memory. I haven't measured it recently, but I expect it is about 1-2MB of memory per language.
With every new language the memory consumption will grow. Probably most users only ever use one language though.

KOLANICH · 2023-02-14T12:00:21Z

@cvzi, https://github.com/KOLANICH-tools/emojifilt.cpp uses .mo files of libintl generated by https://github.com/KOLANICH-tools/emojiSlugMappingGen.py . Currently it reads the whole files into memory (it relies on a lib that uses code compiled from Kaitai Struct spec to parse .mo files (because libintl public API is limited, so I found it easier to have an own lib), and KS currently cannot generate code relying on memory mappings (well, we can use memory mappings via std::stream interface, but it is not enough for processing files larger than RAM, for such files laying out raw structures over memory is needed) ). mo format in theory suits for using it through a memory mapping.

cvzi · 2023-02-14T21:28:39Z

I think something like that would be overkill for this library. I don't think it needs to be that efficient.

My suggestion would be to ideally keep the current API of the library and still try to reduce memory usage a little bit.

People use the big dictionary directly at the moment (it is in the public API). I think it is also nice that you can just open the file in a text editor and look at the emoji. In that way it is kind of like JSON, a human can easily read or even edit it. It is simple to add custom slugs or even custom emoji.

Maybe the language data could be in separate files and could be loaded on request.
Like so:

import emoji

print(emoji.EMOJI_DATA['🐕']['fr']) # would throw an error because fr data has not been loaded

emoji.load_languages(['fr', 'zh', 'ja'])
print(emoji.EMOJI_DATA['🐕']['fr']) # now it would work

If all the languages would be in separate files, it would probably reduce memory usage by about 50% for a user who only uses one language. It would still be a breaking change for the API though, since the first access in the example fails.

cvzi · 2023-02-14T21:50:47Z

Maybe we could use a class to emulate a dictionary with __getitem__(self, key). The __getitem__ could load the necessary language data from different files. That way there would be no breaking changes to the API.

Currently it looks like this:

EMOJI_DATA = {
    u'\U0001F415': { # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True,
        'de': ':hund:',
        'es': ':perro:',
        'fr': ':chien:',
        'ja': u':イヌ:',
        'ko': u':개:',
        'pt': ':cachorro:',
        'it': ':cane:',
        'fa': u':سگ:',
        'id': ':anjing:',
        'zh': u':狗:'
    },
    ...
}

Maybe the inner dictionaries could be objects instead and the language data could be in separate files:

EMOJI_DATA = {
    u'\U0001F415': ClassLikeADictionary({ # 🐕
        'en' : ':dog:',
        'status' : fully_qualified,
        'E' : 0.7,
        'alias' : [':dog2:'],
        'variant': True
    }),
    ...
}


class ClassLikeADictionary:
    def __getitem__(self, key):
        # Load language data if it is not loaded yet
        if languageIsNotLoaded(key):
          loadLanguageFromDataFile(key)
        return valueFor(key)
    ...

So you could still access it with EMOJI_DATA['🐕']['fr']

KOLANICH · 2023-02-15T11:04:23Z

If you don't want to use binary gettext .mo (was chosen mostly beacuse it contains a precomputed on-disk hashtable, though most of impls don't bother to actually use the hashtable), you have another option, a plain text TSV file lexicographically sorted by first column + some statistics to optimize lookup. Can be memory-mapped, then the mapping can be navigated with kinda binary search in the file.

emoji.EMOJI_DATA['🐕']['fr']

I guess since your impl is going to spend memory on each opened file for each language,

emoji.EMOJI_DATA['fr']['🐕']

should make more sense, because it'll make opening a new file more explicit.

cvzi · 2023-03-15T21:52:50Z

Thinking about the idea with updating the data files in CI, I had another idea that could make the memory usage smaller for average user:

We could release several flavors of this package instead of just one. For example:

emoji (full package as it is)
emoji_en (only English)
emoji_fr (only French)
...

I think this could be easy with CI/Github actions. The main thing to do would be to remove all languages from EMOJI_DATA except the one language and then publish on PyPi

TahirJalilov · 2023-05-25T08:22:36Z

I think keeping languages in a separate files (like it was before) but in one "emoji" project much better than create separate projects on PyPi.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data files should probably be in a separate package. #251

Data files should probably be in a separate package. #251

KOLANICH commented Feb 1, 2023 •

edited

cvzi commented Feb 1, 2023

KOLANICH commented Feb 1, 2023

cvzi commented Feb 2, 2023

KOLANICH commented Feb 5, 2023

cvzi commented Feb 14, 2023

KOLANICH commented Feb 14, 2023

cvzi commented Feb 14, 2023 •

edited

cvzi commented Feb 14, 2023

KOLANICH commented Feb 15, 2023 •

edited

cvzi commented Mar 15, 2023

TahirJalilov commented May 25, 2023

Data files should probably be in a separate package. #251

Data files should probably be in a separate package. #251

Comments

KOLANICH commented Feb 1, 2023 • edited

cvzi commented Feb 1, 2023

KOLANICH commented Feb 1, 2023

cvzi commented Feb 2, 2023

KOLANICH commented Feb 5, 2023

cvzi commented Feb 14, 2023

KOLANICH commented Feb 14, 2023

cvzi commented Feb 14, 2023 • edited

cvzi commented Feb 14, 2023

KOLANICH commented Feb 15, 2023 • edited

cvzi commented Mar 15, 2023

TahirJalilov commented May 25, 2023

KOLANICH commented Feb 1, 2023 •

edited

cvzi commented Feb 14, 2023 •

edited

KOLANICH commented Feb 15, 2023 •

edited