Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import takes ~30 seconds #280

Open
Snawe opened this issue Jan 14, 2024 · 21 comments
Open

import takes ~30 seconds #280

Snawe opened this issue Jan 14, 2024 · 21 comments

Comments

@Snawe
Copy link

Snawe commented Jan 14, 2024

Hi!
I just upgraded my application to python 3.12.

Doing import emoji there takes arround 30 seconds there. Doing the same on python 3.11 takes less than a second.

Any clue?

is reproducable with a simple 2 line script like

import emoji
print (emoji.is_emoji("done"))

Using Windows atm with emoji 2.9.0

Something similar was already reported here: #274

@cvzi
Copy link
Contributor

cvzi commented Jan 16, 2024

I have no idea at the moment.

If you have the time, could you check two things:

Does it also happen with just the single import line?

import emoji

And maybe test some older versions of the module, to see if it is a new change that introduced this problem. For example these versions:

2.6.0
2.0.0
1.5.0

Do pip install emoji==2.6.0 to install a specific version.


If you are interested in trying it, I could create a test version for you that only has English language (or whatever languages you need) as suggested in the other issue. That would presumably reduce memory usage overall and reduce start-up time.

@Snawe
Copy link
Author

Snawe commented Jan 18, 2024

Does it also happen with just the single import line?

yes

regarding emoji versions:

2.9.0 => 34 sec
2.6.0 => 24 sec
2.0.0 => 13 sec
1.5.0 => 10 sec

But I just found out something really strange... Like I already mentioned int the beginning, this only happens on python3.12. With python3.11 this does not happen.
What I found out now, it only happens if I run it via vs code. So if I hit F5 in vs code, I get the above times.
If I run it directly in the commandline, I still see that the time doubled (more emojis I think), but the times are:

2.9.0 => ~0.06 sec
2.0.0 => ~0.03 sec

I still think that it has something to do with the emoji library, since every other doesn't take ~500times longer, but yeah....

Just to complete the list, with python 3.11 via vs code:

2.9.0 => 0.03 sec
2.6.0 => 0.03 sec
2.0.0 => 0.03 sec
1.5.0 => 0.03 sec

From the commandline:
2.9.0 => 0.03 sec
2.6.0 => 0.03 sec
2.0.0 => 0.03 sec
1.5.0 => 0.03 sec

@cvzi
Copy link
Contributor

cvzi commented Jan 18, 2024

Wow, thanks for the details!
My first guess would be that VS Code attaches a debugger or something similar and that somehow changed between Python 3.11 and 3.12

I will look into it.

@cvzi
Copy link
Contributor

cvzi commented Jan 19, 2024

I can reproduce it with Python 3.12 on Windows 10 when running with F5 in VS Code. If you run it without the debugger CTRL+F5, it doesn't happen.
(Also I see no problems with Python 3.11)

VS Code runs a file like this, when you press F5:
cmd /C "C:\Python312\python.exe c:\Users\cuzi\.vscode\extensions\ms-python.python-2023.22.1\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher 64938 -- C:\Users\cuzi\Desktop\mytest.py "

I have create a issue at debugpy, maybe they know why this happens: microsoft/debugpy#1496

@cvzi
Copy link
Contributor

cvzi commented Jan 19, 2024

The problem seems to be the large number of lines in the file data_dict.py (currently ~87k lines).

The dictionary in the file can be compressed into a single line:

EMOJI_DATA = {
        '\U0001F947': {
        'en': ':1st_place_medal:',
        'status':
        ...

-->

EMOJI_DATA={'\U0001F947':{'en':':1st_place_medal:','status: ....

resulting in a file with just 46 lines. With the compressed file, the debugging runs as fast as in Python 3.11

@TahirJalilov
Copy link
Collaborator

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

@Snawe
Copy link
Author

Snawe commented Jan 19, 2024

oh wow! Thank you very much for looking into it! :)

@cvzi
Copy link
Contributor

cvzi commented Jan 19, 2024

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

I agree.

Not sure it will help enough regarding this problem though, because the dictionary would still be huge. It takes 4 minutes on my computer at the moment. Even if it cuts the time to 10% it would still take 25 seconds, far too long.

Putting the dictionary into a single line is obviously really ugly. But it would be a quick fix.

I guess using a different file format, not Python code, could solve this problem with the debugging. For example storing the dictionary in a JSON file and the loading the JSON file when the module is imported.

@lsmith77
Copy link

I guess right now there is no work-around when using Python 3.12?

cvzi added a commit to cvzi/emoji that referenced this issue Jan 24, 2024
... to prevent debugging overhead in Python 3.12

carpedm20#280
@lsmith77
Copy link

@cvzi based on 6fb1321 you are releasing a work-around?

@cvzi
Copy link
Contributor

cvzi commented Jan 25, 2024

I guess so. I am not really happy with putting the dict into a single line, but there seems to be no other quick work around. And VS Code is one of the most used editors at the moment and already there seem to be about 2000/day downloads from Python 3.12 of this library (according to PyPi stats).

I have deployed that commit on my own apps, and it seems to work, i.e. in release environment, not debugging.

@lsmith77 any change you could test if it actually solves the problem with VS Code for you? It does solve it on my computer. You can install from my branch cvzi/one_line_dict like this: pip install https://github.com/cvzi/emoji/archive/one_line_dict.zip and then just create a python file with import emoji and run it in VS Code with debugging i.e. F5

@cvzi
Copy link
Contributor

cvzi commented Jan 25, 2024

BTW for reference:
I also tried to put each "sub-dict" (each emoji) on a single line instead of everything in one line:

EMOJI_DATA = {
    '\U0001F947': {'en': ':1st_place_medal:','status': fully_qualified,'E': 3,'de': ':goldmedaille:','es': ':medalla_de_oro:','fr': ':médaille_d’or:','ja': ':金メダル:','ko': ':금메달:','pt': ':medalha_de_ouro:','it': ':medaglia_d’oro:','fa': ':مدال_طلا:','id': ':medali_emas:','zh': ':金牌:','ru': ':золотая_медаль:','tr': ':birincilik_madalyası:','ar': ':ميدالية_مركز_أول:'},
    '\U0001F948': {'en': ':2nd_place_medal:','status': fully_qualified,'E': 3,'de': ':silbermedaille:','es': ':medalla_de_plata:','fr': ':médaille_d’argent:','ja': ':銀メダル:','ko': ':은메달:','pt': ':medalha_de_prata:','it': ':medaglia_d’argento:','fa': ':مدال_نقره:','id': ':medali_perak:','zh': ':银牌:','ru': ':серебряная_медаль:','tr': ':ikincilik_madalyası:','ar': ':ميدالية_مركز_ثان:'},
    '\U0001F949': {'en': ':3rd_place_medal:','status': fully_qualified,'E': 3,'de': ':bronzemedaille:','es': ':medalla_de_bronce:','fr': ':médaille_de_bronze:','ja': ':銅メダル:','ko': ':동메달:','pt': ':medalha_de_bronze:','it': ':medaglia_di_bronzo:','fa': ':مدال_برنز:','id': ':medali_perunggu:','zh': ':铜牌:','ru': ':бронзовая_медаль:','tr': ':üçüncülük_madalyası:','ar': ':ميدالية_مركز_ثالث:'},
    ...

That reduces the import time (as expected) but it still takes too long, about 15 seconds on my computer.

@lsmith77
Copy link

sorry didn't get to it today will try to do it tomorrow morning

@lsmith77
Copy link

I tried my best but I am stuck in virtualenv hell here. pip needs to be updated to even install the package and I am somehow unable to figure out how to get pip to both upgrade and then actually use 3.12

anyway using pdm I got it to work nice and fast (first excution) and then slow using the official package (second execution)

image

so overall I can confirm your workaround does what it is supposed to.

@cvzi cvzi mentioned this issue Jan 30, 2024
@cvzi
Copy link
Contributor

cvzi commented Jan 30, 2024

@lsmith77 Thanks for checking!

@lsmith77
Copy link

thank you for this package and caring about reports such as this one!

@cvzi
Copy link
Contributor

cvzi commented Feb 17, 2024

I did some performance tests to check the feasibility of JSON compared to the Python-dictionary-literal.
Below are the import times for different methods of loading the dictionary.
My conclusion is that JSON could be used and it would be viable to split the languages into separate JSON files

import time
Python dict, pretty-printed, human-readable (before this bugfix) 0.16004
One-line Python dict (current master branch) 0.15565
JSON file, pretty-printed, human-readable 0.22966
Compressed JSON file, one-line, no spaces 0.19430
Splitted JSON files, pretty-printed, load English and metadata from one file, all other languages removed 0.15470
Splitted JSON files, first load English and metadata (as above) then load ONE other language from another JSON file 0.19083

Command to test this:

perf stat -r 10 -B python -c "import emoji; emoji.emojize(':lion:')"

where 10 is the repeats (should be much higher for good average results)

cvzi added a commit to cvzi/emoji that referenced this issue Mar 23, 2024
@cvzi
Copy link
Contributor

cvzi commented Apr 6, 2024

I am going to continue in this thread with this JSON idea, please unsubscribe if you're not interested. Any feedback or suggestions are appreciated though :)

I am thinking about making a main JSON file that has the metadata and English/aliases and a file for each language.

Main file:

{
  "🗺️": {
    "E": 0.7,
    "en": ":world_map:",
    "status": 2,
    "variant": true
  },
  "🗻": {
    "E": 0.6,
    "en": ":mount_fuji:",
    "status": 2
  },
  "🗼": {
    "E": 0.6,
    "alias": [
      ":tokyo_tower:"
    ],
    "en": ":Tokyo_tower:",
    "status": 2
  },
  ...
}

A language file would look like this, e.g. Spanish:

{
  "🗺️": ":mapa_mundial:",
  "🗻": ":monte_fuji:",
  "🗼": ":torre_de_tokio:",
  ...
}

The main file would be loaded when importing the module. The language file would only be loaded when the language is used with d/emojize(str, language='es'). It would be loaded into EMOJI_DATA and the EMOJI_DATA dict would have the same structure as before.

It does mean that the EMOJI_DATA dict is incomplete after importing the module, because all the languages are missing.

This roughly reduces memory usage by about half, if only one language is used.
Import time with only English is slightly faster (about 10%).
Import time with one other-than-English language is slightly slower (about 10%).

Advantages:

  • more languages can be added without increasing memory usage
  • also more (new) metadata could be added to separate JSON files as well
  • JSON data can be read by other programming languages
  • This debugging-bug would be solved

Disadvantages:

  • You can't directly use e.g. EMOJI_DATA['🗺️']['fr'] anymore. To mitigate this, there could be a new function to load one or all languages from the JSON files.
  • JSON is less human-readable than a python dict, mainly comments aren't possible

So this would be a breaking change, but I don't think this would affect many people, I searched in github and I couldn't find a public repository that directly uses something like EMOJI_DATA['🗺️']['fr']
(There are a few repositories that use English i.e. EMOJI_DATA['🗺️']['en'] but that would still work)

@lovetox
Copy link
Contributor

lovetox commented Apr 14, 2024

I think this makes sense, and i see no other option, there are simply a lot of languages and most applications need exactly one. Loading them on demand seems the right decision.

EDIT: Question about your performance test methodology, does your command not also include the starting of the whole python interpreter? This would only be relevant for someone who uses this lib standalone.

For most projects this will just one of many dependencies.

i did a quick test with 2.11.0

def load():
    import emoji
    emoji.emojize(':lion:')

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=1)
    print(res)

and this gives me a load time of just the lib around 0.030 s on my machine.

As the import statement is only once executed, even on repeats, raising the number of repetitions does not yield interesting data.

@cvzi
Copy link
Contributor

cvzi commented Apr 14, 2024

Yes my times include the loading of the Python interpreter. It doesn't really matter, because I am only interested in the relative changes.
Measuring a single import is not really robust, for example because some other process could be using the CPU at the same time as the test.

It is possible to load the module multiple times in Python, but it is a bit hacky:

import sys

def load():
    import emoji
    emoji.emojize(':lion:')

    # remove the emoji modules from the loaded modules
    for name in [name for name in sys.modules if "emoji" in name]:
        del sys.modules[name]

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=100)
    print(res)

@cvzi
Copy link
Contributor

cvzi commented May 17, 2024

FYI compressing the dict into single line has caused coverage to break on Python 3.12
nedbat/coveragepy#1785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants