Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

Open
peterbjorgensen opened this issue Feb 15, 2024 · 1 comment
Assignees

Comments

@peterbjorgensen
Copy link
Contributor

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None)
took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: *\n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers:
centre-for-humanities-computing/danish-foundation-models#207 (comment)

@soldni
Copy link
Member

soldni commented Feb 21, 2024

Yikes. Probably the easiest way to tackle this is to create two version of the taggers; one using regex, the other using re.

@soldni soldni self-assigned this Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants