not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

peterbjorgensen · 2024-02-15T18:50:49Z

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None)
took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: *\n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers:
centre-for-humanities-computing/danish-foundation-models#207 (comment)

The text was updated successfully, but these errors were encountered:

soldni · 2024-02-21T23:32:11Z

Yikes. Probably the easiest way to tackle this is to create two version of the taggers; one using regex, the other using re.

soldni self-assigned this Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

peterbjorgensen commented Feb 15, 2024

soldni commented Feb 21, 2024

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs. #123

Comments

peterbjorgensen commented Feb 15, 2024

soldni commented Feb 21, 2024