Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/element type for non-English languages #3044

Open
cm-halfspace opened this issue May 17, 2024 · 1 comment
Open

bug/element type for non-English languages #3044

cm-halfspace opened this issue May 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@cm-halfspace
Copy link

cm-halfspace commented May 17, 2024

Describe the bug
When I partition a Danish .docx file I notice some weird classifications of the element types.

I think this is related to the fact that the languages-list is not being set in _parse_paragraph_text_for_element_type, eg in is_possible_narrative_text(text).

If one takes a look at the definition of is_possible_narrative_text it seems that a quick temporary solution would be to at least use language_checks in line 90 such that it instead becomes:

if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):

To Reproduce

from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)

which returns False right now. With the above quick-fix, it would return True as expected.

@cm-halfspace cm-halfspace added the bug Something isn't working label May 17, 2024
@MthwRobinson
Copy link
Contributor

Hi @cm-halfspace - thanks for reporting this! We'll look at this as soon as we can, or happy to review if you want to open a PR with your suggested change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants