Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--ignore-links flag creates new composite words in output #389

Open
strizhechenko opened this issue Jun 20, 2022 · 0 comments
Open

--ignore-links flag creates new composite words in output #389

strizhechenko opened this issue Jun 20, 2022 · 0 comments

Comments

@strizhechenko
Copy link

strizhechenko commented Jun 20, 2022

Hi! I'm doing some natural language processing experiments and using html2text to make text sources out of internet pages. My problem is words in links are sticking to each other if I use --ignore-link flag:

html2text --ignore-links <<< '<a href="/1">1</a><a href="/2">2</a><a href="/3">3 4</a><a href="/5">5</a>'
123 45

example is specially simplified of course, but it "creates" new composite words. I've patched it locally to add spaces after each ignored link, sort of workaround with minimal changes:

if tag == "a" and self.ignore_links and not start:
    self.o(" ")
if tag == "a" and not self.ignore_links:

and it produces what I need:

html2text --ignore-links <<< '<a href="/1">1</a><a href="/2">2</a><a href="/3">3 4</a><a href="/5">5</a>'
1 2 3 4 5

Should I open a pull request for this? The code above is sort of workaround, but if it will be useful - I'd be happy to make it cleaner, add tests, changelog, etc.

  • Version by html2text --version: 2020.1.16 (from pypi, but github/master version is affected too)
  • Python version python3.8 --version: Python 3.8.0`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant