Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-extraction thwarted by inline images #18

Open
catagras opened this issue May 23, 2018 · 3 comments
Open

text-extraction thwarted by inline images #18

catagras opened this issue May 23, 2018 · 3 comments

Comments

@catagras
Copy link

Played a bit with text-extraction sample and found that if an inline image is encountered (a BI / ID / EI construct) the rest of the page is skipped. Most likely this is happening because the image stream that follows ID is parsed as a PDF token not as a stream.

Any hint on how I might skip inline images?

Thanks!

@galkahana
Copy link
Owner

yeah...this will take improving the tokenizer. wanna help?

@catagras
Copy link
Author

catagras commented Nov 12, 2018 via email

@galkahana
Copy link
Owner

k. don't have to :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants