text-extraction thwarted by inline images #18

catagras · 2018-05-23T04:50:24Z

Played a bit with text-extraction sample and found that if an inline image is encountered (a BI / ID / EI construct) the rest of the page is skipped. Most likely this is happening because the image stream that follows ID is parsed as a PDF token not as a stream.

Any hint on how I might skip inline images?

Thanks!

galkahana · 2018-11-11T22:44:28Z

yeah...this will take improving the tokenizer. wanna help?

catagras · 2018-11-12T11:18:45Z

Well. I tried to do that few months ago when asked the question but I haven't managed to fix it at that time I'll gladly give it a go even though I haven't coded in C++ in the past 20 years. Cătălin

…

---------------

On Mon, Nov 12, 2018 at 12:44 AM gal kahana ***@***.***> wrote: yeah...this will take improving the tokenizer. wanna help? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AltrUYbwdjmovhRSPds0b5Knnc0GY-1nks5uuKhMgaJpZM4UJxF4> .

galkahana · 2018-11-12T18:13:36Z

k. don't have to :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction thwarted by inline images #18

text-extraction thwarted by inline images #18

catagras commented May 23, 2018

galkahana commented Nov 11, 2018

catagras commented Nov 12, 2018 via email

galkahana commented Nov 12, 2018

text-extraction thwarted by inline images #18

text-extraction thwarted by inline images #18

Comments

catagras commented May 23, 2018

galkahana commented Nov 11, 2018

catagras commented Nov 12, 2018 via email

galkahana commented Nov 12, 2018