Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] OCR #397

Open
xuzeyu91 opened this issue Apr 10, 2024 · 3 comments
Open

[Question] OCR #397

xuzeyu91 opened this issue Apr 10, 2024 · 3 comments
Labels
question Further information is requested

Comments

@xuzeyu91
Copy link

Context / Scenario

54d4edad5bc9a4867cfffc214c9c94f
I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

Question

54d4edad5bc9a4867cfffc214c9c94f
I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

@xuzeyu91 xuzeyu91 added the question Further information is requested label Apr 10, 2024
@lecramr
Copy link
Contributor

lecramr commented Apr 12, 2024

Looks like this is currently not possible, see code:
https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs

Altough we already have (https://github.com/microsoft/kernel-memory/blob/main/service/Abstractions/DataFormats/IOcrEngine.cs) in place, which would be enough for simple text extraction, and UglyToad.PdfPig is able to extract images as experimental feature.

@dluc Wouldn't it be possible to extend "FileContent" with a Array of found Images in the PDF described GPT-4 Vision Api if enabled?

@marcominerva
Copy link
Contributor

I think that you can support this scenario when the issue #379 will be completed (currently there is a PR in preview).

With that, you will be able to inject a custom decoder for PDF files.

@dluc
Copy link
Collaborator

dluc commented Apr 16, 2024

Given that now custom content decoders can be injected, I would first try creating one that replaces the default PDF decoder, and internally does all the work of extracting text and text from images. E.g. you can create a decoder that depends on the existing image decoder to parse images, and return all the text at the end, without the need to revisit the FileContent class (for now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants