Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: improve inline image extraction #2622

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

Conversation

pubpub-zz
Copy link
Collaborator

closes #2598

Copy link

codecov bot commented May 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.08%. Comparing base (1117278) to head (9c03aa7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2622      +/-   ##
==========================================
+ Coverage   95.00%   95.08%   +0.07%     
==========================================
  Files          50       51       +1     
  Lines        8356     8478     +122     
  Branches     1673     1693      +20     
==========================================
+ Hits         7939     8061     +122     
- Misses        259      262       +3     
+ Partials      158      155       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@pubpub-zz pubpub-zz marked this pull request as draft May 4, 2024 14:47
@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented May 5, 2024

use of:
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1065245.pdf
as test input
8 images all looking like:
img

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented May 7, 2024

@pubpub-zz
Copy link
Collaborator Author

@stefan6419846, @MartinThoma, @MasterOdin
In order to improve test coverage, I'm looking for PDF files with inline images. Can you provide me with some ?

@stefan6419846
Copy link
Collaborator

It is rather unlikely that I am able to provide further (real) files for testing. While I have access to lots of PDF files in theory, most of them are confidential. Additionally, the usual routines I use completely omit any inline images as they provide no value for my use cases.

If possible, I would indeed prefer to avoid two different implementations of the same filter algorithms - we already have sufficient coverage for the filters outside the inline image extraction and thus re-using them would make more sense and avoid larger coverage issues.

@pubpub-zz
Copy link
Collaborator Author

@stefan6419846
thanks for trying.
There is not really changes in the filter/image processing. The change only applies to improve data extraction from contents.
Will try to find a way to generate some data.

@pubpub-zz
Copy link
Collaborator Author

add test file for RL encoding:
RL.pdf
image:
RL

@pubpub-zz pubpub-zz marked this pull request as ready for review May 14, 2024 20:49
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/filters.py Outdated Show resolved Hide resolved
@stefan6419846
Copy link
Collaborator

Besides the above remarks, there are two bigger issues I would like to talk about:

  • Importing pypdf._xobj_image_helpers without Pillow is marked as deprecated. Why? Does this mean you want to completely drop support for installations without Pillow?
  • Could we please use more verbose names for the filter methods while using all-lowercase function names as usually recommended by PEP8?

pubpub-zz and others added 7 commits May 20, 2024 10:36
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: invalid literal for int() with base 16: b'F:'
2 participants