ROB: improve inline image extraction #2622

pubpub-zz · 2024-05-03T21:05:50Z

codecov · 2024-05-04T13:12:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.08%. Comparing base (1117278) to head (9c03aa7).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2622      +/-   ##
==========================================
+ Coverage   95.00%   95.08%   +0.07%     
==========================================
  Files          50       51       +1     
  Lines        8356     8478     +122     
  Branches     1673     1693      +20     
==========================================
+ Hits         7939     8061     +122     
- Misses        259      262       +3     
+ Partials      158      155       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2024-05-04T14:36:11Z

new test file:
Pages.62.73.from.0560-22_WSP.Plan_July.2022_Version.1.pdf

pubpub-zz · 2024-05-05T19:41:10Z

use of:
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1065245.pdf
as test input
8 images all looking like:

pubpub-zz · 2024-05-07T09:25:28Z

new test:
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1065245.pdf

image 0 :

closes py-pdf#2629

pubpub-zz · 2024-05-09T12:40:09Z

@stefan6419846, @MartinThoma, @MasterOdin
In order to improve test coverage, I'm looking for PDF files with inline images. Can you provide me with some ?

stefan6419846 · 2024-05-09T12:59:21Z

It is rather unlikely that I am able to provide further (real) files for testing. While I have access to lots of PDF files in theory, most of them are confidential. Additionally, the usual routines I use completely omit any inline images as they provide no value for my use cases.

If possible, I would indeed prefer to avoid two different implementations of the same filter algorithms - we already have sufficient coverage for the filters outside the inline image extraction and thus re-using them would make more sense and avoid larger coverage issues.

pubpub-zz · 2024-05-09T15:15:13Z

@stefan6419846
thanks for trying.
There is not really changes in the filter/image processing. The change only applies to improve data extraction from contents.
Will try to find a way to generate some data.

pubpub-zz · 2024-05-11T13:29:18Z

add test file for RL encoding:
RL.pdf
image:

pypdf/_page.py

pypdf/_xobj_image_helpers.py

pypdf/generic/_data_structures.py

pypdf/filters.py

pypdf/generic/_data_structures.py

stefan6419846 · 2024-05-20T07:58:38Z

Besides the above remarks, there are two bigger issues I would like to talk about:

Importing pypdf._xobj_image_helpers without Pillow is marked as deprecated. Why? Does this mean you want to completely drop support for installations without Pillow?
Could we please use more verbose names for the filter methods while using all-lowercase function names as usually recommended by PEP8?

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

pypdf/generic/_image_inline.py

pubpub-zz added 2 commits May 3, 2024 23:04

ROB: improve inline image extraction

b449664

closes py-pdf#2598

fix

44b41a7

complete testing

0952fee

pubpub-zz marked this pull request as draft May 4, 2024 14:47

pubpub-zz added 3 commits May 5, 2024 22:17

complete test

0ba5ae4

tests

fdbc092

fix

fd57ef7

pubpub-zz added 9 commits May 7, 2024 12:09

fix DCT

70f9c02

Fix A85

8996a73

Merge remote-tracking branch 'origin/iss2598' into iss2598

fd6334e

blank

5b38f34

with new link

67d51ea

Merge branch 'pb_stanford' into iss2598

9fb0974

fix test

092e2a5

BUG: Incorrect number of inline images

c5d62a3

closes py-pdf#2629

Merge branch 'iss2629' into iss2598

ae93628

pubpub-zz mentioned this pull request May 8, 2024

BUG: Incorrect number of inline images #2630

Open

add test for RL + fix

51bea2c

pubpub-zz added 4 commits May 11, 2024 16:25

remove encode as not used for the moment

bd84496

Fix + Test

770aaba

test+fix

a37b73f

test

184e141

pubpub-zz added 4 commits May 12, 2024 14:50

test / fix /refactoring

b79164e

Merge remote-tracking branch 'py-pdf/main' into iss2598

e247b72

fix

66f858c

fix2

ee637c0

pubpub-zz marked this pull request as ready for review May 14, 2024 20:49

pubpub-zz requested a review from stefan6419846 May 19, 2024 13:23

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/_xobj_image_helpers.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/filters.py Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/filters.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

pubpub-zz and others added 7 commits May 20, 2024 10:36

Update pypdf/_page.py

2874e56

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

81e1f30

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

90fe459

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

54e4c1d

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/generic/_data_structures.py

d9841dd

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/generic/_data_structures.py

ecdba02

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

update from comments

ae9fdfc

pubpub-zz requested a review from stefan6419846 May 20, 2024 10:22

Merge branch 'main' into iss2598

5347820

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_image_inline.py Outdated Show resolved Hide resolved

pubpub-zz added 3 commits May 20, 2024 18:10

Update _data_structures.py

bcabdc8

Update _image_inline.py

dc045b6

Update test_generic.py

9c03aa7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROB: improve inline image extraction #2622

ROB: improve inline image extraction #2622

pubpub-zz commented May 3, 2024

codecov bot commented May 4, 2024 •

edited

pubpub-zz commented May 4, 2024

pubpub-zz commented May 5, 2024 •

edited

pubpub-zz commented May 7, 2024 •

edited

pubpub-zz commented May 9, 2024

stefan6419846 commented May 9, 2024

pubpub-zz commented May 9, 2024

pubpub-zz commented May 11, 2024

stefan6419846 commented May 20, 2024

ROB: improve inline image extraction #2622

Are you sure you want to change the base?

ROB: improve inline image extraction #2622

Conversation

pubpub-zz commented May 3, 2024

codecov bot commented May 4, 2024 • edited

Codecov Report

pubpub-zz commented May 4, 2024

pubpub-zz commented May 5, 2024 • edited

pubpub-zz commented May 7, 2024 • edited

pubpub-zz commented May 9, 2024

stefan6419846 commented May 9, 2024

pubpub-zz commented May 9, 2024

pubpub-zz commented May 11, 2024

stefan6419846 commented May 20, 2024

codecov bot commented May 4, 2024 •

edited

pubpub-zz commented May 5, 2024 •

edited

pubpub-zz commented May 7, 2024 •

edited