Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: consider images inside PDF made with onlyoffice #2637

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

0xNath
Copy link

@0xNath 0xNath commented May 11, 2024

closes #2613

Added code to detect patterns in "_get_ids_image".
To avoid any conflicts with images that could be located directly in a page or images using the same ID in different patterns, images ids under patterns are returned in this form : "/Pattern/patternNameHere/imageNameHere"

Added code to deal with Pattern images in "_get_image".

Added a new test to verify ids and image data returned related to the modifications done above.

@0xNath
Copy link
Author

0xNath commented May 11, 2024

I made a PR to add the sample files :
py-pdf/sample-files#30

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded ?

@0xNath 0xNath marked this pull request as draft May 11, 2024 22:05
@stefan6419846
Copy link
Collaborator

stefan6419846 commented May 12, 2024

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded?

In general, files should go into the sample files if you own all copyrights and are okay with the license terms there. Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself.

The usual approach we take apart from this is to upload to an issue or PR and download the corresponding files from the GitHub URLs. Unfortunately, this will not work for image files any more as far as I know as GitHub started to add JWTs to their URLs which expire accordingly and are user-specific.

@pubpub-zz
Copy link
Collaborator

Can you rename the title of your PR : this is not a a STYle but an ENHancement.
You should also edit your first thread to set you "closes " info to ease closing

you should be able to "factorize" your code to use the code for image extraction. the current code looking for "/Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object

You should also confirm your code works with patterns within forms

Also, It might be interesting to extract the thumbnail and confirm that no other images are ignored

@0xNath 0xNath changed the title STY: consider images inside patterns ENH: consider images inside patterns May 12, 2024
@0xNath
Copy link
Author

0xNath commented May 12, 2024

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded?

In general, files should go into the sample files if you own all copyrights and are okay with the license terms there. Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself.

The usual approach we take apart from this is to upload to an issue or PR and download the corresponding files from the GitHub URLs. Unfortunately, this will not work for image files any more as far as I know as GitHub started to add JWTs to their URLs which expire accordingly and are user-specific.

I reused images that were already present in the repo so it should be good in term of rights. The issue is that only office convert images to jpeg when included inside a document, so I had to re-upload them once converted to pass content verification in the test unit.

Let's wait to see if my PR on the other repo get merged then, thanks.

@0xNath
Copy link
Author

0xNath commented May 12, 2024

Can you rename the title of your PR : this is not a a STYle but an ENHancement. You should also edit your first thread to set you "closes " info to ease closing

I was not sure which one to use, thanks.

you should be able to "factorize" your code to use the code for image extraction. the current code looking for "/Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object

You're right, I didn't thought about it, let's do that.

You should also confirm your code works with patterns within forms

It will probably not in the current state, I'll check what are forms, thank you.

Also, It might be interesting to extract the thumbnail and confirm that no other images are ignored

Yes I agree. I'll need to read a bit more the PDF standard, currently I've only read the part about patterns and image objects so I'm lacking a lot about how images object are used/located. Maybe it would be nicer to have an issue for each case where images are missing and have their matching PR ?

@0xNath 0xNath marked this pull request as ready for review May 17, 2024 16:19
@0xNath 0xNath marked this pull request as draft May 17, 2024 16:38
@0xNath
Copy link
Author

0xNath commented May 17, 2024

I have tested onlyoffice forms and images inside it are not taken into account.

Images inside PDF from libreoffice without and with forms are taken into account, as well as images from word forms.

@pubpub-zz
Copy link
Collaborator

@0xNath
Can you please save your test files in this thread and modify your PR to get them from the URLs. for the name, use iss2613a.pdf, iss2613b.pdf, ...

@0xNath
Copy link
Author

0xNath commented May 17, 2024

I'll add the code for extracting the images from the form PDF and the corresponding test. I'm going to follow the same logic as for the patterns, but with "/Annots/.../..." as image identifiers

iss2613-onlyoffice-form.pdf
iss2613-onlyoffice-standardImages.pdf

@0xNath 0xNath changed the title ENH: consider images inside patterns ENH: consider images inside PDF made with onlyoffice May 17, 2024
@0xNath
Copy link
Author

0xNath commented May 17, 2024

Images to test against
iss2613-P1_X1
iss2613-P2_X1
iss2613-P3_X1

Copy link

codecov bot commented May 17, 2024

Codecov Report

Attention: Patch coverage is 85.18519% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 94.89%. Comparing base (c227b0c) to head (24218c8).

Files Patch % Lines
pypdf/_page.py 85.18% 2 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2637      +/-   ##
==========================================
- Coverage   94.97%   94.89%   -0.08%     
==========================================
  Files          50       50              
  Lines        8331     8371      +40     
  Branches     1669     1689      +20     
==========================================
+ Hits         7912     7944      +32     
- Misses        260      262       +2     
- Partials      159      165       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Nathanaël Renaud and others added 12 commits May 17, 2024 23:09
Added code to detect patterns in "_get_ids_image".
To avoid any conflicts with images that could be located directly in a page or images using the same ID in differents patterns, images ids under patterns are returned in this form :
"/Pattern/patternNameHere/imageNameHere"

Added code to deal with Pattern images in "_get_image".
fixed code style
fixed code style
@0xNath 0xNath force-pushed the ENH-#2613-integration-of-images-inside-patterns branch from fcb3b1f to d524672 Compare May 17, 2024 21:09
@0xNath 0xNath marked this pull request as ready for review May 17, 2024 21:16
commit d0493ae
Author: Nathanaël Renaud <perso@renaudna.fr>
Date:   Fri May 17 23:25:13 2024 +0200

    Modified _get_ids_image and _get_image so they work with onlyoffice images

commit 53a3781
Author: Nathanaël Renaud <perso@renaudna.fr>
Date:   Fri May 17 23:22:27 2024 +0200

    Added tests units about images extractions from PDF generated using onlyoffice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Images contained in objects of type "/Pattern" are not retrieved
3 participants