Images contained in objects of type "/Pattern" are not retrieved #2613

0xNath · 2024-05-01T10:10:20Z

Explanation

Hello,
First of all, thanks for your works, it's a very helpful library.

I am not able to extract images from PDF generated with OnlyOffice :
B2.pdf

After looking into the PDF structure, it seems that the image in this PDF page, is contained inside a Tiling Patterns object, which can't be handled by "_page._get_ids_image" nor "_page._get_image".

I've took a look at PDF standards and it's specified that Tiling Patterns can be made of images so it's not an OnlyOffice issue.

I don't have read completely the standards about Patterns, but once this is done I'd like to make a proposition to at least be able to retrieve images from them, so when we try to get images from a page, it also considers Patterns.

What do you think about it ?

Have a nice day !

stefan6419846 · 2024-05-01T10:52:47Z

Thanks for the report. To determine the images associated with a page, pypdf does indeed not consider nested xobjects for image extraction.

pubpub-zz · 2024-05-01T11:30:46Z

pypdf can looks in sub XObjects, however here you are looking for an object which is part of a pattern which is not for me the way to do things.
this is a proposal to extract your image:

import pypdf

r = pypdf.PdfReader("B2.pdf")
img = pypdf.filters._xobj_to_image(r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"])[2]
img.show()

I will try to propose also a easier way to extract an image
edit. I've found a better way

closes py-pdf#2613

pubpub-zz · 2024-05-01T12:20:33Z

with the new PR extraction will be easier:

import pypdf
r = pypdf.PdfReader("B2.pdf")
img = r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"].decode_as_image()
img.show()

0xNath · 2024-05-01T14:09:07Z

Wouldn't it be better to have the fonction that should extract all images of a page to actually extract all images of the pages ?

The PDF standard said that images can be stored inside Patterns so we should expect to find images in them.

pubpub-zz · 2024-05-01T16:52:41Z

I agree that images can be stored in patterns, but the solution used inhere is not common. a pattern is expected in a context to provided a repeated image in a surface.
There is too many places where images could be (patterns, annotations, ...); will be quite complex also out of context having the image may not be very efficient.

0xNath · 2024-05-01T17:31:30Z

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

pubpub-zz · 2024-05-01T19:22:00Z

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

We can propose a PR

0xNath · 2024-05-01T20:41:58Z

Well well well, _page.images isn't a method but a property so passing a parameter to it isn't an option...

stefan6419846 added the workflow-images From a users perspective, image handling is the affected feature/workflow label May 1, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 1, 2024

ENH: add decode_as_image() to ContentStreams

854c467

closes py-pdf#2613

pubpub-zz linked a pull request May 1, 2024 that will close this issue

ENH: add decode_as_image() to ContentStreams #2615

Open

stefan6419846 changed the title ~~Image contained in objects of type "/Pattern" are not retrived~~ Images contained in objects of type "/Pattern" are not retrieved May 2, 2024

0xNath linked a pull request May 11, 2024 that will close this issue

ENH: consider images inside PDF made with onlyoffice #2637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images contained in objects of type "/Pattern" are not retrieved #2613

Images contained in objects of type "/Pattern" are not retrieved #2613

0xNath commented May 1, 2024

stefan6419846 commented May 1, 2024

pubpub-zz commented May 1, 2024 •

edited

pubpub-zz commented May 1, 2024 •

edited

0xNath commented May 1, 2024

pubpub-zz commented May 1, 2024 •

edited

0xNath commented May 1, 2024 •

edited

pubpub-zz commented May 1, 2024

0xNath commented May 1, 2024

Images contained in objects of type "/Pattern" are not retrieved #2613

Images contained in objects of type "/Pattern" are not retrieved #2613

Comments

0xNath commented May 1, 2024

Explanation

stefan6419846 commented May 1, 2024

pubpub-zz commented May 1, 2024 • edited

pubpub-zz commented May 1, 2024 • edited

0xNath commented May 1, 2024

pubpub-zz commented May 1, 2024 • edited

0xNath commented May 1, 2024 • edited

pubpub-zz commented May 1, 2024

0xNath commented May 1, 2024

pubpub-zz commented May 1, 2024 •

edited

pubpub-zz commented May 1, 2024 •

edited

pubpub-zz commented May 1, 2024 •

edited

0xNath commented May 1, 2024 •

edited