[Bug]: pdf export for large image sizes results in wrong colors #25806

simonwm · 2023-05-02T08:23:48Z

Bug summary

Exporting a figure filled with imshow as pdf is using the wrong colors if the image size/resolution becomes large: E.g. gray cells become red.

Code for reproduction

import numpy as np
import matplotlib.pyplot as plt

# data: four background tiles with one of them highlighted
background = [0.9]*3 # gray
highlight = [1,0,0] # red
X = np.array([[background,highlight],[background,background]]) # varying colors and combinations gives a rich spectrum of bug-phenotypes

# big figure
dpi, figsize = 500, 20 # less dpi or figsize avoids the bug
fig,ax = plt.subplots(figsize=(figsize,figsize),dpi=dpi)
ax.imshow(X)

# export it
fig.savefig(f'minimal.png') # shows correctly three tiles in gray and one tile in red
fig.savefig(f'minimal.pdf') # shows incorrectly all four tiles in red

Actual outcome

The png export shows the correct colors: 3 quarters gray, 1 quarter red. The pdf export shows incorrect colors: 4 quarters red.

minimal.pdf

Expected outcome

The png and pdf export shows the correct colors: 3 quarters gray, 1 quarter red. I.e. the pdf should also look like the png pasted above.

Additional information

This bug only appears for large figure sizes and dpi (maybe if figuresize * dpi is larger than some threshold).
It is also dependent on the colors chosen and the content of the image. In my real-world use case I have a large heatmap with many fields and one color is consistently replaced by some other color from the plot. I also observed variations which replace a color with a darker version of it, e.g. if one switches the roles of red and gray in the minimal example above: Then instead of red the background is dark red in the pdf.

This minimal example was executed in a fresh conda env with matplotlib as only dependency. I tried it before with matplotlib 3.6.3 on a less clean environment with the same results. I dont have an example of it working in previous versions.

I guess it is an issue with a buffer of fixed size in the pdf backend, as it works with the png backend.

A workaround can be the reduction of the dpi - for vector graphics this still contains all the information. But it changes the "physical" size of the pdf.

Operating system

RHEL 7.9

Matplotlib Version

3.7.1

Matplotlib Backend

No response

Python version

3.11.3

Jupyter version

No response

Installation

conda

saranti · 2023-05-02T10:26:24Z

Possibly related to #18871. The workaround works in this case.

tacaswell · 2023-05-02T14:30:19Z

To be explict, the work around is to set

ax.imshow(X, interpolation='none')  # the default is `nearest`

which suggests something is going wrong in the resampling / rasterization pipeline with in the pdf backend.

simonwm · 2023-05-03T12:05:28Z

Thanks! The workaround works nicely, indeed.
I wonder why errors in the interpolation are triggered by using a large image size/resolution...

tacaswell · 2023-05-03T13:56:49Z

My guess (and to be clear this is a guess) is that there are floats and rounding involved. Because of how floats work they lose (absolute) precision as they get bigger, thus if there is something near an edge it may work reliably with small absolute scales and fail at large ones.

This is the correct and expected behavior:


In [1]: 1e20 == (1e20 + 1)
Out[1]: True

as at the scale of 1e20 the gap between successive expressible floats is greater than 1!

QuLogic · 2023-05-03T21:44:21Z

I could not reproduce this problem. By some coincidence, it turns out I had disabled PDF compression for some other checks, and turning it back on does reproduce the issue. With compression on, we output images as compressed PNG and though we have not worked out the exact difference, there is obviously something going wrong there.

You may be able to work around the problem by disabling PDF compression, and then (because the file size is then huge), running it through Ghostscript.

QuLogic · 2023-05-04T22:07:48Z

When compression is on, we output PNG images instead of raw data. Since #17895, we've also started making indexed PNG when possible. We do this by asking Pillow to convert with an adaptive palette. It's kind of undocumented what that does, but looking at various issues, it appears that that does not guarantee that it will use the same colours as the original image. I think this explains why you don't always get the same wrong colour.

If we print img.getextrema() around this conversion:

matplotlib/lib/matplotlib/backends/backend_pdf.py

Lines 1751 to 1753 in f64a70e

    
           img = img.convert( 
        
               mode='P', dither=dither, palette=pmode, colors=num_colors 
        
           )

we get

((229, 255), (0, 229), (0, 229))
(1, 1)

That is, before converting, there are different limits, but after converting, everything is palette index 1. Looking at Pillow's issues, I see some reference that there's no guarantee that adaptive palettes replicate exact colours. I did not look too deeply into Pillow code, but based on python-pillow/Pillow#1852, I think this may only be a visible problem for low-colour-count but high-pixel-count images. Of course, it might actually be slightly off for any image that is palettized.

I believe the fix is to request quantization with an explicit palette. This mostly appears to work, though there is some kind of off-by-one error somewhere as I get a strange banding effect when just directly doing that.

Asking Pillow for an "adaptive palette" does not appear to guarantee that the chosen colours will be the same, even if asking for exactly the same number as exist in the image. Instead, create an explicit palette, and quantize using it. Additionally, since now the palette may be smaller than 256 colours, Pillow may choose to encode the image data with fewer than 8 bits per component, so we need to properly reflect that in the decode parameters (this was already done for the image parameters). The effect on test images with _many_ colours is small, with a maximum RMS of 1.024, but for images with few colours, the result can be completely wrong as in the reported matplotlib#25806.

Asking Pillow for an "adaptive palette" does not appear to guarantee that the chosen colours will be the same, even if asking for exactly the same number as exist in the image. And asking Pillow to quantize with an explicit palette does not work either, as Pillow uses a cache that trims the last two bits from the colour and never makes an explicit match. python-pillow/Pillow#1852 (comment) So instead, manually calculate the indexed image using some NumPy tricks. Additionally, since now the palette may be smaller than 256 colours, Pillow may choose to encode the image data with fewer than 8 bits per component, so we need to properly reflect that in the decode parameters (this was already done for the image parameters). The effect on test images with _many_ colours is small, with a maximum RMS of 1.024, but for images with few colours, the result can be completely wrong as in the reported matplotlib#25806.

QuLogic added backend: pdf topic: images labels May 3, 2023

This was referenced May 5, 2023

pdf: Use explicit palette when saving indexed images #25824

Merged

cm.set_bad() not working for specific values of grayscale and dpi when saving as pdf #20575

Closed

ksunden closed this as completed in #25824 Jun 13, 2023

QuLogic added this to the v3.7.2 milestone Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: pdf export for large image sizes results in wrong colors #25806

[Bug]: pdf export for large image sizes results in wrong colors #25806

simonwm commented May 2, 2023

saranti commented May 2, 2023

tacaswell commented May 2, 2023

simonwm commented May 3, 2023

tacaswell commented May 3, 2023 •

edited

QuLogic commented May 3, 2023

QuLogic commented May 4, 2023

[Bug]: pdf export for large image sizes results in wrong colors #25806

[Bug]: pdf export for large image sizes results in wrong colors #25806

Comments

simonwm commented May 2, 2023

Bug summary

Code for reproduction

Actual outcome

Expected outcome

Additional information

Operating system

Matplotlib Version

Matplotlib Backend

Python version

Jupyter version

Installation

saranti commented May 2, 2023

tacaswell commented May 2, 2023

simonwm commented May 3, 2023

tacaswell commented May 3, 2023 • edited

QuLogic commented May 3, 2023

QuLogic commented May 4, 2023

tacaswell commented May 3, 2023 •

edited