[Bug]: page.pdf produces corrupt pdf #7757

tjacobs3 · 2021-11-09T15:26:06Z

Bug description

Steps to reproduce the problem:

Occasionally, we find that our PDFs are not openable by any programs. We've narrowed the issue down to inclusion of certain images. When these images are present, the pdf created by puppeteer is corrupt. All image tools do not indicate that anything is wrong with the image itself so I believe this is an issue on the puppeteer side.

Create an test.html file with the following contents

<html>
  <head>
  </head>
  <body>
    <img src="image.jpg">
  </body>
</html>

In the same directory, place the attached image.jpg

Create a save_to_pdf.js file with the following contents

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(`file://${__dirname}/test.html`, { waitUntil: 'networkidle0', timeout: 60000 });
  await page.pdf({
    path: 'out.pdf',
    printBackground: true
  });

  await browser.close();
})();

Run node save_to_pdf.js
Try to open out.pdf in any program.

puppeteer_bug.zip
image.jpg

Puppeteer version

10.1.0

Node.js version

12.22.5

npm version

6.14.14

What operating system are you seeing the problem on?

macOS

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

MartinFalatic · 2021-12-21T11:48:50Z

I'm tracking a very similar issue with headless page pdf() calls creating corrupted files with certain input images, which started after puppeteer 10.0.0, and have been working on a way to reproduce it. I've been assuming it was the version of Chrome that was to blame somehow.... but both 10.0.0 and 10.1.0 reference Chrome 884014.

There's not a ton of delta between puppeteer 10.0.0 and 10.1.0, but the answer is in there somewhere.

I've recently created a testbed (https://github.com/MartinFalatic/puppeteer-explorations) that I used to replicate the issue. I was able to replicate the failure with your jpg as well as (separately) my own problematic jpg.

So far there's no obvious reason - I also ran both images through http://exif.regex.info/ and got nothing obvious, except that my other assumptions - that the problem was EXIF metadata or maybe the colorspace or some quirk of JPEG vs JPG was possibly to blame - seem very much less likely now.

tjacobs3 · 2021-12-21T15:17:59Z

Thanks for the info! I was able to step through the commits and found that this was the cause

e3699e2

I'm not too familiar with what's going on there but I'll try debugging when I have some time.

MartinFalatic · 2021-12-21T20:10:51Z

The file that fails for me is only 70,361 bytes (it's a 300x225 image) so whatever is going on it's not about the size of the image itself (though I wonder if there's some edge condition at work here, like some exact multiple of bytes in the image data).

It's particularly vexing because it doesn't seem to cause a visible error other than truncating the PDF.

Also replicates the bug at puppeteer/puppeteer#7757 and rules out a Chrome version change as the problem (the version of Chrome didn't change between 10.0.0 and 10.1.0)

MartinFalatic · 2021-12-22T10:39:43Z

So, I did a bit of experimentation tonight...

In node_modules\puppeteer\lib\cjs\puppeteer\common\helper.js (from 10.1.0) within my Docker container (corresponding to https://github.com/puppeteer/puppeteer/blob/v10.1.0/src/common/helper.ts#L363) I wedged in the following line:

size = 16384*2;

I noticed that size is 16384 (aka highWaterMark as noted in https://nodejs.org/docs/latest-v16.x/api/stream.html#new-streamreadableoptions)

Without that line, things fail as usual for the two files I know to be problematic. With that tweak? The PDF generates normally.

This reinforces my suspicion that this not about content but about the specific size of the content, though I haven't yet dumped the streams to see what the boundary conditions are (however, both files are odd lengths, so they're not exact multiples of size in any obvious way. But there may be overhead...)

Knowing this problem was also in 13.0.0, I see that there's a bugfix in 13.0.1 that affects this area of the code - but it doesn't help with this problem:

5b792de#diff-8a47ddda6f4b58ba812e57c7bbef53b97bd7297359e3f8c07bedd10e5baafb6e

Edit: response.data is indeed base64, which is food for thought here.

Edit2: I was able to decode and dump the response.data for each PDF as it was being processed so it's definitely matching expectations there. What's strange is how the response is throwing a premature EOF if size is <16384*2 (approximately). It's also strange that this is totally dependent on the image being present as-is, somewhere in the PDF.

I'm not sure at what point the images get base64 encoded in the first place, but whatever does it seems to respect the highWaterMark value, which suggests to me that it's somehow causing an EOF condition when it processes an image that happens to be an edge case (because the premature ending of the stream is detected as an EOF).

One interesting thing about your image is that the PDF stream contains the header of your image (as it's literally copied into the PDF), but after that it abruptly ends - right before the start of the image data entropy encoded segment (see https://github.com/corkami/formats/blob/master/image/JPEGRGB_dissected.png). That doesn't explain what's going on here (because there's definitely a dependence on the highWaterMark too), but I doubt it's a coincidence.

MartinFalatic · 2021-12-22T14:35:51Z

One more bit of info for today - puppeteer does give Chrome a hint as to how much data it wants at https://github.com/puppeteer/puppeteer/blob/v10.1.0/src/common/helper.ts#L364, specifically:

const response = await client.send('IO.read', { handle, size });

This accesses the Chrome DevTools protocol and uses IO.read (https://chromedevtools.github.io/devtools-protocol/1-3/IO/#method-read) to request up to a specified number of bytes from it (16384 by default).

I'm starting to wonder if Chrome is prematurely breaking the stream here (for no good reason evidently). If no size is given to that call, it'll send quite a lot of data in one go (at least 2.5MB, probably a lot more), but that's not tenable (it should have some reasonable chunk limit).

A way to mitigate this would be to check for the proper end-of-file magic in the PDF, then retry with a different size value, but of course that's playing whack-a-mole with this problem. If that size value was a hint to Page.pdf() then one could perform this validation on the fly and retry with different parameters.

tjacobs3 · 2021-12-22T16:33:52Z

Yeah, this seems to indicate that it's a bug on the chrome side.

I do see that prior to that feature being added, size was not being passed to IO.read (see https://github.com/puppeteer/puppeteer/blob/v10.0.0/src/common/helper.ts#L329). Maybe it's fine to just remove it until an issue is created and resolved on the chrome side

MartinFalatic · 2021-12-22T19:25:55Z

It's up to the devs. Even if it was massive (say 8 MB) it might make the problem space much smaller. Then again, it's not clear what Chrome isn't liking here either

When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occassionally indicate an that it has reached the end of file without sending a full pdf. This is documented by the associated issue. This behavior is not reproducable when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out. Refs: puppeteer#7757

MartinFalatic · 2021-12-22T22:38:45Z

It's not clear to me what benefit was sought in making the original changes that landed in 10.1.0 to use streaming for the PDFs.

According to the same Chrome devtools docs for Page.printToPDF (https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF) the IO.StreamHandle is still experimental in big bold letters. All of this is stemming from the use of that handle (which seems to be pretty flakey in certain cases when size is specified, and may well be flakey in other cases yet to be determined). At least your branch should clear up the present problem which is good!

If I can manage to get enough data on this, and it turns out to not be specific to Puppeteer, it would be worth filing a bug with the Chromium team.

Edit: Then again, puppeteer is by the selfsame team so... hopefully someone there see this bug report.

OrKoN · 2022-01-25T08:52:53Z

I think that might fix it https://chromium-review.googlesource.com/c/chromium/src/+/3413074

MartinFalatic · 2022-01-25T17:53:28Z

I think that might fix it https://chromium-review.googlesource.com/c/chromium/src/+/3413074

That looks promising, though it also looks like it's a long way from being in a release we can use.

OrKoN · 2022-01-26T08:12:52Z

#7868 won't actually fix the problem, it will just move the bug to the 10MB boundary which makes it less likely to happen (because fewer files are bigger than that). We can still land the fix until the Chromium fix arrives.

When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occasionally indicate that it has reached the end of file without sending a full pdf. This is documented by the associated issue. This behavior is not reproducible when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out. Issues: #7757

OrKoN · 2022-01-28T11:58:36Z

So the Chromium fix landed in M100 and I plan to create new Puppeteer release early next week.

cut2run · 2022-03-18T11:30:11Z

The Chromium 100 is already part of the puppeteer. Do you have any ETA on when this ticket will be handled and #7868 will be reverted?

Issues: #7757

MartinFalatic · 2022-03-18T19:44:39Z

Interesting, given that Chrome 100 is still in beta. When is that going to be on stable/main?

http://dl.google.com/linux/chrome/deb stable/main amd64 google-chrome-stable amd64 99.0.4844.74-1

OrKoN · 2022-03-18T19:55:56Z

See https://chromiumdash.appspot.com/schedule

MartinFalatic · 2022-03-18T20:56:11Z

I had no idea that page existed - thanks!

cut2run · 2022-05-06T13:43:44Z

The fix for the original issue was reverted for a couple of months now, but I don't see it in any recent release and the ticket itself is still in the "Open" state. Would you mind clarifying when this will be part of a release?

OrKoN · 2022-05-09T07:05:09Z

I believe this is fixed now.

tjacobs3 added the bug label Nov 9, 2021

tjacobs3 mentioned this issue Dec 22, 2021

fix: page.pdf producing an invalid pdf #7868

Merged

MartinFalatic mentioned this issue Jan 24, 2022

fix: update node-fetch package #7924

Merged

OrKoN self-assigned this Jan 28, 2022

OrKoN mentioned this issue Mar 18, 2022

chore: revert #7868 to use the size parameter for streaming PDF #8145

Merged

OrKoN added a commit that referenced this issue Mar 18, 2022

chore: revert #7868 to use the size parameter for streaming PDF (#8145)

7748730

Issues: #7757

OrKoN closed this as completed May 9, 2022

This was referenced May 30, 2022

chore(main): release 1.0.0 #8407

Closed

chore(main): release 1.0.0 #8411

Closed

This was referenced May 30, 2022

chore(main): release 1.0.0 #8414

Closed

chore(main): release 1.0.0 #8418

Closed

chore(main): release puppeteer 14.2.0 #8420

Closed

This was referenced May 30, 2022

chore(main): release 1.0.0 #8425

Closed

chore(main): release 1.0.0 #8427

Closed

This was referenced Jun 13, 2022

chore(main): release 15.0.0 #8513

Closed

chore(main): release 15.0.0 #8514

Closed

chore(main): release 15.0.0 #8516

Closed

nyroDev mentioned this issue Mar 26, 2024

printPDF to return a stream cyrus-and/chrome-remote-interface#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: page.pdf produces corrupt pdf #7757

[Bug]: page.pdf produces corrupt pdf #7757

tjacobs3 commented Nov 9, 2021

MartinFalatic commented Dec 21, 2021

tjacobs3 commented Dec 21, 2021 •

edited

MartinFalatic commented Dec 21, 2021

MartinFalatic commented Dec 22, 2021 •

edited

MartinFalatic commented Dec 22, 2021 •

edited

tjacobs3 commented Dec 22, 2021 •

edited

MartinFalatic commented Dec 22, 2021

MartinFalatic commented Dec 22, 2021 •

edited

OrKoN commented Jan 25, 2022

MartinFalatic commented Jan 25, 2022

OrKoN commented Jan 26, 2022

OrKoN commented Jan 28, 2022

cut2run commented Mar 18, 2022

MartinFalatic commented Mar 18, 2022

OrKoN commented Mar 18, 2022

MartinFalatic commented Mar 18, 2022

cut2run commented May 6, 2022

OrKoN commented May 9, 2022

[Bug]: page.pdf produces corrupt pdf #7757

[Bug]: page.pdf produces corrupt pdf #7757

Comments

tjacobs3 commented Nov 9, 2021

Bug description

Puppeteer version

Node.js version

npm version

What operating system are you seeing the problem on?

Relevant log output

MartinFalatic commented Dec 21, 2021

tjacobs3 commented Dec 21, 2021 • edited

MartinFalatic commented Dec 21, 2021

MartinFalatic commented Dec 22, 2021 • edited

MartinFalatic commented Dec 22, 2021 • edited

tjacobs3 commented Dec 22, 2021 • edited

MartinFalatic commented Dec 22, 2021

MartinFalatic commented Dec 22, 2021 • edited

OrKoN commented Jan 25, 2022

MartinFalatic commented Jan 25, 2022

OrKoN commented Jan 26, 2022

OrKoN commented Jan 28, 2022

cut2run commented Mar 18, 2022

MartinFalatic commented Mar 18, 2022

OrKoN commented Mar 18, 2022

MartinFalatic commented Mar 18, 2022

cut2run commented May 6, 2022

OrKoN commented May 9, 2022

tjacobs3 commented Dec 21, 2021 •

edited

MartinFalatic commented Dec 22, 2021 •

edited

MartinFalatic commented Dec 22, 2021 •

edited

tjacobs3 commented Dec 22, 2021 •

edited

MartinFalatic commented Dec 22, 2021 •

edited