Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: page.pdf produces corrupt pdf #7757

Closed
tjacobs3 opened this issue Nov 9, 2021 · 18 comments
Closed

[Bug]: page.pdf produces corrupt pdf #7757

tjacobs3 opened this issue Nov 9, 2021 · 18 comments
Assignees
Labels

Comments

@tjacobs3
Copy link
Contributor

tjacobs3 commented Nov 9, 2021

Bug description

Steps to reproduce the problem:

Occasionally, we find that our PDFs are not openable by any programs. We've narrowed the issue down to inclusion of certain images. When these images are present, the pdf created by puppeteer is corrupt. All image tools do not indicate that anything is wrong with the image itself so I believe this is an issue on the puppeteer side.

  1. Create an test.html file with the following contents
<html>
  <head>
  </head>
  <body>
    <img src="image.jpg">
  </body>
</html>

In the same directory, place the attached image.jpg

  1. Create a save_to_pdf.js file with the following contents
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(`file://${__dirname}/test.html`, { waitUntil: 'networkidle0', timeout: 60000 });
  await page.pdf({
    path: 'out.pdf',
    printBackground: true
  });

  await browser.close();
})();
  1. Run node save_to_pdf.js
  2. Try to open out.pdf in any program.
    Screen Shot 2021-11-09 at 10 22 02 AM

puppeteer_bug.zip
image.jpg

Puppeteer version

10.1.0

Node.js version

12.22.5

npm version

6.14.14

What operating system are you seeing the problem on?

macOS

Relevant log output

No response

@tjacobs3 tjacobs3 added the bug label Nov 9, 2021
@MartinFalatic
Copy link

I'm tracking a very similar issue with headless page pdf() calls creating corrupted files with certain input images, which started after puppeteer 10.0.0, and have been working on a way to reproduce it. I've been assuming it was the version of Chrome that was to blame somehow.... but both 10.0.0 and 10.1.0 reference Chrome 884014.

There's not a ton of delta between puppeteer 10.0.0 and 10.1.0, but the answer is in there somewhere.

I've recently created a testbed (https://github.com/MartinFalatic/puppeteer-explorations) that I used to replicate the issue. I was able to replicate the failure with your jpg as well as (separately) my own problematic jpg.

So far there's no obvious reason - I also ran both images through http://exif.regex.info/ and got nothing obvious, except that my other assumptions - that the problem was EXIF metadata or maybe the colorspace or some quirk of JPEG vs JPG was possibly to blame - seem very much less likely now.

@tjacobs3
Copy link
Contributor Author

tjacobs3 commented Dec 21, 2021

Thanks for the info! I was able to step through the commits and found that this was the cause

e3699e2

I'm not too familiar with what's going on there but I'll try debugging when I have some time.

@MartinFalatic
Copy link

The file that fails for me is only 70,361 bytes (it's a 300x225 image) so whatever is going on it's not about the size of the image itself (though I wonder if there's some edge condition at work here, like some exact multiple of bytes in the image data).

It's particularly vexing because it doesn't seem to cause a visible error other than truncating the PDF.

MartinFalatic added a commit to MartinFalatic/puppeteer-explorations that referenced this issue Dec 22, 2021
Also replicates the bug at puppeteer/puppeteer#7757 and rules out a Chrome version change as the problem (the version of Chrome didn't change between 10.0.0 and 10.1.0)
@MartinFalatic
Copy link

MartinFalatic commented Dec 22, 2021

So, I did a bit of experimentation tonight...

In node_modules\puppeteer\lib\cjs\puppeteer\common\helper.js (from 10.1.0) within my Docker container (corresponding to https://github.com/puppeteer/puppeteer/blob/v10.1.0/src/common/helper.ts#L363) I wedged in the following line:

size = 16384*2;

I noticed that size is 16384 (aka highWaterMark as noted in https://nodejs.org/docs/latest-v16.x/api/stream.html#new-streamreadableoptions)

Without that line, things fail as usual for the two files I know to be problematic. With that tweak? The PDF generates normally.

This reinforces my suspicion that this not about content but about the specific size of the content, though I haven't yet dumped the streams to see what the boundary conditions are (however, both files are odd lengths, so they're not exact multiples of size in any obvious way. But there may be overhead...)

Knowing this problem was also in 13.0.0, I see that there's a bugfix in 13.0.1 that affects this area of the code - but it doesn't help with this problem:

5b792de#diff-8a47ddda6f4b58ba812e57c7bbef53b97bd7297359e3f8c07bedd10e5baafb6e

Edit: response.data is indeed base64, which is food for thought here.

Edit2: I was able to decode and dump the response.data for each PDF as it was being processed so it's definitely matching expectations there. What's strange is how the response is throwing a premature EOF if size is <16384*2 (approximately). It's also strange that this is totally dependent on the image being present as-is, somewhere in the PDF.

I'm not sure at what point the images get base64 encoded in the first place, but whatever does it seems to respect the highWaterMark value, which suggests to me that it's somehow causing an EOF condition when it processes an image that happens to be an edge case (because the premature ending of the stream is detected as an EOF).

One interesting thing about your image is that the PDF stream contains the header of your image (as it's literally copied into the PDF), but after that it abruptly ends - right before the start of the image data entropy encoded segment (see https://github.com/corkami/formats/blob/master/image/JPEGRGB_dissected.png). That doesn't explain what's going on here (because there's definitely a dependence on the highWaterMark too), but I doubt it's a coincidence.

@MartinFalatic
Copy link

MartinFalatic commented Dec 22, 2021

One more bit of info for today - puppeteer does give Chrome a hint as to how much data it wants at https://github.com/puppeteer/puppeteer/blob/v10.1.0/src/common/helper.ts#L364, specifically:

const response = await client.send('IO.read', { handle, size });

This accesses the Chrome DevTools protocol and uses IO.read (https://chromedevtools.github.io/devtools-protocol/1-3/IO/#method-read) to request up to a specified number of bytes from it (16384 by default).

I'm starting to wonder if Chrome is prematurely breaking the stream here (for no good reason evidently). If no size is given to that call, it'll send quite a lot of data in one go (at least 2.5MB, probably a lot more), but that's not tenable (it should have some reasonable chunk limit).

A way to mitigate this would be to check for the proper end-of-file magic in the PDF, then retry with a different size value, but of course that's playing whack-a-mole with this problem. If that size value was a hint to Page.pdf() then one could perform this validation on the fly and retry with different parameters.

@tjacobs3
Copy link
Contributor Author

tjacobs3 commented Dec 22, 2021

Yeah, this seems to indicate that it's a bug on the chrome side.

I do see that prior to that feature being added, size was not being passed to IO.read (see https://github.com/puppeteer/puppeteer/blob/v10.0.0/src/common/helper.ts#L329). Maybe it's fine to just remove it until an issue is created and resolved on the chrome side

@MartinFalatic
Copy link

It's up to the devs. Even if it was massive (say 8 MB) it might make the problem space much smaller. Then again, it's not clear what Chrome isn't liking here either

tjacobs3 added a commit to tjacobs3/puppeteer that referenced this issue Dec 22, 2021
When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occassionally indicate an that it has reached the end of file without sending a full pdf. This is documented by the associated issue.

This behavior is not reproducable when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out.

Refs: puppeteer#7757
@MartinFalatic
Copy link

MartinFalatic commented Dec 22, 2021

It's not clear to me what benefit was sought in making the original changes that landed in 10.1.0 to use streaming for the PDFs.

According to the same Chrome devtools docs for Page.printToPDF (https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF) the IO.StreamHandle is still experimental in big bold letters. All of this is stemming from the use of that handle (which seems to be pretty flakey in certain cases when size is specified, and may well be flakey in other cases yet to be determined). At least your branch should clear up the present problem which is good!

If I can manage to get enough data on this, and it turns out to not be specific to Puppeteer, it would be worth filing a bug with the Chromium team.

Edit: Then again, puppeteer is by the selfsame team so... hopefully someone there see this bug report.

@OrKoN
Copy link
Collaborator

OrKoN commented Jan 25, 2022

@MartinFalatic
Copy link

I think that might fix it https://chromium-review.googlesource.com/c/chromium/src/+/3413074

That looks promising, though it also looks like it's a long way from being in a release we can use.

@OrKoN
Copy link
Collaborator

OrKoN commented Jan 26, 2022

#7868 won't actually fix the problem, it will just move the bug to the 10MB boundary which makes it less likely to happen (because fewer files are bigger than that). We can still land the fix until the Chromium fix arrives.

OrKoN pushed a commit that referenced this issue Jan 26, 2022
When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occasionally indicate that it has reached the end of file without sending a full pdf. This is documented by the associated issue.

This behavior is not reproducible when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out.

Issues: #7757
@OrKoN
Copy link
Collaborator

OrKoN commented Jan 28, 2022

So the Chromium fix landed in M100 and I plan to create new Puppeteer release early next week.

@OrKoN OrKoN self-assigned this Jan 28, 2022
@cut2run
Copy link

cut2run commented Mar 18, 2022

The Chromium 100 is already part of the puppeteer. Do you have any ETA on when this ticket will be handled and #7868 will be reverted?

@MartinFalatic
Copy link

Interesting, given that Chrome 100 is still in beta. When is that going to be on stable/main?

http://dl.google.com/linux/chrome/deb stable/main amd64 google-chrome-stable amd64 99.0.4844.74-1

@OrKoN
Copy link
Collaborator

OrKoN commented Mar 18, 2022

@MartinFalatic
Copy link

I had no idea that page existed - thanks!

@cut2run
Copy link

cut2run commented May 6, 2022

The fix for the original issue was reverted for a couple of months now, but I don't see it in any recent release and the ticket itself is still in the "Open" state. Would you mind clarifying when this will be part of a release?

@OrKoN
Copy link
Collaborator

OrKoN commented May 9, 2022

I believe this is fixed now.

@OrKoN OrKoN closed this as completed May 9, 2022
This was referenced May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment