New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: page.pdf produces corrupt pdf #7757
Comments
I'm tracking a very similar issue with headless page There's not a ton of delta between puppeteer 10.0.0 and 10.1.0, but the answer is in there somewhere. I've recently created a testbed (https://github.com/MartinFalatic/puppeteer-explorations) that I used to replicate the issue. I was able to replicate the failure with your jpg as well as (separately) my own problematic jpg. So far there's no obvious reason - I also ran both images through http://exif.regex.info/ and got nothing obvious, except that my other assumptions - that the problem was EXIF metadata or maybe the colorspace or some quirk of JPEG vs JPG was possibly to blame - seem very much less likely now. |
The file that fails for me is only 70,361 bytes (it's a 300x225 image) so whatever is going on it's not about the size of the image itself (though I wonder if there's some edge condition at work here, like some exact multiple of bytes in the image data). It's particularly vexing because it doesn't seem to cause a visible error other than truncating the PDF. |
Also replicates the bug at puppeteer/puppeteer#7757 and rules out a Chrome version change as the problem (the version of Chrome didn't change between 10.0.0 and 10.1.0)
So, I did a bit of experimentation tonight... In
I noticed that Without that line, things fail as usual for the two files I know to be problematic. With that tweak? The PDF generates normally. This reinforces my suspicion that this not about content but about the specific size of the content, though I haven't yet dumped the streams to see what the boundary conditions are (however, both files are odd lengths, so they're not exact multiples of Knowing this problem was also in 13.0.0, I see that there's a bugfix in 13.0.1 that affects this area of the code - but it doesn't help with this problem: 5b792de#diff-8a47ddda6f4b58ba812e57c7bbef53b97bd7297359e3f8c07bedd10e5baafb6e Edit: Edit2: I was able to decode and dump the response.data for each PDF as it was being processed so it's definitely matching expectations there. What's strange is how the response is throwing a premature EOF if size is <16384*2 (approximately). It's also strange that this is totally dependent on the image being present as-is, somewhere in the PDF. I'm not sure at what point the images get base64 encoded in the first place, but whatever does it seems to respect the One interesting thing about your image is that the PDF stream contains the header of your image (as it's literally copied into the PDF), but after that it abruptly ends - right before the start of the image data entropy encoded segment (see https://github.com/corkami/formats/blob/master/image/JPEGRGB_dissected.png). That doesn't explain what's going on here (because there's definitely a dependence on the |
One more bit of info for today - puppeteer does give Chrome a hint as to how much data it wants at https://github.com/puppeteer/puppeteer/blob/v10.1.0/src/common/helper.ts#L364, specifically:
This accesses the Chrome DevTools protocol and uses I'm starting to wonder if Chrome is prematurely breaking the stream here (for no good reason evidently). If no size is given to that call, it'll send quite a lot of data in one go (at least 2.5MB, probably a lot more), but that's not tenable (it should have some reasonable chunk limit). A way to mitigate this would be to check for the proper end-of-file magic in the PDF, then retry with a different size value, but of course that's playing whack-a-mole with this problem. If that size value was a hint to |
Yeah, this seems to indicate that it's a bug on the chrome side. I do see that prior to that feature being added, |
It's up to the devs. Even if it was massive (say 8 MB) it might make the problem space much smaller. Then again, it's not clear what Chrome isn't liking here either |
When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occassionally indicate an that it has reached the end of file without sending a full pdf. This is documented by the associated issue. This behavior is not reproducable when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out. Refs: puppeteer#7757
It's not clear to me what benefit was sought in making the original changes that landed in 10.1.0 to use streaming for the PDFs. According to the same Chrome devtools docs for If I can manage to get enough data on this, and it turns out to not be specific to Puppeteer, it would be worth filing a bug with the Chromium team. Edit: Then again, puppeteer is by the selfsame team so... hopefully someone there see this bug report. |
I think that might fix it https://chromium-review.googlesource.com/c/chromium/src/+/3413074 |
That looks promising, though it also looks like it's a long way from being in a release we can use. |
#7868 won't actually fix the problem, it will just move the bug to the 10MB boundary which makes it less likely to happen (because fewer files are bigger than that). We can still land the fix until the Chromium fix arrives. |
When defining a chunk size for <CDPSession>.send('IO.read', { handle, size }), the CDPSession will occasionally indicate that it has reached the end of file without sending a full pdf. This is documented by the associated issue. This behavior is not reproducible when leaving out the size parameter. Since the size parameter is not required on the CDPSession side and is merely a suggestion on the stream side, we can safely leave it out. Issues: #7757
So the Chromium fix landed in M100 and I plan to create new Puppeteer release early next week. |
The Chromium 100 is already part of the puppeteer. Do you have any ETA on when this ticket will be handled and #7868 will be reverted? |
Interesting, given that Chrome 100 is still in beta. When is that going to be on stable/main?
|
I had no idea that page existed - thanks! |
The fix for the original issue was reverted for a couple of months now, but I don't see it in any recent release and the ticket itself is still in the "Open" state. Would you mind clarifying when this will be part of a release? |
I believe this is fixed now. |
Bug description
Steps to reproduce the problem:
Occasionally, we find that our PDFs are not openable by any programs. We've narrowed the issue down to inclusion of certain images. When these images are present, the pdf created by puppeteer is corrupt. All image tools do not indicate that anything is wrong with the image itself so I believe this is an issue on the puppeteer side.
test.html
file with the following contentsIn the same directory, place the attached
image.jpg
save_to_pdf.js
file with the following contentsnode save_to_pdf.js
out.pdf
in any program.puppeteer_bug.zip
image.jpg
Puppeteer version
10.1.0
Node.js version
12.22.5
npm version
6.14.14
What operating system are you seeing the problem on?
macOS
Relevant log output
No response
The text was updated successfully, but these errors were encountered: