Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIFF with JPEG Compression not readable by Tesseract #540

Open
keinhaar opened this issue May 12, 2020 · 15 comments
Open

TIFF with JPEG Compression not readable by Tesseract #540

keinhaar opened this issue May 12, 2020 · 15 comments

Comments

@keinhaar
Copy link

keinhaar commented May 12, 2020

If i create an multipage TIFF with JPEG Compression, it will not be Readable by Tesseract.
It gives this Error:
"Error in pixReadFromTiffStream: bad tiff file: tiffbpl is too small"
Other Compressions like LZW or Deflate work just fine.

Also GIMP gives an Error, but still opens the TIFF. I'll try to Translate the message, because my GIMP is set to german. Something like "Incompatible TIFF: Additional Channels without Field ExtraSamples"

My code looks like this

            Iterator writers = ImageIO.getImageWritersByFormatName("tiff");
            ImageWriter writer = writers.next();
            ImageOutputStream out = ImageIO.createImageOutputStream(target);
            writer.setOutput(out);
            ImageWriteParam param = writer.getDefaultWriteParam();
            param.setCompressionMode(param.MODE_EXPLICIT);
            param.setCompressionType("JPEG");
            writer.prepareWriteSequence(null);
            for(int i=0;i<tiffs.length;i++)
            {
                BufferedImage raster = ImageIO.read(tiffs[i]);
                param.setCompressionQuality(0.9f);
                IIOImage image = new IIOImage(raster, null, null);
                writer.writeToSequence(image, param);
            }
            writer.endWriteSequence();
            writer.dispose();

Is there something wrong with my code?

@Schmidor
Copy link
Contributor

I might be wrong, but could that be something like your source image has 4 channels, RGB and alpha, and the writer has some issues with the alpha channel when writing JPEG compression?

@keinhaar
Copy link
Author

No, i checked that. The BufferedImage Type is TYPE_3BYTE_BGR.

@Schmidor
Copy link
Contributor

Schmidor commented May 12, 2020

Could you please attach some samples of the sources and a generated image?
I would like to check them. If it's something on the level of the metadata / structure I might be able to help. If its deeper in the JPEG compression you definitely have to wait for Harald :)

@keinhaar
Copy link
Author

Here it is. Hopefully this will help to find the issue.

"img001.tif" and "img002.tif" are combined to "target.tiff" by the Java Class.

I try to use tesseract 4.1.1 on the target.tiff like this:

tesseract target.tiff out.pdf pdf

@keinhaar
Copy link
Author

Sorry, file was to large. Here again.
sample.zip

@Schmidor
Copy link
Contributor

The output file looks pretty normal. Theres nothing unexpected there.

But I can say that Gimp throws this error for every TIFF with JPEG compression I could find. The images are read normally, so I have no idea what it could interpret as an extra sample.
Is it possible tesseract just doesn't support JPEG compressed TIFFs?

@keinhaar
Copy link
Author

I've saved the img001.tif with GIMP as new tiff file with JPEG compression. Then reopened it. There is no Warning. So this is not always the case.
After that i tried tesseract on that file, and it works without problems. so it seems that jpeg compression is supported by tesseract.

@haraldk haraldk changed the title Tiff with JPEG Compression not readable by Tesseract TIFF with JPEG Compression not readable by Tesseract May 13, 2020
@haraldk
Copy link
Owner

haraldk commented May 13, 2020

Thanks guys for looking into this!

@keinhaar Can you attach the same image (target.tiff, with both pages), but after re-saving with GIMP, so I can have a look at the differences?

I don't understand the error message from GIMP either, as the TIFF structure has BitsPerSample: [8,8,8], with PhotometricInterpretation: 6/YCbCr, and the JPEG stream has 8 bit precision, 3 components, standard naming for YCbCr (ids 1, 2 and 3).

Opens fine in all the tools I have available. But... There's always the chance that we have missed something.

--
Harald K

@Schmidor
Copy link
Contributor

target-gimp.zip
GIMP seems to save some ... interesting ... other things and 4 samples

@keinhaar
Copy link
Author

I think Schmidor did it already, but too be safe...
target_gimp.tiff.zip
That file works without problems with Tesseract.

@haraldk
Copy link
Owner

haraldk commented May 13, 2020

Okay...

So GIMP is a bit more sophisticated than our writer, in that it writes JPEGTables and Strips (and stores a lot of "unnecessary" extra information, like document name, thumbnail, Exif and sRGB ICC profile).

But the main differences are it uses photometric RGB, and stores 4 components, where the extra sample is (associated aka premultiplied) alpha, even though the image is fully opaque. I don't know why GIMP does this, or why Tesseract likes this better though... Most software I have, displays these images the same...

We could probably add some options to force RGB mode for JPEGs... And I think you should get 4 components with associated alpha with the reader as-is, if you use TYPE_INT_ARGB_PRE or TYPE_4BYTE_ABGR_PRE for your images.

(Side note: Despite all the extra information, the GIMP file is about half the size of ours... Probably due to higher JPEG compression, but might be worth looking into...)

--
Harald K

@haraldk
Copy link
Owner

haraldk commented May 14, 2020

Okay,

I think I found the bug in the Gimp code: file-tiff-load.c:262. It wrongly assumes (from the comment):

All other color space [than RGB] expect 1 channel (grayscale, palette, mask).

That is, it ignores YCbCr (like in our case), Separated (CMYK) and CIELab that have multiple channels...

It seems the only problem is the warning tough, the files (as you mentioned) otherwise loads just fine.

Update: Filed GIMP issue 5081.

--
Harald K

@keinhaar
Copy link
Author

Thanks for this deep insights.

I tried to use other Color Model as mentioned, but it gives an Error when writing the final tiff.
Seems like the JPEGImageWriter uses some native library, that does not support other color models. (I'm on XUbuntu Linux)

Exception in thread "main" javax.imageio.IIOException: Invalid argument to native writeImage
	at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native Method)
	at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1067)
	at com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:363)
	at com.twelvemonkeys.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:162)
	at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writePage(TIFFImageWriter.java:245)
	at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writeToSequence(TIFFImageWriter.java:954)
	at de.exware.scan.TiffTool.concatTiffs(TiffTool.java:52)
	at de.exware.scan.TiffTool.main(TiffTool.java:61)

@haraldk
Copy link
Owner

haraldk commented May 16, 2020

@keinhaar Thanks for trying that out. Maybe you could post your code as a failing test case, and I'll see if this is something that can be fixed?

And yes, ultimately, JPEG read/write is handled by native code, which for any Oracle JVM is a modified libJPEG AFAIK.

Usually, we can get around those issues by writing a raster instead of the full image, and just populating the metadata correctly ourselves (like I did for CMYK JPEG read/write).

--
Harald K

@keinhaar
Copy link
Author

The code is still the same as in sample.zip. I just created an new buffered image of the type you requested, and drawed the original image with the g2d context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants