Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated JPEG magic number #4707

Merged
merged 9 commits into from
Jun 22, 2020
Merged

Conversation

Cykooz
Copy link
Contributor

@Cykooz Cykooz commented Jun 18, 2020

Current version of function JpegImagePlugin._accept() uses a very primitive "magic number" to check that file may be a JPEG-file:

def _accept(prefix):
    return prefix[0:1] == b"\377"

In some cases this leads to reading an entire file to decide it's not a JPEG-file. (Actually, it's a more complicated problem. Maybe I'll do another pull-request.)

Other libraries for decode JPEG-files uses more complex value of "magic number" - b'\xFF\xD8\xFF'. This value is also given on Wikipedia (https://en.wikipedia.org/wiki/JPEG).

@radarhere radarhere added the JPEG label Jun 18, 2020
@@ -706,6 +706,26 @@ def test_icc_after_SOF(self):
with Image.open("Tests/images/icc-after-SOF.jpg") as im:
assert im.info["icc_profile"] == b"profile"

def test_reading_not_whole_file_for_define_it_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'for define it type'... sorry, this confuses me. What is this name intended to communicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... English is not my native language. May be more correct to name this test as "test_not_read_entire_file_to_determine_its_type".

@radarhere
Copy link
Member

I've created Cykooz#1 with some suggestions.

@radarhere
Copy link
Member

Could you give a brief description of the 'more complicated problem' you allude to?

@Cykooz
Copy link
Contributor Author

Cykooz commented Jun 20, 2020

My initial problem is that Pillow, in some cases, reads an entire file just to determine its type.
I found that this problem comes from JpegImagePlugin. Look at this "infinite" loop

while True:
i = i8(s)
if i == 0xFF:
s = s + self.fp.read(1)
i = i16(s)
else:
# Skip non-0xFF junk
s = self.fp.read(1)
continue
if i in MARKER:
name, description, handler = MARKER[i]
if handler is not None:
handler(self, i)
if i == 0xFFDA: # start of scan
rawmode = self.mode
if self.mode == "CMYK":
rawmode = "CMYK;I" # assume adobe conventions
self.tile = [("jpeg", (0, 0) + self.size, 0, (rawmode, ""))]
# self.__offset = self.fp.tell()
break
s = self.fp.read(1)
elif i == 0 or i == 0xFFFF:
# padded marker or junk; move on
s = b"\xff"
elif i == 0xFF00: # Skip extraneous data (escaped 0xFF)
s = self.fp.read(1)
else:
raise SyntaxError("no marker found")

This loop may read an entire file if content of file is something like this (with old version of "magic number"):

  • many FF bytes;
  • FF 00 FF 00 FF 00 ... FF 00;
  • FF 00 00 00 ... 00.

Correct "magic number" doesn't fix this problem completely. But it reduces the likelihood of encountering a problem. I am not an expert in JPEG-format and I don't know how many bytes need to read from a file to decide it's not a JPEG file.

@radarhere
Copy link
Member

My initial problem is that Pillow, in some cases, reads an entire file just to determine its type.

I wouldn't phrase it exactly like this. It's not in order to determine its type. Pillow is attempting to open the file, hits an error and then gives up on the idea that it is a valid JPEG. So a large amount of the file has to be read to determine its validity.

@Cykooz
Copy link
Contributor Author

Cykooz commented Jun 22, 2020

So a large amount of the file has to be read to determine its validity.

In my app, reading a file is a very expensive operation. Reading 3-4 Mb data from a file may take several seconds. This may be used in DoS attacks.
But I will not argue if you are sure that a JPEG file can contain a lot of "garbage" at the beginning of the file.

@radarhere
Copy link
Member

We guard against images with large dimensions. We don't presently guard against images with large file sizes.

Setting aside the matter of 0xFF bytes for one second, part of the problem is that Pillow reads data from before the SOS marker on the initial open. Without changing that, the only way to solve your dilemma would be if valid JPEGs only allowed a limited number of markers before SOS.

Considering that

  • we have encountered an image with multiple APP1 segments - Only extract first Exif segment #2946 - and we would presumably like to be flexible enough to handle such a file
  • and going through our test suite, I can see that APP1 markers can come before the SOS marker

I don't see a way forward there.

@radarhere radarhere merged commit 926af72 into python-pillow:master Jun 22, 2020
@radarhere radarhere changed the title Replaced primitive "magic number" inside of JpegImagePlugin._accept() function by more correct version Updated JPEG magic number Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants