Updated JPEG magic number #4707

Cykooz · 2020-06-18T13:45:12Z

Current version of function JpegImagePlugin._accept() uses a very primitive "magic number" to check that file may be a JPEG-file:

def _accept(prefix):
    return prefix[0:1] == b"\377"

In some cases this leads to reading an entire file to decide it's not a JPEG-file. (Actually, it's a more complicated problem. Maybe I'll do another pull-request.)

Other libraries for decode JPEG-files uses more complex value of "magic number" - b'\xFF\xD8\xFF'. This value is also given on Wikipedia (https://en.wikipedia.org/wiki/JPEG).

… function by more correct version.

radarhere · 2020-06-19T23:53:04Z

Tests/test_file_jpeg.py

@@ -706,6 +706,26 @@ def test_icc_after_SOF(self):
        with Image.open("Tests/images/icc-after-SOF.jpg") as im:
            assert im.info["icc_profile"] == b"profile"

+    def test_reading_not_whole_file_for_define_it_type(self):


'for define it type'... sorry, this confuses me. What is this name intended to communicate?

Hmm... English is not my native language. May be more correct to name this test as "test_not_read_entire_file_to_determine_its_type".

radarhere · 2020-06-20T00:02:16Z

I've created Cykooz#1 with some suggestions.

radarhere · 2020-06-20T00:02:46Z

Could you give a brief description of the 'more complicated problem' you allude to?

Cykooz · 2020-06-20T10:01:05Z

My initial problem is that Pillow, in some cases, reads an entire file just to determine its type.
I found that this problem comes from JpegImagePlugin. Look at this "infinite" loop

Pillow/src/PIL/JpegImagePlugin.py

Lines 357 to 386 in db7186b

    
           while True: 
        
               i = i8(s) 
        
               if i == 0xFF: 
        
                   s = s + self.fp.read(1) 
        
                   i = i16(s) 
        
               else: 
        
                   # Skip non-0xFF junk 
        
                   s = self.fp.read(1) 
        
                   continue 
        
               if i in MARKER: 
        
                   name, description, handler = MARKER[i] 
        
                   if handler is not None: 
        
                       handler(self, i) 
        
                   if i == 0xFFDA:  # start of scan 
        
                       rawmode = self.mode 
        
                       if self.mode == "CMYK": 
        
                           rawmode = "CMYK;I"  # assume adobe conventions 
        
                       self.tile = [("jpeg", (0, 0) + self.size, 0, (rawmode, ""))] 
        
                       # self.__offset = self.fp.tell() 
        
                       break 
        
                   s = self.fp.read(1) 
        
               elif i == 0 or i == 0xFFFF: 
        
                   # padded marker or junk; move on 
        
                   s = b"\xff" 
        
               elif i == 0xFF00:  # Skip extraneous data (escaped 0xFF) 
        
                   s = self.fp.read(1) 
        
               else: 
        
                   raise SyntaxError("no marker found")

This loop may read an entire file if content of file is something like this (with old version of "magic number"):

many FF bytes;
FF 00 FF 00 FF 00 ... FF 00;
FF 00 00 00 ... 00.

Correct "magic number" doesn't fix this problem completely. But it reduces the likelihood of encountering a problem. I am not an expert in JPEG-format and I don't know how many bytes need to read from a file to decide it's not a JPEG file.

Fix JPEG magic number

radarhere · 2020-06-22T09:24:30Z

My initial problem is that Pillow, in some cases, reads an entire file just to determine its type.

I wouldn't phrase it exactly like this. It's not in order to determine its type. Pillow is attempting to open the file, hits an error and then gives up on the idea that it is a valid JPEG. So a large amount of the file has to be read to determine its validity.

Cykooz · 2020-06-22T09:45:09Z

So a large amount of the file has to be read to determine its validity.

In my app, reading a file is a very expensive operation. Reading 3-4 Mb data from a file may take several seconds. This may be used in DoS attacks.
But I will not argue if you are sure that a JPEG file can contain a lot of "garbage" at the beginning of the file.

radarhere · 2020-06-22T13:33:03Z

We guard against images with large dimensions. We don't presently guard against images with large file sizes.

Setting aside the matter of 0xFF bytes for one second, part of the problem is that Pillow reads data from before the SOS marker on the initial open. Without changing that, the only way to solve your dilemma would be if valid JPEGs only allowed a limited number of markers before SOS.

Considering that

we have encountered an image with multiple APP1 segments - Only extract first Exif segment #2946 - and we would presumably like to be flexible enough to handle such a file
and going through our test suite, I can see that APP1 markers can come before the SOS marker

I don't see a way forward there.

Updated _open check to match _accept

Replaced primitive "magic number" inside of JpegImagePlugin._accept()…

f99e0b8

… function by more correct version.

radarhere added the JPEG label Jun 18, 2020

radarhere added 2 commits June 20, 2020 09:48

Decreased length of test image data

3e9068a

Replaced OSError with more specific UnidentifiedImageError

abbc890

radarhere reviewed Jun 19, 2020

View reviewed changes

Renamed test

65742cf

radarhere mentioned this pull request Jun 20, 2020

Fix JPEG magic number Cykooz/Pillow#1

Merged

Cykooz and others added 3 commits June 22, 2020 07:54

Merge pull request #1 from radarhere/fix_jpeg_magic_number

cebaba1

Fix JPEG magic number

Merge branch 'master' into fix_jpeg_magic_number

95ace8a

Reformat code of `test_file_jpeg.py.

6d2fe42

radarhere mentioned this pull request Jun 22, 2020

Updated _open check to match _accept Cykooz/Pillow#2

Merged

radarhere and others added 2 commits June 23, 2020 00:25

Updated _open check to match _accept

96d1a8b

Merge pull request #2 from radarhere/fix_jpeg_magic_number

b7b4aac

Updated _open check to match _accept

radarhere merged commit 926af72 into python-pillow:master Jun 22, 2020

radarhere changed the title ~~Replaced primitive "magic number" inside of JpegImagePlugin._accept() function by more correct version~~ Updated JPEG magic number Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated JPEG magic number #4707

Updated JPEG magic number #4707

Cykooz commented Jun 18, 2020 •

edited

radarhere Jun 19, 2020

Cykooz Jun 20, 2020

radarhere commented Jun 20, 2020

radarhere commented Jun 20, 2020

Cykooz commented Jun 20, 2020 •

edited

radarhere commented Jun 22, 2020

Cykooz commented Jun 22, 2020

radarhere commented Jun 22, 2020

Updated JPEG magic number #4707

Updated JPEG magic number #4707

Conversation

Cykooz commented Jun 18, 2020 • edited

radarhere Jun 19, 2020

Choose a reason for hiding this comment

Cykooz Jun 20, 2020

Choose a reason for hiding this comment

radarhere commented Jun 20, 2020

radarhere commented Jun 20, 2020

Cykooz commented Jun 20, 2020 • edited

radarhere commented Jun 22, 2020

Cykooz commented Jun 22, 2020

radarhere commented Jun 22, 2020

Cykooz commented Jun 18, 2020 •

edited

Cykooz commented Jun 20, 2020 •

edited