More universal SVG stream detection #723

stanio · 2023-01-02T11:12:46Z

This is a suggestion for implementing universal SVG stream detection using a SAX parser that (setup will most likely be used for the final parsing, and):

Would handle any supported text encoding including UTF-16 and possible BOM

Parser following F.1 Detection Without External Encoding Information would be able to handle EBCDIC code pages as well;
Handles possible namespace prefixes and matches the SVG namespace precisely.

I haven't verified explicitly but the parsing is stopped at most at the root document element, and the performance should not be noticeably slower. As previously, uses possible DOCTYPE declaration as a speculative optimization:

<!DOCTYPE svg ...>
<!--
  - Lenghty comment block...
  -->
<svg ...>

Notable differences with the existing implementation include:

No longer accepted:

<!--
  -- Malformed
  -->
<svg xmlns="http://www.w3.org/2000/svg">

No longer accepted:

<svg xmlns="http://www.w3.org/2023/fake">

Now is accepted:

<!DOCTYPE ns:svg>
<ns:svg xmlns:ns="http://www.w3.org/2000/svg">

The following is wrongfully accepted as before:

<!DOCTYPE svg>
<svg xmlns="http://www.w3.org/2023/fake">

I don't think it is necessary but the last one could be made more reliable by considering the DOCTYPE declaration only when the name doesn't match svg and doesn't end with :svg (definitely not an SVG), or when the name matches svg or ends with :svg and the public identifier matches -//W3C//DTD SVG 1.0//EN or -//W3C//DTD SVG 1.1//EN. If the DOCTYPE name is svg or ends with :svg, but the public identifier is unknown it could always fall back to looking at the root document element.

Use a standard XML reader to parse up to the root document element: - Should handle any supported text encoding Including UTF-16 and possible BOM. Parser following "F.1 Detection Without External Encoding Information" <https://www.w3.org/TR/xml/#sec-guessing> would be able to handle EBCDIC code pages as well; - Handles possible namespace prefixes and matches the SVG namespace precisely Uses possible DOCTYPE declaration as a speculative optimization (as previously): <!DOCTYPE svg ...>  <svg ...>

Extract doctype parsing into a separate file; Add missing lexical-handler setup.

stanio · 2023-01-02T13:07:37Z

I haven't verified explicitly but the performance should not be noticeably slower.

I've now run a few microbenchmarks. I've measured ~54% slower performance. I've got average times of ~68μs (current) vs. ~105μs (new). This roughly translates into ~3.7s added time for scanning 100,000 files. Arguably the more reliable detection is worth the possibly unnoticeable performance hit.

haraldk · 2023-01-02T15:23:25Z

Thank you!

I'll review the changes as soon as I find the time.

I'm initially a little reluctant to merge these changes, however, as file format detection should prioritize performance, over 100% reliable detection. From the API docs of canDecode(Object): "Returning true from this method does not guarantee that reading will succeed, only that there appears to be a reasonable chance of success based on a brief inspection of the stream contents". There are also some security considerations that need to be assessed, regarding XML parsing.

I also believe that comments are allowed anywhere after the optional XML declaration, so your first "no longer accepted" example above, should still be accepted. 😀

stanio · 2023-01-02T15:50:02Z

I also believe that comments are allowed anywhere after the optional XML declaration, so your first "no longer accepted" example above, should still be accepted. 😀

Note, that example uses a malformed comment – a fatal error after which XML parsers generally stop further processing:

For compatibility, the string " -- " (double-hyphen) MUST NOT occur within comments.

It is not something I've implemented explicitly but it is driven by the XML parser implementation.

stanio · 2023-01-02T16:16:57Z

file format detection should prioritize performance, over 100% reliable detection. From the API docs...

I'm unsure of the point you're making here. If the format detection rejects a valid input it won't get to be decoded – that's my concern (it could be considered a defect). I don't insist on using an XML parser if this could be achieved better otherwise, but it seems the custom parsing in canDecode is already rather complex.

Reuse full parser configuration incl. content handler. XMLReader.setProperty() turns out relatively expensive. Regains 1-2% performance back.

stanio · 2023-01-02T16:35:07Z

There are also some security considerations that need to be assessed, regarding XML parsing.

You may have a look at:

TwelveMonkeys/imageio/imageio-batik/src/main/java/com/twelvemonkeys/imageio/plugins/svg/DoctypeHandler.java

Lines 138 to 140 in bc31399

    
           SAXParserFactory spf = SAXParserFactory.newInstance(); 
        
           spf.setNamespaceAware(true); 
        
           spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

and:

TwelveMonkeys/imageio/imageio-batik/src/main/java/com/twelvemonkeys/imageio/plugins/svg/DoctypeHandler.java

Lines 127 to 132 in bc31399

    
           @Override 
        
           public InputSource resolveEntity(String name, 
        
                                            String publicId, 
        
                                            String baseURI, 
        
                                            String systemId) { 
        
               return new InputSource(new StringReader("")); // empty entity

Basically, no external entities are allowed – that's also necessary for performance.

haraldk · 2023-01-03T19:05:04Z

Thanks!

You are right about the XML comment, I missed the illegal double hyphens inside the comment.

Performance is critical, because it impacts not only the detection of SVG, but also all other formats when the SVG (Batik) plugin is installed. Most file formats can be detected just by looking at the first 2-4 bytes, without any extra allocations or instantiations. It's possible to do some tweaks to make this impact smaller, by "moving" the SVG detection to the end, after other formats. And, yes, the ThreadLocal probably helps a bit, but it also comes with cost and complexity...

ImageReaderSpi.canDecodeInput(..) is used from ImageIO.getImageReaders(..) and all the ImageIO.read(...) methods. It's still possible to use a separate detection method, and use ImageIO.getImageReadersByFormatName("SVG") for example (a common use case is some formats that can't be detected by content, but can still be identified by "external" properties, like file name, using ImageIO.getImageReadersBySuffix(suffixOf(fileName))).

But I agree that the best would be if all SVGs were correctly detected by the canDecodeInput(..) method, so I'm still considering it. 😀

haraldk · 2023-01-03T19:22:11Z

...eio/imageio-batik/src/main/java/com/twelvemonkeys/imageio/plugins/svg/SVGImageReaderSpi.java

-            int b;
-            while (Character.isWhitespace((char) (b = pInput.read()))) {
-                // Skip over leading WS
-            }
-
-            // If it's not a tag, this can't be valid XML
-            if (b != '<') {
-                return false;
-            }


What if we keep this initial test, just to very quickly filter out any file that is not XML, before creating stream wrapper and attempting to parse?

We might have to strengthen the whitespace skipping to handle BOMs and other multibyte whitespace chars. But it would at least make me happier, knowing we won't spend time trying to XML-parse binary files... 😀

The current detection is good enough, to be honest. What it lacks most is likely handling UTF-16 and UTF-BOM. Given nowadays almost everything uses just UTF-8 w/o BOM, one may consider my suggestion here superfluous. 😄 It just happens I've encountered this code and decided to give it a try.

I don't think you should spend more time looking into this PR. Thank you for providing feedback to me also.

Alternatively, the detection using an XML parser could be included and triggered by a system property, depending on one's needs. I'll further prepare a revision of this approach over the weekend.

... over the weekend.

Over already.

- Restore the original performance-optimized canDecode implementation - Enable canDecode using XML reader behind system property flag

Don't consider XML namespaces during detection. Would accept no namespace SVG documents: <svg>...</svg> but then Batik will refuse it during read: Root element namespace does not match that requested: Requested: http://www.w3.org/2000/svg Found: null

Operations on the parser factory must be synchronized.

stanio added 2 commits January 2, 2023 12:23

fixup! More universal SVG stream detection

af45270

Extract doctype parsing into a separate file; Add missing lexical-handler setup.

fixup! More universal SVG stream detection

bc31399

Reuse full parser configuration incl. content handler. XMLReader.setProperty() turns out relatively expensive. Regains 1-2% performance back.

haraldk reviewed Jan 3, 2023

View reviewed changes

stanio added 3 commits January 3, 2023 22:55

fixup! More universal SVG stream detection

8f3bc56

- Restore the original performance-optimized canDecode implementation - Enable canDecode using XML reader behind system property flag

fixup! More universal SVG stream detection

1b256e9

Operations on the parser factory must be synchronized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More universal SVG stream detection #723

More universal SVG stream detection #723

stanio commented Jan 2, 2023

stanio commented Jan 2, 2023

haraldk commented Jan 2, 2023

stanio commented Jan 2, 2023

stanio commented Jan 2, 2023 •

edited

stanio commented Jan 2, 2023

haraldk commented Jan 3, 2023

haraldk Jan 3, 2023 •

edited

stanio Jan 3, 2023

stanio Jan 3, 2023 •

edited

stanio Jan 3, 2023

More universal SVG stream detection #723

Are you sure you want to change the base?

More universal SVG stream detection #723

Conversation

stanio commented Jan 2, 2023

stanio commented Jan 2, 2023

haraldk commented Jan 2, 2023

stanio commented Jan 2, 2023

stanio commented Jan 2, 2023 • edited

stanio commented Jan 2, 2023

haraldk commented Jan 3, 2023

haraldk Jan 3, 2023 • edited

Choose a reason for hiding this comment

stanio Jan 3, 2023

Choose a reason for hiding this comment

stanio Jan 3, 2023 • edited

Choose a reason for hiding this comment

stanio Jan 3, 2023

Choose a reason for hiding this comment

stanio commented Jan 2, 2023 •

edited

haraldk Jan 3, 2023 •

edited

stanio Jan 3, 2023 •

edited