Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More universal SVG stream detection #723

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

stanio
Copy link

@stanio stanio commented Jan 2, 2023

This is a suggestion for implementing universal SVG stream detection using a SAX parser that (setup will most likely be used for the final parsing, and):

  • Would handle any supported text encoding including UTF-16 and possible BOM

    Parser following F.1 Detection Without External Encoding Information would be able to handle EBCDIC code pages as well;

  • Handles possible namespace prefixes and matches the SVG namespace precisely.

I haven't verified explicitly but the parsing is stopped at most at the root document element, and the performance should not be noticeably slower. As previously, uses possible DOCTYPE declaration as a speculative optimization:

<!DOCTYPE svg ...>
<!--
  - Lenghty comment block...
  -->
<svg ...>

Notable differences with the existing implementation include:

  • No longer accepted:

    <!--
      -- Malformed
      -->
    <svg xmlns="http://www.w3.org/2000/svg">
  • No longer accepted:

    <svg xmlns="http://www.w3.org/2023/fake">
  • Now is accepted:

    <!DOCTYPE ns:svg>
    <ns:svg xmlns:ns="http://www.w3.org/2000/svg">

The following is wrongfully accepted as before:

<!DOCTYPE svg>
<svg xmlns="http://www.w3.org/2023/fake">

I don't think it is necessary but the last one could be made more reliable by considering the DOCTYPE declaration only when the name doesn't match svg and doesn't end with :svg (definitely not an SVG), or when the name matches svg or ends with :svg and the public identifier matches -//W3C//DTD SVG 1.0//EN or -//W3C//DTD SVG 1.1//EN. If the DOCTYPE name is svg or ends with :svg, but the public identifier is unknown it could always fall back to looking at the root document element.

Use a standard XML reader to parse up to the root document element:

-   Should handle any supported text encoding

    Including UTF-16 and possible BOM.

    Parser following "F.1 Detection Without External Encoding
    Information" <https://www.w3.org/TR/xml/#sec-guessing> would be
    able to handle EBCDIC code pages as well;

-   Handles possible namespace prefixes and matches the SVG namespace precisely

Uses possible DOCTYPE declaration as a speculative optimization (as previously):

    <!DOCTYPE svg ...>
    <!--
      - Lenghty comment block...
      -->
    <svg ...>
Extract doctype parsing into a separate file;
Add missing lexical-handler setup.
@stanio
Copy link
Author

stanio commented Jan 2, 2023

I haven't verified explicitly but the performance should not be noticeably slower.

I've now run a few microbenchmarks. I've measured ~54% slower performance. I've got average times of ~68μs (current) vs. ~105μs (new). This roughly translates into ~3.7s added time for scanning 100,000 files. Arguably the more reliable detection is worth the possibly unnoticeable performance hit.

@haraldk
Copy link
Owner

haraldk commented Jan 2, 2023

Thank you!

I'll review the changes as soon as I find the time.

I'm initially a little reluctant to merge these changes, however, as file format detection should prioritize performance, over 100% reliable detection. From the API docs of canDecode(Object): "Returning true from this method does not guarantee that reading will succeed, only that there appears to be a reasonable chance of success based on a brief inspection of the stream contents". There are also some security considerations that need to be assessed, regarding XML parsing.

I also believe that comments are allowed anywhere after the optional XML declaration, so your first "no longer accepted" example above, should still be accepted. 😀

@stanio
Copy link
Author

stanio commented Jan 2, 2023

I also believe that comments are allowed anywhere after the optional XML declaration, so your first "no longer accepted" example above, should still be accepted. 😀

Note, that example uses a malformed comment – a fatal error after which XML parsers generally stop further processing:

For compatibility, the string " -- " (double-hyphen) MUST NOT occur within comments.

It is not something I've implemented explicitly but it is driven by the XML parser implementation.

@stanio
Copy link
Author

stanio commented Jan 2, 2023

file format detection should prioritize performance, over 100% reliable detection. From the API docs...

I'm unsure of the point you're making here. If the format detection rejects a valid input it won't get to be decoded – that's my concern (it could be considered a defect). I don't insist on using an XML parser if this could be achieved better otherwise, but it seems the custom parsing in canDecode is already rather complex.

Reuse full parser configuration incl. content handler.
XMLReader.setProperty() turns out relatively expensive.
Regains 1-2% performance back.
@stanio
Copy link
Author

stanio commented Jan 2, 2023

There are also some security considerations that need to be assessed, regarding XML parsing.

You may have a look at:

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

and:

@Override
public InputSource resolveEntity(String name,
String publicId,
String baseURI,
String systemId) {
return new InputSource(new StringReader("")); // empty entity

Basically, no external entities are allowed – that's also necessary for performance.

@haraldk
Copy link
Owner

haraldk commented Jan 3, 2023

Thanks!

You are right about the XML comment, I missed the illegal double hyphens inside the comment.

Performance is critical, because it impacts not only the detection of SVG, but also all other formats when the SVG (Batik) plugin is installed. Most file formats can be detected just by looking at the first 2-4 bytes, without any extra allocations or instantiations. It's possible to do some tweaks to make this impact smaller, by "moving" the SVG detection to the end, after other formats. And, yes, the ThreadLocal probably helps a bit, but it also comes with cost and complexity...

ImageReaderSpi.canDecodeInput(..) is used from ImageIO.getImageReaders(..) and all the ImageIO.read(...) methods. It's still possible to use a separate detection method, and use ImageIO.getImageReadersByFormatName("SVG") for example (a common use case is some formats that can't be detected by content, but can still be identified by "external" properties, like file name, using ImageIO.getImageReadersBySuffix(suffixOf(fileName))).

But I agree that the best would be if all SVGs were correctly detected by the canDecodeInput(..) method, so I'm still considering it. 😀

Comment on lines 77 to 85
int b;
while (Character.isWhitespace((char) (b = pInput.read()))) {
// Skip over leading WS
}

// If it's not a tag, this can't be valid XML
if (b != '<') {
return false;
}
Copy link
Owner

@haraldk haraldk Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we keep this initial test, just to very quickly filter out any file that is not XML, before creating stream wrapper and attempting to parse?

We might have to strengthen the whitespace skipping to handle BOMs and other multibyte whitespace chars. But it would at least make me happier, knowing we won't spend time trying to XML-parse binary files... 😀

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current detection is good enough, to be honest. What it lacks most is likely handling UTF-16 and UTF-BOM. Given nowadays almost everything uses just UTF-8 w/o BOM, one may consider my suggestion here superfluous. 😄 It just happens I've encountered this code and decided to give it a try.

I don't think you should spend more time looking into this PR. Thank you for providing feedback to me also.

Copy link
Author

@stanio stanio Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, the detection using an XML parser could be included and triggered by a system property, depending on one's needs. I'll further prepare a revision of this approach over the weekend.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... over the weekend.

Over already.

-   Restore the original performance-optimized canDecode implementation
-   Enable canDecode using XML reader behind system property flag
Don't consider XML namespaces during detection.  Would accept no
namespace SVG documents:

    <svg>...</svg>

but then Batik will refuse it during read:

    Root element namespace does not match that requested:
    Requested: http://www.w3.org/2000/svg
    Found: null
Operations on the parser factory must be synchronized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants