-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More universal SVG stream detection #723
base: master
Are you sure you want to change the base?
Conversation
Use a standard XML reader to parse up to the root document element: - Should handle any supported text encoding Including UTF-16 and possible BOM. Parser following "F.1 Detection Without External Encoding Information" <https://www.w3.org/TR/xml/#sec-guessing> would be able to handle EBCDIC code pages as well; - Handles possible namespace prefixes and matches the SVG namespace precisely Uses possible DOCTYPE declaration as a speculative optimization (as previously): <!DOCTYPE svg ...> <!-- - Lenghty comment block... --> <svg ...>
Extract doctype parsing into a separate file; Add missing lexical-handler setup.
I've now run a few microbenchmarks. I've measured ~54% slower performance. I've got average times of ~68μs (current) vs. ~105μs (new). This roughly translates into ~3.7s added time for scanning 100,000 files. Arguably the more reliable detection is worth the possibly unnoticeable performance hit. |
Thank you! I'll review the changes as soon as I find the time. I'm initially a little reluctant to merge these changes, however, as file format detection should prioritize performance, over 100% reliable detection. From the API docs of I also believe that comments are allowed anywhere after the optional XML declaration, so your first "no longer accepted" example above, should still be accepted. 😀 |
Note, that example uses a malformed comment – a fatal error after which XML parsers generally stop further processing:
It is not something I've implemented explicitly but it is driven by the XML parser implementation. |
I'm unsure of the point you're making here. If the format detection rejects a valid input it won't get to be decoded – that's my concern (it could be considered a defect). I don't insist on using an XML parser if this could be achieved better otherwise, but it seems the custom parsing in |
Reuse full parser configuration incl. content handler. XMLReader.setProperty() turns out relatively expensive. Regains 1-2% performance back.
You may have a look at: Lines 138 to 140 in bc31399
and: Lines 127 to 132 in bc31399
Basically, no external entities are allowed – that's also necessary for performance. |
Thanks! You are right about the XML comment, I missed the illegal double hyphens inside the comment. Performance is critical, because it impacts not only the detection of SVG, but also all other formats when the SVG (Batik) plugin is installed. Most file formats can be detected just by looking at the first 2-4 bytes, without any extra allocations or instantiations. It's possible to do some tweaks to make this impact smaller, by "moving" the SVG detection to the end, after other formats. And, yes, the
But I agree that the best would be if all SVGs were correctly detected by the |
int b; | ||
while (Character.isWhitespace((char) (b = pInput.read()))) { | ||
// Skip over leading WS | ||
} | ||
|
||
// If it's not a tag, this can't be valid XML | ||
if (b != '<') { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we keep this initial test, just to very quickly filter out any file that is not XML, before creating stream wrapper and attempting to parse?
We might have to strengthen the whitespace skipping to handle BOMs and other multibyte whitespace chars. But it would at least make me happier, knowing we won't spend time trying to XML-parse binary files... 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current detection is good enough, to be honest. What it lacks most is likely handling UTF-16 and UTF-BOM. Given nowadays almost everything uses just UTF-8 w/o BOM, one may consider my suggestion here superfluous. 😄 It just happens I've encountered this code and decided to give it a try.
I don't think you should spend more time looking into this PR. Thank you for providing feedback to me also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, the detection using an XML parser could be included and triggered by a system property, depending on one's needs. I'll further prepare a revision of this approach over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... over the weekend.
Over already.
- Restore the original performance-optimized canDecode implementation - Enable canDecode using XML reader behind system property flag
Don't consider XML namespaces during detection. Would accept no namespace SVG documents: <svg>...</svg> but then Batik will refuse it during read: Root element namespace does not match that requested: Requested: http://www.w3.org/2000/svg Found: null
Operations on the parser factory must be synchronized.
This is a suggestion for implementing universal SVG stream detection using a SAX parser that (setup will most likely be used for the final parsing, and):
Would handle any supported text encoding including UTF-16 and possible BOM
Parser following F.1 Detection Without External Encoding Information would be able to handle EBCDIC code pages as well;
Handles possible namespace prefixes and matches the SVG namespace precisely.
I haven't verified explicitly but the parsing is stopped at most at the root document element, and the performance should not be noticeably slower. As previously, uses possible
DOCTYPE
declaration as a speculative optimization:Notable differences with the existing implementation include:
No longer accepted:
No longer accepted:
Now is accepted:
The following is wrongfully accepted as before:
I don't think it is necessary but the last one could be made more reliable by considering the
DOCTYPE
declaration only when the name doesn't matchsvg
and doesn't end with:svg
(definitely not an SVG), or when the name matchessvg
or ends with:svg
and the public identifier matches-//W3C//DTD SVG 1.0//EN
or-//W3C//DTD SVG 1.1//EN
. If theDOCTYPE
name issvg
or ends with:svg
, but the public identifier is unknown it could always fall back to looking at the root document element.