Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handleOnXML tries to parse.xlsx files #790

Open
theseanything opened this issue Oct 19, 2023 · 2 comments
Open

handleOnXML tries to parse.xlsx files #790

theseanything opened this issue Oct 19, 2023 · 2 comments
Labels

Comments

@theseanything
Copy link

The handleOnXML function attempts to parse responses with the content-type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. This is because the function looks for any mention of xml in the content type. This results in a parse error when xmlquery.Parse() is called (For example: `encoding/xml.SyntaxError {Msg: "illegal character code U+0003", Line: 1}).

XLSX files packaged as a zip - so can't be directly parsed as XML.

It would be ideal to not try and parse these files, possibly by being more explicit in which content-types we consider to be XML.

@theseanything
Copy link
Author

This doesn't only effect xlsx, but also docx, pptx etc.. type documents

@theseanything
Copy link
Author

theseanything commented Oct 20, 2023

To add to this it would be nice to able to have more granularity over what XML is parsed. For example, we use a OnXML handler to follow links in a XML sitemap, but our site contains many SVGs (image/svg+xml) and RFDs (application/rdf+xml) which also are unnecessarily parsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants