Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEXT-215: Prevent decimal numeric entities from wrongly including hexadecimal characters #310

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rbunel35
Copy link

@rbunel35 rbunel35 commented Mar 25, 2022

Hello,
This a quick bugfix on the NumericEntityUnescaper. The bug allows decimal characters entities without semi-colon and followed by a letter from A to F to be ignored by the translator.
A full description of the problem is found in the ticket: https://issues.apache.org/jira/browse/TEXT-215

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rbunel35 ! Thank you for your contribution. Could you include a unit test as well, please?

@rbunel35
Copy link
Author

rbunel35 commented Mar 25, 2022

Hi @kinow !
Thanks for the quick review. I just added a unit test for the "semiColonOptional" option which asserts the unescaping is working for both hexadecimal and decimal entities, with and without semi-colon.

@kinow
Copy link
Member

kinow commented Mar 25, 2022

Great! I've allowed the workflow to run. Now just need to find time to review issue & code (though that's rather a small change). Thanks for the pull request @rbunel35 !

@rbunel35
Copy link
Author

Thank you very much !

@rbunel35
Copy link
Author

Hello !
Do you have any news to give me about this fix ?
Thanks in advance :)

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @rbunel35 . Had some spare time now, so re-read the JIRA issue with calm, and then looked at the code & test. Everything looks good IMO. Not sure whether we will need a CVE for this or not. For future issues related to security, it is better to use security instructions from the Commons site 👍

Will leave it open to see if anyone else wants to take a look. If it takes too long to get any activity, feel free to bump me and I'll check in the mailing list/slack/etc to see if anyone wants to review and will check on the CVE.

Thanks
Bruno

@garydgregory
Copy link
Member

https://www.w3.org/TR/REC-xml/#dt-charref

Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.

@kinow
Copy link
Member

kinow commented Apr 27, 2022

https://www.w3.org/TR/REC-xml/#dt-charref

Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.

Good point. I haven't checked any specification yet, but this:

# File: test.html
<iframe src="&#106avascript:alert(1)">

Or this:

# File: test.html
<iframe src="&#106;avascript:alert(1)">

Both trigger an alert (tested with python3 -m http.server and visit http://localhost:8000/test.html). I think the JIRA issue mentions how browsers handle this payload, so I suspect users could expect Commons Text to translate it in a similar way (not saying that it's correct or not, and whether we should do it or not 👍 , just FWIW)

@garydgregory
Copy link
Member

My personal opinion is that we should stick to a specific version of a specification, in this case W3C XML. If we also want to emulate what a browser does or what another language does, then that could be the job for a subclass. So maybe we should be refactoring the code? Needs other opinions...

@rbunel35
Copy link
Author

Hi,
Thank you for the your answers.

Indeed I understand that semicolon-less character entities do not form part of the specification, however as pointed by @kinow, virtually all modern browsers read them as valid entities anyway, and as a user of Commons Text, I expected it to work the same way.

Also, please note that the acceptance of numeric character entities without semi-colon precedes my contribution (via the "semiColonOptional" value in the enum). My fix only makes it work correctly with hexadecimal entities.

@rbunel35
Copy link
Author

Hey @kinow,

I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ?
You mentioned I could bump you if it took too long.

Thanks in advance !

@kinow
Copy link
Member

kinow commented May 10, 2022

Hey @kinow,

I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ? You mentioned I could bump you if it took too long.

Thanks in advance !

Hi @rbunel35

Thanks for the remainder. I'm starting a new job in a few months, and will have about one month with extra spare time, when I expect to be able to release Text if nobody else beats me to it.

I think we will have to move this question to a wider audience, @rbunel35 . We have two options here, I can send the email to the Commons Dev mailing list explaining what the PR is fixing, and trying to explain on the difference between browsers & specs, or if you are subscribed you can send the email over there.

The only difference, I think, is that I may forget the issue again if there are no replies over there 😄 in which case you'd have to ping me again. Up to you 👍

Thanks!!!
Bruno

@rbunel35
Copy link
Author

Thanks for the quick answer !
I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR.
Thank you very much !

@kinow
Copy link
Member

kinow commented May 10, 2022

Thanks for the quick answer ! I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR. Thank you very much !

Brilliant!

Every component is discussed there, that's the only downside. Use the following prefix for your subject, please: "[text] Enter your email subject here".

I have a rule in GMail to move it to another folder so that I can take a look when I'm not busy. You can ignore emails for components you are not interested (I'm about to write one email for Commons Configuration, it goes to the same mailing list, with prefix [configuration]...).

Hopefully we will find a solution for this issue and fix & release it soon. Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants