Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEXT-216: Add HTML5 Entities #312

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rbunel35
Copy link

@rbunel35 rbunel35 commented Mar 28, 2022

Hello !

This time a much larger pull request ^^
I added to the EntityArray class the HTML 5.0 Entities.

Here is how I produced the feature:

  • I got the list of every HTML 5.0 entities in JSON format from whatwg: https://html.spec.whatwg.org/multipage/named-characters.html
  • I ordered them by their Unicode value.
  • I removed all Unicode characters already found in the BASIC, ISO8859_1 and HTML40 maps (the HTML50 map is an extension of those).
  • I separated entities with a semicolon from those without one (which are not part of the HTML Standard).
    • In HTML 5.0, many Unicode characters can translate into different character entities.
    • For example the left bracket can translate into [ or [
  • I provided the HTML50_ESCAPE map with the entities with semicolon. For characters translating into multiple character entities, I used the first one (ex: I associated \u005B with [ but not [).
  • I provided the HTML50_UNESCAPE map, which is an invert of the HTML50_ESCAPE, along with character entities ignored from the HTML50_ESCAPE map (ex: I added an entry for [).
  • I provided the NO_SEMICOLON_UNESCAPE (for unescape purpose only) which maps character entities without semicolon with their corresponding Unicode character.
  • I added the escapeHtml5 and unescapeHtml5 methods in the StringEscapeUtils class using the aforementioned maps.
  • I provided unit tests for all these features.

I am very open to reviewing of this feature and to any question regarding the choices I made.
The JIRA ticket: https://issues.apache.org/jira/browse/TEXT-216

@darkma773r
Copy link
Contributor

@rbunel35, thank you for the PR and apologies for the late reply.

This looks like useful functionality to me, although I'm slightly concerned about the new size of EntityArrays. (If the character maps were accessed by static methods instead of constants, I would recommend using separate private classes to initialize each entity grouping.) Regardless, could you add a unit test that iterates through each HTML5 entity in the official reference and ensures that they are all accounted for and escaped/unescaped correctly? One way to do this would be to convert the JSON reference document into a properties file and load the properties file during the test.

@garydgregory
Copy link
Member

@rbunel35, thank you for the PR and apologies for the late reply.

This looks like useful functionality to me, although I'm slightly concerned about the new size of EntityArrays. (If the character maps were accessed by static methods instead of constants, I would recommend using separate private classes to initialize each entity grouping.) Regardless, could you add a unit test that iterates through each HTML5 entity in the official reference and ensures that they are all accounted for and escaped/unescaped correctly? One way to do this would be to convert the JSON reference document into a properties file and load the properties file during the test.

I agree about splitting out the large new arrays into a new class.

In general, new public and protected elements need to be documented with the Javadoc since tag. Also, we should make as little as possible new or protected to give us maximum flexibility for maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants