Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with whitespace definition #361

Open
thobe opened this issue Apr 4, 2019 · 7 comments
Open

Issues with whitespace definition #361

thobe opened this issue Apr 4, 2019 · 7 comments

Comments

@thobe
Copy link
Contributor

thobe commented Apr 4, 2019

Neither Java's Character.isWhitespace(int), or Character.isSpaceChar(int), or the unicode [:White_Space:] specification treats \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace.

Yet the openCypher grammar considers this a whitespace character, why?

<literal value="&#x180e;"/> <!-- MONGOLIAN VOWEL SEPARATOR -->

Furthermore the definition of whitespace in the openCypher grammar does not consider \u0085 (NEXT LINE) to be whitespace, while it is part of the unicode [:White_Space:] specification. Perhaps that should be added? (it is not considered a whitespace by either Character.isWhitespace(int) or Character.isSpaceChar(int), which explains why it is not in the grammar).

@thobe
Copy link
Contributor Author

thobe commented Apr 4, 2019

I came across this difference when looking at why the whitespace production rules spelled out all whitespace characters individually instead of just referencing the unicode [:White_Space:] specification. So I investigated the difference.

The conclusion of this exercise is that apart from \u0085 (NEXT LINE), the grammar includes all characters of the unicode [:White_Space:] specification, and additionally includes \u001C (FILE SEPARATOR), \u001D (GROUP SEPARATOR), \u001E (RECORD SEPARATOR), and \u001F (UNIT SEPARATOR).

Tabulating the characters involved:

Code Point Character.isWhitespace(...) Character.isSpaceChar(...) [:White_Space:]
\u0009 True False True
\u000a True False True
\u000b True False True
\u000c True False True
\u000d True False True
\u001c True False False
\u001d True False False
\u001e True False False
\u001f True False False
\u0020 True True True
\u0085 False False True
\u00a0 False True True
\u1680 True True True
\u180E True (Java 8) True (Java 8) True (Unicode 4.0 - 6.2)
\u180E False (Java 11) False (Java 11) False (Unicode 3.0 - 3.2; 6.3 -)
\u2000 True True True
\u2001 True True True
\u2002 True True True
\u2003 True True True
\u2004 True True True
\u2005 True True True
\u2006 True True True
\u2007 False True True
\u2008 True True True
\u2009 True True True
\u200a True True True
\u2028 True True True
\u2029 True True True
\u202f False True True
\u205f True True True
\u3000 True True True

@thobe
Copy link
Contributor Author

thobe commented Apr 4, 2019

If we agree to use the unicode [:White_Space:] specification, we could define whitespace as:

<production name="whitespace">
  <alt>
    <character set="White_Space"/>
    <character set="FS"/>
    <character set="GS"/>
    <character set="RS"/>
    <character set="US"/>
  </alt>
</production>

@thobe
Copy link
Contributor Author

thobe commented Apr 4, 2019

Looking at commit history, it appears as if at some point Java's Character.isWhitespace(int) treated \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace. At least that is what the code comments say. And indeed, in Java 8 it is included, but in Java 11 it is not.

@Mats-SX
Copy link
Member

Mats-SX commented Apr 4, 2019

I think it makes good sense to stick with Unicode here. Do we even need the special additions of FS, GS, RS and US?

@thobe
Copy link
Contributor Author

thobe commented Apr 4, 2019

The FILE SEPARATOR, GROUP SEPARATOR, RECORD SEPARATOR, and UNIT SEPARATOR have been explicitly treated as whitespace by Java since forever, and thus by the Neo4j Cypher parser.

They are likely to not occur in Cypher queries. I'd say it's harmless to either include or exclude them.

@Mats-SX
Copy link
Member

Mats-SX commented Apr 5, 2019

I agree. I would lean towards going with Unicode rather than Java (and abandon Cypher's implementation history), but I don't feel strongly about it. I wonder if any of the two alternatives makes a difference for implementability? I doubt it.

@hvub
Copy link
Contributor

hvub commented Mar 18, 2022

See #530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants