Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode operator support missing from PureScript #3006

Closed
toastal opened this issue Jul 22, 2021 · 4 comments · Fixed by #3020
Closed

Unicode operator support missing from PureScript #3006

toastal opened this issue Jul 22, 2021 · 4 comments · Fixed by #3020

Comments

@toastal
Copy link
Contributor

toastal commented Jul 22, 2021

Information

  • Language: PureScript
  • Plugins: none

Description
While I did open a merge request for for the lowest hanging fruit as a keyword, there are issues with PureScript and unicode operators. ∷→←⇒⇐ are natively supported as operators in the compiler, but also users can define their own operators (such as for ==) (see: Lexer.hs). Anything that is considered as a "symbol" according to

isSymbolChar :: Char -> Bool
isSymbolChar c = (c `elem` (":!#$%&*+./<=>?@\\^|-~" :: [Char])) || (not (Char.isAscii c) && Char.isSymbol c)

is 100% valid PureScript. As PureScript extends the Haskell, syntax and Haskell has the mention of . being an issue for function composition in its comment (with . being used for records in PureScript), how best to modify the regex was not immediately apparent to me.

Code snippet

Test page

The code being highlighted incorrectly.
readBooleanOrIntAsBoolean  Foreign  Foreign.F Boolean
readBooleanOrIntAsBoolean value =
  Foreign.readBoolean value
    <|> (toBool =<< Foreign.readInt value)
  where
  toBool  Int  Foreign.F Boolean
  toBool = case _ of
    0 → pure false
    1 → pure true
    int → Foreign.fail (Foreign.ForeignError ("Invalid integer: " <> show int))

isSuccessResponse   a. AX.Response a  Boolean
isSuccessResponse { status } = status >= (StatusCode 200) && status < (StatusCode 400)

infix 4 eq as ≡

isMempty   m. Monoid m  Boolean
isMempty = _ ≡ mempty
@RunDevelopment
Copy link
Member

∷→←⇒⇐ are natively supported as operators in the compiler, but also users can define their own operators (such as for ==) (see: Lexer.hs).

Hmmm, this will be difficult. I don't think that we will be able to match the behaviour of isSymbolChar. It's possible but impractical.

Don't get me wrong, it's easy to create a regex that matches isSymbolChar:

/[:!#$%&*+./<=>?@\\^|\-~]|(?![\0-\x7F])[\p{gc=Math_Symbol}\p{gc=Currency_Symbol}\p{Modifier_Symbol}\p{Other_Symbol}]/u

But we can't use this regex. We support browsers that do not support Unicode property escapes or the u flag. "Inlining" the Unicode property escapes results in a very long regex (2.8kB):

/[:!#$%&*+./<=>?@\\^|\-~\xa2-\xa6\xa8\xa9\xac\xae-\xb1\xb4\xb8\xd7\xf7\u02c2-\u02c5\u02d2-\u02df\u02e5-\u02eb\u02ed\u02ef-\u02ff\u0375\u0384\u0385\u03f6\u0482\u058d-\u058f\u0606-\u0608\u060b\u060e\u060f\u06de\u06e9\u06fd\u06fe\u07f6\u07fe\u07ff\u09f2\u09f3\u09fa\u09fb\u0af1\u0b70\u0bf3-\u0bfa\u0c7f\u0d4f\u0d79\u0e3f\u0f01-\u0f03\u0f13\u0f15-\u0f17\u0f1a-\u0f1f\u0f34\u0f36\u0f38\u0fbe-\u0fc5\u0fc7-\u0fcc\u0fce\u0fcf\u0fd5-\u0fd8\u109e\u109f\u1390-\u1399\u166d\u17db\u1940\u19de-\u19ff\u1b61-\u1b6a\u1b74-\u1b7c\u1fbd\u1fbf-\u1fc1\u1fcd-\u1fcf\u1fdd-\u1fdf\u1fed-\u1fef\u1ffd\u1ffe\u2044\u2052\u207a-\u207c\u208a-\u208c\u20a0-\u20bf\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211e-\u2123\u2125\u2127\u2129\u212e\u213a\u213b\u2140-\u2144\u214a-\u214d\u214f\u218a\u218b\u2190-\u2307\u230c-\u2328\u232b-\u2426\u2440-\u244a\u249c-\u24e9\u2500-\u2767\u2794-\u27c4\u27c7-\u27e5\u27f0-\u2982\u2999-\u29d7\u29dc-\u29fb\u29fe-\u2b73\u2b76-\u2b95\u2b97-\u2bff\u2ce5-\u2cea\u2e50\u2e51\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u2ff0-\u2ffb\u3004\u3012\u3013\u3020\u3036\u3037\u303e\u303f\u309b\u309c\u3190\u3191\u3196-\u319f\u31c0-\u31e3\u3200-\u321e\u322a-\u3247\u3250\u3260-\u327f\u328a-\u32b0\u32c0-\u33ff\u4dc0-\u4dff\ua490-\ua4c6\ua700-\ua716\ua720\ua721\ua789\ua78a\ua828-\ua82b\ua836-\ua839\uaa77-\uaa79\uab5b\uab6a\uab6b\ufb29\ufbb2-\ufbc1\ufdfc\ufdfd\ufe62\ufe64-\ufe66\ufe69\uff04\uff0b\uff1c-\uff1e\uff3e\uff40\uff5c\uff5e\uffe0-\uffe6\uffe8-\uffee\ufffc\ufffd\u{10137}-\u{1013f}\u{10179}-\u{10189}\u{1018c}-\u{1018e}\u{10190}-\u{1019c}\u{101a0}\u{101d0}-\u{101fc}\u{10877}\u{10878}\u{10ac8}\u{1173f}\u{11fd5}-\u{11ff1}\u{16b3c}-\u{16b3f}\u{16b45}\u{1bc9c}\u{1d000}-\u{1d0f5}\u{1d100}-\u{1d126}\u{1d129}-\u{1d164}\u{1d16a}-\u{1d16c}\u{1d183}\u{1d184}\u{1d18c}-\u{1d1a9}\u{1d1ae}-\u{1d1e8}\u{1d200}-\u{1d241}\u{1d245}\u{1d300}-\u{1d356}\u{1d6c1}\u{1d6db}\u{1d6fb}\u{1d715}\u{1d735}\u{1d74f}\u{1d76f}\u{1d789}\u{1d7a9}\u{1d7c3}\u{1d800}-\u{1d9ff}\u{1da37}-\u{1da3a}\u{1da6d}-\u{1da74}\u{1da76}-\u{1da83}\u{1da85}\u{1da86}\u{1e14f}\u{1e2ff}\u{1ecac}\u{1ecb0}\u{1ed2e}\u{1eef0}\u{1eef1}\u{1f000}-\u{1f02b}\u{1f030}-\u{1f093}\u{1f0a0}-\u{1f0ae}\u{1f0b1}-\u{1f0bf}\u{1f0c1}-\u{1f0cf}\u{1f0d1}-\u{1f0f5}\u{1f10d}-\u{1f1ad}\u{1f1e6}-\u{1f202}\u{1f210}-\u{1f23b}\u{1f240}-\u{1f248}\u{1f250}\u{1f251}\u{1f260}-\u{1f265}\u{1f300}-\u{1f6d7}\u{1f6e0}-\u{1f6ec}\u{1f6f0}-\u{1f6fc}\u{1f700}-\u{1f773}\u{1f780}-\u{1f7d8}\u{1f7e0}-\u{1f7eb}\u{1f800}-\u{1f80b}\u{1f810}-\u{1f847}\u{1f850}-\u{1f859}\u{1f860}-\u{1f887}\u{1f890}-\u{1f8ad}\u{1f8b0}\u{1f8b1}\u{1f900}-\u{1f978}\u{1f97a}-\u{1f9cb}\u{1f9cd}-\u{1fa53}\u{1fa60}-\u{1fa6d}\u{1fa70}-\u{1fa74}\u{1fa78}-\u{1fa7a}\u{1fa80}-\u{1fa86}\u{1fa90}-\u{1faa8}\u{1fab0}-\u{1fab6}\u{1fac0}-\u{1fac2}\u{1fad0}-\u{1fad6}\u{1fb00}-\u{1fb92}\u{1fb94}-\u{1fbca}^]/iu

And getting rid of the u flag will make it even longer (probably around 6kB).

That's too long. Prism language definitions are supposed to be lightweight.

Could we somehow limit the symbol used? Are there symbols that are commonly used?

As PureScript extends the Haskell syntax and Haskell has the mention of . being an issue for function composition in its comment (with . being used for records in PureScript)

Could you please give an example of this problem?

@toastal
Copy link
Contributor Author

toastal commented Jul 23, 2021

Could you please give an example of this problem?

Literally just quoting:

{
    // ...
    //
    // Most of this is needed because of the meaning of a single '.'.
	// If it stands alone freely, it is the function composition.
	// It may also be a separator between a module name and an identifier => no
	// operator. If it comes together with other special characters it is an
	// operator too.
	'operator': /\s\.\s|[-!#$%*+=?&@|~:<>^\\\/]*\.[-!#$%*+=?&@|~.:<>^\\\/]+|[-!#$%*+=?&@|~.:<>^\\\/]+\.[-!#$%*+=?&@|~:<>^\\\/]*|[-!#$%*+=?&@|~:<>^\\\/]+|`(?:[A-Z][\w']*\.)*[_a-z][\w']*`/,
    //
    // ...
}

I get what this means, but I can't immediately follow the RegExp itself. . <> is okay as is <>? I could be wrong, but it may be the case that this check is useless in PureScript as compose is <<< and not . like Haskell (though I and others have a Unicode alias of as is the "proper" composition operator).

That's too long. Prism language definitions are supposed to be lightweight.

I suppose a symbol expression could either be something everyone using the library could use? Maybe perhaps there's some common Unicode ranges that could suffice. I don't know what a long-term solution is though outside of fully supporting all possibilities to get good support. I put in some realish examples in my shared example from code I'm using and it definitely looks unsupported.

Looking for an actionable solutions with limitations:

  1. support the bare minimum ∷→←⇒⇐ which are built-in to the compiler (no user definitions required)
  2. see above, + something like one person's unicode Prelude and some basic math operators ∘×÷≡≠⫽⩓⩔∧∨↝⨁⊹ could cover a lot of basic cases.

Personally, the Unicode support was one of the reasons that's kept me attracted to PureScript after many years.

@RunDevelopment
Copy link
Member

Sorry for the delay @toastal. I made a PR (#3020) that fix this issue, I think. Could you please verify that everything works?

@toastal
Copy link
Contributor Author

toastal commented Jul 30, 2021

lgtm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants