Unicode operator support missing from PureScript #3006

toastal · 2021-07-22T15:35:47Z

Information

Language: PureScript
Plugins: none

Description
While I did open a merge request for ∀ for the lowest hanging fruit as a keyword, there are issues with PureScript and unicode operators. ∷→←⇒⇐ are natively supported as operators in the compiler, but also users can define their own operators (such as ≡ for ==) (see: Lexer.hs). Anything that is considered as a "symbol" according to

isSymbolChar :: Char -> Bool
isSymbolChar c = (c `elem` (":!#$%&*+./<=>?@\\^|-~" :: [Char])) || (not (Char.isAscii c) && Char.isSymbol c)

is 100% valid PureScript. As PureScript extends the Haskell, syntax and Haskell has the mention of . being an issue for function composition in its comment (with . being used for records in PureScript), how best to modify the regex was not immediately apparent to me.

Code snippet

Test page

The code being highlighted incorrectly.

readBooleanOrIntAsBoolean ∷ Foreign → Foreign.F Boolean
readBooleanOrIntAsBoolean value =
  Foreign.readBoolean value
    <|> (toBool =<< Foreign.readInt value)
  where
  toBool ∷ Int → Foreign.F Boolean
  toBool = case _ of
    0 → pure false
    1 → pure true
    int → Foreign.fail (Foreign.ForeignError ("Invalid integer: " <> show int))

isSuccessResponse ∷ ∀ a. AX.Response a → Boolean
isSuccessResponse { status } = status >= (StatusCode 200) && status < (StatusCode 400)

infix 4 eq as ≡

isMempty ∷ ∀ m. Monoid m → Boolean
isMempty = _ ≡ mempty

The text was updated successfully, but these errors were encountered:

RunDevelopment · 2021-07-22T21:07:24Z

∷→←⇒⇐ are natively supported as operators in the compiler, but also users can define their own operators (such as ≡ for ==) (see: Lexer.hs).

Hmmm, this will be difficult. I don't think that we will be able to match the behaviour of isSymbolChar. It's possible but impractical.

Don't get me wrong, it's easy to create a regex that matches isSymbolChar:

/[:!#$%&*+./<=>?@\\^|\-~]|(?![\0-\x7F])[\p{gc=Math_Symbol}\p{gc=Currency_Symbol}\p{Modifier_Symbol}\p{Other_Symbol}]/u

But we can't use this regex. We support browsers that do not support Unicode property escapes or the u flag. "Inlining" the Unicode property escapes results in a very long regex (2.8kB):

/[:!#$%&*+./<=>?@\\^|\-~\xa2-\xa6\xa8\xa9\xac\xae-\xb1\xb4\xb8\xd7\xf7\u02c2-\u02c5\u02d2-\u02df\u02e5-\u02eb\u02ed\u02ef-\u02ff\u0375\u0384\u0385\u03f6\u0482\u058d-\u058f\u0606-\u0608\u060b\u060e\u060f\u06de\u06e9\u06fd\u06fe\u07f6\u07fe\u07ff\u09f2\u09f3\u09fa\u09fb\u0af1\u0b70\u0bf3-\u0bfa\u0c7f\u0d4f\u0d79\u0e3f\u0f01-\u0f03\u0f13\u0f15-\u0f17\u0f1a-\u0f1f\u0f34\u0f36\u0f38\u0fbe-\u0fc5\u0fc7-\u0fcc\u0fce\u0fcf\u0fd5-\u0fd8\u109e\u109f\u1390-\u1399\u166d\u17db\u1940\u19de-\u19ff\u1b61-\u1b6a\u1b74-\u1b7c\u1fbd\u1fbf-\u1fc1\u1fcd-\u1fcf\u1fdd-\u1fdf\u1fed-\u1fef\u1ffd\u1ffe\u2044\u2052\u207a-\u207c\u208a-\u208c\u20a0-\u20bf\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211e-\u2123\u2125\u2127\u2129\u212e\u213a\u213b\u2140-\u2144\u214a-\u214d\u214f\u218a\u218b\u2190-\u2307\u230c-\u2328\u232b-\u2426\u2440-\u244a\u249c-\u24e9\u2500-\u2767\u2794-\u27c4\u27c7-\u27e5\u27f0-\u2982\u2999-\u29d7\u29dc-\u29fb\u29fe-\u2b73\u2b76-\u2b95\u2b97-\u2bff\u2ce5-\u2cea\u2e50\u2e51\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u2ff0-\u2ffb\u3004\u3012\u3013\u3020\u3036\u3037\u303e\u303f\u309b\u309c\u3190\u3191\u3196-\u319f\u31c0-\u31e3\u3200-\u321e\u322a-\u3247\u3250\u3260-\u327f\u328a-\u32b0\u32c0-\u33ff\u4dc0-\u4dff\ua490-\ua4c6\ua700-\ua716\ua720\ua721\ua789\ua78a\ua828-\ua82b\ua836-\ua839\uaa77-\uaa79\uab5b\uab6a\uab6b\ufb29\ufbb2-\ufbc1\ufdfc\ufdfd\ufe62\ufe64-\ufe66\ufe69\uff04\uff0b\uff1c-\uff1e\uff3e\uff40\uff5c\uff5e\uffe0-\uffe6\uffe8-\uffee\ufffc\ufffd\u{10137}-\u{1013f}\u{10179}-\u{10189}\u{1018c}-\u{1018e}\u{10190}-\u{1019c}\u{101a0}\u{101d0}-\u{101fc}\u{10877}\u{10878}\u{10ac8}\u{1173f}\u{11fd5}-\u{11ff1}\u{16b3c}-\u{16b3f}\u{16b45}\u{1bc9c}\u{1d000}-\u{1d0f5}\u{1d100}-\u{1d126}\u{1d129}-\u{1d164}\u{1d16a}-\u{1d16c}\u{1d183}\u{1d184}\u{1d18c}-\u{1d1a9}\u{1d1ae}-\u{1d1e8}\u{1d200}-\u{1d241}\u{1d245}\u{1d300}-\u{1d356}\u{1d6c1}\u{1d6db}\u{1d6fb}\u{1d715}\u{1d735}\u{1d74f}\u{1d76f}\u{1d789}\u{1d7a9}\u{1d7c3}\u{1d800}-\u{1d9ff}\u{1da37}-\u{1da3a}\u{1da6d}-\u{1da74}\u{1da76}-\u{1da83}\u{1da85}\u{1da86}\u{1e14f}\u{1e2ff}\u{1ecac}\u{1ecb0}\u{1ed2e}\u{1eef0}\u{1eef1}\u{1f000}-\u{1f02b}\u{1f030}-\u{1f093}\u{1f0a0}-\u{1f0ae}\u{1f0b1}-\u{1f0bf}\u{1f0c1}-\u{1f0cf}\u{1f0d1}-\u{1f0f5}\u{1f10d}-\u{1f1ad}\u{1f1e6}-\u{1f202}\u{1f210}-\u{1f23b}\u{1f240}-\u{1f248}\u{1f250}\u{1f251}\u{1f260}-\u{1f265}\u{1f300}-\u{1f6d7}\u{1f6e0}-\u{1f6ec}\u{1f6f0}-\u{1f6fc}\u{1f700}-\u{1f773}\u{1f780}-\u{1f7d8}\u{1f7e0}-\u{1f7eb}\u{1f800}-\u{1f80b}\u{1f810}-\u{1f847}\u{1f850}-\u{1f859}\u{1f860}-\u{1f887}\u{1f890}-\u{1f8ad}\u{1f8b0}\u{1f8b1}\u{1f900}-\u{1f978}\u{1f97a}-\u{1f9cb}\u{1f9cd}-\u{1fa53}\u{1fa60}-\u{1fa6d}\u{1fa70}-\u{1fa74}\u{1fa78}-\u{1fa7a}\u{1fa80}-\u{1fa86}\u{1fa90}-\u{1faa8}\u{1fab0}-\u{1fab6}\u{1fac0}-\u{1fac2}\u{1fad0}-\u{1fad6}\u{1fb00}-\u{1fb92}\u{1fb94}-\u{1fbca}^]/iu

And getting rid of the u flag will make it even longer (probably around 6kB).

That's too long. Prism language definitions are supposed to be lightweight.

Could we somehow limit the symbol used? Are there symbols that are commonly used?

As PureScript extends the Haskell syntax and Haskell has the mention of . being an issue for function composition in its comment (with . being used for records in PureScript)

Could you please give an example of this problem?

toastal · 2021-07-23T07:24:36Z

Could you please give an example of this problem?

Literally just quoting:

{
    // ...
    //
    // Most of this is needed because of the meaning of a single '.'.
	// If it stands alone freely, it is the function composition.
	// It may also be a separator between a module name and an identifier => no
	// operator. If it comes together with other special characters it is an
	// operator too.
	'operator': /\s\.\s|[-!#$%*+=?&@|~:<>^\\\/]*\.[-!#$%*+=?&@|~.:<>^\\\/]+|[-!#$%*+=?&@|~.:<>^\\\/]+\.[-!#$%*+=?&@|~:<>^\\\/]*|[-!#$%*+=?&@|~:<>^\\\/]+|`(?:[A-Z][\w']*\.)*[_a-z][\w']*`/,
    //
    // ...
}

I get what this means, but I can't immediately follow the RegExp itself. . <> is okay as is <>? I could be wrong, but it may be the case that this check is useless in PureScript as compose is <<< and not . like Haskell (though I and others have a Unicode alias of ∘ as is the "proper" composition operator).

That's too long. Prism language definitions are supposed to be lightweight.

I suppose a symbol expression could either be something everyone using the library could use? Maybe perhaps there's some common Unicode ranges that could suffice. I don't know what a long-term solution is though outside of fully supporting all possibilities to get good support. I put in some realish examples in my shared example from code I'm using and it definitely looks unsupported.

Looking for an actionable solutions with limitations:

support the bare minimum ∷→←⇒⇐ which are built-in to the compiler (no user definitions required)
see above, + something like one person's unicode Prelude and some basic math operators ∘×÷≡≠⫽⩓⩔∧∨↝⨁⊹ could cover a lot of basic cases.

Personally, the Unicode support was one of the reasons that's kept me attracted to PureScript after many years.

RunDevelopment · 2021-07-30T12:22:21Z

Sorry for the delay @toastal. I made a PR (#3020) that fix this issue, I think. Could you please verify that everything works?

toastal · 2021-07-30T16:28:11Z

lgtm 👍

toastal added the language-definitions label Jul 22, 2021

RunDevelopment added the bug label Jul 30, 2021

RunDevelopment mentioned this issue Jul 30, 2021

Improved Haskell and PureScript #3020

Merged

RunDevelopment closed this as completed in #3020 Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode operator support missing from PureScript #3006

Unicode operator support missing from PureScript #3006

toastal commented Jul 22, 2021 •

edited

RunDevelopment commented Jul 22, 2021

toastal commented Jul 23, 2021 •

edited

RunDevelopment commented Jul 30, 2021

toastal commented Jul 30, 2021

Unicode operator support missing from PureScript #3006

Unicode operator support missing from PureScript #3006

Comments

toastal commented Jul 22, 2021 • edited

RunDevelopment commented Jul 22, 2021

toastal commented Jul 23, 2021 • edited

RunDevelopment commented Jul 30, 2021

toastal commented Jul 30, 2021

toastal commented Jul 22, 2021 •

edited

toastal commented Jul 23, 2021 •

edited