Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flow-remove-types produces invalid JS after non-latin characters #7779

Open
mourner opened this issue Jun 3, 2019 · 13 comments · May be fixed by #7781
Open

flow-remove-types produces invalid JS after non-latin characters #7779

mourner opened this issue Jun 3, 2019 · 13 comments · May be fixed by #7781

Comments

@mourner
Copy link

mourner commented Jun 3, 2019

Flow version: 0.100.0

Expected behavior

Given the following input:

// @flow
// п
function foo(bar: number) {}

flow-remove-types should produce:

//      
// п
function foo(bar        ) {}

Actual behavior

It produces invalid JS:

//      
// п
function foo(bar:         {}

The non-latin character is the trigger. Without it, it works correctly. This seems like quite a serious bug considering how common it is to use non-latin characters in comments (e.g. comment in other languages, draw arrows etc.).

cc @mroch @samwgoldman

@mourner mourner changed the title flow-remove-types produces invalid JS after unicode characters in comments flow-remove-types produces invalid JS after non-latin characters in comments Jun 3, 2019
@mourner
Copy link
Author

mourner commented Jun 3, 2019

This seems to be a bug in flow-parser which produces incorrect range values in AST for non-latin characters. E.g. parsing // п and // n produces [0, 5] and [0, 4] ranges respectively, although both should be the same.

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

same root cause as #5793, I believe. the lexer reports its position in unicode codepoints, of which these are 1, but they are multibyte.

@mourner
Copy link
Author

mourner commented Jun 3, 2019

@mroch I believe it's a recent regression, because https://astexplorer.net/ (which uses Flow v0.94.0) shows identical AST for both cases ([0, 4]).

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

Screen Shot 2019-06-03 at 9 27 32 AM

this is 0.94, when hovering over the FunctionDeclaration in the AST. so it's been there a while, but flow-remove-types regressed because it moved to flow-parser.

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

this appears to only be an issue with the range. the loc matches babylon.

it's a bit easier to see if you do it all on one line: https://flow.org/try/#0PQKk-CAsBmD2MNxA (see the AST tab). since it's all on line 1, the range should match the columns in loc.

@mourner
Copy link
Author

mourner commented Jun 3, 2019

@mroch ah, yeah, you're right. The link I commented earlier triggers the bug if you add a new line after the comment.

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

hrm i think we intentionally made range work like this. it's the byte offset in the file, which is what you'd want if you wanted to seek to read that part of the file from disk. but it's not what you'd pass to substring in JS, which is what tools like flow-remove-types do.

@mroch mroch added parsing and removed needs triage labels Jun 3, 2019
@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

cc @nmote I think we should change range to match esprima and babylon, and potentially add (an option for) another field like byte_range or something. will need to figure out which tools need to be updated to use that.

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

(* Note on character encodings:
*
* Throughout Flow, we assume that program text uses a UTF-8 encoding. OCaml strings are just a
* sequence of bytes, so any handling of multi-byte characters needs to be done explicitly.
*
* Column numbers in `Loc.position`s are based on the number of characters into a line the position
* appears, not the number of bytes. Single-byte and multi-byte characters are treated the same for
* the purposes of counting columns.
*
* However, offsets are most useful (at least when working with OCaml's string representation) when
* they represent the number of bytes into the text a given position is. Therefore, this utility
* returns such offsets.
*
* In contrast, JavaScript strings must behave as if they have a UTF-16 encoding, and each element
* is a single 16-bit entry. So, each character occupies either one or two elements of a JavaScript
* string. Esprima, for example, returns ranges based on index into a JS string. For example, this
* utility would consider the smiley emoji (code point 0x1f603) to have width 4 (because its UTF-8
* encoding is 4 8-bit elements), but Esprima would consider it to have width 2 (because its UTF-16
* encodinng is 2 16-bit elements).
*
* If necessary to improve compatibility with Esprima, this utility could be extended to
* additionally track what the offsets into JS-style strings would be.
*)

... yup

@mourner
Copy link
Author

mourner commented Jun 3, 2019

Thanks for looking into this @mroch! Agreed about bringing Flow AST more in line with other parsers.

In the meantime, to fix the breaking bug with flow-remove-types sooner, should we just rewrite the logic so that it only depends on line/column, not range? I could take a stab at a PR.

@mroch
Copy link
Contributor

mroch commented Jun 3, 2019

alternatively, if it could work with UTF8 bytes instead of JS strings, that would make the existing ranges correct.

for example, perhaps changing flow-remove-types to use Buffers? I think the offsets of a Buffer are in bytes, not codepoints. modifying a buffer in place instead of lots of string concatenation could be a perf win too.

result += source.slice(lastPos, startOf(node));

@mourner
Copy link
Author

mourner commented Jun 3, 2019

@mroch yep, will do!

mourner added a commit to mourner/flow that referenced this issue Jun 3, 2019
@mourner mourner linked a pull request Jun 3, 2019 that will close this issue
4 tasks
@mroch mroch changed the title flow-remove-types produces invalid JS after non-latin characters in comments flow-remove-types produces invalid JS after non-latin characters Aug 12, 2019
@walkerdb
Copy link

walkerdb commented Aug 17, 2019

We are running into the same problem, in our case with a ™ in a comment.

// @flow
import React from 'react';

/*
  This should Just Work™
*/
export const SomeComponent = (props: Object) => (
  <div {...props}>
    This breaks
  </div>
);

becomes

import React from 'react';

/*
  This should Just Work™
*/
export const SomeComponent = (props:=> (
  <div {...props}>
    This breaks
  </div>
);

Looks like it's been a few months now, and it shows up in all v2 releases of flow-remove-types -- any updates on a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants