flow-remove-types produces invalid JS after non-latin characters #7779

mourner · 2019-06-03T12:24:14Z

Flow version: 0.100.0

Expected behavior

Given the following input:

// @flow
// п
function foo(bar: number) {}

flow-remove-types should produce:

//      
// п
function foo(bar        ) {}

Actual behavior

It produces invalid JS:

//      
// п
function foo(bar:         {}

The non-latin character is the trigger. Without it, it works correctly. This seems like quite a serious bug considering how common it is to use non-latin characters in comments (e.g. comment in other languages, draw arrows etc.).

cc @mroch @samwgoldman

The text was updated successfully, but these errors were encountered:

mourner · 2019-06-03T13:16:02Z

This seems to be a bug in flow-parser which produces incorrect range values in AST for non-latin characters. E.g. parsing // п and // n produces [0, 5] and [0, 4] ranges respectively, although both should be the same.

mroch · 2019-06-03T13:17:48Z

same root cause as #5793, I believe. the lexer reports its position in unicode codepoints, of which these are 1, but they are multibyte.

mourner · 2019-06-03T13:20:55Z

@mroch I believe it's a recent regression, because https://astexplorer.net/ (which uses Flow v0.94.0) shows identical AST for both cases ([0, 4]).

mroch · 2019-06-03T13:28:14Z

this is 0.94, when hovering over the FunctionDeclaration in the AST. so it's been there a while, but flow-remove-types regressed because it moved to flow-parser.

mroch · 2019-06-03T13:35:57Z

this appears to only be an issue with the range. the loc matches babylon.

it's a bit easier to see if you do it all on one line: https://flow.org/try/#0PQKk-CAsBmD2MNxA (see the AST tab). since it's all on line 1, the range should match the columns in loc.

mourner · 2019-06-03T13:37:23Z

@mroch ah, yeah, you're right. The link I commented earlier triggers the bug if you add a new line after the comment.

mroch · 2019-06-03T13:39:58Z

hrm i think we intentionally made range work like this. it's the byte offset in the file, which is what you'd want if you wanted to seek to read that part of the file from disk. but it's not what you'd pass to substring in JS, which is what tools like flow-remove-types do.

mroch · 2019-06-03T13:43:32Z

cc @nmote I think we should change range to match esprima and babylon, and potentially add (an option for) another field like byte_range or something. will need to figure out which tools need to be updated to use that.

mroch · 2019-06-03T14:37:19Z

flow/src/parser/offset_utils.mli

Lines 8 to 30 in a05cbb2

    
           (* Note on character encodings: 
        
            * 
        
            * Throughout Flow, we assume that program text uses a UTF-8 encoding. OCaml strings are just a 
        
            * sequence of bytes, so any handling of multi-byte characters needs to be done explicitly. 
        
            * 
        
            * Column numbers in `Loc.position`s are based on the number of characters into a line the position 
        
            * appears, not the number of bytes. Single-byte and multi-byte characters are treated the same for 
        
            * the purposes of counting columns. 
        
            * 
        
            * However, offsets are most useful (at least when working with OCaml's string representation) when 
        
            * they represent the number of bytes into the text a given position is. Therefore, this utility 
        
            * returns such offsets. 
        
            * 
        
            * In contrast, JavaScript strings must behave as if they have a UTF-16 encoding, and each element 
        
            * is a single 16-bit entry. So, each character occupies either one or two elements of a JavaScript 
        
            * string. Esprima, for example, returns ranges based on index into a JS string. For example, this 
        
            * utility would consider the smiley emoji (code point 0x1f603) to have width 4 (because its UTF-8 
        
            * encoding is 4 8-bit elements), but Esprima would consider it to have width 2 (because its UTF-16 
        
            * encodinng is 2 16-bit elements). 
        
            * 
        
            * If necessary to improve compatibility with Esprima, this utility could be extended to 
        
            * additionally track what the offsets into JS-style strings would be. 
        
            *)

... yup

mourner · 2019-06-03T14:43:49Z

Thanks for looking into this @mroch! Agreed about bringing Flow AST more in line with other parsers.

In the meantime, to fix the breaking bug with flow-remove-types sooner, should we just rewrite the logic so that it only depends on line/column, not range? I could take a stab at a PR.

mroch · 2019-06-03T14:52:24Z

alternatively, if it could work with UTF8 bytes instead of JS strings, that would make the existing ranges correct.

for example, perhaps changing flow-remove-types to use Buffers? I think the offsets of a Buffer are in bytes, not codepoints. modifying a buffer in place instead of lots of string concatenation could be a perf win too.

flow/packages/flow-remove-types/index.js

Line 107 in a05cbb2

result += source.slice(lastPos, startOf(node));

mourner · 2019-06-03T15:00:04Z

@mroch yep, will do!

closes facebook#7779

walkerdb · 2019-08-17T00:07:23Z

We are running into the same problem, in our case with a ™ in a comment.

// @flow
import React from 'react';

/*
  This should Just Work™
*/
export const SomeComponent = (props: Object) => (
  <div {...props}>
    This breaks
  </div>
);

becomes

import React from 'react';

/*
  This should Just Work™
*/
export const SomeComponent = (props:=> (
  <div {...props}>
    This breaks
  </div>
);

Looks like it's been a few months now, and it shows up in all v2 releases of flow-remove-types -- any updates on a fix?

mourner added bug needs triage labels Jun 3, 2019

mourner mentioned this issue Jun 3, 2019

Update Flow to 0.100 and improve typing mapbox/mapbox-gl-js#8314

Merged

mourner changed the title ~~flow-remove-types produces invalid JS after unicode characters in comments~~ flow-remove-types produces invalid JS after non-latin characters in comments Jun 3, 2019

mroch added parsing and removed needs triage labels Jun 3, 2019

mourner added a commit to mourner/flow that referenced this issue Jun 3, 2019

byte-based indexing to fix multi-byte characters input

5c00d61

closes facebook#7779

mourner linked a pull request Jun 3, 2019 that will close this issue

Fix flow-remove-types for input with multi-byte characters #7781

Open

4 tasks

mroch mentioned this issue Aug 12, 2019

Flow-remove-types breaks when the source have a long dash "–" #7989

Closed

mroch changed the title ~~flow-remove-types produces invalid JS after non-latin characters in comments~~ flow-remove-types produces invalid JS after non-latin characters Aug 12, 2019

sgb-io mentioned this issue Sep 3, 2019

[flow-remove-types] Output is corrupted by \u2019 characters #8060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flow-remove-types produces invalid JS after non-latin characters #7779

flow-remove-types produces invalid JS after non-latin characters #7779

mourner commented Jun 3, 2019 •

edited

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

walkerdb commented Aug 17, 2019 •

edited

flow-remove-types produces invalid JS after non-latin characters #7779

flow-remove-types produces invalid JS after non-latin characters #7779

Comments

mourner commented Jun 3, 2019 • edited

Expected behavior

Actual behavior

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

mroch commented Jun 3, 2019

mourner commented Jun 3, 2019

walkerdb commented Aug 17, 2019 • edited

mourner commented Jun 3, 2019 •

edited

walkerdb commented Aug 17, 2019 •

edited