Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The binary format doesn't have to directly mirror normal JS. #37

Open
dead-claudia opened this issue May 26, 2018 · 16 comments
Open

The binary format doesn't have to directly mirror normal JS. #37

dead-claudia opened this issue May 26, 2018 · 16 comments

Comments

@dead-claudia
Copy link

dead-claudia commented May 26, 2018

I'm not convinced it's required to marry the binary format too tightly to the JS source, and there are things you could do to make it simpler. It's not like the binary format is meant to be human-readable, just machine-readable. There are also idioms in compiled-to-JS code (e.g. from Elm, CoffeeScript, and Babel) that could also stand to benefit from a binary format that that's at least aware of some of their needs.

Here's a few of the ideas I have to throw at the wall. Apologies if this comes across as a bit rambly.

  • Adding a constant for undefined

    • This amounts to the vast majority of uses of void 0 (what UglifyJS replaces strict-mode undefined with) and similar.
    • UglifyJS already prefers a local permanently undefined variable over void 0 for code size reasons.
    • It's already a pseudo-literal in strict mode.
  • Allowing synthetic, anonymous locals that aren't viewable by direct eval or with

    • This would be useful for anything compiling to JS with extra sugar or whatnot, like Scala.js, TypeScript, and Babel.
    • This would also be useful for stripping local names from the binary for minification purposes.
  • Adding built-in source map support.

    • Read: nodes can have locations attached.
  • Adding a built-in description/metadata field.

    • Read: you can still include your license and such. It's just not something that can easily be placed as like a comment.
  • Adding combined type-checking operators for typeof x === "number", etc.

    • This amounts to 90% of typeof use cases, and engines already desugar these 100% of the time. (It's an obvious optimization.)
  • Adding combined coersion operators for x | 0/~~x, !!x, x + "", etc.

    • Those are the only uses for any of those productions, and engines never implement these literally.
    • In binary asm.js code, one of those could replace the standard x | 0 or at least act as a synonym.
    • Elm and TypeScript code could use these quite frequently.
  • Separating the pragmas from the source/function body.

    • This avoids having to parse the source via lookahead to determine what pragmas apply.
    • This makes it much simpler to just skip what you don't know, and skip pragmas you do know, but have already matched.
  • Separating strings from the code, instead storing debug string/name references as {int offset; int length} pairs and putting their values consecutively in a string table before any real code references.

    • The string table could be just be slab-allocated from the start, then filled with data as fast as it receives it. (This is very low-allocation and easily optimizable, so parsing this part is I/O-bound.)
    • It avoids needing to allocate nearly as much when the string/name is being parsed, needing only a reference to the data.
    • The string table is specific to the script/module file, and can be reclaimed after compilation. (String names are copied during bytecode generation post-parsing.)
    • There are potential size benefits here in my experience for making this separation: binary data and string data almost always compress differently, and from previous experimentation under similar circumstances (compressing data in a long URL hash), I noticed about a 2%-4% decrease alone just by doing this.
  • Reducing logical "and"/"or" to corresponding if/else variants: test && other(tmp = test) ? tmp : other, test || other(tmp = test) ? other : test.

    • This could increase code size slightly (no more than a few bytes overhead if it can't reuse an existing synthetic local), but every engine I'm aware of performs this desugaring already, so why make it different.
  • Using breakable blocks with an expression-based AST instead of a statement-based one.

    • Expression-based ASTs are usually smaller than statement-based ones.
    • You could optimize certain cases like let x = foo(); try { x = foo(); } catch (e) { if (!(e instanceof SafeError)) throw e }, where it could be simplified to something roughly like let x = try { foo() } catch (e) { e instanceof SafeError ? undefined : throw e }. Edit: Missed a backtick 😄
    • Elm's highly expression-oriented nature would love this, too.
  • Requiring fallthrough to be explicit in switch statements.

    • It's not common to actually want the fallthrough mechanism.
    • Combined with the previous version, many uses of switch would become much smaller when encoded.
  • Declaring locals before script/module/function body.

    • This makes it easier for engines to type-check which locals are accessible where.
  • Declaring imports/exports before module data table.

    • These would be specified in a separate array, followed by a list of imported bindings from each, for performance reasons.
    • This can enable engines to efficiently request and link modules before they even start parsing the meat of them. (In normal JS, you can't do this. At all. Not without some deep wizardry no engine implements support for.)
    • Loading the data table is one of the most I/O-bound parts of the compilation process, so it's possible multiple modules could have their data tables being loaded simultaneously, even in single-threaded environments.
  • Encoding the bytecode as a hybrid register/stack machine.

    • This removes most of the need for temporaries.
    • This makes the bytecode a little more dense.
    • Not 100% sure how this is otherwise helpful, though...
@Yoric
Copy link
Collaborator

Yoric commented May 26, 2018

Thanks for the proposals! Let me try and sort it a bit.

Kinda language extensions

  • Adding a constant for undefined
  • Allowing synthetic, anonymous locals that aren't viewable by direct eval or with
  • Adding combined type-checking operators for typeof x === "number", etc.
  • Adding combined coersion operators for x | 0/~~x, !!x, x + "", etc.
  • Requiring fallthrough to be explicit in switch statements.
  • Encoding the bytecode as a hybrid register/stack machine.

While each of these proposals makes sense individually, I really feel that they are mostly orthogonal to Binary AST. Recall that each addition to Binary AST will make its adoption more complicated by browsers and VMs. For this reason, we wish to keep it as minimal as possible, which means sticking exactly to the JS language.

Once Binary AST is a standard, we'll be happy to rediscuss each of these proposals, though :)

Kinda optimizations

Separating the pragmas from the source/function body.

That definitely makes sense.

Separating strings from the code, instead storing debug string/name references as {int offset; int length} pairs and putting their values consecutively in a string table before any real code references.

Already the case :)

Reducing logical "and"/"or" to corresponding if/else variants: test && other(tmp = test) ? tmp : other, test || other(tmp = test) ? other : test.

Encoders are, of course, free to make such transformations, especially if they are optimizations. I don't think we want to hardcode them into Binary AST, though.

Using breakable blocks with an expression-based AST instead of a statement-based one.

I don't understand that one, could you elaborate?

Declaring locals before script/module/function body.

Already the case :)

Declaring imports/exports before module data table.

I like the idea. We'll need to think more about it.

Kinda metadata

Adding built-in source map support.

Definitely an objective. Not on the top of our list yet, but we're working on it.

Adding a built-in description/metadata field.

Indeed, we're thinking of a built-in metadata field. It shouldn't be hard to add, as long as it doesn't have semantics. I suspect that most toolchains will strip it, though.

@j-f1
Copy link

j-f1 commented May 26, 2018

For switches, how about having a table at the beginning that has gotos in it, then have an array of statements after the table?

Example
switch (expr) {
  case 'Oranges':
    console.log('Oranges are $0.59 a pound.');
    break;
  case 'Mangoes':
  case 'Papayas':
    console.log('Mangoes and papayas are $2.79 a pound.');
    // expected output: "Mangoes and papayas are $2.79 a pound."
    break;
  default:
    console.log('Sorry, we are out of ' + expr + '.');
}

(source)

Binary AST:

Header:

Value of expr Index
"Oranges" 0
"Mangoes" 2
"Papayas" 2
[[Default]] 4

Body:

console.log('Oranges are $0.59 a pound.');
break;
console.log('Mangoes and papayas are $2.79 a pound.');
break;
console.log('Sorry, we are out of ' + expr + '.');

@Yoric
Copy link
Collaborator

Yoric commented May 26, 2018

@j-f1 Why not. What's the expected benefit?

@j-f1
Copy link

j-f1 commented May 26, 2018

It would allow the engine to skip over unused code and only parse the statements used in the current condition. Additionally, the engine would be able to determine if the default case needs to be taken without having to search the entire switch body for cases first.

@dead-claudia
Copy link
Author

@Yoric Of course, some of those points and thoughts were more or less very forward-looking, and not exactly what I'd expect would have a chance in making it into an MVP. (Like, for example, the synthetic anonymous locals.)

Using breakable blocks with an expression-based AST instead of a statement-based one.

I don't understand that one, could you elaborate?

Think similar to WebAssembly and how its blocks work, just optionally a little heavier (if you need to introduce locals in them) and with unconditionally only one return value. That's pretty much what I had in mind.

WebAssembly itself is expression-based, it uses blocks with (br N value)/(br_if N cond value) and is a hybrid register/stack machine. Expression-based languages are typically denser and simpler, since they don't need as much bookkeeping on what is and isn't a valid statement (almost everything is), and because it's pretty common for control flow to only assign to something once (consider why the do expression proposal exists).

To expand on my concrete example:

An engine wouldn't need to do much work to infer that the result of a particular try/catch is only used in a particular variable if you just assign the result directly into the value itself. Also, it's not like you couldn't emit the statement-like form and it still work, too. Engines can desugar that into bytecode similarly to how they already do with let x; try { x = foo() } catch (e) { ... } now (in fact, I'd expect most implementations to do just that), and Babel could compile its try/catch blocks within do expressions pretty similarly.* (For reasons that make no sense to me, CoffeeScript and Babel + do expressions instead generate IIFEs.)

* In reality, an expression-based syntax means we wouldn't need to support do expressions in the core proposal, since it'd be flat out useless at best. An outer breakable block would be what early returns would compile into if they couldn't be compiled into an if/else.

@Yoric
Copy link
Collaborator

Yoric commented May 26, 2018

@j-f1

It would allow the engine to skip over unused code and only parse the statements used in the current condition. Additionally, the engine would be able to determine if the default case needs to be taken without having to search the entire switch body for cases first.

We're thinking about something kind-of like this, but if we do it, this will complicate the specifications of switch() wrt syntax errors.

@syg
Copy link
Collaborator

syg commented May 26, 2018

@Yoric did a great job separating out the @isiahmeadows' language feature ideas from other suggestions. One of the highest, if not the highest constraints in this design space is to not bifurcate the language. To that end, I take the existence of an "obvious" bijection (modulo whitespace and whatnot) between source JS and binary AST to be a hard constraint. That is, binary AST will not add any language features. What's expressible in source JS must be expressible in binary AST.

Like @Yoric said, the biggest reason is adoption, both for developers and for web browsers. Please do not take binary AST to be our trying to design a better JS. It is very much just the JS language with a more performant transport format.

I also saw mentions to bytecode. There will never be any sort of bytecode involved with binary AST.

@dead-claudia
Copy link
Author

dead-claudia commented May 27, 2018

@syg I'm in high support of keeping it bijective. My hard constraint with all the "language feature" ideas I proposed were that every single one of them had to be trivially desugared. The only exception being synthetic locals used in a scope accessible to a direct eval.

I also saw mentions to bytecode. There will never be any sort of bytecode involved with binary AST.

I did sometimes conflate the binary AST representation with the concept of "bytecode", but I know it's technically incorrect - bytecode is a sequence of instructions, not a graph of nodes. I did make one suggestion here that would've actually made it bytecode in the technical sense, but it was kind of a thing thrown at the wall:

  • Encoding the bytecode as a hybrid register/stack machine.

I was already neutral on it to start, just throwing it out there for the sake of being thrown out there. I'm also not a big fan of the idea, since bytecode doesn't always end up smaller than a basic AST, and it's not always the most intuitive. And the more I think about this, the more I'm leaning against it. With this, all of a sudden you need a decompiler to debug the code, and we shouldn't be making browsers ship a glorified decompiler. This isn't WebAssembly, and even that one has chosen to make their bytecode a borderline AST format, just one that's mostly in reverse Polish notation for the node types. (Their text format lays this pretty bare.)

@bakkot
Copy link

bakkot commented May 27, 2018

@isiahmeadows:

every single one of them had to be trivially desugared

As you correctly point out, engines already do these sorts of simple translations, often at parse time. What's the proposed benefit to having the compiler do them? Keeping in mind that there's a real cost, which is adding new node types and defining and implementing their semantics.

@dead-claudia
Copy link
Author

@Yoric I forgot to clarify that the proposed global "metadata" field was supposed to not have any real semantics attached. It's for tools, not engines.

@dead-claudia
Copy link
Author

@bakkot

What's the proposed benefit to having the compiler do them?

Slightly less data over the wire, and engines don't need to enlist their heuristics to figure them out. Not all of them are especially justifiable, but some of them could be.

  • For x | 0/~~x, !!x, strict undefinedvoid 0, this isn't much (like one byte), and it might not justify their existence - it's like saving one byte.
  • For inverting explicit/implicit fallthrough, this is mostly useful for the common cases of never falling through, or only falling through (i.e. multiple grouped cases). The former is hardly compressible where an explicit break is just boilerplate, and the latter is trivially compressible if you change the switch statement layout to an array of cases + array of block bodies, since it'd just be a bunch of two-byte sequences. The inversion itself wouldn't benefit engines directly, but the size gain could be helpful. (I don't have raw numbers to back this up, however.)
  • For typeof x === "function", usually minified to "function"==typeof x, it'd save them from having to fetch the data table, making it easier and much faster to parse. It'd also encourage minifiers to convert the switch (typeof foo) { ... } idiom to if/else branches instead, saving engines from having to desugar that also - V8 didn't optimize that until TurboFan hit.

@erights
Copy link

erights commented Sep 26, 2018

@isiahmeadows

[undefined] is already a pseudo-literal in strict mode.

In what way? Just tested:

(function(){'use strict'; const undefined = 3; return undefined; })();

returns 3.

@ljharb
Copy link
Member

ljharb commented Sep 26, 2018

undefined is only a pseudo-literal in ES5+ in the global scope (in that it can't be reassigned there), but it can be shadowed anywhere else; i don't think strict mode comes into play at all.

@dead-claudia
Copy link
Author

dead-claudia commented Sep 26, 2018

@ljharb @erights I meant it's a pseudo-literal in that it has those special semantics - it's not a simple identifier in strict mode. It evaluates to:

  • The binding undefined if such a binding is defined for the current lexical scope, global lookups excluded.
  • The literal value undefined if no such binding is defined for the current lexical scope, global lookups excluded.

You could say it's a context-sensitive, non-context-free construct that could be parsed as either a keyword or identifier, depending on what variables are declared in its surrounding lexical scope. Implementations almost necessarily implement this as a keyword at parse time, since strict mode requires special semantics, and punting the check to identifier lookup will inevitably create performance problems. Nope, just a combination of non-configurable, non-writable and in the global scope...

In sloppy mode, it's just a simple global binding lookup, so it's of course not a special syntactic construct.

@erights
Copy link

erights commented Sep 26, 2018

Where is this special case in the spec?

Is this special case observable?

@dead-claudia
Copy link
Author

dead-claudia commented Sep 27, 2018

I edited it because I thought I knew how it worked, but apparently I was mistaken on particulars. 🙂 I thought it was a special case, but it's just a combination of descriptors + etc.

For what it's worth, UglifyJS only reduces undefined to void 0 in strict mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants