Sequences #72

jgravelle-google · 2019-09-24T00:23:32Z

Here's a dump of what I've been thinking about sequences so far. This is informed generally by the experimenting I've been doing in my polyfill/interpreter. This isn't so much an "I have answers" so much as it is a "here's why I'm thinking this is tricky"

Core use cases (in order of importance)

Passing buffers of bytes efficiently
1. buffers of larger data elements should be efficient too
Supporting collections of arbitrary data
1. including structured data
2. including non-scalar data (strings, and sequences)
Translating data between different representations of an ordered list

Sample sequences

C++ alone has three immediate options:

T* buffer (plus length separately)
std::vector<T>
std::list<T>

Assume vector is implemented as a struct with a pointer to a buffer, a length, and a current capacity.
Assume list is implemented as a doubly-linked list, with pointers to the front and back.

I'm using C++ here because it can represent this in a variety of ways. For any given language I expect we only need to idiomatically support a single style. So for C++ we can have compiler support to autogenerate buffer adapters, but not linked lists, whereas in Scheme we could natively support linked lists but not necessarily anything else.

Complications

Assume we model a buffer of arbitray bytes as seq<u8>. To support a maximally-efficient copy, we need to have primitives that can be optimized into a single memcpy.
For the C-style simple buffer, we can do something similar to memory-to-string/string-to-memory, which takes ptr + length and produces a seq, and vice-versa (open question: does this handle only u8, or is it parameterized, and what type is it parameterized on, interface types or wasm types?).
For the linked list case, this is fundamentally impossible (and that's ok). We do need to be able to build a sequence from a linked list, so we will at minimum need another pair of primitives to do so. In my polyfill I tried fold-seq and repeat-until (using helper functions), to go from seq to list and vice-versa.
The vector case is surprisingly tricky to handle. On the one hand we can reuse the buffer-copying logic to marshal the actual data over. On the other hand, we need to maintain the vector invariants as we go (length, capacity). We either need to use a fold primitive and call in to the C++ code to push_back every element (and let C++ maintain the invariants), use the mem-to-seq instructions and fix up the invariants at the end, or move the invariant-maintaining logic in to the body of the folds. The only one of those that is amenable to optimization is to defer fixing up the internal state. I can't quite imagine how to express that without hand-writing the adapter, however (although for std::vector, as part of the standard library, we are technically able to do so).

For simplicity of specification it may be nice to just have a single kind of general primitive, some form of fold or loop, and have clear rules for where it is optimized away into a memcpy. It is also probably better to start with the more general primitive, and add alternative more-optimal ones later on.

The text was updated successfully, but these errors were encountered:

lukewagner · 2019-09-24T17:12:42Z

One idea I've been pondering is to have a single, fairly-general memory-to-sequence lifting instruction that takes a function immediate with signature: [i32]→[T,i32,bool] where: the incoming i32 is the pointer to the element which is to be turned into a T element (for a seq<T>), the bool result indicates whether there is a next element and, if so, the i32 result is a pointer to it.

For a std::vector<uint8_t> this would look like (using Nick's wasm idea, and pulling in let from function-references) [Edit: take out explicit wasm...end]:

(@interface func (export ...) (param $vec i32) (result (seq u8))
  (i32.load offset=4 (local.get $vec))                ;; vec.end()
  let (local $end i32)
    (i32.load offset=0 (local.get $vec))              ;; vec.begin()
    memory-to-sequence (func (param $p i32) (result u8 i32 bool)
      (lift-int i32 u8 (i32.load8_u (local.get $p)))  ;; *p
      (i32.add (local.get $p) (i32.const 1))          ;; p = p + sizeof(uint8_t)
      (lift-bool (i32.eq (dup) (local.get $end)))     ;; (p == end)
    )
  end
)

Inlining the lambda function into the memory-to-sequence instruction's implied for loop, this adapter could be compiled fairly easily into a pretty-efficient loop in machine code. Optimizing this further into a memcpy would involve some fairly simple analysis, but it's possible that this would feel arbitrary enough to motivate a more-specialized instruction that was guaranteed to map to a memcpy.

The neat thing though is that the same memory-to-sequence+lambda design would allow mapping a std::list<uint8_t> to a sequence just as easily [Edit: take out explicit wasm...end]:

(@interface func (export ...) (param $list i32) (result (seq u8))
  (i32.load offset=0 (local.get $list))                ;; list.begin()
  memory-to-sequence (func (param $p i32) (result u8 i32 bool)
    (lift-int i32 u8 (i32.load8_u (local.get $p)))     ;; *p
    (i32.load offset=4 (local.get $p))                 ;; p->next
    (lift-bool (i32.eq (dup) (i32.const 0)))           ;; (p == nullptr)
  )
)

Lowering is a separate topic, but WDYT about this lifting?

fgmccabe · 2019-09-24T20:42:30Z

All those wasm .. end sequences are pretty clumsy. As an alternate to an embedded lambda it would perhaps be more in keeping with wasm to have special looping blocks. This is the approach I have been toying with in the scenarios. That way you can make invoking a helper to its own thing. The lifting and lowering constructs need to be designed in tandem -- because they have to be rewritten in tandem. It does look like we are all converging on something though... Francis

…

On Tue, Sep 24, 2019 at 10:12 AM Luke Wagner ***@***.***> wrote: One idea I've been pondering is to have a single, fairly-general memory-to-sequence lifting instruction that takes a function immediate with signature: [i32]→[T,i32,bool] where: the incoming i32 is the pointer to the element which is to be turned into a T element (for a seq<T>), the bool result indicates whether there is a next element and, if so, the i32 result is a pointer to it. For a std::vector<uint8_t> this would look like (using Nick's wasm idea <#67>, and pulling in let from function-references <https://github.com/WebAssembly/function-references/blob/master/proposals/function-references/Overview.md#local-bindings> ): ***@***.*** func (export ...) (param $vec i32) (result (seq u8)) wasm (i32.load offset=4 (local.get $vec)) ;; vec.end() end let (local $end i32) wasm (i32.load offset=0 (local.get $vec)) ;; vec.begin() end memory-to-sequence (func (param $p i32) (result u8 i32 bool) wasm (i32.load8_u (local.get $p)) ;; *p end lift-int i32 u8 wasm (i32.add (local.get $p) (i32.const 1)) ;; p = p + sizeof(uint8_t) (i32.eq (dup) (local.get $end)) ;; (p == end) end lift-bool ) end ) Inlining the lambda function into the memory-to-sequence instruction's implied for loop, this adapter could be compiled fairly easily into a pretty-efficient loop in machine code. Optimizing this further into a memcpy would involve some fairly simple analysis, but it's possible that this would feel arbitrary enough to motivate a more-specialized instruction that was guaranteed to map to a memcpy. The neat thing though is that the same memory-to-sequence+lambda design would allow mapping a std::list<uint8_t> to a sequence just as easily: ***@***.*** func (export ...) (param $list i32) (result (seq u8)) wasm (i32.load offset=0 (local.get $list)) ;; list.begin() end memory-to-sequence (func (param $p i32) (result u8 i32 bool) wasm (i32.load8_u (local.get $p)) ;; *p end lift-int i32 u8 wasm (i32.load offset=4 (local.get $p)) ;; p->next (i32.eq (dup) (i32.const 0)) ;; (p == nullptr) end lift-bool ) ) Lowering is a separate topic, but WDYT about this lifting? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=AAQAXUHSFUHMXFKXBRCMGNTQLJDA7A5CNFSM4IZWL2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PDNYA#issuecomment-534656736>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQAXUH2COGLFYTUMCQ3K4DQLJDA7ANCNFSM4IZWL2YA> .

-- Francis McCabe SWE

lukewagner · 2019-09-24T20:58:51Z

Yeah, the wasm ... end blocks aren't especially beautiful; the alternative is simply to allow the wasm instructions inline without the surrounding block; the high-order bit, I think, is not redefining core wasm instructions to be similar-but-slightly-different.

So for the looping block idea: is the idea that the memory-to-sequence instruction itself defines the block? That could certainly work, I think. I liked reusing the lambdas b/c I think we'd need them for independent reasons, but perhaps the essential difference here is how they are treated by the rewrite rules; lifting/lowering bodies get rewritten/fused; allocator functions don't.

Agreed that lifting/lowering have to be designed in tandem, but I was trying to keep the post at least somewhat byte-sized and digestible.

jgravelle-google · 2019-09-24T21:28:57Z

What I wound up doing for the wasm instructions in my polyfill was just reimplement them as add and load. Not elegant either but was the quickest thing to do.
We could also potentially have the wasm ... end blocks be an encoding detail, where for any instr that's part of the subset we support, we duplicate the text format, and then encode it in the binary as some start_wasm sentinel.
Ok that's compelling enough that I'll take it to that thread.

it's possible that this would feel arbitrary enough to motivate a more-specialized instruction that was guaranteed to map to a memcpy

Yeah, in the same way that string-to-mem + mem-to-string is very clearly optimizable to memcpy (ish), seq-to-mem + mem-to-seq would be a way of making it clear that that works. We just also need a more general primitive. Which order we ship them in is less important so long as we design both to the point where we're confident.

I was trying to keep the post at least somewhat byte-sized and digestible.

This sounds like a great document for the working notes :D
Which I'd thought about doing this as, but figured we should get something like consensus first. It sounds like we have some vaguely converging ideas though. But yeah I predict we will wind up with a note document with something like 5-10 examples here and that should help.

jgravelle-google · 2019-09-25T17:06:54Z

A question I have is, are there other data structures that should map to sequences?

Stacks and queues should, but their internal memory layout is (probably) similar to a vector. (Hash)Maps are a collection, but need to store a key associated with each element.

Do we want to pass sets across an interface? Does it make sense to convert from set to (say) vector?
If yes, the same principles apply as a linked list, where we may not expect things to be fast and can always fall back to constructing a value by repeatedly calling an add_elem export. A sample implementation of a set could model the data as a tree. If the tree is stored as explicitly-modeled nodes with pointers to their children, the data is spread out and we can't expect any magic. If the tree is stored as implicitly-modeled nodes, then we may be able to read the data out (in some order) with a contiguous memcpy, but in order to maintain the tree invariants if we write data back in, we need to call exports.

Writing this out, I think the question of "should we let people publish an API that takes collections and call it with the wrong kind of thing?" is answered with "how could we even stop them in theory?"

Are there other ways of laying out data in memory aside from contiguously, or spread out with pointers?

fgmccabe · 2019-09-25T17:09:58Z

We kind of don't/should not care about non-sequence sequences. Francis

…

On Wed, Sep 25, 2019 at 10:07 AM Jacob Gravelle ***@***.***> wrote: A question I have is, are there other data structures that should map to sequences? Stacks and queues should, but their internal memory layout is (probably) similar to a vector. (Hash)Maps are a collection, but need to store a key associated with each element. Do we want to pass sets across an interface? Does it make sense to convert from set to (say) vector? If yes, the same principles apply as a linked list, where we may not expect things to be fast and can always fall back to constructing a value by repeatedly calling an add_elem export. A sample implementation of a set could model the data as a tree. If the tree is stored as explicitly-modeled nodes with pointers to their children, the data is spread out and we can't expect any magic. If the tree is stored as implicitly-modeled nodes <https://slideplayer.com/slide/5107901/16/images/3/Binary+Heap+Review+%E2%80%A6+%E2%80%A6+Implicit+%3A+stored+in+array+with+no+pointers.jpg>, then we may be able to read the data out (in some order) with a contiguous memcpy, but in order to maintain the tree invariants if we write data back in, we need to call exports. Writing this out, I think the question of "should we let people publish an API that takes collections and call it with the wrong kind of thing?" is answered with "how could we even stop them in theory?" Are there other ways of laying out data in memory aside from contiguously, or spread out with pointers? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=AAQAXUCK262ZZ3ZBPV2ZYGTQLOLDFA5CNFSM4IZWL2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SUI6Y#issuecomment-535118971>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQAXUEEBUDFAFLWLSYDYRDQLOLDFANCNFSM4IZWL2YA> .

-- Francis McCabe SWE

jgravelle-google · 2019-09-25T17:24:36Z

Right, my point there is not that we need to add anything for non-sequential collections, but that the primitives we have are already enough for people to sequence their non-sequential data. We can (and should) continue to not care about them, but people will use them, and they will mostly work.

A related question is what are the essential properties of a sequence. Well-ordered is probably one. Isomorphic across representations could be one? i.e. for all X,Y where X is a type of sequence and Y is a type of sequence, for all n where n is a collection of values, then X(Y(X(n))) == X(n).
For example let n = [1, 2, 2, 3], X = Vector, Y = List, then Vector(List(Vector([1, 2, 2, 3]))) == Vector([1, 2, 2, 3]), whereas if Y = Set then we would drop the duplicated 2 so the isomorphism fails to hold.
I don't think we have any mechanism to disallow a user from mapping a set to a sequence. But we don't have to care about what happens in that case.

fgmccabe · 2019-09-25T20:25:50Z

Sequence<T> is highly likely to be one of our standard types. For any given language whose compiler supports generating adapters there will be a 'language representative' for sequences. Other data structures in that language will likely need to be coerced into the language native variant of sequence. There is a related question though: what about supporting arbitrary generic collection types. IMO this is doable simply by an elaboration of our current thinking for record types: with the addition of being able to plug-in adapters. E.g., if we had a type: CCTransaction<Item> where CCTransaction is some type that models a credit card transaction, and Item is something that one might purchase with a credit card, then a. any actual transaction will have a type that grounds the Item type variable. (This does rule out some higher-order patterns) b. where Item shows up in the representation of a CCTransaction value, instead of a fixed lifting/lowering there would have to be code that lifted/lowered the actual transaction information.

…

On Wed, Sep 25, 2019 at 10:24 AM Jacob Gravelle ***@***.***> wrote: Right, my point there is not that we need to add anything for non-sequential collections, but that the primitives we have are already enough for people to sequence their non-sequential data. We can (and should) continue to not care about them, but people will use them, and they will mostly work. A related question is what are the essential properties of a sequence. Well-ordered is probably one. Isomorphic across representations could be one? i.e. for all X,Y where X is a type of sequence and Y is a type of sequence, for all n where n is a collection of values, then X(Y(X(n))) == X(n). For example let n = [1, 2, 2, 3], X = Vector, Y = List, then Vector(List(Vector([1, 2, 2, 3]))) == Vector([1, 2, 2, 3]), whereas if Y = Set then we would drop the duplicated 2 so the isomorphism fails to hold. I don't think we have any mechanism to disallow a user from mapping a set to a sequence. But we don't have to care about what happens in that case. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#72?email_source=notifications&email_token=AAQAXUHD53YFMF3VBQ5IEUDQLONFPA5CNFSM4IZWL2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SV7KQ#issuecomment-535125930>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQAXUGFMJVAWHOXE6LVLULQLONFPANCNFSM4IZWL2YA> .

-- Francis McCabe SWE

This was referenced Sep 24, 2019

Generalize+formalize our supported subset of Wasm in wasm ... end blocks #67

Open

Video Call: September 29th 2019 #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequences #72

Sequences #72

jgravelle-google commented Sep 24, 2019

lukewagner commented Sep 24, 2019 •

edited

fgmccabe commented Sep 24, 2019 via email

lukewagner commented Sep 24, 2019

jgravelle-google commented Sep 24, 2019

jgravelle-google commented Sep 25, 2019

fgmccabe commented Sep 25, 2019 via email

jgravelle-google commented Sep 25, 2019

fgmccabe commented Sep 25, 2019 via email

Sequences #72

Sequences #72

Comments

jgravelle-google commented Sep 24, 2019

Core use cases (in order of importance)

Sample sequences

Complications

lukewagner commented Sep 24, 2019 • edited

fgmccabe commented Sep 24, 2019 via email

lukewagner commented Sep 24, 2019

jgravelle-google commented Sep 24, 2019

jgravelle-google commented Sep 25, 2019

fgmccabe commented Sep 25, 2019 via email

jgravelle-google commented Sep 25, 2019

fgmccabe commented Sep 25, 2019 via email

lukewagner commented Sep 24, 2019 •

edited