Skip to content
This repository has been archived by the owner on Aug 17, 2022. It is now read-only.

[Interface Types] Avoid memory copy #88

Open
redradist opened this issue Nov 30, 2019 · 23 comments
Open

[Interface Types] Avoid memory copy #88

redradist opened this issue Nov 30, 2019 · 23 comments

Comments

@redradist
Copy link

redradist commented Nov 30, 2019

Hi all,

Seems like from proposal of interface type I've understood that adapter functions would copy types from one abi to another abi ?

If so than it can hit performance, consider case of one million elements in some data type ...

... and I would suggest the better approach ...

What if instead of coping we will use just interface like in Java, C#, C++ (abstract type) or Rust (traits) ?
Lets imagine that list (vector or some other type in language) will be presented through the bunch on functions, it will provide additional wrapper functions that should be called instead of native api of some language

Lets consider List ...
In the following language it has the following types:

  1. C++ -> std::vector<>
  2. Rust -> std::Vec<>
  3. Swift -> std::Vec<>

From all this types we can see common semantic and instead of coping types between platforms lets add additional functions ...
I think will be better if these functions will be just interface and provide mechanism of reading and writing data in native abi of the wasm module

For List we can add the following common api:

  1. T * data() (pointer to a data)
  2. size() (size that is stored by this pointer)
  3. insert(T) (or add)
  4. delete(T) (or remove)

This approach will increase overall performance, because it does not require coping data between languages abi, but just provides mechanism of accessing data from native abi

I wrote this idea to @linclark at twitter, but seems like she has lots of messages and she did not see mine, that is why I decided to write here ;)

@tniessen
Copy link
Member

T * data() (pointer to a data)

How would you access the memory at the returned pointer from WebAssembly if it is not part of WebAssembly's linear memory? WebAssembly, by design, cannot access memory outside of linear memory, and removing that restriction would have a severe impact on performance and security.

@redradist
Copy link
Author

redradist commented Nov 30, 2019

@tniessen

T * data() (pointer to a data)

How would you access the memory at the returned pointer from WebAssembly if it is not part of WebAssembly's linear memory? WebAssembly, by design, cannot access memory outside of linear memory, and removing that restriction would have a severe impact on performance and security.

T * data() (pointer to a data) will return pointer to a linear memory, because List in all languages has a property:
List should be allocated in linear memory, it is continuous array of bytes

Okay maybe T * data() is not best example for all languages, consider instead the following interface for List:

  1. T get(index)
  2. size() (size that is stored by this pointer)
  3. insert(T) (or add)
  4. delete(T) (or remove)

Anyway this example is for List, but it is obvious that it could be done for any type

Consider Dictionary, common api:

  1. bool has_key()
  2. T get_value_by(key)
  3. bool insert_by(key, value)
  4. bool remove_by(key)
  5. size()

Using such technique we will have better performance, because we will delegate real access to data to underlining abi (abi of programming language that compiled application) ;)

@tniessen
Copy link
Member

List in all languages has a property:
List should be allocated in linear memory, it is continuous array of bytes

I am not sure what you mean. If you create a list in Java, JavaScript, python etc., it will be allocated on the process heap, not in WebAssembly linear memory. And in many languages, a list is not a "continuous array of bytes". For example, a list of strings would only contain pointers/references to strings, which are, again, not in linear memory, but instead somewhere on the process heap.

T get(index)

What would T be? Let's say you have a list<string>, what type would T be in WebAssembly? anyref? Then you would still have to copy the string to linear memory.

T get_value_by(key)

What would key be in a dictionary<string, string>? An anyref to a string? Or a pointer to linear memory where a string is stored?

@redradist
Copy link
Author

redradist commented Nov 30, 2019

@tniessen

List in all languages has a property:
List should be allocated in linear memory, it is continuous array of bytes

I am not sure what you mean. If you create a list in Java, JavaScript, python etc., it will be allocated on the process heap, not in WebAssembly linear memory. And in many languages, a list is not a "continuous array of bytes". For example, a list of strings would only contain pointers/references to strings, which are, again, not in linear memory, but instead somewhere on the process heap.

Even in your example when list stores pointers, it stores them in linear memory layout ;)

@tniessen

T get(index)

What would T be? Let's say you have a list<string>, what type would T be in WebAssembly? anyref? Then you would still have to copy the string to linear memory.
It could be recursive pattern ;)

T in case of string would also represent interface type string which could be accessed by the following api:

  1. memaddr get_str()
  2. u32 size()

@tniessen

T get_value_by(key)

What would key be in a dictionary<string, string>? An anyref to a string? Or a pointer to linear memory where a string is stored?

You will get interface type string and apply to it the same rule as above ;)

@tniessen
Copy link
Member

Even in your example when list stores pointers, it stores them in linear memory layout ;)

No, not necessarily. That would be inefficient for languages that support sparse arrays, e.g., JavaScript. In other languages such as Java, it is up to the virtual machine implementation to choose an appropriate memory layout.

Even if the host stores the pointers in a "linear memory layout", that is not the same as WebAssembly linear memory. In simple terms, WebAssembly cannot access memory that it did not allocate itself. There is no way for WebAssembly to access memory outside of its own address range. So let's assume you create a list in some high-level programming language. The list is somewhere on the process heap, outside of the WebAssembly address range. We can copy the list into WebAssembly linear memory, so now we have an array of pointers in WebAssembly linear memory. But now WebAssembly cannot access the memory behind the pointers, because these pointers point outside of WebAssembly's address range. So in order to access the values, we have to copy them to WebAssembly's address range, too.

memaddr get_str()

Would memaddr create a copy within the address range of the WebAssembly module, similar to the existing string-to-memory? If not, how would WebAssembly access it?

@redradist
Copy link
Author

@tniessen

Even in your example when list stores pointers, it stores them in linear memory layout ;)

No, not necessarily. That would be inefficient for languages that support sparse arrays, e.g., JavaScript. In other languages such as Java, it is up to the virtual machine implementation to choose an appropriate memory layout.

Actually you according JavaDoc ArrayList use underneath array that is continuous memory:
https://docs.oracle.com/javase/8/docs/api/?java/util/ArrayList.html

@tniessen

Even if the host stores the pointers in a "linear memory layout", that is not the same as WebAssembly linear memory. In simple terms, WebAssembly cannot access memory that it did not allocate itself. There is no way for WebAssembly to access memory outside of its own address range. So let's assume you create a list in some high-level programming language. The list is somewhere on the process heap, outside of the WebAssembly address range. We can copy the list into WebAssembly linear memory, so now we have an array of pointers in WebAssembly linear memory. But now WebAssembly cannot access the memory behind the pointers, because these pointers point outside of WebAssembly's address range. So in order to access the values, we have to copy them to WebAssembly's address range, too.

Do not stick to Java Virtual Machine, my example is generic !!
No matter what language compiles to WebAssembly, it will just provide standard interface to standard types that WebAssembly VM could understand that's all, it is pretty simple, I cannot get what is hard in it ?!

@tniessen

memaddr get_str()

Would memaddr create a copy within the address range of the WebAssembly module, similar to the existing string-to-memory? If not, how would WebAssembly access it?

String from whatever programming language will provide address by memaddr to it underlying linear memory and size without requiring the coping from on abi to another one ...
We DO NOT need copy data from one language to another one untill library written by this language works as expected !!

@tniessen
Copy link
Member

Ah, I might have misunderstood you. Are you talking about WebAssembly code that was compiled from multiple programming languages, but that shares the same WebAssembly memory, meaning that all data exchange happens within WebAssembly, and not between WebAssembly and the host system?

String from whatever programming language will provide address by memaddr to it underlying linear memory and size without requiring the coping from on abi to another one ...

You do realize that programming languages use different representations of strings in memory?

  • Java uses both 1-byte encoding and modified UTF16
  • JavaScript uses both UCS-2 and UTF-16
  • Python uses one of UCS-4, UCS-2 and UTF-16
  • POSIX C usually uses UTF-8
  • Windows C usually uses UTF-16

So if one programming language uses one encoding, and passes the address to the string to a different language that uses a different encoding, how would the second language read the string?

@redradist redradist reopened this Nov 30, 2019
@redradist
Copy link
Author

@tniessen

Ah, I might have misunderstood you. Are you talking about WebAssembly code that was compiled from multiple programming languages, but that shares the same WebAssembly memory, meaning that all data exchange happens within WebAssembly, and not between WebAssembly and the host system?

Yes, you understood me correctly !!
Right now we on the same page ;)

@tniessen

String from whatever programming language will provide address by memaddr to it underlying linear memory and size without requiring the coping from on abi to another one ...

You do realize that programming languages use different representations of strings in memory?

* Java uses both 1-byte encoding and modified UTF16

* JavaScript uses both UCS-2 and UTF-16

* Python uses one of UCS-4, UCS-2 and UTF-16

* POSIX C usually uses UTF-8

* Windows C usually uses UTF-16

So if one programming language uses one encoding, and passes the address to the string to a different language that uses a different encoding, how would the second language read the string?

Yes, of course I understood it, but it is not an issue from my point of view )
We should not care, because it is the issue of api to have proper documentation explaining what kind of string it is UTF-8, UTF-16, ASCII ...
And for example user that in Rust will use C++ string should interpreter it properly ...
Or we can add some annotation in WebAssembly what kind of string it is ...

Anyway we will have better solution of than just coping memories from one place to another ...
Coping do not scale enough, it will break at large memory taken by container

As I told previously in case of ten million element we will have performance hit, but with interface methods to container (string, vector, list, dictionary and ...) we would have the same perfomance as in module written by single language ;)

@tniessen
Copy link
Member

Ah, I might have misunderstood you. Are you talking about WebAssembly code that was compiled from multiple programming languages, but that shares the same WebAssembly memory, meaning that all data exchange happens within WebAssembly, and not between WebAssembly and the host system?
Yes, you understood me correctly !!

Oh, okay, I don't know if that is a primary concern of this proposal. I was under the impression that this proposal is supposed to help with host-to-WebAssembly data transfer (and the other way around), not within WebAssembly. Isn't that already possible? You essentially only want additional functions, but couldn't you implement them without any extensions of the WebAssembly specification?

We should not care, because it is the issue of api to have proper documentation explaining what kind of string it is UTF-8, UTF-16, ASCII ...

But in most programming languages (Java, JavaScript, Python, C++, C, ...), there is only one internal representation of strings that is universally understood. And in most languages, you would have to convert strings that use a different encoding to the internal encoding. For example, if you want to read a UTF-8 string in Java/JavaScript/Python, it will be converted to a string in the internal representation. And that conversion essentially copies the string, while re-encoding it. So either way you end up copying the entire string.

@redradist redradist changed the title Avoid memory copy [Interface Types] Avoid memory copy Nov 30, 2019
@redradist
Copy link
Author

redradist commented Nov 30, 2019

@tniessen

Ah, I might have misunderstood you. Are you talking about WebAssembly code that was compiled from multiple programming languages, but that shares the same WebAssembly memory, meaning that all data exchange happens within WebAssembly, and not between WebAssembly and the host system?
Yes, you understood me correctly !!

Oh, okay, I don't know if that is a primary concern of this proposal. I was under the impression that this proposal is supposed to help with host-to-WebAssembly data transfer (and the other way around), not within WebAssembly. Isn't that already possible? You essentially only want additional functions, but couldn't you implement them without any extensions of the WebAssembly specification?

When I've read interface types proposal from @linclark https://hacks.mozilla.org/2019/08/webassembly-interface-types/ article I've got that it will be done by coping type of one abi to another abi
And I've been worrying since that it could hit performance, that is why I propose add special function for each interface type through which it would be possible to access that type in native for its language abi instead of providing functions that will allow coping types

@fgmccabe
Copy link
Contributor

fgmccabe commented Dec 1, 2019

I am afraid that you are maybe both missing a key aspect of the problem. We are targeting interface types to the scenario involving limited trust. This is true for both accessing host functionality and for ‘shared nothing linking’ involving wasm - wasm modules.
In that context we can definitely not assume that the two modules may access each other’s memory.
The design allows the embedded/linker to optimize if it can determine the appropriate circumstances. By default there will almost certainly be some copying of memory bound data structures; but there is obviously pressure to minimize that - so long as it’s probably safe to do so.

@tniessen
Copy link
Member

tniessen commented Dec 1, 2019

I am afraid that you are maybe both missing a key aspect of the problem. [...] we can definitely not assume that the two modules may access each other’s memory

I don't think I am missing that aspect. That is why, from the very beginning, I suspected that this approach might not work because the modules don't have any way of addressing each other's memory. I only abandoned that point after @redradist insisted that both modules have access to the same memory.

@redradist
Copy link
Author

redradist commented Dec 2, 2019

@fgmccabe

I am afraid that you are maybe both missing a key aspect of the problem. We are targeting interface types to the scenario involving limited trust. This is true for both accessing host functionality and for ‘shared nothing linking’ involving wasm - wasm modules.
In that context we can definitely not assume that the two modules may access each other’s memory.
The design allows the embedded/linker to optimize if it can determine the appropriate circumstances. By default there will almost certainly be some copying of memory bound data structures; but there is obviously pressure to minimize that - so long as it’s probably safe to do so.

Actually it is not an issue either !!
It is possible to mark some functions as host and if so then copy memory from host process to wasm, but if it is wasm <-> wasm scenario, then we can just call interface methods !!

Please, do not mix two scenario by providing one solution, you're breaking SOLID principles, even first one S, - single responsibility ;)
It could be accomplished differently ;)

@redradist
Copy link
Author

redradist commented Dec 4, 2019

@fgmccabe @tniessen
Is there any other concerns, guys ?

@jgravelle-google
Copy link
Contributor

(I probably should have replied sooner than a month later, sorry!)

It is possible to mark some functions as host and if so then copy memory from host process to wasm, but if it is wasm <-> wasm scenario, then we can just call interface methods !!

There are more trust boundaries than just between host and wasm module. Shared-nothing linking is a scheme by which two wasm modules can not trust each other but still call each others' APIs. If they agree on interface types as their ABI, they can 1) have their own separate data formats under the hood, and 2) make more guarantees about how their data gets exfiltrated. In particular, if module A doesn't export its memory, it can guarantee that module B cannot read it (modulo VM bugs).

If two wasm modules are ok with sharing memory, they don't need ITs and can just share an ABI. If they aren't, there's very little they can do to interact today, and ITs can expand the set of capabilities we permit while not expanding the permissions we need to expose. So that's a reason to not special-case the Host when thinking about IT.

@redradist
Copy link
Author

Guys, will be some meeting regarding this issue again ?
I was missed previous due to New Year holidays ... (

@jgravelle-google
Copy link
Contributor

Sure, I'll add it to the issue-which-I-haven't-made-yet.


Looking at this again with fresh eyes, I think there's a blended approach that kind-of-works in a way you describe. In particular: it should be possible to expose an interface to an object without needing to have access to the underlying memory. Now, this won't be efficient either, because you'll need to interact with the object exclusively through cross-module function calls, which should be low-ish overhead, but not no-overhead.

Consider a module that exports:

  • type Array
  • fn newArray : () -> Array
  • fn size : (Array) -> u32
  • fn getIndex : (Array, u32) -> T
  • fn setIndex : (Array, u32, T) -> ()
  • fn push : (Array, T) -> ()
  • fn pop : (Array) -> T

(T can refer to either generics, or anyref, or one hardcoded type)

In this scheme, we only copy on a per-element basis, but the array is managed completely by the owning module. So we don't need to share memory, but still have access to an interface published by a module, and can pass around handles to these Array objects.

Note that this doesn't replace the need for array-like object support in Interface Types itself. There are some APIs where you do indeed want to copy the memory across an ownership boundary. In theory we could defer deciding what to do there until after MVP... but given the number of browser APIs that expect and return arrays, that puts a serious damper on the "viable" part of that, so I think we do still need a first-class memory-copying array primitive sooner rather than later.

@redradist
Copy link
Author

@jgravelle-google

Sure, I'll add it to the issue-which-I-haven't-made-yet.

Looking at this again with fresh eyes, I think there's a blended approach that kind-of-works in a way you describe. In particular: it should be possible to expose an interface to an object without needing to have access to the underlying memory. Now, this won't be efficient either, because you'll need to interact with the object exclusively through cross-module function calls, which should be low-ish overhead, but not no-overhead.

Consider a module that exports:

* type Array

* fn newArray : () -> Array

* fn size : (Array) -> u32

* fn getIndex : (Array, u32) -> T

* fn setIndex : (Array, u32, T) -> ()

* fn push : (Array, T) -> ()

* fn pop : (Array) -> T

(T can refer to either generics, or anyref, or one hardcoded type)

In this scheme, we only copy on a per-element basis, but the array is managed completely by the owning module. So we don't need to share memory, but still have access to an interface published by a module, and can pass around handles to these Array objects.

This is exact approach what I wanted that tried to introduce )

@jgravelle-google

Note that this doesn't replace the need for array-like object support in Interface Types itself. There are some APIs where you do indeed want to copy the memory across an ownership boundary. In theory we could defer deciding what to do there until after MVP... but given the number of browser APIs that expect and return arrays, that puts a serious damper on the "viable" part of that, so I think we do still need a first-class memory-copying array primitive sooner rather than later.

I agree with this, for some small structures we may want copying type complete across an ownership boundary. It could depend on size of structure: if size more than some boundary access through interface otherwise copy it

@ghost
Copy link

ghost commented Jan 29, 2021

I cannot see why strings must be copied from linear memory, can we just have a Wasm interface string_view type that would create a view onto memory or another string, without a copy? Why do strings even need to be immutable again? Solely because JavaScript requires it?

@lukewagner
Copy link
Member

Interface-typed values aren't normal first-class values, they are lazy expression which, when evaluated, enable copying from some concrete string representation (in wasm or the host) to some concrete string representation (in wasm or the host). Moreover, since interface values are affine, there's not even really a notion of them being "immutable", since they are only observed only the one time.

If what you want is to pass around views of memory, that would be a separate proposal. There are a number of challenges with view-based approaches, though. One is that languages compiled to linear memory don't have a good way to access views without first copying them into linear memory. E.g., in C, a random char* is always assumed to be relative to the one global linear memory, which means you can't get a pointer to a char in a view (without copying that char into linear memory), so, in the general case, you'll end up needing to copy it all to linear memory anyway. There are other problems too.

@ghost
Copy link

ghost commented Jan 29, 2021

The affinity of interface types seems only necessary because of destructors, yet, if I were not to associate a destructor with an interface value, why would it need to be affine?

allocation will lead to double-free when two destructors are called

This couldn't happen if the type never used a destructor in the first place; how often is it excepted that one would associate extraneous data with an interface value?

As for viewing other memory, that might be better solved by mmapping or something else.

@lukewagner
Copy link
Member

The affinity of interface types seems only necessary because of destructors,

There is also the issue of lazy lifting being potentially effectful (in the extreme: executing user code, e.g., as part of a generator).

As for viewing other memory, that might be better solved by mmapping or something else.

Having considered these options over the years, I think there is a place for mmap and memory sharing, but for more-advanced situations and not as the universal basis for module composition.

@redradist
Copy link
Author

I cannot see why strings must be copied from linear memory, can we just have a Wasm interface string_view type that would create a view onto memory or another string, without a copy? Why do strings even need to be immutable again? Solely because JavaScript requires it?

@00ff0000red This exactly what I want to lobby here )))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants