Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to cancel a pending Atomics.waitAsync? #176

Open
juj opened this issue Jan 17, 2021 · 40 comments
Open

How to cancel a pending Atomics.waitAsync? #176

juj opened this issue Jan 17, 2021 · 40 comments

Comments

@juj
Copy link

juj commented Jan 17, 2021

In several scenarios, after having issued a Atomics.waitAsync(...).value.then((value) => { /* do something */ }); operation, it can happen that some kind of resource/application/page deinitialization type of activity occurs, which deinitializes parts of the Wasm app that contain e.g. the data structures for, say, a lock, semaphore or other multithreading synchronization primitive that resided at that address.

How would one ensure that if such application deinitialization does take place before the waitAsync.then() handler has been invoked, that the promise handler would never be fired (and /* do something */ should never get executed)?

At application level, I can use a separate JavaScript variable

var lockStillExists = true;

void deleteLock() {
  lockStillExists = false;
}

...

Atomics.waitAsync(myInt32Array, ...).value.then((value) => { if (lockStillExists) { /* do something */ } })

to ensure that /* do something */ will never erroneously execute in case the wasm app deinitializes in between, but what happens if I want to deinitialize and let myInt32Array or other JS variables garbage collect altogether? I.e. the promise callback function (value) => { ... } will capture a scope, that would surely pin down myInt32Array and other JS vars in the scope to remain alive indefinitely (since the .then() will never resolve)?

Is there a way to cancel an async wait and/or guarantee that JS garbage collection will safely occur?

@lars-t-hansen
Copy link

I don't think there's a way of cancelling a waitAsync. ISTR there's generally some concern around promises not having a cancellation facility. May be worthwhile pinging TC39 around that.

(I don't know the job semantics well enough to say what happens with pending promises at page shutdown, which is probably the only sense of "deinitialization" that is known to the web.)

@juj
Copy link
Author

juj commented Jan 18, 2021

Running the following test

<html><body><script>
function test() {
	var heap = new Int32Array(new SharedArrayBuffer(128*1024*1024));
	var waitIndex = 0;
	Atomics.waitAsync(heap, waitIndex, 0, Infinity).value.then((value) => {
		console.log(heap[waitIndex]);
	});
}
for(var i = 0; i < 10; ++i) {
	test();
}
</script></body></html>

in Chrome does indeed result in the created SABs being pinned down, and never freed.

In e.g. Unity use, it is somewhat common for pages to need to unload a Unity engine build from the page, and maybe open another Wasm build. An example usage is in a 3D asset store web store page, with a preview carousel, where the preview carousel may be a photo, a video, or a live 3D page (written in Unity game engine to wasm) of the product in question. When one navigates the carousel, the page will dynamically populate to show the active selected preview item index. An example page: https://assetstore.unity.com/packages/3d/environments/landscapes/polydesert-107196

Another example usage is a game portal, where game instances are loaded into a child element.

This limitation of being unable to gracefully deinitialize does effectively mean that one is mandated to use iframes in order to get JavaScript garbage collection of Wasm heaps to work.

@juj
Copy link
Author

juj commented Jan 18, 2021

Trying out a similar test in Chrome with iframes does suggest that Chrome is able to reclaim the memory if one does use iframes:

iframe_parent.html

<html><body><script>
var numLoaded = 0;
function open() {
	if (numLoaded++ >= 20) return;
	console.log(numLoaded);
	var iframe = document.getElementById('container');
	if (iframe) {
		console.log('unload');
		iframe.parentElement.removeChild(iframe);
	}
	else {
		console.log('load');
		iframe = document.createElement('iframe');
		iframe.id = 'container';
		document.body.appendChild(iframe);
		iframe.src = 'iframe_child.html';
	}
	console.log(iframe.src);
}
setInterval(open, 1000);
</script></body></html>

iframe_child.html

<html><body><script>
var heap = new Int32Array(new SharedArrayBuffer(128*1024*1024));
var waitIndex = 0;
console.log('waitAsync');
Atomics.waitAsync(heap, waitIndex, 0, Infinity).value.then((value) => {
	console.log(heap[waitIndex]);
});
</script></body></html>

However not sure if this is adequate - requiring wrapping content in an iframe for garbage collection to work can be quite a footgun, so certainly in search of a better way here.

juj added a commit to juj/emscripten that referenced this issue Jan 18, 2021
… emscripten_atomic_cancel_wait_async() and emscripten_atomic_cancel_all_wait_asyncs() functions to allow canceling waitAsync operations. This does have a drawback of spurious wakeups, but necessary in order to work around GC problem WebAssembly/threads#176 .
@dschuff
Copy link
Member

dschuff commented Jan 19, 2021

/cc @syg

@syg
Copy link

syg commented Jan 19, 2021

Interesting example, Jukka, thanks!

As @lars-t-hansen says, yeah, there's not really a way to cancel, and the question has baggage around cancellation of Promises. Before I pull on threads to see about a Promise cancellation mechanism again, which, given AbortController, is probably not going to get anywhere, let me ask:

For the Unity use case, would either of the following be sufficient? And do you have a preference?

  1. Upon the SAB being finalized, at some future time all pending waitAsync Promises are either resolved with a new sentinel like "cancelled" or rejected?
  2. A new SAB.cancelAllAsyncWaiters() API or something ugly like that?

Also cc @marjakh

@juj
Copy link
Author

juj commented Jan 19, 2021

The resolution proposal 1. would likely be sufficient for the GC reclaim scenario - such 'cancelled' or rejected resolves could easily be filtered out in the application.

The resolution proposal 2. may be tricky, in particular the question is whether SAB.cancelAllAsyncWaiters() would cancel all .waitAsync()s in all Workers waiting on any address on the given SAB, or just all .waitAsync()s that were issued in the Worker that calls SAB.cancelAllAsyncWaiters(). If it canceled all Workers, then I think it would work. However, not needing to implement any code would certainly be preferable here - then the GC would "just work" without a danger that JS code could write any piece of code that generates essentially unreachable objects that can never free (in the page lifetime).

There is also another scenario that becomes complicated from the lack of cancellation: deleting locks.

To make the example concrete, let's say your wasm application has some kind of database of records that are stored in the wasm heap. The records are large enough that SAB/pthread-based Workers are used to download them in the background (think e.g. image files or .json data or something else), and/or maybe on the main thread also.

Whenever a download finishes (e.g. via a XHR or Fetch Promise), the code will issue an Atomics.waitAsync into the wasm heap to lock a mutex and access a central shared data structure that contains this database of records, to insert the data in the database. Atomics.waitAsync is needed since the main thread could not do sync Atomics.waits. Maybe due to simplicity and ease of sharing the code flow, the Workers also use the same Atomics.waitAsync code (saves having to write two copies of mutex locking, one for .waitAsync and another for .wait)

So essentially we have N Workers that are each Atomics.waitAsyncing a lock address in the heap at unexpected times.

Then at some point, e.g. based on user input, it is decided that the database should close (but the wasm application still lives on, maybe opening a new database, or deciding to do something else). If implemented in common OOP fashion, the database lock address will likely have been dynamically allocated uint32 on the heap.

In order to deinitialize the lock, safely free() its uint32 address and reclaim the memory address for malloc() to reuse, the application would need to cancel all pending Atomics.waitAsyncs in both the main thread, but also in all Workers in the wasm application in an atomic/watertight manner. Or otherwise the main thread/Workers may race to resolve Atomics.waitAsyncs to an essentially freed memory address (or in worst case, the app may have allocated a new lock structure in exactly same memory address, leading to random behavior).

At first I thought all of this could simply be avoided by using a helper uint32 next to the wait address as a field to denote whether the uint32 lock is valid, and when deinitializing, mark the lock invalid first, then Atomics.notify() to wake up all waiters, and then proceed to free() all the memory. The woken up waiters would then be set to fizzle the event when they see the lock having been deallocated. And waiters would never be allowed to register a new Atomics.waitAsync() unless first checking the "is the lock valid" uint32 field.

However that does not quite work, but it has a problem that freeing a lock becomes an asynchronous operation! That is, the "is the lock valid" uint32 helper field needs to be kept alive until the last Atomics.waitAsync() event has resolved, making it hard to deal with in C code: deleting locks and free()ing is generally expected to be a synchronous operation.

I can't think of a great way to resolve this issue, except to maybe start making some kind of refcounting things in wasm heap, where whenever a Worker issues a Atomics.waitAsync() on a memory address, it increments a refcount, and when a lock is deleted, it is marked deleted, and all pending .waitAsync()s finally resolve, then the last one to resolve will actually do the free(). However that will still have a big issue if one calls Worker.terminate(), since that would leak refcounts from that Worker, unless one keeps a global registry in wasm heap of all addresses that each Worker has .waitAsync()ed.

I thought about imposing a restriction to Emscripten that pthread Workers are not allowed to do any .waitAsync()s, but only main thread is, and then only main thread would be allowed to delete any locks. Then it would be able to locally track in JS side state the set of alive locks, and clear out the .waitAsync()s. But this kind of deinitialization restriction seems a bit limiting from a C programmers viewpoint.

If there existed a function SAB.cancelAllAsyncWaitersAtAddress(address) that cleared all waiters on all Workers on a given address, then I think this kind of lock deinitialization would be possible to implement synchronously without complications, and one could delete a lock by first issueing a SAB.cancelAllAsyncWaitersAtAddress(address) then followed by free(address).

So I wonder, would we be looking at two functions, SAB.cancelAllAsyncWaiters() and SAB.cancelAllAsyncWaitersAtAddress(address), or might there be a simpler way to address both use cases?

@marjakh
Copy link

marjakh commented Jan 20, 2021

Moi Jukka (&others),

I'm mostly worrying about this from the "how do we release memory" point of view. In the example in OP, if you never call notify(), the code in the then-func will never be executed, so that's fine. But if you create waitAsyncs and never notify() them, in the current impl in V8, memory will grow without bound. (Even if the SAB is GCd!)

Relevant doc where this use case is discussed:

https://docs.google.com/document/d/1aeEGDm1XSqoJkQQKz9F75WqnuAa2caktxGy_O_KpO9Y/edit#heading=h.2cmggae83oqv (that link should link to the "Open question: Problem with location based lists" section).

Re: SAB not being released (your example in here: #176 (comment) ). I think what happens here is that the closure you pass to then() keeps "heap" alive which keeps the SAB alive :) So it's not the waitAsync that keeps stuff alive. If you didn't refer to "heap" in your function, the SAB should get GCd just fine. (But we still keep internal data structures alive and leak memory that way.)

Re: SAB.cancelAllAsyncWaiters APIs, just to confirm, the intent would be that it's a synchronous operation which does not call any callbacks (e.g., promise rejected callbacks), right? Otherwise it wouldn't help the "freeing a lock must be sync" use case.

@juj
Copy link
Author

juj commented Jan 20, 2021

Moi Jukka (&others),

👋

Re: SAB not being released (your example in here: #176 (comment) ). I think what happens here is that the closure you pass to then() keeps "heap" alive which keeps the SAB alive :) So it's not the waitAsync that keeps stuff alive. If you didn't refer to "heap" in your function, the SAB should get GCd just fine. (But we still keep internal data structures alive and leak memory that way.)

That is exactly correct. It is the closure that keeps the heap alive. If the closure did not reference the heap, then the heap will be freed (as one can test by removing console.log(heap[waitIndex]); from the test code). But then again, it is the Atomics.waitAsync that is keeping that closure alive.

The rationale I now realize I missed explaining in that code is that it is extremely common to need to reference the heap index that one just waited on from within the Promise, so having to access heap[waitIndex] inside the Promise .then() is in some sense the "canonical" usage of the Atomics.waitAsync() API. (because that memory location generally stores the state of the lock primitive one is implementing)

Re: SAB.cancelAllAsyncWaiters APIs, just to confirm, the intent would be that it's a synchronous operation which does not call any callbacks (e.g., promise rejected callbacks), right? Otherwise it wouldn't help the "freeing a lock must be sync" use case.

I presume it could work either by resolving with a special 'cancelled', or rejecting, or not calling the callbacks at all. As long as one would know inside the callback that the wait was rejected so as not to access any heap memory in the callback. (since that heap memory has now already been free()d prior)

@marjakh
Copy link

marjakh commented Jan 20, 2021

I think there are (at least) 2 separate problems here:

  1. The "SAB stays alive" problem can be solved with a suitable indirection object. You could make the closure point to the indirection object and the indirection object point to the SAB. When you want to free the memory, kill the indirection_object -> SAB pointer.

When you want to free memory, could you actually notify() everything at that point or is that cumbersome? You can make the closure no-op in that case, so it wouldn't access memory, so it shouldn't matter that you first free memory and only after that notify.

my_sab = new SharedArrayBuffer(...);
my_sab.indirection_object = {'buffer': my_sab };

function createWaiter(indirection_object) {
  waitAsync(indirection_object.buffer, ...).value.then(() => {
    if (indirection_object.buffer) { // when this is null, do nothing
      // do stuff with indirection_object.buffer
    }});
}

// now we want to free memory:
my_sab.indirection_object.buffer = null;
// maybe notify everything here? the then-funcs will be noop since buffer is null

// and let SAB die
my_sab = null;

(I hope I didn't mess up that example :) )

  1. If you don't wan't to / can't notify everything, we have a "user creates waitAsyncs and doesn't notify them -> we leak memory" problem on our side (we don't leak the SAB but our internal data). We can either provide an API the user must call, or just do the right thing without an API (when a SAB dies, detect that we have waitAsyncs which will never finish and release them).

That'd require you let the SAB die on your side.

I might be grossly oversimplifying things here, feel free to point out what exactly :)

Not adding an API would be simpler in a sense; we should just do the right thing anyway :-P But it requires the user to jump through hoops to let the SAB die (as it's indeed canonical use case to refer to it in the closure). So... not sure which is better.

@juj
Copy link
Author

juj commented Jan 20, 2021

1. The "SAB stays alive" problem can be solved with a suitable indirection object.

That would certainly help break the link. One difficulty maintaining such an indirection is due to WebAssembly Memory.grow. When memory growth occurs, one must reinitialize all views to the newly grown buffer, so if there existed multiple copies to these views, then it would have to be coordinated with the heap memory growth code to make sure they don't fall out of sync. This would practically limit the shape of any given codebase to be able to have at most one place where they Memory.grow, and at most one place where they Atomics.waitAsync so that those two tightly coordinate the heap views.

When you want to free memory, could you actually notify() everything at that point or is that cumbersome?
// maybe notify everything here? the then-funcs will be noop since buffer is null

A difficulty here is that we are dealing with multiple Workers - so they each would have their own copy of the my_sab.indirection_object, and they would somehow have to coordinate setting that to null so that none of the Workers will accidentally resolve the SAB. In the "application is terminating, we are GCing" scenario that would not be a problem, all Workers have been long .terminate()d by then, but in solving the sync free() problem, coordinating to resolve all Workers is difficult.

1. We can either provide an API the user must call, or just do the right thing without an API (when a SAB dies, detect that we have waitAsyncs which will never finish and release them).

It would probably be best if this could automatically be detected, also like what @syg pondered above. If there is a manual procedure one must follow, it can probably appear arcane for developers who are not experts on SAB+Atomics development, and they might not immediately realize there is an extra procedure to follow (and thus cause leaks).

@marjakh
Copy link

marjakh commented Jan 20, 2021

Ah, I see where I was oversimplifying this :) Thanks for the added info.

It might be that coordinating workers wrt "letting the SAB die" (what must die is the backing store shared by all workers) is equally complex as what the cancelAllAsyncWaiters would do internally. The coordination would require some kind of a "drop all your references to the SAB" command sent to the workers. So maybe we want to add the explicit API to save the user from this complexity. I'll need to think about this a bit more.

@syg
Copy link

syg commented Jan 21, 2021

It might be that coordinating workers wrt "letting the SAB die" (what must die is the backing store shared by all workers) is equally complex as what the cancelAllAsyncWaiters would do internally.

This is my current understanding as well after the discussion. My two initial proposed solutions and @marjakh's two problem statements in #176 (comment) are in a cycle of sorts. Ideally, as @juj says, we want cancellation of waitAsync Promises to require no extra code and be tied to the lifetime of the SAB, but it is precisely those waiters' Promise handlers that are keeping the SAB alive.

The ideal, "no user intervention needed" solution then seems to me that waitAsync Promise handlers' references to the SAB they are waiting on to be automatically weak. This is much too specialized, however, since waitAsync Promise are just regular Promises, and most likely a non-starter.

I wonder if the canonical use of waitAsync should refer to the SAB in a WeakRef might work before going down the explicit API proposal route?

@marjakh
Copy link

marjakh commented Jan 21, 2021

I don't think WeakRefs would help here.

Either the worker has an additional strong pointer to the SAB or not (in addition to the closure ptr which would now be weak). The worker's SAB is a normal object on the worker's side, independent of the SAB in the main thread, except that they refer to the same backing store.

Has a strong pointer -> making the pointer in the closure weak doesn't change anything; the worker still needs to be told to nullify the strong pointer.

Doesn't have a strong pointer -> the SAB will die too early and the worker won't be able to access it when notify() is called.

Having a SAB in a different thread, referring to the same backing store, doesn't keep the worker's SAB alive. If that was the case, then maybe a weak pointer would help. Was that the scenario you were thinking about?

@juj
Copy link
Author

juj commented Jan 21, 2021

Trying out WeakRefs with the original example, GC does occur properly:

<html><body><script>
function test() {
	var heap = new Int32Array(new SharedArrayBuffer(128*1024*1024));
	var weakheap = new WeakRef(heap);
	var waitIndex = 0;
	Atomics.waitAsync(heap, waitIndex, 0, Infinity).value.then((value) => {
		var heap = weakheap.deref();
		if (heap) console.log(heap[waitIndex]);
	});
}
for(var i = 0; i < 10; ++i) {
	test();
}
</script></body></html>

That certainly is workable to solve the first problem of wasm program deinitialization (to my understanding), but it does have two awkward concerns:

  1. When the Wasm app grows its heap size via WebAssembly.Memory.grow, the existing references to SAB views will all be cleared, and reallocated to new grown views. Thus it can happen that the WeakRef pointing to an existing view will get nulled already while the application is still running, causing the Promise to resolve incorrectly during app lifetime.

The only way I can think of resolving this is to capture the underlying SAB itself, rather than the view, but that has the awkward result that then each .waitAsync resolve will require allocating a new SAB view to process the resolve. I.e. the code would have to look as follows to be correct (compare to the above):

<html><body><script>
function test() {
	var heap = new SharedArrayBuffer(128*1024*1024);
	var view = new Int32Array(heap);
	var weakheap = new WeakRef(heap);
	var waitIndex = 0;
	Atomics.waitAsync(view, waitIndex, 0, Infinity).value.then((value) => {
		var heap = weakheap.deref();
		if (heap) {
			var view = new Int32Array(heap);
			console.log(view[waitIndex]);
		}
	});
}
for(var i = 0; i < 10; ++i) {
	test();
}
</script></body></html>
  1. A second concern is that in a larger application, the closure of .waitAsync() may need to capture multiple JS objects and variables. Application developers would hence need to be able to refer to all of those other objects as well via a WeakRef, or otherwise anything else will leak as well. There is exactly no diagnostics that one would get about this situation either, one "just has to know" this behavior, or silently will get memory leaks. That will make the API more brittle, harder to use correctly. Though in a pinch, I suppose that could work.

I wonder if it would be feasible&sensible to have a new API Atomics.cancel(sabInt32Array[, index]); (Atomics.reject, Atomics.notifyCancel, Atomics.notifyReject?) symmetric to Atomics.notify(), but instead of notifying the waiters, it would make the synchronous waiters return with a 'cancelled' or 'rejected' string, and it would make asynchronous waiters reject the wait Promise (or also resolve with a 'cancelled' or 'rejected'?). I believe that would allow "kicking the waiters out" from a given index when freeing up a lock in that index. Specifying the index param would be optional, in which case all waiters on all addresses would be rejected.

@juj
Copy link
Author

juj commented Jan 21, 2021

One issue I note is that Safari does not support WeakRefs at all (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WeakRef), so it would require Apple to add support for that to Safari in tandem for the SAB support, if that was made a required part of the spec.

@marjakh
Copy link

marjakh commented Jan 21, 2021

In your latest example, if you force a GC, and then try to notify(), will the WeakRef be valid or invalid? I'd guess invalid. If valid, what is keeping the SAB alive?

@juj
Copy link
Author

juj commented Jan 21, 2021

In your latest example, if you force a GC, and then try to notify(), will the WeakRef be valid or invalid? I'd guess invalid. If valid, what is keeping the SAB alive?

The WeakRef will be invalid, but that would be the intent in this example scenario.

In the real application/runtime, the code structure looks something like this:

function Module() {
  var heap = new SharedArrayBuffer(128*1024*1024);
  var weakheap = new WeakRef(heap);
  // ...more global state

  function waitAsync(waitIndex) {
    Atomics.waitAsync(heap, waitIndex, 0, Infinity).value.then((value) => {
      var heap = weakheap.deref();
      if (heap) {
        var view = new Int32Array(heap);
        console.log(view[waitIndex]);
      }
    });
  }
  // ... more global functions

  // main app entry point
  function main() {
    // ...start executing application, sets up all sorts of event handlers that can call Atomics.waitAsync()
  }
  // return exports
  return {
    main: main,
    // ...other exports
  }
}

// Creating an app
var app = Module();

// Deinitializing an app, this should cause the whole app contents to GC
app = null;

In this case, what keeps the SAB alive is the app object in global scope, which references the function exports from the Module, which reference the local scope of the Module where the SAB lives, so my understanding here is that the SAB would stay alive as long until app = null; is run.

@syg
Copy link

syg commented Jan 21, 2021

I wonder if it would be feasible&sensible to have a new API Atomics.cancel(sabInt32Array[, index]); (Atomics.reject, Atomics.notifyCancel, Atomics.notifyReject?) symmetric to Atomics.notify(), but instead of notifying the waiters, it would make the synchronous waiters return with a 'cancelled' or 'rejected' string, and it would make asynchronous waiters reject the wait Promise (or also resolve with a 'cancelled' or 'rejected'?).

Atomics.cancel(sab, index, N) seems reasonable to me as an analog to notify for the async waiters. It seems the case is a little weaker for the synchronous wait use case, but I also see no harm in supporting sync waiters.

Shall I make a proposal?

@marjakh
Copy link

marjakh commented Jan 25, 2021

The API seems useful (though it's quite subtle why that is :) ).

The thought process I used for convincing myself: I wonder whether Atomics.cancel(sab, index, N) enables something that's not doable byAtomics.notify + distinguishing between the two different kinds of notifications on the receiver's side. The trivial way for distinguishing would involve setting a value in the SAB which the notified thread then reads. But we don't want that, since we want to release the memory before the notified thread wakes up. There are probably other ways to do the distinguishing but meh. With adding a new "cancel" notification, the notified thread can honor the invariant that it won't read the memory if it receives the cancel notification.

Is Atomics.cancel(sab, index, N) ergonomic enough for the users, or do we need something more generic like Atomics.cancel(sab)? (The latter is more difficult to implement so I'd be happy with the former of course.)

@lars-t-hansen
Copy link

I suppose our semantics are already deeply tied into having a critical region for the wait/notify system inside which all ambiguities about racing wait/cancel pairs boil away, so this seems no worse than what we're already doing.

My gut feeling says to restrict this to only async waiters, and to call it cancelAsync. Sync waiters will always have access to the memory and can always read a dedicated location to look for eg a cancellation code. I think there would need to be additional justification for making cancel work on these waiters.

@benjamingr
Copy link

Sorry for probably missing it but: Can someone explain to me why .cancel(sab, index, N) can't be accomplished by passing a signal parameter to waitAsync?

Is it because the canceller needs to be able to do this independently from whomever calls waitAsync and cancel another operation? Is that behaviour desireable?

@syg
Copy link

syg commented Jan 28, 2021

Sorry for probably missing it but: Can someone explain to me why .cancel(sab, index, N) can't be accomplished by passing a signal parameter to waitAsync?

Could you spell out some more what you had in mind for the extra parameter?

@benjamingr
Copy link

async function doSomething({ signal } = {}) {  
  const { value } = await Atomics.waitAsync(view, waitIndex, 0, Infinity, { signal })
  // do more stuff 
}
const ac = new AbortController();
doSomething({ signal: ac.signal }); //

ac.abort(); // stops waiting, throws AbortError, just like fetch and other web APIs

@syg
Copy link

syg commented Jan 28, 2021

Thank you for the illustrative example.

The uncomfortable but real political answer here is that TC39, where JS is developed, is a different standards body than WHATWG, where AbortController is developed. To extend Atomics.waitAsync to use AbortController wouldn't work.

On technical grounds, the JS spec features also strive to be cross-platform, and AbortController is web platform-specific.

@benjamingr
Copy link

On technical grounds, the JS spec features also strive to be cross-platform, and AbortController is web platform-specific.

Some of the non-browser platforms like Node.js and Deno have added support for AbortController, we've shipped it in Node.js pretty recently. This was mostly done to reduce the cognitive overhead of developers so there is one (obvious) way to cancel things in APIs that support it (things like http.request and fs.readFile).

I am wondering if the spec can be phrased in such a way that ECMAScript will support cancellation with AbortController without requiring implementing or speccing the full API.

Alternatively if that is impossible to propose a simpler signal-like API that Node can implement on its AbortControllers (or even better: be made part of the DOM's) - Atomics could use that.

@littledan
Copy link
Contributor

littledan commented Jan 29, 2021

If it were only down to this "political"/layering issue, I would be motivated to work out some kind of general solution. However, when discussing it earlier with @marjakh , I got the impression that it's important for the cancellation to come from a different Worker (compared to where waitAsync is called), where there's no particular way to share the AbortController with a Worker to trigger it, and this motivates an index-based API described above. Is this accurate?

@syg
Copy link

syg commented Jan 29, 2021

I got the impression that it's important for the cancellation to come from a different Worker (compared to where waitAsync is called), where there's no particular way to share the AbortController with a Worker to trigger it

Yes the cancellation can certainly come from any Worker or the main thread. I wasn't aware there wasn't a way to share an AbortController across workers, thanks for pointing this out.

@littledan
Copy link
Contributor

littledan commented Jan 29, 2021

Thinking about this a little more: I suspect we could define AbortController and AbortSignal as serializable objects (assuming they are restricted to only in-memory), but they aren't defined that way right now. But maybe such an interface would be overkill for this application, and index-based cancellation is more manageable and efficient.

@benjamingr
Copy link

@syg

Yes the cancellation can certainly come from any Worker or the main thread. I wasn't aware there wasn't a way to share an AbortController across workers, thanks for pointing this out.

Why? Can you elaborate on this perhaps with a use case where one thread needs the capability of aborting a waitAsync of another thread.

I suspect we could define AbortController and AbortSignal as serializable objects (assuming they are restricted to only in-memory), but they aren't defined that way right now.

I suspect that's the case - though we can also provide a method on atomics to obtain an AbortController for a given index. That kind of bypasses the "the capability to abort something and the capability to listen to cancellation are separate" thing.

and index-based cancellation is more manageable and efficient.

I don't understand the problem domain well enough to know if that's a real limitation though waitAsync already seems pretty "heavy" returning a promise compared to .wait.

@syg
Copy link

syg commented Jan 29, 2021

Why? Can you elaborate on this perhaps with a use case where one thread needs the capability of aborting a waitAsync of another thread.

@juj lays out the use case in the first half of this thread: deleting mutexes.

Taking a step back, the primary use case for Atomics.wait and Atomics.waitAsync is to implement higher-level synchronization mechanisms such as mutexes. The most popular example is emscripten-compiled pthreads programs. Since Atomics.wait blocks and the main thread cannot block, Atomics.waitAsync was introduced so that those programs can also use mutexes on the main thread, asynchronously.

Since these are C/C++ pthreads programs compiled via emscripten (perhaps interacting with wasm), these waiters, async or otherwise, are always going to be notified from another thread.

The fact that same-thread notifications are possible is a result of JS asynchrony, and indeed have use cases in JS itself. But given how low-level the API is (it's futexes), I imagine the majority use case will remain as an implementation building block for multithreaded synchronization mechanisms.

@benjamingr
Copy link

Thanks @syg, that is very informative I also went back and read juj's comments and did some googling.

I think Chrome's behaviour (of the first example) is probably a bug that's worth reporting if it wasn't already (like it does in iframe).

I think AbortController/AbortSignal don't make sense here (and neither does TC39 cancellation if that progresses to be honest) since semantically they are describing different types of cancellation and the semantics don't map.

I think waitAsync and cancel is indeed probably better.

Thanks for explaining it and convincing me.

@benjamingr
Copy link

To be clear: the reason AbortController/AbortSignal does not work here is because we might need to be able to cancel a waitAsync "without cooperation" - so even if we could transfer AbortControllers and they were a part of the language and not the web platform it would not help since a signal was not necessarily passed.

@juj
Copy link
Author

juj commented Apr 9, 2021

Heya, sorry for the long delay on this thread. I took it upon myself to come up with a test case that would polyfill simulate waitAsync() and cancelAsync() behavior to confirm whether such a cancelAsync() API would be race free and work to fix the issue, so that I'd have a good confidence on the solution before @syg would proceed to add it to the spec.

But unfortunately I was not able to craft a test case that would not be a bit silly, so let me try to explain the race concern. The following sequence of events seems plausible that it could occur:

  1. Main thread malloc()s a mutex location in a SAB,
  2. Main thread performs a waitAsync() on the mutex location to access the shared memory in the main thread.
  3. Before the waitAsync() has resolved, maybe the main thread receives another event (XHR/user input/etc) and decides that it should now tear down the database and the lock, and it calls .cancelAsync() to synchronously terminate any pending waits. Then back-to-back it calls free() on the lock memory, expecting that no waitAsync() should be able to race through in between that would still erroneously access the mutex memory location.

If we just have a main thread and no Workers, this race probably cannot happen, since naturally the waitAsync()s won't resolve from the queue while the main thread is executing (and cancelAsync() can certainly clear any pending waitAsync()s that might have already been placed in the event queue?)

But what if this scenario is colored with other Workers in the mix? It could be the case that right when main thread calls .cancelAsync() on a memory location, another Worker has just decided to resolve a .waitAsync() in that Worker, and is executing the handler right then (maybe hasn't reached executing JS code just yet). And then a .waitAsync() does resolve seemingly during/after a .cancelAsync() has taken place. This would mean that the main thread could not free() a mutex even if it had just issued a .cancelAsync(), unless it knew that it mean to just cancel its own waits.

So it seems that such a .cancelAsync() should be able to detect whether any .waitAsync()s are running in other Workers, and if so, would it then should synchronously block as long as those operations were running? I wonder if that would be a good idea even(?)

@syg
Copy link

syg commented Apr 12, 2021

@juj Indeed that scenario points out that cancelAsync() has a different critical section from the one for the WaiterList. The WaiterList critical section is for exclusive access to, well, the WaiterList for a location. But it seems like the cancelAsync() guarantee that you want isn't that it had exclusive access to the WaiterList, but that there are no waitAsync handlers currently executing or already scheduled.

That is indeed going to be harder to implement, nor am I sure it even makes sense. What would you consider to be a waitAsync handler? Only the span of the immediate microtask that's triggered by the waitAsync promise? What about microtasks that it transitively enqueues?

All of this makes me feel like that cancelAsync(), even if it exists, isn't going to enable the programming pattern of "cancelAsync(); free();` that you want. You'd still need to coordinate some other way with user code to not touch the mutex memory while trying to cancel...

@juj
Copy link
Author

juj commented Apr 13, 2021

What would you consider to be a waitAsync handler? Only the span of the immediate microtask that's triggered by the waitAsync promise? What about microtasks that it transitively enqueues?

That is a good question. I do think if such thing exists, that it should probably not cover any transitively enqueued tasks.

All of this makes me feel like that cancelAsync(), even if it exists, isn't going to enable the programming pattern of "waitAsync(); free();` that you want.

(You probably meant to wrote "cancelAsync(); free();`?)

You'd still need to coordinate some other way with user code to not touch the mutex memory while trying to cancel...

Indeed, and it seems that this coordination needs to also become asynchronous no matter what. The issue is that the mechanisms that one would use in regular native code to achieve such a guarantee do not seem to work for web/waitAsync code, but that achieving this guarantee fundamentally seems to turn the detaching/joining/parking code async as well.

If I step away from porting existing C/C++ code and think of the canonical "webby" ways to use waitAsync instead via JavaScript, let's say for implementing a main thread producer - N Workers consumers message queue:

  • a shared location C on SAB specifies how many messages are in the queue,
  • all N Workers would waitAsync on memory location C as long as value is 0,
  • main thread atomically increments C whenever a new task becomes available, and notifies a Waiter,
  • each Worker when they are notified, do an atomic cmpxchg decrement to consume a work item.

Say, for example a multithreaded image decoder could work like this, which seems to be a popular feature for web games.

In this kind of scenario, when the work queue is idle, all N Workers have an active pending waitAsync active, waiting for C to become nonzero. Now, if one wanted to tear down this setup from the main thread, what would be the preferred way?

  1. Call worker.terminate() for all Workers? There is no notification from when worker.terminate() has actually taken place. So when can the main thread free() up the memory location C for other purposes?
  2. If we just assume that freeing C could be done safely after some suitable time has passed, it might still not be nice, since spawning a new Worker is very heavyweight operation, which makes one want to reuse the Workers for multiple purposes. So instead of .terminate()ing, one might want to tell the Workers that they should now stop working on that task, as they'll be repurposed for something else. This could be done via a postMessage(), but when a Worker receives that postMessage(), it cannot even undo the waitAsync() it had issued on its own. So even if instead of sync teardown we were doing an async cooperative teardown, the Worker cannot assist/cooperate in that.

I wonder if we should have a .cancelAsync() which works like setTimeout()/clearTimeout() where setTimeout() would return a token that one can abort by calling clearTimeout()? In such mode, a Worker could at least async cooperatively assist the main thread in a request to shut the Worker down.

Now, it looks like it is possible to emulate such a setTimeout-style token in JS by creating an auxiliary JS object that tracks "alive" waitAsyncs, and a JS-implemented cancelAsync() could then remove a wait ID from that object, so later if the waitAsync() does really resolve, it could be checked against the actually desirable waits still remaining in the JS side object table.

However, that kind of method could potentially leave hundreds or thousands of such stale waitAsync()s alive in a program, since these types of waits could be performed at a high cadence, maybe even once or more per requestAnimationFrame(). Talking with @lars-t-hansen a long time ago, I recall one model for waitAsync() was that a browser would spin up a native helper thread to take care of the .waitAsync() request. Leaking multiple dormant wait threads behind would naturally be quite catastrophic.

So it seems that there are two problems even on the native web/JS side operation of .waitAsync: one cannot observe when a Worker .terminate()s to know when it is safe to really free up a memory location in a SAB, and one cannot even cooperatively cancel a wait from within the Worker that issued the wait. I wonder if there should exist a .cancelAsync() that would just cancel one's own issued wait instance via a token?

@syg
Copy link

syg commented Apr 13, 2021

Good idea about the cancellation tokens, I'll need to think more about that. (Also, TC39 is coming up next week and I unfortunately won't really have time to fully re-engage here until afterwards.)

@syg
Copy link

syg commented May 6, 2021

Paging this back in. I like the cancellation token idea a lot, but ISTM the cancelWaitAsync(id) API would need to be stronger than the cancelTimeout(id) API. Since the code you want to be able to write is cancelWaitAsync(id); free(mutex_mem);, cancelWaitAsync(id) can't be fire-and-forget like cancelTimeout(id); instead it needs to return once all pending waits and also any currently executing handlers (non-transitive, as we discussed above) are finished executing, right?

Of course, that can't be done synchronously since it'd be tantamount to blocking, so cancelWaitAsync(id) itself would be async, and you'd need to do await cancelWaitAsync(id); free(mutex_mem);.

Would that work for you?

@juj
Copy link
Author

juj commented May 8, 2021

so cancelWaitAsync(id) itself would be async, and you'd need to do await cancelWaitAsync(id);

Yikes :/ Yeah, I do follow that argument, but reality is that that would not work at all. I cannot defer calling free() there, because I cannot guarantee whether the memory allocator even is alive later, depending on what type of deinitialization the application is performing. I.e. it could be that the application is deinitializing completely including its memory allocators, maybe for a new run altogether. This could include throwing out everything in some type of compartment along with its allocator, or it could involve reinitializing the heap.

Not sure I have ideas on what to propose next. Pragmatically I am starting to lean towards not including Atomics.waitAsync based access to synchronization primitives at all in the wasm workers API.

Also commented in #177 (comment) , which relates to this somewhat.

@juj
Copy link
Author

juj commented May 8, 2021

can't be fire-and-forget like cancelTimeout(id); instead it needs to return once all pending waits and also any currently executing handlers (non-transitive, as we discussed above) are finished executing, right?

Or actually, are you thinking that the tokens would be integers that one can store to the SAB?

Because if the model was that cancelWaitAsync(token) only cancels the wait token that the calling thread issued, then there would not be able to exist any currently executing handlers, and hence one could have a guarantee(?) that after the cancelWaitAsync function returns, that there can't be any handlers still executing, since those would need to be executing on the same thread?

@syg
Copy link

syg commented May 11, 2021

Because if the model was that cancelWaitAsync(token) only cancels the wait token that the calling thread issued, then there would not be able to exist any currently executing handlers, and hence one could have a guarantee(?) that after the cancelWaitAsync function returns, that there can't be any handlers still executing, since those would need to be executing on the same thread?

Ah indeed, that might work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants