Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planar-only considered harmful #2458

Open
cmuratori opened this issue Nov 18, 2021 · 8 comments
Open

Planar-only considered harmful #2458

cmuratori opened this issue Nov 18, 2021 · 8 comments

Comments

@cmuratori
Copy link

As far as I could tell from the spec and the API, the design of the WebAudio API is such that it is always "planar" rather than interleaved, no matter what part of the pipeline is in play. While the efficiency of this design for a filter graph is a separate concern, the WebAudio design creates a more serious issue because it does not distinguish between the final output format and the graph processing format.

As WASM becomes more prevalent, more people will be writing their own audio subsystems. These subsystems will have to output to the browser at some point. At the moment, the only viable option is to use WebAudio. Because WebAudio only supports planar data formats, this means people's internal audio subsystems must output planar data.

This creates a substantial inefficiency. A large installed base of CPUs do not have planar scatter hardware. Since modern mixers must use SIMD to be fast, this means that scattering channel output 2-wide (stereo) or 8-wide (spatial) is extremely costly, as several instructions must be used to manually deinterleave and scatter each sample.

To add insult to injury, most hardware expects to receive interleaved sample data. This means that in many cases, after the WASM code has taken a large performance hit to deinterleave to planar, the browser will then turn around and take another large performance hit to reinterleave the samples, often (again) without gather hardware, meaning it will require several instructions to manually reinterleave.

I would like to recommend that serious consideration be given to supporting interleaved float as an output format. It could be a separate path just for direct audio output, and does not have to be part of the graph specification, if that reduces the cost of adding it to the specification. It could even be made as part of a WASM audio output specification only, with no JavaScript support, if necessary, since it would presumably only be relevant to people writing their own audio subsystems. I believe there is already consideration of WASM-specific use cases, as I have seen mentions of the need to avoid cloning memory from the WASM memory array into the JavaScript audio worklet, etc.

If I have misunderstood the intention here in some way, I would welcome explanations as to how to avoid the substantial performance penalties inherent in the current design.

- Casey

@tklajnscek
Copy link

Just wanted to drop a quick note here that I wholeheartedly agree with Casey on this.

We have an Audio Worklet backend live in multiple products and this was one of the things that felt wrong when I was coding it.

All our internal processing (C++ compiled to WASM) uses interleaved data and we deinterleave when filling the buffers in the worklet processor.

As all native / low level interfaces I've used so far expect the data interleaved this seems very counterintuitive and bad for performance as Casey said.

So if we can't have interleaved everywhere due to some need by the graph processing system at least a configurable option to bypass this conversion for the simple case would be great.

@meshula
Copy link

meshula commented Nov 20, 2021

The API fundamentally makes you deal with channels (input/output indices). https://developer.mozilla.org/en-US/docs/Web/API/AudioNode/connect

At a basic level, I agree that an interleaved input on a destination node, and interleaved output from media sources, would be a fantastic addition.

At the graph level, I've always wished that nodes' inputs and outputs were some kind of Signal object, rather than planar pcm. In particular, I wish the Signal object could carry planar data, fused spatial or quasi-spatial data such as ambisonic signals, spectral data, and that the signal rate was the natural signal rate for the data. Furthermore, I'd hope that there'd be some interface to test if signal-outs are compatible with signal-ins, and that in the case of incompatible signals, explicit adaptor nodes might be available, so that deplanarization, or spectralization, rate coversion, or up and down mixing, would never be heuristically applied, but explicitly supplied to support the dataflow.

@padenot
Copy link
Member

padenot commented Nov 22, 2021

The answer to most of the questions here is "for historical reasons". The Web Audio API was shipped by web browsers on the web without have been fully specified, and with not enough considerations for really advanced use-cases and high performance. The alternative proposal was direct PCM playback in interleaved format, but didn't get picked, this was about 10 years or so ago. The Web Audio API's native nodes will never work in interleaved mode, because it's fundamentally planar (as shown in previous messages here), but this doesn't mean this problem cannot be solved, so that folks with demanding workloads can use the Web.

Another API was briefly considered a few years back, but it didn't feel important enough to continue investigating, in light of the performance numbers gathered at the time. It was essentially just an audio device callback, but this is implementable today with just a single AudioWorkletNode.

That said, the first rule of any performance discussion is to gather performance dat(a) In particular, are we talking here about:

(a) Software that uses a hybrid of native audio nodes, and AudioWorkletProcessor
(b) Multiple AudioWorkletProcessor, with the audio routed via regular AudioNode.connect calls
(c) A single AudioWorkletProcessor, with custom audio rendering pipeline or graph in it

or some other setup, or maybe something hybrid ?

(a) and (b) suffer from lots of copies to/from the WASM heap, (c) doesn't, a single copy from the WASM heap will happen, at the end.

(a) and (b) also suffer from a minimum processing block size (referred to in the spec as a "render quantum") of (for now) 128 frames, but this will change in #2450. (c) doesn't have this issue.

For (a) and b., the interleaving/deinterleaving operations can be folded into the copy to the WASM heap (with possible sample type conversion if e.g. the DSP works in int16 or fixed point). This lowers the real cost of the interleaving/deinterleaving operations (without eliminating it). (See links at the end for the elimination of this copy).

For (c), the interleaving / deinterleaving operations will happen exactly twice (as noted): going into the AudioContext.destination, and from the AudioContext.destination to the underlying OS API. My guess is that this is wasteful but negligible with any meaningful DSP load, at least with all workloads I've measured over the years.

Again, what is needed first and foremost is real performance numbers. Thankfully, if there is already running code implementing the approaches above ((a), (b) and (c), possibly others), it's not particular hard to get them, using https://blog.paul.cx/post/profiling-firefox-real-time-media-workloads/ and https://web.dev/profiling-web-audio-apps-in-chrome/. I assume the people in this discussion can skip most of the prose in both those articles, because they are familiar with real-time safe-code, and can skip to the part about getting/sharing the data.

Here we're mostly interested at the difference between the total time it took to render n frames audio (i.e. the AudioNode DSP methods, and the calls to each process() methods of each AudioWorkletProcessor instantiated in the graph), and the time it took for the callback to run, which would essentially be the routing overhead. It's going to be possible to have rather precise information about the per-AudioNode or per-AudioWorkletProcessor overhead, I'm happy to help if anybody is interested. Then there's going to be the overhead of the additional audio IPC that browsers have to implement because of sandboxing and other security concerns, that can be another copy depending on the architecture, but this is a fixed cost per audio device.

Some assorted links for context:

@guest271314

This comment was marked as off-topic.

@o0101
Copy link

o0101 commented Sep 17, 2022

I'm currently working on something like this for remote browser isolation audio streaming and having to resort to using a mono stream from parec because the format is planar (instead of interleaved) on the client, and I don't know a performant way to get the planar format AudioContext expects.

I think the use case of real-time processing / playing-from-stream of audio is pretty important.

Is there a way to do this with stereo?

@guest271314

This comment was marked as off-topic.

@guest271314

This comment was marked as off-topic.

@o0101
Copy link

o0101 commented Sep 18, 2022

Thank you, @guest271314! This looks awesome. I might try to use your code at some point 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants