Replies: 7 comments 19 replies
-
Hey ! Few hints / questions.
|
Beta Was this translation helpful? Give feedback.
-
Hey @kali ,
A single element:
Batch of 20:
Batch of 8-8-4 ("fixed"):
It seems like I'm very close to the raw tract performance in case 1 and 2, but my batching has some noticeable overhead (which likely relates to all the noise, too... I know it's more allocation-intensive than the other two.)
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
If we take a step back, there is also a wider question to consider. Do you want to optimise for latency or bandwidth ? My understanding is you get items to process at more or less random intervals. It may also be worth choosing one single fixed batch size and run it always, even for partial batches, instead of having several variants of the network memoized. Switching networks variants may have a cost in terms of cache locality. And on the other hand, in the matrix multiplication operations, if a 6-wide kernel has been chosen, running a 4-wide batch will still use a 6-wide tile, so it take roughly the same time. |
Beta Was this translation helpful? Give feedback.
-
It's worth mentioning that while the arm kernels have been very optimised, the same thing can not be said about the intel ones. We use a 16x6 kernel that performs reasonably well, but there is probably some more performance to be found there, if we start optimising for intel vs amd chips, and ideally on specific variants of them too... This is a space where I'm hoping third party developers can offer help and make a difference, as we prioritise our internal resources on arm platforms. |
Beta Was this translation helpful? Give feedback.
-
So you can find background information in the three blog posts starting here: https://tech-blog.sonos.com/posts/optimising-a-neural-network-for-inference/ . The kernel selection happens here: https://github.com/sonos/tract/blob/main/linalg/src/x86_64_fma.rs . It's pretty basic. On the other hand the aarch64 ( https://github.com/sonos/tract/blob/main/linalg/src/arm64.rs#L88 ) code, 1/ has many variants to choose from, so we use a small dnn model to choose the implementation (trained for measurement several cpu variants), 2/ runtime feature detection on aarch64 is not stable, so we do it by hand. Kernels can be found in x86_64/fma. There is a bit of language abuse, the i32 kernel should live in an avx, but right now everything is in under "fma". Adding a kernel specialization is a bit of work, but we already have done this in aarch64. The main file for the current kernel is https://github.com/sonos/tract/blob/main/linalg/x86_64/fma/fma_mmm_f32_16x6.tmpl .
|
Beta Was this translation helpful? Give feedback.
-
Cherry-picking some questions here to answer for now; will try to get back with more info and some code next week as I try to take this from raw inner loops to actual usable kernels.
This is the main problem, yes. AVX-512 is only strong perf gain if you have two ports, and if you're lucky equal perf with one. On some platforms you'll see a clear degradation in clock speed using any AVX-512 instructions, which makes it slower than two-port AVX2.
Neither am I, really. I'm just hoping to throw enough funky stuff at the wall to make it stick... Hence the spattering of kernel sizes. But I think I'm in a good enough position to start selecting and putting out kernels more efficiently. I'm half expecting the optimal kernels of AVX-512 to actually be the same as AVX2 when looking at register distribution (i.e, treating 16x6 AVX and 32x6 AVX-2 as 2x6 registers).
This had me stumped for a bit; but I believe what happens is that I can't saturate the FPU anymore - I have one load for weights, then one load per FMA... leading to one FPU essentially idling. So I think actually a 32x6 kernel would perform well here; or 64x6 even. Consider the ratio of load to FMA, we started at 8:12, with AVX-512 change we're at 7:6. Re the overperformance, I'm actually guessing that since we don't measure actual cycles but rather time, it ends up being... a bit inexact. Also, I don't think AVX2 necessarily runs at base clock, which would be 48 GFlops. But I'm not sure, I haven't found any clock table including my CPU. I'm mostly looking for relative numbers to see whether I'm improving... I'm not going to go comparing this against BLAS or SGEMM and claim I'm better.
No, valid question! I was for a while (due to the failed 16x6 upgrade) under the impression that AVX-512 ran on port5 only as it does on some Skylake CPUs, so this is 2x2 AVX-512 and the tails are AVX2, but it actually makes the balance of loads-fma bad (5:6). AVX-512-2L is AVX2 but using the extended register range only, to avoid reusing registers (later I renamed this to
I haven't tried anything square-ish, so I can't say if that works better. My optimization goal is primarily to get skinny x square to be faster. Consider the ratios here, though. This is 7 loads and 10 outs. It's not that bad of a ratio -- but I should probably go for 96x2 since that's just a flipped 16x6 (2x6 -> 6x2).
SKX is just the code for Skylake X, the CPU platform I'm running on. I was trying to push this one a bit harder with AVX2, but never got anything statistically significant from it. I think this is one of the cases where tools aren't happy, as it's 9 loads to 8 FMAs. And there's not a lot of room to play since the bulk of the loads are a full register (ie not broadcast), so tricks based on shuffling/permuting won't work. |
Beta Was this translation helpful? Give feedback.
-
Hello!
I'm digging deep into optimizing our performance when using tract, and so far I've stayed mostly on my side of the fence and treating tract code as a black box. However, I'm starting to push up against where the tract internals dominates my profiles, and I'd love to open up a discussion about how to push Tract even harder.
Given that I'm in games, the current "full" pipeline that I'm investigating is split in three parts:
Cook
into_typed()
and theninto_decluttered()
Loading
into_optimized()
,into_decluttered()
,into_runnable()
Execute:
The measurements I focus on improving is Loading and Execution, since those happen while the game is running.
My findings so far is that cooking ONNX to NNEF and then using that cuts load times quite a bit. I'm very happy about this already, but sub-10ms is my goal, at least for dynamic. (These categories are described below). Not much else to say; so far; it looks like parsing eats a lot of time (understandably). All measurements are done with files fully loaded to avoid measuring IO -- I'm passing a Cursor around the raw bytes buffer.
Sadly I've had less success in reducing execution times in the general case. However, I have noticed that batching has a massive impact:
The three groups in this image show three different strategies:
none
/simple
means we pass one example through the model at a timefixed
means we minibatch as appropriate to some fixed set pre-baked batch sizes - [1, 2, 4, 8] in this casedynamic
means we concretize for the exact amount of elements (on-demand but memoized)Note that this is normalized by actual number of examples; so 20 in this example. So a batch of 20 elements processed one-at-a-time takes about 10ms, while running that as a single batch takes about 3ms. This difference shrinks as batch sizes goes down, and at a single element they line up at 500ms.
I'm unsure if this slowness can be remediated and where it comes from, but it seems to be me like there's a ~350us overhead when doing small or single element batches. That is also unfortunately my most common case! I'm unsure if this is related to something in the setup we have or something in tract, right now - but I do want to find out and help shrink the gap. 💨
I'd love to know if others have similar - or other - experiences and what steps you've taken to improve it, if any! Also happy to hear if there's other metrics that would be interesting to discuss/measure, as I have a fairly extensible benchmarking setup.
Beta Was this translation helpful? Give feedback.
All reactions