Maximizing performance with tract #716

tgolsson · 2022-05-13T09:33:56Z

tgolsson
May 13, 2022

Hello!

I'm digging deep into optimizing our performance when using tract, and so far I've stayed mostly on my side of the fence and treating tract code as a black box. However, I'm starting to push up against where the tract internals dominates my profiles, and I'd love to open up a discussion about how to push Tract even harder.

Given that I'm in games, the current "full" pipeline that I'm investigating is split in three parts:

Cook
- Load ONNX
- Insert batch symbols where appropriate
- into_typed() and then into_decluttered()
- Export as NNEF
Loading
- Load NNEF
- Concretize per batch size and our internal runtime mode
- into_optimized(), into_decluttered(), into_runnable()
Execute:
- Build batches
- Convert to tensors
- Run the plan

The measurements I focus on improving is Loading and Execution, since those happen while the game is running.

My findings so far is that cooking ONNX to NNEF and then using that cuts load times quite a bit. I'm very happy about this already, but sub-10ms is my goal, at least for dynamic. (These categories are described below). Not much else to say; so far; it looks like parsing eats a lot of time (understandably). All measurements are done with files fully loaded to avoid measuring IO -- I'm passing a Cursor around the raw bytes buffer.

Sadly I've had less success in reducing execution times in the general case. However, I have noticed that batching has a massive impact:

The three groups in this image show three different strategies:

none/simple means we pass one example through the model at a time
fixed means we minibatch as appropriate to some fixed set pre-baked batch sizes - [1, 2, 4, 8] in this case
dynamic means we concretize for the exact amount of elements (on-demand but memoized)

Note that this is normalized by actual number of examples; so 20 in this example. So a batch of 20 elements processed one-at-a-time takes about 10ms, while running that as a single batch takes about 3ms. This difference shrinks as batch sizes goes down, and at a single element they line up at 500ms.

I'm unsure if this slowness can be remediated and where it comes from, but it seems to be me like there's a ~350us overhead when doing small or single element batches. That is also unfortunately my most common case! I'm unsure if this is related to something in the setup we have or something in tract, right now - but I do want to find out and help shrink the gap. 💨

I'd love to know if others have similar - or other - experiences and what steps you've taken to improve it, if any! Also happy to hear if there's other metrics that would be interesting to discuss/measure, as I have a fairly extensible benchmarking setup.

kali · 2022-05-13T09:55:44Z

kali
May 13, 2022
Maintainer

Hey !

Few hints / questions.

Can you reproduce these three cases in the command line to get a dump with profile (using dump --profile --cost) ? (If not what is missing ? Maybe it can be plugged into the command line.) I'm interested in knowing what are the differences are the same and also get a breakdown of performance across operators and node. With a "-v" right after tract command name, you should also get timing for the various loading phases.
When you talk about "concretize", are you actually doing it explicitely ? tract should be able to optimize the net despite the presence of the symbol in the batch dimension (something 1. may help figure out actually). This should avoid rewritting the net at each pass. Explicit "concretize" was meant to be a temporary thing, waiting for tract to support optimisation with symbols.
On intel, 6 may be a valid choice for batching (the "best" matrix multiplication kernel is 6-wide).
Calling into_decluttered() after into_optimised() probably does not help anything. With this and 2, loading sequence should just be 1/load nnef, 2/ into_optimised().into_runnable().
Happy to see that NNEF/tract-opl pays for your case too. We are using it everywhere. I think the parser could be made faster actually (my treatment of spaces is terrible). Alternatively we could imagine a different format (I'm thinking, a somewhat binary equivalent to nnef).

0 replies

tgolsson · 2022-05-13T11:05:55Z

tgolsson
May 13, 2022
Author

Hey @kali ,

Sure! I think there might be easier ways to do this but couldn't get it to work nicely with symbolic networks so generated some explicit NNEFs and ran on.

A single element: tract b1.nnef.tar -O -i 1x280xf32 -i 1x30x30x2xf32 -i 1x24xf32 --nnef-tract-core dump --profile --cost

Not accounted by ops:   0.027 ms/i  5.3%
Entire network performance: 0.496 ms/i  # x20 = 9.92ms

Batch of 20: tract b20.nnef.tar -O -i 20x280xf32 -i 20x30x30x2xf32 -i 20x24xf32 --nnef-tract-core dump --profile --cost

Not accounted by ops:   0.167 ms/i  5.4%
Entire network performance: 3.068 ms/i  # /20 = 0.153ms

Batch of 8-8-4 ("fixed"):

Not accounted by ops:   0.079 ms/i  5.5%
Entire network performance: 1.441 ms/i
+
Not accounted by ops:   0.079 ms/i  5.5%
Entire network performance: 1.447 ms/i
+
Not accounted by ops:   0.049 ms/i  6.4%
Entire network performance: 0.769 ms/i

~ 3.657 / 20 = 0.18285

It seems like I'm very close to the raw tract performance in case 1 and 2, but my batching has some noticeable overhead (which likely relates to all the noise, too... I know it's more allocation-intensive than the other two.)

Yes. It's not quite as fast to let tract do it internally :-)

Very good hint, yepp! Not better in the best case but much better in the average and worst case.

I did a very crude test and decluttering a loaded NNEF at least removes data (by diffing the length of debug prints :D), and it does consume enough time to show in flamegraphs. Maybe it's just saving me some memory but no perf?
That would be lovely. Out of curiosity, should I look at pulse/opl as well? I'm just using nnef right now.

0 replies

kali · 2022-05-13T11:22:24Z

kali
May 13, 2022
Maintainer

Could you run a full dump (feel free to send them by email if you don't want to expose the net). I'm specifically interested in what makes the timing different. I suspect it may be the choice of matrix multiplier, but I can't be sure. add a --info to get dump subcommand to see these hints
I'd like to understand why. I'll need networks dumps (or the nets and whatever is needed to reproduce this)
Interesting... declutter() before optimise() makes some sense. The other way around not so much.
I think you don't need to worry too much about pulse. We use it for real time voice processing. It transforms a network operating on a streaming (eventually infinite) input (think wakeword) into a stateful net that process input in chunks.

7 replies

tgolsson May 17, 2022
Author

Ran a full sweep:

Non-normalized:

mathieupoumeyrolsonos May 18, 2022
Maintainer

Yeah, that's exactly what I'm expecting to see :) basically, the cost model should be

$$a \times \left\lceil \frac{n}{6} \right\rceil + b \times ( (n\ mod\ 6) \neq 0)$$

There is a fixed cost (a) for running 6-wide tiles, plus a cost penalty (b) for dealing with the partial right-most colon of tiles.

mathieupoumeyrolsonos May 18, 2022
Maintainer

For our voice application (which I can finally mention, yay!), we did align the batch size and kernel width. We actually added extra narrow kernels on aarch64 (4-wide, while the most efficient ones are 8-wide) to improve our latency, as it allows to start running the model without waiting for extra frames of input.

mathieupoumeyrolsonos May 18, 2022
Maintainer

And n=1 is a special case optimised separately.

(kali and I are the same person, btw)

tgolsson May 18, 2022
Author

Yeah; recognized the name :-)

Alright, I see. I was hoping that the response would be: "yeah that looks like a bug". ;-) Is this something we could tune differently for our use-cases (without outright disabling SSE/AVX)? I see that n=1 is indeed a special case as it doesn't quite follow the pattern, but it's still more costly than I'd expect/want.

For completeness, here's a comparison of our now four different batching strategies without normalization, on a very small model. Concretization (fixed and dynamic via different setups) seems to hit optimal, letting tract deal with concretization (direct) on its own is suboptimal, and single-batching is linear cost.

kali · 2022-05-13T11:43:26Z

kali
May 13, 2022
Maintainer

If we take a step back, there is also a wider question to consider. Do you want to optimise for latency or bandwidth ? My understanding is you get items to process at more or less random intervals.
If you want to optimise for latency, making each result available ASAP, you may want to run an inference for each item.
If you want to optimise for bandwidth, you may choose a relatively big batch size (choose a multiple of 6 for intel, probably) and wait for enough input items to fill the batch before running it. That workflow should give you the best cpu efficiency/bandwidth.

It may also be worth choosing one single fixed batch size and run it always, even for partial batches, instead of having several variants of the network memoized. Switching networks variants may have a cost in terms of cache locality. And on the other hand, in the matrix multiplication operations, if a 6-wide kernel has been chosen, running a 4-wide batch will still use a 6-wide tile, so it take roughly the same time.

1 reply

tgolsson May 13, 2022
Author

We only care about bandwidth. Our data doesn't have to be ready until next frame so we have to squeeze total time spent executing to ensure we hit our deadlines. Depending on situation that's a 16.6 or 33.3ms chunk of time that we share with all other game code - so us "losing" 350us is a huge chunk of time.

Our agents are synced to simulation frames; but we have varying amounts of data (per frame, model, and over time). So e.g. one frame one robot might be alive, next frame two more spawning so now we have say a cadence of [1, 2, 0, 0, 0] cycling over frames. And with the measures we see here we're looking at [0.5, 0.62, 0, 0, 0,] in frame costs! The problem is that given how cheap extras are the best bang-for-buck is to do huge batches on a thread - but from a $cost and stability perspective running them sync for a low constant cost on the main thread is optimal.

In other words - think of us having a 1ms allotment for ML on the main thread. We can run two agents with different tract models, or... 6 of a single type. The trade-off isn't in favor of variety. :-)

mathieupoumeyrolsonos · 2022-05-18T08:21:56Z

mathieupoumeyrolsonos
May 18, 2022
Maintainer

It's worth mentioning that while the arm kernels have been very optimised, the same thing can not be said about the intel ones. We use a 16x6 kernel that performs reasonably well, but there is probably some more performance to be found there, if we start optimising for intel vs amd chips, and ideally on specific variants of them too...

This is a space where I'm hoping third party developers can offer help and make a difference, as we prioritise our internal resources on arm platforms.

1 reply

tgolsson May 18, 2022
Author

If you can give me a quick rundown of how the Arm kernels are specialized and what to look for I'll see if I can make some progress there. I guess implementing another kernel, and adding it with appropriate nr/mr in x86_64_fma/mmm.rs is the end goal? I had a peek at the MMM code yesterday but it was a bit dense, and I couldn't figure out what the "intel" case specifically means? Are you using that to mean x86_64 in general w/ feature detection for SSE/AVX2? :-)

kali · 2022-05-18T09:29:59Z

kali
May 18, 2022
Maintainer

So you can find background information in the three blog posts starting here: https://tech-blog.sonos.com/posts/optimising-a-neural-network-for-inference/ .

The kernel selection happens here: https://github.com/sonos/tract/blob/main/linalg/src/x86_64_fma.rs . It's pretty basic. On the other hand the aarch64 ( https://github.com/sonos/tract/blob/main/linalg/src/arm64.rs#L88 ) code, 1/ has many variants to choose from, so we use a small dnn model to choose the implementation (trained for measurement several cpu variants), 2/ runtime feature detection on aarch64 is not stable, so we do it by hand.

Kernels can be found in x86_64/fma. There is a bit of language abuse, the i32 kernel should live in an avx, but right now everything is in under "fma". Adding a kernel specialization is a bit of work, but we already have done this in aarch64. The main file for the current kernel is https://github.com/sonos/tract/blob/main/linalg/x86_64/fma/fma_mmm_f32_16x6.tmpl .

renaming the main file with some placeholder (like "core" to keep the same convention as aarch64).
alter build.rs (mimicking aarch64 code again) for build several variants
use the liquid placeholders to have several implementation of the main loop https://github.com/sonos/tract/blob/main/linalg/x86_64/fma/fma_mmm_f32_16x6.tmpl#L100 . This is the loop that we need to get better.
declare the new kernel in https://github.com/sonos/tract/blob/main/linalg/src/x86_64_fma/mmm.rs (and fix the old ones names, as they will have a "_generic" suffix or something). This will also create tests for the kernels. See https://github.com/sonos/tract/blob/main/linalg/src/arm64/arm64simd.rs for inspiration.

4 replies

tgolsson May 18, 2022
Author

Cool! I had a peek, and already got some questions.

It looks like there's three variants of fma for x86_64: 64x1, 16x6, 8x8. You write that we care about improving perf for 16x6 -- is this 6 related to the batch-size 6 issue, or just coincidence?
- If related, It seems to me like having specializations for x2, x3, x4, x5 would help.
- Maybe that's what you mean by having multiple placeholders?
What's the proper way of benchmarking this?
- I see there's a matmul-bench subcrate which I guess is what's referenced in the blogposts?
- I also see there's a file called bench/arm_64.rs. Is this what you use to measure perf of different kernels on arm?
- OTOH, there's also an arm64simd.rs which looks like it's hitting some asm directly, but not the same...
- I can't tell if either of these test batch size?
  - Is that the innermost dim here?
```
   kloop!(f, "8x8x2", 128, "arm64simd_mmm_f32_8x8/packed_packed_loop2/broken_chains.tmpli");
   kloop!(f, "8x8x2", 128, "arm64simd_mmm_f32_8x8/packed_packed_loop2/cortex_a55.tmpli");
```
What is the data layout when batch size is accounted for? Is batch outer- or innermost? (maybe already answered with previous)

kali May 18, 2022
Maintainer

Cool! I had a peek, and already got some questions.

* It looks like there's three variants of fma for x86_64: 64x1, 16x6, 8x8. You write that we care about improving perf for 16x6 -- is this 6 related to the batch-size 6 issue, or just coincidence?

Very much not a coincidence. With your model (and with many others), the batch dimension is projecting to the number of colons in the B matrix of the matrix multiplications. The coincidence is that both are referred to using the same letter: N as in NCHW for batch, and n as in (m,k,n) to describe the matrix product. But in linalg, the notion of batch does not exists, it is introduced in core and above.

  * If related, It seems to me like having specializations for x2, x3, x4, x5 would help.

True. It could help. Playing with the tile size is one way to accommodate specific batch size. It would help because when we use a 16x6 to implement a N=n=4 product, the two last columns of the tile are wasted but computed anyway.

So maybe a 16x4 or 24x4 would help. The M will need to be a multiple of 8 for vector loads to work, the second dimension is more free. Not sure if an efficient 24x4 an be done: as the tiles get more elongated, we need more room to hold operand data loaded from memory. Very odd / prime values will come with their own idiosyncrasies, so I would start with 4. I'm not sure it's worth going through the trouble for 2.

  * Maybe that's what you mean by having multiple placeholders?

Well, this is actually a totally different subject. Instead of targetting small n size, I think we can make the existing kernel a bit more efficient by targetting specific parts of the x64 family. Like splitting intel and amd, for instance, of even going more specific on microarchitecture variants. Each micro-architecture variants have different timing: the current implementation is pretty generic, does not take into account the precise characteristics of this chip or that one.

So this is an optimisation angle which is totally different from the batch/n size specialization. The way it's done on aarch64 (and armv7) is: the asm kernels are generates several times, with different values for the placeholder "core" which get substituted in the file name and kernel function name by the logic in build.rs.

* What's the proper way of benchmarking this?

Good question :)

  * I see there's a matmul-bench subcrate which I guess is what's referenced in the blogposts?

This one is a end-to-end bench that compares tract to naive implementation and/or blas third party ones, for a set of pre-determined m,k,n tuples. You may want to use this one to get a simple view of your progress.

  * I also see there's a file called bench/arm_64.rs. Is this what you use to measure perf of different kernels on arm?

Disregard this one, it is superseded by the cost/ subcrate (sidebar: it runs operation of various size, timing them on various kernel, and save the data as a dataset to be used as training data for the small dnn we use to choose the kernel at runtime based on m,k,n)

  * OTOH, there's also an arm64simd.rs which looks like it's hitting some asm directly, but not the same...

This one is more interesting for an iterating point of view. It times small bits of assembly (in cycle numbers). We use it for two things:
1/ getting a cycle count from small made-up bits of assembly that helps us figure out the precise timing of single instructions or pair of instructions (as they play a huge role in cortex-a53)
2/ getting a cycle count for the inner loop (over k) of the kernel tile. This is why in aarch64 case, the actual tile inner loop is split into various file in the sub-directories. That way we can bench them separately using the arm64simd bench, and include the right one in the top kernel file, depending on the value of "core".

  * I can't tell if either of these test batch size?

I hope I answered that one :)

    * Is that the innermost dim here?
      ```rust
         kloop!(f, "8x8x2", 128, "arm64simd_mmm_f32_8x8/packed_packed_loop2/broken_chains.tmpli");
         kloop!(f, "8x8x2", 128, "arm64simd_mmm_f32_8x8/packed_packed_loop2/cortex_a55.tmpli");
      ```

This is where we benching an entire implementation of the main loop, yes. the 8x8x2 is actually m,n,k here, in the order in which axes are iterated upon. 8x8 is the tile shape, and the 2 means this assembly loop is unrolled over two steps of iteration over k. 128 is the count of actual single precision fma performed in the loop to normalize the efficiency measure 8x8x2=128).

* What is the data layout when batch size is accounted for? Is batch outer- or innermost? (maybe already answered with previous)

Batch lands on n, horizontal dimension of the tile, second number in kernel names, ...

tgolsson May 20, 2022
Author

Hey again @kali!

Thanks a lot for all that info, I've gotten quite a bit further. First of all; I've done a lot of tools-based analysis with LLVM-MCA, CQA and other tools to see what happens to the packed-packed loops when we change various properties. For what it's worth; it seems like many but not all of them easily lead to 100% FPU usage; and an ever-growing queue of pending instructions. While I don't trust these tools 100%; it does imply that there's little to gain by optimizing these kernels (as you'll see below for 32x3 and 40x2). However, it also means that a different shape of kernel that doesn't waste work would lead to a direct speedup.

I've also ported the arm64simd.rs file to run similar tests for x86_64.I'm curious about the / 4. that occurs in the original. I've related that to being the vec4 width used for arc4simd and thus modified that to take the (float) width of the instructions used. I'm not quite sure whether that's truly correct, but it makes sense for the metrics I get.

Thus; I now have some slight proof that we are already quite good at using the resources available; and I've implement some other kenels that yield equal results at narrower kernels; and better results with AVX-512. Before I start breaking apart all of linalg to support all these different modes; I'm a bit curious about the relevance of some of these results - for example; the 80x2 base case claims to hit 96 GFLOPs but falls down immediately with the unrolled variant (to still > 50% perf gain over 16x6 AVX!). Have you made any similar measurements for e.g. M1? Did learnings hold up in practice? I realize that architecturally they are very different, just looking for some expectation management and experiences.

Below is the table of all kernels I've tried so far. The GFLOP measurement is "output floats / lane width / time". Thus, FMA is one FLOP, not two. I haven't double- and triple-checked all results here; but none seem outrageously wrong. WIW the base-clock of my CPU is 3Ghz, the implication is that for AVX we should land around 48 GFLOP to start with -- and for AVX-512 (which generally has a lower base frequency) we'd be at 96 FLOPS.

-- 16x6 kernels
16x6x1   original                                 164% ( 7.32/ 12 cy) 52.63 GFLOP/s
16x6x1   avx-512                                  106% ( 5.65/  6 cy) 68.19 GFLOP/s
16x6x2   original-unroll                          164% (14.66/ 24 cy) 52.53 GFLOP/s
16x6x2   avx-512-unroll                           101% (11.93/ 12 cy) 64.56 GFLOP/s

-- 32x3 kernels
32x3x1   avx                                      166% ( 7.25/ 12 cy) 53.11 GFLOP/s
32x3x1   avx-512-3L                               168% ( 7.15/ 12 cy) 53.87 GFLOP/s
32x3x1   avx-512                                  111% ( 5.38/  6 cy) 71.52 GFLOP/s
32x3x2   avx-512-unroll                           101% (11.83/ 12 cy) 65.11 GFLOP/s

-- 40x2 kernels
40x2x1   avx                                      167% ( 6.00/ 10 cy) 53.49 GFLOP/s
40x2x1   avx-512                                   95% ( 5.28/  5 cy) 60.76 GFLOP/s
40x2x1   avx-512-2L                               167% ( 5.99/ 10 cy) 53.58 GFLOP/s
40x2x2   avx-512-unroll                            87% (11.46/ 10 cy) 55.98 GFLOP/s

-- 80x2 kernels
80x2x1   avx-512                                  150% ( 6.68/ 10 cy) 96.00 GFLOP/s
80x2x2   avx-512-unroll                           125% (15.97/ 20 cy) 80.35 GFLOP/s

-- 64x1 kernels
64x1x1   original                                 131% ( 6.09/  8 cy) 42.14 GFLOP/s
64x1x1   skx                                      132% ( 6.07/  8 cy) 42.31 GFLOP/s
64x1x1   avx-512                                   76% ( 5.26/  4 cy) 48.78 GFLOP/s
64x1x1   avx-512-registers                        141% ( 5.67/  8 cy) 45.30 GFLOP/s
64x1x2   avx-512-unroll                            76% (10.60/  8 cy) 48.45 GFLOP/s

-- 128x1 kernels
128x1x1  avx-512                                  120% ( 6.64/  8 cy) 77.31 GFLOP/s
128x1x2  avx-512-unroll                           101% (15.89/ 16 cy) 64.61 GFLOP/s

kali May 20, 2022
Maintainer

Hey again @kali!

Wow, great work here, happy to see you could make sense of my little experiments and extend them. And your found some bits contradicting my intuitions, which is good :) As most of my experience is with relatively low-end arm chips, it's no wonder I get it wrong on the AVX512-able chips.

Thanks a lot for all that info, I've gotten quite a bit further. First of all; I've done a lot of tools-based analysis with LLVM-MCA, CQA and other tools to see what happens to the packed-packed loops when we change various properties. For what it's worth; it seems like many but not all of them easily lead to 100% FPU usage; and an ever-growing queue of pending instructions. While I don't trust these tools 100%; it does imply that there's little to gain by optimizing these kernels (as you'll see below for 32x3 and 40x2).

Understood. I would be curious how a loop over a 256bit vector (8xf32) implemented using avx512 register and ports would show. Would it be smart enough to tell us we are wasting half of it ?

However, it also means that a different shape of kernel that doesn't waste work would lead to a direct speedup.

That's right. And just to clarify a tiny bit, the only case where a kernel waste fpu power is when it is used for a partial tile on the border. The efficiency difference between, say, a 8x8 and a 16x6 comes from the fact that the bigger kernel will perform less data loading, for exactly the same number of arithmetic operations.

I've also ported the arm64simd.rs file to run similar tests for x86_64.I'm curious about the / 4. that occurs in the original. I've related that to being the vec4 width used for arc4simd and thus modified that to take the (float) width of the instructions used. I'm not quite sure whether that's truly correct, but it makes sense for the metrics I get.

That's good. Yeah. The /4 si about interpreting the data against the theroretical arithmetic capacity of the chip. On the aarch64 I have been optimising for, the capacity is 4 individual fma per cycle (one port, so one single simd fma, and vector size is 4). But this constant would need to be adjusted for the Apple M1 for instance, because it has several fma-able exectution ports. If we move to a more capable chip, we basically have two choices: either adjust the constant to match the device, or establish the given capacity of a given chip (real or theoretical) as a reference point. I did not adjust when I did the measurement for M1, I just used the cortex-a53 as a reference frame.

Thus; I now have some slight proof that we are already quite good at using the resources available; and I've implement some other kenels that yield equal results at narrower kernels; and better results with AVX-512.

Happy to know that the 16x6 does not need fiddling. I guess it is straightforward enough for the pretty smart Intel chips (compared to my cortex-a53...) to find a good execution plan of it.

Before I start breaking apart all of linalg to support all these different modes; I'm a bit curious about the relevance of some of these results - for example; the 80x2 base case claims to hit 96 GFLOPs but falls down immediately with the unrolled variant (to still > 50% perf gain over 16x6 AVX!).

I have seen other occurrences where unrolling is counter productive. I guess at some point it goes beyond some instruction cache or some "forward looking" or something on the chip...

Have you made any similar measurements for e.g. M1? Did learnings hold up in practice? I realize that architecturally they are very different, just looking for some expectation management and experiences.

It will not map linearly. Basically, we are exercising the cpu ports and issuing logic here, and we found out that what we have is already good enough to saturate the arithmetics unit. (The situation is really different on aarch64, where you really need by trial and error to find the best combination of instructios.) What this benches are (mostly) isolated from is the memory behaviour.

Below is the table of all kernels I've tried so far. The GFLOP measurement is "output floats / lane width / time". Thus, FMA is one FLOP, not two.

Yes. That's the convention I have been using (meaning most of the time "my" Flops numbers must be doubled before comparing with blas implementation).

I haven't double- and triple-checked all results here; but none seem outrageously wrong. WIW the base-clock of my CPU is 3Ghz, the implication is that for AVX we should land around 48 GFLOP to start with -- and for AVX-512 (which generally has a lower base frequency) we'd be at 96 FLOPS.

So we are assuming two execution ports ? 2 ports, 8 lanes in AVX2, 16 in AVX512 ?

And well, I'm pretty sure you're aware now: I have no experience/knowledge about avx-512, so... Also this chips are way more complicated than the arm ones, so my mental model may not be sophisticated enough to understand all that happens.

-- 16x6 kernels
16x6x1   original                                 164% ( 7.32/ 12 cy) 52.63 GFLOP/s
16x6x1   avx-512                                  106% ( 5.65/  6 cy) 68.19 GFLOP/s
16x6x2   original-unroll                          164% (14.66/ 24 cy) 52.53 GFLOP/s
16x6x2   avx-512-unroll                           101% (11.93/ 12 cy) 64.56 GFLOP/s

I don't understand why the avx512 is not just twice faster here. Actually the 16x6 should ideally perform at 48, and it's clocking a bit better, which is weird...

-- 32x3 kernels
32x3x1 avx 166% ( 7.25/ 12 cy) 53.11 GFLOP/s
32x3x1 avx-512-3L 168% ( 7.15/ 12 cy) 53.87 GFLOP/s
32x3x1 avx-512 111% ( 5.38/ 6 cy) 71.52 GFLOP/s
32x3x2 avx-512-unroll 101% (11.83/ 12 cy) 65.11 GFLOP/s

-- 40x2 kernels
40x2x1 avx 167% ( 6.00/ 10 cy) 53.49 GFLOP/s
40x2x1 avx-512 95% ( 5.28/ 5 cy) 60.76 GFLOP/s
40x2x1 avx-512-2L 167% ( 5.99/ 10 cy) 53.58 GFLOP/s
40x2x2 avx-512-unroll 87% (11.46/ 10 cy) 55.98 GFLOP/s

How do you tile the 40x2 with vectors of 16 values ? Am I missing something here ?

-- 80x2 kernels
80x2x1 avx-512 150% ( 6.68/ 10 cy) 96.00 GFLOP/s
80x2x2 avx-512-unroll 125% (15.97/ 20 cy) 80.35 GFLOP/s

Mmm... this is weird. Usually the best efficiency is easier to get with square or square-ish kernel, because the ration of arithmetics over load and store and higher. And in this case, the best kernel is a 80x2 ? This is strange.

-- 64x1 kernels
64x1x1 original 131% ( 6.09/ 8 cy) 42.14 GFLOP/s
64x1x1 skx 132% ( 6.07/ 8 cy) 42.31 GFLOP/s

What is skx ?

64x1x1 avx-512 76% ( 5.26/ 4 cy) 48.78 GFLOP/s
64x1x1 avx-512-registers 141% ( 5.67/ 8 cy) 45.30 GFLOP/s
64x1x2 avx-512-unroll 76% (10.60/ 8 cy) 48.45 GFLOP/s

-- 128x1 kernels
128x1x1 avx-512 120% ( 6.64/ 8 cy) 77.31 GFLOP/s
128x1x2 avx-512-unroll 101% (15.89/ 16 cy) 64.61 GFLOP/s

All in all, I think we are on track to get something pretty good. Good general purpose kernels for avx-512 (we may still want to add a very square one, but still), pretty good speedup on the vector case. And we could probably include one or two narrow ones, in both avx2 and avx512. There are a few curious things that we need to figure out (how come the best kernel is 80x2 ? what does it do that we could bring in others ?)

Thank you so much for looking into this.

tgolsson · 2022-05-20T18:46:17Z

tgolsson
May 20, 2022
Author

Cherry-picking some questions here to answer for now; will try to get back with more info and some code next week as I try to take this from raw inner loops to actual usable kernels.

So we are assuming two execution ports ? 2 ports, 8 lanes in AVX2, 16 in AVX512 ?

This is the main problem, yes. AVX-512 is only strong perf gain if you have two ports, and if you're lucky equal perf with one. On some platforms you'll see a clear degradation in clock speed using any AVX-512 instructions, which makes it slower than two-port AVX2.

And well, I'm pretty sure you're aware now: I have no experience/knowledge about avx-512, so... Also this chips are way more complicated than the arm ones, so my mental model may not be sophisticated enough to understand all that happens.

Neither am I, really. I'm just hoping to throw enough funky stuff at the wall to make it stick... Hence the spattering of kernel sizes. But I think I'm in a good enough position to start selecting and putting out kernels more efficiently. I'm half expecting the optimal kernels of AVX-512 to actually be the same as AVX2 when looking at register distribution (i.e, treating 16x6 AVX and 32x6 AVX-2 as 2x6 registers).

I don't understand why the avx512 is not just twice faster here. Actually the 16x6 should ideally perform at 48, and it's clocking a bit better, which is weird...

This had me stumped for a bit; but I believe what happens is that I can't saturate the FPU anymore - I have one load for weights, then one load per FMA... leading to one FPU essentially idling. So I think actually a 32x6 kernel would perform well here; or 64x6 even. Consider the ratio of load to FMA, we started at 8:12, with AVX-512 change we're at 7:6. Re the overperformance, I'm actually guessing that since we don't measure actual cycles but rather time, it ends up being... a bit inexact. Also, I don't think AVX2 necessarily runs at base clock, which would be 48 GFlops. But I'm not sure, I haven't found any clock table including my CPU. I'm mostly looking for relative numbers to see whether I'm improving... I'm not going to go comparing this against BLAS or SGEMM and claim I'm better.

How do you tile the 40x2 with vectors of 16 values ? Am I missing something here ?

No, valid question! I was for a while (due to the failed 16x6 upgrade) under the impression that AVX-512 ran on port5 only as it does on some Skylake CPUs, so this is 2x2 AVX-512 and the tails are AVX2, but it actually makes the balance of loads-fma bad (5:6). AVX-512-2L is AVX2 but using the extended register range only, to avoid reusing registers (later I renamed this to -registers.

Mmm... this is weird. Usually the best efficiency is easier to get with square or square-ish kernel, because the ration of arithmetics over load and store and higher. And in this case, the best kernel is a 80x2 ? This is strange.

I haven't tried anything square-ish, so I can't say if that works better. My optimization goal is primarily to get skinny x square to be faster. Consider the ratios here, though. This is 7 loads and 10 outs. It's not that bad of a ratio -- but I should probably go for 96x2 since that's just a flipped 16x6 (2x6 -> 6x2).

What is skx ?

SKX is just the code for Skylake X, the CPU platform I'm running on. I was trying to push this one a bit harder with AVX2, but never got anything statistically significant from it. I think this is one of the cases where tools aren't happy, as it's 9 loads to 8 FMAs. And there's not a lot of room to play since the bulk of the loads are a full register (ie not broadcast), so tricks based on shuffling/permuting won't work.

6 replies

kali May 21, 2022
Maintainer

Mmm... Packing is only for inputs. For outputs, we're writing straight in the output matrix (except when we have partial tile in the border, this is handled in "scratch" by providing a temporary tile https://github.com/sonos/tract/blob/main/linalg/src/frame/mmm/scratch.rs#L330 ). All the kernels we have so far put their vectors in the vertical direction (where the output channels typically lands), so we can use vector store when we are dealing with data in something like NHWC convention (C dimension, the vertical one in the kernel being the inner). It also works for the matrix-vector (aka n=1) case.

I have done that optimisation for most arm kernels actually. See https://github.com/sonos/tract/blob/main/linalg/arm64/arm64simd/arm64simd_mmm_f32_8x8_core.tmpl#L138 for instance. I check the down stride, and if it is the same as the item size (which is always four), I use vector store instructions instead of the general iter-per-item code.

I just never bothered doing it for intel, but it is probably not very difficult to do the same. At some point I also thought that some model in NCHW would benefit from having a "transpose-then-vector-store" path, but I never tried it...

There may also be a better way to write the generic store case in the case of intel.

Also we may decide that if the output is not C-inner-axis, we use the temporary tile trick to write the generic storage as rust code instead of assembly to simplify a bit and just have to write the vector store in assembly code... After careful benching and pondering of course.

tgolsson May 24, 2022
Author

@kali Can you offer some clarification on what add_unicast does? I don't get the link between the name and the element-wise add it seems to be doing, so unsure if I'm missing some detail.

kali May 25, 2022
Maintainer

add_unicast performs an element-wise addition to a matrix with the same layout than the output, possibly the output itself. Unicast as opposed to broadcast. AddElementWise could actually be a better name, maybe I'll change it next I revisit this.

Following a clear (which puts everything to zero) it allows to pre-load the tile, which is super useful for testing other ops.

As an actual addition it optimises a bit some obscure corner cases, like the way we are doing recurring ops (LSTM & co). Eventually we could also use it to optimise big products where splitting the product along the k dimension improves cache locality.

tgolsson May 25, 2022
Author

Alright, cool! I've never heard unicast outside of networking, so wanted to double-check. I would wholeheartedly agree with a rename to AddElementWise; that's what I'd look for in documentation etc for this operation.

kali May 26, 2022
Maintainer

Noted. That said, I use element-wise for unary element wise (regular f32->f32 functions like sigmoid, sqrt & co) so... I'll think about it.

Maximizing performance with tract #716

tgolsson May 13, 2022

Replies: 7 comments · 19 replies

kali May 13, 2022 Maintainer

tgolsson May 13, 2022 Author

kali May 13, 2022 Maintainer

tgolsson May 17, 2022 Author

mathieupoumeyrolsonos May 18, 2022 Maintainer

mathieupoumeyrolsonos May 18, 2022 Maintainer

mathieupoumeyrolsonos May 18, 2022 Maintainer

tgolsson May 18, 2022 Author

kali May 13, 2022 Maintainer

tgolsson May 13, 2022 Author

mathieupoumeyrolsonos May 18, 2022 Maintainer

tgolsson May 18, 2022 Author

kali May 18, 2022 Maintainer

tgolsson May 18, 2022 Author

kali May 18, 2022 Maintainer

tgolsson May 20, 2022 Author

kali May 20, 2022 Maintainer

tgolsson May 20, 2022 Author

kali May 21, 2022 Maintainer

tgolsson May 24, 2022 Author

kali May 25, 2022 Maintainer

tgolsson May 25, 2022 Author

kali May 26, 2022 Maintainer

tgolsson
May 13, 2022

Replies: 7 comments 19 replies

kali
May 13, 2022
Maintainer

tgolsson
May 13, 2022
Author

kali
May 13, 2022
Maintainer

tgolsson May 17, 2022
Author

mathieupoumeyrolsonos May 18, 2022
Maintainer

mathieupoumeyrolsonos May 18, 2022
Maintainer

mathieupoumeyrolsonos May 18, 2022
Maintainer

tgolsson May 18, 2022
Author

kali
May 13, 2022
Maintainer

tgolsson May 13, 2022
Author

mathieupoumeyrolsonos
May 18, 2022
Maintainer

tgolsson May 18, 2022
Author

kali
May 18, 2022
Maintainer

tgolsson May 18, 2022
Author

kali May 18, 2022
Maintainer

tgolsson May 20, 2022
Author

kali May 20, 2022
Maintainer

tgolsson
May 20, 2022
Author

kali May 21, 2022
Maintainer

tgolsson May 24, 2022
Author

kali May 25, 2022
Maintainer

tgolsson May 25, 2022
Author

kali May 26, 2022
Maintainer