Implement slab level cache for remote frees #634

mjp41 · 2023-09-14T20:13:16Z

@nwf-msr observed that we could improve the performance of remote deallocation if the producer does more work on building lists for each slab before returning to the original allocator. This could improve producer/consumer scenarios further.

Tasks

Give feedback

Implementing statistics to estimate performance wins.
Implement version that doesn't work with snmalloc hardening.
Implement version with hardening as well
Options

mjp41 · 2023-09-14T20:15:06Z

@Licenser @darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.

darach · 2023-09-14T20:41:25Z

Hi Matt, We sure are. In tremor, and I have a project at work ( axiom now ) where I think it's a great fit. We'll ask the community as well Cheers, Darach.

…

On Thu, Sep 14, 2023, 22:15 Matthew Parkinson ***@***.***> wrote: @Licenser <https://github.com/Licenser> @darach <https://github.com/darach> are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor. — Reply to this email directly, view it on GitHub <#634 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABX5MW5MOJBESOQQAOE7FLX2NQVNANCNFSM6AAAAAA4YXRFRI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mjp41 · 2023-09-14T20:57:52Z

Awesome thanks. Any benchmarks we can run would really help us in justifying the engineering work.

darach · 2023-09-14T21:23:34Z

Our CI benchmarks are snmalloc based https://www.tremor.rs/benchmarks/ - relatively boring UI there.
The benchmarks we ran a year or two ago for tremor ( and mimalloc before it, and jemalloc before that ) are here:
https://github.com/tremor-rs/tremor-runtime/blob/main/bench/README.md

They have changed enough though in themselves and we rewrote the runtime and connectors so the benchmarking
code works differently so YMMV

- Trace "Handling remote" once per batch, rather than per element - Remote queue events also log the associated metaslab; we'll use this to assess the efficacy of microsoft#634

Approximate a message-passing application as a set of producers, a set of consumers, and a set of proxies that do both. We'll use this for some initial insight for microsoft#634 but it seems worth having in general.

- Trace "Handling remote" once per batch, rather than per element - Remote queue events also log the associated metaslab; we'll use this to assess the efficacy of microsoft#634

Approximate a message-passing application as a set of producers, a set of consumers, and a set of proxies that do both. We'll use this for some initial insight for microsoft#634 but it seems worth having in general.

- Trace "Handling remote" once per batch, rather than per element - Remote queue events also log the associated metaslab; we'll use this to assess the efficacy of microsoft#634

Approximate a message-passing application as a set of producers, a set of consumers, and a set of proxies that do both. We'll use this for some initial insight for microsoft#634 but it seems worth having in general.

nwf-msr · 2023-09-25T17:12:41Z

I've spent a while playing with various cache strategies (though nothing very sophisticated, just sort of seeing what the low-hanging fruit was like). For the two workloads I've tried, for three interesting choices of hashes, I currently have these message counts:

Workload	No caching	Perfect assembly	4-way direct hash
`msgpass`	7,205,380	552,781; 35 rings	1,076,843
`xmalloc-test`	2,388,551	317,057; 5 rings	317,065

The hash here is inspired by https://github.com/skeeto/hash-prospector but is sort of "a third" of that:

  hash = slab;
  hash *= 0x7feb352d;
  return (hash >> 16) & 3;

I'm sure it's possible to do better, but that seems to work alright, though it's not sensitive to the upper bits of the slab! Of note, however, changing the shift to hash >> 30 or performing a prospector-esque xor-shift prior to multiplication performs significantly worse on msgpass (and a little worse on xmalloc-test). The full prospector hash also does a little worse than the numbers above and is, obviously, a fair bit more expensive.

While working on microsoft#634, it's useful to be able to simulate caching policies without having to write all the C++ to actually run them. Here's a terrible little Perl script that can probably do most of what you might want.