Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance report #251

Open
mjp41 opened this issue Sep 29, 2020 · 18 comments
Open

Poor performance report #251

mjp41 opened this issue Sep 29, 2020 · 18 comments
Assignees

Comments

@mjp41
Copy link
Member

mjp41 commented Sep 29, 2020

This issue is to track the poor performance reported by @armintoepfer on twitter, when comparing to mimalloc.

Initial testing, it's much slower than mimalloc. Do you have, similar to mimalloc, env variables to show debug summaries to ensure that your library was properly loaded? Maybe I screwed something up ;)

https://twitter.com/XLR/status/1310687339887423489

@armintoepfer, firstly thanks for trying snmalloc. My normal approach to checking if it has been correctly loaded is to run a profile, and see if any snmalloc symbols appear in the trace. If you could run perf on a benchmarking run and share any profiling data about the snmalloc and libc symbols, then we might be able to spot what is happening.

How are you loading it? LD_PRELOAD or are you linking the static library? The LD_PRELOAD approach is much better tested. The static library has only been used by the LLVM Windows patch so far.

Is the performance you are seeing similar to the system allocator? This would suggest we haven't intercepted the calls correctly.

@mjp41 mjp41 self-assigned this Sep 29, 2020
@armintoepfer
Copy link

Hey!

For our CCS binary we are ricing like hell, as every percentage performance is actual money saved and reduces time to result for future clinical samples. Who wouldn't like to get the results of a rare disease screen as fast as possible? Surely this is future music, but we are thinking big here :)

Here's a little bit about my toolchain:

  • We've built our own cross-compiled GCC 10, built against glibc 2.17, but linked against 2.32. This allows me to statically link the latest glibc on a CentOS 7 system and all customers get the performance improvements, no matter their ancient system glibc.
  • Subset of -ffast-math
  • LTO and PGO
  • march=nehalem
  • static linking of mimalloc via LDFLAGS
  • LDFLAGS="-static"

The easiest for me to check if mimalloc is linked properly, I can use one of the verbose env variables :)

I can deactivate PGO and then run perf.

@davidchisnall
Copy link
Collaborator

If you're statically linking then, before you strip the binary, running nm and seeing if you see any snmalloc symbols will let you see if you've pulled it in (which doesn't guarantee that it's the malloc that's actually being used, but will at least ensure that it's there).

Are you including snmalloc in the PGO / LTO bundle? Our fast path is around 15 instructions for malloc (one more when compiled with GCC than clang) and so may end up inlined if you're allocating on a hot path.

@armintoepfer
Copy link

armintoepfer commented Sep 29, 2020

Are you including snmalloc in the PGO / LTO bundle?

yes

I'll test nm and let you know.

@armintoepfer
Copy link

Statically linked w/PGO&LTO&GLIBC2.32 on 2x7H12:

Run Wall time CPU Time Peak RSS
MIMALLOC Tuned* 4m 3s 15h 54m 25.9 GB
MIMALLOC 4m 32s 16h 35m 26.6 GB
SNMALLOC 16MB 5m 4s 14h 37m 30.6 GB
SNMALLOC 1MB 5m 8s 14h 42m 30.7 GB
GLIBC 2.32 5m 19s 15h 44m 30.9 GB

Dynamically linked w/GLIBC2.17 on 2x7742:

Run Wall time CPU Time Peak RSS
MIMALLOC Tuned* 5m 12s 21h 25m 30.2 GB
SNMALLOC 16MB 5m 21s 22h 4m 29.9 GB
SNMALLOC 1MB 5m 21s 22h 4m 30.0 GB
GLIBC 2.32 7m 38s 16h 57m 23.9 GB

*MIMALLOC_PAGE_RESET=0 MIMALLOC_LARGE_OS_PAGES=1

It's obviously using snmalloc, as default malloc is really slow for 256 threads running at full throttle.

@mjp41
Copy link
Member Author

mjp41 commented Sep 29, 2020

So given we are spending less CPU time, it suggests to me that we are spending too much time in the kernel holding a lock or something. There is one bit of code I can think of that could be causing that:

# ifndef SNMALLOC_QEMU_WORKAROUND
if (
(page_aligned || is_aligned_block<page_size>(p, size)) &&
(size > 16 * page_size))
{
// Only use this on large allocations as memset faster, and doesn't
// introduce IPI so faster for small allocations.
SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));
madvise(p, size, MADV_DONTNEED);
}
else
# endif

I wonder if given the size of your machine, that this IPI is costing too much. If you set

-DSNMALLOC_QEMU_WORKAROUND=On

then it will disable that call (badly named flag, based on a previous issue we had).

I guess the other possibility is that mimalloc has custom code for NUMA, which we don't have. That could also be a source of slow down, but fixing that would be much harder.

P.S. Thank you for collecting and submitting the benchmark results.

@davidchisnall
Copy link
Collaborator

Tuning that has been on my to-do list for a while. The 16 page minimum looks quite plausible, but on x86 in particular it may require some IPIs and the cost of MADV_DONTNEED will scale in some polynomial on the number of cores, whereas the cost of memset is linear with the size of the allocation. The intersection of those is going to be complicated. I'd like to write a little microbenchmark that can try to figure out the inflection point for a given processor that we can (optionally?) run from CMake and that we can use on a bunch of different machines to pick a number based on the number of cores if we don't have a magic number for the current machine.

@davidchisnall
Copy link
Collaborator

Is the CPU time just user, or user + system?

In the dynamically linked version, we're using a bit more than 3% more CPU than mimalloc and a bit less than 3% more wall clock time. There's a much bigger difference in the PGO + LTO version. I wonder if we're seeing out fast path being inlined more than it should be and pushing more things out of i-cache? It's surprising that we'd use so much less (8%) CPU time but a lot more (25%!) wall clock time, as @mjp41 says, that implies that we're holding a kernel lock somewhere and just not running for a bit.

@mjp41
Copy link
Member Author

mjp41 commented Sep 29, 2020

@armintoepfer if you have page fault statistics available that might also suggest what is going on. Do you have transparent huge pages switched on on the machine? I think the following should say if you do

cat /sys/kernel/mm/transparent_hugepage/enabled

@armintoepfer
Copy link

armintoepfer commented Sep 29, 2020

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Following all static, PGO/LTO builds on 2x7H12 systems.

Metric SN 16 QEMU=On SN 16 MI Tuned MI def
User time (seconds) 53000.29 52938.56 57330.39 58802.21
System time (seconds) 339.14 357.15 63.45 990.52
Percent of CPU this job got 16850% 16737% 21999% 20509%
Elapsed (wall clock) time (h:mm:ss or m:ss) 5:16.53 5:18.42 4:20.88 4:51.53
Average shared text size (kbytes) 0 0 0 0
Average unshared data size (kbytes) 0 0 0 0
Average stack size (kbytes) 0 0 0 0
Average total size (kbytes) 0 0 0 0
Maximum resident set size (kbytes) 30652060 31385804 26211960 28317732
Average resident set size (kbytes) 0 0 0 0
Major (requiring I/O) page faults 1 1 1 1
Minor (reclaiming a frame) page faults 22091749 21991651 487583 20438894
Voluntary context switches 13798271 13884004 1468104 1773052
Involuntary context switches 868023 888361 484109 718255
Swaps 0 0 0 0
File system inputs 23938776 23938776 23938920 23938920
File system outputs 1842072 1842096 1842232 1842176
Socket messages sent 0 0 0 0
Socket messages received 0 0 0 0
Signals delivered 0 0 0 0
Page size (bytes) 4096 4096 4096 4096
Exit status 0 0 0 0

@mjp41
Copy link
Member Author

mjp41 commented Sep 30, 2020

@armintoepfer thanks for this information. I have not seen this level of difference in page faults before. I will run some benchmarks with the MIMALLOC_LARGE_OS_PAGES=1 setting, to try to dig into this.

Reading the mimalloc code, they are grabbing using MAP_HUGE_TLB at the 1GiB granularity and falling back to 2MiB is that fails, and then falling back to transparent huge pages (THP) if that does not work. Where as snmalloc is just using assuming THP will do a good job. I will see what I can implement to improve this.

@armintoepfer, thanks for taking the time to try snmalloc, and providing us so useful feedback.

@armintoepfer
Copy link

Sure thing. But even without MIMALLOC_LARGE_OS_PAGES, mimalloc is much faster than snmalloc per default. So there is something beyond pages that is not working out in my use case. Ping me in case you have something to benchmark.

@mjp41
Copy link
Member Author

mjp41 commented Sep 30, 2020

@armintoepfer that is a very good point. It would interesting to know the page fault number for the non-tuned version.

@armintoepfer
Copy link

Updated.

@nwf
Copy link
Contributor

nwf commented Sep 30, 2020

The things that leap out to me are (and perhaps these are obvious to other observers; if so, sorry for the spam):

  • Why is snmalloc doing an order of magnitude more voluntary context switches than mimalloc? snmalloc doesn't make that many system calls; are we perhaps contending on the address space lock in the kernel somehow?

  • It looks like mimalloc can tune itself down to experiencing only 2% of the minor page faults experienced by the other configurations. What's going on there?

@mjp41
Copy link
Member Author

mjp41 commented Oct 1, 2020

@nwf yeah, I think mimalloc is probably using 1GiB huge pages in that configuration, so that is why there are very few page faults when it is tuned.

With regard to the context switches, that is a good question. I have found that on some microbenchmarks, if I put in a large thread count, then I can replicate the statistics that @armintoepfer has. But it is unclear to me if that is coincidence or actually replicating the issue.

@armintoepfer are you using the same number of OS threads as hardware threads in your application?

One concern I have, if there are few places we have used spinning on infrequently accessed data structures, when we get to 256 threads, I am concerned this might no longer be the correct choice. I have created a branch that monitors the time spent in these bits.

https://github.com/microsoft/snmalloc/tree/mini_stats

@armintoepfer if you would be able to run this in at send me the output, that would be amazing. Should be something like:

MPMC Push:
  Times: 2
  Time spend waiting: 1338 cycles
  Max wait: 1192 cycles
MPMC Push:
  Times: 4
  Time spend waiting: 1176 cycles
  Max wait: 382 cycles
MPMC Pop:
  Times: 7
  Time spend waiting: 3748 cycles
  Max wait: 1128 cycles
MPMC Pop:
  Times: 2
  Time spend waiting: 556 cycles
  Max wait: 368 cycles
FlagLock:
  Times: 11
  Time spend waiting: 948 cycles
  Max wait: 248 cycles

But the numbers will be much bigger on your machine.

Second, @armintoepfer for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?

@armintoepfer
Copy link

are you using the same number of OS threads as hardware threads in your application?

Correct. I'm using the same number of compute heavy threads as I have (logical) cores available; logical because of hyper-threading. I have additional 2x#cores threads for very lightweight output writing.

for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach?

It's a custom threadpool that I call FireAndForget. There is one producer, one orchestrator, and N workers. The orchestrator distributes work such that every worker is always busy. The Forget part about that is, there is no feedback from the worker back to the orchestrator. The worker writes output on its own and stores additional metadata locally, which will be merged right before joining threads. This leads to having no bottleneck, except getting enough data to the workers.

if you would be able to run this in at send me the output, that would be amazing.

I did, but I don't get any output from snmalloc using static linkage with PGO/LTO.
Rebuilt and linked dynamically via LD_PRELOAD and libsnmallocshim-16mib.so

MPMC Push:
  Times: 2080
  Time spend waiting: 344006 cycles
  Max wait: 4056 cycles
MPMC Push:
  Times: 3211647
  Time spend waiting: 1227836194 cycles
  Max wait: 530608 cycles
MPMC Pop:
  Times: 3225056
  Time spend waiting: 2753547577 cycles
  Max wait: 68954808 cycles
MPMC Pop:
  Times: 2081
  Time spend waiting: 315796 cycles
  Max wait: 3172 cycles
FlagLock:
  Times: 20543
  Time spend waiting: 4008420 cycles
  Max wait: 748410 cycles

@mjp41
Copy link
Member Author

mjp41 commented Oct 1, 2020

@armintoepfer Thanks. for running this. It does suggest there is opportunity for optimisation in one case, it is no where near enough to account for the difference.

I will try to work out why the printing does not work for the lto version. As that was the one that showed the difference in a considerably more pronounced way.

@mjp41
Copy link
Member Author

mjp41 commented Oct 8, 2020

@armintoepfer in the LTO/PGO case how are you compiling snmalloc? Are you including malloc.cc and new.cc and adding them to your build? Or are you building the .a and linking that in? Adding some options to get it to use LTO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants