-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance report #251
Comments
Hey! For our CCS binary we are ricing like hell, as every percentage performance is actual money saved and reduces time to result for future clinical samples. Who wouldn't like to get the results of a rare disease screen as fast as possible? Surely this is future music, but we are thinking big here :) Here's a little bit about my toolchain:
The easiest for me to check if mimalloc is linked properly, I can use one of the verbose env variables :) I can deactivate PGO and then run perf. |
If you're statically linking then, before you strip the binary, running Are you including snmalloc in the PGO / LTO bundle? Our fast path is around 15 instructions for malloc (one more when compiled with GCC than clang) and so may end up inlined if you're allocating on a hot path. |
yes I'll test |
Statically linked w/PGO<O&GLIBC2.32 on 2x7H12:
Dynamically linked w/GLIBC2.17 on 2x7742:
* It's obviously using snmalloc, as default malloc is really slow for 256 threads running at full throttle. |
So given we are spending less CPU time, it suggests to me that we are spending too much time in the kernel holding a lock or something. There is one bit of code I can think of that could be causing that: Lines 44 to 55 in 923705e
I wonder if given the size of your machine, that this IPI is costing too much. If you set
then it will disable that call (badly named flag, based on a previous issue we had). I guess the other possibility is that mimalloc has custom code for NUMA, which we don't have. That could also be a source of slow down, but fixing that would be much harder. P.S. Thank you for collecting and submitting the benchmark results. |
Tuning that has been on my to-do list for a while. The 16 page minimum looks quite plausible, but on x86 in particular it may require some IPIs and the cost of |
Is the CPU time just user, or user + system? In the dynamically linked version, we're using a bit more than 3% more CPU than mimalloc and a bit less than 3% more wall clock time. There's a much bigger difference in the PGO + LTO version. I wonder if we're seeing out fast path being inlined more than it should be and pushing more things out of i-cache? It's surprising that we'd use so much less (8%) CPU time but a lot more (25%!) wall clock time, as @mjp41 says, that implies that we're holding a kernel lock somewhere and just not running for a bit. |
@armintoepfer if you have page fault statistics available that might also suggest what is going on. Do you have transparent huge pages switched on on the machine? I think the following should say if you do
|
Following all static, PGO/LTO builds on 2x7H12 systems.
|
@armintoepfer thanks for this information. I have not seen this level of difference in page faults before. I will run some benchmarks with the Reading the mimalloc code, they are grabbing using @armintoepfer, thanks for taking the time to try snmalloc, and providing us so useful feedback. |
Sure thing. But even without |
@armintoepfer that is a very good point. It would interesting to know the page fault number for the non-tuned version. |
Updated. |
The things that leap out to me are (and perhaps these are obvious to other observers; if so, sorry for the spam):
|
@nwf yeah, I think mimalloc is probably using 1GiB huge pages in that configuration, so that is why there are very few page faults when it is tuned. With regard to the context switches, that is a good question. I have found that on some microbenchmarks, if I put in a large thread count, then I can replicate the statistics that @armintoepfer has. But it is unclear to me if that is coincidence or actually replicating the issue. @armintoepfer are you using the same number of OS threads as hardware threads in your application? One concern I have, if there are few places we have used spinning on infrequently accessed data structures, when we get to 256 threads, I am concerned this might no longer be the correct choice. I have created a branch that monitors the time spent in these bits. https://github.com/microsoft/snmalloc/tree/mini_stats @armintoepfer if you would be able to run this in at send me the output, that would be amazing. Should be something like:
But the numbers will be much bigger on your machine. Second, @armintoepfer for splitting the work up, do you uniformly split it at the start across the threads, and then wait for them to all terminate, or do you have a more work stealing like approach? |
Correct. I'm using the same number of compute heavy threads as I have (logical) cores available; logical because of hyper-threading. I have additional 2x#cores threads for very lightweight output writing.
It's a custom threadpool that I call
I did, but I don't get any output from snmalloc using static linkage with PGO/LTO.
|
@armintoepfer Thanks. for running this. It does suggest there is opportunity for optimisation in one case, it is no where near enough to account for the difference. I will try to work out why the printing does not work for the lto version. As that was the one that showed the difference in a considerably more pronounced way. |
@armintoepfer in the LTO/PGO case how are you compiling snmalloc? Are you including malloc.cc and new.cc and adding them to your build? Or are you building the |
This issue is to track the poor performance reported by @armintoepfer on twitter, when comparing to mimalloc.
https://twitter.com/XLR/status/1310687339887423489
@armintoepfer, firstly thanks for trying snmalloc. My normal approach to checking if it has been correctly loaded is to run a profile, and see if any snmalloc symbols appear in the trace. If you could run
perf
on a benchmarking run and share any profiling data about the snmalloc and libc symbols, then we might be able to spot what is happening.How are you loading it? LD_PRELOAD or are you linking the static library? The LD_PRELOAD approach is much better tested. The static library has only been used by the LLVM Windows patch so far.
Is the performance you are seeing similar to the system allocator? This would suggest we haven't intercepted the calls correctly.
The text was updated successfully, but these errors were encountered: