Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization Performance Issue - with sample repo #1142

Open
WernerMairl opened this issue Mar 26, 2024 · 4 comments
Open

Deserialization Performance Issue - with sample repo #1142

WernerMairl opened this issue Mar 26, 2024 · 4 comments

Comments

@WernerMairl
Copy link

Hello again ;-)

Yes, I'm aware of #669 and my personal conclusion there, but I like to investigate deeper, and maybe someone can help!

Issue: CPU seems not to be used on high core machines.

I have created a small (hopefully small enough) repository (net8.0) that provides a simple usecase with basic measuring as a playground for everyone!

The proto definition used is a real world scenario, coming from the OpenStreetMap pbf file format.
Part of the implementation is also "stolen" from the OsmSharp project (MIT licensed).

Sample Repository: https://github.com/WernerMairl/protobuf-net-concurrency

image

Expectations

Using 8 threads/tasks in parallel less then 2000 ms in duration overall should be possible (comparing with 4950 ms with one thread).

I cannot understand why we see a rate of 242 deserializations per seconds in a singlethread scenario and only 50 deserializations per second (and thread) in a 8-thread scenario.

Calculating some overhead, i would expect a rate around 180-200 for each of the 8 threads!

Questions

  • wrong expections ?
  • invalid test ?
  • bugs in the implementation ?

Any help is welcome to improve this.

@ewgdg
Copy link

ewgdg commented Mar 28, 2024

I generated a benchmark report for your sample project using BenchmarkDotNet, as detailed below.

The benchmark results indicate that the program scales effectively with concurrency when the sample size is 10. However, scalability issues arise as the sample size increases.

I think this might be because larger sample sizes create big objects, possibly affecting the large object heap (LOH).

Also, with a large sample size, like 4000, more memory is used as concurrency increases. This suggests a problem of memory inefficiency in multi-threaded environments.

Base on the observations, my hypothesis is that this is related to the overhead of handling large objects by ArrayPool.Shared in a multi-threaded context, as explored in this article.

ArrayPool.Shared relies on thread-local storage for its implementation. This design choice implies that high levels of concurrency can increase the likelihood of object allocations, which can be costly if they are large objects. This explains the observed correlation between increased concurrency and memory allocation.

Also, the SharedArrayPool has a Gen2GcCallback callback. This function releases all thread-local objects under memory pressure, even though you actually want to reuse them as your program continuously deserializes data in a loop. This behavior further degrading the performance in high concurrency case as there are more thread-local objects.



BenchmarkDotNet v0.13.12, macOS Sonoma 14.4 (23E214) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 8.0.101
  [Host]   : .NET 8.0.1 (8.0.123.58001), Arm64 RyuJIT AdvSIMD
  ShortRun : .NET 8.0.1 (8.0.123.58001), Arm64 RyuJIT AdvSIMD

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method SerializedSample Concurrency Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Runner 10 1 1,833.6 ms 230.96 ms 12.66 ms 288000.0000 6000.0000 - 2.25 GB
Runner 10 2 1,006.2 ms 197.33 ms 10.82 ms 293000.0000 12000.0000 1000.0000 2.25 GB
Runner 10 3 775.6 ms 128.97 ms 7.07 ms 294000.0000 18000.0000 1000.0000 2.25 GB
Runner 10 4 614.9 ms 210.45 ms 11.54 ms 294000.0000 23000.0000 1000.0000 2.25 GB
Runner 10 8 399.4 ms 166.71 ms 9.14 ms 294000.0000 42000.0000 1000.0000 2.25 GB
Runner 500 1 1,260.5 ms 54.52 ms 2.99 ms 189000.0000 1000.0000 - 1.48 GB
Runner 500 2 775.5 ms 303.83 ms 16.65 ms 195000.0000 26000.0000 23000.0000 1.48 GB
Runner 500 3 702.9 ms 161.65 ms 8.86 ms 199000.0000 81000.0000 78000.0000 1.5 GB
Runner 500 4 660.6 ms 118.04 ms 6.47 ms 202000.0000 86000.0000 83000.0000 1.52 GB
Runner 500 8 760.5 ms 91.39 ms 5.01 ms 213000.0000 105000.0000 102000.0000 1.57 GB
Runner 1000 1 1,349.7 ms 202.89 ms 11.12 ms 190000.0000 66000.0000 63000.0000 1.49 GB
Runner 1000 2 912.8 ms 582.44 ms 31.93 ms 199000.0000 86000.0000 83000.0000 1.51 GB
Runner 1000 3 803.3 ms 284.16 ms 15.58 ms 203000.0000 93000.0000 90000.0000 1.54 GB
Runner 1000 4 826.8 ms 455.74 ms 24.98 ms 208000.0000 103000.0000 100000.0000 1.56 GB
Runner 1000 8 1,061.0 ms 20.52 ms 1.12 ms 222000.0000 106000.0000 105000.0000 1.66 GB
Runner 4000 1 1,992.6 ms 28.40 ms 1.56 ms 257000.0000 203000.0000 144000.0000 1.59 GB
Runner 4000 2 2,287.4 ms 1,042.69 ms 57.15 ms 750000.0000 747000.0000 127000.0000 1.68 GB
Runner 4000 3 2,167.8 ms 848.60 ms 46.51 ms 875000.0000 871000.0000 160000.0000 1.89 GB
Runner 4000 4 1,848.8 ms 1,515.13 ms 83.05 ms 579000.0000 539000.0000 148000.0000 1.99 GB
Runner 4000 8 1,844.3 ms 218.61 ms 11.98 ms 229000.0000 116000.0000 114000.0000 2.27 GB
Runner 8000 1 2,790.2 ms 2,518.96 ms 138.07 ms 318000.0000 284000.0000 155000.0000 1.73 GB
Runner 8000 2 1,985.8 ms 2,056.98 ms 112.75 ms 401000.0000 344000.0000 123000.0000 1.89 GB
Runner 8000 3 1,558.9 ms 828.29 ms 45.40 ms 248000.0000 180000.0000 112000.0000 2.06 GB
Runner 8000 4 1,650.1 ms 241.14 ms 13.22 ms 196000.0000 100000.0000 99000.0000 2.16 GB
Runner 8000 8 1,918.2 ms 1,543.50 ms 84.60 ms 171000.0000 86000.0000 85000.0000 2.48 GB

@WernerMairl
Copy link
Author

Thank you,
basically I come to likely the same after every restart of my investigations.

Important to know: the sizing of the sample is selected by the inventors for (file)storage efficency reasons, not for memory or deserialization reasons - but anyway I'm using the same "Apple Silicon" like you, and it is frustrating to see that all the available power (that is fantastic) is not usable :-(

So maybe, my next round in learnings and improvements should go into the array "efficent object creation"....

BR Werner

@WernerMairl
Copy link
Author

In the past I was in doubth if the root cause here may be in the area of vCPU, Hypterthreading, CPU Cache, CPU <=> Memory connection etc...

A new experiment confirmed, that the issue is caused inside .net memory usage or inside the current process.

What I did:

I reconfigured my sample/work in a way that only one thread is used and it takes likely 80 seconds to do the work.
Then I splitted the work up in multiple PROCESSES (each with one thread)... and see: it scales!!!

I can see CPU usage > 90 %, I see faster execution of the overall work!

Yes there is some overhead, but with a unoptimized duration of 80 seconds the overhead impact is not so much in %...

Processes Duration (sec)
1 79.5
2 40.5
4 20.5
8 10.9
10 10.0

Also the CPU usage % are scaling in the same way...

from 1 to 8 processes it looks like a perfect scaling, higher values are not as good but this is not my main concern ...

My Hardware: MacBook Pro M2 (2023) with 12 Apple Silicon CPU's => Parallels VM for Windows 11 (ARM) with 10 CPUs for the VM)

@mgravell
Copy link
Member

mgravell commented Apr 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants