Deserialization Performance Issue - with sample repo #1142

WernerMairl · 2024-03-26T14:13:36Z

Hello again ;-)

Yes, I'm aware of #669 and my personal conclusion there, but I like to investigate deeper, and maybe someone can help!

Issue: CPU seems not to be used on high core machines.

I have created a small (hopefully small enough) repository (net8.0) that provides a simple usecase with basic measuring as a playground for everyone!

The proto definition used is a real world scenario, coming from the OpenStreetMap pbf file format.
Part of the implementation is also "stolen" from the OsmSharp project (MIT licensed).

Sample Repository: https://github.com/WernerMairl/protobuf-net-concurrency

Expectations

Using 8 threads/tasks in parallel less then 2000 ms in duration overall should be possible (comparing with 4950 ms with one thread).

I cannot understand why we see a rate of 242 deserializations per seconds in a singlethread scenario and only 50 deserializations per second (and thread) in a 8-thread scenario.

Calculating some overhead, i would expect a rate around 180-200 for each of the 8 threads!

Questions

wrong expections ?
invalid test ?
bugs in the implementation ?

Any help is welcome to improve this.

ewgdg · 2024-03-28T16:59:18Z

I generated a benchmark report for your sample project using BenchmarkDotNet, as detailed below.

The benchmark results indicate that the program scales effectively with concurrency when the sample size is 10. However, scalability issues arise as the sample size increases.

I think this might be because larger sample sizes create big objects, possibly affecting the large object heap (LOH).

Also, with a large sample size, like 4000, more memory is used as concurrency increases. This suggests a problem of memory inefficiency in multi-threaded environments.

Base on the observations, my hypothesis is that this is related to the overhead of handling large objects by ArrayPool.Shared in a multi-threaded context, as explored in this article.

ArrayPool.Shared relies on thread-local storage for its implementation. This design choice implies that high levels of concurrency can increase the likelihood of object allocations, which can be costly if they are large objects. This explains the observed correlation between increased concurrency and memory allocation.

Also, the SharedArrayPool has a Gen2GcCallback callback. This function releases all thread-local objects under memory pressure, even though you actually want to reuse them as your program continuously deserializes data in a loop. This behavior further degrading the performance in high concurrency case as there are more thread-local objects.


BenchmarkDotNet v0.13.12, macOS Sonoma 14.4 (23E214) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 8.0.101
  [Host]   : .NET 8.0.1 (8.0.123.58001), Arm64 RyuJIT AdvSIMD
  ShortRun : .NET 8.0.1 (8.0.123.58001), Arm64 RyuJIT AdvSIMD

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	SerializedSample	Concurrency	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
Runner	10	1	1,833.6 ms	230.96 ms	12.66 ms	288000.0000	6000.0000	-	2.25 GB
Runner	10	2	1,006.2 ms	197.33 ms	10.82 ms	293000.0000	12000.0000	1000.0000	2.25 GB
Runner	10	3	775.6 ms	128.97 ms	7.07 ms	294000.0000	18000.0000	1000.0000	2.25 GB
Runner	10	4	614.9 ms	210.45 ms	11.54 ms	294000.0000	23000.0000	1000.0000	2.25 GB
Runner	10	8	399.4 ms	166.71 ms	9.14 ms	294000.0000	42000.0000	1000.0000	2.25 GB
Runner	500	1	1,260.5 ms	54.52 ms	2.99 ms	189000.0000	1000.0000	-	1.48 GB
Runner	500	2	775.5 ms	303.83 ms	16.65 ms	195000.0000	26000.0000	23000.0000	1.48 GB
Runner	500	3	702.9 ms	161.65 ms	8.86 ms	199000.0000	81000.0000	78000.0000	1.5 GB
Runner	500	4	660.6 ms	118.04 ms	6.47 ms	202000.0000	86000.0000	83000.0000	1.52 GB
Runner	500	8	760.5 ms	91.39 ms	5.01 ms	213000.0000	105000.0000	102000.0000	1.57 GB
Runner	1000	1	1,349.7 ms	202.89 ms	11.12 ms	190000.0000	66000.0000	63000.0000	1.49 GB
Runner	1000	2	912.8 ms	582.44 ms	31.93 ms	199000.0000	86000.0000	83000.0000	1.51 GB
Runner	1000	3	803.3 ms	284.16 ms	15.58 ms	203000.0000	93000.0000	90000.0000	1.54 GB
Runner	1000	4	826.8 ms	455.74 ms	24.98 ms	208000.0000	103000.0000	100000.0000	1.56 GB
Runner	1000	8	1,061.0 ms	20.52 ms	1.12 ms	222000.0000	106000.0000	105000.0000	1.66 GB
Runner	4000	1	1,992.6 ms	28.40 ms	1.56 ms	257000.0000	203000.0000	144000.0000	1.59 GB
Runner	4000	2	2,287.4 ms	1,042.69 ms	57.15 ms	750000.0000	747000.0000	127000.0000	1.68 GB
Runner	4000	3	2,167.8 ms	848.60 ms	46.51 ms	875000.0000	871000.0000	160000.0000	1.89 GB
Runner	4000	4	1,848.8 ms	1,515.13 ms	83.05 ms	579000.0000	539000.0000	148000.0000	1.99 GB
Runner	4000	8	1,844.3 ms	218.61 ms	11.98 ms	229000.0000	116000.0000	114000.0000	2.27 GB
Runner	8000	1	2,790.2 ms	2,518.96 ms	138.07 ms	318000.0000	284000.0000	155000.0000	1.73 GB
Runner	8000	2	1,985.8 ms	2,056.98 ms	112.75 ms	401000.0000	344000.0000	123000.0000	1.89 GB
Runner	8000	3	1,558.9 ms	828.29 ms	45.40 ms	248000.0000	180000.0000	112000.0000	2.06 GB
Runner	8000	4	1,650.1 ms	241.14 ms	13.22 ms	196000.0000	100000.0000	99000.0000	2.16 GB
Runner	8000	8	1,918.2 ms	1,543.50 ms	84.60 ms	171000.0000	86000.0000	85000.0000	2.48 GB

WernerMairl · 2024-03-29T08:51:50Z

Thank you,
basically I come to likely the same after every restart of my investigations.

Important to know: the sizing of the sample is selected by the inventors for (file)storage efficency reasons, not for memory or deserialization reasons - but anyway I'm using the same "Apple Silicon" like you, and it is frustrating to see that all the available power (that is fantastic) is not usable :-(

So maybe, my next round in learnings and improvements should go into the array "efficent object creation"....

BR Werner

WernerMairl · 2024-04-11T05:26:59Z

In the past I was in doubth if the root cause here may be in the area of vCPU, Hypterthreading, CPU Cache, CPU <=> Memory connection etc...

A new experiment confirmed, that the issue is caused inside .net memory usage or inside the current process.

What I did:

I reconfigured my sample/work in a way that only one thread is used and it takes likely 80 seconds to do the work.
Then I splitted the work up in multiple PROCESSES (each with one thread)... and see: it scales!!!

I can see CPU usage > 90 %, I see faster execution of the overall work!

Yes there is some overhead, but with a unoptimized duration of 80 seconds the overhead impact is not so much in %...

Processes	Duration (sec)
1	79.5
2	40.5
4	20.5
8	10.9
10	10.0

Also the CPU usage % are scaling in the same way...

from 1 to 8 processes it looks like a perfect scaling, higher values are not as good but this is not my main concern ...

My Hardware: MacBook Pro M2 (2023) with 12 Apple Silicon CPU's => Parallels VM for Windows 11 (ARM) with 10 CPUs for the VM)

mgravell · 2024-04-11T06:03:37Z

Ok, thanks. I'll try to have a look at what bottleneck we're hitting here.

…

On Thu, 11 Apr 2024, 06:27 Werner Mairl, ***@***.***> wrote: In the past I was in doubth if the root cause here may be in the area of vCPU, Hypterthreading, CPU Cache, CPU <=> Memory connection etc... A new experiment confirmed, that the issue is caused inside .net memory usage or inside the current process. What I did: I reconfigured my sample/work in a way that only one thread is used and it takes likely 80 seconds to do the work. Then I splitted the work up in *multiple PROCESSES* (each with one thread)... and see: it scales!!! I can see CPU usage > 90 %, I see faster execution of the overall work! Yes there is some overhead, but with a unoptimized duration of 80 seconds the overhead impact is not so much in %... ProcessesDuration (sec) 1 79.5 2 40.5 4 20.5 8 10.9 10 10.0 Also the CPU usage % are scaling in the same way... from 1 to 8 processes it looks like a perfect scaling, higher values are not as good but this is not my main concern ... My Hardware: MacBook Pro M2 (2023) with 12 Apple Silicon CPU's => Parallels VM for Windows 11 (ARM) with 10 CPUs for the VM) — Reply to this email directly, view it on GitHub <#1142 (comment)> or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMCHC4YE7JSY2XBMSVTY4YNLRBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVAZDAOBXHA4DAMUCUR2HS4DFUVUXG43VMWSXMYLMOVS2UMRSGA4DGOBQHAZTNJ3UOJUWOZ3FOKTGG4TFMF2GK> . You are receiving this email because you are subscribed to this thread. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialization Performance Issue - with sample repo #1142

Deserialization Performance Issue - with sample repo #1142

WernerMairl commented Mar 26, 2024

ewgdg commented Mar 28, 2024 •

edited

WernerMairl commented Mar 29, 2024

WernerMairl commented Apr 11, 2024

mgravell commented Apr 11, 2024 via email

Deserialization Performance Issue - with sample repo #1142

Deserialization Performance Issue - with sample repo #1142

Comments

WernerMairl commented Mar 26, 2024

Issue: CPU seems not to be used on high core machines.

Expectations

Questions

ewgdg commented Mar 28, 2024 • edited

WernerMairl commented Mar 29, 2024

WernerMairl commented Apr 11, 2024

mgravell commented Apr 11, 2024 via email

ewgdg commented Mar 28, 2024 •

edited