Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call benchmark method directly #2334

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timcassell
Copy link
Collaborator

Fixes #1133

This wraps the workload call with a NoInlining | NoOptimization method instead of a delegate.

Mac Intel x64 results

BenchmarkDotNet=v0.13.5.20230619-develop, OS=macOS Monterey 12.3 (21E230) [Darwin 21.4.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2

Master

Method Mean Error StdDev
OneIncrement 0.0013 ns 0.0037 ns 0.0048 ns
TwoIncrement 0.0034 ns 0.0043 ns 0.0064 ns
ThreeIncrement 0.0000 ns 0.0000 ns 0.0000 ns
FourIncrement 0.1652 ns 0.0053 ns 0.0047 ns
FiveIncrement 0.3950 ns 0.0047 ns 0.0039 ns
SixIncrement 0.6129 ns 0.0038 ns 0.0035 ns

This PR

Method Mean Error StdDev
OneIncrement 0.2852 ns 0.0063 ns 0.0056 ns
TwoIncrement 0.3822 ns 0.0477 ns 0.0742 ns
ThreeIncrement 0.4122 ns 0.0071 ns 0.0063 ns
FourIncrement 0.4845 ns 0.0080 ns 0.0071 ns
FiveIncrement 0.5980 ns 0.0082 ns 0.0068 ns
SixIncrement 1.1828 ns 0.0078 ns 0.0069 ns

Windows AMD x64 results

BenchmarkDotNet=v0.13.5.20230619-develop, OS=Windows 10 (10.0.19045.3086/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT SSE3

Master

Method Mean Error StdDev
OneIncrement 0.1833 ns 0.0359 ns 0.0300 ns
TwoIncrement 0.6950 ns 0.0523 ns 0.0538 ns
ThreeIncrement 0.7524 ns 0.0062 ns 0.0051 ns
FourIncrement 1.0793 ns 0.0154 ns 0.0129 ns
FiveIncrement 1.1073 ns 0.0589 ns 0.0630 ns
SixIncrement 1.6137 ns 0.0683 ns 0.0787 ns

This PR

Method Mean Error StdDev
OneIncrement 0.6927 ns 0.0182 ns 0.0179 ns
TwoIncrement 1.1581 ns 0.0104 ns 0.0087 ns
ThreeIncrement 1.2368 ns 0.0057 ns 0.0050 ns
FourIncrement 1.4841 ns 0.0054 ns 0.0045 ns
FiveIncrement 1.7893 ns 0.0918 ns 0.1128 ns
SixIncrement 2.3818 ns 0.1026 ns 0.0909 ns

@timcassell
Copy link
Collaborator Author

timcassell commented Jul 24, 2023

With this PR, I am seeing what looks like more accurate results in the default toolchain, but the InProcessEmitToolchain is now showing faster times than out-of-process toolchains. The IL is identical (as confirmed by the IL comparison tests), so I'm not really sure why this is. I tried to disassemble to see what's going on, but DisassemblyDiagnoser apparently doesn't work with InProcessEmitToolchain (I got errors).

Looking at the logs, it appears that overhead is measured at more time.

@ig-sinicyn Any ideas?

Master:

    Runtime=.NET 7.0  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.1382 ns | 0.0093 ns | 0.0073 ns |
|   TwoIncrement | 0.7758 ns | 0.0281 ns | 0.0235 ns |
| ThreeIncrement | 0.7911 ns | 0.0304 ns | 0.0445 ns |
|  FourIncrement | 1.3554 ns | 0.0681 ns | 0.1040 ns |

    Toolchain=InProcessEmitToolchain  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.2270 ns | 0.0279 ns | 0.0261 ns |
|   TwoIncrement | 0.8583 ns | 0.0539 ns | 0.0477 ns |
| ThreeIncrement | 0.4007 ns | 0.0578 ns | 0.1071 ns |
|  FourIncrement | 1.4215 ns | 0.0096 ns | 0.0080 ns |

PR:

    Runtime=.NET 7.0  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.5118 ns | 0.0697 ns | 0.0977 ns |
|   TwoIncrement | 1.1762 ns | 0.0813 ns | 0.0903 ns |
| ThreeIncrement | 1.4889 ns | 0.0869 ns | 0.1300 ns |
|  FourIncrement | 1.6251 ns | 0.0893 ns | 0.1028 ns |

    Toolchain=InProcessEmitToolchain  

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|   OneIncrement | 0.0000 ns | 0.0000 ns | 0.0000 ns |
|   TwoIncrement | 0.1269 ns | 0.0088 ns | 0.0078 ns |
| ThreeIncrement | 0.5865 ns | 0.0064 ns | 0.0060 ns |
|  FourIncrement | 0.5677 ns | 0.0074 ns | 0.0062 ns |

@timcassell timcassell marked this pull request as draft July 24, 2023 14:03
@timcassell

This comment was marked as outdated.

@timcassell
Copy link
Collaborator Author

timcassell commented Jul 28, 2023

I reverted the ClrMd disassembler back to v1 on my local so I could actually inspect the asm. The only difference I see that might affect the result is this.

Default toolchain

call      qword ptr [BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()]

InProcessEmit

call      BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()

The IL is exactly the same for those calls, so it seems the JIT treats IL emit slightly different. I don't have any asm knowledge to know what effect that difference makes, but it seems that qword ptr is faster. (Only the overhead and wrapper calls are different, the workload call uses qword ptr for both toolchains.

call-direct-default-asm.md
call-direct-inprocess-asm.md

@timcassell
Copy link
Collaborator Author

timcassell commented Aug 1, 2023

The assembly issue with InProcessEmit is only in net7+. The overhead measurement is off by about 2-3 clock cycles, which isn't far off from the current measurement in all toolchains. I don't think it should block this from being merged.

@timcassell timcassell marked this pull request as ready for review August 1, 2023 06:51
@timcassell timcassell added this to the v0.14.0 milestone Jan 14, 2024
@timcassell
Copy link
Collaborator Author

@AndreyAkinshin I would also like to get this in v0.14.0 if you don't mind (followed by #2336). These 2 PRs will likely change the results of long-term measurements for higher accuracy (like dotnet/performance).

@AndreyAkinshin AndreyAkinshin modified the milestones: v0.14.x, v0.14.0 Jan 22, 2024
@timcassell timcassell linked an issue Mar 6, 2024 that may be closed by this pull request
@timcassell timcassell force-pushed the call-direct branch 2 times, most recently from 1982b8a to 6ba4993 Compare March 10, 2024 04:25
@AndreyAkinshin
Copy link
Member

@timcassell could you please rebase on master one more time? I introduced a bug in Perfolizer 0.3.16 that was fixed in 0.3.17. I just pushed Perfolizer 0.3.17 to BenchmarkDotNet master.

@AndreyAkinshin
Copy link
Member

It seems I found a problem. Let's consider the following environment:

BenchmarkDotNet v0.13.13-develop (2024-03-11), Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

The original benchmark was extended to the following form:

Source code
public class Program
{
    public static void Main() => BenchmarkRunner.Run<OverheadTests>();
}

[DisassemblyDiagnoser]
public class OverheadTests
{
    private int _field;

    [Benchmark]
    public void Increment01()
    {
        _field++;
    }

    [Benchmark]
    public void Increment02()
    {
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment03()
    {
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment04()
    {
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment05()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment06()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment07()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment08()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment09()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment10()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }

    [Benchmark]
    public void Increment20()
    {
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
        _field++;
    }
}

Here is the generated assembly:

Assembly

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment01()
       inc       dword ptr [rdi+8]
       ret
; Total bytes of code 4

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment02()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 14

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment03()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 19

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment04()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 24

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment05()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 29

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment06()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 34

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment07()
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       ret
; Total bytes of code 39

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment08()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 49

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment09()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 54

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment10()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 59

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkDotNet.Samples.OverheadTests.Increment20()
       push      rbp
       mov       rbp,rsp
       mov       eax,[rdi+8]
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       inc       eax
       mov       [rdi+8],eax
       pop       rbp
       ret
; Total bytes of code 109

An interesting observation: since Increment08, .NET 8.0.23.53103 starts wrapping the method body with

push      rbp
mov       rbp,rsp
...
pop       rbp

Here are my results with the latest master:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0282 ns | 0.0000 ns | 0.0000 ns |       4 B |
| Increment02 | 0.0329 ns | 0.0000 ns | 0.0000 ns |      14 B |
| Increment03 | 0.0339 ns | 0.0005 ns | 0.0004 ns |      19 B |
| Increment04 | 0.0000 ns | 0.0000 ns | 0.0000 ns |      24 B |
| Increment05 | 0.1551 ns | 0.0001 ns | 0.0001 ns |      29 B |
| Increment06 | 0.1588 ns | 0.0007 ns | 0.0006 ns |      34 B |
| Increment07 | 0.3427 ns | 0.0022 ns | 0.0021 ns |      39 B |
| Increment08 | 0.5363 ns | 0.0002 ns | 0.0002 ns |      49 B |
| Increment09 | 0.7391 ns | 0.0005 ns | 0.0005 ns |      54 B |
| Increment10 | 0.9274 ns | 0.0015 ns | 0.0013 ns |      59 B |
| Increment20 | 2.7696 ns | 0.0004 ns | 0.0004 ns |     109 B |

The results are quite consistent, stable, and reproducible. For Increment01..04, we have "instant" results, but at least the "Mean" time is not decreasing with an increased number of increments.

Now let's run the same set of benchmarks using BenchmarkDotNet from this PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0113 ns | 0.0061 ns | 0.0057 ns |       4 B |
| Increment02 | 0.0073 ns | 0.0001 ns | 0.0000 ns |      14 B |
| Increment03 | 0.0030 ns | 0.0000 ns | 0.0000 ns |      19 B |
| Increment04 | 0.0145 ns | 0.0001 ns | 0.0001 ns |      24 B |
| Increment05 | 0.0175 ns | 0.0001 ns | 0.0001 ns |      29 B |
| Increment06 | 0.0251 ns | 0.0017 ns | 0.0016 ns |      34 B |
| Increment07 | 0.5500 ns | 0.0014 ns | 0.0013 ns |      39 B |
| Increment08 | 0.6812 ns | 0.0442 ns | 0.1007 ns |      49 B |
| Increment09 | 0.3456 ns | 0.0001 ns | 0.0001 ns |      54 B |
| Increment10 | 0.3746 ns | 0.0035 ns | 0.0033 ns |      59 B |
| Increment20 | 2.2034 ns | 0.0020 ns | 0.0018 ns |     109 B |

Observations:

  • "Instant"-result problem for Increment01..04 is not resolved (plus Increment05..06 are always "instant" now)
  • We have a mean time estimation degradation after Increment08 (0.68->0.34)

While the "correct" results are a controversial thing in this case, the non-monotonic Mean column definitely feels wrong, and it's a clear regression compared to the master. These results are also reproducible on my machine: Increment09 and Increment10 are always reported to be faster than Increment07 and Increment08.

I'm ready to collect any additional diagnostic info if needed.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin Unfortunately I don't have a Ryzen cpu to run benchmarks on, but I ran those benchmarks again on both of my machines and got results that look mostly good (the only outlier is the drop from inc3 to inc4 on Intel on both branches). It must be a cpu architectural reason for those results.

Master

BenchmarkDotNet v0.13.13-develop (2024-03-11), Windows 10 (10.0.19045.4046/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3


| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.4739 ns | 0.0395 ns | 0.0369 ns |       4 B |
| Increment02 | 0.7638 ns | 0.0169 ns | 0.0158 ns |      14 B |
| Increment03 | 0.7994 ns | 0.0283 ns | 0.0236 ns |      19 B |
| Increment04 | 1.0731 ns | 0.0076 ns | 0.0059 ns |      24 B |
| Increment05 | 1.1564 ns | 0.0615 ns | 0.0683 ns |      29 B |
| Increment06 | 1.4360 ns | 0.0692 ns | 0.1097 ns |      34 B |
| Increment07 | 1.4596 ns | 0.0457 ns | 0.0382 ns |      39 B |
| Increment08 | 1.9861 ns | 0.0282 ns | 0.0220 ns |      44 B |
| Increment09 | 2.0474 ns | 0.0542 ns | 0.0507 ns |      49 B |
| Increment10 | 2.2247 ns | 0.0239 ns | 0.0200 ns |      54 B |
| Increment20 | 5.6803 ns | 0.0725 ns | 0.0643 ns |     104 B |
BenchmarkDotNet v0.13.13-develop (2024-03-11), macOS Monterey 12.6 (21G115) [Darwin 21.6.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


| Method      | Mean      | Error     | StdDev    |
|------------ |----------:|----------:|----------:|
| Increment01 | 0.0168 ns | 0.0127 ns | 0.0119 ns |
| Increment02 | 0.0366 ns | 0.0104 ns | 0.0081 ns |
| Increment03 | 0.2366 ns | 0.0150 ns | 0.0133 ns |
| Increment04 | 0.1721 ns | 0.0175 ns | 0.0164 ns |
| Increment05 | 0.4294 ns | 0.0123 ns | 0.0103 ns |
| Increment06 | 0.7268 ns | 0.0299 ns | 0.0279 ns |
| Increment07 | 1.0112 ns | 0.0359 ns | 0.0336 ns |
| Increment08 | 1.2435 ns | 0.0271 ns | 0.0253 ns |
| Increment09 | 1.4853 ns | 0.0309 ns | 0.0274 ns |
| Increment10 | 1.7526 ns | 0.0231 ns | 0.0216 ns |
| Increment20 | 4.3590 ns | 0.0467 ns | 0.0437 ns |

PR

BenchmarkDotNet v0.13.13-develop (2024-03-11), Windows 10 (10.0.19045.4046/22H2/2022Update)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.201
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3


| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.6082 ns | 0.0716 ns | 0.1757 ns |       4 B |
| Increment02 | 1.1070 ns | 0.0182 ns | 0.0142 ns |      14 B |
| Increment03 | 1.2486 ns | 0.0261 ns | 0.0231 ns |      19 B |
| Increment04 | 1.4787 ns | 0.0125 ns | 0.0104 ns |      24 B |
| Increment05 | 1.7511 ns | 0.0104 ns | 0.0092 ns |      29 B |
| Increment06 | 2.4044 ns | 0.0249 ns | 0.0221 ns |      34 B |
| Increment07 | 2.5753 ns | 0.1115 ns | 0.1145 ns |      39 B |
| Increment08 | 3.8930 ns | 0.0206 ns | 0.0172 ns |      44 B |
| Increment09 | 4.0305 ns | 0.0762 ns | 0.0713 ns |      49 B |
| Increment10 | 4.2039 ns | 0.0184 ns | 0.0172 ns |      54 B |
| Increment20 | 6.9911 ns | 0.0497 ns | 0.0465 ns |     104 B |
BenchmarkDotNet v0.13.13-develop (2024-03-11), macOS Monterey 12.6 (21G115) [Darwin 21.6.0]
Intel Core i9-9880H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


| Method      | Mean      | Error     | StdDev    |
|------------ |----------:|----------:|----------:|
| Increment01 | 0.3142 ns | 0.0238 ns | 0.0223 ns |
| Increment02 | 0.4408 ns | 0.0141 ns | 0.0132 ns |
| Increment03 | 0.6237 ns | 0.0428 ns | 0.0380 ns |
| Increment04 | 0.5075 ns | 0.0271 ns | 0.0240 ns |
| Increment05 | 0.6775 ns | 0.0271 ns | 0.0240 ns |
| Increment06 | 1.2777 ns | 0.0309 ns | 0.0289 ns |
| Increment07 | 1.4158 ns | 0.0417 ns | 0.0390 ns |
| Increment08 | 1.6902 ns | 0.0253 ns | 0.0197 ns |
| Increment09 | 2.0321 ns | 0.0342 ns | 0.0267 ns |
| Increment10 | 2.2494 ns | 0.0456 ns | 0.0404 ns |
| Increment20 | 4.9814 ns | 0.0795 ns | 0.0704 ns |

No assembly for the Intel chip (no support for MacOS), but assembly for the old AMD chip:

Assembly

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment01()
       inc       dword ptr [rcx+8]
       ret
; Total bytes of code 4

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment02()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 14

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment03()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 19

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment04()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 24

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment05()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 29

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment06()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 34

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment07()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 39

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment08()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 44

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment09()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 49

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment10()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 54

.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3

; ConsoleApp1.OverheadTests.Increment20()
       mov       eax,[rcx+8]
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       inc       eax
       mov       [rcx+8],eax
       ret
; Total bytes of code 104

I see no logical reason for there to be a flaw with wrapping the call in a NoInlining method rather than a delegate, but I could be missing something.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin I know it'll spoil the "instant" results, but I wonder what results you will get if you make the field volatile to prevent any cpu optimization shenanigans.

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin Also, can you disassemble *Workload* and *Overhead* methods? I'm curious if the assembly calls match or if there's some differences like I saw with the IL emit.

@AndreyAkinshin
Copy link
Member

@timcassell

I wonder what results you will get if you make the field volatile to prevent any cpu optimization shenanigans.

master:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0272 ns | 0.0005 ns | 0.0005 ns |       4 B |
| Increment02 | 0.0138 ns | 0.0151 ns | 0.0141 ns |       7 B |
| Increment03 | 0.0326 ns | 0.0000 ns | 0.0000 ns |      10 B |
| Increment04 | 0.0632 ns | 0.0018 ns | 0.0015 ns |      13 B |
| Increment05 | 0.1947 ns | 0.0001 ns | 0.0001 ns |      16 B |
| Increment06 | 0.4581 ns | 0.0017 ns | 0.0014 ns |      24 B |
| Increment07 | 0.5812 ns | 0.0037 ns | 0.0033 ns |      27 B |
| Increment08 | 0.8946 ns | 0.0183 ns | 0.0171 ns |      30 B |
| Increment09 | 0.9137 ns | 0.0008 ns | 0.0007 ns |      33 B |
| Increment10 | 1.0144 ns | 0.0023 ns | 0.0022 ns |      36 B |
| Increment20 | 3.1494 ns | 0.0025 ns | 0.0023 ns |      66 B |

PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns |       4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns |       7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns |      10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns |      13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns |      16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns |      24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns |      27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns |      30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns |      33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns |      36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns |      66 B |

@AndreyAkinshin
Copy link
Member

@timcassell

Also, can you disassemble *Workload* and *Overhead* methods? I'm curious if the assembly calls match or if there's some differences like I saw with the IL emit.

Could you please remind me what is the easiest way to do this on Linux nowadays?

@timcassell
Copy link
Collaborator Author

Could you please remind me what is the easiest way to do this on Linux nowadays?

You can use --disasmFilter *Workload* *Overhead* command line arg, or

config.AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters: ["*Workload*", "*Overhead*"])))

@timcassell
Copy link
Collaborator Author

timcassell commented Mar 11, 2024

PR:

| Method      | Mean      | Error     | StdDev    | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns |       4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns |       7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns |      10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns |      13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns |      16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns |      24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns |      27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns |      30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns |      33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns |      36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns |      66 B |

Well those results look more stable (almost a consistent increase after inc6). It looks like almost a constant time of 0.4ns was subtracted from your master results. If your cpu is at 5ghz, that's 2 clock cycles. That's almost exactly the same as what I see with the InProcessEmitToolchain on my older machine. Will be curious to see if the assembly shows it.

It's also interesting that adding volatile shrank the code size. 🤔

@timcassell
Copy link
Collaborator Author

timcassell commented Mar 12, 2024

Other things to check out:

Results with net6.0 runtime
Results with full Framework runtime (if you can, I know you said you're on Linux)
Results with another cpu (if you have another cpu you can test with)

@AndreyAkinshin
Copy link
Member

Status update: measurements are in progress (I want to collect a comprehensive set of summary tables and share them at once)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inaccurate results reported for small methods BenchmarkDotNet (arguably) slightly overcorrects for overhead
2 participants