Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace BufferCache with ArrayPool<T> #8841

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

onkrot
Copy link

@onkrot onkrot commented Feb 24, 2024

Fixes #

Main PR

Description

Replace internal BufferCache with generic ArrayPool<T>

Customer Impact

Small performance improvement

Regression

None

Testing

None

Risk

Low risk

Microsoft Reviewers: Open in CodeFlow

@onkrot onkrot requested a review from a team as a code owner February 24, 2024 05:43
@dotnet-policy-service dotnet-policy-service bot added PR metadata: Label to tag PRs, to facilitate with triage Community Contribution A label for all community Contributions labels Feb 24, 2024
@pchaurasia14
Copy link
Contributor

@onkrot - Thank you for this PR. Looks nice. Were you able to benchmark the perf impact of this change at your end?

@onkrot
Copy link
Author

onkrot commented Feb 25, 2024

For the benchmark, I have taken several considerations:

  1. Single-threaded environment
  2. No other code interacts with buffer cache
  3. Amount of glyphs per font can be arbitrary but not exceed 65536

So, with all the above considerations, I have come up with this benchmark:

using BenchmarkDotNet.Attributes;
using MS.Internal.FontCache;
using System.Buffers;

namespace Bench
{
    public class BenchCache
    {
        [Params(100, 200)]
        public int FontCount { get; set; }

        readonly Random random = new(0);

        [Benchmark]
        public void BufferCacheUint()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = BufferCache.GetUInts(random.Next(65536));
                BufferCache.ReleaseUInts(array);
            }
        }

        [Benchmark]
        public void BufferCacheUshort()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = BufferCache.GetUShorts(random.Next(65536));
                BufferCache.ReleaseUShorts(array);
            }
        }

        [Benchmark]
        public void BufferCacheGlyph()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = BufferCache.GetGlyphMetrics(random.Next(65536));
                BufferCache.ReleaseGlyphMetrics(array);
            }
        }

        [Benchmark]
        public void ArrayPoolUint()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = ArrayPool<uint>.Shared.Rent(random.Next(65536));
                ArrayPool<uint>.Shared.Return(array);
            }
        }

        [Benchmark]
        public void ArrayPoolUshort()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = ArrayPool<ushort>.Shared.Rent(random.Next(65536));
                ArrayPool<ushort>.Shared.Return(array);
            }
        }

        [Benchmark]
        public void ArrayPoolGlyph()
        {
            for (int i = 0; i < FontCount; i++)
            {
                var array = ArrayPool<GlyphMetrics>.Shared.Rent(random.Next(65536));
                ArrayPool<GlyphMetrics>.Shared.Return(array);
            }
        }
    }
}

which has yielded these results:

// * Summary *

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3155/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.1.24101.4
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method FontCount Mean Error StdDev
BufferCacheUint 100 115.958 us 1.3134 us 1.0967 us
BufferCacheUshort 100 80.768 us 1.4098 us 1.3187 us
BufferCacheGlyph 100 1,220.442 us 23.6995 us 42.1259 us
ArrayPoolUint 100 1.324 us 0.0022 us 0.0020 us
ArrayPoolUshort 100 1.326 us 0.0030 us 0.0024 us
ArrayPoolGlyph 100 1.318 us 0.0022 us 0.0019 us
BufferCacheUint 200 231.818 us 1.8254 us 1.4251 us
BufferCacheUshort 200 162.415 us 1.3721 us 1.0712 us
BufferCacheGlyph 200 2,415.900 us 47.8346 us 103.9884 us
ArrayPoolUint 200 2.646 us 0.0076 us 0.0064 us
ArrayPoolUshort 200 2.807 us 0.0101 us 0.0084 us
ArrayPoolGlyph 200 2.637 us 0.0060 us 0.0050 us

@onkrot
Copy link
Author

onkrot commented Feb 25, 2024

In an ideal scenario (with the number of elements < 1024) BufferCache is faster:

Method FontCount Mean Error StdDev Allocated
BufferCacheUint 100 1.075 us 0.0042 us 0.0038 us -
BufferCacheUshort 100 1.086 us 0.0156 us 0.0130 us -
BufferCacheGlyph 100 1.037 us 0.0032 us 0.0027 us -
ArrayPoolUint 100 1.323 us 0.0037 us 0.0029 us -
ArrayPoolUshort 100 1.496 us 0.0024 us 0.0020 us -
ArrayPoolGlyph 100 1.366 us 0.0250 us 0.0234 us -
BufferCacheUint 200 2.093 us 0.0057 us 0.0047 us -
BufferCacheUshort 200 2.205 us 0.0065 us 0.0061 us -
BufferCacheGlyph 200 2.358 us 0.0446 us 0.0478 us -
ArrayPoolUint 200 2.731 us 0.0450 us 0.0399 us -
ArrayPoolUshort 200 2.638 us 0.0068 us 0.0053 us -
ArrayPoolGlyph 200 2.676 us 0.0076 us 0.0060 us -

@onkrot
Copy link
Author

onkrot commented Feb 26, 2024

The results for increased max array size for BufferCache up to 1024 * 1024 (default max size for ArrayPool<T>.Shared)

Method FontCount Mean Error StdDev Allocated
BufferCacheUint 100 600.2 ns 7.01 ns 6.56 ns -
BufferCacheUshort 100 596.4 ns 11.35 ns 11.15 ns -
BufferCacheGlyph 100 747.6 ns 11.35 ns 10.62 ns -
ArrayPoolUint 100 729.9 ns 11.08 ns 10.37 ns -
ArrayPoolUshort 100 731.0 ns 8.61 ns 8.05 ns -
ArrayPoolGlyph 100 732.3 ns 11.12 ns 10.40 ns -
BufferCacheUint 200 1,200.8 ns 23.06 ns 21.57 ns -
BufferCacheUshort 200 1,195.4 ns 23.28 ns 22.86 ns -
BufferCacheGlyph 200 1,187.0 ns 17.25 ns 16.13 ns -
ArrayPoolUint 200 1,533.5 ns 1.18 ns 0.99 ns -
ArrayPoolUshort 200 1,466.3 ns 1.93 ns 1.81 ns -
ArrayPoolGlyph 200 1,456.8 ns 18.23 ns 17.05 ns -

@@ -632,7 +642,7 @@ out idealWidth
GlyphTypeface glyphTypeface = TryGetGlyphTypeface();
Invariant.Assert(glyphTypeface != null);

MS.Internal.Text.TextInterface.GlyphMetrics[] glyphMetrics = BufferCache.GetGlyphMetrics(charBufferRange.Length);
MS.Internal.Text.TextInterface.GlyphMetrics[] glyphMetrics = ArrayPool<MS.Internal.Text.TextInterface.GlyphMetrics>.Shared.Rent(charBufferRange.Length);
Copy link
Contributor

@lindexi lindexi Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. Sorry, I don't believe that placing GlyphMetrics into the ArrayPool is a genuine point of optimization. The reason is that GlyphMetrics is an array type that is only used in this context, unlike the ushort or byte, and thus it cannot be reused with other modules through ArrayPool. In order to create a shared GlyphMetrics array, it would be necessary to create additional objects such as SharedArrayPool<GlyphMetrics>, which seems to have a higher cost than benefit. More importantly, the GlyphMetrics array is an object that is frequently used and then released. It can be assumed that virtually all modern WPF applications have text, meaning they constantly require the acquisition and return of GlyphMetrics arrays. In this case, it might be worth considering using a local variable to cache it. Perhaps the cost of using a local variable for caching is the lowest, and the benefit is the highest.

Type Count Size ReferenceSize
SharedArrayPool+Partitions<GlyphMetrics>[] 1 240 240
GlyphMetrics[] 1 152 152
ConditionalWeakTable+Entry<SharedArrayPool+ThreadLocalArray<GlyphMetrics>[], Object>[] 1 152 152
ConditionalWeakTable+Container<SharedArrayPool+ThreadLocalArray<GlyphMetrics>[], Object> 1 56 264
ConditionalWeakTable<SharedArrayPool+ThreadLocalArray<GlyphMetrics>[], Object> 1 40 328
SharedArrayPool<GlyphMetrics> 1 40 608

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me group proposed solutions:

  1. GlyphMetrics[] - approach similar to BufferCache without automatic release
  2. ConditionalWeakTable - automatic release of ArrayPool? SharedArrayPool already has a mechanism to detect unused arrays and free them on garbage collection event
  3. SharedArrayPool - proposed solution

Also, perhaps we could use stackalloc array here if the number of elements is not too high (maybe 150 (for 4096 / sizeof(GlyphMetrics)))

@ThomasGoulet73
Copy link
Contributor

@onkrot Thank you for this PR. Unfortunately, your benchmark is not representative because BufferCache will not cache arrays with more than 1024 items (See the limit here). So in your benchmark BufferCache mostly never uses the cached arrays and allocates on the heap instead. I would also discourage the use of Random in a benchmark, it can make the results unreliable.

Running your benchmark with a fixed buffer size of 1024 on my machine shows that BufferCache is faster than ArrayPool (Probably caused by overhead of ArrayPool being a generic cache compared to BufferCache with its 3 supported array types).

| Method            | FontCount | Mean       | Error   | StdDev  |
|------------------ |---------- |-----------:|--------:|--------:|
| BufferCacheUint   | 100       |   601.3 ns | 1.65 ns | 1.47 ns |
| BufferCacheUshort | 100       |   601.0 ns | 1.12 ns | 1.05 ns |
| BufferCacheGlyph  | 100       |   590.1 ns | 1.30 ns | 1.16 ns |
| ArrayPoolUint     | 100       |   761.9 ns | 1.69 ns | 1.58 ns |
| ArrayPoolUshort   | 100       |   741.3 ns | 1.76 ns | 1.65 ns |
| ArrayPoolGlyph    | 100       |   743.2 ns | 1.69 ns | 1.58 ns |
| BufferCacheUint   | 200       | 1,208.8 ns | 1.38 ns | 1.29 ns |
| BufferCacheUshort | 200       | 1,210.1 ns | 2.28 ns | 2.02 ns |
| BufferCacheGlyph  | 200       | 1,199.1 ns | 1.71 ns | 1.60 ns |
| ArrayPoolUint     | 200       | 1,522.8 ns | 2.72 ns | 2.41 ns |
| ArrayPoolUshort   | 200       | 1,615.8 ns | 3.82 ns | 3.57 ns |
| ArrayPoolGlyph    | 200       | 1,480.6 ns | 1.75 ns | 1.56 ns |

@ThomasGoulet73
Copy link
Contributor

After looking back at this PR it seems like I missed the fact that you already mention the 1024 items limit so you can disregard my comment. Though I think my suggestion of removing the use of Random in your benchmark still applies.

For this PR, it looks like the benefit is improved performance for larger arrays at the cost of reduced performance for smaller arrays ?

@onkrot
Copy link
Author

onkrot commented Mar 8, 2024

@ThomasGoulet73 As the benchmark shows - yes. But the BufferCache has another flaw I could not catch up with in the benchmarks: the cached array's linear growth. So with bad input, it will allocate a new array whenever we need it. ArrayPool on the other hand has exponential buckets that can handle this task with fewer allocations.

@lindexi
Copy link
Contributor

lindexi commented Mar 8, 2024

@onkrot With the common business logic, It hard to get the bad input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Contribution A label for all community Contributions PR metadata: Label to tag PRs, to facilitate with triage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants