Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSharp] Allow Span<byte>-based parsing in CodedInputStream #3431

Closed
mkosieradzki opened this issue Jul 31, 2017 · 70 comments
Closed

[CSharp] Allow Span<byte>-based parsing in CodedInputStream #3431

mkosieradzki opened this issue Jul 31, 2017 · 70 comments

Comments

@mkosieradzki
Copy link
Contributor

mkosieradzki commented Jul 31, 2017

IMO C# protobuf implementation could strongly benefit performance-wise from allowing some unsafe code.

Background

First of all I am not saying that unsafe should be enabled in every build/every platform, having 100% managed library is always a nice feature. All I am suggesting is again some conditional feature on some platform.

However there is a trend to move towards unsafe mixing we can observe in .NET Core.
For example a lot of .NET Core libraries use unsafe code due to need to interop with different kind of unmanaged libraries or just to handle AOT compiled code.

API change

Today CodedInputStream works in two modes: streaming with byte[] buffer and with a fixed buffer.

It would be really nice to be able to provide byte * and length to CodedInputStream constructor.

Alternatively we could create an UnmanagedCodedInputStream which would work on unmanaged memory.

This would require to abstracting CodedInputStream as an interface

Affected APIs

Async API #3166 makes absolutely no sense for UnmanagedCodedInputStream because it assumes everything have been already read.

Benefits

  1. It's extremely useful for interop (See scenario 1)
  2. It can work with stackalloc (See scenario 2)
  3. It allows even quicker deserialization we can quicker deserialize byte * to for example fixed-size primitive types by pointer dereference. (Endiannes might be a problem here)

Scenarios

Scenario 1. NoSQL Database

User is using native NoSQL database (for example RocksDB/LevelDB) and is using protobuf for data persistence. Database returns native allocated pointer with buffer. Instead of copying entire buffer to the memory user can deserialize record straight from the returned memory native memory and then free the pointer. Very little GC is involved and there is practically no overhead.

Scenario 2. Stackalloc-ated buffers

For performance reasons to decrease heap allocations and allocate buffer on stack. This scenario can be alternatively handled by Buffer Pooling, however stackalloc seems to be more efficient.

Affected by

https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md - this work in progress improvement can allow to achieve the same goals in a managed way by providing some platform improvements.

@Aaronontheweb
Copy link

Came here to say the same thing. Byte array copying inside Protobuf3 is one of the biggest sources of GC pressure in Akka.NET; we would like to be able to use pooled byte arrays (or Spans when they become part of the framework) and not have ByteString do a deep clone of them each time we use it.

Yes, we get it - it's dangerous. Understood, but let the end-users take the risk in order to put Protobuf to work in high performance contexts.

@alexvaluyskiy
Copy link

Protobuf Span API could look like this

namespace Google.Protobuf
{
    public class CodedInputStream
    {
        public CodedInputStream(ReadOnlySpan<byte> buffer);}
    
    public class CodedOutputStream
    {
        public CodedOutputStream(Span<byte> buffer);}
    
    public class MessageParser
    {
        public IMessage ParseFrom(ReadOnlySpan<byte> input);}
    
    public static class MessageExtensions
    {
        public static void MergeFrom(this IMessage message, ReadOnlySpan<byte> data);
        public static void WriteTo(this IMessage message, Span<byte> output);}
}

@alexvaluyskiy
Copy link

@Aaronontheweb At least this class should be public
https://github.com/google/protobuf/blob/master/csharp/src/Google.Protobuf/ByteString.cs#L60

@mkosieradzki
Copy link
Contributor Author

Thanks for backing me up :). If this proposal is accepted I am happy to create a proper PR. Problem with Span/ReadOnlySpan is that AFAIK it is not available yet and also that @jskeet will not be very keen on accepting PR requiring very-new compiler support and runtime support.

So I would be probably rather for the unsafe approach which can be widely used today.

@Aaronontheweb
Copy link

@mkosieradzki if they don't want to go forward with just exposing the Unsafe methods they have today, adding some overloads which use ArraySegment<byte> would also be a workable option in lieu of Span

@alexvaluyskiy
Copy link

alexvaluyskiy commented Aug 2, 2017

@mkosieradzki System.Memory package supports netstandard1.0

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Aug 2, 2017

@alexvaluyskiy Thanks. I have just checked that this package has been released in preview 3 months ago.

Are you aware whether it requires some specific compiler version? Or are you aware of any roadmap/official documentation?

@Aaronontheweb ArraySegment is not comparable to Span, because ArraySegment requires byte[] (does not support byte *) and Span can work on everything including byte[] and byte* or any other generic kind of memory.

@jskeet jskeet added the c# label Aug 2, 2017
@jskeet
Copy link
Contributor

jskeet commented Aug 2, 2017

I'm in two minds about this. I do see the point - but I'm definitely concerned about the level of complexity we end up with, in terms of lots of different builds for different scenarios. (Adding a dependency for Span also makes me nervous, but I'm gradually warming to the idea of ValueTask for the async part...) The async aspect adds another layer of complexity here - by the time we've got a matrix of managed/unmanaged, async/non-async, multiple versions of .NET supported, I think it's going to get tricky. It makes it really easy to break customers with what may seem like a trivial change.

There's also the aspect of time commitment from Google to maintain all of this. I'm not on the protobuf team, and have a full plate already - so @anandolee would need to be very comfortable with it all.

In terms of ByteString - no, the Unsafe class should not be made public, in the same way that I would never expect Microsoft to release a public String.Unsafe class that made it trivial to mutate strings. ByteStrings are intended to be immutable; ByteString.Unsafe exists to allow us to be more efficient where the library can guarantee that sharing a byte array will not exhibit mutability.

One option you may want to consider is forking the library entirely, creating an OptimizedAndVeryUnsafe.Protobuf library which is wire compatible but exposes everything you've ever wanted to, pays less attention to being massively backward compatible all the time, takes whatever dependencies it wants and generally gives the user more of a loaded shotgun. I've always tried to optimize within the bounds of making the code easy to use safely and hard to shoot yourself in the foot. If you want to remove those restrictions, that's fine - but it shouldn't be in the same library.

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Aug 2, 2017

@jskeet Thanks a lot for your response. My opinion would be to take the System.Memory path (without explicit unsafe). Span<byte> and ReadOnlySpan<byte> are incredibly great and safe wrappers for both byte[] and byte *.

If we could use System.Memory it would do all the heavy lifting (without requiring any UnmanagedCodedInputStream). We could just replace internal buffer with Span probably without changing a lot of code...

So I would suggest waiting for System.Memory release or I can try to create a fork using the preview version and we can merge it into the main library when ready.

Knowing that System.Memory is going to be released in a predictable future I think that special handling for byte * makes no sense.

I also think it's very similiar case as with ValueTask... (and System.Buffers for a shared buffer pool) - it's yet another dependency required to achieve optimal performance.

BTW. For more distant future I am also researching different ways to achieve Arena allocations for C#. There is a promising project called Snowflake https://www.microsoft.com/en-us/research/publication/project-snowflake-non-blocking-safe-manual-memory-management-net/# .

@jskeet
Copy link
Contributor

jskeet commented Aug 3, 2017

Yes, I'm definitely happier with Span than anything to do with unsafe code. However, I wouldn't want to do anything with it until it's had a full (non-beta) release. We could create PRs before then of course, but we shouldn't merge them.

@mkosieradzki
Copy link
Contributor Author

@jskeet Thanks! I will try to create a PR in August so we can preview this.

@alexvaluyskiy
Copy link

@mkosieradzki @jskeet I think the new Span API should be similar to java's protobuf ByteBuffer Api

@mkosieradzki
Copy link
Contributor Author

Exactly as I was afraid - Span requires new compiler version: aspnet/Announcements#268 ...

@jskeet
Copy link
Contributor

jskeet commented Aug 9, 2017

Let's see where it lands, in terms of requirements. It may be that it'll be harmless to expose it so long as Google.Protobuf is compiled with C# 7.2, which I'd be comfortable with (after that's released).

@dchennells
Copy link
Contributor

Tooling support for the C# 7.2 Span framework types was added with VS2017 15.5, which was RTMed on Dec 4. Span is now well documented and there are a couple of good collateral pieces, including one by Stephen Toub, linked here.

https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/

A couple of issues for discussion:

  1. At what point would it make sense to introduce the requirement for C# Google.Protobuf developers to be on VS2017 15.5+?
  2. For the project owner: would adding this to the mix be seen as an enhancement or rather as an "optimization" seeking a problem?

I'd be willing to work on this if there were to be a clear interest (or acceptance criteria) with respect to number two. and we're within about six weeks on number one (@anandolee & @jskeet).

@jskeet
Copy link
Contributor

jskeet commented Jan 4, 2018

I think it's reasonable to require that anyone developing the library uses VS2017 15.5 or the equivalent .NET Core SDK. There's a chunk of work required in order to update the SDK for continuous integration though.

Whether the Span methods are included in all output libraries is a different matter though. We'll need to look at the dependencies and whether there are issues consuming Span from older versions of VS/C#. (For example, if VS2015 users could easily end up using Span in a dangerous way, we may want to add a netstandard2.0 target and only expose Span there. We'll have to see.)

I think the first port of call should be some prototyping and benchmarking.

@jtattermusch
Copy link
Contributor

I think there's a good potential for using Span optimizations in gRPC C# (general idea: there's the grpc_csharp_ext native library and the messages need to be moved between managed and native layer and copying can be expensive. Being able to transparently address both managed/unmanaged memory with the same code seems useful.) - but we haven't really looked into these optimizations in detail (and there's complications so it's not possible to just say if it is going to be worth it without proper analysis and some experimenting).

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jan 4, 2018

I have started some experimentation in #3530 . I need to revisit my experiments since 15.5 is out. Before proper tooling support Span was super-slow (as expected).

Also it's important to remember that Span is a stack-only type - what has a lot of serious implications:
For example we might need to:

  1. make CodedInputStream a ref struct (doesn't seem right)
  2. or use heap-friendly Memory instead of Span
  3. or something else

This is especially difficult in context of #3166 - I have a lot of doubts related to the proper async-support.

I would really love to start a high level design discussion.

@dchennells
Copy link
Contributor

dchennells commented Jan 5, 2018

Perhaps we should first discuss where we believe the value for C# protobuf could originate with the new framework types and then try to align on a preliminary approach to assess whether that value actually exists. I'm mindful that this is already a well-crafted code-base and we also face the barrier/cost @jskeet noted of changing the SDK. No one wishes to waste their time developing PRs that never see the light of day, so we should try to fail sooner rather than later in the case that there isn't much value.

The rationale for the first wave of these framework types can be boiled down to a few major categories, based on what I'm seeing. Feel free to add or subtract from this picture:

  • Reducing cost of allocations
    • Allocations resulting from slicing/substring operations
    • Allocations that could now conditionally be put on the stack and in a safe/managed manner (when they are local arrays of primitives that turn out to be small)
  • Increasing managed code goodness
    • Reducing the unsafe footprint in the use and passing around of unsafe pointers in our library code

I did a little preliminary mechanical code analysis/triage and here is what I found. Outside of the context of unit tests, there are about 150 instances in the code of obvious allocations as indicated from the following starter list of search terms: "new byte["; ".Copy"; ".BlockCopy"; ".ToArray("; ".Substring("; "new MemoryStream"; ".Clone()." If you are interested, you can see the list in the attached spreadsheet.

Of these, probably only a small subset would be worth attacking, at least initially, and specifically those that:

  • can be avoided with the new types/overloads;
  • are in hot code regions (i.e. can deliver significant improvement in the benchmarks)

(As we know, there is no unsafe code presently in the code base.)

Purely for the purposes of initial discussion, a generic plan for the first couple of iterations in this type of situation, where we're skeptical of the benefits and want to proceed cautiously to avoid time wastage, might be as follows:

  1. Establish benchmark details (procedures, platform(s) and artifacts)
  2. Generate benchmarks for the current C# release (tagged as 3.5.1).
  3. Seek code sites for proof of concept ("POC") interventions by looking to the intersection of: (a) hot code regions (as identified by a tool such as PerfView while running the benchmark cases); and, (b) code regions that have buffer-oriented allocations that might benefit from the new framework types
  4. Select the most promising 2 - 8 code sites for preliminary interventions with span etc.
  5. Within a fork of the same release used above, complete a quick-and-dirty POC iteration.
  6. Obtain preliminary comparative benchmarks for the POC and come back up for air to the discussion in order to evaluate the value of proceeding with a more systematic/disciplined effort.
  7. Decide whether to proceed, on what timing, and who to assign as reviewer(s) etc.
  8. If it's a "Go", begin working on the actual PR(s)...

Let's have a good round of discussion on these details, alternative approaches, and any other thoughts. If something like the above turns out to make sense, I've also included a list of logistical questions in the attached file, which someone from Google could address. There are also some resources on this type of code-optimization effort.

Allocations Inventory, Questions and Resources.xlsx

@jtattermusch
Copy link
Contributor

@mkosieradzki can you clarify the backwards compatibility of Span:

  • What is the lowest version of .NET where Span can be used in the code? (not necessarily with seeing performance gains, not making things slower is enough - my understanding is that Span has a fallback implementation in case it's not supported by the runtime). Google.Protobuf needs to support at least net45+
  • What is the oldest runtime where one can see performance gains from Span (= where Span is properly supported)?

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jan 5, 2018

@jtattermusch
That's topic worth researching - I don't know exactly, and the answer might be very difficult, because it consist of 3 elements:

  • library (compatibility) - I believe System.Memory requires .NET Standard 1.1
  • compiler (ref struct concept and safety checks)
  • runtime optimizations (which are distributed among different versions of runtime starting with .NET Core 2.0 (http://adamsitnik.com/Span/) but I would not be surprised if there would be another bunch of significant improvements in .NET Core 2.1 and later.

According to: https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md

Runtimes that support by-ref fields and returns will get the fast ref-field Span<T>. Other runtimes will get the slower three-field Span<T>.

From benchmarks at http://adamsitnik.com/Span/ we can see 5-10% performance degradation of Spans vs Arrays on reads (on pre .NET Core 2.0 runtimes), however I believe that introducing safe stackallocs might help regain the performance even on older runtimes.

The biggest benefit from Span<T> is ability to have a single API for unmanaged and managed. However it will definitely cost us a huge API redesign as Span<T> cannot be member of non stack-only type. Let me emphasize this: You cannot make Span member of CodedInputStream and this defies all simple refactorings.

To be able to benefit from using Span<T> we need to talk about designing a completely different API.

And having this in mind. We have 2 possible approaches:

  1. Let's go fully unmanaged - Use pointers
    pros: fastest, reverse compatible, we can consider keeping buffer as a CodedInputStream class member, compatible with async approach
    cons: will not work on partial trust environments (this is a legacy concept), require more serious code reviews, when using unmanaged memory we need to do managed pointer pinning.
  1. Let's go Span<T>
    pros: safe, modern, requires pointer pinning, might work on some fully managed enviroments (I am not sure)
    cons: slower on legacy runtimes, need really serious redesign, will not work with async

I believe that Span<T> is a lot about using call stack to guarantee lifetime of the pinned memory pointer.

Also to address problems about async support:

  • Despite I am the author of async support (see Support for async in C# version #3173 ), I am not 100% convinced this is the way to go. I believe it can be a reasonable approach to use protocol buffers in a following manner:
  1. Pre-fetch entire buffer (handling delimiters)
  2. Schedule parsing using fast path only on a pre-fetched buffer.

This manner should be also compatible with gRPC streaming approach (and this is a level where I believe asynchrony fits the best).

This minimizes the time buffer needs to be pinned and strongly simplifies the code.

As I have mentioned before I have created a prototype for unsafe version of protobuf introducing even arena allocator approach to protobuf (see #3530 ).

I believe that introducing arena allocation together with unsafe approach will bring more and more benefits.

The last but not least: using unmanaged memory can allow implementing faster parsing by unsafe casting buggers into primitive types like int or long. Both protobuf and x86 are little-endian what should significantly improve parsing speed.

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jan 5, 2018

On more point about the arena allocations support:
It could be reasonable to use the memory from the original buffer in that case it makes parsing even faster. The most important thing to remember about this parsing approach is that arena-allocated + buffer-sourced data should be short-lived and the API should provide a way to transform it into a long-living heap-allocated C# objects if required by the developer.

In many scenarios long-living heap-allocated objects are not required or helpful at all.

The fact that we are using C# should not force us to downgrade to a Java-like approach with zero control over the allocations.

@dchennells
Copy link
Contributor

dchennells commented Jan 5, 2018

I'm thinking of Span as useful primarily in the narrow case of discrete, single threaded, synchronous functions and their helpers. In that case, the ability to potentially put a short array on the stack or to encapsulate neatly a reference to a slice or substring that can be passed to a helper without incurring an allocation or muddying the water with offsets is a nice (but limited) gain.

As far as the (different) topic of a public interface for core memory-encapsulating buffer objects, I'll defer to the prior participants in that conversation except to note that in my own work, which heavily depends on these, I almost always use such objects now in the context of various (multi-threaded and often parallelized) pipeline patterns in which exposing (heap-based) arrays as public properties is indeed indispensable. I try to avoid GC-pressure from a ton of small buffer or string allocations in those patterns; size buffers based on what's found to be optimal for a machine architecture (not individual document/message/cell size); and I reuse managed arrays in rotating, multi-segment buffers (with coarse-grained write vs. read gating across stages). For what it's worth, my own benchmarking has not demonstrated performance benefits for sequential access scenarios from using pointers with or without native memory regions in that particular context (though I do use pointers, unsafe code, and native memory/file handles for other use cases). All the same, a reusable core buffer strategy requires considerable bookkeeping. A preference for managed core buffer objects with ordinary indexed access may come down simply to managed lifetime as has been noted and cross-platform portability.

If folks ultimately conclude that Spans vs. core buffer refactoring are different topics, perhaps we should spin off the Spans to a separate issue?

@jtattermusch
Copy link
Contributor

@mkosieradzki I looked a bit more into what would the dependency on Span<> mean for Protobuf and gRPC:

System.Memory depends on netstandard1.0 (I believe it will use the unoptimized "fallback" version unless you target netstandard2.0), which means you can use it in net45 projects, but you will need a newer version of nuget and a newer IDE to be able to build those projects (they need to have knowledge of what "netstandard1.0" is)

gRPC currently targets net45 and netstandard1.5, Protobuf targets net45 and netstandard1.0.

As both gRPC and Protobuf currently target net45 explicitly, adding a dependency on System.Memory would have this effect:

  • users using older versions of Visual Studio (e.g. 2013 and I think some older versions of 2015) and mono unable to build their projects (as they wouldn't be able to resolve the System.Memory dependency). Users would be able to build their legacy projects if they upgraded their IDE / toolchain.

Also see: http://adamsitnik.com/Span/#how-to-use-it-today

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jan 9, 2018

@jtattermusch I might be wrong, but my understanding is:

  1. Framework target has no impact on performance at compile time. Neither package target framework version nor current compilation target version. Specifically if you target for example .NET Standard 1.1 you have a single assembly that does not need any recompilation - no matter what the compile target of the entire application is, and moreover no matter whether target runtime will support in-boxed Span or not.
  2. If I understand correctly fast version of Span is provided by the runtime (as an in-boxed type) (newer .NET Framework version installed on the machine or newer .NET Core runtime). My understanding is that if a compatible runtime is used, then bindings are redirected to the in-box version of Span.
  3. Newer version of compiler is required to enforce safety checks for Span as a stack only type and to be able to use ref struct concept. If you are using older compiler you can create incorrect library: I mean something that will throw an exception when an optimized version of Span is used. It also allows you to use type-safe stackallocs etc without using /unsafe. It might also introduce some compile-time optimizations, but I am not sure about this one.
  4. https://www.nuget.org/packages/System.Memory/4.4.0-preview1-25305-02 - if you look at the dependencies for this package - it has a reverse compatible target for net45, so you should still be able
    to compile with double targets net45;netstandard15 and net45;netstandard10 - and use System.Memory package.

@kwaclaw
Copy link

kwaclaw commented Feb 5, 2018

Coming to this discussion late.

Having gathered a little bit experience working on a Span based API for LMDB (https://github.com/LMDB/lmdb) I would agree that an entirely new API is necessary.

From a user persective the simplest things would be to add an overload of WriteTo like this:
public void WriteTo(Span<byte> output)

and one for parsing, like this:
public T ParseFrom(ReadOnlySpan<byte> data)

Benefits are two-fold:

  1. Span can wrap native pointers, it really becomes a universal buffer API.
  2. This makes it possible to have zero-copy interop with native libraries.

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 18, 2018

@jtattermusch I was rather thinking about ref ReadOnlySpan<T> immediateBuffer so while iterating over ReadOnlySequence we can move pointer to the next buffer.

When iterating over a Stream or an array based buffer it's enough to use immutable reference ReadOnlySpan<T>, but our goal is to handle ReadOnlySequence<T> efficiently.

BTW. My preliminary results

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.97 us | 0.1073 us | 0.1004 us |    5 |
                 ParseUsingClassicInline | 15.97 us | 0.1102 us | 0.0977 us |    4 |
                           ParseUsingNew | 14.97 us | 0.0939 us | 0.0878 us |    3 |
       ParseUsingCodedInputSpanPosParser | 13.06 us | 0.0493 us | 0.0461 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.49 us | 0.0803 us | 0.0751 us |    1 |

ParseUsingNew is a heavy-inlined version of classic (classic already benefits from some inlinining so it is not a true baseline) version optimized with an additional parameter ref ReadOnlySpan<T> (as you suggested). I need to track down where the 2 us are lost.

The current experimental code is available: https://github.com/mkosieradzki/protobuf/tree/spans/csharp/src/TestProtoPiper/Benchmarks - it looks pretty bad, but it's only to prototype.

@mkosieradzki
Copy link
Contributor Author

Found 0.7us ;) and accidentally boosted classic parser

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.19 us | 0.0571 us | 0.0534 us |    5 |
                 ParseUsingClassicInline | 14.86 us | 0.0341 us | 0.0319 us |    4 |
                           ParseUsingNew | 14.24 us | 0.0361 us | 0.0338 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.84 us | 0.0471 us | 0.0441 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.23 us | 0.0505 us | 0.0473 us |    1 |

It was hidden in an ineffient IsAtEnd implementation which didn't take the fast track if Stream was null. Still ~1.4us to find.

@mkosieradzki
Copy link
Contributor Author

Found another 0.2us in toxic inlining of slow-path, ~1.2 to go :)

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.31 us | 0.0957 us | 0.0896 us |    5 |
                 ParseUsingClassicInline | 14.73 us | 0.0480 us | 0.0449 us |    4 |
                           ParseUsingNew | 14.05 us | 0.0543 us | 0.0481 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.78 us | 0.0624 us | 0.0583 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.24 us | 0.0527 us | 0.0493 us |    1 |

@mkosieradzki
Copy link
Contributor Author

Picked-up the low-hanging fruit of indirect interface call. I think that the rest of the overhead resides inside the PushLimit / PopLimit / RecomputeBufferSizeAfterLimit - I think it can be optimized if the current message fits the current buffer - but let's leave this for later.

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.27 us | 0.1190 us | 0.1055 us |    5 |
                 ParseUsingClassicInline | 14.85 us | 0.0533 us | 0.0499 us |    4 |
                           ParseUsingNew | 13.79 us | 0.0602 us | 0.0563 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.67 us | 0.0683 us | 0.0605 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.55 us | 0.0490 us | 0.0459 us |    1 |

@jtattermusch

Long story short - I think that the concept of adding ref ReadOnlySpan<T> immediateBuffer is feasible (without generating dedicated code just for Span<T>) and worth the effort. Let's create a separate branch and create an experimental implementation.

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 19, 2018

@jtattermusch I have created a branch https://github.com/mkosieradzki/protobuf/tree/spans-pr with added spans support. (also there is branch when I do most of active development https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace)

It does not have all of the aforementioned (but significant amount) optimizations, however it is feature complete, passes all tests, has Code Access Security attributes.

I have split my changes into 4 commits: SDK-change, protoc-change, regenerated code, updated tests

This branch should be a nice starting point to add ReadOnlySequence<byte> support as no code in CodedInputStream depends on byte[] buffer. Everything utilizes ReadOnlySpan<byte> immediateBuffer except of the stream reading. I am anticipating you can add ReadOnlySequence<byte> in a similar way to Stream. As a source for immediate buffers (I can do this if you like).

As a small bonus I have added ReadOnlySpan<byte> support to ByteString what gives a secure access to the ByteString. I would like to suggest to change string to ByteString (or at least add such an option) as this is the best UTF-8 representation available today. One can use the exposed span together with the System.Buffers.Text.Utf8Parser.

How should we distribute/name/etc this package before it lands in the upsteam repo?

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 20, 2018

Progress report for https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace

  • Basic support for ReadOnlySpan<T> when deserializing
  • Basic support for Span<T> when serializing
  • Improve codegen to avoid using the inefficient FieldCodec abstractions:
    • Deserialization
    • Size computation
    • Serialization
    • Other?
  • Add inlining hints when applicable
  • Hide all public internal-use APIs from end-users
  • Add Code Access Security attributes where applicable
  • Add/Update XML Comments to all non-hidden APIs
  • Change XML Comments for hidden APIs as internal-use only
  • Get rid of the entire FieldCodec abstraction
  • Add nice API extensions to retain code-level compatibility (no binary compatibility is planned) with existing SDK
  • Add support for non-contiguous Memory<T>-based buffers
    • Serialization (IBufferWriter<byte>)
    • Deserialization (ReadOnlySequence<byte>)

Future work:

  • Try to avoid UTF8 -> UTF16 conversion when possible - we should wait for Utf8String
  • Integrate with Pipelines (should be extremely easy - PipeReader returns ReadOnlySequence<byte> and PipeWriter implements IBufferWriter<byte>) - so I am not even sure if we should implement this or it would be a trivial element of gRPC Server I am planning to implement basing on .NET Core 2.2 HTTP/2 support (it's based on Pipelines) when .NET Core 2.2 Preview 1 will be public.

@jtattermusch

  • Figure out what to do with this PR/package
  • Integrate with gRPC
  • Prepare independent benchmarks
  • Google paperwork (formal proposal, etc)

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 20, 2018

@jtattermusch code in branch https://github.com/mkosieradzki/protobuf/tree/spans-pr-workspace is now entirely Span-compatible, it is also hard-generated instead of the inefficient FieldCodec-abstraction. I expect a huge performance boost. (btw. the best effects will be observable when used from netcoreapp2.1).

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Jul 20, 2018

@jtattermusch I'm assigning this issue to you since you have been following up with the proposal.

@mkosieradzki
Copy link
Contributor Author

@jtattermusch It's ready. I don't plan any significant work on this repo now. I have also merged spans-pr-workspace to span-pr.

@davidfowl
Copy link

This was a great discussion to read and I'm happy to share that we went through a very similar exercise when we were writing the HTTP/1.1 parser for Kestrel. The conclusion was that for optimal performance when using ReadOnlySequence, you need to parse a Span at a time, and only use the higher level multi buffer operations when crossing buffer boundaries.

Simple example here -https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L41-L62

More complex example here - https://github.com/aspnet/KestrelHttpServer/blob/612fcca7291e5fad753fccbd8b15279c21e2e288/src/Kestrel.Core/Internal/Http/HttpParser.cs#L198-L209

We're looking at adding a ref struct BufferReader/BufferWriter for the next version to make things a a bit easier to use and more efficient in the common case. We ended up doing this in Kestrel and work is happening now to make it part of the BCL (https://github.com/dotnet/corefxlab/blob/08a1874a0ca4a1a889b2045801c309a1e8575458/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader.cs).

The other interesting thing in this discussion was about async parsing vs sync parsing. I believe that for optimal performance, the best thing to do is to write low level synchronous parsers and build asynchronous parsing on top of that. It looks like you guys have come to a similar conclusion here. In my mind there are 2 patters:

  1. The parser is stateless and works on complete messages
  2. The parsers is stateful and resumable
    • The state can be passed into the parser
    • The state can be kept by the parser

The first makes the parser very simple but consumers to worry about the max message size to avoid a DoS.

The second parser is much harder to write and is where async helps (it basically automatically stores the state on the heap and resumes on your behalf). This state now needs to be managed manually by the parser itself.

Using pipelines though the unconsumed buffer is managed managed on your behalf so the parser just needs to know where it left off and needs to be able to resume based on the parser context.

Here's how I think each of these could look:

Stateless Parsing

async Task StatelessParsing()
{
    var input = connection.Input;

    while (true)
    {
        var result = await input.ReadAsync();
        var buffer = result.Buffer;

        try
        {
            if (result.IsCanceled)
            {
                break;
            }

            if (!buffer.IsEmpty)
            {
                while (Protocol.TryParseMessage(ref buffer, out var message))
                {
                    await _dispatcher.DispatchMessageAsync(connection, message);
                }
            }

            if (result.IsCompleted)
            {
                if (!buffer.IsEmpty)
                {
                    throw new InvalidDataException("Connection terminated while reading a message.");
                }
                break;
            }
        }
        finally
        {
            // The buffer was sliced up to where it was consumed, so we can just advance to the start.
            // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
            // before yielding the read again.
            input.AdvanceTo(buffer.Start, buffer.End);
        }
    }
}

Stateful Parsing

async Task StatefulParsing()
{
    var input = connection.Input;
    ParserContext parserContext;

    while (true)
    {
        var result = await input.ReadAsync();
        var buffer = result.Buffer;

        try
        {
            if (result.IsCanceled)
            {
                break;
            }

            if (!buffer.IsEmpty)
            {
                while (Protocol.TryParseMessage(ref buffer, ref parserContext, out var message))
                {
                    await _dispatcher.DispatchMessageAsync(connection, message);
                }
            }

            if (result.IsCompleted)
            {
                if (!buffer.IsEmpty)
                {
                    throw new InvalidDataException("Connection terminated while reading a message.");
                }
                break;
            }
        }
        finally
        {
            // The buffer was sliced up to where it was consumed, so we can just advance to the start.
            // We mark examined as buffer.End so that if we didn't receive a full frame, we'll wait for more data
            // before yielding the read again.
            input.AdvanceTo(buffer.Start, buffer.End);
        }
    }
}

Of course this is just the outer code, the more complicated code exists in the parser itself as each of the methods needs to flow the context in order to resume parsing.

Glad to see all of the energy and progress here!

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 21, 2018

@davidfowl Thank you for all your valuable points. I will take a closer look on the suggestions, but I really think we want to sacrifice async parsing for simplicity and performance.

I was considering stateful parsing as an option before, but I think it would make the codegen much more complicated - and especially after removing this FieldCodec abstraction responsible for virtcalls - current codegen is so incredibly fast. There also security considerations with gRPC - you don't want to start message parsing before it is already there.

@mkosieradzki
Copy link
Contributor Author

I have created a new branch: https://github.com/mkosieradzki/protobuf/tree/spans-pr-rebased

It is a rebased core code to make code-review, and further development easier.
It does not contain generated code to make rebasing to the latest version much easier - learned the hard-way when creating async PR.

Also I code on Windows (VS2017) and previously had problems with line endings in the generated code - so the code should be generated on Linux (or Bash on Ubuntu on Windows :) ).

@jtattermusch
Copy link
Contributor

jtattermusch commented Jul 26, 2018 via email

@jtattermusch
Copy link
Contributor

@mkosieradzki I started reviewing your code, but haven't noticed the spans-pr-rebased branch initially, which made the review more complicated. Here are some initial comments:

— I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.

— (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (let’s create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.

— Providing the public ReadOnlySpan<byte> Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan<byte> bytes).

— aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).

— what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.

— I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.

— I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.

I'm currently trying to review spans-pr-rebased

@jtattermusch
Copy link
Contributor

jtattermusch commented Jul 26, 2018

@mkosieradzki you've come up with a great amount of good changes, but I'm now thinking of how to make the review process efficient and how we can integrate this. Here's what I suggest:

  • I created a branch in my fork: https://github.com/jtattermusch/protobuf/tree/span_support_experimental (currently in sync with protobuf master). I think we could review the code piece-by-piece by creating smaller PRs against that branch. That way we have control of what's being accepted and a good way to handle review comments before things get merged. The very first PR can add a dependency on System.Memory and then we can take it from there (lot of the optimizations you're suggesting are logically independent). I think the most important item is to support the non-contiguous parsing in it's simplest possible form.

  • I also created a myget.org feed where I can upload the experimental version of Google.Protobuf nuget at any time without disturbing the official version - anyone who want's to experiment with that package can just add the feed to their dev environment and use the experimental packages instead of the official vanilla one. (I think this is better than creating a nuget.org package under fake name).

What do you think?

@mkosieradzki
Copy link
Contributor Author

— I noticed there's a change from netstandard1.0 -> nestandard1.1, is that necessary? Making changes like this is theoretically possible but can slow down the review process quite a bit.

System.Memory requires 1.1 - so it simplified the PR a lot. If you take a look on the compatibility .NET Standard compatbility table the only different is Windows Phone Silverlight. So I think this sounds like a reasonable compromise between amount of #if and compatibility.

— (this point might be not applicable for spans-pr-rebased branch). It seemed that the generated .cs files were out of date when starting the branch - we should make sure C# photos are regenerated (let’s create a PR for that) before starting, otherwise the diffs for generated files are confusing (it's hard to tell new apart the changed parsing logic from the fields added to .proto files) Ideally, regenerating the protos would only happen in a single commit so that the rest of the code can reviewed commit-by-commit without needing to scroll through hundreds of lines of diffs in generated files.

Yup this is exaclty why I have created spans-pr-rebased.

— Providing the public ReadOnlySpan Span => bytes is a good idea and we should do it once the dependency on System.Memory is added. But it seems to be pretty independent from the rest of the parsing logic and should come as a separate PR. Same for ByteString CopyFrom(ReadOnlySpan bytes).

Fair enough. Not sure if the second is not used somewhere in the process...

— aggressive inlining optimization (and other optimizations) should go in as a separate PR (1 PR per one optimization technique).

Makes sense.

— what's the difference between Wrapped vs no-suffix when parsing? Explaining this concept would simplify reading the code review a lot. This seems to be the most interesting piece when it comes to parsing from Spans so we should make sure we are on the same page here.

Wrapped is dedicated to handle the wrapped messages like ValueInt32. Previously Wrapped messages were being handled by codecs with a lot of abstraction. The wrapped message structure is:

  • length
  • predefined tag (don't remember the number)
  • the actual value (Int32/Single/Double/etc..)

— I've notices some refactoring in CodedOutputStream.ComputeSize.cs (and possibly elsewhere), but I think combining refactoring and changes in logic makes the change very hard to review and greatly increases the risk we will miss something important.

I hope I didn't introduce any logic changes there! Please let me know where. I have only added support for wrapped messages.

— I've noticed CodedOutputStream.cs changes -> are they necessary? IMHO we should address the serialization and deserialization pieces separately.

I'm ok with both approaches.

@mkosieradzki
Copy link
Contributor Author

mkosieradzki commented Jul 26, 2018

@jtattermusch
Please give the spans-rebased-pr branch a try. I really tried to make this specific branch very clean and split into separate meaningful commits. I think that because we are dropping compatbility with 3.x branch by deleting FieldCodec abstraction the entire PR should be treated as a major change.

If you want to play with what I have built, I have created yet another branch spans-pr-rebased-build (with custom build process) and connected it to the AppVeyor. You can access the package from the AppVeyor nuget feed.

My apologies, but creating this PR as is has consumed most of my free time (it was a considerable amount of work, most in experimentation), I can't promise I will be able to split the PR into smaller pieces very soon

https://ci.appveyor.com/project/mkosieradzki/protobuf/build/1.0.30 - this is the best build I have currently.. There is one artifact with protoc for windows and one artifact with SDK nuget package. It is also available through this feed: as 4.0.0-pre-30 https://ci.appveyor.com/nuget/protobuf-spans

@jtattermusch
Copy link
Contributor

@mkosieradzki I plan to take a closer look at spans-rebased-pr tomorrow.

@jtattermusch jtattermusch changed the title [CSharp] Allow passing byte* as CodedInputStream buffer [CSharp] Allow Span<byte>-based parsing in CodedInputStream Oct 31, 2018
@prat0088
Copy link
Contributor

I followed the trail of related issues all the way to this PR.

I'm running a locally-built version of Protobuf where I have implemented some of these same optimizations. This issue and related PRs look dead. @mkosieradzki has done so much work on this I hate to put in effort and submit PRs of my own, but I don't see this issue moving otherwise.

What's the best path forward to get these optimizations in Protobuf?

@jtattermusch
Copy link
Contributor

@prat0088 we still plan to move ahead with the Span-based parsing but it's unclear which exact approach we're going to take (design discussions are ongoing). IMHO the PR that's the closest to the ideal design is #5888.
@mkosieradzki definitely deserves recognition for prototyping the approach, his code is really useful as a reference (regardless of which exact approach eventually end up landing).

@jtattermusch
Copy link
Contributor

This has been added in #7351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests