-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve use of iodata #4
Comments
Hi! I had a look into the test repo, and tried to code an alternative implementation that "just" recursively walks the IO list and doesn't build intermediate tuples (and probably reduces function calls a bit).
I think in this scenario we are paying the IO data => binary conversion only once for the first chunk, and every subsequent split is very efficient. If we use iolist splitting, we need to keep working with iolists in the rest of the loop. From a very quick trial with (much) smaller iolists, the benchmarks seem to favor iolist splitting over the BIF, but this would need to be studied into more detail. |
I like your approach - you're trading off building intermediate tuples with more stack work, but that's likely a good compromise to make. In terms of converting to a binary, I'm pretty sure we have to do that for at most one binary per call. To wit:
|
I've pushed a revised version of my split algorithm at mtrudel/iolist_test@b6dc785 which is slightly faster but still slower than binary splitting. Comments welcome. |
Oh OK, I was missing some context. Due to this comment I was under the impression we wanted to split a big IO list into chunks, with Maybe we should try to bench HTTP2 using it to get a more real-world comparison?
Looks neat, pretty easy to understand! I like how you use |
This is the thing - obviously the intent here is to make real-world iolists faster; I suspect I could get some good sample data by grabbing a rendered page from EEx as it gets sent via Bandit (to see it in its original iolist form). I'm going to wire that up into a couple of tests. |
regarding iodata splitting, this might be worth checking out too: https://github.com/ninenines/cowlib/blob/0f5c2f8922c89c58f51696cce690245cbdc5f327/src/cow_iolists.erl. |
Hello! defmodule IOListSplit3 do
@doc """
Returns an iolist consisting of the first `length` bytes of the given list, along with the
remainder of the iolist beyond it as a tuple. Returns `{:error, :length_exceeded}` if there are
not `length` bytes remaining in the iolist.
"""
def split(iodata, length) do
if IO.iodata_length(iodata) >= length do
do_split(iodata, length)
else
{:error, :length_exceeded}
end
end
defp do_split([head | tail], length)
when is_binary(head) and length != 0
do
head_size = byte_size(head)
if head_size <= length do
{tail_head, tail_tail} = do_split(tail, length - head_size)
{[head | tail_head], tail_tail}
else
<<head::binary-size(length), rest::binary>> = head
{head, [rest | tail]}
end
end
defp do_split(iodata, length)
when is_binary(iodata) and length != 0
do
<<head::binary-size(length), rest::binary>> = iodata
{head, rest}
end
defp do_split(iodata, 0) do
{iodata, []}
end
defp do_split([[head_head | head_tail] | tail], length) do
case head_tail do
[] -> do_split([head_head | tail], length)
_ -> do_split([head_head | [head_tail | tail]], length)
end
end
end |
@chrooti - I copied your implementation into my iolist_test benchmarking harness at mtrudel/iolist_test@4f6aa0e, and it seems to come out roughly in line with my current best take on the algorithm. Have a look over there and see if I missed anything! |
So I checked a bit your implementation (thanks for reworking it into something actually competitive) and a few things struck me:
I have not had much free time to experiment with these due to other projects taking my attention, but I thought I'd drop by and at least thank you for the time spent into considering (and fixing) my code I hope can submit something more meaningful in the near future :) |
Oh no! To be clear I didn't try and 'fix' your code at all; it was literally just a copy and paste job. If it's not exactly what you'd written in your original snippet then I messed up. |
Okay, I checked it out and it looks to me like: so maybe a copy paste job gone wrong? :) BTW I benched the follwing example, that allows only well-formed lists, uses tail recursion and should have zero overhead defmodule IOListSplit3 do
@doc """
Returns an iolist consisting of the first `length` bytes of the given list, along with the
remainder of the iolist beyond it as a tuple. Returns `{:error, :length_exceeded}` if there are
not `length` bytes remaining in the iolist.
"""
@compile {:inline, split: 3}
def split(iodata, length, taken \\ [])
def split([head | tail], length, taken) do
head_length = IO.iodata_length(head)
if head_length <= length do
split(tail, length - head_length, [head | taken])
else
<<head_head::binary-size(length), head_tail::binary>> = head
{[head_head | taken], [head_tail | tail]}
end
end
def split([], _length, taken) do
{taken, []}
end
end and this still doesn't get close to binary_concat, if anything its performance is very close to your complete implementation, so probably it's impossible to go faster than that. results:
|
Just leaving a comment here: recently in spawnfest there was jason_native that tries to handle these hard to optimize use cases on the BEAM with what he calls nif sprinkle. Specifically, he mentions binary patterns here Perhaps this could be used here and some other places? I am not proficient enough on C++ and the issue at hand here... feel free to ignore my comment :) |
Hi @mtrudel I've refactored your iodata_split_3 solution and got some promising results:
I tested the equaltiy of my solution against iodata_split_3 to make sure I didn't intruduce any bugs. Let me know if you want me to push a PR as I'm not really sure if you intented this for Bandit only or the IO module given your example:
|
Those are some serious numbers! I'd love to see the approach taken for this - probably the best place to land initial work for this would be in a helper module within Bandit (say, I'd suggest opening a PR on Bandit with the implementation in a helper module & some basic test coverage. We can easily wire this up in a couple of hot paths and set the benchmarker on it to see what numbers we get. Promising! |
So I got some good and bad news: The bad news is, if we want a general
iodata_split_5 is a refactored version of iodata_split_4 which passes cowlib's test: https://github.com/ninenines/cowlib/blob/0f5c2f8922c89c58f51696cce690245cbdc5f327/src/cow_iolists.erl#L58L76, but it is significantly slower then the others, but there might be room for improvements. The good news is, I believe if we can guarantee that the iodata only contains binaries (no charlist) then we can get a big speed up. |
I'm pretty sure that we can guarantee (at least in Bandit's case) that an iolist is either a binary or a (possibly improper) recursively defined list of iolists. Extremely excited to see this; this has been a bugaboo of mine for ages! |
The typedef for iolists includes bytes:
I suppose that means that all charlists are also iolists (and can be validly contained in them), but if you're able to handle bytes as a valid value without performance degradation then we'll be fine. In any case, we control the construction of iolists internally within Bandit, so we can restrict the typing to not inlclude bytes if need be (though that may be a limiting factor in getting a future upstream PR accepted into the standard library if it's not a comprehensive solution). |
@mtrudel I noticed @fhunleth doing some interesting iodata experiments that might be relevant: |
Just saw this. Thanks for looking at the |
There are a number of places (particularly in the h2 stack) where we're building up buffers based on iodata input and end up needing to flatten them to binaries earlier than is optimal. Mostly this comes down to the fact that there is no way to grab the first n bytes of an iodata without resorting to pattern matching:
Some provisional work for this is at https://github.com/mtrudel/iolist_test, but it's something around 8x slower than binary approaches. I'm assuming that this is mostly because binary conversion / matching is done in C within the BEAM, while my naive iolist splitting heuristic is being done in Elixir. To my eye there's strictly less work to be done in the iolist splitting scenario, so if it were implemented in C I imagine it would be faster / more memory efficient than the binary conversion / matching approach.
It would be useful to understand if / to what extent this is the case, and whether improved iolist primitives in the BEAM would be worth it.
The text was updated successfully, but these errors were encountered: