Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-ahead or concurrent fetching? #13

Open
lawik opened this issue Jun 9, 2023 · 6 comments
Open

Read-ahead or concurrent fetching? #13

lawik opened this issue Jun 9, 2023 · 6 comments

Comments

@lawik
Copy link

lawik commented Jun 9, 2023

I have an archive that has a lot of small files and while I haven't measured to confirm I'm pretty sure the streaming is slowing down a lot (download drops from multiple Mb/s to a few Kb/s) as the overhead/latency of each file fetch is more significant than the transfer time.

Thousands of files in this case.
It does complete eventually but it would be neat to be able to ask Packmatic to buffer at least X bytes forward beyond the current need or similar.

Is there a way to do it that I've missed or would this be a good addition in your eyes?

@evadne
Copy link
Owner

evadne commented Jun 9, 2023

Are you able to draw a waterfall graph of each file that needed to be fetched or provide some details as to the size of files and the relative time to first byte?

There are many ways to pre-compute parts, yes this would include using an intermediary layer to concurrently prepare some responses etc

@lawik
Copy link
Author

lawik commented Jun 9, 2023

I think this reproduces the problem in a reasonable experimental way, a livebook:

Packmatic starvation

Mix.install([:packmatic, :kino, :kino_vega_lite])
alias VegaLite, as: Vl

Section

delay = Kino.Input.number("Delay, ms", default: 300) |> Kino.render()
file_size = Kino.Input.number("File size, kb", default: 512) |> Kino.render()
entry_count = Kino.Input.number("Files", default: 200)
chart =
  Vl.new(width: 400, height: 400)
  |> Vl.mark(:line)
  |> Vl.encode_field(:x, "x", type: :quantitative)
  |> Vl.encode_field(:y, "y", type: :quantitative)
  |> Kino.VegaLite.new()
t1 = System.os_time(:millisecond)

log_event = fn event ->
  t2 = System.os_time(:millisecond)
  offset = t2 - t1

  case event do
    %Packmatic.Event.EntryUpdated{stream_bytes_emitted: bytes} ->
      seconds = offset / 1000

      if seconds > 0 do
        kb_per_second = bytes / 1024 / seconds
        IO.inspect(kb_per_second, label: "kb/s")
        point = %{x: seconds, y: kb_per_second}
        Kino.VegaLite.push(chart, point)
      end

    %Packmatic.Event.EntryCompleted{} ->
      IO.inspect(event)

    _ ->
      nil
  end

  :ok
end
latency = Kino.Input.read(delay)
size = Kino.Input.read(file_size)

small_remote_file = fn ->
  # Overhead latency for request
  :timer.sleep(latency)
  # size 512 kb
  {:ok, {:random, size * 1024}}
end
count = Kino.Input.read(entry_count)

entries =
  1..count
  |> Enum.map(fn num ->
    [
      source: {:dynamic, small_remote_file},
      path: "#{num}.txt"
    ]
  end)

{t, _} =
  :timer.tc(fn ->
    entries
    |> Packmatic.build_stream(on_event: log_event)
    |> Stream.run()
  end)

IO.inspect(t / 1000, label: "took ms")
IO.inspect(count * latency, label: "entries * delay, ms")

@lawik
Copy link
Author

lawik commented Jun 9, 2023

Pasting a livebook in github is kinda weird :D

@lawik
Copy link
Author

lawik commented Jun 9, 2023

@evadne
Copy link
Owner

evadne commented Dec 12, 2023

@lawik Revisiting the problem there are some solutions around this

  1. Make the Encoder concurrent by default with some adjustable prefetching logic but then make it tunable

  2. Write a ConcurrentEncoder and keep both

  3. Optimise connection setup, by using Hackney and pooling the connections we should be fast ish to a certain degree

It would depend on:

Whether the sources are on different hosts that resolve to different IP/port pairs which would require separate connections

Whether the individual files are large or small

Etc

There is also another solution which is to keep the encoding entries hot-addable so you have a producer and the consumer just goes on and on until it gets an end message. Then the intermediary layer can be added

@lawik
Copy link
Author

lawik commented Dec 13, 2023

I no longer have the problem because we optimized away the need for about 7000 files and suddenly things are quite snappy.

The last option you mention would let the developer determine their own level of look-ahead. I assume this could be modeled as a a stream of entries instead of a finalized list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants