[WIP] Add encoder/decoder for Avro object container files #81

veedo · 2022-08-21T19:34:40Z

For some of my projects, I need full control of the encoding/decoding process and AvroEx provides a good basis for that.
The only things missing from AvroEx that I use a lot are object containers and cloud wire formats.
This pull request is my attempt to add object containers in a flexible way that gives a user maximum control over the process.

Theory of operation:

Each part of the container can be encoded/decoded separately
Codec implementation can be supplied by the user
- Currently using the :avro_ex app config so that they can configure it in their mix config
- The snappy implementation is provided Without the underlying snappyer because snappy compression does something weird in Avro, but adding snappyer as a dependency adds a NIF compile requirement which is undesirable for cross compilation
- I believe compression codecs are only used in Object Containers, so I placed them under that module. Let me know if that is not the case.
Just return the encoded data, let the user decide how they want to write the file
- Not yet sure how well decoding will work for this concept, might be forced to use IO objects
- Could be solved by providing functions for figuring out how much data to read for each chunk?

Please provide feedback on the PR as I go in case there's something untenable

Only adding the raising functions for now The implementation is designed so that people can pass different implementations of the codecs as required. Some people may want to use pure beam functions, some may want to use NIFs.

davydog187 · 2022-08-21T20:14:40Z

Thank you for the PR! I will provide some direct feedback this week. One immediate piece of feedback is to move all codecs to separate files

Improve how block headers are encoded to use an avro record This will help during decoding, since longs are variable length

davydog187 · 2022-08-22T15:25:05Z

lib/avro_ex/object_container.ex

+  @bh_schema AvroEx.decode_schema!(~S({
+    "type":"record","name":"block_header",
+    "fields":[
+      {"name":"num_objects","type":"long"},
+      {"name":"num_bytes","type":"long"}
+    ]
+  }))
+  @fh_schema AvroEx.decode_schema!(~S({
+    "type": "record", "name": "org.apache.avro.file.Header",
+    "fields" : [
+      {"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}},
+      {"name": "meta", "type": {"type": "map", "values": "bytes"}},
+      {"name": "sync", "type": {"type": "fixed", "name": "Sync", "size": 16}}
+    ]
+  }))


decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string

Calling decode_schema!/1 here causing a compile time dependency between this module and AvroEx. Since this is a premature optimization, I suggest we move out of module attributes

WDYT?

decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string

Didn't notice that, thanks.
A good reason for using json may be to "match" with the Avro spec:

vs

%{ "type" => "record", "name" => "org.apache.avro.file.Header", "fields" => [ %{"name" => "magic", "type" => %{"type" => "fixed", "name" => "Magic", "size" => 4}}, %{"name" => "meta", "type" => %{"type" => "map", "values" => "bytes"}}, %{"name" => "sync", "type" => %{"type" => "fixed", "name" => "Sync", "size" => 16}} ] })

the spec is unlikely to change, so no one will ever need to update it
I don't have a strong opinion about it 🤷
let me know which you prefer with the options side by side

2. Calling `decode_schema!/1` here causing a compile time dependency between this module and `AvroEx`. Since this is a premature optimization, I suggest we move out of module attributes

WDYT?

I'm not aware of any downsides. I work with a lot of embedded devices though, so doing stuff at compile time feels natural to me.
Is there a reason to avoid this optimization?
I can stick it in the file_header_schema/1 instead

Since it shouldn't change much over time, I think its fine to represent in elixir and get the benefits of formatting. It is also easier to edit as elixir code

Regarding compile-time, it shouldn't make a massive difference here, but accrued work at compile time does effect the time it takes to compile the library.

davydog187 · 2022-08-22T15:25:19Z

lib/avro_ex/object_container.ex

+  def magic(), do: @magic
+  def block_header_schema(), do: @bh_schema
+  def file_header_schema(), do: @fh_schema


I don't see a need to expose these directly

The magic is unnecessary, but the raw schemas are useful to have.
I'm using them in tests right now. It feels icky to repeat these schemas in the encoding and decoding test files.
Is there a better way other than exposing them?

IMO we should drop the need for them at tests, since we can just validate against expected inputs/outputs

That is true. The format isn't expected to change either, so the tests won't need to be updated.
I'll try it out and see how it looks, it may make the tests 99% encoded data blobs though 😛

The best way to test this is that you can round trip the data through the encoder, so that should be completely reasonable

test/object_container_encode_test.exs

lib/avro_ex/object_container.ex

davydog187 · 2022-08-22T15:48:45Z

Took a quick look at this, here is some general feedback:

To keep the API surface area of AvroEx as small as possible, I would suggest that we have top-level APIs for working with Object Container Files in AvroEx. We can have the implementation delegate out to the AvroEx.ObjectContainer module
There was mention of using application configuration for the codec. I would advise against this and instead just allow the user to pass a keyword argument to the library, if the user of the library wants to use Application config let them do that in their own application. See the Elixir library guidelines

Matches the rest of the library

davydog187 · 2022-08-22T17:34:48Z

lib/avro_ex/object_container/codec/snappy.ex

+  @behaviour AvroEx.ObjectContainer.Codec
+  @impl AvroEx.ObjectContainer.Codec
+  def encode!(data) do
+    {:ok, compressed} = :snappyer.compress(data)


We will need to add these as optional dependencies, no? We should also probably not compile this module if the user does not have snappyer installed

I'm torn over how to handle it.
The issue is that the snappy codec is implemented differently from all the other codecs.
I haven't looked at how to do it yet, but it would be nice to only compile it in if :snappyer is in the list of compiled/included apps or maybe some kind of build flag?
The implementation could just live in the docs, but it seems to be a popular codec
What do you think?

My plan was to worry about it after finishing the rest of the object container

veedo · 2022-08-22T17:46:24Z

Took a quick look at this, here is some general feedback:

To keep the API surface area of AvroEx as small as possible, I would suggest that we have top-level APIs for working with Object Container Files in AvroEx. We can have the implementation delegate out to the AvroEx.ObjectContainer module

will do 👍

There was mention of using application configuration for the codec. I would advise against this and instead just allow the user to pass a keyword argument to the library, if the user of the library wants to use Application config let them do that in their own application. See the Elixir library guidelines

Thanks, that is a useful document.
Right now the codec is passed in anyways, i'll just have to think about how the name+implementation will work.
I'll probably just add a name/0 function to the behaviour

Handle case with missing codec Add tests for parts of the file header decoding required by the spec

davydog187 · 2022-09-17T13:42:39Z

hello @veedo, just wanted to check in on this PR, are you waiting on any review from me, or still a WIP?

veedo · 2022-09-17T18:05:53Z

I've just been swamped this month unfortunately.
I'll probably make some progress next week/weekend though 😅

davydog187 · 2022-09-17T21:31:51Z

No rush! Just wanted to make sure you weren't waiting on me

Implement all the functions that go into decoding parts of the object container

davydog187 · 2023-02-17T15:00:29Z

Hello @veedo! Checking back in here, is there anything I can do to help with this PR? If you're not going to come back to it, we can consider other options

veedo · 2023-02-25T17:05:03Z

The swamping has continued unfortunately, and will probably continue for the next 2 months at least.

Currently the encoding part works correctly and consistently.
My plan was to finish all the tests and start on decoding.
I can split the encoding part out into its own PR and remove all the decoding parts,
but that may be a bit weird for a user expecting both.

I'll whip myself to finish the encoding tests this weekend.
How would you like to handle it? Splitting shouldn't be too much more work.

Implement the function that decodes an object container

The implementation of the snappy codec in avro is quite unique, so providing an implementation is valuable. Uses the optional dependency method similar to ecto.

veedo · 2023-03-05T16:21:28Z

@davydog187 Disregard my last comment. I had forgotten how much was left to do.
The encoding and decoding works, I just need to do the documentation.
I'll see if I can get it into an understandable state today.

Add encoder for object container files

7cb2259

Only adding the raising functions for now The implementation is designed so that people can pass different implementations of the codecs as required. Some people may want to use pure beam functions, some may want to use NIFs.

veedo requested a review from a team as a code owner August 21, 2022 19:34

veedo added 2 commits August 21, 2022 16:27

Add file header encoding tests

426e44f

Improve how block headers are encoded to use an avro record This will help during decoding, since longs are variable length

Split each codec into its own file

efb6c76

davydog187 reviewed Aug 22, 2022

View reviewed changes

test/object_container_encode_test.exs Outdated Show resolved Hide resolved

davydog187 reviewed Aug 22, 2022

View reviewed changes

lib/avro_ex/object_container.ex Outdated Show resolved Hide resolved

veedo added 2 commits August 22, 2022 10:16

Use alias instead of property

d012a83

Swap __MODULE__ position in argument pattern

dd981a4

Matches the rest of the library

davydog187 reviewed Aug 22, 2022

View reviewed changes

veedo added 4 commits August 25, 2022 07:48

Use elixir maps instead JSON strings for static schemas

22ee254

Pass in codec into encoder instead of using env

12c4514

Add initial file header decode function

186ef44

Improve error handling and responses

99a97f2

Handle case with missing codec Add tests for parts of the file header decoding required by the spec

Decode file container objects

5426a95

Implement all the functions that go into decoding parts of the object container

davydog187 mentioned this pull request Feb 17, 2023

Expose AvroEx.Decode.decode #87

Closed

veedo added 4 commits March 4, 2023 11:27

File decoding

80113f5

Implement the function that decodes an object container

Test Encode/Decode blocks and object containers

1fbf413

Merge branch 'master' into object_container_file

6459f86

Make the snappy codec optional.

1689f3c

The implementation of the snappy codec in avro is quite unique, so providing an implementation is valuable. Uses the optional dependency method similar to ecto.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add encoder/decoder for Avro object container files #81

[WIP] Add encoder/decoder for Avro object container files #81

veedo commented Aug 21, 2022

davydog187 commented Aug 21, 2022

davydog187 Aug 22, 2022

veedo Aug 22, 2022

davydog187 Aug 22, 2022

davydog187 Aug 22, 2022

veedo Aug 22, 2022

davydog187 Aug 22, 2022

veedo Aug 22, 2022

davydog187 Aug 22, 2022

davydog187 commented Aug 22, 2022

davydog187 Aug 22, 2022

veedo Aug 22, 2022

veedo commented Aug 22, 2022 •

edited

davydog187 commented Sep 17, 2022

veedo commented Sep 17, 2022

davydog187 commented Sep 17, 2022

davydog187 commented Feb 17, 2023

veedo commented Feb 25, 2023

veedo commented Mar 5, 2023

[WIP] Add encoder/decoder for Avro object container files #81

Are you sure you want to change the base?

[WIP] Add encoder/decoder for Avro object container files #81

Conversation

veedo commented Aug 21, 2022

davydog187 commented Aug 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davydog187 commented Aug 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veedo commented Aug 22, 2022 • edited

davydog187 commented Sep 17, 2022

veedo commented Sep 17, 2022

davydog187 commented Sep 17, 2022

davydog187 commented Feb 17, 2023

veedo commented Feb 25, 2023

veedo commented Mar 5, 2023

veedo commented Aug 22, 2022 •

edited