Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add encoder/decoder for Avro object container files #81

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

veedo
Copy link

@veedo veedo commented Aug 21, 2022

For some of my projects, I need full control of the encoding/decoding process and AvroEx provides a good basis for that.
The only things missing from AvroEx that I use a lot are object containers and cloud wire formats.
This pull request is my attempt to add object containers in a flexible way that gives a user maximum control over the process.

Theory of operation:

  • Each part of the container can be encoded/decoded separately
  • Codec implementation can be supplied by the user
    • Currently using the :avro_ex app config so that they can configure it in their mix config
    • The snappy implementation is provided Without the underlying snappyer because snappy compression does something weird in Avro, but adding snappyer as a dependency adds a NIF compile requirement which is undesirable for cross compilation
    • I believe compression codecs are only used in Object Containers, so I placed them under that module. Let me know if that is not the case.
  • Just return the encoded data, let the user decide how they want to write the file
    • Not yet sure how well decoding will work for this concept, might be forced to use IO objects
    • Could be solved by providing functions for figuring out how much data to read for each chunk?

Please provide feedback on the PR as I go in case there's something untenable

Only adding the raising functions for now
The implementation is designed so that people can pass different implementations of the codecs as required.
Some people may want to use pure beam functions, some may want to use NIFs.
@veedo veedo requested a review from a team as a code owner August 21, 2022 19:34
@davydog187
Copy link
Member

Thank you for the PR! I will provide some direct feedback this week. One immediate piece of feedback is to move all codecs to separate files

Improve how block headers are encoded to use an avro record
This will help during decoding, since longs are variable length
Comment on lines 16 to 30
@bh_schema AvroEx.decode_schema!(~S({
"type":"record","name":"block_header",
"fields":[
{"name":"num_objects","type":"long"},
{"name":"num_bytes","type":"long"}
]
}))
@fh_schema AvroEx.decode_schema!(~S({
"type": "record", "name": "org.apache.avro.file.Header",
"fields" : [
{"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}},
{"name": "meta", "type": {"type": "map", "values": "bytes"}},
{"name": "sync", "type": {"type": "fixed", "name": "Sync", "size": 16}}
]
}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string
  2. Calling decode_schema!/1 here causing a compile time dependency between this module and AvroEx. Since this is a premature optimization, I suggest we move out of module attributes

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string

Didn't notice that, thanks.
A good reason for using json may be to "match" with the Avro spec:
image
vs

 %{
   "type" => "record",
   "name" => "org.apache.avro.file.Header",
   "fields" => [
     %{"name" => "magic", "type" => %{"type" => "fixed", "name" => "Magic", "size" => 4}},
     %{"name" => "meta", "type" => %{"type" => "map", "values" => "bytes"}},
     %{"name" => "sync", "type" => %{"type" => "fixed", "name" => "Sync", "size" => 16}}
   ]
 })

the spec is unlikely to change, so no one will ever need to update it
I don't have a strong opinion about it 🤷
let me know which you prefer with the options side by side

2. Calling `decode_schema!/1` here causing a compile time dependency between this module and `AvroEx`. Since this is a premature optimization, I suggest we move out of module attributes

WDYT?

I'm not aware of any downsides. I work with a lot of embedded devices though, so doing stuff at compile time feels natural to me.
Is there a reason to avoid this optimization?
I can stick it in the file_header_schema/1 instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it shouldn't change much over time, I think its fine to represent in elixir and get the benefits of formatting. It is also easier to edit as elixir code

Regarding compile-time, it shouldn't make a massive difference here, but accrued work at compile time does effect the time it takes to compile the library.

Comment on lines 32 to 34
def magic(), do: @magic
def block_header_schema(), do: @bh_schema
def file_header_schema(), do: @fh_schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a need to expose these directly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic is unnecessary, but the raw schemas are useful to have.
I'm using them in tests right now. It feels icky to repeat these schemas in the encoding and decoding test files.
Is there a better way other than exposing them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should drop the need for them at tests, since we can just validate against expected inputs/outputs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. The format isn't expected to change either, so the tests won't need to be updated.
I'll try it out and see how it looks, it may make the tests 99% encoded data blobs though 😛

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to test this is that you can round trip the data through the encoder, so that should be completely reasonable

@davydog187
Copy link
Member

Took a quick look at this, here is some general feedback:

  1. To keep the API surface area of AvroEx as small as possible, I would suggest that we have top-level APIs for working with Object Container Files in AvroEx. We can have the implementation delegate out to the AvroEx.ObjectContainer module
  2. There was mention of using application configuration for the codec. I would advise against this and instead just allow the user to pass a keyword argument to the library, if the user of the library wants to use Application config let them do that in their own application. See the Elixir library guidelines

@behaviour AvroEx.ObjectContainer.Codec
@impl AvroEx.ObjectContainer.Codec
def encode!(data) do
{:ok, compressed} = :snappyer.compress(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to add these as optional dependencies, no? We should also probably not compile this module if the user does not have snappyer installed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn over how to handle it.
The issue is that the snappy codec is implemented differently from all the other codecs.
I haven't looked at how to do it yet, but it would be nice to only compile it in if :snappyer is in the list of compiled/included apps or maybe some kind of build flag?
The implementation could just live in the docs, but it seems to be a popular codec
What do you think?

My plan was to worry about it after finishing the rest of the object container

@veedo
Copy link
Author

veedo commented Aug 22, 2022

Took a quick look at this, here is some general feedback:

  1. To keep the API surface area of AvroEx as small as possible, I would suggest that we have top-level APIs for working with Object Container Files in AvroEx. We can have the implementation delegate out to the AvroEx.ObjectContainer module

will do 👍

  1. There was mention of using application configuration for the codec. I would advise against this and instead just allow the user to pass a keyword argument to the library, if the user of the library wants to use Application config let them do that in their own application. See the Elixir library guidelines

Thanks, that is a useful document.
Right now the codec is passed in anyways, i'll just have to think about how the name+implementation will work.
I'll probably just add a name/0 function to the behaviour

@davydog187
Copy link
Member

hello @veedo, just wanted to check in on this PR, are you waiting on any review from me, or still a WIP?

@veedo
Copy link
Author

veedo commented Sep 17, 2022

I've just been swamped this month unfortunately.
I'll probably make some progress next week/weekend though 😅

@davydog187
Copy link
Member

No rush! Just wanted to make sure you weren't waiting on me

Implement all the functions that go into decoding parts of the object container
@davydog187
Copy link
Member

Hello @veedo! Checking back in here, is there anything I can do to help with this PR? If you're not going to come back to it, we can consider other options

@veedo
Copy link
Author

veedo commented Feb 25, 2023

The swamping has continued unfortunately, and will probably continue for the next 2 months at least.

Currently the encoding part works correctly and consistently.
My plan was to finish all the tests and start on decoding.
I can split the encoding part out into its own PR and remove all the decoding parts,
but that may be a bit weird for a user expecting both.

I'll whip myself to finish the encoding tests this weekend.
How would you like to handle it? Splitting shouldn't be too much more work.

Implement the function that decodes an object container
The implementation of the snappy codec in avro is quite
unique, so providing an implementation is valuable.
Uses the optional dependency method similar to ecto.
@veedo
Copy link
Author

veedo commented Mar 5, 2023

@davydog187 Disregard my last comment. I had forgotten how much was left to do.
The encoding and decoding works, I just need to do the documentation.
I'll see if I can get it into an understandable state today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants