Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compress_batches feature #1

Merged
merged 8 commits into from May 19, 2020
Merged

Add compress_batches feature #1

merged 8 commits into from May 19, 2020

Conversation

benwh
Copy link

@benwh benwh commented May 17, 2020

This is best reviewed commit-by-commit, as there's quite a lot of noise in the diff due to Rubocop being introduced and auto-corrections applied.


f8d0755: Set a default ruby version

d5b9813: Remove deprecated has_rdoc option

This has been deprecated since 2011

d0e017e: Add rubocop configuration

3351fab: Add rubocop fixups and configuration

Use the rubocop generated configuration to ignore any major changes that
can't be auto-corrected.

ad03786: Adjust supported Ruby versions

Ruby 2.4 is now end-of-life. As a result of this, the google-protobuf
gem is no longer releasing artifacts compatible with this version.
(protocolbuffers/protobuf#7453)

In CI we encounter this error from bundle install:

google-protobuf-3.12.0-x86_64-linux requires ruby version >= 2.5, which is
incompatible with the current version, ruby 2.4.6p354

Therefore remove Ruby 2.4.6 as a tested version in Travis CI.

Additionally remove the constraint on patch versions in the Travis
config, so that we'll use the latest patch version available for each
release branch.

265e819: Add Prometheus metrics

Augment the existing operations with Prometheus metrics in order to
provide observability around the operations that the plugin is
performing.

Introduce a new metrics helper to prevent attempting to register the
same metric more than once in a multi-threaded or multi-instance
context.

44b4b87: Add compress_batches feature

As per the README updates, this can be used to compress a number of
input records into a single Pub/Sub message, therefore saving on costs.

@benwh benwh changed the base branch from master to gocardless May 17, 2020 23:07
@benwh benwh force-pushed the compressed-batching branch 2 times, most recently from e5ddf42 to e0c92bd Compare May 17, 2020 23:21
@benwh
Copy link
Author

benwh commented May 17, 2020

Closing and reopening to kick Travis into action

@benwh benwh closed this May 17, 2020
@benwh benwh reopened this May 17, 2020
Ruby 2.4 is now end-of-life. As a result of this, the `google-protobuf`
gem is no longer releasing artifacts compatible with this version.
(protocolbuffers/protobuf#7453)

In CI we encounter this error from `bundle install`:
```
google-protobuf-3.12.0-x86_64-linux requires ruby version >= 2.5, which is
incompatible with the current version, ruby 2.4.6p354
```

Therefore remove Ruby 2.4.6 as a tested version in Travis CI.

Additionally remove the constraint on patch versions in the Travis
config, so that we'll use the latest patch version available for each
release branch.
@benwh
Copy link
Author

benwh commented May 18, 2020

Note that in this PR I chose to use zlib, as it comes with the Ruby standard library and seems to do a pretty reasonable job (~85% compression on our lab logs).

Maybe we'd like to investigate alternatives, like lz4, but I'd propose doing this in a follow-up, which will be possible by specifying a different value for the compression_algorithm message attribute.

@pubsub = Google::Cloud::Pubsub.new project_id: project, credentials: key
@autocreate_topic = autocreate_topic
@topics = {}

@compression_ratio =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is capturing the compression ratio. See comment where this is calculated.

)

compressed_size = compressed_messages.bytesize
@compression_ratio.observe(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me looks like it is calculating the data-rate saving and not the ratio, which is what we want to do to ensue we can put this in known buckets of 0-1.

compression ratio = (original size)/(compressed size)
data-rate saving = 1 - (compressed size)/(uncompressed size)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah you're totally right I think. I was a bit torn between producing a number which would make sense on a dashboard, Prometheus' best-practice definition of a percentage/ratio, and avoiding potential division by zero; and ended up with something slightly confusing here.

I think you're right that we probably do want to keep this formula, as if we computed the actual compression ratio then the value of that is harder to predict (e.g. 10MiB/1KiB = 1024) and therefore define buckets for.


So for what we should replace it with:

I'd argue that we're not talking about 'data rate' here, as we're operating on a single blob rather than an indefinite stream, so it should be expressed as 'space savings' right?

  • Current:
    fluentd_output_gcloud_pubsub_messages_compression_ratio

    Variable name: @compression_ratio

  • Proposal 1:
    fluentd_output_gcloud_pubsub_messages_compression_space_savings_ratio

    Variable name: @compression_space_saving

    I'd like to keep compression in the name as it groups it with the compression_duration_seconds metric.

  • Proposal 2:
    fluentd_output_gcloud_pubsub_messages_compressed_size_per_original_size_ratio

    Variable name: @compression_ratio ??

    This is closer to the Prometheus suggestion, but feels a little bit awkward?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That naming of Proposal 2 is really doing it for me 😆

Lets do that!

@@ -0,0 +1,14 @@
# frozen_string_literal: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something super weird has happened to your commits here. This file already exists with this content in it plus more. 🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ignore me. It's Github screwing up the ordering of the commits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think this is probably GitHub doing its usual trick of showing commits in date order rather than actual commit order (as I moved some commits around in the history when rebasing)

Copy link

@dyson dyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment on the naming of the compression ratio metrics but otherwise this looks super solid! Primo work dude.

If you get back to me about that I can smash approve button.

Augment the existing operations with Prometheus metrics in order to
provide observability around the operations that the plugin is
performing.

Introduce a new metrics helper to prevent attempting to register the
same metric more than once in a multi-threaded or multi-instance
context.
@benwh benwh force-pushed the compressed-batching branch 2 times, most recently from 6073941 to 8d7e0bc Compare May 19, 2020 18:04
benwh added 2 commits May 19, 2020 19:40
As per the README updates, this can be used to compress a number of
input records into a single Pub/Sub message, therefore saving on costs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants