Jcipar/parquet pr #18313

jcipar · 2024-05-08T21:14:05Z

Backports Required

Release Notes

Before uploading to archival storage, write out locally. This new branch is mostly a copy of the demo work, but with improved organization.

Group rows into batches of 1000 for writing. Currently this number is hard coded.

Creating an interface for the schema registry reader breaks the circular dependency between schema registry, cluster, and datalake.

This commit adds a proto_to_arrow_consumer class that converts protobuf messages to columns in an Arrow table. If no schema is available, it falls back to the old code and the arrow table simply encodes the key and value as binary columns.

andrwng

Not a complete review, but leaving some quick thoughts (particularly if you're benchmarking, uploading the file stream is probably what you want).

Structurally I think the proto and arrow converters look fine and could probably be pulled into their own PR if you want to begin checking things in, given they're pretty self contained.

Also just a thought about longer term placement, I wonder if the parquet writers should be located similar to WASM rather than being colocated with the data. I don't know the full details, but IIUC WASM uses its own internal topic to checkpoint itself. Especially as a CPU intensive process, something to think about

andrwng · 2024-05-09T19:23:17Z

src/v/datalake/parquet_uploader.cc

+      10ll * 1024ll * 1024ll
+        * 1024ll, // FIXME(jcipar): 10 GiB is probably too much.


nit: just fyi base/units.h has some nifty macros, like 10_GiB

andrwng · 2024-05-09T19:25:32Z

src/v/datalake/parquet_uploader.cc

+        vlog(
+          logger.debug,
+          "Uploaded datalake topic {} successfully.",
+          model::topic_view(_log->config().ntp().tp.topic));


nit: in either case we should probably be deleting the old file

andrwng · 2024-05-09T19:28:46Z

src/v/datalake/parquet_uploader.cc

+          });
+        co_await iobuf_ostream.close();
+
+        auto ret = co_await remote.upload_object(


You probably want to use upload_stream() here instead, to avoid buffereing it all into an iobuf

andrwng · 2024-05-09T19:31:41Z

src/v/archival/ntp_archiver_service.cc

@@ -1171,6 +1171,11 @@ ss::future<cloud_storage::upload_result> ntp_archiver::do_upload_segment(
                  ctxlog.warn,
                  "Failed to upload datalake segment {}",
                  candidate);
+            } else {
+                vlog(
+                  ctxlog.warn,


andrwng · 2024-05-09T19:39:59Z

src/v/redpanda/application.cc

+            local_partition_manager.set_panadaproxy_schema_registry(
+              _schema_registry.get());


Had to do a double take at this -- though _schema_registry is only on shard 0, under the hood it's actually a wrapper around a sharded service, right? And calls transparently call local() via the wasm:: impl

andrwng · 2024-05-09T20:01:37Z

src/v/datalake/arrow_writing_consumer.cc

+    try {
+        _table_builder = std::make_unique<proto_to_arrow_converter>(
+          protobuf_schema);
+    } catch (const std::exception& e) {
+        // Couldn't build a table builder, fall back to schemaless
+        // TODO: Log this
+    }


nit: longer term, may want to consider relying on some kind of result (e.g. see base/outcome.h) type and moving this work into some start() method, in case the exceptions get too unwieldy

andrwng · 2024-05-09T20:21:08Z

src/v/datalake/parquet_uploader.cc

+    // TODO: Is this a good path? should it be configurable?
+    std::filesystem::path path = std::filesystem::path("/tmp/parquet_files")


Probably worth a config variable, or even some subdirectory of the cloud cache?

dotnwat · 2024-05-09T21:44:43Z

/ci-repeat 1

dotnwat · 2024-05-09T21:58:43Z

/ci-repeat 1 skip-rebase

emaxerrno · 2024-05-09T22:44:32Z

src/v/datalake/CMakeLists.txt

+    v::cloud_storage
+    v::raft
+    Seastar::seastar
+    Arrow::arrow_shared


why not static for this and parquet, curious.

emaxerrno · 2024-05-09T23:24:33Z

src/v/datalake/proto_to_arrow_scalar.h

+    proto_to_arrow_scalar()
+      : _builder(std::make_shared<BuilderType>()) {}
+
+    arrow::Status


this is interesting, we tend to use std::error-code see the rpc layer. but wondering if the ok() is for arrow internals

emaxerrno · 2024-05-09T23:24:50Z

src/v/datalake/proto_to_arrow_scalar.h

+
+public:
+    proto_to_arrow_scalar()
+      : _builder(std::make_shared<BuilderType>()) {}


this uses a global lock - lw_shared<>?

emaxerrno · 2024-05-09T23:26:05Z

src/v/datalake/proto_to_arrow_struct.h

+        namespace pb = google::protobuf;
+
+        // Set up child arrays
+        for (int field_idx = 0; field_idx < message_descriptor->field_count();


should the for loop be a function w/ tests and if() return style for the conversions.

emaxerrno · 2024-05-09T23:26:58Z

src/v/datalake/proto_to_arrow_struct.h

+              desc->cpp_type() == pb::FieldDescriptor::CPPTYPE_MESSAGE) {
+                auto field_message_descriptor = desc->message_type();
+                if (field_message_descriptor == nullptr) {
+                    throw std::runtime_error(


wondering a std::error-code (integer) like the RPC layer makes more sense here, since this is likely to be called in hot paths and exceptions/runtimes are expensive with global lock acquisitions.

emaxerrno · 2024-05-09T23:28:16Z

src/v/datalake/proto_to_arrow_struct.h

+        for (int field_idx = 0; field_idx < message_descriptor->field_count();
+             field_idx++) {
+            auto desc = message_descriptor->field(field_idx);
+            if (desc->cpp_type() == pb::FieldDescriptor::CPPTYPE_INT32) {


could be it's own series of tests. the code has both sum times (status<>) and exceptions. wondeirng if sumtypes should be enough to represent the paths.

emaxerrno · 2024-05-09T23:28:49Z

src/v/datalake/protobuf_to_arrow_converter.cc

+    }
+}
+void datalake::proto_to_arrow_converter::initialize_protobuf_schema(
+  const std::string& schema) {


we use ss:string in general

emaxerrno · 2024-05-09T23:30:38Z

src/v/datalake/protobuf_to_arrow_converter.cc

+         field_idx++) {
+        auto desc = message_desc->field(field_idx);
+
+        if (desc->cpp_type() == pb::FieldDescriptor::CPPTYPE_INT32) {


seems like the same loop above w/ the same code, one just pushes back the other move-assigns but the return type could be a free function w/ tests.

emaxerrno · 2024-05-09T23:31:02Z

src/v/datalake/protobuf_to_arrow_converter.h

+
+    void finish_batch();
+
+    std::shared_ptr<arrow::Table> build_table();


shared_ptr have global locks on dtor. lw_shared?

emaxerrno · 2024-05-09T23:32:05Z

had some time and did a quick pass. easy to read code.

WillemKauf · 2024-05-10T00:29:26Z

src/v/datalake/arrow_writing_consumer.cc

+                           &offset_builder](model::record&& record) {
+        std::string key;
+        std::string value;
+        key = iobuf_to_string(record.key());


nit: initialize key on same line as assignment, const std::string key = iobuf_to_string(record.key());

WillemKauf · 2024-05-10T00:43:43Z

src/v/datalake/arrow_writing_consumer.cc

+    // encode as utf8? The Parquet library will happily output binary in
+    // these columns, and a reader will get an exception trying to read the
+    // file.
+    _field_key = arrow::field("Key", arrow::utf8());


nit: Can these assignments be in the initializer list?

WillemKauf · 2024-05-10T00:48:05Z

src/v/datalake/arrow_writing_consumer.cc

+}
+
+std::string
+datalake::arrow_writing_consumer::iobuf_to_string(const iobuf& buf) {


I also see an implementation of iobuf_to_string in cloud_storage/tests/anomalies_detector_test.cc. Which is better, and should we pull this out into it's own utility function elsewhere?

WillemKauf · 2024-05-10T00:57:20Z

src/v/datalake/arrow_writing_consumer.h

+    explicit arrow_writing_consumer();
+    ss::future<ss::stop_iteration> operator()(model::record_batch batch);
+    ss::future<std::shared_ptr<arrow::Table>> end_of_stream();
+    std::shared_ptr<arrow::Table> get_table();


Can you add some comments to these declarations, so users have a rough idea of the workflow for the public API?

dotnwat · 2024-05-10T02:49:42Z

Looks like some build errors @jcipar

jcipar added 6 commits April 29, 2024 16:48

Initial schemaless parquet support

93b2b72

Before uploading to archival storage, write out locally. This new branch is mostly a copy of the demo work, but with improved organization.

Write out row groups containing 1000 events.

e381f38

Group rows into batches of 1000 for writing. Currently this number is hard coded.

Read schema from schema registry

feae26b

Creating an interface for the schema registry reader breaks the circular dependency between schema registry, cluster, and datalake.

Convert protobuf data to structured Parquet

e20e8c6

This commit adds a proto_to_arrow_consumer class that converts protobuf messages to columns in an Arrow table. If no schema is available, it falls back to the old code and the arrow table simply encodes the key and value as binary columns.

Support other scalar types in protobuf to arrow conversion

38bc948

Configuration options for datalake

5b9be75

github-actions bot added the area/redpanda label May 8, 2024

jcipar requested a review from dotnwat May 9, 2024 14:32

Cleanup after self-review

edc5e62

jcipar force-pushed the jcipar/parquet-pr branch from c1d38bd to edc5e62 Compare May 9, 2024 15:50

andrwng reviewed May 9, 2024

View reviewed changes

emaxerrno reviewed May 9, 2024

View reviewed changes

WillemKauf reviewed May 10, 2024

View reviewed changes

jcipar mentioned this pull request May 13, 2024

Protobuf to Arrow converter #18449

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jcipar/parquet pr #18313

Jcipar/parquet pr #18313

jcipar commented May 8, 2024

andrwng left a comment

andrwng May 9, 2024

andrwng May 9, 2024

andrwng May 9, 2024

andrwng May 9, 2024

andrwng May 9, 2024

andrwng May 9, 2024

andrwng May 9, 2024

dotnwat commented May 9, 2024

dotnwat commented May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno May 9, 2024

emaxerrno commented May 9, 2024

WillemKauf May 10, 2024

WillemKauf May 10, 2024

WillemKauf May 10, 2024

WillemKauf May 10, 2024

dotnwat commented May 10, 2024

		10ll * 1024ll * 1024ll
		* 1024ll, // FIXME(jcipar): 10 GiB is probably too much.

		local_partition_manager.set_panadaproxy_schema_registry(
		_schema_registry.get());

		// TODO: Is this a good path? should it be configurable?
		std::filesystem::path path = std::filesystem::path("/tmp/parquet_files")


		void finish_batch();

		std::shared_ptr<arrow::Table> build_table();

Jcipar/parquet pr #18313

Are you sure you want to change the base?

Jcipar/parquet pr #18313

Conversation

jcipar commented May 8, 2024

Backports Required

Release Notes

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented May 9, 2024

dotnwat commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emaxerrno commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented May 10, 2024