DRILL-6820: Msgpack format reader #1500

jcmcote · 2018-10-11T12:58:22Z

Implementation of a msgpack format reader

schema learning
skip over malformed records
skip over invalid field names
skip over records not matching schema
writing msgpack has not yet been implemented

implementation of a zstandard codec

only decompression is implemented

add support for msgpack extended types fix issue with columns of INT which then encounter a BIGINT

fixed bug in reader (array of array) new test case that require schema added useSchema property to turn off schema utilization

after doing some performance tests I concluded that throwing exceptions 1 out of 10000 records had not significant impact on performance and makes the code much easier to understand. also consolidated the reader count reader into a single reader class that can do count or actual reading of records. Again much easier to understand the code like this.

vdiravka · 2018-10-30T10:52:54Z

@jcmcote could you add a corresponding JIRA as a prefix in the title of the pull request? Refer the format of other pull requests here: https://github.com/apache/drill/pulls

coercing values into target schema types

paul-rogers

Very cool addition. MsgPack, like Parquet, should provide the additional schema information to avoid the messy reality of JSON.

This is a partial review with an initial back of comments as I learned the code. Will follow up with the remaining files, then probably take a deeper second pass.

paul-rogers · 2018-11-05T06:44:27Z

contrib/codec-zstd/src/main/java/org/apache/hadoop/io/compress/zstd/ZstdCompressor.java

+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import org.apache.commons.logging.Log;


Turns out Drill requires the use of Logback logging, the build should have complained about illegal imports of common logging.

paul-rogers · 2018-11-05T06:45:34Z

contrib/codec-zstd/src/main/java/org/apache/hadoop/io/compress/zstd/ZstdCompressor.java

+ * A {@link Compressor} based on the snappy compression algorithm.
+ * http://code.google.com/p/snappy/
+ *
+ * jccote !!!DID NOT TEST THIS CLASS, JUST RENAMED SNAPPY FOR ZSTD!!!


Please identify the source of this class: GitHub URL or the like.

contrib/codec-zstd/src/main/java/org/apache/hadoop/io/compress/zstd/ZstdDecompressor.java

paul-rogers · 2018-11-05T06:50:39Z

contrib/format-msgpack/pom.xml

+  <build>
+    <plugins>
+    <plugin>
+  <groupId>org.codehaus.mojo</groupId>


Nit: maybe indent this block so it looks like:

<build> <plugins> <plugin> <groupId>...

...ib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackFormatPlugin.java

paul-rogers · 2018-11-05T07:36:15Z

...b/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReaderContext.java

+    return parseErrorCount + runningRecordCount + recordCount + 1;
+  }
+
+  public void handleAndRaise(String suffix, Exception e) throws UserException {


Here's where you'd use UserException and its builders rather than a custom version.

contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java

paul-rogers · 2018-11-05T07:45:14Z

contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java

+    valueWriterMap.put(ValueType.BOOLEAN, new BooleanValueWriter());
+    valueWriterMap.put(ValueType.STRING, new StringValueWriter());
+    valueWriterMap.put(ValueType.BINARY, new BinaryValueWriter());
+    valueWriterMap.put(ValueType.EXTENSION, new ExtensionValueWriter());


Somewhat confused here; some comments may help.

Presumably, the file can contain any number of FLOAT fields, including 0. Each field needs its own state (reader, parser, whatever.) Each has its own value vector. How does it work to have one "ValueWriter" per type that takes no parameters to say which field or vector is being worked on?

I'll put more comments in the code. Basically this is a switch implemented using an EnumMap. In the ComplexValueWriter I use this switch to lookup what class will handle writing a value type. Here's the line of code from that writeElement method

valueWriterMap.get(value.getValueType()).write(value, mapWriter, fieldName, listWriter, selection, schema);

So based on the value type I get the corresponding writer class to use.

paul-rogers · 2018-11-05T07:46:14Z

contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java

+
+//
+//
+//  private void ensure(final int length) {


Not clear that this reader is one that needs an off-heap work buffer.

vvysotskyi · 2018-11-05T09:31:29Z

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

jcmcote · 2018-11-06T21:38:07Z

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

Agreed. When will drill pickup the new version of Hadoop. Is that a big deal to upgrade the version of Hadoop used?

vdiravka · 2018-11-07T12:20:40Z

@jcmcote There is a Jira ticket for Hadoop libs version update: DRILL-6540.
There is an issue related to commons-logging, see details.
Also there is my "work in progress" branch in the ticket.

vvysotskyi · 2018-11-07T12:53:53Z

@jcmcote, Is it possible to split this pull request into two parts: leave here only changes connected with Msgpack format reader, and continue work on Compression codecs in the scope of a separate Jira after upgrade of Hadoop library is done?

jcmcote · 2018-11-07T16:35:49Z

@vvysotskyi Sure I can split them up. Should be easy to do.

refactored the extension mechanism

avoiding creating objects and copying data into byte arrays not decoding field names but instead map lookup on bytes

it now uses a hashmap to avoid re-creating String for the map keys

jcmcote · 2019-01-10T14:46:02Z

Hey @paul-rogers I've made many code review fixes and improvements to the msgpack reader. Could you have another look at it. I would very much like to have it approved and made part of the main code base. Thanks!

arina-ielchiieva · 2019-01-10T14:54:08Z

@jcmcote taking into account that there is ongoing work to provide schema using file (https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files.

jcmcote · 2019-01-10T15:00:15Z

okay sounds good

…

On Thu, Jan 10, 2019 at 9:54 AM Arina Ielchiieva ***@***.***> wrote: @jcmcote <https://github.com/jcmcote> taking into account that there is ongoing work to provide schema using file ( https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1500 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJoEwoWtRJHSjuYjXhk7st8u65k9vua_ks5vB1QXgaJpZM4XXfMY> .

cgivre · 2019-07-21T20:06:06Z

Hi @jcmcote
Are you still interested in completing this PR? Recently, the enhanced vector format PRs were committed and could make this better and easier.

If you haven't seen this, here's a link to the tutorial by @paul-rogers https://github.com/paul-rogers/drill/wiki/EVF-Tutorial-Row-Batch-Reader.

cgivre · 2019-09-17T12:57:36Z

Hi @jcmcote Are you still interested in completing this PR?

jcmcote and others added 23 commits September 26, 2018 16:14

added a msgpack format plugin and a zstd codec

482dbc9

add support to skip over malformed MAP

c63fefa

add support for msgpack extended types fix issue with columns of INT which then encounter a BIGINT

moved project to format-msgpack

815f9db

returning error codes when parsing error detected

861730f

added test case for batch with no columns

16b03fb

avoiding creating the MapOrList instances

a536786

simple clean ups

7cc9e47

added support for learning schema and applying it

0044508

less verbose message when record is not valid

eb3efd5

testing complete model

cc26f2a

fixed bug in reader (array of array) new test case that require schema added useSchema property to turn off schema utilization

implemented apply schema so the reader knows what column types to output

3f00445

reader can skip records that do not match schema

d751fa4

when not using a schema the reader should ensureAtLeastOneField

c87a19d

increased batch size to 16k

fed1fb2

fixed import

d575bfd

clean up

9fbcfba

drill timestamp is in milliseconds

bade33e

ignoring junit test case which show internal drill issue

2594324

support to skip invalid elements in a list

cf20d98

consolidated printing warnings into context class

417e41e

support for coercing values according to schema

497e5e0

added discovery and loading of extended type readers

25169fa

Your Name added 2 commits October 30, 2018 12:49

refactored the writing of values

e558be9

coercing values into target schema types

refactored the writing of values

5629136

coercing values into target schema types

jcmcote changed the title ~~Msgpack format reader~~ DRILL-6820: Msgpack format reader Oct 31, 2018

work in progress

f2aff31

paul-rogers requested changes Nov 5, 2018

View reviewed changes

Your Name added 2 commits November 6, 2018 11:07

move schema to it's own package

b0fd31f

code review fixes

4d748fa

Your Name and others added 17 commits November 9, 2018 16:21

refactoring, documentation and better logging

ed7373c

documentation

d927316

added example of using msgpack

ba39754

unit test work

5e770be

unit test work

1cd9470

using ColumnMetadata

fc2c804

removed todo

8aedaec

performance test

8dd32d4

Merge remote-tracking branch 'upstream/master'

29affb8

performance improvements

cea0d12

Merge branch 'master' of https://github.com/apache/drill

a07dfb2

using a map to cache the field names

df7b378

refactored the extension mechanism

performance optimization

859d1fd

avoiding creating objects and copying data into byte arrays not decoding field names but instead map lookup on bytes

small code improvements

e1e802d

refactored FieldPathTracker

586a4d9

it now uses a hashmap to avoid re-creating String for the map keys

fix usage of drillbuf and tupleschema

8f1cd78

serialize schema as JSON with ordered childrens

3e552fa

working on msgpack

cba47ad

cgivre added the enhancement PRs that add a new functionality to Drill label Jan 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-6820: Msgpack format reader #1500

DRILL-6820: Msgpack format reader #1500

jcmcote commented Oct 11, 2018

vdiravka commented Oct 30, 2018

paul-rogers left a comment

paul-rogers Nov 5, 2018

paul-rogers Nov 5, 2018

paul-rogers Nov 5, 2018

paul-rogers Nov 5, 2018

paul-rogers Nov 5, 2018

jcmcote Nov 6, 2018

paul-rogers Nov 5, 2018

vvysotskyi commented Nov 5, 2018

jcmcote commented Nov 6, 2018

vdiravka commented Nov 7, 2018

vvysotskyi commented Nov 7, 2018

jcmcote commented Nov 7, 2018

jcmcote commented Jan 10, 2019

arina-ielchiieva commented Jan 10, 2019

jcmcote commented Jan 10, 2019 via email

cgivre commented Jul 21, 2019 •

edited

cgivre commented Sep 17, 2019

DRILL-6820: Msgpack format reader #1500

Are you sure you want to change the base?

DRILL-6820: Msgpack format reader #1500

Conversation

jcmcote commented Oct 11, 2018

vdiravka commented Oct 30, 2018

paul-rogers left a comment

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

jcmcote Nov 6, 2018

Choose a reason for hiding this comment

paul-rogers Nov 5, 2018

Choose a reason for hiding this comment

vvysotskyi commented Nov 5, 2018

jcmcote commented Nov 6, 2018

vdiravka commented Nov 7, 2018

vvysotskyi commented Nov 7, 2018

jcmcote commented Nov 7, 2018

jcmcote commented Jan 10, 2019

arina-ielchiieva commented Jan 10, 2019

jcmcote commented Jan 10, 2019 via email

cgivre commented Jul 21, 2019 • edited

cgivre commented Sep 17, 2019

cgivre commented Jul 21, 2019 •

edited