-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8474: Adding Daffodil to Drill as a contrib. #2909
base: master
Are you sure you want to change the base?
Conversation
Requires Daffodil version 3.7.0 or higher. New format-daffodil module created Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.) We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved. The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice. Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects. Test show this works for data as complex as having nested repeating sub-records. These DFDL types are supported: - int - long - short - byte - boolean - double - float (does not work. Bug DAFFODIL-2367) - hexBinary - string apache#2835
<build> | ||
<plugins> | ||
<plugin> | ||
<artifactId>maven-resources-plugin</artifactId> | ||
<executions> | ||
<execution> | ||
<id>copy-java-sources</id> | ||
<phase>process-sources</phase> | ||
<goals> | ||
<goal>copy-resources</goal> | ||
</goals> | ||
<configuration> | ||
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/daffodil | ||
</outputDirectory> | ||
<resources> | ||
<resource> | ||
<directory>src/main/java/org/apache/drill/exec/store/daffodil</directory> | ||
<filtering>true</filtering> | ||
</resource> | ||
</resources> | ||
</configuration> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
</plugins> | ||
</build> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this section do? I noticed that the msaccess contrib doesn't have this section in its pom.xml.
This fails its tests due to a maven checkstyle failure. It's complaining about Drill:Exec:Vectors, which my code has no changes to. Can someone advise on what is wrong here? |
/home/runner/work/drill/drill/exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MapBuilder.java:201:5:
|
*/ | ||
@Override | ||
public void resume() { | ||
if (Objects.isNull(parent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle I don't know why the checkstyle is telling you the wrong file, but here, you'll need braces as well as at line 203.
ie:
if (parent instanceof MapBuilder) {
resumeMap();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle Confirmed. I successfully built your branch by adding the aforementioned braces. I'll save you some additional trouble. There's another check style violation in DaffodilBatchReader
. Drill doesn't like star imports for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the maven command line to just make it run this checkstyle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just built Drill using the following command:
mvn clean install -DskipTests
When I did that, I was getting the same error as on GitHub. After adding the braces as described above, it built without issues.
With that said, I think you can do just run the check style with:
mvn checkstyle:checkstyle
Tests are now failing due to these two things in TestDaffodilReader.scala
That's an absolute URI that is used to obtain access to the schema files in this statement:
This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select. What should I be doing to arrange for these schema URIs to be found. The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths. |
Hi Mike,
Are you free at all this week? My apologies... We're in the middle of putting an offer on a house and my life is very hectic at the moment.
Best,
-- C
… On Apr 28, 2024, at 10:11 PM, Mike Beckerle ***@***.***> wrote:
Tests are now failing due to these two things in TestDaffodilReader.scala
String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/";
That's an absolute URI that is used to obtain access to the schema files in this statement:
private String selectRow(String schema, String file) {
return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " +
"validationMode => 'true', " + " schemaURI => '" + schemaURIRoot + "schema/" + schema +
".dfdl.xsd'," + " rootName => 'row'," + " rootNamespace => null " + "))";
}
This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select.
What should I be doing to arrange for these schema URIs to be found.
The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths.
—
Reply to this email directly, view it on GitHub <#2909 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PT327D7FTY34D7Z6ULY7WT5DAVCNFSM6AAAAABG4LUKI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRG44DCNJUGY>.
You are receiving this because you commented.
|
Lots of availability. I'll send you separate email. |
*/ | ||
public class DaffodilMessageParser { | ||
|
||
private static final Logger logger = LoggerFactory.getLogger(DaffodilMessageParser.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logger needs to be Drill's logger for log messages originating in the drill bit.
rootName); | ||
} | ||
try { | ||
dmp.loadSchema(schemaFileURI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This right here is depending on the schemaFileURI being class path relative so that the schema, and things referenced from within the schema, are found on the classpath. Daffodil makes extensive use of Java's Service Provider Interface (SPI) to dynamically load its own flavor of UDFs, custom charsets, and "layer" definitions, all of which are compiled java/scala code in jars and those jars can have other dependent jars, and so on.
So somehow the drill user has to be able to specify a path of multiple schema jar files, in order (the order can matter), and that has to end up on the class path of the drill bits so that this loadSchema call can work.
private List<Diagnostic> compileSchema(URI schemaFileURI, String rootName, String rootNS) | ||
throws URISyntaxException, IOException, CompileFailure { | ||
Compiler c = Daffodil.compiler(); | ||
ProcessorFactory pf = c.compileSource(schemaFileURI, rootName, rootNS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema source files can be spread over multiple jars (sometimes dozens of files), and they contain relative paths within them that must resolve relative to that same classpath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This compilation really wants to happen once on a central node and the resulting binary schema file wants to be cached somewhere that gets copied out to the drill bit locations.
Right now this either loads a binary DFDL schema, which needs to happen on the drill bits, or it needs to compile a DFDL schema source (which needs to happen centrally) to create the binary DFDL schema, which then gets loaded by the drill bits.
This has to get refactored because the drill bits should never be compiling a DFDL schema. They should only be loading up pre-compiled binary schemas.
DRILL-8474: Adding Daffodil to Drill as a contrib.
This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR.
Description
Requires Daffodil version 3.7.0 or higher.
New format-daffodil module created
Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed drill environment.)
We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.
The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.
Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.
Test show this works for data as complex as having nested repeating sub-records.
These DFDL types are supported:
Documentation
TBD: feature is incomplete still. It will require substantial documentation for users.
Testing
See tests under src/test in the new daffodil contrib module.