DRILL-7745: Add storage plugin for IPFS #2084

dbw9580 · 2020-05-31T17:34:15Z

Add storage plugin for IPFS. See detailed introduction here.

TODOs:

Port to Drill 1.18.0
Add more tests
Support JSON reader

Authors:

This storage plugin was contributed by Bowen Ding, Yuedong Xu and Liang Wang. The authors are affiliated with Fudan University.

cgivre · 2020-05-31T17:37:34Z

@dbw9580
Thanks for contributing this. Do you want review comments now?

dbw9580 · 2020-05-31T17:50:59Z

Yes, please.

Some major problems I'm working on are:

support for different data formats. Currently only JSON files are supported, and CSV support is removed due to changes introduced in v1.17 and v1.18. The original implementation was a copy-paste of the easy format plugin (org/apache/drill/exec/store/easy/text/TextFormatPlugin.java). I wonder if there is a better way to reuse the code there.
support for CREATE TABLE statements. This will require changes to the Drill framework responsible for SQL parsing, query plan generation, etc. I would appreciate opinions from you.

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSPeer.java

...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java

sanel · 2020-05-31T18:51:37Z

Added small refactoring ideas. Did not check implementation details

cgivre · 2020-06-02T17:16:33Z

@dbw9580
One more comment. You'll want to add your plugin to the distribution files so that it will be built when Drill is built.

You'll have to do that here:

drill/distribution/pom.xml

Lines 300 to 304 in 7d5b611

    
             <groupId>org.apache.drill.contrib</groupId> 
        
             <artifactId>drill-opentsdb-storage</artifactId> 
        
             <version>${project.version}</version> 
        
           </dependency> 
        
           <dependency>

and here:

drill/distribution/src/assemble/component.xml

Lines 28 to 56 in 7d5b611

    
           <include>org.apache.drill.exec:drill-jdbc:jar</include> 
        
           <include>org.apache.drill:drill-protocol:jar</include> 
        
           <include>org.apache.drill:drill-common:jar</include> 
        
           <include>org.apache.drill:drill-logical:jar</include> 
        
           <include>org.apache.drill.exec:vector:jar</include> 
        
           <include>org.apache.drill.memory:drill-memory-base:jar</include> 
        
           <include>org.apache.drill.exec:drill-rpc:jar</include> 
        
           <include>org.apache.drill.exec:drill-java-exec:jar</include> 
        
           <include>org.apache.drill.metastore:drill-metastore-api:jar</include> 
        
           <include>org.apache.drill.metastore:drill-iceberg-metastore:jar</include> 
        
           <include>org.apache.drill.metastore:drill-rdbms-metastore:jar</include> 
        
           <include>org.apache.drill.contrib.storage-hive:drill-storage-hive-core:jar</include> 
        
           <include>org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:jar</include> 
        
           <include>org.apache.drill.contrib.data:tpch-sample-data:jar</include> 
        
           <include>org.apache.drill.contrib:drill-mongo-storage:jar</include> 
        
           <include>org.apache.drill.contrib:drill-storage-hbase:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-mapr:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-syslog:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-esri:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-hdf5:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-ltsv:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-excel:jar</include> 
        
           <include>org.apache.drill.contrib:drill-format-spss:jar</include> 
        
           <include>org.apache.drill.contrib:drill-jdbc-storage:jar</include> 
        
           <include>org.apache.drill.contrib:drill-kudu-storage:jar</include> 
        
           <include>org.apache.drill.contrib:drill-storage-kafka:jar</include> 
        
           <include>org.apache.drill.contrib:drill-storage-http:jar</include> 
        
           <include>org.apache.drill.contrib:drill-opentsdb-storage:jar</include> 
        
           <include>org.apache.drill.contrib:drill-udfs:jar</include>

cgivre · 2020-06-18T15:24:14Z

@dbw9580
We're looking to get a release of Drill out by the end of June. If you can get these revisions done soon, I can help expedite the review so that we can get this (or at least an MVP of this) into Drill 1.18.

dbw9580 · 2020-06-18T15:40:14Z

Yes I'll try my best. I was stuck with CSV and writer support, and haven't had much progress so far. Can we settle with basic JSON and reader support for now, and maybe add those later?

cgivre · 2020-06-18T15:44:54Z

@dbw9580
Why don't you break out CSV and the writer support as separate PRs. Then we can get this in, and work on those for the next release.

I've never done a writer, but I can assist with the CSV reader. Take a look here:
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVBatchReader.java

The HTTP storage plugin can read either JSON or CSV. I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

dbw9580 · 2020-06-18T16:11:41Z

break out CSV and the writer support as separate PRs.

Great. Will do.

I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?

cgivre

break out CSV and the writer support as separate PRs.

Great. Will do.

I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?

Cutting/pasting really wouldn't be the best approach here although there unfortunately are many examples of it in Drill. :-/

The issue I ran into with the HTTP plugin was that I had an InputStream and needed the text reader to accept an InputStream rather than a file. I attempted to modify the ComplaintTextReader to accept an InputStream but it was really complicated and eventually gave up on that for the time being.

Your plugin looks similar in this regard. It looks like you're getting an InputStream from IPFS. In theory you can then send that to any Drill format reader that accepts InputStreams.

Bottom line, if you can refactor the CompliantTextReader to accept an InputStream then it becomes trivial to use this class for your plugin. If you can't, I'd suggest "borrowing" code from the HTTP CSV plugin.

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSJSONRecordReader.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSScanSpec.java

cgivre · 2020-06-02T16:40:22Z

contrib/storage-ipfs/README.md

+Start the service:
+```
+systemd start drill-embedded.service
+```


Please create a separate JIRA to add documentation to the Drill website. That does not have to be part of this PR, but otherwise nobody will know about this. ;-).

contrib/storage-ipfs/pom.xml

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java

...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePlugin.java

cgivre

Thanks for submitting and these revisions look good.

cgivre · 2020-06-19T19:45:31Z

@dbw9580
I had another thought about the design. I wonder if instead of "rolling your own" readers, if you could do what the file plugin does and make use of all the existing format plugins. Basically, that way any new file format for which there is a format plugin could also be read by IPFS. I don't know IPFS that well to know if that would even make sense, or not.

That also might simplify the code a lot.

vvysotskyi · 2020-06-22T12:36:43Z

contrib/storage-ipfs/README.zh.md

@@ -0,0 +1,184 @@
+# Drill Storage Plugin for IPFS


Please keep only README.md in English, since it would be problematic to update it for other developers.

@vvysotskyi
Can they put a link to the Chinese version somewhere?

But for this case, documentation outside the project also may be outdated.

contrib/storage-ipfs/pom.xml

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java

common/src/test/java/org/apache/drill/categories/IPFSStorageTest.java

cgivre

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?

cgivre · 2020-06-23T01:19:24Z

contrib/storage-ipfs/README.md

+
+A live demo: <http://www.datahub.pub/> hosted on a private cluster of Minerva.
+
+Note that it's still in early stages of development and the overall stability and performance is not satisfactory. PRs are very much welcome!


When we're ready to commit, can you please update the docs and remove all the language about building Drill etc.

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSCompat.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSGroupScan.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSScanBatchCreator.java

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java

contrib/storage-ipfs/src/test/java/org/apache/drill/exec/store/ipfs/IPFSTestBase.java

dbw9580 · 2020-06-23T01:53:29Z

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

cgivre · 2020-06-23T01:55:48Z

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

Can you write some tests that do not require the IPFS daemon? I realize that we'll need it for some unit tests, but is it possible to test various components w/o the daemon?

dbw9580 · 2020-06-23T02:10:29Z

some tests that do not require the IPFS daemon

I guess for the storage plugin config, scanSpec, etc, they are static and independent on a running IPFS daemon. Also for the json reader, I think I can supply test data from memory directly, w/o actually retrieving it from IPFS.

dbw9580 · 2020-08-13T17:51:22Z

If I leave an instance of Drill running and then run the unit test (TestIPFSQueries), then it passes. I think the unit test does not actually build and run a full Drill server, which is why the connections are rejected.

cgivre · 2020-08-13T18:22:37Z

@dbw9580
The ClusterTest class is supposed to start a Drill cluster so that you can execute queries. You should not need to have a Drill cluster running for the unit tests to complete.

I think the reason this isn't doing what you're expecting is that in the initIPFS function in IPFSTestSuit you are creating a plugin with a null configuration and hence isn't initializing correctly.

I stepped through testSimple() with the debugger and the dataset object is null, hence the test fails. My suspicion is that there is one small step being missed here. Could you please take a look and step through this to make sure that Drill is being initialized correctly?
Thanks

dbw9580 · 2020-08-14T09:10:58Z

@cgivre Does Drill support connections from IPv6 sockets? Is it enabled by default or do I have to toggle some configuration items? The "protocol family unavailable" error could be due to lack of support for IPv6.

cgivre · 2020-08-14T12:08:18Z

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant.
Were you able to get it working?

dbw9580 · 2020-08-14T12:16:03Z

The connection rejected: /127.0.0.1:31011 failure was because sometimes Drill does not bind to the default ports (31010, 31011, 31012). It can bind to later ports like 31013, 31014, 31015, hence the connection was rejected.

I believe the reason Drill didn't bind to the default ports is that those ports was used by the process from the last test run and had not been recycled by the system. If I wait for a minute or two before starting another round of testing, it's likely the test will pass.

This is part of DRILL-7754, but I haven't come up with a plan to reliably store the ports info in IPFS.

dbw9580 · 2020-08-14T12:19:48Z

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant.
Were you able to get it working?

I don't see Drill binding to any IPv6 address in ss -6ltnp. I blocked IPv6 addresses in 9494a30 and the tests are now passing (most of the time, due to #2084 (comment)).

cgivre · 2020-08-14T13:00:07Z

@dbw9580
The unit tests are passing now on my machine.

dbw9580 · 2020-08-14T13:38:04Z

@cgivre
I tried to set the ports to their default values in c090a43, but it did not seem to do the trick. Why is that?

dbw9580 · 2020-08-16T15:32:59Z

Found that if I ran tests with mvn clean test -DforkMode=never, then the port already in use errors were gone. Have no idea why.

cgivre · 2020-08-16T15:40:28Z

Hi @dbw9580
Thanks for these updates. I didn't have any issues running your unit tests before this. However, I took a look at the Maven docs, and I'm wondering if you can specify the number of forks directly in the pom.xml file. 1

dbw9580 · 2020-08-16T16:02:28Z

Hi @dbw9580
Thanks for these updates. I didn't have any issues running your unit tests before this. However, I took a look at the Maven docs, and I'm wondering if you can specify the number of forks directly in the pom.xml file. 1

Thanks!!

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java

cgivre · 2020-08-17T15:13:08Z

@dbw9580
Can you please rebase on current master as well?

dbw9580 · 2020-08-17T15:16:46Z

@cgivre I think it's based on the current master already.

cgivre · 2020-08-17T15:44:26Z

@dbw9580
This is looking pretty good. I'm going to do a final check this evening or tomorrow, but can you please:

Squash all commits and add message of DRILL-7745: Add storage plugin for IPFS as the commit message
Go through and do a final code hygiene check (make sure there are no extra spaces, commented out blocks etc). Drill does have a code formatter1 and just verify that the code complies with the coding standards for spacing and all that. (I didn't see anything jump out at me, but it always helps to double check)
Please create a JIRA to add this to the public documentation. You're welcome to actually add the documentation as well, but for now, let's just make sure we have a JIRA on file to actually add the docs.

Thanks!

dbw9580 · 2020-08-17T15:54:31Z

@cgivre sure, but I have to do these tomorrow (it's now midnight in my timezone). And maybe allow some time for the IPFS API repo to release an official version: ipfs-shipyard/java-ipfs-http-client#172 (comment) ?

vvysotskyi · 2020-08-17T15:54:37Z

contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSGroupScan.java

+              .setState(DrillbitEndpoint.State.ONLINE)
+              .build();
+          //DRILL-7777: how to safely remove endpoints that are no longer needed once the query is completed?
+          ClusterCoordinator.RegistrationHandle handle = coordinator.register(ep);


@dbw9580, could you please explain why it is required to register Drillbit endpoint? It is prohibited to do it everywhere except for the place when Drillbit is starting. When the endpoint is registered, it may be misused when executing other queries. Also, the same node may run several group scans, so it will fail for this case because required ports will be used.

I need to tell Drill that these peers of IPFS who are also running Drill can be used when executing queries distributedly. So these Drillbit endpoints are created on the fly. Maybe limit these dynamically created endpoints to be used by this plugin only?

Drill should obtain all the info about the source of the data only from the query plan and minor fragments. Please take a look at existing storage plugins, or even file format plugins to see how it is implemented there.

The problem is that unlike the other types of storage plugins, where the Drillbits are known at the time Drill starts, because they reside in the same cluster and are managed by a coordinator, the Drillbits in this plugin are remote IPFS peers which are associated with a particular query, thus can only be known when the user runs a query.

The complete workflow of this plugin is:

the user inputs an SQL statement that specifies an IPFS path to the target table of the query;

this plugin resolves the path to an IPFS object and finds its "providers", i.e. IFPS nodes which store the target object, and filters out those which are running Drill (Drill-ready);

these nodes are registered as Drillbit endpoints, and the query plan is sent to them;

these nodes execute the query plan and return results.

I made some slides to illustrate the basic idea: https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs, starting on slide 10.

I understand how this storage plugin works may break Drill's existing model, but I couldn't find a plugin that works in a similar way, and the internal workings of Drill is too complex to go through. Could you please be more specific about how this plugin can be incompatible with other queries?

@dbw9580,
How would this plugin work if you were joining data from IPFS with data from another storage plugin? Would that break anything?

I'm wondering whether there is some way to mark an endpoint as for IPFS only or even for a particular query only so that it could not be misused and answer @vvysotskyi 's concerns.

@dbw9580,
Perhaps try some variations like an aggregate query, or a query that combines multiple data source AND has subqueries. See if you can break it. ;-)

My hunch is that there we can find a way to resolve @vvysotskyi's concerns. This is a new (and interesting) use case, we should find a way to include it.

I don't know how to produce a testcase that will cause it to fail. I tested successfully in a two node cluster with queries that involves data from the HTTP storage plugin, the classpath plugin and this plugin, combines join, filter and sort operators and nested subqueries. If @vvysotskyi could provide a testcase that shows these dynamically added endpoints can be a problem, I can look into that and see what solutions we can find.

@dbw9580, the case I mentioned may be reproduced in the cluster with several nodes when the endpoint for this plugin is registered for the node with no running drillbit that belongs to the current Drill cluster. In this case, the endpoint for that node will be registered. Assume at this time, another query is submitted. Drill will try to send the plan fragment to this node, and this will cause problems since actually Drillbit is not running there.
Please take a look at the logic in BlockMapBuilder, where it decides which drillbit will execute specific CompleteWork instance.

I think the issue with unit tests you have previously observed was the case similar to this.

@vvysotskyi @cgivre I think to refactor the cluster coordinator and the planner is beyond me. I actually tried to create a endpoint registry that manages them according to the plugin they are registered with. But that didn't work out. The attempt is here: bdchain@85e7d2a
In 0c34b56 I added a manual switch in the config to allow the user to choose whether to run queries in distributed mode. If it is set to false, then no endpoints will be registered on the fly, but the plan will be executed by the foreman, just like the way the HTTP storage plugin works. I hope by doing so (and clarifying in the doc) we can mitigate the problem a bit.

cgivre · 2020-08-20T13:44:42Z

@dbw9580
What would happen in the following scenario. Let's say you have user A who executes a query using IPFS and which spins up new Drillbits. User B then decides to execute a query that does not use IPFS. Is it possible that if these two queries are concurrent, could user B's query end up on the Drillbits for IPFS and then either not find data or cause some other problem?

Alternatively, what would happen if user B executes a query while user A's IPFS queries are running. What would happen if user A's query completes before user B? Would it tear down the Drillbits and cause a crash?

I'm asking because I really don't know here..

dbw9580 · 2020-09-06T05:18:49Z

The test failure looks irrelevant.

cgivre · 2020-12-01T13:58:02Z

@dbw9580
Hi there! I hope all is well. Are you still interested in completing this PR?

dbw9580 · 2020-12-03T06:22:21Z

@dbw9580
Hi there! I hope all is well. Are you still interested in completing this PR?

Yes. I'm currently busy with other projects, and haven't had time to look further into this. I remember we were having some discussions about the way this plugin interacts with the Drill coordinator that needs a major design reconsideration. When I can spare more time on this, I will continue where I left off.

cgivre · 2022-04-26T22:06:44Z

@dbw9580 I was looking at this again after someone tagged me in an issue, but I think there is a WAY easier way to get this done. Take a look at this PR that I submitted to add the Dropbox file system to Drill: (#2337). It re-uses a lot of Drill's internals so that you don't have to deal with all the format readers.