Will Cloudpathlib support HDFS path? #394

daviddwlee84 · 2024-01-19T02:54:54Z

I found that Cloudpathlib only supports prefixes with ['az://', 's3://', 'gs://'] currently.
Is there any plan to support HDFS path hdfs:// in the future?

The text was updated successfully, but these errors were encountered:

pjbull · 2024-01-19T19:55:01Z

Hi @daviddwlee84, we're open to supporting HDFS as a provider. None of the core developers use it regularly, so it would be great to have it from a contributor.

A few questions:

What's the best Python library for interacting with HDFS? There seem to be a quite a few options.
Is there a good Dockerfile for a single-node HDFS deployment? We'll want testing infrastructure that's easy to maintain.

Otherwise, implementation will be creating the HdfsClient and HdfsPath like for the existing providers, a test rig, a mocked backend for unit testing, and any provider specific tests.

daviddwlee84 · 2024-01-22T03:03:53Z

Thanks @pjbull, I see.

For the first question, as far as I know, pyarrow.fs.HadoopFileSystem might be a good choice. (Some other libraries are just wrapping up Hadoop CLI which requires complex environment setup and version matching.)

Here is an example of how fsspec manipulates this.

For the second question, I haven't used Hadoop within the container before.
I found this might be usable but not very active.
big-data-europe/docker-hadoop: Apache Hadoop docker image

For single-node deployment, this requires a matched Java & Hadoop version. With minimal configuration on $HADOOP_HOME/etc/hadoop/hdfs-site.xml like

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///nvme/HDFS/HadoopName</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///nvme/HDFS/HadoopData</value>
    </property>
</configuration>

Then can start with $HADOOP_HOME/sbin/start-dfs.sh.
I think these installation steps can be done in a Dockerfile easily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will Cloudpathlib support HDFS path? #394

Will Cloudpathlib support HDFS path? #394

daviddwlee84 commented Jan 19, 2024

pjbull commented Jan 19, 2024

daviddwlee84 commented Jan 22, 2024

Will Cloudpathlib support HDFS path? #394

Will Cloudpathlib support HDFS path? #394

Comments

daviddwlee84 commented Jan 19, 2024

pjbull commented Jan 19, 2024

daviddwlee84 commented Jan 22, 2024