Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will Cloudpathlib support HDFS path? #394

Open
daviddwlee84 opened this issue Jan 19, 2024 · 2 comments
Open

Will Cloudpathlib support HDFS path? #394

daviddwlee84 opened this issue Jan 19, 2024 · 2 comments

Comments

@daviddwlee84
Copy link

I found that Cloudpathlib only supports prefixes with ['az://', 's3://', 'gs://'] currently.
Is there any plan to support HDFS path hdfs:// in the future?

@pjbull
Copy link
Member

pjbull commented Jan 19, 2024

Hi @daviddwlee84, we're open to supporting HDFS as a provider. None of the core developers use it regularly, so it would be great to have it from a contributor.

A few questions:

  • What's the best Python library for interacting with HDFS? There seem to be a quite a few options.
  • Is there a good Dockerfile for a single-node HDFS deployment? We'll want testing infrastructure that's easy to maintain.

Otherwise, implementation will be creating the HdfsClient and HdfsPath like for the existing providers, a test rig, a mocked backend for unit testing, and any provider specific tests.

@daviddwlee84
Copy link
Author

Thanks @pjbull, I see.

For the first question, as far as I know, pyarrow.fs.HadoopFileSystem might be a good choice. (Some other libraries are just wrapping up Hadoop CLI which requires complex environment setup and version matching.)

Here is an example of how fsspec manipulates this.

For the second question, I haven't used Hadoop within the container before.
I found this might be usable but not very active.
big-data-europe/docker-hadoop: Apache Hadoop docker image

For single-node deployment, this requires a matched Java & Hadoop version. With minimal configuration on $HADOOP_HOME/etc/hadoop/hdfs-site.xml like

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///nvme/HDFS/HadoopName</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///nvme/HDFS/HadoopData</value>
    </property>
</configuration>

Then can start with $HADOOP_HOME/sbin/start-dfs.sh.
I think these installation steps can be done in a Dockerfile easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants