HDFS via S3 connector

Current recommended way for Hadoop to access SeaweedFS is via SeaweedFS Hadoop Compatible File System, which is the most efficient way with the client directly accessing filer for metadata and accessing volume servers for file content.

However, the downside is that you need to add a SeaweedFS jar to classpath, and change some Hadoop settings.

HDFS Access SeaweedFS via S3 connector

The S3a connector is already included in hadoop distributions. You can use it directly.

Here is an example spark job pom.xml file, using hadoop version later than 3.3.1:

<properties>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <scala.version>2.12.11</scala.version>
    <spark.version>3.1.2</spark.version>
    <hadoop.version>3.3.1</hadoop.version>
    <spark.pom.scope>compile</spark.pom.scope>
</properties>

And add this in your code:

SparkSession spark = SparkSession.builder()
    .master("local[*]")
    .config("spark.eventLog.enabled", "false")
    .config("spark.driver.memory", "1g")
    .config("spark.executor.memory", "1g")
    .appName("SparkDemoFromS3")
    .getOrCreate();
spark.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", "admin");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", "xx");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.endpoint", "ip:8333");
spark.sparkContext().hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.path.style.access", "true");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.multiobjectdelete.enable", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.directory.marker.retention", "keep");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.version.required", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.mode", "warn");
RDD<String> rdd = spark.sparkContext().textFile("s3a://bk002/test1.txt", 1);
System.out.println(rdd.count());
rdd.saveAsTextFile("s3a://bk002/testcc/t2");

Introduction

API

Configuration

Filer

Filer Stores

Advanced Filer Configurations

FUSE Mount

WebDAV

Cloud Drive

AWS S3 API

AWS IAM

Machine Learning

TensorFlow with SeaweedFS

HDFS

Replication and Backup

Async Replication to another Filer [Deprecated]
Async Backup
Async Filer Metadata Backup
Async Replication to Cloud [Deprecated]
Kubernetes Backups and Recovery with K8up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS via S3 connector

HDFS Access SeaweedFS via S3 connector

Introduction

API

Configuration

Filer

Filer Stores

Advanced Filer Configurations

FUSE Mount

WebDAV

Cloud Drive

AWS S3 API

AWS IAM

Machine Learning

HDFS

Replication and Backup

Messaging

Use Cases

Operations

Advanced

Security

Misc Use Case Examples

Clone this wiki locally