-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
HDFS via S3 connector
Chris Lu edited this page Sep 29, 2021
·
2 revisions
Current recommended way for Hadoop to access SeaweedFS is via SeaweedFS Hadoop Compatible File System, which is the most efficient way with the client directly accessing filer for metadata and accessing volume servers for file content.
However, the downside is that you need to add a SeaweedFS jar to classpath, and change some Hadoop settings.
The S3a connector is already included in hadoop distributions. You can use it directly.
Here is an example spark job pom.xml file, using hadoop version later than 3.3.1
:
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12.11</scala.version>
<spark.version>3.1.2</spark.version>
<hadoop.version>3.3.1</hadoop.version>
<spark.pom.scope>compile</spark.pom.scope>
</properties>
And add this in your code:
SparkSession spark = SparkSession.builder()
.master("local[*]")
.config("spark.eventLog.enabled", "false")
.config("spark.driver.memory", "1g")
.config("spark.executor.memory", "1g")
.appName("SparkDemoFromS3")
.getOrCreate();
spark.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", "admin");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", "xx");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.endpoint", "ip:8333");
spark.sparkContext().hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.path.style.access", "true");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.multiobjectdelete.enable", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.directory.marker.retention", "keep");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.version.required", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.mode", "warn");
RDD<String> rdd = spark.sparkContext().textFile("s3a://bk002/test1.txt", 1);
System.out.println(rdd.count());
rdd.saveAsTextFile("s3a://bk002/testcc/t2");
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup Setup
- Environment Variables
- Filer Setup
- Directories and Files
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- Filer Cassandra Setup
- Filer Redis Setup
- Filer YugabyteDB Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
- Amazon S3 API
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 API Audit log
- S3 Nginx Proxy
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up