Skip to content

wukong fs old readme

Philip (flip) Kromer edited this page May 9, 2012 · 1 revision

wukong-fs — old README

  • file – Local file system. Only thoroughly tested on Ubuntu Linux.
  • hdfs – Hadoop distributed file system. Uses the Apache Hadoop 0.20 API. Requires JRuby.
  • s3 – Amazon Simple Storage System (s3).
  • ftpFTP (Not yet implemented)

All filesystem abstractions implement the following core functions, many taken from the UNIX filesystem:

  • mv
  • cp
  • cp_r
  • rm
  • rm_r
  • open
  • exists?
  • directory?
  • ls
  • ls_r
  • mkdir_p

Note: Since S3 is just a key-value store, it is difficult to preserve the notion of a directory. Therefore the mkdir_p function has no purpose, as there cannot be empty directories. mkdir_p currently only ensures that the bucket exists. This implies that the directory? test only succeeds if the directory is non-empty, which clashes with the notion on the UNIX filesystem.

Additionally, the S3 and HDFS abstractions implement functions for moving files to and from the local filesystem:

  • copy_to_local
  • copy_from_local

Note: For these methods the destination and source path respectively are assumed to be local, so they do not have to be prefaced by a filescheme.

The Swineherd::FileSystem module implements a generic filesystem abstraction using schemed filepaths (hdfs://,s3://,file://).
Currently only the following methods are supported for Swineherd::FileSystem:

  • cp
  • exists?

For example, instead of doing the following:

hdfs = Swineherd::HadoopFileSystem.new
localfs = Swineherd::LocalFileSystem.new
hdfs.copy_to_local('foo/bar/baz.txt', 'foo/bar/baz.txt') unless localfs.exists? 'foo/bar/baz.txt'

You can do:

fs = Swineherd::FileSystem
fs.cp('hdfs://foo/bar/baz.txt','foo/bar/baz.txt') unless fs.exists?('foo/bar/baz.txt')

Note: A path without a scheme is treated as a path on the local filesystem, or use the explicit file:// scheme for clarity. The following are equivalent:

fs.exists?('foo/bar/baz.txt')
fs.exists?('file://foo/bar/baz.txt')

Config

  • In order to use the S3FileSystem, Swineherd requires AWS S3 access credentials.
  • In ~/swineherd.yaml or /etc/swineherd.yaml:
aws:
  access_key: my_access_key
  secret_key: my_secret_key
  • Or just pass them in when creating the instance:
S3 = Swineherd::S3FileSystem.new(:access_key => "my_access_key",:secret_key => "my_secret_key")

JRuby for HadoopFileSystem

  • jruby -S gem install swineherd-fs
  • You will be warned about jruby-openssl if you do not have it installed. You should install that gem as well.
JRuby limited openssl loaded. http://jruby.org/openssl
gem install jruby-openssl for full support.
  • jruby -S irb
require 'swineherd-fs'
hdfs = Swineherd::FileSystem.get(:hdfs)
hdfs.ls("/user/dsnyder/")
=> ["hdfs://{machine-ip}/user/dsnyder/foo.txt"]
 hdfs.exists?("/user/dsnyder/foo.txt")
=> true