-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Filer Change Data Capture
Is it too much a dream to have something similar to inotify in a distributed file system? Not really!
Actually SeaweedFS can give you more!
You can continuously watch the SeaweedFS meta data changes. Let's also filter with jq
and see only the new files created using this command:
weed filer.meta.tail -timeAgo=3h | jq .eventNotification.newEntry
which will return:
{
"name": "abc.png",
"chunks": [
{
"size": "941248",
"mtime": "1611297248363702000",
"eTag": "2848d811982973ffda34cf8c8599e3f6",
"fid": {
"volumeId": 23,
"fileKey": "155320",
"cookie": 2256694723
}
}
],
"attributes": {
"fileSize": "941248",
"mtime": "1611297248",
"fileMode": 432,
"uid": 502,
"gid": 20,
"crtime": "1611297248",
"mime": "image/png",
"replication": "000",
"md5": "KEjYEZgpc//aNM+MhZnj9g=="
}
}
\\ the rest has been truncated for brevity
See the help:
$ weed filer.meta.tail -h
Example: weed filer.meta.tail [-filer=localhost:8888] [-target=/]
Default Usage:
-es string
comma-separated elastic servers http://<host:port>
-es.index string
ES index name (default "seaweedfs")
-filer string
filer hostname:port (default "localhost:8888")
-pathPrefix string
path to a folder or file, or common prefix for the folders or files on filer (default "/")
-pattern string
full path or just filename pattern, ex: "/home/?opher", "*.pdf", see https://golang.org/pkg/path/filepath/#Match
-timeAgo duration
start time before now. "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h"
Description:
See recent changes on a filer.
If configured Elastic Search server names, the meta data will be sent to Elastic Search
$ weed filer.meta.tail -es=http://localhost:9200
The weed filer.meta.tail
code is nothing fancy. It is calls a gRPC stream API to subscribe to all meta data changes and simply print out the meta data.
The gRPC API has several important use cases within SeaweedFS:
- Replicate data to other SeaweedFS clusters in
weed filer.sync
. - Replicate meta data to other filers if not sharing the same filer meta store.
- Replicate meta data to
weed mount
asynchronously.
The gRPC API is also open to public and can support many other languages.
Here is an example ExampleWatchFileChanges.java, in Java:
To subscribe the meta data changes:
Parameter | Meaning |
---|---|
prefix | A path prefix. Watch any directory or file with this path prefix |
clientName | A client name, just for logging |
sinceNs | A timestamp in nano seconds. Watch changes from this timestamp. You can rewind the time. |
Basically there are four types of events to handle:
Type | Directory | NewEntry | OldEntry | NewParentPath |
---|---|---|---|---|
Create | exists | exists | null | equal to Directory |
Update | exists | exists | exists | equal to Directory |
Delete | exists | null | exists | equal to Directory |
Rename | exists | exists | exists | not equal to Directory |
This is based on Filer gRPC API. You should be able to easily implement it in your own language.
https://github.com/seaweedfs/seaweedfs/blob/master/weed/pb/filer.proto#L52
A Golang example: https://github.com/tuxmart/seawolf
This is basically stream processing or event processing for files. The possible use cases are all up to your imagination.
- Detect new image or video files. Add versions with different resolutions.
- A distributed configuration distribution: stores configuration files under a folder. Detect the configuration changes and reload.
- A job queue: upload files to a folder, and processing new files as soon as possible, and delete the processed files.
- Do-it-yourself Data Replication or Backup.
- Batch processing: streaming data is cool, but sometimes batching is more efficient. To combine streaming and batching, you can put one batch of new data as a file and trigger the batch processing on that file.
- Folder size statistics and monitoring.
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup Setup
- Environment Variables
- Filer Setup
- Directories and Files
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- Filer Cassandra Setup
- Filer Redis Setup
- Filer YugabyteDB Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
- Amazon S3 API
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 API Audit log
- S3 Nginx Proxy
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up