Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Deltas via Metadata #48

Open
omervk opened this issue Aug 5, 2018 · 1 comment
Open

Reading Deltas via Metadata #48

omervk opened this issue Aug 5, 2018 · 1 comment

Comments

@omervk
Copy link
Contributor

omervk commented Aug 5, 2018

(this is very similar in philosophy to #47 and it would be good to read that before this, same caveats applying)

A job that wants to read only new data since the last time it ran must understand what the high-water mark was and read new data from its source based on a predicate. For instance:

val newData = spark
  .read
  /* ... */
  .filter($"day" === "2018-08-05")

However, we can base our reads on the building-up of snapshots along time, so if our snapshots
are S1, S2, S3 and S4 and the last snapshot we processed was S1, we can read the new data from S2, S3 and S4 and skip the filtering completely. This would essentially make our high-water mark metadata-based, rather than data-based.

This can be achieved using the low-level Iceberg API, but not using the Spark API, which would be a great addition to the project.

Here's a sketch of how this API may look like:

spark
  .read
  .format("iceberg")
  .snapshots(2, 3, 4)
  .load(path)

Note: Specifying the list of snapshots would also let this API support other use cases, such as parallel-processing of snapshots, etc.

@rdblue
Copy link
Contributor

rdblue commented Aug 8, 2018

This can be done fairly easily by adding key-value properties when reading with Spark. We plan to do this to implement AS OF SYSTEM TIME SQL statements as well. You'd pass extra information and in the IcebergSource, use it to customize the scan.

Scans don't currently support filtering by what is "new" in a snapshot (or multiple snapshots) but that should be easy to add by extending the TableScan interface and the underlying BaseTableScan implementation.

Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/iceberg that referenced this issue Apr 12, 2019
A manifest list is created for every commit attempt. Before this update,
the same file was used, which caused retries to fail trying to create
the same list file. This uses a new location for every manifest list,
keeps track of old lists, and cleans up unused lists after a commit
succeeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants