Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor change iceberg source interface #178

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

vamshigv
Copy link
Contributor

@vamshigv vamshigv commented Nov 7, 2023

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Minor change in Iceberg source interface to consider snapshotId as the checkpoint as it aligns with Hudi Instant and Delta version number and has little serialization overhead.

Brief change log

Minor change in Iceberg source interface to consider snapshotId as the checkpoint as it aligns with Hudi Instant and Delta version number and has little serialization overhead.

Verify this pull request

(Please pick either of the following options)

This pull request is already covered by existing tests, such as (please describe tests).

@vamshigv
Copy link
Contributor Author

vamshigv commented Nov 7, 2023

@ashvina @the-other-tim-brown LMK what you think ?

Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm

Assertions.assertEquals(0, commitsBacklog.getInFlightInstants().size());
Assertions.assertArrayEquals(snapshots, commitsBacklog.getCommitsToProcess().toArray());
Long[] snapshotIds =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use a list here?

@@ -52,7 +52,7 @@

@Log4j2
@Builder
public class IcebergSourceClient implements SourceClient<Snapshot> {
public class IcebergSourceClient implements SourceClient<Long> {
Copy link
Contributor

@ashvina ashvina Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit confusing. Why is a generic required in the contract if all commits have to be identified by a long id. If the justification is alignment with delta and hudi, then all future sources will need to ensure a long id for commit is available. IMO, a snapshot is a cleaner and less ambiguous representation for a source client. Moreover, just because serialization is efficient with a long, the entire contract should not be changed.
I think such changes should have more compelling reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is allowing us to move towards your suggestion here: #135 - Maybe we have misunderstood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashvina The Generics is still required because for Hudi it is HoodieInstant, For Delta it is Long in the current state. The entity in the generics semantically represents a way to get source table state at that point. If possible it is better to choose a leaner entity(We can add to docs) and snapshot Id for Iceberg fits this.

Copy link
Contributor

@ashvina ashvina Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful not to rush this change. These abstractions are fundamental to OneTable and will affect its future extensibility. My main concern is that we are altering the model because of how one of the components behaves, namely the serialization of the last and pending commits. Let me know if I misunderstood this.
In my opinion, we need a new representation for commits, OneCommit. Just as a table's state is mapped to OneTable, a source client implementation should map the metadata of a table format's commit to OneCommit.
Regarding the tracking of the commit backlog, if OneCommit is available, the serializer or any other component can decide independently what details are best and necessary to serialize. This should cleanly decouple the client's implementation and the behavior of core OneTable components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's work through the round trip conversion testing to see where that leads us. I think that will allow us to have a better vision for where we should be going with respect to the commit representation and serialization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants