Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dvc] dvc node blob discovery #980

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ishwarya-citro
Copy link
Collaborator

@ishwarya-citro ishwarya-citro commented May 10, 2024

Summary, imperative, start upper case, don't end with a period

Implemented a new API endpoint designed to facilitate the bootstrapping of nodes by identifying and listing live nodes prepared to serve the store's blobs. If the blob transfer feature is enabled for the store, it retrieves each node's hostname from the push store, given that these nodes are alive.

How was this PR tested?

integ. test cases

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

“ishwarya-personal” added 2 commits May 15, 2024 12:41
@ishwarya-citro ishwarya-citro marked this pull request as ready for review May 16, 2024 14:12
Copy link
Collaborator

@adamxchen adamxchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have one integ test to verify the API works as expected?

@ishwarya-citro
Copy link
Collaborator Author

Can we also have one integ test to verify the API works as expected?

does calling the endpoint count as a integ test? As i think that covers the entire Scenario ?I don't see any integ tests written for new router endpoints - pl correct me if i am wrong

List<String> hostNames = blobDiscoveryResponse.getLiveNodeHostNames();

Collections.sort(hostNames);
Collections.sort(expectedResult);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't really need to sort since the result is empty anyways

@adamxchen
Copy link
Collaborator

Can we also have one integ test to verify the API works as expected?

does calling the endpoint count as a integ test? As i think that covers the entire Scenario ?I don't see any integ tests written for new router endpoints - pl correct me if i am wrong

Ideally, we should be writing an integ test to cover a new API too, such as this test, which tests the TYPE_CURRENT_VERSION API of router. Or, you can add the integ test to PushStatusStoreTest (this looks like a better place to put this test.

try {
// gets the instances for a FULL_PUSH for the store's version and partitionId
// gets the instance's hostnames from its keys & filter to include only live instances
Map<CharSequence, Integer> instances = pushStatusStoreReader.getPartitionStatus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a direction to remove the partition status update and one of change is already made in Controller:
#909

I think there will be a new change to only report version status in DaVinci, which means the partition-based lookup won't work any more.

I guess maybe we should just abandon the partition param from the call and just return a full list of hostnames, and DaVinci peers internally can figure out which nodes are hosting which partitions.

There will be a stage, when both partition-level status and version-level status will be reported, and I think the logic here needs to be compatible with both modes.
Essentially for requests without partition param,

  1. Query version-level key against push status system store.
  2. If 1) result is empty, need to figure out the total number of partitions and query each partition instead.
  3. Merge the results into an unique list.
  1. can be more complex to always query partition-level key, but I guess even without it, it might be good enough considering the transition period should be short, but I am fine to add the logic to always query partition-level keys regardless of whether the version-level key gives any result or not.

cc @huangminchn @sixpluszero

@@ -318,6 +324,13 @@ public RouterServer(
config.getRefreshAttemptsForZkReconnect(),
config.getRefreshIntervalForZkReconnectInMs());
this.liveInstanceMonitor = new HelixLiveInstanceMonitor(this.zkClient, config.getClusterName());

D2Client d2Client = D2ClientFactory.getD2Client(config.getZkConnection(), Optional.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right as getZkConnection is for Venice ZK.
I think we should let the backend pass an instance of ZKClient instead of constructing it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants