Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Prevent chunking config mismatch between parent and child data centers #659

Open
1 of 12 tasks
ZacAttack opened this issue Sep 21, 2023 · 3 comments
Open
1 of 12 tasks
Labels
bug Something isn't working

Comments

@ZacAttack
Copy link
Contributor

ZacAttack commented Sep 21, 2023

Venice version

0.4.139

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.0): Mariner 5.15.111.1-1.cm2
  • JDK version: 17

Describe the problem

Issue encountered by user in prod: # A store has write compute enabled.

  • The customer can't read the data in in DC1, but can read the same keys in other fabrics.

Root cause: # DC1 has chunking flag enabled, but not in parent and other child fabrics.

  1. In the read path, Venice Server is looking at StoreVersionState to check whether chunking is enabled or not and the chunking flag in StoreVersionState is decided by the StartOfPush control message generated by VPJ.
  2. Even DC1 has chunking enabled in Version metadata, but StoreVersionState doesn't have it, so the read path won't append the chunking suffix, so the lookup always fail.

Potential mitigation: # Always looking at StoreVersionState in the ingestion path.

  • Prevent the update to Child Controllers (which aligns with the decision of ParentController SPoF project).

1 seems more robust, and it seems good to have 2 regardless. But open to other approaches.

Theres another very similar issue regarding partition count mismatch between the colos. So it's be good to fix that here as well.

Tracking information

No response

Code to reproduce bug

No response

What component(s) does this bug affect?

  • Controller: This is the control-plane for Venice. Used to create/update/query stores and their metadata.
  • Router: This is the stateless query-routing layer for serving read requests.
  • Server: This is the component that persists all the store data.
  • VenicePushJob: This is the component that pushes derived data from Hadoop to Venice backend.
  • VenicePulsarSink: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.
  • Thin Client: This is a stateless client users use to query Venice Router for reading store data.
  • Fast Client: This is a stateful client users use to query Venice Server for reading store data.
  • Da Vinci Client: This is an embedded, stateful client that materializes store data locally.
  • Alpini: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.
  • Samza: This is the library users use to make nearline updates to store data.
  • Admin Tool: This is the stand-alone client used for ad-hoc operations on Venice.
  • Scripts: These are the various ops scripts in the repo.
@ZacAttack ZacAttack added the bug Something isn't working label Sep 21, 2023
@huangminchn
Copy link
Contributor

Hey @ZacAttack , just to make sure I understand correctly, is it what happened:
Chunking was enabled in DC1 only (not through parent controller), and thus the ingestion path in DC1 added the chunking suffix; however, since the SOP from parent controller didn't have chunking enabled, the StoreVersionState in all data center didn't have chunking enabled, and thus read path in DC1 didn't add chunking suffix?

@ZacAttack
Copy link
Contributor Author

Hey @ZacAttack , just to make sure I understand correctly, is it what happened:
Chunking was enabled in DC1 only (not through parent controller), and thus the ingestion path in DC1 added the chunking suffix; however, since the SOP from parent controller didn't have chunking enabled, the StoreVersionState in all data center didn't have chunking enabled, and thus read path in DC1 didn't add chunking suffix?

Yeah you got it

@huangminchn
Copy link
Contributor

@ZacAttack This is a difficult one.. if we do (1), the persisted data would conflict with the version config.

As for (2) "Prevent the update to Child Controllers", how exactly to do that? Could you please remind me of the Controller SPoF project? I don't think we can block all configs changes, since we sometimes want to do test or canary new configs in one DC first.

Maybe we could do a fail-fast approach: when SoP is consumed and we notice that the chunking config is different from the version config, fail loudly; meanwhile, the other 2 DCs could move on. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants