Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller][admin-tool] Streamline controller configs and harden update-store workflow #946

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nisargthakkar
Copy link
Contributor

@nisargthakkar nisargthakkar commented Apr 16, 2024

Streamline controller configs and harden update-store workflow

This PR unifies the UpdateStore logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especially VeniceParentHelixAdmin) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:

  1. Add a new config in controllers to specify if the Venice deployment is in multi-region mode or single-region mode.
  2. Add a new class UpdateStoreUtils that is called by both parent and child controllers to apply the store update and perform the necessary validations.
  3. Perform store config validation after applying all configs. This helps check for config compatibility. The following validations are performed currently.
    1. Batch-only:
      1. Inc push is not allowed
      2. WC is not allowed
    2. Hybrid:
      1. Rewind time in seconds cannot be negative
      2. Both offset lag threshold and producer time lag threshold cannot be negative
      3. In multi-region mode, inc push is not allowed for NON_AGGREGATE data replication policy
      4. ACTIVE_ACTIVE data replication policy is only supported when Active-Active replication is enabled
    3. Storage quota cannot be less than 0
    4. Read quota cannot be less than 0
    5. Store cannot be active-active replication enabled if it is not also native-replication enabled
    6. Active-Active and write-compute are not supported when amplification factor is more than 1
    7. Store must have a partitioner-config set
    8. Validate if the partitioner can be built with the specified partitioner configs
    9. Latest superset value schema id must map to a valid value schema id
    10. Max compaction lag < min compaction lag
    11. ETL Proxy user must be set. (We might want to deprecate this and remove the corresponding tests, but we have plans to get KIF-based ETL that could benefit from the ETLConfig in store configs)
  4. Incremental push is not explicitly configurable anymore. It is inferred through other configs. Inc push is supported in the following cases:
    1. In a single-region setup, all hybrid stores are allowed to do incremental pushes
    2. In a multi-region setup, the store must be hybrid and DataReplicationPolicy is either of ACTIVE_ACTIVE, AGGREGATE or NONE
  5. Added a new concept of "primary controller". This is the controller that is allowed to perform inferred updates. In a multi-region mode, this is the parent controller. In a single-region mode, this is a child controller.
  6. Previously each "feature" update incurred an update to the store metadata on Zk - and the corresponding watchers would get invoked. This change replaces all those updates with a single Zk update, which is done after the resulting configs are validated.
  7. The ordinal from BackupStrategy enum was being used to write to the admin channel. This is problematic when the enums evolve. Added a getValue function and used that instead.
  8. Use multi.region config to control certain features instead of controlling them via explicit configs.
    1. Native-replication - Enabled in multi-region mode. Disabled in single-region mode.
    2. Admin channel consumption - Enabled in multi-region mode. Disabled in single-region mode.
      1. This is still allowed to be disabled via LiveConfig for store migration purposes.
    3. Controller allowing BATCH push via API - Disabled in multi-region mode. Enabled in single-region mode.
    4. Active-active replication enabled on a controller - Enabled in multi-region mode. Disabled in single-region mode.
  9. The codes for enabling/disabling native replication for cluster have been removed since they are no-longer honored.
  10. The following controller configs have also been removed:
    1. enable.native.replication.for.batch.only
    2. enable.native.replication.for.hybrid
    3. enable.native.replication.as.default.for.batch.only
    4. enable.native.replication.as.default.for.hybrid
    5. enable.active.active.replication.as.default.for.batch.only.store
    6. admin.topic.remote.consumption.enabled
    7. enable.incremental.push.for.hybrid.active.active.user.stores
    8. active.active.enabled.on.controller
    9. controller.enable.batch.push.from.admin.in.child
  11. A new config: controller.external.superset.schema.generation.enabled has been added to replace controller.parent.external.superset.schema.generation.enabled. since external superset schema generation must be allowed in single-region mode also. controller.parent.external.superset.schema.generation.enabled has been marked deprecated, but it has not been completely removed yet.
  12. Due to the changes to inferred configs, a large chunk of tests started failing as they had incorrect setups. Fixed most of such test setups. Some examples are:
    1. Parent and child regions sharing same Kafka cluster
    2. Setting up multi-region mode by explicitly creating parent Zk, controllers and Kafka clusters instead of using VeniceTwoLayerMultiRegionMultiClusterWrapper
    3. Creating stores and doing pushes directly in child regions
      1. TestFabricBuildout does this and tests for admin execution id. This is incorrect as that is not how a new region will get added. We need the test setup to support adding blank regions so that we can simulate the true fabric buildout process. Because of this, I've disabled these tests for now.

Some side-effects of this change are:

  1. The new workflow now checks if the store schema can support WC at the time of enabling the WC config. There were a few tests that also used unsupported value schemas for WC.
  2. Persona validation was only present in parent controller in multi-region mode. This now works in single-region mode as well
  3. SupersetSchemaGeneratorWithCustomProp had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.

There are a few other changes that we should do, but are not done in this PR:
All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child

Recommendation for Reviewers

I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip ParentControllerConfigUpdateUtils as that has been renamed to PrimaryControllerConfigUpdateUtils and it's contents partially moved to UpdateStoreUtils. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.

In the second pass, follow this review order:

  1. VeniceControllerClusterConfig, VeniceControllerConfig, VeniceControllerMultiClusterConfig
  2. VeniceControllerService
  3. Admin
  4. UpdateStoreUtils
  5. PrimaryControllerConfigUpdateUtils
    6.AdminUtils
  6. UpdateStoreWrapper
  7. VeniceHelixAdmin
  8. VeniceParentHelixAdmin
  9. All the test modifications
  10. All newly added tests
  11. Scripts and doc file changes:
    1. clients/venice-pulsar/readme.md
    2. docker/venice-client/create-store.sh

How was this PR tested?

GH CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@nisargthakkar nisargthakkar force-pushed the adminSingleUpdateZk branch 2 times, most recently from 7bb1870 to f6e6135 Compare April 17, 2024 19:43
@nisargthakkar nisargthakkar changed the title [WIP][controller][admin-tool] Streamline controller configs and harden update-store workflow [controller][admin-tool] Streamline controller configs and harden update-store workflow Apr 17, 2024
@nisargthakkar nisargthakkar marked this pull request as ready for review April 17, 2024 21:05
@mynameborat
Copy link
Contributor

Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.

@mynameborat
Copy link
Contributor

Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.

Can you add a section in the PR description that talks about compatibility characteristics of this PR? Looks like we are deprecating lot of features and also introducing new ways or inferring to accomplish some of the existing feature (incremental push etc)

Would be good to explicitly callout what are some of the considerations w.r.t backward compatibility and forward compatibility in the context of the PR and release notes that describe these to end users (where applicable)

@gaojieliu
Copy link
Contributor

@nisargthakkar
I wonder whether it is still possible to split the PR into multiple smaller ones.
Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether.
Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.

@nisargthakkar
Copy link
Contributor Author

I wonder whether it is still possible to split the PR into multiple smaller ones.
Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether.
Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.

Yes, I've been trying to see how I can break this down. Even if I break it down, I expect at least one PR to be very big. Let me see how that goes

@nisargthakkar nisargthakkar marked this pull request as draft April 24, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants