Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Ingestion Spec controller #313

Open
AdheipSingh opened this issue Oct 9, 2022 · 2 comments
Open

Proposal: Ingestion Spec controller #313

AdheipSingh opened this issue Oct 9, 2022 · 2 comments

Comments

@AdheipSingh
Copy link
Contributor

Current State of Druid Operator

Druid operator is supporting installation, upgrade and maintaining a druid cluster. Internally druid operator has a druid controller which talks to the k8s api for operations. Most of the intelligence built is in from an k8s installation perspective, the CRD spec is very flexible. The current reconcile loop is stable and battle tested.
Current CRD's belong to group druid.apache.org, and is in v1alpha1 version with Druid as the only kind supported. The manager hooks in a single controller ie druid_controller.

Goals

IMHO druid operator ( the operator framework ) is powerful enough to leverage kubernetes as a control plane for running druid. All operations and specs can be handled as CRD definitions. Automating and handling supervisors configs for ingestion for a druid cluster by adding a new CRD to the group druid.apache.org.

Design

Seperation of concerns , A new CRD + Controller

  • In order to manage ingestion specs/supervisor HTTP call needs to be done to the overlord API. Adding this support in current controller is an anti-pattern. The current reconcile loop is responsible for operations with k8s. Dont want to add in HTTP calls in the same reconcile loop which is handling the k8s state for druid pods. Controllers are eventually consistent. Managing it in the current code base will make the handler even more complex.
    Relation between a druidingestion CR and druid CR is one to one. Having one to many will add complexity and confusion.
    Taking motivation from other operators such as strimzi kafka operator that has a separate crd for kafka, kafka topics and kafka acls.

Authentication with Druid API

Design of the CRD, CR Spec and Reconcilation

  • CRD belongs to
Group: druid.apache.org
Version: v1alpha1
Kind: DruidIngestion

CRD is scoped to namespaced ie for kind. Validation using open api v3, for more complex validation of the json validation webhook can be added.
A sample CR Spec

apiVersion: "druid.apache.org/v1alpha1"
kind: "DruidIngestion"
metadata:
  name: sample-druid-spec
  namespace: mydruid
spec:
  clusterRef: mydruid
  suspend: false
  supervisorSpec: |-
      {
      "type": "kafka",
      "spec": {
        "dataSchema": {
          "dataSource": "metrics-kafka",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [
              "timestamp",
              "value"
             ]
           },
          "metricsSpec": [
            {
              "name": "count",
              "type": "count"
            },
            {
              "name": "value_sum",
              "fieldName": "value",
              "type": "doubleSum"
            },
            {
              "name": "value_min",
              "fieldName": "value",
             "type": "doubleMin"
            },
            {
              "name": "value_max",
              "fieldName": "value",
              "type": "doubleMax"
           }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "HOUR",
            "queryGranularity": "NONE"
          }
        },
        "ioConfig": {
          "topic": "metrics",
          "inputFormat": {
            "type": "json"
          },
          "consumerProperties": {
            "bootstrap.servers": "localhost:9092"
          },
          "taskCount": 1,
          "replicas": 1,
          "taskDuration": "PT1H"
        },
        "tuningConfig": {
          "type": "kafka",
          "maxRowsPerSegment": 5000000
        }
      } 
    }

Controller shall reconcile this spec, and send POST requests to the overlord API ie http://localhost:8090/druid/indexer/v1/supervisor.

Reconcilation and State Changes

Controllers are combination of level driven and event driven. An update to the druidingestion CR can be reconciled as an update event. Still there is a possibility of an outage. The reconcile loops triggers every N seconds to prevent that.
In the current controller all configs in the CR are converged to first class kubernetes objects. The supervisor spec can be created as a configmap, this configmap can help incase an event is missed, we can trigger an update if the current state is not same as desired state. Druid operator adds an objectHash to the cm. ( same flow as the current controller )

A CRD shall have a status. The status shall be patched with the fields from the http response from druid API, status shall have the supervisor id. To suspend the supervisor spec, controller shall the get the id from status, and POST request to overload api to suspend the supervisor /druid/indexer/v1/supervisor//suspend

Updating suspend to false in the ingestion CR, shall cause reset of the supervisor spec. Operator shall emit events using events API for each operation handled and update the status of the CR.
A deletion of the druid ingestion config shall be controlled by finalizers. Before the deletion controller makes an HTTP call to delete the supervisor spec, at this point the CR will be marked as terminating. Once requests are completed CR will be removed.


This proposal might have missed in some druid specific details of the API. The original issue : #251

@cintoSunny
Copy link
Contributor

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

@AdheipSingh
Copy link
Contributor Author

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

Do you see kubernetes as an orchestration platform for running druid or you see kubernetes as a control plane for running druid. If you consider the latter, you will can leverage CRD for handling supervisor specs etc.

Just like kafka operator, you can deploy kafka + manage topics and acls via crds. Of course, you can create kafka topics using cli clients same way we can create druid supervisors from the console, but if you want full control from k8s + enhance gitops and build operator as a control plane, this can be a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants