Proposal: Ingestion Spec controller #313

AdheipSingh · 2022-10-09T08:34:34Z

Current State of Druid Operator

Druid operator is supporting installation, upgrade and maintaining a druid cluster. Internally druid operator has a druid controller which talks to the k8s api for operations. Most of the intelligence built is in from an k8s installation perspective, the CRD spec is very flexible. The current reconcile loop is stable and battle tested.
Current CRD's belong to group druid.apache.org, and is in v1alpha1 version with Druid as the only kind supported. The manager hooks in a single controller ie druid_controller.

Goals

IMHO druid operator ( the operator framework ) is powerful enough to leverage kubernetes as a control plane for running druid. All operations and specs can be handled as CRD definitions. Automating and handling supervisors configs for ingestion for a druid cluster by adding a new CRD to the group druid.apache.org.

Design

Seperation of concerns , A new CRD + Controller

In order to manage ingestion specs/supervisor HTTP call needs to be done to the overlord API. Adding this support in current controller is an anti-pattern. The current reconcile loop is responsible for operations with k8s. Dont want to add in HTTP calls in the same reconcile loop which is handling the k8s state for druid pods. Controllers are eventually consistent. Managing it in the current code base will make the handler even more complex.
Relation between a druidingestion CR and druid CR is one to one. Having one to many will add complexity and confusion.
Taking motivation from other operators such as strimzi kafka operator that has a separate crd for kafka, kafka topics and kafka acls.

Authentication with Druid API

IMHO for the start support basic auth for authenticating with the druid API.
Use internal router k8s svc url to connect with API. Ex: http://router.namespace.svc.cluster.local

Design of the CRD, CR Spec and Reconcilation

CRD belongs to

Group: druid.apache.org
Version: v1alpha1
Kind: DruidIngestion

CRD is scoped to namespaced ie for kind. Validation using open api v3, for more complex validation of the json validation webhook can be added.
A sample CR Spec

apiVersion: "druid.apache.org/v1alpha1"
kind: "DruidIngestion"
metadata:
  name: sample-druid-spec
  namespace: mydruid
spec:
  clusterRef: mydruid
  suspend: false
  supervisorSpec: |-
      {
      "type": "kafka",
      "spec": {
        "dataSchema": {
          "dataSource": "metrics-kafka",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [
              "timestamp",
              "value"
             ]
           },
          "metricsSpec": [
            {
              "name": "count",
              "type": "count"
            },
            {
              "name": "value_sum",
              "fieldName": "value",
              "type": "doubleSum"
            },
            {
              "name": "value_min",
              "fieldName": "value",
             "type": "doubleMin"
            },
            {
              "name": "value_max",
              "fieldName": "value",
              "type": "doubleMax"
           }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "HOUR",
            "queryGranularity": "NONE"
          }
        },
        "ioConfig": {
          "topic": "metrics",
          "inputFormat": {
            "type": "json"
          },
          "consumerProperties": {
            "bootstrap.servers": "localhost:9092"
          },
          "taskCount": 1,
          "replicas": 1,
          "taskDuration": "PT1H"
        },
        "tuningConfig": {
          "type": "kafka",
          "maxRowsPerSegment": 5000000
        }
      } 
    }

Controller shall reconcile this spec, and send POST requests to the overlord API ie http://localhost:8090/druid/indexer/v1/supervisor.

Reconcilation and State Changes

Controllers are combination of level driven and event driven. An update to the druidingestion CR can be reconciled as an update event. Still there is a possibility of an outage. The reconcile loops triggers every N seconds to prevent that.
In the current controller all configs in the CR are converged to first class kubernetes objects. The supervisor spec can be created as a configmap, this configmap can help incase an event is missed, we can trigger an update if the current state is not same as desired state. Druid operator adds an objectHash to the cm. ( same flow as the current controller )

A CRD shall have a status. The status shall be patched with the fields from the http response from druid API, status shall have the supervisor id. To suspend the supervisor spec, controller shall the get the id from status, and POST request to overload api to suspend the supervisor /druid/indexer/v1/supervisor//suspend

Updating suspend to false in the ingestion CR, shall cause reset of the supervisor spec. Operator shall emit events using events API for each operation handled and update the status of the CR.
A deletion of the druid ingestion config shall be controlled by finalizers. Before the deletion controller makes an HTTP call to delete the supervisor spec, at this point the CR will be marked as terminating. Once requests are completed CR will be removed.

This proposal might have missed in some druid specific details of the API. The original issue : #251

The text was updated successfully, but these errors were encountered:

cintoSunny · 2022-10-20T17:41:47Z

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

AdheipSingh · 2022-11-10T17:41:51Z

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

Do you see kubernetes as an orchestration platform for running druid or you see kubernetes as a control plane for running druid. If you consider the latter, you will can leverage CRD for handling supervisor specs etc.

Just like kafka operator, you can deploy kafka + manage topics and acls via crds. Of course, you can create kafka topics using cli clients same way we can create druid supervisors from the console, but if you want full control from k8s + enhance gitops and build operator as a control plane, this can be a way.

This was referenced Feb 18, 2023

Proposal: Ingestion Spec controller cloudnativelyhq/druid-operator#8

Open

[Proposal] Ingestion Spec controller datainfrahq/druid-operator#3

Closed

applike-ss mentioned this issue May 11, 2023

added SupervisorSpec CRD datainfrahq/druid-operator#61

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Ingestion Spec controller #313

Proposal: Ingestion Spec controller #313

AdheipSingh commented Oct 9, 2022

cintoSunny commented Oct 20, 2022

AdheipSingh commented Nov 10, 2022

Proposal: Ingestion Spec controller #313

Proposal: Ingestion Spec controller #313

Comments

AdheipSingh commented Oct 9, 2022

Current State of Druid Operator

Goals

Design

Seperation of concerns , A new CRD + Controller

Authentication with Druid API

Design of the CRD, CR Spec and Reconcilation

Reconcilation and State Changes

cintoSunny commented Oct 20, 2022

AdheipSingh commented Nov 10, 2022