Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid pods taking more time to come up when security enabled #301

Open
bharathkuppala opened this issue Aug 3, 2022 · 14 comments
Open

Druid pods taking more time to come up when security enabled #301

bharathkuppala opened this issue Aug 3, 2022 · 14 comments

Comments

@bharathkuppala
Copy link

Hello All,
druid-operator: v0.0.7
druid: v0.22.1

We are trying to enable basic security for druid and we found out that it was taking Approx. 16 min for entire cluster to come up. Upon investigation on logs we found out that there were multiple retries against coordinator trying to fetch some users data.

Logs from middleManager:
{"timeMillis":1653992838211,"thread":"main","level":"WARN","loggerName":"org.apache.druid.java.util.common.RetryUtils","message":"Retrying (1 of 9) in 727ms.","thrown":{"commonElementCount":0,"localizedMessage":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","message":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","name":"com.fasterxml.jackson.databind.exc.MismatchedInputException","extendedStackTrace":[{"class":"com.fasterxml.jackson.databind.exc.MismatchedInputException","method":"from","file":"MismatchedInputException.java","line":59,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_initForReading","file":"ObjectMapper.java","line":4360,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_readMapAndClose","file":"ObjectMapper.java","line":4205,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"readValue","file":"ObjectMapper.java","line":3292,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"tryFetchUserMapsFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":400,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"lambda$fetchUserAndRoleMapFromCoordinator$4","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":330,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":129,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":81,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":163,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":153,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"fetchUserAndRoleMapFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":328,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"initUserMaps","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":457,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"start","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":116,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke0","file":"NativeMethodAccessorImpl.java","line":-2,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke","file":"NativeMethodAccessorImpl.java","line":62,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.DelegatingMethodAccessorImpl","method":"invoke","file":"DelegatingMethodAccessorImpl.java","line":43,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"java.lang.reflect.Method","method":"invoke","file":"Method.java","line":498,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler","method":"start","file":"Lifecycle.java","line":446,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle","method":"start","file":"Lifecycle.java","line":341,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.guice.LifecycleModule$2","method":"start","file":"LifecycleModule.java","line":143,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.GuiceRunnable","method":"initLifecycle","file":"GuiceRunnable.java","line":115,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.ServerRunnable","method":"run","file":"ServerRunnable.java","line":63,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.Main","method":"main","file":"Main.java","line":113,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

In the above log its looking for some method which would fetch some information from coordinator and would retry for 9 times.

Wanted to know does druid-operator impose any restrictions upon druid installation like order its broker, historicals and middlemanager which would comeup first during scratch installation then comes coordinator and router?

If above is true then pods will fail looking for coordinator?
Any help on this?

@bharathkuppala
Copy link
Author

@AdheipSingh : Can you please provide any info on this ? if you know already

@AdheipSingh
Copy link
Contributor

AdheipSingh commented Aug 17, 2022

sorry missed this out,
the druid operator has nothing do with security enabled on druid or not. Druid operator supports securityContexts on the kubernetes pods it deploys and manages.
When rolling deploy is enabled to true the druid operator upgrades the nodes in the order as described here
https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L1765
When the operator installs the cluster all pods are deployed parallel.

@bharathkuppala
Copy link
Author

bharathkuppala commented Sep 8, 2022

Thanks for the rly @AdheipSingh and sry for not responding actually we were testing this and found out that when we give all node types as a statefulset like you said all pods are deployed in parallel, But let's say in my case we are having
Deployments- Broker, Coordinator and Router
Statefulsets - Historicals and MiddleManagers

Now when we install we see along with 2 statefulsets only one deployment is getting up hence stopping other deployments from coming up.

Can you please check this behavior if not already done? Or any suggestions would be of a great help!!

Stack trace
2022-09-08T05:40:35.675Z ERROR controller-runtime.manager.controller.druid Reconciler error {"reconciler group": "druid.apache.org", "reconciler kind": "Druid", "name": "tsdb-apache-druid", "namespace": "nom-backend", "error": "StatefulSet.apps \"druid-tsdb-apache-druid-brokers\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
After above stack trace it would stop near broker preventing other deployments to come up
Thanks!!

@bharathkuppala
Copy link
Author

Intermediately i see this issue after doing scratch install multiple times. But i always see above stack trace no sure why :(

@AdheipSingh
Copy link
Contributor

AdheipSingh commented Sep 8, 2022

@bharathkuppala https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L182

Be deployment or statefulset in any combination, all pods should come up in parallel for the 1st installation.

Can you tell me the generation number in the CR metadata, if its > 1 and rollingDeploy is kept to true in your CR, upgrades and changes will happen incremental.

Use Kind: Deployemnt to switch to deployment by default its statefulset

@bharathkuppala
Copy link
Author

@AdheipSingh : Yes like you said there is no problem while doing the upgrade with generation number > 1 it would follow the prescribed order. But somehow i see with podMangamentPolicy as parallel both Historical and MM are deployed next broker would comeup stopping other deployments

@bharathkuppala
Copy link
Author

We are trying our druid cluster in open shift. Just wanted to know do we have to check something on open shift side to troubleshoot this behavior?
Because in operator we saw it often hitting this line and giving else error.

if apierrors.IsNotFound(err) {

@AdheipSingh
Copy link
Contributor

@bharathkuppala IMHO it should not arise, i will try to re-create this issue. Just to double check, did you switch between sts and deployments in Kind ?

@bharathkuppala
Copy link
Author

@AdheipSingh : Yes i did switch but still i see this issue

@bharathkuppala
Copy link
Author

bharathkuppala commented Sep 9, 2022

druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 0s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 1s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 3s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 6s
druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 8s
druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 8s
druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 9s
druid-tsdb-apache-druid-historicals-1 0/1 Running 0 1m
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Running 0 1m
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Running 0 1m
druid-tsdb-apache-druid-middlemanagers-1 0/1 Running 0 1m
druid-tsdb-apache-druid-historicals-0 0/1 Running 0 1m
druid-tsdb-apache-druid-middlemanagers-0 0/1 Running 0 2m
druid-tsdb-apache-druid-historicals-0 1/1 Running 0 13m
druid-tsdb-apache-druid-middlemanagers-1 1/1 Running 0 13m
druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 1/1 Running 0 14m
druid-tsdb-apache-druid-brokers-86c4849b47-4598x 1/1 Running 0 14m
druid-tsdb-apache-druid-middlemanagers-0 1/1 Running 0 14m
druid-tsdb-apache-druid-historicals-1 1/1 Running 0 14m

druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 0s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 1s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 1s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Running 0 3s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Running 0 4s
druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 1/1 Running 0 30s
druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 1/1 Running 0 30s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 1s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 2s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 3s
druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Running 0 3s
druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Running 0 4s

This is what is happening in my current setup once historical, MM and Broker pods are up then only coordinator and routers are triggered

@AdheipSingh
Copy link
Contributor

are you adding any node selectors ? are you coordinator/routers get scheduled on different nodes
if you are rollingDeploy is set to false, all pods should come in parallel, there is no conditional check to wait for druid nodes on the first installation.

@bharathkuppala
Copy link
Author

@AdheipSingh : No we are not using any node selectors for scheduling the pods. And we are using rollingDeploy as true because we want our pods to have an upgrade based on prescribed order.

Yea i see on druid-operator code we are not checking generation so pods should come in parallel. I can share logs of druid-operator so we could have a complete trace.

@AdheipSingh
Copy link
Contributor

@bharathkuppala whats your podmanagement policy ? try setting to Parallel

@bharathkuppala
Copy link
Author

@AdheipSingh : Yes it is by default parallel right for sts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants