Skip to content
This repository has been archived by the owner on Sep 2, 2022. It is now read-only.

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

Open
morelina opened this issue Feb 8, 2021 · 1 comment · May be fixed by #420
Open

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

morelina opened this issue Feb 8, 2021 · 1 comment · May be fixed by #420

Comments

@morelina
Copy link

morelina commented Feb 8, 2021

I am trying to update a running Job in my Flink Job Cluster. I am using commit which is still in PR with some fixes: 72e89b2

FlinkOperator triggers the Savepoint and it is successfully created. However, flink-operator crashes immediately after.

These are the events in FlinkCluster:
Normal SavepointCreated 17m FlinkOperator Successfully savepoint created
Normal SavepointTriggered 12m FlinkOperator Triggered savepoint for update: triggerID 590c343c5e3934e4996e5904b719cf17.
Normal SavepointCreated 12m FlinkOperator Successfully savepoint created
Normal SavepointTriggered 7m1s FlinkOperator Triggered savepoint for update: triggerID 51660f2c77254db025d37e23c0fa57e7.
Normal SavepointCreated 6m56s FlinkOperator Successfully savepoint created
Normal SavepointTriggered 107s FlinkOperator Triggered savepoint for update: triggerID ffc1ff58f0ee872a278fab5b

And these are the logs from the crash:

controllers.FlinkCluster ---------- 4. Take actions ---------- {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster ConfigMap already exists, no action {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "JobManager"}
controllers.FlinkCluster JobManager service already exists, no action {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "TaskManager"}
controllers.FlinkCluster Job is about to be restarted to update {"cluster": "namespace-a/cluster-a"}
E0208 17:15:08.662342 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 362 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1422fc0, 0x2241f50)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1422fc0, 0x2241f50)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:111 +0x223
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770)
/workspace/controllers/flinkcluster_controller.go:220 +0xb91
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128)
/workspace/controllers/flinkcluster_controller.go:82 +0x249
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000c89c0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12eee3b]

goroutine 362 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1422fc0, 0x2241f50)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:111 +0x223
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770)
/workspace/controllers/flinkcluster_controller.go:220 +0xb91
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128)
/workspace/controllers/flinkcluster_controller.go:82 +0x249
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000c89c0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305

@elanv
Copy link
Contributor

elanv commented Feb 9, 2021

It seems to be caused by the newly added field. A workaround would be to set the value of spec.job.takeSavepointOnUpgrade to true.

@elanv elanv linked a pull request Feb 23, 2021 that will close this issue
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants