FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

morelina · 2021-02-08T17:42:27Z

I am trying to update a running Job in my Flink Job Cluster. I am using commit which is still in PR with some fixes: 72e89b2

FlinkOperator triggers the Savepoint and it is successfully created. However, flink-operator crashes immediately after.

These are the events in FlinkCluster:
Normal SavepointCreated 17m FlinkOperator Successfully savepoint created
Normal SavepointTriggered 12m FlinkOperator Triggered savepoint for update: triggerID 590c343c5e3934e4996e5904b719cf17.
Normal SavepointCreated 12m FlinkOperator Successfully savepoint created
Normal SavepointTriggered 7m1s FlinkOperator Triggered savepoint for update: triggerID 51660f2c77254db025d37e23c0fa57e7.
Normal SavepointCreated 6m56s FlinkOperator Successfully savepoint created
Normal SavepointTriggered 107s FlinkOperator Triggered savepoint for update: triggerID ffc1ff58f0ee872a278fab5b

And these are the logs from the crash:

controllers.FlinkCluster ---------- 4. Take actions ---------- {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster ConfigMap already exists, no action {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "JobManager"}
controllers.FlinkCluster JobManager service already exists, no action {"cluster": "namespace-a/cluster-a"}
controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "TaskManager"}
controllers.FlinkCluster Job is about to be restarted to update {"cluster": "namespace-a/cluster-a"}
E0208 17:15:08.662342 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 362 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1422fc0, 0x2241f50)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1422fc0, 0x2241f50)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:111 +0x223
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770)
/workspace/controllers/flinkcluster_controller.go:220 +0xb91
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128)
/workspace/controllers/flinkcluster_controller.go:82 +0x249
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000c89c0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12eee3b]

goroutine 362 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1422fc0, 0x2241f50)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0)
/workspace/controllers/flinkcluster_reconciler.go:111 +0x223
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770)
/workspace/controllers/flinkcluster_controller.go:220 +0xb91
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128)
/workspace/controllers/flinkcluster_controller.go:82 +0x249
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000c89c0)
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360)
/root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305

The text was updated successfully, but these errors were encountered:

elanv · 2021-02-09T02:05:05Z

It seems to be caused by the newly added field. A workaround would be to set the value of spec.job.takeSavepointOnUpgrade to true.

morelina mentioned this issue Feb 8, 2021

Fix job recovery and savepoint bug #401

Merged

elanv mentioned this issue Feb 15, 2021

Fix savepoint problems #392

Merged

elanv linked a pull request Feb 23, 2021 that will close this issue

Improve update and savepoint handling #420

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

morelina commented Feb 8, 2021

elanv commented Feb 9, 2021 •

edited

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

Comments

morelina commented Feb 8, 2021

elanv commented Feb 9, 2021 • edited

elanv commented Feb 9, 2021 •

edited