Workflows-worker ran for 41days but stuck in CrashLoopBackOff after restart

Hi everyone,

I’m running the kubernetes based mender server (Chart version: mender-5.2.3. App Version: 3.6.2) and I have an issue with the workflows-worker pod.

The pod was running for 41 days without restarts, but today I performed a manual kubectl delete pod mender-workflows-worker-..., because I changed the smtp secret with a helm upgrade and the pod was not restarted automatically.

Since the manual pod restart, the pod has been in CrashLoopBackOff state with the following error:

kubectl logs -f mender-workflows-worker-859b66449-zjcl2
time="2023-10-20T11:59:42Z" level=info msg="migrating workflows" file=migrations.go func=mongo.Migrate line=38
time="2023-10-20T11:59:42Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=124
time="2023-10-20T11:59:42Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=139
time="2023-10-20T11:59:47Z" level=info msg="nats client closed the connection" file=client.go func=nats.NewClientWithDefaults.func1.1 line=91
2023/10/20 11:59:47 failed to apply Jetstream consumer migrations: context deadline exceeded

I updated the release with the following command, so hooks were not disabled:

helm upgrade mender mender/mender --version 5.2.3 -f mender-3.6.2.yml

and only the smtp secret was changed:

helm diff revision mender 1 2
default, smtp, Secret (v1) has changed:
  # Source: mender/templates/secret-smtp.yaml
  apiVersion: v1
  kind: Secret
  metadata:
    labels:
      app.kubernetes.io/component: secrets
      app.kubernetes.io/instance: mender
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: smtp
      app.kubernetes.io/part-of: mender
      helm.sh/chart: mender
    name: smtp
    namespace: default
  data:
-   EMAIL_SENDER: '-------- # (14 bytes)'
-   SMTP_AUTH_MECHANISM: '-------- # (5 bytes)'
-   SMTP_HOST: '-------- # (12 bytes)'
-   SMTP_PASSWORD: '-------- # (4 bytes)'
-   SMTP_SSL: '-------- # (5 bytes)'
-   SMTP_USERNAME: '-------- # (4 bytes)'
+   EMAIL_SENDER: '++++++++ # (... bytes)'
+   SMTP_AUTH_MECHANISM: '++++++++ # (... bytes)'
+   SMTP_HOST: '++++++++ # (... bytes)'
+   SMTP_PASSWORD: '++++++++ # (...bytes)'
+   SMTP_SSL: '++++++++ # (... bytes)'
+   SMTP_USERNAME: '++++++++ # (... bytes)'

I do see some activity in the mender-nats-0 pod every time the workflow-worker restarts.

[7] 2023/09/08 12:41:58.668123 [INF] 10.224.0.59:6222 - rid:14 - Router connection closed: Duplicate Route
[7] 2023/09/08 12:42:05.595351 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/09/08 12:42:09.849093 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > create-artifact-worker'
[7] 2023/10/20 11:49:05.650113 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:11.168638 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:31.952829 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:59.946672 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:50:47.943136 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'

Any help would be greatly appreciated.
Kind Regards

I tried a helm rollback to revision 1, which did not fix the issue

helm history mender
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION     
1               Fri Sep  8 14:39:58 2023        superseded      mender-5.2.3    3.6.2           Install complete
2               Fri Oct 20 13:41:32 2023        superseded      mender-5.2.3    3.6.2           Upgrade complete
3               Fri Oct 20 13:52:53 2023        deployed        mender-5.2.3    3.6.2           Rollback to 1 

Hi @oldev ,
can you please try with an updated NATS version?

In your values file you can specify:

nats:
  nats:
    image: nats:2.9.21-scratch

The default image is nats:2.7.4-scratch.

When you run the helm upgrade, a new migration job should run and fix the NATS Jetstream. After that can you try restarting the workflows-worker and server pods?

Thanks

1 Like

Hi,
thanks for the quick reply.

The new revision is looking good.

The nats pods were upgraded to a new image and the workflows server and worker are up and running after a pod restart

Here are the new worker logs:

kubectl logs -f mender-workflows-worker-859b66449-6cgsg
time="2023-10-20T13:28:31Z" level=info msg="migrating workflows" file=migrations.go func=mongo.Migrate line=38
time="2023-10-20T13:28:31Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=124
time="2023-10-20T13:28:31Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=139
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading 9 workflows from /etc/workflows/definitions." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=187
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading decommission_device v10." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: decommission_device v10: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading deploy_device_configuration v2." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: deploy_device_configuration v2: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading provision_external_device v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: provision_external_device v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading reindex_inventory v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: reindex_inventory v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading reindex_reporting v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: reindex_reporting v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading update_device_status v7." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: update_device_status v7: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading provision_device v10." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: provision_device v10: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading reindex_reporting_deployment v1." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: reindex_reporting_deployment v1: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading update_device_inventory v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: update_device_inventory v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=9
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=6
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=8
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=3
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=1
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=7
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=0
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=2
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=4
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=5