oldev
October 20, 2023, 1:00pm
1
Hi everyone,
I’m running the kubernetes based mender server (Chart version: mender-5.2.3. App Version: 3.6.2) and I have an issue with the workflows-worker pod.
The pod was running for 41 days without restarts, but today I performed a manual kubectl delete pod mender-workflows-worker-...
, because I changed the smtp secret with a helm upgrade and the pod was not restarted automatically.
Since the manual pod restart, the pod has been in CrashLoopBackOff state with the following error:
kubectl logs -f mender-workflows-worker-859b66449-zjcl2
time="2023-10-20T11:59:42Z" level=info msg="migrating workflows" file=migrations.go func=mongo.Migrate line=38
time="2023-10-20T11:59:42Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=124
time="2023-10-20T11:59:42Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=139
time="2023-10-20T11:59:47Z" level=info msg="nats client closed the connection" file=client.go func=nats.NewClientWithDefaults.func1.1 line=91
2023/10/20 11:59:47 failed to apply Jetstream consumer migrations: context deadline exceeded
I updated the release with the following command, so hooks were not disabled:
helm upgrade mender mender/mender --version 5.2.3 -f mender-3.6.2.yml
and only the smtp secret was changed:
helm diff revision mender 1 2
default, smtp, Secret (v1) has changed:
# Source: mender/templates/secret-smtp.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: secrets
app.kubernetes.io/instance: mender
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: smtp
app.kubernetes.io/part-of: mender
helm.sh/chart: mender
name: smtp
namespace: default
data:
- EMAIL_SENDER: '-------- # (14 bytes)'
- SMTP_AUTH_MECHANISM: '-------- # (5 bytes)'
- SMTP_HOST: '-------- # (12 bytes)'
- SMTP_PASSWORD: '-------- # (4 bytes)'
- SMTP_SSL: '-------- # (5 bytes)'
- SMTP_USERNAME: '-------- # (4 bytes)'
+ EMAIL_SENDER: '++++++++ # (... bytes)'
+ SMTP_AUTH_MECHANISM: '++++++++ # (... bytes)'
+ SMTP_HOST: '++++++++ # (... bytes)'
+ SMTP_PASSWORD: '++++++++ # (...bytes)'
+ SMTP_SSL: '++++++++ # (... bytes)'
+ SMTP_USERNAME: '++++++++ # (... bytes)'
I do see some activity in the mender-nats-0 pod every time the workflow-worker restarts.
[7] 2023/09/08 12:41:58.668123 [INF] 10.224.0.59:6222 - rid:14 - Router connection closed: Duplicate Route
[7] 2023/09/08 12:42:05.595351 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/09/08 12:42:09.849093 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > create-artifact-worker'
[7] 2023/10/20 11:49:05.650113 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:11.168638 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:31.952829 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:49:59.946672 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[7] 2023/10/20 11:50:47.943136 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
Any help would be greatly appreciated.
Kind Regards
oldev
October 20, 2023, 1:07pm
2
I tried a helm rollback to revision 1, which did not fix the issue
helm history mender
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Fri Sep 8 14:39:58 2023 superseded mender-5.2.3 3.6.2 Install complete
2 Fri Oct 20 13:41:32 2023 superseded mender-5.2.3 3.6.2 Upgrade complete
3 Fri Oct 20 13:52:53 2023 deployed mender-5.2.3 3.6.2 Rollback to 1
robgio
October 20, 2023, 1:22pm
3
Hi @oldev ,
can you please try with an updated NATS version?
In your values file you can specify:
nats:
nats:
image: nats:2.9.21-scratch
The default image is nats:2.7.4-scratch
.
When you run the helm upgrade, a new migration job should run and fix the NATS Jetstream. After that can you try restarting the workflows-worker and server pods?
Thanks
1 Like
oldev
October 20, 2023, 1:34pm
4
Hi,
thanks for the quick reply.
The new revision is looking good.
The nats pods were upgraded to a new image and the workflows server and worker are up and running after a pod restart
Here are the new worker logs:
kubectl logs -f mender-workflows-worker-859b66449-6cgsg
time="2023-10-20T13:28:31Z" level=info msg="migrating workflows" file=migrations.go func=mongo.Migrate line=38
time="2023-10-20T13:28:31Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=124
time="2023-10-20T13:28:31Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=migrator_simple.go func="migrate.(*SimpleMigrator).Apply" line=139
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading 9 workflows from /etc/workflows/definitions." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=187
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading decommission_device v10." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: decommission_device v10: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading deploy_device_configuration v2." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: deploy_device_configuration v2: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading provision_external_device v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: provision_external_device v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading reindex_inventory v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: reindex_inventory v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading reindex_reporting v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: error loading: reindex_reporting v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:31Z" level=info msg="LoadWorkflows: loading update_device_status v7." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: update_device_status v7: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading provision_device v10." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: provision_device v10: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading reindex_reporting_deployment v1." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: reindex_reporting_deployment v1: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: loading update_device_inventory v3." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=189
time="2023-10-20T13:28:32Z" level=info msg="LoadWorkflows: error loading: update_device_inventory v3: Workflow already exists." file=datastore_mongo.go func="mongo.(*DataStoreMongo).LoadWorkflows" line=195
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=9
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=6
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=8
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=3
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=1
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=7
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=0
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=2
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=4
time="2023-10-20T13:28:32Z" level=info msg="worker starting up" file=worker.go func=worker.InitAndRun.func2 line=114 worker_id=5