Mender on K8S problems

I’m having a problem setting up mender (non-enterprise) on K8S.

I’ve followed the instructions from here:

I have also now created an own hostname for the mender server (other thread can be ignored) and have my own ingress that is working correctly. I have created a user using useradm as specified and can login but get a number of errors. In K8S I see that not all server are starting:

mender         api-gateway-5ccf655f9b-k98rw                               1/1     Running            0                4h10m
mender         create-artifact-worker-68db55dd4f-4xbtw                    0/1     CrashLoopBackOff   11 (3m25s ago)   35m
mender         deployments-66659cd55f-84287                               0/1     Running            11 (3h42m ago)   4h11m
mender         device-auth-ddf5fdccc-njw6f                                0/1     Running            1 (4h8m ago)     4h8m
mender         deviceconfig-798796457d-wsw45                              1/1     Running            1 (4h8m ago)     4h9m
mender         deviceconnect-5d4dc576cc-6dx6k                             1/1     Running            2 (4h8m ago)     4h9m
mender         gui-584bfd7ff7-gddzt                                       1/1     Running            0                4h10m
mender         inventory-595cb5bb7b-ngdnf                                 1/1     Running            1 (4h8m ago)     4h9m
mender         iot-manager-7b9596f696-855mf                               1/1     Running            2 (4h9m ago)     4h9m
mender         useradm-59c9449649-k5gwq                                   1/1     Running            4 (4h9m ago)     4h10m
mender         workflows-server-7794f55d99-2xhj2                          0/1     Error              54 (5m14s ago)   4h12m
mender         workflows-worker-778f8fb6d6-4m9dv                          0/1     CrashLoopBackOff   17 (3m51s ago)   67m

Most are however and the connection to mongo is also working.

All 3 of the crashing servers report a similar thing:

time="2022-03-11T16:05:51Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:05:51Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:05:51Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/03/11 16:05:56 context deadline exceeded
time="2022-03-11T16:04:11Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:04:11Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:04:11Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/03/11 16:04:16 context deadline exceeded
time="2022-03-11T16:05:25Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:05:25Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-11T16:05:25Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/03/11 16:05:30 context deadline exceeded

Is it stranger that they all log the same thing?

Also the time between the previous log and the “context deadline exceeded” is only 5 seconds.

I recreated the namespace from scratch with the exact same result

I noticed that I had taken the most up to date mongo but the version here MongoDB | Mender documentation was set at 10.21… So I installed that instead and redid the whole mender namepsace as well… Exact same result again

Hello @davel, and thank you for trying Mender!

The services that are crashing all depend on NATS, which is what I suspect is not deployed or reachable with the configured URL. Can you verify that NATS is deployed and reachable with the nats hostname in your deployment?

Hi @alfrunes , thanks for the reply. Nats is deployed, healthy and reachable:

 nats: {
    URL: 'nats://nats.nats.svc.cluster.local:4222'
 },

What I did this morning is downgrade mender to 3.1.1 and it is working now. I didn’t change anything else so if I had to hazard a guess I would say there is a bug somewhere in 3.2.1.

I think I may have found an error in our documentation. Since the latest release, these backend services depends on NATS Jetstream™, which needs explicit configuration to work. I think the following helm values should work for the NATS deployment:

cluster:
  enabled: true
  replicas: 2
nats:
  image: "nats:2.7.4-alpine"
  jetstream:
    enabled: true

    memStorage:
      enabled: true
      size: "2Gi"

    fileStorage:
      enabled: true
      size: "1Gi"
      storageDirectory: /data/
      #storageClassName: "default"

I’ve created a ticket to make sure the documentation gets corrected ASAP.

Hi @alfrunes ,

I have similar problem, as davel .
I have tried installing NATS with your new configuration and result after

cat >nats.yml <<EOF
cluster:
  enabled: true
  replicas: 2
nats:
  image: "nats:2.3.1-alpine"
  jetstream:
    enabled: true

    memStorage:
      enabled: true
      size: "1Gi"

    fileStorage:
      enabled: true
      size: "2Gi"
      storageDirectory: /data/
EOF

helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm repo update
helm install nats nats/nats --version 0.8.2 -f nats.yml

is

NAME                                       READY   STATUS    RESTARTS      AGE
cert-manager-5d7f97b46d-kzr4b              1/1     Running   1 (55m ago)   68m
mongodb-arbiter-0                          1/1     Running   1 (55m ago)   66m
mongodb-0                                  1/1     Running   1 (55m ago)   66m
cert-manager-webhook-f697cc96d-tzvv8       1/1     Running   1 (55m ago)   68m
mongodb-1                                  1/1     Running   1 (55m ago)   65m
cert-manager-cainjector-69d885bf55-vzgqs   1/1     Running   2 (54m ago)   68m
nats-0                                     0/3     Pending   0             2m20s
nats-box-56c684cb47-tr9fb                  1/1     Running   0             2m20s

kubectl describe pod nats-0:
Name:           nats-0
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/instance=nats
                app.kubernetes.io/name=nats
                controller-revision-hash=nats-5f96889449
                statefulset.kubernetes.io/pod-name=nats-0
Annotations:    prometheus.io/path: /metrics
                prometheus.io/port: 7777
                prometheus.io/scrape: true
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/nats
Containers:
  nats:
    Image:       nats:2.3.1-alpine
    Ports:       4222/TCP, 7422/TCP, 7522/TCP, 6222/TCP, 8222/TCP, 7777/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      nats-server
      --config
      /etc/nats-config/nats.conf
    Liveness:   http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8222/ delay=10s timeout=5s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:           nats-0 (v1:metadata.name)
      POD_NAMESPACE:      default (v1:metadata.namespace)
      CLUSTER_ADVERTISE:  $(POD_NAME).nats.$(POD_NAMESPACE).svc.cluster.local.
    Mounts:
      /data/ from nats-js-pvc (rw)
      /etc/nats-config from config-volume (rw)
      /var/run/nats from pid (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qpmvb (ro)
  reloader:
    Image:      natsio/nats-server-config-reloader:0.6.1
    Port:       <none>
    Host Port:  <none>
    Command:
      nats-server-config-reloader
      -pid
      /var/run/nats/nats.pid
      -config
      /etc/nats-config/nats.conf
    Environment:  <none>
    Mounts:
      /etc/nats-config from config-volume (rw)
      /var/run/nats from pid (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qpmvb (ro)
  metrics:
    Image:      natsio/prometheus-nats-exporter:0.8.0
    Port:       7777/TCP
    Host Port:  0/TCP
    Args:
      -connz
      -routez
      -subz
      -varz
      -prefix=nats
      -use_internal_server_id
      -jsz=all
      http://localhost:8222/
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qpmvb (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  nats-js-pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nats-js-pvc-nats-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nats-config
    Optional:  false
  pid:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-qpmvb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  33s   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

Am I missing something? Previous configuration started NATS correctly, but workflows were in error/crash state.

Thanks in advance,
Michal

Hi @Michal :wave:

It seems like you haven’t configured a DefaultStorageClass on your cluster.
In this case you can explicitly set the storage class by uncommenting the line from the helm values in my previous comment - replaceing “default” with a class name available on your cluster.
To list the available storage classes on your cluster you can run the following command:

kubectl get storageclasses.storage.k8s.io

For instance on my local minikube cluster, I get:

$ kubectl get storageclasses.storage.k8s.io
NAME                 PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
standard (default)   k8s.io/minikube-hostpath   Delete          Immediate           false                  46h

Hi @alfrunes ,

thanks, there was default class “local-path” in my cluster, but something was wrong with PVC, or PV, I deleted whole cluster with all objects manually and recreated it and NATS now starts correctly.
Everything is up except

workflows-worker-778f8fb6d6-z9254          0/1     CrashLoopBackOff   5 (2m39s ago)   5m47s
deployments-66659cd55f-sbmzm               0/1     CrashLoopBackOff   5 (2m20s ago)   5m47s

Deployments are down, because minio does not have valid certificate (Lets encrypt mechanism is not accessible yet, probably because of missing redirect for ports).
workflows-worker’s logs:

time="2022-03-17T18:48:09Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading 8 workflows from /etc/workflows/definitions." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading deploy_device_configuration v1." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: deploy_device_configuration v1: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading provision_device v8." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: provision_device v8: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading provision_external_device v2." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: provision_external_device v2: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading reindex_inventory v2." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: reindex_inventory v2: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading reindex_reporting v3." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: reindex_reporting v3: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading update_device_inventory v2." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: update_device_inventory v2: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading update_device_status v6." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: update_device_status v6: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: loading decommission_device v8." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="LoadWorkflows: error loading: decommission_device v8: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-03-17T18:48:09Z" level=info msg="nats client closed the connection" file=client.go func=nats.NewClientWithDefaults.func1.1 line=86
2022/03/17 18:48:09 failed to subscribe to the nats JetStream: cannot create a queue subscription for a consumer without a deliver group

Looks like NATS problem, but I’m quite lost, where to look.
My mender helm chart has config:

nats:
    URL: "nats://nats:4222"

I have tried newer image

image: "nats:2.7.4-alpine"

for installation of NATS, as you proposed here, but it did not help.
workflows-server is running OK.

Thanks in advance!

Sorry for the late response - I was trying to reproduce your results using chart version 3.2.1 and master without any luck. What version of the helm chart are you using?
Could you also try running the following snippet to verify that the Jetstream is in fact created?

kubectl exec `kubectl get pod -l app=nats-box -o name` -- nats stream ls

Example output:

╭────────────────────────────────────────────────────────────────────────────────╮
│                                    Streams                                     │
├───────────┬─────────────┬─────────────────────┬──────────┬──────┬──────────────┤
│ Name      │ Description │ Created             │ Messages │ Size │ Last Message │
├───────────┼─────────────┼─────────────────────┼──────────┼──────┼──────────────┤
│ WORKFLOWS │             │ 2022-03-15 16:03:49 │ 0        │ 0 B  │ never        │
╰───────────┴─────────────┴─────────────────────┴──────────┴──────┴──────────────╯

As for the deployments service, you need to properly setup the certificate and a DNS record to make sure that the global.s3.AWS_URI helm value resolves to your minio instance.

Thanks,

output is:

╭────────────────────────────────────────────────────────────────────────────────╮
│                                    Streams                                     │
├───────────┬─────────────┬─────────────────────┬──────────┬──────┬──────────────┤
│ Name      │ Description │ Created             │ Messages │ Size │ Last Message │
├───────────┼─────────────┼─────────────────────┼──────────┼──────┼──────────────┤
│ WORKFLOWS │             │ 2022-03-17 17:03:52 │ 0        │ 0 B  │ never        │
╰───────────┴─────────────┴─────────────────────┴──────────┴──────┴──────────────╯

helm version
version.BuildInfo{Version:"v3.8.1", GitCommit:"5cb9af4b1b271d11d7a97a71df3ac337dd94ad37", GitTreeState:"clean", GoVersion:"go1.17.5"}
helm list
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
cert-manager    default         1               2022-03-17 13:25:46.741016257 +0100 CET deployed        cert-manager-v1.4.0     v1.4.0
mender          default         1               2022-03-17 20:13:49.678944217 +0100 CET deployed        mender-3.2.1            3.2.1
minio-operator  default         1               2022-03-17 18:00:52.620454599 +0100 CET deployed        minio-operator-4.1.7    v4.1.3
mongodb         default         1               2022-03-17 17:58:18.709092987 +0100 CET deployed        mongodb-10.21.1         4.4.6
nats            default         1               2022-03-17 20:11:20.084756604 +0100 CET deployed        nats-0.14.2             2.7.4

NATS is version 2.7.4 from last attempt, but 2.3.1 was giving same results.
All charts were installed using commands from Production installation with Kubernetes | Mender documentation .
I can upload my yml files generated in process, if it helps.
I hope, that it is not caused by some leftover from my attempts before adding of jetstream (I have recreated whole cluster though) or because of previous Docker Compose installation (it is shut down).

Michal

I tried (again) deploying everything from scratch in a k3s environment this time. Although I had the same issue with the default storage class not working as expected, I was not able to reproduce the issue with NATS. Everything simply works out of the box :man_shrugging:
May I ask you to try once more deploying to a fresh k3s cluster?
Before you do, could you also give me the full description of the workflows-worker pod (or at least the container image ID)?

Hi @alfrunes,
sorry for late response, my server was down, so no way to test it.
I have tried to reinstall everything again (about 4th time) from one big script and now workflows-worker works right after install :dizzy_face: I don’t know why, because all generated files looks same as previous, except for generated secrets.
Problem is, that when I restart whole machine, so k3s boots automatically, I get

failed to subscribe to the nats JetStream: cannot create a queue subscription for a consumer without a deliver group

for workflows-worker.
What is even stranger,

kubectl exec `kubectl get pod -l app=nats-box -o name` -- nats stream ls

returns

Streams:
        WORKFLOWS

NATS pods are in Running state.

kubectl logs nats-0 nats
[8] 2022/04/04 19:04:11.607079 [INF] Starting nats-server
[8] 2022/04/04 19:04:11.607123 [INF]   Version:  2.3.1
[8] 2022/04/04 19:04:11.607125 [INF]   Git:      [907fef4]
[8] 2022/04/04 19:04:11.607127 [INF]   Name:     nats-0
[8] 2022/04/04 19:04:11.607132 [INF]   Node:     S1Nunr6R
[8] 2022/04/04 19:04:11.607134 [INF]   ID:       NDDIDMI5N3DZPPDNI6J6EPPKKL4QBSLT3LJUVVRXXCZ6NNTCJDYMPNZN
[8] 2022/04/04 19:04:11.607137 [INF] Using configuration file: /etc/nats-config/nats.conf
[8] 2022/04/04 19:04:11.612397 [INF] Starting JetStream
[8] 2022/04/04 19:04:11.613380 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[8] 2022/04/04 19:04:11.613386 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[8] 2022/04/04 19:04:11.613388 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[8] 2022/04/04 19:04:11.613389 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[8] 2022/04/04 19:04:11.613390 [INF]
[8] 2022/04/04 19:04:11.613391 [INF]          https://docs.nats.io/jetstream
[8] 2022/04/04 19:04:11.613393 [INF]
[8] 2022/04/04 19:04:11.613394 [INF] ---------------- JETSTREAM ----------------
[8] 2022/04/04 19:04:11.613400 [INF]   Max Memory:      1.00 GB
[8] 2022/04/04 19:04:11.613403 [INF]   Max Storage:     2.00 GB
[8] 2022/04/04 19:04:11.613404 [INF]   Store Directory: "/data/jetstream"
[8] 2022/04/04 19:04:11.613405 [INF] -------------------------------------------
[8] 2022/04/04 19:04:11.630812 [INF]   Restored 0 messages for stream "WORKFLOWS"
[8] 2022/04/04 19:04:11.632074 [INF]   Recovering 2 consumers for stream - "WORKFLOWS"
[8] 2022/04/04 19:04:11.643995 [INF] Starting JetStream cluster
[8] 2022/04/04 19:04:11.644005 [INF] Creating JetStream metadata controller
[8] 2022/04/04 19:04:11.647296 [INF] JetStream cluster recovering state
[8] 2022/04/04 19:04:11.650342 [INF] Starting http monitor on 0.0.0.0:8222
[8] 2022/04/04 19:04:11.650510 [INF] Listening for client connections on 0.0.0.0:4222
[8] 2022/04/04 19:04:11.651063 [INF] Server is ready
[8] 2022/04/04 19:04:11.651079 [INF] Cluster name is nats
[8] 2022/04/04 19:04:11.651332 [INF] Listening for route connections on 0.0.0.0:6222
[8] 2022/04/04 19:04:11.657697 [INF] JetStream cluster new stream leader for '$G > WORKFLOWS'
[8] 2022/04/04 19:04:11.658001 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > workflows-worker'
[8] 2022/04/04 19:04:11.658074 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > create-artifact-worker'
[8] 2022/04/04 19:04:21.653376 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-1.nats.default.svc.cluster.local.": lookup nats-1.nats.default.svc.cluster.local. on 10.43.0.10:53: read udp 10.42.0.50:48455->10.43.0.10:53: i/o timeout
[8] 2022/04/04 19:04:41.738252 [INF] 10.42.0.65:47636 - rid:13 - Route connection created
[8] 2022/04/04 19:04:46.661185 [INF] 10.42.0.65:6222 - rid:14 - Route connection created
[8] 2022/04/04 19:04:46.661540 [INF] 10.42.0.65:6222 - rid:14 - Router connection closed: Duplicate Route
[8] 2022/04/04 19:04:48.155619 [INF] JetStream cluster new metadata leader
[8] 2022/04/04 19:04:49.606657 [INF] JetStream cluster new consumer leader for '$G > WORKFLOWS > create-artifact-worker'

kubectl logs nats-1 nats
[8] 2022/04/04 19:04:18.698872 [INF] Starting nats-server
[8] 2022/04/04 19:04:18.698927 [INF]   Version:  2.3.1
[8] 2022/04/04 19:04:18.698930 [INF]   Git:      [907fef4]
[8] 2022/04/04 19:04:18.698932 [INF]   Name:     nats-1
[8] 2022/04/04 19:04:18.698936 [INF]   Node:     yrzKKRBu
[8] 2022/04/04 19:04:18.698938 [INF]   ID:       NB3WCYR7KJZ65XVYNIUYVLQL3Z2OHR5MST5E6GZMS3J3QD4AHCKTS4TH
[8] 2022/04/04 19:04:18.698941 [INF] Using configuration file: /etc/nats-config/nats.conf
[8] 2022/04/04 19:04:18.700034 [INF] Starting JetStream
[8] 2022/04/04 19:04:18.701403 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[8] 2022/04/04 19:04:18.701414 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[8] 2022/04/04 19:04:18.701415 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[8] 2022/04/04 19:04:18.701417 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[8] 2022/04/04 19:04:18.701418 [INF]
[8] 2022/04/04 19:04:18.701419 [INF]          https://docs.nats.io/jetstream
[8] 2022/04/04 19:04:18.701421 [INF]
[8] 2022/04/04 19:04:18.701422 [INF] ---------------- JETSTREAM ----------------
[8] 2022/04/04 19:04:18.701433 [INF]   Max Memory:      1.00 GB
[8] 2022/04/04 19:04:18.701435 [INF]   Max Storage:     2.00 GB
[8] 2022/04/04 19:04:18.701437 [INF]   Store Directory: "/data/jetstream"
[8] 2022/04/04 19:04:18.701438 [INF] -------------------------------------------
[8] 2022/04/04 19:04:18.702127 [INF] Starting JetStream cluster
[8] 2022/04/04 19:04:18.702133 [INF] Creating JetStream metadata controller
[8] 2022/04/04 19:04:18.708325 [INF] JetStream cluster recovering state
[8] 2022/04/04 19:04:18.715385 [INF] Starting http monitor on 0.0.0.0:8222
[8] 2022/04/04 19:04:18.715960 [INF] Listening for client connections on 0.0.0.0:4222
[8] 2022/04/04 19:04:18.716069 [INF] Server is ready
[8] 2022/04/04 19:04:18.716111 [INF] Cluster name is nats
[8] 2022/04/04 19:04:18.716128 [INF] Listening for route connections on 0.0.0.0:6222
[8] 2022/04/04 19:04:28.725672 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats.default.svc.cluster.local.": lookup nats-0.nats.default.svc.cluster.local. on 10.43.0.10:53: read udp 10.42.0.65:52598->10.43.0.10:53: i/o timeout
[8] 2022/04/04 19:04:41.738532 [INF] 10.42.0.50:6222 - rid:6 - Route connection created
[8] 2022/04/04 19:04:46.661250 [INF] 10.42.0.50:34610 - rid:7 - Route connection created
[8] 2022/04/04 19:04:46.661761 [INF] 10.42.0.50:34610 - rid:7 - Router connection closed: Duplicate Route

Here is workflows-worker’s describe:

Name:         workflows-worker-778f8fb6d6-drbbt
Namespace:    default
Priority:     0
Node:         vmyoctobuilder/10.0.7.58
Start Time:   Mon, 04 Apr 2022 20:53:20 +0200
Labels:       app.kubernetes.io/name=workflows-worker
              pod-template-hash=778f8fb6d6
Annotations:  <none>
Status:       Running
IP:           10.42.0.63
IPs:
  IP:           10.42.0.63
Controlled By:  ReplicaSet/workflows-worker-778f8fb6d6
Containers:
  workflows:
    Container ID:  containerd://16b1c17dcc5e36412841288cd41f16cf92c99bdb147a2218d1ea34854a0ce598
    Image:         docker.io/mendersoftware/workflows-worker:mender-3.2.1
    Image ID:      docker.io/mendersoftware/workflows-worker@sha256:8eebc2d4002b84eeb08ac5d6ea0434da0011e004e6fbb5ae6a9a5290920bce0a
    Port:          <none>
    Host Port:     <none>
    Args:
      worker
      --excluded-workflows
      generate_artifact
      --automigrate
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 04 Apr 2022 21:10:20 +0200
      Finished:     Mon, 04 Apr 2022 21:10:20 +0200
    Ready:          False
    Restart Count:  9
    Limits:
      cpu:     100m
      memory:  128M
    Requests:
      cpu:     10m
      memory:  64M
    Environment Variables from:
      mongodb-common  Secret with prefix 'WORKFLOWS_'  Optional: false
    Environment:
      WORKFLOWS_NATS_URI:  nats://nats:4222
      HAVE_DEVICECONNECT:  true
      HAVE_DEVICECONFIG:   true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vx49d (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-vx49d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       17m                    default-scheduler  Successfully assigned default/workflows-worker-778f8fb6d6-drbbt to vmyoctobuilder
  Normal   Pulling         17m                    kubelet            Pulling image "docker.io/mendersoftware/workflows-worker:mender-3.2.1"
  Normal   Pulled          15m                    kubelet            Successfully pulled image "docker.io/mendersoftware/workflows-worker:mender-3.2.1" in 1m51.676472634s
  Warning  BackOff         14m (x3 over 15m)      kubelet            Back-off restarting failed container
  Normal   Pulled          14m (x3 over 15m)      kubelet            Container image "docker.io/mendersoftware/workflows-worker:mender-3.2.1" already present on machine
  Normal   Created         14m (x4 over 15m)      kubelet            Created container workflows
  Normal   Started         14m (x4 over 15m)      kubelet            Started container workflows
  Normal   SandboxChanged  6m33s                  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          4m49s (x4 over 6m33s)  kubelet            Container image "docker.io/mendersoftware/workflows-worker:mender-3.2.1" already present on machine
  Normal   Created         4m49s (x4 over 6m33s)  kubelet            Created container workflows
  Normal   Started         4m48s (x4 over 6m33s)  kubelet            Started container workflows
  Warning  BackOff         83s (x24 over 6m22s)   kubelet            Back-off restarting failed container

Hi @alfrunes, sorry for the late reply. I stuck with my downgraded config for the short term but will attempt an install with the new instructions. Thanks for the help!

Thanks for the update. I will try to reproduce this scenario later and get back to you.

Hi everyone,
Was this issue ever resolved? I’m currently facing the same issue. Both the create-artifact-worker and workflows-worker pods are failing with CrashLoopBackOff. Logs for both have the same error message:
failed to subscribe to the nats JetStream: cannot create a queue subscription for a consumer without a deliver group

Per the documentation for the Kubernetes installation on the Mender site (NATS | Mender documentation), I am using chart version nats-0.8.2 and app version 2.3.1.

Ubuntu 22.04
k3s version v1.24.6+k3s1 (a8e0c66d)
go version go1.18.6
mender 3.4.0

Thanks,
stu

Hello @stumarr,

At the time I was not able to reproduce the issue locally. But I have encountered it recently.
It seems to be related to when the majority of the NATS nodes are down while receiving traffic. The only workaround I have found for this is to remove the consumer configuration from the NATS server and recreate the workflows pods. If you have included nats-box in your deployment (included by default), you can run the following snippet:

kubectl exec `kubectl get pod -l app=nats-box -o name` -- \
    sh -c 'for consumer in workflows-worker create-artifact-worker; do nats consumer rm -f WORKFLOWS $consumer; done'
for deploy in workflows-worker workflows-server create-artifact-worker; do
    kubectl rollout restart deployment/$deploy
done

I’m hoping that the new NATS consumer API included in one of the recent driver versions will resolve the issue, but it will take some time before we can migrate to this API in our software.

2 Likes

Hello @alfrunes,

Thank you for the quick response! It looks like that worked. All pods are stable.
I appreciate the help!

Thanks,
stu

Hi, i use mender docs v3.7 to install mender server by k8s.

There is a problem like this when Installing the Mender Helm chart steps:

helm upgrade --install mender mender/mender -f mender-3.7.3.yml
image

How to fix it?@ alfrunes

Thank you.
WangJi

Hi @Jiwangreal , can you please share the logs from one of the error pods?