How to debug Kubernetes installation of mender?

Hello Everyone

We are trying to deploy the mender installation of Kubernetes, following the guidelines for 3.2: Production installation with Kubernetes | Mender documentation

Everything works except for 3 mender containers:

  • 2 times workflows-server
  • 1 time create-artifact-worker
    The containers are stuck in a CrashLoopBackOff.
❯ kubectl get pods --namespace application-peripherals
NAME                                      READY   STATUS             RESTARTS          AGE
api-gateway-756685fdc9-vj7rd              1/1     Running            0                 22h
cert-manager-54b9fc686-hbc4x              1/1     Running            0                 22h
cert-manager-cainjector-89487b959-8x9n6   1/1     Running            0                 22h
cert-manager-webhook-85f96c57dd-2nhvm     1/1     Running            0                 22h
create-artifact-worker-6676fd594-z7rqb    0/1     CrashLoopBackOff   247 (70s ago)     21h
deployments-88d4d87-66c2x                 0/1     Running            23 (20h ago)      22h
device-auth-77ffc8688c-nf7lt              0/1     Running            0                 22h
deviceconfig-7ccbfb857d-fk5pc             1/1     Running            0                 22h
deviceconnect-5468dd6c54-qw4mh            1/1     Running            0                 21h
gui-7b6988cb96-xwp9n                      1/1     Running            0                 22h
inventory-7454868b78-dmlm4                1/1     Running            0                 22h
iot-manager-5465779b4-w5zkg               1/1     Running            0                 22h
minio-operator-6c984995c9-lldss           1/1     Running            0                 22h
minio-operator-console-9d9cbbcc8-flbmf    1/1     Running            0                 22h
minio-ss-0-0                              1/1     Running            0                 22h
minio-ss-0-1                              1/1     Running            0                 22h
mongodb-0                                 1/1     Running            0                 22h
mongodb-arbiter-0                         1/1     Running            0                 22h
nats-0                                    3/3     Running            0                 22h
nats-box-67786894bd-hszrk                 1/1     Running            0                 22h
useradm-65db46c846-xjz59                  1/1     Running            0                 22h
workflows-server-db8fd468d-mb8w7          0/1     CrashLoopBackOff   254 (2m14s ago)   21h
workflows-worker-8657585498-7tcr2         0/1     CrashLoopBackOff   247 (71s ago)     21h

When we check the logs I get the following:

create-artifact-worker-6676fd594-z7rqb

❯ kubectl logs create-artifact-worker-6676fd594-z7rqb --namespace application-peripherals
time="2022-02-02T09:39:11Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:39:11Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:39:11Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/02/02 09:39:16 context deadline exceeded

workflows-server-db8fd468d-mb8w7

❯ kubectl logs workflows-server-db8fd468d-mb8w7 --namespace application-peripherals
time="2022-02-02T09:38:22Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:38:22Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:38:22Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/02/02 09:38:27 context deadline exceeded

workflows-worker-8657585498-7tcr2

❯ kubectl logs workflows-worker-8657585498-7tcr2 --namespace application-peripherals
time="2022-02-02T09:39:16Z" level=info msg="migrating workflows" file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:39:16Z" level=info msg="migration to version 1.0.0 skipped" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
time="2022-02-02T09:39:16Z" level=info msg="DB migrated to version 1.0.0" db=workflows file=entry.go func="logrus.(*Entry).Infof" line=351
2022/02/02 09:39:21 context deadline exceeded

The context deadline exceeded is probably a GOLANG error. Which makes the error logs ambiguous.


Deviations we have from the installation documentation (Production installation with Kubernetes | Mender documentation)

  • We don’t use AWS Kubernetes, we have a bare-metal Kubernetes
    • we use ingress-nginx as a reverse proxy and TLS termination
    • we use kube-flannel for networking
  • We use MinIO for S3 (deployed as explained in the mender documentation)

Questions:

  • What can we do to debug the ‘context deadline exceeded’ errors?
  • It isn’t mentioned in the documentation, do I have to create the MinIO bucket or does mender take care of this?
  • Is the nats://nats:4222 connection string mentioned in the documentation correct?
    • Don’t we have to use the internal DNS of the nats service?
    • e.g. nats://pod-0.nats.application-peripherals.svc.cluster.local:4222

Additional deviation:
The version of mender chart proposed in the documentation is not available,
–version 3.2.1 only 3.2.0 is available in the chart repo.

Looks to me like this was done yesterday, here:

Responding even though this is an older post, since it comes up in a search about Kubernetes deployment.

The context deadline exceeded errors are almost assuredly because the api-gateway cannot communicate with the S3 API server (e.g MinIO). I have not yet found a simple way to debug it, but every time I have seen it that was the root cause.

The bucket will be created for you, as long as the MinIO S3 API server communication (and authentication) are working.

If you are using namespaces other than default (e.g. I put nats in it’s own namespace), then you need to use the cluster internal DNS for the service:

nats://nats.<namespace>.svc.cluster.local:4222

The full name of the service for the default namespace would then be:
nats://nats.default.svc.cluster.local:4222