Strange problem when upgrading a device via hosted.mender.io

I have sent this support@m…io but I thought I would try the community as well…

We have seen a some issues recently when trying to update one of our devices on the community version of the mender server (which we use internally for testing). To try and understand if the problem was with our internal instance or general issue we tried to replicate the issue on hosted.mender.io (via our trial account). Unfortunately we could not replicate the original issue (we saw on the community version) but instead seem to have discovered different issue when trying to update the unit.

Below is a snippet from the error log of the failed update:

2022-02-17 18:36:58 +0000 UTC info: Device unauthorized; attempting reauthorization
2022-02-17 18:36:58 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/enp5s0f1
2022-02-17 18:36:59 +0000 UTC info: successfully received new authorization data from server https://hosted.mender.io
2022-02-17 18:36:59 +0000 UTC info: Local proxy started
2022-02-17 18:36:59 +0000 UTC info: Reauthorization successful
2022-02-17 18:36:59 +0000 UTC info: Validating the Update Info: https://s3.amazonaws.com/hosted-mender-artifacts/... [name: rel_0.11.2; devices: [bwt-dn201-sc]]
2022-02-17 18:36:59 +0000 UTC info: Attempting to upgrade to currently installed artifact name, not performing upgrade.
2022-02-17 18:36:59 +0000 UTC error: Update control map check failed: transient error: file already exists, retrying...
2022-02-17 18:36:59 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control-retry-refresh-maps [none]
2022-02-17 18:36:59 +0000 UTC info: Wait 1m0s before next update control map fetch/update attempt
2022-03-22 14:42:13 +0000 UTC info: State transition: mender-update-control-retry-refresh-maps [none] -> mender-update-control-refresh-maps [none]
2022-03-22 14:42:13 +0000 UTC info: Validating the Update Info: https://s3.amazonaws.com/hosted-mender-artifacts/... [name: rel_0.11.2; devices: [bwt-dn201-sc]]

with the Validating the Update Info sequence repeating until:

2022-03-22 20:50:23 +0000 UTC info: Validating the Update Info: https://s3.amazonaws.com/hosted-mender-artifacts/6... [name: rel_0.11.2; devices: [bwt-dn201-sc]]
2022-03-22 20:50:23 +0000 UTC info: Attempting to upgrade to currently installed artifact name, not performing upgrade.
2022-03-22 20:50:23 +0000 UTC error: Update control map check failed: transient error: file already exists, retrying...
2022-03-22 20:50:23 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control-retry-refresh-maps [none]
2022-03-22 20:50:23 +0000 UTC error: transient error: Tried maximum amount of times
2022-03-22 20:50:23 +0000 UTC info: State transition: mender-update-control-retry-refresh-maps [none] -> rollback [ArtifactRollback]
2022-03-22 20:50:23 +0000 UTC info: Performing rollback
2022-03-22 20:50:23 +0000 UTC info: Rolling back to the inactive partition (3).
2022-03-22 20:50:23 +0000 UTC info: State transition: rollback [ArtifactRollback] -> rollback-reboot [ArtifactRollbackReboot_Enter]
2022-03-22 20:50:23 +0000 UTC info: Executing script: ArtifactRollback_Leave_00_rootfs

Failure occurred post reboot after starting with client version 3.1.0 and ending with client version 3.2.1. The unit under test, before commencing the test had its association with the server “wiped” by performing the following procedure:

  1. Stop mender agent on unit
  2. Decommission and delete the device from the server
  3. Removed all persistent mender state from the unit, specifically removing all the contents of /var/lib/mender (a symbolic link to /data/mender) except for /var/lib/mender/device_type
  4. Restarted the mender agent on unit
  5. Accepted the unit in hosted.mender.io

As far as I am able to diagnose, its the wiping of the mender-store & mender-store-lock files that seem to be the root cause of the problem we are experiencing. We have other units connected to hosted.mender.io and they are able to update without problem; the only difference is the mender state had not been “reset” on these units.

Any advice on what is going on would be greatly appreciated (and I do note that the problem we see looks similar to Rollback when attempting to install an image with mender-client 3.2.1 into an image with 2.6.1).