Mender keeps rolloing back after update

I believe I integrated mender with our Yocto-based toradex verdin-imx8mplus SoM (there was a weird scenario with his SoM, see Trying to integration mender with Toradex verdin-imx8mp - #2 by whsplk).

When I try updating via hosted mendor, mender would rollback the update after booting into the new partition. Checking the logs, it mentions "transient error: The deployment was aborted from the server","timestamp":"2022-03-02T01:03:01 before rolling back.

At first I thought it might be our custom yocto recipes that might be affecting this issue, but the rollback still persists after stripping our application layer.

Not sure how else to debug this. If anyone from mender or experience with mender can comment, that woud be great!

Thanks

Full logs:

2022-03-02 01:13:25 +0000 UTC info: Running Mender client version: 3.2.1
2022-03-02 01:13:25 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2022-03-02 01:13:25 +0000 UTC info: No public key was provided for authenticating the artifact
2022-03-02 01:13:25 +0000 UTC info: Opening device "/dev/mmcblk2p3" for writing
2022-03-02 01:13:25 +0000 UTC info: Native sector size of block device /dev/mmcblk2p3 is 512 bytes. Mender will write in chunks of 1048576 bytes
2022-03-02 01:14:19 +0000 UTC info: All bytes were successfully written to the new partition
2022-03-02 01:14:19 +0000 UTC info: The optimized block-device writer wrote a total of 1953 frames, where 15 frames did need to be rewritten (i.e., skipped)
2022-03-02 01:14:20 +0000 UTC info: Wrote 2046820352/2046820352 bytes to the inactive partition
2022-03-02 01:14:20 +0000 UTC info: State transition: update-store [Download_Enter] -> update-after-store [Download_Leave]
2022-03-02 01:14:20 +0000 UTC info: State transition: update-after-store [Download_Leave] -> mender-update-control-refresh-maps [none]
2022-03-02 01:14:20 +0000 UTC info: Validating the Update Info: https://s3.amazonaws.com/hosted-mender-artifacts/621bf49f19dc3d571d2dd3cb/bafab3e5-3ea1-4a1a-a1f4-9aed81c3adc1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQWI25QR6NDALMYE2%2F20220302%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220302T011420Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&response-content-type=application%2Fvnd.mender-artifact&X-Amz-Signature=a3c9880ae0b5206d1b1a72f8004a0a6f72400820b2012e4a0abfd0a2ce3443f8 [name: minimal-3; devices: [verdin-imx8mp]]
2022-03-02 01:14:20 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2022-03-02 01:14:20 +0000 UTC info: State transition: mender-update-control [none] -> update-install [ArtifactInstall]
2022-03-02 01:14:20 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 3
2022-03-02 01:14:20 +0000 UTC info: State transition: update-install [ArtifactInstall] -> mender-update-control-refresh-maps [none]
2022-03-02 01:14:20 +0000 UTC error: Error receiving scheduled update data: failed to check update info on the server. Response: &{429 Too Many Requests 429 HTTP/1.1 1 1 map[Connection:[keep-alive] Content-Length:[0] Content-Type:[application/json; charset=utf-8] Date:[Wed, 02 Mar 2022 01:14:20 GMT] X-Men-Requestid:[4007ee82-418a-48e0-8b17-4de9e3d5453e]] 0x40002504c0 0 [] false false map[] 0x40001ded00 <nil>}
2022-03-02 01:14:20 +0000 UTC error: Update control map check failed: transient error: failed to check update info on the server. Response: &{429 Too Many Requests 429 HTTP/1.1 1 1 map[Connection:[keep-alive] Content-Length:[0] Content-Type:[application/json; charset=utf-8] Date:[Wed, 02 Mar 2022 01:14:20 GMT] X-Men-Requestid:[4007ee82-418a-48e0-8b17-4de9e3d5453e]] 0x40002504c0 0 [] false false map[] 0x40001ded00 <nil>}, retrying...
2022-03-02 01:14:20 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control-retry-refresh-maps [none]
2022-03-02 01:14:20 +0000 UTC info: Wait 1m0s before next update control map fetch/update attempt
2022-03-02 01:15:20 +0000 UTC info: State transition: mender-update-control-retry-refresh-maps [none] -> mender-update-control-refresh-maps [none]
2022-03-02 01:15:20 +0000 UTC info: Validating the Update Info: https://s3.amazonaws.com/hosted-mender-artifacts/621bf49f19dc3d571d2dd3cb/bafab3e5-3ea1-4a1a-a1f4-9aed81c3adc1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQWI25QR6NDALMYE2%2F20220302%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220302T011520Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&response-content-type=application%2Fvnd.mender-artifact&X-Amz-Signature=6eaed3b5ebc56fcde56ac1f22012bcf56a867854fceb9c1c597391494cafe44b [name: minimal-3; devices: [verdin-imx8mp]]
2022-03-02 01:15:21 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2022-03-02 01:15:21 +0000 UTC info: State transition: mender-update-control [none] -> reboot [ArtifactReboot_Enter]
2022-03-02 01:15:21 +0000 UTC info: Rebooting device(s)
2022-03-02 01:15:21 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk2p2
2022-03-02 01:15:21 +0000 UTC info: Daemon terminated with SIGTERM
2022-03-02 01:17:35 +0000 UTC info: Running Mender client version: 3.2.1
2022-03-02 01:17:35 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2022-03-02 01:17:35 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2022-03-02 01:17:35 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> mender-update-control-refresh-maps [none]
2022-03-02 01:17:35 +0000 UTC info: Device unauthorized; attempting reauthorization
2022-03-02 01:17:35 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/eth0
2022-03-02 01:17:36 +0000 UTC info: successfully received new authorization data from server https://hosted.mender.io
2022-03-02 01:17:36 +0000 UTC info: Local proxy started
2022-03-02 01:17:36 +0000 UTC info: Reauthorization successful
2022-03-02 01:17:36 +0000 UTC error: transient error: The deployment was aborted from the server
2022-03-02 01:17:36 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> rollback [ArtifactRollback]
2022-03-02 01:17:36 +0000 UTC info: Performing rollback
2022-03-02 01:17:36 +0000 UTC info: Rolling back to the inactive partition (2).
2022-03-02 01:17:36 +0000 UTC info: State transition: rollback [ArtifactRollback] -> rollback-reboot [ArtifactRollbackReboot_Enter]
2022-03-02 01:17:36 +0000 UTC info: Rebooting device(s) after rollback
2022-03-02 01:17:36 +0000 UTC info: Mender rebooting from inactive partition: /dev/mmcblk2p3
2022-03-02 01:17:36 +0000 UTC info: Daemon terminated with SIGTERM
2022-03-02 01:19:50 +0000 UTC info: Running Mender client version: 3.2.1
2022-03-02 01:19:50 +0000 UTC info: Mender shut down in state: rollback-reboot
2022-03-02 01:19:50 +0000 UTC info: State transition: init [none] -> verify-rollback-reboot [ArtifactRollbackReboot_Leave]
2022-03-02 01:19:50 +0000 UTC info: State transition: verify-rollback-reboot [ArtifactRollbackReboot_Leave] -> after-rollback-reboot [ArtifactRollbackReboot_Leave]
2022-03-02 01:19:50 +0000 UTC info: State transition: after-rollback-reboot [ArtifactRollbackReboot_Leave] -> update-error [ArtifactFailure]
2022-03-02 01:19:50 +0000 UTC info: State transition: update-error [ArtifactFailure] -> cleanup [Error]
2022-03-02 01:19:50 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2022-03-02 01:19:50 +0000 UTC info: Device unauthorized; attempting reauthorization
2022-03-02 01:19:50 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/eth0
2022-03-02 01:19:51 +0000 UTC info: successfully received new authorization data from server https://hosted.mender.io
2022-03-02 01:19:51 +0000 UTC info: Local proxy started
2022-03-02 01:19:51 +0000 UTC info: Reauthorization successful
1 Like

Should’ve probably searched before posting, but from Update fails with "The deployment was aborted from the server" downgrading the client to version 3.1.0 fixes the issue.

But is this issue on version 3.2.x something that is confirmed/looked at by mender?

Hello @whsplk,

thanks for the detailed report! We are already looking into it - could you please share the device ID too?

Thanks,
Josef

Hi Josef,
I believe it should be 6e411bff-7eaf-4712-ac9a-be39d770d6fa (grabbed it from the ID field under Device Identity)

1 Like

@whsplk can I kindly ask you to provide me more information about where the Mender client local database is stored on your devices? Is it in the /var/lib/mender folder? If yes, is this folder persistent across reboots?

1 Like

@whsplk another question, would it be possible for you to enable the Mender client’s debug logs and try the update again?

hi @tranchitella,

the local database is stored in /var/lib/mender which I believe is sym-linked to /data/mender.

debug logs (had to rename extension to yml for me to upload)
debug.yml (32.5 KB)

Thank you @whsplk, let me check the logs with the Mender Client team and I’ll get back to you.

Hello @whsplk :wave:

If you have access to the device through Mender connect, ssh or something similar, any chance you can run just run mount, and paste the output here.

Hello @whsplk,

We just rolled out a hot-fix on Hosted Mender to prevent failures like the one you reported.
Can I kindly ask you to check if your deployments work as expected?
For your information, the 429 Too Many Request errors are unrelated to the failure. Therefore, you may still see them in the logs.

Thank you in advance.

Hi @oleorhagen , @tranchitella

Here’s the output of my mount:

/dev/mmcblk2p2 on / type ext4 (rw,noatime)
devtmpfs on /dev type devtmpfs (rw,relatime,size=3212452k,nr_inodes=803113,mode=755)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev)
tmpfs on /var/volatile type tmpfs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
/dev/mmcblk2p1 on /uboot type vfat (rw,relatime,sync,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
/dev/mmcblk2p4 on /data type ext4 (rw,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=783732k,mode=700)

I tried deploying again but I still rolled-back. Debug logs:
debug.yml (35.7 KB)

@whsplk thank you!

One thing which stands out to me in your logs is:

2022-03-05 18:50:16 +0000 UTC debug: Returning artifact name from /etc/mender/artifact_info file.

Which means that the client is not reading from the database on the data partition.

But I see /data in your mount output.

Could you show me the output of:

ls -l /var/lib/mender

and post it here?

root@verdin-imx8mp-06900814:~# ls -l /var/lib/mender
lrwxrwxrwx 1 root root 12 Mar  9  2018 /var/lib/mender -> /data/mender
1 Like

@whsplk the more I stare at your logs, the less I feel like I understand here.

2022-03-08 03:06:17 +0000 UTC info: Running Mender client version: 3.2.1
2022-03-08 03:06:17 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2022-03-08 03:06:17 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2022-03-08 03:06:17 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> mender-update-control-refresh-maps [none]
2022-03-08 03:06:17 +0000 UTC info: Device unauthorized; attempting reauthorization
2022-03-08 03:06:17 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/eth0
2022-03-08 03:06:17 +0000 UTC error: Failure occurred while executing authorization request: Method: Post, URL: https://hosted.mender.io/api/devices/v1/authentication/auth_requests
2022-03-08 03:06:17 +0000 UTC error: Failed to authorize with "https://hosted.mender.io": Unknown url.Error type: dial tcp: lookup hosted.mender.io on 1.0.0.1:53: dial udp 1.0.0.1:53: connect: network is unreachable
2022-03-08 03:06:17 +0000 UTC warning: Reauthorization failed with error: transient error: authorization request failed
2022-03-08 03:06:17 +0000 UTC error: Error receiving scheduled update data: update check request failed: transient error: authorization request failed
2022-03-08 03:06:17 +0000 UTC error: Update control map check failed: transient error: update check request failed: transient error: authorization request failed, retrying...
2022-03-08 03:06:17 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control-retry-refresh-maps [none]
2022-03-08 03:06:17 +0000 UTC info: Wait 1m0s before next update control map fetch/update attempt

This is from the first example.

Client goes to sleep, then the device reboots…

Second example:

2018-03-09 12:35:02 +0000 UTC info: Running Mender client version: 3.2.1
2018-03-09 12:35:02 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2018-03-09 12:35:02 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2018-03-09 12:35:02 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> mender-update-control-refresh-maps [none]
2018-03-09 12:35:02 +0000 UTC info: Device unauthorized; attempting reauthorization
2018-03-09 12:35:02 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/eth0
2018-03-09 12:35:02 +0000 UTC error: Failure occurred while executing authorization request: Method: Post, URL: https://hosted.mender.io/api/devices/v1/authentication/auth_requests
2018-03-09 12:35:02 +0000 UTC error: Failed to authorize with "https://hosted.mender.io": Unknown url.Error type: dial tcp: lookup hosted.mender.io on 1.0.0.1:53: dial udp 1.0.0.1:53: connect: network is unreachable
2018-03-09 12:35:02 +0000 UTC warning: Reauthorization failed with error: transient error: authorization request failed
2018-03-09 12:35:02 +0000 UTC error: Error receiving scheduled update data: update check request failed: transient error: authorization request failed
2018-03-09 12:35:02 +0000 UTC error: Update control map check failed: transient error: update check request failed: transient error: authorization request failed, retrying...
2018-03-09 12:35:02 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control-retry-refresh-maps [none]
2018-03-09 12:35:02 +0000 UTC info: Wait 1m0s before next update control map fetch/update attempt

Same thing basically.

Both the cases get no network connectivity, and looks to me like the device dies, pretty much straight away (not even a second passes in between Mender starting up, and getting killed) in both cases.

What do the systemd boot logs output?

Im having the same Problem. Downgrading to version 3.1.0 solves the issue.

This forces me to use mender-connect 1.2 which has a bug losing connection to the server quite often.

Is there still no solution yet?

Hi @Aalex The 3.2.x mender client versions are having some issues we did not anticipate, and hence, the 3.2.x versions will be deprecated. There is a new 3.3.0 release coming in a few weeks. In the meantime, 3.0.x is your safest bet :slight_smile:

Im seeing it. I updated to 3.2.1 but now i get Problems when deploying the updates. Which combination (client and connect ) works stable at the moment ?

Hi @Aalex, there is a simple workaround, which will work for you on 3.2.x.

That is, simply run an update of any sort, which is not a rootfs-update. So make a dummy update, like write a 0-byte file to /tmp or whatever no-op update you can think of.

Then the client will work just fine after (since the DB is now populated correctly). Then you get to keep the version of connect as well :slight_smile:

2 Likes