Unable to migrate from Kirkstone to Scarthgap

This was tested on a UP Xtreme i11 Edge A10-0001

We are currently running a Kirkstone build (Mender-client 3.5.2), but we want to migrate to a Scarthgap build (Mender 4.0.4). The Kirkstone builds have been working flawlessly for a over a year without hiccups and across hundreds of updates. We followed the steps outlined here and built the Scarthgap image. We stripped out most of our code even though it’s more application rather than system level to make sure that wasn’t interferring.

When we attempt to flash this image to a board running a Kirkstone build we get the log:

2024-11-25 20:31:48 +0000 UTC info: Running Mender client version: 3.5.2
2024-11-25 20:31:48 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2024-11-25 20:31:49 +0000 UTC info: No public key was provided for authenticating the artifact
2024-11-25 20:31:49 +0000 UTC info: Opening device "/dev/sda2" for writing
2024-11-25 20:31:49 +0000 UTC info: Native sector size of block device /dev/sda2 is 512 bytes. Mender will write in chunks of 1048576 bytes
2024-11-25 20:32:49 +0000 UTC info: All bytes were successfully written to the new partition
2024-11-25 20:32:49 +0000 UTC info: The optimized block-device writer wrote a total of 10161 frames, where 516 frames did need to be rewritten (i.e., skipped)
2024-11-25 20:32:50 +0000 UTC info: Wrote 10653532160/10653532160 bytes to the inactive partition
2024-11-25 20:32:50 +0000 UTC info: State transition: update-store [Download_Enter] -> update-after-store [Download_Leave]
2024-11-25 20:32:50 +0000 UTC info: State transition: update-after-store [Download_Leave] -> mender-update-control-refresh-maps [none]
2024-11-25 20:32:50 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-11-25 20:32:50 +0000 UTC info: State transition: mender-update-control [none] -> update-install [ArtifactInstall]
2024-11-25 20:32:50 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
2024-11-25 20:32:50 +0000 UTC info: State transition: update-install [ArtifactInstall] -> mender-update-control-refresh-maps [none]
2024-11-25 20:32:50 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-11-25 20:32:50 +0000 UTC info: State transition: mender-update-control [none] -> reboot [ArtifactReboot_Enter]
2024-11-25 20:32:50 +0000 UTC info: Rebooting device(s)
2024-11-25 20:32:50 +0000 UTC info: Mender rebooting from active partition: /dev/sda3
2024-11-25 20:32:51 +0000 UTC error: error forwarding from client to backend: websocket: close 1006 (abnormal closure): unexpected EOF
2024-11-25 20:32:51 +0000 UTC info: Daemon terminated with SIGTERM
2024-11-25 20:33:56 +0000 UTC info: Running Mender client version: 3.5.2
2024-11-25 20:33:56 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2024-11-25 20:33:56 +0000 UTC error: transient error: Reboot to the new update failed. Expected "upgrade_available" flag to be true but it was false. Either the switch to the new partition was unsuccessful, or the bootloader rolled back
2024-11-25 20:33:56 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> rollback [ArtifactRollback]
2024-11-25 20:33:56 +0000 UTC info: Performing rollback
2024-11-25 20:33:56 +0000 UTC info: No update available, so no rollback needed.
2024-11-25 20:33:56 +0000 UTC info: State transition: rollback [ArtifactRollback] -> rollback-reboot [ArtifactRollbackReboot_Enter]
2024-11-25 20:33:56 +0000 UTC info: Rebooting device(s) after rollback
2024-11-25 20:33:56 +0000 UTC info: Mender rebooting from inactive partition: /dev/sda3
2024-11-25 20:33:56 +0000 UTC info: Daemon terminated with SIGTERM
2024-11-25 20:34:13 +0000 UTC info: Running Mender client version: 3.5.2
2024-11-25 20:34:13 +0000 UTC info: Mender shut down in state: rollback-reboot
2024-11-25 20:34:13 +0000 UTC info: State transition: init [none] -> verify-rollback-reboot [ArtifactRollbackReboot_Leave]
2024-11-25 20:34:13 +0000 UTC info: State transition: verify-rollback-reboot [ArtifactRollbackReboot_Leave] -> after-rollback-reboot [ArtifactRollbackReboot_Leave]
2024-11-25 20:34:13 +0000 UTC info: State transition: after-rollback-reboot [ArtifactRollbackReboot_Leave] -> update-error [ArtifactFailure]
2024-11-25 20:34:13 +0000 UTC info: State transition: update-error [ArtifactFailure] -> cleanup [Error]
2024-11-25 20:34:13 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2024-11-25 20:34:13 +0000 UTC info: Device unauthorized; attempting reauthorization
2024-11-25 20:34:14 +0000 UTC info: successfully received new authorization data from server https://hosted.mender.io
2024-11-25 20:34:14 +0000 UTC info: Local proxy started
2024-11-25 20:34:14 +0000 UTC info: Reauthorization successful

You will also briefly see a message from grub which says “rolling back”, even though when it reboots it never looks like it attempts to enter the other partition. (Goes straight from BIOS splash screen to the “rolling back” message).

We then flashed the SSD from scratch with the Scarthgap image using dd. The system boots fine and functions well + any updates to other Scarthgap images now work without issues. However, if we try to flash a Kirkstone image we see the following errors on the display before the whole system freezes:

[FAILED] Failed to start File System Check on /dev/sda4.
[DEPEND] Dependency failed for/data.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Grow File System on /data.
[FAILED] Failed to start File System Check on /dev/sda4.
[DEPEND] Dependency failed for data.
[DEPEND] Dependency failed for Grow File System on /data.
[DEPEND] Dependency failed for Local File Systems.
[FAILED] Failed to start File System Check on /dev/sda4.
[DEPEND] Dependency failed for data.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Grow File Systen on /data...
IFAILED] Failed to start Network Name Resolution.
IFAILED] Failed to start Network Name Resolution.
[FAILEDI Failed to start Network Name Resolution.
[FAILED) Failed to start Network Name Resolution. 
[FAILED) Failed to start Network Name Resolution.
[FAILED) Failed to start Network Name Resolution.

We then have to forcefully shutdown the device to complete the rollback, logs for that are here:

2024-11-25 21:30:04.317 +0000 UTC info: Running Mender client 4.0.4
2024-11-25 21:30:04.317 +0000 UTC info: Deployment with ID 55176e57-8f4d-4eb6-ab96-09d80dfbc7a4 started.
2024-11-25 21:30:04.317 +0000 UTC info: Sending status update to server
2024-11-25 21:30:05.068 +0000 UTC info: Installing artifact...
2024-11-25 21:31:13.672 +0000 UTC info: Update Module output (stdout): ================ STATISTICS ================
2024-11-25 21:31:13.672 +0000 UTC info: Update Module output (stdout): Blocks written: 1716
2024-11-25 21:31:13.672 +0000 UTC info: Update Module output (stdout): Blocks omitted: 8444
2024-11-25 21:31:13.672 +0000 UTC info: Update Module output (stdout): Bytes  written: 1799356416
2024-11-25 21:31:13.672 +0000 UTC info: Update Module output (stdout): ============================================
2024-11-25 21:31:14.135 +0000 UTC info: Sending status update to server
2024-11-25 21:31:14.738 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.745 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.749 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.759 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.765 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.77 +0000 UTC info: Update Module output (stderr): sync: ignoring all arguments
2024-11-25 21:31:14.782 +0000 UTC info: Sending status update to server
2024-11-25 21:31:15.142 +0000 UTC info: Calling `reboot` command and waiting for system to restart.
2024-11-25 21:31:15.191 +0000 UTC info: Termination signal received, shutting down gracefully
2024-11-25 21:51:28.695 +0000 UTC info: Running Mender client 4.0.4
2024-11-25 21:51:28.696 +0000 UTC info: The update client daemon is now ready to handle incoming deployments
2024-11-25 21:51:28.789 +0000 UTC error: Process returned non-zero exit status: ArtifactVerifyReboot: Process exited with status 1
2024-11-25 21:51:28.908 +0000 UTC info: Calling `reboot` command and waiting for system to restart.
2024-11-25 21:51:28.944 +0000 UTC info: Termination signal received, shutting down gracefully
2024-11-25 21:53:41.544 +0000 UTC info: Running Mender client 4.0.4
2024-11-25 21:53:41.544 +0000 UTC info: The update client daemon is now ready to handle incoming deployments
2024-11-25 21:53:41.65 +0000 UTC info: Sending status update to server

Does anyone have any clues as to why it could be failing to upgrade from Kirkstone to Scarthgap? We updated all the submodules and it’s a pretty barebones build. We currently use the following layers:

  ${TOPDIR}/../sources/poky/meta \
  ${TOPDIR}/../sources/poky/meta-poky \
  ${TOPDIR}/../sources/poky/meta-yocto-bsp \
  ${TOPDIR}/../sources/meta-openembedded \
  ${TOPDIR}/../sources/meta-intel \
  ${TOPDIR}/../sources/meta-mender/meta-mender-core \

This seems to happen if we change the mender version to 4.0.4 inside of Kirkstone.

To do this we added the following to the local.conf:

INIT_MANAGER = "systemd"
PREFERRED_VERSION_mender = "4.0.4"
PREFERRED_PROVIDER_mender-native = "mender-native"
PREFERRED_RPROVIDER_mender-auth = "mender"
PREFERRED_RPROVIDER_mender-update = "mender"

This causes the same issues as if we built the Yocto image on Scarthgap.

Hi @OliverW,

Thanks for the report, thats definitely an unexpected problem. Just a few detail questions so I understand correctly:

  • When you build Client 4.0.4 on kirkstone, do you deploy it as an update to the device running Client 3.5 already, or is it a fresh image written to storage directly?
  • Are you on latest kirkstone?
  • You are describing two separate, but connected issues:
    • the system boot yielding the log which the blocks forever on Failed to start network resolution
    • the rollback after a forced restart.

My understanding is that the rollback is expected and normal behaviour then, the Client sees the forced restart as failed update, which is also kinda sensible in this case.

So if you flash a fresh image with Mender Client 4.0 on kirkstone to a device, my guess would be that you are seeing the stuck boot. Correct?

Greetz,
Josef

When you build Client 4.0.4 on kirkstone, do you deploy it as an update to the device running Client 3.5 already, or is it a fresh image written to storage directly?

  • In my update yesterday the devices were running the Client 3.5 already.
  • Today I wrote the Kirkstone+4.0.4 uefiimg directly to an SSD using dd and installed it into a device
    • I then updated the device over Mender using the Kirkstone image that failed yesterday and it worked.
    • I will try pushing the Scarthgap image to it next and seeing if it worked.

Are you on latest kirkstone?

Yup, all submodules are on the latest Kirkstone versions.

My understanding is that the rollback is expected and normal behaviour then, the Client sees the forced restart as failed update, which is also kinda sensible in this case.

The rollback is correct, but it’s unfortunate that it needs to be manually power cycled for this to occur. If left alone, the device will never reboot and will be stuck updating indefinitely.

So if you flash a fresh image with Mender Client 4.0 on kirkstone to a device, my guess would be that you are seeing the stuck boot. Correct?

Negative, when I flash a fresh image to a SSD with dd the device boots correctly and I can update to other Yocto images also compiled with the 4.0.4 Mender client. I just can’t flash it back to images that use the 3.5.2 Mender client.

  • The issue seems to be updating between major Mender releases:
    • If the device is running the 3.5 Mender client, it can be updated to other Yocto images that also use the 3.5 Mender client
    • If the device is running the 4.0 Mender client, it can be updated to other Yocto images that also use the 4.0 Mender client
    • If the device is running the 3.5 Mender client and I attempt to update to a Yocto image that uses the 4.0 Mender client: it immediately fails after booting and rolls back (I don’t see the new OS load at all).
    • If the device is running the 4.0 Mender client and I attempt to update to a Yocto image that uses the 3.5 Mender client: it fails to start the file system check/network name resolution and freezes indefinitely.

The mender specific configuration we are doing is as follows:

IMAGE_INSTALL:append = " mender-server-certificate mender-configure mender-connect mender-configure-scripts lsb-ld"
MENDER_FEATURES_ENABLE:append = "mender-persist-systemd-machine-id"

MENDER_DEVICE_TYPE="test"
MENDER_ARTIFACT_NAME = "test1"
MENDER_STORAGE_DEVICE = "/dev/sda"
MENDER_RETRY_POLL_INTERVAL_SECONDS = "10"
MENDER_UPDATE_POLL_INTERVAL_SECONDS = "10"
MENDER_INVENTORY_POLL_INTERVAL_SECONDS = "10"
MENDER_STORAGE_TOTAL_SIZE_MB_DEFAULT = "20480"
MENDER_SERVER_URL = "https://hosted.mender.io"

We inherit mender-full in the local.conf

Hi :slight_smile:

We are facing similar issue when trying to migrate from Kirkstone to Scarthgap.

On a computer running the Kirkstone version, the deployment of a Scarthgap version fails and the computer rolls back to the Kirkstone version.
Unlike @OliverW, the computer is under Intel and does not use Grub, so the reboot of the computer seems to work after the rollback (without any manual action)

In case it helps, here are the logs we have :

2024-12-26 09:11:38 +0000 UTC info: Running Mender client version: 3.5.2
2024-12-26 09:11:38 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2024-12-26 09:11:38 +0000 UTC info: No public key was provided for authenticating the artifact
2024-12-26 09:11:38 +0000 UTC info: Opening device "/dev/sda3" for writing
2024-12-26 09:11:38 +0000 UTC info: Native sector size of block device /dev/sda3 is 512 bytes. Mender will write in chunks of 1048576 bytes
2024-12-26 09:14:15 +0000 UTC info: All bytes were successfully written to the new partition
2024-12-26 09:14:15 +0000 UTC info: The optimized block-device writer wrote a total of 2993 frames, where 1684 frames did need to be rewritten (i.e., skipped)
2024-12-26 09:14:16 +0000 UTC info: Wrote 3137339392/3137339392 bytes to the inactive partition
2024-12-26 09:14:16 +0000 UTC info: State transition: update-store [Download_Enter] -> update-after-store [Download_Leave]
2024-12-26 09:14:16 +0000 UTC info: State transition: update-after-store [Download_Leave] -> mender-update-control-refresh-maps [none]
2024-12-26 09:14:16 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-12-26 09:14:16 +0000 UTC info: State transition: mender-update-control [none] -> update-install [ArtifactInstall]
2024-12-26 09:14:16 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 3
2024-12-26 09:14:17 +0000 UTC info: State transition: update-install [ArtifactInstall] -> mender-update-control-refresh-maps [none]
2024-12-26 09:14:17 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-12-26 09:14:17 +0000 UTC info: State transition: mender-update-control [none] -> reboot [ArtifactReboot_Enter]
2024-12-26 09:14:17 +0000 UTC info: Rebooting device(s)
2024-12-26 09:14:17 +0000 UTC info: Mender rebooting from active partition: /dev/sda2
2024-12-26 09:14:18 +0000 UTC error: error forwarding from client to backend: websocket: close 1006 (abnormal closure): unexpected EOF
2024-12-26 09:14:18 +0000 UTC warning: error while sending close message: write tcp 127.0.0.1:38491->127.0.0.1:59432: use of closed network connection
2024-12-26 09:14:18 +0000 UTC info: Daemon terminated with SIGTERM
2024-12-26 09:15:04 +0000 UTC info: Running Mender client version: 3.5.2
2024-12-26 09:15:04 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2024-12-26 09:15:04 +0000 UTC error: transient error: Reboot to the new update failed. Expected "upgrade_available" flag to be true but it was false. Either the switch to the new partition was unsuccessful, or the bootloader rolled back
2024-12-26 09:15:04 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> rollback [ArtifactRollback]
2024-12-26 09:15:04 +0000 UTC info: Performing rollback
2024-12-26 09:15:04 +0000 UTC info: No update available, so no rollback needed.
2024-12-26 09:15:04 +0000 UTC info: State transition: rollback [ArtifactRollback] -> rollback-reboot [ArtifactRollbackReboot_Enter]
2024-12-26 09:15:04 +0000 UTC info: Rebooting device(s) after rollback
2024-12-26 09:15:04 +0000 UTC info: Mender rebooting from inactive partition: /dev/sda2
2024-12-26 09:15:05 +0000 UTC info: Daemon terminated with SIGTERM
2024-12-26 09:15:43 +0000 UTC info: Running Mender client version: 3.5.2
2024-12-26 09:15:43 +0000 UTC info: Mender shut down in state: rollback-reboot
2024-12-26 09:15:43 +0000 UTC info: State transition: init [none] -> verify-rollback-reboot [ArtifactRollbackReboot_Leave]
2024-12-26 09:15:43 +0000 UTC info: State transition: verify-rollback-reboot [ArtifactRollbackReboot_Leave] -> after-rollback-reboot [ArtifactRollbackReboot_Leave]
2024-12-26 09:15:44 +0000 UTC info: State transition: after-rollback-reboot [ArtifactRollbackReboot_Leave] -> update-error [ArtifactFailure]
2024-12-26 09:15:44 +0000 UTC info: State transition: update-error [ArtifactFailure] -> cleanup [Error]
2024-12-26 09:15:44 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2024-12-26 09:15:44 +0000 UTC info: Device unauthorized; attempting reauthorization
2024-12-26 09:15:44 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/enp3s0
2024-12-26 09:15:44 +0000 UTC error: Failure occurred while executing authorization request: Method: Post, URL: https://eu.hosted.mender.io/api/devices/v1/authentication/auth_requests
2024-12-26 09:15:44 +0000 UTC error: Failed to authorize with "https://eu.hosted.mender.io": Unknown url.Error type: dial tcp: lookup eu.hosted.mender.io on [::1]:53: read udp [::1]:39463->[::1]:53: read: connection refused
2024-12-26 09:15:44 +0000 UTC warning: Reauthorization failed with error: transient error: authorization request failed
2024-12-26 09:15:44 +0000 UTC error: Failed to report status: transient error: authorization request failed
2024-12-26 09:15:44 +0000 UTC error: error reporting update status: reporting status failed: transient error: authorization request failed
2024-12-26 09:15:44 +0000 UTC error: Failed to send status to server: transient error: reporting status failed: transient error: authorization request failed
2024-12-26 09:15:44 +0000 UTC info: State transition: update-status-report [none] -> update-retry-report [none]
2024-12-26 09:16:14 +0000 UTC info: State transition: update-retry-report [none] -> update-status-report [none]
2024-12-26 09:16:14 +0000 UTC info: Device unauthorized; attempting reauthorization
2024-12-26 09:16:14 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/enp3s0
2024-12-26 09:16:14 +0000 UTC info: successfully received new authorization data from server https://eu.hosted.mender.io
2024-12-26 09:16:14 +0000 UTC info: Local proxy started
2024-12-26 09:16:14 +0000 UTC info: Reauthorization successful

We tried a fresh build of Poky on Kirkstone and Scarthgap with minimal modifications (This tutorial, except we also make the production tweaks). The update still fails with the minimal Poky builds without modifications.

We have also tried 3 different hardware platforms: two Intel NUC models and our custom hardware.

Hi @OliverW and @LudovicCook-e,

Thanks for the details and sorry for the inconvenience! I’ll hopefully get around to reproduce this tomorrow or early next week. Will keep you posted!

Greetz,
Josef

1 Like

Hi @OliverW and @LudovicCook-e,

I spend some time to do quite a bit of testing on this, and what I found so far:

  • the upgrade kirkstone + Client 3.5 to scarthgap + Client 4.0 works with a u-boot integration, tested on Raspberry Pi 4-64
  • the upgrade kirkstone + Client 3.5 to scarthgap + Client 4.0 doesn’t work with a grub integration, tested on Beaglebone Black

When looking at the boot log, the problem seems to be GRUB not being able to load from the new active partition. So my current guess would be that there is something off when the UEFI path is used.

Can somebody of you actually see the boot logs, and post them?

Greetz,
Josef

Addition: if GRUB is involved, the following workaround might be required: Yocto Project runtime | Mender documentation

Greetz,
Josef

Unfortunately, I don’t know how to retrieve the boot logs on my machine …

Thanks for the additional information about GRUB. I will try to add this option in my next trial.

Looking a bit through it, I think this might fix the problem for @LudovicCook-e.

@OliverW what kind of board is this? Do you have a serial terminal that you can access through a UART?

I confirm that it has fixed the problem. Thanks for your support @TheYoctoJester :slight_smile:

1 Like

Sorry for the late reply. I can confirm that adding:

EXTRA_IMAGECMD:ext4:append = " -O ^metadata_csum_seed"

to local.conf fixes the issue. Thank you for the help @TheYoctoJester.

Many thanks,
Oli

1 Like