we’re having issues with failing updates, and traced the problem back to
nvbootctrl get-current-slot returning the wrong slot in 50% of all cases.
There is a thread on this problem at the NVidia support forum as well. There, I learned that Mender uses U-Boot variables to switch between the two rootfs partitions, and that this is a separate mechanism from the A/B slot mechanism in the NVIDIA bootloader chain.
Can anyone elaborate on that? I would like to learn in which way Mender interfaces and affects the A/B slot mechanism in the NVIDIA bootloader chain. Also, I would like to learn how system boots works in general on that platform. I didn’t quite grok the vanilla A/B slot system, let alone the changes due to Mender.
We are observing the following symptoms:
- Mender update fails, when
nvbootctrl get-current-slot returns the wrong slot.
- The values returned by the
nvbootctrl commands change with every boot. They follow the following cyclic pattern:
nvbootctrl get-current-slot : 0 ; Priority of slot 0: 14 ; Priority of slot 1: 14 ;
nvbootctrl get-current-slot : 0 ; Priority of slot 0: 14 ; Priority of slot 1: 15 ;
nvbootctrl get-current-slot : 1 ; Priority of slot 0: 13 ; Priority of slot 1: 15 ;
nvbootctrl get-current-slot : 1 ; Priority of slot 0: 15 ; Priority of slot 1: 14 ;
In each and every of these reboots,
- the machine booted from slot 0 according to both
fw_printenv mender_boot_part and
retry_count as reported by
nvbootctrl dumps-slots-info was 7 for both slots
boot_successful as reported by
nvbootctrl dumps-slots-info was 1 for both slots
Thanks a lot,
The A/B slot mechanism interaction happens in https://github.com/mendersoftware/meta-mender-community/tree/dunfell/meta-mender-tegra/recipes-mender/tegra-state-scripts as a part of the mender updates, so for instance here is where the nv_update_engine command is executed and here is where that script is setup as the ArtifactInstall state script.
I don’t really understand how Mender is going to be involved on standalone reboots, since the scripts here should only execute as a part of an update or rollback. So I think the issue to understand is why are slot numbers changing with normal reboots.
It would be interesting to try to reproduce this phenomenon on a build without mender installed at all, for instance stock L4T from NVIDIA or tegra demo distro with
tegrademo instead of
Incidentally, we just noticed this same issue on one device today after using mender updates successfully on several devices for months.
A bit more about the difference between u-boot and cboot interaction:
With u-boot the boot partition is chosen here based on mender variables set in uboot environment in response to
fw_setenv from the mender client.
However, with cboot builds there’s a fake version of libubootenv which uses nvboootctrl to get the current slot and uses this to communicate to mender which partition is currently running.
So in a uboot based implementation if software outside mender (ie the nvidia bootloader) decides to switch between between boot slots using the logic in the Update Engine State Machine this will mean there will be a mismatch between the bootloader and the root filesystem and this is what the error is telling you. This would obviously be an issue if it happened across any update which attempted to update the bootloader.
In a cboot based system, the bootloader and root filesystem will always be in sync, so if the Update Engine State Machine decides to roll back to a different boot slot it will also roll back the root filesystem. You won’t get the mismatch error you see with uboot in this case but you will be running a rootfs and bootloader version you likely didn’t expect.
So in both cases it’s bad that the update engine is deciding to switch boot slots on its own, but it’s likely more problematic for uboot.
Created an issue at https://github.com/OE4T/meta-mender-community/issues/7 to track how to deal with this. I’d like to understand whether
PREFERRED_PROVIDER_virtual/bootloader = "cboot-prebuilt" is a possible solution, or at least makes the problem easier to deal with.