Failing updates on Tegra TX2 due to issues with `nvbootctrl`

Hello everyone,

we’re having issues with failing updates, and traced the problem back to nvbootctrl get-current-slot returning the wrong slot in 50% of all cases.

There is a thread on this problem at the NVidia support forum as well. There, I learned that Mender uses U-Boot variables to switch between the two rootfs partitions, and that this is a separate mechanism from the A/B slot mechanism in the NVIDIA bootloader chain.

Can anyone elaborate on that? I would like to learn in which way Mender interfaces and affects the A/B slot mechanism in the NVIDIA bootloader chain. Also, I would like to learn how system boots works in general on that platform. I didn’t quite grok the vanilla A/B slot system, let alone the changes due to Mender.

We are observing the following symptoms:

  • Mender update fails, when nvbootctrl get-current-slot returns the wrong slot.
  • The values returned by the nvbootctrl commands change with every boot. They follow the following cyclic pattern:

Boot 1: nvbootctrl get-current-slot : 0 ; Priority of slot 0: 14 ; Priority of slot 1: 14 ;
Boot 2: nvbootctrl get-current-slot : 0 ; Priority of slot 0: 14 ; Priority of slot 1: 15 ;
Boot 3: nvbootctrl get-current-slot : 1 ; Priority of slot 0: 13 ; Priority of slot 1: 15 ;
Boot 4: nvbootctrl get-current-slot : 1 ; Priority of slot 0: 15 ; Priority of slot 1: 14 ;

In each and every of these reboots,

  • the machine booted from slot 0 according to both fw_printenv mender_boot_part and findmnt /.
  • retry_count as reported by nvbootctrl dumps-slots-info was 7 for both slots
  • boot_successful as reported by nvbootctrl dumps-slots-info was 1 for both slots

Thanks a lot,
Manuel

The A/B slot mechanism interaction happens in https://github.com/mendersoftware/meta-mender-community/tree/dunfell/meta-mender-tegra/recipes-mender/tegra-state-scripts as a part of the mender updates, so for instance here is where the nv_update_engine command is executed and here is where that script is setup as the ArtifactInstall state script.

I don’t really understand how Mender is going to be involved on standalone reboots, since the scripts here should only execute as a part of an update or rollback. So I think the issue to understand is why are slot numbers changing with normal reboots.

It would be interesting to try to reproduce this phenomenon on a build without mender installed at all, for instance stock L4T from NVIDIA or tegra demo distro with tegrademo instead of tegrademo-mender.

Incidentally, we just noticed this same issue on one device today after using mender updates successfully on several devices for months.

2 Likes

A bit more about the difference between u-boot and cboot interaction:

With u-boot the boot partition is chosen here based on mender variables set in uboot environment in response to fw_printenv and fw_setenv from the mender client.

However, with cboot builds there’s a fake version of libubootenv which uses nvboootctrl to get the current slot and uses this to communicate to mender which partition is currently running.

So in a uboot based implementation if software outside mender (ie the nvidia bootloader) decides to switch between between boot slots using the logic in the Update Engine State Machine this will mean there will be a mismatch between the bootloader and the root filesystem and this is what the error is telling you. This would obviously be an issue if it happened across any update which attempted to update the bootloader.

In a cboot based system, the bootloader and root filesystem will always be in sync, so if the Update Engine State Machine decides to roll back to a different boot slot it will also roll back the root filesystem. You won’t get the mismatch error you see with uboot in this case but you will be running a rootfs and bootloader version you likely didn’t expect.

So in both cases it’s bad that the update engine is deciding to switch boot slots on its own, but it’s likely more problematic for uboot.

1 Like

Created an issue at https://github.com/OE4T/meta-mender-community/issues/7 to track how to deal with this. I’d like to understand whether PREFERRED_PROVIDER_virtual/bootloader = "cboot-prebuilt" is a possible solution, or at least makes the problem easier to deal with.