Mender automatic fallback

Hello,

I’m using a Raspberry Pi 3 with a version of Raspbian converted to work with mender. Updates and all work just fine. However, for a test, I purposely corrupted the inactive root partition just after the update process and before the reboot.

From my understanding, if the updated root partition fails to boot more than 3 times, Mender should automatically revert to the previous rootfs ? And it should all happen without human interaction ?

But in my case, I let my raspberry for about 5 minutes, and it never rebooted to the previous rootfs.

Is it normal ?

Thanks.

Actually it fails only once before performing an rollback (which is handled in the bootloader).

Depends a bit on what happens when you corrupted the the inactive part, but if the Linux kernel halts/hangs before “user-space” is started something needs to kick some life back in to the system and this is the job of the hardware watchdog as the Mender client will not be running because it never reached user-space.

1 Like

Then, what could cause a rollback ? I don’t understand because if the user-space is started that means that the system is running some way or another.

And is the hardware watchdog part of the mender configuration ?

There are plenty of things that can cause a roll-back and it is hard to cover them all but here is a short list:

  • Fails to load Linux kernel (file is corrupted or misplaced)
  • Fails to boot Linux kernel (crash before user-space, HW watchdog needs to reset it, could also be e.g CONFIG_PANIC_TIMEOUT)
  • Mender client (in managed mode, that is connected to server) not able to connect to server after update
  • Power loss / reset of device during many of Mender client states (ArtifactReboot is an exception)
  • User/application can force an rollback as well by defining custom sanity checks using state-scripts,

I don’t understand because if the user-space is started that means that the system is running some way or another.

If user-space starts, Mender client is typically happy as long as it can connect to the server again. If you need further sanity checks you would need to provide that using state-scripts

And is the hardware watchdog part of the mender configuration ?

No, this is platform specific. Ultimately you would want the watchdog to be enabled in the bootloader for it to catch a failing Linux kernel, and this is highly platform specific.

E.g on ARM you would want to do this in U-Boot, and each platform/board/SoC has a different watchdog.

Same is actually true (or even worse) for x86, here you would want to enable the watchdog in the firmware (BIOS/UEFI) and again highly platform specific if supported at all.

1 Like

Thank you for the list. However, which of these things are caught by Mender ? That’s what I don’t understand.

Thank you

Thank you for the list. However, which of these things are caught by Mender ? That’s what I don’t understand.

Should be all except,

Fails to boot Linux kernel (crash before user-space, HW watchdog needs to reset it, could also be e.g CONFIG_PANIC_TIMEOUT)

The bootloader integration will detect that it failed (this is the boot counters in U-Boot), but it needs help with the reset since this is an in-between state.

1 Like

Ok thank you. So a kernel panic should be caught either by a kernel timeout or by a HW watchdog both which are not managed by Mender.

I imagine that this error is caught by U-Boot directly which will revert back to the previous root partition without having to communicate with Mender, is it right ?

Yes, though the Mender client will notice that this happened (because of the U-Boot environment variables that changed) and report failure to the Mender server.

1 Like

So with the information you provided I retried my tests. The first one, deleting the zImage on the inactive root partition before rebooting worked fine. The system rebooted automatically to the previous root partition.

However for the second test, even when setting kernel.panic=5 and kernel.panic_on_oops=1 in the /etc/sysctl.conf file, if I corrupt the zImage on the inactive root partition before rebooting, the RPi3 will hang on the rainbow screen. Do you have any idea ?

However for the second test, even when setting kernel.panic=5 and kernel.panic_on_oops=1

You probably also need to add kernel.panic = 10, that is reboot after 10 seconds of kernel panic/oops.

How do you corrupt it?

the RPi3 will hang on the rainbow screen. Do you have any idea ?

Best is if you can connect a serial console and share the output to get a better idea of what is happening.

1 Like

What’s the benefit to change from 5 to 10 ?

I use hexer, it allows to edit binary files. I select a random patch of bytes and set them to 0, then save the file.

Unfortunately, due to COVID-19, I dont’ have access to this type of material right now.

In the mean time, I enabled the hardware watchdog on a new image following the non-domotic part of this guide: https://www.domoticz.com/wiki/Setting_up_the_raspberry_pi_watchdog

The reset is set to 14 seconds. Once the newly generated images are ready I will check the update process again to see if the watchdog allows the Pi to reboot upon a corruped zImage.

Ignore this. I missed that you had set this already. The timeout value should not matter.

I use hexer, it allows to edit binary files. I select a random patch of bytes and set them to 0, then save the file.

Hm, yeah this could introduce a case where the kernel does not trigger a panic/oops and only a HW watchdog would reset it.

Unfortunately, the hardware watchdog does not seem to work either. Do you have any idea if mender-convert could “break” some implications that the hardware watchdog for the raspberry pi has ?

Hi @deku I don’t know of anything obvious that would break the watchdog reset capability. Have you verified that it works before running mender-convert?
Drew

Hi,

Problem is, I don’t know how to test it without having Mender as a fallback. I tested it with the “forkbomb” attack and the Pi rebooted. But I get some strange ouputs when reading the watchdog status:

deku@pi-ps6:~ $ sudo service watchdog status
[sudo] password for deku:
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-04-24 15:25:53 BST; 3 days ago
  Process: 660 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprob
  Process: 662 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, st
 Main PID: 664 (watchdog)
    Tasks: 1 (limit: 2200)
   Memory: 912.0K
   CGroup: /system.slice/watchdog.service
           └─664 /usr/sbin/watchdog

Apr 24 15:25:53 pi-ps6 watchdog[664]: interface: no interface to check
Apr 24 15:25:53 pi-ps6 watchdog[664]: temperature: no sensors to check
Apr 24 15:25:53 pi-ps6 watchdog[664]: no test binary files
Apr 24 15:25:53 pi-ps6 watchdog[664]: no repair binary files
Apr 24 15:25:53 pi-ps6 watchdog[664]: error retry time-out = 60 seconds
Apr 24 15:25:53 pi-ps6 watchdog[664]: repair attempts = 1
Apr 24 15:25:53 pi-ps6 watchdog[664]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
Apr 24 15:25:53 pi-ps6 watchdog[664]: watchdog now set to 14 seconds
Apr 24 15:25:53 pi-ps6 watchdog[664]: hardware watchdog identity: Broadcom BCM2835 Watchdog timer
Apr 24 15:25:53 pi-ps6 systemd[1]: Started watchdog daemon.

The heartbeat is set to none, altough it has been setted in /etc/modprobe.d/watchdog.conf :
options bcm2835_wdt nowayout=1 heartbeat=10

and also in /etc/watchdog.conf :
watchdog-device = /dev/watchdog
watchdog-timeout = 14
realtime = yes
priority = 1
max-load-1 = 24

In addition, this entry has been added to the /boot/config.txt:
dtparam=watchdog=on

Thank you for the help