Recovery when rootfs corrupted

Hi,
Is there a good practice to check if the rootfs is corrupted before it enters an emergency state and then to make the u-boot start rootfs from partition B and in the end make mender deploy its rootfs update again on the corrupted partition?
I tried to add my emergency services that changed u-boot upgrade_avalible flag and reset the device and one more service that should run fsck on the other partition to fix the errors but I think it is not the right way.

Hello @SergeyMich

I think I understand where you are coming from. But in order to be sure that a faulty update falls back is to have a watchdog on the system, and have it rollback the update if something happens to/with the fs during boot.

Hi @oleorhagen
Yes, you are right. I have a watchdog on my system, but the watchdog is not triggered when the system goes to the emergency shell.
However, if I manage to trigger the watchdog, how will the mender automatically deploy the rootfs to the corrupted A rootfs?

Alright, I’m not completely following what is going on here.

So rootfs A is the active partition (aka the one we are updating from), and rootfs B is the inactive partition, the one we are trying to update to?

Just to get the nomenclature right :slight_smile:

Alright, let’s consider a scenario where, under normal circumstances, my device operates with rootfs A. However, at some point, rootfs A becomes corrupted, manifesting an error like the following:

[   15.518323] EXT4-fs error (device mmcblk0p7): ext4_lookup:1785: inode #636: comm systemd-udevd: iget: bad extra_isize 7340 (inode size 256)

This error leads the system into an emergency shell, rendering the mender-client non-functional and leaving the device stranded with the corrupted rootfs A. In response, I modified the emergency service to update the u-boot parameters and trigger a system reboot, transitioning it to rootfs B.

Even after successfully bringing up and stabilizing rootfs B, rootfs A remains corrupted. Now, I have a couple of questions:

  1. How can I detect the corruption of rootfs A to initiate a seamless switch to rootfs B using Mender?
  2. Is there an automated way to deploy changes to the corrupted rootfs A from the functioning rootfs B?

You can, as you said, do it if you end up in the emergency shell I guess. By as you said, changing the u-boot parameters to boot off of the other partition, then simply reboot. However, I would not recommend this I think, as it smells a little bit like an opportunity for a downgrade attack. But in all regards, it sounds possible. Maybe a rescue partition would be better to fall back to in this instance?

I cannot think of anything off the top of my head. You could do something like having a deployment group maybe, and then when the device falls back to B, it sets some inventory value, which makes sure it becomes part of this group, and then automatically gets a new update from it? However, it feels like we’re shoehorning this a little bit.

thanks, great suggestions to think of. I feel that you have some better suggestions… may be from the bottom of the architecture of my devices.

Hmm, I don’t have any quick solutions. In general redundancy is the way to go I think. So that even though it might be expensive, a separate rescue partition sounds like something you might want.

But I think I understand your motivation behind the falling back to the other partition. It can also be a solution, but I would not push it into production myself I think.