Mender 1.7 standalone - Power failure case failing

I’m testing with mender 1.7 in standalone mode. The procedure is given below:

  • Using mender -rootfs “HTTPS local server path of mender artifact”
  • While downloading the artifact, remove the power plug when around 80% download completed.
  • Connect the power plug again and boot the system.

I expected the fail safe mechanism here but, I’m getting the below error when performing the above procedure:

[ 1.859484] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[ 1.868841] VFS: Mounted root (ext4 filesystem) readonly on device 179:1.
[ 1.876893] devtmpfs: mounted
[ 1.880023] Freeing unused kernel memory: 448K
[ 1.885645] EXT4-fs warning (device mmcblk0p1): dx_probe:792: inode #3451: comm swapper/0: dx entry: limit 0 != root limit 125
[ 1.897057] EXT4-fs warning (device mmcblk0p1): dx_probe:864: inode #3451: comm swapper/0: Corrupt directory, running e2fsck is recommended
[ 1.909711] Starting init: /sbin/init exists but couldn’t execute it (error -4094)
[ 1.918429] EXT4-fs warning (device mmcblk0p1): dx_probe:792: inode #2868: comm swapper/0: dx entry: limit 0 != root limit 125
[ 1.929845] EXT4-fs warning (device mmcblk0p1): dx_probe:864: inode #2868: comm swapper/0: Corrupt directory, running e2fsck is recommended
[ 1.942467] Starting init: /etc/init exists but couldn’t execute it (error -4094)
[ 1.951434] Starting init: /bin/init exists but couldn’t execute it (error -4094)
[ 1.958954] Starting init: /bin/sh exists but couldn’t execute it (error -4094)
[ 1.966279] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[ 1.980454] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.62-imx_4.14.62_1.0.0_beta+g1907fe4 #1
[ 1.994287] Call trace:
[ 1.996740] [ffff000008089c08] dump_backtrace+0x0/0x3c8
[ 2.002145] [ffff000008089fe4] show_stack+0x14/0x20
[ 2.007202] [ffff000008757620] dump_stack+0x9c/0xbc
[ 2.012257] [ffff0000080cb0b8] panic+0x124/0x29c
[ 2.017053] [ffff000008769cfc] kernel_init+0xec/0x100
[ 2.022282] [ffff000008084ed8] ret_from_fork+0x10/0x18
[ 2.027598] SMP: stopping secondary CPUs
[ 2.031524] Kernel Offset: disabled
[ 2.035015] CPU features: 0x0802008
[ 2.038497] Memory Limit: none
[ 2.041552] Rebooting in 10 seconds…

On further debugging, I have found that:

  • When started the artifact download, the partitions from file system were:

mender_boot_part=2
mender_boot_part_hex=2

  • But, after power failure and switch ON, the partitions from U-boot become:

mender_boot_part=1
mender_boot_part_hex=1

I believe this is causing the above kernel panic since, the inactive partition is already streamed and corrupted during the power failure. Please correct me if I’m wrong.

But, how to set it properly? Why it is switching to inactive partition on U-boot during the power failure? Am I missing anything here?

PS: If I didn’t update using mender artifact and changed the partitions manually (by setting mender_boot_part and mender_boot_part_hex), then, there is no issues and expected partition only marked as active partition.

Could anyone knows what is missing here? Should I give any specific details?

Please note that, the integration checklist is already verified and is working fine for the platform.

Any help would be really appreciated.

This is worrying. Whilst I don’t know the answer the first thing I would do is look though the source code on github to see when the uboot variables get changed when using standalone mode.

I myself use standalone mode, but have not checked this scenario yet. I have now added it to the list.

@dellgreen Thank you and agreed on the point of checking on the source code. Since, this is a power failure test and that can happen at any point of time during the update, I thought of asking the experts :slight_smile:

Unfortunately I am not going to be able to check this scenario this week as working on other things, maybe one of the go-lang guys @mirzak can confirm whether the uboot variables get changed prior to download in standalone mode under some conditions, because my understanding was that they didn’t get changed until you ran mender-client with the -commit argument after doing a mender-client -rootfs. And this seems to certainly be the case as several times i have forgotten to commit the changes and rebooted to end up on the old partition by mistake.

In your case this doesn’t sound like it is the case. Could it be something else you have going on? i.e. do you have your own systemd service that touches mender-client on boot or touches uboot variables?

I’m not sure whether this is a mender standalone issue or something with my set-up. I’m disabling the mender client daemon during the boot-up as below:

sudo systemctl stop mender
sudo systemctl disable mender

AFAIK, mender standalone will change the boot partitions once, the client runs -rootfs and finished the download but, not before that. The -commit will make the partition marked as active once, we rebooted and verified.

I found the below observation during further testing:

  • I have following rootfs partitions (please note that, this is different from the standard mender partition where an additional (optional) boot partition is present)

fw_setenv mender_boot_part 1
fw_setenv mender_boot_part_hex 1

fw_setenv mender_boot_part 2
fw_setenv mender_boot_part_hex 2

  • If I’m in the first partition (i.e. mender_boot_part 1) and doing the update, then power failure test case is passing and getting the same partition (i.e. mender_boot_part 1) after power on and rebooted.
  • If I’m in the second partition (i.e. mender_boot_part 2) and doing the update, then power failure test is failing and getting other partition (i.e. mender_boot_part 1) after power on and rebooted!
  • I’m not getting why it is always set to first partition (i.e. mender_boot_part 1) during the power failure and reboot scenario.

I think @mirzak can give more details regarding this.

I can confirm that the client does not modify the U-Boot environment at all before the download has finished completely. Are you sure that the rootfs partitions are set correctly in /etc/mender/mender.conf?

Please see the mender.conf details of the platform below:

cat /etc/mender/mender.conf
{
“InventoryPollIntervalSeconds”: 5,
“RetryPollIntervalSeconds”: 30,
“RootfsPartA”: “/dev/mmcblk0p1”,
“RootfsPartB”: “/dev/mmcblk0p2”,
“ServerCertificate”: “/etc/mender/server.crt”,
“ServerURL”: “https://docker.mender.io”,
“TenantToken”: “dummy”,
“UpdatePollIntervalSeconds”: 5
}

Do you suspecting anything else which causing the switch?

It looks correct. Can you post the whole output from fw_printenv?

Please see the fw_printenv when the partition is in /dev/mmcblk0p2 below:

fw_printenv
altbootcmd=run mender_altbootcmd; run bootcmd
bootlimit=1
bootcount=0
upgrade_available=0
mender_uboot_boot=mmc 0:1
mender_uboot_if=mmc
mender_uboot_dev=0
mender_boot_kernel_type=booti
mender_kernel_name=Image
mender_dtb_name=fsl-imx8dx-ccu.dtb
mender_pre_setup_commands=run m4boot_0
mender_post_setup_commands=
mender_setup=if test “${mender_pre_setup_commands}” != “”; then run mender_pre_setup_commands; fi; setenv mender_kernel_root /dev/mmcblk0p${mender_boot_part}; if test ${mender_boot_part} = 1; then setenv mender_boot_part_name /dev/mmcblk0p1; else setenv mender_boot_part_name /dev/mmcblk0p2; fi; setenv mender_kernel_root_name ${mender_boot_part_name}; setenv mender_uboot_root mmc 0:${mender_boot_part_hex}; setenv mender_uboot_root_name ${mender_boot_part_name}; setenv expand_bootargs “setenv bootargs \”${bootargs}\“”; run expand_bootargs; setenv expand_bootargs; if test “${mender_post_setup_commands}” != “”; then run mender_post_setup_commands; fi
mender_altbootcmd=if test ${mender_boot_part} = 1; then setenv mender_boot_part 2; setenv mender_boot_part_hex 2; else setenv mender_boot_part 1; setenv mender_boot_part_hex 1; fi; setenv upgrade_available 0; saveenv; run mender_setup
mender_try_to_recover=if test ${upgrade_available} = 1; then reset; fi
bootcmd=run mender_setup; setenv bootargs root=${mender_kernel_root} ${bootargs}; if test “${fdt_addr_r}” != “”; then load ${mender_uboot_root} ${fdt_addr_r} /boot/${mender_dtb_name}; fi; load ${mender_uboot_root} ${loadaddr} /boot/${mender_kernel_name}; ${mender_boot_kernel_type} ${loadaddr} - ${fdt_addr_r}; run mender_try_to_recover
bootdelay=3
baudrate=115200
ethprime=eth0
loadaddr=0x80280000
mfgtool_args=setenv bootargs console=${console},${baudrate} rdinit=/linuxrc clk_ignore_unused
kboot=booti
bootcmd_mfg=run mfgtool_args;if iminfo ${initrd_addr}; then if test ${tee} = yes; then bootm ${tee_addr} ${initrd_addr} ${fdt_addr}; else booti ${loadaddr} ${initrd_addr} ${fdt_addr}; fi; else echo “Run fastboot …”; fastboot 0; fi;
initrd_addr=0x83100000
initrd_high=0xffffffffffffffff
m4_0_image=m4_ccu_app.bin
loadm4image_0=load mmc 0:5 ${loadaddr} ${m4_0_image}
m4boot_0=run loadm4image_0; dcache flush; bootaux ${loadaddr} 0
fdt_file=fsl-imx8dx-ccu.dtb
fdt_addr_r=0x83000000
fdt_high=0xffffffffffffffff
ethaddr=00:01:02:03:04:05
loadaddr=0x80280000
kernel=Image
bootargs=console=ttyLP0,115200 earlycon=lpuart32,0x5a060000,115200 rootwait rootfstype=ext4 rw
mender_boot_part=2
mender_boot_part_hex=2

The issue is coming only when the system is in RootfsPartB (i.e /dev/mmcblk0p2 or mender_boot_part 2).

I have not been able to reproduce the reported issues on one of my boards.

If this is only happening when you are on running on RootfsPartB it could mean that you are somehow corrupting the U-boot environment and it reverts to the default one, which would put you on mender_boot_part=1 again.

Can you list output of fdisk -l <your disk on device> and also where the U-boot environment is stored

Thank you @mirzak for confirming that, this issue is not coming by default in standalone mode.

Please see the fdisk -l <your disk on device> details below:

fdisk -l /dev/mmcblk0
Disk /dev/mmcblk0: 7456 MB, 7818182656 bytes, 15269888 sectors
238592 cylinders, 4 heads, 16 sectors/track
Units: cylinders of 64 * 512 = 32768 bytes

Device       Boot StartCHS    EndCHS        StartLBA     EndLBA    Sectors  Size Id Type
/dev/mmcblk0p1    32,0,1      31,3,16           2048     657407     655360  320M 83 Linux
/dev/mmcblk0p2    32,0,1      31,3,16         657408    1312767     655360  320M 83 Linux
/dev/mmcblk0p3    32,0,1      287,3,16       1312768    1329151      16384 8192K 83 Linux
/dev/mmcblk0p4    288,0,1     1023,3,16      1329152   15269887   13940736 6807M  5 Extended
/dev/mmcblk0p5    320,0,1     383,3,16       1331200    1335295       4096 2048K 83 Linux
/dev/mmcblk0p6    416,0,1     415,3,16       1337344    5531647    4194304 2048M 83 Linux
/dev/mmcblk0p7    448,0,1     447,3,16       5533696    6582271    1048576  512M 83 Linux
/dev/mmcblk0p8    480,0,1     991,3,16       6584320    6617087      32768 16.0M 83 Linux
/dev/mmcblk0p9    0,0,1       1023,3,16      6619136   15269887    8650752 4224M 83 Linux

We are storing the U-boot environment at 0x400000 location of /dev/mmcblk0.

Please let me know whether anything missing or wrong here.

Yeah as suspected :slight_smile:

/dev/mmcblk0p1 32,0,1 31,3,16 2048 657407 655360 320M 83 Linux

Above is equal to a starting address of 0x100000 which overlaps with your U-boot environment. Meaning that when you write data to /dev/mmcblk0p1 you will overwrite what is at 0x400000

1 Like

Sorry, I didn’t get it :thinking: Could you please elaborate a little more?

Does this means that, when I’m in /dev/mmcblk0p2 and doing the update (streaming to /dev/mmcblk0p1) this will overwrite what is at 0x400000?

How to resolve this problem?

I believe the issue is that you have the U-boot environment at 0x400000 but your /dev/mmcblk0p1 partition extends from the beginning of the mmc device to after the location that you’re putting your U-boot environment, completely overlapping it. Because your partition table has no idea that you put your U-boot there, your Linux has no idea that there’s data there that it shouldn’t overwrite when performing an update.

I’d suggest enhancing whatever method you’re using to generate your partition table to account for the fact that you have essentially a shadow partition on the disk that needs to be avoided.

2 Likes

Thank you very much @dellgreen @TimFroehlich @kacf and @mirzak for your support. I got the point :+1:

Should/could we add something to meta-mender to detect such situations and warn/error user during build as it was nit immediately obvious what the cause of this problem was for an average user.

Sorry to jump in, but imo, if its low impact for everybody, and reasonably doable, I’m all for making meta-mender detect such situations.

(Just my 2 cents…)

If one is using the provided sdimg image type one is protected from this happening with this,

and there are more sanity checks that make sure things do not overlap.

But I believe @ajithpv is using a custom image type and in that case there is nothing we could add to meta-mender to check this as we have no insight in how the image is generated. At least nothing that I am aware of and if someone has creative ideas please feel free :smiley:

1 Like

Yes, I agreed to @mirzak’s point . I’m using a custom partition layout which is different from the Mender standard partition layout. Hence, I believe that, the one who is making the custom changes should be take care of the sanity checks :slight_smile: