Strange reboot behaviour after sucessfull board integration

Hey!

I’ve been integrating mender in our product and I think I’m 99% done.
However I encounter a strange behavior which I cannot explain.

Setup:
Allwinner S3 boot from eMMC

Partition table
uboot at start sector of eMMC (not in partition table, boot command hardcoded with mender patches applied)
rootfsA at mmcblk2p1 (ro)
rootfsB at mmcblk2p2 (ro)
/data at mmcblk2p3 (rw)

I have a symlink in /var/lib to point at /data/mender to store persistent mender data.
Same for /etc/fw_env.config
fw_printenv works, also all points https://docs.mender.io/2.4/devices/yocto-project/bootloader-support/u-boot/integration-checklist work exactly as expected.

However:
I’m booting from A. The mender daemon starts and connects to the server correctly. Once added in the admin interface and an update being scheduled, mender downloads the image and writes it to partition B. However, it does not change the u-boot config. Mender then reboots, and now after starting the service, mender changes the boot config to boot from B. On the server tool, I get “update successful”. Then nothing happens. If I now reboot manually, the system now boots correctly from B.

The same happens if I start a new update, just the other way around.

From my understanding, this behavior doesn’t make any sense. I would expect mender to temporarily boot to the other partition after the successful update, and if the update worked and mender daemon starts, it sets this partition as the permanent boot partition.

All my configs and u-boot patches are based on this repo: https://github.com/mendersoftware/buildroot-mender/tree/master/buildroot-external-mender

I used the mender package available in buildroot-2020-05.

You can find the persistent storage, including some logs of the updates here:
https://drive.google.com/drive/folders/1e163zHSW4SokoRncK-CjfWyl3DIgzuyC?usp=sharing

I just used the same rootfs generated for provisioning to create the artifacts.

Any help would be greatly appreciated!

Thanks and best regards
Niklas Fauth
go-e GmbH

Hi @niklas welcome to Mender hub.

I think the interesting part of your log info is:

{“level”:“debug”,“message”:“Marking inactive partition as a boot candidate successful.”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“info”,“message”:“State transition: update-install [ArtifactInstall] -\u003e reboot [ArtifactReboot_Enter]”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“debug”,“message”:“statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“debug”,“message”:“statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“debug”,“message”:“handling reboot state”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“debug”,“message”:“status reported, response 204 No Content”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“info”,“message”:“rebooting device(s)”,“timestamp”:“2020-07-03T01:12:02+02:00”}
{“level”:“info”,“message”:“Mender rebooting from active partition: /dev/mmcblk2p1”,“timestamp”:“2020-07-03T01:12:02+02:00”}

Unfortunately all that does is confirm what you are seeing; it doesn’t seem to have any reason why.

Can you get a serial console log of the setup across the reboot? And another one where you interrupt it at U-Boot before launching and print the environment?

Drew

Hi Drew!

Thanks for your reply.
Here are the logs you requested.
For the captures, with the setup powered off, I uploaded a new release to the mender server. I then powered the module on, let it do the update, reboot, and then do another, manual reboot.

Here is the log of an update without uboot interruption:

And here is a log with me interrupting uboot both after the automatic and the manual reboot:

Hope that helps analyzing the problem. For me, the uboot env looks exactly the way as set by the mender daemon. The question remains why the mender daemon sets the wrong partition active before reboot.

Best regards and have a nice weekend!
Niklas

So I spent today porting uboot 2020.04 for our platform. Before I used an uboot fork from 2017 and I wanted to be sure that wasn’t the problem. Unlike with integrating mender to this old version, I had no problems patching uboot at all and I was able to select all required config options via menuconfig. So I’m confident now my uboot and env setup is exactly as expected by mender.

Unfortunately, the behavior is just the same. So now I assume this is related to a config problem.

Just to make sure, is there anything else that could go wrong with uboot even through everything on https://docs.mender.io/2.4/devices/yocto-project/bootloader-support/u-boot/integration-checklist works as expected?

What does mender use the redundant env for? I understand that it is nice to have. But is it required?
What I find strange is that if I comment out one of the two config lines in /etc/fw_env.config, I fail to read the env. I would expect that with either one of the two config lines I can read either the main or the redundant environment. Why do I require both to match the config? Is this expected behavior?

I find it hard to believe that this bug is actually related to mender, so I want to make sure all I touched during porting is done correctly.

I’m very happy for any hint on where to look next!

Thanks and best regards
Niklas

Hmm. I wonder if perhaps the fw_env.config is not correct and maybe you are writing to an address that is not used by UBoot. That would cause the Mender client to write the correct data but then UBoot itself would see the old data.

The redundant environment is how UBoot handles the fact that it can no guarantee atomic writes to flash or something like that. @kacf may have more input here.

Hey Drew!

I think this is not the case since using fw_setenv I can make uboot boot the other partition, like in the integration checklist, part 14.

Also, if that was the case, mender wouldn’t be able to change the boot partition at all.
But it can do so, it just does it incorrectly.

Thanks!

Well, I was thinking that perhaps U-Boot and the fw-utils shared one of the environment blocks but not the other. Then sometimes the writes would go to the block that was shared and in that case it would work.

Yes, the redundant environment is required, otherwise your updates are not atomic.

Drew also has a point about sharing environment blocks: If one block is overlapping, but not the other, then some writes may be picked up and some may not. What you can do is to try to do several fw_setenv test 1 commands from the command line, with a different value each time, Make sure to reboot between each one, and ensure that every single one of those is picked up at the U-Boot prompt with fw_printenv test.