Mender_setup failure orphans deployed sensors. Need remote recover solution!

Environment
Client: Mender version 2.1.1
Client Mode: Managed
Server: 2.1
Bootloader: U-boot - ver=U-Boot 2019.01
HW: Custom - based on BBB core reference.
OS: Bespoke/linux based
Build System: Yocto

Background
We have ~48,000 sensors in the field that are currently orphaned due to mender_setup failing to ‘initialize’ u-boot, resulting in a failed canary check when attempting an OTA update.

This can be fixed by simply touching u-boot, for example, set an arbitrary envvar, i.e.
fw_setenv random_var 1
and upon reboot the mender_setup will run successfully, and OTA updates will succeed as expected.

Sadly, the failed canary check exception is raised before the Install_Enter state, so we are unable to implement corrective measures using a state script, so we cant deploy the solution.
Update Modules are not implemented.

Given the cost to roll a truck for all the afflicted devices, we are asking for recommendations, ideas, cheats, voodoo magic, anything to help us bring these orphaned devices back into the family.

Notes
How did this happen?
So, a sad circumstance of the recommended integration testing, which is the first smoke test we perform on new release candidates, is that it prescribes writing an envvar to u-boot, rebooting, and reading the written variable to insure fw_* works. This very action masked this bug, because when our CI system moved on to test OTA updates, they succeed, because u-boot had been ‘touched’, resulting in mender_setup running successfully on subsequent boots.
Unfortunately, in manufacturing, the image is written to the mmc while its on the real, and no subsequent ‘touch’ of the u-boot happens, so these units were all manufactured into inventory and then deployed to the field without detection.

Any help or suggestions are greatly appreciated!
SLR-

That is unfortunate! Do you have Update Modules installed on the devices? On Yocto this would be PACKAGECONFIG_append_pn-mender-client = " modules". If so you can using a script update to perform arbitrary commands on the device, and non-rootfs-image updates do not perform the check.

Thanks for the prompt response!
Sadly, no. Our security model prohibits implementing potential execution vectors without a declared business need, and we didnt anticipate this. I wish we did! Then we could just use the script mod.

I’m afraid that you may be out of options then. The canary check happens very early, before the daemon even starts contacting the server, and if the check fails, then the rootfs-image setup is considered broken, and such Artifacts are simply not available to install. If you also don’t have any Update Modules, then the mender client is unfortunately dead weight on your devices right now; it cannot install any Artifact until the setup is fixed.

Sorry I cannot provide a better solution, but I’m fairly certain in this assessment, and I also tested the canary logic just to be sure. :frowning_face:

Not that stdout/logging order is a real indication of a systems order of operation, but I do see in the logs that the state script we implemented to fix this (ArtifactInstall_Enter_00), was “processing” before the canary exception is raised.
The script isnt in “execute” state, but it is apparently seen by the installer.
I guess I was hoping this could be manipulated in some way to get this simple one line script to run before the canary exception was raised.

2020-07-06 00:20:38 +0000 UTC info: no public key was provided for authenticating the artifact
2020-07-06 00:20:38 +0000 UTC info: Update Module path “/usr/share/mender/modules/v3” could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2020-07-06 00:20:38 +0000 UTC debug: checking if device [beaglebone-yocto] is on compatible device list: [beaglebone-yocto]
2020-07-06 00:20:38 +0000 UTC debug: installer: processing script: ArtifactInstall_Enter_00
2020-07-06 00:20:38 +0000 UTC debug: installer: successfully read artifact [name: 01-02-00-A02; version: 3; compatible devices: [beaglebone-yocto]]
2020-07-06 00:20:38 +0000 UTC debug: Trying to install update of size: 394264576
2020-07-06 00:20:38 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-07-06 00:20:38 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-07-06 00:20:38 +0000 UTC debug: MARC - Test saveenv canary variables: map[mender_check_saveenv_canary:1]
2020-07-06 00:20:38 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-beaglebone-yocto.ext4: No match between boot and root partitions.: Failed mender_saveenv_canary check. There is an error in the U-Boot setup. Likely causes are: 1) Mismatch between the U-Boot boot loader environment location and the location specified in /etc/fw_env.config. 2) ‘mender_setup’ is not run by the U-Boot boot script: exit status 1
2020-07-06 00:20:38 +0000 UTC info: State transition: update-store [Download_Enter] -> cleanup [Error]

You can ignore the log entry tagged "MARC - ", thats just a print statement from a debug patch I implemented last month after chatting with you about this.

So, is the consensus that our fleet is pretty much ‘bricked’ by this bug?

I’ll defer to @kacf’s expertise on the canary stuff.

Is Mender the only channel you have to your sensors?

Hey Drew,
Well, we’ve got UART. No listening ports. There is an MQTT interface publishing data over cat-m1, and while we support IoT jobs on these devices, we explicitly prohibit arbitrary execution.
I’m suddenly missing the olden days, when everyone implemented back doors in their hardware sigh

You’re right, it is actually re-checked at that point as well. But this is before any Artifact scripts are run, so it still does not help, unfortunately.

I’ve been turning it upside down, and inside out in my head, but unfortunately, I think so. With both SSH and Update Modules disabled, there really is only one way into the device, and that is via a rootfs-image Artifact. But it won’t let you do that if it thinks the setup is wrong.

The check was of course implemented exactly to prevent these kinds of mistakes, but the test scenario and “touching” of U-Boot that you described managed to sidestep this safety check in an unexpected way and it’s now working against you instead. Certainly not intended, and I’ll try to think about ways that we can improve in this area.

I very much realize the magnitude of this problem, and I really wish I could be more helpful.

In the interest of providing good defaults for future users of Mender, I’m wondering if we should not make the script module opt-out instead of opt-in, like it is now. Your argument makes sense, but if we look a bit deeper at the technical details, having the script module available does not fundamentally open up new attack vectors, since script execution is already possible by using state scripts in a rootfs-image Artifact. Having this kind of emergency hatch can, as you have painfully experienced, be incredibly important when all bad luck strikes at once.

Hey kacf,

Thanks for the thorough response, much appreciated! Moving forward, do you know at which client release this bug is no longer present? I believe you mentioned in a previous thread that you were no longer testing u-boot integrations, but do you have a sense for which release does not exhibit this behavior (2.1.1 is our current rev, which fails mender_setup at init)? Specifically for uboot based beaglebone-yocto builds.

THNX!
SLR-

The saveenv_canary check has been in the client for quite a long time, it first appeared in version 1.6.

As for the bug, to be clear, the problem here is still in the U-Boot setup. The fact that the saveenv_canary check triggers is evidence that the U-Boot environment is incorrect at startup, and the touching of U-Boot is just hiding that fact. The best way forward for you, at least for future devices, would be to make sure that the test framework does not do the touching, and that a Mender update succeeds on a device with a pristine environment. Another possibility would be to destroy the environment in the test and reboot before doing an update, forcing it back to its default.

One piece of reassurance is that the touching you mentioned will no longer work from the Yocto dunfell branch and onwards, because it uses a different user space tool. This tool doesn’t come with a default environment, so it cannot fix anything if the bootloader has not already fixed it. So it will be impossible to “accidentally succeed” as you did with your testing.

Hey kacf,

While we have worked around this, for the moment, I am compelled to understand the cause of this ‘problem in the U-Boot setup’. Building a nascent yocto warrior core-image-minimal with mender support results in a meter in this state. Any pointers on tracking down the culprit of this bad setup?

Notes/logs
root@304511b1242f:~# cat /etc/fw_env.config

/dev/mmcblk1 0x800000 0x20000
/dev/mmcblk1 0x1000000 0x20000
.

root@304511b1242f:~# cat /data/mender/mender.conf

{
“RootfsPartA”: “/dev/mmcblk1p2”,
“RootfsPartB”: “/dev/mmcblk1p3”
}

root@304511b1242f:~# fw_printenv mender_setup

Warning: Bad CRC, using default environment
mender_setup=if test “{mender_saveenv_canary}" != "1"; then setenv mender_saveenv_canary 1; saveenv; fi; if test "{mender_pre_setup_commands}” != “”; then run mender_pre_setup_commands; fi; if test “{mender_systemd_machine_id}" != ""; then setenv bootargs systemd.machine_id={mender_systemd_machine_id} {bootargs}; fi; setenv mender_kernel_root /dev/mmcblk1p{mender_boot_part}; if test {mender_boot_part} = 2; then setenv mender_boot_part_name /dev/mmcblk1p2; else setenv mender_boot_part_name /dev/mmcblk1p3; fi; setenv mender_kernel_root_name {mender_boot_part_name}; setenv mender_uboot_root mmc 1:{mender_boot_part_hex}; setenv mender_uboot_root_name {mender_boot_part_name}; setenv expand_bootargs “setenv bootargs \”{bootargs}\\""; run expand_bootargs; setenv expand_bootargs; if test "{mender_post_setup_commands}” != “”; then run mender_post_setup_commands; fi

mender_check_saveenv_canary is set, so the mender_setenv appears to have run and saved the env - still a CRC check failure
.
root@304511b1242f:~# fw_printenv mender_check_saveenv_canary

Warning: Bad CRC, using default environment
mender_check_saveenv_canary=1

.
mender complains at startup

Jul 21 05:51:35 304511b1242f mender[155]: time=“2020-07-21T05:51:35Z” level=error msg=“Failed to read the current active partition: No match between boot and roo
t partitions.: Failed mender_saveenv_canary check. There is an error in the U-Boot setup. Likely causes are: 1) Mismatch between the U-Boot boot loader environment locatio
n and the location specified in /etc/fw_env.config. 2) ‘mender_setup’ is not run by the U-Boot boot script: exit status 1” module=cli
.

.
U-Boot complains about an attempt to read the environment from FAT?

U-Boot SPL 2019.01 (Jul 14 2020 - 06:37:11 +0000)
Trying to boot from MMC2
Loading Environment from FAT… Card did not respond to voltage select!
Loading Environment from MMC… *** Warning - bad CRC, using default environment

U-Boot 2019.01 (Jul 14 2020 - 06:37:11 +0000)

CPU : AM335X-GP rev 2.1
I2C: ready
DRAM: 512 MiB
No match for driver ‘omap_hsmmc’
No match for driver ‘omap_hsmmc’
Some drivers were not found
MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1
Loading Environment from FAT… Card did not respond to voltage select!
Loading Environment from MMC… *** Warning - bad CRC, using default environment

<ethaddr> not set. Validating first E-fuse MAC
Net: cpsw, usb_ether
Press SPACE to abort autoboot in 2 seconds
Saving Environment to FAT… Card did not respond to voltage select!
Failed (1)
34532 bytes read in 11 ms (3 MiB/s)
6504392 bytes read in 422 ms (14.7 MiB/s)
## Flattened Device Tree blob at 88000000
Booting using the fdt blob at 0x88000000
Loading Device Tree to 8fff4000, end 8ffff6e3 … OK

.
Note the failed save to FAT

THNX!
SLR-

mender_check_saveenv_canary is what triggers the check to be run, it’s not the pass condition1. For that you need to look for mender_saveenv_canary=1. The “Bad CRC” message is also an indicator, this should never occur on a healthy setup, and it never will, if mender_saveenv_canary has been written.

However, what fw_printenv prints is actually irrelevant for this investigation, because this only displays the symptoms, not the cause. You need to investigate from the U-Boot prompt. Try stopping the boot at the prompt and running printenv. This output should be identical to the output you get from fw_printenv. My hypothesis is that bootcmd is not set up correctly to run mender_setup, and this causes the canary logic to not execute. Or maybe the Mender patches are missing entirely from the U-Boot built-in environment.

Note that the “Bad CRC” message will usually occur at the U-Boot prompt on first boot, but it should only occur once, and then never again.

1 This is for historical reasons. Very old versions of Mender images did not have the canary logic, and we could not simply add it to the client, since that would break if run on old images. So we introduced it in two places: One hard-coded mender_check_saveenv_canary, which triggers the check to be run, and one mender_saveenv_canary, which is only saved at runtime, which makes the check pass.

The printenvs differ. They both include the mender setup, etc, but they do, indeed, differ.

The CRC failures persist across boots.

Loading Environment from MMC… *** Warning - bad CRC, using default environment

Note:

Saving Environment to FAT… Card did not respond to voltage select!
Failed (1)

What is this failed attempt to write to a FAT?
Do I need to explicitly disable CONFIG_ENV_IS_IN_FAT?

SLR-

Yes, standard Mender setup uses CONFIG_ENV_IS_IN_MMC.

I’ll create a uboot recipe to apply this and disable the fat target and test that.