Idempotency of Mender config files migration scripts

Hi,

The release of Mender client 2.2 made it necessary to add some migration scripts to our Yocto conf files for backward compatibility. We now need to add the following code:

IMAGE_INSTALL_append = " mender-migrate-configuration"
PACKAGECONFIG_remove = "split-mender-config"
MENDER_PERSISTENT_CONFIGURATION_VARS = "RootfsPartA RootfsPartB"
MENDER_ARTIFACT_EXTRA_ARGS_append = " -v 2"

Is this idempotent ? Can this remain in our Yocto conf files even after the jump from Thud to Warrior ?

The rationale is a lot of our Mender-enabled devices can stay offline for a very long time. If we want the ability to deploy Mender updates that can work for all of our devices — regardless of how up-to-date they are — it’s important that our artifacts can be deployed safely regardless of the system they currently run.

Side note: couldn’t the Mender client have handled this migration automatically & without additional config ?

I think it’s idempotent (I haven’t tested it though), but it’s still risky to run it every time, because there is no rollback for this change. It’s a bit like updating your boot loader: Yes, it’s idempotent, but you don’t want to do it on every update if you can avoid it.

I have a different suggestion for you: In addition to the migration script, include a state script in the update which changes /data/mender/device_type to a new device type. Once the update has finished, your fleet is divided into non-migrated (the old device type), and migrated (the new device type) devices, and Artifacts will never accidentally end up on the wrong one. This state script also won’t be able to roll back, but by doing it together with the migration script in one single Artifact, you are limiting the risk.

After this is done, just set MENDER_DEVICE_TYPE to the new device type for all subsequent builds.

OK thanks for the suggestion. I think we’ll do the following:

  • use a version identifier exposed through the device inventory to separate devices and plan deployments properly. On this subject, this would be easier to track with “smart” device groups.
  • use a ArtifactInstall_Enter script as failsafe to block erroneous (e.g. artifact which has RPi bootfiles upgrade disabled but actually requires it) or uselessly dangerous (e.g. artifacts that migrate the Mender conf files when that’s already been done)

An idea for later occurrences of such migrations, which are bound to happen : let the mender client handle the migration automatically, and instead provide an opt-out flag for users who want more control. I cannot speak for all users but I reckon this would offer a better tradeoff for a lot of us.

We have a mixed deployment, with tens of thousands of nodes running ~1.7; these units successfully update with the above migration script additions. HOWEVER, we have over 2,000 nodes deployed from our most recent manufacturing run that are natively running 2.1.1, and they all fail to take an OTA update from an artifact with the migration script additions included. WIthout exception, they throw a “Failed mender_saveenv_canary check” error, and abort.

i.e.

2020-05-05 05:27:35 +0000 UTC info: Running Mender version 2.1.1
2020-05-05 05:27:35 +0000 UTC debug: handle update fetch state
2020-05-05 05:27:36 +0000 UTC debug: status reported, response 204 No Content
2020-05-05 05:27:36 +0000 UTC debug: Received fetch update response &{200 OK 200 HTTP/1.1 1 1 map[Accept-Ranges:[bytes] Connection:[keep-alive] Content-Length:[46405120] Content-Security-Policy:[block-all-mixed-content] Content-Type:[application/vnd.mender-artifact] Date:[Tue, 05 May 2020 05:27:36 GMT] Etag:[“943be2fdae469c7f1316500eb5714da4”] Last-Modified:[Mon, 27 Apr 2020 00:09:53 GMT] Server:[openresty/1.13.6.2] Strict-Transport-Security:[max-age=63072000; includeSubdomains; preload] Vary:[Origin] X-Amz-Request-Id:[160C0AA14D99108B] X-Content-Type-Options:[nosniff] X-Frame-Options:[DENY] X-Xss-Protection:[1; mode=block]] 0x80fec0 46405120 [] false false map[] 0x876900 0x867620}+
2020-05-05 05:27:36 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2020-05-05 05:27:36 +0000 UTC debug: handle update install state
2020-05-05 05:27:36 +0000 UTC debug: status reported, response 204 No Content
2020-05-05 05:27:36 +0000 UTC debug: Read data from device manifest file: device_type=beaglebone-yocto
2020-05-05 05:27:36 +0000 UTC debug: Current manifest data: beaglebone-yocto
2020-05-05 05:27:36 +0000 UTC info: no public key was provided for authenticating the artifact
2020-05-05 05:27:36 +0000 UTC info: Update Module path “/usr/share/mender/modules/v3” could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2020-05-05 05:27:36 +0000 UTC debug: checking if device [beaglebone-yocto] is on compatible device list: [beaglebone-yocto]
2020-05-05 05:27:36 +0000 UTC debug: installer: processing script: ArtifactCommit_Enter_10_migrate-configuration
2020-05-05 05:27:36 +0000 UTC debug: installer: successfully read artifact [name: 01-02-00; version: 2; compatible devices: [beaglebone-yocto]]
2020-05-05 05:27:36 +0000 UTC debug: Trying to install update of size: 394264576
2020-05-05 05:27:37 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-05-05 05:27:37 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-05-05 05:27:37 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-beaglebone-yocto.ext4: No match between boot and root partitions.: Failed mender_saveenv_canary check. There is an error in the U-Boot setup. Likely causes are: 1) Mismatch between the U-Boot boot loader environment location and the location specified in /etc/fw_env.config. 2) ‘mender_setup’ is not run by the U-Boot boot script: exit status 1
2020-05-05 05:27:37 +0000 UTC info: State transition: update-store [Download_Enter] -> cleanup [Error]
2020-05-05 05:27:37 +0000 UTC debug: transitioning to error state
2020-05-05 05:27:37 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-05 05:27:37 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-05 05:27:37 +0000 UTC debug: Handling Cleanup state
2020-05-05 05:27:37 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2020-05-05 05:27:37 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-05 05:27:37 +0000 UTC debug: handle update status report state
2020-05-05 05:27:37 +0000 UTC debug: status reported, response 204 No Content
2020-05-05 05:27:37 +0000 UTC debug: attempting to upload deployment logs for failed update

This entire wing of our fleet cant be updated right now.

This doesn’t sound like a problem that is related to the earlier configuration migration. The canary check reveals that either the boot loader has been installed incorrectly, or the configuration of the U-Boot user space tools is incorrect. My guess would be that you cannot install any Artifact on these devices, and they are simply misconfigured from the manufacturer’s side. Do you have SSH access to them?

I suggest you start by getting your hands on one of the new ones, and try to figure out what is wrong with the setup. Things to look for are:

  • Does the content of /etc/fw_env.config look correct?
  • If you run fw_setenv test hello_from_shell, can you see that at the U-Boot prompt with printenv test?
  • If you run setenv test hello_from_uboot; saveenv in the U-Boot prompt, can you see that in the shell using fw_printenv test?

It should pass the entire integration checklist.

Interesting. Thanks for the response! I will grab a unit from the lab tomorrow and test this.
A couple things I should add, ci runs the integration checklist after build and deploy to our lab units (it is an imperfect world, so maybe something went wrong there, and wasnt asserted properly). Further, there is no difference between these build versions other than the mender update. No yocto updates, no new recipes, no configuration changes other than the migration edits to local.conf. The difference between builds is solely the mender update, and auto changes to core-files to update /etc/issue and os-release with the new version tags.

Thanks again!

fw_env.config is identical to previous versions:

cat /etc/fw_env.config
/dev/mmcblk1 0x800000 0x20000
/dev/mmcblk1 0x1000000 0x20000

Running fw_setenv test hello_from_shell throws a crc warning, but it does set the variable.

fw_setenv test hello_from_shell
Warning: Bad CRC, using default environment

fw_printenv test
test=hello_from_shell

Sadly, I forgot to grab the keyed driver needed to open the device inclosure to get to the uart, so I wont be able to test from the uboot prompt until I can get back to the lab tomorrow. However, the uboot variables I set from the shell do persist if I pull the power from the unit, and subsequently power it back up/boot and run ‘fw_printenv test’ again.

Following are the uboot envvars

fw_printenv
Warning: Bad CRC, using default environment
altbootcmd=run mender_altbootcmd; run bootcmd
bootlimit=1
bootcount=0
upgrade_available=0
mender_boot_part=2
mender_boot_part_hex=2
mender_uboot_boot=mmc 1:1
mender_uboot_if=mmc
mender_uboot_dev=1
mender_boot_kernel_type=bootz
mender_kernel_name=zImage
mender_dtb_name=am335x-bonegreen.dtb
mender_pre_setup_commands=
mender_post_setup_commands=
mender_check_saveenv_canary=1
mender_setup=if test "${mender_saveenv_canary}" != "1"; then setenv mender_saveenv_canary 1; saveenv; fi; if test "${mender_pre_setup_commands}" != ""; then run mender_pre_setup_commands; fi; if test "${mender_systemd_machine_id}" != ""; then setenv bootargs systemd.machine_id=${mender_systemd_machine_id} ${bootargs}; fi; setenv mender_kernel_root /dev/mmcblk1p${mender_boot_part}; if test ${mender_boot_part} = 2; then setenv mender_boot_part_name /dev/mmcblk1p2; else setenv mender_boot_part_name /dev/mmcblk1p3; fi; setenv mender_kernel_root_name ${mender_boot_part_name}; setenv mender_uboot_root mmc 1:${mender_boot_part_hex}; setenv mender_uboot_root_name ${mender_boot_part_name}; setenv expand_bootargs "setenv bootargs \\"${bootargs}\\""; run expand_bootargs; setenv expand_bootargs; if test "${mender_post_setup_commands}" != ""; then run mender_post_setup_commands; fi
mender_altbootcmd=if test ${mender_boot_part} = 2; then setenv mender_boot_part 3; setenv mender_boot_part_hex 3; else setenv mender_boot_part 2; setenv mender_boot_part_hex 2; fi; setenv upgrade_available 0; saveenv; run mender_setup
mender_try_to_recover=if test ${upgrade_available} = 1; then reset; fi
bootcmd=run mender_setup; setenv bootargs root=${mender_kernel_root} ${bootargs}; if test "${fdt_addr_r}" != ""; then load ${mender_uboot_root} ${fdt_addr_r} /boot/${mender_dtb_name}; fi; load ${mender_uboot_root} ${kernel_addr_r} /boot/${mender_kernel_name}; ${mender_boot_kernel_type} ${kernel_addr_r} - ${fdt_addr_r}; run mender_try_to_recover
bootdelay=2
baudrate=115200
arch=arm
cpu=armv7
board=am335x
board_name=am335x
vendor=ti
soc=am33xx
loadaddr=0x82000000
kernel_addr_r=0x82000000
fdtaddr=0x88000000
fdt_addr_r=0x88000000
rdaddr=0x88080000
ramdisk_addr_r=0x88080000
scriptaddr=0x80000000
pxefile_addr_r=0x80100000
bootm_size=0x10000000
boot_fdt=try
mmcdev=0
mmcrootfstype=ext4 rootwait
finduuid=part uuid mmc ${bootpart} uuid
args_mmc=run finduuid;setenv bootargs console=${console} ${optargs} rw rootfstype=${mmcrootfstype}
loadbootscript=load mmc ${mmcdev} ${loadaddr} boot.scr
bootscript=echo Running bootscript from mmc${mmcdev} ...; source ${loadaddr}
bootenvfile=uEnv.txt
importbootenv=echo Importing environment from mmc${mmcdev} ...; env import -t ${loadaddr} ${filesize}
loadbootenv=fatload mmc ${mmcdev} ${loadaddr} ${bootenvfile}
loadimage=load ${devtype} ${bootpart} ${loadaddr} ${bootdir}/${bootfile}
loadfdt=load ${devtype} ${bootpart} ${fdtaddr} ${bootdir}/${fdtfile}
envboot=mmc dev ${mmcdev}; if mmc rescan; then echo SD/MMC found on device ${mmcdev};if run loadbootscript; then run bootscript;else if run loadbootenv; then echo Loaded env from ${bootenvfile};run importbootenv;fi;if test -n $uenvcmd; then echo Running uenvcmd ...;run uenvcmd;fi;fi;fi;
mmcloados=run args_mmc; if test ${boot_fdt} = yes || test ${boot_fdt} = try; then if run loadfdt; then bootz ${loadaddr} - ${fdtaddr}; else if test ${boot_fdt} = try; then bootz; else echo WARN: Cannot load the DT; fi; fi; else bootz; fi;
mmcboot=mmc dev ${mmcdev}; setenv devnum ${mmcdev}; setenv devtype mmc; if mmc rescan; then echo SD/MMC found on device ${mmcdev};if run loadimage; then if test ${boot_fit} -eq 1; then run loadfit; else run mmcloados;fi;fi;fi;
boot_fit=0
fit_loadaddr=0x87000000
fit_bootfile=fitImage
update_to_fit=setenv loadaddr ${fit_loadaddr}; setenv bootfile ${fit_bootfile}
loadfit=run args_mmc; bootm ${loadaddr}#${fdtfile};
bootpart=0:2
bootdir=/boot
bootfile=zImage
fdtfile=undefined
console=ttyO0,115200n8
partitions=uuid_disk=${uuid_gpt_disk};name=bootloader,start=384K,size=1792K,uuid=${uuid_gpt_bootloader};name=rootfs,start=2688K,size=-,uuid=${uuid_gpt_rootfs}
optargs=
ramroot=/dev/ram0 rw
ramrootfstype=ext2
spiroot=/dev/mtdblock4 rw
spirootfstype=jffs2
spisrcaddr=0xe0000
spiimgsize=0x362000
spibusno=0
spiargs=setenv bootargs console=${console} ${optargs} rootfstype=${spirootfstype}
ramargs=setenv bootargs console=${console} ${optargs} rootfstype=${ramrootfstype}
loadramdisk=load mmc ${mmcdev} ${rdaddr} ramdisk.gz
spiboot=echo Booting from spi ...; run spiargs; sf probe ${spibusno}:0; sf read ${loadaddr} ${spisrcaddr} ${spiimgsize}; bootz ${loadaddr}
ramboot=echo Booting from ramdisk ...; run ramargs; bootz ${loadaddr} ${rdaddr} ${fdtaddr}
findfdt=if test $board_name = A335BONE; then setenv fdtfile am335x-bone.dtb; fi; if test $board_name = A335BNLT; then setenv fdtfile am335x-boneblack.dtb; fi; if test $board_name = A335PBGL; then setenv fdtfile am335x-pocketbeagle.dtb; fi; if test $board_name = BBBW; then setenv fdtfile am335x-boneblack-wireless.dtb; fi; if test $board_name = BBG1; then setenv fdtfile am335x-bonegreen.dtb; fi; if test $board_name = BBGW; then setenv fdtfile am335x-bonegreen-wireless.dtb; fi; if test $board_name = BBBL; then setenv fdtfile am335x-boneblue.dtb; fi; if test $board_name = BBEN; then setenv fdtfile am335x-sancloud-bbe.dtb; fi; if test $board_name = A33515BB; then setenv fdtfile am335x-evm.dtb; fi; if test $board_name = A335X_SK; then setenv fdtfile am335x-evmsk.dtb; fi; if test $board_name = A335_ICE; then setenv fdtfile am335x-icev2.dtb; fi; if test $fdtfile = undefined; then echo WARNING: Could not determine device tree to use; fi;
init_console=if test $board_name = A335_ICE; then setenv console ttyO3,115200n8;else setenv console ttyO0,115200n8;fi;
static_ip=${ipaddr}:${serverip}:${gatewayip}:${netmask}:${hostname}::off
nfsopts=nolock
rootpath=/export/rootfs
netloadimage=tftp ${loadaddr} ${bootfile}
netloadfdt=tftp ${fdtaddr} ${fdtfile}
netargs=setenv bootargs console=${console} ${optargs} nfsroot=${serverip}:${rootpath},${nfsopts} rw ip=dhcp
netboot=echo Booting from network ...; setenv autoload no; dhcp; run netloadimage; run netloadfdt; run netargs; bootz ${loadaddr} - ${fdtaddr}
dfu_alt_info_emmc=rawemmc raw 0 3751936;boot part 1 1;rootfs part 1 2;MLO fat 1 1;MLO.raw raw 0x100 0x100;u-boot.img.raw raw 0x300 0x1000;u-env.raw raw 0x1300 0x200;spl-os-args.raw raw 0x1500 0x200;spl-os-image.raw raw 0x1700 0x6900;spl-os-args fat 1 1;spl-os-image fat 1 1;u-boot.img fat 1 1;uEnv.txt fat 1 1
dfu_alt_info_mmc=boot part 0 1;rootfs part 0 2;MLO fat 0 1;MLO.raw raw 0x100 0x100;u-boot.img.raw raw 0x300 0x1000;u-env.raw raw 0x1300 0x200;spl-os-args.raw raw 0x1500 0x200;spl-os-image.raw raw 0x1700 0x6900;spl-os-args fat 0 1;spl-os-image fat 0 1;u-boot.img fat 0 1;uEnv.txt fat 0 1
dfu_alt_info_ram=kernel ram 0x80200000 0x4000000;fdt ram 0x80f80000 0x80000;ramdisk ram 0x81000000 0x4000000
mmc_boot=if mmc dev ${devnum}; then setenv devtype mmc; run scan_dev_for_boot_part; fi
boot_net_usb_start=usb start
usb_boot=usb start; if usb dev ${devnum}; then setenv devtype usb; run scan_dev_for_boot_part; fi
boot_efi_binary=if fdt addr ${fdt_addr_r}; then bootefi bootmgr ${fdt_addr_r};else bootefi bootmgr ${fdtcontroladdr};fi;load ${devtype} ${devnum}:${distro_bootpart} ${kernel_addr_r} efi/boot/bootarm.efi; if fdt addr ${fdt_addr_r}; then bootefi ${kernel_addr_r} ${fdt_addr_r};else bootefi ${kernel_addr_r} ${fdtcontroladdr};fi
load_efi_dtb=load ${devtype} ${devnum}:${distro_bootpart} ${fdt_addr_r} ${prefix}${efi_fdtfile}
efi_dtb_prefixes=/ /dtb/ /dtb/current/
scan_dev_for_efi=setenv efi_fdtfile ${fdtfile}; if test -z "${fdtfile}" -a -n "${soc}"; then setenv efi_fdtfile ${soc}-${board}${boardver}.dtb; fi; for prefix in ${efi_dtb_prefixes}; do if test -e ${devtype} ${devnum}:${distro_bootpart} ${prefix}${efi_fdtfile}; then run load_efi_dtb; fi;done;if test -e ${devtype} ${devnum}:${distro_bootpart} efi/boot/bootarm.efi; then echo Found EFI removable media binary efi/boot/bootarm.efi; run boot_efi_binary; echo EFI LOAD FAILED: continuing...; fi; setenv efi_fdtfile
boot_prefixes=/ /boot/
boot_scripts=boot.scr.uimg boot.scr
boot_script_dhcp=boot.scr.uimg
boot_targets=mmc0 legacy_mmc0 mmc1 legacy_mmc1 nand0 pxe dhcp
boot_syslinux_conf=extlinux/extlinux.conf
boot_extlinux=sysboot ${devtype} ${devnum}:${distro_bootpart} any ${scriptaddr} ${prefix}${boot_syslinux_conf}
scan_dev_for_extlinux=if test -e ${devtype} ${devnum}:${distro_bootpart} ${prefix}${boot_syslinux_conf}; then echo Found ${prefix}${boot_syslinux_conf}; run boot_extlinux; echo SCRIPT FAILED: continuing...; fi
boot_a_script=load ${devtype} ${devnum}:${distro_bootpart} ${scriptaddr} ${prefix}${script}; source ${scriptaddr}
scan_dev_for_scripts=for script in ${boot_scripts}; do if test -e ${devtype} ${devnum}:${distro_bootpart} ${prefix}${script}; then echo Found U-Boot script ${prefix}${script}; run boot_a_script; echo SCRIPT FAILED: continuing...; fi; done
scan_dev_for_boot=echo Scanning ${devtype} ${devnum}:${distro_bootpart}...; for prefix in ${boot_prefixes}; do run scan_dev_for_extlinux; run scan_dev_for_scripts; done;run scan_dev_for_efi;
scan_dev_for_boot_part=part list ${devtype} ${devnum} -bootable devplist; env exists devplist || setenv devplist 1; for distro_bootpart in ${devplist}; do if fstype ${devtype} ${devnum}:${distro_bootpart} bootfstype; then run scan_dev_for_boot; fi; done
bootcmd_mmc0=setenv devnum 0; run mmc_boot
bootcmd_legacy_mmc0=setenv mmcdev 0; setenv bootpart 0:2 ; run mmcboot
bootcmd_mmc1=setenv devnum 1; run mmc_boot
bootcmd_legacy_mmc1=setenv mmcdev 1; setenv bootpart 1:2 ; run mmcboot
bootcmd_nand=run nandboot
bootcmd_pxe=run boot_net_usb_start; dhcp; if pxe get; then pxe boot; fi
bootcmd_dhcp=run boot_net_usb_start; if dhcp ${scriptaddr} ${boot_script_dhcp}; then source ${scriptaddr}; fi;setenv efi_fdtfile ${fdtfile}; if test -z "${fdtfile}" -a -n "${soc}"; then setenv efi_fdtfile ${soc}-${board}${boardver}.dtb; fi; setenv efi_old_vci ${bootp_vci};setenv efi_old_arch ${bootp_arch};setenv bootp_vci PXEClient:Arch:00010:UNDI:003000;setenv bootp_arch 0xa;if dhcp ${kernel_addr_r}; then tftpboot ${fdt_addr_r} dtb/${efi_fdtfile};if fdt addr ${fdt_addr_r}; then bootefi ${kernel_addr_r} ${fdt_addr_r}; else bootefi ${kernel_addr_r} ${fdtcontroladdr};fi;fi;setenv bootp_vci ${efi_old_vci};setenv bootp_arch ${efi_old_arch};setenv efi_fdtfile;setenv efi_old_arch;setenv efi_old_vci;
distro_bootcmd=for target in ${boot_targets}; do run bootcmd_${target}; done

Thanks!
SLR-

It could be that they are individually persistent, but if U-Boot and fw_setenv are referring to different storage areas, then a change in one won’t be reflected in the other. So they will both need to be tested together.

Roger that. I will definately test as soon as I can get my hands on the tools to get attached to the uart on these devices. I have triage meetings in the morning, but i’m hoping to have this done by early afternoon, San Francisco time.
I’ll update with my findings…

Thanks again, for your attention!
SLR-

Setting var from uboot

Loading Environment from MMC… OK
Net: cpsw, usb_ether
Press SPACE to abort autoboot in 2 seconds
=>
=>
=>
=> setenv test hello_from_uboot
=> saveenv
Saving Environment to MMC… Writing to redundant MMC(1)… OK
=>

Unplug power.
Re plug power/boot

~# fw_printenv test
test=hello_from_uboot

Any thoughts?
SLR-

To be 100%, you would need to run the same test twice as you have two locations of the U-Boot environment.

Might also be helpful to get following each time you test:

fw_printenv | grep mender_

~# fw_printenv | grep mender_

altbootcmd=run mender_altbootcmd; run bootcmd
bootcmd=run mender_setup; setenv bootargs root={mender_kernel_root} {bootargs}; if test “{fdt_addr_r}" != ""; then load {mender_uboot_root} {fdt_addr_r} /boot/{mender_dtb_name}; fi; load {mender_uboot_root} {kernel_addr_r} /boot/{mender_kernel_name}; {mender_boot_kernel_type} {kernel_addr_r} - {fdt_addr_r}; run mender_try_to_recover
mender_altbootcmd=if test {mender_boot_part} = 2; then setenv mender_boot_part 3; setenv mender_boot_part_hex 3; else setenv mender_boot_part 2; setenv mender_boot_part_hex 2; fi; setenv upgrade_available 0; saveenv; run mender_setup mender_boot_kernel_type=bootz mender_boot_part=2 mender_boot_part_hex=2 mender_check_saveenv_canary=1 mender_dtb_name=am335x-bonegreen.dtb mender_kernel_name=zImage mender_saveenv_canary=1 mender_setup=if test "{mender_saveenv_canary}” != “1”; then setenv mender_saveenv_canary 1; saveenv; fi; if test “{mender_pre_setup_commands}" != ""; then run mender_pre_setup_commands; fi; if test "{mender_systemd_machine_id}” != “”; then setenv bootargs systemd.machine_id={mender_systemd_machine_id} {bootargs}; fi; setenv mender_kernel_root /dev/mmcblk1p${mender_boot_part}; if test {mender_boot_part} = 2; then setenv mender_boot_part_name /dev/mmcblk1p2; else setenv mender_boot_part_name /dev/mmcblk1p3; fi; setenv mender_kernel_root_name {mender_boot_part_name}; setenv mender_uboot_root mmc 1:{mender_boot_part_hex}; setenv mender_uboot_root_name {mender_boot_part_name}; setenv expand_bootargs “setenv bootargs \”{bootargs}\\""; run expand_bootargs; setenv expand_bootargs; if test "{mender_post_setup_commands}" != “”; then run mender_post_setup_commands; fi
mender_try_to_recover=if test ${upgrade_available} = 1; then reset; fi
mender_uboot_boot=mmc 1:1
mender_uboot_dev=1
mender_uboot_if=mmc

Hmm, I do not see anything wrong here.

Looking at the code, I can only see two reasons why the following would be printed:

Failed mender_saveenv_canary check
  1. Something wrong with U-Boot environment configuration (which I think we have eliminated)
  2. Something wrong with how the Mender client parses U-Boot environment, but the following code seems to find the correct variable as it prints it a few lines before. But then it fails on this line.

And it looks like it is failing on this call,

If you are able to compile the Mender client for you target, it might be useful to add some additional print statements to pin-point this.

I can do that. Are you just wanting me to print env/bootEnv from getBootEnvActivePartition()?

SLR-

I am mostly curious on how this one can fail,

So the content of “vars” would be interesting.

content of “vars”

map[mender_check_saveenv_canary:1]

i.e.

2020-05-26 06:00:24 +0000 UTC info: Running Mender version 2.1.1-dirty
2020-05-26 06:00:24 +0000 UTC debug: handle update fetch state
2020-05-26 06:00:24 +0000 UTC debug: status reported, response 204 No Content
2020-05-26 06:00:29 +0000 UTC debug: Received fetch update response &{200 OK 200 HTTP/1.1 1 1 map[Accept-Ranges:[bytes] Connection:[keep-alive] Content-Length:[46405120] Content-Security-Policy:[block-all-mixed-content] Content-Type:[application/vnd.mender-artifact] Date:[Tue, 26 May 2020 06:00:26 GMT] Etag:["943be2fdae469c7f1316500eb5714da4"] Last-Modified:[Mon, 27 Apr 2020 00:09:53 GMT] Server:[openresty/1.13.6.2] Strict-Transport-Security:[max-age=63072000; includeSubdomains; preload] Vary:[Origin] X-Amz-Request-Id:[16127E9BDDB2EB7A] X-Content-Type-Options:[nosniff] X-Frame-Options:[DENY] X-Xss-Protection:[1; mode=block]] 0x9a6c80 46405120 [] false false map[] 0x876400 0x866960}+
2020-05-26 06:00:29 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2020-05-26 06:00:30 +0000 UTC debug: handle update install state
2020-05-26 06:00:31 +0000 UTC debug: status reported, response 204 No Content
2020-05-26 06:00:31 +0000 UTC debug: Read data from device manifest file: device_type=beaglebone-yocto
2020-05-26 06:00:31 +0000 UTC debug: Current manifest data: beaglebone-yocto
2020-05-26 06:00:31 +0000 UTC info: no public key was provided for authenticating the artifact
2020-05-26 06:00:31 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2020-05-26 06:00:31 +0000 UTC debug: checking if device [beaglebone-yocto] is on compatible device list: [beaglebone-yocto]
2020-05-26 06:00:31 +0000 UTC debug: installer: processing script: ArtifactCommit_Enter_10_migrate-configuration
2020-05-26 06:00:31 +0000 UTC debug: installer: successfully read artifact [name: 01-02-00; version: 2; compatible devices: [beaglebone-yocto]]
2020-05-26 06:00:31 +0000 UTC debug: Trying to install update of size: 394264576
2020-05-26 06:00:31 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-05-26 06:00:31 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-05-26 06:00:31 +0000 UTC debug: MARC - Test saveenv canary variables: map[mender_check_saveenv_canary:1]
2020-05-26 06:00:31 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-beaglebone-yocto.ext4: No match between boot and root partitions.: Failed mender_saveenv_canary check. There is an error in the U-Boot setup. Likely causes are: 1) Mismatch between the U-Boot boot loader environment location and the location specified in /etc/fw_env.config. 2) 'mender_setup' is not run by the U-Boot boot script: exit status 1
2020-05-26 06:00:31 +0000 UTC info: State transition: update-store [Download_Enter] -> cleanup [Error]
2020-05-26 06:00:31 +0000 UTC debug: transitioning to error state
2020-05-26 06:00:31 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-26 06:00:31 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-26 06:00:31 +0000 UTC debug: Handling Cleanup state
2020-05-26 06:00:31 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2020-05-26 06:00:31 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-05-26 06:00:31 +0000 UTC debug: handle update status report state
2020-05-26 06:00:32 +0000 UTC debug: status reported, response 204 No Content
2020-05-26 06:00:32 +0000 UTC debug: attempting to upload deployment logs for failed update

…and from journalctl on the target device:

May 26 06:00:30 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:30Z” level=info msg=“State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]” module=mender
May 26 06:00:31 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:31Z” level=info msg=“no public key was provided for authenticating the artifact” module=installer
May 26 06:00:31 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:31Z” level=info msg=“Update Module path “/usr/share/mender/modules/v3” could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available” module=modules
May 26 06:00:31 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:31Z” level=error msg=“Artifact install failed: Payload: can not install Payload: core-image-minimal-beaglebone-yocto.ext4: No match between boot and root partitions.: Failed mender_saveenv_canary check. There is an error in the U-Boot setup. Likely causes are: 1) Mismatch between the U-Boot boot loader environment location and the location specified in /etc/fw_env.config. 2) ‘mender_setup’ is not run by the U-Boot boot script: exit status 1” module=state
May 26 06:00:31 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:31Z” level=info msg=“State transition: update-store [Download_Enter] -> cleanup [Error]” module=mender
May 26 06:00:31 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:31Z” level=info msg=“State transition: cleanup [Error] -> update-status-report [none]” module=mender
May 26 06:00:33 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:33Z” level=info msg=“State transition: update-status-report [none] -> idle [Idle]” module=mender
May 26 06:00:33 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:33Z” level=info msg=“authorization data present and valid” module=mender
May 26 06:00:33 redaptive-304511b1242f mender[260]: time=“2020-05-26T06:00:33Z” level=info msg=“State transition: idle [Idle] -> check-wait [Idle]” module=mender

We might be in real trouble here. We have thousands of units deployed on this rev, and an ongoing manufacturing run that is currently pumping out additional units.

Unfortunately the logs do not help much. It is as one would expect so still unclear.

Going back a bit, do the updates work if you disable the migration script?

More can you share output of:

cat /etc/mender/mender.conf

cat /var/lib/mender/mender.conf

For clarity’s sake, the following data is from the target device, which is running a previous OS version (v1.1), not the OS version we are attempting to update TO (v1.2).

Background:
This device is running mender 2.1.1-dirty, and our OS image v 1.1, and fails to take an OTA update to our image v1.2.
If a previous version of our OS (say, ~v1.0*) is updated to THIS version (v1.1) via OTA update, and then a subsequent OTA updated to v1.2 is pushed, it will be successful; it is only devices with emmc chips programmed on the real, during manufacturing, that fail to take an OTA update from a from ~v1.1 to v1.2 .

…again, the following data is from the target device, which is running a previous OS version (v1.1), not the OS version we are attempting to update TO (v1.2).
cat /etc/mender/mender.conf

{
“ClientProtocol”: “https”,
“InventoryPollIntervalSeconds”: 86400,
“RetryPollIntervalSeconds”: 300,
“RootfsPartA”: “/dev/mmcblk1p2”,
“RootfsPartB”: “/dev/mmcblk1p3”,
“ServerCertificate”: “/etc/mender/server.crt”,
“ServerURL”: “https://update-nebula.device.redaptiveinc.com”,
“TenantToken”: “dummy”,
“UpdatePollIntervalSeconds”: 1800
}

cat /var/lib/mender/mender.conf

cat: can’t open ‘/var/lib/mender/mender.conf’: No such file or directory

root@redaptive-304511b1242f:~# cat /var/lib/mender/

deployments.0001.7ef3a085-a723-42d4-8455-e901053d47c1.log mender-agent.pem mender-store-lock
device_type mender-store scripts/

When I disable the migration script, the update doesnt fail ~immediately, as it does here, It goes on to fully stream the image to disk, and reboot, at which point it fails with similar errors.

Just to be 100% sure:
The manufacturing devices, they have not received an update previously?
They are straight off the production line?