Mender in bandwidth constrained environments: 2g/15kbps connection speed

Good morning,

Are there any known issues when operating mender at such a low connection speed?

I have a device with mender, reliably updating successfully over connections down to 50kbps. However, I need to run in more data constrained environments at the low end of 2G connection speeds, typically as low as 12-15kbps.

For the same artifact I am able to complete a deployment at unrestricted speeds but at 15kbps speeds the same operation fails to reboot and record update. I am still gathering data but I have had success with even a 20kbps restricted deployment.

The hunch is that a race condition may be occurring. In order to test this theory I have turned off all but one of the cpus on the device and am waiting on the results.

Many thanks

Hi @beruti.

Interesting use-case.

-> but at 15kbps speeds the same operation fails to reboot and record update

How long do a deployments take at the different speeds? To see if you are close to any of the timeouts. I know that there is a 24 hours (I will need to double-check to verify) timeout for the on the download URL. Meaning if the download takes more then 24 hours, and you would get interrupted during the update, the link would be invalid when it tries to resume and fail the update.

I would like to point out that we are about to release Binary Delta Update support, Mender Binary Delta rootfs update. Though this will only be available trough our commercial offerings.

But this feature is particularly useful in such environments as yours.

Its quick to download, under 10 minutes, on any reasonable fixed-line connection.

It is using a pretty small rootfs of 400mb so it ends up as a ~13mb artifact. With a connection traffic controlled at 15kbps, the ~13mb artifact communicates to the server that it is in the download phase for the expected amount of time ~2 hours but then the process stalls at reboot until im forced to pull power from the device. Then a rollback occurs on status report.

When i connect the serial connection it seems that the device is stuck at the bootloader phase but this behaviour isnt seen when the connection is unrestricted. Why this would occur at a lower connection speed is to be understood.

With one cpu the same sequence occurs.

Ill certainly be following to see if Binary Delta Update could be worked into our set up

Can not think of anything that would be Mender related. Mender does integrity checks of the written data ensuring that it matches the data in the Mender Artifact. If the Mender client would detect a checksum error it would not try to reboot.

but then the process stalls at reboot until im forced to pull power from the device

This you can typically get around by enabling hardware watchdog which I would recommend regardless, which would prevent devices being “stuck” in such conditions.

Also if you are able to share a failed deployment log, I can take look.

Thanks

Log 1

    2019-10-24 01:42:02 +0000 UTC info: wrote 419430400/419430400 bytes of update to device /dev/mmcblk0p2
    2019-10-24 01:42:02 +0000 UTC debug: installer: successfully read artifact [name: mender-neo2-openwrt-2019101714-34; version: 2; compatible devices: [mender-neo2-openwrt]]
    2019-10-24 01:42:12 +0000 UTC error: got unexpected HTTP status when reporting status: 502
    2019-10-24 01:42:12 +0000 UTC error: error reporting update status: (request_id: ): reporting status failed, bad status 502 server error message: failed to parse server response: invalid character '<' looking for beginning of value
    2019-10-24 01:42:12 +0000 UTC info: State transition: update-store [Download] -> update-install [ArtifactInstall]
    2019-10-24 01:42:12 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
    2019-10-24 01:42:12 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
    2019-10-24 01:42:13 +0000 UTC debug: status reported, response 204 No Content
    2019-10-24 01:42:13 +0000 UTC debug: Inactive partition: /dev/mmcblk0p2
    2019-10-24 01:42:13 +0000 UTC debug: Marking inactive partition (/dev/mmcblk0p2) as the new boot candidate.
    2019-10-24 01:42:13 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
    2019-10-24 01:42:13 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
    2019-10-24 01:42:13 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
    2019-10-24 01:42:13 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
    2019-10-24 01:42:13 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
    2019-10-24 01:42:13 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
    2019-10-24 01:42:13 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
    2019-10-24 01:42:13 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
    2019-10-24 01:42:13 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
    2019-10-24 01:42:13 +0000 UTC debug: handling reboot state
    2019-10-24 01:42:13 +0000 UTC debug: status reported, response 204 No Content
    2019-10-24 01:42:13 +0000 UTC info: rebooting device
    2019-10-24 01:42:13 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk0p3
    2019-10-18 15:30:31 +0000 UTC error: got invalid entrypoint into the state machine: state: reboot
    2019-10-18 15:30:31 +0000 UTC info: State transition: init [none] -> update-error [ArtifactFailure]
    2019-10-18 15:30:31 +0000 UTC debug: transitioning to error state
    2019-10-18 15:30:31 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
    2019-10-18 15:30:31 +0000 UTC debug: handle update error state
    2019-10-18 15:30:31 +0000 UTC info: State transition: update-error [ArtifactFailure] -> update-status-report [none]
    2019-10-18 15:30:31 +0000 UTC debug: handle update status report state
    2019-10-18 15:30:31 +0000 UTC error: failed to report status: Put

Log 2

2019-10-24 08:30:23 +0000 UTC debug: Trying to install update of size: 419430400
2019-10-24 08:30:23 +0000 UTC debug: Active partition: /dev/mmcblk0p3
2019-10-24 08:30:23 +0000 UTC debug: Detected inactive partition /dev/mmcblk0p2, based on active partition /dev/mmcblk0p3
2019-10-24 08:30:23 +0000 UTC info: native sector size of block device /dev/mmcblk0p2 is 512, we will write in chunks of 1048576
2019-10-24 08:30:31 +0000 UTC info: opening device /dev/mmcblk0p2 for writing
2019-10-24 08:30:31 +0000 UTC info: partition /dev/mmcblk0p2 size: 419430400
2019-10-24 10:28:00 +0000 UTC info: wrote 419430400/419430400 bytes of update to device /dev/mmcblk0p2
2019-10-24 10:28:00 +0000 UTC debug: installer: successfully read artifact [name: mender-neo2-openwrt-2019101714-34; version: 2; compatible devices: [mender-neo2-openwrt]]
2019-10-24 10:28:03 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:03 +0000 UTC info: State transition: update-store [Download] -> update-install [ArtifactInstall]
2019-10-24 10:28:03 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:03 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:04 +0000 UTC debug: Inactive partition: /dev/mmcblk0p2
2019-10-24 10:28:04 +0000 UTC debug: Marking inactive partition (/dev/mmcblk0p2) as the new boot candidate.
2019-10-24 10:28:04 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
2019-10-24 10:28:04 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2019-10-24 10:28:04 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2019-10-24 10:28:04 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2019-10-24 10:28:04 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2019-10-24 10:28:04 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
2019-10-24 10:28:04 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
2019-10-24 10:28:04 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: handling reboot state
2019-10-24 10:28:04 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:04 +0000 UTC info: rebooting device
2019-10-24 10:28:04 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk0p3
2019-10-18 15:24:35 +0000 UTC error: got invalid entrypoint into the state machine: state: reboot
2019-10-18 15:24:35 +0000 UTC info: State transition: init [none] -> update-error [ArtifactFailure]
2019-10-18 15:24:35 +0000 UTC debug: transitioning to error state

each end with error reporting update status: reporting status failed:

Other oddities from uboot environment on my last serial connection:
uboot_bootcount=22 but failed_bootcount=4

implying that mender_altbootcmd has run 4 times but triggered run mender_setup 22 times even though they should have a one to one relationship.

mender_altbootcmd=setexpr failed_bootcount ${failed_bootcount} + 1; if test ${mender_boot_part} = 2; then setenv mender_boot_part 3; setenv mender_boot_part_hex 3; else setenv mender_boot_part 2; setenv mender_boot_part_hex 2; fi; setenv 
+
mender_setup=setexpr uboot_bootcount ${uboot_bootcount} + 1; saveenv; if test "${mender_saveenv_canary}" != "1"; then setenv mender_saveenv_canary 1; saveenv; fi; if test "${mender_pre_setup_commands}" != ""; then run mender_pre_setup_commands; fi; setenv mender_kernel_root /dev/mmcblk0${mender_boot_part}; if test ${mender_boot_part} = 2; then setenv mender_boot_part_name /dev/mmcblk0p2; else setenv mender_boot_part_name /dev/mmcblk0p3; fi; setenv mender_kernel_root_name ${mender_boot_part_name}; setenv mender_uboot_root mmc 0:${mender_boot_part_hex}; setenv mender_uboot_root_name ${mender_boot_part_name}; setenv expand_bootargs "setenv bootargs \\"${bootargs}\\""; run expand_bootargs; setenv expand_bootargs; if test "${mender_post_setup_commands}" != ""; then run mender_post_setup_commands; fi

error reporting update status: reporting status failed:

This one is a bit misleading as it eventually succeeded in publishing the error log, but it can fail a couple of times as network is being brought up.

The error that is the cause of the failure is this one,

2019-10-18 15:30:31 +0000 UTC error: got invalid entrypoint into the state machine: state: reboot

Which implies that something went wrong during the boot process in U-Boot or Linux.

mender_altbootcmd=setexpr failed_bootcount ${failed_bootcount} + 1; if test ${mender_boot_part}…

Oh, you are not using the Mender provided patches here. Where did you get this one?

Our default commands you can see here

So that makes it a bit more difficult to help out. By default we let CONFIG_BOOTCOUNT_LIMIT and CONFIG_BOOTCOUNT_ENV handle this logic and Mender only sets/clears the upgrade_available
variable which is part of the CONFIG_BOOTCOUNT_LIMIT feature. Mender also cleans up bootcount on successful update.

We do provide an experimental patch which one can use as inspiration for implementing the CONFIG_BOOTCOUNT_LIMIT feature if it is not available in the U-Boot version you are using.

Ah yes. They are from a rossbcan patch for use in the mender-openwrt setup. It is likely not this though as the same files have served for all previous successful deployments.

Testing to see if any changes independent of his work are the root cause.

Think it is acceptable to deal with this edge case given that 20kbps updates are predictably working. An incremental update approach made possible by Directory could be our workaround in the situation where bandwidth is so tightly rationed. Will open a more thorough report should we action that level of testing in the future. Thanks for your help