Are there any known issues when operating mender at such a low connection speed?
I have a device with mender, reliably updating successfully over connections down to 50kbps. However, I need to run in more data constrained environments at the low end of 2G connection speeds, typically as low as 12-15kbps.
For the same artifact I am able to complete a deployment at unrestricted speeds but at 15kbps speeds the same operation fails to reboot and record update. I am still gathering data but I have had success with even a 20kbps restricted deployment.
The hunch is that a race condition may be occurring. In order to test this theory I have turned off all but one of the cpus on the device and am waiting on the results.
-> but at 15kbps speeds the same operation fails to reboot and record update
How long do a deployments take at the different speeds? To see if you are close to any of the timeouts. I know that there is a 24 hours (I will need to double-check to verify) timeout for the on the download URL. Meaning if the download takes more then 24 hours, and you would get interrupted during the update, the link would be invalid when it tries to resume and fail the update.
I would like to point out that we are about to release Binary Delta Update support, Mender Binary Delta rootfs update. Though this will only be available trough our commercial offerings.
But this feature is particularly useful in such environments as yours.
Its quick to download, under 10 minutes, on any reasonable fixed-line connection.
It is using a pretty small rootfs of 400mb so it ends up as a ~13mb artifact. With a connection traffic controlled at 15kbps, the ~13mb artifact communicates to the server that it is in the download phase for the expected amount of time ~2 hours but then the process stalls at reboot until im forced to pull power from the device. Then a rollback occurs on status report.
When i connect the serial connection it seems that the device is stuck at the bootloader phase but this behaviour isnt seen when the connection is unrestricted. Why this would occur at a lower connection speed is to be understood.
Can not think of anything that would be Mender related. Mender does integrity checks of the written data ensuring that it matches the data in the Mender Artifact. If the Mender client would detect a checksum error it would not try to reboot.
but then the process stalls at reboot until im forced to pull power from the device
This you can typically get around by enabling hardware watchdog which I would recommend regardless, which would prevent devices being “stuck” in such conditions.
2019-10-24 01:42:02 +0000 UTC info: wrote 419430400/419430400 bytes of update to device /dev/mmcblk0p2
2019-10-24 01:42:02 +0000 UTC debug: installer: successfully read artifact [name: mender-neo2-openwrt-2019101714-34; version: 2; compatible devices: [mender-neo2-openwrt]]
2019-10-24 01:42:12 +0000 UTC error: got unexpected HTTP status when reporting status: 502
2019-10-24 01:42:12 +0000 UTC error: error reporting update status: (request_id: ): reporting status failed, bad status 502 server error message: failed to parse server response: invalid character '<' looking for beginning of value
2019-10-24 01:42:12 +0000 UTC info: State transition: update-store [Download] -> update-install [ArtifactInstall]
2019-10-24 01:42:12 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 01:42:12 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 01:42:13 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 01:42:13 +0000 UTC debug: Inactive partition: /dev/mmcblk0p2
2019-10-24 01:42:13 +0000 UTC debug: Marking inactive partition (/dev/mmcblk0p2) as the new boot candidate.
2019-10-24 01:42:13 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
2019-10-24 01:42:13 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2019-10-24 01:42:13 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2019-10-24 01:42:13 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2019-10-24 01:42:13 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2019-10-24 01:42:13 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
2019-10-24 01:42:13 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
2019-10-24 01:42:13 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 01:42:13 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 01:42:13 +0000 UTC debug: handling reboot state
2019-10-24 01:42:13 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 01:42:13 +0000 UTC info: rebooting device
2019-10-24 01:42:13 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk0p3
2019-10-18 15:30:31 +0000 UTC error: got invalid entrypoint into the state machine: state: reboot
2019-10-18 15:30:31 +0000 UTC info: State transition: init [none] -> update-error [ArtifactFailure]
2019-10-18 15:30:31 +0000 UTC debug: transitioning to error state
2019-10-18 15:30:31 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-18 15:30:31 +0000 UTC debug: handle update error state
2019-10-18 15:30:31 +0000 UTC info: State transition: update-error [ArtifactFailure] -> update-status-report [none]
2019-10-18 15:30:31 +0000 UTC debug: handle update status report state
2019-10-18 15:30:31 +0000 UTC error: failed to report status: Put
Log 2
2019-10-24 08:30:23 +0000 UTC debug: Trying to install update of size: 419430400
2019-10-24 08:30:23 +0000 UTC debug: Active partition: /dev/mmcblk0p3
2019-10-24 08:30:23 +0000 UTC debug: Detected inactive partition /dev/mmcblk0p2, based on active partition /dev/mmcblk0p3
2019-10-24 08:30:23 +0000 UTC info: native sector size of block device /dev/mmcblk0p2 is 512, we will write in chunks of 1048576
2019-10-24 08:30:31 +0000 UTC info: opening device /dev/mmcblk0p2 for writing
2019-10-24 08:30:31 +0000 UTC info: partition /dev/mmcblk0p2 size: 419430400
2019-10-24 10:28:00 +0000 UTC info: wrote 419430400/419430400 bytes of update to device /dev/mmcblk0p2
2019-10-24 10:28:00 +0000 UTC debug: installer: successfully read artifact [name: mender-neo2-openwrt-2019101714-34; version: 2; compatible devices: [mender-neo2-openwrt]]
2019-10-24 10:28:03 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:03 +0000 UTC info: State transition: update-store [Download] -> update-install [ArtifactInstall]
2019-10-24 10:28:03 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:03 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:04 +0000 UTC debug: Inactive partition: /dev/mmcblk0p2
2019-10-24 10:28:04 +0000 UTC debug: Marking inactive partition (/dev/mmcblk0p2) as the new boot candidate.
2019-10-24 10:28:04 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
2019-10-24 10:28:04 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2019-10-24 10:28:04 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2019-10-24 10:28:04 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2019-10-24 10:28:04 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2019-10-24 10:28:04 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
2019-10-24 10:28:04 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
2019-10-24 10:28:04 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1m0s seconds
2019-10-24 10:28:04 +0000 UTC debug: handling reboot state
2019-10-24 10:28:04 +0000 UTC debug: status reported, response 204 No Content
2019-10-24 10:28:04 +0000 UTC info: rebooting device
2019-10-24 10:28:04 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk0p3
2019-10-18 15:24:35 +0000 UTC error: got invalid entrypoint into the state machine: state: reboot
2019-10-18 15:24:35 +0000 UTC info: State transition: init [none] -> update-error [ArtifactFailure]
2019-10-18 15:24:35 +0000 UTC debug: transitioning to error state
each end with error reporting update status: reporting status failed:
So that makes it a bit more difficult to help out. By default we let CONFIG_BOOTCOUNT_LIMIT and CONFIG_BOOTCOUNT_ENV handle this logic and Mender only sets/clears the upgrade_available
variable which is part of the CONFIG_BOOTCOUNT_LIMIT feature. Mender also cleans up bootcount on successful update.
We do provide an experimental patch which one can use as inspiration for implementing the CONFIG_BOOTCOUNT_LIMIT feature if it is not available in the U-Boot version you are using.
Ah yes. They are from a rossbcan patch for use in the mender-openwrt setup. It is likely not this though as the same files have served for all previous successful deployments.
Testing to see if any changes independent of his work are the root cause.
Think it is acceptable to deal with this edge case given that 20kbps updates are predictably working. An incremental update approach made possible by Directory could be our workaround in the situation where bandwidth is so tightly rationed. Will open a more thorough report should we action that level of testing in the future. Thanks for your help