Artifact installs failing on "Tried maximum amount of times"

~20% of our fleet is radio challenged, which is to say they suffer a number of latency and pLoss issues. While these network challenges don’t tend to prevent our devices from delivering their minutely payload to our upstream data lakes, they do seem to result in OTA failures, i.e.

...
2021-11-03 03:35:16 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-redrock-yocto.ext4: Cannot resume download: Tried maximum amount of times

At the moment this leaves us with somewhere around 30,000 sensors that are currently flagged as Not OTA Updatable; while this has not generally affected their ability to deliver sensor data, Im hoping there are server and/or client side modifications we can make that will help ameliorate this issue moving forward.

Environment: Self Managed Server v2.4
Client: v2.1.1 - v2.6.0
Note: Generally our artifacts are in the 35MB range.

Thoughts?

SLR-

You could try to set RetryPollIntervalSeconds a bit higher, but I don’t think it will make a big difference in your case. It uses exponential backoff with number of attempts defined by this formula:

3 * ceil(log2(RetryPollIntervalSeconds) + 1)

In other words increasing RetryPollIntervalSeconds will only increase the number of attempts slightly, but will increase waiting time a lot.

If you are ok with patching, you could try to change the 3 in that formula instead, which can be done here. Note that in order to do this you must be using Mender Client 3.0.0 or later, or else you will most likely get this error:

State data stored and retrieved maximum number of times

@kacf
Thanks for the prompt response Kristian!
My assumption was the poll interval related specifically to the client polling the server for updates.
In this case, the server is sending a RST, ungracefully terminating the session in the middle of an artifact download, and the client is attempting to resume the connection at the last known offset. Is the number of times the client can attempt to resume a download governed by the same poll interval algorithm!?

SLR-

Yes.​​​​​​​​​​

Thanks for the update @kacf !

While likely of little value to anyone else here, I’ll note that for us these two use cases (polling for updates vs. streaming an artifact download to the local filesystem) aren’t well served by that single logic. A process that checks for updates ~30 times a day, and a process that is managing a mission critical “~long running” data transfer dont deserve the same governing thresholds.
Our customer SLA makes no promise that on our publication of a new firmware version, their devices will be notified of the update within 30 minutes. We do, however, insure them that their devices will successfully receive critical updates; or we will roll a truck.
In a challenged radio environment (common for field sensors and other ‘IoT’ devices), pushing a 35MB OTA image over and over until it completes (or doesnt), based on the RetryPollIntervalSeconds, represents a real material cost in data overages for our fleet. Then, ultimately, still needing to roll a truck is thousands of times that cost.
This is what I’m trying to mitigate.
.
If there are no remediations on the client side, is there any configuration we can perform on the server side to help mitigate the apparent RST’s its sending?

Thanks again for your time!
SLR-

You’ve got some valid points. For now I’ve created MEN-5206. I can’t promise this will get attention soon, but depending on urgency, maybe it can serve as inspiration if you or someone else wants to attempt a pull request?

For the server side, this isn’t my area of expertise, but you could possibly take a look at the page about ip-sysctl. There are loads of settings in there which tweak how the kernel deals with timeouts and the sending of various TCP status packets.

Otherwise I defer to other Mender Hub users to answer that one.

Thanks Kristian, you’re a rockstar!
If I can manage to make some space for someone on my team, I may ask them to take a look at addressing MEN-5206 internally.

THNX!
SLR-

Thanks, glad I’m living up to my avatar! :wink:

@kacf On an aside, I just noticed that all the devices in our lab that are failing OTA updates with this issue are retrying 13 times. Based on the algo you posted above, I was expecting 36, given we use the default poll interval (1800). Am I miss-reading the algorithm or is the mender client only trying half the expected times? …or maybe its a logging anomaly where half the expected events are logged? …or?

SLR-

I would expect that too. The message Download connection broken should be printed every time the connection is retried. This is right next to the part which calculates the next interval. Does this appear in the log?

This is what it looks like

2022-06-12 23:16:24 +0000 UTC info: Running Mender client version: 2.6.0
2022-06-12 23:16:50 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2022-06-12 23:16:52 +0000 UTC info: No public key was provided for authenticating the artifact
2022-06-12 23:16:52 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2022-06-12 23:17:11 +0000 UTC info: Opening device "/dev/mmcblk1p3" for writing
2022-06-12 23:17:11 +0000 UTC info: Native sector size of block device /dev/mmcblk1p3 is 512 bytes. Mender will write in chunks of 1048576 bytes
2022-06-12 23:28:25 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-12 23:28:25 +0000 UTC info: Resuming download in 1m0s
2022-06-12 23:29:25 +0000 UTC info: Attempting to resume artifact download from offset 1015213
2022-06-12 23:30:46 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-12 23:30:46 +0000 UTC info: Resuming download in 1m0s
2022-06-12 23:31:46 +0000 UTC info: Attempting to resume artifact download from offset 1015213
2022-06-12 23:50:24 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-12 23:50:24 +0000 UTC info: Resuming download in 1m0s
2022-06-12 23:51:24 +0000 UTC info: Attempting to resume artifact download from offset 2308893
2022-06-13 00:11:46 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 00:11:46 +0000 UTC info: Resuming download in 2m0s
2022-06-13 00:13:46 +0000 UTC info: Attempting to resume artifact download from offset 3979405
2022-06-13 00:26:02 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 00:26:02 +0000 UTC info: Resuming download in 2m0s
2022-06-13 00:28:02 +0000 UTC info: Attempting to resume artifact download from offset 4847101
2022-06-13 00:38:27 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 00:38:27 +0000 UTC info: Resuming download in 2m0s
2022-06-13 00:40:27 +0000 UTC info: Attempting to resume artifact download from offset 5419885
2022-06-13 00:55:15 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 00:55:15 +0000 UTC info: Resuming download in 4m0s
2022-06-13 00:59:15 +0000 UTC info: Attempting to resume artifact download from offset 6779101
2022-06-13 01:15:26 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 01:15:26 +0000 UTC info: Resuming download in 4m0s
2022-06-13 01:19:26 +0000 UTC info: Attempting to resume artifact download from offset 8089165
2022-06-13 01:36:17 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 01:36:17 +0000 UTC info: Resuming download in 4m0s
2022-06-13 01:40:17 +0000 UTC info: Attempting to resume artifact download from offset 9546685
2022-06-13 02:02:07 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 02:02:07 +0000 UTC info: Resuming download in 5m0s
2022-06-13 02:07:07 +0000 UTC info: Attempting to resume artifact download from offset 11184429
2022-06-13 02:19:40 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 02:19:40 +0000 UTC info: Resuming download in 5m0s
2022-06-13 02:24:40 +0000 UTC info: Attempting to resume artifact download from offset 12330652
2022-06-13 02:40:41 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 02:40:41 +0000 UTC info: Resuming download in 5m0s
2022-06-13 02:45:41 +0000 UTC info: Attempting to resume artifact download from offset 13657099
2022-06-13 03:02:15 +0000 UTC error: Download connection broken: unexpected EOF
2022-06-13 03:02:15 +0000 UTC error: Failed to write 36825600 bytes to the new partition
2022-06-13 03:02:15 +0000 UTC info: The optimized block-device writer wrote a total of 37 frames, where 0 frames did need to be rewritten (i.e., skipped)
2022-06-13 03:02:15 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-<HWID>.ext4: Cannot resume download: Tried maximum amount of times
2022-06-13 03:02:15 +0000 UTC info: State transition: update-store [Download_Enter] -> cleanup [Error]
2022-06-13 03:02:15 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]

SLR-

Are you sure you have set the interval correctly? The log seems to indicate it is set to 5 minutes. You can see that it increases exponentially at first, but instead of going to 8, it goes to 5, and stops there. You are using RetryPollIntervalSeconds, right?

In 3.3, you will also be able to use RetryPollCount, so you can control the retry count without adjusting the interval.

@kacf Thats entirely possible. Maybe its set to 300 seconds, and Im just misremembering, or confusing it with our UpdatePollIntervalSeconds value.
Regardless, as you mentioned in this post months ago, the effect of increasing the retry interval has only negligible effect on the retry count. Increasing from 300 to 1800 probably affords us another 5 or 6 retries. …but who knows, maybe that would be just enough to make a difference in the field ¯_(ツ)_/¯

I guess I need to have the test-eng team look a little more closely at the real-word effect of changing this value.

Thanks for your help/input!
SLR-