Artifact installs failing on "Tried maximum amount of times"

~20% of our fleet is radio challenged, which is to say they suffer a number of latency and pLoss issues. While these network challenges don’t tend to prevent our devices from delivering their minutely payload to our upstream data lakes, they do seem to result in OTA failures, i.e.

...
2021-11-03 03:35:16 +0000 UTC error: Artifact install failed: Payload: can not install Payload: core-image-minimal-redrock-yocto.ext4: Cannot resume download: Tried maximum amount of times

At the moment this leaves us with somewhere around 30,000 sensors that are currently flagged as Not OTA Updatable; while this has not generally affected their ability to deliver sensor data, Im hoping there are server and/or client side modifications we can make that will help ameliorate this issue moving forward.

Environment: Self Managed Server v2.4
Client: v2.1.1 - v2.6.0
Note: Generally our artifacts are in the 35MB range.

Thoughts?

SLR-

You could try to set RetryPollIntervalSeconds a bit higher, but I don’t think it will make a big difference in your case. It uses exponential backoff with number of attempts defined by this formula:

3 * ceil(log2(RetryPollIntervalSeconds) + 1)

In other words increasing RetryPollIntervalSeconds will only increase the number of attempts slightly, but will increase waiting time a lot.

If you are ok with patching, you could try to change the 3 in that formula instead, which can be done here. Note that in order to do this you must be using Mender Client 3.0.0 or later, or else you will most likely get this error:

State data stored and retrieved maximum number of times

@kacf
Thanks for the prompt response Kristian!
My assumption was the poll interval related specifically to the client polling the server for updates.
In this case, the server is sending a RST, ungracefully terminating the session in the middle of an artifact download, and the client is attempting to resume the connection at the last known offset. Is the number of times the client can attempt to resume a download governed by the same poll interval algorithm!?

SLR-

Yes.​​​​​​​​​​

Thanks for the update @kacf !

While likely of little value to anyone else here, I’ll note that for us these two use cases (polling for updates vs. streaming an artifact download to the local filesystem) aren’t well served by that single logic. A process that checks for updates ~30 times a day, and a process that is managing a mission critical “~long running” data transfer dont deserve the same governing thresholds.
Our customer SLA makes no promise that on our publication of a new firmware version, their devices will be notified of the update within 30 minutes. We do, however, insure them that their devices will successfully receive critical updates; or we will roll a truck.
In a challenged radio environment (common for field sensors and other ‘IoT’ devices), pushing a 35MB OTA image over and over until it completes (or doesnt), based on the RetryPollIntervalSeconds, represents a real material cost in data overages for our fleet. Then, ultimately, still needing to roll a truck is thousands of times that cost.
This is what I’m trying to mitigate.
.
If there are no remediations on the client side, is there any configuration we can perform on the server side to help mitigate the apparent RST’s its sending?

Thanks again for your time!
SLR-

You’ve got some valid points. For now I’ve created MEN-5206. I can’t promise this will get attention soon, but depending on urgency, maybe it can serve as inspiration if you or someone else wants to attempt a pull request?

For the server side, this isn’t my area of expertise, but you could possibly take a look at the page about ip-sysctl. There are loads of settings in there which tweak how the kernel deals with timeouts and the sending of various TCP status packets.

Otherwise I defer to other Mender Hub users to answer that one.

Thanks Kristian, you’re a rockstar!
If I can manage to make some space for someone on my team, I may ask them to take a look at addressing MEN-5206 internally.

THNX!
SLR-

Thanks, glad I’m living up to my avatar! :wink: