Client does not resume download and hangs indefinitely when server connection is severed

Hey,

We have a setup with 3 devices, all running Mender client (v5.0.1) for OTA updates. Two of the devices have direct uplinks to internet, whereas internet requests from the third device are proxied through the other devices with net.ipv4.ip_forward = 1.

We are attempting to install the OS on all devices at the same time, but are running into issues where in some scenarios the OS install stalls on the third device when downloading the artifact. This happens when the first two devices manages to download and eventually reboot while the third is downloading the artifact.

Ideally we would like the download to resume on the third device once the connection to internet can be reestablished. This seems to be supported by the mender-client, as it uses Range reads and will retry failed reads. However, in our case the read never fails (ie. the TCP connection is still considered established), and the download client stalls indefinitely.

We’ve managed to get this working by configuring tcp_keepalive on the third device to much more aggressive values:

net.ipv4.tcp_keepalive_time = 60     # 1 minute
net.ipv4.tcp_keepalive_intvl = 75         
net.ipv4.tcp_keepalive_probes = 9

With these settings the mender-client will “realize” that the connection has been severed, and then attempt to resume the download, which eventually leads to the OS install being successful.

However, configuring this on the host means all TCP connections will use these settings, which we expect will lead to other issues down the line.


Is it possible to configure the TCP keepalive options on the individual socket in mender-client today (or is it something you would consider adding)?

Alternatively, if you have another suggestion on how to handle this scenario, that would be welcome too!

1 Like

Hi @ericwenn,

As far as I can tell, there is no direct way to tune these settings, you’d have to take a dive into the code. The recommended way is to use the Mender Gateway for such, as its meant to do exactly that (and more, like caching).

Greetz,
Josef

Thanks, I haven’t seen Mender Gateway before!

We are currently not on the enterprise, and as such can not use Mender Gateway, but reading through the linked page I saw

Devices in a System usually require coordination during the update process.

however I don’t see anything else in the documentation indicating how this is done with Mender Gateway. Coordinating the update process (such that not all devices start at the same time) is also something we could consider, but ideally we would not like to build it on top of Mender ourselves.

Is there anything in Mender Gateway that implements such functionality?


If we end up trying to patch mender-client source, would this be a welcome change upstream in Mender, or would we have to carry this patch ourselves in our Yocto build?

Thanks in advance, and happy easter!