Hey,
We have a setup with 3 devices, all running Mender client (v5.0.1) for OTA updates. Two of the devices have direct uplinks to internet, whereas internet requests from the third device are proxied through the other devices with net.ipv4.ip_forward = 1
.
We are attempting to install the OS on all devices at the same time, but are running into issues where in some scenarios the OS install stalls on the third device when downloading the artifact. This happens when the first two devices manages to download and eventually reboot while the third is downloading the artifact.
Ideally we would like the download to resume on the third device once the connection to internet can be reestablished. This seems to be supported by the mender-client, as it uses Range reads and will retry failed reads. However, in our case the read never fails (ie. the TCP connection is still considered established), and the download client stalls indefinitely.
We’ve managed to get this working by configuring tcp_keepalive on the third device to much more aggressive values:
net.ipv4.tcp_keepalive_time = 60 # 1 minute
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
With these settings the mender-client will “realize” that the connection has been severed, and then attempt to resume the download, which eventually leads to the OS install being successful.
However, configuring this on the host means all TCP connections will use these settings, which we expect will lead to other issues down the line.
Is it possible to configure the TCP keepalive options on the individual socket in mender-client today (or is it something you would consider adding)?
Alternatively, if you have another suggestion on how to handle this scenario, that would be welcome too!