Mender Troubleshoot Add-on reliability

Hi,

we sometimes can’t access our devices via the troubleshoot remote terminal. 95% of the time it’s working fine, but if troubleshoot stops working, it does not work until restarting mender-connect or rebooting the device.

We have a bunch of devices set up with a custom yocto-based OS. We use EU hosted mender and have licensed the troubleshoot add-on. The devices continue to update the inventory and successfully check for and install pending deployments. Restarting the device or restarting the mender-connect systemd service over a separate SSH session immediatly fixes the issue. This problem comes up sporadically and we don’t have a way to reproduce it.

Are there any known issues with mender-connect and the troubleshooting add-on, for example when working with less reliable internet connections, wifi, …? Could problems on the server-side cause this? Does anyone have pointers what we could try? Currently my only idea is to periodically restart mender-connect, which would not be an ideal solution.

Best regards,
Nils

Does anyone have an idea? What additional information can I provide to aid in resolving this?

Best regards
Nils

I since contacted support via e-mail and got a quick reply. By mistake I thought we already updated to current mender-connect 2.2.1, but we were still on 2.1.0, which apparently had known issues.

Side note: The support rep also suggested to migrate to mender client 4, if we did not rollout to production yet.

I’ll close this for now and re-open if we find more problems.

1 Like

Hi @NiBr, thanks for the recap!

Hi again,

sadly, updating mender and mender-connect did not fix the issue. We still face the same problem. The device is connected (and for our development devices also reachable via SSH), inventory updates work but troubleshoot does not. The devices sometimes are simply losing connection to troubleshoot, so neither the web-gui nor mender-cli terminal works.
Restarting the mender-connect service or rebooting the device immediatly solves the problem.

Hello Nils,

Thanks for following up. Let me answer in here as well:

There is a config based workaround that it is to set the following config in your mender-connect.conf :

"Sessions": {    
    "StopExpired": true,    
    "ExpireAfterIdle": 600  
}

This basically ensures that in the worst case scenario, the mender-connect auto heal will happen after 10 minutes (600 seconds).

Can you give it a try?

BR,
Luis

Hi Luis,

thanks for the reply. From my side we can continue here in the forum instead of the private mail chat, so everyone can profit from our findings. For your info: We are using current versions of mender (3.5.2) and mender-connect (2.2.1) now.

I’m glad to test ExpireAfterIdle. Sometimes we have to use Mender troubleshoot to help customers in certain situations, e.g. when they call via phone. Does it hurt to set the ExpireAfterIdle time to let’s say 300 or 60 seconds instead of 600?
Waiting 10 minutes during a call just because the connection got lost could get annoying quickly.

FWIW, this happens to us quite often (had observed this for a few months now) to the point mender-connect is not something we could rely upon. Today this happened already a few times and I noticed that it we lost the connectivity to all our devices at around the same time (± 1 minute) and then when it started working, all devices came up online simultaneously again. At the same time, I was intermittently getting errors like “accepted devices couldn’t be loaded. timeout of 10000ms exceeded” in the Mender console, which got me thinking if it actually could be a server-related issue.

We are on the trial plan of the hosted Mender (EU) and using mender 3.4.0 and mender-connect 2.1.0 (will try upgrading today). Here is an excerpt of mender-connect logs that I’m seeing on of these devices:

Jun 13 07:30:32 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:30:32Z" level=error msg="eventLoop: error reconnecting: failed to connect after max number of retries"
Jun 13 07:30:57 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:30:57Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 1/10); len(token)=654"
Jun 13 07:31:22 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:31:22Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 2/10); len(token)=654"
Jun 13 07:31:42 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:31:42Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 3/10); len(token)=654"
Jun 13 07:31:58 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:31:58Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 07:34:44 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:34:44Z" level=error msg="messageLoop: error on readMessage: websocket: close 1000 (normal): read tcp 10.11.8.21:60980->20.61.30.255:443: read: connection timed out; disconnecting, waiting for reconnect."
Jun 13 07:35:14 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:35:14Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 1/10); len(token)=654"
Jun 13 07:35:39 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:35:39Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 2/10); len(token)=654"
Jun 13 07:36:04 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:36:04Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 3/10); len(token)=654"
Jun 13 07:36:29 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:36:29Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 4/10); len(token)=654"
Jun 13 07:36:54 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:36:54Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 5/10); len(token)=654"
Jun 13 07:37:19 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:37:19Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 6/10); len(token)=654"
Jun 13 07:37:44 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:37:44Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 7/10); len(token)=654"
Jun 13 07:38:09 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:38:09Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 8/10); len(token)=654"
Jun 13 07:38:34 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:38:34Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: websocket: bad handshake; reconnecting in 5s (try 9/10); len(token)=654"
Jun 13 07:38:45 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:38:45Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 07:58:44 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:58:44Z" level=error msg="messageLoop: error on readMessage: websocket: bad close code 1006; disconnecting, waiting for reconnect."
Jun 13 07:59:25 iot-gate-imx8 mender-connect[523]: time="2024-06-13T07:59:25Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 08:34:46 iot-gate-imx8 mender-connect[523]: time="2024-06-13T08:34:46Z" level=error msg="messageLoop: error on readMessage: websocket: close 1000 (normal): read tcp 10.11.8.21:60988->20.61.30.255:443: read: connection reset by peer; disconnecting, waiting for reconnect."
Jun 13 08:35:01 iot-gate-imx8 mender-connect[523]: time="2024-06-13T08:35:01Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 09:39:16 iot-gate-imx8 mender-connect[523]: time="2024-06-13T09:39:16Z" level=error msg="messageLoop: error on readMessage: websocket: close 1000 (normal): read tcp 10.11.8.21:60990->20.61.30.255:443: read: connection reset by peer; disconnecting, waiting for reconnect."
Jun 13 09:39:32 iot-gate-imx8 mender-connect[523]: time="2024-06-13T09:39:32Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 10:26:31 iot-gate-imx8 mender-connect[523]: time="2024-06-13T10:26:31Z" level=error msg="messageLoop: error on readMessage: websocket: close 1011 (internal server error): read tcp 172.26.65.255:8080->172.26.65.60:33946: i/o timeout; disconnecting, waiting for reconnect."
Jun 13 10:27:21 iot-gate-imx8 mender-connect[523]: time="2024-06-13T10:27:21Z" level=info msg="eventLoop: Connection established with http://127.0.0.1:39823"
Jun 13 10:29:32 iot-gate-imx8 mender-connect[523]: time="2024-06-13T10:29:32Z" level=error msg="messageLoop: error on readMessage: websocket: close 1011 (internal server error): read tcp 172.26.67.23:8080->172.26.65.60:49428: i/o timeout; disconnecting, waiting for reconnect."
Jun 13 10:30:29 iot-gate-imx8 mender-connect[523]: time="2024-06-13T10:30:29Z" level=error msg="connection manager failed to connect to http://127.0.0.1:39823/api/devices/v1/deviceconnect/connect: read tcp 127.0.0.1:35640->127.0.0.1:39823: i/o timeout; reconnecting in 5s (try 1/10); len(token)=654"

EDIT: @NiBr Ah, I guess that’s explains some of the connection problems. Still, this problem has been affecting us for a much longer time. I will try upgrading mender-connect and see if that helps.

Hi @martin.zima,

EU hosted mender seems to have experienced an outage today: https://mender.statuspage.io/.
Maybe the problems you’ve experienced are caused by that. The problem I was discussing does not affect all devices simultaneously.