Mender server deviceauth propagate issue

Hi peter,
thanks for helping me out!
I’ll update to 3.3.0 today and will try to reproduce the issue and give detailed information regarding your questions. I keep you updated.

I tested now carefully the entire setup a few days and updating the server to 3.3.0 fixed indeed the original issue I had! Upgrading to 3.2.0 did not work, so the fix might have come with migrating to 3.3.0 I guess.

After doing some tests I noticed everything is now working again as expected, besides the mender-connect, which is not working any more. I tried to update and downgrade the clients, nothing helped. On the client side I only get following error:

level=error msg="messageLoop: error on readMessage: websocket: close 1006 (abnormal closure): unexpected EOF; disconnecting, waiting for reconnect."

Updating, inventory, everything works, but not mender-connect.
Should I open a new topic?

Best regards,
Simon

After some digging I actually think the issue is related to my original issue.

Here a short summery what’s the status:

Original issue (mender-server 3.1.0):

  • Wired device status and inventory behavior
  • mender-connect working as expected

Updated to mender-server 3.3.0

  • Wired device behavior fixed
  • mender-connect not working any more

The mender-client failed with following error message:

mender[13087]: time="2022-07-11T13:34:51+02:00" level=warning msg="error while sending close message: use of closed network connection"
mender[13087]: time="2022-07-11T13:35:02+02:00" level=error msg="error forwarding from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF"

The mender-connect just reports

mender-connect[13092]: time="2022-07-11T13:36:04+02:00" level=info msg="eventLoop: Connection established with http://localhost:37857"
mender-connect[13092]: time="2022-07-11T13:36:04+02:00" level=error msg="messageLoop: error on readMessage: websocket: invalid close code; disconnecting, waiting for reconnect."

Since all clients worked with 3.1.0 and previous version, I’m sure the issue is in the server setup.

Now checking the server logs, there might be the missing key what’s going on with my device db:

time="2022-07-11T11:37:47Z" level=info byteswritten=0 clientip=172.18.0.15 device_id=0f6da962-efc2-4cf5-99dd-a87f53633dc9 file=middleware_gin.go func=accesslog.Middleware.func1
line=131 method=GET path=/api/devices/v1/deviceconnect/connect plan=enterprise qs= request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58 responsetime=1129us status=200 ts="2022-07-11T
11:37:47Z" type=HTTP/1.1 useragent=Go-http-client/1.1
time="2022-07-11T11:37:47Z" level=error msg="websocket closed with error: error sending websocket close frame: websocket: invalid control frame: write exception: write errors: [
E11000 duplicate key error collection: deviceconnect.devices index: _id_ dup key: { _id: \"0f6da962-efc2-4cf5-99dd-a87f53633dc9\" }]" device_id=0f6da962-efc2-4cf5-99dd-a87f53633
dc9 file=device.go func=http.DeviceController.connectWSWriter.func1 line=225 plan=enterprise request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58

I have no clue where the duplicated key might come from, but this could also be a hint what might have gone wrong with the initial issue? The device has only one key registered, everything works expect, mender-connect.

@peter are there any other information I can provide you?

hey,

wow, @t-mon, some serious debugging you are doing. perhaps we should switch places :slight_smile:
could you give me the version of mender-connect --version and mender --version on your device? Also you could check:

dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken

on the device (please remove the token (string "ey…) before sending the output).

best regards,
peter

mender-connect --version                                                                                                                        
mender-connect version 2.0.0    runtime: go1.17.6
mender --version                                                                                                                                    
3.2.0   runtime: go1.17.6
dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken
method return time=1657805389.556103 sender=:1.1964 -> destination=:1.3834 serial=25595 reply_serial=2
   string "eyJhb****"
   string "http://localhost:37451"

Could I kindly ask you to upgrade the client elements to the latest (3.3.0 mender client), 3.2.0 has reached EOL.
if that is the debian based system, then do not forget to do apt-get update.
if that does not help, could we try restarting systemctl restart mender-connect and if that does not help I would gladly see the complete logs from both components: journalctl -u mender-client and journalctl -u mender-connect.

peter

I installed the latest mender-client from your repository:

apt policy mender-client
mender-client:
  Installed: 3.2.0-1
  Candidate: 3.2.0-1
  Version table:
 *** 3.2.0-1 500
        500 https://downloads.mender.io/repos/debian stable/main armhf Packages
        100 /var/lib/dpkg/status
     1.7.0-4+b12 500
        500 http://deb.debian.org/debian buster/main armhf Packages

I’ll build my own 3.3.0 package and retry.

thanks for the detailed response. the repo has been deprecated, please use the following one:

https://downloads.mender.io/repos/debian debian/buster/stable main

(Downloads | Mender documentation)

pg

ok, I updated all packages from the new stable repository

mender --version
3.3.0   runtime: go1.17.6
mender-connect --version
mender-connect version 2.0.2    runtime: go1.17.6

The mender-connect journal:

Jul 15 06:13:05  mender-connect[672]: time="2022-07-15T06:13:05Z" level=error msg="messageLoop: error on readMessage: websocket: invalid close code; disconnecting, waiting for reconnect."
Jul 15 06:13:15  mender-connect[672]: time="2022-07-15T06:13:15Z" level=info msg="eventLoop: Connection established with http://localhost:33633"

The mender journal:

Jul 15 06:15:30  mender[352]: time="2022-07-15T06:15:30Z" level=warning msg="error while sending close message: connection closed"
Jul 15 06:15:40  mender[352]: time="2022-07-15T06:15:40Z" level=error msg="error forwarding from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF"
dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken
method return time=1657867021.129224 sender=:1.46 -> destination=:1.61 serial=8 reply_serial=2
   string "eyJh****"
   string "http://localhost:35317"

Unfortunatly, the same behavior…

ok, could we try to run remote terminal via cli:

server_url=...
username=...
password=...
device_id=... # get it for instance from the UI
mender-cli login --server "https://${server_url}" --username "${username}" --password "${password}" --skip-verify
mender-cli terminal "${device_id}" --server "https://${server_url}" --skip-verify

what are the logs saying then? Could we try then to systemctl restart mender-client, wait a bit, and try again? then the same with
systemctl restart mender-connect, wait a bit and try the cli?
Is there anything between the device and the server that may prevent to upgrade the https to a websocket? as in: a web app firewall, some proxy?
Does the server run the default configuration of the api gateway? I mean: is there any additional element on the server side that may prevent the upgrade to a websocket?
I am sorry for the amount of questions, but I am unable to reproduce the problem locally.

pg

Trying to get the terminal:

Configuration file not found. Continuing.
FAILURE: unable to get the device: Get https://<****>/api/management/v1/deviceconnect/devices/0f6da****request failed with status 404

skip-verify does not change anything regarding this behavior, it is a valid ssl cert, and also the browser is says it is secure.

The 404 makes sense, the device is online from the check in perspective, check update and inventory update works, but mender-connect failed to establish a connection.

The client websocket error close 1006 (abnormal closure) comes from the missing close code before dropping the socket. This means to me: somehow the connection gets killed without staying in the ws protocol.

On the server side, I get this error:

websocket closed with error: error sending websocket close frame: websocket: invalid control frame: write exception: write errors: [
E11000 duplicate key error collection: deviceconnect.devices index: _id_ dup key: { _id: \"0f6d***\" }]" device_id=0f6da**
dc9 file=device.go func=http.DeviceController.connectWSWriter.func1 line=225 plan=enterprise request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58

It says unable to write the close frame…matching basically the error on the client side.

Since the mender-connect connection worked using server 3.1.0 I still think it is an issue on the server side of my setup. Same certificate as used with 3.1.0, same server, same port configs etc…not working any more since 3.3.0 update.

Is there anything else I could check on the server what could kill the TCP connection? The connection attempt reaches the right service, so I guess no proxy, firewall or other networking issue there.

Maybe the deviceconnect service already trying to close the connection for any other reason or violations and printing only the error that he was unable to send the close frame because the socket was already gone? Can I get somehow more logs from the service?

I made some progress!

Since I still think something is wrong in my db, i created a completely fresh setup, created a db-dump from the production server, and restored the db in the new setup and devices i the field started to connect using mender-connect and seem to work again.

Still wired, my test device using the latest client version still failed with the errors I described in the previous post. So…progress, but not yet solved…something is killing the TCP connection.

nice!
I think you have solved it. there is a corruption in the db, the cause of it: a bug we very recently discovered which you could have triggered during the upgrades. I would first clean all the documents in the deviceconnect devices collection that have {"_id":"0f6d***"}. Then I would check , sessions, recordings, and control collections as well, sessions should be empty if you have no open sessions, recordings and control you can either leave or clean – they contain the session recording data. Then just to be sure I would decommission the test device, and re-accept it.

peter

I’m glad you where able to figure out what could have gone wrong here!

Do you have a command on the hand how I can clean those entries from the db in a simple way?
As I understood, those data is stored in a mongo db as json?

Is there any hotfix I can try while importing exporting the data to fix the db?

to remove all the documents with given device id (it assumes you have one mongo container running):

device_id="0f6da962-efc2-4cf5-99dd-a87f53633"
echo -ne "use deviceconnect\ndb.devices.deleteMany({\"_id\":\"${device_id}\"})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo

we can also check the other collections:

for i in sessions recordings control; do
 echo "${i}:";
 echo -ne "use deviceconnect\ndb.\"${i}\".find({})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo
done

let me know how it goes!

peter

Hi peter,

good news, removing the device if from the database collections indeed made the mender-connect features work again!

I’ve tested it with my “broken” test device and removed it from the devices db using your command, then also from sessions recordings and control collections, restarted the device and it was working again, no decommissioning was required.

There was a small typo in your for loop, for completeness here the working query:

for i in sessions recordings control; do  
    echo "${i}:";  echo -ne "use deviceconnect\ndb.${i}.find({})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo; 
done

I tested it now also with other devices where the feature stopped working with older mender-client and mender-connect binaries installed, and they connected again to the server after a reboot.

I’m glad we where able to solve it :slight_smile: Thanks for your support!

1 Like

Hi @t-mon @peter,

I’ve got a very similar issue that I could partially fix. I will try to explain it as well as I can, but please let me know if it isn’t clear.

I’m using a Mender Open-Source server, currently on 3.1.0 (I’m pretty sure it appeared when we moved from 3.0.0 to 3.1.0), and use the preautorization workflow since the beginning. At some point (and can’t explain why), some devices I created weren’t able to connect on first try anymore, so it was necessary to decomission them, and manually accept the new pending requests. The problem is, I still have the inventory of the preautorized devices and can’t delete them, even with the Delete Device Inventory API request. It does return a code 204, but I still can get the device inventory.

It’s already an issue, but the real problem is that they share the same serial number I’m using as ID data, and I just realized yesterday that this could prevent some devices to be included in deployment. After I accepted the new devices requests, I do see them as accepted in my GUI, but with the GET List Devices Inventories request, I saw that they missed the “status” attribute.

Using the commands @t-mon provided at the top, I was able to fix the issue for the current devices (thanks a lot btw!), but I’m concerned that this issue might remain for following devices.

I also tried the last commands @peter provided, but it seems that the deletion of my device doesn’t work:

root@mender:~/mender-server.3.1.0/production# echo -ne "use deviceconnect\ndb.devices.deleteMany({\"_id\":\"${device_id}\"})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo
MongoDB shell version v4.4.20
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("6656bac5-628a-4fd1-bcab-7c1e420073e2") }
MongoDB server version: 4.4.20
switched to db deviceconnect
{ "acknowledged" : true, "deletedCount" : 0 }
bye

Any help on that will be very appreciated.

Yoann

hey @Piocky

thanks for using Mender. perhaps we should start a new topic, since your issue is a bit different.
what I would suggest is: backup the inventory devices collection, then clean it:

use inventory...
db.devices.deleteMany({})

after that, I would suggest to run the deviceauth commands to propagate the data to inventory:

docker exec -i `docker ps | grep device-auth | head -1 | awk '{print($1);}'` deviceauth propagate-inventory-statuses
docker exec -i `docker ps | grep device-auth | head -1 | awk '{print($1);}'` deviceauth propagate-inventory-id-data

it is a bit late on Friday night, so just double check that those commands exist in 3.1. on that note using the Mender Server 3.1 does not really make much sense, you should upgrade to 3.6 as soon as you can.
one more thing: the serial number as such is not necessarily a problem, as long as the whole identity is unique (I hope you have something more there than just the serial).

best regards,
peter

Hi @peter,

Thanks for your quick answer. I do have a script to backup my database, but I’m not sure how I should use the following commands:

use inventory...
db.devices.deleteMany({})

as well as where to check if the commands you provided exist on 3.1

We will upgrade from 3.1 to 3.6 when we will migrate the Server, but currently we can’t because of the current certificate we are using on this server (not compatible with 3.2 and more).
Unfortunately, we use only Serial Number + RSA public key for identity data, and I feel like the last one is in auth_sets, not in inventory, so maybe that’s why having twice the same serial number is a problem in our case.

Yoann

I started a new thread: Cleaning inventry and deviceauth sync commands