Mender server deviceauth propagate issue

Hi,
I’m hosting an open source mender-server and since the update to 3.1.0 (at least I think it is since then) I have some issues in my setup and I hope you can help me out.

When a new device comes online, it does not get recognized by the server at all:

  • I brought up a not pre-authorized device, it was not shown under pending devices
  • I pre-authorized a device, brought it online, it did not come online once connected to the server the first time

After digging around bit, I found this post and tried it out.

docker exec $(docker ps -q -n 1 -f 'name=device-auth') /usr/bin/deviceauth propagate-inventory 
docker exec $(docker ps -q -n 1 -f 'name=device-auth') /usr/bin/deviceauth propagate-inventory-id-data
docker exec $(docker ps -q -n 1 -f 'name=device-auth') /usr/bin/deviceauth propagate-inventory-statuses

The moment I executed the 3 command, the pending devices showed up and the pre-authorized device appeared in the device list. Everything looked correct.

Then I tried to accept the pending device, the status of device in the webinterface startet toggeling between accepted and pending (very fast).
After running deviceauth propagate-inventory-statuses the status flickering stopped and the device was accepted and everything is as it should be.

Unfortunately this behavior remains, I re-pulled the containers, restarted the server and the service, nothing helped. It seems like my setup has something broken and I have no idea what it could be.

I also tried the deviceauth maintenance --decommissioning-cleanup, the issue remains.

I tried to create a backup, migrated to an entirely fresh server instance, restored the backup (worked very smooth), but the issue remains also in the new instance.

So I guess the issue comes with the database data.
Any ideas how to figure out whats the issue in the database content?

Might there some miss-configuration in the prod.yaml?

Since this is quiet a blocker right now, I really would appreciate any hint helping find a solution.

Continuing trying to debugging the issue, I figured out that the mongo db has actually inconsistent data, no clue how that could happen, but there seems to be an issue.

I have inventory data for device ID’s which do not exist any more. Removing them using the API for it returns me a 204 which seems to be the correct response code, but the data remains int he database, so the call has no effect (Mender API docs). Removing the data directly for the database makes the inconsistent data disappear, but does not solve the overall issue.

If I dis-commission a device, it remains with noauth in the system (identity data still in tact), re-connecting the device (and doing the update calls on mender server so it shows up as pending) I end up with a new device ID, having 2 devices with the same identity data. Maybe there is the root of the issue.

At some point the connection between database and the rest of the system seems to be broken.

How is it possible that the API thinks the database call was successful, but the data remains actually in the database?

How can I have 2 devices with the same identity sets?

Is there a sanity check for the database available somewhere?

Hello t-mon,

I am very sorry for the late answer. And I am very happy that you are interested in Mender.
What is the current status?
Much has changed over the course of time, and I would recommend upgrading to 3.3.0 (the latest release).
I can see that you did also a very thorough debugging. Can we decommission both devices? Let’s start by doing so from the UI. And then lets see if the inventory.devices and deviceauth.devices and auth_sets collections are empty. Please keep me posted. And sorry once again for the late response.

best regards,
peter

Hi peter,
thanks for helping me out!
I’ll update to 3.3.0 today and will try to reproduce the issue and give detailed information regarding your questions. I keep you updated.

I tested now carefully the entire setup a few days and updating the server to 3.3.0 fixed indeed the original issue I had! Upgrading to 3.2.0 did not work, so the fix might have come with migrating to 3.3.0 I guess.

After doing some tests I noticed everything is now working again as expected, besides the mender-connect, which is not working any more. I tried to update and downgrade the clients, nothing helped. On the client side I only get following error:

level=error msg="messageLoop: error on readMessage: websocket: close 1006 (abnormal closure): unexpected EOF; disconnecting, waiting for reconnect."

Updating, inventory, everything works, but not mender-connect.
Should I open a new topic?

Best regards,
Simon

After some digging I actually think the issue is related to my original issue.

Here a short summery what’s the status:

Original issue (mender-server 3.1.0):

  • Wired device status and inventory behavior
  • mender-connect working as expected

Updated to mender-server 3.3.0

  • Wired device behavior fixed
  • mender-connect not working any more

The mender-client failed with following error message:

mender[13087]: time="2022-07-11T13:34:51+02:00" level=warning msg="error while sending close message: use of closed network connection"
mender[13087]: time="2022-07-11T13:35:02+02:00" level=error msg="error forwarding from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF"

The mender-connect just reports

mender-connect[13092]: time="2022-07-11T13:36:04+02:00" level=info msg="eventLoop: Connection established with http://localhost:37857"
mender-connect[13092]: time="2022-07-11T13:36:04+02:00" level=error msg="messageLoop: error on readMessage: websocket: invalid close code; disconnecting, waiting for reconnect."

Since all clients worked with 3.1.0 and previous version, I’m sure the issue is in the server setup.

Now checking the server logs, there might be the missing key what’s going on with my device db:

time="2022-07-11T11:37:47Z" level=info byteswritten=0 clientip=172.18.0.15 device_id=0f6da962-efc2-4cf5-99dd-a87f53633dc9 file=middleware_gin.go func=accesslog.Middleware.func1
line=131 method=GET path=/api/devices/v1/deviceconnect/connect plan=enterprise qs= request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58 responsetime=1129us status=200 ts="2022-07-11T
11:37:47Z" type=HTTP/1.1 useragent=Go-http-client/1.1
time="2022-07-11T11:37:47Z" level=error msg="websocket closed with error: error sending websocket close frame: websocket: invalid control frame: write exception: write errors: [
E11000 duplicate key error collection: deviceconnect.devices index: _id_ dup key: { _id: \"0f6da962-efc2-4cf5-99dd-a87f53633dc9\" }]" device_id=0f6da962-efc2-4cf5-99dd-a87f53633
dc9 file=device.go func=http.DeviceController.connectWSWriter.func1 line=225 plan=enterprise request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58

I have no clue where the duplicated key might come from, but this could also be a hint what might have gone wrong with the initial issue? The device has only one key registered, everything works expect, mender-connect.

@peter are there any other information I can provide you?

hey,

wow, @t-mon, some serious debugging you are doing. perhaps we should switch places :slight_smile:
could you give me the version of mender-connect --version and mender --version on your device? Also you could check:

dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken

on the device (please remove the token (string "ey…) before sending the output).

best regards,
peter

mender-connect --version                                                                                                                        
mender-connect version 2.0.0    runtime: go1.17.6
mender --version                                                                                                                                    
3.2.0   runtime: go1.17.6
dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken
method return time=1657805389.556103 sender=:1.1964 -> destination=:1.3834 serial=25595 reply_serial=2
   string "eyJhb****"
   string "http://localhost:37451"

Could I kindly ask you to upgrade the client elements to the latest (3.3.0 mender client), 3.2.0 has reached EOL.
if that is the debian based system, then do not forget to do apt-get update.
if that does not help, could we try restarting systemctl restart mender-connect and if that does not help I would gladly see the complete logs from both components: journalctl -u mender-client and journalctl -u mender-connect.

peter

I installed the latest mender-client from your repository:

apt policy mender-client
mender-client:
  Installed: 3.2.0-1
  Candidate: 3.2.0-1
  Version table:
 *** 3.2.0-1 500
        500 https://downloads.mender.io/repos/debian stable/main armhf Packages
        100 /var/lib/dpkg/status
     1.7.0-4+b12 500
        500 http://deb.debian.org/debian buster/main armhf Packages

I’ll build my own 3.3.0 package and retry.

thanks for the detailed response. the repo has been deprecated, please use the following one:

https://downloads.mender.io/repos/debian debian/buster/stable main

(Downloads | Mender documentation)

pg

ok, I updated all packages from the new stable repository

mender --version
3.3.0   runtime: go1.17.6
mender-connect --version
mender-connect version 2.0.2    runtime: go1.17.6

The mender-connect journal:

Jul 15 06:13:05  mender-connect[672]: time="2022-07-15T06:13:05Z" level=error msg="messageLoop: error on readMessage: websocket: invalid close code; disconnecting, waiting for reconnect."
Jul 15 06:13:15  mender-connect[672]: time="2022-07-15T06:13:15Z" level=info msg="eventLoop: Connection established with http://localhost:33633"

The mender journal:

Jul 15 06:15:30  mender[352]: time="2022-07-15T06:15:30Z" level=warning msg="error while sending close message: connection closed"
Jul 15 06:15:40  mender[352]: time="2022-07-15T06:15:40Z" level=error msg="error forwarding from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF"
dbus-send --system --dest=io.mender.AuthenticationManager --print-reply /io/mender/AuthenticationManager io.mender.Authentication1.GetJwtToken
method return time=1657867021.129224 sender=:1.46 -> destination=:1.61 serial=8 reply_serial=2
   string "eyJh****"
   string "http://localhost:35317"

Unfortunatly, the same behavior…

ok, could we try to run remote terminal via cli:

server_url=...
username=...
password=...
device_id=... # get it for instance from the UI
mender-cli login --server "https://${server_url}" --username "${username}" --password "${password}" --skip-verify
mender-cli terminal "${device_id}" --server "https://${server_url}" --skip-verify

what are the logs saying then? Could we try then to systemctl restart mender-client, wait a bit, and try again? then the same with
systemctl restart mender-connect, wait a bit and try the cli?
Is there anything between the device and the server that may prevent to upgrade the https to a websocket? as in: a web app firewall, some proxy?
Does the server run the default configuration of the api gateway? I mean: is there any additional element on the server side that may prevent the upgrade to a websocket?
I am sorry for the amount of questions, but I am unable to reproduce the problem locally.

pg

Trying to get the terminal:

Configuration file not found. Continuing.
FAILURE: unable to get the device: Get https://<****>/api/management/v1/deviceconnect/devices/0f6da****request failed with status 404

skip-verify does not change anything regarding this behavior, it is a valid ssl cert, and also the browser is says it is secure.

The 404 makes sense, the device is online from the check in perspective, check update and inventory update works, but mender-connect failed to establish a connection.

The client websocket error close 1006 (abnormal closure) comes from the missing close code before dropping the socket. This means to me: somehow the connection gets killed without staying in the ws protocol.

On the server side, I get this error:

websocket closed with error: error sending websocket close frame: websocket: invalid control frame: write exception: write errors: [
E11000 duplicate key error collection: deviceconnect.devices index: _id_ dup key: { _id: \"0f6d***\" }]" device_id=0f6da**
dc9 file=device.go func=http.DeviceController.connectWSWriter.func1 line=225 plan=enterprise request_id=3262e5fe-eeb3-4b98-b1a2-e4a5e7199c58

It says unable to write the close frame…matching basically the error on the client side.

Since the mender-connect connection worked using server 3.1.0 I still think it is an issue on the server side of my setup. Same certificate as used with 3.1.0, same server, same port configs etc…not working any more since 3.3.0 update.

Is there anything else I could check on the server what could kill the TCP connection? The connection attempt reaches the right service, so I guess no proxy, firewall or other networking issue there.

Maybe the deviceconnect service already trying to close the connection for any other reason or violations and printing only the error that he was unable to send the close frame because the socket was already gone? Can I get somehow more logs from the service?

I made some progress!

Since I still think something is wrong in my db, i created a completely fresh setup, created a db-dump from the production server, and restored the db in the new setup and devices i the field started to connect using mender-connect and seem to work again.

Still wired, my test device using the latest client version still failed with the errors I described in the previous post. So…progress, but not yet solved…something is killing the TCP connection.

nice!
I think you have solved it. there is a corruption in the db, the cause of it: a bug we very recently discovered which you could have triggered during the upgrades. I would first clean all the documents in the deviceconnect devices collection that have {"_id":"0f6d***"}. Then I would check , sessions, recordings, and control collections as well, sessions should be empty if you have no open sessions, recordings and control you can either leave or clean – they contain the session recording data. Then just to be sure I would decommission the test device, and re-accept it.

peter

I’m glad you where able to figure out what could have gone wrong here!

Do you have a command on the hand how I can clean those entries from the db in a simple way?
As I understood, those data is stored in a mongo db as json?

Is there any hotfix I can try while importing exporting the data to fix the db?

to remove all the documents with given device id (it assumes you have one mongo container running):

device_id="0f6da962-efc2-4cf5-99dd-a87f53633"
echo -ne "use deviceconnect\ndb.devices.deleteMany({\"_id\":\"${device_id}\"})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo

we can also check the other collections:

for i in sessions recordings control; do
 echo "${i}:";
 echo -ne "use deviceconnect\ndb.\"${i}\".find({})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo
done

let me know how it goes!

peter

Hi peter,

good news, removing the device if from the database collections indeed made the mender-connect features work again!

I’ve tested it with my “broken” test device and removed it from the devices db using your command, then also from sessions recordings and control collections, restarted the device and it was working again, no decommissioning was required.

There was a small typo in your for loop, for completeness here the working query:

for i in sessions recordings control; do  
    echo "${i}:";  echo -ne "use deviceconnect\ndb.${i}.find({})\n" | docker exec -i `docker ps | grep mongo | head -1 | awk '{print($1);}'` mongo; 
done

I tested it now also with other devices where the feature stopped working with older mender-client and mender-connect binaries installed, and they connected again to the server after a reboot.

I’m glad we where able to solve it :slight_smile: Thanks for your support!

1 Like