Mender-stress-test-client corrupted the MongoDB

Hey there Community!

We have an OpenSource Mender 3.4 installed as a Dev environment. It was deployed by Docker-Compose. 2vCPU 1GB. t3micro instance.

After some time we decided we have to do a load tests and our choice was mender-stress-test-client .

Here is how we used it:

./mender-stress-test-client run --server-url https://mender.com --count 100 --mac-address-prefix “e4” --device-type “test” --update-interval 5 --tenant-token “some_token”

We have successfully accepted 100 devices and made a simple deployment simulation. LA was about 10-15% compared to the usual 1-3%.
I have written a script to clear all these test devices out after the test by status “noauth”.

Then we decided to create 1000 devices at once and this is where our server has fallen=). LA becomes about 13-15. GUI started to glitch. I was able to accept about 150 devices, then the server just refused to change devices’ statuses. I have stopped the script.


Some devices didn’t have names, attributes etc. I was able to “dismiss” all the devices on a page and some of the statuses become “noauth” as expected, but when I tried to decommission these devices via GUI or API, they changed their status to “Accepted” again.

curl -X DELETE -H “Authorization: Bearer ${JWT_TOKEN}” “${MENDER_SERVER_URL}/api/management/v1/inventory/devices/b8fe1252-4a23-4b0e-8e5b-afd18e139036”

None of my further actions helped to delete at least one of these devices, so was forced to recreate my database to fix all the issues.

  1. Now I want to know, what was the problem. It looked like the MongoDB couldn’t handle so many requests at a time and the metadata for every new device was corrupted.
    We need to know, was the problem related to the EC2 instance CPU resources, Disk speed, or this is too much for any Mender server at once. Is it possible that If I’ll set the “updatepollinterval” to 600 my server will handle such amount of requests?
  2. Is there any other tools that can allow us to test a real deployment on hundreds of devices at once to be sure our network configuration is enough?

In the end, we need to be 100% sure that our Production server will handle a situation where 1000 devices were connected at a time, or there will be a big update of hundreds of devices at once. (I know that is a quite rare situation, but nevertheless)

Looking forward to hearing from you!
Kos

UPD:
This time server felt absolutely fine on 200 devices but attributes were glitching again on 500. Hopefully this time I was able to accept all the devices and even do a deployment.

The problem is likely the inter service synchronization when you accept the devices. When a device is accepted, there is an aynchronous job that propagates this information (triggered by the deviceauth service) to the inventory. Could you check the workflows.jobs collection in MongoDB and see if there are any jobs that failed?

// mongosh mongodb://<primary FQDN>
use workflows
db.jobs.find({status: {$gt: 0}})

Also, we have seen some issues with nats in the past (especially with nats:2.6 which is used in mender 3.4). To debug nats, you can install their CLI tool and inspect if there are any pending or redelivered messages:

nats stream status WORKFLOWS
# Reports the state of the stream: check if the size > 0.
nats consumer report WORKFLOWS
# Gives a report of all subscribers and the number of pending messages

In the ideal situation there should be no persistent messages when the server is idle.

Unfortunately, mender-stress-test-client is the only open-source tool that we have for scale of testing.

The Mender server is capable to handle a lot more than a couple of 1000s of devices. However, the LA will of course scale with the number of API calls the devices collectively make, so to have an efficient system you should also consider the client parameters as well.

Hey! Thank you for your feedback!
Unfortunately, the server become absolutely unresponsive to delete the test devices, so I decided to recreate the DB and can’t provide you with workflows or nuts statuses. But, I’ve done few more tests, and figured out that there is a possibility to add 1000 devices at a time, but I need to extend the start_time parameter to 120 sec instead of the default value of 10 sec. In that case, I still can see visual glitches with attributes, but I can change devices statuses, accept devices, and attributes is starting to look normal after a simple interaction. Also, DB is responsive.
CPU load ~ 20%. MongoDB consumes 19% of memory. 80% of all memory is used.
Here is how the command looks now:

./mender-stress-test-client run --server-url $MENDER_SERVER_URL --count 1000 --mac-address-prefix “e2” --device-type “test” --update-interval 600 --start-time 120 --tenant-token $JWT_TOKEN

Also, I have written a simple script to delete the test devices after the stress test. Hope someone will find it useful:

#!/bin/bash

# Replace with your Mender server URL
MENDER_SERVER_URL='https://mender.host.com'

# This is the user credentials to obtain the JWT_TOKEN
USER_CREDENTIALS='admin@mail.io:password'

# Obtain the JWT_TOKEN from Mender server
JWT_TOKEN=$(curl -s -X POST -u $USER_CREDENTIALS $MENDER_SERVER_URL/api/management/v1/useradm/auth/login)

echo "JWT_TOKEN: ${JWT_TOKEN}"

PAGE=1
PER_PAGE=20
HAS_MORE=true

while [ "$HAS_MORE" = true ] ; do
  # Fetch devices from the Mender server
  curl -s -H "Authorization: Bearer ${JWT_TOKEN}" ${MENDER_SERVER_URL}/api/management/v1/inventory/devices?per_page=${PER_PAGE}\&page=${PAGE} > devices.json

  # Get the list of devices from the file
  DEVICES=$(cat devices.json)

  echo "DEVICES: ${DEVICES}"

  # Filter devices with device_type "test"
  TEST_DEVICES=$(echo $DEVICES | jq -r '.[] | select(.attributes[] | select(.name == "device_type" and .value == "test")) | .id')

  echo "TEST_DEVICES: ${TEST_DEVICES}"

  # Loop through the devices and delete them
  for DEVICE in $TEST_DEVICES; do
    echo "Deleting device ${DEVICE}"
    DELETE_RESPONSE=$(curl -s -X DELETE -H "Authorization: Bearer ${JWT_TOKEN}" "${MENDER_SERVER_URL}/api/management/v2/devauth/devices/${DEVICE}")
    echo "DELETE_RESPONSE: ${DELETE_RESPONSE}"
  done

  # Check if there are more devices to fetch
  HAS_MORE=$(echo $DEVICES | jq -r ". | length == ${PER_PAGE}")

  let "PAGE++"
done

echo "Deletion process completed."