High mongodb CPU usage in mender-device-auth table is causing major issues

We are using self-hosted mender on version 1.7 and we are experiencing an entire loss of our environment due to extremely high CPU usage in the mender-device-auth container via the mongodb process.

We are attempting to upgrade to the latest and greatest version – 2.5 which is being hindered by the high CPU usage in our existing container for mender-device-auth. Is there a way to terminate any locking calls on the database so that we can get a backup of our data and initialize a new server so that we can get around this issue?

I am working with my team to gather all necessary logs, if logs would be helpful please let us know what logs we need and where we can gather them from to help us tackle this issue for good.

@tranchitella @0lmi @merlin any thoughts here?

I’ve generated logs for the devauth docker container and for the current operations in mongod:

Device Auth Docker Container Log:
docker logs menderproduction_mender-device-auth_1 2>&1 | tee 2020-11-04_device_auth.log

Download from Google Drive

MongoDB Current Operations Log:
docker exec -it menderproduction_mender-mongo_1 mongo --eval "db.currentOp(true)" > 2020-11-04_mongo-currentOp.log

Download from Google Drive

Here are the logs from our api-gateway container.

Our client-side polling intervals are problematically short. Is there an approach which can be taken on the server-side to ratelimit the incoming api gateway requests? Due to the server load, we are unable to push a deployment to increase the client-side polling intervals - and any server-side means of compensating (even temporarily) would create room for us to make such a client-side change.

API Gateway Container Log:
docker container logs menderproduction_mender-api-gateway_1 2>&1 | tee 2020-11-04_api-001.log

Download from Google Drive

Looks like you are on a right direction. In Mender 2.2 was significant performance improvements and upgrading server version is really good idea. Inventory update and deployments check poll intervals are definitely important for decreasing server load.

The situation is really interesting and I can imagine only hacky ways to try to handle it somehow.

I think the only option for rate limiting in open source version is twicking nginx configuration for enabling rate limiting by IP or do the same on your firewall and sequentially update devices fleet with decreased poll intervals.

Another thing, which might help, is manually check and create indexes in the DB for most frequent and heavy requests. But this step will require attention and manual intervention, because it has to be taken in consideration during further update process. The difference in indices is significant:
1.7.1

> use deviceauth
switched to db deviceauth
> db.getCollectionNames().forEach(function(collection) {
...    indexes = db[collection].getIndexes();
...    print("Indexes for " + collection + ":");
...    printjson(indexes);
... });
Indexes for auth_sets:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "deviceauth.auth_sets"
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"device_id" : 1,
			"id_data" : 1,
			"pubkey" : 1
		},
		"name" : "auth_sets:DeviceId:IdData:PubKey",
		"ns" : "deviceauth.auth_sets"
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"device_id" : 1,
			"id_data_sha256" : 1,
			"pubkey" : 1
		},
		"name" : "auth_sets:IdDataSha256:PubKey",
		"ns" : "deviceauth.auth_sets"
	}
]
Indexes for devices:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "deviceauth.devices"
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"id_data" : 1
		},
		"name" : "devices:IdentityData",
		"ns" : "deviceauth.devices"
	}
]
Indexes for migration_info:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "deviceauth.migration_info"
	}
]
>

2.5.0

> use deviceauth
switched to db deviceauth
> db.getCollectionNames().forEach(function(collection) {
...    indexes = db[collection].getIndexes();
...    print("Indexes for " + collection + ":");
...    printjson(indexes);
... });
Indexes for auth_sets:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"device_id" : 1,
			"id_data_sha256" : 1,
			"pubkey" : 1
		},
		"name" : "auth_sets:IdDataSha256:PubKey",
		"background" : false
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"id_data_sha256" : 1,
			"pubkey" : 1
		},
		"name" : "auth_sets:NoDeviceId:IdDataSha256:PubKey",
		"background" : false
	}
]
Indexes for devices:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"id_data_sha256" : 1
		},
		"name" : "devices:IdentityDataSha256",
		"background" : false
	}
]
Indexes for migration_info:
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" } ]
Indexes for tokens:
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"key" : {
			"exp.time" : 1
		},
		"name" : "TokenExpiration",
		"background" : false,
		"expireAfterSeconds" : 0
	}
]
>

tokens collection might contain quite a lot of not indexed expired tokens, which might be safely deleted.

Hello 0lmi, thanks for taking the time to respond to our questions.

I have a question regarding the creation of indexes for the device auth table –

Is it possible to just use the 2.5.0 index schema in 1.7.1, or will there need to be specific changes necessary for us to use that schema in 1.7.1? Also would it be possible to provide an example of what needs to be done on the 1.7.1 database to stabilize things? I am not too familiar with mongodb and Mender is my first interaction with it.

We had another idea to help stem the issue and it is a bit of a hack –

We were thinking about modifying the PatchDeviceAttributesHandler in api_inventory.go to skip the “Upsert” of the attributes by simply commenting out L329-L333 in the file linked below:

Do you think that there could be any unforeseen consequences of this action other than Mender losing the ability to track device characteristics? We actually already track such device attributes outside of mender so from a naive perspective the issue would not have much of an impact on performance. The main goal here is to get us off of the 1.7 version so that we can transition all of our devices and server to the later 2.5.x version that might have better optimizations and may work better for our large pool of devices.

If we can get to a degraded, but working 1.7.x version running on our self-hosted infrastructure then we can work towards fixing the root cause of the issue. However in our current configuration, we do not have a working server and cannot actually deploy updates effectively while the load on the database is this immense.

@robg everything depends from your goals, system availability and timeline requirements.

If the goal is to stabilise the system for rolling out urgent updates then I would just created the indices and I’m pretty sure that it will help to get the system responsive back. But those indices might (this is my supposition, I didn’t check it and don’t know from my memory for sure) cause issues during upgrade process. If it’s the case, then upgrade process will require the extra step for removing the indices before starting an upgrade.

If it’s acceptable to stop traffic processing for Mender backend upgrade then this is the best option. Just stop traffic processing, incrementally upgrade Mender backend to the latest version and then start traffic processing. In v2.0 auth management api v1 was removed together with mender artifact v1 (see more info in the release announcement). Removing management api can damage only your addition automation if some, but it can be simply and fast fixed, by just replacing related calls to v2 api version. If you don’t use v1 artifacts then it’s completely safe for you to upgrade the backend to the latest version. Upgrading Mender client in sync with backend upgrade is not required, because currently Mender backend supports all older client versions.

Your hack might also work as temporary workaround, but I would bid on main resources are consuming by full collection scanning because of lack of indices rather than the DB is overloaded by excessive write operations. The hint here might come from top output where is possible to check where most resources are spent and if disk subsystem is a bottleneck candidate.

@robg just out of curiosity to understand the scale and the load better could you please answer on the following questions:

  1. How much devices do you have managed by Mender currently?
  2. What are inventory update and deployment check poll intervals?
  3. How much resources is allocated for Mender backend currently (CPU, memory, storage)?