Decommissioning broken on Mender 2.3 (Elasticsearch can't start)

Hello,

Since update to open-Source Mender 2.3 , I can no longer decommission any device, neither via UI nor via API. Always get 500 Internal Server Error without any explanation.

I also get 500 as a reply when I try to accept or reject authorization requests, however they seem to work even if 500 is returned.

Does anyone observe the same issue?

After debugging it a bit using Troubleshooting section of the docs, I found following:

Logs from mender-device-auth, it seems it can not contact mender-conductor:

mender-device-auth_1 | time=“2020-03-27T18:05:45Z” level=error msg=“internal error: submit device decommissioning job error: failed to submit decommissioning job: Post http://mender-conductor:8080/api/workflow/decommission_device: dial tcp 172.19.0.15:8080: connect: connection refused” file=response_helpers.go func=rest_utils.RestErrWithLogMsg line=62 request_id=8db6c8f8-791c-4c83-9437-eeb017f403e7 user_id=5ec43064-1fe1-49bd-92b2-c3a7ee220460

Looking at mender-conductor logs shows that it waits for elasticsearch:

mender-conductor_1               | Fri Mar 27 18:18:15 UTC 2020 waiting for ES to be ready at mender-elasticsearch:9300
mender-conductor_1               | nc: bad address 'mender-elasticsearch:9300'
mender-conductor_1               | Fri Mar 27 18:18:16 UTC 2020 waiting for ES to be ready at mender-elasticsearch:9300
mender-conductor_1               | nc: bad address 'mender-elasticsearch:9300'
mender-conductor_1               | Fri Mar 27 18:18:17 UTC 2020 waiting for ES to be ready at mender-elasticsearch:9300

see that mender-elasticsearch is constantly restarting with following errors:

elasticsearch Logs
mender-elasticsearch_1           | OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
mender-elasticsearch_1           | [2020-03-27T16:57:43,165][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [unknown] uncaught exception in thread [main]
mender-elasticsearch_1           | org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:116) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
mender-elasticsearch_1           | 	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:300) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:296) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	... 6 more
mender-elasticsearch_1           | Caused by: java.io.IOException: failed to obtain lock on /usr/share/elasticsearch/data/nodes/0
mender-elasticsearch_1           | 	at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:218) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:270) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:296) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	... 6 more
mender-elasticsearch_1           | Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes/0/node.lock
mender-elasticsearch_1           | 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:90) ~[?:?]
mender-elasticsearch_1           | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
mender-elasticsearch_1           | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
mender-elasticsearch_1           | 	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
mender-elasticsearch_1           | 	at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
mender-elasticsearch_1           | 	at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
mender-elasticsearch_1           | 	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:125) ~[lucene-core-7.7.2.jar:7.7.2 d4c30fc2856154f2c1fefc589eb7cd070a415b94 - janhoy - 2019-05-28 23:30:25]
mender-elasticsearch_1           | 	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-7.7.2.jar:7.7.2 d4c30fc2856154f2c1fefc589eb7cd070a415b94 - janhoy - 2019-05-28 23:30:25]
mender-elasticsearch_1           | 	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-7.7.2.jar:7.7.2 d4c30fc2856154f2c1fefc589eb7cd070a415b94 - janhoy - 2019-05-28 23:30:25]
mender-elasticsearch_1           | 	at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:211) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:270) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:296) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-6.8.6.jar:6.8.6]
mender-elasticsearch_1           | 	... 6 more

Googling these symptoms suggested that there’s an instance of elasticsearch which is running in parallel.
I tried to stop all docker containers and performed server restart but it didn’t help.

Strange thing: I see that elastic and java processes keep appearing in ps aux (outside Docker ?!), with PID incrementing several times a second.
I tried to kill these processes with kill $(ps aux | grep 'elastic' | awk '{print $2}') but command always return “No such process”, even though ps aux | grep 'elastic' shows there is one, but it terminates very quickly in a loop and always changes it’s PID.

This process appears even after a server reboot, without any docker containers running yet!
Any help will be much appreciated…

Hi @ster thanks for digging in and troubleshooting here. I tried in my local setup and could not immediately reproduce but my server is pretty much ephemeral so I started from scratch. It could be related to how long you have maintained your server, and how you handled the upgrades. Can you describe the lifecycle of your server a bit?

@tranchitella, @merlin any thoughts here?

Drew

Hello @drewmoseley, thanks a lot for looking at this.

The server I use was set up back in time of Mender V2.0, carefully following the Production installation section of the documentation. It was then updated to 2.1 and recently to 2.3, using the git merging appropriate branches into my installation branch as described in Upgrading section of the docs. During every update, I always had merge conflicts (but I saw on the forum that I’m not the only one) and force applying incoming changes.

The problem appeared after 2.1->2.3 upgrade. Either right after, or some time later (I thought I checked all containers being up after upgrade but now I’m not sure anymore, I might have missed restarting Elasticsearch). I noticed that since 2.3 prod.yml moved under another directory and changed the format a bit (which by the way wasn’t reflected in Upgrading docs, I was only able to discover it from launch errors ;), so I manually updated and moved prod.yml.

I will carefully compare files on my branch with files in the repo, maybe something went wrong during merge conflicts resolving… Normally there shouldn’t be any differences from my branch and 2.3.0 tag (except prod.yml, keys and certificates) right?

The problem seems to be fixed by removing and recreating mender-elasticsearch-db volume.

It would be nice though to know the following, maybe someone could confirm…:

  • is any data from mender-elasticsearh-bd needs to be backed up and restored or it it safe to wipe it?
    Seeing that migration/dump-db tool doesn’t back up anything from elasticsearch, I assumed it is safe to wipe it and indeed it seems to work fine, however it would be nice to be sure.
  • May the fact that I skipped a minor version be the reason of this problem (I updated from 2.1 directly to 2.3)?

@ster

  • You can safely remove the elasticsearch block device, it is used as an index cache about workflows status by Conductor and does not contain anything application-specific for Mender
  • Mender 2.3 uses Elasticsearch 6, while previous versions used Elasticsearch 5; it is possible something went wrong with the upgrade of the volume, but as I mentioned before, you can start from an empty Elasticsearch cluster and solve the issue.
1 Like

I’m having exactly the same problem on a server migrated from 2.0 -> 2.1 -> 2.2 and now to 2.3.

@tranchitella what do you mean for “You can safely remove the elasticsearch block device”. Where?
Are you talking about prod.yml?

the result of my docker ps -a is:

802e8d72c93d        mendersoftware/api-gateway:2.1.0              "/entrypoint.sh"         14 minutes ago      Up 28 seconds                      80/tcp, 0.0.0.0:443->443/tcp   menderproduction_mender-api-gateway_1
796596a83159        mendersoftware/deployments:1.9.0              "/entrypoint.sh --co…"   14 minutes ago      Up 29 seconds                      8080/tcp                       menderproduction_mender-deployments_1
884a86ab64e6        openresty/openresty:1.13.6.2-0-alpine         "/usr/local/openrest…"   14 minutes ago      Up 30 seconds                      0.0.0.0:9000->9000/tcp         menderproduction_storage-proxy_1
eb9e38cfa5a0        mendersoftware/deviceauth:2.2.0               "/usr/bin/deviceauth…"   15 minutes ago      Up 55 seconds                      8080/tcp                       menderproduction_mender-device-auth_1
29a0fc3aac3c        mendersoftware/mender-conductor:1.6.0         "/srv/start_conducto…"   15 minutes ago      Up 57 seconds (health: starting)   8080/tcp, 8090/tcp             menderproduction_mender-conductor_1
2b5ba91b3ac7        mendersoftware/workflows:1.0.0                "/usr/bin/workflows …"   15 minutes ago      Up 54 seconds                      8080/tcp                       menderproduction_mender-workflows-server_1
c927e027857d        mendersoftware/create-artifact-worker:1.0.0   "/usr/bin/workflows …"   15 minutes ago      Up 55 seconds                      8080/tcp                       menderproduction_mender-create-artifact-worker_1
41777600600e        mendersoftware/useradm:1.10.0                 "/usr/bin/useradm --…"   15 minutes ago      Up 55 seconds                      8080/tcp                       menderproduction_mender-useradm_1
269907a52bfe        mendersoftware/inventory:1.7.0                "/usr/bin/inventory …"   15 minutes ago      Up 55 seconds                      8080/tcp                       menderproduction_mender-inventory_1
d6479fd2ad3e        redis:5-alpine3.8                             "/redis/entrypoint.sh"   15 minutes ago      Up About a minute                  6379/tcp                       menderproduction_mender-redis_1
886c73307019        minio/minio:RELEASE.2018-09-25T21-34-43Z      "/usr/bin/docker-ent…"   15 minutes ago      Up About a minute (healthy)        9000/tcp                       menderproduction_minio_1
aaea94c34b85        mongo:3.6                                     "docker-entrypoint.s…"   15 minutes ago      Up About a minute                  27017/tcp                      menderproduction_mender-mongo_1
908791b91d19        mendersoftware/gui:2.3.0                      "/entrypoint.sh nginx"   15 minutes ago      Up 57 seconds                      80/tcp                         menderproduction_mender-gui_1
43b8c402429c        elasticsearch:6.8.6                           "/usr/local/bin/dock…"   15 minutes ago      Restarting (1) 2 seconds ago                                      menderproduction_mender-elasticsearch_1

and ./run ps returns:

----------------------------------------------------------------------------------------------------------------------------------------
menderproduction_mender-api-gateway_1              /entrypoint.sh                   Up                      0.0.0.0:443->443/tcp, 80/tcp
menderproduction_mender-conductor_1                /srv/start_conductor.sh          Up (health: starting)   8080/tcp, 8090/tcp
menderproduction_mender-create-artifact-worker_1   /usr/bin/workflows --confi ...   Up                      8080/tcp
menderproduction_mender-deployments_1              /entrypoint.sh --config /e ...   Up                      8080/tcp
menderproduction_mender-device-auth_1              /usr/bin/deviceauth --conf ...   Up                      8080/tcp
menderproduction_mender-elasticsearch_1            /usr/local/bin/docker-entr ...   Restarting
menderproduction_mender-gui_1                      /entrypoint.sh nginx             Up                      80/tcp
menderproduction_mender-inventory_1                /usr/bin/inventory --confi ...   Up                      8080/tcp
menderproduction_mender-mongo_1                    docker-entrypoint.sh mongod      Up                      27017/tcp
menderproduction_mender-redis_1                    /redis/entrypoint.sh             Up                      6379/tcp
menderproduction_mender-useradm_1                  /usr/bin/useradm --config  ...   Up                      8080/tcp
menderproduction_mender-workflows-server_1         /usr/bin/workflows --confi ...   Up                      8080/tcp
menderproduction_minio_1                           /usr/bin/docker-entrypoint ...   Up (healthy)            9000/tcp
menderproduction_storage-proxy_1                   /usr/local/openresty/bin/o ...   Up                      0.0.0.0:9000->9000/tcp

No, you just delete the “buggy” volume with a standard docker command.
Once deleted, it will be recreated healthy after next docker run, or you have to recreate it manually using a command from “Production installation” procedure described in docs, don’t really remember, most probably the second.

I removed the elasticsearch with docker rm 43b8c402429c , where “43b8c402429c” is the id of my elasticsearch container.
Then I executed ./run up -d but elasticsearch is still restarting.

Fixed! I don’t know if this is the right procedure, but I’ll post here:

  • ./run stop
  • docker ps -a
  • docker rm idofelasticsearch
  • docker volume ls
  • docker volume rm mender-elasticsearch-db
  • ./run pull
  • docker volume create --name=mender-elasticsearch-db
  • ./run up -d

As pointed out by @ster, I meant you can lose/delete your Elasticsearch volume and start with a new one. The content of the volume is not critical and you can start with an empty Elasticsearch without any issue.

Anyway, I see you already solved the issue removing the volume and creating it again, well done! :slight_smile:

1 Like

Had same problem upgrading from 2.2.1 to 2.3.0 and solutions above worked great.