Mender setup at ECS

I’m working on setting up Mender at AWS ECS at this moment:

  • Mender 2.3.0
  • using S3 as storage backend, no storage-proxy and minio.

Entire setup seems working for me at this moment: I can add a add/decommission the device(s), run the deployment, but I faced a strange issue when trying to upload an artifact over CLI:

FAILURE: artifact upload failed with status 504, reason: <html>
<head><title>504 Gateway Time-out</title></head>

Artifact looks fine for me, and it was generated with last version mender-artifact (3.3.0):

Artifact file 'test-release-1.0.mender' validated successfully

API gateway log:

2020/04/27 12:01:34 [warn] 54#54: *3950 a client request body is buffered to a temporary file /usr/local/openresty/nginx/client_body_temp/0000000007 while sending to client, client: <client>, server: <server>, request: "POST /api/management/v1/deployments/artifacts HTTP/1.1", host: "<server>"

mender-deployments log related entry:

time="2020-04-27T12:01:52Z" level=error msg="unexpected EOF" file=view.go func="view.(*RESTView).RenderError" line=53 request_id=<request_id> user_id=<user>

time="2020-04-27T12:01:52Z" level=info msg="400 539196549μs POST /api/management/v1/deployments/artifacts HTTP/1.0 - Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0" file=middleware.go func="accesslog.(*AccessLogMiddleware).MiddlewareFunc.func1" line=60 request_id=<request_id> user_id=<user_id>

Test demo images loads just fine using both CLI tool and UI.

Questions:

  1. How could I check the artifact other than using mender-artifact validate?
  2. Is it OK to exclude storage-proxy and minio from the setup in case of using AWS S3 as a storage?

Hello @imort welcome to Mender hub.

As far as I know removing the storage-proxy and minio should be ok if using S3, although perhaps @tranchitella or @merlin know better.

Are you able to upload that artifact through the Web UI?

The artifact format is actually just a tarball with the payload and some metadata. You can use “tar” to extract the contents but if the mender-artifact validate command succeeded, I’m not sure that will be helpful.

Drew

@imort if you are uploading large files, please have a look here:

If you are using an AWS load balancer in front of Mender, edit its configuration using the AWS console and increase the idle timeout accordingly.

Thank you for your kind response!
I’ve added timeout to the API gateway nginx config, and increased ALB timeout - artifact upload works now!

Can I post there any other details related to Mender setup at ECS later as it could be helpful for the community?

@imort sure, feel free to add any additional suggestions or details that can be useful to the community!

@tranchitella, regarding API gateway setup:

You’re using a reload-when-hosts-changed script, called in container entrypoint which reloads nginx if IP address of any underlying service is changed.

Since I’m using service discovery and private DNS zone for identifying services, and therefore I’ll have more than one IP address for each service name, it could reload it too often, even if DNS server returns same IP addresses in different order.

Looks like there is a better solution available now:

nginx.conf:

location /ui {
    resolver @RESOLVER@ valid=30s;
    set $backend "http://@MENDER_GUI@:80";
    <skipped>
    proxy_pass $backend;
}

Resolver should be retrieved in entrypoint.sh and passed using sed to the nginx.conf just like MENDER_GUI of course:

nameserver=$(cat /etc/resolv.conf | grep nameserver)
resolver=${nameserver:10}

It will force resolve the MENDER_GUI hostname and use all new IP’s each 30s.

Quick test shows that it works without reloading nginx when GUI container replaced, and distributing requests evenly over two mender-gui containers under same hostname.

@imort

Thanks for sharing! It sounds like an interesting approach, indeed. What about launching a PR with such a change? :slight_smile:

Side note: are you running Mender in Kubernetes, Swarm, Virtual machines? I’m curious about your deployment setup as you mentioned service discovery and multiple instances of the microservices. We can bring this to private communication if you cannot share these details in public.

What about launching a PR with such a change?

I should ask my boss about that first, sorry :slight_smile:

Side note: are you running Mender in Kubernetes, Swarm, Virtual machines?

I’m going to run it at AWS ECS (Fargate), and it seems working fine at the testing stage. Data storage will be deployed separately as ECS containers are stateless by design.

v2.3 setup working for a 1+ year for now, no issues.

Has to prepare migration to 2.6 due MongoDB 3.6 end of life.
Test 2.6 setup looks good, but I have troubles connecting a client, seems some issues with deviceauth service. Could you please take a look at the logs?

time="2021-04-23T08:23:25Z" level=info msg="401 28538μs POST /api/devices/v1/authentication/auth_requests HTTP/1.1 - Go-http-client/1.1" byteswritten=109 file=middleware.go func="accesslog.(*AccessLogMiddleware).MiddlewareFunc.func1" line=71 method=POST path=/api/devices/v1/authentication/auth_requests qs= request_id=bd4fd44c-8434-4bae-a32c-bf382f2b4054 responsetime=0.028538446 status=401 ts="2021-04-23 08:23:25.066767212 +0000 UTC" type=http
time="2021-04-23T08:23:25Z" level=warning msg="dev auth: unauthorized: dev auth: unauthorized" file=response_helpers.go func=rest_utils.restErrWithLogMsg line=108 request_id=bd4fd44c-8434-4bae-a32c-bf382f2b4054

No any details here which can help me out (
Logs generated with Mender stress client, which works for 2.3 perfectly, same configuration.

I also find out that create-artifact-worker 1.0.2 and 1.0.3 can’t connect to the MongoDB while 1.0.0 can (using same setup), but it’s not essential.

Thank you in advance!

@imort the log is telling you the device is not authorized to connect to the server. Can you please check if the device shows up in the list of “pending” devices in the UI?

A few extra questions: are you running a freshly installed Mender 2.6 server, or you upgraded from 2.3 to 2.6? Have you considered installing 2.7, which we released a few days ago?

Yep, I get it, problem is it didn’t appear in the pending list as it should.

It’s a fresh installed 2.6 test installation.

Have you considered installing 2.7, which we released a few days ago?

No, because 2.7 replaced nginx-based api-gateway with traefik, and it will be more work to adopt that to ECS setup that I have at the moment.

@imort can you please try following these instructions to force the sync between deviceauth and inventory?

Will do, thank you!

I also found that in workflow-worker logs:

time="2021-04-23T08:23:25Z" level=info msg="processTask: calling http task: POST http://${env.INVENTORY_ADDR|mender-inventory:8080}/api/internal/v1/inventory/tenants/${workflow.input.tenant_id}/devices/${workflow.input.device_status}" file=entry.go func="logrus.(*Entry).Infof" line=346

I seems that it didn’t have a proper inventory internal DNS name, does workflow-worker needs it?
Can’t see it in docker-compose file.

@imort yes, you either set it as INVENTORY_ADDR env variable for the workflows-worker container, or rely on the default (mender-inventory:8080).

1 Like

yes, you either set it as INVENTORY_ADDR env variable for the workflows-worker container, or rely on the default (mender-inventory:8080).

I think it should be added to the config.yaml then (not there for server/worker at 2.6). I’ll create a pull request on that later.

Added correct inventory address:

workflow-worker:

2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"60828e0d50199f804398dfb3\" }" file=process.go func=worker.processJob line=44
2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"60828e0d50199f804398dfb2\" }" file=process.go func=worker.processJob line=44
2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=info msg="Message bus: New job (60828e0d50199f804398dfb3) with workflow update_device_status" file=entry.go func="logrus.(*Entry).Infof" line=346
2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=info msg="Worker: processing job 60828e0d50199f804398dfb3 workflow update_device_status: " file=entry.go func="logrus.(*Entry).Infof" line=346
2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=info msg="Message bus: New job (60828e0d50199f804398dfb2) with workflow update_device_inventory" file=entry.go func="logrus.(*Entry).Infof" line=346
2021-04-23 15:06:21time="2021-04-23T09:06:21Z" level=info msg="Worker: processing job 60828e0d50199f804398dfb2 workflow update_device_inventory: " file=entry.go func="logrus.(*Entry).Infof" line=346
2021-04-23 15:05:00time="2021-04-23T09:05:00Z" level=info msg="Job scheduler listening to message bus" file=datastore_mongo.go func="mongo.(*DataStoreMongo).GetJobs.func1" line=361

2021-04-23 15:05:00time=“2021-04-23T09:05:00Z” level=info msg=“LoadWorkflows: loading 4 workflows from /etc/workflows/definitions.” file=entry.go func=“logrus.(*Entry).Infof” line=346

workflow-server:

time="2021-04-23T09:06:21Z" level=info msg="201 3.102328 POST /api/v1/workflow/update_device_status 10.166.11.77 - Go-http-client/1.1" byteswritten=63 clientip=10.166.11.77 file=logger.go func=http.routerLogger.func1 line=65 method=POST path=/api/v1/workflow/update_device_status responsetime=3.102328 status=201 ts="2021-04-23 09:06:21.328148075 +0000 UTC" type=http
time="2021-04-23T09:06:21Z" level=info msg="StartWorkflow db.InsertJob returned &{60828e0d50199f804398dfb3 update_device_status [{device_status pending} {request_id 6e480fbb-68d9-4bad-801a-5c399e89e431} {devices [{\"id\":\"3ec37dbd-491e-4743-b478-99e566721c53\",\"revision\":2}]} {tenant_id }] 1 [] 2021-04-23 09:06:21.328243564 +0000 UTC m=+7580.894436042},<nil>" file=entry.go func="logrus.(*Entry).Infof" line=346

Still the same error in deviceauth logs, and no device in pending.
Will try your instructions from the post above.

Thank you for your help!

Reinstalled entire Mender deployment, and it works: device added, but still in pending list while accepted :slight_smile:

Tried to decommission it and add again - stuck in pending state, no option to accept/reject.

Can you please advice which services I should check?
Does this looks suspicious at workflow-worker?

time="2021-04-23T09:56:08Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"608299b86e4126ead4381061\" }" file=process.go func=worker.processJob line=44

When trying to decommision the device it stays in the pending list in pending status.

device_auth logs:

time="2021-04-23T10:33:35Z" level=warning msg="Decommission device with id: ef0937b8-ebeb-4d9d-b950-538047f61306" file=devauth.go func="devauth.(*DevAuth).DecommissionDevice" line=688 request_id=79bb17f9-7c3d-4742-bae9-6f08652785da user_id=8d048067-e532-5932-97f0-cf4181d5ad98
 time="2021-04-23T10:33:35Z" level=info msg="204 11753μs DELETE /api/management/v2/devauth/devices/ef0937b8-ebeb-4d9d-b950-538047f61306 HTTP/1.1 - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36" byteswritten=0 file=middleware.go func="accesslog.(*AccessLogMiddleware).MiddlewareFunc.func1" line=71 method=DELETE path=/api/management/v2/devauth/devices/ef0937b8-ebeb-4d9d-b950-538047f61306 qs= request_id=79bb17f9-7c3d-4742-bae9-6f08652785da responsetime=0.011753271 status=204 ts="2021-04-23 10:33:35.524395681 +0000 UTC" type=http user_id=8d048067-e532-5932-97f0-cf4181d5ad98

workflow_worker logs:

time="2021-04-23T10:33:35Z" level=info msg="Worker: processing job 6082a27f6e4126ead438107e workflow decommission_device: " file=entry.go func="logrus.(*Entry).Infof" line=346
 time="2021-04-23T10:33:35Z" level=info msg="Message bus: New job (6082a27f6e4126ead438107e) with workflow decommission_device" file=entry.go func="logrus.(*Entry).Infof" line=346
 time="2021-04-23T10:33:35Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"6082a27f6e4126ead438107e\" }" file=process.go func=worker.processJob line=44

workflow_server logs:

time="2021-04-23T10:33:35Z" level=info msg="201 2.248705 POST /api/v1/workflow/decommission_device 10.166.11.90 - Go-http-client/1.1" byteswritten=62 clientip=10.166.11.90 file=logger.go func=http.routerLogger.func1 line=65 method=POST path=/api/v1/workflow/decommission_device responsetime=2.248705 status=201 ts="2021-04-23 10:33:35.530177082 +0000 UTC" type=http
time="2021-04-23T10:33:35Z" level=info msg="StartWorkflow starting workflow decommission_device : StatusCreated" file=entry.go func="logrus.(*Entry).Infof" line=346
time="2021-04-23T10:33:35Z" level=info msg="StartWorkflow db.InsertJob returned &{6082a27f6e4126ead438107e decommission_device [{authorization Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJmMTIzN2M3Zi1iNDkyLTQwYmYtODdmNC0xMTg3YTM0MTAwZTkiLCJzdWIiOiI4ZDA0ODA2Ny1lNTMyLTU5MzItOTdmMC1jZjQxODFkNWFkOTgiLCJleHAiOjE2MTk3NzY0MjIsImlhdCI6MTYxOTE3MTYyMiwibWVuZGVyLnVzZXIiOnRydWUsImlzcyI6Im1lbmRlci51c2VyYWRtIiwic2NwIjoibWVuZGVyLioiLCJuYmYiOjE2MTkxNzE2MjJ9.CtRlhNNS0yz8WaZd-K1lLezhmG7vOZ_96b2HxnR3mDhBa8mKnFmIgIsx2FDcA6EV9Ri2TtU9p_LOHBw9DCuhwml0XgKctizBaebahAkvQMCKSI_yLAr7pnejsixJH-WpXV503mFMRhh30D_9omEPtwK1UqS9kmMDgLpzHM93bdHq55SWozGTnQTFqmi7EKzpiJEGEFkhtf2xlQAldPPGkuKJR1sHups6nmxV3iXdNKVLlN80eFyVlchk54pXsKmqXtmm2LnE6hORIlXTinwdJFJ8QwNojf_CanXvnFuFt_SZNPYou433URyi2F_Nj7_zikr4ixSUh2LwTptB6wqm4sK3yREEUUMSR85crglmQ7_jA214cgUqD7msLj7LTOs5bmFyy2YnNzt-RC27Bmd1CgGwHQ2jtu3LLKAGIc-Mzs2V1Y9syagXWstK6nzr96CCMpvyUm6rY96FZR8OQNjB0hOXxvvQv5Lb4GiasyDcAUzZa79yO71OxU84BxtfDILd} {tenant_id } {device_id ef0937b8-ebeb-4d9d-b950-538047f61306} {request_id 79bb17f9-7c3d-4742-bae9-6f08652785da}] 1 [] 2021-04-23 10:33:35.530305978 +0000 UTC m=+2958.273998974},<nil>" file=entry.go func="logrus.(*Entry).Infof" line=346

It seems that workflow_worker have some troubles when accepting/decommissioning devices specifically. May it be related with queue errors?
But updating device inventory works, deployment works too.

P.S. When workflow_worker (1.3.0) starts I can see two errors in the logs:

time="2021-04-23T11:51:40Z" level=info msg="LoadWorkflows: error loading: update_device_status v2: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351
time="2021-04-23T11:51:40Z" level=info msg="LoadWorkflows: error loading: update_device_inventory v2: Workflow already exists." file=entry.go func="logrus.(*Entry).Infof" line=351

Can it be related to issues described?

Hello @imort,

How many instances of the workflows-worker container do you have running in your system?
Can you please check the content of the “workflows” database in mongodb, “jobs” collections to find out if the workflows have been correctly executed? For each workflow, you’ll see the list of tasks and the response HTTP codes.

One workflow-server, one workflow-worker, 1.3.0
It looks like some devices added processed, and some are not.
Added five test devices, only two appears in pending list, both accepted: one moved to devices list, and one still in the pending list (while status changed to accepted).

I’ve added one more client device using stress test docker container, it didn’t appears in Mender pending list at all.

Jobs looks not executed by worker:

Worker logs:

time="2021-04-26T06:28:45Z" level=info msg="StartWorkflow db.InsertJob returned &{60865d9d42dd26c5db22d614 update_device_inventory [{request_id c5f30f9c-93f8-4189-b363-2571c2defe2e} {tenant_id } {device_id efb659cb-99d1-4fea-adeb-6b240662fb70} {scope identity} {attributes [{\"name\":\"mac\",\"value\":\"cf:fb:c6:0b:18:9e\",\"scope\":\"identity\"}]}] 1 [] 2021-04-26 06:28:45.02920532 +0000 UTC m=+3077.017808925},<nil>" file=entry.go func="logrus.(*Entry).Infof" line=346
time="2021-04-26T06:28:45Z" level=info msg="StartWorkflow db.InsertJob returned &{60865d9d42dd26c5db22d615 update_device_status [{request_id c5f30f9c-93f8-4189-b363-2571c2defe2e} {devices [{\"id\":\"efb659cb-99d1-4fea-adeb-6b240662fb70\",\"revision\":2}]} {tenant_id } {device_status pending}] 1 [] 2021-04-26 06:28:45.04276124 +0000 UTC m=+3077.031364748},<nil>" file=entry.go func="logrus.(*Entry).Infof" line=346

It seems that worker returns nil and job was never executed.

Added one more device, it appears in pending list:

It really seems strange, like some jobs are handled properly, and some stays unprocessed and the only thing I can see in worker logs is something like:

time="2021-04-26T06:28:45Z" level=info msg="StartWorkflow db.InsertJob returned &{60865d9d42dd26c5db22d615 update_device_status [{request_id c5f30f9c-93f8-4189-b363-2571c2defe2e} {devices [{\"id\":\"efb659cb-99d1-4fea-adeb-6b240662fb70\",\"revision\":2}]} {tenant_id } {device_status pending}] 1 [] 2021-04-26 06:28:45.04276124 +0000 UTC m=+3077.031364748},**[nil]**" file=entry.go func="logrus.(*Entry).Infof" line=346

instead of HTTP status code if failed. Worker didn’t fails/hangs/restarted.
And I can’t find such requests in inventory log, which may explain it.
I’ll try to remove ${env.INVENTORY_ADDR|mender-inventory:8080} from job definitions, maybe it trying to send requests to default mender-inventory for 50% of requests…

Entire setup right now is pretty simple, single container for all microservices, service discovery (checked, no dead services):

  • inventory
  • device-auth
  • user-adm
  • gui
  • create-artifact-worker
  • workflows-server
  • workflows-worker
  • docker-api-gateway
  • deployments

P.S. Tried to adjust inventory address in workflow files, and set WORKFLOWS_CONCURRENCY to 1, no luck.

That what I see in workflow_worker logs for the failed jobs:

time="2021-04-26T07:56:54Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"60867246a63e3c8cb86eb857\" }" file=process.go func=worker.processJob line=44

It’s obvious that worker can’t add new document in workflows.job_queue, but I can’t figure out why. It’s 20-50% of devices added successfully using this stress client setup:

/mender-stress-test-client -backend https://<test_mender_setup_url> -count 10 -invfreq 60 -pollfreq 60 -wait 60 -current_device qemux86-64 -inventory "device_type:qemux86-64,image_id:test,client_version:test"

That’s the failed job document in workflows.job_queue:

2021-04-26_14-02-28

That’s the same failed job document in workflows.jobs:

2021-04-26_14-03-05

And there are two documents for job which is not failed (workflows.job_queue, workflows.jobs):

2021-04-26_14-07-08

P.P.S.

Redeployed, trying to add devices one by one.
Errors happened at update_device_inventory and update_device_status workflows, preventing device to register and/or move to accepted:

time="2021-04-26T09:18:52Z" level=error msg="(DuplicateKey) E11000 duplicate key error collection: workflows.job_queue index: _id_ dup key: { _id: \"6086857c925d25a7809155e5\" }" file=process.go func=worker.processJob line=44

Double checked that there’s no job with _id=6086857c925d25a7809155e5 in job_queue before adding the device.

Overall, seems really strange for me(

Added deviceconnect and nat to the setup, according to the demo setup.

Same issues, adding 10 clients:

/mender-stress-test-client -backend https://<test_mender_setup> -count 10 -invfreq 60 -pollfreq 60 -wait 60 -current_device qemux86-64 -inventory "device_type:qemux86-64,image_id:test,client_version:test" -startup_interval 60

Typically 8 devices are added to the pending list, 4-5 moved to the devices list when accepted, others stays in the pending list with accepted status.

Looks like I’ll have to try v2.7, because I have no clue debugging same MongoDB document ID issue I’m facing here.